Corpus: eng_news_2020_30K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 30000
Average sentence length in characters 116.8187
Average sentence length in words 19.5937
Number of distinct word forms 73471
Number of distinct word forms (without multiwords) 55032
Percentage of lower case word forms 48.6069
Number of multi word units 18439
Percentage of multi word units 25.0970
Number of running word forms 638183
Number of running word forms (without multiwords) 587908
Percentage of lower case running words 82.1821
Average word form length 7.5070
Average running word length 4.87305667
Percentage of word forms with frequency=1 62.3171
Number of sentence based co-occurrences 111896
- minimal likelihood ratio 6.63
- maximal likelihood ratio 4494.88
Number of neighbour based co-occurrences 23796
- minimal likelihood ratio 3.84
- maximal likelihood ratio 6189.34
Average number of sentence based co-occurrences per sentence 59.3391
Average number of neighbour co-occurrences per sentence 7.0224
Most frequent word the
Most frequent word's frequency 30312
571 msec needed at 2021-05-28 18:00