Corpus: kat_newscrawl_2011_300K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 300000
Average sentence length in characters 298.2627
Average sentence length in words 13.4558
Number of distinct word forms 311379
Number of distinct word forms (without multiwords) 311157
Percentage of lower case word forms 99.1374
Number of multi word units 222
Percentage of multi word units 0.0713
Number of running word forms 4010702
Number of running word forms (without multiwords) 4009979
Percentage of lower case running words 99.7548
Average word form length 27.7083
Average running word length 21.15247013
Percentage of word forms with frequency=1 55.1964
Number of sentence based co-occurrences 944292
- minimal likelihood ratio 6.63
- maximal likelihood ratio 70500.04
Number of neighbour based co-occurrences 128757
- minimal likelihood ratio 3.84
- maximal likelihood ratio 123909.65
Average number of sentence based co-occurrences per sentence 34.3170
Average number of neighbour co-occurrences per sentence 3.8631
Most frequent word და
Frequent word's frequency 126990
2498 msec needed at 2018-03-12 23:02