Corpus: bul_news_2011

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 1812845
Average sentence length in characters 191.8041
Average sentence length in words 17.0180
Number of distinct word forms 468595
Number of distinct word forms (without multiwords) 460367
Percentage of lower case word forms 66.8358
Number of multi word units 8228
Percentage of multi word units 1.7559
Number of running word forms 30895909
Number of running word forms (without multiwords) 30704796
Percentage of lower case running words 86.7070
Average word form length 17.1962
Average running word length 10.20578186
Percentage of word forms with frequency=1 41.3624
Number of sentence based co-occurrences 13668158
- minimal likelihood ratio 6.63
- maximal likelihood ratio 448175.28
Number of neighbour based co-occurrences 889508
- minimal likelihood ratio 3.84
- maximal likelihood ratio 526636.38
Average number of sentence based co-occurrences per sentence 160.1964
Average number of neighbour co-occurrences per sentence 10.9442
Most frequent word на
Frequent word's frequency 1446881
32965 msec needed at 2018-02-03 21:00