Corpus: srp_wikipedia_2018_300K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 300000
Average sentence length in characters 178.9287
Average sentence length in words 15.5676
Number of distinct word forms 388623
Number of distinct word forms (without multiwords) 377492
Percentage of lower case word forms 56.3562
Number of multi word units 11131
Percentage of multi word units 2.8642
Number of running word forms 4667497
Number of running word forms (without multiwords) 4643865
Percentage of lower case running words 84.2518
Average word form length 15.9588
Average running word length 10.45685738
Percentage of word forms with frequency=1 58.0336
Number of sentence based co-occurrences 914512
- minimal likelihood ratio 6.63
- maximal likelihood ratio 26914.22
Number of neighbour based co-occurrences 147579
- minimal likelihood ratio 3.84
- maximal likelihood ratio 53235.79
Average number of sentence based co-occurrences per sentence 50.0065
Average number of neighbour co-occurrences per sentence 5.7169
Most frequent word је
Most frequent word's frequency 208617
3762 msec needed at 2024-05-03 01:02