Corpus: ekk_wikipedia_2021_300K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 300000
Average sentence length in characters 94.5810
Average sentence length in words 11.7774
Number of distinct word forms 509528
Number of distinct word forms (without multiwords) 509528
Percentage of lower case word forms 63.4340
Number of multi word units 0
Percentage of multi word units 0.0000
Number of running word forms 3517758
Number of running word forms (without multiwords) 3517758
Percentage of lower case running words 80.9643
Average word form length 10.6172
Average running word length 6.95927946
Percentage of word forms with frequency=1 65.6462
Number of sentence based co-occurrences 614870
- minimal likelihood ratio 6.63
- maximal likelihood ratio 15188.15
Number of neighbour based co-occurrences 91041
- minimal likelihood ratio 3.84
- maximal likelihood ratio 22056.87
Average number of sentence based co-occurrences per sentence 21.0532
Average number of neighbour co-occurrences per sentence 2.5722
Most frequent word ja
Most frequent word's frequency 115870
3719 msec needed at 2021-06-12 09:01