Corpus: lit_wikipedia_2016_10K

Other corpora

1.1 Summary

Values for some general parameters

parameter value
number of sentences 10000
average sentence length in characters 104.7955
average sentence length in words 13.2207
number of distinct word forms 44591
percentage of lower case word forms 69.0521
percentage of multi word units 1.4196
number of running word forms 157985
percentage of lower case running words 80.3454
average word form length 8.7102
average running word length 6.84355406
percentage of word forms with frequency=1 69.9042
number of sentence based co-occurrences 10176
minimal likelihood ratio 6.63
maximal likelihood ratio 1272.70
number of neighbour based co-occurrences 2021
minimal likelihood ratio 3.88
maximal likelihood ratio 2392.12
average number of sentence based co-occurrences per sentence 7.3382
average number of neighbour co-occurrences per sentence 1.2547
most frequent word ir
frequent word's frequency 4164
377 msec needed at 2018-01-01 16:00