Corpus: lit_newscrawl_2014_30K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 30000
Average sentence length in characters 114.3751
Average sentence length in words 14.5273
Number of distinct word forms 98887
Number of distinct word forms (without multiwords) 97981
Percentage of lower case word forms 74.3354
Number of multi word units 906
Percentage of multi word units 0.9162
Number of running word forms 434039
Number of running word forms (without multiwords) 432638
Percentage of lower case running words 83.4766
Average word form length 9.1992
Average running word length 6.75073849
Percentage of word forms with frequency=1 65.6537
Number of sentence based co-occurrences 53816
- minimal likelihood ratio 6.63
- maximal likelihood ratio 4423.49
Number of neighbour based co-occurrences 8825
- minimal likelihood ratio 3.84
- maximal likelihood ratio 4675.50
Average number of sentence based co-occurrences per sentence 12.3555
Average number of neighbour co-occurrences per sentence 1.7492
Most frequent word ir
Frequent word's frequency 12941
486 msec needed at 2018-03-15 16:00