Corpus: zul_web_2012_10K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 10000
Average sentence length in characters 93.3167
Average sentence length in words 10.9150
Number of distinct word forms 38356
Number of distinct word forms (without multiwords) 38307
Percentage of lower case word forms 78.2772
Number of multi word units 49
Percentage of multi word units 0.1278
Number of running word forms 108385
Number of running word forms (without multiwords) 108314
Percentage of lower case running words 77.9674
Average word form length 9.1653
Average running word length 7.43162472
Percentage of word forms with frequency=1 71.9079
Number of sentence based co-occurrences 13786
- minimal likelihood ratio 6.63
- maximal likelihood ratio 657.62
Number of neighbour based co-occurrences 1880
- minimal likelihood ratio 3.87
- maximal likelihood ratio 1144.02
Average number of sentence based co-occurrences per sentence 5.9272
Average number of neighbour co-occurrences per sentence 0.9291
Most frequent word ukuthi
Frequent word's frequency 1503
354 msec needed at 2018-07-03 06:20