Corpus: zho-mo_web_2016_1M

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 1000000
Average sentence length in characters 193.0802
Average sentence length in words 1.2557
Number of distinct word forms 1237507
Number of distinct word forms (without multiwords) 1237507
Percentage of lower case word forms 97.2018
Number of multi word units 0
Percentage of multi word units 0.0000
Number of running word forms 33359026
Number of running word forms (without multiwords) 33359026
Percentage of lower case running words 99.4591
Average word form length 9.1235
Average running word length 5.38675314
Percentage of word forms with frequency=1 60.9538
Number of sentence based co-occurrences 16300028
- minimal likelihood ratio 6.63
- maximal likelihood ratio 198415.23
Number of neighbour based co-occurrences 990537
- minimal likelihood ratio 3.84
- maximal likelihood ratio 410528.41
Average number of sentence based co-occurrences per sentence 522.3889
Average number of neighbour co-occurrences per sentence 19.8656
Most frequent word
Frequent word's frequency 1485407
346646 msec needed at 2018-07-02 21:25