Corpus: jpn_newscrawl_2016_100K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 100000
Average sentence length in characters 155.2497
Average sentence length in words 1.6550
Number of distinct word forms 86079
Number of distinct word forms (without multiwords) 86030
Percentage of lower case word forms 93.1877
Number of multi word units 49
Percentage of multi word units 0.0569
Number of running word forms 2844611
Number of running word forms (without multiwords) 2844358
Percentage of lower case running words 99.1896
Average word form length 10.5721
Average running word length 5.10900280
Percentage of word forms with frequency=1 45.0842
Number of sentence based co-occurrences 910118
- minimal likelihood ratio 6.63
- maximal likelihood ratio 22647.97
Number of neighbour based co-occurrences 97240
- minimal likelihood ratio 3.84
- maximal likelihood ratio 119837.86
Average number of sentence based co-occurrences per sentence 240.6889
Average number of neighbour co-occurrences per sentence 15.9671
Most frequent word
Frequent word's frequency 147034
2570 msec needed at 2018-05-31 06:12