Corpus: hin_newscrawl_2017_100K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 100000
Average sentence length in characters 236.6231
Average sentence length in words 18.1867
Number of distinct word forms 94801
Number of distinct word forms (without multiwords) 93078
Percentage of lower case word forms 96.0549
Number of multi word units 1723
Percentage of multi word units 1.8175
Number of running word forms 1826666
Number of running word forms (without multiwords) 1816076
Percentage of lower case running words 99.1578
Average word form length 19.0153
Average running word length 12.04338695
Percentage of word forms with frequency=1 57.6956
Number of sentence based co-occurrences 400038
- minimal likelihood ratio 6.63
- maximal likelihood ratio 15527.91
Number of neighbour based co-occurrences 66518
- minimal likelihood ratio 3.84
- maximal likelihood ratio 55582.22
Average number of sentence based co-occurrences per sentence 90.0559
Average number of neighbour co-occurrences per sentence 8.6549
Most frequent word के
Frequent word's frequency 73426
1638 msec needed at 2018-05-29 06:41