Corpus: guj_wikipedia_2014_100K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 100000
Average sentence length in characters 276.9656
Average sentence length in words 17.0847
Number of distinct word forms 202731
Number of distinct word forms (without multiwords) 199549
Percentage of lower case word forms 96.9304
Number of multi word units 3182
Percentage of multi word units 1.5696
Number of running word forms 1719988
Number of running word forms (without multiwords) 1711495
Percentage of lower case running words 99.2739
Average word form length 22.0680
Average running word length 15.11271140
Percentage of word forms with frequency=1 63.4397
Number of sentence based co-occurrences 339426
- minimal likelihood ratio 6.63
- maximal likelihood ratio 29164.08
Number of neighbour based co-occurrences 45872
- minimal likelihood ratio 3.84
- maximal likelihood ratio 67577.25
Average number of sentence based co-occurrences per sentence 57.1932
Average number of neighbour co-occurrences per sentence 5.2125
Most frequent word છે
Most frequent word's frequency 81070
2038 msec needed at 2024-07-29 02:00