Corpus: ara_wikipedia_2021_300K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 300000
Average sentence length in characters 174.1700
Average sentence length in words 16.5453
Number of distinct word forms 469837
Number of distinct word forms (without multiwords) 416549
Percentage of lower case word forms 95.0247
Number of multi word units 53288
Percentage of multi word units 11.3418
Number of running word forms 5149259
Number of running word forms (without multiwords) 4958467
Percentage of lower case running words 99.2671
Average word form length 12.5711
Average running word length 9.50323679
Percentage of word forms with frequency=1 60.7211
Number of sentence based co-occurrences 1133230
- minimal likelihood ratio 6.63
- maximal likelihood ratio 23102.02
Number of neighbour based co-occurrences 157237
- minimal likelihood ratio 3.84
- maximal likelihood ratio 36552.81
Average number of sentence based co-occurrences per sentence 50.9220
Average number of neighbour co-occurrences per sentence 5.5166
Most frequent word في
Most frequent word's frequency 201039
3917 msec needed at 2021-06-09 10:02