Corpus: spa_news_2018_1M

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 1000000
Average sentence length in characters 139.2186
Average sentence length in words 22.7173
Number of distinct word forms 470102
Number of distinct word forms (without multiwords) 397625
Percentage of lower case word forms 46.0026
Number of multi word units 72477
Percentage of multi word units 15.4173
Number of running word forms 23226639
Number of running word forms (without multiwords) 22694075
Percentage of lower case running words 84.9411
Average word form length 8.6077
Average running word length 5.03716798
Percentage of word forms with frequency=1 51.8728
Number of sentence based co-occurrences 4259556
- minimal likelihood ratio 6.63
- maximal likelihood ratio 95735.37
Number of neighbour based co-occurrences 514062
- minimal likelihood ratio 3.84
- maximal likelihood ratio 357529.12
Average number of sentence based co-occurrences per sentence 186.9129
Average number of neighbour co-occurrences per sentence 14.6909
Most frequent word de
Most frequent word's frequency 1520448
13568 msec needed at 2021-07-14 15:00