Corpus: swe_wikipedia_2021_100K

Other corpora

1.1 Summary

Values for some general parameters

Parameter Value
Number of sentences 100000
Average sentence length in characters 91.7184
Average sentence length in words 13.5938
Number of distinct word forms 203405
Number of distinct word forms (without multiwords) 170584
Percentage of lower case word forms 42.9449
Number of multi word units 32821
Percentage of multi word units 16.1358
Number of running word forms 1396310
Number of running word forms (without multiwords) 1357161
Percentage of lower case running words 80.1599
Average word form length 9.3258
Average running word length 5.69663585
Percentage of word forms with frequency=1 74.3851
Number of sentence based co-occurrences 236204
- minimal likelihood ratio 6.63
- maximal likelihood ratio 64473.49
Number of neighbour based co-occurrences 35614
- minimal likelihood ratio 3.84
- maximal likelihood ratio 121883.49
Average number of sentence based co-occurrences per sentence 59.6484
Average number of neighbour co-occurrences per sentence 5.2418
Most frequent word är
Most frequent word's frequency 45842
1505 msec needed at 2021-06-24 18:00