Based on corpus data, over half of the words in a typical page of English text has four or fewer letters, with the average word length being slightly less than five.
Length of… | Mean | Median | Mode |
---|---|---|---|
Unique words | 7.52 | 7 | 7 |
Words weighted by frequency | 4.95 | 4 | 3 |
Source code
import pandas
wd = pandas.read_csv("eng-ca_web_2002_1M-words.txt","\t",header=None,names=["Rank", "Word", "Frequency"])
wd.Word = wd.Word.apply(str).str.lower()wd = (wd.query("Frequency > 5") # Remove rare words,.query("Word.str.match('^[A-Za-z]\*$')") # words with non-letters,.query("Word.str.contains('[aeiouy]')") # abbrvns w/o vowels,.groupby("Word") # and duplicates.sum())
wd["Length"] = pandas.Series(wd.index, index=wd.index).str.len()
sm = wd.groupby("Length")
result = pandas.DataFrame({"NormalizedCount": sm.Frequency.count() / sm.Frequency.count().sum(),"NormalizedFrequency": sm.Frequency.sum() / sm.Frequency.sum().sum()})