Based on corpus data, over half of the words in a typical page of English text has four or fewer letters, with the average word length being slightly less than five.
Length of… | Mean | Median | Mode |
---|---|---|---|
Unique words | 7.52 | 7 | 7 |
Words weighted by frequency | 4.95 | 4 | 3 |
Source code
import pandas
wd = pandas.read_csv(
"eng-ca_web_2002_1M-words.txt",
"\t",
header=None,
names=["Rank", "Word", "Frequency"]
)
wd.Word = wd.Word.apply(str).str.lower()
wd = (
wd.query("Frequency > 5") # Remove rare words,
.query("Word.str.match('^[A-Za-z]*$')") # words with non-letters,
.query("Word.str.contains('[aeiouy]')") # abbrvns w/o vowels,
.groupby("Word") # and duplicates
.sum()
)
wd["Length"] = pandas.Series(wd.index, index=wd.index).str.len()
sm = wd.groupby("Length")
result = pandas.DataFrame({
"NormalizedCount": sm.Frequency.count() / sm.Frequency.count().sum(),
"NormalizedFrequency": sm.Frequency.sum() / sm.Frequency.sum().sum()
})