Language • Ross Churchley

Ten years ago, I discovered that a disproportionate number of posts on Twitter had exactly 140 characters. Five years ago, I found the spike had shifted to the new 280-character limit. Now, Twitter has imploded and lots of people have migrated to Bluesky — what does the post length distribution look like on the new platform?

Length distribution of two million 2024 Bluesky posts, fit to an exponential curve.

Once again, we reproduce the spike approaching the character limit, with 300-character posts accounting for over half a percent of the entire dataset — more common than any other length greater than 50.

Once again, we also see the effects of bots: two bots are responsible for posting random strings of length 46 and 256 every few seconds, causing those lengths to be overrepresented.

But this might be my favourite of the charts I’ve made in this series. Take a close look at the 140 and 280 character marks: a larger-than-expected number of posts are exactly those lengths or slightly shorter. The curve abruptly reverts to its original course — an exponential decay curve — immediately after those milestones. That’s exactly the behaviour we expect (and observe!) when people compose posts that are a bit too long, then edit them down to fit within a character cap.

It’s entirely possible that these might be the exact same spikes we saw in my last two charts in this series, if people are reposting their old tweets from yesteryear on Bluesky. But I suspect this is more likely evidence of people crafting messages to be cross-posted to multiple platforms, in which case they may be using tools that recommend smaller limits than those supported by Bluesky. Time will tell what hypothesis is right!

2025-03-13
Today’s Unicode calligraphy highlights the International Phonetic Alphabet symbol for the glottal stop, the sound between the syllables of “uh-oh”. (It can also be heard if you say the phrase “glottal stop” in a Cockney accent!)

In Squamish orthography, the glottal stop is represented by the digit 7 instead of ʔ, as in Sḵwx̱wú7mesh.

2025-03-03
Numbers like $\pi$ and $\sqrt{2}$ are irrational. Are they called that because they are unreasonable? Or is it just because they just can’t be expressed as a ratio of two integers?

As it turns out, this question is even harder to answer in Ancient Greek. On the one hand, the word used by (e.g.) Euclid to describe irrational lengths (ἄλογος) comes from adding the negative prefix ἄ- to the word for ratio (λόγος). On the other hand, λόγος and ἄλογος had a lot of other meanings.

Ἄλογος could mean unexpected, which would be an appropriate description of the newly-discovered irrational numbers. Or it could mean unspeakable — in the literal sense, although the figurative sense fits with the (possibly fictitious) cover-up by the Pythagoreans. The word could also mean speechless or incapable of reasoning.

It’s fascinating to see how translators (and later mathematicians) decided to resolve the polysemy of ἄλογος:
- Latin translations of Greek works used ratio for λόγος and irrationalis for ἄλογος. These were actually imported into mathematical English before their non-technical meanings.
- Early Arabic mathematicians called $\sqrt{2}$ a deaf root (جذر أصم) in reference to the “speechless” meaning of ἄλογος. This got translated back to Latin as surd, a now-archaic term for (irreducible) roots.
- In modern Arabic, the phrase is “non-fractional number” (عدد غير كسري).
- In modern Greek, irrational numbers are now called inexpressible (άρρητος).
2025-02-27
Today’s Unicode calligraphy is the good old ampersand. The name for the character comes from “and per se and”, but despite all sources agreeing on that fact, I couldn’t figure out what the longer phrase actually meant.

Looking at older sources, I think the phrase is supposed to be parsed ”& per se: and”. That is, ”& per se” means the character ”&” by itself, and the second “and” is a spelling-bee-style repetition to indicate that you have finished spelling the word “and”. So one might spell “wear and tear” out loud by saying:

W-E-A-R wear; & per se and; T-E-A-R tear

The same was apparently true for the words I, a, and O, so our national anthem would be spelled “O per se: O; C-A-N-A-D-A: Canada”.

2025-02-10
Most Chinese languages, such as Mandarin and Cantonese, pronounce 茶 along the lines of cha, but Min varieties along the Southern coast of China and in Southeast Asia pronounce it like teh. Almost all languages use variants of one or the other, depending on who they got their tea from.
- The Portuguese borrowed chá from Cantonese in the 1550s via their trading posts in Macau.
- The Persians adopted chay from Mandarin, which spread throughout Central Asia.
- The Dutch may have borrowed their word for tea through trade directly from Fujian or Taiwan, or from Malay traders in Java who had adopted the Min pronunciation as teh.
- The Dutch word influenced other European languages, including French (thé), Spanish (té), German (Tee), and English, even after the latter started buying tea directly from Cantonese-speaking merchants
Several articles pointing this out have the headline “Cha if by land, tea if by sea”. This plays on Longfellow’s poem about Paul Revere, in which lanterns are to signal how the British invade: one if by land, and two if by sea.
2025-01-09
The word “distribute” is etymologically the opposite of “tribute”.

early 15c., distributen, “to deal out or apportion, bestow in parts or in due proportion,” from Latin distributus, past participle of distribuere “to divide, deal out in portions,” from dis- “individually” and tribuere “to pay, assign, grant,” also “allot among the tribes or to a tribe,” from tribus (see tribe)

In geography, a tributary is a stream or river that feeds into a larger body of water; for example, the Thompson River (Snek’w7étkwe) is a tributary of the Fraser (Sto:lo). When a river bifurcates into multiple downstream branches, such as the North and South Arms of the Fraser, those branches are called distributaries.

2024-09-20
“Hydrogen” in German is Wasserstoff, which sounds hilarious except it’s just a literal translation of the Greek hydro-gen!

Most chemical elements are more or less the same in German and English. The fun exceptions are:
- Wasserstoff (Hydrogen); “water stuff” is a literal translation from Greek.
- Kohlenstoff (Carbon); “coal stuff” is a literal translation from Latin.
- Stickstoff (Nitrogen); “suffocation stuff”, apparently because it’s the non-oxygen part of air, is a German original.
- Sauerstoff (Oxygen); “sour stuff” is a literal translation from Greek.
German also has Natrium (Sodium), Kalium (Potassium), Wofram (Tungsten), Quecksilber (Mercury), and Blei (Lead).
2024-05-27
My favourite etymology fact is that “helicopter” is helico-pter — Greek for “spiral wing”. It’s obvious when pointed out, but I’d never have realized it on my own since in English it’s always broken down as heli-copter!

Relatedly, Magic: The Gathering has a creature type called Thopter, which is a rebracketed abbreviation of the word “ornithopter” (from ornitho- meaning bird, and pter meaning wing).

2023-10-02
The length distribution of tweets has shifted in response to raised character limits, but it’s still the case that a disproportionate number of tweets use all the characters they’re given.

A sample of tweets gathered in 2019 still exhibit a telltale spike approaching the character limit, but it is smaller than the tweet distribution from a decade earlier. The peak of the curve has also shifted leftwards, to 15 characters, due to a separate change in 2016 that excluded media attachments and certain at-mentions from the character count.

The most interesting feature of the above graph is unfortunately an artifact of the dataset — the massive spike at 105 characters can be blamed on a spambot network broadcasting identical copies of the same tweet when the dataset was collected.

2020-05-17
“General particulars” is an excellent phrase that deserves to catch on more widely than its current context of legally-mandated notices on boats.

(Boats are required by international law to have a wheelhouse poster listing their “general particulars”, i.e., a list of statistics, properties, and other bits of information necessary to get a basic view of the vessel.)

2019-11-15
“May you live in interesting times” is typically claimed to be a Chinese expression, but it actually originated with the British. Joseph Chamberlain — Neville’s dad — used the phrase “interesting times” frequently in speeches:

I think that you will all agree that we are living in most interesting times. I never remember myself a time in which our history was so full, in which day by day brought us new objects of interest, and, let me say also, new objects for anxiety.
Joseph Chamberlain

Joseph’s other son Austen was the first to claim it originated as a Chinese saying. Quote Investigator theorizes that Austen, in conversation with his diplomat colleagues, learned about a Chinese proverb that expresses apprehension about living in what his father would call “interesting times” and assumed that was the source of Joseph’s phrase. But the wording of the real proverb is entirely different:

寧為太平犬，莫作亂離人

Better to be a dog in days of peace, than a human in times of chaos.

Feng Menglong

2016-07-15
The word pea was originally pease in the singular and peasen in the plural. Eventually, speakers understandably interpreted the -s in pease as the plural suffix rather than just a sound in the original Latin pisum/pisa and Greek πίσον, and the English singular pea was born.

For example, a 15th-century cookbook has the following recipe for what we would today call pea soup:

Take grene pesyn, an washe hem clene an caste hem on a potte, an boyle hem tyl þey breste, an þanne take hem vppe of þe potte, an put hem with brothe yn a-noþer potte, and lete hem kele; þan draw hem þorw a straynowre in-to a fayre potte, an þan take oynonys…

Harleian manuscript 279

Pease also functioned as a mass noun, like bread or oatmeal.

Yisterday I ete cale and pes, & to-day I eete pes & cale, & to-morn I mon eate pess with cale, & after to-morn I mon eate cale with pease.
Alphabet of Tales

Unfortunately, the latter quote is taken from a religious anecdote promoting a moderate and uniform diet, and not a hilariously sarcastic comment by a medieval peasant.

2015-10-01
A disproportionate number of my tweets are exactly 140 characters. I don’t know whether that means I’m really good at Twitter or really bad. Sometimes it’s the result of a too-long idea being meticulously edited down to size; sometimes it’s purely chance. Either way, I find 140-character tweets oddly satisfying — and based on a large dataset of tweets, it looks like I’m not the only one.

The dataset paints a fascinating picture of the distribution of tweet lengths. Extremely short tweets are understandably very rare, but it doesn’t take long for the distribution to reach its first mode at 35 characters. The curve gradually and smoothly trails off to a local minimum around 116 characters, before positively spiking after 135. The average length is a bit more than 68 characters and the median a bit lower at 62.

2015-09-20

Based on corpus data, over half of the words in a typical page of English text has four or fewer letters, with the average word length being slightly less than five.

Length of…	Mean	Median	Mode
Unique words	7.52	7	7
Words weighted by frequency	4.95	4	3

Source code

import pandas

wd = pandas.read_csv(
"eng-ca_web_2002_1M-words.txt",
"\t",
header=None,
names=["Rank", "Word", "Frequency"]
)

wd.Word = wd.Word.apply(str).str.lower()
wd = (
wd.query("Frequency > 5") # Remove rare words,
.query("Word.str.match('^[A-Za-z]\*$')") # words with non-letters,
.query("Word.str.contains('[aeiouy]')") # abbrvns w/o vowels,
.groupby("Word") # and duplicates
.sum()
)

wd["Length"] = pandas.Series(wd.index, index=wd.index).str.len()

sm = wd.groupby("Length")

result = pandas.DataFrame({
"NormalizedCount": sm.Frequency.count() / sm.Frequency.count().sum(),
"NormalizedFrequency": sm.Frequency.sum() / sm.Frequency.sum().sum()
})

2007-01-27

According to an old piece of email forwarding-spam, it’s easy to read text even if you scramble all but the first and last letters in each word. But the truth is a bit more complicated.

The ancient meme reads:

Aoccdrnig to rseearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit plcae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

The form of this paragraph appears at first glance to provide direct evidence of its own “azanmig” claim. But something’s a little fishy: a lot of the words aren’t actually scrambled. Short words aren’t affected much if at all by the message’s middle-muddling, and most English words are short!

Matt Davis, an actual researcher at the University of Cambridge, wrote an informative response to the meme:
There are elements of truth in this, but also some things which scientists studying the psychology of language (psycholinguists) know to be incorrect.

I’m going to list some of the ways in which I think that the author(s) of this meme might have manipulated the jumbled text to make it relatively easy to read. This will also serve to list the factors that we think might be important in determining the ease or difficulty of reading jumbled text in general.

Short words are easy.

Function words (the, be, and, you etc.) stay the same - mostly because they are short words.

Of the 15 words in this sentence, there are 8 that are still in the correct order. However, as a reader you might not notice this since many of the words that remain intact are function words, which readers don’t tend to notice when reading.

Transpositions of adjacent letters are easier to read than more distant transpositions.

None of the words that have reordered letters create another word.

Transpositions were used that preseve the sound of the original word (e.g. toatl vs ttaol for total).

The text is reasonably predictable.
If you want to test your own permutation powers against realistic examples, I whipped up a bookmarklet that you can use to scramble the words on any website you want to challenge!
Scramble text!
2007-01-20