The 280-character spike

Back when tweets were capped at 140 characters, I observed that a disproportionate number of them barely fit within the character limit. Since then, Twitter has expanded the maximum tweet length to 280 characters, so how has this changed the distribution of tweet lengths?

The distribution of tweet lengths in a 2019 dataset

Compared to 2009, the tweet length distribution is much less dramatic. There is still a telltale spike approaching the character limit, but it is smaller than it used to be. The peak of the curve has also shifted leftwards, to 15 characters, due to a separate change in 2016 that excluded media attachments and certain at-mentions from the character count.

The most interesting feature of the above graph is unfortunately an artifact of the dataset — the massive spike at 105 characters can be blamed on a spambot network broadcasting identical copies of the same tweet when the dataset was collected.

References

Source code

2019-analysis.py
import numpy
import html
import pandas
from pandas import DataFrame, Series

# Load data
#
# Downloaded from https://archive.org/details/twitter_cikm_2010
# on 6 Mar 2016.

tweets = pandas.read_csv("test_set_tweets.txt",
                         sep="\t",
                         header=None,
                         names=["user_id", "tweet_id", "text", "date"])

# Clean data
#
# Filter out null tweets, retweets, and tweets that may have been incorrectly
# imported. Then unescape HTML characters like >

tweets = tweets[pandas.notnull(tweets['text'])]
tweets = tweets[~ tweets['text'].str.contains('RT @')]
tweets = tweets[~ tweets['text'].str.contains('RT @')]
tweets = tweets[~ tweets['text'].str.contains('\t')]
tweets["text"] = tweets["text"].apply(html.unescape)

# Calculate lengths

tweets["length"] = Series(tweets["text"]).str.len()


by_length = tweets.groupby("length")
by_length = by_length.size() / tweets["text"].size * 100
by_length.to_csv("data.csv")

Related