Pareto Distribution in Language

I saw a great video by Vsauce on Youtube. This video covered the idea of the Pareto distribution, and showed how it appeared when counting the frequency that words appear in natural language. When the occurrence of each word in a text is tallied up, the result is a distribution in which the most common word appears N times, the next most common word appears N/2 times, and so on. This was intriguing, and I was skeptical. A very satisfying explanation for why this happens doesn't exist yet. To test this property, I wrote a python script that reads all the words from a pdf document, tallies their occurrences, and displays the result in a bar graph. I ran this script on my statement of objectives, and the result is shown below. While the distribution is not perfect, it does show a similar pattern to the expected distribution. For a document less than 3000 words in length, the result is not bad.

You can download my code by hitting the button below and try it with any pdf document that you would like.

The Pareto Principle is the idea that 80% of a result comes from 20% of the work. In the context of language, this refers to estimation that 80% of all words written will be in the top 20% of the most frequently used words. In the plot below this is further demonstrated by the right skew of the plot. The most frequent words are used so frequently that the first 1/5 of the plot contains roughly 4/5 of the all of the words in the text.

See for yourself

Pareto Counting: About

Pareto Counting: Work