1/10/2012 – English Letter Frequency – Redux

Back on September 15, 2010 I wrote a little piece, “Frequency – Scrabble and the actual frequency of letter usage in English”, about what seemed to me to be a puzzling mis-match between the number of tiles in Scrabble and my superficial sense of the frequency of occurrence of letters in English.  While I was away in Hong Kong in December, I received the following email note from David T. Wong,

Excuse me for being very late to comment about the piece you wrote last year regarding the frequency of letters used in the English language. The article was posted on September 15, 2010, but I have only recently just read it. Anyway, in the article you had given the assumption that the frequency of letters used in English should be based upon their frequency as occurring in the Concise Oxford Dictionary. I have no objection to using that dictionary for the frequency of letters used in English words, but to assume that the frequency occurring in English words is the same as the frequency occurring is English usage is an absolutely untrue assumption. This is because not all words are used equally.

The simplest way to explain this is that while there is only one word “quiz” in the dictionary, and only one word “the” in the dictionary, why would you expect that the two words (and their component letters) to be used with equal frequency? My point is that you will hear (or see) the word “the” (and its component letters) hundreds of times more frequently than you will see or hear the word “quiz”, yet you only counted each letter (T, H, E, Q, U, I, Z) only once in your analysis of the dictionary. Most analyses of English usage that I see reveal a rather different ranking for letter frequency from the one you showed in your posting of 15 September 2010.

Thank you for your time in considering my opinion on the matter.

In thinking again about this question it struck me that in the case of Scrabble, though a word may be appear more than once on the board, there is no real connection between the words on the board and words used during conversation and in writing. Scrabble players select words to place on the board based on the letters available to them and the maximization of the score based on the scoring rules of the game. One could play Scrabble with just the lexicon of chemistry and find the game works quite well, though with a much smaller universe of likely players.

Thus, the issue for the frequency of the use of the letters in the lexicon of English is the appropriate question, not the frequency of the letters as found in a real stream of communication or “usage”.

This frequency topic reminded me of how peculiar frequency can be in human language.  Consider the English phoneme “th” (e.g. in the word “then”) in words in a lexicon of over 70,646 in an advanced learner’s dictionary and the frequency of use in the spoken text. The phoneme “th” occurs in only 0.12% of the words, yet comprises 3.56% of the phonemes in spoken usage. It ranks 43rd out of 44 phonemes in English as occurrences in the lexicon but 6th in spoken usage.

Below is a chart for further exploration.

 

RP phonemes in the Advanced Learner’s Dictionary

(adapted from: http://myweb.tiscali.co.uk/wordscape/wordlist/phonfreq.html)

phoneme illustrative keyword total occurrences in lexicon total words % of total occurrences in lexicon % occurrences in speech
ə another 31009 26813 6.29% 10.74%
ɪ bid 51830 37729 10.52% 8.33%
n near 31934 27020 6.48% 7.58%
t teat 34260 29441 6.95% 6.42%
d died 21275 19125 4.32% 5.14%
s see 33922 28548 6.88% 4.81%
l low 27373 25435 5.56% 3.66%
ð then 596 593 0.12% 3.56%
r raw 23069 21434 4.68% 3.51%
m my 14823 13988 3.01% 3.22%
k cake 22453 20308 4.56% 3.09%
e bed 11312 10940 2.30% 2.97%
w west 4600 4523 0.93% 2.81%
z zoo 19972 18808 4.05% 2.46%
v vine 6007 5859 1.22% 2.00%
b bib 10907 10420 2.21% 1.97%
bite 7441 7236 1.51% 1.83%
f fine 8839 8606 1.79% 1.79%
p pop 15553 14569 3.16% 1.78%
ʌ bud 7124 6917 1.45% 1.75%
bait 10234 10029 2.08% 1.71%
i bead 6721 6525 1.36% 1.65%
əʊ no 6685 6416 1.36% 1.51%
h high 3699 3625 0.75% 1.46%
æ bad 11603 11149 2.35% 1.45%
ɒ pot 7960 7747 1.62% 1.37%
ɔ port 4730 4627 0.96% 1.24%
ŋ sing 9181 8958 1.86% 1.15%
u boot 4794 4743 0.97% 1.13%
g go 6239 6079 1.27% 1.05%
ʃ shy 6117 6039 1.24% 0.96%
j year 3560 3518 0.72% 0.88%
ʊ put 1977 1959 0.40% 0.86%
ɑ bard 4215 4141 0.86% 0.79%
cow 2179 2135 0.44% 0.61%
ʤ judge 3869 3802 0.79% 0.60%
ɜ bird 3095 3083 0.63% 0.52%
ʧ chin 2672 2639 0.54% 0.41%
Ɵ think 1602 1591 0.33% 0.37%
bear 965 962 0.20% 0.34%
ɪə beer 4174 4034 0.85% 0.21%
boy 788 784 0.16% 0.14%
ʒ treasure 334 334 0.07% 0.10%
ʊə poor 1053 1053 0.21% 0.06%