User:Aleph2.0
Hello. This page is not my first contribution; a few months ago I contributed an article on Frequency Analysis, though it was deleted: I only put data from a 500,000 letter sample and a few other things in there. The article is below, if anyone is interested. Aleph2.0 04:34, 19 May 2006 (UTC)
Frequency Analysis, More Concisely
[edit]Frequency Analysis is a technique commonly used in cryptanalysis (mainly for “hand systems” - codes breakable by hand) that involves analyzing the ciphertext to count the occurrences of each symbol, and then using that to figure out which symbols represent which letters of the plaintext alphabet.
Frequency analysis for “monoalphabetic substitution ciphers,” such as the “caesar shift cipher” and the conventional monoalphabetic cipher is very straightforward - the most commonly occurring symbol is probably E, the most common letter in plaintext, and so on.
For polyalphabetic ciphers, however, one has to apply the analysis to every other letter, then, if that doesn’t work, every third letter, and so on, until the cryptanalyst finds the length of the “keyword” that directs the switching between cipher alphabets. Once that happens, the cryptanalyst must apply frequency analysis to every nth letter, where n is the number of letters in the keyword, etc.
Frequencies
[edit]Below is a table of the frequencies of the letters of the English alphabet, based on a sample of 500,000 characters from CTY’s listing and descriptions of its 7th and 8th grade courses for 2006, my journal of my activities with BSA Troop 680, discussions of some of the “classic Senate speeches,” and Wikipedia articles on:
TABLES OF FREQUENCIES
[edit]IN ALPHABETICAL ORDER………
A…………………43,194; 08.638800% N…………………37,751; 07.550200%
B…………………06,702; 01.340400% O…………………36,150; 07.230000%
C…………………16,694; 03.338800% P…………………10,409; 02.081800%
D…………………19,199; 03.839800% Q…………………00,396; 00.079200%
E…………………61,230; 12.246000% R…………………33,299; 06.659800%
F…………………11,297; 02.259400% S…………………33,734; 06.746800%
G…………………09,968; 01.993600% T…………………46,047; 09.209400%
H…………………23,006; 04.601200% U…………………13,322; 02.664400%
I…………………37,744; 07.548800% V…………………05,462; 01.092400%
J…………………00,869; 00.173800% W…………………08,510; 01.702800%
K…………………02,634; 00.526800% X…………………00,912; 00.182400%
L…………………20,712; 04.142400% Y…………………07,519; 01.503800%
M…………………12,643; 02.528600% Z…………………00,597; 00.001194%
IN ORDER OF FREQUENCY………
E…………………61,202; 12.240400% M…………………12,643; 02.528600%
T…………………46,047; 09.209400% F…………………11,297; 02.259400%
A…………………43,194; 08.638800% P…………………10,409; 02.081800%
N…………………37,751; 07.550200% G…………………09,968; 01.993600%
I…………………37,744; 07.548800% W…………………08,514; 01.702800%
O…………………36,150; 07.230000% Y…………………07,519; 01.503800%
S…………………33,734; 06.746800% B…………………06,702; 01.340400%
R…………………33,299; 06.659800% V…………………05,462; 01.092400%
H…………………23,006; 04.601200% K…………………02,634; 00.526800%
L…………………20,712; 04.142400% X…………………00,912; 00.182400%
D…………………19,199; 03.839800% J…………………00,869; 00.173800%
C…………………16,694; 03.338800% Z…………………00,597; 00.001194%
U…………………13,322; 02.664400% Q…………………00,396; 00.079200%
A NOTE ON THE RESULTS
[edit]One will notice that the order of frequency here is rather different than the order given by most frequency sources:
THIS ARTICLE…………………………………MOST OTHERS* E………………………………………………E T………………………………………………T A………………………………………………A N………………………………………………O I………………………………………………I O………………………………………………N S………………………………………………S R………………………………………………H H………………………………………………R L………………………………………………D D………………………………………………L C………………………………………………U U………………………………………………C M………………………………………………M F………………………………………………F P………………………………………………W G………………………………………………Y W………………………………………………P Y………………………………………………V B………………………………………………B V………………………………………………G K………………………………………………K X………………………………………………Q J………………………………………………J Z………………………………………………X Q………………………………………………Z
This is probably because the selection includes more “foreign” words than most selections (see the articles: China, Russia, Triumph of the Will, WWI, and WWII. The order under “MOST OTHERS” would be the “American frequency” while the order named in this article would be a “World frequency.” Also, due differences in spelling in the same language, if this article used purely British works, the order produced would probably be different from the two orders listed above. For example, U might overtake L - when Webster made his dictionary, he dropped U’s that directly follow O’s in many words, such as in “color”: we spell it “color,” but the British spell it “colour.” And if a “gray/grey situation” appears (the British “grEy” is the American “grAy”) in a sufficient number of words, A might fall behind O in the order.
- This order is the order from the Wikipedia article ETAOIN SHRDLU. It is quite similar to, if not the same as, the order given in Simon Singh’s The Code Book.
ACKNOWLEDGEMENTS
[edit]I would like to thank Simon Singh for writing The Code Book (First US publishing by Doubleday, New York, NY, in 1999). Without it, I might never have gotten interested in codes and ciphers, and therefore never would have written this article. I would also like to thank my sister for going to CTY’s summer school on cryptology, which uses The Code Book as its course textbook, thus forcing her to buy it.
I would like to thank the San Diego Union-Tribune (one of the county newspapers) for running a blurb in its Quest section about Wikipedia, prompting me to go to www.wikipedia.org, and after only a few minutes of surfing around on the site, I decided to write and donate this article.
WORKS CITED
[edit]As mentioned above, CTY’s 2006 7th and 8th summer school course listings, my Scout Journal, and the following Wikipedia articles were used in counting the frequencies:
Also used is the Wikipedia article ETAOIN SHRDLU
Also, as mentioned in the Acknowledgements section, Simon Singh’s The Code Book helped a lot: my explanation of how to apply frequency analysis is based on my understanding of his.
- I took off a few thousand letters from this article so that the total number of characters used would be 500,000. I do not remember the exact number taken off, but it was around 7,300. The characters were a chunk lopped off from the end; I did no go in and remove several thousand E’s, say.