Jump to content

Talk:Letter frequency

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
(Redirected from Talk:Letter frequencies)

Dispute

[edit]

Actually, the source is Cryptographical Mathematics, by Robert Edward Lewand and it does not state the sample size. The 15000 word sample is from someone named Tom's apparnetly independent analysis. hnw555, 11/28/06

The sources for the statistics are taken from: http://www.central.edu/homepages/LintonT/classes/spring01/cryptography/letterfreq.html And are based on a ridiculously small sample size (15000 characters). It might also be copyrighted information.

Oh dear, I didn't notice that sample size. I think the thing to do is remove the data from this page and link to (preferably several) others with data. Objections? Frencheigh 03:23, 20 July 2005 (UTC)[reply]
I found a nice letter frequency calculator and I'm gonna give it input from a largish number of Wikipedia articles. It'll contain some bias from stub templates and things, but it should be a somewhat accurate estimate of the frequencies of letters in the English language. --Ihope127 19:37, 9 September 2005 (UTC)[reply]
...Whoops, it poofed. Ah well... --Ihope127 13:41, 10 September 2005 (UTC)[reply]
Wikipedia is a very biased source, besides the templates, there are going to be many foreign words which will affect the frequency. For English you'll be better off getting 50MB of text files from Project Gutenberg. -- 6 October 2005
I don't think that sort of thing could be used here, as original research is not permitted on Wikipedia. (Wikipedia:No original research)
I guess the results from Project Gutenberg that I just posted could be construed as original research. If so, I apologize, and I guess take them down. :-( I can provide detailed explanation of my methods and source code, as well as full result data, to anyone interested. Matt Whitlock 21:54, 12 April 2006 (UTC)[reply]
Removed, for interested persons this is when they were added. Frencheigh 17:59, 5 July 2006 (UTC)[reply]
Meanwhile, I've heard that plain-old data isn't protected by copyright. So instead of what I suggested above, I envision a table where each row is a letter and each column is a different study, and in the cells is the frequency of that letter as given by that study. That way the page would be immediately useful for what I bet is the main reason somebody would want to view it. Anyone know if that would be legal? Frencheigh 23:38, 6 October 2005 (UTC)[reply]

Well the data that is in there now doesn't seem terribly good, so much better would be to replace it with something else. There's got to be published sources that have used large representative samples. Any ideas of where to look? - Taxman Talk 18:38, 30 November 2005 (UTC)[reply]

Ok I did some searching and found corpus linguistics, which seems a much better way to do it. Summary statistics on the prominent corpus' such as the Brown Corpus and the British National Corpus seems much more valuable than what is in the article. Only I couldn't find them. All I could find when searching was this that lists some interesting letter frequencies in various languages, but they appear to be just from some guys webpage that calculated them. Help on finding summary statistics on the corpus' would be great. - Taxman Talk 22:02, 30 November 2005 (UTC)[reply]

You've all misunderstood the original quoted article. The frequencies given in the Wikipedia article are correct; note that they all match the second source quoted of British National Corpus to the accuracy given. The mistake was that the title "Tom's Letter Frequencies (in order)" in the center of the page is NOT the caption to the table above; rather, it is the heading for the paragraph and table BELOW. Note that the paragraph even says it is "below" and also that the second table, based on the 15,000 letter sample, is in "order" of frequency. Thus, the original Wikipedia article should stand as being accurate. (JPP)

Is the factual accuracy still disputed? Argyriou 20:09, 3 July 2006 (UTC)[reply]
It appears that the four following sections are still based upon that 15,000-char analysis. If there are no objections, I think I'll remove those four sections and attribute the rest above them to "Cryptographical Mathematics" by Robert Edward Lewand. Now, are we sure on the title? "Cryptological Mathematics" gets many more google hits. ([1], [2]). ((signature added later - comment by User:Frencheigh, PDT 15:59, 5 July 2006))
I've removed the {{disputed}} label. The sections User:Frencheigh removed can be found at [3] Argyriou 21:58, 11 July 2006 (UTC)[reply]
Directly above i was referring to the "Top 10 beginning of word letters", "Top 10 end of word letters", "Most common bigrams (in order)", and "Most common trigrams (in order)" sections, which were present during the above discussion, unlike the other Project Gutenberg ones I deleted recently (on account of their being OR, see farther up). I suppose I'll leave it for a bit again and clarify; I intend to remove all sections but "Relative frequencies of letters", "See also", and "External links", because the others are from the 15000-char analysis. Frencheigh 08:41, 12 July 2006 (UTC)[reply]
Done. Frencheigh 20:15, 19 July 2006 (UTC)[reply]

When I was a kid, I read a book on cryptography (I think it may have been "The First Book of Codes and Ciphers," which you can see Neal Stephenson reading in his author photo in "Cryptonomicon"!) that gave the frequency list as ETAONRISHDLFCMUGYPWBVKXJQZ. Anyone else recognize this ordering? Anyone know what statistical source it might have come from? I'm obviously not the only one who's ever thought it was the authoritative ordering, since googling that string of letters produces 187 results. --Mr. A. 21:07, 16 July 2006 (UTC)[reply]

The Stat keyboard design released around 2000 uses a databased source of 300,000 letter strokes from random Internet news articles, and essays. It graphs a smooth bell curve which is statistical justification that the source accurately reflects the usage in the whole population. The keyboard design put the 6 letters used for 52% of the strokes close together, and the letters used in 83% of the strokes on the right half of the keyboard. The the keys for the most common 3 letter groups, also determined from bell curve analysis of the same 300,000 stroke database, are placed so the keys can be struck in a pinky to index pattern which is faster than index to pinky. I have a copy of the material in graphic form but it was an internet source, didn't get wide distribution, can't be found, and may be called OR. 2600:8807:5400:28F0:1829:439C:BE94:D30E (talk) 16:54, 1 August 2023 (UTC)[reply]

Chart ordered by frequency would be helpful

[edit]

The chart shown graphing letter frequency vs. letter is ordered alphabetically. An additional chart ordering the vertical bars by frequency (rather than alphabetically) would enhance the presentation.

I generated such a frequency-ordered chart on my Windows system using the Excel spreadsheet chart facility. I have not tried to add the result to the Wiki article because it's relatively ugly and because I couldn't figure out how to convert it to a .png file.

I've found an ordered letter frequency of the english language in this page: http://www.csm.astate.edu/~rossa/datasec/frequency.html The source of the table is: H. Beker and F. Piper, Cipher Systems, Wiley-Interscience, 1982. I don't know if it would be ok to put it here.


How about using this source:

Case-sensitive letter and bigram frequency counts from large-scale English corpora. MN Jones, DJK Mewhort - Behavior Research Methods, Instruments, & Computers, 2004

A link to it can be found here [4]

--Zip123 (talk) 16:37, 23 October 2008 (UTC)[reply]

CAN SOMEONE PLEASE ADD INFORMATION ABOUT HOW TO GENERATE LETTER FREQUENCY TABLES IN FOREIGN LANGUAGES (i.e. from texts that are loaded into a computer program)?!? [24.59.100.23]

I wrote a program that takes a file as input and generates a primitive frequency table...it won't work for Unicode, though. I have to fix that. If you want it, I can upload it to Wikipedia (is that legal?). 7 July 2006 - dargueta

I made a Mathematica Simple Code for this purporse. Save a text with only lower-case characteres and with only letters from a to z (don't use others characteres like á,à,ê etc). Save it with the name Liber. The code is the following.

Doc = Import["Liber.txt"]; Numb = Sum[StringCount[Doc, FromCharacterCode[i]], {i, 97, 122}]; K = Table[{StringCount[Doc, FromCharacterCode[i]]*(100./Numb), FromCharacterCode[i]}, {i, 97, 122}]; TableForm[K[[Ordering[100 - K]]]] —Preceding unsigned comment added by 201.58.15.73 (talk) 14:59, 25 December 2007 (UTC)[reply]

Statistics from a larger sample size

[edit]

In the book The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography by Simon Singh, I found the following table with a caption that reads:

This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Beker and F. Piper, and originally published in Cipher Systems: The Protection Of Communication.

Note that the values below add to 100.3 due to rounding.

Letter Percentage Letter Percentage
a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1


The following table sorts the values given above in order of letter frequency.

Letter Percentage Letter Percentage
e 12.7 m 2.4
t 9.1 w 2.4
a 8.2 f 2.2
o 7.5 g 2.0
i 7.0 y 2.0
n 6.7 p 1.9
s 6.3 b 1.5
h 6.1 v 1.0
r 6.0 k 0.8
d 4.3 j 0.2
l 4.0 x 0.2
c 2.8 q 0.1
u 2.8 z 0.1


I took a little class on cryptology once, and The Code Book and Cryptological Mathematics were our textbooks. I'm pretty sure they have the same data, but in The Code Book it's rounded. --Ravi12346 19:40, 30 July 2006 (UTC)[reply]


This pertains both to the "Letter frequency" section and the "Samples from a larger dataset" section. In the course of an NLP project I've been doing, I've parsed through the english Wikipedia dump and built up a vocabulary file. I've parsed this file (which takes into account word occurence counts as well, unlike the OED dataset presented earlier). I can provide first letter frequencies, all letter frequencies and last letter frequencies if that could help. I may also be able to share the source code to replicate this (the program is written in python with Gensim, Pattern3 and some numpy thrown in). The letter frequencies approach those given here (with some minor variance). It should also be noted that some foreign words creep in (depending on the subject matter of the Wiki articles), but only alpha characters between a-z have been tallied, even for those foreign words. — Preceding unsigned comment added by PGadoury (talkcontribs) 02:53, 18 December 2018 (UTC)[reply]

Query

[edit]

sth is a surprise in a list of high-frequency trigrams. On its own it's an abbreviation of south, and I can think of a few words containing it, but not enough to account for its listing here. Can anyone tell me what it is I haven't thought of?

I grepped a dictionary and came up with 414 results... admittedly, almost all of them you wouldn't use in conversation (try "somesthetic" and "chromesthesia"), but there are a couple like 'firsthand' and 'guesthouse' that aren't so outlandish.

is as has was / this the that there they

Trigraphs ignoring spaces may not be of great practical use though. Uldoon 10:33, 10 March 2006 (UTC)[reply]

Given that this seems suspicious, and that we have reproducable numbers from PG, might this section (and sections 1-4) be gotten rid of? Onepairofpants 14:38, 30 May 2006 (UTC)[reply]

I agree that the top portion of the page should be deleted. The sample size of that portion is 15000 characters with only 2700 words. And the input is definitely biased (license agreement from Sun, teaching philosophy of a computer science professor, letter of recommendation). This is probably why "sth" appears in the results.

American English

[edit]

contains a lot more "z"s than British English. 218.102.218.250 03:02, 5 April 2006 (UTC)[reply]

Mainly, I assume, thro' a preference for -ize as a suffix in the US rather than -ise; this despite the reverance in which the Oxford English dictionary is held, and its general preference for the former spelling.

Average Word length

[edit]

I would be interested to know some more statistics about these letter frequencies, but I lack the skill to extract the relevant information from the PG archive's ample selection of texts; what is the average word length in english? I read somewhere that it was 4.26, though this was with a rather small sample size. Is the distribution of word lengths a standard distribution? if so, what is the std deviation? How does letter frequency vary with word length? obviously at words of 1 letter, the frequencies will be 0 apart from "I", "A" and possibly "O"... Would anyone have the ability and the capability to satisfy my curiosity? 86.20.233.151 20:59, 1 June 2006 (UTC)[reply]

One of the references for this article (Peter Norvig "English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU") answers some of your questions:
The average word length in English text is 4.79 letters per word, the most common word length in English text is 3 letters per word.
The average word length for *distinct* English words is 7.60 letters long, the most common word length for *distinct* English words is 7 letters.
Both appear to be approximately a Poisson distribution.
--DavidCary (talk) 01:04, 18 October 2021 (UTC)[reply]

Wheel of Fortune

[edit]

So...H and D appear more than L, but the "gimme" letters in the last round of Wheel of Fortune are RSTLNE. That should appear somewhere towards the bottom of this article. --JD79 19:20, 26 January 2007 (UTC)[reply]

Pure speculation here, but (1) I think those are the gimme letters because people had gotten the idea that they were a "good set" and guessed the exact same letters all the time, and the producers probably wanted to mix it up; and (2) if you've already guessed S and T, you can probably infer the locations of H's, as well as the likelihood of the last letter's being D (the past tense prefix no doubt being the reason why D is so common). --Mr. A. 04:21, 27 January 2007 (UTC)[reply]

"Text messages"?

[edit]

From the article:

"The frequency of letters in text messages has often been studied for use in cryptography..."

In the UK at least, the term "text messages" (SMS messages TO SOME DUMBASSES), and readers could hardly understand it to mean anything else. However, from the context it doesn't seem very likely that this was the intended meaning, at least not exclusively. If not then I suggest "text messages" is replaced by just "text". I won't change it just yet in case anyone has a good reason why it should read as it does. Matt 20:36, 21 April 2007 (UTC).

I think you're absolutely right, and I've gone ahead and made the change. --Mr. A. 10:49, 24 April 2007 (UTC)[reply]

Hemingway and Faulkner

[edit]

I've removed the Hemingway vs Faulkner section in Letter frequencies, again, because, aside from the quote about conventional use of punctuation, it's unreferenced, and may be original research. It's also not really germane to the article; showing that two particular works have different letter frequencies doesn't really say anything about those writers' styles - they may have chosen character or place names which account for the entire discrepancy. An analysis, to be meaningful, would need to show similarities across all of an author's works for several authors, and consistent differences between authors. If there is any published literature on the subject, then it would be appropriate to summarize it and include it in this article under a section heading such as letter frequency variation and authorial style. Argyriou (talk) 19:39, 23 July 2007 (UTC)[reply]

ETAONRISH

[edit]

I recall having read a book that used an alphabet ordered by frequency, beginning with the letters above, that was useful in solving substitution ciphers. Searching for "etaonrish" on Google gets quite a few results, too, so I wonder why this order isn't discussed in this article. B7T (talk) 04:11, 17 April 2008 (UTC)[reply]

This is the list that I mention at the end of the Dispute section above. I'd be interested in knowing where this list of frequencies came from. --Geenius at Wrok (talk) 10:41, 19 April 2008 (UTC)[reply]
If the title in that section is correct, a query on Amazon.com suggest that the book in question may be a children's book, The First Book of Codes and Ciphers by Sam and Beryl Epstein, illustrated by Laszlo Roth, published in 1956, if anyone wants to try to find a copy to verify it; I'm fairly certain that I first heard of this sequence in a different book for young people. A search for "etaonrish" on Amazon.com yielded results of its mention in five other books in the context of ciphers or cryptography, none of which are children's books (although one is about activities to do with children). B7T (talk) 19:44, 24 April 2008 (UTC)[reply]
If you ever figure it out, perhaps it would also be useful in the ETAOIN SHRDLU article. --68.0.124.33 (talk) 17:21, 9 March 2010 (UTC)[reply]
The full sequence of the letters is ETAONRISHDLFCMUGYPWBVKXJQZ and can be found in a children's book by the name of "Alvin's Secret Code" 1963 by Clifford B. Hicks. — Preceding unsigned comment added by 204.44.186.204 (talk) 17:37, 2 March 2015 (UTC)[reply]
This sequence is also mentioned in the children's book "Codes and Secret Writing" by Herbert Zim (1971). According to Zim, "Many studies of the frequency of letters in the English language have been made. Samuel Morse made such a study when he made his Morse Code. Edgar Allan Poe did too. Dozens of other studies are all in reasonably close agreement. The printer who sets up type knows this fact about our alphabet and uses it every day. The case in which he keeps his type is divided up into compartments. The letters used most get the largest boxes and those that are easiest to reach." (p71) Dabley2 (talk) 13:15, 2 May 2017 (UTC)[reply]

German ä, ö, and ü

[edit]

This list is immediately suspect. Although I have no figures to back it up, I know enough German to wager that these three letters are nowhere near the given zero percent. Mamarazzi (talk) 05:35, 16 June 2008 (UTC)[reply]

The appendix to Fletcher Pratt's classic Secret and Urgent also gives a frequency table for German that omits these letters; it notes that umlauted a, o, and u have been counted with the non-umlauted a, o, and u. The reason for doing this is not explained. Perhaps in telegraphy they were treated the same? 76.199.88.163 (talk) 12:30, 8 July 2008 (UTC)[reply]
In The American Cryptogram Association's Xenocrypyts, umlauts and other diacritical remarks are stripped. Pratt does seem to have been a member:
http://voynichcentral.com/transcriptions/Voynich-101/strong_letters.pdf.
My guess would be that the frequency tables he printed had been compiled by members of the ACA.
--jdege (talk) 22:18, 15 July 2008 (UTC)[reply]
But if they weren't counted, don't put in a number such as 0... AnonMoos (talk) 14:02, 20 August 2008 (UTC)[reply]

Shouldn't we be warning readers about potential problems in the data? I know German but blithely copied the data without thinking, assuming them to be correct until I started seeing major anomalies in the statistical totals I was tabulating across languages! If I hadn't noticed this, it could have thrown my work way out!Matthew Slyman (talk) 09:07, 29 April 2013 (UTC)[reply]

I've just noticed the discrepancy as well. It should be fixed since its incorrect and misleading. Both this page, and its German Wikipedia equivalent cite the same source but unfortunately, I do not have a copy the source available. The German page says that ä, ö, and ü were counted as if they were ae, oe, and ue (standard practice when those characters aren't available). It also states the ligature ſz (which would later develop into ß) is counted independently from ß itself, which suggests some older texts were used in the analysis. Ideally another source should be found if possible. Note I do in fact assume the German page is right, since its totals sum to around 100% while the English page sums to about 102 or 103%. 74.12.29.232 (talk) 03:45, 4 October 2013 (UTC)[reply]

Apostrophe

[edit]

This is a great article, but I would find it useful to include the apostrophe character in the list of characters for which data is gathered, sorted by frequency, etc. I recall that in some languages (including English?) the apostrophe has a higher frequency than a few other letters of the alphabet. I went to this Wikipedia page to check this, but couldn't find such information, because the apostrophe is not even considered. Even more importantly, in languages like Italian the apostrophe can be part of a word. Even in languages like English or German there may be scenarios where one could wish to include the apostrophe (as used in possessives and contractions) in such statistics. Hi-Toro (talk) 01:19, 22 March 2011 (UTC)[reply]

Frequencies are inherently a property of the specific type of text. The traditional letter frequency counts are from telegraph text - upper case, with numbers spelled out and punctuation either dropped or spelled out. Mixed case, with punctuation, and/or with spaces, and/or with umlauts/accents/etc., will all have different frequency counts. As will Baudot code, ASCII, every different flavor of Unicode, etc.
We can't possibly include tables for them all.
jdege (talk) 01:43, 23 March 2011 (UTC)[reply]

Move to "Letter frequency"?

[edit]

Shouldn't this article be moved to Letter frequency per WP:SINGULAR? Leon math (talk) 21:47, 4 January 2009 (UTC)[reply]

From what it can be read, it might be either because English people do not know how to use a keyboard, or because dictionary spelling is not enforced. — Preceding unsigned comment added by 86.75.160.141 (talk) 22:29, 27 October 2012 (UTC)[reply]

French ë?

[edit]

Why is the ë 0.000% instead of 0? Is there only one word in their entire language with it? And is it only used in certain times of year? (Noël) Is Noël the only example? Is it 0.000% because the percentage is less than 0.000% but greater than 0? Uber-Awesomeness (talk) 19:49, 10 February 2009 (UTC)[reply]

Off top of my head, I can think of "canoë", "continguë", "ambiguë", "noël", "exiguë", "ciguë", "aiguë", and "Israël", "Staël", "Saint-Saëns". We also use "ü" in "capharnaüm", "Saül", "Esaü". And of course, we use lots of "ï" as in "ambiguïté", "exiguïté", "égoïste", "aïeul", "glaïeul", "haïr", "maïs", "coïncider", "inouï", etc... —Preceding unsigned comment added by 84.72.92.4 (talk) 00:58, 16 April 2009 (UTC)[reply]

Diactric in english language

[edit]

Why not to include in statistics English diacritics such as those within English terms with diacritical marks article. — Preceding unsigned comment added by 86.75.160.141 (talk) 22:23, 27 October 2012 (UTC)[reply]

I added those that I could find examples for. It would be good to regenerate the stats using a better dataset. I've seen it spelled "naïve" so many times that I think if you're an adult English reader you're almost required to know it and the frequency table doesn't reflect that reality. Akeosnhaoe (talk) 04:48, 19 November 2020 (UTC)[reply]

Letter Frequency Not Reliable Enough?

[edit]

I do not think that the English letter frequency on this page is very reliable. The source (http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html) only counted 15000 characters, which is not really enough to get a good picture of letter frequency. It only uses three documents, all of which are relatively specialized and so do not accurately represent letter frequency in the language as a whole.

Additionally, letter frequency from Project Gutenberg alone (http://en.wikipedia.org/w/index.php?title=Letter_frequencies&diff=48178370&oldid=45375638) is unreliable, as it only contains certain types of text and styles of writing. A combination of styles is necessary to be truly reliable. —Preceding unsigned comment added by Humanperson0 (talkcontribs) 19:38, 28 June 2009 (UTC)[reply]

(http://letterfrequency.org/) cites letter frequency as "e t a o i n s r h l d c u m f p g w y b v k x j q z". I do not know how reliable that site is, but my own study of letter frequency (http://mtgap.bilfo.com/letter_frequency.html) – in which I used 750,000 characters and a variety of writing styles – came up with the same result so it's probably pretty reliable. Also, the site includes many different types of frequency (word frequency, letter frequency in religious writings, etc) which points towards the study being thorough.

I do not have the skill necessary to completely modify the English section of the letter frequency page, but I recommend that it be seriously modified. The letter frequency that is currently used is unreliable.

Humanperson0 (talk) 19:24, 28 June 2009 (UTC)[reply]

I don't think we should be doing our own frequency counts. That would be original research. We should be referring to - and citing - frequency counts which we can can properly document via reliable sources. Yes, different sources have different frequency counts, but different kinds of text have different letter frequencies. We shouldn't try to hide that.
--jdege (talk) 12:34, 30 June 2009 (UTC)[reply]

As someone who's interested in these frequencies in order to teach someone Braille, I'd like to point out that the deaf and blind folks have a slightly different ordering. Maybe it's worth it to mention, although, to be fair, they don't cite their sources either: http://www.deafandblind.com/word_frequency.htm for what it's worth Phillipkwood (talk) 22:39, 10 July 2009 (UTC)[reply]

The English letter frequency counts in the article from http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html (at the top of the webpage) as well as the bigram and trigram frequencies listed at the bottom of the webpage are from Cryptological Mathematics by Robert Lewand, pages 36 and 37 respectively (google books link:[5]) and are NOT based on the webpage author Tom Linton's own 15000 character sample. (Lewand's book is approaching letter statistics from the perspective of cryptanalysis of messages with all spaces stripped, which gives the odd distribution of trigrams: some like "edt" and "sth" are found mostly across word boundaries.) It seems more appropriate to cite Lewand's book rather than Linton's website for the numbers used in the article so as to avoid any confusion with Linton's own low-quality dataset. I'm not sure what corpus Lewand used, but it must have been much larger than 15,000 characters. Alternatively, if we could find a reliable source that actually states what corpus they used, that could be better. --Speight (talk) 02:36, 4 August 2009 (UTC)[reply]

My method: I took all the public word lists from magneticpoetry.com except for Yiddish, as well as the words from Euro Magnets, Bumper Magnets, Amusing Magnets and Bold Words, and got this letter frequency from 107685 letters: EOATIRSNHLUDMCYWGFBPKVJZXQ. 2A01:119F:21E:4D00:B93E:76E5:CA3E:4F14 (talk) 09:27, 17 March 2018 (UTC)[reply]

Toki Pona

[edit]

I must question whether Toki Pona — a constructed language of recent origin, with only a few dozen users, and intentionally designed not to be of general use as an international auxiliary language — is sufficiently notable to merit inclusion in the "Relative frequencies of letters in other languages" table. Comments? Richwales (talk · contribs) 06:11, 22 December 2010 (UTC)

There is absolutely no excuse for the obscure Toki Pona to be in this article. The language itself is not really notable enough to deserve an article (I'm shocked it has one). If the goal is to include a made-up language, I'd suggest Pig Latin. If the goal is to include a language with a very different distribution, I'd suggest Hawaiian. If it weren't such a pain to edit tables on Wikipedia, I would have ripped Toki Pona out already. Perhaps someone knows of an easier way to edit tables. RoyLeban (talk) 22:06, 9 September 2011 (UTC)[reply]

There is already a "made-up" language in the article, Esperanto. This designed language is in use by many people all over the word, so covering another made up language by including Toki Pona or Pig Latin isn't necessary unless those languages have interest in their own right. Decorian (talk) 13:12, 28 September 2011 (UTC)[reply]

My impression

[edit]

I have always understood that - although, admittedly, there is dispute about this - the most common letters in English by popular consent are - in decreasing order of frequency - ETAONRISH

I am not sure what comes next in the list, but it would probably be D followed by U or L. ACEOREVIVED (talk) 17:22, 24 March 2011 (UTC)[reply]

Isn't it EOATIRSNHLUDMCYWGFBPKVJZXQ? 2A01:119F:21E:4D00:B93E:76E5:CA3E:4F14 (talk) 09:20, 17 March 2018 (UTC)[reply]
Any "isn't it..." question is inviting the answer "nope". Any definitive version will be "wrong" according to someone, because we haven't even defined what "frequency of letter use" actually means. We're not going to get a consensus, because we're not even arguing about the same thing.
Many years ago I read that it was ETAOIN SHRDLU (in several places, but one was Gödel, Escher, Bach: an Eternal Golden Braid). Martin Kealey (talk) 08:48, 26 April 2019 (UTC)[reply]

Universal translator?

[edit]

Is there a source for letter frequency by language more generally? (That is, for languages not tabulated here.) It seems to me it would be useful to mention or link to, for people wanting more information. TREKphiler any time you're ready, Uhura 16:29, 13 April 2011 (UTC)[reply]

Strictly WP:OR

[edit]

Entirely original research, but if anyone is seeking independent confirmation of (approximate) letter frequencies, I'd recommend taking a look at their computer keyboard. My netbook - a few months old - had a matt surface on the keys. The well-used ones are now glossy... AndyTheGrump (talk) 03:15, 8 June 2011 (UTC)[reply]

Introduction

[edit]

At 7 paragraphs, the intro for this article appears to be excessive. Rather than deleting any content, perhaps some of the material could be moved into its own section or merged with others. I'd like to add a cleanup tag, called {{lead too long}}, to resolve this problem ... unless there are objections, or a specific reason for the verbosity? — VoxLuna  orbitland  22:11, 9 November 2011 (UTC)[reply]

Data miscopied from source?

[edit]

Upon comparing the letter frequencies in the first table of this Wiki article against the source citation [4], I find that the frequencies for the three letters K, V, W do not agree with the source, as follows:

Letter Freq in Wiki Freq in [4]
K 0.747 0.772
V 1.037 0.978
W 2.365 2.360

As verification, the sum of the frequencies of all 26 letters of the alphabet should add up to 1. The frequencies listed in the Wikipedia article add up to 1.00038. The frequencies listed in the source [4] add up to .99999 (differs from 1 presumably due to roundoff). I went ahead and revised the numbers for these three letters in the Wiki table to match those of the source [4]. AlanSiegrist (talk) 15:31, 14 October 2012 (UTC)[reply]

Not sure what "[4]" source you were talking about last year, but the current article seems a little confusing, citing the 1982 Cipher Systems for the letter frequencies "listed below", then telling the reader that it differs from some Cornell table, then saying that the Concise Oxford dictionary includes some analysis, before finally saying that the Wikipedia list is using "this table" - sourced to some coder's personal website! The data seems to match the numbers on the coder's site, but he doesn't say where he got them from, only vaguely citing a 2000 textbook as "sources".
I've tried to clean this up by cutting the first statement as false, moving the second after the table, leaving the third and explaining the fourth. But what do we think is the best source to use here? --McGeddon (talk) 18:53, 19 July 2013 (UTC)[reply]

Tables do not add up to 100%

[edit]

The sum of the frequency of the letters for the languages English, French, German, Spanish, Portuguese, Esperanto, Italian, Turkish, Swedish, Polish, Dutch, Danish, Icelandic, and Finnish are currently 100.00%, 99.16%, 102.36%, 104.47%, 104.09%, 99.99%, 101.14%, 94.01%, 100.00%, 108.01%, 101.59%, 100.00%, 100.00%, and 100.00%. Whilst I understand that frequencies may not always add up to 100%, I cannot imagine why they should ever exceed 100%. I cannot avoid the conclusion that some letters are being counted twice. — Preceding unsigned comment added by 2.104.4.142 (talk) 14:08, 15 February 2014 (UTC)[reply]

It's possible that the diacritics are being counted twice - checking a reference at random, the source for the Portuguese distribution differs slightly (it totals 100.01% versus this article's 104.09% - the .01% is presumably just a rounding error, although H, W and X have different values) and it says absolutely nothing about diacritics, so perhaps somebody just added them afterwards from a different source. --McGeddon (talk) 09:58, 21 February 2014 (UTC)[reply]
Looks like these edits from May/June 2013 might be the culprit - an IP editor added more diacritic rows and updated the stats for every language, somehow bumping them up to three decimal places of accuracy despite the sources only using two. Assuming good faith, this could have been somebody running their own analysis of corpus texts and getting a finer level of accuracy, but it's left us with a lot of data that doesn't match its sources, and apparently doesn't quite add up. --McGeddon (talk) 10:05, 21 February 2014 (UTC)[reply]

Apparently corrupted data

[edit]

On the 5th through the 8th of July 2014 three IP-numbers: [6], [7], [8] made unsourced changes in the article, mainly in the table with data from other languages, amongst them Polish, Turkish and Dutch data. For some of the data I can not read or understand the original language, from what I can read I discovered (intentional) errors. Today I'll revert only those from the Dutch language to the source data. --VanBuren (talk) 15:05, 16 July 2015 (UTC)[reply]

Good catch. As the tag at the top of the table says, the numbers do not match the supposed sources. This is noticeable with the number of decimal places in the Dutch column, which are almost entirely given to only two decimal places in the cited source. We should change everything to match the sources. Sminthopsis84 (talk) 15:15, 16 July 2015 (UTC)[reply]

All of these languages use a similar 25+ character alphabet.

[edit]

Why 25? This number seems a bit random. All aforementioned languages have 26 or more letters, the exception being Italian with only 21 official letters. 25 can't be the average either [French (26), German (26), Spanish (27), Portuguese (26), Esperanto (28), Italian (21), Turkish (29), Swedish (29), Polish (32), Dutch (26), Danish (29), Icelandic (32), Finnish (28), Czech (42)]. --147.142.185.206 (talk) 11:12, 5 March 2016 (UTC)[reply]

So... what is the list of most frequent first letters?

[edit]

Quoted directly from the article:

The first letter of an English word, from most to least common, s a c m p r t b f g d l h i e n o w u v j k q y z x.[16]

The first letter of an English word, from most to least common, t o a w b c d s f m r h i y e g l n p u j k[17]

So which is the correct list? 128.101.108.73 (talk) 20:57, 31 May 2016 (UTC)[reply]

I'm unsure, but the former may be a list of the most common words organized by word count. It might also refer to names and such, given the text above, which definitely refers to names. Still, 's' is almost certainly more common than eighth, right? The second one also does not list a few letters, but these are the quite rare 'v', 'q', 'x', & 'z', which might lend more credence to the source (the discrepancies are too insignificant), but at the same time might suggest that a small sample was use. I'm removing the former one, on the basis that it is probably about names, not words in text, but if anyone prefers the other, here is the source/code, with spaces in the < ref > tags. Alternatively, if someone has access to the source, could you please check if that has to do with names, and if so, add it in as such?
The first letter of an English word, from most to least common, s a c m p r t b f g d l h i e n o w u v j k q y z x.< ref name="ohlman" >Herbert Marvin Ohlman. "Subject-Word Letter Frequencies with Applications to Superimposed Coding". [9] Proceedings of the International Conference on Scientific Information (1959).</ ref >
--Blanket P.I. (talk) 23:33, 6 July 2016 (UTC)[reply]

Uppercase/lowercase Frequency

[edit]

This article only addresses the frequency of letters without distinction of whether they are occur in lowercase or uppercase in the corps of texts analyzed. For instance, in the books of Project Gutenberg (mentioned here), there certainly are enough words with uppercase letters - mainly at the beginning of a sentence - to make that distinction meaningful. In some applications, (e.g. text compression) this information is indeed valuable. Since the article is titled "Letter frequency" and uppercase letters are different from lowercase letters, I'd argue that addressing this point belongs into this article.Havajsky (talk) 19:57, 30 March 2017 (UTC)[reply]

Michael N. Jones and D. J. K. Mewhort, “Case-sensitive letter and bigram frequency counts from large-scale English corpora” [10] (Behavior Research Methods, Instruments, & Computers, 2004, 36 (3), 388–396) is very informative. (Notably, uppercase E is relatively rare, with a frequency rank of 11th out of 26 uppercase letters, while lowercase e is by far the most frequent.) Kanou-h (talk) 09:56, 15 November 2023 (UTC)[reply]
[edit]

Hello fellow Wikipedians,

I have just modified one external link on Letter frequency. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

checkY An editor has reviewed this edit and fixed any errors that were found.

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 17:17, 14 May 2017 (UTC)[reply]

[edit]

Hello fellow Wikipedians,

I have just modified 2 external links on Letter frequency. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 20:07, 4 June 2017 (UTC)[reply]

Dutch é and ë

[edit]

Though they are not Added to the Dutch alphabet é and ë are regularly (and ï less*) used in writing Dutch. één for exempel in Dutch means one while een means a. This letter is also used to accentuate a letter e so the pronaunciation is clear. The letter ë is even a lot more frequently used in Dutch. For exempel in the words industriële, accentuële or artificiële. It's also frequently used for geogrphical locations like Belgium (België), Asia (Azië), oceania (Ocenanië) Tasmania (Tasmanië), Transylvania (Transsylvanië)... (an exeample of ï: maïs or pinguïn*) This explained: why are these letters not included in the list? Falco iron (talk) 11:05, 10 September 2017 (UTC)[reply]

Dutch frequencies of accented letters are incorrect

[edit]

As mentioned above, accented letters do occur in Dutch. I went to the source and found a more recent one. They obviously removed accents before determining frequency, as the word "één" (1) does not occur on the list of 200 most frequent Dutch words from the same corpus, while twee, drie and vier (2, 3 and 4) all do occur on the list.

Unfortunately I have not been able to find a source of better data. BGPexpert (talk) 13:13, 19 February 2024 (UTC)[reply]

Morse

[edit]

I think it would be interesting to explain that the order of Morse letters by length, e it san hurdm wgvlfbk opxcz jyq, puts o, l, and c in an odd place (far towards the end despite of their high frequency) because the current International Morse code is actually a modification of the original version, where this would look like e t i alno ms cdfru ghkw bqvxyz jp.

I added a footnote explaining this but then was told that, without references, this might be considered original research so I undid it. However I'm opening this discussion to see if someone can add a reference to said fact or if this footnote can be safely added as is (since this information can be almost directly obtained from the image at American Morse code) without any reference.

Cousteau (talk) 00:33, 4 December 2017 (UTC)[reply]

Letter Frequency in Roots vs Text

[edit]

The page reads "The first method, used in the chart below, is to count letter frequency in root words of a dictionary," so I take it that this is referring to Pavel Mička's list. However, the data for the chart on the right from Nandhp with the caption "Relative frequencies of letters in text" has exactly the same numbers. So which is it? Are these numbers for the roots or for average text?

195.191.126.1 (talk) 13:27, 29 December 2017 (UTC)[reply]

Vbgkjq

[edit]

Which is right? vbgkjq or vbgkqj?99.10.204.30 (talk) 14:09, 12 July 2018 (UTC)[reply]

English frequency column now add up to 106%

[edit]

It seems the frequencies for the letters 'w' and 'k' for English have been changed and the frequencies no longer adds up to 100%. It's now 106%. 'w' previously was 2.36 (now 5.37) and 'k' was 0.772 (now 3.872). No sure who changed it or why, but if the new frequencies of 'k' and 'w' are more accurate, the other frequencies need to be adjusted to add up to 100% again. — Preceding unsigned comment added by Karl Napf der Abwaschbare (talkcontribs) 04:48, 8 December 2018 (UTC)[reply]

Are accented letters in English actually 0.000%?

[edit]

Accents are used in English occasionally. Most commonly, it appears in names, but there are also a few words that can optionally have accents or other marks, such as résumé (with one or both accents), naïve, façade, and a few others. Should the "frequency by language" chart be updated with very low nonzero frequencies for English, or does it round to zero? HotdogPi 13:26, 27 June 2019 (UTC)[reply]

I very much agree with HotdogPi (and many others) regarding the flaws of this article as stands. It appears the biggest problem is a lack of citable resources with which to produce more informative infographics. I still propose that anyone interested in this topic find some resources on frequency analyses that rank the frequency of all unicode characters on some List_of_text_corpora. Original research is not permitted on Wikipedia, however, anyone with sufficient skills and interest could also just publish their own research, (so as to meet the requirements of citability/reliability, of course). See M. Jankowska, V. Kešelj and E. Milios, "Relative N-gram signatures: Document visualization at the level of character N-grams," 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), Seattle, WA, 2012, pp. 103-112, doi: 10.1109/VAST.2012.6400484. as a potential starting point for further research? BlackPlatinumChowChow (talk) 15:52, 22 May 2020 (UTC)[reply]

'Relative frequency as the first letter of an English word' add up to 90%

[edit]

The percentages in the table 'Relative frequency as the first letter of an English word' add up to 90%.

Should the percentages' sum be 100%?

Thank you, 2604:6000:7750:B800:ACBA:D86:9628:57D (talk) 05:33, 29 November 2020 (UTC)Scott[reply]

Etaoin Shrdlu equivalents

[edit]

Why are the etaoin shrdlu equivalents listed at two groups of five letters, when etaoin shrdlu is two groups of six? If there is a reason for this, wouldn't it be useful to include this in the article? 84.92.90.18 (talk) 10:57, 27 May 2021 (UTC)[reply]

Inconsistent Uppercase/Lowercase

[edit]

At present among the tables in this article, some use A B C D etc., others use a b c d etc. Besides being inconsistent in style, this may also mislead some readers about what is being counted: only upper case? only lower case? both actually?

I would like to update each table to follow the Aa Bb Cc Dd etc. paradigm seen in other articles, such as List of Latin-script letters and List of Cyrillic letters. - DuckMaestro (talk) 18:37, 7 August 2022 (UTC)[reply]

The sum of the freq % is not 100 for Hungarian letters

[edit]

Summing up the letter frequency for Hungarian letters gives 117,531% instead of expected 100%. Milan Berta 07:55, 25 March 2023 (UTC)[reply]

 Done Dexxor (talk) 12:23, 25 March 2023 (UTC)[reply]

How could French k share the same frequency (0.074%) as the English z?

[edit]

Considering that k is a Foreign letter in French, appearing only in loanwords? This link reports an even higher frequency for k, wow! 129.104.241.242 (talk) 04:53, 3 June 2024 (UTC)[reply]