Wikipedia:Typo Team/moss/Archive
DNA
[edit]DNA sequences, like those in
Hmm, I will have to ask around on MoS or something. Thanks for finding that. -- Beland (talk) 01:38, 19 July 2018 (UTC)
- If not, we could make one, a template with at bare minimum
<span class="dna-sequence">{{{1}}}</span>
Do similarly for the poem structure patterns. We did this with trade designations for horticultural plants, and it has worked out well: {{tdes}}. Turn out the nomenclature authority requires them (in a scientific name) to be in a differenced font, so we used kerned monospace (it supports extra options, but that part was probably a bad idea). Anyway, here's all "Template:"-namespace pages with "dna" in their titles here's those with "gene", in case there's already a template for this (I have not pored over them). — SMcCandlish ☏ ¢ 😼 14:48, 21 July 2018 (UTC)- Oof, that has resulted in some pretty ugly plant name typography; I wish we hadn't followed the typographical conventions of that source. I do like the idea of a template, though - that would make it easy for anyone who is interested to find all of the DNA sequences on Wikipedia...which is a thing that could happen? I put your code in {{DNA sequence}} and applied that to this article; thanks for throwing that together! I'll ponder poem patterns a bit more. -- Beland (talk) 05:44, 25 July 2018 (UTC)
Fixed
[edit]- 549 – List of MicroWorlds Logo commands – ignore articles tagged {{Copy to \w+}} or {{Move to \w+}}
- Testing a fix for this on the next run (April 14 or later). -- Beland (talk) 21:29, 13 April 2018 (UTC)
- 1635 – wikt:nowrap – 100th United States Congress, etc. – HTML comments inside a table
- The problem was that the leading {| for the table was in a template, not the main article text. I added |- as an alternate start sequence for tables, so this should be fixed in the next run (April 14 or later). -- Beland (talk) 00:07, 14 April 2018 (UTC)
- Three-letter ISO language codes. These should be excluded if article titles like ISO 639:zzj are split on ":".
- Good catch. Improved the article title-splitting algorithm to split on all punctuation as well as whitespace. -- Beland (talk) 00:23, 14 April 2018 (UTC)
- 1 – 10P/Tempel – wikt:astrosurf – ignore text in "References" and "External links" sections
- Yeah, there's going to be a lot of obscure proper nouns and URL words down there. Fixed; I'm also ignoring "Bibliography" and "Further reading" sections. -- Beland (talk) 00:30, 14 April 2018 (UTC)
- 1 – 1 + 2 + 3 + 4 + ⋯ – wikt:prespacetime – word in a url
- Fixed by excluding "External links" sections. -- Beland (talk) 00:32, 14 April 2018 (UTC)
- 4 zjem – foreign language word, ignore words in italics
- I wrapped this with {{lang}}. -- Beland (talk) 00:49, 14 April 2018 (UTC)
- 1 – 1620 Geographos – wikt:graphos – part of name of asteroid
- I wrapped this with {{lang}}; it's a Greek root in the etymology. -- Beland (talk) 00:49, 14 April 2018 (UTC)
- 148 – wikt:rowspan – 16th United States Congress, 2004 Veikkausliiga, A11 (Croatia), Andalusian parliamentary election, 2008, British Columbia general election, 1882 ... find all – HTML comments inside tables?
- Hmm, a diversity of issues with table start/end syntax mixing with templates. Given the fix I just made for missing {|, I'll take another pass at this in a subsequent run. -- Beland (talk) 00:07, 14 April 2018 (UTC)
- I made some changes to handle stray HTML tags; I think this is fixed in the next run. -- Beland (talk) 02:15, 11 July 2018 (UTC)
- Direct quotes
- 1 – 0 (year) – wikt:fortysixth – in a direct quote – ignore passages inside ""
- 1 – 14 (number) – wikt:forteen – as above
- 1 – 108th Training Command (Initial Entry Training) – wikt:trainfire – ignore wording in quotation marks
- Hmm, we should spell-check quotations too (it's certainly possible to make a typo when copying), but we have plenty of other typos to fix before we get to the ones that require source-checking. I'll put these on a different output channel and just not post that until I get implement some more complex handling (we should probably mark them with {{sic}} or {{typo}}). -- Beland (talk) 01:56, 11 July 2018 (UTC)
- 1 – 15th (Imperial Service) Cavalry Brigade – wikt:artilleryand – the string in question is artillery{{spaced ndash}}and (i.e. two words separated by a dash)
- It was the wrong symbol, this should be unspaced mdash ie:—. I have fixed it, but we don't need a spelling alert from this. Graeme Bartlett (talk) 04:05, 25 April 2018 (UTC)
- 1 – 1-(4-(Trifluoromethyl)phenyl)piperazine – wikt:pTFMPP – chemical names where first letter is not capitalized
- I've run across a few things where words in quotes are fixed with brackets, ex. improbabl[e], replac[es], etc. [Examples added. Tevildo (talk) 09:50, 29 April 2018 (UTC)]
1 - 12 oz. Mouse - wikt:unrave : (unrave[l])- Fixed. -- John Broughton (♫♫) 04:04, 26 May 2018 (UTC)
- 1 - 340B Drug Pricing Program - wikt:tretch : ([s]tretch)
- 1 - '47 (magazine) - wikt:eople : ([p]eople)
- 1 - Zaat (novel) - wikt:enowned : ([r]enowned)
- 1 - Zachery Kouwe - wikt:appropriat : (appropriat[ing])
- The next run will ignore direct quotes, so this sort of thing will hopefully not be a problem. -- Beland (talk) 02:15, 11 July 2018 (UTC)
- Reference names, ex. wikt:grayone in 10th Scripps National Spelling Bee
- Ah, that was due to unexpected capitalization inside the HTML tag. I made the logic case insensitive so that shouldn't happen in the next run. -- Beland (talk) 02:23, 11 July 2018 (UTC)
- Items separated by · are considered one word, causing the supposed typo 52 testingRegressionCorrelationTerminologyParticipatory on social science
- Fixed in the next run. I'm not doing template substitution so there may be other similar utility templates lurking. -- Beland (talk) 06:04, 11 July 2018 (UTC)
- 48 fffffffunnyfearlessfemmefataleorverynaughtygurls
- I'll add Publications sections to the list of things to ignore. -- Beland (talk) 02:04, 19 July 2018 (UTC)
- Fixed in the next run. -- Beland (talk) 05:24, 25 July 2018 (UTC)
- Proper nouns that do not begin with a capital letter. These are generally Irish names, trademarks or stage names. Tevildo (talk) 18:29, 26 April 2018 (UTC)
- 1 -
11th century in Ireland - wikt:hAinmere - 1 -
309 Road - wikt:iSITE - 1 -
30th century BC - wikt:nGiall - 1 -
32nd Young Artist Awards - wikt:yFury - 1 -
35th century BC - wikt:nGiall - 1 -
36 Cube - wikt:iParenting - 1 -
4/4 (EP series) - wikt:illdude - 1 -
45th NAACP Image Awards - wikt:influANTces - 1 -
53A (band) - wikt:neoDominatrix - 1 -
59 (number) - wikt:djTAKA - 1 -
7 Angels 7 Plagues - wikt:xFor - 1 -
7th Grade Civil Servant - wikt:workpointTV - 1 -
95bfm - wikt:bCasts - 1 -
9th Irish Film & Television Awards - wikt:tSaoil - 1 -
A10 Networks - wikt:vThunder - 1 -
Zach Boisjoly - wikt:iAMZJB - 1 -
Zach Veach - wikt:urTXT - Hmm, if these don't have English Wikipedia articles, I think the only way to handle them is to wrap them in {{not a typo}} or maybe {{lang}} (for Irish names)? -- Beland (talk) 02:23, 11 July 2018 (UTC)
- All tagged. Some of these would have been ignored by new code anyway, since they're in quote marks or ignored sections. -- Beland (talk) 20:23, 25 July 2018 (UTC)
- 1 -
These are due to difficult-to-parse mixtures of tables and templates. ::sigh:: I think I can fix this in code. -- Beland (talk) 00:49, 19 July 2018 (UTC)
- 252 - wikt:colspan - 1901–02 Minnesota Golden Gophers men's basketball team, 1917 Norwegian Football Cup, 1917 in Norwegian football, 1918–19 Minnesota Golden Gophers men's basketball team, 1964–65 SV Werder Bremen season ... find all
- These should be ignored in the next run (20 July 2018 dump or later). -- Beland (talk) 22:28, 25 July 2018 (UTC)
- 88 - wikt:hillclimbs - 1955 Le Mans disaster, Adolf Brudes, Augusto Monaco, Austin-Healey Sebring Sprite, Autobianchi A112 ... find all
- Singular added to Wiktionary --Harmonicaplayer (talk) 13:27, 27 July 2018 (UTC)
- Hmm, it was deleted. I guess we need to see if it has been used by a diversity of editors here? -- Beland (talk) 19:57, 13 September 2018 (UTC)
- I've re-added it, this time with enough quotations of literature to demonstrate that it meets Wiktionary's criteria for inclusion. I suspect it was deleted the first time around because it's uncommon and many instances on e.g. Google Books are scannos of "hill climb" or the like, and Harmonicaplayer, while often helpful, is also a longtime jokester. ;) It does appear to be a valid word in the specific context of competitively racing cars up hills. -sche (talk) 17:21, 14 September 2018 (UTC)
- Excellent; thanks for your efforts! -- Beland (talk) 19:09, 20 September 2018 (UTC)
- I've re-added it, this time with enough quotations of literature to demonstrate that it meets Wiktionary's criteria for inclusion. I suspect it was deleted the first time around because it's uncommon and many instances on e.g. Google Books are scannos of "hill climb" or the like, and Harmonicaplayer, while often helpful, is also a longtime jokester. ;) It does appear to be a valid word in the specific context of competitively racing cars up hills. -sche (talk) 17:21, 14 September 2018 (UTC)
- Hmm, it was deleted. I guess we need to see if it has been used by a diversity of editors here? -- Beland (talk) 19:57, 13 September 2018 (UTC)
- Singular added to Wiktionary --Harmonicaplayer (talk) 13:27, 27 July 2018 (UTC)
- 81 - wikt:gholas - Bene Tleilax, Chani, Duncan Idaho, Dune (franchise), Emperor: Battle for Dune ... find all
- ghola is either related to West Bengal, Pakistan, Afghanistan, or related to the Dune universe; Wiktionary does not have either wikt:ghola or wikt:gholas;
- 24 - "gholas" : of 24 matches only one (Hasnabad (community development block)) is not from the Dune universe
- 427 - "ghola"
- 59 - "ghola" -"bengal"
- of 59, only 7 are not about the Dune universe: Ghoul, Prem Pujari/List of songs recorded by Kishore Kumar (a song title) Mount Paiko/Kharkoo (places) Bogeyman List of rampage killers/List of rampage killers (familicides) (a town)
- So this is the plural of a word that is most often a made-up term from the Dune universe, not exactly ready for Wiktionary! What to do? Shenme (talk) 19:12, 28 August 2018 (UTC)
- Ah, we have a redirect from ghola; I can add redirects to the exclusion list. I'll have to be careful of those with {{R from misspelling}} and variations, and we'll have to go through all untagged redirects and tag those that are also misspellings. (In the end, I think all redirects will be tagged; categorizing them helps projects decide whether or not they are worthy for inclusion in a print version or CD, etc.) -- Beland (talk) 19:53, 13 September 2018 (UTC)
- Oh, redirects are already included in the dictionary. I just created a redirect from gholas, so this should be ignored on the next run. -- Beland (talk) 04:43, 24 September 2018 (UTC)
- Ah, we have a redirect from ghola; I can add redirects to the exclusion list. I'll have to be careful of those with {{R from misspelling}} and variations, and we'll have to go through all untagged redirects and tag those that are also misspellings. (In the end, I think all redirects will be tagged; categorizing them helps projects decide whether or not they are worthy for inclusion in a print version or CD, etc.) -- Beland (talk) 19:53, 13 September 2018 (UTC)
- ghola is either related to West Bengal, Pakistan, Afghanistan, or related to the Dune universe; Wiktionary does not have either wikt:ghola or wikt:gholas;
- 76 - wikt:phyllon - Acleris phyllosocia, Antispila oinophylla, Arsenate minerals, Arsenite minerals, Arthrochilus stenophyllus ... find all
- I've added it because it is an English word, but most or all of the uses above are of Greek and should probably be language-tagged; compare tetartos. -sche (talk) 01:05, 30 August 2018 (UTC)
- 57 - wikt:decisioned - Abner Cotto, Ada Vélez, Al Hostak, Amanda Serrano, Beneil Dariush ... find all
- Infinitive added to Wiktionary --XY3999 (talk) 10:28, 31 August 2018 (UTC)
- Added to Wiktionary --XY3999 (talk) 07:59, 2 September 2018 (UTC) And expanded out of Wikipedia Graeme Bartlett (talk) 12:37, 2 September 2018 (UTC)
- 40 - wikt:circrnas - Circular RNA, Circular RNA (circRNA) databases and resources, Competing endogenous RNA (ceRNA) databases and resources, RMBase (RNA Modification Base), RNA ... find all
- plural of circRNA, which is short for Circular RNA. Darylgolden(talk) Ping when replying 13:35, 4 September 2018 (UTC)
- I've added it to Wiktionary as wikt:circRNA, wikt:circRNAs. (For good measure I also redirected those titles here on WP to Circular RNA.) -sche (talk) 19:43, 21 September 2018 (UTC)
- plural of circRNA, which is short for Circular RNA. Darylgolden(talk) Ping when replying 13:35, 4 September 2018 (UTC)
- 28 - wikt:hylaxes - List of mammals of Algeria, List of mammals of Angola, List of mammals of Botswana, List of mammals of Cameroon, List of mammals of Chad ... find all
- Although "hylax" appears to be a real word in Taxonomic Latin in reference to butterflies, it appears to be a persistently copy-pasted typo of "hyrax" in these articles. (It does appear to have existed in the past as an uncommon and now archaic spelling of "hyrax", but it seems unsuitable for use in articles.) -sche (talk) 18:46, 14 September 2018 (UTC)
- I've changed all instances of "hylaxes" to "hyraxes". Instances of "hylax" should also be looked at (some are valid, in re butterflies; some are not, in re hyraxes). -sche (talk) 19:04, 14 September 2018 (UTC)
- Although "hylax" appears to be a real word in Taxonomic Latin in reference to butterflies, it appears to be a persistently copy-pasted typo of "hyrax" in these articles. (It does appear to have existed in the past as an uncommon and now archaic spelling of "hyrax", but it seems unsuitable for use in articles.) -sche (talk) 18:46, 14 September 2018 (UTC)
- 28 - wikt:junebugs - Acoma (beetle), Coenonycha, Dichelonyx, Diplotaxini, Fossocarus ... find all
- A less common but apparently valid spelling of June bug. Added to Wiktionary, but someone may wish to change entries here to the spelling June bug that is about 15x more common. -sche (talk) 16:41, 15 September 2018 (UTC)
- 1 - Aasu, American Samoa - wikt:oloaufou → false positive after apostrophe - Polynesian names often contain an apostrophe
- Well, I created A’oloaufou as a redirect. Based on what is documented at ʻokina I think this is supposed to be an ’eta because this is Tahitian, but it doesn't have a separate Unicode point from ’. I definitely don't want to parse quote marks as letters but maybe I can combine things on either side for spell-check dictionary lookup purposes. -- Beland (talk) 00:54, 26 July 2018 (UTC)
- In order to fix a problem with Persian transliteration I had to allow U+2019 (right single quote mark) inside words, so this sort of thing will be ignored in the future as a proper noun even if there isn't a redirect or Wikipedia article with the exact name. -- Beland (talk) 04:51, 24 September 2018 (UTC)
- Well, I created A’oloaufou as a redirect. Based on what is documented at ʻokina I think this is supposed to be an ’eta because this is Tahitian, but it doesn't have a separate Unicode point from ’. I definitely don't want to parse quote marks as letters but maybe I can combine things on either side for spell-check dictionary lookup purposes. -- Beland (talk) 00:54, 26 July 2018 (UTC)
Notes from Apr 2018
[edit]- 1 - 3,4-Ethylidenedioxyamphetamine - wikt:isopropylidinedioxy → chemical name fragment
- I checked with a chemist, and this was slightly misspelled. -- Beland (talk) 00:10, 26 July 2018 (UTC)
- 1 - 49th Quartermaster Group - wikt:counterpotente → Heraldry term
- Needs Wiktionary entry. -- Beland (talk) 00:10, 26 July 2018 (UTC)
- Added. Great find. It and potente are apparently less-common spellings of (counter)potenty, which Wiktionary also lacked until now. -sche (talk) 04:44, 27 July 2018 (UTC)
- Needs Wiktionary entry. -- Beland (talk) 00:10, 26 July 2018 (UTC)
- 1 - 59th Ariel Awards - wikt:plainrowheaders → false alarm with table class="wikitable plainrowheaders"
- Hmm, not sure why that happened. We'll see if it happens again on the next run. -- Beland (talk) 00:10, 26 July 2018 (UTC)
- The 2018-07-20 run didn't list this as an error, yay! -- Beland (talk) 23:34, 26 July 2018 (UTC)
- 1 - 89th Academy Awards - wikt:orchestrals →false positive in quote
- Actually, I think this should have a Wiktionary entry. -- Beland (talk) 00:10, 26 July 2018 (UTC)
- Added. -sche (talk) 04:44, 27 July 2018 (UTC)
- Actually, I think this should have a Wiktionary entry. -- Beland (talk) 00:10, 26 July 2018 (UTC)
Poems
[edit]These are patterns used to describe poetry. Not sure they are appropriate for Wiktionary; if not, I will whitelist them. -- Beland (talk) 00:49, 19 July 2018 (UTC)
- There may be a better and even conventionally marked-up way to represent these. Check poetry sources? Maybe they done as c-d-c-d or whatever. — SMcCandlish ☏ ¢ 😼 14:38, 21 July 2018 (UTC)
- 129 - wikt:cdcd - Anne Locke, Poetry, Rhyme, Shakespeare's sonnets, Sonnet ... find all
- 128 - wikt:efef - Anne Locke, Ilse Kokula, Mariana (poem), Ode to Psyche, Poetry ... find all
Oh, there are lots more where that came from. Maybe these should be tagged or maybe I can fix in code with a pattern recognizer or something. I'll have to ponder. -- Beland (talk) 01:42, 19 July 2018 (UTC)
- 1 - wikt:abaaba - List of ideophones in Basque
- 1 - wikt:ababababcd - Christis Kirk on the Green
- 1 - wikt:ababababcdcd - Middle English Metrical Paraphrase of the Old Testament
- 1 - wikt:ababababcdddc - The Knightly Tale of Gologras and Gawain
- 1 - wikt:ababbacc - Ottava rima
- 1 - wikt:ababbcac - Summum Bonum (poem)
- 1 - wikt:ababbccb - The Flyting of Dumbar and Kennedie
- 1 - wikt:ababbccdcd - Ballade (forme fixe)
- 1 - wikt:ababc - Ballad of Eric
- 1 - wikt:ababccab - Sapphic stanza in Polish poetry
- 1 - wikt:ababccbdd - Spenserian stanza
- 1 - wikt:ababccc - Nachtlied (Reger)
- 1 - wikt:ababcdcdefefgg - Bright star, would I were steadfast as thou art
- 1 - wikt:ababcdcdeffeef - Ode to Psyche
- 1 - wikt:abacabadabacaba - Abacaba pattern
- 1 - wikt:abacabadabacabaeabacabadabacaba - Abacaba pattern
- 1 - wikt:abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacaba - Abacaba pattern
- 1 - wikt:abbaab - Sestain
- 1 - wikt:abbacdcdee - The Sun Rising (poem)
- 1 - wikt:abcabc - Sestain
- 1 - wikt:abcabdcefedf - Sestina
- 1 - wikt:abcbba - The Sunlight on the Garden
- 1 - wikt:ababbcc - A Lover's Complaint
From longest:
- 63 abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacaba
- 48 abaabaababaaabaabababaabaabaababaabaaababaabaaab
I think these should be capitalized or enclosed in quotes, either of which would prevent them as showing up here as spelling errors. I started a discussion at Wikipedia talk:Manual of Style § Rhyme scheme patterns. -- Beland (talk) 22:26, 25 July 2018 (UTC)
- Continued at Wikipedia:Typo Team/moss#Repeating patterns. -- Beland (talk) 02:04, 17 August 2018 (UTC)
Notes from Jan 2019
[edit]- 1 - Babak Hamidian - wikt:fakhrmousavi - this good enough? "fakhrmousavi" -> "Fakhrmousavi"
- @Shenme: Yes, that fixes it; capitalized words are assumed to be proper nouns and are ignored. -- Beland (talk) 05:28, 24 September 2018 (UTC)
1 - Babakotia - wikt:antipronograde - wikt has wikt:pronograde, not the 'anti-' tho; see definitions antipronograde and pronograde1 - Côn Sơn Island - wikt:oversped - When I corrected this one, a user on my talk page said it was a niche term used for "over-powering" a motor. References included at my talk page –eggofreasontalk 15:57, 20 December 2018 (UTC)
Statistics
[edit]2018-04 to 2018-09
[edit]Misspellings per article |
2018-04-01 dump moss 4933ad4 |
2018-07-01 dump moss 4933ad4 |
2018-07-20 dump moss 5e6b2ce |
2018-08-01 dump moss 0f7ddbf |
2018-08-20 dump moss 032a6be |
2018-09-01 dump moss 816c025 |
2018-09-20 dump moss 7e26fe6 |
Total change (to 2018-09-20) |
---|---|---|---|---|---|---|---|---|
0 | 4839889 | 4910541 (+70652) | 4948698 (+38157) | 4956727 (+8029) | 4975895 (+19168) | 4986531 (+10636) | 5066713 (+80182) | (+226824) |
1 | 319509 | 319315 (-194) | 315926 (-3389) | 312871 (-3055) | 311641 (-1230) | 309785 (-1856) | 268592 (-41193) | (-50917) |
2 | 104405 | 104591 (+186) | 90630 (-13961) | 89861 (-769) | 89701 (-160) | 89286 (-415) | 71105 (-18181) | (-33300) |
3 | 40270 | 40099 (-171) | 38430 (-1669) | 37891 (-539) | 37832 (-59) | 37669 (-163) | 29796 (-7873) | (-10474) |
4 | 22793 | 22739 (-54) | 21069 (-1670) | 20900 (-169) | 20909 (+9) | 20859 (-50) | 16180 (-4679) | (-6613) |
5 | 13355 | 13331 (-24) | 12561 (-770) | 12392 (-169) | 12357 (-35) | 12315 (-42) | 9483 (-2832) | (-3872) |
6 | 9398 | 9422 (+24) | 8700 (-722) | 8620 (-80) | 8625 (+5) | 8574 (-51) | 6411 (-2163) | (-2987) |
7 | 6599 | 6614 (+15) | 6150 (-464) | 6095 (-55) | 6098 (+3) | 6076 (-22) | 4573 (-1503) | (-2026) |
8 | 5314 | 5312 (-2) | 4854 (-458) | 4832 (-22) | 4812 (-20) | 4839 (+27) | 3474 (-1365) | (-1840) |
9 | 3992 | 3985 (-7) | 3723 (-262) | 3643 (-80) | 3665 (+22) | 3631 (-34) | 2640 (-991) | (-1352) |
10-19 | 16753 | 16879 (+126) | 15508 (-1371) | 15437 (-71) | 15497 (+60) | 15458 (-39) | 10260 (-5198) | (-6493) |
20-29 | 4997 | 4992 (-5) | 4597 (-395) | 4594 (-3) | 4524 (-70) | 4512 (-12) | 2596 (-1916) | (-2401) |
30-39 | 2169 | 2211 (+42) | 1976 (-225) | 1962 (-14) | 1934 (-28) | 1929 (-5) | 1011 (-918) | (-1158) |
40-49 | 1177 | 1205 (+28) | 1061 (-144) | 1061 (0) | 1031 (-30) | 1027 (-4) | 525 (-502) | (-652) |
50-59 | 674 | 695 (+21) | 619 (-74) | 618 (-1) | 560 (-58) | 553 (-7) | 296 (-257) | (-378) |
60-69 | 453 | 476 (+23) | 420 (-56) | 419 (-1) | 378 (-41) | 377 (-1) | 179 (-198) | (-274) |
70-79 | 299 | 326 (+27) | 243 (- 83) | 241 (-2) | 214 (-27) | 218 (+4) | 119 (-99) | (-180) |
80-89 | 213 | 218 (+5) | 179 (-39) | 181 (+2) | 177 (-4) | 179 (+2) | 81 (-98) | (-132) |
90-99 | 140 | 153 (+13) | 131 (-21) | 126 (-5) | 126 (0) | 128 (+2) | 61 (-67) | (-79) |
100-199 | 456 | 521 (+65) | 434 (-87) | 435 (+1) | 416 (-19) | 414 (-2) | 196 (-218) | (-260) |
200-299 | 90 | 113 (+23) | 93 (-20) | 95 (+2) | 91 (-4) | 96 (+5) | 44 (-52) | (-46) |
300-399 | 44 | 45 (+1) | 41 (-4) | 42 (+1) | 41 (-1) | 42 (+1) | 27 (-15) | (-17) |
400-499 | 19 | 26 (+7) | 21 (-5) | 22 (+1) | 18 (-4) | 18 (0) | 9 (-9) | (-10) |
500-599 | 12 | 13 (+1) | 13 (0) | 13 (0) | 16 (+3) | 16 (0) | 7 (-9) | (-5) |
600-699 | 8 | 9 (+1) | 9 (0) | 8 (-1) | 6 (-2) | 7 (+1) | 2 (-5) | (-6) |
700-799 | 2 | 3 (+1) | 3 (0) | 5 (+2) | 5 (0) | 5 (0) | 1 (-4) | (-1) |
800-899 | 2 | 3 (+1) | 3 (0) | 3 (0) | 3 (0) | 2 (-1) | 0 (-2) | (-2) |
900-999 | 6 | 7 (+1) | 6 (-1) | 6 (0) | 5 (-1) | 5 (0) | 0 (-5) | (-6) |
1000-1999 | 25 | 27 (+2) | 27 (0) | 26 (-1) | 24 (-2) | 23 (-1) | 0 (-23) | (-25) |
2000-2999 | 3 | 5 (+2) | 5 (0) | 5 (0) | 5 (0) | 3 (-2) | 0 (-3) | (-3) |
4000-4999 | 0 | 2 (+2) | 2 (0) | 2 (0) | 3 (+1) | 1 (-2) | 0 (-1) | (0) |
Parse failed | 193671 | 194777 (+1106) | 191813 (-2964) | 195147 (+3334) | 203588 (+8441) | 203583 (-5) | 201420 (-2163) | (+7749) |
The spell checker has been getting smarter over time, so more recent versions report fewer false alarms. This explains most of the drop in the number of possible typos reported. Most of the gains for pages with more than 100 possible typos is due to changes that ignore pages with {{cleanup}} and similar tags, which indicate the page may not be ready for spell checking. I have been specifically tagging pages with a high number of possible typos to bring them to the attention of interested editors. Pages tagged for cleanup are reported in the statistics of cleanup-related work queues.
Some variation in the number of typos fixed between runs is also explained by the differences in the amount of time between runs. The biggest sources of variance are the unusually long time between the first two runs and the fact that dumps snapshotted on the first day of the month (which have a lot of additional data the spell checker doesn't need) take longer for Wikimedia servers to generate than the dumps snapshotted on the twentieth day of the month. There is also considerable activity from other editors writing new material and correcting typos as they find them while reading or editing articles.
moss project participants have been correcting hundreds or thousands of typos per month (yay!) mostly in articles with a single typo. We have also been adding somewhere from handfuls to dozens of entries to Wiktionary a month. Looking only at the generated reports, these numbers are difficult to separate from the other changes in data and code, but we do see progress as we strike through or remove items from the todo lists.
Since figuring out which words are not typos is such a big part of the problem to be solved, the code may need to get smarter in the future, but we're probably going to have an upcoming period of relative stability as we work through some low-hanging fruit. Hopefully upcoming statistics will reflect progress in actually reducing typos more than changes in spell checker code. -- Beland (talk) 18:20, 12 October 2018 (UTC)
2018-09 to 2019-03
[edit]At least 10% of possible typos reported in the old statistics are definitely misspellings, but it's unclear how many of the remaining 90% are. Below is a new way of breaking down possible typos, by type instead of count per article. The "T1" items are almost all typos, and those are what we've been working on in the main "by article" section. Some of the other types have their own reports on this page, but most will require further analysis to either automatically distinguish typos vs. legitimate strings, or produce a more useful report for human editors.
Reporting symbol | Explanation | Instances/Unique strings, 2018-09-20 dump (7e26fe6) | Instances/Unique strings, 2018-10-20 dump (7649023) | Instances/Unique strings, 2018-11-01 dump (0aa8575) | Instances/Unique strings, 2018-12-20 dump (03be966) | Instances/Unique strings, 2019-01-20 dump (1bcf51c) | Instances/Unique strings, 2019-02-01 dump (c6ce3ab) | Instances/Unique strings, 2019-03-01 dump (ff8b9d2) | Instances/Unique strings, 2019-03-20 dump (692642d) |
---|---|---|---|---|---|---|---|---|---|
TS | Missing or whitespace or dash (or new compound) | 152985/84720 | 194758/114535 | 194711/114518 | 195044/114675 | 192811/114167 | 193752/114734 | 191701/113928 | 183795/109989 |
T1 | Edit distance 1 from common English word | 111429/70527 | 104280/68352 | 103043/67652 | 96081/64513 | 89549/61018 | 89355/60879 | 83353/57483 | 75941/53339 |
T2 | Edit distance 2 from common English word | 82638/53517 | 81793/53146 | 81721/53191 | 81536/53093 | 81170/52980 | 82727/53945 | 81410/53326 | 72093/47849 |
T3 | Edit distance 3 from common English word | 91844/61332 | 90769/60713 | 90778/60760 | 90382/60574 | 89841/60397 | 91893/61566 | 90328/60825 | 79609/54610 |
T4 | Edit distance 4 from common English word | 76336/52684 | 75139/52090 | 75006/52101 | 74757/51828 | 74536/51752 | 76323/52938 | 75335/52296 | - |
T5 | Edit distance 5 from common English word | 52071/36450 | 50970/35807 | 50882/35812 | 50614/35649 | 50571/35624 | 51785/36446 | 50852/36022 | - |
T6 | Edit distance 6 from common English word | 30437/21927 | 29755/21481 | 29704/21478 | 29490/21302 | 29440/21280 | 30134/21759 | 29685/21506 | - |
T7 | Edit distance 7 from common English word | 15392/11095 | 14972/10854 | 14977/10858 | 14858/10736 | 14765/10698 | 15153/10939 | 14929/10790 | - |
T8 | Edit distance 8 from common English word | 7138/5060 | 6966/4936 | 6970/4947 | 6911/4902 | 6863/4881 | 6967/4959 | 6811/4886 | - |
T9 | Edit distance 9 from common English word | 2450/1868 | 2383/1823 | 2380/1822 | 2349/1822 | 2348/1819 | 2407/1867 | 2386/1848 | - |
T10 | Edit distance 10 from common English word | 1027/721 | 987/705 | 986/706 | 995/702 | 978/697 | 992/708 | 960/693 | - |
T11 | Edit distance 11 from common English word | 399/324 | 390/317 | 389/316 | 380/312 | 378/309 | 386/315 | 388/316 | - |
T12 | Edit distance 12 from common English word | 122/105 | 119/102 | 119/102 | 120/103 | 117/101 | 118/101 | 118/101 | - |
T13 | Edit distance 13 from common English word | 44/29 | 44/29 | 44/29 | 44/29 | 45/30 | 45/30 | 45/30 | - |
T14 | Edit distance 14 from common English word | 15/13 | 14/12 | 14/12 | 13/11 | 1/1 | 6/5 | 5/5 | - |
T15 | Edit distance 15 from common English word | 1/1 | 1/1 | 1/1 | 0/0 | 1/1 | 0/0 | 0/0 | - |
T16 | Edit distance 16 from common English word | 2/2 | 0/0 | 0/0 | 0/0 | 0/0 | 1/1 | 1/1 | - |
R | A-Z only, not near a common English word | 168446/121107 | 165841/119452 | 165960/119619 | 165403/119208 | 165091/119086 | 169103/121936 | 166235/120111 | 101178/77389 |
I | Letters with accents or mixed with punctuation (other than hyphen) | 266937/143960 | 261310/144833 | 261653/145040 | 263654/145754 | 263679/146027 | 275444/153887 | 229579/149303 | 93902/70014 |
W | Not in English Wikitionary, in non-English Wiktionary | - | - | - | - | - | - | - | 82548/48389 |
L | Probable Romanization (transLiteration) | - | - | - | - | - | - | - | 4294/2610 |
ME | Probable coMpound, English | - | - | - | - | - | - | - | 51279/33301 |
MI | Probable coMpound, non-English (International) in English Wiktionary | - | - | - | - | - | - | - | 194949/133055 |
MW | Probable coMpound, found in non-English Wiktionary | - | - | - | - | - | - | - | 51656/36961 |
ML | Probable coMpound, transLiteration | - | - | - | - | - | - | - | 4010/2791 |
C | Chemistry words | 6581/4604 | 6597/4619 | 6613/4629 | 6631/4638 | 6633/4624 | 6618/4618 | 6637/4625 | 1853/1399 |
D | DNA sequences (a, c, g, t) | 51/18 | 15/3 | 16/4 | 16/4 | 15/3 | 15/3 | 2/2 | 0/0 |
N | A-Z plus numbers and hyphens | 25061/20114 | 25728/20854 | 25702/20846 | 25748/20899 | 25582/20737 | 26201/21255 | 25969/21130 | 26620/21685 |
P | Patterns (e.g. rhyme schemes) | 808/461 | 796/484 | 790/484 | 778/478 | 736/439 | 744/443 | 493/423 | 47/33 |
H | HTML/XML/SGML tag | - | - | - | - | - | - | 3389/1592 | 3519/1593 |
HB | Known bad HTML tag, like <font> | - | - | - | - | - | - | 14417/49 | 15366/49 |
HL | Bad HTML-like linking, like <http://...> | - | - | - | - | - | - | 519/5 | 516/5 |
Parse failure | Mismatched punctuation | ? | ? | ? | 202583 | 203044 | 203611 | 214525 | 199130 articles |
Total | 1092214/690639 | 1113627/715148 | 1112459/714927 | 1105804/711232 | 1095150/706671 | 1120169/723334 | 1075547/711296 | 1043175/695061 |
2019-03 to 2020-02
[edit]From 2018-09-20 to 2019-03-01, the number of typos classified as T1 (edit distance 1 from an English word, the most likely to be actual misspellings) dropped by 35,488, or 32%, and this appears to be due to the hard work of editors participating in the moss project fixing typos on the T1 lists. Amazing progress! The numbers for categories we aren't fixing have remained relatively stable, though for all categories there is some bouncing around as new typos are created and fixed in the normal course of writing and editing articles.
While processing the 2019-03-01 dump, I made a major change to how typos are classified. (You can see the old method in the archived statistics.) I've dropped categories with an edit distance greater than 3 from an English word (T4 thru T16) since these are quite unlikely to be misspellings. Most of the reported typos that are not likely English misspellings are either compound words or non-English words. (Some of the non-English words are also misspelled.) Some English compounds end up as TS, if they are caught by a conventional spell checker; the rest are now classified as ME. (There are various other categories for compounds, all starting with M, and these will all need to be refined later because a fair number of words are up there that don't belong.) In an effort to exclude as many non-English words as possible, I've started looking at non-English Wiktionaries; any words found there but not in the English Wiktionary are classified as W. Romanizations are not eligible for Wiktionary; words native to non-Latin writing systems are entered under those other systems. I've written some code that attempts to perform transliteration from any given writing system. It's starting to catch a few thousand words (classified as L) but is obviously missing a lot and so will need to be further refined. I've also added some categories for bad HTML tags and similar problems.
Since the classification changes make the new numbers incomparable with the old numbers, I've started a new table below. I've started posting some TS typos as well as T1s, so expect to see both those numbers to improve significantly in the coming months. -- Beland (talk) 07:30, 23 March 2019 (UTC)
Reporting symbol | Explanation | Change from 2019-03-01 to 2020-02-20 | Instances, 2019-03-01 dump (692642d) | Instances, 2019-03-20 dump (802b6c0) | Instances, 2019-04-01 dump (ab3fabd) | Instances, 2019-04-20 dump (7bb97ba) | Instances, 2019-05-01 dump (dcb388a) | Instances, 2019-05-20 dump (dcb388a) | Instances, 2019-06-01 dump (30a59f6) | Instances, 2019-07-01 dump (2fc381f) | Instances, 2019-07-20 dump (41f99ab) | Instances, 2019-08-01 dump (bc954d6) | Instances, 2019-08-20 dump (c600526) | Instances, 2019-09-01 dump (4660042) | Instances, 2019-09-20 dump (18f7307) | Instances, 2019-10-01 dump (08a1438) | Instances, 2019-10-20 dump (e07a89f) | Instances, 2019-11-01 dump (e07a89f) | Instances, 2019-11-20 dump (e07a89f) | Instances, 2019-12-01 dump (95d1a53) | Instances, 2019-12-20 dump (0434c67) | Instances, 2020-01-20 dump (99af116) | Instances, 2020-02-20 dump (99af116) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TS | Missing or extra whitespace or dash (or new compound) | -39368 (-21%) | 183795 | 182018 (-1777/.97%) | 178591 (-3427/1.9%) | 177391 | 176266 | 175163 | 173312 | 170828 | 168401 | 166966 | 164205 | 161344 | 160707 | 157832 | 155980 | 155218 | 152621 | 147666 | 146591 | 144424 | 144427 |
T1 | Edit distance 1 from common English word | -36192 (-48%) | 75941 | 73600 (-2341/3.1%) | 70756 (-2844/3.9%) | 69261 | 68790 | 66099 | 64732 | 61255 | 57141 | 55160 | 51987 | 48904 | 45926 | 44275 | 40436 | 39285 | 39106 | 39721 | 39301 | 38737 | 39749 |
T2 | Edit distance 2 from common English word | -7560 (-10%) | 72093 | 71615 (-478/.66%) | 70949 (-666/.93%) | 70909 | 70684 | 70247 | 69741 | 69629 | 69365 | 69266 | 69146 | 68748 | 68657 | 67161 | 66173 | 65589 | 64952 | 64890 | 64886 | 64691 | 64533 |
T3 | Edit distance 3 from common English word | -5276 (-7%) | 79609 | 78925 (-684/.86%) | 78209 (-716/.91%) | 78139 | 78046 | 77541 | 76954 | 76887 | 76672 | 76691 | 76663 | 75998 | 76061 | 75096 | 74636 | 74327 | 73995 | 74030 | 74551 | 74419 | 74333 |
R | Regular word (A-Z only) not near a common English word | -3525 (-3%) | 101178 | 100067 (-1111/1.1%) | 99491 (-576/.58%) | 99722 | 99694 | 99236 | 98856 | 98788 | 98646 | 98498 | 98411 | 97438 | 97588 | 96865 | 96775 | 96746 | 96490 | 96593 | 96948 | 97342 | 97653 |
I | Definitely not English (International) due to accents or mixed with punctuation (other than hyphen) | -22196 (-24%) | 93902 | 90875 (-3027/3.2%) | 88564 (-2311/2.5%) | 87748 | 87925 | 84690 | 81042 | 81284 | 82263 | 82412 | 82431 | 71982 | 71240 | 70248 | 70349 | 70385 | 70510 | 70468 | 70714 | 70856 | 71706 |
W | Not in English Wiktionary, in non-English Wiktionary | -6764 (-8%) | 82548 | 82519 (-29/.04%) | 80041 (-2478/3.0%) | 79664 | 79486 | 77888 | 76310 | 76309 | 76224 | 76177 | 76142 | 75508 | 76248 | 75263 | 74906 | 74816 | 74851 | 74991 | 75294 | 75663 | 75784 |
L | Probable Romanization (transLiteration) | +81 (+2%) | 4294 | 4306 (+12/.28%) | 4206 (-100/2.3%) | 4219 | 4237 | 4197 | 4168 | 4181 | 4189 | 4188 | 4191 | 4191 | 4234 | 4115 | 4126 | 4132 | 4182 | 4195 | 4228 | 4282 | 4375 |
ME | Probable coMpound, English (with and without dash) | +976 (+2%) | 51279 | 51052 (-227/.44%) | 50845 (-207/4.1%) | 50932 | 50902 | 50659 | 50263 | 50352 | 50439 | 50419 | 50700 | 50606 | 50708 | 50392 | 51830 | 51791 | 51782 | 51830 | 52026 | 52173 | 52255 |
MI | Probable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash) | -18475 (-9%) | 194949 | 192743 (-2206/1.1%) | 189661 (-3082/1.6%) | 189758 | 190172 | 187870 | 184497 | 185101 | 185733 | 185960 | 186074 | 175904 | 176069 | 174746 | 173592 | 173700 | 173611 | 173710 | 174881 | 175528 | 176474 |
MW | Probable coMpound, found in non-English Wiktionary | -5544 (-11%) | 51656 | 51240 (-416/.81%) | 50288 (-952/1.9%) | 50026 | 49785 | 48728 | 47641 | 47642 | 47544 | 47831 | 47555 | 46854 | 46850 | 46342 | 46232 | 46026 | 45944 | 45968 | 46031 | 45947 | 46112 |
ML | Probable coMpound, transLiteration | -124 (-3%) | 4010 | 3964 (-46/1.1%) | 3925 (-39/.98%) | 3881 | 3892 | 3835 | 3829 | 3827 | 3826 | 3857 | 3853 | 3849 | 3852 | 3779 | 3750 | 3759 | 3786 | 3798 | 3834 | 3863 | 3886 |
C | Chemistry words | -176 (-9%) | 1853 | 1855 (+2/.11%) | 1863 (+8/.43%) | 1862 | 1858 | 1864 | 1569 | 1559 | 1554 | 1560 | 1561 | 1552 | 1551 | 1665 | 1662 | 1651 | 1635 | 1639 | 1657 | 1662 | 1677 |
D | DNA sequences (a, c, g, t) | 0 | 0 | 0 (-) | 0 (-) | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
N | A-Z plus numbers and hyphens | -1391 (-5%) | 26620 | 25854 (-766/2.8%) | 25711 (-143/.56%) | 25739 | 26263 | 26134 | 25945 | 25841 | 25703 | 25650 | 25664 | 26664 | 25776 | 25557 | 25245 | 25072 | 24942 | 24993 | 25119 | 25107 | 25229 |
P | Patterns (e.g. rhyme schemes) | -20 (-43%) | 47 | 50 (+3/6.4%) | 49 (-1/2.0%) | 50 | 48 | 47 | 50 | 49 | 45 | 42 | 38 | 37 | 39 | 17 | 18 | 16 | 17 | 19 | 21 | 19 | 27 |
H | HTML/XML/SGML tag | -539 (-15%) | 3519 | 3459 (-60/1.7%) | 3423 (-36/1.0%) | 3420 | 3404 | 3237 | 3197 | 3160 | 3173 | 3180 | 3190 | 3059 | 3078 | 3003 | 3016 | 3673 | 3012 | 3019 | 3019 | 2978 | 2980 |
HB | Known bad HTML tag, like <font> | -1080 (-7%) | 15366 | 14837 (-529/3.4%) | 14541 (-296/2.0%) | 14776 | 14622 | 16313 | 16286 | 16818 | 16816 | ? | 15558 | 14620 | 15525 | 15262 | 14494 | 14891 | 14872 | 15003 | 15116 | 14164 | 14286 |
HL | Bad HTML-like linking, like <http://...> | -98 (-19%) | 516 | 510 (-6/1.2%) | 501 (-9/1.8%) | 500 | 497 | 492 | 491 | 496 | 492 | 493 | 492 | 474 | 482 | 459 | 448 | 449 | 441 | 441 | 446 | 433 | 418 |
U | URL | -94 (-7%, from 2019-03-20) | - | 1284 | 1242 (-42/3.3%) | 1235 | 1222 | 1225 | 1218 | 1225 | 1227 | 1213 | 1200 | 1219 | 1213 | 1192 | 1197 | 1196 | 1194 | 1199 | 1205 | 1192 | 1190 |
BC | Bad characters | -12678 (-6%, from 2019-09-01) | - | - | - | - | - | - | - | - | - | - | - | 205046* | 196231 | 194847 | 194674 | 194281 | 192895 | 192845 | 192679 | 192523 | 192368 |
BW | Bad words | -6542 (-5%, from 2019-09-20) | - | - | - | - | - | - | - | - | - | - | - | 306181* | 120289* | 115983 | 116073 | 115612 | 115522 | 117419 | 115418 | 114602 | 113747 |
Total | -39115 (-3%, from 2019-09-20) | 1043175 instances | 1030773 instances (-12402/1.2%) | 1012856 instances (-17917/1.7%) | 1009232 | 1007793 | 995465 | 980102 | 975232 | 969454 | 964828 | 959061 | 1440178* instances | 1242324* instances | 1224099 instances | 1215612 instances | 1212615 instances | 1206360 instances | 1204437 instances | 1203965 instances | 1200605 instances | 1203209 instances | |
Parse failure | Mismatched punctuation | -5145 (-3%) | 199130 articles | 200032 articles (+902/.45%) | 195598 articles (-4434/2.2%) | 195995 articles | 196330 articles | 196566 articles | 196882 articles | 197380 articles | 197810 articles | 198086 articles | 198442 articles | 158283 articles + 40465 MOS:STRAIGHT violations | 158564 articles + 40523 MOS:STRAIGHT violations | 151604 articles + 39214 MOS:STRAIGHT violations | 151827 articles + 39333 MOS:STRAIGHT violations | 152017 articles + 39428 MOS:STRAIGHT violations | 152167 articles + 39590 MOS:STRAIGHT violations | 152254 articles + 39727 MOS:STRAIGHT violations | 152557 articles + 39971 MOS:STRAIGHT violations | 152835 articles + 40112 MOS:STRAIGHT violations | 153494 articles + 40491 MOS:STRAIGHT violations |
* Affected by significant algorithm changes. 1 Sep 2019: Added BC and BW. (Parse failures dropped due to JWB-powered MOS:STRAIGHT cleanup.) 20 Sep 2019: BC and BW restricted to lowercase; added TS+COMMA, TS+BRACKET, TS+EXTRA.
- red = Probably need to fix
- yellow = Unsorted
- blue = Probably OK (but may need to verify)
- bold = actively working on fixing
2020 statistics
[edit]In the year from March 2019 to March 2020, moss volunteers fixed over 94,000 typos! The most impressive progress is in the T1 category (single-letter misspellings), where we eliminated about half from the English Wikipedia. During this period we also started fixing missing spaces (focusing on those around punctuation) and those have dropped by about one-fifth. As we make progress, clear misspellings are increasingly mixed in with unclear cases; I'll be doing some more work on separation algorithms to keep the typo reports useful, so you'll probably see some more changes to typo classifications. Thanks to everyone who has been helping out! -- Beland (talk) 16:54, 28 April 2020 (UTC)
Reporting symbol | Explanation | Change from 2019-03-01 to 2020-02-20 | Instances, 2020-04-01 dump (9f6d726) | Instances, 2020-04-20 dump (5ff589d) | Instances, 2020-05-01 dump (1a96ded) | Instances, 2020-05-20 dump (e511f74) | Instances, 2020-06-01 dump (509f79a) | Instances, 2020-06-20 dump (825ceb4) | Instances, 2020-07-01 dump (db9db23) | Instances, 2020-07-20 dump (caa619f) | Instances, 2020-08-01 dump (cf76e8c) | Instances, 2020-08-20 dump (f104e58) | Instances, 2020-09-01 dump (4654d88) | Instances, 2020-09-20 dump (a26ccca) | Instances, 2020-10-01 dump (686f5db) | Instances, 2020-10-20 dump (4f90810) | Instances, 2020-11-01 dump (ac54580) | Instances, 2020-11-20 dump (6dbd61d) | Instances, 2020-12-01 dump (917bcc8) | Instances, 2020-12-20 dump (0b3409d) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TS | Missing or extra whitespace or dash (or new compound) | -39368 (-21%) | 145297 | 144673 | 331658** | 330624 | 328249 | 325399 | 324179 | 322282 | 321801 | 318621 | 317183 | 315825 | 314747 | 312110 | 310537 | 309386 | 308280 | 308977 |
T1 | Edit distance 1 from common English word | -36192 (-48%) | 41090 | 41081 | 39967 | 39452 | 38783 | 38379 | 38436 | 38271 | 37803 | 36783 | 35976 | 34036 | 33539 | 33764 | 32347 | 33097 | 33559 | 33427 |
T2 | Edit distance 2 from common English word | -7560 (-10%) | 64526 | 63263 | 60690 | 60321 | 59589 | 58603 | 58649 | 58521 | 58200 | 58085 | 57845 | 57329 | 57152 | 57487 | 57387 | 57511 | 57386 | 57348 |
T3 | Edit distance 3 from common English word | -5276 (-7%) | 74396 | 73255 | 70516 | 70039 | 68887 | 68192 | 68149 | 68020 | 67769 | 67788 | 67482 | 67226 | 67025 | 67101 | 67002 | 67213 | 67298 | 67399 |
R | Regular word (A-Z only) not near a common English word | -3525 (-3%) | 97726 | 96916 | 94793 | 93855 | 93252 | 91537 | 91489 | 91746 | 91521 | 91729 | 91513 | 91613 | 91339 | 91813 | 92329 | 93246 | 93377 | 93493 |
I | Definitely not English (International) due to accents or mixed with punctuation (other than hyphen) | -22196 (-24%) | 72151 | 69118 | 65842 | 64827 | 63630 | 61844 | 61888 | 61782 | 61899 | 62113 | 61916 | 62003 | 62049 | 62274 | 62287 | 62390 | 62234 | 62471 |
W | Not in English Wiktionary, in non-English Wiktionary | -6764 (-8%) | 75913 | 74351 | 86935 | 85604 | 83173 | 81894 | 81946 | 82173 | 81943 | 82170 | 81912 | 81968 | 81792 | 81256 | 81052 | 81224 | 81131 | 81192 |
L | Probable Romanization (transLiteration) | +81 (+2%) | 4435 | 4486 | 4266 | 4199 | 4120 | 4122 | 4104 | 4113 | 4137 | 4140 | 4151 | 4164 | 4165 | 4207 | 4203 | 4234 | 4240 | 4260 |
ME | Probable coMpound, English (with and without dash) | +976 (+2%) | 52269 | 48761 | 47187 | 47153 | 46830 | 46856 | 46967 | 47163 | 47052 | 47170 | 47009 | 47070 | 47066 | 47045 | 47023 | 47193 | 47142 | 47302 |
MI | Probable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash) | -18475 (-9%) | 177646 | 176929 | 171484 | 169592 | 166216 | 164828 | 165140 | 165351 | 165605 | 166016 | 166208 | 166499 | 166572 | 167349 | 167961 | 169044 | 168953 | 169409 |
MW | Probable coMpound, found in non-English Wiktionary | -5544 (-11%) | 46113 | 45103 | 43501 | 42931 | 40436 | 41383 | 41325 | 41440 | 41173 | 41234 | 40990 | 40956 | 40795 | 40353 | 40272 | 40454 | 40411 | 40338 |
ML | Probable coMpound, transLiteration | -124 (-3%) | 3909 | 3874 | 3707 | 3663 | 3672 | 3575 | 3589 | 3593 | 3628 | 3639 | 3658 | 3717 | 3724 | 3779 | 3769 | 3825 | 3830 | 3822 |
C | Chemistry words | -176 (-9%) | 1782 | 7564 | 7530 | 7644 | 7640 | 7655 | 7658 | 7659 | 7660 | 7662 | 7654 | 7644 | 7659 | 7661 | 7665 | 7659 | 7674 | 7700 |
N | A-Z plus numbers and hyphens | -1391 (-5%) | 25209 | 23813 | 22650 | 22511 | 22290 | 22020 | 22052 | 22053 | 21971 | 22009 | 21960 | 21923 | 21879 | 21856 | 21885 | 21898 | 21893 | 21943 |
Z | Decimal fraction missing leading Zero | - | 47* | 0* | 11405** | 11418 | 11414 | 11398 | 11402 | 11421 | 11455 | 11530 | 11546 | 11578 | 11598 | 11669 | 11683 | 11703 | 11728 | 11762 |
P | Patterns (e.g. rhyme schemes) | -20 (-43%) | 27 | 28 | 7 | 9 | 7 | 7 | 3 | 2 | 2 | 4 | 5 | 4 | 5 | 5 | 4 | 5 | 5 | 5 |
H | HTML/XML/SGML tag | -539 (-15%) | 3010 | 2886 | 2938 | 2903 | 2904 | 2848 | 2693 | 2697 | 2680 | 2747 | 2757 | 2729 | 2565 | 2569 | 2542 | 2538 | 2540 | 2572 |
HB | Known bad HTML tag, like <font> | -1080 (-7%) | 14465 | 14121 | 12903 | 13928 | 12919 | 14733 | 14022 | 11428 | 11670 | 11198 | 10191 | 8860 | 8756 | 8842 | 9725 | 11088 | 10164 | 10556 |
HL | Bad HTML-like linking, like <http://...> | -98 (-19%) | 414 | 418 | 377 | 394 | 394 | 421 | 408 | 425 | 420 | 413 | 373 | 359 | 356 | 329 | 324 | 315 | 318 | 328 |
U | URL | -94 (-7%, from 2019-03-20) | 1179 | 1152 | 1118 | 1134 | 1117 | 1122 | 1129 | 1124 | 1120 | 1124 | 1124 | 1103 | 1101 | 1099 | 1091 | 1096 | 1050 | 1055 |
BC | Bad characters | -12678 (-6%, from 2019-09-01) | 192230 | 190482 | 186651 | 186517 | 185572 | 178698 | 175325 | 166116 | 159095 | 124158 | 112959 | 112755 | 112695 | 112633 | 112479 | 110608 | 110025 | 109808 |
BW | Bad words | -6542 (-5%, from 2019-09-20) | 113682 | 106327 | 381288** | 380259 | 378710 | 374982 | 375107 | 375206 | 375431 | 375306 | 374622 | 374740 | 374560 | 375010 | 375008 | 375557 | 374989 | 375663 |
Total | -39115 (-3%, from 2019-09-20) | 1207516 instances | 1188601 instances | 1647413** instances | 1638977 instances | 1619804 instances | 1600496 instances | 1595660 instances | 1582586 instances | 1574035 instances | 1535639 instances | 1519034 instances | 1514101 instances | 1511139 instances | 1510211 instances | 1508575 instances | 1511284 instances | 1508227 instances | 1510830 instances | |
Parse failure | Mismatched punctuation | -5145 (-3%) | 154084 articles + 40705 MOS:STRAIGHT violations | 153033 articles + 40838 MOS:STRAIGHT violations | 214365 articles + 37697 MOS:STRAIGHT violations | 214463 articles + 37667 MOS:STRAIGHT violations | 214101 articles + 37607 MOS:STRAIGHT violations | 214465 articles + 37767 MOS:STRAIGHT violations | 214732 articles + 37849 MOS:STRAIGHT violations | 215081 articles + 37993 MOS:STRAIGHT violations | 215447 articles + 38067 MOS:STRAIGHT violations | 215915 articles + 38169 MOS:STRAIGHT violations | 216227 articles + 38210 MOS:STRAIGHT violations | 216472 articles + 38205 MOS:STRAIGHT violations | 216738 articles + 38213 MOS:STRAIGHT violations | 216991 articles + 38246 MOS:STRAIGHT violations | 217192 articles + 38338 MOS:STRAIGHT violations | 217660 articles + 38498 MOS:STRAIGHT violations | 217861 articles + 38625 MOS:STRAIGHT violations | 218207 articles + 38789 MOS:STRAIGHT violations |
- red = Probably need to fix
- yellow = Unsorted
- blue = Probably OK (but may need to verify)
- bold = actively working on fixing
* Identification of Z was broken
** Affected by major bug fix for counting inter-word typos (e.g. involving punctuation)
2021 statistics
[edit]Dump (moss version) | Parse failures (articles + articles with MOS:STRAIGHT violations) | TOTAL (instances) | BC | BW | C | H | HB | HL | I | L | ME | MI | ML | MW | N | P | R | T1 | T2 | T3 | TS | U | W | Z | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-01-01 (b4af24a) | 218317 + 38841 | 1505808 | 108661 | 375875 | 7705 | 2550 | 10726 | 311 | 62583 | 4262 | 47274 | 169504 | 3841 | 40131 | 21954 | 4 | 93373 | 32968 | 56903 | 66819 | 306445 | 1054 | 81112 | 11753 | |
2021-01-20 (a249b2d) | 218455 + 38930 | 1506940 | 108030 | 376079 | 7679 | 2616 | 11036 | 298 | 62746 | 4298 | 47044 | 170234 | 3885 | 39960 | 21959 | 4 | 93467 | 33598 | 56688 | 66688 | 306776 | 1042 | 81049 | 11764 | |
2021-02-01 (8279235) | 218833 + 38960 | 1506004 | 107000 | 375979 | 7677 | 2595 | 11729 | 298 | 62829 | 4305 | 47053 | 171005 | 3888 | 39771 | 21971 | 2 | 93726 | 33237 | 56822 | 66707 | 305573 | 1035 | 81079 | 11723 | |
2021-02-20 (2f00c51) | 218991 + 39035 | 1504064 | 106534 | 375909 | 7682 | 2602 | 11697 | 275 | 62942 | 4342 | 47036 | 171313 | 3897 | 39732 | 22009 | 3 | 93959 | 32705 | 56529 | 66617 | 304463 | 1020 | 81041 | 11757 | |
2021-03-01 (248159a) | 219198 + 39155 | 1494162 | 106421 | 376305 | 7669 | 2624 | 9291 | 281 | 62978 | 4328 | 46830 | 169666 | 3876 | 39189 | 21936 | 4 | 92221 | 32762 | 56197 | 66069 | 302377 | 1020 | 80338 | 11780 | |
2021-03-20 (57aaae7) | 219556 + 39371 | 1492923 | 106284 | 375853 | 7695 | 2610 | 9965 | 278 | 63055 | 4331 | 47064 | 170453 | 3880 | 39172 | 21998 | 2 | 92721 | 32523 | 56052 | 66087 | 299751 | 1002 | 80305 | 11842 | |
2021-04-01 (d47c725) | 219692 + 39478 | 1484879 | 105670 | 375757 | 7697 | 2620 | 8857 | 205 | 62842 | 4309 | 46966 | 170369 | 3884 | 38886 | 21964 | 0 | 92575 | 32160 | 55810 | 65706 | 296009 | 995 | 79736 | 11862 | |
2021-04-20 (d169566) | 220014 + 39634 | 1476477 | 104505 | 374548 | 7686 | 2648 | 8863 | 199 | 62668 | 4327 | 47036 | 170547 | 3878 | 38644 | 21973 | 4 | 92336 | 30560 | 55284 | 65191 | 293170 | 985 | 79487 | 11938 | |
2021-05-01 (7719363) | 219292 + 39601 | 1445819 | 103253 | 367236 | 7661 | 2387 | 7682 | 178 | 59749 | 3966 | 44397 | 165787 | 3774 | 38591 | 21697 | 4 | 91448 | 30666 | 56556 | 65257 | 283967 | 980 | 78634 | 11949 | |
2021-05-20 (c6359fc) | 219284 + 39761 | 1444570 | 102794 | 368258 | 7678 | 2271 | 7878 | 176 | 59913 | 3978 | 44514 | 166538 | 3804 | 38629 | 21725 | 4 | 91887 | 29205 | 56341 | 65171 | 282093 | 983 | 78651 | 12079 | |
2021-06-01 (076f14c) | 219111 + 39759 | 1441769 | 102409 | 368046 | 7689 | 2275 | 7827 | 166 | 59876 | 3943 | 44658 | 166622 | 3818 | 38567 | 21755 | 5 | 92077 | 28507 | 56157 | 64919 | 280645 | 975 | 78682 | 12151 | |
2021-06-20 (ffbc72f) | 219625 + 39935 | 1435330 | 101926 | 367522 | 7694 | 2276 | 7108 | 162 | 59650 | 3964 | 44692 | 167038 | 3819 | 38298 | 21687 | 8 | 92365 | 28020 | 55983 | 64688 | 276538 | 955 | 78621 | 12316 | |
2021-07-01 (cb3d5e8) | 219791 + 39990 | 1433415 | 101916 | 367581 | 7704 | 2263 | 6921 | 169 | 59663 | 3960 | 44770 | 167508 | 3837 | 38299 | 21674 | 8 | 92600 | 27369 | 55755 | 64301 | 275024 | 946 | 78720 | 12427 | |
2021-07-20 (5c3b9e9) | 220086 + 40132 | 1429627 | 101518 | 367954 | 7688 | 2136 | 6702 | 137 | 59995 | 3955 | 44805 | 167818 | 3824 | 38179 | 21646 | 7 | 92660 | 26469 | 55565 | 64171 | 272147 | 950 | 78624 | 12677 | |
2021-08-01 (86e7022) | 220338 + 40213 | 1424448 | 101229 | 367552 | 7708 | 2123 | 6252 | 121 | 61727 | 3767 | 44851 | 168279 | 3812 | 36769 | 21643 | 0 | 93146 | 26555 | 55547 | 64124 | 271406 | 953 | 74189 | 12695 | |
2021-08-20 (33a14e3) | 220370 + 40254 | 1414854 | 100973 | 367172 | 7719 | 2047 | 5736 | 119 | 59520 | 3746 | 44729 | 167010 | 3811 | 37772 | 21537 | 2 | 92763 | 24146 | 54950 | 63571 | 266761 | 960 | 77075 | 12735 | |
2021-09-01 (90e0a3b) | 220449 + 40268 | 1411194 | 100113 | 367110 | 7714 | 2046 | 5801 | 120 | 59567 | 3733 | 44623 | 167222 | 3824 | 37710 | 21525 | 2 | 92833 | 23310 | 54796 | 63455 | 265044 | 953 | 76926 | 12767 | |
2021-09-20 (c71a444) | 220781 + 40328 | 1412140 | 99635 | 367286 | 7713 | 2040 | 5650 | 121 | 59595 | 3766 | 44828 | 167997 | 3843 | 37719 | 21561 | 0 | 93701 | 22924 | 54661 | 63575 | 264775 | 948 | 76966 | 12836 | |
2021-10-01 (cdd699c) | 221094 + 40362 | 1405448 | 99065 | 367498 | 7683 | 2060 | 5774 | 111 | 59546 | 3710 | 44579 | 167357 | 3831 | 37696 | 21381 | 2 | 93027 | 22576 | 54268 | 63134 | 261463 | 952 | 76883 | 12851 | 1 |
A major upgrade to word categorization was made in October 2021. The same dump is shown on the old and new systems for comparison. R, I, W, MI, MW, and ML were eliminated and sorted by language as TE or TF instead. New categories:
- A = mAth
- T/ = Suspected MOS:SLASH violation
- TE = AI thinks it's trying to be English
- TF = AI thinks it's trying to be a non-English language (Foreign to English Wikipedia), sorted by language (e.g. TF+el)
Dump (moss version) | Parse failures (articles + articles with MOS:STRAIGHT violations) | TOTAL (instances) | A | BC | BW | C | H | HB | HL | L | ME | N | P | T/ | T1 | TE | TF | TS | U | Z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-10-01 (2ec07e4) | 221094 + 40362 | 1457644 | 17030 | 175488 | 367537 | 4049 | 2060 | 5774 | 111 | 5428 | 237959 | 2329 | 37 | 3237 | 54108 | 10076 | 439099 | 118822 | 1649 | 12851 |
2021-10-20 (b44e087) | 221396 + 40415 | 1452333 | 22433 | 173701 | 381776 | 7762 | 2032 | 5341 | 95 | 5399 | 219482 | 2351 | 6 | 3252 | 53679 | 10151 | 438103 | 112265 | 1613 | 12892 |
2021-11-01 (0786728) | 221592 + 40396 | 1476996 | 22385 | 97423 | 481799 | 7793 | 1573 | 5122 | 97 | 5399 | 219638 | 2297 | 9 | 3246 | 53546 | 10145 | 440061 | 111957 | 1607 | 12899 |
2021-11-20 (34069e9) | 153165 + 42992 | 1491000 | 23808 | 99945 | 497995 | 7816 | 1609 | 5587 | 111 | 5688 | 222435 | 2340 | 9 | 3373 | 53516 | 9847 | 426498 | 116119 | 1642 | 12662 |
2021-12-01 (0fc2fb3) | 153177 + 42994 | 1489025 | 23727 | 99782 | 496905 | 7828 | 1558 | 5602 | 104 | 5702 | 222571 | 2346 | 8 | 3359 | 53405 | 9816 | 425937 | 116070 | 1627 | 12678 |
2021-12-20 (d20f520) | 153289 + 42902 | 1488550 | 23761 | 99074 | 496904 | 7845 | 1561 | 5601 | 108 | 5715 | 223063 | 2351 | 4 | 3337 | 53580 | 9806 | 425623 | 115890 | 1618 | 12709 |
2022 statistics
[edit]Dump (moss version) | Parse failures (articles + articles with MOS:STRAIGHT violations) | TOTAL (instances) | A | BC | BW | C | D | H | HB | HL | L | ME | N | P | T/ | T1 | TE | TF | TS | U | Z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2022-01-01 (92506e2) | 153265 + 42919 | 1488043 | 23730 | 98949 | 496872 | 7872 | 0 | 1561 | 5712 | 108 | 5744 | 222842 | 2355 | 8 | 3337 | 53020 | 9801 | 425923 | 115845 | 1608 | 12756 |
2022-01-20 (f63dc78) | 153371 + 42894 | 1490532 | 23729 | 98433 | 497315 | 7875 | 1 | 1603 | 6158 | 108 | 5794 | 223402 | 2345 | 5 | 3325 | 53057 | 9667 | 426560 | 116722 | 1594 | 12839 |
2022-02-01 (8fbf720) | 153444 + 43002 | 1621627 | 23804 | 98366 | 497551 | 7934 | 1 | 1579 | 6051 | 108 | 6007 | 240216 | 2381 | 13 | 3334 | 58724 | 11652 | 531477 | 117630 | 1599 | 13200 |
2022-02-20 (8245233) | 153724 + 43135 | 1622459 | 23835 | 98083 | 497766 | 7956 | 1 | 1604 | 5177 | 102 | 5999 | 240497 | 2370 | 14 | 3281 | 59384 | 11661 | 531576 | 118343 | 1616 | 13194 |
2022-03-01 (8245233) | 153733 + 43208 | 1624427 | 23837 | 98107 | 497855 | 7989 | 1 | 1571 | 5815 | 102 | 6027 | 240789 | 2371 | 16 | 3278 | 59744 | 11669 | 531890 | 118567 | 1608 | 13191 |
2022-03-20 (fb66b79) | 153882 + 43327 | 1624509 | 23823 | 97961 | 498466 | 7996 | 1 | 1552 | 4746 | 106 | 6059 | 241192 | 2363 | 15 | 3311 | 60058 | 11638 | 531382 | 119054 | 1601 | 13185 |
2022-04-01 (fb66b79) | 153932 + 43430 | 1626452 | 23823 | 97828 | 498085 | 8000 | 1 | 1594 | 4793 | 105 | 6063 | 241718 | 2375 | 16 | 3327 | 60572 | 11642 | 532088 | 119684 | 1591 | 13147 |
2022-04-20 (fb66b79) | 154017 + 43596 | 1630486 | 23789 | 97841 | 498611 | 8012 | 1 | 1607 | 4990 | 105 | 6065 | 242940 | 2374 | 17 | 3337 | 60977 | 11649 | 532927 | 120483 | 1587 | 13174 |
2022-05-01 (fb66b79) | 153825 + 43698 | 1631287 | 23793 | 97801 | 498632 | 8020 | 1 | 1609 | 5048 | 104 | 6073 | 243306 | 2384 | 20 | 3337 | 61453 | 11694 | 533878 | 119359 | 1579 | 13196 |
2022-05-20 (cc63e5f) | 153870 + 43814 | 1635174 | 23851 | 97718 | 498090 | 8043 | 1 | 1636 | 4925 | 107 | 6103 | 243986 | 2385 | 19 | 3337 | 59550 | 11866 | 538310 | 120406 | 1574 | 13267 |
2022-05-20 (ae346b0)* | 164831 + 29862 | 1620797 | 23846 | 92522 | 487792 | 8099 | 1 | 1631 | 4930 | 110 | 6076 | 244851 | 2308 | 18 | 3335 | 60170 | 11838 | 538751 | 119670 | 1580 | 13269 |
2022-06-01 (6090418) | 164899 + 29887 | 1620209 | 23786 | 92402 | 487512 | 8099 | 1 | 1620 | 4620 | 113 | 6090 | 245017 | 2309 | 16 | 3331 | 60318 | 11803 | 538115 | 120085 | 1587 | 13385 |
2022-06-20 (97d23b9) | 164770 + 29816 | 1617952 | 23775 | 91799 | 486712 | 8102 | 0 | 1611 | 4705 | 116 | 6087 | 245190 | 2319 | 13 | 3300 | 59666 | 11763 | 538585 | 119215 | 1568 | 13426 |
2022-06-20 (1432a2f)† | 164877 + 29821 | 1677855 | 23781 | 91816 | 547534 | 8102 | 0 | 1611 | 4706 | 116 | 6071 | 245153 | 2318 | 13 | 3297 | 59659 | 11764 | 537643 | 119292 | 1554 | 13425 |
2022-07-01 (9ab6dad) | 164769 + 29855 | 1674273 | 23732 | 91585 | 547881 | 8113 | 0 | 1644 | 4657 | 116 | 6110 | 244376 | 2295 | 143 | 3261 | 59286 | 11657 | 535628 | 118761 | 1559 | 13469 |
2022-07-20 (06d752b) | 164636 + 29850 | 1674512 | 23605 | 91172 | 547558 | 8111 | 0 | 1663 | 4856 | 126 | 6127 | 244725 | 2294 | 144 | 3272 | 58857 | 11659 | 536841 | 118429 | 1550 | 13523 |
2022-08-01 (622271d) | 164730 + 29865 | 1675287 | 23593 | 90912 | 547590 | 8080 | 0 | 1660 | 4926 | 127 | 6144 | 244829 | 2284 | 145 | 3273 | 58908 | 11604 | 537355 | 118773 | 1553 | 13531 |
2022-08-20 (597dbd2) | 163908 + 29808 | 1667614 | 23508 | 90561 | 544710 | 8081 | 0 | 1653 | 5137 | 121 | 6136 | 243853 | 2287 | 122 | 3234 | 58163 | 11473 | 536597 | 117099 | 1535 | 13344 |
2022-08-20 (5ee7ffd)‡ | 162500 + 29580 | 1210578 | 10681 | 86656 | 540463 | 7981 | 0 | 1611 | 5136 | 122 | 2073 | 182672 | 1964 | 114 | 2307 | 43457 | 6582 | 206072 | 97829 | 1522 | 13336 |
2022-08-20 (6965e1f)⹋ | 162432 + 29567 | 1205869 | 10669 | 86557 | 538964 | 7979 | 0 | 1610 | 5131 | 122 | 2041 | 181481 | 1963 | 114 | 2298 | 43278 | 6540 | 204575 | 97689 | 1520 | 13338 |
2022-09-01 (cda0784) | 161909 + 29468 | 1198769 | 10663 | 86161 | 536440 | 7990 | 0 | 1603 | 5399 | 120 | 1977 | 180548 | 1945 | 99 | 2270 | 42927 | 6445 | 202651 | 96760 | 1485 | 13286 |
2022-09-20 (4689b50) | 162154 + 29594 | 1199166 | 10676 | 85924 | 536599 | 7981 | 0 | 1621 | 6730 | 125 | 1985 | 180428 | 1950 | 99 | 2267 | 42279 | 6383 | 202327 | 96972 | 1487 | 13333 |
2022-10-01 (e725bbd) | 161370 + 29450 | 1193722 | 10646 | 84999 | 534429 | 7981 | 0 | 1623 | 6988 | 123 | 1964 | 179378 | 1934 | 99 | 2259 | 42089 | 6356 | 201547 | 96530 | 1466 | 13311 |
2022-10-20 (e725bbd) | 161347 + 29546 | 1192591 | 10632 | 84851 | 534850 | 7998 | 0 | 1623 | 6987 | 121 | 1981 | 178500 | 1921 | 101 | 2271 | 41414 | 6264 | 201358 | 96915 | 1454 | 13350 |
2022-11-01 (ebbea0e) | 161388 + 29603 | 1192455 | 10634 | 84376 | 535156 | 8036 | 0 | 1633 | 6505 | 116 | 1976 | 178546 | 1917 | 102 | 2270 | 41341 | 6217 | 201463 | 97334 | 1450 | 13383 |
2022-11-20 (84f0fc4) | 161548 + 29683 | 1193478 | 10659 | 84327 | 535811 | 8112 | 0 | 1614 | 6622 | 115 | 1970 | 178817 | 1918 | 102 | 2259 | 41326 | 6187 | 201180 | 97563 | 1444 | 13452 |
2022-12-01 (d57116b) | 161334 + 29741 | 1193626 | 10650 | 84229 | 536307 | 8124 | 0 | 1604 | 6503 | 110 | 1981 | 178844 | 1913 | 102 | 2262 | 41018 | 6181 | 201090 | 97779 | 1446 | 13483 |
2022-12-20 (003741b) | 161351 + 29828 | 1189035 | 10658 | 83972 | 535095 | 8218 | 0 | 1592 | 4957 | 110 | 1971 | 178831 | 1917 | 1 | 2236 | 41413 | 6177 | 198807 | 98124 | 1431 | 13525 |
* ae346b0 started ignoring content inside curly quotes
† 1432a2f excluded more end sections
‡ 5ee7ffd started ignoring italicised content
⹋ 6965e1f started ignoring content inside single quotes
Likely new words by frequency (non-English)
[edit]From 2019-02-01 dump:
- 50 -
wikt:µtorrent - BitComet, BitTorrent, BitTorrent (company), BitTorrent (software), BitTorrent protocol encryption ... find all- Should not be listed here, as μTorrent is an article with μtorrent as redirect. Graeme Bartlett (talk) 12:09, 7 April 2019 (UTC)
- Hmm, maybe it's because the title of the redirect is actually Μtorrent, or my code isn't case insensitive with respect to Greek letters. -- Beland (talk) 06:19, 31 August 2019 (UTC)
- Ah, there are a ton of pages that use the micro character instead of the mu character, and they look exactly the same. I'm cleaning them all up so they use mu, so this should go away. -- Beland (talk) 16:06, 4 September 2019 (UTC)
- Hmm, maybe it's because the title of the redirect is actually Μtorrent, or my code isn't case insensitive with respect to Greek letters. -- Beland (talk) 06:19, 31 August 2019 (UTC)
- Should not be listed here, as μTorrent is an article with μtorrent as redirect. Graeme Bartlett (talk) 12:09, 7 April 2019 (UTC)
From 2019-02-01 dump, but clearly not foreign words (need to figure out what to do with them):
- 81 -
wikt:₤100 - 19th-century London, Agio, Arthur Machen, Auckland Baptist Tabernacle, Australian native police ... find all- I should probably put in a code change to exclude money patterns like this. -- Beland (talk) 01:25, 30 August 2019 (UTC)
- Ah, the problem is that ₤ (two horizontal bars) is not allowed by MOS:CURRENCY. It needs to be changed to £ (one horizontal bar) if it represents the British pound and it's unclear to me what to do for Italian lira. find all ₤-- Beland (talk) 06:17, 31 August 2019 (UTC)
- OK, clarified with MOS folks the Lira should use ₤, so I'll clean up all the GBP instances that use ₤ and then add ₤ to moss's list of allowed currency symbols. -- Beland (talk) 16:08, 4 September 2019 (UTC)
- I have changed all those that mean Pounds. Some that mean Lira still remain. Graeme Bartlett (talk) 04:03, 14 September 2019 (UTC)
- OK, clarified with MOS folks the Lira should use ₤, so I'll clean up all the GBP instances that use ₤ and then add ₤ to moss's list of allowed currency symbols. -- Beland (talk) 16:08, 4 September 2019 (UTC)
- Ah, the problem is that ₤ (two horizontal bars) is not allowed by MOS:CURRENCY. It needs to be changed to £ (one horizontal bar) if it represents the British pound and it's unclear to me what to do for Italian lira. find all ₤-- Beland (talk) 06:17, 31 August 2019 (UTC)
- I should probably put in a code change to exclude money patterns like this. -- Beland (talk) 01:25, 30 August 2019 (UTC)
Case notes from 2019-06-01 dump
[edit]- 1 - QueA RNA motif - wikt:preQ --this appears as preQ1 which does have a Wiktionary entry, wikt:preQ1 so why is it included here?
- Weird, I'll have to debug that. -- Beland (talk) 08:47, 16 June 2019 (UTC)
- Oh, of course, because sup and sub tags cause text on either side to be in different tokens. I'll try changing that and see if it is an overall improvement. That should also fix things like chemical formulas, so I think it will be good. -- Beland (talk) 02:10, 9 May 2020 (UTC)
- This is confirmed fixed. -- Beland (talk) 18:42, 27 May 2020 (UTC)
- Oh, of course, because sup and sub tags cause text on either side to be in different tokens. I'll try changing that and see if it is an overall improvement. That should also fix things like chemical formulas, so I think it will be good. -- Beland (talk) 02:10, 9 May 2020 (UTC)
- Weird, I'll have to debug that. -- Beland (talk) 08:47, 16 June 2019 (UTC)