Jump to content

User:John of Reading/Typo fixing with AutoWikiBrowser

From Wikipedia, the free encyclopedia

If anyone's interested, here is how I do bulk typo fixing:

Creating the list

[edit]

I download and uncompress a 100GB database dump every couple of months, and use the AWB database scanner to create my main work lists. This is the only method I know that:

  • Allows precise control over the search using regular expressions
  • Returns the full list of results
  • Does not try to guess what I really meant

I am happy to be asked to create article lists for other editors. My latest download is enwiki-20240801-pages-articles.xml.bz2 watch for a new download.

Using the Database Scanner

[edit]

I use a regular expression that describes several typos at once, so that I get a long list of articles that need a variety of fixes.

  • I like to search for a prefix rather than a whole word, so that I find occurrences where editors have mangled the ending of a word as well as the beginning.
  • If the list is going to be a long one, I'll run the first 1 or 2 percent of the scan and see if I can tweak the regular expression to eliminate some of the false positives.
  • When working on the Lists of common misspellings, I use a regular expression that searches for all the words in an entire list. For example, \be(?:ached|achother|...(several hundred more omitted)...|yasr) from the "E" list.

To use the AWB database scanner, select "Database dump" from the drop-down control just above the article list, then press "Make list". Within the scanner dialog, I choose these options before clicking "Start":

  • On the "Database" tab, I browse to my download folder and select the uncompressed database dump.
  • On the "Namespace" tab, I tick the first "Content" checkbox, which ticks everything in the first list box, then untick "Draft". These database dumps do not include any user pages, so that tick is not relevant.
  • On the "Title" tab, I skip blocks of titles within the Wikipedia namespace. I tick "Not contains", "Regex" and "Case sensitive", and paste the long regular expression, below, into the second text box.
  • On the "Text" tab, I tick "Contains", "Regex" and "Ignore comments", and paste my current search targets into the first text box.
Titles skipped

(?:~~ARTICLES~~|Charles Magauran|Commonly misspelled English words|Cut Spelling|Date and time notation in the United Kingdom|Drexel\s+4\d\d\d|Early Cornish texts|English orthography|Henry Marshall Furman|Interspel|List of On Cinema episodes|List of the Dead Daisies members|Nairai\b|Otte Rud|SoundSpel|Transposed letter effect|~~OTHERS~~|Abuse reports|Abuse response/|Academic studies of Wikipedia|ACF Regionals answers/|Administrators' noticeboard|AMA IRC Meeting log|Adopt-a-typo|Arbitration Committee Elections|Arbitration/|Archived deletion|articles by quality log|Articles for|Articles with UK Geocodes|Attached KML/List of power stations in New Zealand|AutoWikiBrowser/Typos|BillboardEncode/|BillboardID/|Categories for|Catholic Encyclopedia topics/|Centralized discussion/|Changing username/|CHECKWIKI/|Contributor copyright investigations/|Copyright problems/|Correct typos in one click|Coverage of Mathworld topics/|Database reports/|Deleted articles with freaky titles|Deletion log/|Deletion log archive|Deletion review|Did you know nominations/|Disambiguation pages with links/|Editor review/|Featured article|Featured list|Featured picture|Featured portal|Featured topic|Files for|Find a Grave famous people/|GLAM/NHMandSM|GLAM/Your paintings|Goings-on/|Good article reassessment|In the news/|India Education Program/Courses/|Jewish Encyclopedia topics/|Jimbo Wales discussion|List of encyclopedia topics/|List of Wikipedians by|Lists of common misspellings|Main Page history/|Mediation Cabal/|Meetup/|Miscellany for|Move review/|New user log/|Pfam2pdb|Pfam2PDBsum|Picture peer review|Possibly unfree|Recent additions|Redirects for|Reference desk archive|Requested articles|Requests for|Sandbox/|School and university projects/|Shortcut table/|Sockpuppet investigations/|Stub types for deletion|Suspected copyright violations/|Suspected sock puppets|Templates for|Templates with red links|Tyop Contest|Typo Team|Unwanted Cinema cover.png|Upload log archive|Votes for deletion|Wiki Ed/|Wiki Guides/|Wikipedia Signpost/2|Wikipedia Signpost/Special|WikiProject Academic Journals/|WikiProject Chemicals/Log/|WikiProject Chemistry/IRC|WikiProject Directory/Description|WikiProject Editor Retention/|WikiProject Fix common mistakes/|WikiProject History Merge/|WikiProject Intertranswiki/|WikiProject Languages/|WikiProject London Transport/The Metropolitan/|WikiProject Missing encyclopedic articles/|WikiProject Pharmacology/Log/|WikiProject Red Link Recovery/|WikiProject Short descriptions/wd/|WikiProject Spam/|~~SLASH~~|/All discussions|/[Aa]rchive|/Article alerts|/Article list|/Article Talk list|/Articles|/Assessment|/Cleanup listing|/CurrentTranscriptions|/[Dd]ata|/Deletion archive|/Did you know|/Discussions?|/DYK|/Encyclopedic articles|/Example generated lists|/[Ff]eedback|/Fundraising|/ICC valuations|/Internet Relay Chat|/IRC|/List of all portals|/List of biographies|/List of mountains|/Listeria|/Listing by project|/Lists of pages|/Members|/Metrics/|/Newsletter|/Participants|/Peer review|/Popular pages|/Prospectus|/[Pp]ublicwatchlist|/Recent changes|/Recognized [Cc]ontent|/[Rr]edlinks|/Rename template parameters|/[Ss]andbox|/Settings/|/Stale drafts|/Stats|/Statistics|/Talk|/Translation task force|/Unpatrolled|/Watchall|/[Ww]atchlist)

Yes, there are a few article titles in this list. Some of these contain many false positives, others are where I don't wish to repeat a mistake, others are where I am avoiding a slow-motion edit war.

Settings within AutoWikiBrowser

[edit]
  • I tick "Find & Replace"; and within the configuration dialog:
    • I do not tick the "ignore" checkboxes, so that I force the correction of the current misspelling in the entire page.
    • I tick "Add replacements to edit summary" so that the edit summary is as helpful as possible. For this to work properly, the "Find" strings must match all four brackets of a [[Link]].
    • I usually start with my accumulated list of over 4,000 spelling rules.
    • Below the spelling rules, I start with a dummy "Find & Replace" rule that finds the exact regular expression that I used for the database scan and replaces it with "INVESTIGATE".
  • I tick "Skip if no replacement".
  • I tick "Skip if only minor replacements made"; within the "Find & Replace" dialog I use the "Minor" checkbox to mark rules that make a change that is valid but, I think, not worth saving by itself

I'm currently running with General Fixes turned off because this discussion has not reached a conclusion.

I gave up on RegExpTypoFix some years ago. Although there are lots of good spelling rules there, I prefer to leave MOS fixes to editors who are prepared to defend them.

Checking each proposed edit

[edit]

Then it's up to me to check each proposed edit.

  • If the text looks like vandalism or a WP:BLP violation, then I jump out to look at the article history.
  • If I can't understand what the text is trying to say, I don't try to fix it.
  • If my "Find & Replace" has damaged correct text, then I may pause to think about changing the re-spelling rules to avoid the false positive.
  • If my "Find & Replace" has identified incorrect text by changing it the word "INVESTIGATE", then I'll either add or adjust a re-spelling rule and try again, or make a one-off edit to the article text.
  • If my "Find & Replace" has made an incorrect fix, I'll either adjust the settings and try again, or make a one-off edit to the article text.
  • If the changes are part of quoted text or something like a book title, I'll jump out to another window to try to check the source.

I may make other edits in the AWB edit box, fixing additional typos or correcting syntax errors that AWB has identified but not fixed automatically.

I pick one of a handful of pre-configured edit summaries, and then modify it if necessary to describe the edits I actually made.

I try to remember to clear the "Minor edit" checkbox if I've done anything more than simple typo-fixing or if the diff seems very long; the danger is that I forget to tick it again afterwards.

Then it's "Save" and on to the next article.

There is a danger that I'll accidentally save the word "INVESTIGATE" in an article. I check for this kind of error by running this search every day or two.

Editing quotes, book titles and such like

[edit]

My regular expressions run on the whole page including quotations, book titles and so on. If I edit these, I try to leave a helpful edit summary:

replaced: foobar per source
I found the source and was able to verify that the version at Wikipedia was incorrect. Either an earlier editor miscopied it, or, perhaps, the source has been corrected after it was copied.
replaced: foobar per book cover image at Amazon/Abebooks/etc.
I found an image of the book with enough pixels for the words to be read clearly.
replaced: foobar per a search at Amazon/Abebooks/etc.
I didn't find a usable cover image, but these external sites support the correction.
replaced: foobar - MOS:QUOTE recommends fixing "insignificant" errors in quoted text
I found the source also has the error, but I've made an editorial decision to apply MOS:QUOTE.
replaced: foobar - In a quote, but I'm assuming this was a copying error
I haven't found the source, but to me it looks like a copying error. Or perhaps MOS:QUOTE might apply.
replaced: foobar for legibility
I didn't bother to check the source, as the change is small and the incorrect version is hard to read - something like WIlliam > William
replaced: foobar
Oh dear. Perhaps I didn't spot that I was about to edit a quote, or I neglected to adjust the edit summary. Please revert if necessary, but it's possible that MOS:QUOTE might apply.

Skipping false positives

[edit]

The best way to skip false positives is to use regular expressions with lookahead/lookbehind. This method is especially useful when doing the initial database scan, since it means the articles don't even appear in the list.

I've developed a few standard suffixes that arrange for some common false positives to be skipped. I'll tack some of them on to the long regular expression when doing the initial search, and sometimes tack them on to individual find+replace rules when needed.

Suffix Skip matches...
(?(?<![\.\-]\w*)|(?!\w*[\.\-])) ...inside hyphens or dots, probably part of a URL or domain name
(?(?<!"\w*)|(?!\w*")) ...where a single word is inside double quotes
(?(?<!\[\[\w*)|(?!\w*\]\])) ...where a single word is inside a wikilink
(?![ \(\)\.\,\;\-\'\"\+\&\%\w\d]*\.(?i:(?:gif|jpe?g|ogg|ogv|pdf|png|svg|tiff?|webm))\b) ...inside an image file name
(?!(?:<sup>|&#91;|</?nowiki>|\W)+(?i:Sic)\b)(?<!{{(?:[Aa]s\s+written|[Nn]at|[Nn]ot\s+typo|[Nn]ot\s*a\s*typo|[Pp]roper\s*name|[Ss]ic\??|[Ss]IC|[Tt]ypo)\|[^{}]+) ...inside a {{Sic}} template, or closely followed by the word "sic"
(?<!\<\s*ref\s+name\s*=\s*(?:"|'|)[\w\s\:\-\.\/]{0,99}) ...inside a reference name
(?<!https?://[^ \|\{\}\[\]\<\>]*) ...inside a URL
(?<!\b(?<!trans-)title\d*\s*=[^\|\{\}]{0,255}) ...inside a title parameter, but not a trans-title

I'll typically save edits to around 40% of the articles that turn up in my list, so it is important that the other 60% are skipped efficiently.

For example, the "E" list says that "exercice" may be a misspelling of "exercise". I actually searched for \bexercic so that I found "exerciced", "exercicing" and so on. As I worked through the list I gradually expanded the rule to

exercic(?!(i|io|ios|is|o)\b)(?!es?\s+(anarchistes|au|comme|commun|d|dans|de|des|divertissants?|du|en|et|journaliers|modulé|ou|par|participatif|phénoménologique|pour|pratiques|préparatoires|prepar[eé]es|progressifs|spirituels?|sur|terminé)\b)(?<!\b(d|l)['’]exercic)(?<!\b(avec|ces|cet|des|douze|en|en\s+\d+|et|les|mes|ou|plein|son|un)\s+exercic)

Fragment Meaning
(?!(i|io|ios|is|o)\b) Skip if the word is "exercici" (Latin) or other foreign words
(?!es?\s+(anarchistes|...|phénoménologique)\b) Skip if the word is "exercice" or "exercices" followed by something indicating we're in French-language text
(?<!\b(d|l)['’]exercic) Skip if the word is immediately preceded by d' or l', again indicating French-language text
(?<!\b(avec|...|un)\s+exercic) Skip if the previous word tells us we're in French-language text

Alternatively I use the "Minor replacement" feature. If I can write a regular expression that describes a set of false positives, I'll add a respelling rule to change that to "FALSE" and mark the rule as "minor"

Namespaces

[edit]

I will happily fix typos in most non-talk namespaces.

Namespace Comment
Draft I don't touch these
File I consider file descriptions and fair use rationales to be part of the encyclopedia, and fair game for typo fixing. However, I try not to fix descriptions written in the first person. Some files contain lists of old edit summaries (example) inside <nowiki>...</nowiki> tags. I skip those efficiently by ticking the first "Ignore" checkbox at the top left of the "Find & Replace" dialog while working on the File namespace
Module I'll have a look at them, but I'm most unlikely to make any edits
Portal With care; some portal pages are used for discussion, and shouldn't be fixed; others are archives and probably shouldn't be fixed
Template I'll fix typos in template documentation, with care; I'll even fix typos in templates sometimes, with great care
User I don't touch these. They are not included in the database dumps so don't turn up in my lists.
Wikipedia I aim to fix typos only on pages which are still being used. On the "Skip" tab, I have a regular expression (\{\{([Ff]ailed|[Hh]istorical|[Rr]ejected)(\||\}\})|\[\[User:|\[\[User\s+talk:|^(?s:.{499999})) which I turn on while working on the Wikipedia namespace. This skips most discussion pages, some other inactive pages, and huge pages that can cause AWB to hang.

Common misspellings

[edit]
The settings files are here: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

May 2023: In terms of effort per fix, this approach is no longer efficient.

Each of these lists contains a mixture of spellings. Some are easy, in the sense that most articles that contains that spelling need to be fixed. Others are not easy, because although they are incorrect spellings of English words, they are valid foreign-language words, surnames, brand names, and so on. Back in 2012 there was a backlog of easy errors which I was able to fix. Nowadays I find that other editors keep on top of the easy errors, and, despite my efforts to eliminate the false positives automatically, I'm looking through a list where most of the matches shouldn't be fixed but cannot easily be skipped automatically.

Tackling the Lists of common misspellings

List Start date Time Edits Notes
A March 2012 2 months 5500

My (somewhat naive) database scans created lists totalling 60,000 articles

  • a.k.a > a.k.a. - 2,000 of these
  • Alright > All right - Quickly abandoned, 99% song/film titles and quotations
  • alright > all right - Abandoned after this edit
  • anyways > anyway - Abandoned part way; too subtle to be a typo fix
February 2015 3 weeks 4400
  • aka > a.k.a - Abandoned; a MOS issue, not a misspelling as such; c. 40,000 articles
  • a.k.a > a.k.a. - 1,200 of these
  • august > August - 300 fixes, many false positives
November 2016 3 weeks 2800
October 2018 3 weeks 2700
November 2020 3 weeks 1900
  • antiapartheid > anti-apartheid - Not done, not convinced
November 2022 3 weeks 2270
  • ant > aunt - Only searched "his/her/their ant", one fix made, to "anti"
B May 2012 3 weeks 1600
  • broadcasted > broadcast - Quickly abandoned; very common, and several dictionaries allow it
September 2014 6 days 500

Reassuringly faster with improved regular expressions

  • bye-election > by-election - Not done, some dictionaries allow it
  • Bonnano > Bonanno - Next time don't bother, Wikipedia must follow the sources
December 2016 6 days 600
  • bearly > barely - Next time look for lowercase examples only
  • boo's > boos - Next time look for lowercase examples only
March 2019 1 week 620
December 2020 1 week 400
December 2022 4 days 360
C June 2012 5 weeks 5600
  • cancelation > cancellation - Wiktionary allows both; feels like an engvar issue
  • councellor > councillor or counsellor - Usually too hard to work out which fix is correct
  • calender > calendar - The "-er" spelling seems common in the context of Patent Rolls
October 2014 2 weeks 1900
  • cancelation > cancellation - Not done, several dictionaries allow it
  • cant > can't - 60 fixes, 2000 false positives. Next time search lowercase only
  • carrer > career - 30 fixes, 500 false positives
  • casted > cast - 200 fixes
  • chanel > channel - 60 fixes, 3000 false positives
  • childrens > children/children's - 130 fixes, 1000 false positives
  • comando > commando - Just one fix, 750 false positives. Next time don't bother?
  • correspondance > correspondence - 20 fixes, 700 false positives
January 2017 3 weeks 2500
  • carrer > career - Now a 1400-character regex for next time
  • chanel > channel - Now a 2000-character regex
  • constructable > constructible - Not done, at least two major dictionaries allow it
March 2019 4 weeks 2400
January 2021 3 weeks 2000
  • Cape town > Capetown - Abandoned; seems too minor a change
  • connotate > connote - Not done, several dictionaries allow it
January 2023 3 weeks 1850
  • chanel > channel - Not done this time
D October 2012 3 weeks 3500
  • de-facto > de facto - About 1000 of these
  • decypher > decipher - Wiktionary allows both, but not many real dictionaries allow "decypher"
  • dukeship > dukedom - Not done, I think there is a different shade of meaning here
December 2014 1 week 850
February 2017 1 week 1000
May 2019 3 weeks 1300
March 2021 1 week 800
January 2023 1 week 800
E November 2012 5 weeks 4650
  • eg > e.g. - About 600 of these
  • enroute > en route - About 1300 of these
  • eery > eerie - Not done, some dictionaries allow it
  • (various) > entrepreneur - About 350 of these
March 2015 2 weeks 1400
March 2017 1 week 1300
  • eg. > e.g. - About 160 of these
  • eg [no dot] > e.g. - On hold after this revert, discussion here
June 2019 2 weeks 1400
May 2021 1 week 800
March 2023 1 week 700
  • employes > employs or employees - Next time search lowercase only
  • enroute > en route - Mostly skipped despite what the dictionaries say
F January 2013 3 weeks 2750
  • floatation > flotation - Not done, some dictionaries allow it
  • fo > of/for/to - About 700 of these; I only searched for lowercase examples
  • followup > follow-up (adjective) - About 500 of these
  • forecasted > forecast - Not done, some dictionaries allow it
  • forego > forgo - Not done, some dictionaries allow "forego" to mean "go without"
  • forewent > forwent - Not done, similarly
  • forrest > forest - Only searching for lowercase "f" to avoid the thousands of false positives
June 2015 2 weeks 1500
April 2017 1 week 900
  • forth > fourth - I only searched for forth followed by a few likely words
July 2019 2 weeks 1050
  • followup > follow-up (adjective) - On hold after a revert; my post attracted no comments
July 2021 2 weeks 950
  • for exemple > for example - Not done, included in my "e" list
April 2023 2 weeks 590
  • fiel > feel, field, file, phial - Four fixes, 1100+ false positives
G February 2013 8 days 1050
  • gae > game/Gael/gale - One fix, 400+ false positives
  • GameBoy > Game Boy - Needs an expert. I only tackled GameBoy Advance and GameBoy Color
  • Guiseppe > Giuseppe (et al.) - Very tricky, as each fix needed research and a detailed edit summary
August 2015 2 weeks 500
May 2017 5 days 300
August 2019 1 week 500
August 2021 1 week 350
May 2023 1 week 290
  • glamourous > glamorous - not done, some dictionaries allow it
H March 2013 11 days 1100
  • Habsbourg > Habsburg - Not done, needs an expert
  • hace > hare - 900 false positives and no fixes
  • honourary > honorary - Tricky, since 'honourary' may be valid in Canadian articles
September 2015 1 week 550
  • hace > hare - 4 fixes (to have) and an 1100-character regex for the false positives
June 2017 4 days 500
  • heros > heroes - Doable: 50+ fixes, 800 false positives
September 2019 11 days 650 Delayed by phab:T232491
September 2021 1 week 500
  • honourary > honorary - Still not brave enough to do this in Canadian articles
July 2023 5 days 380
I April 2013 3 weeks 3500
  • In Memorium > In Memoriam - About 60 fixes, but only where I could check with a source
  • Inchon > Incheon - Not done, is correct in historical contexts
  • indite > indict - Not done, I'm not familiar with either word
  • inputted > input - Not done, some dictionaries allow it
  • intermural > intramural - Not done, I'm not familiar with US school/college terminology
  • internment > interment - About 100 fixes out of 3,500 total uses
  • ironical > ironic - Not done, some dictionaries allow it
  • its' > its or it's - About 750 of these
September 2015 2 weeks 1250
  • internment > interment - Now I have a 3000-character regex for many of the false positives
July 2017 2 weeks 1300
November 2019 3 weeks 1700
October 2021 2 weeks 800
September 2023 1 week 900
  • internment > interment - Another 70 fixes
J June 2013 1 day 150
  • judgement > judgment - Not attempted, no efficient way to skip the correct uses
  • judgment > judgement - Likewise
October 2015 1 day 60
July 2017 1 day 50
December 2019 1 day 60
November 2021 1 day 100
September 2023 1 day 90
K June 2013 1 day 160
  • kiloohm > kilohm - Why?
  • Kingdon > Kingdom - A few fixes, but it's quite a common surname
October 2015 1 day 70
August 2017 1 day 110
December 2019 1 day 60
November 2021 1 day 60
November 2023 1 day 40
L June 2013 3 weeks 2000
  • leafs > leaves - Only searching for lowercase matches
  • leant > leaned - Not done, some dictionaries allow it
  • lessor > lesser - 20 changed out of 600; next time do this separately
  • lite > light - Not attempted, no efficient way to work through the 5,000 matches
  • locus > locust - Done separately; just one fix (to lotus) out of 6,000
  • loose > lose - 250 changed out of 43000; much manual checking despite a 5Kb regex
  • 26 different misspellings of Louisiana - 210 fixes; I searched for \blou[ains]+a\b
  • Lousia > Louisa - 33 found by the Louisiana regex
October 2015 10 days 1100

An overall 90% false positive rate despite last time's regex

September 2017 2 weeks 1000
  • lead > lede - Not done, surely uncommon
  • locus > locust - Not done, but should include it next time
  • loose > lose - 180 changed out of 69000; regex expanded to 19Kb
January 2019 2 weeks 1000
  • lillies > lilies - Next time add the "skip if in a quoted fragment" regex
  • locus > locust - One fix, at caterpillar
  • loose > lose - Abandoned; next time use the small regex that identifies likely errors
November 2021 1 week 550
  • leaded > led - Next time use the simplified exclusion regex I've been working on
  • loose > lose - 19 changed, using a pragmatic and simple search
November 2023 2 weeks 430

Slowed by off-wiki distractions

M September 2013 5 weeks 4100
  • majorly > mainly - Not done, hard to tell which shade of meaning is intended
  • Malcom > Malcolm - 300 changed, but only where verifiable. "Malcom" is common as a surname.
  • mens > men's - 700, avoiding those in Latin, Dutch, Danish...
  • Micheal > Michael - 500 changed, but only where verifiable. "Micheal" is common in Ireland.
  • minimali[sz]e > minimi[sz]e - Not done, some dictionaries allow it
  • monoatomic > monatomic - Not done, allowed by Collins
  • moreso > more so - Abandoned, too subtle to be a mere typo fix
  • moveable > movable - Not done, many dictionaries allow it

I didn't tackle "manouver" and its variants very thoroughly. In many cases it is hard to decide whether to correct to the British or American spelling; and the British spelling is frankly silly. I'll wait for the next edition of the Concise Oxford.

December 2015 3 weeks 3600
  • Malcom > Malcolm - 100 fixes
  • mens > men's - 2250 fixes, thanks to many errors inside {{MedalSport}}
  • Micheal > Michael - 100 fixes
September 2017 3 weeks 1900

Still very slow, and only a 20% hit rate, thanks to...

  • Malcom > Malcolm - 130 forenames fixed, but only where verifiable
  • mens > men's - 350 fixed, avoiding non-English uses and other false positives
  • Micheal > Michael - 220 fixed, but only where verifiable
  • Millenium > Millennium - 140 fixed, but in proper names only where verifiable
January 2020 4 weeks 2050

Did I mention that this one is slow?

  • Malcom > Malcolm - 90 fixes
  • mens > men's - 400 fixes
  • Micheal > Michael - 190 fixes
  • Millenium > Millennium - 210 fixes
  • milion > million - 90 fixes, avoiding non-English uses and the Milion
January 2022 3 weeks 1550

How about omitting Malcom, Michael and Millenium next time?

January 2024 2 weeks 1200
  • Malcom > Malcolm - 1 fix, left out of the scan this time
  • Massachussetts (etc) > Massachusetts - 120 fixes
  • mens > men's - 330 fixes
  • Micheal > Michael - Not done this time
  • Millenium > Millennium - 150 fixes, perhaps scan for lowercase only next time
  • milion > million - 80 fixes
N December 2013 4 days 500
  • Newyorker > New Yorker - Not done thoroughly, there seem to be many special usages
  • noone > no one - 60 fixes, 3500 false positives
February 2016 2 days 300
October 2017 3 days 480
March 2020 5 days 340
  • Newyorker > New Yorker - Abandoned
March 2022 3 days 390
March 2024 3 days 290
O January 2014 8 days 1300
  • octostyle > octastyle - Not done, some dictionaries allow it
  • OSes > OSs - Not done, not obviously an improvement
  • outputted > output - Not done, some dictionaries allow it
February 2016 6 days 840
November 2017 6 days 700
  • Ottowa > Ottawa - 100 fixes
March 2020 5 days 580
March 2022 1 week 740
March 2024 1 week 660
  • overlayed > overlaid - Not done, too subtle for me
P March 2014 4 weeks 4000
  • paide > paid - 4 fixes, 400+ false positives
  • panal > panel - 19 fixes, 400+ false positives
  • parlament > parliament - 120 fixes, 500+ false positives, 100+ foreign-language suffixes
  • pasted > passed - 3 fixes, 1200 false positives
  • persaude > persuade - Next time search for persaud\w+ not persaud\w*
  • pidgeon > pigeon - 20 fixes, 1000 false positives
  • planed > planned - 4 fixes, 200+ false positives
  • plentitude > plenitude - Not done, some dictionaries allow it
  • pokemon > pokémon - Not done, needs an expert
  • predominately > predominantly - Not done, modern dictionaries allow it
  • profesional > professional - 235 fixes, 1600 foreign-language false positives
  • pwn > own 4 fixes (owner (2), pen, Pennytown), 900 false positives
March 2016 3 weeks 1900
November 2017 3 weeks 2200
March 2020 3 weeks 1600
March 2022 2 weeks 1400
June 2024 2 weeks 1350
Q August 2013 3 days 300
  • quitted > quit - Not done, many dictionaries allow it
April 2016 1 day 120
December 2017 2 days 200
April 2020 1 day 100
April 2022 1 day 90
June 2024 1 day 80
R April 2014 2 weeks 1700
  • roman > Roman - 1030 fixes; I need more regular expressions to skip the false positives
  • romans > Romans - Remember to include this next time

My thanks to Arjayay for regularly tackling most of the entries in this list.

May 2016 1 week 750
January 2018 1 week 650
  • roman > Roman - 220 fixes; 2500-character regular expression
May 2020 5 days 660
May 2022 1 week 700
May 2024 1 week 580
S June 2014 4 weeks 4300
  • sargent > sergeant - 80 fixes, over 8,000 false positives
  • sayed > said - I only searched for lowercase matches; see Sayyid (name)
  • seemless > seamless - I only searched for lowercase matches; see Seemless
  • simulcasted > simulcast - Not done, not convinced this is wrong
  • slippy > slippery - Not done, many dictionaries allow it
  • smoothen > smooth - Not done, many dictionaries allow it
  • smoothes > smooths - Not done, some dictionaries allow it
  • snuck > sneaked - Not done, many dictionaries allow it
  • spacial > spatial or special - "spacial" seems valid, but I corrected 14 to "special"
  • speciality > specialty - Not done, too hard to identify US contexts
  • sportscar > sports car - Not done, see this inconclusive discussion
  • subsequential > subsequent - Not done, Collins allows it
  • symbolical > symbolic - Not done, the COD allows it
  • synthetical > synthetic - Not done, some dictionaries allow it
  • syphon > siphon - Not done, many dictionaries allow it

You'd be amazed how many ways there are to spell "specification"

June 2016 2 weeks 1250
  • sargent > sergeant - 20 fixes, 8,000 false positives, and a 5,700-character regex for next time
February 2018 4 weeks 2500

Hindsight says I didn't build the list correctly last time

  • Sahastra > Sahasra, et al. - Not done, needs an expert
  • setlist > set list - Not done, with luck a future dictionary will include it
May 2020 2 weeks 1500
  • sargent > sergeant - 35 fixes, 10,000 false positives, and the regex is up to 6,700 characters
  • sauter > solder - 1 fix, over 2,000 false positives
  • savy > savvy - 1 fix, over 2,000 false positives
  • sherif > sheriff - 2 fixes, 2,000 false positives
  • soley > solely - 10 fixes, 1,000 false positives
  • stange > strange - 2 fixes, 1,000 false positives
  • staring > starring - 15 fixes, 4,000 false positives
  • steller > stellar - 2 fixes, 1,000 false positives
  • surender > surrender - 2 fixes, 500 false positives
  • syas > says - no fixes, 300 false positives
May 2022 3 weeks 1600
August 2024 3 weeks 1550

Ridiculously inefficient

  • sargent > sergeant - Skipped this time
  • sauter > solder - Skipped this time, next time try lowercase only
  • Shangdong > Shandong - Not done, didn't have the courage
  • sherif > sheriff - Skipped this time, next time search using the small regex from my correction rule
  • Skagerak > Skagerrak - Not done, needs an expert to sort out the special cases
  • spacial > special - One fix, too many false positives
  • supremist > supremacist - Not done, appears to be the standard spelling in some contexts
  • surender > surrender - Skipped this time, next time scan for -ed, -s, -ing and the small regex from my correction rule
  • surprize > surprise - Next time omit from the scan when quoted
  • syas > says - Skipped this time
T March 2010 3 weeks 2100

I began with "T" because I assumed most editors would begin with "A".

September 2014 3 weeks 2900
  • tendonitis > tendinitis - Not done, several dictionaries allow it
  • tennant > tenant - c 50 fixes, over 4000 false positives
  • ther > the/their/they/her - c 150 fixes, several hundred false positives
  • thru > through - Abandoned after this revert; needs dual proficiency in US/International English
  • todays > today's - c 100 fixes, mostly Today's Zaman
  • tracklisting > track listing - So common that I think it would be disruptive to change them all
  • transitionary > transitional - Not done, several dictionaries allow it
  • trys > tries - c 70 fixes, a few false positives in Romanian text
  • twitter > Twitter - c 500 fixes, but sometimes I applied WP:ELMINOFFICIAL instead
August 2016 10 days 1300
  • traditon > tradition - c 400 fixes in copies of the same text
April 2018 2 weeks 1300
June 2020 2 weeks 1100
  • teh > the - c 60 fixes, over 4000 false positives
  • Telengana > Telangana - c 60 fixes
  • tennant > tenant - c 10 fixes, over 5000 false positives (example), and the regex is up to 4,300 characters
  • themself > whatever - Not done, increasingly a mainstream usage
  • ther > the/their/they/her - c 60 fixes, many false positives
  • traning > training - c 40 fixes, mostly in copies of the same text
  • twitter > Twitter - c 250 fixes
July 2022 11 days 950
  • twitter > Twitter - mostly abandoned, more a MOS fix than a misspelling
U March 2010 2 weeks 1200
December 2014 1 week 1250
  • unadvisable > inadvisable - Not done, several dictionaries allow it
  • undoubtably > undoubtedly - Not done, listed in my COD
  • ukelele > ukulele - Not done, standard UK spelling
  • unviable > inviable - Not done, several dictionaries allow it
October 2016 5 days 700
June 2018 1 week 840
September 2020 4 days 500
August 2022 3 days 350
V April 2010 10 days 850
December 2014 1 week 1750
  • varios > various - 11 fixes, 400 false positives
  • varius > various - 12 fixes, 1250 false positives
  • viceversa or vice-versa > vice versa - c 750 fixes March 2015: c 900 more that I'd skipped by mistake
  • viri > viruses - 1 fix, 400 false positives
  • vittel > vittle - Searched for lowercase only
October 2016 4 days 620
June 2018 1 week 735
  • vittel > vittle - Searched uppercase also
September 2020 1 week 500
September 2022 1 week 580
W April 2010 2 weeks 1200
January 2015 2 weeks 2100
  • wanna > want to - Not done, far too common in song titles, lyrics, quotes
  • wass > was - Searched for lowercase only
  • webcasted > webcast -- Not done, not convinced
  • wich > which - 100 fixes, 300 false positives
  • womens > women's - 750 fixes, in titles etc. only where verifiable
  • wont > won't - 100 fixes, 1000 false positives
  • ws > was - 150 fixes, 6000 false positives
November 2016 1 week 820
  • walla > voilà - Two fixes, to Wala and Wallah, 3000 false positives
  • wholistic > holistic - Not done, some dictionaries allow it
August 2018 1 week 940
  • walla > voilà - One fix, to walls, 3000 false positives
September 2020 1 week 700
  • walla > voilà - Omit next time, not worth the effort
September 2022 1 week 800
X April 2010 1 day 60
February 2015 1 day 175
November 2016 1 day 75
August 2018 1 day 90
September 2020 1 day 50
September 2022 1 day 40
Y May 2010 1 day 60
February 2015 1 day 75
  • ya'll > y'all - Not done, too many quotes and song titles
November 2016 1 day 50
August 2018 1 day 50
September 2020 1 day 50
September 2022 1 day 30
Z May 2010 1 day 40
February 2015 1 day 20
November 2016 1 day 2
September 2018 1 day 8
September 2020 1 day 8
September 2022 1 day 8
  • Ziegfield > Ziegfeld - no longer in the list, but checked anyway
Repetitions May 2010 1 month 3500 Using only the Google search, so I must have missed many.
Grammar and Misc June 2010 19 months 70000
  • current flow > current - Abandoned after this discussion
  • in January 1 > on January 1 - This single line took three months
  • try and > try to - Abandoned after this discussion. c. 10K occurrences, so rather daunting anyway
  • up until > until - Not attempted; appears to be a style issue rather than a clear grammar error. c. 14K occurrences, too many to tackle
  • will likely > will probably - Not attempted; appears to be a style issue rather than a clear grammar error. Only 2K occurrences, so I may get round to it
  • Dates - Not attempted; huge tasks, gradually being dealt with by the general fixes
  • Hyphens - Not attempted; not clearly defined, and too minor to tackle in my opinion

Repeated words

[edit]

The the

[edit]
The settings file is here

I like to scan for "the the" errors whenever I download a new database dump. My regular expression searches for

  • Either "The" or "the", followed by "the"
  • Perhaps with quotes or apostrophes in between
    • ...said it was the "the greatest thing since sliced bread"
    • ...announced the the sale of the century
  • Perhaps where the second "the" is an article title or in a piped link
    • ...the worst outrage since the [[the Troubles]]
    • ...did well in the [[1969–70 NBA season|the previous season]]

I don't search for "The The" or "the The" where the second "The" is uppercase. I used to, but after a while I couldn't decide whether "The The Who tour..." or "...the The Times reporter..." looked wrong or not.

More generally

[edit]
The settings files are here: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

After each "List of common misspellings", I've been scanning for repeated words beginning with that letter. Here is the main part of the regexp for the letter "U"...

\b[Uu](?<!https?://[\w\.\,\:\/\?\&\%\+\=\-\#_]+)([a-z]+)(\s|’|'|`|"|\]\]|\[\[(?!Category:)[^\[\]\|]*\||\[\[(?!Category:)(?=[^\[\]\|]*\]\]))+u\1\b

...which searches for a word beginning with "U" or "u", followed by the same word beginning with "u". I found that a search for two uppercase words found too many false positives in book/film/song titles. The words may appear inside wikilinks and may be separated by various kinds of quote mark.

The main false positives are species names. I began by telling the database scanner to skip any article containing a {{Taxobox}} or {{Automatic taxobox}}, and added a rule that guessed that any Latin-like word ending was a false positive. I later decided this was a mistake, and now deal with these more thoroughly.

Many other false positives turn up, so I add additional rules as needed.