Jump to content

Wikipedia:AutoWikiBrowser/Typos/Guide

From Wikipedia, the free encyclopedia

These are the typo regular expressions for RegExTypoFix (Regular Expression Typographical error Fixer, or RETF). Development has been open to the public since 2006.

Please add to or improve these regular expressions!

Description

[edit]

These regular expressions find and fix common misspellings and grammatical errors. The primary advantage of RegExTypoFix over other possible spellchecking engines and approaches is accuracy and the return of only one possible replacement. The rules below are developed to give as few false positives as possible. Errors should be encountered only in extremely rare usages or when parsing other languages (though even then if there are too many false positives the expression will be modified). On everyday English, accuracy should hit 100%.

RegExTypoFix is used across diverse sources of text from many languages, in the English Wikipedia. RegExTypoFix is also used on other MediaWiki-based wikis, and derivatives can be leveraged in other software. This leads to a massively tested, well-vetted set of automatic corrections. Even so, due to the great variability of text, RegExTypoFix is not accurate enough to be run without a human checking every proposed correction when running against an encyclopedia such as Wikipedia.

Syntax of the expressions is described in full on the MSDN website, though for the purposes of this page the Well House summary is likely easier to use.

Usage

[edit]

Everyone using RegExTypoFix should use it responsibly. Check every edit before you make it. If in doubt, SKIP. This typo list is used by the in-browser editor and multiple Wikipedia tools.

AutoWikiBrowser (AWB)

[edit]

AWB purposely avoids fixing typos in certain areas of the wiki-text. Typo fixing is prevented within: image names, template names and parameters, wikilink targets, text in quotations and italics, and any text that follows a colon or asterisk. If a typo rule matches a wikilink target, this rule will be ignored on the whole page.

When using AWB, you can refresh the typo list by selecting "File → Refresh status/typos" (CTRL-R). This is useful when you are modifying the typo list on Wikipedia while using AWB to test/process the modification (but basic testing should first be done offline—e.g. by using AWB's Regex Tester or "Find and replace").

JavaScript Wiki Browser (JWB)

[edit]

The JavaScript Wiki Browser uses the same rules for ignoring typo fixing as the downloadable AWB does. Additionally, JWB will ignore any typo that occurs on the same line of text as {{sic in order to avoid fixing intentional or transcribed typos. Other than that, the typo rules will not be applied to image names, template names and parameters, quotes, and any text following a colon or asterisk, as well as skipping any rule that also matches a wikilink target on that page. Due to some browsers not supporting lookbehinds, any replacement rules containing lookbehinds (?<= and ?<!) will be ignored on those specific browsers. Any browser that does support these rules will apply them as normal.

To refresh the typo list, simply click the right next to the checkbox for enabling the Typo Fixing.

WPCleaner

[edit]

WPCleaner also purposely avoids fixing typos in certain areas of the wiki-text. Since Java supports lookbehinds a bit differently than C#, any replacement rules containing lookbehinds (?<= and ?<!) will be rejected if the lookbehind expression doesn't have an obvious maximum length (for example, if the lookbehind expression is using quantifiers like * or +, it will probably be rejected) . Rules starting with \{\{ are only applied on the beginning of templates, rules starting with \[\[ are only applied on the beginning of internal links. For other rules, typo fixing is prevented within:

  • comments,
  • internal links, except for the text description when the link is in the form [[link|description]],
  • images, except for the text description or the alternate text description,
  • templates,
  • categories,
  • interwiki links, except for the text description when the link is in the form [[xx:link|description]],
  • language links,
  • external links, except for the text description when the link is in the form [http://xxxx/ description],
  • defaultsort,
  • tags,
  • between <gallery>...</gallery>, <math>...</math>, <code>...</code> or <timeline>...</timeline> tags,
  • if the text is surrounded by dots, themselves surrounded by letters or digits.

When using WPCleaner, you can refresh the typo list by clicking on the button in the main window.

wikEd

[edit]

On Wikipedia gadget wikEd, the rules are applied everywhere.

Adding/changing a misspelling

[edit]

The syntax for each rule is the following (according to AWB and WikEd source code):

<Typo word="Optional name for this rule" find="Regex code to detect the error" replace="Replacement for the error"/>

The "word" parameter is optional and any additional spaces between the parameters are ignored.

Before editing this page

[edit]
  • Note that all typo rules are case-sensitive. This affects how they are written and tested.
  • Test your proposed change by using an ordinary Wikipedia search or an AWB Google Search with a "Find and Replace" configured. This may reveal that your rule will sometimes damage correct text, or may sometimes make the wrong correction. In these cases do not add the rule here; instead, consider adding it to the Lists of common misspellings.
  • If you do not know how to make a change, suggest it here, where a knowledgeable user will add it for you.
  • Keep in mind that every addition/possibility of a word uses more CPU and slows scanning.
  • Note that only words outside wikimarkup and URLs are fixed, so a rule to fix, say, a wiki template will not work on AWB.

Writing typo rules

[edit]
  • Aim to have a single rule for each root word, prefix, and suffix.
  • Avoid having a rule detect a spelling outside its intended scope (for example, a rule that fixes housa to house must not detect thousand or house). Add word boundaries (\b) to both ends of the regex unless you are matching errors in parts of words or multiple words.
  • Do not expect rules to be applied in the order they appear.
  • Write fast rules:
    • Beginnings are expensive, so be specific in the matching of the first few characters to eliminate possibilities quickly.
    • If possible don't use the quantifiers * and + with anything but a single character. Avoid them entirely if possible, as they put extra strain on CPU and are apt to do other than what you expect.
  • Each rule must be completely independent.
  • Update the rule name if you change something that affects it.
  • Lookbehind constructs ?<= and ?<! are not supported by wikEd, and JavaScript Wiki Browser in some web browsers (notably Firefox & Safari as of October 2019), and could cause these rules to be skipped.
  • Because the typo rules are case-sensitive, be sure to handle all reasonable case possibilities.

Testing typo rules

[edit]
  • Use the AWB Regular Expression tester, AWB's "Find and replace", or something similar before adding here. If you use AWB's "Find and replace", make sure "CaseSensitive", "Regex" and "Enabled" in Normal settings (or "Case sensitive", "Regular expression" and "Enabled" in Advanced settings) are checked for each rule tested.
  • Verify with AWB or WikEd immediately after you add them. If they do not work, remove them first, and analyze later.

To do

[edit]

Typo list

[edit]

All changes to this list are live. AWB loads directly from this list whenever someone invokes the RETF option.