User:SMcCandlish/TidyRefs

User script
TidyRefs
Description	Harmonize spacing and attribute value quoting in <ref> tags
Author(s)	SMcCandlish
Status	Working for core functionality (more features in development)
Updated	24 January 2024; 9 months ago
Browsers	Likely all
Skins	Likely all
Source	User:SMcCandlish/TidyRefs.js

User:SMcCandlish/TidyRefs.js is a user JavaScript for your common.js page. It adds two options to the "Tools" menu (on the left in most skins), "<Tidy>" and "<Tidy> (vertically)" [the latter still in development]. These only appear in the menu when in editing mode.

"<Tidy>" normalizes all horizontal <ref> citation code throughout the article to have consistent spacing within, quoted attribute values, and lowercased tag and attribute names. This includes <ref>...</ref> and <ref /> instances, and also includes fixing visually disruptive vertical ones to be horizontal.
"<Tidy> (vertically)" - [Forthcoming.] When developed, this will format <ref>...</ref> tags vertically in some sane manner, with consistent spacing plus the quote-marks fixes, and should only be used in a page-bottom citations section that is using vertical citations in list-defined references (LDR) style.
- In an article using LDR, the article body will contain horizontal citations, and the LDR references at the bottom may be vertical (though this is not required). In such a case of mixed citation formatting, the way to use these scripts is to copy–paste the vertical LDR references into a user sandbox, run "<Tidy>" on the entire article (don't save it yet), run "<Tidy> (vertically)", when available, on the vertical references in the sandbox, and copy the vertical references back out from the sandbox and paste them over the undesirably horizontalized ones at the bottom of the article.

Usually, neither function should be used without also making a more substantive change (at least fix a typo or something) in the same edit, per the human-editor rules at WP:COSMETICBOT.

This script does not do anything with CS1/CS2 citation templates ({{cite web}}, {{cite journal}}, {{citation}}, etc.), that are the content between the <ref>...</ref> tags (i.e., it does not clean up {{cite journal | last = Ceedie | first = A. B. | title = My Book | ... }} to {{cite book |last=Ceedie |first=A. B. |title=My Book |...}}). The script for doing that, to run along with TidyRefs, is User:SMcCandlish/TidyCitations.

Installation instructions

Put the line:

{{subst:Load user script|User:SMcCandlish/TidyRefs.js|User:SMcCandlish/TidyRefs}}

in either your common.js or the skin.js of your current skin, save the page, and bypass your browser cache.

The function importScript was deprecated in the July 2017 release of MediaWiki 1.29, and mw.loader is prefered.^[1] But importScript is not obsolete and still works, in case you prefer the old method of manually installing with {{subst:Install user script|User:SMcCandlish/TidyRefs.js}} or using ScriptInstaller.

Usage

TidyRefs will add two menu items to the p-toolbar when in edit mode:

<Tidy>
<Tidy> (vertically)

[The second of these is still in development.]

Clicking them will harmonize citations in a mess as bad as this (with several instances of invalid markup):

<ref    name  =  'Jarnow 2018'   group  =  notes  >{{cite book ...}}</ref>
<ref
 name
  =
   "
    Jarnow 2018
   "
 group
  =
   '
    notes
   '
/>
<REF Name=Jarnow 2018 Group=notes/>
<ref name="Jarnow 2018"group=notes />

to one of the following:

<Tidy>:

<ref name="Jarnow 2018" group="notes">{{cite book ...}}</ref>
<ref name="Jarnow 2018" group="notes" />
<ref name="Jarnow 2018" group="notes" />
<ref name="Jarnow 2018" group="notes" />

<Tidy> (vertically):

[Nothing will be done yet! Format forthcoming, after more study of what vertical citations are doing.]

Features

Put double quotes around attribute values (leaves them alone if already quoted): <ref name=foo /> → <ref name="foo" />
Change single quotes around attribute valules to the required double: <ref name='foo' /> → <ref name="foo" />
Fix invalid nested double quotes: <ref name="foo "bar" baz" /> or <ref name='foo "bar" baz' /> → <ref name="foo 'bar' baz" />
Enforce spacing between attribute="value" pairs: <ref name="foo"group="bar" /> → <ref name="foo" group="bar" />
Remove extraneous spacing around attribute="value" pairs: <ref name="foo" group="bar" /> → <ref name="foo" group="bar" />
Remove extraneous spacing between attribute, =, and "value": <ref name = "foo" /> → <ref name="foo" />
Remove extraneous spacing around the citation content between the ref tags: <ref name="foo"> {{cite journal |...}} </ref> → <ref name="foo">{{cite journal |...}}</ref>
In the horizontal version of the script, this cleanup also applies to line-breaks, not just whitespace on the same line. I.e., makes short work of vertically formatted citation code in mid-article – however, the {{cite book}}, etc., templates inside the <ref>...</ref> have to be cleaned up with a separate script, User:SMcCandlish/TidyCitations (which can be done in the same edit).
Remove extraneous spacing inside attribute values: <ref name=" foo " /> → <ref name="foo" /> (but not desired ones between words/names: <ref name="foo bar" /> is untouched).
Enforce a space in front of /> (the spaced version is understood by more parsers): <ref name="foo"/> → <ref name="foo" />
- Bonus cleanup: does the same thing with  ,  , broken "", etc →  ; and same with the uncommon <hr>, etc. (again, handled properly by more parsers; see also here and here).
Remove a space in front of > by itself: <ref name="foo" >...</ref> → <ref name="foo">...</ref>
Fix broken tags by removing extraneous spacing at the start of <ref>: < ref > → <ref> and < ref ...> → <ref ...>
Fix broken tags by removing extraneous spacing inside </ref>: < / ref > → </ref>
Can handle any sane attribute value, even one containing >.
Detects and fixes multi-word attribute values that aren't quoted: <ref name=Smith Jones 2023/> → <ref name="Smith Jones 2023" />
Reduces all-caps and camelcase tag and attribute names to lowercase: <REF Name="foo" gRoUp="bar" />...</rEF> → <ref name="foo" group="bar" />...</ref>
Applies all these fixes at once, in a single pass.
Detects all of the attributes of <ref>...</ref>, in any order, even the new ones not supported on en.Wikipedia yet. The two that work here already are name= and group=. The two that do not yet are follow= and extends=.

It has been tested against articles as long and complex as Tartan and Donald Trump without producing any unexpected or undesired results, and works quite quickly despite the complexity of the regular expressions and the number of JavaScript operations, across input that (for Wikipedia) is very large.

Forthcoming features

More detection and repair of invalid ref markup, especially of sorts that MW doesn't throw an error message about, especially empty attributes like name="", or a bare name= or name followed by nothing.
Detect and convert curly-quoted values as in <ref name=“Mcguffin 123”> (it turns out that various editors do this, either on mobile or from editing in a word processor and pasting into our edit window).
Vertical version for refs formatted that way at page-bottom (WP:LDR).
Remove spacing between <ref> tags. Will probably do this by injecting a temporary token after each tag, then parsing for that and the start of a new one, instead of doing more complex regex to read the entire preceding tag again.
Remove spacing between <ref> and other citation templates like {{sfnp}} and {{sfn}}. [It may not actually be feasible to do with such templates that precede the <ref>...</ref>, only those that follow.]
Remove spacing between <ref> and the non-citation content that precedes it (with some exceptions like tables).
Remove linebreak after </ref> if more content in the same paragraph is present.
Maybe another "bonus" cleaner-upper to fix invalid  to required 
Maybe un-"hiding" bare URLs (see here): <ref>[https://example/com/foo/bar.html]</ref> → <ref>https://example/com/foo/bar.html</ref> (CitationBot will try to do something useful with the latter but will not touch the former, and they verge on useless to readers since they just show up as something like "[39]" instead of the URL.)
Need to detect cases of empty <ref name="foo"></ref>, by people who don't understand the syntax, and correct it to <ref name="foo" />

Test cases page: User:SMcCandlish/sandbox/ref name js testpage – Feel free to add more, but don't save the page after running the script; the whole point is to edit the page, tweak the script and run it, re-edit the page, tweak the script and run it again, etc., using the same test data, until issues go away.

Known limitations

This script is very close to magic, but is not actual magic. A few unusual circumstances may confuse it.

Not really bugs but perhaps "failures to be as maximally forgiving of garbage input as possible", when fed markup that is technically invalid but which MW doesn't presently treat as an error (which means someone somewhere might actually do it):
- If fed the incomplete markup <ref name= group=>foo</ref> with empty but present attributes, it will misparse this and output <ref name=" group=">foo</ref>
- Similarly, the incomplete markup <ref name= group="foo"/> will be misunderstood and result in <ref name=" group='foo'" />
- Incomplete markup of the form <ref name="" /> will not be misparsed but will be skipped entirely.
- Several other bits of weirdess like that. The only practical solutions for them are pre-pass filters that detect and fix them before moving on to the main operations of the script; instituting this will be really tedious to do.
The script may fail and produce incorrect results if run on blatantly invalid input of kinds that it is not already written to handle – kinds that MW itself cannot handle, and which will show up as visibly broken citations in the rendered page (either cites that don't render at all, or code garbage showing up in the content). Known examples:
- Extraneous junk inside the <ref> such as stray characters (<ref name="foo"x/>), or unrecognized attributes (<ref name="foo" test="bar" />). Presently, en.Wikipedia only recognizes name= and group=; it is not clear when we are getting follow= (deployed at WikiSource and a few other projects) or extends= (in beta since 2019, somehow).
- An attribute value that starts quoted but the quote never ends, e.g. <ref name="foo />
- An attribute value with mismatched quotation marks in the form: <ref name="foo' /> (MW simply sees this as another case of the quote never ending).
- The invalid null markup <ref /> or <ref/> with no attribute and value.
- The exact string /> being used as content inside the value of an attribute: <ref name="foo/>bar" /> (an attribute value with / > with a space between them is fine, though).
- Boneheaded (outside the context of template code) attempts to use MW xtags inside the start or end tag of the <ref> element, as in <ref <includeonly>name="foo"</includeonly> /> or <ref name="<nowiki>foo</nowiki>" />. That's just too broken to contemplate. Same goes for attempts to put an HTML comment inside a tag, like <ref name="" />
An attribute value with mismatched quotation marks in the different form: <ref name='foo" /> (MW will actually render a citation that simple, but if it has a second attribute like group="quux", then that attribute will fail).
The script detects the specific <ref> attributes name=, group=, follow=, and extends= (with or without a space before =). If you use one of those strings as content inside the value of another attribute, then the material will be misparsed. The odds of anything like <ref name="group=foo" group="bar" /> existing in the wild are extremely low, but this would definitely break the script. MW actually does parse that markup, so it's infinitesimally possible for this to happen.
If a citation has the same attribute twice, this will not be detected by the script as a valid citation: <ref name="foo" name="bar">baz</ref> (MW presently just accepts the second and ignores the first, and does not throw an error, for some reason). This might be a common enough error to have the script check for it and do something about it. We'll see. It would be a lot of work.
The script does not detect the presence of <pre>, <syntaxhighlight>, <code><nowiki>, or wikimarkup blocks equivalent to <pre> created by code being put on lines indented by one or more space characters, or any other means of laying out code blocks to present wikimarkup examples. If such an example contains <ref ...>...</ref> or <ref ... /> code that is subject to cleanup by this script, it will be cleaned up. If this is not desired, try replacing <ref with <ref in the example code.
This script is for parsing textual content with citations in it. If you run it against template code, JS code, CSS code, Lua module code, interface pages, and other weird stuff, you are entirely on your own. In theory, it will actually work if it encounters a string of citation code in such a page, but supporting such use is outside the scope of this script.

Additional technical notes

Invalid input of <ref name=foo/bar>...</ref> should not be parsed by MW as valid, but presently is and produces a usable citation, so this script detects it and repairs it by putting quotes around the attribute value.
Invalid input of <ref name='foo "bar" baz'>...</ref> or, even worse, <ref name="foo "bar" baz">...</ref> should not be parsed by MW as valid, but in a case this simple it is and produces a usable citation. However, if the inner-quoted material contains a space, then the citation will break, so MW's (probably accidental) handling of this circumstance is faulty. The script detects such a mess and repairs it to <ref name="foo 'bar' baz">...</ref> in both cases.
The   and <hr /> "bonus cleaner" does not detect rare attribute-bearing instances, like  , <hr class="..." /> and so on.

wikEd compatibility

wikEd (an advanced editor you can install via "Gadgets" in the Wikipedia "Preferences" menu) is generally incompatible with scripts, add-ons, or extensions that rely on or change the standard text edit box, and TidyRefs is one of those scripts. The workaround is to temporarily turn off wikEd by pressing the button, making the changes with TidyRefs, then re-enabling wikEd.

There may be a way to fix this, but I would have to install it and figure out what it's doing in detail.

Credits

Kudos for inspiring this work goes to Nick (user:9473764) at Stack Overflow, who first produced a "basic" (actually very complicated) regex that could handle the gist of a <ref> citation with a name= attribute under most circumstances. The material has gotten much more complicated since then to account for numerous legitimate and erroneous use cases, including the four attributes that the tag supports.

The "framing" code around the meat of this script is based on User:SMcCandlish/TidyCitations.js; see credits inside it and on its documentation page.

Regex101 was immensely helpful in the development of this. If you want to examine what the main regex is doing (the ones for other attributes than name= are just variants of it), see https://regex101.com/r/xubdCt/20 which has an "Explanation" panel that walks through it step-by-step.

ChatGPT helped work through a few things, though by the time the regex got much more complicated than Nick's original, the "AI" was no longer able to accurately help much (it was not really able to correctly predict most of the changes it suggested, and kept causing massive regressions). The LLM did help a little with mostly-serviceable JS code snippets for a few tasks, including treating regex capture groups as JS vars, and that saved some time and annoyance.

Change log

1 January 2024‎ – Development began (after several days of regular-expression testing at https://regex101.com.
21 January 2024 – First version safe to use in mainspace for horizontal-citation cleanup; documentation started.
22 January 2024 – Minor bug fix: had to account for strings "group", "name", "follow", or "extends" within an attribute's value (not being an attribute itself followed by =).
23 January 2024 – Now fixes broken <ref> and </ref> tags (e.g. < ref >, etc.), and removes extraneous whitespace in the citation data immediately after <ref> or <ref ...> and before </ref>
24 January 2024 - Now detects and fixes invalid <ref name=foo/bar /> markup. Detects and is not fooled by, but does not repair, invalid <ref name=>Foo</ref> markup without a value. Lowercases any ALL-CAPS or CamelCase ref tag and attribute names (MW treats them as valid, rare though it may be). Bonus cleanup added, of  ,  , and various invalid variants like ,  , etc. to  ; plus <hr /> versions.

Infrequently asked questions

Isn't there some rule against this?
- No. See RfC here: there is definitely not a consensus against using citation-formatting tools, and a large discussion to affirm the acceptability of using citation tools is not needed. It would be possible to do something disruptive with one, like mass-changing a bunch of articles in a bot-like fashion and not checking the output and letting a bunch of errors get through. But who is doing that? Close continues: questions of editor behavior should be addressed as needed at noticeboards. See also other RfC here: changes to visual output for the reader generally require consensus, as do systematic changes across an entire article changing from one consistent citation style to another consistent citation style, but changes of coding that occur while updating the content of a citation and/or adding citations do not require consensus. Even reader-facing changes are permissible when making the visual output of citations consistent within an article where there has been no history of consistency. Also from the closer: An editor hopping from article to article converting everything to a template would be a 'no' without consensus. Next see third RfC here: There is a clear consensus that the usage of vertical and horizontal templates does not fall within the purview of WP:CITEVAR. ... the inclusion of wikitext formatting within a style guideline is a form of WP:CREEP as the coded structure of the citation does not visually alter the article and provides no difference to the reader ... The existence of established policies such as WP:BRD, WP:EW, WP:OWN, and WP:BUREAU eliminates the need to codify something as specific as this. ... the code structure does not require consensus to change ..., thought editwarring over such trivia is prohibited about this as it is about everything. In summary, forcing everything to be CS1 templates in articles using another citation style consistently is not okay (change of major citation style), but cleaning up the wikicode without changing to a different major citation style is fine. In short, the efforts of certain editors to get every aspect of internal formatting of citations deemed to be part of a "citation style" that was "protected" by WP:CITEVAR has been repeatedly rejected by consensus (despite strenuous efforts in that direction by various parties).
Why didn't you do this with a MediaWiki parser?
- None of them that are fully functional are in JavaScript, so they can't be installed and used as WP user scripts. Using some other parser for this in some way would require using an external tool hosted at Toolforge, and pretty much no one is going to do that. Maybe there's some way to hook into such a tool through an internal JavaScript here, but I'm unaware of how to do so. Also, I just liked the challenge of writing a (multi-step) regex quasi-parser that can mostly handle a simple four-attribute element, in the face of various people on StackExchange saying it's not possible. There is a JavaScript parser called wtf_wikipedia, but it can't convert output back into markup (it's aimed at extracting text from it for reuse elsewhere); another called wikiapi looks promising but requires node.js and seems really to be a way to remotely edit a WP page through some other application, not a means of using JS while on WP to futz with the content.
Why does this put quotes around attribute values?
- Because it is a best practice, of robust, guaranteed-working, and future-proof markup. It stops citations from being broken later, fixes broken ones now, is better for reuse of Wikipedia content, and trying to go the other direction is a totally lost cause. The quotation marks are required any time the attribute value has a space, has punctuation, or has any character that is not part of the original ASCII character set (which means characters from any non-Latin-based writing system like Greek, Cyrillic, CJK, etc., and the majority of the Latin-alphabet characters with diacritics). Most editors do not realize this (or realize only the space part). Consequently, many existing citations are in invalid markup (most commonly hyphens, underscores, dots, slashes, and other puntuation, e.g. <ref name=Smith-Jones />). Right this moment, MW seems to parse most of them properly most of the time anyway, though this becomes more iffy when there are two attributes, like also a group=. This kinda-sorta support for the bad markup cannot be depended upon indefinitely, as it is not officially supported and is against the documented requirements of MW's own ref extension. The more complex an unquoted attribute, the more likely the MW parser is to fail with it, and the behavior could change with any future MW version.
 Worse, any unhelpful ref name like <ref name=news /> or <ref name=p7 /> is very likely to be improved by later editors to make more sense, e.g. <ref name="NYT 2022" /> and <ref name="Smith-Jones 1998 p7" /> and will break citations if they do not remember to add the quotation marks that should have been there to begin with. As for reuse, any given third party is likely to attempt to parse our material with whatever they have, and most XML parsers are not going to handle bad markup of this sort. Even if someone uses as purpose-built MW markup parser, the odds of it perfectly replicating MW's quirks with regard to invalid markup in its "XML-like syntax" for ref tags are very low (doing it would require quite an effort on the part of the parser writer, and exactly what quirks MW will parse is something that's going to change from version to version).
 Finally, various tools, including WMF's own highly ... discussed VisualEditor enforce the quotation marks anyway, so trying to resist them is a quixotic waste of time and just annoying to everyone who understands what the quotation marks are there for and how often they are needed. PS: On the accessibility claim that quotation marks are harder for certain users to type due to mobility issues, doing <ref name=foo /> is not actually going to (immediately) break anything for very simple ref names, and no one will yell at you for doing it (we hope!). The only issue would be removing quotes just because you don't like them or revert-warring against other editors doing later cleanup to add them.

References

^ mw:Special:Permalink/2764501#MediaWiki 1.29

[1] w:Special:Permalink/2764501#MediaWiki 1.29

[1]