Wikipedia:Bots/Requests for approval/AntiCompositeBot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Request Expired.

AntiCompositeBot

Operator: AntiCompositeNumber (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 20:25, Sunday, February 23, 2020 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): python

Source code available: https://github.com/AntiCompositeNumber/AntiCompositeBot/blob/master/src/harvcheck.py

Function overview: Tag Harvard-style references ({{sfn}}, {{harv}}, etc) when the link to the full citation is broken.

Links to relevant discussions (where appropriate): Wikipedia:Bot requests/Archive 80#Harvard Bot

Edit period(s): Continuous

Estimated number of pages affected: There are around 95,000 pages transcluding {{Sfn}} alone, but most probably won't be broken. Trial indicates 2100–2700 pages.

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: Harvard-style refs use a shortened footnote that links to the full citation elsewhere on the page. The shortened footnote alone does not contain enough information to verify the statements in the article. If the full citation has been removed or the citation or footnote have been improperly constructed, the footnote does not link to the citation.

When the bot finds a broken reference, it applies a {{subst:Broken footnote}} after the footnote, unless the footnote is already followed by {{broken footnote}} or {{citation not found}}.

Discussion

No opinion on the task so far, but as noted in the above discussion it's not clear that this can't be accomplished by appropriately tweaking the CS1/2 templates. Jo-Jo Eumerus (talk) 21:23, 23 February 2020 (UTC)[reply]

It cannot. Tweaking CS1 could reduce the number of broken footnotes, but it wouldn't eliminate them, nor would they categorize articles as being in need of cleanup. Headbomb {t · c · p · b} 21:51, 23 February 2020 (UTC)[reply]

@AntiCompositeNumber: does it also cover the other Harvard templates, like {{harvnb}}, {{harvtxt}}, {{sfnm}}, etc.? It should cover all footnote templates that create links to other citations. Also, it should substitute with the date (e.g. {{Broken footnote|date=February 2020}}. Headbomb {t · c · p · b} 21:53, 23 February 2020 (UTC)[reply]

@Headbomb: The bot doesn't care what template creates the links, as long as they are of the form ...{{PAGENAME}}#CITEREF... and there's no element on the page with a matching ID. The detection logic is very similar to User:Ucucha/HarvErrors.js's first check, except that I explicitly ignore any links to other pages. The bot does substitute the template, which automatically includes the date. See this edit to my test cases. --AntiCompositeNumber (talk) 22:52, 23 February 2020 (UTC)[reply]

I feel before this is deployed, CS1 templates should be updated to emit the default |ref=harv anchors. This would cut down on quite a bit of errors, and save everyone lots of headaches. Still, a technical trial will at least demonstrate feasibility and give an idea of how many such citations are affected. Maybe there's room for more limited bot runs while templates are being updated, with a full rollout at a later date.

For now,

Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete.. I'll be recusing myself from final approval, given this is my idea of a bot. Headbomb {t · c · p · b} 03:20, 26 February 2020 (UTC)[reply]

Partially done, stopping now so I can make sure kubernetes doesn't try to restart it while I'm sleeping. 34597 random pages were checked and 16 were tagged in 3 hours. That's a hit about every 11 minutes or 2200 pages. I'll do the last 9 edits in the next few days. --AntiCompositeNumber (talk) 05:24, 27 February 2020 (UTC)[reply]

Surely there's a better way of finding pages than using random pages. Checking for pages with {{sfn}} on them, for example. Or using WP:DUMPs. Headbomb {t · c · p · b} 05:28, 27 February 2020 (UTC)[reply]

Seems useful, as we don't know which templates are out there, and to see what the percentages are eg. 5% of all articles have one of these templates of any type and 5% of those have a problem that can be fixed by the bot (ergo whatever 5% of 5% of 7 million is) to get a sense of the bot's footprint. At the same time it is making X number of edits for testing purposes. -- GreenC 15:18, 27 February 2020 (UTC)[reply]

Getting an idea could just as easily be achieved through "10% of 95,000 pages with sfn transclusions" or similar. The concern is during regular operation, it's much much more efficient to operate only on pages with templates that could be problematic, rather than access random pages until you cover every six million of them. Anyway, 16 out of 34,597 is ~0.046% of pages, which seems to be about 2,775 pages out of 6,000,000. If the sample was representative. Headbomb {t · c · p · b} 15:40, 27 February 2020 (UTC)[reply]

Assuming all templates are of equal popularity/usage. -- GreenC 15:53, 27 February 2020 (UTC)[reply]

Both have their upsides and their downsides. Using the random ordering at the current speed, it would take about 3 weeks to cover all articles. Querying the database for only articles that transclude any of these templates would yield 191,387 pages. That would miss any uncommon redirects or any links not generated from a template. I think that it would be worth running the full search at least once to try to catch some of that stuff. I'll run the remaining trial edits with the templatelinks search later. --AntiCompositeNumber (talk) 17:20, 27 February 2020 (UTC)[reply]

Trial complete. 563 articles from quarry:query/42383 in a random order scanned in roughly 10 minutes with 9 problems found. That's one every 62.5 articles, or 1.6% of the queried articles. That would put us around 2100 total articles to edit over about a day and a half. The biggest source of delay with the pre-prepared query is the throttle to maintain max 1 epm. Had a little bit of a hiccup with a bug in the code that detects existing problem tags and Kubernetes being overenthusiastic about restarting the bot. I've fixed that bug, written a regression test, and calmed Kubernetes down. --AntiCompositeNumber (talk) 19:09, 29 February 2020 (UTC)[reply]

Some problem edits

Problematic edit 1: [1] tags when {{full citation needed}} is already present

Problematic edit 2: [2] seems to be missing named references

Problematic edit 3: [3] adds a subst: that doesn't subst?

Headbomb {t · c · p · b} 19:25, 29 February 2020 (UTC)[reply]

Fixed: added to ignore list.
~~Not a bug~~: Not tagging reuses of a named reference was a conscious decision. When a reused citation is fixed, there is nothing that has to be done to the reuses to update them, so tagging them just makes more work for the editor fixing them.
Fixed: Substitution does not work in ref tags. I already had something in place to try to handle this, but the template on that page was deeply nested inside the ref tag, so it didn't get picked up. I've added a second check for it, and the problem tag is now placed outside the ref tag. --AntiCompositeNumber (talk) 21:39, 29 February 2020 (UTC)[reply]

@AntiCompositeNumber: concerning 2, it's not about necessarily tagging every re-used citation, but rather at least one of them (probably the first use / the one generating the broken footnote). For example, ref 21 (Dauril Alden 1996) generates a broken footnote, but is not tagged as broken. There's a total of 10 distinct broken footnotes in that article, but only 4 are tagged. Headbomb {t · c · p · b} 22:42, 29 February 2020 (UTC)[reply]

The bot could also remove {{broken footnote}} once fixed. Headbomb {t · c · p · b} 22:42, 29 February 2020 (UTC)[reply]

I see what's happened there. That ref is <ref name=DA>{{Harvnb|Dauril Alden|1996|Page=152}}</ref>. The Cite extension considers the quotes around the name to be optional, but Parsoid will always include them when parsing HTML into wikitext. That means that string matching failed. Thankfully I can re-use the fix for problem 3 here. It's a bit hacky, but it works.

I'll look into removing tags from fixed references. --AntiCompositeNumber (talk) 02:01, 1 March 2020 (UTC)[reply]

Note: I recommend waiting on Help talk:Citation Style 1#make ref=harv the default for CS1 to be implemented before doing further trials / final approval. Headbomb {t · c · p · b} 21:10, 13 March 2020 (UTC)[reply]

@AntiCompositeNumber and Headbomb: Would you be okay if with expire this request until such a time as the above change has been made? Once made, a ping to me and can approve. What do you think of that plan? (Just want to clear the backlog.) --TheSandDoctor ^Talk 17:21, 23 March 2020 (UTC)[reply]

Doesn't really matter to me. The relevant CS1/2 changes are purported to be rolled out in early April. The landscape concerning this request will have changed a bit by then, so marking as expired should be fine, with the understanding that this can be re-opened once the changes have been made and a clearer picture has emerged. Headbomb {t · c · p · b} 17:24, 23 March 2020 (UTC)[reply]

No objections here either. The bot task and code are likely to change once the CS1/2 changes are made anyway. --AntiCompositeNumber (talk) 17:51, 23 March 2020 (UTC)[reply]

Request Expired. Per the above. This can be re-opened when a clearer picture has emerged/by AntiCompositeNumber pinging me requesting such. --TheSandDoctor ^Talk 18:23, 23 March 2020 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.