Wikipedia:Bots/Requests for approval/GreenC bot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

GreenC bot

Operator: Green Cardamom (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 16:29, Sunday, March 13, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Nim and AWK

Source code available: WaybackMedic on GitHub

Function overview: Fix known problems with Internet Archive wayback machine links and page formatting errors introduced by Cyberbot IABot between December 2015 and March 2016.

Links to relevant discussions (where appropriate):

Edit period(s): one time run

Estimated number of pages affected: est. 20k pages of ~100k checked (the corpus of all articles edited by Cyberbot IABot from 20151231 to 20160310).

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: User:Green Cardamom/WaybackMedic lists details

Discussion

Do to the large scope, this will likely require multiple trials, and a community response period. Sometimes these are easier to show as demonstrations, so your first small trial is approved, please post results below when ready. — xaosflux ^Talk 17:00, 13 March 2016 (UTC)[reply]

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — xaosflux ^Talk 17:00, 13 March 2016 (UTC)[reply]

During development a trial run was made in full manual mode. Checked each edit and verified using offline tools, then uploaded via AWB. It processed the first 500 articles edited by Cyberbot (starting Dec 14 2015). Of those it found corrections were needed in 94. The 94 edits can be seen [1] starting 11 March at 7:21pm with a subject line "wayback medic using AWB". If this trial run is acceptable I can run the next 250 article batch (they have to be done in batches) which should roughly correspond to 50 edits. -- GreenC 19:31, 13 March 2016 (UTC)[reply]
OK, also - added AWB "bot" access. As your account is not yet flag for botedit, please limit use a AWB delay rate of 10. — xaosflux ^Talk 21:15, 13 March 2016 (UTC)[reply]

Alright thanks. I've been in touch with Internet Archive and they provided documentation on a new version of their API to use so once I get that coded and tested, will run the next batch of 250 articles (~50 edits) using GreenC bot. -- GreenC 21:50, 13 March 2016 (UTC)[reply]

@Green Cardamom: It looks like {{Dead link|bot=...}} is a thing. I dunno if that param is truly critical in the grand scheme of things, but I'd suggest supplying it with the bot's username. Also, pinging @Cyberpower678: into the loop. --slakr^\ talk / 02:31, 16 March 2016 (UTC)[reply]

Cyberbot should not be tagging external links as dead yet.—^cyberpower_{Chat:Limited Access} 03:05, 16 March 2016 (UTC)[reply]

I was not aware of the bot param and can easily add it in case someone wants a record trail. WaybackMedic is re-adding the dead link template after it was removed by Cyberbot so the decision to tag the source dead is not original to WaybackMedic (or Cyberbot). That distinction may or may not matter. -- GreenC 04:13, 16 March 2016 (UTC)[reply]

GreenC bot has completed it's trial run. The edits are view-able here, ending with the Moscow theater hostage crisis. -- GreenC 20:35, 18 March 2016 (UTC)[reply]

The bot has gone through a major overhaul to incorporate new API, features. I added an additional 25 edits to the previous 33 show some of it. -- GreenC 21:57, 20 March 2016 (UTC)[reply]

Marking as

Trial complete. so someone will drop by to check the diffs (possibly even me a little later). Either way, another trial is probably a good idea due to the overhaul. --slakr^\ talk / 03:29, 24 March 2016 (UTC)[reply]

Slackr; There was a bug in this trial run. A number of articles were dropping the query portion of the URL due to a problem with urlencoding in the post to the API. Example. This is fixed. -- GreenC 15:59, 24 March 2016 (UTC)[reply]

Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --slakr^\ talk / 02:47, 29 March 2016 (UTC)[reply]

Trial complete. Trial complete. -- GreenC 18:59, 29 March 2016 (UTC)[reply]

{{BAGAssistanceNeeded}} - there were no errors with the trial. -- GreenC 20:47, 10 April 2016 (UTC)[reply]

I've never seen a bot in Awk before. I think you deserve some kind of award... — Earwig ^talk 02:49, 11 April 2016 (UTC)[reply]

I agree with Earwig. →Σ σ ς. (Sigma) 02:55, 11 April 2016 (UTC)[reply]

Thanks. I'm happy with awk, fun and easy. It's specialized for text processing so is ideal for wiki text processing (with some external programs for any networking etc) -- GreenC 04:59, 11 April 2016 (UTC)[reply]

This edit broke a citation template, unfortunately. — Earwig ^talk 02:58, 11 April 2016 (UTC)[reply]
This edit replaces a broken archive with a... incorrect archive? I'm not sure. — Earwig ^talk 03:25, 11 April 2016 (UTC)[reply]
In this edit, the bot removes a broken archive leaving the original link, which is some ad-spam nonsense. I guess that's because it appears to be up? I'm not sure if there's much we can do here. — Earwig ^talk 03:36, 11 April 2016 (UTC)[reply]

That's all I've reviewed so far (stopping at Henry Fox Talbot). Other than those things, it looks good. — Earwig ^talk 03:38, 11 April 2016 (UTC)[reply]

The original cite had an invisible LF character at the start (maybe the text was uploaded from a Windows text file). I've added a strip().
Edit is correct. If archive.org has no available snapshots it will query Memento which is an index to a dozen or so other archives (WebCite, Library of Congress, national archives). This is an unusual condition, though, 98% or more will be from Wayback.
If there is nothing available at Wayback (or other archives) it restores back to the original non-working link.

-- GreenC 04:59, 11 April 2016 (UTC)[reply]

But for the second one, the WebCite link doesn't appear to be valid: the date given is from 2005 but the news story is from 2008, and the link leads to an image download. Is it really archiving the right page? — Earwig ^talk 16:15, 11 April 2016 (UTC)[reply]

You're right. Unfortunately this is bad data from the Memento API. Here is the API request:

http://timetravel.mementoweb.org/api/json/20090101075242/http://www.themusic.com.au:80/imm_display.php?s%3Dchristie%26id%3D556%26d%3D2008-08-12

Returns the following JSON output (Pastebin). WaybackMedic tries to find the nearest match to 20090101075242 that isn't archive.org or archive.is .. in this case "first" dated 2005-11 at WebCite. Not sure what can be done here other than report it to Memento. In total there are only about 3-400 links to alternative archives in the whole set (I've already run it to completetion offline). After WM has completed I'll go through and check the WebCites that have this unusual truncated URL, fix any articles and send the data to Memento. Other spot checks things looked ok. -- GreenC 19:59, 11 April 2016 (UTC)[reply]

Ok I found 20 out of the 61 WebCite URLs don't work. The non-working all take the form of a 3-character (or less) URL path, the working have a 9-character path (http://www.webcitation.org/5lZ39OFsi), so it is easy to fix and is now fixed. -- GreenC 21:13, 11 April 2016 (UTC)[reply]

{{BAGAssistanceNeeded}} - I understand that in the 30+ days of this bot's trial, a single editor Earwig found two problems. Those problems are edge cases that, had the bot run to completion, would have impacted an estimated 25 of 25,000 edits or a bot accuracy rate of 0.999 though there might other unknown edge cases that bring it up to .99 or something. No other editor has raised concerns. Meanwhile the problems that MediaWiki is trying to fix are becoming worse -- editors attempt to fix them manually, and by doing so break things making it impossible for WaybackMedic to actually make the fixes it is designed for (eg. they see a link doesn't work, remove it and add {{cbignore}} making it impossible for WaybackMedic to replace with a working link). Each day that goes by WM's edit ability to fix problems is degraded. -- GreenC

Just letting you know: you probably shouldn't {{tl}} the assistance template if you want it to show up on the main status page. Anyway, I'm at work now. I wanted to finish going through the trial, and I haven't had time... Anyone else? — Earwig ^talk 16:13, 13 April 2016 (UTC)[reply]

Ok. Thank you for the assistance. -- GreenC 16:39, 13 April 2016 (UTC)[reply]

I've listed the remaining 22 edits in the second trial below. -- GreenC 18:03, 14 April 2016 (UTC)[reply]

May I suggest that you also filter URLs of the old WBM schemes, including http://replay.waybackmachine.org/ and http://wayback.archive.org/, to the new https://web.archive.org/. --bender235 (talk) 00:42, 12 May 2016 (UTC)[reply]

@Bender235: that's a good idea. It already will convert http://wayback.archive.org/ but only when doing something else at the same time, like changing a snapshot date. The focus for the first iteration of the bot is to fix some known problems within a limited sub-set of articles - once it finishes I hope to make a new version that will run against all articles containing wayback links and does general formatting fixes like you suggested. -- GreenC 02:22, 12 May 2016 (UTC)[reply]

Trial results

Trial results are at User:Green_Cardamom/WaybackMedic/trial2.

There's still enough red X's to justify more trial, edge cases keep showing up. I'd like to run in batches of 25 which seems manageable, using the same method above above. Hopefully it won't need more than another 50-75 edits, but however long it takes. I'll log the results on sub-pages to avoid making this page too long. -- GreenC 21:15, 15 April 2016 (UTC)[reply]

Would it be possible to dry run some of these instead of making live edits in production? e.g., log what would have changed to a sub-page in the bot's userspace or just manually review `diff` output, for example. We shouldn't have to post-mortem numerous trial runs. By the third trial, this should be at production readiness. There should also be clear evidence that there's been large amounts of self-testing without disruption to the production environment. --slakr^\ talk / 05:39, 16 April 2016 (UTC)[reply]

If you only knew how much dry run testing as been done! And of course I will continue to do so. In the trial's 47 edits, 99 changes were made of which 4 had fixable bugs that were difficult to spot (edit 1, 6, 14 & 38), or about a 4% error rate. That's not good enough, but it's close. The bot by its nature will always contain 'mistakes' (edits 5 and 17) that can't be helped, it's the nature of a constantly changing Internet. -- GreenC 13:48, 16 April 2016 (UTC)[reply]

I had mostly skipped the first trial due to your comment that the bot had been reworked, so as far as I'm concerned, there's only been one trial. I don't see any problem with a "second" one closely-monitored in small batches; Linus's Law comes into play, and the error rate is small enough to avoid damage.

Approved for trial (100 edits max, in 25-edit batches). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — Earwig ^talk 22:42, 20 April 2016 (UTC)[reply]

Earwig, my problem has been lack of graphic in-line diffs so bugs were hard to spot and I was depending on the live trials to pick out the remaining bugs. I took User:Slakr's advice and setup some dry runs in User space and the last one ran mostly clean. Also, I ported the bot to a new language, Nim, which compiles to C with assembly optimizations and it's now running about 400% faster and half the memory. The port wasn't difficult as Nim can be made to look like other languages such as Awk or Python, the Nim code appears close to the Awk code. I uncovered some deep bugs along the way, and added some new features and optimizations, so it's a much improved version. -- GreenC 15:21, 21 April 2016 (UTC)[reply]

(edit: and I'll run some live trials next) -- GreenC 16:03, 21 April 2016 (UTC)[reply]

Trial 3 results

Trial complete. - results:

The trial articles were hand-picked to stress test the software's feature set. There was one bug in 51-75 that in production would have impacted few articles (required two rare conditions to occur simultaneous). -- GreenC 14:32, 23 April 2016 (UTC) {{BAGAssistanceNeeded}}[reply]

Nice. Very much improved. =) I might be paranoid or overly cautious here, but I say let's do one last trial to put this one to rest. :D If everything looks decent, I don't otherwise see any issues and have no problem with greenlighting it fully. Thanks again for all your hard work, tenacity, and response to input. :)

Approved for extended trial (100 edits max, in 25-edit batches). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --slakr^{\ [[User talk:|talk]] /} 02:42, 7 May 2016 (UTC)[reply]

Trial 4 results

Trial complete. - results:

Trial4 76-100 - live test (May 12)
Trial4 51-75 - live test (May 11)
Trial4 26-50 - live test (May 10)
Trial4 01-25 - live test (May 9)

One bug. This bug was created when fixing the bug from the last trial. Both bugs are related to code dealing with alternative (non-wayback) archives which in total accounts for ~100 articles out of the ~100,000 being processed. It is showing up in trial because I am stress testing by manually picking articles that contain alternative archives, along with other rare cases intentionally chosen for the trial. Here is a suggestion how to ease into this:

Run the complete set of alternative archives as a single batch and manually check each one. As noted there are only ~100 articles in this set. I don't expect any more problems, but this code is the most processed part of the bot it's at the end of a long chain of decisions and has some separate functions.
Run the first 10% (10,000) which will end up making changes in about 1500 to 2500 articles. Spot check 100 of them. Wait 7 days for user feedback.
Continue this process with no less than 48hr wait between each 10% block until completed.
I believe in I break it I fix it. The bot keeps full records of everything, so once a known error is discovered it is trivial to regex previous runs to find where else it showed up and go back and fix it.

-- GreenC 15:45, 12 May 2016 (UTC)[reply]

A user has requested the attention of a member of the Bot Approvals Group. Once assistance has been rendered, please deactivate this tag by replacing it with {{t|BAG assistance needed}}. @Slakr: @The Earwig:

Approved. — Earwig ^talk 20:19, 24 May 2016 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.