Wikipedia:Bots/Requests for approval/Cyberbot II 5
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Cyberpower678 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 13:37, Saturday, June 6, 2015 (UTC)
Automatic, Supervised, or Manual: Automatic and Supervised
Programming language(s): PHP
Source code available: Here
Function overview: Replace existing tagged links as dead with a viable copy of an archived page.
Links to relevant discussions (where appropriate): Here
Edit period(s): Daily, but will likely look it will run continuously.
Estimated number of pages affected: 130,000 to possibly a million.
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details: The bot will crawl its way through articles on Wikipedia and attempt to retrieve an archived copy of dead-links at the time closest to original access date, if specified. To avoid persistent edit-warring, users have the option of placing a blank, non-breaking, {{cbignore}}
tag on the affected to tell Cyberbot to leave it alone. If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref.
The bots detecting of a dead-link needs to be carefully thought out to avoid false positives, such as temporary site outage. Feel free to suggest some algorithms to add to this detection function. At current the plan is to check for a 200 OK response in the header. If any kind of response that indicates downage, the bot proceeds to add the archived link if available, or otherwise tags it as dead. A rule mechanism can be added to the configurations for sites that follow certain rules when the kill a link.
There is a configuration page that allows the bot to be configured to desired specifications, which can be seen at User:Cyberbot II/Dead-links. The bot attempts to parse various ways references have been formatted and attempts to keep consistent as to not destroy the citation. Even though the option to not touch an archived source is available, Cyberbot II will attempt to repair misformatted sources using archives if it comes across any.
Any link/source that is still alive, Cyberbot can check for an available archive copy, and the request site be archived, if it can't find any.
The bot can forcibly verify if the link is actually dead, or be set to blindly trust references tagged as dead.
The bot may need some further developing depending on what additional issues crop up, but is otherwise ready to be tested.
Discussion
[edit]I think this is a great idea. One thought: there are several kinds of dead links - (a) sometimes the site is completely defunct and the domain simply doesn't work - there is no server there any more, (b) sometimes the site has been bought by another entity and whatever used to be there isn't there any more, so most things get a 404, (c) sometimes a news story is removed and now gets a 404, or (d) sometimes a news story is removed and is now a 30x redirect to another page.
For a, b, or c, what you are describing is a great idea and probably completely solves the problem. For (d), it may be tricky to resolve whether this is really a dead link or whether they merely relocated the article.
One thought/idea: can you have a maintainable list of newspapers that are known to only leave their articles available online for a certain amount of time? The Roanoke Times for example, I think only leaves things up for maybe six months. Sometimes, they might redirect you to a list of other articles by the same person, e.g. [1] which was a specific article by Andy Bitter and now takes you to a list of Andy Bitter's latest articles. Other times you just get a 404, e.g. [2]. Since links from roanoke.com are completely predictable that they will disappear after six months, you could automatically replace 302s, whereas for some other sites, you might tag it for review instead of making the replacement on a 302. An additional possible enhancement would be that, knowing that the article is going to disappear in six months, you could even submit it to one of the web citation places so that we can know it will be archived, even if archive.org misses a particular article. --B (talk) 18:38, 9 June 2015 (UTC)[reply]
- First time comment on a BRFA. For d) one possibility (for cite templates with additional parameters such as author, date, location or title) would be for the bot to check if these words appear on the target page of the link. Non-404 dead links usually lack these. Jo-Jo Eumerus (talk) 18:54, 9 June 2015 (UTC)[reply]
- This is really great input, and it should be possible, to maintain such a list via a control panel I've already set up. Right now, my development is focused various ways a reference has been formatted and appropriately parsing it and modifying it when needed.—cyberpowerChat:Limited Access 20:42, 9 June 2015 (UTC)[reply]
- Another thing that came to mind: Can the bot check more than one archive service? Such as both the Wayback one and WebCite? Jo-Jo Eumerus (talk) 16:13, 13 June 2015 (UTC)[reply]
- I really wish you informed me of that earlier, the bot has been written around the WayBack machine. Also, the WebCite doesn't seem to have an API
, or a way to immediately look up a website. It seems to prefer to email the user with a list of archive links. My bot doesn't have email though, so WebCite is out atm. Screen scraping by the looks of it may not be effective either.—cyberpowerChat:Online 16:29, 13 June 2015 (UTC)[reply] - I can however program the bot to ignore any reference that have a webcite template or archive link.—cyberpowerChat:Online 16:35, 13 June 2015 (UTC)[reply]
- Rats. I know I have to suggest things more quickly. Category:Citation Style Vancouver templates should perhaps be included as well, some of them allow for URLs. Jo-Jo Eumerus (talk) 18:58, 13 June 2015 (UTC)[reply]
- I really wish you informed me of that earlier, the bot has been written around the WayBack machine. Also, the WebCite doesn't seem to have an API
- Another thing that came to mind: Can the bot check more than one archive service? Such as both the Wayback one and WebCite? Jo-Jo Eumerus (talk) 16:13, 13 June 2015 (UTC)[reply]
- This is really great input, and it should be possible, to maintain such a list via a control panel I've already set up. Right now, my development is focused various ways a reference has been formatted and appropriately parsing it and modifying it when needed.—cyberpowerChat:Limited Access 20:42, 9 June 2015 (UTC)[reply]
Since you mentioned this, how do you avoids things like "temporary site outage"? Links go temporary bad very often. Sometimes there's a DNS issues. Sometimes regional servers or cache servers are down. Sometimes clouds are having issues. There's scheduled maintenances and general user errors. It is definitely unreliable to check a link only once.
I don't want to repeat all the comments from previous BRFAs, but there's tons of exceptions that you have to monitor. Like I've had sites return 200 and a page missing error just as returning 404 and valid content. I've had sites ignore me because of not having some expected user agent, allowing/denying cookies, having/not having referrer, being from certain region, not viewing ads, not loading scripts, not redirecting or redirecting to a wrong place or failing redirect in scripts, HEAD and GET returning different results, and a hundred other things. — HELLKNOWZ ▎TALK 18:03, 24 June 2015 (UTC)[reply]
- Those are issues that need to be controlled for even without a bot. If an average editor tries to follow an external link and comes to a 404 page, that editor is as likely to replace the link with a working one, even if the 404 page only comes from a temporary cite error. If there is an archived version of the page, and the link is changed to that, then no information is lost. bd2412 T 18:09, 24 June 2015 (UTC)[reply]
- Even if it is a temporary downage, Cyberbot will simply be adding an archived version of the link to the original citation either through the use of the wayback template or if using a cite template through the archive-url parameter. Nothing is lost. The verification procedure will be very erroneous at first but as I get more information, refinements can be easily added. Rules can be added to the bot's configuration page for ones with regular problems. If the bot is being problematic with a source, users can attach a
{{cbignore}}
tag to the citation to tell the bot to go away.—cyberpowerChat:Limited Access 18:46, 24 June 2015 (UTC)[reply]- Adding an archive url implies the link is dead, unless you add
|deadurl=no
, which implies it is not dead at this time. There was brief discussion on this (can't really recall where), and sending a user to a slower, cached version when a live one is available was deemed "bad". I would say you need consensus for making archive links the default links when bot has known detection errors and links may be live. It may be low enough that people don't care as long as there are archives for really dead links. — HELLKNOWZ ▎TALK 19:35, 24 June 2015 (UTC)[reply]- There is a clear consensus for a bot to do this at the Village Pump discussion. The problem of dead links is substantial, and the slim chance that a website will be down temporarily when the bot checks is vastly outweighed by the benefit of fixing links that are actually bad. I would also suggest that a site that goes down "temporarily" may not be the best site to link to either. bd2412 T 19:48, 24 June 2015 (UTC)[reply]
- I linked to the discussion that supports this bot. The bot leaves a message on the talk advising the user to review the bot's edit and fix as needed. So any link changed that shouldn't be changed can be fixed and tagged with
{{cbignore}}
.—cyberpowerChat:Online 20:03, 24 June 2015 (UTC)[reply]- Worth noting also that when I manually repair dead links, the site is usually down (although I have encountered a few working links which were presumably tagged during temporary outages) and the archived links almost always work. These errors do constitute only a minor share of all replacements, in my experience.Jo-Jo Eumerus (talk) 20:07, 24 June 2015 (UTC)[reply]
- I see consensus for dead links, not most likely dead links though. We had such consensus already, and this is a previously approved bot task. There is no question that we need a bot, the question is what error rate in what areas is allowed? The VP proposal was worded "could we have such a bot in theory?", not "we have a bot that will have x% error rate, is this acceptable?" We are talking hundreds of thousands of links here. Even a 0.01% error rate is thousands of links. From what Cyberpower says, it would be higher and we know some cases cannot be avoided. BRFA needs to show either close to 0% error rate or clear consensus that an error rate is acceptable (see, for example, ClueBot NG BRFA). This is described as part of WP:CONTEXTBOT. — HELLKNOWZ ▎TALK 21:25, 24 June 2015 (UTC)[reply]
- If we are talking about links that sometimes work and sometimes don't (and therefore might not be working when the bot checks), I think it's pretty obvious that we are better off with a link to an archived page that works all the time. It's not an error at all to replace a questionable link with a stable link to the same content. bd2412 T 22:08, 24 June 2015 (UTC)[reply]
- I linked to the discussion that supports this bot. The bot leaves a message on the talk advising the user to review the bot's edit and fix as needed. So any link changed that shouldn't be changed can be fixed and tagged with
- There is a clear consensus for a bot to do this at the Village Pump discussion. The problem of dead links is substantial, and the slim chance that a website will be down temporarily when the bot checks is vastly outweighed by the benefit of fixing links that are actually bad. I would also suggest that a site that goes down "temporarily" may not be the best site to link to either. bd2412 T 19:48, 24 June 2015 (UTC)[reply]
- Adding an archive url implies the link is dead, unless you add
- Even if it is a temporary downage, Cyberbot will simply be adding an archived version of the link to the original citation either through the use of the wayback template or if using a cite template through the archive-url parameter. Nothing is lost. The verification procedure will be very erroneous at first but as I get more information, refinements can be easily added. Rules can be added to the bot's configuration page for ones with regular problems. If the bot is being problematic with a source, users can attach a
"If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref." -- Is there consensus for this? That's a lot of messages. — HELLKNOWZ ▎TALK 21:25, 24 June 2015 (UTC)[reply]
- Technically, 0.01% would be tens of links. I think we'll need a test run to establish how reliable the link replacement is, though.Jo-Jo Eumerus (talk) 21:45, 24 June 2015 (UTC)[reply]
- It's been asked for a couple times, it can be switched off. Link checking can be switched off too. It would drastically speed the bot up.—cyberpowerChat:Online 21:48, 24 June 2015 (UTC)[reply]
The main problem I see with this is automatically trying to identify whether a link is up or down. It's ridiculously tough for a bot to do it (reflinks had a ton of code for it), and IIRC sites like CNN and/or NYT blocked the toolserver in the past. I also don't see any advantage to using a special exclusion template and spamming talk pages. I also had written my own code for this (BRFA) which I'll resuscitate. It'll be great to have multiple bots working on this! Legoktm (talk) 22:22, 26 June 2015 (UTC)[reply]
- I have been discussing with Legoktm on IRC and I think 2 bots is a lovely idea. More coverage quicker. My bot shouldn't have any conflicts with another bot. Legoktm and I will be implementing a feature to allow them both to acknowledge
{{nobots|deny=InternetArchiveBot}}
. As for checking whether a link is dead or not, it seems to be an agreement among us to leave that feature off for now, or indefinitely. As spamming talk pages, we can see how that works out. If it's too much after the trial, we can turn that off too.—cyberpowerChat:Online 23:16, 26 June 2015 (UTC)[reply]
Development Status
[edit]- Done Fetch appropriate articles
- Done Recognize and parse various formats in references
- Done Parse a template properly
- Done Recognize and parse various formatted external links, and citations
- Done Detect if a link is really dead
- Done Submit archive requests for links that are alive but have no archive
- Done Detect if the link has been marked as dead
- Done Detect if the link has an archive
- Done Handle the link properly
- Done Scan the archive and retrieve an archive
- Done Properly format new references and links
- Done Fix improperly formatted templates
- Done Notify on talk page
- Done Log report generator
- Done Refinements
- ((BAGAssistanceNeeded)) Development is finished, source code has been posted and I believe the bot is ready for a trial run.—cyberpowerChat:Online 23:08, 22 June 2015 (UTC)[reply]
- Before being approved for a trial please answer the following questions:
- Should Cyberbot scan all links on specified pages, or just references?
- Should Cyberbot scan all pages, or only those contain dead-link tags?
- Should Cyberbot modify all links, only those tagged as dead, or tagged as dead and those the bot see as dead?
- Should the bot verify if a tagged link is really dead, or blindly trust dead-link tags?
- Should the bot provide the latest archived copy or those closest to the set access date of source?
- Should Cyberbot touch sources that already have archives on them?
- Should Cyberbot leave a message on the respective talk page when it edits a page?
- Can you suggest a subject line Cyberbot should use for talk page messages? You can use keywords such as {linksrescued}, {linkstagged}, {linksmodified}, and {namespacepage}.
- Can you suggest the body of the message Cyberbot should leave behind. You can use the same syntax mentioned in the previous question. Use \n for newlines.
- Should Cyberbot check if a link is dead, as in check those that aren't tagged?
- Should Cyberbot make sure an archived copy is available and ready should the live link ever go down?
- All these questions are individual configuration options for this bot. Knowing how the community wants would be of a great help.
- Here is my opinion on the matter:
- Bad links are bad links, so it shouldn't make a difference if they are in references or text.
- I interpret that as all links.—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- Yes, all links. If a link is dead, it should be made good.
- I interpret that as all links.—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- Same as above, although I would start with those that are tagged.
- This is a configuration question. There are 2 scanning methods, one scans all pages, the populates pages that contain dead-link templates and scans those. It sounds like you want all pages in the end.—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- In that case, I would go with all pages. Going with tagged links is useful because it focuses on links known to be dead, but if the resources exist to do all pages, go for it.
- This is a configuration question. There are 2 scanning methods, one scans all pages, the populates pages that contain dead-link templates and scans those. It sounds like you want all pages in the end.—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- I presume the bot will do nothing to links that appear to be in fine working order. If it sees a link as dead, it should fix it, tagged or not.
- I think it makes more sense to verify. Basically, it should be agnostic about the tags, since those may be erroneous.
- I agree that verification is a must, but the process still quite erroneous. Certain dead links do return a 200 OK and the bot will see that as a live link.
- To what extent can the process be tweaked as it goes? Can we start with clearly dead links, and then refine the process for links that are tagged as dead but do not show up as dead?
- Rules can be introduced using the rules parameter in the configuration page. Verification algorithms can be improved on demand. The bot's source code has been for maintainability.—cyberpowerChat:Limited Access 02:36, 23 June 2015 (UTC)[reply]
- Ok - not to throw in new complications, but if a links is tagged as a dead link, but the bot thinks it's a live link, perhaps the "dead link" tag should either be removed or modified to indicate that there's some question about whether it really is a dead link. Also, this raises an additional question for me. What does the bot do when it finds a dead link for which no fix exists (i.e. no archive)? Perhaps it should also note this on the talk page, so editors will know that whatever proposition the link is supposed to support will need a new source. bd2412 T 02:55, 23 June 2015 (UTC)[reply]
- The bot would simply remove the tag if it was deemed as alive and the bot can't find an archive it will tag it as dead. Any modification done to the page results in a talk page notification. Both these features can be turned on and off on the configuration page.—cyberpowerChat:Limited Access 03:50, 23 June 2015 (UTC)[reply]
- Ok - not to throw in new complications, but if a links is tagged as a dead link, but the bot thinks it's a live link, perhaps the "dead link" tag should either be removed or modified to indicate that there's some question about whether it really is a dead link. Also, this raises an additional question for me. What does the bot do when it finds a dead link for which no fix exists (i.e. no archive)? Perhaps it should also note this on the talk page, so editors will know that whatever proposition the link is supposed to support will need a new source. bd2412 T 02:55, 23 June 2015 (UTC)[reply]
- Rules can be introduced using the rules parameter in the configuration page. Verification algorithms can be improved on demand. The bot's source code has been for maintainability.—cyberpowerChat:Limited Access 02:36, 23 June 2015 (UTC)[reply]
- To what extent can the process be tweaked as it goes? Can we start with clearly dead links, and then refine the process for links that are tagged as dead but do not show up as dead?
- I agree that verification is a must, but the process still quite erroneous. Certain dead links do return a 200 OK and the bot will see that as a live link.
- I would prefer the closest archive to the source date, since the contents of the page may have changed.
- I'm not sure what you mean by "sources that already have archives". If the link already purports to point to an archive I don't know how we would find an archive of that link.
- What I mean by that is, if a source contains a reference to an archive, should Cyberbot fiddle with it or leave it alone? My recommendation is to leave them alone.—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- I have no preference with respect to talk page messages. Since the operation is a bit complicated, I guess it would be too much to describe in a tag on the page.
- Have you seen the source code yet? Compared to that, notifying on the talk page is easy. :p—cyberpowerChat:Online 01:43, 23 June 2015 (UTC)[reply]
- This is more than a tagging or modification. I would just say "Dead link(s) replaced with archived links".
- Any message should briefly describe the operation, and state that "[this] dead link was replaced with [this] link from the Internet Archive" (or whatever service is used).
- As above, the concern is the links, irrespective of the tags. Although we can start with tagged links, ultimately every link should be checked.
- Checking for archives of working links seems a bit out of scope, and a bigger task. I don't recall whether we had determined that there is a way to prompt Internet Archive or another such service to archive a link.
- Some users have asked for it, and I've been able to implement without much cost to resources. I recommend this be turned on.
- If there's a call for it, sure.
- Some users have asked for it, and I've been able to implement without much cost to resources. I recommend this be turned on.
- Bad links are bad links, so it shouldn't make a difference if they are in references or text.
- Cheers! bd2412 T 01:34, 23 June 2015 (UTC)[reply]
- Here is my opinion on the matter:
What should we do here? -- Magioladitis (talk) 13:54, 28 June 2015 (UTC)[reply]
- Approve for a trial, obviously. :p—cyberpowerChat:Online 14:03, 28 June 2015 (UTC)[reply]
First trial (100 edits)
[edit]Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. ·addshore· talk to me! 15:52, 28 June 2015 (UTC)[reply]
- 50 article edits and 49 talk page edits already done (counting by way of the edit reasons). I'll inspect the article edits. Jo-Jo Eumerus (talk) 18:00, 28 June 2015 (UTC)[reply]
- Trial complete. A random review of the edits reveals no problems.—cyberpowerChat:Online 18:24, 28 June 2015 (UTC)[reply]
- A few notes of mine:
- Trial complete. A random review of the edits reveals no problems.—cyberpowerChat:Online 18:24, 28 June 2015 (UTC)[reply]
- The bot is adding the {{wayback}} template without a space between the template and the preceding markup, leaving no space between any punctuation and the "Archived" template output in seeing mode. Is this right? (On Paul Bonner, it did add a space in one of the two replacements).
- Fixed Though I should note it didn't edit Paul Bonner.—cyberpowerChat:Online 19:43, 28 June 2015 (UTC)[reply]
- Whoops. It was Peter Bonner, not Paul. Sorry!
- Fixed Though I should note it didn't edit Paul Bonner.—cyberpowerChat:Online 19:43, 28 June 2015 (UTC)[reply]
- The bot didn't change the two broken external links on Zeba Islam Seraj. I assume that the bot noticed that the most recent archived copies are also broken?
- Archive.org only returns the closest working copy of the page, or it returns nothing. If the bot gets nothing, it does nothing with the link.—cyberpowerChat:Online 19:52, 28 June 2015 (UTC)[reply]
- Floating ecopolis had a previously archived link that was broken by the bot. Apparently it tried to archive the already archived link.
- Actually the wayback was being improperly used. The generated link is unusable. The bot attempted to fix the formatting, but it failed, it should have removed the 1= parameter.—cyberpowerChat:Online 19:57, 28 June 2015 (UTC)[reply]
- Fixed—cyberpowerChat:Online 21:15, 28 June 2015 (UTC)[reply]
- Actually the wayback was being improperly used. The generated link is unusable. The bot attempted to fix the formatting, but it failed, it should have removed the 1= parameter.—cyberpowerChat:Online 19:57, 28 June 2015 (UTC)[reply]
- Talysh Khanate also had an incomplete replacement, not sure what went wrong there.
- Not sure what happened there either, I'll have to look at that closely.
- Fixed—cyberpowerChat:Online 21:31, 28 June 2015 (UTC)[reply]
- Not sure what happened there either, I'll have to look at that closely.
- One Wayback archive was of an already broken page (last link on the Überlingen article). The Margin of error, Palmer's College and Gecko (software) replacement also appears to be already broken. Same for the last link on Koreatown, Los Angeles (or so it appears to me).
- The bot can't be expected to accurately determine if the archive is good or not, that's why the suggestion of human review.—cyberpowerChat:Online 20:27, 28 June 2015 (UTC)[reply]
- In a few instances, the bot replaced a working link with another working link because the original was mis-tagged as broken (Parsley Sidings, Hot Cross and Vanity (singer)).
- Don't blame the bot if someone else mistagged it. The bot can't be expected to know if the link is really dead or not when there is consensus to shut the link verification process off.—cyberpowerChat:Online 20:27, 28 June 2015 (UTC)[reply]
That's all from me - only the first four things are potentially problematic. Jo-Jo Eumerus (talk) 18:53, 28 June 2015 (UTC)[reply]
- [3] - the bot grabbed the earliest archive, why earliest and not latest? (P.S. I only checked like 10 pages, so this isn't a full review.) — HELLKNOWZ ▎TALK 19:51, 28 June 2015 (UTC)[reply]
- I'm assuming it has something to do with the blank accessdate parameter making the bot assume a unix timestamp of 0 and resulting in it trying to pull an archive as close to January 1, 1970 as possible. I;ll put in a fix for that.—cyberpowerChat:Online 20:27, 28 June 2015 (UTC)[reply]
- WikiBlame could perhaps be implemented somehow? That's what I use when finding the best archived-link. (t) Josve05a (c) 20:37, 28 June 2015 (UTC)[reply]
- In all my years of being here, I never learned what WikiBlame is. Can someone enlighten me?—cyberpowerChat:Online 20:50, 28 June 2015 (UTC)[reply]
- A tool for searching in the revision history of a page, per Wikipedia:WikiBlame.Jo-Jo Eumerus (talk) 21:39, 28 June 2015 (UTC)[reply]
- How would that help?
- A tool for searching in the revision history of a page, per Wikipedia:WikiBlame.Jo-Jo Eumerus (talk) 21:39, 28 June 2015 (UTC)[reply]
- In all my years of being here, I never learned what WikiBlame is. Can someone enlighten me?—cyberpowerChat:Online 20:50, 28 June 2015 (UTC)[reply]
- Fixed—cyberpowerChat:Online 21:53, 28 June 2015 (UTC)[reply]
- WikiBlame could perhaps be implemented somehow? That's what I use when finding the best archived-link. (t) Josve05a (c) 20:37, 28 June 2015 (UTC)[reply]
- I'm assuming it has something to do with the blank accessdate parameter making the bot assume a unix timestamp of 0 and resulting in it trying to pull an archive as close to January 1, 1970 as possible. I;ll put in a fix for that.—cyberpowerChat:Online 20:27, 28 June 2015 (UTC)[reply]
- Josve05a has brought up more issues that I missed and have addressed them.—cyberpowerChat:Online 13:50, 29 June 2015 (UTC)[reply]
Second trial (300 edits)
[edit]- The previous trial has concluded, and the brought up issues have been addressed. Requesting another trial of 500 this time.—cyberpowerChat:Online 22:04, 28 June 2015 (UTC)[reply]
Approved for extended trial (300 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. 500 is too much for us to check. Let's do 300 first. -- Magioladitis (talk) 08:09, 5 July 2015 (UTC)[reply]
- Notes by Josve05a
I've only done tests on a few articles to find the worst bugs. These are not all, but those I found when checking a selective number of articles to see its reliability. Saying "the bot can't know if it is dead on Wayback" is not a good excuse. That is a reason not to allow the bot task.
Legend for recurring errors:
Code | Error |
---|---|
(b) | The source URL was not dead. |
(c) | The archive-url is dead. |
Diff | URL | Archived | Note |
---|---|---|---|
[4] | [5] | [6] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[7] | [8] | [9] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[10] | [11] | [12] | (c) THE LINK WAS INLINE, NOT IN-REF OR UNDER EXTERNAL LINKS |
[13] | [14] | [15] | (c) |
[16] | [17] | [18] | (c) |
[19] | - | - | Added |dead-url=yes, even though |deadurl=yes already existed |
[20] | [21] | [22] | (c) |
^ | [23] | [24] | (b) |
^ | [25] | [26] | (c) |
[27] | [28] | [29] | (c) |
[30] | [31] | [32] | (b) |
[33] | - | - | REMOVED CONTENT FROM THE ARTICLE |
[34] | - | - | TRIED TO FIX STRAY REF IN COMMENTED TEXT, BREAKING TEMPLATE, REMOVING CONTENT |
[35] | - | - | REMOVED CONTENT FROM THE ARTICLE |
(t) Josve05a (c) 16:18, 5 July 2015 (UTC)[reply]
{{OperatorAssistanceNeeded|D}}
Magioladitis (talk) 22:46, 7 July 2015 (UTC)[reply]
- Trial complete. Sorry. The bot is still waiting to receive the fixes to mentioned bugs.—cyberpowerChat:Online 22:53, 7 July 2015 (UTC)[reply]
- Rome wasn't built in a day. ;-) bd2412 T 23:15, 7 July 2015 (UTC)[reply]
- I have addressed (c). The likelihood of a bad archive being added should be greatly reduced now. A solution for items 1 and 2 is already present. I have fixed item number 6 and item number 13 so far.—cyberpowerChat:Online 13:09, 10 July 2015 (UTC)[reply]
- It took some searching but I managed to get 12 and 14 fixed and confirmed it with this edit. Also an addendum, I have instructed the bot to change the links in external links directly, if they are not inside reference tags. That way when fixing sources and links, I'm not disrupting the article with a wayback template.—cyberpowerChat:Offline 06:04, 11 July 2015 (UTC)[reply]
- Rome wasn't built in a day. ;-) bd2412 T 23:15, 7 July 2015 (UTC)[reply]
Third trial (500 edits)
[edit]The bot appears to be ready for one last trial before approval.—cyberpowerChat:Offline 06:04, 11 July 2015 (UTC)[reply]
- I have reviewed the bot's configuration once again (last time I did it was before the first trial), and it seems like my earlier major concern about VERIFY_DEAD being set to true is resolved. (Note to closing BAGer: bot does not seem to have consensus to run with VERIFY_DEAD set to true, since it is too prone to errors.)
- I'm still unsure about the talk page notices. I cleaned up the wording a bit, adding a {diff} label (which Cyberpower says he can implement) and removing unnecessary information, but I'm still debating the general usefulness, since it will appear on tens (possibly hundreds) of thousands of talk pages. I would like to see some more comments on this.
- My only other issue is concerning PAGE_SCAN. As I understand it, setting this to false (as Cyberpower intends to do when the bot is approved) will involve tens of millions of archival requests to the Internet Archive (Wikipedia has 81,235,194 external links at last count; some of these are already archived but many will not be). I understand this is in line with the goals of that service, but I'm not sure if this is a good idea without explicit confirmation from them. So let's hold off on setting PAGE_SCAN to 0 after approval until we get more details on this.
- Anyway: Approved for extended trial (500 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Hopefully I will be able to do a careful review of the results of this trial after it is complete. — Earwig talk 05:00, 13 July 2015 (UTC)[reply]
- Trial complete.. I looked over the edits and can't find any problems with them except that some pages don't seem to be archiving correctly, based on the talk messages left behind. The source rescuing hasn't revealed any bugs this time, but I would appreciate a seperate set of eyes on this too in case I missed something.—cyberpowerChat:Online 13:43, 14 July 2015 (UTC)[reply]
The Earwig here you are! 500 pages :) Josve05a you may also want to have a look! -- Magioladitis (talk) 08:53, 15 July 2015 (UTC)[reply]
- I'll be checking the articlespace contributions. As a note/question, I am not sure how much importance should be placed on working link-->working archive or nonworking link-->nonworking archive replacements; they appear to be fairly minor issues on their own (unlike working link-->nonworking archive replacements). Jo-Jo Eumerus (talk, contributions) 13:30, 15 July 2015 (UTC)[reply]
- Alright, from Ani DiFranco forward I see [36] where the bot fixed one link but de-{{dead link}}-ed two and [37] where a citation already using a Webarchive link got its "wayback" part removed. Jo-Jo Eumerus (talk, contributions) 14:03, 15 July 2015 (UTC)[reply]
- I somehow missed that edit, which begs to ask how many others I missed. As for that edit, it's reasonable to assume the bot got confused as 2 sources were placed in one reference. Unless I'm mistaken, only one source should be in a reference at a time, so that should rather be fixed on the article. True?—cyberpowerChat:Online 14:14, 15 July 2015 (UTC)[reply]
- @Jo-Jo Eumerus: Is there anything wrong with that second case? It doesn't seem to be an explicit part of the task description, but the end result is better formatted since it does make the original source visible. @Cyberpower678: It is odd, yes, but I don't think it's technically disallowed – either way, the bot shouldn't be doing that, even though it's understandable why the bug would arise in the first place. — Earwig talk 05:20, 17 July 2015 (UTC)[reply]
- Mmm. Yeah, with your argument I think that can be done. I'll review some more edits from Assembly line forward. Jo-Jo Eumerus (talk, contributions) 10:58, 17 July 2015 (UTC)[reply]
- Alrighty, aside from the usual Austin, Texas had a mistagged link (because it still works) changed to a broken Wayback link, but it's clearly noted on the talk page so I guess it's not a major issue. Nothing else serious to see. Jo-Jo Eumerus (talk, contributions) 11:49, 17 July 2015 (UTC)[reply]
- Then this might be a problem. The way the bot is coded, it's designed to look for reference tags, external links, and citation templates. If it finds the reference tag, it looks for the source inside it. It can't see 2 sources the way it's coded, and updating that would require major rewrites of the bot. While I can adjust the regex to absorb the 2 links no problem, feeding them into the parser might be a problem as it only takes in one link. Ideally, I would rather simply fix this issue on the article since this seems to have occurred only once of all the trials.—cyberpowerChat:Online 14:10, 17 July 2015 (UTC)[reply]
- Just thinking outside the box here, but why don't we make a separate bot to find and fix instances of multiple references in a single tag, run that on all of Wikipedia, and then run this one when that one is done. bd2412 T 14:29, 17 July 2015 (UTC)[reply]
- Unfortunately, you're thinking outside of our galaxy here. Such a bot would be extremely difficult to program. How would it know what text to put where. In the case here, this reference has 2 external and text mentioning both links. Your bot would need to master the english language first. I do like the idea though.—cyberpowerChat:Online 15:16, 17 July 2015 (UTC)[reply]
- My suggestion would be to correct the reference manually and move on. I have a sneaking suspicion that this issue will come up so rarely, that any human could easily fix it. And the bot won't come back to it once it has an archive link, or an ignore tag.— Preceding unsigned comment added by cyberpower678 (talk • contribs)
- Just thinking outside the box here, but why don't we make a separate bot to find and fix instances of multiple references in a single tag, run that on all of Wikipedia, and then run this one when that one is done. bd2412 T 14:29, 17 July 2015 (UTC)[reply]
- Mmm. Yeah, with your argument I think that can be done. I'll review some more edits from Assembly line forward. Jo-Jo Eumerus (talk, contributions) 10:58, 17 July 2015 (UTC)[reply]
- Is there a way to tell these problem refs? Maybe the bot can simply list them somewhere (or tag them) and have a human repair them before doing the botwork. Jo-Jo Eumerus (talk, contributions) 15:46, 17 July 2015 (UTC)[reply]
- Yes. That can easily be done. But to do all 12 million articles may take some time.—cyberpowerChat:Online 15:53, 17 July 2015 (UTC)[reply]
- OK. I believe this is a problem that can be actively dealt while the bot is working. Were there any other problems? -- Magioladitis (talk) 10:36, 24 July 2015 (UTC)[reply]
- Not that I am aware of.—cyberpowerChat:Online 13:54, 24 July 2015 (UTC)[reply]
- OK. I believe this is a problem that can be actively dealt while the bot is working. Were there any other problems? -- Magioladitis (talk) 10:36, 24 July 2015 (UTC)[reply]
- Yes. That can easily be done. But to do all 12 million articles may take some time.—cyberpowerChat:Online 15:53, 17 July 2015 (UTC)[reply]
Josve05a did you had the chance to check (some of) the 500 edits? -- Magioladitis (talk) 14:32, 24 July 2015 (UTC)[reply]
- I did do some spot tests and checks and the rate of error (dead archives etc.) is within my acceptable parameters, in my opinion. However, I would suggest a mew maintenence tempate/category be added next to the links/on the talk page, so a human can do a second review of all bot edtis if wanted, instead of having to look at edit logs. Like "Template:Bot link-archivation" or something, in monthly categories. Just a suggestion, to catch those which may be in error. (t) Josve05a (c) 16:50, 24 July 2015 (UTC)[reply]
- Would the talk page notifiers serve that scope? Jo-Jo Eumerus (talk, contributions) 16:55, 24 July 2015 (UTC)[reply]
- Not unless they all got "collected at one page, like if they had a category/template in them. It is one thing to "see" the talk page notifiers while on the article, another to systematicly manually review them afterwards. The notifiers is to "let you know" that it happened, a category where hese could be in would to to "allow manual reviews"...I'm just mumbling right now... (t) Josve05a (c) 17:10, 24 July 2015 (UTC)[reply]
- How about fashioning a template to go in the talkpage message? The template has a switch, resolved=no, which places the page in a category, and resolved=yes, which removes the page from the category.—cyberpowerChat:Online 14:28, 6 August 2015 (UTC)[reply]
- "Resolved" makes it sound like it inherently has a problem, which it should not. I thnk
{{{manually_checked}}}
or something is more "accurate", but it sounds like a plan. Has my 'vote'. (t) Josve05a (c) 19:16, 6 August 2015 (UTC)[reply]- How about
{{{checked}}}
?—cyberpowerChat:Online 19:53, 6 August 2015 (UTC)[reply]- Done. Also web archive doesn't seem to have a problem with the bot archiving.—cyberpowerChat:Limited Access 02:59, 7 August 2015 (UTC)[reply]
- How about
- "Resolved" makes it sound like it inherently has a problem, which it should not. I thnk
- How about fashioning a template to go in the talkpage message? The template has a switch, resolved=no, which places the page in a category, and resolved=yes, which removes the page from the category.—cyberpowerChat:Online 14:28, 6 August 2015 (UTC)[reply]
- Not unless they all got "collected at one page, like if they had a category/template in them. It is one thing to "see" the talk page notifiers while on the article, another to systematicly manually review them afterwards. The notifiers is to "let you know" that it happened, a category where hese could be in would to to "allow manual reviews"...I'm just mumbling right now... (t) Josve05a (c) 17:10, 24 July 2015 (UTC)[reply]
- Would the talk page notifiers serve that scope? Jo-Jo Eumerus (talk, contributions) 16:55, 24 July 2015 (UTC)[reply]
- Example
Here's an example. {{sourcecheck}}
Outcome
[edit]{{BAGAssistanceNeeded}} I recommend that this bot task be approved, on the condition that the template above are implemented. In case a bug arises which breaks a page, or changes page layout in any way the bot shall be turned off and not be turned on again until the bug has been fixed, in order to not break more pages. This should not be conditional. (t) Josve05a (c) 03:23, 8 August 2015 (UTC)[reply]
- The bot has a runpage and the changes have been implemented.—cyberpowerChat:Offline 04:56, 8 August 2015 (UTC)[reply]
- Three things:
- What's going on here?
- Talk pages that are automatically bot-archived are going to lose these notifications, even when they are still marked with
|checked=false
. This might be a problem given the categorization. Also, I'm not sure if requiring (or recommending, at the very least) manual intervention on over a hundred thousand talk pages is a good idea. - I made a minor tweak to the talk page message and changed the name of {{sourcecheck}}'s category to Category:Articles with unchecked bot-modified external links. Willing to change again if people don't like it. Let's leave it red until approval.
- Thanks. — Earwig talk 01:50, 10 August 2015 (UTC)[reply]
- What do you mean?
- How, it'll simply relocate the the category link to the archive, meaning can still piece 2+2 in figuring out which article that archive belongs to.
- Agreed.
- Cheers.—cyberpowerChat:Limited Access 18:42, 11 August 2015 (UTC)[reply]
- Re #1, I do not understand what that first message is about. Is it part of this task? What's the real point of it? I suspect it will show up a lot for similar pages. Why isn't it combined with the main message? Re #2, I realize the meaning will be clear, but we are then suggesting that users edit talk archives. I suppose this is not a dealbreaker, but I'm not fully satisfied with it either. — Earwig talk 09:02, 12 August 2015 (UTC)[reply]
- It simply means that the bot received a bad response from the archive while attempting to archive non-dead pages. It's doing that to alert to the possibility that link may be dead, a redirect, or the site does not allow for archiving, and if possible if the site is prone to dying that it should be manually archived somehow.—cyberpowerChat:Online 12:45, 12 August 2015 (UTC)[reply]
- Can you combine that with the main message? — Earwig talk 03:15, 15 August 2015 (UTC)[reply]
- I can possibly put in a patch to combine the messages. But what about the edit summaries? Also, it would seem the WMF has taken an interest in this bot and is offering to use their name in talks with IA, to better improve the service. So now I am also waiting to hear from them.—cyberpowerChat:Limited Access 16:08, 21 August 2015 (UTC)[reply]
- Can you combine that with the main message? — Earwig talk 03:15, 15 August 2015 (UTC)[reply]
- It simply means that the bot received a bad response from the archive while attempting to archive non-dead pages. It's doing that to alert to the possibility that link may be dead, a redirect, or the site does not allow for archiving, and if possible if the site is prone to dying that it should be manually archived somehow.—cyberpowerChat:Online 12:45, 12 August 2015 (UTC)[reply]
- Re #1, I do not understand what that first message is about. Is it part of this task? What's the real point of it? I suspect it will show up a lot for similar pages. Why isn't it combined with the main message? Re #2, I realize the meaning will be clear, but we are then suggesting that users edit talk archives. I suppose this is not a dealbreaker, but I'm not fully satisfied with it either. — Earwig talk 09:02, 12 August 2015 (UTC)[reply]
- Three things:
Approved. Cyberpower has removed the message regarding un-archivable links. To the best of my knowledge, that was only remaining issue. Future feature requests, such as detecting unmarked dead links, should be made under a subsequent BRFA. — Earwig talk 00:24, 25 August 2015 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.