Wikipedia:Bots/Requests for approval/Ocobot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: Ocolon
Automatic or Manually Assisted: unsupervised automatic
Programming Language(s): PHP, MySQL
Function Summary: list articles with broken external links
Edit period(s) (e.g. Continuous, daily, one time run): continuous
Edit rate requested: ≤ 12 per hour 0.4 per minute
Already has a bot flag (Y/N): -
Function Details: Ocobot checks articles for broken external links. It lists articles with broken links then.
Human users can decide what to do with these links (keep/alter/delete...).
Discussion
[edit]This will be my first bot, if it gets approved. So I'm looking forward to suggestions on how it could be made better. Thank you. Ocolon 16:59, 4 March 2007 (UTC)[reply]
- What if an external server is down for a half hour or so? Will the bot go back and check links more than once to make sure they really are dead? Where will it get the list of article to check? --Selket Talk 15:32, 5 March 2007 (UTC)[reply]
- What if an external server is down for half an hour? I can make Ocobot check dead links twice; good suggestion. But even then he won't delete them on his own. He'll only list them on a dead links page e.g. User:Ocobot/Dead links. I think that's better for several reasons:
- An external server might be down for, say, two days. Human users can better evaluate if a dead link will be resurrected.
- The link might be dead but maybe it should be kept in an article anyway. If it's an important reference for example. We also keep books as reference that can't be bought anymore.
- Sometimes you only need to change a link a little, maybe it had been misspelled, and then it works.
- What if an external server is down for half an hour? I can make Ocobot check dead links twice; good suggestion. But even then he won't delete them on his own. He'll only list them on a dead links page e.g. User:Ocobot/Dead links. I think that's better for several reasons:
- Where will it get the list of article to check?
It will automatically check random articles. But you can also put articles or categories manually on his schedule. The search depths will be one by default (he checks a random article and all linked articles). But if you put a category on his schedule manually you can also make him search deeper so that he checks all articles in all sub-categories of a given category for example. He won't check all of Wikipedia at once though - too much traffic for me. — Ocolon 15:57, 5 March 2007 (UTC)[reply]
- Where will it get the list of article to check?
- I just checked my userpage with Ocobot and a search depths of 1 and got six times HTTP code 404 (file not found). I checked them manually: They really didn't exist. I just add this to the discussion to show that the bot can really be of use. There are many broken links out there. This test run didn't require Ocobot to log in or edit anything. He only read my userpage and those pages linked there (less than 20), just as a human would have done and like search engine bots do all day long. Therefore it didn't need approval. — Ocolon 18:23, 5 March 2007 (UTC)[reply]
I put an example output table at my sandbox. The suggested repeated checks of broken links are implemented. The status codes will be explained. I'll add links to WP:CITE and Using the Wayback Machine of course. And there will be an ignore list for links that shall be ignored. Suggestions are always appreciated. Ocolon 07:46, 8 March 2007 (UTC)[reply]
It's been one week since my request and I didn't receive a comment from an approval group member yet (I'm thankful for Selket's proposals though). Approval group members have been quite active since then and approved/commented on a number of bots, so I reckon my request didn't meet the standards. I'm sorry about that. I don't want to rush anyone, I'd just prefer to get a hint instead of silence.
… To be more specific about Ocobot: The only thing I request approval for at the moment is that Ocobot may log in and edit its own subpages to list broken links he found. The broken link finding process does neither require logging in nor editing anything. It is something that search engine bots do all the time. I could run Ocobot completely outside Wikipedia and make him list dead links on my server. I'd then remove or revive them. This would have the following advantages:
- Saves some server space because it doesn't edit its own user subpages.
- Doesn't require approval, as long as it doesn't use the random article or other dynamically generated pages.
Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please. — Wikipedia's robots.txt
Disadvantages would be:
- Lack of transparency
- Users cannot add categories and articles to Ococbot's schedule
- Users cannot add categories, articles and external links to Ococbot's ignore list
- The only user who would then remove or revive dead links from the list would be me
I think the disadvantages outweigh the advantages. Therefore I ask you to approve that Ocobot may log in to Wikipedia and save his search results on its own subpage. Suggestions are always appreciated. Ocolon 08:21, 11 March 2007 (UTC)[reply]
- can we see the results of this bot in a test run? check one or 2 pages. Betacommand (talk • contribs • Bot) 19:43, 14 March 2007 (UTC)[reply]
- Sure. :) I put an example output table into my sandbox. It's just the table at the moment. I'll write detailed explanations for the error codes too (404 is the http code for page not found, the other ones are cURL errors e.g. 6 couldn't connect to the server (it might not exist anymore), 28 stands for timeout…). — Ocolon 20:50, 14 March 2007 (UTC)[reply]
- can we see the results of this bot in a test run? check one or 2 pages. Betacommand (talk • contribs • Bot) 19:43, 14 March 2007 (UTC)[reply]
- I've changed the edit rate to two edits within five minutes: One edit to update the dead links list and one to read out and clear the schedule. If there's nothing on the schedule, the rate will be reduced to one edit in five minutes as requested before. Furthermore I don't think Ocobot needs to check random articles. Doing what is put to the schedule should be enough. And, well, you can put Special:Random to the schedule anyway. I updated Ocobot's user page. It's not finished yet but I think it provides valuable information yet, much more than other bots. — Ocolon 12:58, 17 March 2007 (UTC)[reply]
- Looks good. Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Make 100 edits or so and post back here with diffs. —METS501 (talk) 16:27, 17 March 2007 (UTC)[reply]
- Ocobot made 96 automatic edits: schedule history, broken links history. I stopped the bot manually twice during the run to make minor regular expression changes. It's working fine.—The preceding unsigned comment was added by Ocolon (talk • contribs).
- The entire process looks fine, I just have concerns about getting the lists of pages and loading them. How about getting a database dump to get all of the page titles, and then loading the pages using Special:Export? —METS501 (talk) 00:50, 24 March 2007 (UTC)[reply]
- I'm currently using the Query API to load the pages. Its xml output doesn't look fundamentally different from the Special:Export output. It is not longer. Special:Export would be better to retrieve all pages, that's right. The page titles could be retrieved from a database dump or simply from Special:Allpages. But User:Ocobot would then basically become a clone of Wikipedia:Dead external links which is a great project with some advantages to Ocobot but also with some disadvantages:
- Due to the huge amount of data it isn't up to date.
- Users have to strike out recovered/removed links by hand so other users won't check the same ones again — that's a consequence of 1.
- Nevertheless it lists links that have already been removed, see 1.
- It lists more links that aren't actually broken, because it cannot recheck external URLs as often, again because of 1.
- My approach is somewhat different. I want to give contributers the possibility to have "their article" or their area of interest checked with relatively short delay. I don't see yet that this is possible with a database dump.
- I have to keep Ocobot's list of dead links rather short and up to date to achieve my goal. That's also why I am probably going to request approval for an extension if this one gets approved.
- Do you agree, do you see a better alternative? — Ocolon 07:58, 24 March 2007 (UTC)[reply]
- OK, that's fine. No problems. Approved. This bot shall run with a flag. —METS501 (talk) 14:23, 24 March 2007 (UTC)[reply]
- I'm currently using the Query API to load the pages. Its xml output doesn't look fundamentally different from the Special:Export output. It is not longer. Special:Export would be better to retrieve all pages, that's right. The page titles could be retrieved from a database dump or simply from Special:Allpages. But User:Ocobot would then basically become a clone of Wikipedia:Dead external links which is a great project with some advantages to Ocobot but also with some disadvantages:
- The entire process looks fine, I just have concerns about getting the lists of pages and loading them. How about getting a database dump to get all of the page titles, and then loading the pages using Special:Export? —METS501 (talk) 00:50, 24 March 2007 (UTC)[reply]
- Ocobot made 96 automatic edits: schedule history, broken links history. I stopped the bot manually twice during the run to make minor regular expression changes. It's working fine.—The preceding unsigned comment was added by Ocolon (talk • contribs).
- Looks good. Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Make 100 edits or so and post back here with diffs. —METS501 (talk) 16:27, 17 March 2007 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.