Wikipedia:Bots/Requests for approval/JeffGBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Jeff G. (talk · contribs)
Time filed: 18:21, Wednesday February 23, 2011 (UTC)
Automatic or Manually assisted: automatic posting as a part of manually initiated runs
Programming language(s): Python (latest v2.x version, 2.7.1 as of 2010-11-27)
Source code available: standard pywikipediabot (latest nightly build, weblinkchecker.py as last modified 2010-12-23, internally stamped "8787 2010-12-22 22:09:36Z")
Function overview: finds broken external links and report them to the talk page of the article in which the URL was found per m:Pywikipediabot/weblinkchecker.py. weblinkchecker.py creates two files (workfile deadlinks-wikipedia-en.dat and resultsfile results-wikipedia-en.txt), which would be manually distributed by the Operator (the first on request as a part of and after the discussion below and the second on /results-wikipedia-en.txt, a subpage of this page or User:JeffGBot/results-wikipedia-en.txt).
Links to relevant discussions (where appropriate):
Edit period(s): manual runs, at least once every two weeks
Estimated number of pages affected: all talk pages of article space pages with broken external links, using the default putthrottle of 10 seconds between posts (maximum 6 posts per minute)
Exclusion compliant (Y/N): N/A - this bot is not intended to touch user or user talk pages
Already has a bot flag (Y/N):
Function details: can be found at m:Pywikipediabot/weblinkchecker.py. With reference to some questions at Wikipedia:Bots/Requests for approval/PhuzBot and elsewhere, I have asked for some assistance at m:Talk:Pywikipediabot/weblinkchecker.py#Questions_from_BRFAs_and_elsewhere_on_English_Wikipedia.
Discussion
[edit]Even were it modified to crawl every page querying prop=extlinks instead of downloading the page text, IMO it would still be better done from a database dump or a toolserver query. I also would like to know why you intend to post to article talk pages instead of applying {{dead link}} directly to the page, and how this would interact with User:WebCiteBOT or other processes that provide archive links; for example, would it complain about dead links inBots that download substantial portions of Wikipedia's content by requesting many individual pages are not permitted. When such content is required, download database dumps instead.
|url=
for every {{cite web|url=|archiveurl=}}
. Anomie⚔ 19:51, 23 February 2011 (UTC)[reply]
- The bot does not by default know how to use database dumps, query the toolserver, only download extlinks, or post directly to the article page. It appears to ignore cites that WebCiteBOT already processed because they already have an archiveurl parameter. — Jeff G. ツ 20:02, 23 February 2011 (UTC)[reply]
- When I pointed it at a test page, it did seem to check both links in
{{cite web|url=|archiveurl=}}
. Anomie⚔ 00:32, 24 February 2011 (UTC)[reply]- Ive got a script in the workings (including a partnership with webcitation.org) that should make this bot pointless. ΔT The only constant 00:36, 24 February 2011 (UTC)[reply]
- Perhaps this bot account could run that script instead. Would you care to share any details? — Jeff G. ツ 01:57, 1 March 2011 (UTC)[reply]
- Its no where near stable enough for me to release the code, (it uses a ton of code last count had me at over 110 pages), and still requires a human eye to double check it. But the basics include a python like implementation of AWB's gen fixes, along with a few other advanced cleanup features, lookup/addition of archive.org urls, and those of webcitation, archiving if needed active links via webcitation (and passing along meta data to them through a soon to be upgraded API), removal of missing images, and several other features. ΔT The only constant 17:08, 1 March 2011 (UTC)[reply]
- How is your work on that going? — Jeff G. ツ 03:17, 23 March 2011 (UTC)[reply]
- Actually pretty good, the code has become fairly stable and just need to poke the cite team for whitelisting of my ip. ΔT The only constant 03:21, 23 March 2011 (UTC)[reply]
- How is your work on that going? — Jeff G. ツ 03:17, 23 March 2011 (UTC)[reply]
- Its no where near stable enough for me to release the code, (it uses a ton of code last count had me at over 110 pages), and still requires a human eye to double check it. But the basics include a python like implementation of AWB's gen fixes, along with a few other advanced cleanup features, lookup/addition of archive.org urls, and those of webcitation, archiving if needed active links via webcitation (and passing along meta data to them through a soon to be upgraded API), removal of missing images, and several other features. ΔT The only constant 17:08, 1 March 2011 (UTC)[reply]
- Perhaps this bot account could run that script instead. Would you care to share any details? — Jeff G. ツ 01:57, 1 March 2011 (UTC)[reply]
- Ive got a script in the workings (including a partnership with webcitation.org) that should make this bot pointless. ΔT The only constant 00:36, 24 February 2011 (UTC)[reply]
- When I pointed it at a test page, it did seem to check both links in
- How do you deal with web-sites that have paywalls or require subscriptions? Will the bot ignore links tagged with "(subscription required)" style tags? What is your delay between re-visiting the web-sites? Do you respect robots.txt of websites and is the bot faking user agent/referrer? — HELLKNOWZ ▎TALK 13:23, 3 March 2011 (UTC)[reply]
- The paragraph above appears from the single colon to be a reply to Anomie, but from its content it may be a reply to me, so I'll provide some answers:
- My computer does not have access to any content on reliable source "web-sites that have paywalls or require subscriptions". Other humans and their computers which who do have such access can discount its reports.
- The delay between re-visiting particular article/URL combinations is a minimum of a week.
- Robots.txt is inapplicable because the bot is not crawling any websites, just trying to visit particular URLs that are already in articles.
- The bot is coded to use user agent "pywikibot.useragent" or "Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8) Gecko/20051128 SUSE/1.5-0.1 Firefox/1.5" and not to mention referrer.
- — Jeff G. ツ 02:10, 10 March 2011 (UTC)[reply]
- The paragraph above appears from the single colon to be a reply to Anomie, but from its content it may be a reply to me, so I'll provide some answers:
- Also, the bot is not "requesting many individual pages" (as in HTML web pages), it is instead requesting raw wikitext, which should not increase the parsing load at all. — Jeff G. ツ 03:10, 10 March 2011 (UTC)[reply]
- The questions were for you, I meant to do a bullet. I mentioned subscriptions/paywals because some sites wrongly return 404s in place of 401/403 or something. "robots.txt" are applicable because you are using an automated tool for browsing at large quantities. Wayback and Webcite wouldn't have archives of robots.txt excluded urls anyway. But it's not really that important. Delay sounds good. Second agent's fine. Not mentioning referrer does make some sites act weird. It's best to fake the domain's top level url as referrer. — HELLKNOWZ ▎TALK 08:52, 10 March 2011 (UTC)[reply]
- IMHO sites that require referrer, thus actively denying traffic from "Email Clients, IM, AIR Apps, and Direct" (bit.ly terminology), should be considered to be unreliable (or at minimum badly configured). — Jeff G. ツ 03:23, 11 March 2011 (UTC)[reply]
- Also, not sure what you mean by "N/A - this bot is not intended to touch user or user talk pages" in Exclusion Compliance. It refers to {{bots}} template that could be placed on any page. — HELLKNOWZ ▎TALK 08:53, 10 March 2011 (UTC)[reply]
- Sorry, I took 'These templates should be used mainly on the "User" and "User talk" namespaces and should be used carefully in other spaces' on Template:Bots literally. The bot is not designed to write to pages anywhere except the "Talk" namespace; I don't know if it is designed to respect {{bots}} on pages in the "Talk" namespace. — Jeff G. ツ 02:53, 11 March 2011 (UTC)[reply]
- The questions were for you, I meant to do a bullet. I mentioned subscriptions/paywals because some sites wrongly return 404s in place of 401/403 or something. "robots.txt" are applicable because you are using an automated tool for browsing at large quantities. Wayback and Webcite wouldn't have archives of robots.txt excluded urls anyway. But it's not really that important. Delay sounds good. Second agent's fine. Not mentioning referrer does make some sites act weird. It's best to fake the domain's top level url as referrer. — HELLKNOWZ ▎TALK 08:52, 10 March 2011 (UTC)[reply]
{{BAG assistance needed}} Are there any further questions, comments, or concerns? Thanks! — Jeff G. ツ 02:53, 23 March 2011 (UTC)[reply]
- I think that the bot will do good and you'll be a great owner. WayneSlam 20:49, 23 March 2011 (UTC)[reply]
Given the number of dead links, I strongly suggest the bot to place {{Dead link}} instead and attempt to fix them with Wayback/Webcite. Few people repair dead links in article, and I feel like posting them on talk page will be even more cumbersome. A couple of bots are already approved for that, though inactive. — HELLKNOWZ ▎TALK 12:23, 3 April 2011 (UTC)[reply]
- Thanks, I have relayed your suggestion to m:Talk:Pywikipediabot/weblinkchecker.py#Questions_from_BRFAs_and_elsewhere_on_English_Wikipedia. Also, this bot would be active. — Jeff G. ツ 03:34, 4 April 2011 (UTC)[reply]
- Question from SpinningSpark if an editor informs you that they think your bot has made a mistake, what action will you take? SpinningSpark 17:58, 21 April 2011 (UTC)[reply]
- Thanks for your question, SpinningSpark. If that happens, I will verify the accuracy of that editor's report. If the editor is accurate and I agree, I will reverse the mistake and try to stop it from happening in the future, possibly including reporting the mistake to the bot's programmers and halting the bot until I get the problem fixed. If the editor is not accurate or I do not agree, I will discuss the situation with the editor amicably. If that does not suffice, I will discuss the situation here or at any willing BAG member's user talk page (if this BRFA has not yet been approved) or at WT:BRFA or WP:BOTN as appropriate, leaving a pointer to that discussion on the appropriate talk pages. The dispute resolution process should suffice if the resulting discussion degrades. Also, if an Administrator agrees, I understand that they may block the bot until I have had a chance to deal with the situation. I will be happy to discuss modification of the above. — Jeff G. ツ 03:10, 22 April 2011 (UTC)[reply]
I started sharing the results file here on English Wikipedia as wikitext, but that has proven to be too cumbersome because of size (limited to 2MB) and spam filters, so I have instead started sharing both files via Windows Live SkyDrive here. The results files are in sequential order. — Jeff G. ツ 18:48, 22 April 2011 (UTC)[reply]
- Approved for trial (50 edits or 5 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. MBisanz talk 08:25, 13 May 2011 (UTC)[reply]
- Thank you, I am commencing testing... — Jeff G. ツ 21:10, 20 May 2011 (UTC)[reply]
- Trial complete. The bot has now made 51 edits (the first 3 manual, the next 7 semiautomated (requiring captcha authorization), and the next 41 fully automated). I eagerly await your analysis. — Jeff G. ツ 22:39, 20 May 2011 (UTC)[reply]
- Approved. MBisanz talk 05:02, 25 May 2011 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.