Wikipedia:Bots/Requests for approval/WaybackBot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Withdrawn by operator.

WaybackBot

Operator: Tim1357

Automatic or Manually assisted: The bot would run extremely supervised until it had enough "experience" to run by itself.

Programming language(s): Python Using pywikipedia

Source code available: here is a link to the code (it updates automatically every time i change it). It needs work, i keep getting little format errors that I need some programers to help me with.

Function overview:

WaybackBot would (intelligently) check the Internet Archive for archives of dead pages.

Links to relevant discussions (where appropriate): There are a lot. Some are

Edit period(s):

At first, I will baby sit the bot and check every edit it makes, until I can feel confident enough to let it run free.

Estimated number of pages affected: An estimated 10% of all links on wikipedia are, in some way, dead. If there are 2.5 million links on wikipedia (there were in 2006), then that means 250000 are dead. Thats a 'lot of pages.

Exclusion compliant (?):Im not sure, is pywikipedia automatically exclusion compliant?

Already has a bot flag (Y/N):

Function details: The bot's syntax looks like this:

Load a page (from xml dump)
Extract and check all the external links
check them all, return dead (defined as error code 404 or 401)
if they are dead, look for their corresponding accessdate, if none exists, use wikiblame
create range of acceptable dates (for right now, the range of an acceptable archive is within 2 months of the original accessdatye, I am willing to change that. Remember that a larger range means that an archive is more likely.)
if the url is referenced using {{citeweb}}, and does not already have an archive, add archive-url and archive-date.
if there is not cite-web, append reference with {{wayback}} using parameters |date and |url
if there is no internet archive, mark the reference with {{Dead link}} using parameters: |date and |bot
start over, and cache links that were checked as either dead or alive, so I don't have to check them again. i will add a function to the script to clear the cache.

Whew, i think thats it. if you want a more nitty-gritty explanation of what the bot does, look at the source code. Pretty much each line has a comment. Note that the source is being hosted from my home computer, so It might not be up when the computer is off.

Discussion

Some Stuff You Should Know:

see this skrew up i made (still very sorry).

I am pretty new to python, this was my first big project, so i need some help

the Internet archive does not show archives until 6 months after they are grabbed (right now they are still processing archives from June), so that means if I request an archive for a page that was accessed today, the bot will not get any archives.

I support a larger archive range, but I will leave it up to consensus here.

Things to do

~~add logging that is similar to the logs of User:WebCiteBOT~~Y Still need to write the code that uploads the log.
make bot exclusion compliant
auto-clear cached links
add synonyms for templates (citeweb=CIteweb=cite-web) ect.

Id like to put this on hold for a while. User:Dispenser gave me some points about the bot's concept that I hadn't thought about. I am going to tweak the code so I can make it more fail-safe, and so that the bot gives a dead link two tries before it finds the archive (as some links are only dead for a bit, then are live again). Thanks Tim1357 (talk) 02:05, 10 December 2009 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.