Jump to content

Wikipedia:Bots/Requests for approval/WikiLinkChecker

From Wikipedia, the free encyclopedia

New to bots on Wikipedia? Read these primers!

Operator: Skarz (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:54, Sunday, March 10, 2024 (UTC)

Function overview: It's a basic Python script that retrieves the wiki markup version of an article that I submit using the REST API, scans for URLs that are not already archived, checks the internet archive for archived versions/adds to the internet archive if necessary, and updated the applicable citation as necessary.

Automatic, Supervised, or Manual: It's hard to categorize because it's not a bot; it's a script that performs a very specific action only when I direct it to, much like Internet Archive Bot. In it's current state it cannot be used to process 100s or 1000s of URLs at a time.

Note: I have been advised by [[User:ferret|@ferret] that because this script writes the changes to Wikipedia without the ability for me to preview changes, it does not meet the criteria for assisted editing.

Programming language(s): Python

Source code available: User:Skarz/WikiLinkChecker

Links to relevant discussions (where appropriate): Discord

Edit period(s): Whenever I run it.

Estimated number of pages affected: ~10 per day

Namespace(s): Mainspace

Exclusion compliant (Yes/No):

Function details:

This Python script is designed to update Wikipedia pages by replacing dead external links with archived versions from the Internet Archive's Wayback Machine. Here's a step-by-step explanation of what the script does:

  • User Login: The script prompts the user to enter their Wikipedia username and password to log in to the English Wikipedia.
  • Page Selection: The user is prompted to enter the name of the Wikipedia page or its URL. The script extracts the page name from the URL if a URL is provided.
  • Page Content Retrieval: The script retrieves the content of the specified Wikipedia page.
  • Link Extraction: The script extracts all external links from the page content. It specifically looks for links within {{cite web}} templates and <ref> tags.
  • Link Checking and Updating:
    • For each extracted link, the script checks if the link is alive by sending a HEAD request.
    • If the link is dead, the script checks if there is an archived version available on the Wayback Machine.
    • If an archived version is available, the script updates the reference in the page content with the archive URL and the archive date.
  • Page Update: The script saves the updated page content to Wikipedia with a summary indicating that dead links have been updated with archive URLs.
  • Output: The script prints a message indicating that the Wikipedia page has been updated.

Limitations and Safeguards:

  • User Authentication: The script requires a valid Wikipedia username and password, limiting its use to authenticated users.
  • Edit Summary: The script provides an edit summary for transparency, allowing other Wikipedia editors to review the changes.
  • Rate Limiting: Wikipedia has rate limits and abuse filters in place to prevent automated scripts from making too many requests or disruptive edits.
  • Error Handling: The script checks for errors when making web requests and accessing the Wayback Machine, preventing it from continuing with invalid data.
  • No Automatic Deletion: The script does not delete any content; it only updates dead links with archived versions, reducing the risk of unintended content removal.

Discussion

[edit]