Jump to content

User:GreenC/BotWikiAwk

From Wikipedia, the free encyclopedia
A little auk goes a long way

BotWikiAwk is a framework and libraries for creating and running bots on Wikipedia.

Features

[edit]
  • Bot management tools compatible with bots written in any language
  • Libraries for bots written in awk
  • Non-SQL. Data files in plain-text
  • Manage batches of articles of any size, 50 for WP:BRFA or 50k+ for production runs
  • Runs using GNU parallel making full use of multi-core CPUs
  • ..or runs on the Toolforge grid across 40+ distributed computers
  • Dry-run mode, diffs can be checked out before uploading
  • Inline colorized diffs on the command-line
  • Re-run individual pages via a cached copy of the page (download wikisource once, run bot many)
  • Installs in a single directory, easily removed
  • Includes complete example bots and skeleton bots
  • Includes a general awk library developed over years of writing bots
  • Includes a command-line interface to the MediaWiki API
  • In development and private use since 2016. Public June 2018

Overview

[edit]

BotWikiAwk contains two elements:

  • A library of routines for writing bots in awk
  • An integrated set of tools for running and managing bots written in any language

Why awk? Awk is a small, elegant language composed of a single binary file, the interpreter. It is a POSIX tool installed on most unix computers. The language syntax is simple and forgiving. It is usually associated with one-line scripts, but since about 2012 the GNU version has become more powerful. While not a general purpose language, awk is primarily a text processing language which is exactly what bots do. The areas that awk can not support (eg. networking) are executed through external programs.

BotWikiAwk is batch oriented. After creating a master list of articles, it then carves out batches which are assigned a unique name, called a project ID. Each utility takes as input the project ID and what action to take for the project. Projects can be any size including the full size of the master-list ie. a single project.

Requirements

[edit]
  • A Wikipedia account with bot flag permissions
  • GNU awk (version 4.1+)
  • GNU wget (version 1.13+)
  • GNU parallel (sudo apt-get install parallel) - not required on Toolforge
  • openssl for login authentication (if writing to pages)
  • wdiff (sudo apt-get install wdiff) - small utility for inline diffs
  • GNU tac (sudo apt-get install tac) - small utility reverse cat

Setup

[edit]

If installing on Toolforge see special instructions.

export AWKPATH=.:/home/adminuser/BotWikiAwk/lib:/usr/local/share/awk
If on Toolforge see special instructions
  • Add BotWikiAwk to the PATH eg.
PATH=$PATH:/home/adminuser/BotWikiAwk/bin
  • Log out and back in so environment vars are set.
  • cd to ~/BotWikiAwk and run ./setup.sh
  • Edit ~/BotWikiAwk/lib/botwiki.awk
Change #1) StopButton URL
Change #2) UserPage URL
  • Read the SETUP file for additional instructions
  • For Wikipedia edit authorization: add your OAuth key/secrets to bin/wikiget.awk -- see EDITSETUP

New bot

[edit]

To create a new bot:

makebot ~/botname

The path should point to a new directory, botname that has not been created yet, with "botname" being the name of your bot (no spaces recommended). The path can be to anywhere, but if different from the default ~/BotWikiAwk/bots directory also update ~/BotWikiAwk/lib/botwiki.awk section #3 following the "mybot" example.

I find locating the bot outside the ~/BotWikiAwk directories makes it easier to upgrade BotWikiAwk later. One can simply delete everything and re-clone it (saving only the original botwiki.awk file).

It will prompt for type of bot skeleton. If the bot will be doing operations on CS1|2 templates choose #2.

Writing bot

[edit]

See ~/BotWikiBot/example-bots

<to be expanded>

Running bot

[edit]

In summary, the process works by running four utilities:

  • wikiget downloads a list of page titles the bot will operate on eg. 10k page titles from a category
  • project -c creates a new project (or batch) to process eg. the first 50 pages
  • runbot executes the bot in dry-run mode on a given project
  • bug -dc to view diffs for individual pages, to see what changes the bot made
  • bug -r to re-run for individual pages
  • when satisfied the bot is running well, runbot again in live mode to upload changes. Repeat with larger project sizes until done.

The utility programs (wikiget, project, runbot and bug) have many options available with -h

Example bot

[edit]

The easiest way to demonstrate BotWikiBot by running a real bot.

0. Create the bot using existing example, accdate, a bot for removing |access-date= in CS|2 templates.

Make the bot:
makebot ~/BotWikiBot/bots/accdate
Copy in the pre-written example bot:
cp ~/BotWikiBot/example-bots/accdate.awk ~/BotWikiBot/bots/accdate
cd to the bot directory
cd ~/BotWikiBot/bots/accdate
All utilities only work while in the bot's home directory; with the exception of wikiget which can run anywhere.

A. Make a master list of pages to process, called an "auth" file. Here getting the list from a category, the "-c" option.

wikiget -c "Category:Pages using citations with accessdate and no URL" > meta/accdate20181102.auth
The file ends in .auth (required) and is located in the bot's meta subdirectory.
In this case '20181102' is today's date but it can be any identifying string of numbers or letters.
The "accdate" portion of the filename can also be anything, though it's helpful to use the bot name.
Manually edit meta/accdate20181102.auth to remove unwanted pages eg. "Template:" or "Wikipedia:" space.

B. Create (-c) a batch (called a 'project') of 50 articles to process

project -c -p accdate20181102.00001-00050
The project ID (-p) is composed of the name created in Step A (accdate20181102) followed by a "." followed by a set of numbers (00001-00050) which means line # 1 -> line #50 in the file meta/accdate20181102.auth ie. the first 50 articles to process.
The project ID is referenced by every utility to identify which project is being worked on.

C. Run the bot in dry-run mode

runbot accdate20181102.00001-00050 auth dryrun

D. Look at resulting local diffs

Find which pages the bot modified as recorded in the "discovered" file in the meta directory
cat meta/accdate20181102.00001-00050/discovered
For each, visually check the diff with bug -dc
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -dc
The bot can be re-run for individual pages
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -r
Further info available with -v shows location of data directory
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -v

E. Push changes to Wikipedia

If project was previously run in dry-run mode, first delete it and recreate
project -x -p accdate20181102.00001-00050
project -c -p accdate20181102.00001-00050
Then run in live mode (CAUTION: don't do this for the demonstration)
runbot accdate20181102.00001-00050 auth
If project has never been created before just create it new and run
project -c -p accdate20181102.00001-00050
runbot accdate20181102.00001-00050 auth

F. Repeat

Repeat steps B->F increasing the size of the batch and using the "bug -dc" to spot check diffs until confidence is high. Once confidence is high, only the last part of step E required. As can be seen each project run is a 2-step process: create the project defining its size, then run the bot on the project.