Jump to content

Wikipedia:Request a query/Tips and schemas

From Wikipedia, the free encyclopedia

Tips

[edit]

If you don't know SQL or you are rusty, SQLBolt has a great set of interactive exercises.

Speed

[edit]

enwiki is a massive database. For example, the revisions table contains one billion rows. SQL queries that would execute in a fraction of a second on a smaller database may take over an hour and then time out when run on Quarry and enwiki. Here are some tips to make your SQL queries as efficient as possible.

  • Test your query on a smaller wiki, such as simplewiki (612,000 articles) or sawiki (11,000 articles).
  • Use a LIMIT.
  • Omit ORDER BY if you don't need it.
  • Be aware that the WHERE, GROUP BY, and ORDER BY fields are the most likely to slow down a query. Try to use primary key fields for these whenever possible. For example, WHERE log_id > 90267267 will be faster than WHERE log_timestamp > 20200415012345, because log_id is a primary key field.
  • SQL optimizer tool - Run https://sql-optimizer.toolforge.org/ on your query. It will suggest special tables that you can use to speed up your query.
    • Alternative views - The replica databases (used by Quarry) have copies of certain tables that are exactly the same, except they have a different primary key. Using one of these tables and properly using its corresponding primary key can speed up queries. See the "Alternative views" section below. Example: using logging_userindex instead of logging.
    • Sub-views - The replica databases (used by Quarry) have copies of certain tables that omit unnecessary rows. Look into using these tables instead of the main table when joining certain tables, to speed up your queries. Example: When joining actor and recentchanges, use actor_recentchanges.
    • Rows - In the "results" section of SQL optimizer, pay attention to any rows with a "rows" column value in the millions or billions. Then check if that row's "extra" field says "using index". If not, consider refactoring your query to use an index. Indexes should also be in the correct order. Index documentation for each MediaWiki core table can be found at pages such as Manual:Logging table#Indexes. Most tables have many indexes, but you need to use the correct combination and order of fields in order to benefit from them.
  • Avoid the revision table, because it literally has a billion rows
    • If you only need to check recent page revisions (last 30 days), use the recentchanges table instead of the revision table.
    • If you can get the data you need from one of the Special:Logs, use the logging table instead of the revision table.
  • If a query scans millions of rows, consider breaking it into multiple queries. Query time versus number of rows increases exponentially, not linearly.
  • Indexes - Check the documentation on MediaWiki wiki for each database table. There is a list of indexes. Pay attention to what those are, since you'll want to try to search only those fields. Also pay attention to which indexes come in pairs and triplets. If you try to search WHERE log_title = 'foo' without also searching AND log_namespace = 'bar', your query will time out because those two fields are indexed together.

Other

[edit]
  • Timestamps - Be aware that date/time fields are in an unusual format. They are BINARY(14) integers in the format YYYYMMDDHHMMSS. You can use code such as WHERE rev.rev_timestamp > DATE_FORMAT(DATE_ADD(NOW(), INTERVAL -1 YEAR), "%Y%m%d%H%i%s") to do conversions.
  • Actor table - Be aware that there is a table called actor, with fields actor_id and actor_name (username). Instead of JOINing the user table, you will often need to JOIN the actor table instead.
    • To get a user_name, instead of joining both the actor and user tables, you can join just actor and then select actor_name.
  • Timeouts - A query will time out after awhile. The timeout is often 30 minutes, but sometimes it will go for longer. You may want to hit "Submit query", and then revisit the page later.
  • Page_title doesn't include namespace - When selecting page_title, you will usually also want to select page_namespace, so you can figure out the exact page name, and also for speed (index) reasons. page_title does not include namespace prefixes such as Talk:, User:, etc.
  • Page_title formatting - Page title requires underscores in place of spaces. Page title (and likely all fields in this database) are CaSe SeNsITiVe. Enwiki pages all begin with a capital letter.
  • Replication lag - The replica database that Quarry uses is typically synced up almost exactly with the main database. Very rarely, sometimes due to maintenance, the replica database may lag behind the main database by hours or days. Click here to check replication lag.
  • LIKE is case sensitive - Normally on MariaDB, LIKE is case insensitive, but in Wikimedia databases it is often case sensitive, due to many fields having binary types.

Databases

[edit]

Databases follow the pattern: [languageCode][typeOfWiki]

Database names can have no suffix (enwiki) or end in the _p suffix for public (enwiki_p). They're identical on Quarry, use interchangeably. Only one or the other may work on certain other tools though.

  • Language codes
    • en - English
    • See the rest of the languages codes here, in the "Wiki" column of the table.
  • Types of Wikis
    • wiki - Wikipedia (encyclopedia)
    • wikibooks
    • wikinews
    • wikiquote
    • wikisource
    • wikiversity
    • wikivoyage
    • wiktionary
  • Wikis that aren't encyclopedias but still use MediaWiki software tend to end in "wiki":
    • commonswiki
    • mediawikiwiki
    • metawiki
    • wikidatawiki
  • Some databases aren't wikis at all
    • centralauth - keeps track of global user groups such as staff, sysadmins, global sysops, etc.

Tables

[edit]

Accessible tables

[edit]

SHOW TABLES;[1]

Core

[edit]
  • actor
  • archive[1]
  • category
  • categorylinks
  • change_tag
  • change_tag_def
  • comment
  • content
  • content_models
  • externallinks
  • filearchive
  • image
  • imagelinks
  • ipblocks
  • ipblocks_restrictions
  • ip_changes
  • iwlinks
  • l10n_cache
  • langlinks
  • logging
  • module_deps
  • oldimage
  • page
  • pagelinks
  • page_props
  • page_restrictions
  • protected_titles
  • recentchanges
  • redirect
  • revision
  • sites
  • site_identifiers
  • site_stats
  • slot_roles
  • slots
  • templatelinks
  • updatelog
  • user
  • user_former_groups
  • user_groups
  • user_properties

Extensions

[edit]

Unknown

[edit]
  • pif_edits
  • user_properties_anon
  • wbc_entity_usage

Hidden tables

[edit]

Not all tables are available in Quarry. Certain tables are intentionally omitted. This is an incomplete list. Complete list is here

  • bot_passwords
  • interwiki
  • job
  • log_search
  • objectcache
  • querycache
  • querycachetwo
  • querycache_info
  • searchindex
  • text
  • uploadstash
  • user_newtalk
  • watchlist
  • watchlist_expiry

Hidden columns

[edit]

Columns in certain tables are unavailable.

  • archive
  • user_properties

Hidden rows

[edit]

Rows in certain tables are partially blanked.

  • abuse_filter
  • abuse_filter_action
  • abuse_filter_history
  • abuse_filter_log

Alternative views

[edit]

Data redaction causes some queries and indexes to be really slow. To solve this problem, "alternative view" tables were created, with different indexes to help speed things up. This is only applicable to Quarry and these are not used in Wikimedia production (since Wikimedia production tables are not redacted and therefore not slow).

  • archive_userindex - "userindex" means it indexes the "actor" field
  • filearchive_userindex
  • ipblocks_ipindex - "ipindex" means it indexes the "address" field
  • logging_logindex - "logindex" means it indexes following 3 fields together: log_namespace, log_title, log_page
  • logging_userindex
  • oldimage_userindex
  • recentchanges_userindex
  • revision_userindex

Sub-views

[edit]

These tables have only the data needed to JOIN with another specific table. The rest of the data has been deleted. This is for faster querying when JOINing with those particular tables.

  • actor_filearchive
  • actor_image
  • actor_ipblocks
  • actor_logging
  • actor_oldimage
  • actor_protected_titles
  • actor_recentchanges
  • actor_revision
  • comment_filearchive
  • comment_image
  • comment_ipblocks
  • comment_logging
  • comment_oldimage
  • comment_protected_titles
  • comment_recentchanges
  • comment_revision

Tables left over from "actor" migration

[edit]

"compat" might stand for "compatibility". These tables are compatible with the pre-actor versions of MediaWiki, and will eventually be removed.

  • archive_compat
  • filearchive_compat
  • image_compat
  • ipblocks_compat
  • logging_compat
  • oldimage_compat
  • page_compat
  • protected_titles_compat
  • recentchanges_compat
  • revision_actor_temp
  • revision_compat

Query limits

[edit]

Queries are limited to approximately 8300 characters

Table size

[edit]

Here is the number of rows from a couple of tables on enwiki, to give an idea of the database size.

  • page - 6 million
  • externallinks - 168 million[2]
  • revision - 1 billion

Notes

[edit]
  1. ^ Revisions of deleted pages are stored here.
  2. ^ https://quarry.wmflabs.org/query/61412