User:The Anome/SQL
This page exists to save useful SQL queries, just in case the Quarry project ever goes away. Where this differs from the Quarry query, the Quarry version should be regarded as authoritative.
See also https://sql-optimizer.toolforge.org/ for a useful SQL analysis tool.
Articles not assigned to WikiProjects
[edit]https://quarry.wmcloud.org/query/72412
SELECT article.page_title, article.page_id, talk.page_id AS talk_id FROM page AS article LEFT JOIN page_props ON pp_page = article.page_id AND pp_propname = 'disambiguation' LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1 AND talk.page_is_redirect = 0 WHERE article.page_namespace = 0 AND pp_page IS NULL AND article.page_is_redirect = 0 AND NOT (article.page_title LIKE "List_of_%") AND NOT (article.page_title LIKE "Lists_of_%") AND NOT EXISTS ( SELECT 1 FROM categorylinks WHERE article.page_id = cl_from AND ( cl_to IN ( "Temporary_maintenance_holdings", "Candidates_for_speedy_deletion", "Requested_RD1_redactions" ) OR cl_to LIKE "%_names" OR cl_to LIKE "%_surnames" OR cl_to LIKE "Lists_of_%" ) ) AND NOT EXISTS ( SELECT 1 FROM templatelinks JOIN linktarget ON lt_id = tl_target_id WHERE tl_from = article.page_id AND lt_namespace = 10 AND ( lt_title IN ( "Wi", "Surname", "Given_name", "Animal_common_name", "Plant_common_name", "Nickname", "Dmbox" ) OR lt_title LIKE "%index%" OR lt_title LIKE "%redirect%" ) ) AND ( (talk.page_id IS NULL) OR ( (talk.page_id IS NOT NULL) AND NOT EXISTS ( SELECT 1 FROM categorylinks WHERE talk.page_id = cl_from AND ( cl_to LIKE "%WikiProject_%" OR cl_to LIKE "%-Class_%_articles" OR cl_to LIKE "%-importance_%_articles" OR cl_to LIKE "%-priority_%_articles" OR cl_to LIKE "Unassessed_%_articles" ) ) AND NOT EXISTS ( SELECT 1 FROM templatelinks JOIN linktarget ON lt_id = tl_target_id WHERE tl_from = talk.page_id AND lt_namespace = 10 AND (lt_title LIKE "%WikiProject_%") ) ) ) ORDER BY article.page_title
Biographical articles without WP Biography tag
[edit]https://quarry.wmcloud.org/query/72401
SELECT CONCAT("* [[",article.page_title, "]]") -- , talk.page_id AS talk_page_id FROM page AS article LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1 AND talk.page_is_redirect = 0 WHERE -- article.page_id % 10 = 2 article.page_namespace = 0 AND article.page_is_redirect = 0 AND NOT article.page_title RLIKE "^(Lists?_of|[0-9]{4})_.*$" -- no dated events or lists AND EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = article.page_id and (cl_to = "Living_people" OR cl_to RLIKE "^(Date|Year)_of_(birth|death)_(missing|unknown).*$" OR cl_to RLIKE "^[0-9][^_]+_(births|deaths)$")) -- make sure we don't catch things like 'protest-related deaths' AND ( (talk.page_id IS NULL) OR ((talk.page_id IS NOT NULL) AND NOT EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = talk.page_id and (cl_to LIKE "%Biography_articles" OR cl_to LIKE "%biography_articles" OR cl_to LIKE "%WikiProject_Biography%")))) ORDER BY article.page_title
WikiProject to page category cross-reference
[edit]https://quarry.wmcloud.org/query/72090
Part of a plan to assign articles to WikiProjects using Naive Bayes.
Still being worked on. Limited to just a few pages, for testing purposes. Sorting by all three result columns to make output more compressible later.
Now using modulo arithmetic to sub-sample the page table more uniformly. 1/1000 of the table ~= 6700 articles
page_id % 1000 = 77: Executed in 251.52 seconds as of Tue, 07 Mar 2023 08:21:01 UTC. Resultset (296041 rows) page_id % 1000 = 78: Executed in 265.67 seconds as of Tue, 07 Mar 2023 08:28:22 UTC. Resultset (316077 rows) page_id % 1000 = 392: Executed in 256.58 seconds as of Tue, 07 Mar 2023 09:22:56 UTC. Resultset (310016 rows) page_id % 1000 = 816: Executed in 269.80 seconds as of Tue, 07 Mar 2023 09:31:37 UTC. Resultset (295323 rows)
With 1/1000 scan, jsonl file is about 30 Mbytes in size, but bz2's to about 1.3 Mbytes.
Running this query repeatedly to cover all articles will take ~3 days., but is necessary to generate ground truth for later categorization.
Next step: Toolforge.
SELECT pagecats.cl_to, talkcats.cl_to, -- REGEXP_EXTRACT(talkcats.cl_to, '(?WikiProject_|.*_importance_|.*_priority_|Unassessed)(.*)'), -- does not work COUNT(*) AS my_count FROM page AS article INNER JOIN page AS talk ON talk.page_title = article.page_title INNER JOIN categorylinks AS pagecats ON pagecats.cl_from = article.page_id INNER JOIN categorylinks AS talkcats ON talkcats.cl_from = talk.page_id WHERE article.page_id % 1000 = 816 AND article.page_namespace = 0 AND talk.page_namespace = 1 AND article.page_is_redirect = 0 AND talk.page_is_redirect = 0 AND ( talkcats.cl_to LIKE "%WikiProject_%" OR talkcats.cl_to LIKE "%-Class_%_articles" OR talkcats.cl_to LIKE "%-importance_%_articles" OR talkcats.cl_to LIKE "%-priority_%_articles" OR talkcats.cl_to LIKE "Unassessed_%_articles" ) AND NOT ( talkcats.cl_to LIKE "%vital%" OR talkcats.cl_to LIKE "%Version%" ) AND NOT (pagecats.cl_to LIKE "%Disambig%" OR pagecats.cl_to LIKE "%disambig%" OR pagecats.cl_to LIKE "%set_index%" OR pagecats.cl_to LIKE "Set_index%") AND NOT (talkcats.cl_to LIKE "%Disambig%" OR talkcats.cl_to LIKE "%disambig%") AND NOT (pagecats.cl_to LIKE "Short_description%" OR pagecats.cl_to LIKE "%_errors%" OR pagecats.cl_to LIKE "CS1_%" OR pagecats.cl_to LIKE "%short_description%" OR pagecats.cl_to LIKE "%articles%" OR pagecats.cl_to LIKE "Articles%" OR pagecats.cl_to LIKE "%pages%" OR pagecats.cl_to LIKE "%disputes%" OR pagecats.cl_to LIKE "Pages%" OR pagecats.cl_to LIKE "Use_dmy_date%" OR pagecats.cl_to LIKE "%Wikipedia%" OR pagecats.cl_to LIKE "%articles%" OR pagecats.cl_to LIKE "%Articles%" OR pagecats.cl_to LIKE "%Wikidata%" OR pagecats.cl_to LIKE "Webarchive%") GROUP BY pagecats.cl_to, talkcats.cl_to ORDER BY my_count DESC, pagecats.cl_to, talkcats.cl_to -- LIMIT 10000