Jump to content

User:MZMcBride/climax

From Wikipedia, the free encyclopedia

climax is the code name for a project that gathers and analyzes a set of attributes of biographies of living people in an attempt to programmatically find problematic biographies.

Attributes

[edit]

A number of attributes of the pages are collected using Python scripts and are inserted into an SQLite database. The database will be released to the public (with the exception of one attribute—the number of page watchers). Below is a table of the raw attributes collected. Other attributes will be derived from this data.

Attribute Source Description
Page ID Database dump Numeric page ID
Page title Database dump Page title
Page length Database dump Length of the page in bytes
Bad words count Database dump Total number of instances of bad words
Bad words within 50 bytes of an inline references tag Database dump Total number of bad words that are within 50 bytes of an inline <ref> tag
Bad words within 50 bytes of an inline "citation needed" tag Database dump Total number of bad words that are within 50 bytes of an inline {{citation needed}} tag
Inline references tag count Database dump Total number of instances of "<ref"
Inline "citation needed" tag count Database dump Total number of instances of "{{citation needed}}" (and its redirects)
Banner "citation needed" count Database dump Total number of instances of "{{BLP unsourced}}" (and its redirects)
External hyperlink count Database dump Total number of instances of "http"
Page views Compiled text file Total number of page views (in November 2009)
Days since last edit Replicated database Number of days since the page was edited most recently
Days since first edit Replicated database Number of days since page creation
Revisions count Replicated database Total number of revisions for the page
Page watchers count Replicated database Number of users with the page on their watchlist

Analysis

[edit]

The value in this data comes from the analysis of it. climax will focus on a scoring system. Other users may be interested in performing their own analysis to examine certain trends or problem areas.

Technical details

[edit]

Going to split this into a few separate scripts. Dump scanner goes first. Then need to retrieve various props from a text file and from the database....

Script statuses — largely deprecated
  • climax-dump-props.py
    • Implemented
      • Total page length
      • Total number of bad words
      • Total number of "<ref"s
      • Total number of "[http"
      • Presence of reference banners
    • Not implemented
      • Number of bad words within X bytes of {{cn, etc.
      • Number of bad words within X bytes <ref / [http
  • climax-database-props.py
    • Implemented
      • none
    • Not implemented
      • Date of first edit to page
      • Date of last edit to page
      • Total number of revisions
      • Number of page watchers
  • climax-views-props.py
    • Implemented
      • none
    • Not implemented
      • Number of page views from bh.txt (rename this file...)
  • climax-scorer.py
    • Implemented
      • none
    • Not implemented
      • Need to devise a proper scoring chart
      • over 9000 :PP
Test cases

Bad words

[edit]

Urgently need to add case sensitivity support here.

Words definitely need case sensitivity:

  • dick
  • evil
  • traitor
  • arrested
  • psycho

Words to possibly remove from the bad word list:

  • steals (lolbaseball)
  • investigations
Full list of bad words
\babusing\b
\babuse\b
\babused\b
\babducted\b
\babduction\b
\baccuse\b
\baccused\b
\baccusation\b
\ballege\b
\balleged\b
\banus\b
\barrest\b
\barrested\b
\barse\b
\bass\b
assault\b
assaulted\b
asshole
bastard
bitch
bloody
bollocks
\bbribe\b
\bbribes\b
\bbribed\b
bugger
\bcharges\b
child molester
child molestor
child predator
child predater
\bcocks\b
convict\b
convicted\b
\bcorrupt\b
cunt\b
\bdick\b
dumbass
espionage
\bevil\b
fag\b
faggot\b
faggots\b
fags\b
\bfired\b
\bfled\b
\bflee\b
fraud\b
\bfuck\b
\bfucks\b
\bfucked\b
is gay
\bghey\b
guilty
had an affair
\bhates
idiot
\bimpeach
insane
insanity
investigation
jackass
\bkilled\b
\bliar\b
\bliars\b
\blie\b
\blied\b
lol\b
\blying
malpractice
molest\b
molested
molestation
molesting
murder\b
murdered\b
murdering\b
mutant
neglect
neglected
negligent
\bnigger
paedophile
parole
pedophile
psychiatric
\bpedo\b
\bpsycho\b
\bpussy\b
\bracist
\brape\b
\braped\b
\braping
\bscandal\b
sexual assault
sexually assault
\bshit
\bslut\b
\bsluts\b
\bslutty\b
\bsteal
\bstole\b
\bstupid\b
\bretarded\b
\bretard\b
\btheft\b
\btits\b
\btwat\b
\bwanker
your mom
\bcharged\b
\bsentenced\b
in jail\b
\btraitor

Database schema

[edit]
As of 02:46, 25 January 2010 (UTC)
sqlite> .schema
CREATE TABLE dump (
    dump_id INTEGER NOT NULL UNIQUE,
    dump_title TEXT NOT NULL,
    dump_length INTEGER,
    dump_bad_words INTEGER,
    dump_bad_words_near_ref INTEGER,
    dump_bad_words_near_citation_needed INTEGER,
    dump_individual_ref_tags INTEGER,
    dump_external_links INTEGER,
    dump_inline_citation_templates INTEGER,
    dump_banner_citation_templates INTEGER
);

CREATE TABLE database (
    db_id INTEGER NOT NULL UNIQUE,
    db_title TEXT,
    db_first_edit INTEGER,
    db_last_edit INTEGER,
    db_revision_count INTEGER,
    db_creator TEXT,
    db_watchers INTEGER
);

CREATE TABLE views (
    views_title TEXT NOT NULL,
    views_value INTEGER NOT NULL,
    views_month TEXT NOT NULL
);

CREATE INDEX vtindex ON views(views_title);
CREATE INDEX vvindex ON views(views_value);

To-do

[edit]
  • case sensitive bad words (e.g., "dick")
  • add column in the database table for page creator text
  • add all code to code repo
  • prefix database names properly
  • version views columns (e.g., nov_09_views)
  • test.db includes non-articles
  • need to add "reference_headers" count
  • differentiate bad words vs. very bad words
  • track Google hits?
  • track incoming page links from other articles?