User:MZMcBride/climax
climax is the code name for a project that gathers and analyzes a set of attributes of biographies of living people in an attempt to programmatically find problematic biographies.
Attributes
[edit]A number of attributes of the pages are collected using Python scripts and are inserted into an SQLite database. The database will be released to the public (with the exception of one attribute—the number of page watchers). Below is a table of the raw attributes collected. Other attributes will be derived from this data.
Attribute | Source | Description |
---|---|---|
Page ID | Database dump | Numeric page ID |
Page title | Database dump | Page title |
Page length | Database dump | Length of the page in bytes |
Bad words count | Database dump | Total number of instances of bad words |
Bad words within 50 bytes of an inline references tag | Database dump | Total number of bad words that are within 50 bytes of an inline <ref> tag |
Bad words within 50 bytes of an inline "citation needed" tag | Database dump | Total number of bad words that are within 50 bytes of an inline {{citation needed}} tag |
Inline references tag count | Database dump | Total number of instances of "<ref" |
Inline "citation needed" tag count | Database dump | Total number of instances of "{{citation needed}}" (and its redirects) |
Banner "citation needed" count | Database dump | Total number of instances of "{{BLP unsourced}}" (and its redirects) |
External hyperlink count | Database dump | Total number of instances of "http" |
Page views | Compiled text file | Total number of page views (in November 2009) |
Days since last edit | Replicated database | Number of days since the page was edited most recently |
Days since first edit | Replicated database | Number of days since page creation |
Revisions count | Replicated database | Total number of revisions for the page |
Page watchers count | Replicated database | Number of users with the page on their watchlist |
Analysis
[edit]The value in this data comes from the analysis of it. climax will focus on a scoring system. Other users may be interested in performing their own analysis to examine certain trends or problem areas.
Technical details
[edit]Going to split this into a few separate scripts. Dump scanner goes first. Then need to retrieve various props from a text file and from the database....
Script statuses — largely deprecated
|
---|
|
- Test cases
Bad words
[edit]Urgently need to add case sensitivity support here.
Words definitely need case sensitivity:
- dick
- evil
- traitor
- arrested
- psycho
Words to possibly remove from the bad word list:
- steals (lolbaseball)
- investigations
Full list of bad words
|
---|
\babusing\b \babuse\b \babused\b \babducted\b \babduction\b \baccuse\b \baccused\b \baccusation\b \ballege\b \balleged\b \banus\b \barrest\b \barrested\b \barse\b \bass\b assault\b assaulted\b asshole bastard bitch bloody bollocks \bbribe\b \bbribes\b \bbribed\b bugger \bcharges\b child molester child molestor child predator child predater \bcocks\b convict\b convicted\b \bcorrupt\b cunt\b \bdick\b dumbass espionage \bevil\b fag\b faggot\b faggots\b fags\b \bfired\b \bfled\b \bflee\b fraud\b \bfuck\b \bfucks\b \bfucked\b is gay \bghey\b guilty had an affair \bhates idiot \bimpeach insane insanity investigation jackass \bkilled\b \bliar\b \bliars\b \blie\b \blied\b lol\b \blying malpractice molest\b molested molestation molesting murder\b murdered\b murdering\b mutant neglect neglected negligent \bnigger paedophile parole pedophile psychiatric \bpedo\b \bpsycho\b \bpussy\b \bracist \brape\b \braped\b \braping \bscandal\b sexual assault sexually assault \bshit \bslut\b \bsluts\b \bslutty\b \bsteal \bstole\b \bstupid\b \bretarded\b \bretard\b \btheft\b \btits\b \btwat\b \bwanker your mom \bcharged\b \bsentenced\b in jail\b \btraitor |
Database schema
[edit]As of 02:46, 25 January 2010 (UTC)
|
---|
sqlite> .schema
CREATE TABLE dump (
dump_id INTEGER NOT NULL UNIQUE,
dump_title TEXT NOT NULL,
dump_length INTEGER,
dump_bad_words INTEGER,
dump_bad_words_near_ref INTEGER,
dump_bad_words_near_citation_needed INTEGER,
dump_individual_ref_tags INTEGER,
dump_external_links INTEGER,
dump_inline_citation_templates INTEGER,
dump_banner_citation_templates INTEGER
);
CREATE TABLE database (
db_id INTEGER NOT NULL UNIQUE,
db_title TEXT,
db_first_edit INTEGER,
db_last_edit INTEGER,
db_revision_count INTEGER,
db_creator TEXT,
db_watchers INTEGER
);
CREATE TABLE views (
views_title TEXT NOT NULL,
views_value INTEGER NOT NULL,
views_month TEXT NOT NULL
);
CREATE INDEX vtindex ON views(views_title);
CREATE INDEX vvindex ON views(views_value);
|
To-do
[edit]- case sensitive bad words (e.g., "dick")
- add column in the database table for page creator text
- add all code to code repo
- prefix database names properly
- version views columns (e.g., nov_09_views)
- test.db includes non-articles
- need to add "reference_headers" count
- differentiate bad words vs. very bad words
- track Google hits?
- track incoming page links from other articles?