User:Alanyst/Vector space research

This is another page that documents research pertaining to Wikipedia:Requests for arbitration/Mantanmoreland. See also User:Alanyst/Edit collision research.

Aim of the research

The research aims to provide insight to the question of whether two user accounts 'MM' and 'SH' on Wikipedia are independent, or else are controlled by the same individual (sockpuppets). This research focuses on the similarity of edit summaries as a metric of independence.

Initial assumptions

If two Wikipedia accounts are controlled by the same individual, we can expect a higher degree of similarity in the edit summaries made by those accounts than for a typical pair of unrelated accounts. Even if the individual is consciously avoiding using similar language, there is still likely to be significant overlap in terms used simply because people's speech patterns are deeply ingrained.

Similarity

The notion of similarity needs to be defined. It is not enough to show that two accounts share similar terms, since unrelated accounts might also employ those terms. It is also important to show that terms that one account employs and the other does not are less of a factor than the similar terms. Also, the more commonly used the term is among the general population of editors, the less weight the term should have in the analysis.

Edit summary similarity as an information retrieval problem

To gauge similarity between different editors' edit summaries, we can view the problem in terms of information retrieval. Let A be the set (more strictly, the bag) of all terms used by editor A in A's edit summaries. We define B similarly for editor B, and so forth for all editors who have contributed edit summaries.

We can view A, B, and so forth as individual documents, much like web pages (though without HTML markup and full of vaguely Wikipedia-related gibberish). If we want to know whose edit summaries are most similar to a particular editor's (say, editor 'MM'), we treat the document MM as a search query over the set of all documents in our corpus, and find the closest match. Just as a search engine ranks the results by relevance, so we can assess the relevance of MM to each other editor's combined edit summaries.

The vector space model is a well-known algorithm in information retrieval that can provide this kind of analysis. Term frequencies are calculated for each document and for the entire corpus. Each term is treated as one dimension of a vector, with the frequency of that term in that document (normalized by the overall frequency of the term in the corpus) supplying the magnitude of that component of the vector. Then the similarity of the documents is calculated by measuring the angle (actually, the cosine) between the vectors that represent each document. The cosine is maximized when two vectors are colinear (complete similarity), and minimized when they are orthogonal (zero similarity).

Methodology

I began with the set of all revisions made in 2007. Each record contained the editor name (or IP address for anonymous editors), a timestamp, and the full edit summary for the revision. The timestamp was dropped for this analysis.

The set of all 2007 revisions is too large for a vector space analysis on my hardware, so I reduced the set to all revisions made by editors who had between 1000 and 2000 edits (inclusive) during that time. These include edits made by MM and SH.

I then combined each editor's edit summaries to produce a set of tf-idf scores as follows:

// first gather raw counts of terms per user (=document) and for entire corpus
for each revision do
  if no edit summary then skip
  remove commonly seen automated edit summaries:
    * automatic section comments (/* like this */)
    * revert-tool messages ("Reverted P by Q to R by S")
    * undo-tool messages ("Undo/Undid...by User" with wiki markup)
    * Twinkle ("...using TW")
  remove HTML entities (&quot;, &gt;, etc.)
  remove all non-alphabetic and non-whitespace sequences (punctuation, digits, symbols)
  condense all whitespace sequences to single space characters
  trim whitespace from the start and end of the text
  if nothing left in edit summary then skip
  split edit summary into tokens using space character as delimiter
  // note: token case was preserved in order to capture similar styles 
  // of capitalization, so "Foo" is treated as a different token than "foo".
  increment token count for the user and for the entire corpus
end loop on each revision

// next calculate inverse document frequency for each term (see vector space model)
for each term do
  term.idf = log(number of users/number of times term appears in corpus)
end loop on each term

// finally, calculate term frequency for each term and user
// and multiply by idf to get vector of term weights for each user
for each user do
  for each term appearing in the user's edit summaries do
    user.term.weight = (number of times term appears in user's edit summaries) * idf
  end loop on each term
  write out the vector of term=>weight pairs for the user
end loop on each user

I calculated tf-idf vectors for all 3629 editors who have between 1000 and 2000 edits in 2007.

Finally, I calculated similarity rankings for all 3629 editors with respect to MM and SH individually. In other words, I treated MM's tf-idf vector as the query and ranked all other editors' vectors by similarity, and then did the same for SH's vector.

Artifacts of the process

The following table details the artifacts involved in the analysis, which can be found on my server (link posted on arbitration case's Evidence page).

File	Type	Remarks	Command to create
enwiki-20080103-stub-meta-history.xml.gz	Compressed data	Original data dump downloaded from MediaWiki
filter2007.pl	Perl script	Extracts all 'revision' tags with timestamp within 2007 from data dump
revisions.2007.bz2	Compressed data	Full revision metadata for 2007 (sans article text).	`gzip -c -d enwiki-20080103-stub-meta-history.xml.gz \| perl filter2007.pl \| bzip2 > revisions.2007.bz2`
condenseRevs.pl	Perl script	Extracts username or IP, timestamp, and raw edit comment from each 'revision' tag and outputs pipe-delimited record for smaller filesize and easier parsing. Does not preserve revision ID, numeric contributor ID, "minor comment" flag, or text ID.
smallrevs.2007.bz2	Compressed data	Username/IP, timestamp, and raw edit comment for all 2007 revisions.	`bzcat revisions.2007.bz2 \| perl condenseRevs.pl \| bzip2 > smallrevs.2007.bz2`
countEdits.pl	Perl script	Counts how many revisions each editor made.
editorCounts.2007	Uncompressed data	Pipe-delimited records containing editor name/IP and count of edits, for all editors with edit counts in 2007 between 1000 and 2000 inclusive.	`bzcat smallrevs.2007.bz2 \| perl countEdits.pl --from=1000 --to=2000 \| sort --key=1 --field-separator=\\| > editorCounts.2007`
sample1K2Krevisions.pl	Perl script	Prints revisions for editors listed in an input file.
sample1K2Kedits.bz2	Compressed data	Subset of smallrevs.2007.bz2, limited to revisions by editors listed in editorCounts.2007.	`bzcat smallrevs.2007.bz2 \| perl sample1K2Krevisions.pl -f editorCounts.2007 \| bzip2 > sample1K2Kedits.bz2`
extractIndex.pl	Perl script	Derives tf-idf weights for each editor's combined edit summaries taken from input file.
tfidf.1K2K.bz2	Compressed data	List of editors and, for each, every term they used in 2007 along with tf-idf weight for that term (formatted as: "term1=weight1 term2=weight2 ...").	`bzcat sample1K2Kedits.bz2 \| perl extractIndex.pl \| bzip2 > tfidf.1K2K.bz2`
vsm.pl	Perl script	Calculates VSM similarity rankings for a given editor, from a file containing tf-idf scores. With -v option, accepts comma-delimited list of terms to ignore in ranking calculation.
Mantanmoreland.vsm Samiharris.vsm	Uncompressed data	VSM similarity rankings for Mantanmoreland and Samiharris, sorted by similarity rank (highest last).	(Example shown is for Mantanmoreland account; similar for Samiharris.) `bzcat tfidf.1K2K.bz2 \| vsm.pl -u Mantanmoreland \| sort --key=2n --field-separator=\\| \| Mantanmoreland.vsm`

Results

Note: These results are erroneous. Please see the "Correction" section below.

The 20 lowest-similarity editors are (lowest first):

MM		SH
Editor	Weight	Editor	Weight
AfDBot	0.000000	AfDBot	0.000000
Warpozio	0.000001	Warpozio	0.000002
Uncle G's 'bot	0.000003	Uncle G's 'bot	0.000003
Android Mouse Bot 4	0.000016	Gerakibot	0.000008
Tsemii	0.000020	NongBot	0.000011
Puuropyssy	0.000024	Tsemii	0.000022
Lissander	0.000042	Android Mouse Bot 4	0.000024
SQLBot	0.000052	Kauczuk	0.000029
Mircea cs	0.000053	Puuropyssy	0.000035
Gerakibot	0.000054	Lissander	0.000036
SHARU(ja)	0.000057	Nk	0.000036
Nk	0.000063	Soregashi	0.000037
Lindum	0.000072	Davecrosby uk	0.000038
Jacob.jose	0.000079	Jacob.jose	0.000042
Soregashi	0.000083	GurchBot	0.000046
Tiyoringo	0.000092	Tiyoringo	0.000067
Kauczuk	0.000093	SQLBot	0.000073
NongBot	0.000111	Sporti	0.000075
TnS	0.000122	SHARU(ja)	0.000079
Paul-L	0.000124	Tsiaojian lee	0.000086

The 20 highest-similarity editors are (lowest first):

MM		SH
Editor	Weight	Editor	Weight
Jinxmchue	0.072591	William Pietri	0.088757
RenamedUser2	0.072745	Tango	0.089049
Antaeus Feldspar	0.073847	Istanbuljohnm	0.089700
Tirronan	0.073930	Littleolive oil	0.091930
William Pietri	0.075480	Monkeyzpop	0.093728
Icarus3	0.075492	ObiterDicta	0.093880
Ikanreed	0.076567	Madeleine Price Ball	0.095214
Shot info	0.078433	Alii h	0.096108
Monkeyzpop	0.079736	Davidbspalding	0.100061
Revolving Bugbear	0.079747	Shot info	0.101637
Qworty	0.080497	80.229.29.19	0.106104
Davidbspalding	0.081912	Tdl1060	0.108981
Lisapollison	0.083451	Qworty	0.110574
AniMate	0.087351	Ww	0.111477
Ramsquire	0.094263	Lisapollison	0.116296
80.229.29.19	0.099841	Ramsquire	0.120949
Istanbuljohnm	0.101898	AniMate	0.128838
Piperdown	0.125442	Mantanmoreland	0.178484
Samiharris	0.178484	Piperdown	0.233706
Mantanmoreland	1.000000	Samiharris	1.000000

Analysis

MM and SH are in each other's top two similarity rankings, disregarding the trivial self-similarity.
Interestingly, User:Piperdown is also in both editors' top two, and in fact is the most similar with respect to SH, over MM.
Piperdown's strong similarity ranking challenges the hypothesis of collusion somewhat, as it is well known that Piperdown is on the opposite side of the Overstock issue than MM and SH are. However, the terms Piperdown has most strongly in common with MM and SH are mostly connected to the Overstock battle, in which SH participated more than MM did during 2007:
- SEC
- Forbes
- Bloomberg
- hedge
- shorting
- Byrne
- Weiss
- SHO
- piperdown
- material
- DOB
- RS
While MM and SH also share strong correlations with some Overstock-related terms, there are also some distinctive terms of habit that strongly correlate:
- SEC
- rply
- expanding
- clarifying
- distort
- regulatory
- duplicative
- NPA
- RS
- naked
The technique of stripping out non-alphabetic characters and tokenizing on whitespace means that phrases, numbers, and punctuation do not factor into the results. This means that the "as per" and " -- " tics do not influence these findings.

Variations

At User:Noroton's suggestion, I re-ran the vector space algorithm for Samiharris and Mantanmoreland with a set of words excluded: SEC, Forbes, Bloomberg, naked, shorting, Byrne, Weiss, hedge, SHO, piperdown, material, DOB, and RS. The results:

Samiharris
Editor	Weight
Redrocketboy	0.093542
Ikanreed	0.094971
William Pietri	0.096567
Tango	0.097557
Istanbuljohnm	0.098271
Monkeyzpop	0.098586
ObiterDicta	0.100128
Alii h	0.104990
Shot info	0.106822
Davidbspalding	0.107650
80.229.29.19	0.116242
Tdl1060	0.119069
Qworty	0.119986
Ww	0.121966
Lisapollison	0.126238
Ramsquire	0.131007
Piperdown	0.136043
AniMate	0.141181
Mantanmoreland	0.168838
Samiharris	1.000000

Mantanmoreland
Editor	Weight
Antaeus Feldspar	0.071822
Tttom	0.071906
GabrielF	0.072012
Tirronan	0.074653
Shot info	0.074855
William Pietri	0.075941
Icarus3	0.076027
Ikanreed	0.077093
Monkeyzpop	0.079326
Revolving Bugbear	0.080164
Qworty	0.080644
Davidbspalding	0.081183
Lisapollison	0.083997
AniMate	0.087079
Piperdown	0.091709
Ramsquire	0.094061
80.229.29.19	0.101006
Istanbuljohnm	0.103086
Samiharris	0.168838
Mantanmoreland	1.000000

Comments

(please move to discussion page if you think that's the best place)

Deleting the Topic references (financial words in this case) seems like a very good idea. It was also mentioned on the workshop page by User:Avruch. I'd like to see the Mantanmoreland results as well (if possible). This does suggest that the topic words are quite important in determining these results, but that there is more, i.e. Piperdown is still highly "correlated." I was wondering what else might be driving this, so checked out AniMate's edit summaries - it's pretty clear, he uses "verb"-ing quite a bit.

I'm not familiar with this method (yes, I'll try to check it out) but it seems good, if it's not too sensitive to a couple of things like same articles edited, and (maybe) "verb"-ing. I'm afraid I don't see the "timing correlations" as anything but proof that they edit from the same time zone - which I think we already knew. Keep up the good work. Smallbones (talk) 20:03, 21 February 2008 (UTC)

"Verbing" will only matter if they are using the same verbs repeatedly in that fashion. Both "revising" and "extending" are "verbing" forms, but they are completely different tokens for this algorithm.

As to the timing correlations, it is just one piece of the puzzle. Sort of like the Blind Men and an Elephant story, you mislead yourself if you only look for proof from individual pieces. All of the evidence needs to be looked at as a complete pattern. The timing correlation combined with the lack of interleaving edits while working in the same topic areas to me indicates two accounts that were actively working to keep their activities from looking at first glance coordinated. The punctuation similarity is independent of the method described on this page, because this algorithm treats all punctuation as whitespace. So we have multiple strong threads of evidence that in my mind form a coherent pattern - timing data as a whole, topics chosen and POV, punctuation and word choice. GRBerry 21:20, 21 February 2008 (UTC)

Okay, I've added the results for Mantanmoreland with the same topic filter as I used for the Samiharris variation above. alanyst ^/talk/ 05:04, 22 February 2008 (UTC)

Remove 13 finance-related terms and Piperdown's similarity recedes, most dramatically in the Manatnmoreland table. This would seem to indicate that subject matter is not the underlying reason for this similarity between these two user accounts but rather because they share a deeper, more personal style independent of the article subjects they edited. Another strand in the rope. Noroton (talk) 06:52, 22 February 2008 (UTC)

Impressive. A few comments:

It might be worth noting that there is a long tradition of statistical analysis for determining text authorship, cf. forensic linguistics and stylometry. (For example, this thesis has a detailed overview.) This includes the application of concepts from information retrieval, like above. I don't say this to diminish your achievement, but to stress that in such a situation, it is an entirely reasonable decision to use a careful application of statistical methods, contrary to some "with statistics you can prove anything" comments in the debate about this case (I don't mean Smallbones, who seems to have taken a more nuanced stance).

For a paper specifically about detecting sock puppets in online communities, whose conclusions could be of value here, see:

Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins: Anti-Aliasing on the Web. In: Proceedings of the 13th international conference on World Wide Web (2004) online version

They used a real-life data set to evaluate the accuracy of different similarity measures. (More concretely: They looked at 100 posters on a board of the web forum of CourtTV.com, with at least 100 postings each, and split the 100 accounts artificially into 200, such that each account a had one "artificial sock puppet" a´. The accuracy of a similarity measure is then the probability with with a´ is the account most similar to a in this measure among the 199 others. See Chapter 4. The number of users is smaller than in your data set, but the size of the text corpus for each user should be comparable.)

They compared tf–idf (in its variant without the log) against the Kullback–Leibler divergence and found that the KL divergence yielded a similarity measure with much better accuracy.

They also state that smoothing the distributions (weights) improved the accuracy greatly. Smoothing here means replacing the term frequency vector for one user by an linear combination of it with the overall (whole corpus) frequency vector. They say this is because the unsmoothed distribution over-emphasizes highly infrequent terms, which could be the reason for the Piperdown result (the highly infrequent terms coming from an external issue which affected several users - here: the Overstock battle -, rather than from each person's default preferred vocabulary). In other words: Smoothing might achieve automatically and less arbitrarily what has been done above by selecting and removing these terms by hand.

Quote from their conclusion:

In this paper, we have shown that matching aliases to authors [i.e. accounts to real life persons] with accuracy in excess of 90% is practically feasible in online environments.

(And they did not use time stamps and other features which have been analyzed elsewhere in this case.)

Regards, High on a tree (talk) 12:15, 23 February 2008 (UTC)

I've had another idea. Would it be possible to extract the Mantanmoreland 2006 contributions and use them to query the data set. Since Mantanmoreland has more than 2K contributions in 2006, maybe just the last 2K of them? It would be interesting to see how Mantanmoreland 2006 compares to Mantanmoreland 2007 and to Samiharris. (I think we all believe than Mantanmoreland 2006 == Mantanmoreland 2007.) However, I'm not certain if the method allows this. If it doesn't allow that simple an approach, would it be worth adding Mantanmoreland 2006 to the complete data set then running with MM 2006, MM 2007, and SH as the three queries? GRBerry 22:54, 26 February 2008 (UTC)

Correction

I have detected an error in my original VSM work that affects the results given above. Those results should be considered unreliable.

I have corrected the error and computed new results, which I give below. I will also detail what went wrong.

New results

These new results show, for Mantanmoreland and Samiharris, the top 20 editors in terms of edit summary similarity, for two different datasets:

editors with edit counts between 1000 and 2000 in 2007
editors with edit counts between 500 and 3500 in 2007

(Note that the second is a superset of the first.)

Overview of results:

Samiharris is not in Mantanmoreland's top 20 similar editors for either dataset, contrary to the previous results. In fact, Samiharris ranks at about #98 of 3628 in the 1000-2000 dataset, and at #188 of 11377 in the 500-3500 dataset.
However, Mantanmoreland is #1 in Samiharris's rankings of similar editors, for both datasets.
Note that Piperdown no longer appears in any of these rankings. He ranks at #288 in the 1000-2000 dataset for Mantanmoreland, and #151 for Samiharris in the same dataset. In the 500-3500 dataset, he ranks at #688 for Mantanmoreland and #351 for Samiharris.

MM 1K2K

MM 1000-2000
Editor	Weight
Jd2718	0.906714
Paul Pieniezny	0.906970
Nethgirb	0.907032
Sparkzilla	0.907156
Elaich	0.907357
Action potential	0.907641
Rhialto	0.908820
Jinxmchue	0.909088
Shot info	0.909232
Arjuna808	0.909299
Rainwarrior	0.910045
Davidbspalding	0.910842
Loonymonkey	0.911421
Madeleine Price Ball	0.912481
6SJ7	0.915240
Skywriter	0.916594
Ikanreed	0.917859
Jance	0.921206
Antaeus Feldspar	0.925242
Mantanmoreland	1.000000

MM500-3500

MM 500-3500
Editor	Weight
Jinxmchue	0.905089
Phil Sandifer	0.905458
Davidbspalding	0.906335
Skybunny	0.906777
Loonymonkey	0.907059
Madeleine Price Ball	0.907387
Edhubbard	0.907447
Snalwibma	0.907594
GDallimore	0.910814
Skinwalker	0.910888
Skywriter	0.911339
SheffieldSteel	0.911888
6SJ7	0.912143
Andyvphil	0.912587
Ikanreed	0.913579
Jance	0.915028
Antaeus Feldspar	0.921188
Risker	0.923119
Jmh123	0.929817
Mantanmoreland	1.000000

SH1K2K

SH 1000-2000
Editor	Weight
Jance	0.822843
Melonbarmonster	0.824340
Alii h	0.824862
Fresheneesz	0.825351
MaximvsDecimvs	0.826019
MJBurrage	0.826792
Nadirali	0.827380
Gaimhreadhan	0.829191
Maniwar	0.830085
Northmeister	0.830931
Jim Butler	0.833111
Fourdee	0.837236
W. Frank	0.848141
Skywriter	0.850681
Tonicthebrown	0.854164
Monkeyzpop	0.860232
Littleolive oil	0.864525
AniMate	0.865724
Mantanmoreland	0.880997
Samiharris	1.000000

SH500-3500

SH 500-3500
Editor	Weight
Malljaja	0.830741
Dseer	0.830809
Jmh123	0.830888
Northmeister	0.831229
NBeale	0.831244
Jim Butler	0.832238
Fourdee	0.835833
Khorshid	0.837822
Guliolopez	0.838067
Alice	0.838662
Zeraeph	0.842278
W. Frank	0.845215
Tonicthebrown	0.848386
Skywriter	0.849354
Kierant	0.857126
Monkeyzpop	0.857441
AniMate	0.859353
Littleolive oil	0.861320
Mantanmoreland	0.874837
Samiharris	1.000000

What went wrong

In my original code, I attempted to filter out automated parts of edit summaries, since these do not reflect a writer's word choices. Unfortunately, my filter caught some but not all automated edit summaries. Most importantly, it missed the automatic section headings (enclosed in C-style comments /* like this */).

When I adjusted the filter to include those and re-ran the code, the results were quite different, as can be observed above. I believe these new results to be much more reflective of a true similarity measure.

Analysis of new results

With section headings being filtered out of the edit summaries, Piperdown disappears from these lists. This shows that Piperdown's high ranking in the erroneous results was almost wholly due to having edited in the same articles as MM or SH, which caused the edit summaries to have similar terms taken from the section headings.
Samiharris also drops in Mantanmoreland's rankings, but Mantanmoreland continues to be at the top of Samiharris's rankings. This suggests that Mantanmoreland uses a higher number of distinctive terms that Samiharris does not, but that there are still distinctive terms that the two accounts do share that relatively few others do.
The new results seem to lend some credence to Mantanmoreland's argument that Samiharris is a separate individual who has adopted the same habits of phrasing that Mantanmoreland has used. This is a plausible hypothesis under these new results, but the sockpuppet hypothesis is IMO not debunked by these results. Other evidence available needs to be considered in deciding between these hypotheses.
It may be profitable to examine the terms that correlate best between Samiharris and Mantanmoreland, to see if it's plausible that Samiharris could have picked up terminology from Mantanmoreland. Are the unusual terms that they share used by Mantanmoreland where Samiharris is likely to have seen them and picked them up? Or, are the timestamps and articles corresponding to Mantanmoreland's use of those terms so distant from Samiharris's editing times and areas of interest that the mimicry hypothesis is implausible?

Mea culpa

I sincerely apologize to the arbitrators, involved parties, and interested observers for this error. I urge anyone who has based their conclusions on the erroneous results to reexamine them. I also welcome any additional scrutiny of my work, as well as inquiries into my methods if there are further doubts as to the reliability of my work. alanyst ^/talk/ 00:08, 27 February 2008 (UTC)

Comments

Alanyst, I hope you don't mind my adding this "Comments" section. The possibility of error in any one set of data is one of the reasons why many of us are trying to rely on a range of different sets of data and different types of data. Also, this data is weaker in terms of showing a link between the two accounts, although it does show similarities.
SlimVirgin had asked at Wikback about whether or not one editor might pick up edit-summary styles from another editor. My assumption is that this would diverge over time, with similarities most evident earlier on. At this point, I don't know if it would be worth your time to do this kind of research, but if you're curious (and if it isn't too difficult) you might want to try comparing earlier edits (say, the first half of the year) with later edits (the second half of the year). I have to admit, I'm not sure that a divergence would prove anything. Also, is it possible that accounts with more edits but still within your range (say, 1,900 edits) would be more likely to show up as similar than would counts with a smaller number of edits (say, 1,001)? The accounts with more edits would have more opportunities to use similar words, and with a range of 1,000 to 2,000 the accounts with the most edits could be almost double the size of the accounts with the fewest edits. Or maybe I'm missing something. I'm asking more because I'm curious than because I think any new results would matter much at this point. Anyway, thanks for the effort. I think you've given us all some valuable information. Noroton (talk) 06:23, 28 February 2008 (UTC)