Jump to content

User:Alanyst/Vector space research

From Wikipedia, the free encyclopedia

This is another page that documents research pertaining to Wikipedia:Requests for arbitration/Mantanmoreland. See also User:Alanyst/Edit collision research.

Aim of the research

[edit]

The research aims to provide insight to the question of whether two user accounts 'MM' and 'SH' on Wikipedia are independent, or else are controlled by the same individual (sockpuppets). This research focuses on the similarity of edit summaries as a metric of independence.

Initial assumptions

[edit]

If two Wikipedia accounts are controlled by the same individual, we can expect a higher degree of similarity in the edit summaries made by those accounts than for a typical pair of unrelated accounts. Even if the individual is consciously avoiding using similar language, there is still likely to be significant overlap in terms used simply because people's speech patterns are deeply ingrained.

Similarity

[edit]

The notion of similarity needs to be defined. It is not enough to show that two accounts share similar terms, since unrelated accounts might also employ those terms. It is also important to show that terms that one account employs and the other does not are less of a factor than the similar terms. Also, the more commonly used the term is among the general population of editors, the less weight the term should have in the analysis.

Edit summary similarity as an information retrieval problem

[edit]

To gauge similarity between different editors' edit summaries, we can view the problem in terms of information retrieval. Let A be the set (more strictly, the bag) of all terms used by editor A in A's edit summaries. We define B similarly for editor B, and so forth for all editors who have contributed edit summaries.

We can view A, B, and so forth as individual documents, much like web pages (though without HTML markup and full of vaguely Wikipedia-related gibberish). If we want to know whose edit summaries are most similar to a particular editor's (say, editor 'MM'), we treat the document MM as a search query over the set of all documents in our corpus, and find the closest match. Just as a search engine ranks the results by relevance, so we can assess the relevance of MM to each other editor's combined edit summaries.

The vector space model is a well-known algorithm in information retrieval that can provide this kind of analysis. Term frequencies are calculated for each document and for the entire corpus. Each term is treated as one dimension of a vector, with the frequency of that term in that document (normalized by the overall frequency of the term in the corpus) supplying the magnitude of that component of the vector. Then the similarity of the documents is calculated by measuring the angle (actually, the cosine) between the vectors that represent each document. The cosine is maximized when two vectors are colinear (complete similarity), and minimized when they are orthogonal (zero similarity).

Methodology

[edit]

I began with the set of all revisions made in 2007. Each record contained the editor name (or IP address for anonymous editors), a timestamp, and the full edit summary for the revision. The timestamp was dropped for this analysis.

The set of all 2007 revisions is too large for a vector space analysis on my hardware, so I reduced the set to all revisions made by editors who had between 1000 and 2000 edits (inclusive) during that time. These include edits made by MM and SH.

I then combined each editor's edit summaries to produce a set of tf-idf scores as follows:

// first gather raw counts of terms per user (=document) and for entire corpus
for each revision do
  if no edit summary then skip
  remove commonly seen automated edit summaries:
    * automatic section comments (/* like this */)
    * revert-tool messages ("Reverted P by Q to R by S")
    * undo-tool messages ("Undo/Undid...by User" with wiki markup)
    * Twinkle ("...using TW")
  remove HTML entities (", >, etc.)
  remove all non-alphabetic and non-whitespace sequences (punctuation, digits, symbols)
  condense all whitespace sequences to single space characters
  trim whitespace from the start and end of the text
  if nothing left in edit summary then skip
  split edit summary into tokens using space character as delimiter
  // note: token case was preserved in order to capture similar styles 
  // of capitalization, so "Foo" is treated as a different token than "foo".
  increment token count for the user and for the entire corpus
end loop on each revision
// next calculate inverse document frequency for each term (see vector space model)
for each term do
  term.idf = log(number of users/number of times term appears in corpus)
end loop on each term
// finally, calculate term frequency for each term and user
// and multiply by idf to get vector of term weights for each user
for each user do
  for each term appearing in the user's edit summaries do
    user.term.weight = (number of times term appears in user's edit summaries) * idf
  end loop on each term
  write out the vector of term=>weight pairs for the user
end loop on each user

I calculated tf-idf vectors for all 3629 editors who have between 1000 and 2000 edits in 2007.

Finally, I calculated similarity rankings for all 3629 editors with respect to MM and SH individually. In other words, I treated MM's tf-idf vector as the query and ranked all other editors' vectors by similarity, and then did the same for SH's vector.

Artifacts of the process

[edit]

The following table details the artifacts involved in the analysis, which can be found on my server (link posted on arbitration case's Evidence page).

File Type Remarks Command to create
enwiki-20080103-stub-meta-history.xml.gz Compressed data Original data dump downloaded from MediaWiki
filter2007.pl Perl script Extracts all 'revision' tags with timestamp within 2007 from data dump
revisions.2007.bz2 Compressed data Full revision metadata for 2007 (sans article text). gzip -c -d enwiki-20080103-stub-meta-history.xml.gz | perl filter2007.pl | bzip2 > revisions.2007.bz2
condenseRevs.pl Perl script Extracts username or IP, timestamp, and raw edit comment from each 'revision' tag and outputs pipe-delimited record for smaller filesize and easier parsing. Does not preserve revision ID, numeric contributor ID, "minor comment" flag, or text ID.
smallrevs.2007.bz2 Compressed data Username/IP, timestamp, and raw edit comment for all 2007 revisions. bzcat revisions.2007.bz2 | perl condenseRevs.pl | bzip2 > smallrevs.2007.bz2
countEdits.pl Perl script Counts how many revisions each editor made.
editorCounts.2007 Uncompressed data Pipe-delimited records containing editor name/IP and count of edits, for all editors with edit counts in 2007 between 1000 and 2000 inclusive. bzcat smallrevs.2007.bz2 | perl countEdits.pl --from=1000 --to=2000 | sort --key=1 --field-separator=\| > editorCounts.2007
sample1K2Krevisions.pl Perl script Prints revisions for editors listed in an input file.
sample1K2Kedits.bz2 Compressed data Subset of smallrevs.2007.bz2, limited to revisions by editors listed in editorCounts.2007. bzcat smallrevs.2007.bz2 | perl sample1K2Krevisions.pl -f editorCounts.2007 | bzip2 > sample1K2Kedits.bz2
extractIndex.pl Perl script Derives tf-idf weights for each editor's combined edit summaries taken from input file.
tfidf.1K2K.bz2 Compressed data List of editors and, for each, every term they used in 2007 along with tf-idf weight for that term (formatted as: "term1=weight1 term2=weight2 ..."). bzcat sample1K2Kedits.bz2 | perl extractIndex.pl | bzip2 > tfidf.1K2K.bz2
vsm.pl Perl script Calculates VSM similarity rankings for a given editor, from a file containing tf-idf scores. With -v option, accepts comma-delimited list of terms to ignore in ranking calculation.
Mantanmoreland.vsm
Samiharris.vsm
Uncompressed data VSM similarity rankings for Mantanmoreland and Samiharris, sorted by similarity rank (highest last). (Example shown is for Mantanmoreland account; similar for Samiharris.)

bzcat tfidf.1K2K.bz2 | vsm.pl -u Mantanmoreland | sort --key=2n --field-separator=\| | Mantanmoreland.vsm

Results

[edit]

Note: These results are erroneous. Please see the "Correction" section below.

The 20 lowest-similarity editors are (lowest first):

MM SH
Editor Weight Editor Weight
AfDBot 0.000000 AfDBot 0.000000
Warpozio 0.000001 Warpozio 0.000002
Uncle G's 'bot 0.000003 Uncle G's 'bot 0.000003
Android Mouse Bot 4 0.000016 Gerakibot 0.000008
Tsemii 0.000020 NongBot 0.000011
Puuropyssy 0.000024 Tsemii 0.000022
Lissander 0.000042 Android Mouse Bot 4 0.000024
SQLBot 0.000052 Kauczuk 0.000029
Mircea cs 0.000053 Puuropyssy 0.000035
Gerakibot 0.000054 Lissander 0.000036
SHARU(ja) 0.000057 Nk 0.000036
Nk 0.000063 Soregashi 0.000037
Lindum 0.000072 Davecrosby uk 0.000038
Jacob.jose 0.000079 Jacob.jose 0.000042
Soregashi 0.000083 GurchBot 0.000046
Tiyoringo 0.000092 Tiyoringo 0.000067
Kauczuk 0.000093 SQLBot 0.000073
NongBot 0.000111 Sporti 0.000075
TnS 0.000122 SHARU(ja) 0.000079
Paul-L 0.000124 Tsiaojian lee 0.000086

The 20 highest-similarity editors are (lowest first):

MM SH
Editor Weight Editor Weight
Jinxmchue 0.072591 William Pietri 0.088757
RenamedUser2 0.072745 Tango 0.089049
Antaeus Feldspar 0.073847 Istanbuljohnm 0.089700
Tirronan 0.073930 Littleolive oil 0.091930
William Pietri 0.075480 Monkeyzpop 0.093728
Icarus3 0.075492 ObiterDicta 0.093880
Ikanreed 0.076567 Madeleine Price Ball 0.095214
Shot info 0.078433 Alii h 0.096108
Monkeyzpop 0.079736 Davidbspalding 0.100061
Revolving Bugbear 0.079747 Shot info 0.101637
Qworty 0.080497 80.229.29.19 0.106104
Davidbspalding 0.081912 Tdl1060 0.108981
Lisapollison 0.083451 Qworty 0.110574
AniMate 0.087351 Ww 0.111477
Ramsquire 0.094263 Lisapollison 0.116296
80.229.29.19 0.099841 Ramsquire 0.120949
Istanbuljohnm 0.101898 AniMate 0.128838
Piperdown 0.125442 Mantanmoreland 0.178484
Samiharris 0.178484 Piperdown 0.233706
Mantanmoreland 1.000000 Samiharris 1.000000

Analysis

[edit]
  • MM and SH are in each other's top two similarity rankings, disregarding the trivial self-similarity.
  • Interestingly, User:Piperdown is also in both editors' top two, and in fact is the most similar with respect to SH, over MM.
  • Piperdown's strong similarity ranking challenges the hypothesis of collusion somewhat, as it is well known that Piperdown is on the opposite side of the Overstock issue than MM and SH are. However, the terms Piperdown has most strongly in common with MM and SH are mostly connected to the Overstock battle, in which SH participated more than MM did during 2007:
    • SEC
    • Forbes
    • Bloomberg
    • hedge
    • shorting
    • Byrne
    • Weiss
    • SHO
    • piperdown
    • material
    • DOB
    • RS
  • While MM and SH also share strong correlations with some Overstock-related terms, there are also some distinctive terms of habit that strongly correlate:
    • SEC
    • rply
    • expanding
    • clarifying
    • distort
    • regulatory
    • duplicative
    • NPA
    • RS
    • naked
  • The technique of stripping out non-alphabetic characters and tokenizing on whitespace means that phrases, numbers, and punctuation do not factor into the results. This means that the "as per" and " -- " tics do not influence these findings.

Variations

[edit]

At User:Noroton's suggestion, I re-ran the vector space algorithm for Samiharris and Mantanmoreland with a set of words excluded: SEC, Forbes, Bloomberg, naked, shorting, Byrne, Weiss, hedge, SHO, piperdown, material, DOB, and RS. The results:

Samiharris
Editor Weight
Redrocketboy 0.093542
Ikanreed 0.094971
William Pietri 0.096567
Tango 0.097557
Istanbuljohnm 0.098271
Monkeyzpop 0.098586
ObiterDicta 0.100128
Alii h 0.104990
Shot info 0.106822
Davidbspalding 0.107650
80.229.29.19 0.116242
Tdl1060 0.119069
Qworty 0.119986
Ww 0.121966
Lisapollison 0.126238
Ramsquire 0.131007
Piperdown 0.136043
AniMate 0.141181
Mantanmoreland 0.168838
Samiharris 1.000000
Mantanmoreland
Editor Weight
Antaeus Feldspar 0.071822
Tttom 0.071906
GabrielF 0.072012
Tirronan 0.074653
Shot info 0.074855
William Pietri 0.075941
Icarus3 0.076027
Ikanreed 0.077093
Monkeyzpop 0.079326
Revolving Bugbear 0.080164
Qworty 0.080644
Davidbspalding 0.081183
Lisapollison 0.083997
AniMate 0.087079
Piperdown 0.091709
Ramsquire 0.094061
80.229.29.19 0.101006
Istanbuljohnm 0.103086
Samiharris 0.168838
Mantanmoreland 1.000000

Comments

[edit]

(please move to discussion page if you think that's the best place)

Deleting the Topic references (financial words in this case) seems like a very good idea. It was also mentioned on the workshop page by User:Avruch. I'd like to see the Mantanmoreland results as well (if possible). This does suggest that the topic words are quite important in determining these results, but that there is more, i.e. Piperdown is still highly "correlated." I was wondering what else might be driving this, so checked out AniMate's edit summaries - it's pretty clear, he uses "verb"-ing quite a bit.

I'm not familiar with this method (yes, I'll try to check it out) but it seems good, if it's not too sensitive to a couple of things like same articles edited, and (maybe) "verb"-ing. I'm afraid I don't see the "timing correlations" as anything but proof that they edit from the same time zone - which I think we already knew. Keep up the good work. Smallbones (talk) 20:03, 21 February 2008 (UTC)

"Verbing" will only matter if they are using the same verbs repeatedly in that fashion. Both "revising" and "extending" are "verbing" forms, but they are completely different tokens for this algorithm.
As to the timing correlations, it is just one piece of the puzzle. Sort of like the Blind Men and an Elephant story, you mislead yourself if you only look for proof from individual pieces. All of the evidence needs to be looked at as a complete pattern. The timing correlation combined with the lack of interleaving edits while working in the same topic areas to me indicates two accounts that were actively working to keep their activities from looking at first glance coordinated. The punctuation similarity is independent of the method described on this page, because this algorithm treats all punctuation as whitespace. So we have multiple strong threads of evidence that in my mind form a coherent pattern - timing data as a whole, topics chosen and POV, punctuation and word choice. GRBerry 21:20, 21 February 2008 (UTC)

Okay, I've added the results for Mantanmoreland with the same topic filter as I used for the Samiharris variation above. alanyst /talk/ 05:04, 22 February 2008 (UTC)

  • Remove 13 finance-related terms and Piperdown's similarity recedes, most dramatically in the Manatnmoreland table. This would seem to indicate that subject matter is not the underlying reason for this similarity between these two user accounts but rather because they share a deeper, more personal style independent of the article subjects they edited. Another strand in the rope. Noroton (talk) 06:52, 22 February 2008 (UTC)


Impressive. A few comments:

It might be worth noting that there is a long tradition of statistical analysis for determining text authorship, cf. forensic linguistics and stylometry. (For example, this thesis has a detailed overview.) This includes the application of concepts from information retrieval, like above. I don't say this to diminish your achievement, but to stress that in such a situation, it is an entirely reasonable decision to use a careful application of statistical methods, contrary to some "with statistics you can prove anything" comments in the debate about this case (I don't mean Smallbones, who seems to have taken a more nuanced stance).

For a paper specifically about detecting sock puppets in online communities, whose conclusions could be of value here, see:

Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins: Anti-Aliasing on the Web. In: Proceedings of the 13th international conference on World Wide Web (2004) online version

They used a real-life data set to evaluate the accuracy of different similarity measures. (More concretely: They looked at 100 posters on a board of the web forum of CourtTV.com, with at least 100 postings each, and split the 100 accounts artificially into 200, such that each account a had one "artificial sock puppet" . The accuracy of a similarity measure is then the probability with with is the account most similar to a in this measure among the 199 others. See Chapter 4. The number of users is smaller than in your data set, but the size of the text corpus for each user should be comparable.)

They compared tf–idf (in its variant without the log) against the Kullback–Leibler divergence and found that the KL divergence yielded a similarity measure with much better accuracy.

They also state that smoothing the distributions (weights) improved the accuracy greatly. Smoothing here means replacing the term frequency vector for one user by an linear combination of it with the overall (whole corpus) frequency vector. They say this is because the unsmoothed distribution over-emphasizes highly infrequent terms, which could be the reason for the Piperdown result (the highly infrequent terms coming from an external issue which affected several users - here: the Overstock battle -, rather than from each person's default preferred vocabulary). In other words: Smoothing might achieve automatically and less arbitrarily what has been done above by selecting and removing these terms by hand.

Quote from their conclusion:

In this paper, we have shown that matching aliases to authors [i.e. accounts to real life persons] with accuracy in excess of 90% is practically feasible in online environments.

(And they did not use time stamps and other features which have been analyzed elsewhere in this case.)

Regards, High on a tree (talk) 12:15, 23 February 2008 (UTC)

  • I've had another idea. Would it be possible to extract the Mantanmoreland 2006 contributions and use them to query the data set. Since Mantanmoreland has more than 2K contributions in 2006, maybe just the last 2K of them? It would be interesting to see how Mantanmoreland 2006 compares to Mantanmoreland 2007 and to Samiharris. (I think we all believe than Mantanmoreland 2006 == Mantanmoreland 2007.) However, I'm not certain if the method allows this. If it doesn't allow that simple an approach, would it be worth adding Mantanmoreland 2006 to the complete data set then running with MM 2006, MM 2007, and SH as the three queries? GRBerry 22:54, 26 February 2008 (UTC)


Correction

[edit]

I have detected an error in my original VSM work that affects the results given above. Those results should be considered unreliable.

I have corrected the error and computed new results, which I give below. I will also detail what went wrong.

New results

[edit]

These new results show, for Mantanmoreland and Samiharris, the top 20 editors in terms of edit summary similarity, for two different datasets:

  • editors with edit counts between 1000 and 2000 in 2007
  • editors with edit counts between 500 and 3500 in 2007

(Note that the second is a superset of the first.)

Overview of results:

  • Samiharris is not in Mantanmoreland's top 20 similar editors for either dataset, contrary to the previous results. In fact, Samiharris ranks at about #98 of 3628 in the 1000-2000 dataset, and at #188 of 11377 in the 500-3500 dataset.
  • However, Mantanmoreland is #1 in Samiharris's rankings of similar editors, for both datasets.
  • Note that Piperdown no longer appears in any of these rankings. He ranks at #288 in the 1000-2000 dataset for Mantanmoreland, and #151 for Samiharris in the same dataset. In the 500-3500 dataset, he ranks at #688 for Mantanmoreland and #351 for Samiharris.

MM 1K2K

[edit]
MM 1000-2000
Editor Weight
Jd2718 0.906714
Paul Pieniezny 0.906970
Nethgirb 0.907032
Sparkzilla 0.907156
Elaich 0.907357
Action potential 0.907641
Rhialto 0.908820
Jinxmchue 0.909088
Shot info 0.909232
Arjuna808 0.909299
Rainwarrior 0.910045
Davidbspalding 0.910842
Loonymonkey 0.911421
Madeleine Price Ball 0.912481
6SJ7 0.915240
Skywriter 0.916594
Ikanreed 0.917859
Jance 0.921206
Antaeus Feldspar 0.925242
Mantanmoreland 1.000000


MM500-3500

[edit]
MM 500-3500
Editor Weight
Jinxmchue 0.905089
Phil Sandifer 0.905458
Davidbspalding 0.906335
Skybunny 0.906777
Loonymonkey 0.907059
Madeleine Price Ball 0.907387
Edhubbard 0.907447
Snalwibma 0.907594
GDallimore 0.910814
Skinwalker 0.910888
Skywriter 0.911339
SheffieldSteel 0.911888
6SJ7 0.912143
Andyvphil 0.912587
Ikanreed 0.913579
Jance 0.915028
Antaeus Feldspar 0.921188
Risker 0.923119
Jmh123 0.929817
Mantanmoreland 1.000000

SH1K2K

[edit]
SH 1000-2000
Editor Weight
Jance 0.822843
Melonbarmonster 0.824340
Alii h 0.824862
Fresheneesz 0.825351
MaximvsDecimvs 0.826019
MJBurrage 0.826792
Nadirali 0.827380
Gaimhreadhan 0.829191
Maniwar 0.830085
Northmeister 0.830931
Jim Butler 0.833111
Fourdee 0.837236
W. Frank 0.848141
Skywriter 0.850681
Tonicthebrown 0.854164
Monkeyzpop 0.860232
Littleolive oil 0.864525
AniMate 0.865724
Mantanmoreland 0.880997
Samiharris 1.000000

SH500-3500

[edit]
SH 500-3500
Editor Weight
Malljaja 0.830741
Dseer 0.830809
Jmh123 0.830888
Northmeister 0.831229
NBeale 0.831244
Jim Butler 0.832238
Fourdee 0.835833
Khorshid 0.837822
Guliolopez 0.838067
Alice 0.838662
Zeraeph 0.842278
W. Frank 0.845215
Tonicthebrown 0.848386
Skywriter 0.849354
Kierant 0.857126
Monkeyzpop 0.857441
AniMate 0.859353
Littleolive oil 0.861320
Mantanmoreland 0.874837
Samiharris 1.000000

What went wrong

[edit]

In my original code, I attempted to filter out automated parts of edit summaries, since these do not reflect a writer's word choices. Unfortunately, my filter caught some but not all automated edit summaries. Most importantly, it missed the automatic section headings (enclosed in C-style comments /* like this */).

When I adjusted the filter to include those and re-ran the code, the results were quite different, as can be observed above. I believe these new results to be much more reflective of a true similarity measure.

Analysis of new results

[edit]
  • With section headings being filtered out of the edit summaries, Piperdown disappears from these lists. This shows that Piperdown's high ranking in the erroneous results was almost wholly due to having edited in the same articles as MM or SH, which caused the edit summaries to have similar terms taken from the section headings.
  • Samiharris also drops in Mantanmoreland's rankings, but Mantanmoreland continues to be at the top of Samiharris's rankings. This suggests that Mantanmoreland uses a higher number of distinctive terms that Samiharris does not, but that there are still distinctive terms that the two accounts do share that relatively few others do.
  • The new results seem to lend some credence to Mantanmoreland's argument that Samiharris is a separate individual who has adopted the same habits of phrasing that Mantanmoreland has used. This is a plausible hypothesis under these new results, but the sockpuppet hypothesis is IMO not debunked by these results. Other evidence available needs to be considered in deciding between these hypotheses.
  • It may be profitable to examine the terms that correlate best between Samiharris and Mantanmoreland, to see if it's plausible that Samiharris could have picked up terminology from Mantanmoreland. Are the unusual terms that they share used by Mantanmoreland where Samiharris is likely to have seen them and picked them up? Or, are the timestamps and articles corresponding to Mantanmoreland's use of those terms so distant from Samiharris's editing times and areas of interest that the mimicry hypothesis is implausible?

Mea culpa

[edit]

I sincerely apologize to the arbitrators, involved parties, and interested observers for this error. I urge anyone who has based their conclusions on the erroneous results to reexamine them. I also welcome any additional scrutiny of my work, as well as inquiries into my methods if there are further doubts as to the reliability of my work. alanyst /talk/ 00:08, 27 February 2008 (UTC)

Comments

[edit]
  • Alanyst, I hope you don't mind my adding this "Comments" section. The possibility of error in any one set of data is one of the reasons why many of us are trying to rely on a range of different sets of data and different types of data. Also, this data is weaker in terms of showing a link between the two accounts, although it does show similarities.
  • SlimVirgin had asked at Wikback about whether or not one editor might pick up edit-summary styles from another editor. My assumption is that this would diverge over time, with similarities most evident earlier on. At this point, I don't know if it would be worth your time to do this kind of research, but if you're curious (and if it isn't too difficult) you might want to try comparing earlier edits (say, the first half of the year) with later edits (the second half of the year). I have to admit, I'm not sure that a divergence would prove anything. Also, is it possible that accounts with more edits but still within your range (say, 1,900 edits) would be more likely to show up as similar than would counts with a smaller number of edits (say, 1,001)? The accounts with more edits would have more opportunities to use similar words, and with a range of 1,000 to 2,000 the accounts with the most edits could be almost double the size of the accounts with the fewest edits. Or maybe I'm missing something. I'm asking more because I'm curious than because I think any new results would matter much at this point. Anyway, thanks for the effort. I think you've given us all some valuable information. Noroton (talk) 06:23, 28 February 2008 (UTC)