User talk:Jarry1250/Findings
Suspicious number
[edit]The total for Special Groupings is 65535 (2^16 - 1). Is this a coincidence or is there a systematic reason? OrangeDog (talk • edits) 01:29, 12 May 2009 (UTC)
- Yes, there is. I will clarify that on the main page (I see what you mean), but also here: that's the number of articles in the whole sample (in order to give perspective for the other numbers, which are in sample not in population), not a sum of the other numbers in that table. 65535 was picked, somewhat unwittingly, as the sample size because it's the maximum number of rows Microsoft Excel can handle without help. - Jarry1250 (t, c) 08:20, 12 May 2009 (UTC)
Preemptive disambiguation
[edit]One other detail that would be useful to know is how often an article is named "<Obscurity> (<specific>)" even when <Obscurity> does not (yet) need disambiguation.
- Method
According to your method you "discarded" the list of all names. If you do still have them, sort by name and discard all where there is a disambiguation, i.e. where the text outside the brackets is duplicated at least. This will leave the possibly preemptive disambiguation pages. They would need to be checked against all names including those without brackets (and those ending in (disambiguation) of course).
Instead of leaving it to you, I downloaded the latest list and did it myself. I get 6437566 pages, 498694 disamb pages (excluding one original page, whether it has ()'s or not), and 97122 preemptive disambiguated pages.
You may still want to do it for your snapshot as well, just so it corresponds to the rest of your data.
After downloaded and extracting enwiki-latest-all-titles-in-ns0.txt I:
- executed DOS SORT to ensure it is ordered
- then I ran this in DotLisp:
(with-dispose (page-names (System.IO.File:OpenText "T:\\Current Downloads\\enwiki-latest-all-titles-in-ns0-sorted.txt"))
(let (pages -1 #| ignore "field name" |#
disamb 0 preemptive 0
page-name nil previous-page-name nil
truncate false same false
previous-truncated false previous-same false)
(while (set page-name page-names.ReadLine)
(set truncate (and (page-name.EndsWith ")") (positive? (page-name.IndexOf "_("))))
(when truncate
(set page-name (page-name.Substring 0 (page-name.IndexOf "_("))))
(set same (and previous-page-name page-name (== page-name previous-page-name)))
(if same
(++ disamb)
(when (and (not previous-same) previous-truncated)
(++ preemptive)))
(++ pages)
(set previous-page-name page-name previous-truncated truncate previous-same same))
(list pages disamb preemptive)))
Mark Hurd (talk) 02:43, 14 May 2009 (UTC)
(disambiguation)
[edit]Note that a lot of redirects to disambiguaton pages were created in this time frame. Rich Farmbrough, 09:29 14 May 2009 (UTC).
- All redirects were ignored, so I don't think that would have affected anything. Unless you think it has done? I'm only human, it's possible I made a mistake somewhere. - Jarry1250 (t, c) 11:06, 14 May 2009 (UTC)