Jump to content

Wikipedia talk:WikiProject Sweep

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Welcome to WikiProject Sweep! This collaboration aims to comprehensively review every article created in Wikipedia's early days to ensure basic conformity to modern standards. This page and its subpages aim to facilitate collaboration around this goal. If you'd like to help us, please add yourself as a participant to the project, inquire on the talk page, and see the to-do list for things to do. Thanks for helping us clean up old articles!

Link=Wikipedia:WikiProject Sweep
Link=Wikipedia:WikiProject Sweep

Transcluded WikiProject proposal

[edit]

Project launch

[edit]

Hi all! Tagging those who commented on the proposal: @Danre98, Techie3, GenQuest, Bilorv, Hog Farm, Remagoxer, Jh15s, Berrely, JesterSocks, Ajpolino, Pigsonthewing, Graeme Bartlett, Matt Deres, Eddie891, History DMZ, and PJvanMill: welcome! I'd appreciate help spreading the word about the launch of this project. Please also feel free to help build the project page. I laid out the basic game plan I think we should follow in the stages section; the first step is discussing here which articles to include (see below), and the second step is launching a WP:CENT-listed RfC to affirm community consensus for the sweep. Please let me know how that sounds and if you think we should make any changes. Cheers, {{u|Sdkb}}talk 03:48, 24 March 2021 (UTC)[reply]

Building the set of articles

[edit]

Okay, so our first major task is building the list of articles to be swept. Per the project proposal, this will include all articles that meet two criteria: (1) created before a TBD cutoff date, and (2) does not fall into any TBD exclusion categories. I'll create subsections below for us to discuss each of those things. {{u|Sdkb}}talk 03:48, 24 March 2021 (UTC)[reply]

Cutoff date

[edit]

Very roughly speaking, this is somewhere in the late 2000s or the early 2010s. One potential date that comes to mind is the establishment of comprehensive new page patrolling, which I think was launched about this time. Could someone sleuth out when specifically that was? {{u|Sdkb}}talk 03:48, 24 March 2021 (UTC)[reply]

Patrol started logging on 16:58, 16 November 2007. Page curation first log event was on 21:54, 6 September 2012. Graeme Bartlett (talk) 04:35, 24 March 2021 (UTC)[reply]
@Graeme Bartlett: What's the difference between patrolling and page curation? Is it that page curation is comprehensive (i.e. every page goes through a queue) whereas the patrolling was just ad hoc? If so, I think the curation might be the better cutoff date. I found Wikipedia:New_pages_patrol#History, which seems to indicate that NPP was at least being set up in 2011. {{u|Sdkb}}talk 04:44, 24 March 2021 (UTC)[reply]
Wikipedia:New pages patrol goes back to 2004 as a proposal. Page curation is adding the tags, with an interface to allow deletion or maintenance tagging. It was a WMF extension to help with patrolling. I don't know when pages were no-indexed till patrol though. Graeme Bartlett (talk) 07:41, 24 March 2021 (UTC)[reply]
I don't think NPP was particularly comprehensive or organised before Kudpung's reforms c. 2011. Those led to the creation of Wikipedia:Page Curation, which was deployed on 20 September 2012. So that could be a good, relatively conservative cutoff. @Kudpung: would know more though. – Joe (talk) 20:20, 24 March 2021 (UTC)[reply]

Exclusion criteria

[edit]

We have a whole bunch of possibilities here. I think we should aim to include all categories that indicate that an article has passed a quality review or is otherwise unlikely to be totally a mess. Here's some ideas:

These all seem like very safe criteria. If we wanted to get a little more aggressive, we could add B-class articles (but that's a little risky, as anyone could have self-assessed) or articles with high pageviews (but this should overlap with number of editors, and I think it's better to go by that than number of readers). Something to note is that we can always loosen our standards after we get under way, but it's a lot harder to tighten them if we notice questionable pages slipping through. Also, once we agree on these standards, I think they should be somewhat fixed, e.g. if someone adds a page from 2006 to the VA5 list a month from now, it should still be swept. This is to prevent editors protecting questionable pages who get wind of the sweep from trying to evade it. Does anyone have suggestions on other criteria we might want to include, or concerns about the criteria I denoted above? {{u|Sdkb}}talk 03:48, 24 March 2021 (UTC)[reply]

If some of these have not been edited since the cutoff then something is wrong! and it may well not meet today's standards. Though for some of these (vital, GA, FA, Britannica, AFD survival (if individual), peer review) you can expect that notability is passed, as that will have been examined properly. One issue cropping up recently is FLs that are not notable, which should probably go to AFD. So neglected old FLs should get a checkout. Disambiguation page, set indexes, outlines, should get a checkout. Redirects probably don't need any checking. Graeme Bartlett (talk) 04:49, 24 March 2021 (UTC)[reply]
Notability is explicitly not part of the GA criteria and I've seen at least two GAs (including one I reviewed) have their notability later questioned (IMO one was notable and the other wasn't). Graeme makes a good point about FLs. Even the AfD point... I've seen some very low-quality pages survive AfD due to lack of participation (even though it seems to me like the current rule is treat the close like evaluating an expired PROD if there are no "keep"s).
However, given that the project is huge-scope, it's probably not worth the labour power to go through e.g. all the GAs to find maybe 100 that are borderline at AfD, when there are populations with much higher proportions of badness. I might even start with something like "stubs and unassessed pages created by an IP (back when IPs could create articles)". Or, the "cutoff of X edits by Y different users" would be a good one if we can establish what realistic and useful bounds on this are, by sampling some bad old pages. If the task gets close to completion, only then is it worth thinking about what the next category is.
I think what we'll be looking for is (a) people acting in good faith who have no idea of our rules; and (b) people acting constructively but the standards have now changed. So I wouldn't worry about anyone trying to evade exclusion criteria or anyone who B-rated their own article. I don't think it's a major factor or that there are many cases of someone trying to "hide" an article from us. — Bilorv (talk) 12:53, 24 March 2021 (UTC)[reply]
Surviving an AfD is not necessarily a good indicator of notability unless you can limit it to those that closed as 'keep' with >X participants (say, five). The others seem reasonable. Eddie891 Talk Work 14:45, 24 March 2021 (UTC)[reply]
I might suggest avoiding FA since there is presently a parallel process. Or coordinate. W/e. --Izno (talk) 15:08, 24 March 2021 (UTC)[reply]
Izno, the FA sweeps seem to be aiming to do a much more thorough job. When we talk about "basic conformity to modern standards", we're not talking the FA standards, but more cleaning out/tagging the "how on earth does that exist on Wikipedia?"-type pages. {{u|Sdkb}}talk 19:48, 24 March 2021 (UTC)[reply]
That's even better reason not to add those to the list? --Izno (talk) 20:36, 24 March 2021 (UTC)[reply]
Yes, certainly. {{u|Sdkb}}talk 20:38, 24 March 2021 (UTC)[reply]
Thinking in terms of cost/benefit, GA+FA+FL+PR+VA+EB11 combined is going to be such a small proportion of the total that I'm not sure it would be worth the extra script-writing effort exclude them? It will also only take a couple of seconds for anyone patrolling them to see that they're fine.
Non-article articles like disambiguation pages can also hide a lot of cruft, so I think those should be included if they meet the other criteria. Again, they will be very quick to patrol so it doesn't seem like it would cost much, relatively speaking.
I would focus on the criteria to exclude articles that have been actively edited since the cutoff; that's the only one that looks like it will exclude a significant number of pages, and it will probably catch most of the others anyway. – Joe (talk) 20:44, 24 March 2021 (UTC)[reply]
One thing I notice for older articles, is that they can have a lot of minor gnome edits over the years, eg MOS compliance, spelling fixes, category renames, tagging, template restructuring without any textual change or checking if the topic is suitable. So many useless articles will be edited over the years. It will be the very basic ones with no templates that remain untouched. It would be good if we can find articles that only have minor edits after the cutoff date. This will be more work than just no editing after the cutoff date. Graeme Bartlett (talk) 23:23, 24 March 2021 (UTC)[reply]
I think we're going to run into one type of article that could cause lots of trouble - old one-liner place stubs. For instance, last year, a group of editors including myself discovered that a single editor had created about 3,000 mostly non-notable one-liners about places in California (2009-2011). The same user also created about 5500 one-liners for uninhabited locations in Iran, and looking through the first page of Special:AncientPages, that's mainly short stubs and SIAs for places in Russia. Those could be quite tricky to deal with - I don't know that we can trust mass-creation of geography stubs, and I'm aware of at least 4 situations where the mass-creation was of very low quality. Hog Farm Talk 03:24, 25 March 2021 (UTC)[reply]
Broadly agreed—this is the thing I was thinking of in my comment at the original project proposal. WP:PLACEOUTCOMES says "Cities and villages anywhere in the world are generally kept, regardless of size or length of existence, as long as that existence can be verified through a reliable source", so I'll assume we're talking just about more minor geographical locations than that (like the uninhabited locations). I would support mass redirection of such content to the next-largest notable place, unless notability (say, three sources in something that's human-compiled, rather than from a database dump) is clearly established. Part of the reason for this is exactly what you say, that the mass creations have been found to be of very low quality in the past. This should be true as a rule, because if someone is creating thousands of sub-stubs on minor geographical locations then there's simply no way they could be doing this in a high-quality manner, due to the faultiness of underlying data sources. — Bilorv (talk) 09:20, 25 March 2021 (UTC)[reply]

June 2021 follow-up

[edit]

So, following up on all this, I think to move things forward, we're going to need someone with the relevant technical expertise to create potential lists for us so that we can see how much changing various criteria expands or contracts the counts. This is more complicated than what PetScan can handle, so I'll put an invite at WP:Request a query and see if that draws anyone. {{u|Sdkb}}talk 21:46, 4 June 2021 (UTC)[reply]

@Sdkb There's a lot to process above. What lists precisely are we looking for? – SD0001 (talk) 09:06, 5 June 2021 (UTC)[reply]
@SD0001: It seems like 20 September 2012 may be our cutoff date, and that "had non-minor edits by more than some number of editors" may be our most significant exclusion criterion. So to start, it'd be helpful to have some figures/examples about how many/what sort articles would be in our set if we include all articles created before that date and that have non-minor edits by fewer than X editors for various values of X. Does that help? {{u|Sdkb}}talk 02:49, 6 June 2021 (UTC)[reply]
@Sdkb I created Wikipedia:WikiProject Sweep/Pre-2004 - not really what you asked but this alone took 2.5 hours to execute. Doing it for 2012 instead of 2004 will certainly timeout. See also WP:RAQ#Help wanted compiling list for WikiProject Sweep. – SD0001 (talk) 16:21, 7 June 2021 (UTC)[reply]
Thanks SD0001, that's helpful to see the pre-2004 list with <10 edits. Perhaps we could extrapolate out how many articles would likely be in the set total if we chose X=10. I think we still need more info before we'd be able to make any concrete decision, though; hopefully someone else at RAQ may be able to help further. {{u|Sdkb}}talk 17:19, 7 June 2021 (UTC)[reply]

Level of scrutiny desired

[edit]

Following up from the comment Bilorv made at the proposal, one thing we should figure out early is the level of desired scrutiny we're going to ask reviewers to pay to these pages. If we go too strict, we're never going to get through the queue, and if we go too loose, this project won't do much good. My suggestion is to use roughly the standard for NPP, or maybe something slightly weaker. Basically, determine whether the article notable and whether it has glaring problems that need tagging (e.g. NPOV, copyright violations, lack of references) but don't worry about fussier things like making sure it's fully categorized, tagged with all the relevant WikiProjects, compliant with the manual of style, etc. Does that sound about right? {{u|Sdkb}}talk 04:03, 24 March 2021 (UTC)[reply]

Well that won't improve things much for the pages we retain. So how about we do some things to assist, eg bolding the title, adding categories, run AWB against it to fix some common errors. If citations are structured then use citation bot, and if bare urls then refill. Graeme Bartlett (talk) 04:35, 24 March 2021 (UTC)[reply]
If we can build patrolling tools that make that stuff really easy, then sure. But if we're going to get through several million pages, we can't really set our sights too high. Keep in mind that for most pages, if they're at all important, they've likely already had those things done. {{u|Sdkb}}talk 04:48, 24 March 2021 (UTC)[reply]
If we are doing several million pages, then the job will never finish. But for 10,000s of pages it is feasible. 100,000s pages puts it in the same league as Wikipedia:Typo Team/moss which is multiyear. For AWB you can give it a list to deal with, and for me it takes about 10 seconds per page, with check that the right edit happens. For citation bot you can dump in a category, and let it work away by itself. Graeme Bartlett (talk) 04:56, 24 March 2021 (UTC)[reply]
I think we want to keep it as quick as possible. If individual reviewers want to also run AWB and categorise or whatever else is worthwhile, they should be encouraged to do so, but the minimum standard should just be to determine whether the article meets the standard we are assessing it for. As for that standard, I'm in several minds. "GNG or TNT" could be one idea—we move on immediately from every article that's notable and has some encyclopedic content. That leaves us with pages to CSD, PROD, AFD, or immediately redirect or merge, on the grounds of notability or that all of the content of the page is a copyvio or a blatant advert. — Bilorv (talk) 12:59, 24 March 2021 (UTC)[reply]
I think a basic check in regards to notability and whether the article is at least somewhat up to scratch followed by perhaps some minor changes (nothing major content wise but things like Graeme suggested and perhaps grammatical cleanup) and maybe an class assessment is the most we can realistically do with the amount of pages. Remagoxer (talk) 16:20, 6 April 2021 (UTC)[reply]
I agree with Bilorv, if it's notable and reasonable, move on. Eddie891 Talk Work 14:45, 24 March 2021 (UTC)[reply]

No new templates

[edit]

Consider modifying/using Template:Article history rather than introducing a new Template:Swept. Izno (talk) 15:55, 24 March 2021 (UTC)[reply]

Izno, I like that suggestion. The last thing I want to do is to contribute to banner bloat. Adding to article history for articles that use it would be a good option. Many of these pages will be very low-traffic, though, and might not be using {{Article history}}. An alternative would be to use talk page messages, which would allow the sweeper to leave a brief comment if they want, and would get naturally archived after time if the page is active. {{u|Sdkb}}talk 19:52, 24 March 2021 (UTC)[reply]
I was wondering whether the best procedure might be to tag all the articles (at the top, by bot) with something like {{Old article, needs review}}. As they're presumably very low-watched pages, it's not a major disruption; if you wake up and find 200 pages on your watchlist with this tag then that's good, because presumably you've been monitoring the pages or know that they are in reasonable shape, and can just remove the tags to save us the time. Otherwise you might be the best-placed person to review them. The other advantage is that I think I might stumble across these sort of pages naturally from time to time, and it would be good for me to see by banner at the top of the article that it's on the review list, rather than me having to navigate through a bot dump of 10,000 pages beginning with "A" and searching for the article that I just found because I was trying to do some other small task there. This approach does have its possible drawbacks, though. — Bilorv (talk) 20:01, 24 March 2021 (UTC)[reply]
Bilorv, that's another interesting idea! The main drawback I see is that, if we track the pages by a maintenance tag, that prevents us from having any sort of restriction on who can do the sweeping. Letting people choose the pages they review and letting anyone do it seems like it could be a bad combination for COI. {{u|Sdkb}}talk 20:32, 24 March 2021 (UTC)[reply]
Yep, this is the main drawback I had considered too. Two variants on the idea I thought were: (a) some user script you can install that highlights the article name in red or puts a line at the top saying "Unreviewed in the 2021 sweep" (problem: someone has to write it. Not sure how long writing a script takes or whether it's worth the effort); (b) you can only remove the tag by going to a Sweep subpage which is under ECP and removing the article from the "Unreviewed list", and if you try to remove it manually, a bot will just re-add it (problem: will cause major confusion to newbies and a lot of edit warring and people coming to noticeboards confused about "how do i remove this tag on my article?"; then again, that's almost a way of baiting COI people into drawing volunteer attention to bad articles that need to go). — Bilorv (talk) 20:41, 24 March 2021 (UTC)[reply]
I think that articles should probably have article history regardless. I know that SV was bemoaning its general lack of use by the major scripts/bots of late. --Izno (talk) 20:35, 24 March 2021 (UTC)[reply]
I think a good general principle would be that pages shouldn't be edited until something substantial is done to them, i.e. when they're "swept". Don't underestimate how annoyed people get by watchlist clutter, or how many old pages are lingering in old editors' watchlists! That excludes using tracking categories or a preemptive cleanup template. What you could do is maintain a central, bot-maintained list of pages to be swept, but have humans do everything through a user script. So they could click a button to patrol the next article on the list, or see an indication that a page hasn't been swept if they come across it another way. The script then updates {{article history}}/adds {{swept}} to the talk page, and a bot updates the central list accordingly. And you can restrict who has access to the script. – Joe (talk) 21:09, 24 March 2021 (UTC)[reply]

Proposed flow

[edit]

I think we need a flow chart or at least a flow-list of how an article is to be addressed, to be sure we're all on the same page. GenQuest "scribble" 00:00, 25 March 2021 (UTC)[reply]

It depends on the standard reached. If it is the same as NPP, File:NPP flowchart.svg can be used. Otherwise it's relatively easy to make a good SVG flow chart in a program such as Flow.io. — Berrely • TalkContribs 07:42, 25 March 2021 (UTC)[reply]

Discord

[edit]

Is anyone here (aside from Izno) in the Discord server? Remagoxer (talk) 19:24, 25 March 2021 (UTC)[reply]

Update: I'm aware that Sdkb and Berrely are in the server. Remagoxer (talk) 19:58, 25 March 2021 (UTC)[reply]
@Remagoxer: I am in the server. Danre98(talk^contribs) 22:51, 1 July 2021 (UTC)[reply]

Part-offwiki methods for larger numbers of articles?

[edit]
I realise that the number of old articles to check seems terrifyingly high, and so you're trying to cut down on the numbers. I'm not totally sure that this won't remove the wrong articles from the sweeping process (somebody above suggested removing EB1911 articles, most of which are fine notability wise but many of them fail NPOV and other core policies), and would suggest to think about how to do the sweeping as efficiently as possible instead.
Basically, what I'm imagining is having a Toolserver-based tool that sweepers have access to that presents you with a page created before day X, and you then have three buttons: accept and remove from sweep, skip to let others decide, and edit on Wikipedia (with or without removing from sweep). You could accept articles in a few seconds like that and get through the unproblematic 2.5 million of the 3 million articles in something like 3000-4000 people hours, which isn't actually terrible. Storing the data about what has been swept offwiki saves a couple million edits and avoids craziness on watchlists. (You could also have more buttons for semiautomatic addition of suitable cleanup tags).

Things like this (cleanup organised via offwiki tools) have been done before (and may still be happening): we used to have an offwiki tool to resolve interwikis that would present you with two pages in different languages and make you click on "same topic"/"different topic"/"don't know". —Kusma (talk) 09:12, 30 June 2021 (UTC)[reply]

revival attempt

[edit]

@Sdkb Hello, I've been independently working on something similar to this project's goals, partly based on a personal metric of "link density", but also based on roughly the first million page IDs, mostly from this query over on Quarry, which I'm sure could be improved upon as I'm far from a coding expert(i haven't yet figured out how to exclude list articles, as an example), but has proved fruitful as I've been able to successfully PROD almost a dozen articles created from 2006 or prior. Mostly coming in here to get guidance from a more experienced user on this matter as I've been getting some pushback on my methods of improving these (as seen in my user talk page), as well as seeing if my "link density" metric and the related query would be useful to this project, as the above talking point of 2010 being the cutoff is I think too big, at least as a starting point, given we had about 3.2 million articles created by that point, vs my query producing just under a tenth of that, and being much less in a "transitional period" of Wikipedia's practices. Akaibu (talk) 06:39, 1 November 2024 (UTC)[reply]