Jump to content

User talk:Full-date unlinking bot/Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 2

Chronological articles

Is 1983 in sports a chronological article? I don't see a consensus either way, in the most recent flawed RfC determining the guidelines, nor in the discussion leading here. — Arthur Rubin (talk) 14:25, 9 July 2009 (UTC)

Yes, it is a chronological article as it can be seen as a split off from 1983 which is definitely chronological. --Apoc2400 (talk) 17:01, 9 July 2009 (UTC)

Notifications

At which noticeboards should be post notification of this RfC? Personally, I think we should pull out all the stops, including a watchlist notice. Dabomb87 (talk) 14:39, 9 July 2009 (UTC)

I suggest Wikipedia:WikiProject Chronology and Wikipedia:WikiProject Time. Editors involved in these projects may be familiar with unusual dates and the articles in which they appear. --Jc3s5h (talk) 15:38, 9 July 2009 (UTC)
That sounds like a good idea, but I was referring to general noticeboards (e.g. Village Pump, WP:AN, WP:ANI, WT:FAC, etc.) Dabomb87 (talk) 15:55, 9 July 2009 (UTC)

Thanks

Thank you harej for taking on this task. I have never made a Wikipedia bot but I know some programming so I will be happy to take a look at the code when there is some. Which language are you using? --Apoc2400 (talk) 16:23, 9 July 2009 (UTC)

I'd second that. This is a very worthwhile task, and one that is sure to have a lot of pitfalls attached. I would also be happy to assist in writing or reviewing the code. Happymelon 16:26, 9 July 2009 (UTC)
Like User:RFC bot it will be written in PHP. —harej (talk) (cool!) 22:19, 9 July 2009 (UTC)

Some comments

  • The coding can start at any time. We don't have to wait for the exclusion list to be completed.
  • Should there be separate sections for collecting the intrinsically chronological articles and for list of other articles that should still be excluded?

--Apoc2400 (talk) 17:07, 9 July 2009 (UTC)

    • I would rather start the coding knowing exactly what is supposed to be processed and what isn't supposed to be processed. The list of exceptions can be one list — no need to differentiate between intrinsically chronological and not intrinsically chronological but should be excluded anyway. —harej (talk) (cool!) 22:22, 9 July 2009 (UTC)

Timings

One week to jump through probably two, three or even four rounds of trialling at the BRFA-level seems more than a little ambitious. But we'll see :) . - Jarry1250 [ humourousdiscuss ] 17:56, 9 July 2009 (UTC)

The timeline will be updated when appropriate. —harej (talk) (cool!) 22:23, 9 July 2009 (UTC)

Date punctuation variants

A few weeks back there was a short discussion of the possibility of recognizing some improperly formatted dates. Is this still being considered? In the case of actual broken dates such as "[[June]] [[29]]" and "[[June 29]]-[[30]]", I think is probably best to leave any cleanup to a separate manual or AWB-assisted process. However, the date autoformatting process currently recognizes, accepts, and corrects a number of cases where the punctuation between the day-month and year parts is not quite standard.

In general, date formatting allows any number of spaces (zero or more) and at most one comma (anywhere within those spaces), regardless of whether the date is in dmy or mdy form. Whatever punctuation is present is replaced by one space for dmy dates, or a comma plus one space for mdy dates. Specifically, "[[July 9]] [[2009]]" and "[[9 July]], [[2009]]" are output as "[[July 9]], [[2009]]" and "[[9 July]] [[2009]]" respectively (unless overridden by date preferences). A couple months back, I counted over 100,000 pages with "[[month day]] [[year]]" (no comma) forms and over 35,000 pages with "[[day month]], [[year]]" (unexpected comma) forms. Thousands of pages also have other variants that are recognized and corrected by date autoformatting such as no space or comma, comma but no space, or any of several other non-standard combinations of comma and spaces. All of these are presently corrected by the date autoformatting function.

I would recommend including these in the full-date unlinking bot. It should be a simple matter to specify " *,? *" or more properly " *(?:, *)?" as the separator string in the regular expressions that search for both dmy and mdy dates. As an alternative, this could be done in a separate process, preferably one that precedes the actual unlinking, but that would involve an extra 150,000 edits that could be avoided if we included it here. -- Tcncv (talk) 03:44, 10 July 2009 (UTC)

I am not sure where you are getting at. Are you saying the bot should be correcting grammatically incorrect dates that the autoformatting has transparently fixed, or that the bot should know to look out for grammatically incorrect dates? —harej (talk) (cool!) 06:50, 10 July 2009 (UTC)
I'm suggesting that inaddition to
  • [[January 15]], [[2005]]January 15, 2005
  • [[27 May]] [[2007]]27 May 2007
consider recognizing and processing additional forms such as
  • [[January 15]][[2005]]January 15, 2005
  • [[January 15]] [[2005]]January 15, 2005
  • [[January 15]],[[2005]]January 15, 2005
  • [[January 15]] ,[[2005]]January 15, 2005
  • [[January 15]]<<Any other combination of spaces and zero or one comma>>[[2005]]January 15, 2005
  • [[27 May]][[2007]]27 May 2007
  • [[27 May]], [[2007]]27 May 2007
  • [[27 May]],[[2007]]27 May 2007
  • [[27 May]] ,[[2007]]27 May 2007
  • [[27 May]]<<Any other combination of spaces and zero or one comma>>[[2007]]27 May 2007
Note that all of the above "poorly formed dates" display correctly with date autoformatting enabled. The bot would remove the links, but would also normalize the punctuation. -- Tcncv (talk) 14:02, 10 July 2009 (UTC)
If you would prefer to keep the bot as simple as possible and only handle the normal cases, I would be willing to take on the task of normallizing date punctuation as separate process from the date unlinking bot. -- Tcncv (talk) 14:30, 10 July 2009 (UTC)
I agree that the bot ought to perform any fixes that the autoformatting software is currently performing for readers who are not logged in or have selected "no preference" for date format. --Jc3s5h (talk) 14:47, 10 July 2009 (UTC)
Yes, this is why the proposal has the point "7. The bot will also fix uncontroversial errors such as missing commas and spaces." The fixes Tcncv listed are all simple and uncontroversial. --Apoc2400 (talk) 17:37, 10 July 2009 (UTC)

Would this regex work: /\[{2}(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|June?|July?|Aug(ust)?|Sept?(ember)?|Oct(ober?)|Nov(ember)|Dec(ember)?)(\s?\d{1,2}|\]{2}\s?\[{2})\d{1,2})\],?\s?\[{2}\d{1-4}\]{2}/i ? This would pick up dates such as [[January 15]], [[2001]], [[January 15]][[2001]], [[Jan]] [[15]] [[2001]], etc. This would just be for American dates, and it accounts for bizarre comma usage or lack thereof. —harej (talk) (cool!) 23:10, 10 July 2009 (UTC)

Close. A few corrections and suggestions:
  1. "Oct(ober)?" (moved "?")
  2. "Nov(ember)?" (added "?")
  3. Suggest changing "\s?" to "[ _]" (allow either space or underscore within links)
  4. Change "\]," to "\]{2}," (added quantifier after "]")
  5. Suggest replacing ",?\s?" with "\s*(,\s*)?" (more general punctuation case)
  6. Insert "([ _]BC)?" after "\d{1-4}"
  7. Throughout, "(" can be replaced with "(?:" for any group you do not wish to capture.
  8. I'm not sure it is necessary to process the "[[Jan]] [[15]] [[2001]]" forms, as these are not working date links and do not autoformat.
I have been using the following regex expressions (and variants) in my earlier investigations. Perhaps you can adapt these to your needs:
  • mdy - \[\[(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[ _](\d{1,2})\]\](?: *(?:, *)?)\[\[(\d{1,4}(?:[ _]BC)?)\]\]
  • dmy - \[\[(\d{1,2})[ _](Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\]\](?: *(?:, *)?)\[\[(\d{1,4}(?:[ _]BC)?)\]\]
which I believe cover most or all of the mdy and dmy date autoformatting recognized cases. Note that the above only matches true spaces, but "\s" can be substituted for generality. It will capture the day, month, and year parts separately, while the rest or the groups are non-capturing. It requires either a single space or a "_" between the day and month parts and before the BC (if present). The " *(, *)?" punctuation pattern can be replaced with the simple " " (dmy) or ", " (mdy) to get only normal punctuation, or with the negative-lookahead qualified "(?! \[) *(, *)?" (dmy) or "(?!, \[) *(, *)?" to get only the abnormal punctuation cases. I made a couple of last minute changes to the above. I hope I got them right. They seem to check out in AWB Regex Tester. (If you use these, I suggest copying the source from the edit window.) -- Tcncv (talk) 03:32, 11 July 2009 (UTC)
To make sure I'm doing it right, could you re-post those regexes with the adjustments necessary to account for unusual grammar? —harej (talk) (cool!) 01:02, 12 July 2009 (UTC)
I will assume that to implement edit summary codes, you will likely need separate regular expressions for different cases, so I have split the cases out separately below.
  • Normal punctuation
    • mdy - \[\[(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\s_](\d{1,2})\]\],\s\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
    • dmy - \[\[(\d{1,2})[\s_](Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\]\]\s\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
  • Minor punctuation variations (fixed by autoformatting)
    • mdy - \[\[(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[\s_](\d{1,2})\]\](?!,\s\[)(?:\s*(?:,\s*)?)\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
    • dmy - \[\[(\d{1,2})[\s_](Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\]\](?!\s\[)(?:\s*(?:,\s*)?)\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
  • Broken [[month]] [[day]] [[year]] and [[month]] [[day]] [[year]] forms (not fixed by autoformatting)
    • mdy - \[\[(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\]\]\s*\[\[(\d{1,2})\]\](?:\s*(?:,\s*)?)\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
    • dmy - \[\[(\d{1,2})\]\]\s*\[\[(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\]\](?:\s*(?:,\s*)?)\[\[(\d{1,4}(?:[\s_]BC)?)\]\]
The more I think about it, the more the last pair of cases bothers me, with the possibility of false hits or some unanticipated side effect. These might be better handled by a separate AWB assisted that identifies and fixes these (without removing links) under close supervision (without removing links). The results could them be reprocessed as normal cases by the date unlinking bot (removing links this time). E.g., an supervised AWB operation could change "[[Jan]] [[15]] [[2001]]" to "[[Jan 15]], [[2001]]", and the date unlinking bot would later remove the links leaving "Jan 15, 2001".
If you like, I can work on some expressions for the ymd, iso-8601-like dates, and date range cases. -- Tcncv (talk) 03:37, 12 July 2009 (UTC)
That would be great! Of course, you will get credit as one of the bot coders. —harej (talk) (cool!) 19:17, 12 July 2009 (UTC)
Will do. I will prepare some expressions based on the current specifications and will monitor for changes. It will likely be a few days before I can put them out for review. Should we move such implementation details to a separate page so as not to clutter the talk page? The expressions will of course continue to be open for general review and comment. -- Tcncv (talk) 15:06, 13 July 2009 (UTC)
The regex review will be a part of the code review. —harej (talk) (cool!) 16:53, 13 July 2009 (UTC)
Perhaps not relevant, but see User:Andrwsc/bad year links. Dabomb87 (talk) 23:37, 14 July 2009 (UTC)
Thank you. All such input is welcome, even if it doesn't make it into the bot. I think our intent here is to focus on the common, correctly-formed dates plus some forms with minor punctuation issues. Cases like those you referenced definitely need fixing, but I think they are best left to manual or AWB-assisted correction. I looked at several, and at least some have already been fixed.
I am currently scanning the database to identify as many distinct date and date range forms as I can, including both the good and the bad. The good and unambiguous forms will likely be included in the bot operations (subject to review). I will attempt to identify all pages containing other date forms and publish lists or manual fixup. In the case of days linked as if they were two digit years, I will try to narrow the cases by looking for adjacent month or month and year references. This should reduce the number of false hits – excluding one or two digit years references that actually are one or two digit years. -- Tcncv (talk) 01:41, 16 July 2009 (UTC)
Why not just take the regexes straight out of MediaWiki? The languages are the same, and the intention here is to handle the same dates as MW handles, right? The code is a bit messy, but it's all there, and it all works. It's called in Parser.php as DateFormatter::getInstance()->reformat( <dateformat>, <page_text> ); You might be able to use it almost 'as is'; or certainly the regex engine. (also)Happymelon 15:28, 20 July 2009 (UTC)
Originally I did not want to, believing that I would have to go through the millions of lines that comprise MediaWiki. But now that you have explicitly pointed out where it is, I will consider it. —harej (talk) (cool!) 17:04, 20 July 2009 (UTC)
Thanks for the info HM. The code confirms most of what I'd reverse engineered (the hard way) and brings to light a few things I didn't know, such as negative ISO-like dates like "[[-0012-05-06]]" (rendered as "-0012-05-06" and linking to 13 BC) and the exclusion of cases like "[[April 1]]st" from autoformating altogether. I don't think we can use DateFormatter directly, because the our goals and selection criteria are different, but some of the logic might be adapted for the bot.
Harej - I still running database dump scans (finding many variations of date range notations). If you prefer, when I'm done, I can prepare regular expressions in a building-block fashion similar to what is shown in the DateFormatter code. I think this will be easier to maintain, and more understandable to those reviewing bot code. -- Tcncv (talk) 03:08, 21 July 2009 (UTC)

Edit summary codes

Accounting for grammatical errors should not be a problem. Considering all the sub-tasks the bot will be doing, I was thinking of making an edit summary code system for each type of edit the bot makes. —harej (talk) (cool!) 19:27, 10 July 2009 (UTC)

I think that's a fantastic idea. It that should preemptively reduce complaints if editors can look up the codes know to exactly what the bot intended to do. —Ost (talk) 19:39, 10 July 2009 (UTC)
I second that. An excellent idea. It will also help during the testing phase, making it easy to verify test case coverage and proper operation of the different tasks. As for the punctuation issues, I'd suggest copying the "uncontroversial errors" item from proposal page to the bot description page. -- Tcncv (talk) 21:10, 10 July 2009 (UTC)

Different ways of writing the date

I will have to write a parser for each type, so let's go through each possibility to make sure I cover everything. There is:

  • Month-date-year (February 15, 1992)
  • Date-month-year (15 February 1992)
  • ISO (1992-02-15)

What other forms are there? Also, do people link to dates such as [[02-15-1992]]; that is, a non-standard numbers-only format? —harej (talk) (cool!) 23:33, 10 July 2009 (UTC)

I've seen usage of the ambiguous format (DD/MM/YYYY or MM/DD/YYYY) Dabomb87 (talk) 23:39, 10 July 2009 (UTC)
We're SOL for that one except when it's obvious (15/02/1992 is obviously the day-first format). Those will have to be sorted by humans. —harej (talk) (cool!) 01:38, 11 July 2009 (UTC)
I don't think those dates involve links, so they are outside the scope of this bot. -- Tcncv (talk) 04:10, 12 July 2009 (UTC)
I have seen some instances when links are used, a very messy result. In any case though, they are not autoformatted and are therefore, as you said, outside the bot's scope. Dabomb87 (talk) 04:14, 12 July 2009 (UTC)
Date autoformatting recognizes both the "[[yyyy-mm-dd]]" and "[[yyyy]]-[[mm-dd]]" forms. It also recognizes "[[yyyy]] [[month dd]]" and the very unusual "[[yyyy]] [[month dd]]", including punctuation variations like the mdy and dmy forms. (Example: "1986 March 4" in Halley's Comet and several in List of Roman battles. I'd suggest caution with these, since a list of mdy dates might appear (to a regex) to have oddly punctuated ymd dates in them. It might be best to only recognize such forms if they are neither preceeded by or followed by other date components. This can be done with some extra negative look-behind and look-ahead regex constructs. -- Tcncv (talk) 04:10, 12 July 2009 (UTC)
To expand, date autoformatting maps both "[[2009-07-04]]" and "[[2009]]-[[07-04]]" as if they were coded as "[[2009]]-[[July 4|07-04]]". And to answer the question above, forms like [[02-15-1992]] are not supported and do not generate meaningful links, so I do not think we need process these forms. -- Tcncv (talk) 04:41, 12 July 2009 (UTC)
  • I have also seen some real abominations in use, for example "[[July 4|4 July]] [[2009]]", "[[July 4|4th of July]] [[2009]]", "[[July 4|7/4]]/[[2009|09]]". These are not particularly widespread, so are perhaps not of great concern. I bring these up anyway for purposes of discussion. Ohconfucius (talk) 02:50, 7 August 2009 (UTC)

Writing of dates that are not standard in English

It is obvious that DMY dates will be preserved as DMY, and MDY dates will be preserved as MDY. But what about YDM or YMD? Those two are not standard in written English, so I was thinking of having them converted into DMY just to keep some kind of consistency. People can fix them as necessary. —harej (talk) (cool!) 17:37, 13 July 2009 (UTC)

YMD at least is a reasonably common arrangement c.f. access dates, some templates; I don't think you ought to change these. Best not YDM either really. - Jarry1250 [ humourousdiscuss ] 17:59, 13 July 2009 (UTC)
I presume you mean dates such as [[2009]] [[July 13]] or [[2009]] [[13 July]]. Since there is a good chance of guessing wrong about the format to convert them to, I would leave them alone. This would preserve some context so that whoever corrects them will have some hints about the writing style of whichever editor put them there in the first place. Such hints can be helpful if the text is confusing. --Jc3s5h (talk) 18:04, 13 July 2009 (UTC)
So I should just leave them as they are, except unlinked? —harej (talk) (cool!) 03:28, 14 July 2009 (UTC)
I think the answer is yes. As for unlinking, I think it is safe to unlink the "[[2009]] [[July 13]]" case, since it displays as "2009 July 13" with no punctuation changes. As for the unusual year-day-month case, I would be cautious. "[[2009]] [[13 July]]" displays as "2009 13 July" (note comma added by date autoformatting). We could add the comma, but it might be best to leave this case for manual review. I expect there are relatively few such cases out there. Similarly, date autoformatting will remove the comma from "[[2009]], [[July 13]]", displaying it as "2009, July 13". The bot could perform a similar edit as a non-controversial punctuation change. However, we need to guard against accidentially picking up such dates in lists such as "[[July 13]] [[2009]], [[July 13]] [[2009]]", where the comma separates two month-day-year dates, and is not touched by date autoformatting. I have also seen articles such as List of Roman battles, that contain a list of dates, all starting with the year, intentionally followed by a comma, and then followed by the day-month or month-day link. I think the comma should remain in these cases.
I would recommend not processing any year/day/month dates, and only processing year/month/day cases when there are no adjacent date components that might indicate that this is part of a list of dates. The remaining cases can be reviewed and corrected manually. -- Tcncv (talk) 21:36, 14 July 2009 (UTC)

← The bot could leave alone dates in which it's not obvious what to do, but add the article title and the "suspect" date to a list which humans can consult and decide what to do with. --A. di M. – 2009 Great Wikipedia Dramaout 12:20, 17 July 2009 (UTC)


In the dark old days when we just assumed that everyone was logged in with their preferences set the linked iso format was generally expected to turn into normal English prose. So in keeping with this original intent we should not do this
  • [[1989-11-05]] → 1989-11-05
but either one of these
  • [[1989-11-05]] → 5 November 1989
  • [[1989-11-05]] → November 5, 1989
It's then a matter of deciding which but this would usually be a simple matter of keeping consistant with the article ... if the article is consistant, which is another question. JIMp talk·cont 21:08, 31 July 2009 (UTC)

Notification for adding to exclusion list

I drafted a notice here. Does it look good? Dabomb87 (talk) 16:33, 15 July 2009 (UTC)

Where would this notice go? —harej (talk) (cool!) 18:56, 15 July 2009 (UTC)
General noticeboards, such as the Village Pumps, MOS pages, and other high-visibility pages (WT:FAC, WT:GAN, etc.) Dabomb87 (talk) 22:19, 15 July 2009 (UTC)
That seems fine to me. NW (Talk) 22:25, 15 July 2009 (UTC)
In addition, this is an RfC, and I posted notice of this on Template:Cent. I won't hold out hopes for a watchlist notice, as we've had multiple such messages posted there already WRT date linking. Dabomb87 (talk) 22:30, 15 July 2009 (UTC)
I think it might be wise to structure the exception list so it has a clear consistent form that can be later converted to a form readable by the bot. I would suggest a list of well-defined general cases, such as day, year, decade, topic in year, and other common chronological article groups, plus a list of specific protected articles. It might also be wise to have a separate area submitting and (if necessary) discussing proposed exceptions, stating that proposals should be specific, justified, and signed. Properly justified submissions would normally be accepted and added to the master accepted list. Others could be discussed further, and may be later either accepted by consensus (or significant minority support), or rejected as not satisfying the terms of the exclusion list clause of the proposal. Maybe something like:
Exception list
General exceptions
Specific exceptions
Proposed exceptions
Please add proposed exceptions below in the form: "linked article title - relevant date (or date part) - justification".
  • Sample - 1980 - This event had a strong influence on many other notable events that same year. -- (signed)
Accepted -- (signed)
  • Example - 1968 - Many related events. -- (signed)
Accepted -- (signed)
Please state your justification. -- (signed)
Because it's my birthday. -- (signed)
no Declined Does not satisfy approved RFC criteria for exclusion list entries. -- (signed)
This does not seem like a good enough reason. Need to discuss further. -- (signed)
As an alternative, we could allow editors to add exceptions directly to the main list, with the possibility that they might be pulled and moved to separate "discussion" or "rejected" sections if contested. This more closely follows WP:AGF would reduce list maintenance to only handling contested items. Something like:
Exception list
General exceptions
Specific exceptions
Please add proposed exceptions below in the form: "linked article title - relevant date (or date part) - justification".
  • Sample - 1980 - This event had a strong influence on many other notable events that same year. -- (signed)
  • Example - 1968 - Many related events. -- (signed)
Under Discussion
This does not seem like a good enough reason. Need to discuss further. -- (signed)
Rejected
Please state your justification. -- (signed)
Because it's my birthday. -- (signed)
no Declined Does not satisfy approved RFC criteria for exclusion list entries. -- (signed)
In either case I think we need a well structured list, or the additions may become unmanageable. -- Tcncv (talk) 03:13, 16 July 2009 (UTC)
Agree, and I like the second example better. We might consider making the exclusion list a separate page. Dabomb87 (talk) 03:38, 16 July 2009 (UTC)
I like the separate page idea. And I also prefer the second example above over the first. -- Tcncv (talk) 04:09, 16 July 2009 (UTC)
If you can set up the page, I can send out notifications when you're done. Dabomb87 (talk) 04:13, 16 July 2009 (UTC)

Of note, the bot will not be interpreting this list; at one point, the list will be closed off to submission and the recommendations that make it through will be made a part of the bot code. —harej (talk) (cool!) 16:36, 16 July 2009 (UTC)

Consistency

Often there is a mix of formats in an article. Sometimes you get all three. It would be good if we could impose some consistancy if not by the bot then but a person (after the bot makes a list of problematic articles). JIMp talk·cont 21:14, 31 July 2009 (UTC)

Actually, some kind of consistency enforcement could be coded in. It would pick up what the majority of the dates (between DMY and MDY) are, and change the non-majority ones to fit with the rest. If neither are used, then it would go with an arbitrary one (probably DMY). This whole consistency enforcement scheme would have its own edit summary code so that people wuld be able to get full explanations. —harej (talk) (cool!) 21:28, 31 July 2009 (UTC)
A bot can't tell which format is most appropriate, nor can it tell if there is true inconsistency; an otherwise consistent article may contain dates within a quotation in a different format. Quotations are difficult for a bot to detect. --Jc3s5h (talk) 21:28, 31 July 2009 (UTC)
A bot cannot determine which is most appropriate, but it can do a reasonable guess based on what format appears the most number of times. In any case, this would enforce consistency throughout the article. Anything considered to be a quote should be ignored by the bot, and if the bot makes a mistake, the mistakes could easily be fixed. —harej (talk) (cool!) 21:38, 31 July 2009 (UTC)
Changing the format of dates is completely outside the scope of the bot as initally proposed. If you want to do that, start the RfC over again. --Jc3s5h (talk) 21:51, 31 July 2009 (UTC)
Agree. If editors who set their preferences (a miniscule percentage of our readership) complain about seeing inconsistent date formats, it's not too big a deal; IPs have always seen the mixed formatting. Dabomb87 (talk) 21:54, 31 July 2009 (UTC)

Clarification: I don't think this is a bad idea at all. However, as Jc3s5h points out, it would be unethical to tell the community that we're doing one thing, and then without warning or approval slip in another task. We can always start another RfC and make this the next task for the bot on a later date. Dabomb87 (talk) 22:02, 31 July 2009 (UTC)

Okay, so it is decided then that the bot will not do anything to address inconsistency directly, except maybe make a note of them. —harej (talk) (cool!) 22:17, 31 July 2009 (UTC)

Solitary years and day–months

Why not to delink solitary years and solitary day–months? JIMp talk·cont 21:14, 31 July 2009 (UTC)

The RFC stipulated that only full dates get unlinked. That is just the specific scope of the bot; I am not sure why the other aspects are not covered. —harej (talk) (cool!) 21:22, 31 July 2009 (UTC)
Because of the history of date-delinking, especially concerning bots, we (the proposer and others who are concerned with the issue) have decided to take a conservative initial approach to automated delinking. If the community indicates support for further date delinking, we can always get the bot to run again. Dabomb87 (talk) 21:44, 31 July 2009 (UTC)

On schedule?

According to the timeline, bot coding will begin in four days. Is this date correct, or will it be pushed back? Dabomb87 (talk) 20:51, 5 August 2009 (UTC)

It will probably begin sooner than that, considering that the RFC has slowed down. —harej (talk) (cool!) 01:30, 6 August 2009 (UTC)
Great! Dabomb87 (talk) 02:04, 6 August 2009 (UTC)

puzzled

the criteria for delinking still include "Using the Category:Years by topic tree for comparison against the article's listed categories, the bot will suggest (e.g.) [[1983 in sports|1983]] to replace [[1983]] where the the article already has an existing Category:Sports". but as discussed earlier, this is NOT what the bot should recommend - it should recommend (also see 1983 in sports). outside of tables, aliasing "year-in-X" links as "plain year" links is not generally a good idea. this was brought up during the RfC and Harej indicated that the point would be taken into account. Sssoul (talk) 05:55, 13 August 2009 (UTC)

Actually, I thought that was explictedly rejected for this bot (in either form) by consensus, as being out of scope. — Arthur Rubin (talk) 14:09, 13 August 2009 (UTC)
Third that — I thought it was agreed that one thing the bot absolutely should not be doing is creating new links. Mlaffs (talk) 14:26, 13 August 2009 (UTC)
It was pretty much introduced in two forms, with the above form not generating discussion and the bottom form effectively defeated in discussion. Therefore, should de-linking proceed as planned on those pages? Or just the opposite? —harej (talk) 16:49, 13 August 2009 (UTC)
I'd say that delinking of "year-in-x" links should not happen, as already outlined in exception 2. I'd also say that suggestion of new links of that type should not be undertaken, contrary to what's laid out in bullet three of the notes section. Mlaffs (talk) 18:24, 13 August 2009 (UTC)
Noted. For the sake of a limited scope, I will remove it from the list of duties. —harej (talk) 18:54, 13 August 2009 (UTC)

(outdent) just for the record: the idea was not to have the bot remove any year-in-X links OR for it to create any. the idea was for the bot to recommend using a year-in-X link when possible. if it's not going to do any recommending, so be it - but if it were going to, it shouldn't recommend aliased links. Sssoul (talk) 21:05, 13 August 2009 (UTC)

Code now available

User:Full-date unlinking bot/code. Bear in mind that I am not that great of a programmer, and that at this point, the code is not perfect. This is open-source software, and I encourage that you make changes to the code where appropriate. @harej 10:13, 23 August 2009 (UTC)

Similarly, I have filed the BRFA. @harej 10:21, 23 August 2009 (UTC)
I don't (imediately) see support for excluding decade articles, such as List of decades (may not be necessary, yet, to exclude), 210s, 209–200 BC, 210s BC, 1500–1509, 2000s (decade), 210s BC, and possibly 1990s in sport, etc. I could easily be wrong, as I'm not a PHP programmer. — Arthur Rubin (talk) 18:43, 24 August 2009 (UTC)

Bots template

I have skimmed the code, and don't see any mention of respecting the {{Bots}} template. --Jc3s5h (talk) 14:17, 23 August 2009 (UTC)

It will be added; it just has not been added yet.. @harej 18:01, 23 August 2009 (UTC)

Code comments

  • In checktoprocess(): Add the File: namespace next to Image: in the regex.
  • Does it have to match specific words like architecture|art|aviation|comics|film? Couldn't it match any article that contains a year, century or millenium?
  • Does it have to add comments to processed articles? It adds clutter that would have to removed later. Is it possible to keep a list of processed articles locally on the machine where the bot runs?
  • Good idea to check the whatlinkshere for dates. I wouldn't have thought of that. I think you need to add a space to whatlinkshere($months[$i] . $d) though.

--Apoc2400 (talk) 17:30, 23 August 2009 (UTC)

(a) Thank you, I forgot about the File namespace. (b) It does not have to match "in architecture" etc. Note that after the huge parenthetical section there is a question mark: ( in (architecture| ... |television)? (c) The point of the comment is to prevent the bot from editing the same article again so that we don't have an edit war situation (though I have established that bots by definition can't edit war). A centralized list would get very resource intensive. Imagine if the bot is on Page #100,000. It would have to check the list to make sure that Page #100,000 is not Page #1 through Page #99,999. As the list gets bigger, it would take longer and longer to match up the article currently being processed with the list. The comment in my opinion should not be too bothersome, as it would stuck at the end of the article and no one would see it except in the edit window. (d) I have now done that. Thank you. @harej 18:00, 23 August 2009 (UTC)
About marking processed articles: There are more efficient ways than searching through the list every time. One is to use a database. The bot has to communicate with the Wikipedia database anyway (through the API) so contacting a local db too shouldn't make it much slower. There are other ways too. I don't know PHP, but Perl has table variables implemented with an efficient hashmap. You can write e.g. $processedarticles{$articlename}=1 and then if not $processedarticles{$articlename}... You would still have to write a list to file and reload that file into the hashmap if the bot is halted and restarted. --Apoc2400 (talk) 21:20, 23 August 2009 (UTC)

Replacement expressions

I see you've got a good start on the bot code. One thing I see in the code that probably needs changing is the replacement strings. The current code uses the PHP date function to format the replacement date string, but that will produce dates with full month names and no leading zeros on days, even if the original date had an abbreviated month and/or a leading zero on the day. Date autoformatting preserves the original format of each date component, so I think we should do the same.

The date match expressions are currently set up to capture matched date components in addition to the overall matched string. Each open parenthesis not immediately followed by "?:" defines a capture group. If $brReg[0][$z] is the entire matched string, $brReg[1][$z] is the first matched date component (day for DMY, month for MDY), $brReg[2][$z] is the second matched date component (month for DMY, day for MDY), and $brReg[3][$z] is the third matched date component (year). (Note that the year may also contain a BC suffix)

I suggest building replacement dates using the matched components as follows:

  • $unlinked = $brReg[1][$z] . ' ' . $brReg[2][$z] . ' ' . $brReg[3][$z] // British: "day month year"
  • $unlinked = $brReg[1][$z] . ' ' . $brReg[2][$z] . ', ' . $brReg[3][$z] // American: "month day, year"

It may also be advantageous to use preg_replace to combine the preg_match_all and str_replace operations. For example:

$count = 0
$contents = str_replace('...(match expression here)...', '$1 $2, $3', $contents, -1, &$count)
if ($count > 0) {
	$editsummary .= "AMreg, ";
}

I think I have the syntax correct. I am a not experienced in PHP, so the above could use independent review. -- Tcncv (talk) 19:36, 23 August 2009 (UTC)

I find the use of the PHP date function and strtotme function alarming. I am not a PHP coder. When I tried to look up these functions using a Google search, the hits that stood out from the crowd were at http://www.php.net. The descriptions of the functions there did not give any statement about the acceptable range of the input or output, which is alarming in and of itself. There was a statement for the strtodate function that assumptions would be made about two-digit years, that these would be mapped to years in the 20th and 21st century. For Wikipedia articles, this is wrong; 23 August 19 is a date 1,990 years ago, not 10 years in the future.
It has been a while since I have coded detailed regular expressions. Perhaps the regular expressions are such that certain year ranges will be ignored. Any such exclusions should be added to the exclusion criteria on User: Full-date unlinking bot where people who cannot read PHP can understand them. --Jc3s5h (talk) 21:07, 23 August 2009 (UTC)
Upon further investigation, I see that strtodate produces a timestamp, which is an integer. The maximum absolute value that can be contained by an integer is platform dependent, but is often 31 bits, or about two billion. Since Unix time is a count of seconds from January 1, 1970 UT, this corresponds to an acceptable date range of about 1902 to 2038. This is obviously unacceptable. --Jc3s5h (talk) 21:25, 23 August 2009 (UTC)
Implementing the above-proposed changes to the replace logic would eliminate the use of the strtotime() and date() functions in favor of plain text substitutions (no date conversion needed), so the issues you identify would be resolved. -- Tcncv (talk) 00:38, 24 August 2009 (UTC)
Absolutely agree. The very act of converting to a date-type loses information (principally the original format of the month) and is unnecessary. The point of this bot run to is remove links, not to impose a format on the dates. You can extract the three parts of the date as text and strip the links (even if they are somewhat malformed by extraneous or missing whitespace or punctuation); it should be only necessary to then re-assemble those into {d m y} or {m d, y} by simple concatenation. --RexxS (talk) 01:00, 24 August 2009 (UTC)
Tcncv, you're welcome to implement your requested change to the logic, since you probably would know how to implement it best. @harej 01:58, 24 August 2009 (UTC)
Done. I implemented preg_replace and also restructured the regular expressions so that they are built up from more readable components. I also switched back to using a true space rather than "\s", since tabs, newlines, or more exotic whitespace characters should almost never be present in the source. I do not have a test environment to test the code, so it needs close review. I did test the regex's though using AWB regex tester.
I'll add ISO-like dates and date ranges in the next day or two. -- Tcncv (talk) 03:45, 24 August 2009 (UTC)
You're awesome. @harej 03:48, 24 August 2009 (UTC)

As Ohconfucious pointed out earlier, some linked dates use piped link syntax, so such that the displayed text is different from the link target. What level of support should we include for recognizing and delinking these dates. More specifically, consider the following:

  • [[December 31|31 December]] [[2009]]
  • [[December 31|Dec 31st]], [[2009]]
  • [[Dec 31|12/31]]/[[2009]]
  • [[Dec 31|31/12]]/[[2009|09]]
  • [[April 12|Easter]], [[2009]]
  • [[April 12|arbitrary]] [[2009|text]]

The above are all full linked dates, and at least some of them are trivially different from the plain date formats. Recognizing and processing the above would not be difficult. The links can be identified using a variation of the existing logic by allowing arbitrary piped text on a valid date link. (Limiting the piped text to something that reasonably matches the target would be more work, but is doable.) The date links would be simply be replaced with the piped link text, when present, or with the terget link text, for parts that do not have piped link text. Punctuation would be left unchanged. The result would diplay as before, but without links.

Putting the obvious style violations aside, the question is: should some or all of these cases be fall within the scope of the date delinking bot? -- Tcncv (talk) 04:22, 24 August 2009 (UTC)

I would say it wouldn't fall within the charter of this bot, only of a partial date delinking bot. — Arthur Rubin (talk) 18:36, 24 August 2009 (UTC)

Date ranges and lists

Per the bot description:

"...the bot will unlink month-day items that are clearly adjacent to and in combination with a triple—i.e., in date ranges and slashed dates. Examples are:
  • [[October 17]] – [[November 8]], [[1987]] → October 17 – November 8, 1987
  • [[23 April|23]]/[[24 April]] [[1966]] → 23/24 April 1966".

I've been scanning the dumps and found quite a variety of characters and words that are used to denote date ranges. The more common ones include:

  • Words: to, and, or, until, through, till
  • Characters: - (hyphen), - (en dash), — (em dash), / (slash), & (ampersand)
  • HTML escapes: "&ndash;", "&mdash;"
  • Templates: "{{ndash}}"
  • Rare: + (plus), − (minus), × (times), x (letter x), ; (semicolon), thru, til, into, en (?), oder (?)

I have also encountered a number of lists or list/range combinations containing three or more dates such as:

  • [[August 3]], [[August 5]], [[August 9]], [[1644]]
  • [[May 15]] – [[May 25]]; [[July 19]], [[2002]]
  • [[March 3]]-[[March 4|4]], [[April 7]] [[1964]]

For simple date ranges and date lists, I can define some generalized match and replace logic that allows for most of the punctuation variants identified above. Any of the month-day parts may optionally have piped link text containing only the day number. The goal would be to remove the date links, but preserve the displayed text and punctuation without change. Some dates, date ranges, and date lists contain special codes such as "&nbsp;" or "<br/>". I am looking into recognizing and preserving these in dates, date ranges, and lists.

Some complicated cases exist that would be very difficult to generalize and codify, such as:

  • [[July 27]] (O.S.)/[[August 9]], [[1888]]
  • Saturday, [[February 9]]-Sunday, [[February 10]] [[2008]]
  • [[8 December|8]]-c. [[10 December]] [[1941]]
  • [[June 28]] [[1751]] (NS: [[July 9]])–[[January 2]] [[1809]] (NS: [[January 14]])

It is likely that these complicated cases would not be recognized as a date range or list, so only the right-most dates would be delinked.

Comments? -- Tcncv (talk) 05:32, 24 August 2009 (UTC)

PS, does anyone know the meaning of "en", "oder", or "×" (times) in the context of date ranges?
en is Dutch for 'and'; oder is German for 'or'. Probably the articles containing them were hastily translated from one of those languages. Colonies Chris (talk) 18:22, 13 September 2009 (UTC)
Thanks. After looking at the context of the articles (Serious Request and List of people of Heilbronn) and with a little help from Babel Fish, I came to the same conclusion. It turns out they were isolated occurrences, which I have fixed. -- Tcncv (talk) 20:32, 13 September 2009 (UTC)
Support for date ranges and lists has now been added to the bot code. The expressions are fairly involved and attempt to cover a variety of punctuation and connector words between adjacent dates. This was built based on the results of extensive database dump scans looking for the more common cases (and some less common ones). Date lists in general consist of up to ten linked day-month, month-day, and piped-day parts followed by a linked year. Dates in the list may be separated by spaces, commas, hashes, slashes, some other punctuation, or any of a short list of joining words ("to", "through", "and", "or", etc.). I believe the allowed syntax satisfies the "month-day items that are clearly adjacent to and in combination with a triple" criteria for the bot. Review and comment is requested. -- Tcncv (talk) 05:40, 26 August 2009 (UTC)

Partially linked full dates

During my database scans, I found a number of cases of full dates, date ranges, and lists where the date components were only partially linked. Often these were cases where the month-day part was linked, but the year was not. Further, a cursory look at several of these articles led me to conclude that the most likely reason that the year was not linked was that the same year was linked earlier in the article.

For example, the 44th Fighter Squadron article contains the partially linked date [[21 December]] 1942–[[15 August]] [[1945]]. The apparent reason for not linking the year is that the same year is linked in the immediately preceding sentence. The rest of the section is very heavily date-linked, but with potentially duplicate links omitted.

My question is: Should we delink partially linked full dates? Alternately, should we delink partially linked full dates, if the unlinked date parts are linked elsewhere in the same article? -- Tcncv (talk) 05:47, 24 August 2009 (UTC)

I would say partially linked dates indicate that the editor who wrote it actually intended to link to the date-related article, and was not linking for the purpose of autoformatting. Such linking would be outside the scope of the bot. --Jc3s5h (talk) 15:10, 24 August 2009 (UTC)
Yup, I think partially linked triples have to be accepted as purposeful links rather than date-autoformatting at this stage, to be audited manually at a later stage. Although this will take a lot of time, it's all one can do. I think there aren't many of them around, actually.
In addition, may I remind people that one of the initial ideas behind the bot was that the task be relatively narrow and simple. However, this does not preclude the addition of simple fixes where they are easy to code and unlikely to cause problems (and where, as a result of the trial runs, they can be disentangled and removed if unsatisfactory). Tony (talk) 09:15, 26 August 2009 (UTC)
What simple fixes are you referring to? @harej 10:24, 26 August 2009 (UTC)
(ec) for the record, back when i was just learning to edit, i frequently linked partial dates in the belief that that would autoformat them. but i agree that guessing what partially-linked dates might mean is not part of what the bot was supposed to do. Sssoul (talk) 10:28, 26 August 2009 (UTC)

BRFA status?

I see the page hasn't been edited in a week. What's the status there? Dabomb87 (talk) 04:04, 9 September 2009 (UTC)

"Approved for trial run. (since September 12, 2009)" - bravo! Sssoul (talk) 19:42, 12 September 2009 (UTC)

block

The bot seemed to go crazy attempting to blank articles, so I have temporarily blocked it. From the edit filter log:

   * 22:06, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on History of Gabon. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 22:06, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on February. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 22:06, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on December. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 22:06, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on Coronation Street. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 22:05, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on August. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 22:05, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on April. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:59, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on November. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:59, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on Month. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:59, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on Beijing cuisine. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:58, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on May. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:58, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on March. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:58, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on Foreign relations of Morocco. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:57, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on July. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:57, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on June. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:57, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on List of historical anniversaries. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:57, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on History of Gabon. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:57, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on February. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:56, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on December. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:56, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on Coronation Street. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:56, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on August. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:55, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on April. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)
   * 21:53, 2 October 2009: Full-date unlinking bot (talk | contribs | block) triggered filter 3, performing the action "edit" on April. Actions taken: Warn; Filter description: New user blanking articles (details) (examine)


Graeme Bartlett (talk) 22:11, 2 October 2009 (UTC)

Hmm, this is strange. I can't see anything in the contributions log though. Dabomb87 (talk) 22:12, 2 October 2009 (UTC)
Ah, never mind. The edit filter probably prevented those edits from going through. Dabomb87 (talk) 22:13, 2 October 2009 (UTC)
It would have been nice if this wasn't been filtered; otherwise I would have been able to stop it in a much more timely manner. Of course, as with any software test, I will look into the cause and try to fix it. @harej 22:16, 2 October 2009 (UTC)
Are you ready for an unblock? Check this log for results too: http://en.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchUser=Full-date%20unlinking%20bot Graeme Bartlett (talk) 22:21, 2 October 2009 (UTC)
Thank you for the link. I will unblock it when I am ready to run it again. @harej 22:39, 2 October 2009 (UTC)
Good luck! I am guessing that the edit filter only warned, but the bot did not know what to do, so did not proceed to override the warning and save. But something must have gone wrong before then. If I was programming this, it could have been due to spelling the variable wrong, and ending up with a null string! Graeme Bartlett (talk) 22:44, 2 October 2009 (UTC)
I forgot to utter global $contents; in one of the subroutines. -_- @harej 22:58, 2 October 2009 (UTC)

Proposed Arbcom Motion re date delinking

Your attention is brought to a motion currently being considered by the Arbitration Committee:
Wikipedia:Arbitration/Requests/Motions#Motion to amend Wikipedia:Requests for arbitration/Date delinking.

At the time this notice was posted the text of the motion read:

This wording may have since changed; please see the above link for the current wording.

On behalf of the Arbitration Committee, Manning (talk) 09:46, 12 October 2009 (UTC)

The motion has been passed, clearing any remaining obstacles. Dabomb87 (talk) 14:15, 24 October 2009 (UTC)

Need help?

Let me know if help is needed with this bot. Rich Farmbrough, 12:38, 16 October 2009 (UTC).

False negatives

The bot has only limited support to recognize piped forms of the form "[[month day|day]]" or "[[day month|day]]" and then only when used in date lists. The forms identified above are not supported, and it is probably a bit late to add them at this point in the approval cycle. I would suspect these forms are relatively rare and could possibly be handled as an AWB task. Bare years and month-year combinations are out of scope for the bot. -- Tom N (tcncv) talk/contrib 02:04, 23 October 2009 (UTC)
Links to year-in-topic articles are specifically excluded from the scope of the bot. I believe there is a preference to replace the piped "[[year in topic|year]]" forms to something like "year (see [[year in topic]])" in the article body, but the shorter piped references for may be more appropriate for tables and infoboxes, as long as the context of topic is understood. Outside of context, I think they are considered Easter egg links. Too many nuances for a bot to figure out. -- Tom N (tcncv) talk/contrib 02:04, 23 October 2009 (UTC)
Thank you for your input. I fixed the second diff link above (it was missing a digit and yielded a very confusing result). I've looked at the cases you pointed out above and have run a database scan and found that there are a couple thousand articles containing linked dates with ordinals (e.g., [[October 24th]] or [[24th October]]. There is also about a hundred date links of the form [[27th of October]]), but date redirects of that form are sparsely defined, and some (1st of May, 5th of May, 4th of July, 6th of October, 8th of November) are not standard month-day page redirects. As for the piped forms, I found several hundred cases of the form [[October 24|24 October]], which may make it worth the effort to automate.
The bot could be modified to recognize and process the above ordinal forms and piped forms, but I'd like to get input from Harej on (1) do we consider these forms to be clearly within the scope of the bot and (2) do we want to make these changes at this point in the test cycle. As for the ordinal notation, although the bot might delink that form, I don't think it is within the scope of the bot to remove the ordinal suffix, thus "<code>[[October 24th]]" would become "October 24th", not "October 24". -- Tom N (tcncv) talk/contrib 19:34, 24 October 2009 (UTC)
I consider ordinal dates, as long as they are full dates, to be within the scope of the bot. As long as the formatting is preserved once the link is de-linked. @harej 19:45, 24 October 2009 (UTC)
I don't think the form [["nth of month'']], [[y]] should be touched. It may be intended to indicate a holiday traditionally referred to by the date, such as [[4th of July]]. Processing such a date would create a change in meaning and eliminate a link to the holiday. --Jc3s5h (talk) 19:55, 24 October 2009 (UTC) (The preceding signature was manually fixed by tcncv.)
I had intended to exclude those links (listed above) that were not standard redirects to the month-day articles, but I can exclude all if you prefer. The number of articles affected is small enough to be handled manually (or as an AWB task). -- Tom N (tcncv) talk/contrib 20:42, 24 October 2009 (UTC)
If you have identified all the articles that exist of the form "nth of month", which are not redirects to a normal article about a date, and are excluding them, that's fine with me. --Jc3s5h (talk) 20:57, 24 October 2009 (UTC)
Changes have been implemented and will be included in the next trial run. -- Tom N (tcncv) talk/contrib 21:53, 31 October 2009 (UTC)

Capitalization

I noticed in this edit to Chesley Bonestell that the bot doesn't fix capitalization (i.e. it left it as june, not June). That seems like a very uncontroversial thing to implement—per the MoS, month names are always capitalized. TheFeds 01:30, 8 November 2009 (UTC)

Good catch. I was about to suggest that this be put off as an AWB task, but when I examined current linked date behavior, I found that proper-casing the month names is one of the side effects of date auto-formatting. Thus, "[[7 november]]" and "[[november 7]]" display as "7 november" and "november 7", respectively. Other case forms such as "[[7 NOVEMBER]]" and "[[NOVEMBER 7]]" are not fixed by autoformatting and display as "7 NOVEMBER" and "NOVEMBER 7".
I scanned the database dump and found about 2100 articles affected, which means we may be hitting about 20-30 affected articles per day. Since this affects the presentation (unlinking effectively changes "November" to "november", I'm going to invoke Joe Biden and shut down the bot for now.
An enhancement would not be too difficult but would require a new set of regular expression patterns and substitutions. As far as I know, the "preg_replace" function does not directly support changing case in the substitution, so each month name and abbreviation will require a separate statement.
An alternative would be to pre-process the affected pages and then let the bot run as written. Using AWB to do this would take about 10 hours of time that I don't currently have. (Volunteers?) In particular, the "may" dates should be fixed, due to the ambiguity with the word "may" after unlinking. -- Tom N (tcncv) talk/contrib 04:03, 8 November 2009 (UTC)
I have posted a proposed update to the bot code that will change lower-case month names to proper case (leading cap). Only delinked dates are affected and only all-lower-case month names are changed. This will preserve the date-autoformatting appearance of these dates after delinking. Unit test results can be examined here (new test cases at the bottom), and a comparison with prior test results can be viewed here (no regression observed). It is my opinion that other forms, such as all-caps, that are not recognized by date auto-formatting should be left as-is by the bot, and any cleanup be done as a separate activity.
So the question is: do we (1) implement the change and resume bot operations, (2) preprocess the ~2000 affected pages and then resume bot operations using existing code, or (3) resume bot operations immediately as-is and clean-up the dates later. -- Tom N (tcncv) talk/contrib 06:55, 8 November 2009 (UTC)
I implemented the change and resumed the bot operation. @harej 07:34, 8 November 2009 (UTC)
and i watched in grateful admiration - thanks, people; please feel appreciated Sssoul (talk) 07:53, 8 November 2009 (UTC)

Special dates

  • 1st of May no links
  • 5th of May no links
  • 4th of July a dismbiguation page - should have no links - has about 80
  • 6th of October a dismbiguation page - should have no links - just a redirect now
  • 8th of November a dismbiguation page - should have no links, has about 40?
Rich Farmbrough, 19:49, 12 November 2009 (UTC).
OK Dabs taken care of,no special cases in main-space just now. Rich Farmbrough, 20:57, 12 November 2009 (UTC).
What about the 5th of November ("Remember remember the fifth of November")? @harej 07:50, 14 November 2009 (UTC)

A DIY edition of FDUB

As I have done with the RfC posting tool, I could create a fork of fulldateunlinker.php where users enter the name of an article, it goes through the de-linking process, and then they save the changes with their own accounts. @harej 21:15, 12 November 2009 (UTC)

  • A reminder to all that User:Lightmouse has written monobook scripts in order to make all date formats consistent (without changing any ISO formats), and to remove links of date fragments as well. (copy: importScript('User:Lightmouse/monobook.js/script.js'); into your monobook page). There are also scripts below -also written by Lightmouse - are similar, but may be run within the 'module' function of AWB, to give a mono-format article:
Ohconfucius ¡digame! 03:15, 13 November 2009 (UTC)

Articles such as Labour Day

The following item was moved from User Talk:Tcncv#work on James's script. -- Tom N (tcncv) talk/contrib 20:28, 5 November 2009 (UTC)

I noticed the bot has treated articles such as Labour Day. Could be easy to exclude by searching for calendar terms in titles ("day", "week", "month", "year", etc.). But I notice such articles have not been "excluded" here. Thanks. Tony (talk) 14:04, 5 November 2009 (UTC)

I don't think any of us have thought of that. While I don't know whether Labour Day should have had its date links removed, it's not like the bot will ever edit that article again. @harej 20:31, 5 November 2009 (UTC)
Infobox holiday is the key here. But again the date links are not that useful. Rich Farmbrough, 12:40, 17 November 2009 (UTC).

[1] Use with caution and at your own risk. Rich Farmbrough, 07:44, 16 November 2009 (UTC).

A note on the bot's operation

The bot will do this weird thing where it will just plain stop editing after it has been running for a long time — the output would indicate that it is in fact submitting changes, but those changes are not actually being submitted. As a workaround, the bot automatically shuts off after being in operation for 7 hours, then it resumes after an hour-long break. This is accomplished by a built-in time limit along with a cronjob. @harej 01:52, 9 November 2009 (UTC)

What can I do to help? Look at the code? Run the bot? Run the task under SmackBot? Rich Farmbrough, 12:56, 12 November 2009 (UTC).

James, could I chime in here as well to say that many hands make light work, and you've born the brunt of the task alone already, for which we're all thankful ... but would you be OK about sharing the burden now? Tony (talk) 14:13, 12 November 2009 (UTC)

It's not that I have given up, it's that I have little free time during the week. Rich Farmbrough is very much invited to look at the code to see if there's some flaw that causes submissions to not work. I also have another idea which I will elaborate upon below. @harej 21:13, 12 November 2009 (UTC)

Good news! I believe I know the source of the problem — the bot had been aggravating the API limits on edits, so I decided to rewrite the entire error detection system. I have posted the changes and will try running the bot. @harej 07:19, 14 November 2009 (UTC)

The bot may re-edit a few pages. This is because the glitch created a bunch of junk entries in the database, and it was easier to purge the database than to pick out which entries were wrongfully placed in the database. From the cases I've seen where the bot re-edits a page, it is not edit warring with the "local population" of the article but is handling new links, so I would not worry about reverting it. (Of course, those who insist on reverting the bot's superfluous edit out of principle technically are allowed to do so, per the bot's operating requirements). @harej 07:53, 14 November 2009 (UTC)
  • The bot is no longer controversial, it hasn't made any bloopers so far, it has a panic button. Is there any reason why it's taking such frequent and long constitutionals? This seems to be an extremely laid-back approach to WP:NODEADLINE. Ohconfucius ¡digame! 15:07, 17 November 2009 (UTC)
    • There was a glitch (a glitch I have now resolved) that derailed the bot's process, requiring it to take vacations. Otherwise, the bot runs pretty quickly (4-5 edits per minute) and only has breaks in between edits. @harej 05:43, 18 November 2009 (UTC)
      • James, can you please clarify if that means it will no longer be taking breaks (at all). Also, the bot is mandated to run at a rate of 6-7 per minute. What would be the current constraints upon it running at 4? Ohconfucius ¡digame! 05:47, 18 November 2009 (UTC)

Non-breaking spaces?

When de-linking, how about putting a non-breaking space&nbsp; — between day and month?
—WWoods (talk) 19:36, 14 November 2009 (UTC)

To prevent dates being split by line breaks, e.g. 15
November. This is routinely done with numbers, e.g. "15&nbsp;billion" or "15&nbsp;ft (4.6&nbsp;m)".
—WWoods (talk) 17:34, 15 November 2009 (UTC)
This is a Good Thing, however - FDUB is being extremely conservative to reduce the chance of community friction kyboshing the task. Is it worth a supplementary BRFA? Rich Farmbrough, 07:44, 16 November 2009 (UTC).
At this time, I do not believe there is a consensus opinion that non-breaking spaces are needed in dates. Wikipedia:Manual of Style (dates and numbers)#Non-breaking spaces lists several cases where they should be used, but dates is not one of them. -- Tom N (tcncv) talk/contrib 04:01, 18 November 2009 (UTC)

Pages edited by FDUB where my AWB script finds a full date

User:Rich Farmbrough/temp54 will hold off on these for a while. Needs to be considered that someone may have relinked the date correctly - but it seems unlikely. Rich Farmbrough, 18:17, 17 November 2009 (UTC).

Thank you for identifying these cases. I investigated the first 20 articles listed and found the following
  1. Anytime with Bob Kushell - Extra space between month and day: "[[March  31]], [[2009]]"
  2. List of Royal Malaysian police officers killed in the line of duty - New linked date added after bot processing
  3. LBE Nos. 1 to 3 - False hit: "[[2-4-2]]"
  4. InetSoft - Improperly formatted ISO date "[[2007-2-20]]" (fixed)
  5. Bob Scherbarth - Extra space between month and day: "[[January  18]], [[1926]]"
  6. Bavarian Eastern Railway Company - False hit: "[[2-2-2]]"
  7. List of mergers in Gunma Prefecture - New linked date added after bot processing
  8. Albrecht Achilles (Korvettenkapitän) - Extra space after year: "[[5 April]] [[1945 ]]"
  9. Tramway de Pithiviers à Toury - False hit: "[[2-5-2]]"
  10. List of fictional United States Presidents N-T - Unable to locate fully linked date
  11. Francis Lewis Cardozo - Extra space before year: "[[22 July]] [[ 1903]]"
  12. United States – Zimbabwe relations - Piped link to topic specific year article: "[[5 March]] [[2009 in literature|2009]]"
  13. Plymouth Friary railway station - False hit: "[[0-6-2]]"
  14. Roman Catholic Diocese of Foligno - Not sure what was detected. Suspect: "[[22 December]] [[1894]" (only one trailing bracket).
  15. List of massacres committed prior to the 1948 Arab–Israeli war in Mandate Palestine - Unable to locate fully linked date
  16. Regional implementations of DAB - Unable to locate fully linked date
  17. Hawkhurst Branch Line - False hit: "[[2-2-2]]"
  18. Opt-outs in the European Union - Bot edits reverted by another editor who objected to non-autoformatted ISO dates
  19. List of Sigma Alpha Epsilon chapters - Improperly formatted ISO date "[[2004-05-1]]" (fixed)
  20. WJOU - Piped link to topic specific year article: "[[January 4]], [[2008 in radio|2008]]"
As noted, I could not locate the date in question in a couple of articles above. If you can identify them for me, please update the above results. Also, for the Opt-outs in the European Union article, if you (anyone) has a script or bot to convert the ISO dates to day-month-year, please do so.
Of the cases identified, I think the only ones that may warrant a bot enhancement are those that contain extra spaces inside the link brackets and between the month and day. I think these could be considered uncontroversial cases within the scope of the bot. If there are no objections, I can start working on the change. -- Tom N (tcncv) talk/contrib 03:41, 18 November 2009 (UTC)
Will do the opt-out stuff. Rich Farmbrough, 08:26, 18 November 2009 (UTC).

Perhaps Rich's AWB does not have a tight enough definition of what an all-numeric date is. Then again, since there is no consensus on whether ISO 8601 applies to Wikipedia articles or not, there is no way to decide whether 8-03-01 is a valid all-numeric date (which defies the MOSNUM guideline about pre-1583 dates). --Jc3s5h (talk) 04:11, 18 November 2009 (UTC)

It does actually avoid changing stuff like 9-1-1 but is not accurate enough with it's pre-rejection of these articles. Rich Farmbrough, 08:26, 18 November 2009 (UTC).
  • OK I am running through (fixing) these articles to make notes for future work as much as anything - examples or descriptions of each type (possibly mentioned above)
    • [[2007-06]] (valid ISO)
    • Piped links to "year in foo"
    • False positives
    • [[19 August]] [[14|AD 14]]
    • [[8]] [[December]] [[1821]] (since corrected)
    • [[Jan. 01]] ....
    • November 3, 17S8, thats an "S" ...
    • [[May 1]], ]]1798]]
    • 5 digit years NTFS</nowiki>
    • 2007-[[06-30]]
    • [[April 30]], [[4th millennium|3168]]
    • [21 March]] [[1945]]

The rest are either false positives or as described by Tcncv. Rich Farmbrough, 10:42, 18 November 2009 (UTC).

The bot now runs full-time

For real this time. 24 hours a day, 7 days a week, about 4 edits per minute. If I am not mistaken, I have resolved the last glitch in the bot. Knock on wood. @harej 05:44, 18 November 2009 (UTC)

Congratulations. I also noticed that as of 05:42 today, the bot has finished with January 1 and has begun processing January 2 dates. Because many articles have many linked dates distributed through the calendar year, a disproportionate number of articles will be processed during scans for the early dates, so the pace should pick up in the coming weeks.
You might be able to tweak the built-in sleep statement to increase the processing rate. 10,000 edits per-day equates to about 7 per-minute. Assuming an average actual processing time of 5-6 seconds (retrieval, processing, and postback), changing the sleep timer to 4-5 seconds might get you closer to that target. -- Tom N (tcncv) talk/contrib 06:08, 18 November 2009 (UTC)
Interesting to learn how it works, and by what article logic. So the number of articles processed is programmed in terms of idle time... Can we have an edit counter, please? and one that tells us what day of the year we're on? ;-) Ohconfucius ¡digame!
"There may be a better way, but this tool (found through WP:COUNT) shows the edit count. About 12,750 as of this writing. As for the date curretly being processed, you can look at the Special:Contributions/Full-date_unlinking_bot and examine the diff of an article with only one date change. Most likely, that is the date whose "what-links-here" list is being processed. Check a second article to confirm. -- Tom N (tcncv) talk/contrib 07:20, 18 November 2009 (UTC)

Thank you...

...for existing. I dream of a Wikipedia where you aren't necessary. Eric talk 03:50, 20 November 2009 (UTC)

A t-shirt reading "Full-date unlinking bot's #1 Fan" is now en route to your place of residence. You're welcome. @harej 03:54, 20 November 2009 (UTC)

Styles of dating

There seems to be some confusion about how to present dates in articles. In the US, only the military dates (ex. 3 Dec 2009). Apparently the UK and other countries use it all the time. It doesn't really matter to me which way I date but perhaps there should be some uniformity, at least within an article. Is there a particular policy at Wiki about the "correct" way to date? Mugginsx (talk) 16:58, 21 November 2009 (UTC)

Wow, it all seems too complicated for me. I guess I'll just depend on someone to edit my dates. Thanks. Mugginsx (talk) 17:54, 21 November 2009 (UTC)

Pretty much only the US uses December 3 1999 as the majority format. The US increasingly uses 3 December too, and not just in the military. Canada uses a complete mix. Most of the rest of the world uses "3 December". Unfortunately this has lead to the formats being labelled "US" and "International" - guaranteed to reinforce arguments. MOS certainly suggests one style per article (excluding quotes and titles).Rich Farmbrough, 13:22, 22 November 2009 (UTC).

I will date as you recommend in the future. I do think it would be better for entire article to be dated the same.Mugginsx (talk) 13:39, 22 November 2009 (UTC)
I wouldn't read too much into the "The US increasingly uses 3 December too" – the US civilian population still uses the month-day-year form exclusively. Although some government agencies and the military may use day-month-year for internal use, especially in areas of international influence (such as with the FAA), government communications with the public and the press generally use month-day-year dates (see http://nasa.gov, http://irs.gov, http://www.defense.gov, or anything under http://usa.gov). Spelled out month names are preferred in formal communications, but numeric-only dates of the form nn/nn, nn/nn/nn, or nn/nn/nnnn are understood to be month/day or month/day/year in the US. As far as I know, there is no public trend away from current usage. -- Tom N (tcncv) talk/contrib 17:22, 22 November 2009 (UTC)

As to style One thing that bothers me, although I do not know how to change it without controversy, is that the "de Clare" article is headlined "De Clare", instead of "de Clare". The French editors agree. Do you agree and is that is something that should be changed? Mugginsx (talk) 13:42, 22 November 2009 (UTC)

I fixed the title of the de Clare article by adding the {{Lowercase}} template. -- Tom N (tcncv) talk/contrib 15:48, 22 November 2009 (UTC)
Thank you, Thank you, Thank you Tcncy! Mugginsx (talk) 21:59, 22 November 2009 (UTC)

ISO-style dates

Why are we leaving the reader-unfriendly ISO-style dates alone? Is there a discussion somewhere? Mr Stephen (talk) 10:50, 14 November 2009 (UTC)

I can think of a few reasons. One is a change in meaning. In an article about the UK, 5 November 1605 is certainly in the Julian calendar. On the other hand, 1605-11-05 might have been written by an editor who thinks ISO 8601 and who took the trouble to read the ISO 8601standard before using it, and knows that ISO 8601 dates must be in the Gregorian calendar. It might have been written by some fool who thinks ISO 8601 applies to Wikipedia, but didn't bother reading the spec first. Or, it might have been written by someone who is using that style, but never heard of ISO 8601, or does not give a damn what the spec actually says. Converting 1605-11-05 to 5 November 1605 converts an uncertain date to a certain date, thus changing the meaning.
Another reason is that the date might be within a quote or a programming language example. Since robots have no reliable means of detecting such cases, they could misquote someone, or damage what used to be a working example. --Jc3s5h (talk) 16:47, 14 November 2009 (UTC)
Also because it would prescribe a certain date style. It's not obvious to a bot whether 1992-02-15 should be written as February 15, 1992 or 15 February 1992, so the bot would be arbitrarily assigned to use a specific format which is bad. @harej 16:53, 14 November 2009 (UTC)
Thanks for your replies. Neither of those reasons would apply to a human-assisted program, like the AWB module described immediately above, though. From a slightly different tack, Cybercobra pointed me at Wikipedia:Mosnum/proposal_on_YYYY-MM-DD_numerical_dates. Mr Stephen (talk) 17:52, 14 November 2009 (UTC)

However the axiom that you touch date formats at your peril is shown to be true on my talk page. I despair sometimes. Rich Farmbrough, 20:19, 16 November 2009 (UTC).

You don't need me to tell you how long this ... episode ... has been going on. Make some space on the despair couch. Mr Stephen (talk) 20:29, 16 November 2009 (UTC)
This 1605-11-05 example misses the two points that a) it is unlikely that someone has converted a Julian date to the Gregorian calender (proleptic or otherwise) so the "certainty" of it's meaning is non-existant and b) that it will in any case render 5 November or November 5 for many users. Rich Farmbrough, 12:27, 17 November 2009 (UTC).

Just a note: many of these dates in references were create by a long AWB run about a year ago e.g. http://en.wikipedia.org/w/index.php?title=Jerry_Hsu&action=historysubmit&diff=267380684&oldid=267000838. At that point it was the Right Thing to do according to the Cite template documentation (although those following other events may have been aware it was the Wrong Thing). Rich Farmbrough, 14:31, 23 November 2009 (UTC).

  • I can't find any diffs now, but I've seen a few instances of a bot converting dates in this way. The last time I saw this happen was maybe three/four weeks ago. I'll post here, or notify the bot owner, if I find any examples. Ohconfucius ¡digame! 15:44, 23 November 2009 (UTC)

AWB

I saw this page and WP:RFAR/DDL only moments ago...only after running AWB for the past week which I included some commands removing date links. Now I realised I may have actually run foul of the ArbCom ruling. So now that this bot is active, may I confirm if it is now ok to resume delinking? Thanks!--Huaiwei (talk) 10:50, 23 November 2009 (UTC)

I think it unlikely anyone would object to unlinking full dates, unless you were unlucky enough to hit an article which had been unlinked/relinked. Technically the ARBCOM ruling expires 14 December but this is about improving the 'pedia. Rich Farmbrough, 14:28, 23 November 2009 (UTC).
Yes, it's perfectly OK now. The ruling actually says "All mass date delinking is restricted for six months. For six months, no mass date delinking should be done until the Arbitration Committee is notified of a Community approved process for the mass delinking [my emphasis]." Since ArbCom was notified a while back, I don't think there's an issue now. Dabomb87 (talk) 15:57, 23 November 2009 (UTC)
Ah I am relieved to hear that! Many thanks guys! ;)--Huaiwei (talk) 19:03, 23 November 2009 (UTC)