Wikipedia:Bots/Requests for approval/KolbertBot 3

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Request Expired.

KolbertBot 3

Operator: Jon Kolbert (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 01:28, Tuesday, February 6, 2018 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s):

Source code available:

Function overview: Convert New York Times abstract URLs to archive URLs.

Links to relevant discussions (where appropriate): Wikipedia:Bot_requests#Bot_to_convert_New_York_Times_abstract_URLs_to_archive_PDF_URLs

Edit period(s): ~~one time~~ continuous (NYT links are continuously being added to the project)

Estimated number of pages affected: ~10,000

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: Replace http[s]://select.nytimes.com/gst/abstract.html?res= and http[s]://query.nytimes.com/gst/abstract.html?res= to http://query.nytimes.com/mem/archive/pdf?res=

Discussion

Is this going to convert working https links to http? — xaosflux ^Talk 02:45, 6 February 2018 (UTC)[reply]

@Xaosflux: If you convert a NY Times PDF File from http to https, the link will not work. It is something I explained on Jon's talk page a month ago, and at WP:BOTREQ just yesterday. epicgenius (talk) 22:07, 6 February 2018 (UTC)[reply]

@Jon Kolbert: Thanks for filing the BRFA. Since NY Times links are being added all the time, should this be a periodic task? epicgenius (talk) 22:07, 6 February 2018 (UTC)[reply]

@Epicgenius: Yeah, that makes more sense. I'll make the adjustment in the request. Jon Kolbert (talk) 01:06, 7 February 2018 (UTC)[reply]

Please provide below an example of one of these being changed in a current article (just do it with your own account). I'm especially interested in one where the reference is currently using HTTPS. — xaosflux ^Talk 22:33, 6 February 2018 (UTC)[reply]

@Xaosflux: See here for a sample edit. Jon Kolbert (talk) 01:03, 7 February 2018 (UTC)[reply]

@Jon Kolbert: this looks like it is making use of a current vulnerability with this publisher, and may be infringing on their digital property. Do we have any indication that this is both a supported and authorized use? — xaosflux ^Talk 02:22, 7 February 2018 (UTC)[reply]

Not Jon, but now that I think about it, we can limit the run to articles published before January 1, 1923 (so 1922 or earlier) by checking for the publication date's metadata. Pages published before then are free to non-members, though on a limited basis. Pages between January 1, 1923 and December 31, 1980 are not free and may be considered as digital theft. epicgenius (talk) 15:00, 7 February 2018 (UTC)[reply]

@Epicgenius: do you have an example of the "old" and "new" link for one of these <1923 articles with them? — xaosflux ^Talk 15:29, 7 February 2018 (UTC)[reply]

This is an article from 1896. This is the old link and this is the new link. Incidentally, here are the random "keys" for several randomly selected articles in the December 29, 1922 to January 4, 1923 range. I don't see a particular pattern, though. epicgenius (talk) 15:42, 7 February 2018 (UTC)[reply]

Dec 31, 1911: 9805E1DC123AE633A25752C3A9649D946096D6CF (free)
Dec 31, 1921: 950CE0DF1E3EEE3ABC4950DFB467838A639EDE (free)
Dec 29, 1922: 9C05E0DF1730E433A2575AC2A9649D946395D6CF (free)
Dec 30, 1922: 9D04E2DE1730E433A25753C3A9649D946395D6CF (free)
Dec 31, 1922: 9C0CE7DC1730E433A25752C3A9649D946395D6CF (free)
Jan 01, 1923: 9B04E6D91731E633A25752C0A9679C946295D6CF (free)
Jan 02, 1923: 9A05E4D91731E633A25751C0A9679C946295D6CF (not free)
Jan 03, 1923: 9902E4DF1630E333A25750C0A9679C946295D6CF (not free)
Jan 04, 1923: 9E03E3D61531E333A25757C0A9679C946295D6CF (not free)
Dec 31, 1923: 9803E6D71130E233A25752C3A9649D946295D6CF (not free)
Dec 31, 1933: 9406E3DA1731E633A25752C3A9649D946294D6CF (not free)

@Epicgenius: OK, the "old" ones that are using the query.nytimes.com/mem/archive-free/pdf? resolver appear safer to use, think there is any benefit of ADDING the new link and also maintaining the old one? — xaosflux ^Talk 15:49, 7 February 2018 (UTC)[reply]

I think we could definitely keep both links, in case non-subscribers want to see the abstract and not count the article against their monthly limit. Or we can add {{subscription required}} with a note. NY Times is unusual in that pre-1923 and post-1980 articles are free to view, but with a limit. epicgenius (talk) 16:18, 7 February 2018 (UTC)[reply]

Ack, that's a good point. I thought these articles were intentionally available for free access. I guess the question becomes more now how can we determine what to do with the current links to the .pdf documents and how to determine if one is in public domain or not. Surely transcriptions/copies of the public domain articles could be useful on some Wikimedia project. Jon Kolbert (talk) 19:09, 7 February 2018 (UTC)[reply]

As per the copyright law of the United States, articles published before January 1, 1923 are free and in the public domain - hence the restriction to articles before January 1, 1923. So this article published in 1896 would be in the public domain. I'm not sure about articles published on January 1, 1923, but I'd imagine these should be treated as copyrighted articles. epicgenius (talk) 21:47, 7 February 2018 (UTC)[reply]

Sorry, that's not what I meant. I know that the material published before 1923 is PD, but I don't know how to setup KolbertBot so it can make that distinction. I mostly do work with images, and I know there's some particular restrictions about the use of PDF files on Commons, but I'll check in with my colleagues there to see if the PD news articles would be welcome. Jon Kolbert (talk) 02:32, 9 February 2018 (UTC)[reply]

A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Have you had a chance to follow up and decided how/if the scope of this should be updated? — xaosflux ^Talk 22:00, 19 February 2018 (UTC)[reply]

Request Expired. — xaosflux ^Talk 12:09, 28 February 2018 (UTC)[reply]

@Xaosflux: Whoops, I just saw this in my watchlist now. It appears as if AnomieBOT only notified me the first time the {{OperatorAssitanceNeeded}} template was invoked and didn't do it the second time around on February 19 (I didn't notice a comment was made because it was probably buried in my watchlist). It seems that the bot may have skipped notifying me a second time because I had an old notification from the bot from the same BRFA on my talk page on February 19. In any case, the task of determining which articles are PD and which aren't is a bit above my coding ability and doesn't exactly fit within KolbertBot's current scope of HTTP->HTTPS and preventing linkrot. Jon Kolbert (talk) 16:26, 28 February 2018 (UTC)[reply]

@Jon Kolbert: no worries - please keep in mind, having an expired request isn't a negative - you may always present this task again (though please include it in the list of prior discussions). Best regards, — xaosflux ^Talk 16:29, 28 February 2018 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.