Wikipedia:Bots/Requests for approval/GreenC bot 4

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

GreenC bot 4

Operator: Green Cardamom (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 00:57, Saturday, February 18, 2017 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Nim + Awk

Source code available: Yes

Function overview: Convert machine-specific URLs at Internet Archive to generic work page URLs.

Links to relevant discussions (where appropriate): User_talk:Cyberpower678/Archive_45#IABot_dead_link_fix

Edit period(s): One time run initially

Estimated number of pages affected: 20,000

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details:

The bot's purpose is easiest seen by example diff: [1]

Links to Internet Archive collections have a main work page in this form: https://archive.org/details/manualofconcholo111tryo .. the work page contains multiple files in PDF, Epub, text etc.. as seen in the index. It's possible to link to a file like this: http://ia700307.us.archive.org/4/items/manualofconcholo111tryo/manualofconcholo111tryo.pdf .. the link contains a machine ID ia700307 in the cluster so if the cluster changes - a machine replaced for hardware failure etc.. the link dies.

Initial tests showed that approximately 50% of all such "machine ID links" on enwiki have become dead links. The bot's purpose is to replace all such machine ID links with the work page which redirects to whatever machine hosts it in the IA cluster. The bot will also detect and remove any {{dead link}} tags.

The code for the bot is completed and tested and ready to run. It is a module of WaybackMedic so whatever pages for this bot will also get other WM fixes. It will initially target all articles containing machine ID links as found by searching a recent database dump, then continue running incidentally as part of the WaybackMedic runs.

Discussion

This seems like a great task. I can see no policy based reason for opposing this. TheMagikCow (talk) 15:27, 18 February 2017 (UTC)[reply]

Approved for trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Just a short run to see some more examples. Otherwise, I don't see any issues. — HELLKNOWZ ▎TALK 15:54, 18 February 2017 (UTC)[reply]

It did a couple beyond 10 during the last batch of WaybackMedic. Trial complete. -- GreenC 20:07, 19 February 2017 (UTC)[reply]

@Green Cardamom: It looks like the trial edits were all in situations where the former link was dead - can you point to an example of an edit this task would perform on a link that is not dead? — xaosflux ^Talk 23:57, 23 February 2017 (UTC)[reply]

Yeah that is disabled. I didn't want to run into an editorial dispute before the BRFA was approved, but felt safe fixing dead links on its own merit. I'll enabled for the next WaybackMedic batch (in a day or two) and post some results. -- GreenC 00:18, 24 February 2017 (UTC)[reply]

Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. OK, please update here when done, be sure to include at least a few of these (yes it may open an editorial can of worms). — xaosflux ^Talk 00:39, 24 February 2017 (UTC)[reply]

The last batch found about 60 articles and they are grouped together in the GreenC bot edit history starting at 17:04 (Paleoconservatism) through 17:08 (Chinese Canadians in Greater Vancouver). It's about 50/50 dead vs non-dead.

Trial complete. -- GreenC 21:12, 26 February 2017 (UTC)[reply]

Ok so most edits are around here. I took an example one, the edit on Beit Sira. Change was to remove a working direct reference BEFORE to a page that welcomes readers with:

There Is No Preview Available For This Item
This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.

AFTER. — xaosflux ^Talk 02:06, 28 February 2017 (UTC)[reply]

Needs wider discussion. This seems to be an undesirable landing page for readers, a larger community discussion is warranted. Please post at appropriate venues, including Wikipedia:Village pump (proposals), and centralize a discussion. — xaosflux ^Talk 02:09, 28 February 2017 (UTC)[reply]

That's just how Internet Archive works. It's uncommon for no preview, but the page has a link to the PDF. There is no other method to make a permalink. I've opened discussion at Wikipedia:Village_pump_(proposals)#Links_to_Internet_Archive and Wikipedia_talk:External_links#Links_to_Internet_Archive. -- GreenC 14:15, 28 February 2017 (UTC)[reply]

I am probably in favour of one aspect of this proposal but not for the other. (1) If a URL containing a cluster ID can invariably be mapped to a generic URL then I support such a change. Does the Internet Archive have a statement for their policy? Have they been asked? Do we know it will always be the same edition of a book that is being linked to? (2) I expect it will be problematic linking to a "work page" which links in turn to multiple formats. The formats will not necessarily be identical (for example OCR errors, pagination) and careful referencing will have been done with respect to the particular format used originally in the citation. Modifying a citation would certainly require adding a note stating the originally linked format. However, an additional convenience link to the work page could well be helpful to the reader. Such an approach would have helped with the problem highlighted above. Suggestion: https://archive.org/download/CensusOfPalestine1931.PopulationOfVillagesTownsAndAdministrativeAreas/PalestineCensus1931.pdf as a replacemant link (but preferably a non-download variant) and https://archive.org/details/CensusOfPalestine1931.PopulationOfVillagesTownsAndAdministrativeAreas as a convenience link. Thincat (talk) 15:46, 28 February 2017 (UTC)[reply]

That's a good point I'll ask IA if there is a permalink URL for the media file. Each upload has its own work page. The multiple formats are called derivations they are derived (automatically generated) from the same source upload and contain identical content (sans OCR mistakes for derived text files). A convenience links is not a bad idea but it would be very difficult for a bot for a number of reasons. I have no problem replacing with the "/download/" version. -- GreenC 16:22, 28 February 2017 (UTC)[reply]

I heard from Internet Archive and the permalink options are "/download/" and "/details/". Suggest the bot checks for a preview on the details page and if none exists use "/download/" otherwise "/details/". The "/details/" page contains meta data, links to other file formats and an in-browser document view that is hidden when using "/download/", but it's a compromise to use "/download/" when missing the document view. -- GreenC 14:56, 1 March 2017 (UTC)[reply]

Oh, thank you for investigating that. In the light of what you say about "derivations" I think my suggestion about preserving a record of the exact format is too minor to be usefully persued. Thincat (talk) 18:31, 1 March 2017 (UTC)[reply]

Updated edit policy:

"/download/" for .mp3/mp4/.ogg .. and .pdf if page has no preview
"/details/" everything else

-- GreenC 13:31, 23 March 2017 (UTC)[reply]

{{BAGAssistanceNeeded}} -- the BRFA is now 2 months old and the discussions opened per the "wider discussion" request above are now archived. No one has commented other than Thincat. It's in line with policy about using permalinks. I've continued dribbling in changes incidentally as part of the WaybackMedic work, probably a few hundred more. -- GreenC 16:20, 18 April 2017 (UTC)[reply]

Approved.. Useful extension to existing task to preserve links for inevitable linkrot. No pending concerns brought up during wider discussion (mainly SILENCE). Trusted botop. No further issues with trial edits. Due to the nature of the edits, various special cases and reliance on external tools, there may be occasional unforeseen errors. Obviously, this BRFA is approved on general BOTPOL assumption that any issues are fixed and discussed, if needed. Approval also includes potential expansion to include other unambiguously better archive URLs that may appear in future or be for other services. — HELLKNOWZ ▎TALK 17:47, 18 April 2017 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.