Wikipedia:Bots/Requests for approval/OAbot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Withdrawn by operator.
Operator: Pintoch (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 19:05, Saturday, October 22, 2016 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): Python
Source code available: https://github.com/dissemin/oabot
Function overview: Adds free to read external links in citation templates.
Links to relevant discussions (where appropriate):
Edit period(s): continuous
Estimated number of pages affected: most main space pages with scholarly citation templates (about one million pages, maybe?)
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): No
Function details: The bot queries various APIs to find free to read external links for citation templates, and adds them to the templates. More details can be found on the project page. You can try out what the bot would currently do with the web-based demo.
The bot uses the forthcoming access signaling features of CS1-based templates, so it should not be run before these features are deployed (normally in one week). I post the application now because I expect a lot of discussion anyway.
Our goal for this bot was not to add as many links as possible, but rather to make sure all the changes would be as small and uncontroversial as possible. The bot adds at most one link per template, uses the appropriate identifier parameters when available, and keeps off the templates where one of the existing links is already free to read. The editing pace of the bot would be quite low (a few pages every minute) as the API calls it makes are quite slow (as you can see on the web interface).
Of course, suggestions are welcome. For instance, would it be useful to post a message on the talk page of each affected page, similarly to what User:InternetArchiveBot currently does?
Notifying users involved in the project: Ocaasi (WMF), symac, Andrew Su, jamestwebber, A3nm, Sckott and ChPietsch.
Discussion
[edit]Demo
[edit]I ran the demo on Alan Turing and the bot added |doi-access=free
to the citation with doi:10.1112/plms/s2-43.6.544. When I click through that DOI, the page says that I need to sign in to get a full text PDF. Is that a bug? – Jonesey95 (talk) 19:37, 22 October 2016 (UTC)[reply]
- Good catch! This error comes from our data sources, as you can see here (the record is classified as free to read). This is bound to happen sometimes, although errors of this kind are usually quite rare in my experience, as BASE's OA classification tends to err on the side of caution. ChPietsch (responsible for APIs at BASE), any idea? I do not see any straightforward fix. The newly launched http://oadoi.org does not solve the issue for this DOI either. This is probably a good case for a notification on the talk page, inviting editors to check the edit? − Pintoch (talk) 19:56, 22 October 2016 (UTC)[reply]
- If the bot were live and it were to make the edit described above, an editor might notice that the bot added a non-free URL to a citation and remove it. What prevents the bot from trying to add a non-free (or any non-working) URL again? For example, InternetArchiveBot uses {{cbignore}} as a way for an editor to prevent the bot from further altering a citation. Does OABot have something similar, e.g. perhaps honoring the same ignore template or a new {{oaignore}}? —RP88 (talk) 21:45, 22 October 2016 (UTC)[reply]
- Excellent suggestion. We should have that indeed. − Pintoch (talk) 06:59, 23 October 2016 (UTC)[reply]
As another example, on Makemake for the citation to doi:10.1038/nature11597 the bot wants to add a 'url' of http://orbi.ulg.ac.be/jspui/handle/2268/142198 . That link redirects to http://orbi.ulg.ac.be//handle/2268/142198 . That page doesn't actually have a full text version, the "fulltext file" attached is just a 2.2 kB blank PDF. The tool oaDOI also fails for this DOI, but in a different manner. For this DOI it claims to have found a full text version, but it directs us to a different URL http://pubman.mpdl.mpg.de/pubman/faces/viewItemOverviewPage.jsp?itemId=escidoc:1615196 which says "There are no public full texts available". —RP88 (talk) 05:39, 23 October 2016 (UTC)[reply]
- Interesting. I am considering to add a second layer of full text detection in the bot, based on Zotero. It will be very slow, but for a bot like this we don't really care. − Pintoch (talk) 06:59, 23 October 2016 (UTC)[reply]
- If the bot were live and it were to make the edit described above, an editor might notice that the bot added a non-free URL to a citation and remove it. What prevents the bot from trying to add a non-free (or any non-working) URL again? For example, InternetArchiveBot uses {{cbignore}} as a way for an editor to prevent the bot from further altering a citation. Does OABot have something similar, e.g. perhaps honoring the same ignore template or a new {{oaignore}}? —RP88 (talk) 21:45, 22 October 2016 (UTC)[reply]
I ran the demo on Alzheimer's disease. Two times, apparently for the same citation, the demo added http://www.ncbi.nlm.nih.gov/pmc/articles/PMC. That link points to an error page; presumably the bot forgot to append the appropriate identifier (if there is one). Also, all of the new research gate urls end end with the .pdf extension but when I followed those links I did not get a pdf file but instead got an html abstract page. The bot shouldn't cause the cs1|2 template to mislead the reader (urls ending in '.pdf' display the pdf icon).
- Thanks! I think I can deal with these problems via special cases in the code. − Pintoch (talk) 20:17, 22 October 2016 (UTC)[reply]
- The two bugs should be fixed now (make sure you refresh the processing as there is some caching). − Pintoch (talk) 20:39, 22 October 2016 (UTC)[reply]
I ran the demo on Influenza A virus subtype H7N9 and noticed one item that wasn't really an error. For the citation to doi:10.1016/S0140-6736(13)60938-1 it wants to add a 'url' of:
https://www.researchgate.net/profile/Haixia_Xiao/publication/236637171_Origin_and_diversity_of_novel_avian_influenza_A_H7N9_viruses_causing_human_infection_Phylogenetic_structural_and_coalescent_analyses
when a 'url' of https://www.researchgate.net/publication/236637171
would have been adequate. Not really a big deal, but I figured it would be nice to use the short researchgate URLs if possible. —RP88 (talk) 21:19, 22 October 2016 (UTC)[reply]
- I think I broke the tester, tried to run Diffie–Hellman key exchange, got a 500 server error now it won't try anymore. — xaosflux Talk 21:30, 22 October 2016 (UTC)[reply]
- Hmm, same for Cayley–Purser algorithm - I think it may have some input handling bugs. — xaosflux Talk 21:33, 22 October 2016 (UTC)[reply]
- It's the m-dash in the page name. The bot is breaking on pages with non-ascii characters in their page name. For example, it also fails on the page À. —RP88 (talk) 21:54, 22 October 2016 (UTC)[reply]
Thanks. I've fixed the unicode problem (that's what you deserve when you use python 2) and have shortened RG urls a bit more (though I could probably remove the profile bit indeed). − Pintoch (talk) 22:32, 22 October 2016 (UTC)[reply]
- Do you think you could add the following code to main.py in order to further shorten the ResearchGate URLs? After rg_re add...
rg2_re = re.compile('^(https?://www\.researchgate\.net/)profile/[^/]+/(publication/[0-9]+)$')
- ...and change to the rg_match code to the following:
rg_match = rg_re.match(oa_url) if rg_match: oa_url = rg_match.group(1) # Further shorten ResearchGate URLs by removing "profile/<name>" rg_match = rg2_re.match(oa_url) if rg_match: oa_url = rg_match.group(1) + rg_match.group(2)
- Thanks. —RP88 (talk) 13:21, 26 October 2016 (UTC)[reply]
- Thanks a lot for the suggestion, I have added it to the code after a slight simplification. − Pintoch (talk) 22:40, 26 October 2016 (UTC)[reply]
Back to Alzheimer's disease again. The demo added |url=https://www.academia.edu/27685650
When I clicked that link from the demo page, I land on a page that offers apparently two ways to get to the pdf document. One way is through a google login – a rather larger reddish button which pops up a window wanting me to sign in with a google account. The other is a larger image that purports to look like a stack of pages with the document title and PDF in white on a red background and the Acrobat icon. Clicking that image gets me to a 'sign up to download' display (google and / or facebook). The demo did not add |url-access=registration
, but, wasn't that the purpose of all of that haggling we've been doing at WT:CS1?
But wait, there's more. When I put the link here in this post, magically, it takes me to the document and now the link in the demo's report does the same. This is astonishing. Do not astonish the user.
I know that you are not responsible for that pathetic academia.edu user interface, but I do have to wonder: if it is so poorly designed as to act this way, should the bot be adding those links to Wikipedia articles?
—Trappist the monk (talk) 00:12, 23 October 2016 (UTC)[reply]
- Yes, as you noticed Academia.edu does not require registration to download the PDF if the user comes from Wikipedia. Therefore links to this website are currently displayed without access annotation. But this behavior can easily be changed of course. I personally think these links are useful, as they give access to full texts that are hard to find elsewhere. Sometimes, even Google Scholar is not aware of these links (compare this Google Scholar cluster and http://doai.io/10.1017/s0790966700007503). The bot currently prioritizes links to conventional repositories over social networks such as ResearchGate and Academia.edu. − Pintoch (talk) 18:10, 23 October 2016 (UTC)[reply]
This is from Clitoris:
{{cite book |last=Schünke |first=Michael |first2=Erik |last2=Schulte |first3=Lawrence M. |last3=Ross |first4=Edward D. |last4=Lamperti |first5=Udo |last5=Schumacher |title=Thieme Atlas of Anatomy: General Anatomy and Musculoskeletal System |volume=1 |publisher=[[Thieme Medical Publishers]] |year=2006 |isbn=978-3-13-142081-7 |url=https://books.google.com/books?id=NK9TgTaGt6UC&pg=PP1 |accessdate=November 27, 2012 |ref=harv}}
- Schünke, Michael; Schulte, Erik; Ross, Lawrence M.; Lamperti, Edward D.; Schumacher, Udo (2006). Thieme Atlas of Anatomy: General Anatomy and Musculoskeletal System. Vol. 1. Thieme Medical Publishers. ISBN 978-3-13-142081-7. Retrieved November 27, 2012.
The demo added |doi=10.1136/aim.26.4.253
. The template has a |url=
link to a preview-able facsimile at google books. The doi:10.1136/aim.26.4.253 links to a vaguely related review of this book in the BMJ (which is not mentioned in the citation). Not really useful for helping readers find a copy of the source material that supports the Wikipedia article. The bot should not be littering cs1|2 templates with such vaguely related material.
—Trappist the monk (talk) 00:24, 23 October 2016 (UTC)[reply]
- I propose to exclude
cite book
from the bot's scope, as confusions between books and their reviews could indeed be a problem. Is there a book-specific parameter incitation
that I could use to detect CS2 books? − Pintoch (talk) 06:59, 23 October 2016 (UTC)[reply]- I'm not sure that is necessary. I think the problem is at a different point in the processing workflow. Why did the bot even think this citation matched doi:10.1136/aim.26.4.253? Comparing the metadata between the DOI and the citation, the only match is the title. There is no overlap between the five authors in the citation and the one author for the DOI . The years of publications are not the same. Why did the bot think it had a match when literally the only matching metadata was the title? While it may be uncommon, I suspect there are more than a few journal articles that share titles. Maybe the better fix is for the bot to demand a higher quality match before adding a DOI? How about this proposal: if a query is made to the dissemin API without a DOI, the bot should apply an additional filter that rejects any results that differ in either the author's last name or year of publication. —RP88 (talk) 09:02, 23 October 2016 (UTC)[reply]
- That would make sense. Feel free to implement that if you are up for it. For now I will keep
cite book
blacklisted, mostly for performance reasons. − Pintoch (talk) 18:10, 23 October 2016 (UTC)[reply]- I have added an additional check when adding a DOI, by comparing the metadata in the citation with the official metadata from the publisher. This solves the problem in the example raised by Trappist. − Pintoch (talk) 21:36, 4 November 2016 (UTC)[reply]
- That would make sense. Feel free to implement that if you are up for it. For now I will keep
- I'm not sure that is necessary. I think the problem is at a different point in the processing workflow. Why did the bot even think this citation matched doi:10.1136/aim.26.4.253? Comparing the metadata between the DOI and the citation, the only match is the title. There is no overlap between the five authors in the citation and the one author for the DOI . The years of publications are not the same. Why did the bot think it had a match when literally the only matching metadata was the title? While it may be uncommon, I suspect there are more than a few journal articles that share titles. Maybe the better fix is for the bot to demand a higher quality match before adding a DOI? How about this proposal: if a query is made to the dissemin API without a DOI, the bot should apply an additional filter that rejects any results that differ in either the author's last name or year of publication. —RP88 (talk) 09:02, 23 October 2016 (UTC)[reply]
This, also from Clitoris, the demo added |doi-access=free
(I modified it to use Module:Citation/CS1/sandbox to avoid the unrecognized parameter error):
{{cite journal/new |last=Smith |first=K. C. |first2=T. J. |last2=Parkinson |first3=S. E. |last3=Long |first4=F. J. |last4=Barr |title=Anatomical, cytogenetic and behavioural studies of freemartin ewes |journal=[[Veterinary Record]] |volume=146 |issue=20 |pages=574–8 |year=2000 |doi=10.1136/vr.146.20.574 |ref=harv |subscription=yes|doi-access=free }}
- Smith, K. C.; Parkinson, T. J.; Long, S. E.; Barr, F. J. (2000). "Anatomical, cytogenetic and behavioural studies of freemartin ewes". Veterinary Record. 146 (20): 574–8. doi:10.1136/vr.146.20.574.
{{cite journal}}
: Invalid|ref=harv
(help); Unknown parameter|subscription=
ignored (|url-access=
suggested) (help)
- Smith, K. C.; Parkinson, T. J.; Long, S. E.; Barr, F. J. (2000). "Anatomical, cytogenetic and behavioural studies of freemartin ewes". Veterinary Record. 146 (20): 574–8. doi:10.1136/vr.146.20.574.
Following that doi gets an abstract and, ultimately, a note that reads: "Access to the full text of this article requires a subscription or payment." Far from free. So, the |subscription=yes
is correct. Should the bot add a free signal parameter when there is only one 'external' link and when the template includes |subscription=yes
? This is contradictory.
—Trappist the monk (talk) 00:37, 23 October 2016 (UTC)[reply]
- Good point. The error comes from the same data source (highwire press). If this is frequent among records from this publisher, ChPietsch might consider to reclassify them? Otherwise I can blacklist this publisher downstream. I will restrict also the bot to templates without
|subscription=
and|registration=
. − Pintoch (talk) 06:59, 23 October 2016 (UTC)[reply]
I ran the demo on Pluto and noticed another error, maybe a data source problem? For the citation to doi: 10.1007/s10569-010-9320-4 the bot wants to add a 'url' of http://www.springerlink.com/content/g272325h45517581/fulltext.pdf . However, that URL redirects to http://link.springer.com/article/10.1007%2Fs10569-010-9320-4 which requires a login/money for full text. —RP88 (talk) 06:16, 23 October 2016 (UTC)[reply]
- This bug is introduced in Dissemin, I think I can fix that upstream. − Pintoch (talk) 06:59, 23 October 2016 (UTC)[reply]
I have rolled out a new version that checks all links with Zotero's scrapers, which increases a lot the processing time but should give a more accurate full text detection. − Pintoch (talk) 18:10, 23 October 2016 (UTC)[reply]
I ran the demo on Dengue fever and noticed one item that was a little odd, but maybe not your issue. The bot failed to identify doi: 10.1002/14651858.CD003488.pub3 as freely available. The DOI resolves to http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD003488.pub3/abstract , which hosts the full PDF (presumably via Cochrane Library's "green open access" program. —RP88 (talk) 13:33, 27 October 2016 (UTC)[reply]
- Yes, this is due to my recent decision to add one extra layer of full text availability detection. What happens in this case is that Zotero scrapes this page, returns a full text URL http://onlinelibrary.wiley.com/store/10.1002/14651858.CD003488.pub3/asset/CD003488.pdf?v=1&t=iusekwil&s=86c45ca3ee282309247c9e0cdc4b8d3779ecf544 , then we try to download this file to check it looks like a PDF, and we fail because of a 403 error. I suspect Wiley puts some protections on their full text URLs to make sure they are not shared (this is probably why they put some tokens after
.pdf
). So, we will not add any green locks on Wiley DOIs. This is bound to happen if we want to be absolutely sure that all the links we add link to full texts: this is a precision/recall trade-off. I think we all agree the bot should have an excellent precision. In the case you brought up, the bot does not make any change to the template: I think this is fine. There are surely cases where it adds another free url or identifier without noticing that the doi is free. Is this a serious issue? At least editors (or other bots!) can prevent it by adding|doi-access=free
(in which case OAbot would not change the template). By the way, I have added instructions for publishers in case they want to make sure the bot can detect their full texts. (Wiley does not fully comply with Google Scholar guidelines as the link they put in thecitation_pdf_url
meta tag does not lead to a PDF file, but an HTML file). − Pintoch (talk) 14:27, 27 October 2016 (UTC)[reply]- Thanks for the explanation. —RP88 (talk) 14:44, 27 October 2016 (UTC)[reply]
Here is another that very likely has nothing to do with the bot but instead more likely to be a data issue. On (225088) 2007 OR10 the cite to doi:10.3847/0004-6256/151/5/117 ideally should be found to match arXiv:1603.03090 ( https://arxiv.org/pdf/1603.03090.pdf ). —RP88 (talk) 15:06, 27 October 2016 (UTC)[reply]
- Yes, Dissemin's index is currently not fully up to date with BASE's, and this DOI mapping has been modified recently, so we don't detect this. This should normally be resolved as we process updates from BASE. − Pintoch (talk) 21:38, 28 October 2016 (UTC)[reply]
{{BAGAssistanceNeeded}} I believe all the errors reported here have been fixed. What are the next steps? − Pintoch (talk) 21:36, 4 November 2016 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Do a small run, post the results here please. — xaosflux Talk 23:42, 4 November 2016 (UTC)[reply]
Trial 1 (50 edits)
[edit]The bot was blocked at its 10th edit by the pywikibot error "Hit WP:AbuseFilter: Link spamming". The bot account does not look blocked, but I will not attempt any edit before getting your opinion on the matter. − Pintoch (talk) 01:03, 10 November 2016 (UTC)[reply]
- In Autism, the bot added hdl:11693/22959 and
|hdl-access=free
. That's sort of right I guess. The article is available through a doi link on the hdl page. But, doi:10.1016/j.neuron.2015.09.016 (which does link to the article) is already present in the template yet the bot appears to have ignored that fact.
- The academia links still require some sort of login so the bot should not be adding them as free-to-read links. The researchgate links never quite finish loading (I know, not your problem; just annoying from a reader's perspective).
- —Trappist the monk (talk) 01:40, 10 November 2016 (UTC)[reply]
- Thanks for spotting this problem with the HDL. The bot is mislead by the PDF placeholder on the institutional repository. The DOI is detected as free to read by Dissemin by Zotero fails to confirm that. I will do two fixes: do not consider PDF files as full texts if they are too short (say less than 3 pages), and do not add any link to a citation with a DOI that Dissemin detects as free to read (but do not add the green lock unless Zotero confirms it is free to read).
- Concerning links to academia.edu, are you sure you need to login when coming from Wikipedia? (Have you tried clicking on the links as they appear in the wiki pages?) Or do you think we should not add these links because they require a registration when coming from outside Wikipedia? It seems to me that a debate about the sources we want to include (and more generally on the bot itself) would be helpful, perhaps at WP:VPP? − Pintoch (talk) 10:11, 10 November 2016 (UTC)[reply]
- Relying on page-count doesn't seem to me to be a good idea. I've seen pdf 'articles' that were less than a page in length – obituaries, book reviews, retractions, etc.
-
- Perhaps its simply a definition problem where free-to-read includes the caveat that some 'free' sources are only free-to-read when linked from Wikipedia.
- —Trappist the monk (talk) 11:02, 10 November 2016 (UTC)[reply]
- Concerning links to academia.edu, are you sure you need to login when coming from Wikipedia? (Have you tried clicking on the links as they appear in the wiki pages?) Or do you think we should not add these links because they require a registration when coming from outside Wikipedia? It seems to me that a debate about the sources we want to include (and more generally on the bot itself) would be helpful, perhaps at WP:VPP? − Pintoch (talk) 10:11, 10 November 2016 (UTC)[reply]
- I agree looking at the page count is not great, but what else can we do? It is not an issue if we skip very short articles (we just won't add a full text link to them, but we will not tag them as paywalled or anything like that.) Ideally I could have a look at CiteSeerX's filtering heuristics, I know they have put some effort into filtering out PDF files that are not full texts. But integrating that into the pipeline will be almost surely quite painful. − Pintoch (talk) 11:10, 10 November 2016 (UTC)[reply]
- Pintoch the abusefilter issue should be resolved now (I adjusted the filter) - ping me if you hit it again. — xaosflux Talk 04:48, 10 November 2016 (UTC)[reply]
- Thanks a lot! I will resume the bot after fixing the problems above. − Pintoch (talk) 10:11, 10 November 2016 (UTC)[reply]
- {{OperatorAssistanceNeeded}} Any progress on moving back to trials? — xaosflux Talk 20:18, 2 December 2016 (UTC)[reply]
The bot has made 20 edits so far, and I have spotted a few mistakes:
- Considering that some sources were freely available when they were not, for instance doi:10.1039/js8641700112 for which Zotero finds a PDF url, but the PDF cannot be downloaded without authentication. The other identifiers were the bot had similar problems were doi:10.1107/S0108270102002032, doi:10.1107/S0567739476001551, doi:10.1163/156853897X00297 and hdl:11693/22959 (already reported above).
- Added a link to a source that was not the one designated by the citation: https://www.academia.edu/12441837 was added as citation to doi:10.1214/009053604000001048. The association between this URL and this DOI comes from Academia.edu (see the relevant oai record).
I am taking the following measures:
- instead of just doing a HEAD request on the URLs that are supposed to lead to a PDF, the bot now downloads the PDF and checks that it is a valid PDF file (this measure was implemented earlier but was not effective due to a cache in the pipeline).
- as I have no control over how academia.edu extracts its DOIs, and as the usefulness of these links was questioned by Trapist, the bot will now stop adding links to this website. ResearchGate has similar issues as DOAI.io's index of it is getting out of sync with the website, so the bot will stop querying DOAI.io altogether. I think it would be good to have a debate at WP:VPR to decide whether we want these links or not: if so, it should not be very hard to convince these two sources to adapt their metadata accordingly.
Editing will be resumed when I will be satisfied with my implementation of these fixes. − Pintoch (talk) 15:09, 4 December 2016 (UTC)[reply]
Trial complete. Now I just need to find the time to analyze the results. − Pintoch (talk) 09:35, 6 December 2016 (UTC)[reply]
Trial 1 results
[edit]Here is a report for the last 30 edits made by the bot, which are the edits made with the latest version of the code. The bot processed about two thousand citation templates (I don't have the exact figure).
- The bot added
|doi-access=free
on 69 DOIs, and I checked manually that they were all free to read from an IP address that is not covered by institutional subscriptions. - The bot added additional free to read links to other citations. All the links it added were free to read. I have checked that the free versions all look faithful to the "published" one (if not identical), when my institutional accesses allowed me to go through the paywall.
Link added by OAbot | Publisher link | Publisher access |
---|---|---|
CiteSeerx: 10.1.1.659.5717 | doi:10.1037/a0029016 | $11.95 |
hal |
doi:10.1021/jp9077008 | $40.00 |
hal |
doi:10.1016/j.nuclphysa.2003.11.00 | $39.95 |
hdl:1807/50099 | JSTOR 23499358 | registration |
hdl:1807/50095 | JSTOR 23499354 | registration |
CiteSeerx: 10.1.1.680.5115 | doi:10.1080/10888705.2010.507119 | EUR 39.00 |
CiteSeerx: 10.1.1.145.4600 | doi:10.1145/102782.102783 | $15 |
hdl:10536/DRO/DU:30050819 | doi:10.1016/S0140-6736(12)61728-0 | registration |
CiteSeerx: 10.1.1.454.4197 | doi:10.1016/S0140-6736(07)61575-X | registration |
hdl:10419/85494 | doi:10.1093/cje/25.1.1 | $29.00 |
arXiv:1311.2763 | doi:10.1140/epjh/e2013-40037-6 | EUR 41.94 |
hdl:10915/2785 | doi:10.1002/andp.19053220806 | free |
hdl:10915/2786 | doi:10.1002/andp.19053221004 | free |
hdl:10144/125625 | http://ajcn.nutrition.org/content/85/1/218.short | free |
CiteSeerx: 10.1.1.529.1977 | doi:10.1016/j.ympev.2005.10.017 | $39.95 |
As you can see, in three cases the published version was also free. This happens when the bot fails to discover the full text from the publisher. This is bound to happen as not all publishers comply with the Google Scholar guidelines or have an up-to-date Zotero scraper. This makes the HDL link less useful, but the bot did not add any wrong information, as it does not mark any link as paywalled. − Pintoch (talk) 18:52, 6 December 2016 (UTC)[reply]
{{OperatorAssistanceNeeded}} @Pintoch: Sorry... this sorta got lost in the mix of things. Given there were some kinks in the prior run, I'm thinking we should probably do another trial run (presuming those kinks would be fixed?). Sound like a plan? --slakr\ talk / 05:19, 10 February 2017 (UTC)[reply]
- Sure, no worries! But I propose we wait for the second RFC about the access locks in CS1/2 to be closed (the first one was closed a few days ago). This might change the behaviour of CS1, which might require some adjustments to the bot. − Pintoch (talk) 11:34, 10 February 2017 (UTC)[reply]
- As the RFC closure indicates that consensus for visual indication of access levels has not been reached, I propose to run the bot without adding these parameters in any template. I will adapt the code for that. Can you let me know how many edits I should do in this new version? − Pintoch (talk) 21:05, 10 February 2017 (UTC)[reply]
- Hmm... I'm only a little concerned with the whole automatic section on that RFC. I think the intent there was default behavior (i.e., uninformed), which seems to be the opposite of what this bot would do (i.e., informed, intelligent). Of course, it doesn't look like the current template is rendering the lock icons anyway (at least, not based on the field the bot was using). Also, I don't see any major opposition input here. Still, it does seem like this would affect quite a few pages. Basically, I'll assume that intelligent tagging isn't going to be an issue, at least, as far as getting the bot working, so it's probably safe to Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. for another batch to iron out kinks. If we feel we need some sort of separate RFC for increased attention, we can go from there. --slakr\ talk / 06:25, 14 February 2017 (UTC)[reply]
- As the RFC closure indicates that consensus for visual indication of access levels has not been reached, I propose to run the bot without adding these parameters in any template. I will adapt the code for that. Can you let me know how many edits I should do in this new version? − Pintoch (talk) 21:05, 10 February 2017 (UTC)[reply]
- Hi Slakr, I don't understand how the discussion about automatic flagging on the RFC could have any impact on what the bot does, as by definition these are just rendering changes that are not linked to any parameters. In addition, what I propose is to run the bot without any access-signaling features, ignoring completely all locks and not adding any new ones, so not doing even intelligent tagging. The locks *are* currently implemented but it looks like the community does not wish to see them popping around yet. − Pintoch (talk) 07:38, 14 February 2017 (UTC)[reply]
- Ah, indeed, it looks like this edit does indeed add them (e.g., [1]) as of that template revision. When I say "intelligent" I mean that, at least presumably based on the discussion above, the bot's actually checking against a data source to determine the status of the link (rather than assuming a given site is inherently paywalled, for example), regardless of the icon or how the template interprets the parameter. But yeah, either way, I think there's obviously not consensus to say it should be adding icons per se, but I think that's more of a template issue (i.e., should the parameter be filled in but just not rendered with an icon?). I was under the assumption that the RFC was more about how the template interprets the field and renders it. I basically just did the extended trial so you can demonstrate any tweaks/fixes. I'm not saying you have to do anything in addition to whatever you were comfortable with. :P --slakr\ talk / 06:10, 15 February 2017 (UTC)[reply]
- Hi Slakr, I don't understand how the discussion about automatic flagging on the RFC could have any impact on what the bot does, as by definition these are just rendering changes that are not linked to any parameters. In addition, what I propose is to run the bot without any access-signaling features, ignoring completely all locks and not adding any new ones, so not doing even intelligent tagging. The locks *are* currently implemented but it looks like the community does not wish to see them popping around yet. − Pintoch (talk) 07:38, 14 February 2017 (UTC)[reply]
Trial 2 (50 edits)
[edit]In progress The bot is running, and here are the results so far. − Pintoch (talk) 19:51, 26 February 2017 (UTC)[reply]
Trial complete.. − Pintoch (talk) 09:17, 27 February 2017 (UTC)[reply]
Trial 2 results
[edit]CiteSeerX and ELNEVER
[edit]I just had a run-in with this bot on abstract data type. It added a CiteSeerX link to a copyrighted journal paper, for which CiteSeerX traced the provenance of its copy of the paper to a course web site not run by the paper's authors. I believe that this sort of link violates WP:ELNEVER: making that paper available to course students may have been fair use, but that does not make it fair use for CiteSeerX or for us. In particular, this material fails the tests in ELNEVER of being a website run by the content author, licensed by the author, or being compliant with fair use. And I believe that it specifically violates the language in ELNEVER that "This is particularly relevant when linking to sites such as Scribd or YouTube, where due care should be taken to avoid linking to material that violates copyright."
CiteSeerX links may sometimes be appropriate, when they are derived from versions made available by the content authors themselves, although in general I would prefer to see direct links to the author copies. But checking whether such a link is appropriate or inappropriate is clearly not something a bot can be trusted to have the judgement to do correctly.
Please cease this copyright-violating misbehavior. —David Eppstein (talk) 22:58, 26 February 2017 (UTC)[reply]
- PS I just checked the three CiteSeerX links listed above in the progress section. (The other handle-based links are non-problematic.) The link added in apatosaurus is derived from a Russian piracy site. The link added in amine is derived from two different author copies. The link added in ALGOL is derived from the official publisher copy and an author copy. So one in three of these links was bad. —David Eppstein (talk) 23:04, 26 February 2017 (UTC)[reply]
- Hi @David Eppstein: thanks for your help to analyze these edits! Feel free to add a column to the table, with your assessment of the legal status of the added links. I will let the bot run for the 50 edits as agreed by BAG, so that we get a better picture of what the bot does. Cheers. − Pintoch (talk) 23:33, 26 February 2017 (UTC)[reply]
So, based on the analysis above it looks like the bot should not add links from CiteSeerX. This would solve all the copyright violations observed in this sample. Personally, I am quite surprised that these links are not welcome, for the following reasons:
- CiteSeerX is not exactly a pirate website: it is run by an American university, complying with DMCA takedown requests and respecting robots.txt crawling policies. It is mirrored by many other websites, including core.ac.uk (run by a UK university).
- Editors routinely use this service in citations, and other services where an even larger proportion of the papers infringe copyright (academic social networks, for instance).
- The suggestion that direct links to papers stored on personal home-pages would be superior is surprising: these links have a much shorter life expectancy. Given all the effort around the archival of sources in Wikipedia, for instance with the WayBack Machine, I would find it counterintuitive to run a bot that could directly add links to a stable archive but prefers unstable sources instead. Legally speaking, these home-pages are generally equally problematic if the document is an unauthorized copy of a published version of a paper.
But I agree that the links flagged by David Eppstein indeed violate WP:ELNEVER (which I find out of touch with the current practices, but this is not a place to discuss this policy). Therefore I propose to remove CiteSeerX from the links added by the bot. What do you think? − Pintoch (talk) 20:15, 4 March 2017 (UTC)[reply]
- That seems like the simplest solution to me. I don't think there are significant or widespread problems with the other types of links added by the bot, at least judging from the sample so far. —David Eppstein (talk) 20:17, 4 March 2017 (UTC)[reply]
I would be quite saddened by the complete loss of CiteSeerX as a resource because a minority of paper are uploaded there against copyright. I'm going to brainstorm here and suggest
- The bot adds
|citeseerX=<!--10.1.1.xxx.xxxx REVIEW FOR COPYRIGHT BEFORE UNCOMMENTING-->
- The bot adds a list of
|citeseerX=10.1.1.xxx.xxxx
on the talk page, letting editors know these links might be acceptable free resources.
Headbomb {talk / contribs / physics / books} 22:16, 4 March 2017 (UTC)[reply]
- Personally, I'm much in favour of option 1. Headbomb {talk / contribs / physics / books} 22:16, 4 March 2017 (UTC)[reply]
- If we're going to go that route, why not both? But "review for copyright" is too vague — it needs to be made clear what is to be reviewed and why. Also, I think "uploaded there" is an inaccurate description of how this happens — people don't upload their papers to CiteSeerX, they upload them elsewhere and then CiteSeerX finds them and links them. So it is basically a search engine for papers, not unlike Google scholar, but with the advantage that it keeps a "cached copy" so that it still works if the original gets taken down. —David Eppstein (talk) 22:23, 4 March 2017 (UTC)[reply]
- The point is to do the hard work of finding the paper in the first place. Meat popsicles can then decide if the link is compliant or not. The comment message can be tweaked as needed, and I've got no objection to having both done. Headbomb {talk / contribs / physics / books}
- I also think it would be interesting to make a semi-automated version of the bot where the links could be checked by editors. This could be based on an adaptation of the current web interface. Users would then perform the edits with their account, via the interface. The bot is currently throwing away a lot of sources on many different grounds, so a semi-automated version could add many more links with the help of editors.
- But for now I would like to restrict the bot to the edits which are clearly uncontroversial, so that it can be approved and run. − Pintoch (talk) 10:50, 5 March 2017 (UTC)[reply]
- The point is to do the hard work of finding the paper in the first place. Meat popsicles can then decide if the link is compliant or not. The comment message can be tweaked as needed, and I've got no objection to having both done. Headbomb {talk / contribs / physics / books}
- If we're going to go that route, why not both? But "review for copyright" is too vague — it needs to be made clear what is to be reviewed and why. Also, I think "uploaded there" is an inaccurate description of how this happens — people don't upload their papers to CiteSeerX, they upload them elsewhere and then CiteSeerX finds them and links them. So it is basically a search engine for papers, not unlike Google scholar, but with the advantage that it keeps a "cached copy" so that it still works if the original gets taken down. —David Eppstein (talk) 22:23, 4 March 2017 (UTC)[reply]
A user has requested the attention of a member of the Bot Approvals Group. Once assistance has been rendered, please deactivate this tag by replacing it with {{t|BAG assistance needed}}
. Summary of the situation: the test run is complete. One issue about copyright violations was raised, and the proposed solution is to blacklist a particular source (CiteSeerX) so that the bot does not add links from it. − Pintoch (talk) 19:05, 6 March 2017 (UTC)[reply]
Withdrawn by operator. Six months have passed since the submission of this request for approval. I am not blaming the BAG at all, they are under a very heavy workload currently, and we are all volunteers. However, I have the feeling that this request is not going anywhere: to reach a satisfactory precision, we have decreased the recall so much that the bot adds very few links. By doing so we are throwing away many useful links. An unsupervised bot is not the appropriate format for this project. Therefore, I am converting this bot to a different tool, where edits are proposed by users who perform them with their own account. − Pintoch (talk) 14:42, 17 April 2017 (UTC)[reply]
- Please don't withdraw this. I would review, and wholeheartedly support the task, but having been so involved in the discussion here and elsewhere, I'm uncomfortable approving it myself. Several of its links are great, and are completely fine, like arxivs, hdls, and dois. Having the bot exclude one of its many possible sources shouldn't be a reason to consider this a bad bot, or decline it. Headbomb {t · c · p · b} 14:50, 17 April 2017 (UTC)[reply]
- Thanks a lot Headbomb for your support! But I genuinely think a semi-supervised approach is better here. It has many advantages, the main one being that it will allow us to add much more links. The new tool can be accessed at https://tools.wmflabs.org/oabot/ (it is still evolving). I would recommend to use it on your own sandbox first (feel free to use mine too). − Pintoch (talk) 20:27, 17 April 2017 (UTC)[reply]
Closing per OP's request. — HELLKNOWZ ▎TALK 20:33, 17 April 2017 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.