Wikipedia:Bots/Requests for approval/Pi bot 3
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Mike Peel (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 20:56, 28 November 2017 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): Python (pywikibot)
Source code available: on bitbucket
Function overview: Look through references to references to reports to Cochrane (organisation) to check for updates to them; when found, tag with {{update inline}} [1], and add to the report at Wikipedia:WikiProject Medicine/Cochrane update/August 2017 for manual checking by editors [2]. Also archive report lines marked with {{done}} to the archive at Wikipedia:WikiProject Medicine/Cochrane update/August 2017/Archive [3] [4].
Links to relevant discussions (where appropriate): This was previously run by @Ladsgroup on an ad-hoc basis. I was asked to take over the running of it on a more regular basis by @JenOttawa:. See [5] and [6].
Edit period(s): Once per month
Estimated number of pages affected: Depends on the number of Cochrane updates each month, and the number of references to them. Likely to be a number in the tens rather than the hundreds.
Namespace(s): Mainspace and Wikipedia
Exclusion compliant (Yes/No): No, not relevant in this situation
Function details: The code searches for cases of "journal=Cochrane" in Wikipedia articles, extracts the Pubmed ID from the reference, then fetches the webpage from pubmed and looks for a "Update in" link. If an update is available, then it marks the reference as {{update inline}}, with a link to the updated document, and adds it to the report at Wikipedia:WikiProject Medicine/Cochrane update/August 2017 where users manually check to see if the article needs updating. If it does, then they can update the reference and mark it as {{done}} in the report, and the bot then archives the report when it next runs. If it does not, then it can be marked with <!-- No update needed: ID_HERE --> in the article code, and the bot won't re-report the outdated link in the future. I've made some test edits under my main user account to demonstrate how the bot works, links are in the function overview above. Mike Peel (talk) 20:56, 28 November 2017 (UTC)[reply]
Discussion
[edit]- Comment: Is text like "journal=The Cochrane database of systematic reviews" (as in Postpartum bleeding) or "journal = Cochrane Database of Systematic Reviews" (as in Common cold) or the presumably incorrect "title=Cochrane Database of Systematic Reviews" (as in Common cold) or "journal = Cochrane Database Syst Rev" (as in Common cold) relevant to this request? You might want to include those variations. – Jonesey95 (talk) 21:08, 28 November 2017 (UTC)[reply]
- @Jonesey95: The code that's currently used to select articles is
generator = pagegenerators.SearchPageGenerator('insource:/\| *journal *= *.+Cochrane/', site=site, namespaces=[0])
. That was written by @Ladsgroup, and I'm not sure how to modify it to catch more cases. It also currently returns the message "WARNING: API warning (search): The regex search timed out, only partial results are available. Try simplifying your regular expression to get complete results. Retrieving 50 pages from wikipedia:en." Once the articles are selected,pmids = re.findall(r'\|\s*?pmid\s*?\=\s*?(\d+?)\s*?\|', text)
is run on the article text to find the references to update, which will actually catch more than just the Cochrane reviews in the article, but only the references with updates are touched by the code. TBH, I'm not an expert in regexes, so any suggestions you have to improve these would be very welcome! Thanks. Mike Peel (talk) 21:17, 28 November 2017 (UTC)[reply]- Insource searches have a very low timeout value, so anything with a mildly complex regex will time out. See T106685 for some details. The only way I know of to get around it is to search for multiple regexes in succession, like this:
- insource:/\| journal =.+Cochrane/
- insource:/\| journal=.+Cochrane/
- insource:/\|journal =.+Cochrane/
- insource:/\|journal=.+Cochrane/
- It looks like the regex you have will catch all of the above cases except the junky "title" instance, which should be fixed manually by someone who knows the right way to fix it. – Jonesey95 (talk) 00:45, 29 November 2017 (UTC)[reply]
- @Jonesey95: I've added a loop that runs each of those regexes in turn, and just for the fun of it I've also added the same set for 'title' as well as 'journal' so it'll try to catch those odd cases. It currently checks 6576 Wikipedia articles in total, which will include duplicates (since I don't currently filter them out - is there a good way to merge and de-duplicate the return values from SearchPageGenerator or PreloadingGenerator?). While 6 out of 8 of the regexes run without timeouts, the last two do still return the warning, but they're "insource:/\|title =.+Cochrane/" and "insource:/\|title=.+Cochrane/" - so if there's not a good way around that then maybe we just live with it (those two queries return 98 and 304 results respectively, which is a lot less than some of the others, so this is a bit odd).
- I'd like to set this going for a full run soon, if that would be OK? Thanks. Mike Peel (talk) 21:36, 1 December 2017 (UTC)[reply]
- @Mike Peel: Re "is there a good way to merge and de-duplicate the return values", you could maintain a list in-memory of the page IDs/titles that have been processed and skip anything that has shown up before. That may or may not be helpful depending on the amount of duplication. Anyway, I have a broader question. As you said above, the bot actually checks all PMIDs in a given page for updates, not just the Cochrane-related ones; this includes logging said non-Cochrane-related updates on the Cochrane updates page. Is there any potential for this to be a problem? Alternatively, would it be useful to potentially expand the task scope to all PMIDs? — Earwig talk 05:59, 12 December 2017 (UTC)[reply]
- @The Earwig: De-duplicating: that's true, although I was hoping there might be a built-in option. :-) The numbers are fairly small here, and the code should cope fine with a second pass through a page (it'll see the messages left by any previous and not do anything). On checking PMIDs - @JenOttawa: can probably answer this better than me, but my understanding is that most PMIDs will never be updated since they're one-off articles rather than part of a series like the Cochrane ones are, so while we can check for updates to them they won’t be flagged by the bot. If there are any that aren’t Cochrane-related that do have an update, then they’ll be investigated by a human after being posted to the Cochrane page, and we can figure out how to deal with them then. Thanks. Mike Peel (talk) 14:24, 12 December 2017 (UTC)[reply]
- @Mike Peel: Re "is there a good way to merge and de-duplicate the return values", you could maintain a list in-memory of the page IDs/titles that have been processed and skip anything that has shown up before. That may or may not be helpful depending on the amount of duplication. Anyway, I have a broader question. As you said above, the bot actually checks all PMIDs in a given page for updates, not just the Cochrane-related ones; this includes logging said non-Cochrane-related updates on the Cochrane updates page. Is there any potential for this to be a problem? Alternatively, would it be useful to potentially expand the task scope to all PMIDs? — Earwig talk 05:59, 12 December 2017 (UTC)[reply]
- Insource searches have a very low timeout value, so anything with a mildly complex regex will time out. See T106685 for some details. The only way I know of to get around it is to search for multiple regexes in succession, like this:
- @Jonesey95: The code that's currently used to select articles is
- Thanks for helping here The Earwig and Mike Peel. In my experience, most other PMIDs are not updated like Cochrane Reviews are, however, I can not speak for all journals/publishing companies. Other publications are certainly retracted/withdrawn, but I am also not sure what happens here to the PMIDs. This bot ran for quite a few years and seemed to work very well and be accurate. I performed a large number of the updates (at least 100). This means that I manually went through the citation needed tags + PMID list generated, and there were very few errors. I never saw an incidence where a non-Cochrane Review was flagged with the citation needed tag, for example. I hope this helps and somewhat answers the question. We have spent considerable time on this over the past 12 months, so we are now fairly caught up with the updates. In May 2017 we had about 300 updates to perform. I would expect that a full run of the bot would pull about 50-75 new updates needed (August-December updates that were published by Cochrane), and then if we run with monthly, it would pull about 15-20 a month. This means that the volunteers will be able to stay fairly up to date with the updates, and if there are errors (other reviews pulled, etc) we will be able to correct manually them within a month or so. If you have any other questions, or if there is anything that I can help with, please let me know. I am still learning about this, but we greatly appreciate your assistance on this! JenOttawa (talk) 14:38, 12 December 2017 (UTC)[reply]
- Thanks for the prompt replies, everyone. This sounds good to me, so let's move forward with a trial run. Since the plan is for monthly runs, let's have the bot complete a full round of updates for this month and we can evaluate it from there. Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. — Earwig talk 17:57, 12 December 2017 (UTC)[reply]
- @The Earwig: Thanks, it is now running. Mike Peel (talk) 18:17, 12 December 2017 (UTC)[reply]
- It's taking longer to run than I was expecting (due to the number of unique pubmed pages it's fetching), but the edits so far seem to be OK. I'm heading offline for the eve now, so if there are any issues then please abort it by blocking the bot. Otherwise, I'll check things in the morning. Thanks. Mike Peel (talk) 23:28, 12 December 2017 (UTC)[reply]
- Thanks for the prompt replies, everyone. This sounds good to me, so let's move forward with a trial run. Since the plan is for monthly runs, let's have the bot complete a full round of updates for this month and we can evaluate it from there. Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. — Earwig talk 17:57, 12 December 2017 (UTC)[reply]
- Thanks for helping here The Earwig and Mike Peel. In my experience, most other PMIDs are not updated like Cochrane Reviews are, however, I can not speak for all journals/publishing companies. Other publications are certainly retracted/withdrawn, but I am also not sure what happens here to the PMIDs. This bot ran for quite a few years and seemed to work very well and be accurate. I performed a large number of the updates (at least 100). This means that I manually went through the citation needed tags + PMID list generated, and there were very few errors. I never saw an incidence where a non-Cochrane Review was flagged with the citation needed tag, for example. I hope this helps and somewhat answers the question. We have spent considerable time on this over the past 12 months, so we are now fairly caught up with the updates. In May 2017 we had about 300 updates to perform. I would expect that a full run of the bot would pull about 50-75 new updates needed (August-December updates that were published by Cochrane), and then if we run with monthly, it would pull about 15-20 a month. This means that the volunteers will be able to stay fairly up to date with the updates, and if there are errors (other reviews pulled, etc) we will be able to correct manually them within a month or so. If you have any other questions, or if there is anything that I can help with, please let me know. I am still learning about this, but we greatly appreciate your assistance on this! JenOttawa (talk) 14:38, 12 December 2017 (UTC)[reply]
- Thanks again to both of you. Looks good so far. JenOttawa (talk) 01:28, 13 December 2017 (UTC)[reply]
90% of what the bot is marking for updates are to "withdrawn" reviews. I have reverted most of them and updated the one of two that were newer and not withdrawn.
The bot needs to exclude withdrawn articles. It also need to look for the newest version not just the next newer version. Best Doc James (talk · contribs · email) 05:30, 13 December 2017 (UTC)[reply]
- OK, I think this test run has shown two issues - the need to handle withdrawn articles better, and also an intermittent problem with fetching the webpages (which is why the bot stopped at ~0200UT without finishing the run). I'll work on improving those before requesting another test run. Thanks. Mike Peel (talk) 19:00, 13 December 2017 (UTC)[reply]
- @The Earwig: I've now updated the code to ignore updates that have themselves been withdrawn (per @Doc James:), and I'm also using a different package to fetch the webpages that will hopefully avoid timeouts. So I'm now ready to try another test run, if that's OK with you? Thanks. Mike Peel (talk) 16:30, 2 January 2018 (UTC)[reply]
- Can we do 10 and I will than check? Best Doc James (talk · contribs · email) 04:39, 3 January 2018 (UTC)[reply]
- @Mike Peel: That's OK with me. Could you also have the bot add the
|date=
parameter so AnomeBOT doesn't have to follow it around? — Earwig talk 05:01, 3 January 2018 (UTC)[reply] - Thanks - I've modified it to edit a maximum of 10 articles, and I've added the date parameter. I'll set it running later today. Thanks. Mike Peel (talk) 06:36, 3 January 2018 (UTC)[reply]
- Now running. The addition of whitespace at [7] was unexpected, but should be fixed in the next run. Thanks. Mike Peel (talk) 07:17, 3 January 2018 (UTC)[reply]
- Restarted due to a bug in the code for the data parameter, now fixed. As a result, the whitespace thing I mentioned in the line above is now fixed. Thanks. Mike Peel (talk) 15:28, 3 January 2018 (UTC)[reply]
- @The Earwig: Now Done with 10 pages edited. @Doc James: spotted a case where a withdrawn one wasn't caught as the pubmed website didn't use the same punctuation after "WITHDRAWN" in the title, which I've now worked around (by not including the punctuation in the check). The bot did need prodding at one point as it hung again on fetching a page from pubmed, so I'll look into other ways of doing that, but I'd like to do a complete run next please. Thanks. Mike Peel (talk) 16:00, 4 January 2018 (UTC)[reply]
- @Mike Peel: Is it intended for all updates to go to a page titled "August 2017"? This seems confusing. Other than that, I don't have any real concerns. — Earwig talk 20:33, 4 January 2018 (UTC)[reply]
- Thanks for working on this Mike Peel and Earwig. At this time, I do not have a concern about the updates going to the August 2017 page. Unless we were to put a re-direct in, the volunteers are already using this page and Mike had added the function to archive updates marked as "done". Thanks again, JenOttawa (talk) 00:41, 5 January 2018 (UTC)[reply]
- Okay reviewed them all. Looks good. I think we can go a batch of 100 next? What do you think User:JenOttawa? Doc James (talk · contribs · email) 09:57, 5 January 2018 (UTC)[reply]
- It's easy to change the age if needed, it's using the "August 2017" page as per JenOttawa. I've set it running again now, it will edit a maximum of 500 pages this run, which I anticipate will be a complete set. Then we can switch to running it monthly via a cron job if formally approved. Thanks. Mike Peel (talk) 10:24, 5 January 2018 (UTC)[reply]
- Thanks Doc James and Mike Peel. I appreciate you reviewing the updates added so far. On the new updates that I have reviewed, I do not see the "update needed" tag added to the WP article. For example,
- AArticle Meningitis (edit) old review PMID:18254003 new review PMID:27121755
- Everything else looks great so far. The update needed tags are not 100% necessary, how do you feel Doc James? Thanks again, Jenny JenOttawa (talk) 01:20, 6 January 2018 (UTC)[reply]
- @JenOttawa: The bot added it, but @Doc James: then updated the ref and didn't mark it as done. Thanks. Mike Peel (talk) 06:26, 6 January 2018 (UTC)[reply]
- The latest run just completed, with 6738 pages checked. 0 tagged in this run, so they were all tagged in the previous one. If everything's OK, then perhaps this can be approved/closed, and I'll set it to run monthly from now on. Thanks. Mike Peel (talk) 11:20, 6 January 2018 (UTC)[reply]
- @The Earwig: The bot ran OK at the start of this month, is it OK to approve/continue doing so monthly now, or is there anything else we need to talk about here first? Thanks. Mike Peel (talk) 21:45, 11 February 2018 (UTC)[reply]
- Sorry Mike, lost track of this request. All seems good now. Thanks for your work on this! Approved. — Earwig talk 05:16, 12 February 2018 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.