Wikipedia:WikiProject Bluelink patrol/Workshop

This subpage is intended for editors of a technical bent. It is, of course, open to anyone; but if you feel that that the waters are closing over your head while you read it, they probably are.

Backlinks

Moved from Wikipedia talk:WikiProject Bluelink patrol

– Narky Blart

Added to watchlist. GoingBatty enjoys a bespoke alert service currently unavailable to anyone else. If a handful more would like this same service it would be quick and easy to replicate on-request. If it were to be a public service where anyone can sign up that would be a lot of work. -- GreenC 17:11, 3 January 2021 (UTC)[reply]

@GreenC: Understood. As with advanced WP privileges, you would have to argue a convincing case to be granted it. Narky Blert (talk) 17:41, 3 January 2021 (UTC)[reply]

@GreenC: I've created User:Certes/Backlinks. If that could be checked, with a note of where any output appears, that would be helpful. I intend to prune the list of anything that produces more false positives than useful leads. If the initial list is too long then we could start with a subset such as one initial letter. Certes (talk) 17:54, 3 January 2021 (UTC)[reply]

Certes, the alerts are sent via email. Email me your email and I'll set it up. -- GreenC 18:13, 3 January 2021 (UTC)[reply]

Done; thanks. m:Community Wishlist Survey 2021/Watchlists/Link watchlist might also help but probably didn't attract enough support to get implemented. Certes (talk) 19:35, 3 January 2021 (UTC)[reply]

It will email once a day around 9 or 10 GMT. The first time an entry is added to the list (in this case every entry since the whole list is new), the first day (Monday) it will send an email confirming it was added with some debugging info that can be ignored. The second day (Tuesday) it will start reporting new links. Re: wishlist, that's a great idea. I use the email alert backlink watchlist tool for a lot of things and don't know how the community gets by without it tbh it's really useful. For example I monitor every new instance of Template:Internet Archive author since many use it incorrectly. So many use cases. -- GreenC 22:40, 3 January 2021 (UTC)[reply]

@GreenC: First e-mails received, thanks. That's good timing for me, as I'm in the UK and usually edit in afternoons and evenings. For now, I'll work through the full lists but rely on my previous checks for those with over 100 cases. Once actual changes start to arrive, I should be able to keep up with words deserving attention and delist those which are predominantly false positives. I really must get around to putting my local scripts onto Toolserver. They're mainly in Perl but should be simple enough to convert to Python or whatever's needed. Thanks again for setting up this very useful service, Certes (talk) 12:29, 4 January 2021 (UTC)[reply]

I'm getting some great leads from those e-mails. One question: I know that lowercase looks only for lowercase links (e.g. "ford" finds ford, which probably needs attention, but ignores Ford, which is probably correct). Does sentence/title case also look only for links as stated? For example, I just fixed a pile of Turks born in Batman. Am I right to assume that it didn't check for Pte. Pike being batman to Col. Cholmondeley and that would need a separate entry? Certes (talk) 13:55, 4 January 2021 (UTC)[reply]

Hi Certes glad it is working. If you have access to Toolforge, the data files are available in /data/project/botwikiawk/bw2 everything is plain text no databases. The ".old" is yesterdays complete list of backlinks and ".new" is today's list, then it does a list compare (subtraction of .old from .new) and sends the results if any via email. The list is then added to an ".add" file as a history if ever needed. For case, if it is lower it only reports when there is a lower. If upper it only reports when there is upper. Thus for both requires two entries in /Backlinks. If both cases will be a common requirement, maybe we can come up with a way to flag it in the /Backlinks page such as the page title starts with a "!" or something. -- GreenC 14:17, 4 January 2021 (UTC)[reply]

I've logged in successfully to wikitech.wikimedia.org as Certes. I used a different password from my Wikipedia/Global account, so I think this must be my "Wikimedia developer account". I've found wikitech:Portal:Toolforge/Quickstart but that seems to be about becoming a maintainer of a tool, so I didn't follow the instructions as far as creating SSH keys, etc. Is there a lower level of "able to log into and look at the files"? If they're only available to accepted tool maintainers that's no problem; everything I need is in the e-mails. Certes (talk) 15:00, 4 January 2021 (UTC)[reply]

It's a Unix shell server at login-stretch.tools.wmflabs.org via ssh with the wikitech ID and password. Probably will need to follow the 6 steps in the Quickstart to get approved. If you are a programmer it's a good free resource. -- GreenC 15:32, 4 January 2021 (UTC)[reply]

Thanks, I'll look into that soon. I am a coder and familiar with Unix. I should get round to turning some jobs I run regularly into proper tools. Certes (talk) 16:23, 4 January 2021 (UTC)[reply]

I've worked through my first daily changes e-mails: 88 e-mails with 10 bad links, so I need to do some refining. Some entries (London, Luxembourg, Melbourne) produce nearly all false positives and are better done by other means (though it would be wonderful to combine the two, say by monitoring pages which link to both London and Ontario). Some others (knot, mass, primate...) need lowercase duplicates if they stay.

The approach I've taken is to export the day's e-mails as text then run a Perl script to combine them as wikitext. (Normally I'll just preview that rather than saving it.) Does anyone have hints for doing this more efficiently? Certes (talk) 14:12, 5 January 2021 (UTC)[reply]

Probably the bot could generate and post a wikitable report, instead of email. It would overwrite the page each day and the page history could be used to navigate by date. It could follow your table example. I agree that is a good way to get the information if you think so. -- GreenC 14:30, 5 January 2021 (UTC)[reply]

That would certainly be more collegiate. (I don't own these links.) I'm not sure whether having the bot edit pages rather than send e-mails creates any authorisation problems. I find the history link useful. Ideally there would be a diff link from the last time this page was assessed (24 hours earlier) but I realise that this would be more awkward. (My simple script doesn't wade through page histories; it just processes locally stored text, leaving page preview to format tables and links.) Certes (talk) 14:51, 5 January 2021 (UTC)[reply]

Ok the code is done, but untested. It can be configured for emails, or table, or both. Will do both for a couple days, so you can verify the table is accurate. It will post at User:Certes/Backlinks/Report. I'll run it now and see how badly it breaks :) -- GreenC 15:46, 5 January 2021 (UTC)[reply]

Certes, after fixing a logic error the table seems to be working (was using "A" and "ABC" as test cases so they show > 100 as new entries). When/if you are ready to disable emails let me know. The idea of linking to the diff is interesting, it would have to account for preexisting links of the same name. Maybe it would count the number of target wikilinks until it finds a revision where the count goes down. Will think on it. -- GreenC 17:03, 5 January 2021 (UTC)[reply]

That looks great (and I don't think either of those will be linking to Athens, Georgia, so I should probably take Athens off my list). Certes (talk) 17:20, 5 January 2021 (UTC)[reply]

User:Certes/Backlinks/Report updated today and is exactly what I was hoping for, except that it only covers ABC News. (The sole added link today refers correctly to the US brand, though I found 50 others for ABC News (Australia) that had appeared since my last trawl). Certes (talk) 13:31, 6 January 2021 (UTC)[reply]

Yes I was just about to post :) Noticed that also and found a bug that caused it to stop running after the first hit (was using the same counter variable name ("i") inside nested loops). It's fixed and rerunning now should see new results shortly. -- GreenC 13:44, 6 January 2021 (UTC)[reply]

That's perfect, thanks, and I can see one that needs fixing already. I also got the e-mails but won't need these now. Is it OK if I add and remove a few articles and rearrange the list into categories rather than alphabetical? Certes (talk) 14:43, 6 January 2021 (UTC)[reply]

Seeing some false positives (for example Crusaders linked in Church of the Ascension, Jerusalem). Not sure why, but going on the assumption the testing has confused the data files. It should hopefully clear out with the run tomorrow.. Emails are now disabled. The page is overwritten each day so manual changes would get overwritten. You could move the page to another name like /Report -> /Report20210106 making a permanent copy? Or browse via the article history. -- GreenC 15:25, 6 January 2021 (UTC)[reply]

That article was moved today, so you won't have a record of an old link from its new name to Crusaders. Overwriting the page is fine; I can use the history if I miss a few days. Is it OK to update and rearrange my Backlinks list, keeping the comments? Certes (talk) 15:39, 6 January 2021 (UTC)[reply]

Ah whew thought there was a deeper problem. The /Backlinks page any line that starts with "*" will be parsed and everything else ignored including lines starting with whitespace, "#", "=" etc.. The ordering doesn't matter. -- GreenC 16:44, 6 January 2021 (UTC)[reply]

Today's report says there are more than 100 new links for The Daily Telegraph, which is not a new target. I think I've tracked down and fixed the relevant changed articles with PetScan but I'm wondering how this happened. It could be a template change, but only if articles using the template also have a relevant link in the text. Does the Backlinks tool monitor new links from the Template: namespace as well as main? If not then that would be an extremely useful addition, as one careful edit to a template can improve links in hundreds of articles. (Category:, File: and Portal: would also be nice to have if it's as easy as adding them to a list, but please don't go to any effort for those.) Certes (talk) 16:30, 11 January 2021 (UTC)[reply]

Yes I see there are over 11,000 new entries added. This looks like a problem with yesterday runs it undercounted probably the backlinks process got aborted for unknown reason. Since you see only additions not deletions it was not apparent what happened. This sort of thing can happen sometimes. The program only knows two things: list of backlinks found yesterday and list found today, and reports the difference between the two. If for some reason the list of backlinks was not accurately created yesterday, or today, the difference between the lists will appear strange. Wait a day and it should fix itself assuming no more problems retrieving the backlink list.

The /Backlinks page can contain any page name including Category:, File: etc .. for which backlinks are monitored by default for mainspace pages. However it could monitor backlinks for any page type, or all types. This is configurable in the program either for all entities in /Backlinks, or customized on a per entity basis which if set would override the global setting for that entity. There's no way currently to change those settings yourself but it would be easy on request. -- GreenC 17:41, 11 January 2021 (UTC)[reply]

Yes, I'm interested in monitoring links from templates to the titles I've listed (which all happen to be articles but need not be). So if someone creates {{Newspapers in Australia}} and adds The Daily Telegraph when they meant The Daily Telegraph (Sydney), I'd hope to detect and fix that before it causes much head-scratching on all the Australian articles which transclude it. Certes (talk) 19:22, 11 January 2021 (UTC)[reply]

Template backlinks added for all entities. Should see a bunch of new logs. -- GreenC 22:53, 11 January 2021 (UTC)[reply]

Thanks, that's exactly what I was hoping for. I've skimmed today's expanded list, checked the suspicious cases and fixed as necessary. People who edit templates tend to make fewer mistakes but they can be big ones. Certes (talk) 12:53, 12 January 2021 (UTC)[reply]

Good idea. Glad it worked you are the first to use non-mainspace backlinks, I wrote it in a long time ago but never needed it before. There might be some others of interest also like Module: and File: and Category: -- GreenC 13:50, 12 January 2021 (UTC)[reply]

File: and Category: descriptions are worth checking if it's easy; also Portal:. Not all standard tools check those namespaces, so errors can become neglected. I'm not sure Module:s can wikilink to anything but if they can then it would be harmless to include them. Certes (talk) 14:43, 12 January 2021 (UTC)[reply]

Ok done. If you think it's too many false positives I can restore the filters. -- GreenC 14:52, 12 January 2021 (UTC)[reply]

That's very useful but I manually applied a tweak: File: and Category: wikilinks need an initial colon. (I also restored the non-free File: links which had been removed by a bot.) (Never write software: if it has flaws, we will find them; if it's perfect, we will demand endless enhancements.) There are five Module: pages, all documentation, which (whilst less important than articles) we may as well check for completeness. Thanks again, Certes (talk) 10:56, 13 January 2021 (UTC)[reply]

Fixed though untested yet, with tomorrow's run. Assuming JC's bot won't remove with a leading : -- GreenC 13:57, 13 January 2021 (UTC)[reply]

@GoingBatty: I'm watching Fox, though there's no harm in having more eyes on it. Do you do anything automatically with User:GoingBatty/misdirected links? Many of my checks would have far fewer false positives with a search rather than a simple link list, e.g. linksto:Greenwich -London or linksto:Greenwich Connecticut rather than manually excluding the majority of links which really are for Greenwich. Certes (talk) 00:10, 17 January 2021 (UTC)[reply]

@Certes: I created most of those rules pre-Backlinks. Once in a while I might run a query like Connecticut insource:/\[\[Greenwich\]\]/ but I don't have any automated checks for those rules other than Backlinks. GoingBatty (talk) 00:25, 17 January 2021 (UTC)[reply]

@Certes: ...and I just fixed a few Greenwich links. GoingBatty (talk) 00:55, 17 January 2021 (UTC)[reply]

Thanks, though I looked at E. Wight Bakke recently and decided it probably meant Greenwich, England. Certes (talk) 01:00, 17 January 2021 (UTC)[reply]

@GreenC: Thanks again for the daily report which is working very nicely. It's flagging about 100 articles a day and I'm finding about 10 errors. I'm aware that I've expanded my list considerably and this must be tying up the server for longer. How is the loading: should I be shortening my list? If so, would it be better to remove a few checks which match too many articles or several checks which rarely trigger? The latter are still useful as false positives are rare and they can find more serious problems such as this today. Other possibilities are to run weekly rather than daily, or to run different lists on different days of the week. I probably need to do some pruning anyway and to establish some more nuanced manual checks like those on User:GoingBatty/misdirected links. For example, I don't check London due to the many false positives, but such a check limited to pages which mention Ontario would be productive. Certes (talk) 12:57, 13 February 2021 (UTC)[reply]

No problem glad it useful. It looks like you are currently tracking 443 pages and it takes 90 minutes to complete all. This is not a burden on the WMF servers. I think more important is what are you able to keep up with, if there is too much information it can be overload. It's up to you how frequently to run, it's a cron job so can be set to any time period (hourly, bi-weekly, etc). The idea of a co-word check (London + Ontario) is theoretically possible. Maybe specify a regex statement with an exclude/include keyword ie. only report a hit if there is also a match on a regex statement. Or, exclude if there is a match on a statement. That way you can filter out false positives as you discover them. Not a perfect solution of course as there might be article that contains Ontario and London unrelated to each other. -- GreenC 15:12, 13 February 2021 (UTC)[reply]

I'm not worried about a few false positives with London (UK) validly linked in one place and Ontario mentioned elsewhere. It's minor compared with the chore of checking every link to London, which I currently avoid by ignoring the problem. There are other perennials that I just scan, for example I hover over everyone listed on ABC News and skim for Australia, NSW, etc.; if it's listed in the lead then I check properly whether ABC News (Australia) would be more appropriate. That could probably be automated but I'd need some clue as to how the code works to say more. What technology is it based on: SQL on pagelinks, Cirrus search, Python grepping a dump or something else? Certes (talk) 17:36, 13 February 2021 (UTC)[reply]

Oh old school. GNU awk. It gets the wikitext and regex's. Awk regex has a few peculiarities and limitations but is pretty standard, no lookbacks or anything fancy. For example this is what I use to match [[Foo]] or [[Foo|foobar]] = "[[]{2}[ ]*(" linkname "[ ]*[]]{2}|" linkname "[ ]*[|][^]]*[]]{2})" where "linkname" is "foo". You can test it from the CLI:

echo "[[foo|foobar]]" | awk '{linkname="foo"; match($0, "[[]{2}[ ]*(" linkname "[ ]*[]]{2}|" linkname "[ ]*[|][^]]*[]]{2})", d); print d[0]}'

-- GreenC 18:24, 13 February 2021 (UTC)[reply]

Thanks. I've written awk, but not since I discovered Perl about 20 years ago! I'll have a think, but the change I was contemplating looks awkward to implement: Ontario might be before or after the London link or even on a different line. My best guess would be something like

echo "[[London|UK capital]] Ontario" | awk '{linkname="London"; othername="Ontario"; match($0, "[[]{2}[ ]*(" linkname "[ ]*[]]{2}|" linkname "[ ]*[|][^]]*[]]{2})", d); match($0, othername, e); if (e[0]) {print d[0]}}'

. (Annoyingly, the obvious default of othername="" doesn't match, even though the line contains plenty of null strings.) Then we'd need a backward-compatible syntax for adding such complex entries to the Backlinks file: maybe *London|Ontario as | can't appear in article titles. Probably more trouble than it's worth. Certes (talk) 19:14, 13 February 2021 (UTC)[reply]