Wikipedia:Bots/Requests for approval/Theo's Little Bot 23

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

Theo's Little Bot 23

Operator: Theopolisme (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 18:18, Sunday July 14, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: of course

Function overview: Removes Google Analytics tracking parameters (like utm_source, for example) from links (also replaces links with their canonical urls if possible)

Links to relevant discussions (where appropriate): bot request with more details and discussion

Edit period(s): 500 edit batches

Estimated number of pages affected: A hearty handful

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: For all articles that include an external link with the string "utm_" (using externallinks table):

gets all links on the page
if a link includes "utm_":
- screen scrape the link's html to try to find a canonical url
- if we can find a canonical url, replace the link on wiki with the canonical url
- if not, use regular expression to remove all "utm_"-ish things

Discussion

I support in principle this task in terms of upholding openness and user privacy on WP. Don't have any specific comments on the task as yet, though function details seem a little bit light at the moment. Rjwilmsi 05:31, 19 July 2013 (UTC)[reply]

@Rjwilmsi: What additional details are you be interested in? The code has been written; here is a sample edit of it replacing a link that included utm_ with a canonical url. I'm happy to answer any questions. Theopolisme (talk) 15:46, 19 July 2013 (UTC)[reply]

I am obviously in support of this bot as I am the person that submitted the request. I would be happy to discuss the motivations of the bot. I can not speak for the code because that was all handled graciously by Theo. DouglasCalvert (talk) 18:21, 19 July 2013 (UTC)[reply]
{{BAGAssistanceNeeded}} A trial, perhaps? Theopolisme (talk) 21:26, 25 July 2013 (UTC)[reply]
Some quick comments on your regexes:
- In the first one, you have ((?:\w+:)? to identify links on the page. However, would it not be better to use the API to get a list of all external links on the page? You can use action=parse&prop=externallinks&page=. Or at least limit the regex to looking inside single square brackets, and to specific protocols (http and https) rather than just searching for any string (not containing white space, <> or []) preceded by // which is what you're doing at the moment.
- You should add a close square bracket as a possible terminator for a link parameter (so (?=\s|&|$) becomes (?=\s|&|$|]). I know externals links shouldn't really be used like this, but if there were one formatted as follows: [http://example.com&utm=123] your bot would (I believe) remove the closing square bracket.
That's all for now - Kingpin¹³ (talk) 18:24, 31 July 2013 (UTC)[reply]

I copied the regex from Wikipedia:AutoWikiBrowser/Regular_expression#Using_look_ahead.2Fbehind (I think I was having a lazy day, you know how that goes); however, the API seems like a much better method -- I'll work on implementing that. Theopolisme (talk) 19:51, 31 July 2013 (UTC)[reply]

Done in source and tested. Theopolisme (talk) 20:08, 31 July 2013 (UTC)[reply]

Great, thanks for the quick response. My only other concern would be that the bot may encounter some false positives, where a parameter named "utm_x" is used for something other than Google Analytics. My only real thoughts on reducing the risk of that would be to search the target page for the Google Analytics script (since you're loading the page already) or to add more strict criteria about the parameter name (Douglas mentioned in the bot request about some "cheat sheets"). Anyway, if you do want to implement something to be more strict about only removing Analytics related parameters, don't let me stop you. But in the meantime I'm happy to do a trial run and see if any issues arise.

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. - Kingpin¹³ (talk) 23:41, 31 July 2013 (UTC)[reply]

Actually, before you start the trial, I was just taking a look at the regexes again, and would I be correct in thinking that http://example.com?para=value would become http://example.com&para=value? That would obviously be incorrect, as the first parameter needs a question mark, not ampersand. - Kingpin¹³ (talk) 23:52, 31 July 2013 (UTC)[reply]

Ah, that's a really good catch. I can look at this tomorrow. Theopolisme (talk) 00:09, 1 August 2013 (UTC)[reply]

@Kingpin13: I initially saw two options here, and both involved two regexes:

option 1 ("proactive"): change the UTM regex to two regexes, |&|$|]) and \?, and then have the first one replace with "" and the second with "?"
option 2 ("reactive"): use the current regex and then fix it immediately afterwards, by doing (^.*?\.[^?]*?)&(.) → \1?\2.

I ran 100,000 tests using timeit, and got the following total execution times (in other words, for 100,000 tests, this is how long it took in total) for the regexes:

original: 0.372184038162s
reactive: 0.797237873077s
proactive: 0.648833990097s

I'm learning towards the proactive one (because I'm generally a fairly proactive person). Do you have a preference as to which I should use (or any alternative suggestions)? _{I also need to keep in mind that this will only be running 500 times per run, and all things considered it's basically an unnoticeable difference).} Theopolisme (talk) 16:49, 1 August 2013 (UTC)[reply]

Your proactive example certainly makes more sense to me. However, it does not cover cases where the utm parameter is the only parameter (e.g. http://example.com It does get to the point where it gets a bit silly to keep using regex, but here's what I've come up with, using the proactive regexes as a starting point:

s/ \?(]*?=[^&\s$]*[&\s$])+(?<=$|\s)|(?<=\?)utm_.*?=.*?&(utm_.*?=.*?&)*|(&|(?<=&))|&|$|])//

Just a tiny bit convoluted, I haven't really done any optimisation or tidying up. But it does get it all into a single regex, there may be mistakes, so I'll outline the basics of it so you can double check. There are three different cases, essentially, and the regex has three different parts:

Case 1 (Red):

All the parameters are utm_x parameters, so remove all parameters (including the question mark):

Case 2 (Green):

A mix of parameters, and the first parameter is a utm_x parameter, so remove the first parameter and any other utm parameters immediately after that (leave the question mark behind but remove the trailing ampersand):

Case 3 (Blue):

A mix of parameters, so remove all utm parameters (bearing in mind that the leading ampersand may have been captured by the previous case):

The main issue with the above regex, is it still doesn't know when to correctly terminate a url (e.g. a question mark at the end of the url should terminate it, but the regex treats it as part of the parameter's value). I'm assuming this won't be an issue if you're gathering the list of external links from the API, as they'll all be terminated by $...? Anyway, I can understand if at this point you want to claw your eyes out and never talk about regex with me again!

- Kingpin¹³ (talk) 18:17, 1 August 2013 (UTC)[reply]

@Kingpin13: There's a module for that. We don't need to ever speak of our ineptitude again; let's just drop the sticks and back slowly away from the horse carcass, and remember that at least 78.32% of the world's population has done exactly what we're trying to accomplish... Approved for trial? ;) Theopolisme (talk) 18:41, 1 August 2013 (UTC)[reply]

Haha, nice find. Yeah, feel free to run the trial whenever you're good and ready - Kingpin¹³ (talk) 18:43, 1 August 2013 (UTC)[reply]

Trial complete. [1] Theopolisme (talk) 01:22, 2 August 2013 (UTC)[reply]

The edits look pretty good overall. Just two problems I spotted. Firstly, you need to deal with canonical links which are not absolute. For example, the canonical links on the targets in this edit, which lead to the bot messing up the cite template. Secondly, I think it would be a good idea to have the bot mark links which return 404 errors with {{Dead link|date=Month Year|bot=Theo's Little Bot}}. This would make it a great deal easier for editors to fix the dead link than if it is replaced with the canonical 404 link - which does not give much clue as to what the original dead link was (e.g. as happened in this edit). - Kingpin¹³ (talk) 02:11, 2 August 2013 (UTC)[reply]

Thanks for examining the edits. I've fixed the issue with non-absolute urls (here's a test edit using the example you linked previously). As far as dead links go...I'll look into that shortly. Theopolisme (talk) 03:09, 2 August 2013 (UTC)[reply]

Alright, I've added the dead link tagging functionality as well—if the link returns an error code, then the bot will attempt to add {{dead link}} after the link in question (if the link is inside a template, then the bot will add {{dead link}} immediately after the template). Here's a test edit using the example you linked previously. Theopolisme (talk) 04:11, 2 August 2013 (UTC)[reply]

Thanks again for the quick fixes. It appears that all the problems identified in the trial have been resolved. Seems like an uncontroversial task with high support, so

Approved. - Kingpin¹³ (talk) 17:11, 2 August 2013 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.