Wikipedia:Bots/Requests for approval/AnomieBOT III 3
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Anomie (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 22:20, Tuesday, November 1, 2016 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): Perl
Source code available: User:AnomieBOT/source/tasks/SpamBlacklistBlocker.pm
Function overview: Block IPs that hit certain URLs on the spam blacklist too frequently.
Links to relevant discussions (where appropriate): Wikipedia:Bot requests#blocking IPs that only hit the spam blacklist
Edit period(s): Continuous
Estimated number of pages affected: IP talk pages as they hit the blacklist
Exclusion compliant (Y/N): No
Adminbot (Yes/No): Yes
Already has a bot flag (Y/N): Yes
Function details: The bot will block IPs that:
- attempt to add URLs that are on the spam blacklist (see MediaWiki:Spam-blacklist and m:Spam blacklist)
- and are on User:AnomieBOT III/Spambot URI list
- and have done so more than N times in T seconds (currently N=2 and T=120, but this may be changed if consensus determines it).
The IPs will be blocked for 1 month, anon-only, account creation disabled, and talk page access disabled, with message {{spamblacklistblock}}.
Community notifications
[edit]Have been placed at:
- Wikipedia:Bots/Requests for approval/Adminbots
- Wikipedia:Village_pump_(proposals)#New_adminbot_proposal_-_blocking_spambot_IPs
- Wikipedia:Administrators'_noticeboard#New_adminbot_proposal_-_blocking_spambot_IPs
- Wikipedia_talk:Blocking_policy#New_adminbot_proposal_-_blocking_spambot_IPs
Discussion
[edit]As noted in the linked discussion, admins are already making these blocks manually. The bot is requested in order to have a faster response time. Anomie⚔ 22:20, 1 November 2016 (UTC)[reply]
- Will you only be counting if their edit is actually stopped (i.e. if MediaWiki:Spam-whitelist has overridden the meta: list, they will NOT be blocked?) — xaosflux Talk 01:05, 2 November 2016 (UTC)[reply]
- @Xaosflux: It will operate on a specific set of domains, these will never be whitelisted. The bot should look at blacklist-hits, which by definition means that the link is not whitelisted. --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- @Beetstra: - The request above specifically mentions both the local and global general SBL's, if this is not so it changes most of my follow up questions. — xaosflux Talk 04:50, 2 November 2016 (UTC)[reply]
- {{OperatorAssistanceNeeded}} — xaosflux Talk 04:50, 2 November 2016 (UTC)[reply]
- @Xaosflux: The request reads 'on the spam blacklist (see MediaWiki:Spam-blacklist and m:Spam blacklist) and on User:AnomieBOT III/Spambot URI list'; that is a logical AND, the links have to be blocked by the spam blacklists, and when they are blocked and also on the special list (User:AnomieBOT III/Spambot URI list), then the IPs are spambots which need to be blocked on second edit. --Dirk Beetstra T C 05:14, 2 November 2016 (UTC)[reply]
- {{OperatorAssistanceNeeded}} — xaosflux Talk 04:50, 2 November 2016 (UTC)[reply]
- @Beetstra: - The request above specifically mentions both the local and global general SBL's, if this is not so it changes most of my follow up questions. — xaosflux Talk 04:50, 2 November 2016 (UTC)[reply]
- @Xaosflux: The bot operates on "hit" entries in Special:Log/spamblacklist (note that log page is visible to admins only). If a URL is whitelisted, it won't show up in that log so the bot won't see it in the first place to check it against User:AnomieBOT III/Spambot URI list. Anomie⚔ 12:09, 2 November 2016 (UTC)[reply]
- @Xaosflux: It will operate on a specific set of domains, these will never be whitelisted. The bot should look at blacklist-hits, which by definition means that the link is not whitelisted. --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- Please note, that template was changed to {{Uw-spamblacklistblock}}. — xaosflux Talk 01:07, 2 November 2016 (UTC)[reply]
- A quick look at the (admin only) SBLog shows some odd trends - some "unwanted" links like youtube are causing hits - and it looks like the editor may not be understanding the message and just tries to hit save again - these are usually consecutive attempts (seconds apart) so if you are blocking for 2hits/2min they will get blocked right away. Is this really desired? — xaosflux Talk 01:11, 2 November 2016 (UTC)[reply]
- Keeping YouTube out of it is definitely a good idea. Maybe we should go only by User:AnomieBOT III/Spambot URI list, adding any other links we feel could not conceivably be add in good-faith, as opposed to blatant spamming — MusikAnimal talk 01:28, 2 November 2016 (UTC)[reply]
- I'm not arguing for youtube :D Just that it looks like those editors aren't really being malicious and the SBL already did its job - hitting save twice in a row seems a bit much for getting a month block. — xaosflux Talk 02:01, 2 November 2016 (UTC)[reply]
- @Xaosflux: It should only operate on the links in User:AnomieBOT III/Spambot URI list. Note that the spam blacklist log is filled with editors that are adding in good faith, hitting sometimes 10-20 times. These are spambot domains. We had (before we started to aggressively block them) IPs who were hitting the blacklist thousands of times (thousands of hits; and even recently there is one (October 26) with a good hundred when no-one was looking for a couple of hours). There is no human behind the IP at this time (or at best a sweatshop). --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- @Xaosflux - for the good faith edits where editors repeatedly try to add redirect site, I have actually suggested to do the opposite - leave a message on the talkpage of the editor telling them which link they should use (have a bot expand the url-shortened link). This is request is only concerning the real spam as used by these spambots. --Dirk Beetstra T C 03:41, 2 November 2016 (UTC)[reply]
- And in a way, the concern raised here is very valid - and I would be willing to help those editors who don't get through with good faith additions - but these spambots are obscuring the logs sometimes so much that it becomes an impossible task to see (this cherry-picked chunk of 250 log entries/90 minutes from October 26 contains 96.8% of spambot hits). --Dirk Beetstra T C 04:01, 2 November 2016 (UTC)[reply]
- I'm not arguing for youtube :D Just that it looks like those editors aren't really being malicious and the SBL already did its job - hitting save twice in a row seems a bit much for getting a month block. — xaosflux Talk 02:01, 2 November 2016 (UTC)[reply]
- Keeping YouTube out of it is definitely a good idea. Maybe we should go only by User:AnomieBOT III/Spambot URI list, adding any other links we feel could not conceivably be add in good-faith, as opposed to blatant spamming — MusikAnimal talk 01:28, 2 November 2016 (UTC)[reply]
- That template has directions for placing an unblock message - however your bot specifications say they will disable talk page access - is this action necessary? — xaosflux Talk 02:25, 2 November 2016 (UTC)[reply]
- @Xaosflux: The template is written for editors who are aggressively hitting the spam blacklist. It is used on editors outside of the spambots. For those editors outside of the spambots there is no need to withdraw talkpage access, and they could ask for a regular unblock (and these editors should not be blocked by the bot, hence the selected sub-blacklist that is used by this bot). These spambots however also hit (and sometimes, only hit) the spam blacklist with edits to their own talkpage. For those IPs plain blocking is not helping a lot with clearing the log (500 hits to own talkpage; the main reason why the IPs need to get blocked), one needs to withdraw talkpage access as well. A good faith future IP editor on that IP will have to go through UTRS (but it turns out that we generally keep re-blocking the IPs, and saying that I regularly find IPs with reasonable edits would be an exaggeration). --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- Note that there are now IP ranges of these spambots blocked with withdrawn talkpage access. --Dirk Beetstra T C 03:30, 2 November 2016 (UTC)[reply]
- Note that the template has a
|notalk=yes
parameter, which the bot will use. The parameter changes the unblock instructions to refer the IP to UTRS. Anomie⚔ 12:13, 2 November 2016 (UTC)[reply]
- Here comes my big question - why is this necessary to run on a bot? If the community is fine with automated blocking for this type of criteria, couldn't this be handed by enabling extra features in the AbuseFilter? — xaosflux Talk 02:28, 2 November 2016 (UTC)[reply]
- @Xaosflux: I have tried a filter (791), but it turns out that the filter does not see these hits. The spam blacklist is blocking the edit before the AbuseFilter sees it (makes sense, if the blacklist can filter out edits, it would make the abusefilter have less to do). A possible solution could then be to locally whitelist the domains and have an edit filter blocking the additions, but I think that is significantly more heavy on the server. --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- Make sense, ignore this section now - especially if not actually using the SBL/GBL. — xaosflux Talk 04:51, 2 November 2016 (UTC)[reply]
- The bot is not using the SBL/GBL, it blocks specific cases where editors have been disallowed to save an edit by the SBL/GBL. I guess I get your confusion now. --Dirk Beetstra T C 05:15, 2 November 2016 (UTC)[reply]
- Make sense, ignore this section now - especially if not actually using the SBL/GBL. — xaosflux Talk 04:51, 2 November 2016 (UTC)[reply]
- @Xaosflux: I have tried a filter (791), but it turns out that the filter does not see these hits. The spam blacklist is blocking the edit before the AbuseFilter sees it (makes sense, if the blacklist can filter out edits, it would make the abusefilter have less to do). A possible solution could then be to locally whitelist the domains and have an edit filter blocking the additions, but I think that is significantly more heavy on the server. --Dirk Beetstra T C 03:28, 2 November 2016 (UTC)[reply]
- What authentication method is this bot using (traditional, botpasswords, oauth)? If oauth, please link to the consumer; if botpasswords what restrictions are being set? — xaosflux Talk 04:54, 2 November 2016 (UTC)[reply]
- OAuth. The current consumer is Special:OAuthListConsumers/view/fcb7ec99d7927ff32c327cd27ad8d434, although I'll have to make a new one to enable the newly-added "View restricted log entries" grant that the bot needs to be able to access the log. Anomie⚔ 12:09, 2 November 2016 (UTC)[reply]
- Why are we/you not implementing this as part of the SpamBlacklist extension that implemented some rate limiting? Legoktm (talk) 05:18, 2 November 2016 (UTC)[reply]
- @Legoktm: It would still show the hits in the logs, right. And rate-limiting would also catch noobs who try and try to add e.g. a redirect site and fail to read the messages that they get. An update of the spamblacklist extension (e.g. according to a suggestion of Brion 5 years ago) would be a very welcome solution, as would be a major upgrade to the Captcha system (so the spambots would not get through). Phab tickets exist for those, but no serious follow up by the development team. --Dirk Beetstra T C 06:28, 2 November 2016 (UTC)[reply]
- Do you have links to specific tickets? I quickly looked for one about rate limiting based on spam blacklist hits and didn't find anything. Legoktm (talk) 05:59, 3 November 2016 (UTC)[reply]
- Not for rate limiting, there is T125132 for better Captcha (the current captcha is mostly deterring humans .. not the bots; WMF has even been considering to remove the whole Captcha so it annoys new editors less ..). I can't find the remark from Brion back .. Phabricator is difficult to search for that. I'll try and find that one. --Dirk Beetstra T C 03:45, 6 November 2016 (UTC)[reply]
- @Legoktm: Another one here: T6459 from ... 2006; And the remark I meant from Brion was in T16719#194121 (2010). Both are alluding to the current system being to crude, I would suggest to rewrite it into a version akin the abusefilter, but without the heavy code-parser, and added option buttons (namespace choice, log yes/no/only; warn/disallow); built-in whitelisting; page exclusion (?); tagging, etc. Maybe it is something to be discussed in T6459. --Dirk Beetstra T C 09:05, 9 November 2016 (UTC)[reply]
- Do you have links to specific tickets? I quickly looked for one about rate limiting based on spam blacklist hits and didn't find anything. Legoktm (talk) 05:59, 3 November 2016 (UTC)[reply]
- @Legoktm: It would still show the hits in the logs, right. And rate-limiting would also catch noobs who try and try to add e.g. a redirect site and fail to read the messages that they get. An update of the spamblacklist extension (e.g. according to a suggestion of Brion 5 years ago) would be a very welcome solution, as would be a major upgrade to the Captcha system (so the spambots would not get through). Phab tickets exist for those, but no serious follow up by the development team. --Dirk Beetstra T C 06:28, 2 November 2016 (UTC)[reply]
- Didn't think of it. But now that I do think of it, I still don't have a good idea of how exactly rate limiting should work for log events: we could rewrite existing log entries (ugh) or delay logging events until we determine there won't be more hits to combine (ugh) or just drop events on the floor (ugh), and at the same time figure out how any of those needs to interact with other relevant event streams (RC, RCStream or its replacement, etc). Anomie⚔ 12:21, 2 November 2016 (UTC)[reply]
- I'm not really sure what you mean, I quickly wrote gerrit:319514 for what I was thinking about. Legoktm (talk) 05:59, 3 November 2016 (UTC)[reply]
- @Legoktm: Hmm .. this is an idea, but with a lot of collateral damage I am afraid. We have many good faith editors who repeatedly trip the spam blacklist in short times, because they fail to read the instructions (21 in 90 minutes; 1 per 5 minutes; 14 over several hours; 1 per 30 minutes; 9 in 20 minutes; 1 per 2 minutes; 11 in 3 minutes, 4 per minute). If you put that into contrast with 100 edits in 34 minutes; 2 per minute; 24 in 2 hours; 1 every 5 minutes, you see that a too short throttle will catch not all of them, and a longer throttle will catch many good faith editors (adding to their frustration why their edits don't save). --Dirk Beetstra T C 06:21, 3 November 2016 (UTC)[reply]
- So it seems like a short throttle would still be useful right? False negatives are okay, false positives are not. Basically I'm trying to figure out that if we can have a bot block users without judgement, at what level can we integrate that into the software. Legoktm (talk) 21:18, 3 November 2016 (UTC)[reply]
- @Legoktm: Oh, it for sure would be a welcome addition, and might take out some, but I am afraid that most will edit slower, or at a similar pace as a human editor. --Dirk Beetstra T C 03:45, 6 November 2016 (UTC)[reply]
- So it seems like a short throttle would still be useful right? False negatives are okay, false positives are not. Basically I'm trying to figure out that if we can have a bot block users without judgement, at what level can we integrate that into the software. Legoktm (talk) 21:18, 3 November 2016 (UTC)[reply]
- Oh, you were referring to blocking the users, not just rate-limiting the log entries. Anomie⚔ 13:54, 3 November 2016 (UTC)[reply]
- @Legoktm: Hmm .. this is an idea, but with a lot of collateral damage I am afraid. We have many good faith editors who repeatedly trip the spam blacklist in short times, because they fail to read the instructions (21 in 90 minutes; 1 per 5 minutes; 14 over several hours; 1 per 30 minutes; 9 in 20 minutes; 1 per 2 minutes; 11 in 3 minutes, 4 per minute). If you put that into contrast with 100 edits in 34 minutes; 2 per minute; 24 in 2 hours; 1 every 5 minutes, you see that a too short throttle will catch not all of them, and a longer throttle will catch many good faith editors (adding to their frustration why their edits don't save). --Dirk Beetstra T C 06:21, 3 November 2016 (UTC)[reply]
- I'm not really sure what you mean, I quickly wrote gerrit:319514 for what I was thinking about. Legoktm (talk) 05:59, 3 November 2016 (UTC)[reply]
- Didn't think of it. But now that I do think of it, I still don't have a good idea of how exactly rate limiting should work for log events: we could rewrite existing log entries (ugh) or delay logging events until we determine there won't be more hits to combine (ugh) or just drop events on the floor (ugh), and at the same time figure out how any of those needs to interact with other relevant event streams (RC, RCStream or its replacement, etc). Anomie⚔ 12:21, 2 November 2016 (UTC)[reply]
- I think that it should list the IP addresses in real time on a full-protected page, which human admins will look over the entries and remove once either they approve of the block or undo/reduce it if it's wrong. I'm completely against a bot blocking anyone with talk page access revoked, unless some immediate means of human supervision is available. עוד מישהו Od Mishehu 11:10, 2 November 2016 (UTC)[reply]
- @Od Mishehu: List the IPs in real time .. that is the same as just in real time looking at the logs. The bot is doing nothing else than what any of us, including me, would do already: block on sight for 1 month with talkpage access removed. As per my this cherry-picked chunk of 250 log entries/90 minutes from October 26 - we have 3-5 practically around the clock admins looking at the log, and still 242 edits obscuring 8 possibly genuine edits come through. If we would have more admins looking at the logs, then we would not need to find alternate solutions. And the human supervision is still there, we would still look at the logs regularly, looking at anything that seems out of the ordinary. The domains the bot would block on (which is a selected subset) do not have any genuine use (really none), there will not be any good-faith editor caught in the blocks who uses these domains. Of the hundreds of IPs I blocked for this reason I have until now only seen who had any edits in their history. We could ask for a daily log to be maintained at the same time in the bot's userspace so it is easier for admins to check in real time the actions of the bot. And I am volunteering to check for the first 1-2 weeks to check regularly every single block the bot makes. --Dirk Beetstra T C 11:41, 2 November 2016 (UTC)[reply]
- When you do it, a human admin clearly approves the action; and when an other spam watcher sees that log entry, (s)he can see that fact. If a bot does it, the same level of scrutiny is needed. And merely the bot's block log can't be edited by human admins who approve the individual blocks. — Preceding unsigned comment added by Od Mishehu (talk • contribs) 11:46, 2 November 2016 (UTC)[reply]
- @Od Mishehu: When the bot does it, a human editor has approved that IPs who add those specific useless domains should be blocked because no human editor would use them, and when an other spam watcher sees that log entry, they can see that fact. If the human admin approves the block placed by another admin (or in this case, the bot), they would not follow up anything. If a human admin disapproves of the human admin's block, they would start a discussion to get the block removed, if a human admin disapproves of the adminbot's block, they should do the same (unless, and I would encourage that), that removing blocks placed by an admin block can be done without discussion without constituting wheel warring. --Dirk Beetstra T C 12:37, 2 November 2016 (UTC)[reply]
- Wouldn't the bot's block log be a real-time list of IPs the bot blocked? Anomie⚔ 12:09, 2 November 2016 (UTC)[reply]
- Yuo would then have several admins redoing the same work in checking these blocks. Better have an editable list, which human adins can remove checked entries from. עוד מישהו Od Mishehu 12:11, 2 November 2016 (UTC)[reply]
- @Od Mishehu: That is what I say, the bot could also log the blocks in a private log page, with a specific template linking to the necessary user actions and a 'checked=no' parameter, which humans can flip to 'yes' if they manually checked it. That however does not preclude that the bot would make the initial block under the conditions that is set in this BRFA. --Dirk Beetstra T C 12:37, 2 November 2016 (UTC)[reply]
- I wasn't disagreeing with the conditions for the bot to make the decision, only pointing out that there needs to be a reasonable editable log of its actions. עוד מישהו Od Mishehu 13:12, 2 November 2016 (UTC)[reply]
- @Od Mishehu: That is what I say, the bot could also log the blocks in a private log page, with a specific template linking to the necessary user actions and a 'checked=no' parameter, which humans can flip to 'yes' if they manually checked it. That however does not preclude that the bot would make the initial block under the conditions that is set in this BRFA. --Dirk Beetstra T C 12:37, 2 November 2016 (UTC)[reply]
- Yuo would then have several admins redoing the same work in checking these blocks. Better have an editable list, which human adins can remove checked entries from. עוד מישהו Od Mishehu 12:11, 2 November 2016 (UTC)[reply]
- @Od Mishehu: List the IPs in real time .. that is the same as just in real time looking at the logs. The bot is doing nothing else than what any of us, including me, would do already: block on sight for 1 month with talkpage access removed. As per my this cherry-picked chunk of 250 log entries/90 minutes from October 26 - we have 3-5 practically around the clock admins looking at the log, and still 242 edits obscuring 8 possibly genuine edits come through. If we would have more admins looking at the logs, then we would not need to find alternate solutions. And the human supervision is still there, we would still look at the logs regularly, looking at anything that seems out of the ordinary. The domains the bot would block on (which is a selected subset) do not have any genuine use (really none), there will not be any good-faith editor caught in the blocks who uses these domains. Of the hundreds of IPs I blocked for this reason I have until now only seen who had any edits in their history. We could ask for a daily log to be maintained at the same time in the bot's userspace so it is easier for admins to check in real time the actions of the bot. And I am volunteering to check for the first 1-2 weeks to check regularly every single block the bot makes. --Dirk Beetstra T C 11:41, 2 November 2016 (UTC)[reply]
- The proposal is to block any IP which adds 2 specified links in 120 seconds (values are configurable). That seems very desirable to me, and the suggestion of an editable log is unnecessary make-work. In the early stages of operation, each block by the bot could be examined by anyone interested, and would be examined by those involved in the plan. If what the bot does is seen to have undesirable side-effects, the procedure could be re-evaluated. My concern is that spam operators would learn to add links no more rapidly than once per 61 seconds. Johnuniq (talk) 10:06, 6 November 2016 (UTC)[reply]
- @Johnuniq: I foresee that as well, but seen the M.O. of all the IPs I blocked in the process, one could easily do 2 edits in 60 minutes and still have zero false positives. If they then are going to spam @ 1 edit every 31 minutes we have basically done what we want to achieve anyway .. clear the log from hammering (sigh, then we will get rapid IP rotation .. to which our answer can only be to restrict it to IPs who add domains on our shortlist and block on first sight so at least they do not return a week later). --Dirk Beetstra T C 10:29, 6 November 2016 (UTC)[reply]
- I do indeed not really see the necessity (we will still be looking at the logs to see what happens and whether there are false positives, which should be easy to weed out when this is operational), but also no harm in the bot keeping a daily log of the IPs it blocks. Even if only for bookkeeping and manual re-evaluating when patterns emerge, and a bit to easily check the actions. Just log them in a custom template with the IP and the data/time of blocking as parameters and we can later sort them by IP and find typical ranges to pre-emptively block (if there are no genuine edits in them) or even transfer the blocks to global. --Dirk Beetstra T C 10:33, 6 November 2016 (UTC)[reply]
- I have to admit that I've triggered the spamblacklist myself. I would support this only if the block:
- Comes on at least the second attempt.
- The bot leaves a warning on the IP's talk page.
- The block is for a week since this could potentially be abused by people looking to deliberately get library IP's blocked. VegasCasinoKid (talk) 12:09, 8 November 2016 (UTC)[reply]
- @VegasCasinoKid: the first two are indeed the case - the latter requirement I do not understand .. if an admin would add urls to this list to get non-spambots blocked, then that I would regard as an abuse of admin privileges by proxy (basically, because the admin added the domain, the block is functionally theirs, even though the bot did it for them). The list (User:AnomieBOT III/Spambot URI list) the bot uses as a check to block should (and currently does) strictly only contain urls that are being abused by spambots. Any links outside of that shortlist will not result in IPs being blocked. That list is editable by any admin, and the list is visible to everyone. Any links on that list that should not be there should be removed; any blocks caused by said domains should be overturned, and a discussion (at the least) with the admin who added it should be started. --Dirk Beetstra T C 12:27, 8 November 2016 (UTC)[reply]
- I think VegasCasinoKid is referring to the possibility that someone goes to the library and intentionally hits the blacklist because they want to get the library's IP blocked. OTOH, someone could get the same result by posting normal vandalism to highly-visible articles, and the blocks the bot will make do not prevent logged-in editors from editing. Anomie⚔ 13:18, 8 November 2016 (UTC)[reply]
- Also blocking the talk page is too much. As for account creation, I think it's time to find a way where people can create accounts (without an email addess) but delay them or require a harder captcha if the IP is determined to be a busy one. As far as I know the ACC team would be less inclined to make an account from an IP that just got blocked for spamming. VegasCasinoKid (talk) 00:39, 9 November 2016 (UTC)[reply]
- @Anomie and VegasCasinoKid: If someone would now go to a library, and use these domains in order to get the library IP blocked, they would get their way. If they then proceed to excessive talkpage vandalism, we would also happily remove talkpage access. They good also do simple petty vandalism for an hour or so, and we would happily block the IP. This bot would not make a difference to people who want to get an IP blocked.
- @VegasCasinoKid: Hitting their own talkpage is in the MO, and not disabling talkpage access is simply not going to help - an admin would just have to come behind them and revoke talkpage access to make sure the logs stay clear. See e.g. this editor, 1 hour - 24 blacklist hits, 23 to their own talkpage. Or this editor, 15 hits, 13 to own talkpage ..
- Regarding other ways .. these bots have broken the current Captcha, but yes, we know there are better solutions (much improved spam blacklist extension, much improved Captcha system), but that would actually require WMF to do something useful for the community. For now we need a solution as the logs are simply useless if there are extended periods where these spambots were covering >95% of the log. --Dirk Beetstra T C 03:41, 9 November 2016 (UTC)[reply]
- Those two cases would certainly merit blocking the talkpage, but the bot can extend the block to talkpage access if the IP or editor starts spamming the talkpage. As for account creation, I only bring it up because I've seen off-wiki sites that discourage granting account requests originating from a blocked IP. I also think the block length should depend on how much spamming is detected. More spam = longer block. VegasCasinoKid (talk) 04:56, 9 November 2016 (UTC)[reply]
- @VegasCasinoKid: Do you really want to see the hundreds of IPs I have blocked for these reasons, and how many of those have significant (if not only) talkpage hits on their list as well? It is part of their MO, I block these spambots on their first own talkpage hit with these domains. The bot will be a bit more lenient, blocking only on the second edit. These are not human editors, these are bots.
- I agree that account creation should not be blocked, though seen that there are zero IP edits (of the so many IPs I blocked I recall having seen 'regular' edits on only one(!)). --Dirk Beetstra T C 05:03, 9 November 2016 (UTC)[reply]
- @VegasCasinoKid: See /Examples - 14 out of 45 hit their own talkpage, some exclusively. Note that we now block early on, so many may not get to own talkpage as we withdraw access pre-emptively. --Dirk Beetstra T C 05:38, 9 November 2016 (UTC)[reply]
- At least we can agree on something. Only some 30 editors are privvy to checking IPs so yes I do think a harder captcha is a good idea. I know on some sites that do this to throttle creating too many accounts in short order. VegasCasinoKid (talk) 06:08, 9 November 2016 (UTC)[reply]
- @VegasCasinoKid: I am not disagreeing with you, and I am fully with you on your concerns. The problem is, that if we make the bot act weaker than we already do (there are admins blocking IP ranges with talkpage access withdrawn, as many IPs seem to come out of certain regions, I've blocked repeat-IPs for a year after the 1 month block expired).
- Overall, the problem should be solved by the developers, install a MediaWiki-wide better Captcha (and hope these are real spambots and not sweatshops), and completely overhaul the spam blacklist (so that certain rules can be set to 'no log' - we don't need to log the real rubbish, only for the grey area cases). <rant>But well .. that would mean that WMF would really have to do something that is of use to the community</rant>. --Dirk Beetstra T C 06:26, 9 November 2016 (UTC)[reply]
- At least we can agree on something. Only some 30 editors are privvy to checking IPs so yes I do think a harder captcha is a good idea. I know on some sites that do this to throttle creating too many accounts in short order. VegasCasinoKid (talk) 06:08, 9 November 2016 (UTC)[reply]
- Those two cases would certainly merit blocking the talkpage, but the bot can extend the block to talkpage access if the IP or editor starts spamming the talkpage. As for account creation, I only bring it up because I've seen off-wiki sites that discourage granting account requests originating from a blocked IP. I also think the block length should depend on how much spamming is detected. More spam = longer block. VegasCasinoKid (talk) 04:56, 9 November 2016 (UTC)[reply]
- I think VegasCasinoKid is referring to the possibility that someone goes to the library and intentionally hits the blacklist because they want to get the library's IP blocked. OTOH, someone could get the same result by posting normal vandalism to highly-visible articles, and the blocks the bot will make do not prevent logged-in editors from editing. Anomie⚔ 13:18, 8 November 2016 (UTC)[reply]
- I have to admit that I've triggered the spamblacklist myself. I would support this only if the block:
- @VegasCasinoKid: The problem is, that with at some times >95% of the edits in the log being spambots (example block of 250 consequtive hits), there is no possibility for human editors to go through the list and e.g. help genuine editors who inadvertently run into the spam blacklist. The log becomes so utterly useless that we can just as well disable it. --Dirk Beetstra T C 06:56, 9 November 2016 (UTC)[reply]
- The log would still be useful. Like the abuse filter, the failed edits are logged. The solution would be to log the diffs. And even if the talkpage is spammed who really has interest in reading those anyway? VegasCasinoKid (talk) 07:06, 9 November 2016 (UTC)[reply]
- @VegasCasinoKid: If I would like to see if some semi-genuine site is still spammed over the last month or two, I would have to go through hundreds of pages to filter out the 200-300 edits that are genuine (on some pages, >95% was spambot). With now a handful of admins actively (range-)blocking the IPs we get much more useful logs (the last 250 hits cover >1 1/2 days, per the previous example 250 hits would only cover 1 1/2 hours (~factor of 25 improvement) - imagine wanting to check just one month). I have in the past gone through blacklist hits after a site got blacklisted, found an IP trying to add it, who then happily edited by using an alternative spam address. That type of research is near impossible. I am currently running behind the maintenance/archiving bots that hit the blacklist, trying to solve the underlying issues (archiving of threads with blacklisted links, repairing blacklisted links, etc.) And yes, there are other solutions (make the list searchable, e.g., but that again would mean that we actually get something done by development. If the bot is not going to do it, then we will do it manually, it is (at the moment) the only way to get somewhat readable logs. --Dirk Beetstra T C 07:28, 9 November 2016 (UTC)[reply]
- What about the captcha that comes when you try to add a link? Are these bots capable of cracking them? Or do they check the wiki code for which words the captcha generates? VegasCasinoKid (talk) 07:48, 9 November 2016 (UTC)[reply]
- The Captcha is utterly useless, see also T125132. And a Captcha is assuming that this is a real bot, not a sweatshop (which I assume is correct). --Dirk Beetstra T C 07:55, 9 November 2016 (UTC)[reply]
- The better captchas are ones that use random images and you have to select the right ones. VegasCasinoKid (talk) 10:22, 10 November 2016 (UTC)[reply]
- The people who actually combat spam have to deal with the world as it is, rather than how it might be. Spam is a real problem and providing any opening for spambots/sweatshops just encourages more abuse. Johnuniq (talk) 10:27, 10 November 2016 (UTC)[reply]
- As I said before, there are many other options to solve this problem (having a better Captcha is one of them, having a proper rewrite of the spamblacklist is another, removing these rules from the blacklist and using an editfilter is another). The problem is, none of that is getting any priority (that is the flipside of the coin with VisualEditor, MediaViewer, Flow etc. on one side), or it is going to be excessively heavy on the server (a spam catching edit-filter is rather time consuming), we are left with what we have. We should indeed continue to push to have a working system in the background, and push to have this task made obsolete. But knowing from experience that even real bugs are not actively resolved and stay open for a long time, and that feature requests are apparently lower on the priority list .. that is going to take ages. It is that these IPs don't seem to have any 'real' use, otherwise the lack of incentive to solve this problem would cost more good-faith editors than all other incentives to increase the editor base. --Dirk Beetstra T C 10:38, 10 November 2016 (UTC)[reply]
- Overall I think spam filtering is a good idea, but to be on the safe side the parameters should be 3 tries in 2 min and blocks should be weeks -- not months -- at a time in case the IP is a shared one. While it's true the spammer will wait and come back it's to the benefit of the legit users of that IP. Since editing will require logged in access, I suggest using a template (in the block reason) that directs users to log in or create an account by request. VegasCasinoKid (talk) 10:11, 12 November 2016 (UTC)[reply]
- @VegasCasinoKid: As I alluded to above, I would not mind increasing lengths, starting even with 31 hours. But, according to the T125132, there are on a global scale (quote) NO users (end quote) on these IPs, the risk for that damage (it is what I see locally and on meta as well). The talkpage tags do have a notalk-option enebled, the standard block message is sufficient (and several admins do not tag them, and some are under rangeblocks). --Dirk Beetstra T C 17:34, 12 November 2016 (UTC)[reply]
- Just as an example, we just were hit by 176.213.36.24 (talk • contribs • deleted contribs • blacklist hits • AbuseLog • what links to user page • COIBot • Spamcheck • count • block log • x-wiki • Edit filter search • WHOIS • RDNS • tracert • robtex.com • StopForumSpam • Google • AboutUs • Project HoneyPot), an IP with zero edits on 733 wikis, but with 28 hits in 23 minutes, and hits here and on meta (I am not admin elsewhere, I cannot see any other or global logs. You can go through the log I have been keeping (User:AnomieBOT III/Spambot URI list/Log ) to see how many of the IPs have edits anywhere (click 'x-wiki' for each IP). The risk that any of these IPs are going to edit constructively is way, way smaller than an non-static, editing IP here that gets blocked with talkpage access withdrawn due to talkpage abuse. And all of them are pointed to WP:UTRS anyway. --Dirk Beetstra T C 12:36, 13 November 2016 (UTC)[reply]
- Going through the list, I do see some global contributions on some IPs - they are until now all related to the same spam though, hence not constructive and not collateral damage. --Dirk Beetstra T C 13:45, 13 November 2016 (UTC)[reply]
- Overall I think spam filtering is a good idea, but to be on the safe side the parameters should be 3 tries in 2 min and blocks should be weeks -- not months -- at a time in case the IP is a shared one. While it's true the spammer will wait and come back it's to the benefit of the legit users of that IP. Since editing will require logged in access, I suggest using a template (in the block reason) that directs users to log in or create an account by request. VegasCasinoKid (talk) 10:11, 12 November 2016 (UTC)[reply]
- The better captchas are ones that use random images and you have to select the right ones. VegasCasinoKid (talk) 10:22, 10 November 2016 (UTC)[reply]
- The Captcha is utterly useless, see also T125132. And a Captcha is assuming that this is a real bot, not a sweatshop (which I assume is correct). --Dirk Beetstra T C 07:55, 9 November 2016 (UTC)[reply]
- What about the captcha that comes when you try to add a link? Are these bots capable of cracking them? Or do they check the wiki code for which words the captcha generates? VegasCasinoKid (talk) 07:48, 9 November 2016 (UTC)[reply]
- @VegasCasinoKid: If I would like to see if some semi-genuine site is still spammed over the last month or two, I would have to go through hundreds of pages to filter out the 200-300 edits that are genuine (on some pages, >95% was spambot). With now a handful of admins actively (range-)blocking the IPs we get much more useful logs (the last 250 hits cover >1 1/2 days, per the previous example 250 hits would only cover 1 1/2 hours (~factor of 25 improvement) - imagine wanting to check just one month). I have in the past gone through blacklist hits after a site got blacklisted, found an IP trying to add it, who then happily edited by using an alternative spam address. That type of research is near impossible. I am currently running behind the maintenance/archiving bots that hit the blacklist, trying to solve the underlying issues (archiving of threads with blacklisted links, repairing blacklisted links, etc.) And yes, there are other solutions (make the list searchable, e.g., but that again would mean that we actually get something done by development. If the bot is not going to do it, then we will do it manually, it is (at the moment) the only way to get somewhat readable logs. --Dirk Beetstra T C 07:28, 9 November 2016 (UTC)[reply]
- The log would still be useful. Like the abuse filter, the failed edits are logged. The solution would be to log the diffs. And even if the talkpage is spammed who really has interest in reading those anyway? VegasCasinoKid (talk) 07:06, 9 November 2016 (UTC)[reply]
{{BAG assistance needed}} I have kept a log since this discussion started under User:AnomieBOT III/Spambot URI list/Log. I find myself adding several IPs on a daily basis. Can s.o. please have a look at this request and how what is suggested is in line with the observations recorded in the log, and act accordingly? Thanks. --Dirk Beetstra T C 10:27, 21 November 2016 (UTC)[reply]
- @Beetstra: the operator has not edited this page since 08-NOV, but I see you are answering questions - are you taking over this bot? — xaosflux Talk 05:15, 24 November 2016 (UTC)[reply]
- @Xaosflux: No, I am not planning to take it over .. most of the questions were regarding my original request, concerns over when to block etc. (in the meantime, we keep blocking several IPs on a daily basis). As far as I understand, User:Anomie is ready to run this. --Dirk Beetstra T C 05:56, 24 November 2016 (UTC)[reply]
- Confirmed. I just haven't had anything to add to the discussion since it was mostly back-and-forth about the request itself that Beetstra was answering. Anomie⚔ 15:21, 24 November 2016 (UTC)[reply]
- @Xaosflux: No, I am not planning to take it over .. most of the questions were regarding my original request, concerns over when to block etc. (in the meantime, we keep blocking several IPs on a daily basis). As far as I understand, User:Anomie is ready to run this. --Dirk Beetstra T C 05:56, 24 November 2016 (UTC)[reply]
- Approved for trial (14 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Anomie for first stage trial - can you have the bot just produce a list of what it would block? — xaosflux Talk 15:30, 24 November 2016 (UTC)[reply]
- Let's give it a try. The bot will write to User:AnomieBOT III/Trial block log. Anomie⚔ 16:10, 24 November 2016 (UTC)[reply]
Log format
[edit]I'm not sure the protocol for BFA discussions, so I figured I'd start a new subsection for a specific issue...
The logfile at User:AnomieBOT III/Trial block log is formatted by the following line from User:AnomieBOT/source/tasks/SpamBlacklistBlocker.pm:
$txt .= "* [$t1] Would block $ip: Hit the blacklist $ct times in $timeframe seconds<sup class='plainlinks'>[//en.wikipedia.org/wiki/Special:Log/spamblacklist/$ip?offset=$t2]</sup>\n";
which renders link to the spamblacklist as a superscript in square brackets; it's an extlink, so it displays as consecutively-numbered throughout the page. But it looks exactly like a <ref>...</ref> footnote marker rather than an actual link itself. Actual extlinks shouldn't be superscripted. DMacks (talk) 21:25, 25 November 2016 (UTC)[reply]
- It's only an "external" link because I don't know of a way to supply the offset parameter with an internal link. Also the log is intended to be temporary for this trial. Anomie⚔ 16:17, 26 November 2016 (UTC)[reply]
- please wrap each IP in a
{{IP summary|1.2.3.4}}
. That gives a lot of spam fighting links. --Dirk Beetstra T C 17:24, 26 November 2016 (UTC)[reply] - @Anomie: Would you mind changing the following two things in the code: a) the above mentioned wrapping of each IP in a
{{IP summary|1.2.3.4}}
in the log, and secondly, mentioning the IP in the edit summary. It makes administrating much easier (my main problem in seeing the efficacy of the bot are actually detecting the ones that are missed ..). --Dirk Beetstra T C 13:28, 6 December 2016 (UTC)[reply]
- please wrap each IP in a
Disruptive bot can be blocked
[edit]My immediate concern was "what if the bot goes haywire?" With most bots, this doesn't matter, because we can just block the bot and it won't be able to do anything, but a blocking bot won't need to edit. However, just for everyone's information — when you're blocked, you can't use the block tool, so if AnomieBOT III misbehaves to the point that we have to block it, the bot won't be able to make further blocks until it's unblocked. Anomie, just please don't teach it how to unblock itself :-) Nyttend (talk) 17:36, 26 November 2016 (UTC)[reply]
- Besides blocking it, editing User:AnomieBOT III/shutoff/SpamBlacklistBlocker to not be empty will cause the SpamBlacklistBlocker task to stop without preventing BrokenRedirectDeleter from doing its thing. Anomie⚔ 19:41, 26 November 2016 (UTC)[reply]
- Of course, and I'd do that before blocking it. It's just that we must always be able to use blocking as a last resort; if we had a proposed bot that could somehow continue to make a mess while blocked (a concept that I originally thought described this bot), I would oppose it as being highly reckless, and I want to ensure that others don't oppose on those grounds. Nyttend (talk) 05:27, 27 November 2016 (UTC)[reply]
Log analysis
[edit]I' ve updated User:AnomieBOT_III/Spambot_URI_list/Log#IP_list_.28with_bot.29 with a new section after the bot started (starting with the first IP noticed). Of the listed 12, 3 were noticed. Many of the remaining 9 were not noticed because the list of to-catch-domains was incomplete (also updated since), and the rest only had one attempt. It's initial working has been correct. --Dirk Beetstra T C 06:43, 27 November 2016 (UTC)[reply]
- @Anomie: We're now about 2 weeks in, and have yet to find a false positive. I have recorded, from 19:21, 24 November 2016 till 14:18, 8 December 2016, 163 IPs, of which 93 were noticed by the bot (there are many false negatives - mainly because they are single edit, which is not much of a problem as they do not flood the log, or because they use unique domains, which is a problem but can easily be mitigated by adding the domains).
- For me, the bot can be set to blocking the accounts (as there are some which manage to get through with a good number of hits at times no-one is looking: 16, 20, 12, 28). We can then proceed to add any domains that are commonly hit but not yet blocked by the bot. I will now stop logging, just adding missing domains to the regex-list. --Dirk Beetstra T C 11:51, 8 December 2016 (UTC)[reply]
Blocking Trial
[edit]- Approved for extended trial (14 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. You may begin trialing the blocking function - please report any issues here. This is effectively a go-live, keeping under BRFA for centralized review during the initial launch. — xaosflux Talk 13:56, 10 December 2016 (UTC)[reply]
- WP:AN notice left. — xaosflux Talk 15:50, 10 December 2016 (UTC)[reply]
- The block log for this trial can be seen here. — xaosflux Talk 02:00, 13 December 2016 (UTC)[reply]
- WP:AN notice left. — xaosflux Talk 15:50, 10 December 2016 (UTC)[reply]
I have been looking through the logs over the last couple of days, no mistakes recorded as yet. False negatives are there due to a) editors editing too slow (which is fine .. they don't clog the logs), b) editors doing only one edit (fine, no clogging either), or c) urls not on list (which shows intended use, we'll expand the list where needed/possible). Thanks Anomie, this does clear the log significantly!
One interesting observation: Special:Log/spamblacklist/5.164.227.67 apparently edited too fast, they managed to get 4 hits before the bot blocked them. --Dirk Beetstra T C 05:33, 13 December 2016 (UTC)[reply]
- The bot checks the spamblacklist log once per minute. That seemed a reasonable tradeoff between catching things quickly and not hammering the API too much. Anomie⚔ 16:06, 13 December 2016 (UTC)[reply]
- It is, as long as we don't get spambots that go way faster than that. 4 is very acceptable, when it becomes more than 10 we may want to poll faster. --Dirk Beetstra T C 20:10, 13 December 2016 (UTC)[reply]
- Half way through the supervised launch, the block logs appear good so far; I haven't seen any complaints - @Anomie: have you seen any issues? — xaosflux Talk 15:41, 17 December 2016 (UTC)[reply]
- Seconding this - I haven't seen any issues / mistakes. The usual as in the pre-block-trial (just false negatives). The overall log is now much calmer! --Dirk Beetstra T C 15:44, 17 December 2016 (UTC)[reply]
- I haven't seen any issues. Anomie⚔ 15:53, 18 December 2016 (UTC)[reply]
- Seconding this - I haven't seen any issues / mistakes. The usual as in the pre-block-trial (just false negatives). The overall log is now much calmer! --Dirk Beetstra T C 15:44, 17 December 2016 (UTC)[reply]
- A couple points:
- I agree with those above: false negatives are far preferable to false positives on any sort of "enforcement" type bot like this; the edits are already being blocked anyway.
- Similarly, as was mentioned, I'd probably prefer shorter block lengths at first, especially when dealing with blocks where talk page access is revoked (as appears to be necessary, given their behavior). It should be fairly simple to keep a database of previously-blocked ips so that if/when they're re-encountered, they can be blocked for much longer. At lot of these don't seem to be repeat offenders. Anyway, not that huge of a deal; it can be tweaked over time.
- It might be an idea to avoid sticking the block message on the talk page, as this will probably end up just polluting the namespace over time with otherwise-useless edits
- --slakr\ talk / 23:22, 21 December 2016 (UTC)[reply]
- @Anomie: - any comments on Slakr's notes? — xaosflux Talk 19:28, 24 December 2016 (UTC)[reply]
- False positives vs false negatives depends mainly on what URLs people put on User:AnomieBOT III/Spambot URI list, there's nothing much I can do in code about that. I'd rather not deal with the complexity of trying to track block lengths on the suggestion of just one person. Since talk page notifications are standard, I'd also like more than just one person to comment before removing them. Anomie⚔ 16:32, 25 December 2016 (UTC)[reply]
- The standard block procedure includes notification, so for now I'd leave these. I may support not adding them if it was also a page creation. Additional follow up on this can be at your or your bots talk page after it runs for a while. — xaosflux Talk 18:38, 25 December 2016 (UTC)[reply]
- False positives vs false negatives depends mainly on what URLs people put on User:AnomieBOT III/Spambot URI list, there's nothing much I can do in code about that. I'd rather not deal with the complexity of trying to track block lengths on the suggestion of just one person. Since talk page notifications are standard, I'd also like more than just one person to comment before removing them. Anomie⚔ 16:32, 25 December 2016 (UTC)[reply]
- @Anomie: - any comments on Slakr's notes? — xaosflux Talk 19:28, 24 December 2016 (UTC)[reply]
- Approved. Task approved. — xaosflux Talk 18:39, 25 December 2016 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.