Wikipedia:Bots/Requests for approval/FearBot

The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was

Withdrawn by operator.

FearBot

Operator: TheFearow

Automatic or Manually Assisted: Automatic (Supervised)

Programming Language(s):Java

Function Summary:Patrols Special:newpages and checks for nuisance articles, then flags with the IdentifiedSpam template.

Edit period(s) (e.g. Continuous, daily, one time run):For a start, when I am active. Once bot has been running properly for a week or two I will make fully automatic on a dedicated server.

Edit rate requested: 10-12 edits per minute (MAXIMUM)

Already has a bot flag (Y/N):N

Function Details:Scans Special:newpages, and checks article for nuisance content using a scoring system, based on several factors and badword/goodword lists.

Discussion

I have updated the request, so please make comments here. Thanks! Matt - TheFearow 01:38, 18 May 2007 (UTC)[reply]

OK. I'm deciding to wait until I get a better bot task, like some of my suggested stuff at WP:DEAD. Could a BAG member mark this as Withdrawn? Thanks! Matt - TheFearow 22:00, 19 May 2007 (UTC)[reply]

Archived Discussion

The discussion below was for old functions. Please add comments on updated in the section above. Thanks!

Note: You do not have a bot flag. --ST47Talk 23:20, 14 May 2007 (UTC)[reply]

Ok, thanks! I edited the request. TheFearow 23:30, 14 May 2007 (UTC)[reply]

Thanks. Looks very interesting, but are you at all worried about biting? Any algorithm can give false positives, what is the extent of the supervision? Will you approve each tagging? Can you post your whitelist/blacklist somewhere for review? --ST47Talk 00:38, 15 May 2007 (UTC)[reply]

With biting, it scans users pages and uses the warning appropriate to how many preious warnings (uw-create then the number of previous warning + 1, with a maximum of 4 (as well as a note about it being a bot)). The algorithm i'm using is being tested at the moment, just seeing what the results are of scanning the current list, and I am refining it at the moment. Just before I was getting an approximate 90% success rate at recognising, however right now I stuffed something up but I think I have it sorted now. I won't approve each tagging, however it won't tag articles unless it is really sure, and if its partially sure it'll check with me. As for the whitelist/blacklist, I believe you mean goodword/badword filters? I am currently working on that, as my previous system seems to have failed and was outputting that about 99% of all new articles were extreme spam, which obviously wasn't correct. The Blacklist/Whitelist I am going to have loaded from a subpage of the bots user page, so when I get that functionality it will be public. I am going to request that the page gets semi-protected by an admin to stop vandals breaking the lists. Thanks! TheFearow 02:03, 15 May 2007 (UTC)[reply]

Addition: I have changed several things internally, these make it more efficient, it can properly check most pages, and it will flag ones it is pretty certain about (via Wikipedia:PROD) and ones it is completely certain about (via Wikipedia:SPEEDY). If it believes some are spam, but it's not sure, it will log it's name on User:FearBot/notsure. Thanks. TheFearow 02:53, 15 May 2007 (UTC)[reply]

Sounds good. Let me know when you're ready for a trial. --ST47Talk 10:27, 15 May 2007 (UTC)[reply]

I am currently working on the finishing touches and some minor configurations, so should be ready within a day or two. I am also considering using a Category rather than a page, would this be easier? Something like "Category:Suspected Spam Pages", this could be used for other bots as well, and it's immediately obvious to editors the page has been marked and they can remove/delete accordingly. Any comments on the matter? I have also publicised the wordlists, they are at User:FearBot/Wordlists, so feel free to browse/edit. Thanks! TheFearow 20:56, 15 May 2007 (UTC) (Edited to use my signature, as I forgot to login)[reply]

I have been fine-tuning, should be ready for trial tomorrow. I have also been testing the automatic creator-notification, and nomination functions. If you have a look through the history of User:FearBot/sandbox you can see it was nominated under IdentifiedSpam template, WP:PROD, and WP:SPEEDY. Of course, in practice it won't report it under everything, but thats an example of it. Also more information on the algorithm is as User:FearBot/Factors and User:FearBot/Wordlists (wordlist is semi-protected for vandalism reasons). Thanks! TheFearow 05:45, 16 May 2007 (UTC)[reply]

On the point of a trial - I would like to see this go through an extended bot trial for about 2 weeks, to allow objections to its use to surface. Notices will also need to be posted on Wikipedia:AN and Wikipedia:VP. (Note that the bot is not approved for trial yet) Now for some questions about the bot:

How often will it check Special:Newpages?
Will it load every page?
How do you define spam? (I don't want a link to a page - I want an explanation from you of what you consider spam and what your bot will tag)

A number of the factors and some objects on the wordlist seem suspect, but again I would need to hear your definition of "spam" first (see also: Wikipedia:CSD and Wikipedia:SPAM). Martinp23 18:03, 16 May 2007 (UTC)[reply]

I also get the feeling that some users would like to see the source code of this bot (I wouldn't mind a look, but am not a java expert (in fact, I can probably only just about read it)). Martinp23 18:14, 16 May 2007 (UTC)[reply]

Ok, here are my answers:

It will check Special:newpages whenever necessary, each time it loads 10 pages so itll complete as soon as it's processed that list. Each entry not spam just has to be loaded and parsed, so about 2 seconds for those, and identified spam articles have to be loaded, parsed, edited, then the creator looked up and warned, so about 6 seconds. So it varies depending on how much spam is currently active.
Yes, however it uses act=raw, so only the article source (not entire page) will be loaded
I define spam as an unwanted article of no or little value to wikipedia, or is an obvious attack or nonsense page, such as the many "(Guy name here) is a guy who goes to (School name here) and he is ugly". It also takes high priority for very short articles (under 100 bytes, or about three words), and varying scores for articles only slightly larger. For information on how my bot categories things as spam, see User:FearBot/Wordlists and User:FearBot/Factors.

Regarding entries suspect, none of the factors can individually makke an article spam. It needs at least 5 of the minor factors to be considered possible spam, at least 8 to be considered PROD-able, and at least 12 to be DB-able. Each factor gives it a different score, so I cannot say exactly the effects. A lot of things also lower score, so it needs more if there are good factors like templates, images, links, categories, etc. On the wordlists, some words were added that I didn't think needed to be in there, but I found were in at least 5 different spam articles while I was testing (e.g. Lobster). About the source code, I may publish the main evaluation function, to save space (as ther eare a lot of other functions that would waste space). Most of the functions I won't post are the functions to get the appropriate warning level etc, and the functions to flag pages. The only one that might be required is the warning level one, but that just counts the number of occurences of and uw- templates on the page. For the main wiki-access cde, see jwbf.sourceforge.net which is the framework I am using. I only modified it slightly to add the reading of newpages and history. I have read spam, and it is not the same as my definition of spam, I mean spam in the general sense of unwanted messages/articles, rather than the definition of Wikipedia:SPAM. Thanks, and if you have any more questions, or you wish to be privately sent the entire code, please reply. If you just want the evaluation functions (which decide if an article is spam or not) say so, and i'll make it public. TheFearow 21:21, 16 May 2007 (UTC)[reply]

OK - on Wikipedia at least, spam (as defined at Wikipedia:SPAM and in Wikipedia:CSD) is the addition of useless external links, or the production of pages solely for advertising. Hence I would suggest that you reformulate the code which places the tag on the article to represent this. I am not convinced that a bot should have the approval to mark articles for PROD, nor to place warning templates on them, as these will, in most cases, not be conducive with the expectations of a normal editor. That said, for pages which are clearly vandalism/nonsense, then I wouldn't object to the tagging of the page and the warning of the creator. However a key part of the job of every Wikipedia new page patroller is to be responsive to the creators, in order to help them to become a helpful contributor, hence I would suggest that you should, for the first few months at least, only run the bot while you are connected to the internet and able to respond to comments left for it.

I also want more community input on this proposal before giving it trial approval, so am going to ask you to make posts to Wikipedia:VP and Wikipedia:AN in order to (hopefully) generate it. Could you post the score of each factor next to its respective name on the factors list as well please?

In summary, we need to make sure that the warnings make sense, and that the bot won't start biting the newbies. To do this, we need to carefully determine how it will choose articles to tag for deletion, and how it will avoid false positives. For example, "Martinp23 is a noble proze winning Wikipedia adminstrator" is not valid to be deleted under CSD A7 (granted, it would be deleted under G1), because it contains an assertion of notability. On another note, I'm not convinced that we need a bot to do NP work, as no bot will be able to enforce speedy tags, or respond to {{hangon}}s, but that's justmy opinion :). Martinp23 22:14, 16 May 2007 (UTC)[reply]

I'll post the full source of the evaluation function in the factors page, as that gives it exactly in more detail than I can actually say. PROD was being used as a mid-way between suspected articles and defiate articles - it would clog AFD, maybe I should just make it use speedy if its sure, rather than only if its completely sure? As for the warnings, I use the uw-create and PRODWarning templates, with a little note afterwards saying a bot left the note. As for your example, the bot doesn't differentiate between criteria - that would be incredibly complex - it flags using the db|REASON tag (a shortcut of db-reason). Therefore the admin can delete under appropriate category. This is no different to current manual systems, as (mostly) patrollers use the db-nonsense and db-bio tags, whatever it's closest to, not what it actually should be under. Regarding hangons, it doesnt delete, and the people who flag articles dont care anyway, its the admins who have to look at hangons. Also, NP work is the same cateogry as RC work, you can't get it to inforce, even AntiVandalBot will not try again if you revert back (even on obvious vandalisms). The idea of this is not to REPLACE editors, but to ease the load a significant amount. You put the required score to flag high enough so there are minimal false positives, while still removing a large percentage of nuisance articles. Also, I am changing spam articles to nuisance articles in the request (the only change is on this page, code uses standardized page creation user warnings so no need to change). If I missed anything, please point it out to me. Once I get the evaluation code etc I will ask for opinions on WP:VP and WP:AN. Thanks! TheFearow 00:11, 17 May 2007 (UTC)[reply]

Code: I have added the evaluation function code at User:FearBot/EvalFunc. It is also linked from the Factors page. TheFearow 00:28, 17 May 2007 (UTC)[reply]

I'm responding to the posting at Wikipedia:AN. This looks like a reasonable proposal. Yechiel Man 02:22, 17 May 2007 (UTC)[reply]

Looks like a good idea, needs testing but all in all OK. I would suggest to make sure it does NOT check pages that is newer than say 10 minutes, or pages that is new but have had edits the last 10 minutes? This would be to not flag bad first versions that the editor is still working on. I know that when I make a new page I would probably not have many of the things you count as good things in the first version, but give me 1 more hour and a few updates and it would have these things and not be flagged. Stefan 03:06, 17 May 2007 (UTC)[reply]

I have some experience with my bot User:AlexNewArtBot that I used to generate bad lists. I had a few observations:

Rap music articles, etc. can have a huge number of bad words (all the song titles can easily be obscenities) but still be valid.
Articles can be vandalized, but the original theme might be valid

I am sceptical about the possibility of no false Prodding and DBing articles. I would suggest to generate lists of potentially bad articles and then Prod and DB them in semi-automatic regime. If no false positives during a week time then the bot is safe for the automatic prodding. You might consider re-writing the User:AlexNewArtBot/Bad rules, then the ANAB would provide most of the functionality required Alex Bakharev 06:22, 17 May 2007 (UTC)[reply]

I have a few hesitations regarding this bot. For one, I reviewed the scoring algorithm that you provided, and I have to say, without testing it on a large sampling of pages, that the values used to determine the scores and the evaluation of what these scores indicate appears entirely arbitrary. I see absolutely no justification for the weights placed upon certain items, nor a clear parallel to the judgments of what the scores indicate. Of particular concern to me are arbitrary judgments such as if(numlinks < 3){p += 10;} and if(numimages == 0){p += 2;} along with others. By the judgments you provide, a well-written article that provides fewer than 3 wikilinks is tagged as spam--this seems highly illogical to me. Similarly, a simple formatting mistake, such as typing '''Bold Text'''The text I meant to make bold'''Bold Text''', especially when coupled with not using wikilinks, not adding images, or, for some odd reason, using one or more exclamation marks, could lead to such a page being tagged as spam, prodded, or tagged for speedy. The likelihood of producing false positives by this method is way, way too high, and I would certainly not support approving this bot unless the algorithm can be dramatically improved and tested on a wide sampling of both manually-deemed acceptable and unacceptable articles.

Coupled with the tendency of the algorithm to produce false positives, I must echo concerns mentioned above about biting newcomers. As you may be aware, many new users create pages in short spurts--they write a page like "== Headline Text == '''Bold Text'''Dick Cheney'''Bold Text''' is the vice president of the united states." and over the course of many, many edits in a short period of time convert the article to a well-written and well-formatted article. By immediately tagging these articles for deletion and then posting threatening messages on the article creators' talk pages, you serve only to discourage newcomers from contributing anything at all. I do still remember my first article creation, one that may well have been deemed by your bot to be spam, and had I received a threatening notice from some bot immediately after making that edit, I likely would have refrained from contributing, assuming that Wikipedia did not welcome me. What I would suggest is that, instead of tagging these articles for deletion immediately following their creation, you notify users that they may have made formatting or style mistakes and point them to Help:Editing and Wikipedia:MOS; then you might also leave a link on some page to be checked at a much later time by administrators to see if any progress in the way of improving the article has been made, or if it was simply spam.

Additionally, and this is more of a semantic concern, the bot's name really needs to be changed. Along with above concerns about biting, I must say that a FearBot sounds to me like a bot to be used to intimidate evil spammers; even if you do not regard the bot as such, I assure you that many users will regard a message from a "FearBot" as quite intimidating. AmiDaniel (talk) 08:15, 17 May 2007 (UTC)[reply]

Ok everyone, thanks for the messages, I will answer each in turn.

New articles that are spam are usually speedied with a minute by newpage patrollers - waiting any amount of time longer than that makes the bot unneeded. Generic album pages seem to always come up as spam - I am considering excluding them. The numlinks smaller than 3 was a mistake, it should NOT be that high, that was a big error. It was supposed to be == 0 or <= 1, not smaller than three. Numimages I am going to remove, as it seems to be unnecessary, as a lot of articles don't have images. I always try and add images to my new articles, however I understand it is often hard to do and newcomers often don't. Regarding using the bold tags wrong - After patrolling newpages for a while as well as watching newpages closely over the last week, there seems to be no problems with people accidentally using boldtext etc incorrectly - There was one time but it was obvious they accidentally clicked, and it was a spam article anyway. Regarding threatening messages - I am using the standardized uw-create and prodwarning templates - as far as I know these are not threatening, and the only reason it uses more than level 1 is if the user has already been warned. I could change bot to put all in IdentifiedSpam template, which puts them in a category, but that wouldn't be as fast and it would mean another category admins have to watch. Regarding FearBot - Would FearowBot be better, or do I need to eliminate the word fear altogether? One policy is that the name should somehow reflect the owner/operator's name, that's why i'm using Fear or Fearow. FearBot was also the name of an old MSN bot of mine, and an old IRC bot I made for a friend. If needed, I will be happy to change. Lastly, for first week of trial I would just have it report, not DB or PROD it. After that, I would have it do it but I would be closely watching. Once it's public, I would be only running it when i'm on for at least the first month, and I will have it set to message me when anything happens. If there is anything I missed, please don't hesitate to point it out. Thanks! TheFearow 08:55, 17 May 2007 (UTC)[reply]

If new page spam is speeded within a minute by newpage patrollers, the bot is not needed! If it is not done by human newpage patrollers, you can just as well wait 10 minutes and avoid biting the newbees. If the boot is so good that it never (or very rarely and by that I mean less than maybe 1 in 1000) makes misstakes, I can accept it doing that pages directly, otherwise I would really suggest to wait at least 10 minutes after that last edit. See AmiDaniel above how might have never stayed and I'm sure this bot will make some users never come back if it tags pages directly, sure a prod is not threatening, but still not very welcoming either, remeber this bot would probably not wrongly tag pages from experienced users, more likely newbees. Stefan 13:10, 17 May 2007 (UTC)[reply]

If by spam we mean Wikipedia:AUTO pages then a significant number of them survuves new page patrol perfectly well. Look through the User:AlexNewArtBot/COISearchResult for fine examples Alex Bakharev 14:36, 17 May 2007 (UTC)[reply]

I do not think that the bot is intending to find Wikipedia:AUTO pages. Stefan 14:45, 17 May 2007 (UTC)[reply]

I am against of using this bot. Let's indeed see User:AlexNewArtBot/COISearchResult for examples. Some articles are just fine - there are no reasons for deletion. Look Farah Abushwesha - this is just one of many examples. If these articles are marked for "non-controversial deletion" (and people are rarely looking through these lists) - they will be deleted. Such bot can be used only for one purpose: to identify articles that need editor's attention and mark them as such. No marking for deletion by bots, please. Biophys 16:31, 17 May 2007 (UTC).[reply]

I concur. No marking for deletion by bots, only tagging for human attention. W. Frank 17:41, 17 May 2007 (UTC)[reply]

Question- I created a couple pages, one that got nailed by a good word/bad word bot and one that didnt. One was a redirect to an essay I wrote, at Wikipedia:DONTGIVEASHIT that I contested the speedy on and had to sit at Wikipedia:RFD for a week just to see it gt an 8-1 consensus to keep, the other at Bong Hits 4 Jesus, the common name for a supreme court case, which had the full article originally but is now a redirect to the official case name. This one did not alert any bot or Newpages patroller. What sort of steps would you take to make sure legit pages such as this are not tagged by accident? Id suggest keeping it away from the Wikipedia: namespace for one. -M^ask? 17:18, 17 May 2007 (UTC)[reply]

Comment. I'm greatly puzzled by the fact that containing an external link is a factor that helps your bot mark a a page as not spam. At least in my personal experience, I've found a lot of spam pages that contain external links. Many pages serve, at least in what I've seen, only for the purpose of containing one or two external links. Why then include external links as a factor indicating a page is not spam? Cool3 19:09, 17 May 2007 (UTC)[reply]

Subsection 1

Ok, to answer AKMask's qestion, it only monitors the main namespace. Also, I am considering switching this just to IdentifiedSpam, which can be watched by human editors. Can I get an oppinion for this, and should I resubmit a new bot request o edit this one, as it is a majorly different purpose. And yes, all nuisanse article are picked up already, however this was designed to reduce the load on human editors so they could work on more important tasks, however it would still fill that role just identifying spam Can I get comments on switching to IdenfitiedSpam rather than SPEEDY or PROD. Thanks! TheFearow 21:01, 17 May 2007 (UTC)[reply]

Knowing that, I'll Support this bot, but I'd also be even more enthusiastic about it to just tag things 'identified spam' rather then tag for deletion.-M^ask? 21:28, 17 May 2007 (UTC)[reply]

Ok, I am going to switch to marking only using the IdentifiedSpam template, which places them in a category.

Should I edit this request or create a new one? If I should create a new one, mark this one as Withdrawn By Author and I will create a new one with new purpose. Thanks! TheFearow 22:49, 17 May 2007 (UTC)[reply]

Just keep it in this one, perhaps redefine the processes below - all of the above discussion is relevant, so no reason to withdraw (unless you really want to, perhaps if you change the bot's username?) Martinp23 22:51, 17 May 2007 (UTC)[reply]

Here are my thoughts...Newpages patrol isn't very taxing, pages aren't created on the same level that AntiVandalBot other anti-vandalism bots have to deal with on RCP. If anything, I say we need more admins on Newpages patrol (a Jeffrey O. Gusatafon Shazaam! moment), as I've had more problems with new users removing speedy tags instead of problems with tagging articles in the first place. hbdragon88 01:08, 18 May 2007 (UTC)[reply]

OK. I'm deciding to wait until I get a better bot task, like some of my suggested stuff at WP:DEAD. Could a BAG member mark this as Withdrawn? Thanks! Matt - TheFearow 21:59, 19 May 2007 (UTC)[reply]

Withdrawn by operator. Martinp23 22:03, 19 May 2007 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.