Wikipedia:Bots/Requests for approval/CeraBot2
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Request Expired.
Operator: Ceradon (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 05:38, Saturday June 16, 2012 (UTC)
Automatic, Supervised, or Manual:
Programming language(s): Python/Pywikipedia
Source code available: Standard Pywikipedia
Function overview: Transforms bare references to ones which use the proper templates. (Cite web, etc.)
Links to relevant discussions (where appropriate): DumZiBoT's BRFA.
Edit period(s): Daily
Estimated number of pages affected: 300-500
Exclusion compliant (Yes/No): Yes.
Already has a bot flag (Yes/No): No.
Function details: The bot uses the pywikipedia framework's reflinks.py. The bot would convert bare references to ones that would use the proper citation templates. This is done in a similar manner by one of Dispenser's tools. For instance, <ref>[http://www.google.fr]</ref>
would be given a bot generated title and converted to use on of the many citation templates. If the bot detects a dead link, it appends to the reference: {{dead link}}. In a case of duplicate references, the would leave only the first untouched and add a refname to the others found in the article.
Discussion
[edit]- The CeraBot2 account has not yet been created. I will create as soon as possible. --Ceradon talkcontribs 05:38, 16 June 2012 (UTC)[reply]
- I can't help noticing you seem to have nicked this task from me, after I agreed to do it at Wikipedia:Bot requests/Archive 48#Bare reference conversion only yesterday. Rcsprinter (whisper) 11:18, 16 June 2012 (UTC)[reply]
- It's not a competition. Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. – Seems to be a simple enough task; uses standard pywikipedia. — madman 20:16, 19 June 2012 (UTC)[reply]
- I can't help noticing you seem to have nicked this task from me, after I agreed to do it at Wikipedia:Bot requests/Archive 48#Bare reference conversion only yesterday. Rcsprinter (whisper) 11:18, 16 June 2012 (UTC)[reply]
- A quick grep of an old dump shows at least 30,000 articles with bare references. Also, if you intend to run this bot well, be prepared to update the magic regexes. My ability to provide consulting will be more limited this time. — Dispenser 07:42, 21 June 2012 (UTC)[reply]
- Dispenser, are you referring to regular expressions in reflinks.py? If so, without having to get too much into it, what are the shortcomings of the current regular expressions and how should they be fixed? This should be done by someone with access to Git so everyone can benefit from the fixes. — madman 15:28, 21 June 2012 (UTC)[reply]
- Nearly every time reflinks was ran NicDumZ and I spot-checked hundreds of diff for mistakes and possible improvement. We're still getting complaints 4 years later. The regex blacklist probably need updated to add more foreign keywords and URL matches.
- Other known issues: Some sites only reply with Gzip content (I've patched this locally), protocol relative URLs are unsupported, sites serving invalid UTF-8 trip up UnicodeDammit, a mechanism for automatically blacklisting repetitive titles, User-Agent could be more informative. And I'm probably forgetting somethings too. — Dispenser 22:43, 23 June 2012 (UTC)[reply]
- Dispenser, are you referring to regular expressions in reflinks.py? If so, without having to get too much into it, what are the shortcomings of the current regular expressions and how should they be fixed? This should be done by someone with access to Git so everyone can benefit from the fixes. — madman 15:28, 21 June 2012 (UTC)[reply]
- Is the bot recognizing {{Use dmy dates}}/mdy dates? Will it ad accessdates? What other fields will be recognized? Will the bot check for archived (webcitation.org, archive.org) links? mabdul 21:08, 23 June 2012 (UTC)[reply]
- No, this is the regular version just copies the html <TITLE> tag (including all SEO spam, e.g. Daily Express: The Worlds Greatest Newspaper) into the wikitext without templates or anything fancy. — Dispenser 22:43, 23 June 2012 (UTC)[reply]
- Most URLs include the release date, the publisher and the whole title - so why do we need such a stupid bot? mabdul 09:43, 24 June 2012 (UTC)[reply]
- Further more, why wasn't detected here, that <ref name=velez> contains the same content as <ref name=autogenerated4>? mabdul 09:51, 24 June 2012 (UTC)[reply]
- In answer to your first point Mabdul the simple answer is to combat Link rot and because all citations are to use a proper {{cite}} template. I do however agree there are lots of problems with this script. Rcsprinter (gas) 09:55, 24 June 2012 (UTC)[reply]
- I'm the first person who supports a bot who is clearing the link rot problem - if it is doing it 'right'. Some additional fields, maybe using any cite template, filling out as many fields as possible (maybe on hard coded basis, e.g. replacing all NYT refs with all fields, etc.) and I will support it. I know that this is really not easy, but this bot is totally useless and creates more work than it serves. Collecting a title from the <title> meta elemets is not really helpful. mabdul 10:05, 24 June 2012 (UTC)[reply]
- Your listing features from webreflinks. Hard coded or not, it doesn't work well. Constant redesigns hinder hard coding, sites use multiple designs for sections or regions or publishing year, they use same metadata fields for the article and comment section, lots mark things up only using the <font> tag, and authors hide behind pseudonyms. Not even the Googlebot can get the publishing date reliably correct. I thought and worked on this, built a web interface, added references to scores of pages, conceptualized where improvements would be feed back into the system. And then I realized:
People do not care about references
What our readers want (when they care) is that every book, journal, website is accessible to them online. Of course what makes this problem worse is WMF does not care about references, how else could you explain leaving $100,000+ in hardware savings and partnerships on the table. Despite the hired talent, they're still using analysis and thinking from 2005 for a top 10 website. Fuck, we're more popular than twitter!
In the meantime, reflinks and webreflinks are still surprisingly popular with their shortcomings averaging 33 and 240 edits a day. So some people clearly disagree about the helpfullness. — Dispenser 06:02, 25 June 2012 (UTC)[reply]
- Your listing features from webreflinks. Hard coded or not, it doesn't work well. Constant redesigns hinder hard coding, sites use multiple designs for sections or regions or publishing year, they use same metadata fields for the article and comment section, lots mark things up only using the <font> tag, and authors hide behind pseudonyms. Not even the Googlebot can get the publishing date reliably correct. I thought and worked on this, built a web interface, added references to scores of pages, conceptualized where improvements would be feed back into the system. And then I realized:
- I'm the first person who supports a bot who is clearing the link rot problem - if it is doing it 'right'. Some additional fields, maybe using any cite template, filling out as many fields as possible (maybe on hard coded basis, e.g. replacing all NYT refs with all fields, etc.) and I will support it. I know that this is really not easy, but this bot is totally useless and creates more work than it serves. Collecting a title from the <title> meta elemets is not really helpful. mabdul 10:05, 24 June 2012 (UTC)[reply]
- One has punctuation the other does not. — Dispenser 20:39, 24 June 2012 (UTC)[reply]
- In answer to your first point Mabdul the simple answer is to combat Link rot and because all citations are to use a proper {{cite}} template. I do however agree there are lots of problems with this script. Rcsprinter (gas) 09:55, 24 June 2012 (UTC)[reply]
- No, this is the regular version just copies the html <TITLE> tag (including all SEO spam, e.g. Daily Express: The Worlds Greatest Newspaper) into the wikitext without templates or anything fancy. — Dispenser 22:43, 23 June 2012 (UTC)[reply]
{{OperatorAssistanceNeeded|D}}
Hasn't operated for more than a month since doing the first few edits of the trial. Are you going to continue? Rcsprinter (speak) @ 16:31, 27 July 2012 (UTC)[reply]- Request Expired. – No response from operator when assistance was requested. This request may be re-opened at any point in the future when the operator's in a better position to resume the trial and to respond to any discussion of the request. Thanks, — madman 00:44, 9 August 2012 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.