Wikipedia:Bots/Requests for approval/CheMoBot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: Dirk Beetstra T C
Automatic or Manually Assisted: Automatic
Programming Language(s): Perl, Perlwikipedia
Function Summary: Extract infoboxes from mainspace pages, and saving them on subpages of a Wikiproject for verifying the data in there. From then reporting changes to mainpage infoboxes (when 'verified' data gets changed) to a report page under the same wikiproject for manual review.
Edit period(s) (e.g. Continuous, daily, one time run): Continuous
Already has a bot flag (Y/N): No
Function Details:
I am writing this bot in stages, and I will ask separate permission for the further functions, but the possible future tasks may be mentioned here:
Introduction
[edit]Some data on mainpages are not likely to change, and importantly, should not be changed without being properly discussed and/or backed-up with proper references. E.g. we all know that the boiling point of water under normal conditions is 100 degrees centigrade. That number is not likely to change, and anyone who changes that number on that page is likely to be reverted. Typically, these numbers are kept in infoboxes (for Water (molecule) this is in a {{chembox new}}).
These infoboxes generally contain that type of data, which is not likely to change. And for a number of them, we know which value is correct (or at least, verifyable), and often there are numerous references to back them up. This goes for the boiling and melting point of water, and the birthdate of John F. Kennedy (May 29, 1917), and changes to these numbers are easily traced (many people know these by heart). But for other pages, those numbers are also verified, correct, and not likely to change, but are less known. For these detection is less obvious (the melting point of Trimethylphosphine is stated to be -86 degrees centigrade; I don't know if it is correct (though I expect that it has been checked), but I would not be able to detect if an editor who changes that to -55 degrees centigrade is actually putting in the correct value, changing it in good faith, or 'vandalising' the page (example: this ..). Would be nice to have a bot to check those changes against a repository of verified values. That is what this bot is trying to accomplish.
For whom
[edit]I am writing this bot for Wikipedia:WikiProject Chemicals/Wikipedia:Wikiproject Pharmacology, which keep a lot of such 'unlikely-to-change' data in, respectively, infoboxes ({{chembox new}} and {{drugbox}}), and quite some of that data is checked and verified. I am trying to keep the bot scaleable, so that other infoboxes can easily be added (which may in the end result in a bot rename).
Tasks
[edit]The first task this bot is going to perform is to make a repository of data as subpages of Wikipedia:WikiProject Chemicals. This data can then be checked and verified, and the bot will during that time, and after that, trace edits to the mainspace pages containing a {{chembox new}} and a {{drugbox}} and check changes to them against the (verified) WikiProject copy (it will not correct them). When data in the infoboxes gets changed it will report the edit to a log, which is also going to be stored under the wikiproject.
Working
[edit]The bot is written in perl, and most of its working can be controlled via User:CheMoBot/Settings. There settings for the which boxes are monitored, which fields in those boxes, and wikiproject(s) involved, represenatives for the wikiproject(s), IRC-representatives for the off-wiki control, etc.
More technically, it 'extracts' the infoboxes from the pages (for a diff it extracts it from both the old and new revid), and compares them/the change to the infobox on the wikiproject copy. If a monitored field is changed it reports the addition/deletion/change.
The current working (on a small set of testpages) can be seen on irc in the wikiproject chemistry channel and my own botchannel
Summarising
[edit]Summarising, the tasks I ask permission for here are:
- Creating copies of the infobox data to wikiproject space for verification (in the end there will be about 4000-5000 of the pages in wikiproject space) and possibly using an off-wiki database with verified data to update the pages in wikiproject space. For the trial phase: about 100 edits for the former (copying/updating from mainspace), 50 for the latter (updating from database)?
- Creating on-wiki logs for changes to data in {{chembox new}} and {{drugbox}} for review of the detection process of the bot, and, where necessery, MANUAL repair of the mainspace pages. For the trial phase I ask for a couple of days of logging (this also gives us a feel for how many edit we have to handle).
Technically, the bot could also do the reverse ('repair' infoboxes in mainspace after changes), but for that first we need the verified data, and a thorough test of the mechanism (which will be visible via the mentioned log). I'd like to keep that part outside of this discussion for now (this bot will NOT edit in mainspace under this permission), and will ask for separate permission when such tasks are viable!
--Dirk Beetstra T C 14:08, 15 July 2008 (UTC)[reply]
Discussion
[edit]Sounds like a very good idea and the idea could be used in many different projects. I can't comment on the programing and issues that could arise because i have no experience when it comes to Perlwikipedia. It sounds good to me! Printer222 (talk) 15:29, 15 July 2008 (UTC)[reply]
- Perhaps a better idea than creating subpages to store the verfied data in is to store it as wikimarkup on your local machine (the plaintext shouldn't be that large). The bot could then be set up to change the local copies to reflect any 'verified' edits (perhaps certain people, like a whitelist?). The logs would need to be public but this shouldn't be a problem. Just a few ideas! RichardΩ612 Ɣ ɸ 15:39, 15 July 2008 (UTC)[reply]
- Re: Richard0612. We have been thinking about off-wiki database for this (MySQL would be suitable), but it imposes a couple of problems, first, the data is not 'free for anyone to edit', and the data, even if verified, will sometimes need changing. Secondly, there will only be a few people who have access to the database. I don't mind doing that for one or two wikiprojects, but if more projects opt-in, then it may become quite a task, and (as I would be responsible for the data then as well, I would also have to check all that data. If we keep the data on-wiki, anyone can change the verified data (except if the verified pages get protected, but then still a significant number of people can change the data), and other projects, especially the ones where I am less knowledgeable, can manage their own data. A third point is (which would apply to an on-wiki database-format), that some of the, first party, verified data (e.g. a database available to some members of Wikipedia:WikiProject Chemicals) has to be handled differently due to some agreements.
- I am thinking of using databases to update the project copies of the boxes with verified data. --Dirk Beetstra T C 15:52, 15 July 2008 (UTC)[reply]
- I could be missing something, but if the wikiproject copy of the data is free to edit, doesn't that mean that unscrupulous people could vandalise these copies just as easily as the live ones? RichardΩ612 Ɣ ɸ 15:59, 15 July 2008 (UTC)[reply]
- If that happens on a regular basis (these copies are less visible to the public), then they can be protected, so that e.g. only admins (still a big group) can edit them. --Dirk Beetstra T C 16:05, 15 July 2008 (UTC)[reply]
If you are committed to using wiki pages to store the "clean" data, could you find a way to store more than one infobox worth of data on a single wiki page? For example, maybe you could make a page for each letter of the alphabet, and put all the infoboxes on the corresponding page? 5000 pages makes it hard to watchlist them all, hard to make any uniform edit to all of them, and so on. But do be aware of size limits on pages - I can answer questions about them on my talk page. — Carl (CBM · talk) 17:05, 15 July 2008 (UTC)[reply]
- It would be easier for the bot to have all the data in off-wiki databases, but it makes it much, much harder to maintain, and to alter. If a project wants to add another field to the/a database, they will have to ask me to kill the bot, change the bot, change the database, start the bot again, and add the data. I don't know how many different infoboxes we have, but I think that the more transcluded infoboxes also will have more people watching the verified data. And changes to a record the database also has to be done by someone with access to the database. All that makes it for me better to have the database on wiki.
- Watching all is not necessery if an admin in the project turns on transclusion protection for the verified data template; as it is now, the data in mainspace is in {{chembox new}}, the verified data is then in a {{chembox new verified}}, the latter should not be used in mainspace, and hence can be transclusion protected when maintenance is not necessery and the project does not want to have to fight vandalism on those pages, when changes are needed, take down the protection for a while and then keep an eye on vandalism to those pages.
- We talked in the Chemicals WikiProject about having an on wiki database (just a wikipage with a xml-layout, e.g.), but that has problems with readability and typos having drastic effects (damaging one tag kills the whole database, e.g.). Similarly a CSV-type format is possible, but also that suffers readability. For those on-wiki-databases still goes, these pages are going to be HUGE. If someone knows a good inbetween, that would be nice. --Dirk Beetstra T C 09:14, 16 July 2008 (UTC)[reply]
- I would not be comfortable with using page protection on these - it's not mentioned in the protection policy, and our general practice is to avoid preemptive protection whenever possible. And you can't trust page protection, since it can be accidentally removed and admins can make mistakes. One way to detect vandalism would be to keep an md5sum of each verified infobox on the computer that runs that script, and check the wiki copies from time to time.
- What I was thinking about storage was just to put 50 or 75 infoboxes per page, instead of one infobox per page. That doesn't require xml or csv formatting. You could separate the infoboxes with section headers, and it would be simple both for bots and humans to understand the data. If you run into template limits doing that, you could just add nowiki around the infoboxes. — Carl (CBM · talk) 13:24, 16 July 2008 (UTC)[reply]
- 50-75 might be doable. Not sure how it will be with finding the data though. 50 Boxes would mean about 25 pages for {{chembox new}} (I guess), on which the bot would have to find the correct one (not sure how to do that quick and it would be quite some throughput for the bot). Or I have to also add an index to it (I have to think about this).
- I don't think that protection would be needed, generally. The pages are not very visible, and undiscussed edits by people unknown to the project can be treated as suspicious. Only if there is really persistent vandalism page protection should be considered. --Dirk Beetstra T C 13:53, 16 July 2008 (UTC)[reply]
Here are the counts for chembox new by first letter of the article title:
A 293 B 220 C 418 D 290 E 103 F 82 G 121 H 134 I 148 J 6 K 17 L 121 M 289 N 141 O 68 P 405 Q 15 R 74 S 313 T 385 U 92 V 52 W 11 X 24 Y 15 Z 48 1 105 2 126 3 49 4 40 5 21 6 5 7 8 8 3 9 3 Other 8
If you estimate 1k per chembox, you could fit up to 100 chemboxes per page, which means that most letters could have a single page, but some would need to be divided further. — Carl (CBM · talk) 14:21, 16 July 2008 (UTC)[reply]
- Heh, nice statistics. 100 chemboxes per page, and the first character only is not enough .. which means that I either need an index (which needs maintaining as well), or resort to the first 2 characters, in which case we end up with 900 subpages per project anyway (and this is only {{chembox new}}, there are many boxes out there, some of which may have much higher numbers, and that may in the end result in problems with scalability). What are the arguments against actually making 2000 subpages? Yes, the watchlist becomes long if you want to watch them all, but well, they are not edited often (low visibility), you don't need to watch them all (it can be pooled), the bot can help with it anyway (working on that), and well, it beats watching the 2000 pages in mainspace (which will also not be on one persons list). Advantages, if one subpage gets vandalised too much, it can be protected, and you don't protect the (max) 99 others; easy to find the data for one of them (just insert the prefix), quick and easy to parse for the bot (yes, I am lazy as well ;-) ). --Dirk Beetstra T C 14:31, 16 July 2008 (UTC)[reply]
- I did check two-letter prefixes as well as one-letter prefixes. What I found is that the distribution is not at all even - some prefixes, like "1-", have over 100 pages, but most prefixes have only one or two. So rather than making 26*26 pages, you could probably get by with 100 or fewer. One option is to just use single letter prefixes, sort articles alphabetically within each prefix, and then split a single-prefix page into pieces no more than 100 articles long ( A (1), A (2), etc). This wouldn't be very hard to code.
- There are about 4300 chembox new transclusions. There are several reasons not to create 4300 pages to mirror them:
- Humans can't do maintenance on that scale - you are committing yourself to doing a bot run anytime you need to make a change to the pages.
- Any time you want to check the pages, you have to do 4300 page requests, which takes a long time, is inconvenient, and while not a burden to the system is still unneeded work for it.
- The system isn't scalable. What happens when the number of chemboxes grows from 4,000 to 8,000 or 12,000 over the next couple years? Perhaps 4,000 is at the edge of what could be done by a bot - but is 10,000 still be reasonable? If two or three other projects ask for the same system for their pet templates, the one-page-per-infobox system could lead to hundreds of thousands of individual pages. But if your code is successful, I expect a lot of other projects would ask for it. This is the strongest reason not to use the one-page-per system.
- Other bots that have generated large numbers of individual pages have been shot down by the site admins. They recommend working smarter, not harder.
- There are about 4300 chembox new transclusions. There are several reasons not to create 4300 pages to mirror them:
- If the only issue is time and/or experience writing this sort of code, I'd be glad to help. — Carl (CBM · talk) 14:47, 16 July 2008 (UTC)[reply]
- OK, makes sense. So it will be indexing. There must be a smart way to find the right box and to do this .. --Dirk Beetstra T C 15:08, 16 July 2008 (UTC)[reply]
- I was thinking about how to implement it just now. One idea is: if there are fewer then 100 pages for index A, just list their contents at page A, separated by section headers. If there are more than 100, split them alphabetically into groups of 90 in subpages: A (1), A (2), etc. Then use A as an index that can be used by the bot to find the right subpage given the name of the article. To add another article, just add it to the right subpage and update the index. The reason to use 90 instead of 100 is that you can add ten more articles "free" before you need to re-split again into groups of 90. Each of those inserts would only take two edits (index page and subpage) instead of needing to edit all the subpages, so it's a lot cheaper. — Carl (CBM · talk) 15:20, 16 July 2008 (UTC)[reply]
- OK, got it. See Wikipedia:WikiProject Chemicals/Index/B (0). I can make the bot read that just as easy (just have to be aware of some strange characters which may appear in some chemicals). The header for the section is defining the name for the mainspace article. It is a bit more error-prone (if you delete a character from the section header the bot can't find it, and an editor will regenerate it maybe from mainspace), but it takes down the number of subpages drastically (indexpages now are 'B (0)', 'B (1)'). Hope this resolves this issue. --Dirk Beetstra T C 18:15, 16 July 2008 (UTC)[reply]
- Now I have to write the 'copy them from mainspace' .. --Dirk Beetstra T C 18:15, 16 July 2008 (UTC)[reply]
- Actually, I don't bother for an index anymore, I just browse B (#) until either I retrieve an empty page, or find the compound. --Dirk Beetstra T C 18:17, 16 July 2008 (UTC)[reply]
- I think you would have to write "copy from mainspace" either way :)
- One thing to be wary of is the template limits. If you put too many templates on the same page they won't al be expanded. It may be necessary to put nowiki around them; editors can always remove the nowiki to review the template before saving. — Carl (CBM · talk) 21:02, 16 July 2008 (UTC)[reply]
- Done: new setting. Can be tweaked, though if you exceed it you will have to repair it manually for now. I hope 75 is save, but we can test it later. --Dirk Beetstra T C 15:55, 17 July 2008 (UTC)[reply]
- There are lots of hash function techniques that can be used to keep a balance between number of pages of infobox datasets and the number of datasets per page. One more logical and general solution than "I just browse B (#) until either I retrieve an empty page, or find the compound." would be to have a special "index" page format that links to other pages, that could themselves be index pages or actual data pages. So you start by reading the main index page, decide which of the listed entries on it is where a given compound would be, and read that page. If that page is the compound data itself, then you're done. If it's another index, read it and recurse. Probably the "browse B (#)" is easier and sufficient for the scale we have now, but actual indicies would be less server load and might become more necessary for larger-scale uses. DMacks (talk) 19:44, 19 July 2008 (UTC)[reply]
- Regarding secondary pages for the data, a few issues come to mind (not sure how intimately and/or intrisically they relate to the bot at hand). Storing the data on-wiki seems pretty important for the spirit of WP (assuming it doesn't conflict with any agreements we have for data verified by other groups), keeping them protected pre-emptively might be overkill. Keeping the data on a few centralized pages (subpages, or even collections of secondary pages with known names or on some other wikiproject; either transcluded or just "authentic" copies of the verified data) makes it easier for humans to look over the bot's shoulder. If editors watchlist those pages, only changes to the data would be flagged. The current format of the chemicals pages (the only place for the data is chembox in the article itself) is annoying to watch because any change to article prose triggers a watch just like change to data. Further, moving this "factual" (and in some cases "derived from other factual" (mol wt from mol formula, etc) out of the main pages makes it less likely for skool-kidz to find it to vandalize it, and makes the article pages themselves simpler ("less crap to scroll through to find the article text to edit"). Lots of specific advantages here, could mitigate "no full-page transclusions, period" and related sentiments. Also (for bot and other external programmatic uses) having just this stuff on a separate page makes it easier to avoid false-positives for changes: easier to parse if it's a predictable format without extraneous material (other types of templates with same-named fields for example). DMacks (talk) 19:56, 19 July 2008 (UTC)[reply]
- Positive data-point for approval: I've been watching #wikichem diagnostic output/testing of various functions while this program has moved towards becoming this bot. Haven't seen any evidence of it mis-detecting page changes (either pages it shouldn't care about or changes to parts outside the chembox). DMacks (talk) 20:00, 19 July 2008 (UTC)[reply]
- I was thinking this weekend, there may be a fourth place to store data: using the box stored on a permanent revision of a page itself. We then could work with an index somewhere, which has a setup with one page per line, stating 'Pagename=123456789'. If you improve the data in a userbox on a certain page, it is sufficient to update the revid in the index, and the bot will start using that.
- This trick may be enough for certain boxes (though I am planning some enhanced possibilities, which would not be possible to store in a working revision), but I could also make an option in that in the settings (whee .. setting-creep!). --Dirk Beetstra T C 09:16, 21 July 2008 (UTC)[reply]
- That would be a very efficient system. It's something like the flagged versions extension, but just for templates. It would be possible to make a Twinkle function to tag the current version of a page as "checked" using your index. — Carl (CBM · talk) 05:44, 23 July 2008 (UTC)[reply]
Given the response to bots like any of the StatusBots which used subpages for storing data, you'll probably quickly find this bot banned by a dev. They severely frown upon using wikipedia as a data storage. Q T C 05:46, 23 July 2008 (UTC)[reply]
- OverlordQ, I am not sure what you mean here, do you mean that we would not be allowed to have on-wiki copies of (parts of) mainspace pages and use them to compare? This would 'force' us to store data 'off wiki', which is a bit unwikilike. --Dirk Beetstra T C 09:39, 25 July 2008 (UTC)[reply]
Comment: I don't particularly like the idea of putting information from articles into WikiProject subpages. We put article content in NS:0 ((Main)), NS:10 (Template:), or NS:6 (Image:). Putting article content into the NS:4 (Wikipedia:), the project namespace, is a bad idea. Other articles that have moved their infoboxes into a separate page almost always use the Template: namespace. This prevents "ownership" by a particular WikiProject and is a safeguard in case a WikiProject is suddenly dissolved. It also keeps article content in an appropriate place. --MZMcBride (talk) 20:18, 25 July 2008 (UTC)[reply]
Follow-up: It would seem I mis-read a piece of this bot's request. I assumed it would be moving the infobox from the article to a WikiProject subpage, and I suggested that it instead move the infobox to a Template: namespace. The same type of thing was done with some of the chemical element infoboxes, and it has made editing much easier for new users. When they click "edit this page," they see this instead of this. I think it would be a very good task for this bot to actually move the content from articles to a Template: page (or subpage). Thoughts? --MZMcBride (talk) 20:40, 25 July 2008 (UTC)[reply]
- That is something different from this request, MZMcBride. What we aim here for is having a project-bound copy of infoboxes which contain data which is 'controlled' and verified by the project. It then compares the data in the life copy with this verified data (and if either is changed, it writes to a log). The position of the actual data is something which is separate from that, that could be on the page in mainspace, but also in a template in template space. If the wikiproject would decide to make a template {{infobox benzene}}, which contains the whole of the {{chembox new}} which is now transcluded on Benzene, and then transclude the short {{infobox benzene}} onto benzene (comparable to what the wikiproject elements has done), then this bot would compare the fields of {{chembox new}} as transcluded in {{infobox benzene}} with the Wikiproject copy.
- What I in short here now aim for (further possibilities for later requests), is: we have two copies of the infoboxes, the copy especially containing the specialised, dry, correct, verified, 'numerical' data, and let the bot compare changes to these two copies. The data in the project copy can be thoroughly verified etc., if the mainspace changes with respect to that, then that needs to be checked, but probably the change can be reverted. It takes away the necessity that if someone decides to change the boiling point of some obscure chemical in mainspace, that a couple of people have to grab the 'Handbook of Chemistry and Physics' (or a chemicals catalogue) and check .. as we know that the offline data is checked (and if the mainspace editor has a proper, referenced, argument for the change, then both copies have to be changed). And unexplained changes to the verified copy should be checked as well for the same reason.
- The choice of which fields qualify is to the discretion of the wikiprojects, who can judge that best, but it is probably particularly suitable for things like physical constants of chemicals, the numerical value of mathematical constants (e, pi), dates of birth/death of people, as they are not subject to change, and hardly subject to change in formatting (and if, then the wikiproject should probably be involved and change it throughout, which would mean coordinated changes on both copies; the bot may be able to assist here, even in a later stage). --Dirk Beetstra T C 07:55, 26 July 2008 (UTC)[reply]
- What if you did both of these:
- 1. Copy the infobox data to templates to start
- 2. Keep track of "approved" versions of the templates, which can be used to compare the current data values in the template. This list would just be a list of revision id numbers, and would be kept on a different page.
- — Carl (CBM · talk) 12:13, 26 July 2008 (UTC)[reply]
- The comparing is already being done, with the few pages in the current index. I am not planning to store the data in templates (that would not be a good place, anyway), but in index pages linked to the wikiproject. Where the infobox data is is not of importance to this task, and will depend on the preference of the wikiproject, I am not the one to decide on that. --Dirk Beetstra T C 12:18, 26 July 2008 (UTC)[reply]
- Could you summarize exactly what the bot is currently doing, and give links to any pages it uses in WikiProject: space? I'm sure it's getting hard for new people to figure out what's actually being discussed; I'm a little confused myself. — Carl (CBM · talk) 12:21, 26 July 2008 (UTC)[reply]
What the bot is currently doing:
- It keeps a list of all pages which transclude {{chembox new}} and {{drugbox}} in memory. If an edit is performed to that page, it retrieves the two revids (the new and the old one), extracts the infobox as is transcluded on both, and then checks if any fields have been changed. If a field is changed, then it reports for {{chembox new}} to #wikichem, for {{drugbox}} to #wikidrugs (and all to #BeetstraBotChannel) on IRC (see User:CheMoBot/Settings).
- There are a couple of pages I put manually into the 'repository'; subpages of Wikipedia:WikiProject Chemicals/Index (all for {{chembox new}}). When a page in mainspace that transcludes a {{chembox new}}, it sees if the page is in the repository, extracts the {{chembox new verified}} from the correct section on the repository, and also compares the changes in mainspace with that data. The repository should, in the end, contain fully checked infoboxes, and the bot should then be able to spot 'genuine edits' and 'possible vandalism' (and write that to the log; that will be the testcase before I ask for further bot-permissions -> first: IRC-commanded mainspace repairing/updating of boxes; later: automatic repairing of infoboxes; restoring verified data).
What I'd like the bot to do for now:
- Log the above changes to log-pages under the wikiproject (for {{chembox new}} in Wikipedia:WikiProject Chemicals/Log and for {{drugbox}} in Wikipedia:WikiProject_Pharmacology/Log; it will only log here for edits pertaining changes to this template) and to the userspace of CheMoBot (for general messages), so they can be checked by non-IRC members. I plan to update the logs once per hour, with one template that would be one edit per hour here.
- Help in creating the repository by extracting the infobox from the mainspace page, and putting it in the index. It will only put it there when the page is not already in the index. After that, members of the wikiprojects can go through the pages and check and clean them (they will be quite correct already). That will also help me in seeing what we can expect there and adapt the bot for that where possible. I'd like to have at a certain point about 100-500 indexed boxes in the repository to see how things run. --Dirk Beetstra T C 11:26, 28 July 2008 (UTC)[reply]
Addition, I also programmed the function so it can use an old revid of the page to 'store' the verified data. I will make an index by hand for this in Wikipedia:WikiProject_Pharmacology/Index, and see if this works properly. I think I am going to run into some problems with that, but that remains to be seen. --Dirk Beetstra T C 11:26, 28 July 2008 (UTC)[reply]
- A little background information on this effort may help. We have a group of Wikipedia:Chem members who have been working behind the scenes since December (for many, many hours) to check and double-check our existing data on chemicals, and to obtain some new data from the primary sources (e.g., CAS numbers from CAS). We want to present rock-solid content like the CAS Nos. in such a way that people can trust what they read. We've been open to a lot of ideas on how to do this, and we've spent hours on IRC discussing most of them. We think that the approach Dirk describes is one of the best ways to do this. Walkerma (talk) 15:31, 28 July 2008 (UTC)[reply]
- I am manually copying the logs of User:CheMoBot onto subpages of Wikipedia:WikiProject Chemicals/Log (for {tl|chembox new}} only); and tweaking the bot when necessery. This may help in reviewing the functions of the bot. --Dirk Beetstra T C 13:29, 1 August 2008 (UTC)[reply]
Trials
[edit]Whew! This can be a rather intimidating bot approval request to plow through. I've been reading and studying the subpages for the past hour, and I think I understand what is requested.
For one thing, you have requested approval for this operation:
- Log the above changes to log-pages under the wikiproject (for {{chembox new}} in Wikipedia:WikiProject Chemicals/Log and for {{drugbox}} in Wikipedia:WikiProject_Pharmacology/Log; it will only log here for edits pertaining changes to this template) and to the userspace of CheMoBot (for general messages), so they can be checked by non-IRC members. I plan to update the logs once per hour, with one template that would be one edit per hour here.
- Approved for trial (7 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. (for this task.) That's an easy one.
As for storing "stable" data, several methods have been suggested above, and you seem to have narrowed the approaches down to two: (1) Store stable data in _verified templates on Wikipedia:WikiProject Chemicals/Index subpages (and other project subpages). (2) Store the revisionid of a version of the page which holds stable data in the template itself. It sounds like you're saying that approach #1 is ready to go, and #2 is under development. Is that correct? There are a number of advantages to approach #2 over #1: the data cannot be changed (although the stored revision number can), making vandalism more difficult; it does not needlessly duplicate information beyond what is already stored; a huge number of pagename/revisionids can be stored on one page, making this method easy on the watchlist; etc. Is this ready to go? If so, you can certainly set up Wikipedia:WikiProject Pharmacology/Index and similar pages without RfBA approval. You can then have the seven days referred to above to log whether the infobox changes are in line with the stable data or not.
If I seem to be missing an important aspect of this request, please let me know. – Quadell (talk) 19:00, 8 August 2008 (UTC)[reply]
- Heh, yes .. and these are only the first tasks.
- I'll enable saving the logs to subpages of:
- Thanks for that part.
- For the other part, option #1 is better, as it provides an editable repository, which can be cleaned and updated and can contain some extra functions which I'd like to try to develop (e.g. if an infobox contains the field 'death_date', and the person is still alive, then in mainspace the field is empty, but I could use a code in the off-mainspace-copy to signify that the field in mainspace has to be empty, I can't do that using old revisions (option #2); old revisions may be enough for certain boxes, but not for all).
- Making that repository is going to be a huge task (getting thousands of boxes from mainspace), and I'd prefer to do that botwise. For now I can save them to disc, and then move them by hand, but it would be easier if it could be done automatically.
- Thanks again. --Dirk Beetstra T C 12:01, 9 August 2008 (UTC)[reply]
- Oh, I should say: no, both options are ready to go, and actually running (on IRC, but not editing on wiki). At the moment the bot runs from the _verified templates for {{chembox new}} and from the revid-index on {{drugbox}}. However, for drugbox the situation is similar to chembox new, both in the end will have to run from _verified, there is just too much data, and revid can't properly store that data (except if we make the templates even more esoteric). --Dirk Beetstra T C 12:13, 9 August 2008 (UTC)[reply]
- I know many pixels have already been spent it this bot request, but there's something I'm just not getting. It seems to me that approach #2 should work exactly as effectively, with hundreds of fewer pages to manually watch and maintain. And if that's true, I'm not inclined to approve approach #1. You've said that approach #2 would not be sufficient, but -- maybe I'm being dense here -- I don't understand why. You've said that #1 "provides an editable repository", yet #2 also provides a list of links to revisions (which would work as a repository), and these can be edited simply by verifying the data on the live page and setting the stored revision id to the current version. In your example of additional functionality, the infobox could contain <!-- empty --> and work just as well, right? You said "there is just too much data, and revid can't properly store that data (except if we make the templates even more esoteric)", but I don't see why that would be.
- You've been very patient, and I'm sorry for the additional delay, but I want to make sure we don't approve a version which makes life harder for the people who will be maintaining/looking over this information. For approach #2 (which you say is set up to run for drug-box), go ahead and run a week trial. But for approach #1, I'm not yet convinced. – Quadell (talk) 22:59, 9 August 2008 (UTC)[reply]
- Approved for trial (7 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. (for approach #2, using revids)
- Thanks for that, I will see if I can do something with that.
- 1 gives a copy of data off-mainspace, in which data can be put as needed. Having an index with revids means that at a certain point in time, you have to have the data in the revid to 'program' the box, so if you want to use 'extra' functionality, you need to have that in the mainspace copy (where it would be displayed). Also, for some values there is no verified value, but there may be an unverified value in the mainspace box.
- If there is a value in the off-mainspace copy, then the bot defines that as the verified value. If there is nothing there, it regards that as 'there is no verified value available'. The latter does not mean that the mainspace field has to be empty, there may be other reasons for not having the verified value there. If the off-mainspace copy contains '#NA', then the bot regards that as 'the value does not exist', and if someone then puts a value there, then that is 'vandalism' (as it was verified that the value does not exist). I am also thinking about making a '#COUNT' function, which could count something in the mainspace field, and if it exceeds a certain value, it also warns. But these things can not be stored in a live revid in mainspace (except if you want to have, for a short time, a bad copy in mainspace (which you have to have every time you verify another field in mainspace). As I said, for some boxes the revid may be enough, but at least for chembox_new and drugbox they are not. I hope this explains, and otherwise can I give it a try on IRC to explain this? --Dirk Beetstra T C 09:52, 11 August 2008 (UTC)[reply]
- Well, now, you mention the '#NA' indicator for "no value should be in this field"... wouldn't it work just as well to have <!-- #NA --> or <!-- no value --> or something in the template itself, and store that revid? That doesn't seem to me to be something you need a separate copy of the template for. As for the #COUNT function, I'm not sure what you mean for it to do. If it's checking data outside of the templates, it would be outside this request's scope. So I still don't understand why a revid isn't enough for chembox_new and drugbox. Can you explain why revid isn't enough? And does anyone else have an opinion on whether an entire off-page copy of the template is a better choice (or why it would be)? I won't approve off-page copies of tens of thousands of templates if it's not useful to the project. – Quadell (talk) 12:39, 12 August 2008 (UTC)[reply]
- Hmm, did not think about it in that way. That is indeed a (brilliant) solution. I'll give it a thought.
- With the #COUNT I would like to try to 'count' how many of a certain item there is in a field. One of the things that I would like to apply that to is e.g. the number of reviews listed in templates like {{infobox_album}}. The field 'Reviews' in that attracts many people who add yet another review (the projects guideline has as a maximum 5 reviews). I could e.g. count there how many '<br />' there are, that should be 4 .. or how many '*' (which should max be 5 .. it is not failsave, but it generally goes OK. But these are future functions. For that, It is a bit silly to put <!-- COUNT 5 <br /> --> .. as the field, though this may be better as a general setting. --Dirk Beetstra T C 12:55, 12 August 2008 (UTC)[reply]
- The trial seems to have completed. How do you think it went? Where are we now? – Quadell (talk) 12:57, 18 August 2008 (UTC)[reply]
- {{OperatorAssistanceNeeded}} SQLQuery me! 06:18, 22 August 2008 (UTC)[reply]
- User:Beetstra is on a short wikibreak; however, I can report that he did a short demo of the bot at our 12 August IRC meeting, and it appeared to work very nicely! See this diff. Walkerma (talk) 07:04, 26 August 2008 (UTC)[reply]
- I also played with it a bit (and was watching its change reports for a few days) and it seemed pretty functional. Didn't see it trigger on any wrong pages or wrong parts of pages. The revid does indeed seem like a low-redundancy (and clever!) solution...something like the oft-discussed "flagged revisions" for WP itself, but only within a constrained section of certain pages. DMacks (talk) 07:24, 26 August 2008 (UTC)[reply]
- User:Beetstra is on a short wikibreak; however, I can report that he did a short demo of the bot at our 12 August IRC meeting, and it appeared to work very nicely! See this diff. Walkerma (talk) 07:04, 26 August 2008 (UTC)[reply]
I've been away for some time (actually still am, but I have some spare time). All seems to work correctly, all I am waiting for is some people starting to build the index by verifying data on pages, and adding the current number of the revid to the indexes. Now it is merely checking and reporting changes to the pages. --Dirk Beetstra T C 12:37, 2 September 2008 (UTC)[reply]
- I'm poking this request. I'd like to activate the logging capabilities of this bot. All other functions have to be done by hand. --Dirk Beetstra T C 06:58, 20 September 2008 (UTC)[reply]
{{BAGAssistanceNeeded}} Could someone more familiar with this take a look? Mr.Z-man 03:13, 2 November 2008 (UTC)[reply]
Approved. BJTalk 07:23, 11 November 2008 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.