Wikipedia:WikiProject Chemistry/IRC discussions/2008
Agenda
[edit]- What do we have? The dataset of chemical compounds, currently being cleaned up by ChemSpiderMan et al. - numbers, quality? Data in other chemistry articles, e.g. on chemists?
- How can we make the data more easily searchable/mineable, and more suitable for the semantic web?
- How can we foster mashups with other sites that might bring chemists to us, while providing useful chemical information for the other site?
Summary of main conclusions
[edit]- We probably have around 6000 organics with chembox or drugbox, and the majority of the list has been checked by User:ChemSpiderMan for Structure/Name. The list will probably get finished during February. Inorganics/organometallics haven't been addressed yet.
- User:ChemSpiderMan will also provide us with InChIs and InChIKeys for all compounds, in the SDF file he is providing.
- User:Petermr is planning to use this collection of articles/chemboxes/drugboxes as the basis of an RDF-based database, like a chemical version of DBPedia.
- For this, we need to standardise the chemboxes (partly being done now through Chembox new), and we need to standardise the data content. "Main problems with the data were (e.g.) character encodings (can be awful), lack of consistency in units, difficulty of parsing annotations in values (e.g. 200 (decomposes))."
- We might reduce errors in things like density, MP, BP, by having such things stored with one single entry (in °C; or g/cm<sup3), with other versions being calculated from these.
- User:Petermr would like to standardise how we pass information to and from the chemboxes/drugboxes. Bot?
- WP is becoming the #1 source of information on simple compounds. Can we get things like links from chemistry articles, using the approach of Project Prospect?
- User:Petermr would like us to use only ASCII, with no character encodings.
- Do we need a "WikichemID" for each compound? If so, how should it be done? There was extensive discussion, but no clear conclusion.
- Should the database reside on the wiki, or off? How should we get "drive-by" users to add information, if we make it hard to enter data? We agreed to sleep on this!
- Should we start handling spectra? (ChemSpider is already doing this.)
- How should we handle salts and different "forms" of the same compound - to be discussed later (covered the following week).
Action
[edit]- Continue to help ChemSpiderMan in completing the list, by fixing errors on the wiki. See the working list.
- Complete the migration of the Chembox so we have one standard chembox
- Look into amending how we handle certain data, to standardise on one unit and calculate the rest (I'm not sure how much of this is done already - Walkerma).
- Some items warrant further discussion soon, notably the WikichemID idea and the problem with salts - to be done the following week.
Agenda
[edit]- What progress has been made with the dataset, and what issues have arisen?
- How do we deal with salts, where there is perhaps a counterion in the name but not in the structure?
- What should be used as the primary key for the dataset (this was an unresolved issue from the previous week). Should we classify by compound (and if so, by name, structure or CAS#) or by article (which may cover several compounds)?
Summary of main conclusions
[edit]- We should put in the MOS that structure, name, CAS, InChI, etc should all be for the same form of the compound.
- We may need to put tables in, as with cresol or tartaric acid, when multiple forms are possible, but more discussion on this aspect is needed.
- We will continue to classify compounds by article name, at least for the time being. The reasons: CAS is problematical in cases like tartaric acid where one "compound" can have lots of CAS#s, InChIs don't really work for inorganics, and Wikipedia is organised by article, not by specific compound.
- We still need to clarify what CAS# should be used as the "main" one in the chembox, for the MOS.
- We will need to validate the CAS nos. for the 6000 structures checked by ChemSpiderMan.
Action
[edit]- Work on validation of CAS nos. at Wikipedia:WikiProject_Chemistry/CAS_validation.
Please review some responses to Walkerma's questions to get the views of some chemical information professionals on this topic. Please also take a look at the InChI and InChIKey on some test pages:
- Tributylphosphine (the only one on this list to use the {{InChI}} template)
- Phytane
- Quassin
- Beta-hydroxybutyryl-CoA
- Meldrum's acid
Agenda: InChIs and InChIKeys
- How can we handle structural identifiers such as InChIs and SMILES properly? These are designed for machine-reading, but people may often use our visible info to "copy and paste" into a search engine.
- Should we promote the use of {{InChI}}, or develop something different? Should such information be placed together in the ChemBox, or in a databox at the bottom of the page (as with the InChI template)), or on a separate data page, or what? Should it be hidden, semi-hidden, or fully visible?
- ChemSpiderMan will be providing us with InChIs and InChIKeys (a concise, hashed version of InChI) for all the structures. Should we include InChIKey information as well? If so, where?
- (If time) How will we upload the information from ChemSpiderMan's SDF file, including the InChIs and InChIKeys?
Summary of main conclusions:
- Consensus wasn't totally clear, but several options were discussed for the display of InChI strings:
- A link farm
- "Click to see or search on InChI"
- Use of {{InChI}}
- A lot depends on the technical feasibility. PC was not present, to explain how the {{InChI}} option might work. Some felt it would be better to display an InChI, perhaps with "soft" line breaks to break up the string only for displaying (if this can be done). Others liked the "Click to see or search" approach. There was an extensive discussion about how InChIs and InChIKeys work.
- It should be possible to upload ChemSpiderMan's SDF file into Wikipedia using a bot, assuming the articles have Chemboxes. The same bot might be used to check ChemBoxes on an ongoing basis. The bot should flag any Chembox where the PubChem link doesn't match with the bot list, and any other quick check like that.
- We should reach consensus on use of InChIKeys on Wikipedia.
Action:
- Look into the possibility of a soft line break to break up InChIs etc.
- Post a "request for comment" regarding InChIKeys.
- Consider who might write and operate a bot for uploading the SDF file.
Agenda: CAS numbers - how can we validate these quickly, easily and cheaply?
Summary of main conclusions:
- Dealing with CAS nos. is very challenging!
- The only reliable way to validate them is via ACS. Ideally this might be done with the help of people at ACS/CAS, but failing that we will have to plod through SciFinder.
- Should we also mention "wrong but popular" CAS nos., to aid searches?
- The ChemBox could have separate cas and cas_validated fields.
- We have an issue of clarity: If an article is on (say) glucose, should it show the CAS no. for the unspecified isomer (which matches better with the article title?) or the CAS no. for the structure shown directly with it. The consensus was to usually include both, with the "specific" form shown close to the drawn structure. One proposal was: If we use a non-specific chembox, we could add a separate chemsubbox (within the chembox?) for information on a specific form or isomer.
Action:
- Contact ACS for help. If we don't hear back by February 29th we will continue to work manually on the CAS numbers.
- Add cas and cas_validated into the ChemBox
- PC, Rifleman and Beetstra to determine the details of how to handle specific vs. generic CAS nos. in the Chembox.
Followup
- I contacted someone I know at ACS (call him A), and he says that he passed my request on to CAS. I still haven't found out who at CAS, despite a "reminder" email at the end of February. While waiting back from person A, I had also contacted someone I know (less well) at CAS (call him B), and he responded, but by that time I had a reply. I didn't want Bto be duplicating the effort of someone else; I said I would ask for help from him if my first "lead" (via A) failed. Walkerma (talk) 04:30, 4 March 2008 (UTC)
- We have been asked by CAS not to use SciFinder for curation. I have been in contact with CAS, we should hear back by mid-March. Walkerma (talk) 04:08, 9 March 2008 (UTC)
- Agenda: Choice and indexing of identifiers
- Which identifiers (InChI, CAS, etc) are the most important for us (already discussed to some extent)?
- Should we create indexes on these identifiers?
- Under what circumastances should we link out to external sites?
- Agenda: Choice and indexing of identifiers
- Summary of main conclusions:
- Action:
Agenda: "The protonation problem" and related issues
- How do we deal with compounds such as Geranyl pyrophosphate which may exist in various conjugate acid/base forms under physiological conditions? See comment here. What about drugs such as Ranitidine, which may be produced in a salt form, yet which are often written as a neutral compound?
- Related to this, how should we handle zwitterions such as amino acids and betanin?
- Related to this, how do we handle tautomers in cases such as 1,3-cyclopentanedione, where the structure may vary depending on conditions?
- (If time) How do we deal with counterions - this often arises with pharmaceuticals which may even exist with a variety of counterions such as succinate, maleate, etc.
- (If time) How do we deal with sugars such as Fructose-1-phosphate or glucose? Cyclic or open-chain form? See these comments.
- (If time) How should we deal with hydrates of salts, and different Werner complexes, as seen at chromium(III) chloride?
Summary of main conclusions:
- When we have the choice of charged or uncharged forms, we will (for consistency) use the uncharged form. We agreed that "compounds will be shown in the neutral form, no matter what is their "standard form"." Thus, a pyrophosphate ester will have OH groups, not O−s. An amine will typically be shown as the amine rather than in its protonated form. Details of structure and counterions can be discussed in the article.
- The same approach will be taken with a zwitterion such as an amino acid, with explanation of the zwitterionic structure included in the article.
- In cases of tautomerism, where there is some ambiguity, the article name will be agreed based on a case-by-case basis, and the structure etc. will match the article name. Details of the tautomerism can be handled in the article, as in 2-pyridone
Action:
- Add the above information to the working draft MOS section at User:Itub/Ambiguous chemical identifiers.
- Agenda: Chembox issues
- A carry-over from last week - how should we organise chemboxes for pages where several related substances are being described (e.g., tartaric acid, cresol?
- How can we cite our sources for ChemBox information without (a) breaking the Chembox in the printable version and (b) confusing the reader? See User:Walkerma/Sandbox2 and its printable version as a test place.
- (If time) Is "table creep" a problem? Is there anything we should be keeping off mainspace and either hiding or placing on the data page?
- Agenda: Chembox issues
- Summary of main conclusions:
- Action:
Agenda: Organic reactions - now with a general review Background: We have been approached by Mark Leach (who runs an online reaction database), regarding the upload of generic reaction information into Wikipedia. I (Walkerma) took the liberty of inviting him to talk to us on IRC about how reactions can be represented online. A more detailed agenda will be posted later.
- Meeting with Mark Leach postponed: He has had to cancel the meeting with us for 4th March, but hopes to attend in a week or two. However, he will use the time to write a demo page for us to look at. For March 4th I am proposing we cover the following:
- Review of all our recent meetings - what are the main things we should be working on? Who is going to work on them? (I will try to expand the summaries and action items before the meeting)
- What are the main topics still outstanding?
- If there is time, perhaps we could begin to consider how we handle reactions. One proposal of mine (Walkerma) is the use of image maps: See Ryoji_Noyori#Chemistry and Lithium_aluminium_hydride#Use_in_organic_chemistry.
Summary of main conclusions:
- Most discussion centred around updating and expanding the chemistry manual of style, so that it includes all of the standards and systems we agreed upon at the recent meetings.
- Related to this, we agreed on a need to improve communication, and to work with neighbouring WikiProjects.
- There was some informal discussion on image maps, which were seen as useful, and people were impressed by the image map editor.
Action
- User:Rifleman_82 agreed to take on the central task, rewriting and organising the [manual of style] (also see draft version). This will include the WP:Chem style guide as well as other aspects of chemistry. It will be written in summary style with sub-pages as needed. He will also assist Walkerma by contacting fvas and assisting with a new navigation scheme.
- User:DMacks agreed to help establish rules for images and contribute this to the style guide. He will be contacting User:Benjah-bmm27 to request his input.
- User:Axiosaurus has agreed to look at how we can tighten up our policies on inorganic nomenclature.
- User:Walkerma has agreed to (1) Design a better navigation system around the chemistry pages (projects, portal, MOS), so that newcomers can find stuff more easily and (2) Talk to the neighbouring projects, and get their opinion on our MOS additions (others may help with #2). He will also contact ~K about writing a policy for reaction pages.
Agenda:Dealing with inorganics & organometallics. Also Mark Leach will join us to talk about chemical reactions.
We have looked in detail at Chemspiderman's collection of organics. How should we validate the remaining compounds?
Summary of main conclusions:
Action
Followup
Agenda:Tying up the loose ends for validation by CAS
- We need to resolve a few outstanding issues such as "Which carbohydrate form should be used?"
- Ensuring that we have neutral forms, not charged forms (as we agreed at the 19 Feb meeting).
- What remains to be done to build a collection of inorganics & organometallics?
Summary of main conclusions:
- For carbohydrates, we plan to define a "standard form" for all of the common carbohydrates. For others, the alpha-pyranose form will be the standard form by default. If there is good reason to choose a different form for a particular carbohydrate, this can be discussed until a consensus standard form for that compounds is reached. We did not agree on which representation would be used; the Haworth form was not popular, but there was no clear decision made between chair forms or stereodifferentiated hexagonal cyclohexanes.
Action
- Write a page showing the standard form for the common carbohydrates.
- See how well our current collection matches with the new rules.
- Resolve how best to represent the pyranoses.
/25 Mar 2008
[edit]Many of the regulars can't make it this week, so there is no formal meeting. As usual, #wikichem is always open for informal discussion about...anything really.
/1 Apr 2008
[edit]No formal meeting.
/8 Apr 2008
[edit]I propose that we continue with informal meetings for now - mostly we just need to get on and do the work, instead of talking about doing the work! We can discuss progress on the CAS validation work, and also perhaps get a New Orleans report from anyone who is there. Walkerma (talk) 07:19, 6 April 2008 (UTC)
Time Should we change the time? It seems that our original time has become difficult for several of our group, and things have changed anyway with the clocks going forward in many countries. Are there any other times on Tuesday that you would prefer?
- My availability is significantly reduced now unfortunately. This week I am not available until 1pm and the following week I am at the ACS. Lunchtime (noon) on Tuesday is certainly better for me.--68.33.211.217 (talk) 16:54, 30 March 2008 (UTC)
Agenda:Getting the chemicals list ready for CAS We have two main groups of articles that we are currently getting ready for CAS. Physchim62 also has a combined version.
- I assume it is a list (XLS/TXT?) rather than an SDF file?--68.33.211.217 (talk) 16:54, 30 March 2008 (UTC)
- Actually, Antony's collection is an SDF file. I'm not sure about Physchim62's file. Walkerma (talk) 03:44, 1 April 2008 (UTC)
- My file is in a relational database, but I can provide other versions without too much problem. Wikipedia:WikiProject Chemicals/Inorganics was extracted from the database, for example. Physchim62 (talk) 17:49, 1 April 2008 (UTC)
- Actually, Antony's collection is an SDF file. I'm not sure about Physchim62's file. Walkerma (talk) 03:44, 1 April 2008 (UTC)
I propose we find out what has been done and what final tweaks still need to be done. The two main lists are:
- Antony's SDF collection
- This is progressing but slower than I would hope because of many other distractions--68.33.211.217 (talk) 16:54, 30 March 2008 (UTC)
- I know the feeling! Physchim62 (talk) 17:49, 1 April 2008 (UTC)
- I extracted a list of linked Wikipedia pages from it. Was pretty easy to parse and munge the .sdf DMacks (talk) 14:35, 8 April 2008 (UTC)
Several people seem to want to discuss CAS validation and "data-mining" from WP; I shall do my best to be available. Physchim62 (talk) 19:26, 14 April 2008 (UTC)
- OK, let's do that. I asked if people were interested in an IRC meeting on this topic, but no one responded to my request, so I was expecting this to be another small, informal gathering. I think many people are catching up after New Orleans, and I think Rifleman may also be on the road, but perhaps a few can gather - there is certainly interest in the data-mining aspect. I will be quite busy myself, so I may not be able to be there for much of the time. PC, can you chair the meeting? I expect I will be joining on IRC around 1610h UTC. I should have an update on the CAS work, also. Walkerma (talk) 21:41, 14 April 2008 (UTC)
- OK, will do. Can you remind me to log it, in case I forget! Agenda is (depending on who can be available):
- update on CAS verification
- questions/discussion concerning "data-mining" from Wikipedia
- any other issues
- Physchim62 (talk) 13:46, 15 April 2008 (UTC)
- OK, will do. Can you remind me to log it, in case I forget! Agenda is (depending on who can be available):
Summary of main conclusions:
Action
It looks as if ChemSpiderMan can make this meeting, and PC can now talk on IRC, so we will try to meet formally this week. Many of the topics I'm proposing are similar to what is listed above for April 15.
Time: 1700 h UTC (1pm US EDT). NOTE NEW TIME, one hour later!
Agenda:
- How to merge in data from CAS once this has been received). If we have received the file, we can perhaps discuss that too.
- How to organise the data once it is validated. We need to find a way to ensure that validated content remains intact. PC has some ideas on how to put the data into a database form within WP.
Summary of main conclusions:
- PC will continue working on CAVer, a relational database linking WP articles with specific compounds, while we are waiting for news from CAS.
Action
- circulation (by email) of test lists in the various formats used; for queries, contact PC.
- meeting logs not to be published until situation with CAS is clarified
PC will not be able to make the formal meeting, but will try to be on IRC 1530–1630 UTC to answer any questions.
I probably can't make it (network flaky--if someone else can log, I can format & post it later), but did manage to get non-volatile data shifted out of the main article. Bonus: InChI keys google-indexable (actually visible on a page) but not visible in article Chembox. proof of concept DMacks (talk) 13:44, 29 April 2008 (UTC)
Time: 1700 h UTC (1pm US EDT).
Agenda:
- feedback re CAS data and prospects
- AOB
Summary of main conclusions:
Action
Agenda:
- Work to be done on the CAS collection
- Wichempedia, chempedia, wikichem and related ideas.
Summary of main conclusions:
- There are some issues to be resolved on differences between CAS format and our format for some data (especially inorganics). We may need to contact CAS on this.
- Despite this, it should be possible for us to release groups of 500 articles at a time (monthly?), starting quite soon.
- Physchim62 will handle curation of inorganic data, while ChemSpiderMan, Walkerma and Rifleman82 will be handling the organics.
Action
/13 May 2008
[edit]Meeting at 1600h UTC (noon US EDT). Agenda:
- Wichempedia, chempedia, wikichem and related ideas.
- If any time left, we can discuss CAS issues further.
Summary of main conclusions:
- Nobody was talkative in-channel today...no actual meeting.
Meeting at 1600h UTC (noon US EDT). Agenda:
- An informal meeting to discuss the CAS work, and the wikichem idea, as people see fit.
Summary of main conclusions:
1600h UTC (noon in US EDT, 1700h in British Summer Time).
Agenda: Validated data for chemboxes - which method? Persondaten method or transcluded from a data page? See WT:Chem discussion. Summary of main conclusions We didn't resolve the above issue, but we DID resolve the presentation of InChIs and other long data, with an elegant hide/show option from Dmacks. This may affect how we present validated data, too. On 26 June, we had an informal meeting, at which Beetstra tested out CheMoBot to see if it can be used to protect selected data fields within Chemboxes - it seems that it can. Action Beetstra is testing the bot.
1600h UTC (noon in US EDT, 1700h in British Summer Time).
Agenda:
- In the light of the developments from 24-26 June, with new collapsible data fields that can be read by the bot, we still need to decide which method we will use to upload and present our validated data. Are we ready to upload the first 500? Should we have data present on a single on-wiki page, or in some other format? Will the bot be able to watch things, or can we avoid the use of a bot altogether?
- If there's time, I'd also like to get people's views on structure searching on Wikipedia.
Summary of main conclusions:
Beetstra outlined how the bot might work, and did a simple demonstration. We chose an option whereby each article has an associated data page containing the validated data, which would be transcluded onto the article page. If problems arose with that, we might consider having one single on-wiki page, though there is concern that one small error might render such a page unreadable by the bot.
Action
Beetstra will continue to test the bot, then apply for permission to use it as described. Initially (for testing purposes) the bot will simply report edits to validated data, but later it will revert such edits.
1600h UTC (noon in US EDT, 1700h in British Summer Time).
Agenda:
- We will review progress on uploading and presenting our validated data using CheMoBot. The technical details of this are being discussed here. If bot approval goes through, are we ready to upload the first 500, or at least 50 for testing?
- If there's time, I'd also like to get people's views on structure searching on Wikipedia.
Note: I may not be able to attend until 1630h UTC, so I'll ask someone else to start the meeting off if necessary. Walkerma (talk) 20:47, 20 July 2008 (UTC)
- I'll hopefully leave a channel-log running, but likely won't be there before 1630-UTC either. DMacks (talk) 21:34, 20 July 2008 (UTC)
Summary of main conclusions: The bot has been set up and is simply awaiting approval.
Agenda: An informal meeting to chat about the feasibility of structure searching on Wikipedia.
Summary of main conclusions: We should introduce structure searching on WP. Action ChemSpiderMan will set this up on the ChemSpider site when time allows.
Testing of CheMoBot has been approved, but I don't think tests have been enough to conclude anything so far, so I don't think we need to have a formal meeting. Walkerma (talk) 02:05, 12 August 2008 (UTC)
- Beetstra informs me that there probably is enough to discuss, so we can have an informal meeting. I hope to be there for noon, but I'm not 100% certain (maybe 95%) - please go ahead without me if necessary. Next week I'll be away at the ACS meeting. Walkerma (talk) 12:01, 12 August 2008 (UTC)
- Summary of main conclusions
We are testing out the use of specific article "versions with validated data". This avoids the need to create long pages of data - the data are stored in the article history instead. Beetstra demonstrated on benzene how the bot reported an edit to the Chembox instantly after a change was made to one of the (pseudo)validated fields, and it should log all such edits. The validated data can be restored if the Bot is commanded to do so.
To upload validated data into WP for a specific article, we will simply need to:
- Check that the Chembox fields match our validated data (edit the data if they don't).
- Record that version of the article on Wikipedia:WikiProject Chemicals/Index
We had an informal meeting. We agreed to start uploading validated data to Wikipedia:WikiProject Chemicals/Index, with a working page at Wikipedia:WikiProject Chemicals/Chembox validation to coordinate the work. The trouble since then is that everyone (including me!) seems to be very busy! We discussed the possibility of having any edit to the data be recorded, it would change a flag from "validated" to "not validated". We considered a red-yellow-green system to indicate unchecked-believedOK-validated data.
If anyone's around, I wouldn't mind having a chat for a few minutes with WP:Chem members at 1600 UTC, about the problems involved in uploading our SDF data into ChemBoxes and Drugboxes. Cheers, Walkerma (talk) 09:36, 7 October 2008 (UTC)
- I'd like to discuss our talk-page-tagging policies as well, given Itub's recent post. Physchim62 (talk) 15:18, 7 October 2008 (UTC)
I'm proposing we meet this Tuesday, one hour earlier at 1500h UTC (11am US EDT). Let's plan to talk about issues surrounding the uploading of validated data onto Wikipedia. See these comments for background, as well as the 9th September IRC records. Walkerma (talk) 07:04, 13 October 2008 (UTC)
Could we chat for a few minutes on Tuesday again at 1500h UTC (11am US EDT)? No need to talk for too long this week, but DMacks/Physchim62 have now demonstrated that we can flag validated data in a straightforward way, and I'd like us to clarify what is next. Walkerma (talk) 05:57, 21 October 2008 (UTC)
Let's meet at 1600h UTC (11am US Eastern Time, 1700h Central European Time). We can finalise our plans for "validated" flags for data in chemboxes, and start their implementation. Walkerma (talk) 01:56, 11 November 2008 (UTC)
- Probably can't make it (but will read channel log). DMacks (talk) 01:58, 11 November 2008 (UTC)
Let's meet at 1600h UTC (11am US Eastern Time, 1700h Central European Time). Just an informal chat about some of the details on how the bot works - see User talk:CheMoBot/Q&A. Walkerma (talk) 06:29, 25 November 2008 (UTC)
- Will try to leave window open, but probably only to read it all later due to Real Life meeting. DMacks (talk) 06:55, 25 November 2008 (UTC)
Let's meet at 1600h UTC (11am US Eastern Time, 1700h Central European Time), to talk about the work with CAS - what needs to be done, and how it's going. I will invite someone from CAS to join us, though I'm not sure if anyone will be able to come. Walkerma (talk) 18:35, 15 December 2008 (UTC)