Wikipedia talk:WikiProject Chemistry/CAS validation

CAS Discourages Using SciFinder for curating 3rd party databases (e.g. Wikipedia)

Chemical Abstracts Service (CAS) objects to anyone encouraging the use of SciFinder® and STN® to curate third-party databases or chemical substance collections, including the one found in Wikipedia. SciFinder and STN are provided to researchers under formal license agreements, under which the researchers agree to refrain from using these tools to build databases. We urge and expect those researchers to respect the explicit terms of the agreements they have entered into. CAS is a division of the American Chemical Society. Please contact CAS if you have questions. Eric Shively, CAS, eshivelyATcas.org Eshively (talk) 20:56, 5 March 2008 (UTC)[reply]

Thanks for your reply, I contacted CAS last month and was hoping to hear from someone there. Thanks for making contact and clarifying the limitations. I'll be in touch soon, Walkerma (talk) 03:43, 7 March 2008 (UTC)[reply]

I have blogged this - thanks to Chemspiderman and am appalled by it. Unless Chemical Abstracts changes their policies I think the only logical and safe thing to do is to boycott the use of CAS numbers anywhere in Wikipedia. (There should of course be factual entries about CAS and the CAS number system). There is no way of proving to CAS that information in Wikipedia has not been scraped from Scifinder, but as Wikipedia rightly honours copyright it should assume that with an aggressively unhelpful copyright holder all information comes from third-party sources.Petermr (talk) 05:02, 8 March 2008 (UTC)[reply]

Well, does this mean that there is no way correcting structures according to their CAS numbers? And this raises indeed the question, if there is any use of a CAS numbers in the public domain? Looks like a classical license compatibility problem. And this raises the following questions. 1. It is not allowed using SciFinder® and STN® for curating CAS numbers. 2. Is it allowed using scientific publications with CAS numbers for curating CAS numbers? JKW (talk) 08:53, 8 March 2008 (UTC)[reply]

CAS numbers are not subject to copyright. Wikipedia did not enter an agreement with the parties involved with CAS. Any individual should check whether he/she is bound by any contract that would prohibit entering CAS numbers in Wikipedia. Anyone else should continue to improve the articles, which includes validating CAS numbers. -- 89.12.246.73 (talk) 15:51, 8 March 2008 (UTC)[reply]

FYI: /CAS-German Wikipedia -- 89.12.246.73 (talk) 16:01, 8 March 2008 (UTC)[reply]

I agree with Eric Shively wholeheartedly on this issue. CAS databases' terms and conditions should be respected (unless they conflict with each other which makes this impossible). Equally, I would not be in favour of a boycott or removal of any information wrt CAS or CAS numbers from Wikipedia. This would damage the Wikipedia information resource and undermine the contributions of users. There are many sources of CAS numbers (e.g. some chemical catalogues) from which I think (without being a lawyer) it is perfectly legal for these identifiers (as I see them) to be obtained from though I accept CAS databases must be the most reliable source. Will-ocw (talk) 20:10, 8 March 2008 (UTC)[reply]

IMHO, the ideal solution is for us to work together with CAS, and that is what I want WP:Chem to do. At one of the recent Wikipedia IRC meetings where we discussed CAS numbers, that was the approach we agreed to take, and the one I have been pursuing. Whatever opinions people may hold on the copyright issue, most would agree that CAS registry numbers are more established and more reliable than the alternatives, while InChIKeys (while valuable) are based on molecular structure (microscopic), not chemical composition (macroscopic). I think if we lose CAS from the chemical information community, it would be a great loss to all, so instead I would rather find a way to break down the walls. If ACS and CAS can fully participate in the new web-based world of open information, then everybody gains a huge amount - except perhaps for ACS's commercial competitors. We just need to ACS and CAS to see it that way! Walkerma (talk) 03:41, 9 March 2008 (UTC)[reply]

Comment--Please examine the exact wording of their notice. Even they they did not explicitly say not to use the registry numbers. And they referred only to Scifinder, and STN, the electronic services, and in terms of the licensing, not the copyright. CAS registry numbers are available from other sources--including their printed Chemcial Abstracts. Most libraries in the US no longer get it in print, but some of the large public libraries still do--including the New York Public Library, and many libraries in other countries also. The numbers can also be found for many compounds through a variety of other sources.

CAS has claimed in the past that the registry numbers are copyright, though to the best of my knowledge this has never been tested in court. They would be hard put to show that the use of a limited number of them, obtained individually in observance of the license, was not a fair use. They do not make that claim here. DGG (talk) 10:17, 9 March 2008 (UTC)[reply]

I trust that the chemical community, on Wikipedia and elsewhere, will treat this missive from CAS with the contempt it deserves.

It is clear that CAS places the maximization of its revenue above the provision of chemical information. What does CAS object to here? That researchers use its products to find chemical information, or that this information is published? In either case, its stance is both ludicrous and profoundly anti-scientific.
In a discussion about CAS registry numbers, it should be pointed out that these are used by many governments and international organizations (see, e.g., 29 CFR 1910.1000) and innumerable commercial firms (e.g. chemical suppliers). Indeed, they would not be interesting for WP if they were not so widely used! CAS tacitly admits that it cannot control this use through copyright law, as has been discussed at length both on WP, which is why it has to resort to contract law in the form of the draconian license terms it imposes for access to its databases.
However CAS is effectively a monopoly supplier of much chemical information, as can be seen from the prices it manages to charge for access to its databases. The restrictions it purports to impose of the reuse of its “product” would appear to breach anti-trust legislation on both sides of the Atlantic. Users of CAS databases in the European Union can take heart from Art. 8.1 of the Database Directive (96/6/EC):
“The maker of a database which is made available to the public in whatever manner may not prevent a lawful user of the database from extracting and/or re-utilising insubstantial parts of its contents, evaluated qualitatively and/or quantitatively, for any purposes whatsoever.”

I call on CAS to make it clear that the information contained in its databases may be freely reused in accordance with the principles of chemical science and the laws of the jurisdictions in which it operates. Physchim62 (talk) 12:24, 10 March 2008 (UTC)[reply]

I'm curious about how much is insubstantial. I don't think we aspire to include even 1% of the CAS registry numbers, but is that "insubstantial" enough? --Itub (talk) 12:29, 10 March 2008 (UTC)[reply]

I have blogged about this and received considerable feedback. There is clearly a difference of opinion in the WP community about whether to try to deal with CAS or not. I side with Physchim62 but I think there is also a dispassionate reason why WP should withdraw any mention of CAS numbers and I have outlined this on in my blog. Simply:

Wikipedia requires authoritative sources for its information.
The assignment of a CAS number to one or more WP entries requires the authority of CAS
CAS forbids WP to use this authority
Therefore WP cannot include CAS numbers if it wishes to uphold its principles of authoritative sources - there are NONE available to it.

So I think WP would violate its own principles by including CAS numbers Petermr (talk) 14:56, 10 March 2008 (UTC)[reply]

With all due respect to Petermr, his arguement fails on the second point: CAS numbers are used (and often misused) without the intervention of CAS. Both the US federal government and the European Union quote them in their regulations, for example. The people who would benefit from "correct" CAS numbers being included in WP are CAS themselves. That is the irony of it all! Physchim62 (talk) 17:51, 10 March 2008 (UTC)[reply]

Two more aspects; 1. The question must be raised, if there are not already plenty examples of copyright misuse in 3rd party databases and especially journal publications. Just images X publications publish Y CAS numbers each, and the total would be higher than 10000. This is a copyright violation, isn't is? We need a laywer for answering how valid CAS copyright claims are under a public domain and misuse standpoint of view; 2. I honestly think that no information is better than wrong information, which means here CAS numbers. Since we can not use CAS services for validation, can we include CAS numbers at all in Wikipedia? How do we verify that they are correct? JKW (talk) 21:14, 10 March 2008 (UTC)[reply]

I'll quote from my blogpost "I say let’s not abandon hope regarding CAS opening their numbers to the world just yet. This dialog is likely sparking discussions already. Let’s keep it out there and establish a groundswell of concern and support and hope that the right thing can happen for our good and for CAS. I have great respect for many of their people and their work and want the resolution to be appropriate for all parties." --ChemSpiderMan (talk) 00:25, 11 March 2008 (UTC)[reply]

To reply to JKW, it's not a lawyer we need to determine whether the claims of copyright in CAS Registry Numbers® are valid, but a judge! The issue has been discussed at length over the last two years on Wikipedia (English and German) and the consensus seems to be that there is no copyright for lack of creativity. For example, see Matthew Bender v. West Pub. Co. (158 F.3d 693). Physchim62 (talk) 15:48, 11 March 2008 (UTC)[reply]

New announcement from CAS

CAS, a division of the American Chemical Society, is pleased to announce that it will contribute to the Wikipedia project. CAS will work with Wikipedia to help provide accurate CAS Registry Numbers^® for current substances listed in Wikiprojects-Chemicals section of the Wikipedia Chemistry Portal that are of widespread general public interest.

The CAS Registry is the world’s most comprehensive collection of chemical substances and the CAS Registry Number is the recognized global standard for chemical substance identification.

CAS views Wikipedia as an important societal tool for the general public, and this collaboration with Wikipedia is in line with CAS’ mission as a Division of the American Chemical Society.

We look forward to working with the Wikipedia volunteers over the next few weeks to make this happen.Eshively (talk) 13:40, 12 March 2008 (UTC)[reply]

Well that's great news, both for CAS and for Wikipedia. I can only hope that the same constructive spirit is shown to other legitimate users of information from CAS databases ;) Physchim62 (talk) 14:19, 12 March 2008 (UTC)[reply]

This is a terrific outcome. I think the collaboration between CAS and Wikipedia will definitely assist us in producing a high quality validated dataset of chemical compounds as well as demonstrate to the community the intentions for both parties to collaborate for the public good. This is a good outcome! I have blogged it here. --ChemSpiderMan (talk) 15:33, 12 March 2008 (UTC)[reply]

Congratulations and thanks for innitiating this discussion, contributing, and bringing it to a broader audience! My personal special thanks go to the very constructive contributions of ChemSpiderMan, User:Walkerma, and Physchim62

JKW (talk) 17:12, 15 March 2008 (UTC)[reply]

Blog Feedback

--Historiograf (talk) 21:49, 10 March 2008 (UTC) and JKW (talk) 23:48, 10 March 2008 (UTC)[reply]

First block of 25

I've looked at the first block, and there appear to be some problems with stereochemistry already. A few comments. To check the pdf against scifinder as-is (for someone using only one monitor/laptop screen) requires a lot of alt-tab window switching. Copying individual CAS numbers from the PDF into the Scifinder window (which can accept a text file or multiple lines) is not very efficient.

Tony, is there a way to generate the list in an excel file, CAS on left, structure on right, which I can print out? Then I can copy 10 or 25 CAS #s onto SciFinder, print out, and check one by one? It'll a lot of paper compared with printing the 25 pages as 25 pages.

Walkerma, I'll send you a summary of the Scifinder output by email. --Rifleman 82 (talk) 09:30, 26 January 2008 (UTC)[reply]

Yes, I can generate an Excel like file for you and send it forward. More work for me but it will reduce the work for you all so I'll do it. I'm not sure when I'll get to it but I might get it done today. If not today then it's unlikely to be until next week because of some meetings I am hosting next week.--ChemSpiderMan (talk) 13:44, 26 January 2008 (UTC)[reply]

Rifleman...before I wade too deeply into the issues can you comment on the problems you are seeing? Lack of stereochemistry? Incorrect stereochemistry...I assume the latter. What we have to agree on is whether the primary key is the structure name and find the CAS reg number that matches that and adjust the structure to that. The problem is if the name itself is "too general". What we need to make sure of is that the CAS Number doesn't become the primary key and things are adjusted around that to far. These is much easier to explain in an interactive dialog if we can have one--ChemSpiderMan (talk) 15:43, 26 January 2008 (UTC)[reply]

Looking for a Status Report

I am presently finishing up a paper for submission tomorrow. My hope is to get back to Wikipedia curation in the next few days. Question: What is the progress with the validation of the 150 structures I have posted so far? Any feedback? I don't want to jump back into the project until those have been looked at in detail and there's feedback on the progress to date and the process I'm using. i welcome your comments all. --ChemSpiderMan (talk) 05:36, 15 February 2008 (UTC)[reply]

I did one block, I think, but then I'm not sure exactly what was I supposed to do. :) For the entries that had a CAS number, I searched for that CAS number using SciFinder and checked if it matched the structure in the PDF file. There was one case where the stereochemistry of the structure in the PDF was wrong, but the structure in the WP article was correct! This was Adenosine thiamine triphosphate; however, the CAS number is for the neutral chloride so it does not match the figure exactly. I added a parenthetical note in the infobox to that effect; I don't know what the standard practice should be, but I think we will need these types of annotations often, especially if we want to provide more than one CAS number per "substance" (i.e., WP article).

Another question is whether we will go ahead with using a different infobox field for CAS numbers that have been checked. In this case I did nothing with the entries that were correct.

The case of articles that have no CAS number yet, and especially those with no infobox, is more complicated. I would prefer handling those as a separate project, as it is much more time-consuming. In those cases I searched for the compound by name, or sometimes by structure or formula, and added the new CAS number to the infobox if there was one. In the articles with no infobox I just added the CAS number somewhere in the article or talk page, as I was not in the mood to start creating infoboxes at that moment.

Besides fixing the articles that need to be fixed, are we keeping some sort of log? For example, there were some pages in the PDF with comments and questions. In some cases I could answer the question if desired, but where? Especially, should we start a list of "unfinished fixes"? For example, the article on Aciclovir seems to have the wrong tautomer in the figure, but structure drawing is not my specialty, so I just left a note in the talk page. --Itub (talk) 10:08, 15 February 2008 (UTC)[reply]

In terms of what to do I would hope to receive feedback/comments regarding errors identified in the files as they exist. You have provided one piece and the stereochemistry has been changed for the adenosine thiamine triphosphate structure.I believe the CAS number should be found for that compound if possible. The noting that the CAS number is for the chloride is exactly what I was hoping for. Since before the CAS number was not for th compound shown. Now people know it's for the chloride. Excellent. Adding the CAS numbers for the rest is good too. I have no way to source them from CAS. In terms of answering the questions I think the community preference would be to have the questions answered in front of everyone so probably setting up a page and listing the structure (linked to WP article), the question asked and the answer given would be ideal. Then people can discuss, get to consensus (maybe) and an action can be taken. At the end of the project my output will be an SDF file of structures and connected terms and then the job of migrating that information will need to start. There are automated ways to take my SDF file and create PNG files for uploading. --ChemSpiderMan (talk) 14:43, 15 February 2008 (UTC)[reply]

Regarding contacting ACS, I've heard back from a couple of people I contacted, but I still haven't heard back from the CAS people they contacted. Walkerma (talk) 16:47, 15 February 2008 (UTC)[reply]

Stereo Issues on Structures

Rifleman...I have sent you and Walkerma the first 50 records in Excel format. I have now started re-checking those records based on your information. I can confirm there ARE issues with stereochemistry. The problem is where? Now that the structures and the IUPAC names on Wikipedia match the disconnect with the CAS number. So, is there a DIFFERENT CAS number for the structure drawn/...my expectation is YES...so you would need to draw the structure to do the search to get the correct CAS number OR the fact is the structure itself is wrong. How do we figure some of these out? This undertaking just got a whole lot bigger gentlemen...a lot bigger. We need to chat before it continues.--ChemSpiderMan (talk) 20:07, 26 January 2008 (UTC)[reply]

Tuesday's not too far away. Can we do it then? --Rifleman 82 (talk) 02:50, 27 January 2008 (UTC)[reply]

I will try and sit in on that discussion but have actually got someone visiting from Europe for the week and need to make the most of our time. I commit to attending the IRC chat provided I have internet access wherever I am at that time.--ChemSpiderMan (talk) 04:34, 27 January 2008 (UTC)[reply]

Inorganics

Myself and Martin have compiled a list of some 2400 articles describing inorganic substances, which can be found at Wikipedia:WikiProject Chemicals/Inorganics. Most of these substances will have to be verified on the basis of a name, and many articles describe more than one substance. There are also substances which are described in more than one article. Any comments on the best way to proceed would be gratefully received! Physchim62 (talk) 14:56, 25 March 2008 (UTC) It appears that there are a large number of red links on the list, dispite the fact that it was complied from WP sources! I'll get on to that problem, and post a revised version in the next couple of days or so. Physchim62 (talk) 15:01, 25 March 2008 (UTC)[reply]

Validation of CAS numbers; collaboration with Wikidata?

Hi all, for the past few months we have been talking to a source of trusted CAS number information, and likely we cause this to confirm many CAS numbers, similar to commonchemistry.org. Together with this source, we're exploring how to this data into Wikipedia and Wikidata, and we have been talking about using ChemBox to pull out the information from Wikidata (which I think it does for various other fields already. On the Wikidata side (see Validation of CAS numbers; collaboration with Wikipedia?), I want a clear data model: we don't just want to give the CAS, but also this new source as reference, when it was added/verified, etc. Importantly, I am also thinking about indicating on what basis the statement was made. For example, was this based on InChI(-Key) matching? The model should ideally say this, so that we can detect items where the InChIKey changed after the match was done. We're likely talking a few hundred thousand CAS registry numbers, so I like to work out these details early. And the more we can automate the better. Your thoughts, please. --Egon Willighagen (talk) 07:50, 11 October 2020 (UTC)[reply]

I also just left a note on the ChemBox Talk page. --Egon Willighagen (talk) 08:08, 11 October 2020 (UTC)[reply]

@Walkerma: let's get the ball rolling. --Egon Willighagen (talk) 08:13, 11 October 2020 (UTC)[reply]

May I add my two-pence worth? I had extensive experience with chemical/biological database design in the period 2000-2009 before I retired. Part of that experience involved agreeing with our IT experts a relational db schema (in Oracle 8 originally) that treats chemistry fully and correctly. I can't, obviously, disclose that schema, even in outline, without company approval. After implementation we (mainly I) used InChI key matching to fix the ~20k "bad" structures out of ~250,000 good ones we rolled forward from a previous schema. InChI never failed, in the sense that if it detected a duplicate, then there really was duplication of underlying connection tables (MDL/MACCS-based). If starting again today, I'd definitely think of making the InChI the primary key for a chemical substance, although I'd think hard about whether to split the key over columns for the three base-strings within a standard InChI 14-10-1.

Second pence-worth. We learned the hard way that you need to test the proposals against the toughest cases. If you can handle the 16 individual isomers of cyhalothrin, plus the various combinations of isomers encountered in ordinary lab/manufacturing practice (see the article) and the multiple salts of paraquat, then you're on to a winner. We gave up over some inorganics which were of much less interest — and given my examples you can guess the company. Include me in at any time I can help, but I won't be following your progess closely as I'm elsewhere engaged for the next couple of weeks at least. Michael D. Turnbull (talk) 11:38, 11 October 2020 (UTC)[reply]

I'd advocate to curate and maintain this information in Wikidata. It is best equipped to handle mass-edits, and it is the central location to broadcast info to all wikis. A first step could be: list all delta's between enwiki and wd. That is: categorise all Chembox/Drugbox articles that have a mismatch between enwiki and WD, for say CAS number and InChI. Also, use enwiki indexing to differentiate (add QID for each indexed compound); so an article can have multiple compounds correct in WD. (Do the same in dewiki?). -DePiep (talk) 14:07, 11 October 2020 (UTC)[reply]

More thoughts. (Cyhalothrin is an excellent example!) So C23H19ClF3NO3 has one en-article, three indexed compounds in the article infobox, and many many isomers. Solution idea: the chemical formula has/needs/gets its own QID; any isomer can have their own too. -DePiep (talk) 18:53, 11 October 2020 (UTC)[reply]

So what is the unique key in that case? 2- 3- and 4-nitrophenol all have the same molecular formula and that's even before one considers E/Z isomers and other stereoisomers and, especially, tautomers (which were a complete nightmare that I first started to tackle in large databases in the 1980s). That's why InChI (or InChI keys) work so well. Michael D. Turnbull (talk) 09:35, 12 October 2020 (UTC)[reply]

Apolgies DePiep, that may have sounded harsh in tone — I was very busy yesterday. All good db designs for a substance table will have an index on the mol.form. as that's a very efficient field for an index. That is why, I think, that Inchis were designed the way they are: the part between the first two / is indeed the mol.form., so the index can be implemented over that substring, if that turns out to be more efficient than having a separate field. It certainly does ensure data integrity given that the fact that the Inchi is the true unique key: the so-called Inchi key can, theoretically, be the same for two chemicals. Michael D. Turnbull (talk) 11:25, 13 October 2020 (UTC)[reply]

No problem not received as such :-) more like good to learn. I am not familiar enough with CASnumber nor with InChI (but more with DBMS). If I understand you well, InChI is superior to CASno as identifier for (all spatially different) compounds. That would be great: we can identify all compounds, and do so same as in RL! I'll follow the discussion re this. Some questions, I guess not that important now. Maybe save time and skip answering = OK ;-)

- The mol-form to index (and in InChI), is that the "experimental formula" (counting atoms), ~~or should/could there be structural info in there too (brackets, split and mention element X in more places)? If not the basic one, this would allow variants for the same set of atoms.~~ -- Obvious, see InChI. -DePiep (talk) 19:31, 15 October 2020 (UTC).[reply]

- Is there a reason not to use Standard-InChI (1S/) always & everywhere? (In other words, should we get rid of non-standard InChI's?)

- Is there an InChI for generic compounds, like groups or mono-nitrophenol (which can appear as three compounds o, m, p)?

- If the InChI is preferred as ID for compounds wiki-wide, we could make this setup: each InChI (potentially) can get its own QID at Wikidata (=a compound with unique property InChI (P234)). This is where Wikipedia meets the RL chemical world! (CASno's can follow=be added; maybe the current CASno restriction re uniqueness cannot be kept). If generic compounds like mono-nitrophenol have a InChI, better.

- The mol.form in WD can be structural not experimental (e.g., cyhalothrin (Q421593)). If the User Requirement wants a search option by mol.form (yes), we should either search this WD property (chemical formula (P274)), or indeed search part of the InChI string. (The concept of 'index' you refer to, as in "field, part of the multi-field key" or even "repeated persistent field derived from single-field full key" [key=InChI], is obvious in relational DB, but not in the concept of WD. Hence a more complicated & expensive/slow query may be needed).

Anyway your post here about our first task is to the point and so I'll yield my & our time. -DePiep (talk) 17:48, 15 October 2020 (UTC)[reply]

I copy/paste two posts that were made by @DMacks and Egon Willighagen: in a third talk location:

It's a hopeless design flaw that there is a 1:1 mapping from WD entries to WP articles:( Also, major segments of WP fundamentally distrust WD content outright. You might get better acceptance if it's done on enwiki directly (rather than shifting to WD). There is already a flag here for validated data, and I'm not sure if there is a way to have a similar tracking on WD that would appear on enwiki boxes. DMacks 08:19, 11 October 2020 (UTC)

I can personally live with an outcome where Wikipedia-EN and Wikidata are done in parallel. I am aware of all the complications, but hope we can minimize the effort. --Egon Willighagen 09:08, 11 October 2020 (UTC)

@Michael D. Turnbull: Does InChI work with inorganics/organometallics? My understanding has always been that it does not. --Project Osprey (talk) 18:08, 11 October 2020 (UTC)[reply]

No, they absolutely can. RMRCNWBMXRMIRW-WYVZQNDMSA-L is Vitamin B12 (or strictly, cyanocobalamin, see the article's Chembox, where sits the rather fearsome InChI). The problem is legacy compounds if you are trying to convert databases (which in the past I was). In legacy examples we had NaCl or Fe2O3 as simple text strings, whereas the correct InChi for the latter is 1S/2Fe.3O Most of the WP articles where there is a chembox are, AFAIK, correct even for inorganics/organometallics. Note also that modern drawing packages like Biovia Draw and Chemsketch (both free downloads for private use) can read/write InChI and the latter can use the Chemspider API to sub/structure search direct from one's PC. Michael D. Turnbull (talk) 09:22, 12 October 2020 (UTC)[reply]

Ok, that's good news. I suppose the next question is for User:Egon Willighagen. What is the nature of the incoming data? Because the data model is going to be shaped by it. Are the CAS numbers related to InCHI or something else? --Project Osprey (talk) 16:35, 18 October 2020 (UTC)[reply]

Allow me, an interested outsider, to add this. By now, this is starting to look like the CAS number has an obscure (or paywalled) definition, and there is no way to find the chemical it defines (not helpful that it is a random code, not a descriptive ID. A passport number, not a fingerprint). OTOH, since inchi (and SMILES?) is fully descriptive*, this is the better ID, with its openness as a bonus. The 16:35 Project Osprey response above here illustrates: it reads like "what is CAS number pointing to?" and, literally, "something else". (* "fully descriptive": claim may be too stromg).

So, for the primal question here ('how can we use the external CAS number base?') it looks like the answer is: 'we need more info'. OTOH, Michael D. Turnbull offers an off-topic option: let's base the ID on inchi (wikiwide: WD, enwiki, ..). My opinion: since this question, long term, circles around define-and-identify-the-chemical, the answer is simple. (Mean but about the point: why do db-aware chemists here keep hanging on the CAS number?).

inchi also has this advantage: it is self-sourcing. Like math: we do not need to source a "1+1=2" claim (proving it is another thing). Looks quite important for the Wikipedia base. -DePiep (talk) 17:19, 18 October 2020 (UTC)[reply]

@Egon Willighagen: To answer your question, I would propose to use your new data set to curate an established and well known database, as example PubChem, and then using the curated CAS numbers in that database, to import them into WD. Why ? Because WD can't be a source. We need to rely on external documents or databases, we need references for the values imported into WD. WD should be the connection between references and other authorities, not becoming the reference. Snipre (talk) 21:42, 13 October 2020 (UTC)[reply]

@Snipre: so that took a lot longer than planned, but in January I received the first batch of data and since then working on tools for the validation and now able to start releasing the first validation results. Focusing on the easy stuff. I just created a shape expression (not perfect yet, but at least documents the reference model, per out protocol in this paper): [chemical compound with validated CAS registry by number]. I also used QuickStatements to add such references for CAS registry numbers I validated because my reference data (from CAS Common Chemistry) has the same InChIKey (and CAS registry number) as the Wikidata item. I also have Wikidata items with InChIKey matches without a CAS registry number and will discuss next week if I can add those. On the Wikidata discussion page there is some more content. It's moving again! There will be two talks about the project at the upcoming American Chemical Society meeting with more details (but those slides will have to be cleared first by the other authors). --Egon Willighagen (talk) 19:29, 2 April 2021 (UTC)[reply]

First, a bit of pedantry, but I think it's worth pointing out; I love InChIs, @Michael D. Turnbull:, and I think you're absolutely right about them being the best primary key in any substance database. However, CAS RNs are based on the macroscopic substance, whereas InChIs are based on the molecular structure. That means in practice there will not always be a 1:1 correlation between InChI and CAS RN.

But my main point: It sounds to me like we need to have some form of validation for certain Wikidata item-property-value triples. I'm sorry that I'm not familiar enough with Wikidata to know how that would be done, or even if it is already being done somewhere, so please forgive me if I'm off base here. I'm much more likely to trust a datum that has been checked against a trusted source, rather than just imported en masse from a database that may have errors. For example, I would trust ChemSpider to be a reliable source for ChemSpider IDs, but unreliable for CAS numbers. I do think the source @Egon Willighagen: has in mind is going to turn out to be a reliable source (at least, as good as we can realistically expect), but in that case we need to show that provenance as reliable, so we can trust it.

If we had validated Wikidata item-property-value triples, it may be we could establish the trust needed to use Wikidata entries in Chemboxes in the various Wikipedias. Without any validation tagging in Wikidata, it would be seen to be going backwards in terms of data reliability. Is this kind of "reliable provenance tagging" possible? Thanks, Walkerma (talk) 06:45, 14 October 2020 (UTC)[reply]

Walkerma Not pedantry but a fundamental insight that I believe is the crux of the meaning of "chemical substance". When saying "this is a (new) chemical substance" one has in mind a concept, not a sample. A single molecule is enough: all properties such as MP and BP are, as you say, based on the macroscopic [sample of] that substance but if we haven't made any of it yet, how can we possibly know what their values might be? We might have a good estimate for BP, but given that many pure samples exist in polymorphic forms, we can't have one for MP. Yet we may for very good reasons (perhaps to state the scope of a patent claim) want to as-it-were reserve a particular set of (never-to-be-made) substances as being within the Markush structure of our claim. CAS had the problem that not only did they have to index all the listed substances in a patent, they in some cases had to assign different CAS numbers to different tautomers: mesotrione is the test case for whether you are handling tautomers correctly (adding to cyhalothrin and paraquat). P.S. I've asked Syngenta for the necessary permissions. Michael D. Turnbull (talk) 10:03, 14 October 2020 (UTC)[reply]

Walkerma Syngenta have not yet said "no" and given the document I seek to share is dated 2008, I'm sure they'll give me permission soon.

Meanwhile, I will note that pedantically the true Key to a substance table (that which identifies a single row) is almost always a business key: for Chemspider is the Chemspider ID, for example. Thus Chemspider ID #236 is linked in their database to Benzene (presumably as a connection table), the molecular formula for benzene and so on. These are the columns of that table. The unique identifier for a substance will (for example) be its InChI + one or more other fields to handle the awkward cases like the allotropes of carbon. Then this combination of fields (i.e. columns) is associated with the business key. As I understand it, a CAS number does not have the properties necessary for it to be a business key; for example multiple CAS numbers may in fact refer to the same substance (because of tautomerism amongst other historic reasons). Mike Turnbull (talk) 17:50, 18 October 2020 (UTC)[reply]

Like someone said: "I do think the source Egon Willighagen: has in mind is going to turn out to be a reliable source". I could not tell at the time, but the reliable source is CAS Chemistry itself. The news about the (close to) 500,000 CAS registry numbers now released on CAS Common Chemistry (press release). This validation is in collaboration with the CAS Chemistry team. --Egon Willighagen (talk) 19:51, 2 April 2021 (UTC)[reply]

Chembox

Some background notes on {{Chembox}}. I am not a chemist; I do maintain Chembox.

Drugbox. For this, it is best to also look at {{Infobox drug}} (aka 'Drugbox'), and treat the same. In there, the identifiers mentioned here are equally present (eg Aspirin). Together, these infoboxes have 11500 + 7200 = 18700 instances (infobox in article); almost all disjoint (not both infoboxes in one article).
Read from WD. Today, three parameters are read from Wikidata, in both infoboxes (eg ammonia):

E number Category:E number from Wikidata (372)

ECHA InfoCard ID Category:ECHA InfoCard ID from Wikidata (10,995)

|DTXSID= (CompTox Chemicals Dashboard)

The latter two can be locally overwritten (=overwrite WD value by entering a parameter value on enwiki; or 'none' to suppress).

They have the option to Edit-At-Wikidata: the pencil icon that is a link to the Wikidata page (per good WP design: editors can edit!).

|DTXSID= was added to Wikidata by d:User:ChemConnector, to allow bulk-adding (April 2019). (Instead of: adding all values to enwiki infoboxes at enwiki, dewiki, ...). We could ask ChemConnector their experiences (working on Wikidata).

3. IDs. As DMacks and Michael D. Turnbull noted here, the "1:1" relation between en-article and wd-QID ("ammonia" <-> "Q4087") is tricky. As far as I know, there has been no systematic check for correctness. For example, how is distinction of isomers, racemic, chiral compounds maintained? Wikidata people have build a orthography for this; I am not familiar with.

In Chembox and Drugbox, variants are added by indexes: |CASno=, |CASno1=, ..., |CASno5=; and |InChI=, |InChI1=, ..., |InChI5=. (eg Linalool, Dithiane). Apart from the soft guideline same-index-is-same-compound, no integrity checks are performed. See also Category:Chemical articles with multiple compound IDs (2,215).

Chembox has no policy to handle InChI and Standard InChI systematically (both or one can be present).

4. Validation. DMacks mentions the validation system, in use for both infoboxes (and not outside). It adds

or

to a set of parameters that have been verified (by a trusted editor). I have no guess on its completeness; it appears that many infoboxes have not been screened for this in a long time. For example, Category:Chemical articles without CAS registry number (1,070) has had between 1500 and 2000 articles 'forever'.

-DePiep (talk) 13:35, 11 October 2020 (UTC)[reply]

Chembox Validation, CheMoBot

Validation. DMacks mentions the validation system, in use for both infoboxes (and not outside). It adds Y or N to a set of parameters that have been verified (by a trusted editor). I have no guess on its completeness; it appears that many infoboxes have not been screened for this in a long time. For example, Category:Chemical articles without CAS registry number (1,070) has had between 1500 and 2000 articles 'forever'.
-DePiep (talk) 13:35, 11 October 2020 (UTC)

On validation, I have noticed many people have a template to use when they create a chembox that says that the entry is validated. So I suspect that the validation green ticks are not indicating that it has been checked by a "trusted user". If I find a red cross, yet the value appears correct, then I change it to the "correct" value. I assume this is how its supposed to work. But automated or semiautomated checks would be valuable. Because of "distrust" of wikidata, I would not recommend just applying the values found there. Instead we could have a hidden category that says there is a mismatch. For indexed chemboxes, perhaps qid(n) could be actually used or added to match up multi wikidata entries. I would also recommend that a list of mismatched entries should be made. This matching should also include German Wikipedia which also has good quality infoboxes. I suppose the Wikidata entries could also have a known good reference for the CASNo. Currently there are several references that are in use, but they are not all 100% good, and they all have missed entries. (eg PubChem, ECHA, ChemSpider) Graeme Bartlett (talk) 00:40, 12 October 2020 (UTC)[reply]

Quite possibly I've been doing the wrong thing but when I work on a chemical article at any length I always check all the Chembox lines that point to external databases and ensure that there is |StdInChI_Ref=^Y or |ChemSpiderID_Ref=^Y etc, then I delete the |Watchedfields = changed |verifiedrevid=464365162 lines (or set them to blank) so that the larger green/red ticks disappear from the bottom of the box. I assume that there is intended to be a bot that acts as guardian angel for certain values in a Chembox but absence of its |watchedfields flag is not an error. Michael D. Turnbull (talk) 11:37, 13 October 2020 (UTC)[reply]

The validation in en:WP is done by CheMoBot by checking against a list of validated RevIDs. The idea is that once validation is done on these stable fields, you don't (usually) need to change it. For example, the CAS no. is very unlikely to change, so why should anyone ever need to edit that field? (Yes - I accept that very occasionally there are revisions made - once in a blue moon.) I did once see someone trying to vandalize the w:Sulfuric acid article ChemBox, and the bot caught it within about five minutes, adding the red X. Walkerma (talk) 06:12, 14 October 2020 (UTC)[reply]

That Bot last edited in June 2018. So it sounds inactive. But perhaps it could run again. Graeme Bartlett (talk) 02:27, 15 October 2020 (UTC)[reply]

Validation at Wikidata

I have currently no experience with automated editing of infoboxes (I stopped around the time when DBPedia had a lot of errors extracting SMILES many, many years ago. All updated into welcome. It's good there are bots right now to support this! So, on Wikidata I have started validating (see also reply earlier in the thread tonight (CEST)) and adding references (35 now, going to give a bit more time to the Wikidata chemistry team to provide feedback on the reference model) and then scale up. This Wikidata Query Service query can return the validated CAS registry numbers (based on identical InChIKey): https://w.wiki/39oN Looking forward to hear more about how this can be used to validate CAS numbers in the chemboxes. cc Walkerma --Egon Willighagen (talk) 20:05, 2 April 2021 (UTC)[reply]

First results are in. I have started validating CAS registry numbers in Wikidata, and using that data (this query: https://w.wiki/3A8C) I matched up CAS/Wikidata/Wikipedia. The approach is very basic and it seems the Wikipedia community can do better: it just checks if the Wikipedia mentions the CAS registry number validated in Wikidata (using a basic substring match on the full page HTML). Result: out of the 72, 58 do indeed seem to be mentioned, but 15 are not. An example is Ferulic acid, where Wikipedia lists the CAS number of a record where the stereochemistry is not defined, where Wikidata has the stereospecific one. Another common patterns as Category pages, which has been a longer standing Wikipedia/Wikidata interoperablity issue, where Wikidata is often too specific for the matching Wikipedia page. Over the next weeks more and more validation will become available. The list of 15: https://gist.github.com/egonw/6cfc455193d302abc2dde4f920a11426 --Egon Willighagen (talk) 10:25, 5 April 2021 (UTC)[reply]

Would it help if enwiki lists (catgegorises) all Chembox articles by CAS-comparision, like:

(see below, #Chembox support) -DePiep (talk) 17:55, 5 April 2021 (UTC)[reply]

-DePiep (talk) 12:34, 5 April 2021 (UTC)[reply]

(Edit conflict with DePiep) OK, Egon Willighagen, I've downloaded your list and started fixing the 15 (see Ferulic acid for example where I've left the old CAS commented out in the Chembox). I'll do the others the same way. If you get massive lists later, we clearly need a bot! I'm sure DePiep can help with that... Mike Turnbull (talk) 12:37, 5 April 2021 (UTC)[reply]

Turns out that most of the fixes needed were on Commons Categories, which were being updated via Wikidata by Egon. Only Ferulic acid, Trifluperidol and Iron(II) perchlorate needed attention on en.wp. I didn't do the latter because the whole thing is a mess because of multiple possible hydrates and CAS's poor handling of inorganics. Mike Turnbull (talk) 13:11, 5 April 2021 (UTC)[reply]

Hope I understand this process right: currently the Egon Willighagen project is to curate Wikidata CAS RN's based on source https://commonchemistry.cas.org. enwiki's {{Chembox}} & {{Drugbox}} are invited to follow (check & improve local CAS RN's using sourced WD value). So far, no updates from enwiki -> Wikidata are expected (except maybe to disentangle stereochemistry or salt issues). -DePiep (talk) 14:20, 5 April 2021 (UTC)[reply]

We may want to talk about the Categories, because in these cases, the link to Wikidata is likely incorrect, where Wikidata is about one specific chemical. Indeed, doing the validation I know how to automate, but there on the Wikipedia side things are harder (more than one ChemBox on one page, or one, but it's just an example, etc, etc) --Egon Willighagen (talk) 14:46, 5 April 2021 (UTC)[reply]

With this, I'd like to learn how enwiki's Chembox can help (through code). -DePiep (talk) 14:20, 5 April 2021 (UTC)[reply]

I will keep the list on https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation/CASCommons updated. In the next days more and more CAS numbers will get marked as validated in Wikidata, and the list on this page will follow. Note that the category pages should likely be fixed too, but depends on the specific situation. Generally, I think Wikidata is more specific than Wikipedia, and the Wikipedia category page should be linked to a new page in Wikidata reflecting the category instead. But this may need the Chemistry teams of Wikipedia and Wikidata to sit together and discuss the options and make some choices. --Egon Willighagen (talk) 15:00, 5 April 2021 (UTC)[reply]

Looks like enwiki (Chembox) is not a help in curating CAS--Wikidata. In that case, editing (updating) enwiki Chembox articles can be postphoned. (This would also add time to convince WP:CHEMICALS editors to accept WD data: read CAS from WD).

OR, if Chemboxes actually are a help (eg to disambiguate stereochemical entries), we can start working to make that useful.

Whichever way, I support the process of curating Wikidata not enwiki, because of: better automation possible, identification & separation (no 5 chemicals in one article/'ID'), standard good sourcing option, publishes wiki-wide/international, covers commonscat, etc. -DePiep (talk) 15:13, 5 April 2021 (UTC)[reply]

I agree. Wikidata needs a source for many more chemicals than have articles on en.wiki and CAS seems like a reasonable master for curating some of them. There will be "surprises" as a result: for example Alliin which was on Egon Willighagen's list of the first 15. It turns out that Wikimedia Commons has Q57741744, which is CAS 17795-27-6 BUT that's for an totally unspecified stereochemistry (the compound has 2 chiral centres = 4 isomers and Wikidata calls this a "group of stereoisomers"). Our article is about the natural product, which is CAS 556-27-4, although it also mentions a diastereomer first synthesized in 1951, adding to the complexity (CAS will at some point also have indexed that one). So, in my opinion the automation should focus on curating Wikidata, while we rely on expert editors to check and if necessary fix our articles and their Chemboxes, assisted, perhaps as DePiep suggests below (where I'll also comment in a moment). Mike Turnbull (talk) 11:34, 6 April 2021 (UTC)[reply]

Chembox support

Would it help if enwiki lists (catgegorises) all Chembox articles by CAS-comparision, like:

cat 1: WD enwiki: CAS same

cat 2: WD enwiki: CAS differ

cat 3: WD enwiki: one CAS missing

...

cat X: Chembox with stereochemistry involved

-DePiep (talk) 12:34, 5 April 2021 (UTC)[reply]

Thinking: why not concentrate on {{Infobox drug}} (with 6800 CAS RNs), instead of {Chembox} (11k)? {{Drugbox}} is by {{Infobox}} already! -DePiep (talk) 19:20, 5 April 2021 (UTC)[reply]

I don't think that the difference between 11k and 18k (i.e. all compounds) is that great, if you can automate some of the checking so that human editors only have to reassess a small group of articles. While drugs are likely to have stereochemistry, the "common" CAS number will likely be the important one, that is, the one to have reached the market and have the significant body of publications. Simple compounds without isomer issues should only ever have had one CAS, I would hope, while inorganics will be a mess! There are, we know, examples like cyhalothrin with multiple marketed isomer groups. I can vouch that this one is already correct - but might be a good test of the automation ;-) Mike Turnbull (talk) 11:44, 6 April 2021 (UTC)[reply]