Jump to content

Wikipedia talk:WikiProject Molecular Biology/Gene Wiki/Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 2Archive 3Archive 4
Gene Wiki – Discussion


Idea

Would it be possible to create a references section on the page using the bibliography data in the Entrez gene page? TimVickers 22:41, 4 May 2007 (UTC)

Another interesting suggestion is to sort the input database according to GeneRIF and only import genes where there is a proposed or known function. TimVickers 00:50, 5 May 2007 (UTC)

Great ideas, added it to the specs. I propose adding any gene with either a gene2pubmed entry (minus the genome-wide annotation papers that are linked to tens of thousands of genes) or a generif. sound good? AndrewGNF 18:10, 5 May 2007 (UTC)

Sounds good. These were several ideas that came out of the discussion of how to apply this bot. I wanted to cut off any chance that we would run the bot, upload thousands of entries and then have people object and delete them. The discussion is [1] and seems to have coalesced around the idea that we should create stubs for each gene with attributable references. You may also want to add your own comments to this discussion. TimVickers 21:40, 7 May 2007 (UTC)

Whoa, great discussion I've been missing out on! Looks like lots of good feedback and ideas there contributed in a very short amount of time. Anyway, thanks for the pointer and starting the thread. I'll be sure to chime in... AndrewGNF 22:58, 7 May 2007 (UTC)
Since Tim Vickers' reference to the Village Pump discussion (above) has timed out, I replaced it with a history pointer. EdJohnston 19:38, 25 June 2007 (UTC)

evaluation metrics

One of the planned outputs of this project is a peer-reviewed publication, and it would be nice to include in that publication some metrics of ProteinBoxBot's impact. Some ideas:

  • Obviously, simply the number of articles created or edited by PBB
  • the number or pre- and post-PBB edits by other users (say, within a +/- one-month time frame)
  • the number of wikilinks per article (either to other PBB pages or to other WP pages as a whole)

other ideas? AndrewGNF 17:17, 21 September 2007 (UTC)

You might refer to discussion below and describe how did you address the criticism/comments in WP. It might be possible that some reviewers of your paper will have very similar comments or concerns.Biophys 18:30, 6 November 2007 (UTC)

But what is the subject/title of your publication? Is it "Curated database of human protein expression profiles", or is it "Using Wikipedia as a collaborative tool for annotation of human genome"?Biophys 14:26, 7 November 2007 (UTC)
Something more along the latter. The former have already been published here and here. AndrewGNF 17:54, 7 November 2007 (UTC)
Good luck! It would be interesting to look at your article. I think this will also provide some good publicity for WP.Biophys 04:18, 9 November 2007 (UTC)

Some comments and a question

I like this Bot. Now I realize that it can only modify articles about one protein (or gene?). That is probably fine. But I think it would be very helpful if your Bot provided also links to family databases, such as Pfam and SMART. A technical question: how do you identify all relevant PDB files? As indicated in a Uniprot record for the protein?Biophys 21:27, 1 November 2007 (UTC)

Yes, to make your project more useful for WP, one would need to create also articles or disambig. pages for protein families or clusters of orthologous genes. Did you think about using COGs database from NCBI for linking related genes? I am not sure though if this is practical.Biophys 21:39, 1 November 2007 (UTC)
I strongly believe that each gene deserves its own gene page. Protein families are defined mostly by sequence similarity ("molecular function"), but genes within the same protein family can often be involved in very different biological processes. And as we get further and further along in gene annotation, I think it only makes sense to give each gene a page to talk about its own peculiarities. As you note, many pages devoted to gene families can also summarize the features in common.
Regarding systematic efforts to deal with protein families, I would support that, but perhaps that will be a separate effort... ;) For example, have you seen this effort to generate categories based on EC data? Similar efforts based on PFAM/SMART/GO would I think be worthwhile too. (COGs I think are a big different than these other efforts -- I believe they link orthologous genes across species, instead of related genes within a species.) We can certainly think about adding more protein family information to the PBB protein box, but the masters student we have on this right now would really appreciate not having additional requirements added at this stage... ;) But thankfully we have an update infrastructure in place so that adding these data later shouldn't be a problem...
PDB links can be automatically harvested from Ensembl, and we supplement that with some of our own sequence analysis. Cheers,AndrewGNF 22:21, 1 November 2007 (UTC)
Oh, and regarding the use of gene symbols versus gene names which you raised at Talk:Phosphatidylinositol_transfer_protein, I have primarily been creating them at the gene symbol because for the most part they are unique and well-structured. For example, the official name of CD117 is "V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog". Quite a mouthful. Anyway, default has been to create pages at the gene symbol and adding redirects from the gene name (when it's reasonable). I'm open to discussion on this though... (other examples: PIK3R1 is "Phosphoinositide-3-kinase, regulatory subunit 1 (p85 alpha)", FASLG is "Fas ligand (TNF superfamily, member 6)", etc.) Cheers,AndrewGNF 22:29, 1 November 2007 (UTC)
I agree, it is logical to have a seprate article for each protein. As about short gene names, such as PIK3R1, this is questionable from WP perspective. I would rather make abbreviations as redirects. Did you ask any third opinion about that?Biophys 20:48, 2 November 2007 (UTC)
Hmm, I can't recall any direct discussions on this issue advocating either way. Two possibly relevant thoughts, and then a proposal. First, the bot approval process went through clearly with the gene symbol as the page title. Second, gene names can often be pretty obscure complicated -- not sure if it would be appropriate for page title. (I've put the names of the top 500 genes that we'll be going through at User:AndrewGNF/Sandbox.) My proposal is this: in the case where there is a clear and simple gene name, put the page there and redirect the gene symbol (example: Endothelin 1 and EDN1). When the gene name is too confusing, then we put it at the gene symbol. (Unfortunately, the bot can't make that judgment call so it would default to putting it at the gene symbol, but fully automated edits only account for ~10-20% of PBB pages right now.) What do you think? AndrewGNF 21:27, 2 November 2007 (UTC)
I think this compromise solution is fine.Biophys 23:24, 3 November 2007 (UTC)
Sounds reasonable to me. If anyone disagree, this can be corrected. Perhaps a separate WP article is needed to explain the protein expression patterns shown in the Protein Box or some Legend. How these patterns have been determined? Why several patterns are shown for some proteins? Why do not we simply tell: "this protein is expressed in retina" as in Human Protein Reference Database? Why does the site of expression has been defined as "kidney" or "eye", instead of indicationg the specific type of cell per List of distinct cell types in the adult human body? The existing proteomics articles do not seem to explain exactly these questions, unless I am missing something.Biophys 22:55, 2 November 2007 (UTC)
All good and valid questions. In order... All of these expression patterns come from the SymAtlas database (with which I am affiliated, and potential conflicts and WP:OR issues have been discussed and addressed previously), so that means they are gene expression measurements from microarrays. Several patterns are sometimes shown because the microarrays often have several probe sets interrogating a single gene; since there is no good way to systematically determine which is "right", we show them all (up to three). I actually prefer seeing the barcharts over the "Gene X is expressed in Tissue Y" declaration. The barcharts are from a systematic survey of expression so seeing the quantitative variation can be useful, whereas the black-and-white statements are cherry-picked from literature sources without context. Going down to the level of cell-type specificity would clearly be cool, but beyond the scope of the original data set (for the most part, only whole tissues were used).
I think I've answered all of your questions to you (but of course, let me know if you want further clarification). I like the idea of putting this in some sort of legend so that everyone else can also see the answers. Suggestions on where to put such a legend? Perhaps a discreet "?" link in the template? AndrewGNF 23:16, 2 November 2007 (UTC)
Thank you! To me "expression in the eye" is ridiculous, although there is nothing we can do about it. Some kind of legend as a link would be good.Biophys 23:25, 2 November 2007 (UTC)
Yeah, but the people who need more resolution than "expression in the eye" can pay for their own experiments! AndrewGNF 00:22, 3 November 2007 (UTC)
Maybe not ridiculous. If it was the only place expressed that would be very useful information. Besides what more can one expect from a limited selection of differentiated tissue analysis? Clearly a developmental times series is more preferable, however, I know where Andrew will tell me to go if I ask for that. Or he'll raid my bank account :) David D. (Talk) 04:07, 6 November 2007 (UTC)
I don't know about more preferable (the "Gene Atlas" data set has been pretty darn useful), but I agree, developmental atlas would be great too. A human atlas would be difficult for obvious reasons, but a mouse atlas would be great! Anyone got $30-50k to spare? AndrewGNF 17:13, 6 November 2007 (UTC)
I'm just projecting my particular interest onto to you, like a subliminal message ;) Who knows, you might do it and I won't have to fork over the cash? OK, maybe not. Clearly the current resource is more than pretty darn useful! David D. (Talk) 17:18, 6 November 2007 (UTC)
Oh, and on the naming issue, Forluvoft pointed out that in fact there is already a naming convention defined, and it is to use the gene symbol as the page name. AndrewGNF 14:47, 3 November 2007 (UTC) (Actually, that's not the correct interpretation...) AndrewGNF 22:57, 3 November 2007 (UTC)

Entrez Gene

Is copyright status of Entrez Gene compatible with Wikipedia, so we can copy anything from there?Biophys 00:20, 4 November 2007 (UTC) I guess it is after looking at their disclaimer.Biophys 01:47, 4 November 2007 (UTC)

Most notably, OMIM content is not. Sigh, it would be great to use that content to seed the free-text section. (If OMIM were *really* smart, they'd just convert their own site over to a wiki, but I hear NCBI wants nothing to do with public wikis...) AndrewGNF 03:27, 5 November 2007 (UTC)

Problems with protein domains

1. Let's take PTK2 as an example. As you can see for PDB files in Uniprot entry,

  • 1K04; X-ray; A=891-1052.
  • 1K05; X-ray; A/B/C=891-1052.
  • 1MP8; X-ray; A=411-686, and so on.

It means that PDB structures 1K04 and 1MP8 corresponds to different structural domains of the same polypeptide chain/gene product. So, you would have to make pictures for both domains and show them both in the article with names of the corresponding domains. The names of domains could be taken from Uniprot entry (taking them from SCOP would be probably more difficult). The Uniprot tells: it includes one FERM domain and one protein kinase domain (see DOMAIN field and some other fields, such as SMART). This information is extremely important and ideally could be included in WP article automatically as "Domain structure".

2. In case of KPNB1, for example, the bot picked up image of first PDB file of this protein, 1F59; (X-ray; A/B=1-442, 442 residues), which represent only one of its domains, instead of using 1QGK (X-ray; A=1-876), which represents more complete structure (876 residues).

3. Another problem appears for PDB entries that represent multimeric complexes from several polypeptide chains, only one of which is your gene product. The easiest way to deal with that would be selecting a PDB entry with the largest number of polypeptide chains/subunits or all PDB entries with different number of subunits (this is even better) - for making images. However, there is another catch: some symmetry-related subunits are sometimes not included in PDB entries. To resolve this, one would have to take structures of the complexes from Protein Quaternary Structure server/database and draw their pictures, rather than taking them from the PDB. I am not sure you can correct this, but at least keep this in mind.Biophys 02:42, 4 November 2007 (UTC)

Whew, you raise a lot of good points and you're absolutely right. I'm sure there are lots of better ways of doing this, and for now, we are doing the incredibly naive thing and just taking the first linked PDB code (which is "first" for likely some arbitrary reason). As I mentioned to Forluvoft, we're a bit resource constrained in terms of adding in the additional logic. But, the good news is that A) we've built in a pretty good update mechanism for making the protein box better on each iteration (and we plan on running ~quarterly to keep up to date on content), and B) soon we'll also release the PBB code on some public open source site so that anyone can tweak and tune. Unfortunately, I think we're going to have to table your suggestions above for now so that we can focus a bit on breadth. Sound reasonable? Cheers, AndrewGNF 03:33, 5 November 2007 (UTC)
Actually, my main point is not even selection of PDB file for picture. We are missing domain structure of the protein, althouth it is included in UniProt and probably could be extracted from there. See for example UniProt entry for BTK:
DOMAIN 3 133 131 PH.
DOMAIN 214 274 61 SH3.
DOMAIN 281 377 97 SH2.
DOMAIN 402 655 254 Protein kinase
ZN_FING 135 171 37 Btk-type.
This information would be very important to have in a WP article if at all possible.Biophys 18:00, 5 November 2007 (UTC)
Ahh, I see your point. Yes, protein domain information is definitely useful and something I'd definitely like to get in there as well. We'll look into getting that data imported. AndrewGNF 18:26, 5 November 2007 (UTC)
Note that protein families are defined only for individual domains (not for polypeptide chains or genes) - see PFAM or SCOP. That is one of reasons the domains are so important.Biophys 18:51, 5 November 2007 (UTC)
Also note that legend of image in BTK article is not quite correct. It should be telling: "structure of SH3 domain of Bruton's tyrosine kinase" (a reader might think: "Wow, such small protein does so many things!"). The image of its PH domain, also present as another PDB file, is missing for the reasons discussed above. Sorry for criticism - this is just to let you know.Biophys 19:03, 5 November 2007 (UTC)
Good point. I've added a proposed change to the ideas page that hopefully will fix that problem. AndrewGNF 00:19, 6 November 2007 (UTC)

Just chime in here (excuse the pun), it would be great if the bot could pick the best pdb file, but how could it do this without some consistent tags for it to read? This seems like something that will have to be done by hand after the bot has run through. Personally i don't envisage these pages as anwhere near a final product but more as honey for researchers to come and add info. There is no doubt they will be adding their favourite pdb file, whether the one the bot chooses is full length or not. However, if Andrew can train the bot to be selective for the full length version then that would be a benefit. i just don't see it as critical for these seed articles. David D. (Talk) 03:55, 6 November 2007 (UTC)

Most important thing is to identify domains based on DOMAIN records of UniProt and include this information as a part of Protein Box or in the body of article. So far I did not see anyone in WP who would replace an existing image of a PDB structure with his structure of the same or lower quality. So that is hardly a problem.Biophys 04:14, 6 November 2007 (UTC)
Isn't that info linked to? Maybe I'm not sure what you want. Do you want all domains listed under the pdb files in the infobox? Or possibly listed in the aticle? How accurate is uniprot? Usually when you see the domain questimates of various algorithms there are significant differences. David D. (Talk) 04:29, 6 November 2007 (UTC)
Probably the best way would be to make a small "domain box" with the following information taken from UniProt for BTK (for example):
DOMAIN: Residues:
PH domain 3-133
SH3 domain 213-274
SH2 domain 281-377
How accurate is UniProt? The assignment of domains is basically the same as in Pfam and SMART. Yes, it is sufficiently accurate based on my personal experience (and I am sure there are relevant publications). It is just as good as manual assignment of domains in SCOP. I am not talking about ridiculous domains in DALI or some poor quality computational tools.Biophys 04:50, 6 November 2007 (UTC)
The only thing I was thinking is that the info my get dated quickly which is why a link might be preferable. The only other potential issue is how big is this infobox going to get? Maybe a prose version like you just did above, as a section of text is best for this information? David D. (Talk) 04:56, 6 November 2007 (UTC)
Domain structure does not change with time. Any way of representation would be fine. One just have to realize that domain structure is one of first things a reader would like to know about a protein. It is the domains (not polypeptide chains) are "units" of the protein evolution and function. Hence they are also classification units of Pfam and SCOP.Biophys 15:30, 6 November 2007 (UTC)
Good points, I was thinking secondary structure when i wrote that, not domains, duh. Personally, I would favor a section in the text devoted to this information. David D. (Talk) 16:03, 6 November 2007 (UTC)
There are 726 domains in SMART [2], 3506 human protein families/domains in Pfam [3] [4], but only a few domains in WP (see Category:Protein domains). I guess we may need a new bot to harvest SMART and PFAM domains to WP.Biophys 16:20, 6 November 2007 (UTC)

Pseudogenes

Wow, it is already going after pseudogenes - see HCG4P11. Should not they be filtered out and perhaps included into a separate List of human pseudogenes?Biophys 17:43, 4 November 2007 (UTC) Please also see my comments here [5]. If those improvements can be done, it is better to make them now. I think that at least link to HPRD database in Protein Box can be done easily.Biophys 17:50, 4 November 2007 (UTC)

Ooops. It was definitely not our intention to go after pseudogenes. In case you're wondering, the order so far has been determined by the number of linked citations in Entrez Gene. (The top of the list are all the classics -- p53, VEGF, TNF, EGFR, etc...) I think this particular case is related to some server hiccups we've been having over the weekend. In fact, we've had to temporarily stop automated edits while we look into it. Anyway, to sum up, psuedogenes are generally not high on the priority list... AndrewGNF 03:39, 5 November 2007 (UTC)

Could you generate articles for related genes (for example all caspasess - CASP1, CASP2 and so on genes) at the same time, rather than making only one (e.g. CASP7) article? That would facilitate bringing everything in order in parallel with the bot. Otherwise, one would better wait until the bot is done with all 30 thousand of genes.Biophys 00:38, 5 November 2007 (UTC)

Yeah, thought about that too. Unfortunately, I'm not sure how we could do it systematically. I've done it for a couple cases for collaborators -- for example, the TLRs and all of the TRP channels are done. If there are particular protein families that you're interested, let me know gene IDs and we can bump them up. (Note User:ProteinBoxBot#Requests.) Or, we can create the stand-alone gene pages now and brush up on the links to the family page later when the last family member arrives. Hopefully start to end won't be too long... But gene order is one of the easy things to change, so if you've got better ideas on how to order genes systematically, feel free to make a suggestion. AndrewGNF 03:45, 5 November 2007 (UTC)
Now I realize that you order genes by the number of linked citations in Entrez Gene, which is reasonable, although not the most convenient way for someone who would like to "digest" the bot's articles by making wikilinks, categories, and so on. How long would it take to process the entire genome? If not too long, there is no good reason to rush and manipulate the order. I think it could be a reasonable approach to generate all the articles ASAP and then to "digest" them all together.Biophys 17:08, 5 November 2007 (UTC)
Well, personally I hope to be over 3000 by the end of the month, but that largely depends on having willing volunteers to help with the merging. Except for a brief hiccup this weekend with server issues, I hope we can continue to expand the links at User:ProteinBoxBot/PBB_Log_Index pretty quickly... AndrewGNF 18:41, 5 November 2007 (UTC)
Around nine months for the entire genome? Not that fast. The mergings could be done any time later. Each of the aricles actually needs wikification. But Entrez Gene abstracts in WP articles will not be updatable after the wikification... Do you know how frequently those abstracts are updated by NCBI annotators in Entrez Gene?Biophys 20:17, 5 November 2007 (UTC)
Geez, usually I throw around 3000 as a number and people are impressed... Tough crowd... ;) Really we could probably go much faster than that. In terms of automated time, it takes about one hour of running time to do a batch of 25. So at full throttle, everything could be done in a couple months. Two considerations. First, we're ramping up (relatively) slowly, just to be sure we can detect and correct issues early. And second, I actually think we'll stop short of the 25k-30k genes in the human genome. Even though it's a real gene, I doubt it will be useful at this point to have a page devoted to C5orf23, for example. My guess is that we'll stop at somewhere between 3k and 10k gene pages, but we'll just play it by ear.
Not sure how often NCBI updates those summaries, but I'd guess pretty infrequently. I don't think they have an army of annotators at their disposal, and if they did, I'd hope they'd be focusing on getting an summary for every gene before going back and revising the existing ones... I'd be in favor of just disabling the update once the wikilinks are added. I think wikilinks are more valuable than incremental revisions from NCBI. AndrewGNF 21:13, 5 November 2007 (UTC)
Of course, the huge amount of data is an important factor. That is why I suggest to collect and include in WP articles more information through your bot (especially links to HPRD and the content of the following UniProt fields: PFAM, "Protein name", "Synonyms", FUNCTION, DOMAIN, SUBCELLULAR LOCATION, CATALYTIC ACTIVITY, COFACTOR, SUBUNIT, and WEB RESOURCE). This information can facilitate all future work of people in WP with articles generated by your bot, and that is main consideration. After taking care of that, one could proceed with the maximal speed. You are right about C5orf23. WP:Notability applies to proteins and genes just as to people: some publications about them are required.Biophys 23:15, 5 November 2007 (UTC)
Yeah, I agree, all those would be useful. But if I'm reading your comment correctly, I disagree that that information is a prerequisite for making a functional stub. As it stands now, all that information is one click away through the provided Uniprot link (and at worst, two links away through the Entrez Gene / Ensembl link). It's not as if that information is unavailable, just less convenient than one would hope. Anyway, I point this out just because I don't want to hold back from completing version 1.0 of this project (or version 0.1, depending on perspective). The two primary motivations for my impatience are a desire to get the free-text contributions from the community flowing as soon as possible, and getting our master's student through to a publishable unit of work and gainful employment. AndrewGNF 00:29, 6 November 2007 (UTC)
Sure, there is no reason to slowdown the ongoing project. I just thought that including a link to HPRD and picking up a few more fields from UniProt is not a big deal, but it would be beneficial for WP community. So, this is entirely up to you and others. I can only suggest something.Biophys 04:35, 6 November 2007 (UTC)

Any bugs?

The bot did not pick up PDB image for MUC1 for some reason... It also did not identify UniProt link for Peptidylprolyl isomerase A. This suppose to be P62937: [6].Biophys 23:34, 5 November 2007 (UTC)

The image thing looks like a bug -- I've asked JonSDSUGrad to check it out and report back here. The Uniprot issue looks to be just an artifact of working off of downloaded snapshots of NCBI data (rather than screenscraping their site). (And working from a local mirror is necessary since we also have to organize all the relationships from so many different sources.) Right now, our local mirror is a couple of months old. In the future, we hope to have things a bit more streamlined so we can reduce the lag time between the primary data updates and what PBB sees. AndrewGNF 00:14, 6 November 2007 (UTC)
I'm looking into this now - it is strange to me that things did not work correctly, But I will figure out the reason and then correct it if I can. JonSDSUGrad 00:56, 6 November 2007 (UTC)
OK. I think I know what happened in this case - it was not so much a bug, but more like an error on my part. I think I had aborted an update batch in the middle and while the image was uploaded in the aborted instance. PBB was unable to find that it had previously uploaded the image (it already existed) because it was unable to find the created page - I have accounted for this small bug in the code and hopefully something like this will not happen again. It was more of a fluke really, and should only have affected that one log of proteins. If something like this happens again, it is likely that the image does exist on the wikiserver and you just need to manually write the name of the image file in (I use a standard naming scheme). If there is any trouble I can also just run pbb for the affected pages again and that should clear them up. JonSDSUGrad 01:28, 6 November 2007 (UTC)
The image was not loaded in PLCG1, TIMP1, and PARP1.Biophys 01:42, 6 November 2007 (UTC)
Fixed... AndrewGNF 02:18, 6 November 2007 (UTC)
MAP4K1 does not have any references, although they are present in Enrez Gene entry. I am not sure how your bot selects references.Biophys 15:10, 6 November 2007 (UTC) It looks like you have a cuttoff for the number of references. It would be good to increase this cutoff to 10.Biophys 15:16, 6 November 2007 (UTC)
There was quite a debate over how to do references during the bot approval process. I think we settled on linking review articles only that were linked at NCBI. I don't think we set an arbitrary upper limit in terms of number. Check out Apolipoprotein_E#Further_reading. AndrewGNF 17:22, 6 November 2007 (UTC)
Does it mean that MAP4K1 lacks any references because none of the references in NCBI was a review article? Then you should allow incorporation of at least two non-review references in each created article. Otherwise, anyone can mark articles like MAP4K1 for deletion per WP:Notability. At least one (but better two) references must be present. I would definitely include all references to articles published in Nature, Science, PNAS and Cell as most reliable sources per WP:Source and presumably most highly cited. Biophys 17:43, 6 November 2007 (UTC)
Good point. Added it to the PBB/Ideas page. If I'm reading the policy correctly though, we could successfully fight deletion because the topic is notable (evidenced by the existence of references), even if the article itself is not (yet) properly referenced. AndrewGNF 17:58, 6 November 2007 (UTC)
Perhaps. But I would recommend to include automatically the references rather than to risk AfD discussions, which is not fun and might negatively affect the entire project. I am an "inclusionist" and do not mind even having articles about pseudogenes, but most people here think otherwise.Biophys 22:48, 6 November 2007 (UTC)
Point well taken. We'll work on it... AndrewGNF 17:56, 7 November 2007 (UTC)
It seems that bot sometimes takes an incorrect Uniprot code, as in case of PIAS4 (I fixed it). May be this worth checking.Biophys 05:03, 10 November 2007 (UTC)
Hmmm, just took a look and I'm not sure. PBB linked mouse uniprot to Q3UGQ2, which does link to mouse Pias4. You changed it to Q8N2W9, which looks like one of the human versions. PBB linked human uniprot to Q05DS6, which also appears to be reasonable (duplicate entry in Uniprot?). Can you sanity check me here and let me know if I'm missing something? AndrewGNF 05:24, 10 November 2007 (UTC)
Sorry, that is my mistake. I placed code in a wrong place. Q8N2W9 and Q05DS6 are two different human proteins (compare the lengths) but products of the same gene? We definitely need a bot for that kind of work.Biophys 06:29, 11 November 2007 (UTC)
PPP2R2A did not pick up UniProt code in the Box. It suppose to be P63151. PPP1CB - same thing. Biophys 16:14, 13 November 2007 (UTC)
Just checked, and that association is correctly loaded in our next release of the data. Feel free to either manually update it now, or just wait until the next update. Just as proof that keeping links up to date is an incredibly hard problem, note that Uniprot links P63151 to ENSG00000147459, which is clearly not correct. (Should have been ENSG00000214122.) Anyway, you can be sure that any portal will have some small percentage of cross-references incorrect at any given time, and ours is no exception... (The beauty of WP, of course, is that we can override the automatically-computed data with our own human intelligence, which is not possible at any other gene portal). AndrewGNF 16:56, 13 November 2007 (UTC)
ND1 appears in the Category:Genes on chromosome MT but ND6 and many other proteins do not.Biophys 18:30, 15 November 2007 (UTC)
ND6 does for me. Might have been that the cache is still refreshing on some of these genes? The easiest way to force a refresh is to edit and save a "blank" edit. AndrewGNF 19:20, 15 November 2007 (UTC)
When you pick Category:Genes on chromosome MT, do you see ND6 there? I do not.Biophys 19:44, 15 November 2007 (UTC)
Ahh, I see what you mean now. You're right, that seems like odd behavior to me. Thankfully though, I'm reasonably confident that this is not a PBB bug... ;) AndrewGNF 20:25, 15 November 2007 (UTC)
This is probably related to last update. All articles that do not appear in "Chromosome ..." category have not been recently updated judging from a small set of their journal references. BTW, the articles categorized by the bot do not have normal "Category:Genes on chromosome MT" record near the bottom. So, I am not even sure where the categories came from.Biophys 02:25, 16 November 2007 (UTC)
I think this addition should actually be independent of the PBB doing an update. Banus previously described how to cleverly add the category to the ortholog template, which I did here. This should retroactively add the category to all genes with the PBB templates. Still not sure what the difference is then on why ND1 and ND6 have different behavior... AndrewGNF —Preceding comment was added at 02:34, 16 November 2007 (UTC)
Yes, this was automatically fixed after update. So, one should simply wait until all other articles are updated.Biophys 03:43, 16 November 2007 (UTC)
yes, and it can be any update, not just a PBB update. If you want to force any particular page, just edit and save. This may be obvious to you and others, but just to be sure... AndrewGNF 03:50, 16 November 2007 (UTC)
The bot generated a very strange category: Category:Genes on chromosome c6 COX - see pages that belong there. I guess this should be simply chromosome 6.Biophys (talk) 19:10, 20 November 2007 (UTC)
Strange indeed. We'll take a look at it. Thanks for reporting it... AndrewGNF (talk) 20:39, 20 November 2007 (UTC)

A suggestion

Can your bot download images of all PDB files in the Protein Box to the Wikimedia? Then one can easily replace the image in a protein box or include several images (e.g. different domains or different quaternary complexes) whenever necessary.Biophys 02:43, 6 November 2007 (UTC)

If we do that we should include a gallery at the bottom of each page so the options are available in wikimiedia are clear to the users that come after the bot. David D. (Talk) 04:00, 6 November 2007 (UTC)
We might use a small icon for each PDB structure and allow some brief legend for a group of icons, such as "NMR models of PH domain of 'protein X': 1abv, 2dcf, 4fds". To explain why having many images is so important, please compare this [7] and this [8] structures of "the same" protein (one of articles recently created by the bot).Biophys 04:26, 6 November 2007 (UTC)
Something like that would be useful for any potential user who wanted to expand the article. David D. (Talk) 04:52, 6 November 2007 (UTC)
Right, it is exactly what I am talking about. If the relevant images were already in Wikimedia (and we do not need anything else!), a wikipedian who wants to improve the articles (like me) could easily select the pictures of all domains and quaternary complexes that are needed for each protein and include them in a series of articles. But this user does not have enough time to download many images himself to Wikimedia. That is why we need the bot. Of course, one could download all 15,000 protein images from the PDB to Wikimedia, and that would work even better, or one could make a special "protein image download tool" for that purpose.Biophys 05:04, 6 November 2007 (UTC)
I like this idea a lot. In fact, I just added it as the top priority on the PBB/Ideas page. However, this task might actually be well suited to a second bot. It's a nice discrete unit of work that really doesn't need to be integrated into PBB (and yet PBB could benefit from the fruits of the second bot's labor). Delegating this to a second bot would also help counteract "bloat" in the PBB functionality, and bloat often dooms programs as far as maintainability and future extensibility. So, anyone have access to some enterprising young undergraduate or graduate student? David D.? Or should we try posting at Wikipedia:Bot_requests? AndrewGNF 17:33, 6 November 2007 (UTC)
I don't know one off hand I'm afraid. David D. (Talk) 19:38, 6 November 2007 (UTC)

Titles of protein images

The titles of protein images should be taken as titles of the corresponding PDB files, not as names of proteins. Otherwise, almost all titles are misleading, since the structures do not represent the protein but an isolated domain of the protein or a complex with another protein.Biophys 05:46, 7 November 2007 (UTC)

If we follow the same logic of the old protein box, the title refers to the entire box, and then should be the name of the protein. I think that will be better to use the source field as "caption" like in the old template, where you can put "XY domain of protein Z (source:PDB link)" or "Z protein complexed with X" and so on (see here for an example). But I find the bold text in the current template a little too strong for a long caption... the best will be a plain font and small size. Hopefully, this is not a big change, if we decide to do it.--Banus 09:04, 7 November 2007 (UTC)
I definitely agree with Banus that the title refers to the entire box, not to the image. The image caption appears below the PDB image, and by default says something like "PDB rendering based on 2acm". (As suggested before, I have it on the to-do list to change that caption so that it comes directly from the PDB entry title.) Accordingly, I'm going to undo your previous edit of MUC1, okay? Banus, not sure what you're referring to by the "bold text in the current template", but if you have a change you'd like to make, go for it. AndrewGNF 17:05, 7 November 2007 (UTC)
O'K, that is title for the whole protein. But the protein is 1250 residues long, whereas a domain (the image) is only 50-100 residues. Then we need a separate title for the picture, exactly as you said. Otherwise, the image will mislead a reader. The box includes the following text below the picture: "Image source: PDB rendering based on 2acm" Would it be O'K to replace this text by a title of the picture which would include the name of PDB file (2acm)? "Image source: PDB rendering" is not informative any way.18:20, 7 November 2007 (UTC) —Preceding unsigned comment added by Biophys (talkcontribs)
Yup, replace that text however you see fit. (Note though that if you turn off the infobox updating, you won't get updates as far as GO categories, genome locations on new assemblies, and additions to PBB. Just food for thought -- I still trust humans to make the appropriate choice for their gene(s) of interest.) If the changes are something generic that we can codify, then we can modify the bot to do it for all bots. Of course, the best way to up the priority for feature changes is to create a prototype we can reference, so let me know if you do (or just add a link next to the entry on User:ProteinBoxBot/Ideas). AndrewGNF 19:19, 7 November 2007 (UTC)
So, the bot would overwrite the image legend during the next update? Then, one might wish to make protein image(s) with their legend independent on the rest of the box to allow the manual modifications and updates.Biophys 20:10, 7 November 2007 (UTC)
Yes, as it stands now, any changes in the infobox will get overwritten on the next run. While we could starting thinking about better ways of handling the infobox content (having a specific update_image tag in PBB_Controls, for example), I think some of the other suggestions you raised have a better cost/benefit. Two possible intermediate work-arounds. 1) you could keep the update flag as "yes", and when the PBB update stomps your edits on its infrequent updates, you could revert or merge those changes as appropriate. 2) you could add additional images outside of the infobox (or perhaps a new "protein structure" infobox that goes along the bottom of the page?) and PBB won't touch those. AndrewGNF 21:03, 7 November 2007 (UTC)

Categories

I suggest using new "Category: Human proteins" instead of simply "Proteins" in the generated articles, and also "Category: Gene from chromosome N". Could you make that?Biophys 18:32, 7 November 2007 (UTC)

Yet another good idea. Added it to the ideas page... AndrewGNF 19:19, 7 November 2007 (UTC)
Just saw that you were proposing a new category for human proteins, which presumably would be a subcategory of Category:Proteins. Is there a place where we get consensus on a change like that? I can imagine the powers that be wanting to put some reality checks on new categories (though in this case I'd support the new addition). AndrewGNF 01:45, 8 November 2007 (UTC)
This is trivial. I just created Category:Proteins by species and Category:Human proteins. If people do not like it, this can be changed.Biophys 02:05, 8 November 2007 (UTC)
Super, I just changed the template to use that category. I assume it will start populating the new category when some cache expires... AndrewGNF 02:11, 8 November 2007 (UTC)
Adding genes in Category:Genes on chromosome N is also fairly simple: copying from Template:Protein you can add <includeonly>{{#if: {{{Hs_GenLoc_chr|}}} | [[Category: Genes on chromosome {{{Hs_GenLoc_chr}}}]] }}</includeonly> to Template:GNF Ortholog box. Saving strange behaviors in nested templates, the trick should work. The categories "Genes on chromosome N" already exist and will be populated.--Banus 21:55, 14 November 2007 (UTC)
Brilliant, I like it... Done... AndrewGNF 22:19, 14 November 2007 (UTC)

Scope of the project

So, how many human genes have at least a couple of references in Entrez Gene? One-two thousand? If I understand correctly, you are not going to create articles about any other genes?Biophys 00:41, 8 November 2007 (UTC)

According to the Oct 30 2007 release of the gene2pubmed file, there are 25694 unique human entrez genes. Of those, 20078 have at least 2 linked pubmeds, 17305 have at least 3 linked pubmeds, and 15380 have at least 4 linked pubmeds. Note though that many of those linked pubmeds are huge genome-wide studies. Take, for example, PMID 12477932 and PMID 15489334 which are associated to 18009 and 11328 human genes, respectively. So I don't know that the # of linked pubmeds is the right thing to draw a threshold on. Personally, I don't think there's a need to draw a strict and hard threshold. We'll stop when they stop being uniformly useful. From there, we'll add only on the basis of specific requests (or a consensus that we should keep going). AndrewGNF 01:51, 8 November 2007 (UTC)
Thank you for explanation. References to genome-wide studies would not qualify. Biophys 06:04, 8 November 2007 (UTC)

Gene ontology

It seems that you just included Gene ontology in the Protein Box. This is something which requires third opinion and community discussion. It makes the Protein box very big. If I understand correctly, Gene ontology is a low-quality computer-generated annotation, unlike curated Entrez Gene abstracts and UniProt records. I would not recommend this as a part of Protein Box, but rather as additional links or another box at the bottom of the article.Biophys 06:04, 8 November 2007 (UTC)

Yes, I'm open to further discussion of this topic, but please remember that this bot has undergone extensive "community discussion" in many forums prior to approval, including on this talk page, in WP:MCB/Proposals, and on the bot approval page. Our primary goal right now is to execute the consensus plan that was decided on during those discussions. Gene Ontology functional annotations come from a variety of sources. Check out the Entrez Gene page for BTK and scroll down to the Gene Ontology section. To the right side of the table, each annotation is provided with a three-letter code that indicates the source of the annotation. The three letter codes can be decoded here. I believe one-click access (through the Entrez Gene link) is the appropriate level of accessibility to this information (integrating it directly would clutter and obfuscate). AndrewGNF 16:56, 8 November 2007 (UTC)
Sorry, I was not clear. I think that "hidden" link to Gene Ontology is great (as it is in TOP2B right now - with small "show" icon to the right). I just did not like when the Gene Ontology menu was always open for some reason yesterday, which made the protein box too big.Biophys 18:25, 8 November 2007 (UTC)

Interaction partners

This is just something to think about. Entrez Gene has a well-defined Table of interaction partners for each human protein - with supporting references. All these partners are other human proteins, whose articles will be included in WP as a result of this project. Hence each interaction partner has (or will have) an internal link in WP. Therefore, WP is uniquely equipted to deal with protein-protein interaction networks. It would make a lot of sense to create a Table of human protein interaction partners in each article about a human protein - automatically by this bot. That would also be encyclopedic since each partner/interaction is supported by a reliable source. For example,

Interaction partners of CUL1:
SKP2 (reference), (comment),
RAC2 (reference), (comment),
RBX1 (reference) and so on.Biophys 15:16, 9 November 2007 (UTC)
Another excellent idea... And I agree, the synergy between protein interactions and wikilinking make this very attractive... AndrewGNF 16:53, 9 November 2007 (UTC)

Better names for gene articles

This 'bot is creating lots for strange articles that will truly mess up searches. Most people who search for "sumo" won't be looking for SUMO2. I would prefer that these articles' titles were prefixed with the string "Human Gene"; e.g. I would have the bot create a Human Gene SUMO2 article.--Mumia-w-18 02:25, 10 November 2007 (UTC)

[9] Search seems fine to me... AndrewGNF 02:30, 10 November 2007 (UTC)

I also notice that the 'bot is not providing references for its data. All of this data looks like something I would find in something other than an encyclopedia. Perhaps there is a place in the Commons for all of this?--Mumia-w-18 02:29, 10 November 2007 (UTC)

What in specific are you referring to? I think most biologists would think that these stubs are good starting places for encyclopedia articles. Check out relevant discussions here and here. AndrewGNF 02:33, 10 November 2007 (UTC)
I read those pages, and I'm not impressed by the lack of discussion about who is going to maintain and protect those pages. My concerns are in the section titled "ProteinBoxBot objections."--Mumia-w-18 04:26, 10 November 2007 (UTC)

ProteinBoxBot objections

These articles use jargon and are unreferenced. They are also presumably going to be 10,000+ strong. As a Wikipedia editor, you know how important it is to provide references because you want other editors to be able to verify your articles' claims. And in order to protect your articles from being vandalized, other editors need the references. And when I say "other editors," I mean other editors who are not biologists. So you would never imagine writing a robot that would create thousands of unreferenced articles--right?

As things stand, you've created probably hundreds of articles that only 1% of Wikipedians can edit for improvements and protect from vandalism. And please don't tell me that you intend to watch 10,000+ articles--I wouldn't believe it. You would have to create a 'bot to protect these articles, and that 'bot would freeze the articles in their original states--preventing other editors from correcting mistakes in them--which goes against the purpose of Wikipedia.

I'm not about to dispute the value of this information (yet--technically, I can dispute all of it, because you've not provided any references). I recognize the value of genetics information to our society; however, these entries need to either be truly encyclopedic articles, or they need to go elsewhere. Please host them yourself, or find some interested party willing to do it for you. Perhaps Wikisource is willing to accept something like this.

And calling them stubs is an evading responsibility. You're asking thousands of other editors to bring your article-stubs up to article quality--even though only a very tiny fraction of them have the requisite knowledge to do so. Anyway, they shouldn't be forced to so what your 'bot should do from the start. Your 'bot has access to the data; it should specify where it got its data from in the many articles it creates (by providing proper references and citations). If that proves impossible to program, Wikipedia is not the right place for this data.--Mumia-w-18 04:23, 10 November 2007 (UTC)

Thanks for putting your concerns out in detail, and I understand you're looking out for the best interest of Wikipedia. However, please understand that this effort is not trying to "fly under the radar", creating thousands of these pages without anyone noticing. On the contrary, it's been discussed and planned for many months now, with discussion and contributions and input from dozens of other wikipedians. Between the discussion on WP:MCB and the bot approval group, I'd guess that hundreds of wikipedians are aware of this effort, many of whom are very experienced editors. You are the first to raise such strong (and strongly worded) objections. My suggestion? Perhaps you would be willing to draw the attention of other experienced Wikipedians you feel might be an impartial third party. Those third parties can help us decide if anything needs to be adjusted. Does that sound like a plan? I get the sense that neither of us is going to be convinced simply by the words of the other alone. Involving others is democratic, in the spirit of Wikipedia... AndrewGNF 04:59, 10 November 2007 (UTC)
I am sure these stubs belong to Wikipedia, but Mumia-w-18 has some valid points. First, all articles should be referenced. Second, they should have at least a couple of content phrases to qualify as a stub (there is nothing when Entrez Gene abstract is missing). So, I would suggest to include ASAP at least a couple of references in each article by the bot (from Entrez Gene list) and include a couple of UniProt fields when Entrez Gene abstract is missing (see my comments above). Overall, I believe this bot is great. We need more such bots to create biology articles.Biophys 05:25, 10 November 2007 (UTC)
Please no more bots. I wish there were a flat-out ban on article-creating bots. And with all due respect, I don't think these "stubs" are really stubs. A stub is a budding article which is meant to be fixed up by humans and made into a truly encyclopedic article (not just data). No one that I know of is fixing these "stubs," so they are not stubs--just non-encyclopedic data. We shouldn't use stub tags on non-encyclopedic things in order to get them into Wikipedia. Rhetorical Question: How many of the bot-created articles will have been upgraded to article quality by November 2008?--Mumia-w-18 09:14, 10 November 2007 (UTC)
RE: "No one that I know of is fixing these 'stubs'". I respectfully disagree. There is still a lot of work to be done, but I and several others have made a start on a few of these stubs (see for example PPARA, PPARG, and RXRA). Cheers. Boghog2 09:32, 10 November 2007 (UTC)
If people are fixing the stubs then they are stubs, so one of my arguments fails . Now please take down my other argument.--Mumia-w-18
The previous discussion on this topic was made at the village pump, see archive. The community came to a consensus that only genes meeting the notability criteria of being the topic of several scientific publications should be created, and that these articles must contain reliable sources. This bot implements that decision. Tim Vickers 06:14, 10 November 2007 (UTC)
Thanks. I'll read the pump article soon.--Mumia-w-18 09:14, 10 November 2007 (UTC)
I've fleshed-out more of my concerns about these articles here: Article Half Life.--Mumia-w-18 09:26, 10 November 2007 (UTC)
I am sorry, but this "Half life" argument is your personal essay. Could you please refer to any official WP policy which is violated here? This is something really new: argue about creation of new articles based on future vandalism concerns. In my experience, vandalism rate for biology stubs is quite low.Biophys 17:53, 10 November 2007 (UTC)

Unreferenced articles are being created and apparently in large numbers

Coming across these articles in new pages patrol, I considered blocking the bot while we discuss the fact that as Tim Vickers noted at the end of the previous section, consensus was that "these articles must contain reliable sources". The bot is not implementing that decision and is creating many unsourced articles. I do not know exactly how many as I would need a bot to check them all. Some of the articles created have further reading sections. Being referred to an outside source for further study is not the same as providing a reference to a source verifying the information in the article.

A quick perusal of the last 50 articles created reveals 8 articles with only "further reading" sections: SLC9A3; SLC37A4; SLC15A1; RUVBL1; PPP5C; RAPGEF3; CD83 and NONO; and 8 articles with no even putative references: SAA2; ARAF; ETV4; DLG3; RCC1; CUL4A; PIAS4; and PPP1CC. That's sixteen out of 50 articles or 32%.

Considering the possibility that the bot may do a second pass for whatever reason, so looking at the most recent fifty might be deceptive, I went back to earlier articles created and the first one I clicked on at random, CRY1, was unreferenced. I don't know if this is a malfunction or not strict adherence to the criteria established, but it should stop and the articles created without references should be remedied by the bot or programmed to be tagged with {{db-author}}. I refained from blocking since this is done in small batches so there should be an opportunity of discussion before massive numbers of new articles are created.

A second issue of concern is that though there is no prohibition on locally uploading free license images, our policy pages strongly encourage that free images to be uploaded to commons. I don't think it's a good idea to have this bot uploading thousands of images locally. I understand, of course, that there is likely a whole separate bot approval process at the commons, but that doesn't mean that we should forego what is proper because more hoops must be jumped through.--Fuhghettaboutit 15:07, 10 November 2007 (UTC)

I have just noted that some of the infoboxes contain links to sources. That's better than nothing but is still no substitute for a transparent, flagged source in the article proper.--Fuhghettaboutit 15:22, 10 November 2007 (UTC)

Unreferenced articles are a malfunction. I note several of the articles in the "October 2nd dry run log" that created CRY1 don't have references. I could add these now (CRY1 has over 300 papers written about it!) but I'd prefer for us to work out what was going on first. Tim Vickers 18:03, 10 November 2007 (UTC)

I agree with Tim. Besides, I have seen a lot of WP articles without any sources. The Protein Box bot could simple tag such articles as "unsourced" for further improvement by the people and the next generation of the bot. There is no serious reason for stopping the bot.Biophys 18:09, 10 November 2007 (UTC)
Please also see WP:Source. It says: "A reliable source is a published work regarded as trustworthy or authoritative in relation to the subject at hand." Hence the links to established scientific databases qualify as references to reliable sources.Biophys 18:18, 10 November 2007 (UTC)
First, I'm not sure what you are agreeing to but you don't seem to be echoing what Tim said. The fact that you have seen a lot of unsourced articles on Wikipedia is the problem. In fact it's a cancer on Wikipedia, and is no proper rationale for allowing a process to create more. Please see WP:OTHERSTUFFEXISTS. Second, your citation to what constitutes a reliable source is inapposite to whether that source is referenced. I am not disputing the reliability of any sources. The problem at issue is that there are a large percentage of these articles that are not showing any source as a reference if my statistical sampling of the last 50 created was at all reliable. If what you are looking for is a statement in policy that placing a link in an infobox is analogous to creating a references section with citations populated, you won't find it and for good reason. I also agree with Tim—this should be worked out first.--Fuhghettaboutit 18:52, 10 November 2007 (UTC)
I agree with Fuhghettaboutit. The 'bot should be disabled while it's being fixed. And while the sources for CRY1 may be reliable, there are still no references. Nonetheless, even if I consider the links to be adequate replacements for references, I still cannot get the data. Please click on the CRY1 link in this article.--Mumia-w-18 18:38, 10 November 2007 (UTC)
I should clarify that I get this response from Mozilla Firefox when I click that link: "Firefox can't establish a connection to the server at www.gene.ucl.ac.uk."--Mumia-w-18 18:40, 10 November 2007 (UTC)
I think this is just a malfunction though, the references exist but for some reason they weren't added to the stubs by the bot. Tim Vickers 18:33, 10 November 2007 (UTC)
I wonder if that is the reason why it malfunctioned? Does the bot have external dependencies? Tim Vickers 18:44, 10 November 2007 (UTC)
It's not a malfunction per se. We, the wikipedia community, decided (on one of those discussion pages) that we would limit references to review articles only. Other users raised concerns about too many references listed... Anyway, the bot was approved with that protocol in mind, but since multiple experienced editors are chiming in here, we'd be open to discussing this spec again. New proposal: link all review articles, and if less than 10, add primary articles up to 10. Sound good? If anyone has an issue with this revision, please raise it by Monday. Changing code and specifications at this point is non-trivial, so I'd like to minimize them before the first-pass run is complete. AndrewGNF 18:50, 10 November 2007 (UTC)
[10] HGNC link fixed. AndrewGNF 18:54, 10 November 2007 (UTC)
If I understand you, you're talking about listing ten separate sources now? I'm not overly concerned that there be a wealth of sources when the information is one sentence. The main issue is that a source or sources be placed in a references section and thus are clearly a reference for the text in the article, rather than a link in the infobox. At best, this should be an inline citation, which provides the markup for future human editors who often don't know how they are created and have no idea where to go to find out.--Fuhghettaboutit 19:08, 10 November 2007 (UTC)

I think the software needs a longer prototyping period and some more eyes looking at the code. The bot's code should not be made public because of potential abuse. The bot doesn't need to create normal articles in the article namespace for right now. I would guess that it needs to be prototyped for another month in another namespace (which one?)

Programmers with molecular biology skills should help AndrewGNF with the ProteinBoxBot.

Meanwhile, more people need time to discuss the benefits and drawbacks of the idea of having 10,000 bot-created article-stubs on Wikipedia.--Mumia-w-18 19:18, 10 November 2007 (UTC)

I think we have the following problem: (a) there is a huge amount of notable biological subjects to include them to WP as independent articles (almost all human proteins are notable per WP:Notability), and (b) there are too few wikipedians who can do it. That is why we need such bots.Biophys 00:48, 11 November 2007 (UTC)
Do you have a list of drawbacks? David D. (Talk) 09:10, 11 November 2007 (UTC)
I asked an opinion of my friend who is very skeptical about everything. He said: "This work does not make any sense because the articles created by the bot are not better than the original biological database(s). In fact, the original databases are better and more useful because they provide links to the protein interaction partners and other information missing in the created stubs". Of course, we are not going to compete here with professional biological databases, and we work to create a very different educational resource... However we do not really know if this project will really work in WP environment. If few people work with created stubs, this project might be of little significance. Let's try and see.Biophys 05:29, 12 November 2007 (UTC)
Maybe your friend does not understand the goal. These are seed articles. There is no intention to add more than the databases themselves already have. Once these articles are in place they will start to be linked to from other articles. Other users will add to the articles and some will grow (maybe many will grow). Researchers will come and improve their favourite genes. I suspect this would never happen without the seed. Worst case scenario is they all sit there unedited for the next ten years. If so, no real problem. I doubt this will be the case and in fact we might see a massive influx of editors due to this initiative. David D. (Talk) 05:37, 12 November 2007 (UTC)

Incidentally, there's still the problem of uploading all the graphics locally instead of using the Commons. Please get bot approval from the Commons to mass-upload there so that these images will be as useful as possible. —Remember the dot (talk) 23:47, 13 November 2007 (UTC)

Whew! Who knew so many people were watching this thread! After Fuhghettaboutit pointed it out, I agree that the commons is definitely something that we should look at. I've accordingly added it to our ideas page for version 2. However, since this is not an official policy (only a recommendation), and since the bot was approved with the images clearly hosted at WP, I am reluctant to further delay implementing the approved "Version 1" plan. Remember the dot, if this is reasonable to you, please log your support below. (If not, please come back when we're drafting specs for version 2... ;) ) AndrewGNF 23:58, 13 November 2007 (UTC)

Revision of the code

No problems with too many references! An editor who knows a particlular protein can easily delete excessive references. Yes, I think inclusion of 10 references, as AndrewGNF suggested, resolves this issue and will greatly facilitate the future development of the articles. Are the references taken from Entrez Gene? There are two Tables with references there: (1) "GeneRIFs: Gene References Into Function" (they are already ordered by the publication date) and (2) References in "Interactions" table. So, if there are no enough references in the first Table, they can be taken from the second. Is not it a good moment to include protein "interaction partners" with references as I suggested above? Bringing content of at least FUNCTION, but possibly also other fields from UniProt would also help to make these articles real "stubs".Biophys 19:36, 10 November 2007 (UTC)

There seem to be a lot of issues with references included in this data. It is not hard for me to have PBB include a few more references with each entry, there just isn't any guarantee that they will be the best references. Previously in the bot approval discussion (sorry I don't have the link handy), there was a member who adamantly opposed adding in even the review article references - he thought there were too many references at that point. So, I have already modified the bot's code to include a few more references when it is not able to find any review articles (or very few). Basically I just need to know what kind of limits we want - I was thinking limiting review articles to the most recent 20, and then if we got less than 10 articles, we could supplement with other types of articles to make the total about 10 (there is a possibility that exactly 10 will not occur, but will usually be close to that number, and the minimum would be 5, but those 5 would all be review articles). I just need to know what kind of limits are acceptable, and that is more up to the wiki-community than me. Also, is this an acceptable alternative? Remember that there is no problem with legitimacy of the data that we are posting, that has all been researched out by the scientific community and is accurate to the extent that such a community can provide. JonSDSUGrad 23:51, 10 November 2007 (UTC)

These articles are basically stubs, which suppose to be modified later. Main consideration here is the convenience for a reader and for wikipediands who will improve these articles later. I worked already with 10-15 article generated by the bot. My comments can be found at this talk page. The following important information generated by the bot would facilitate further work with these articles (in the order of significance): (1) domain structure of each protein (taken from UniProt with link to SMART if feasible); (2) Table of protein interaction partners (taken from Entrez Gene, see above); (3) several fields from UniProt (see above); (4) bigger list of references. As a potential future developer of these articles, I do not see any problems with "excessive" number of references. Your suggestion is reasonable ("limiting review articles to the most recent 20, and then if we got less than 10 articles, we could supplement with other types of articles to make the total about 10"). As I told above, some references can be taken from Table of protein interaction partners of Entrez Gene. Since your reference format includes link to Abstract, a developer of the article can look them all and delete anything what is excessive (but usually nothing will be excessive). Good idea would be to include all references to articles from Nature series, Science, Cell and PNAS for obvious reasons.Biophys 00:37, 11 November 2007 (UTC)
Biophys, we've talked about many of your suggestions above and I absolutely agree that they're valuable. However, I'm going to strongly advise that we table your suggestions until the second run of PBB. Undoubtedly everyone has got a slightly different vision of what PBB should do, and often those visions conflict with each other. When we proposed and discussed this project with WP:MCB and the bot approval group, we tried to synthesize everyone's input and come up with a consensus set of specs. Those specs were then approved by both the WP:MCB and BAG, and now we are running PBB under those approved specs. To everyone, if you agree with the principal goals of PBB, I'm asking you to hold off on adding to the discussion here. (Add to the ideas page if you like.) Let's focus the discussion here on what it will take to satisfy the people who are fundamentally opposed to PBB as it is running now and to avoid any deletion issues. I promise, after the first round is complete, we'll seek feedback from the community (which is obviously more aware of PBB now) and open the floor up again for enhanced and revised specs. (I'm sure all of Biophys' ideas will be highly prioritized then). Right now, I just want to be sure we finish round one in a timely manner. AndrewGNF 07:14, 11 November 2007 (UTC)
Adding review articles first and then the ten most recent papers deals with any possible notability concerns. I support that idea. Have we worked out why some articles (such as CRY1) lacked a reference section? There are reviews on that gene - was this a malfunction? Tim Vickers 13:57, 11 November 2007 (UTC)
I should have also mentioned that we are also constrained by Pubmed articles linked in Entrez Gene (presumably by NCBI curators). For example, CRY1 has an Entrez Gene ID of 1407, and you can find its linked Pubmed IDs here (no reviews). Unfortunately, doing free-text searches of Pubmed is too imprecise (given conflicting synonyms/aliases and common words, e.g., CLOCK, KIT). AndrewGNF 15:15, 11 November 2007 (UTC)
No problem, just do whatever can be done right now. No free searches of PubMed, please. The references should be taken only from curated lists, such as those in Entrez Gene or UniProt.Biophys 15:59, 11 November 2007 (UTC)
Yes, I strongly agree with the principal goals of PBB.Biophys 18:42, 11 November 2007 (UTC)

If no reviews are located for a citation i would suggest adding a link to the NCBI pubmed search rather than randomly picking some of the research papers. Who knows what the first ten are about and they may give a distorted perspective with regard to gene function. For example, most of the recent Cry1 papers are to do with schezophrenia and bipolar type disorders rather than its role in entraining the circadian rhythm. By having a link to all the research articles an author or reader gets a better picture of the genes role. If ten papers are listed an author or reader may not think to look at more papers. David D. (Talk) 16:14, 11 November 2007 (UTC)

Yes, it would be helpful to provide a link to results of PubMed search, like here [11]. Let's do it also. But the whole discussion here was about providing normal references, as suppose to be in all WP articles. In the case of CRY1 there are only three references in main reference table:
1. Cry1 may be a candidate gene of schizophrenia. The proposition may have new clues on the development of genetic study on complex diseases
2. Linkage disequilibrium analyses using single SNPs and haplotypes showed no association to bipolar disease.
3. the CLOCK(NPAS2)/BMAL1 complex is post-translationally regulated by cry1 and cry2

All of them are relevant." A candidate gene of schizophrenia" is important. "CLOCK complex" is important as well. There are also 15 references in "interaction partners" Table. They are also relevant. These are not random references; they have been selected by annotators or depositors. This way we avoid using random references. Therefore, JonSDSUGrad's suggestion is reasonable I think.Biophys 16:47, 11 November 2007 (UTC)

I'm not suggesting they are not all relevant, I'm just noting if the bot just takes the first five or ten we do not neccessarily have full coverage of the genes function. Maybe the link to the NCBI pubmed should be in all the aricles even those with reviews too? David D. (Talk) 16:52, 11 November 2007 (UTC)
I agree, that would be good.Biophys 18:39, 11 November 2007 (UTC)
I made this change, which adds a link to all Pubmed references in Entrez Gene to the Ortholog box. (Example, ARAF.) Obviously I was looking for a way to retroactively add the link to all previously-created pages without re-running PBB, but if this isn't adequate, we can certainly go back and redo those pages later if the consensus is to have those links in the main text. AndrewGNF 22:25, 11 November 2007 (UTC)
That was very good and helpful change. If all newly created articles will also include several standard references (as suggested by JonSDSUGrad), this should resolve all concerns by Fuhghettaboutit and Mumia-w-18. I personally do not think that you must re-run anything, although if this can be done automatically to update only references, why do not do it?Biophys 04:52, 12 November 2007 (UTC)

Proposal and request for support

Okay, it looks like the discussion is settling down a bit. It also looks like we've converged on a proposal and that will satisfy WP standards. First, we have changed the infobox template to add links to the full Pubmed searches. And second, we will modify PBB so that it will supplement the list of all research articles with primary research articles up to 20 total references.

For future reference, can I ask that people who have been following this thread to note their support that this proposal reflects a consensus of Wikipedia editors at this time. (I've only created a "Support" section below. If you disagree, please highlight your remaining concerns in the discussions above.) Thanks all for the input and suggestions... AndrewGNF 03:51, 13 November 2007 (UTC)

Support

  1. AndrewGNF 03:51, 13 November 2007 (UTC)
  2. I support this bot and hope to see other similar bots.Biophys 15:26, 13 November 2007 (UTC)
  3. good job in trying to accommodate everyones concerns, it's never easy to please everyone. You have come very close. David D. (Talk) 15:58, 13 November 2007 (UTC)
  4. I agree with User:David D. You've been incredibly patient. And I fully support this bot. Forluvoft 16:55, 13 November 2007 (UTC)
  5. Well, of course I support this bot. :) JonSDSUGrad 17:22, 13 November 2007 (UTC)
  6. I haven't followed the entire discussion above, but I like the idea of taking references from curated lists, I don't like unreferenced articles, and I look forward to a progress report at such time as the unreferenced articles have been fixed. I hope that the bot people will continue their valuable work, but I acknowledge that Fuhgettaboutit's issues have not been addressed in exact language. EdJohnston 21:11, 13 November 2007 (UTC)
  7. Bot has a useful function and adds verifiable information. Tim Vickers 18:00, 14 November 2007 (UTC)
  8. I support this bot. I have some relatively minor reservations about its present implementation, but you have been very accommodating to suggestions and I am sure we can come to an agreement which will be acceptable to most editors. Boghog2 21:20, 14 November 2007 (UTC)

Oppose
Following the clarification below, I must oppose. We can spend time dickering about whether non-direct sources in infoboxes and further reading sections meet the mandate of WP:V, which states "The source should be cited clearly and precisely to enable readers to find the text that supports the article content in question." I don't think they do, and each would arguably be properly tagged with {{unreferenced}}, but that discussion is really beside the point. The issue is what we should be guaranteeing is in place from a bot creating 10,000 stubs, managed by a host of experienced users who know or should know that sourcing is such a problem on Wikipedia, and that inline citations are the preferred method of sourcing and required for top flight article. Stubs are the germs of future articles and people take their cue in expansion from what they find in place when first coming across an article.

Mechanically, a person stumbling across an article who finds directly cited text via a references section, understands intuitively that sources are or likely are required, and will be more prone to follow suit with their own cited text. While a general references section is a basic sourcing requirement, which isn't even provided by these infobox/further reading "citations," inline citations are vastly better. Having them in place teaches future article how they are done by example, and if they get that sources are required and are willing, are also more likely to source using that device rather than another. Moreover, if a person already knows about inline citations, he or she won't be stymied by having to navigate to a page describing how to place them when the markup is already provided.

If any of these articles are going to become great articles, inline citations will eventually have to be placed. Why should we leave it to future human editors to add the same markup that a bot can faithfully place in each article once programmed, right now? I have yet to be told what exactly is the problem with doing so. This is an encyclopedia. It is by definition a tertiary source. Ensuring that the addition of a large number of new articles are in a form that fosters future use of sources is not a side note but of fundamental concern.--Fuhghettaboutit 01:43, 14 November 2007 (UTC)

Why should we leave it to future human editors to add the same markup that a bot can faithfully place in each article once programmed, right now? Because the bot cannot faithfully place them right now. Human intelligence is required to take the references and properly align them with relevant statements.
Ensuring that the addition of a large number of new articles are in a form that fosters future use of sources is not a side note but of fundamental concern. Agreed, but I'd much rather (in the short term) have scholarly and unreferenced contributions from newbies, than no new contributions at all. Remember that there is a very active (and friendly!) WP:MCB community that will ease newbies into the process, including the process of adding citations. So the downside of proceeding as-is is that lots of newbie education will be required, but that is a responsibility that WP:MCB I think is willing to take. The downside of not proceeding is that we'll never see those newbies because they can't find a stub that they'd want to edit. (and realistically speaking, few newbies "ease" into the process by creating new articles, especially in the life sciences.)
Finally, you haven't mentioned whether you think my prototype (diff) addresses your concerns on inline citations. Again, while I think this change is neutral at best, it's a change we'd be willing to make to gain consensus. AndrewGNF 02:10, 14 November 2007 (UTC)
I also realized I could make these changes which might also address your concerns. What do you think? If none of these changes address your concerns, then I'm afraid I must not understand what you're looking for. Additional details and/or a prototype would be helpful. AndrewGNF 06:56, 14 November 2007 (UTC)
Right. There are inline citations. So, you fixed the problem.Biophys 15:27, 14 November 2007 (UTC)
I hope so. But I'd really like to get one more tick in the "Support" column (and one less in the "oppose")...  ;) AndrewGNF 15:55, 14 November 2007 (UTC)
Regarding your first comment: The text is not being generated in a vaccum; it's either written by a human, or taken by the bot from somewhere. Wherever it's being taken from is what should be cited as a reference. What am i missing?
Second comment: That's great. However, I don't believe it will ever work like that in practice. 10,000 stubs speaks for itself. We recently celebrated 2,000,000 "articles" when 80% were unreliable placeholders for sourced content. Despite good intentions, you're adding in some measure to that mess with a 34% unsourced rate if my intiial findings remain true. What radical framework do you have in place to monitor these 10,000 articles and ensure once material is added, the newbie is contacted right away, they are tutored and their additions changed to conform with policy? This also ignores the "germ" metaphor I was speaking of; it is much better if new users start with sourcing because it's in their face when they arrive. It's always radically more efficient to measure twice, cut once.
Third: Your prototype is a sourced stub, with inline citations, What could I possibly say but that it's exactly what I'm looking for!? (are we talking past each other?). What baffles me is why no one will address head on and in detail what the problem is. Of the total fifty articles I looked for my original post, what is different about the sixteen articles which the bot didn't (couldn't?) place references in, when it could and did provide references in the balance of the thirty four articles remaining? On the one hand you say: "Human intelligence is required to take the references and properly align them with relevant statements.", while at the same time the bot created 50 articles with 76% of them sourced.--Fuhghettaboutit 00:58, 15 November 2007 (UTC)
What baffles me is why no one will address head on and in detail what the problem is. Well, let's first settle on what the problem is and then we can address it. I think we have been two boats passing in the fog, but we're getting close here (as evidenced by the apparent common ground we're reaching on point #3)...
But first, let's reach a stipulation of the facts. For the purposes of this discussion, gene stubs generally fall into one of three categories: 1) having an Entrez Gene summary and references in "Further reading" (e.g., MUC1), 2) having no Entrez Gene summary but having references in "Further reading" (e.g,. SLC9A3), and 3) having neither an Entrez Gene summary nor references (e.g., SAA2). You asked ... what is different about the sixteen articles which the bot didn't (couldn't?) place references in ..., and your guess was correct, PBB cannot place the Entrez Gene summary (nor the reference to it) because it doesn't exist for that gene. Sound good to this point?
Class 1 has inline and references citations, so nobody has any problems with those, right? So I'm assuming we're only talking about class 2 and class 3 stubs only now. It sounds like the expanded number of references in the Further reading section isn't addressing your concerns, but that the changes in User:AndrewGNF/Sandbox/ARAF are. Can you clarify exactly which footnotes/references you think sufficiently address your concern? Some of us (myself included) think that not all of these changes actually improve the article. In particular, I think all the references to footnote 2 are highly redundant, since all that is stated much more succinctly and clearly in the infobox. Would adding footnotes 1 and 3 satisfy your concern?
Finally, regarding your question What radical framework do you have in place to monitor these 10,000 articles... Let me ask you, what radical framework is there to monitor the 2 million other articles in wikipedia? Pretty much none, right? It just works. Not perfectly, but in general. We're making the same leap of faith here. (And remember, newbies don't stay newbies. They learn things like adding references, and then they teach other newbies.) AndrewGNF 02:35, 15 November 2007 (UTC)

Okay. Let's talk about the nature of a "reference". A source listed *somewhere* in an article is not a reference. A reference tells the reader "this source was used for adding this text—directly. It also teaches the astute reader by its existence in that direct manner that anything they add should follow suit. There is nothing about a link—a source—listed in a further reading section or in an infobox that tells people it is a "reference." Show me an article with 100 incredible links in the infobox and thirty fine sources in a furthr reading section, but no:[1]

References

  1. ^ this is a reference!

Or no

References

  • Some book citation

And it deserves an {{unreferenced}}. It may be verifiable through those other infobox/further reading links, but you haven't flagged anything to tell people that it is verified through them. This is exactly why WP:CITE states (I quoted it more extensively in the comment section below) "All items used as sources in the article must be listed in the "References" or "Notes" section, and are usually not included in "Further reading" or "External links" So let's be clear. If there isn't an actual "references" section, marked as such, in the body of the article, you aren't referencing, or at best, you're referencing by weak implication rather than directly. When you call a link in an infobox a "reference," I tilt my head in a puzzled manner. Let' me also say this as clearly as possible, I am and have been unconcerned by the quality of the sources you have listed. I trust you, as experts, that they are reliable, and am, in any event, incompetent in this area to judge. I am concerned only with the manner of their application so that they function as references and not as something else.--Fuhghettaboutit 03:01, 15 November 2007 (UTC)

It sounds like the expanded number of references in the Further reading section isn't addressing your concerns, but that the changes in User:AndrewGNF/Sandbox/ARAF are. Can you clarify exactly which footnotes/references you think sufficiently address your concern? Some of us (myself included) think that not all of these changes actually improve the article. In particular, I think all the references to footnote 2 are highly redundant, since all that is stated much more succinctly and clearly in the infobox. Would adding footnotes 1 and 3 satisfy your concern?
Copied verbatim from above. I think if you answer these questions, we'll have a mutually satisfactory answer. Unless I hear from you differently, I'm going to assume that the first reference to Entrez Gene only (as Tim suggested below) will be sufficient. I think we're all tired of spinning our wheels on this one... AndrewGNF 07:16, 15 November 2007 (UTC)
I'm really not sure how I can be any clearer. As I said above regarding User:AndrewGNF/Sandbox/ARAF, which you denominated a prototype, "Your prototype is a sourced stub, with inline citations, What could I possibly say but that it's exactly what I'm looking for!? (are we talking past each other?)." Any article with a references section and at least one inline citation such as appears in User:AndrewGNF/Sandbox/ARAF satisfies me in spades, and I don't care which source is used so long as it's used as a reference by flagging it as a reference by its placement in a references section, denominated as such. Footnote 1, 2 or 3 together or solo are all fine and also completely irrelevant to the issue. What you haven't said is what this prototype is. Again, all you need to do to satisfy me is assure me that every article will have a section denominated "references" with something in it, I don't care what. So 1) Will this prototype be implemented in all articles, or 2) will some have no reference section and just links in an infobox and a further reading section as 16 out of 50 did. If the former, I'm happy, if the latter, I'm opposed. I'm smiling as I write this: I'm starting to feel like a character in a Kafka story where everyone bends over backwards to address a question I asked, which I appreciate, but they somehow keep attempting to answer a completely different question than what was put to them.--Fuhghettaboutit 12:51, 15 November 2007 (UTC)
In an attempt at even more clarity as we indeed seem to be, as you put it, ships passing in the fog, looking at the last three recently created articles by the bot: AURKB and OLR1 are good, they both have a references section; ITPR1 is not good, it has no references section. It has 21 sources listed in a further reading section. You could have 1,000 sources listed there; none of them are functioning as references because they don't tell a reader they are references (by the way, looks like overkill to me to list so many further reading sources. I would think one or two would be enough, but that's not the issue here).--Fuhghettaboutit 13:20, 15 November 2007 (UTC)
Fantastic. Let me officially draw this discussion to a close then. All future runs of PBB will minimally look like ITPR1 with the single inline reference. (Past runs will require some manual edits to add that, since I just realized that the header line is the one piece of PBB content that we never had any intention of updating.) Congratulations (and apologies) to any who made it to the end of this thread, and I pity anyone who comes to this talk page who tries to makes sense of all our interleaved comments! AndrewGNF 15:12, 15 November 2007 (UTC)
Fantastic right back at you!. Why it took so long to get to that simple statement will always be a mystery to me.--Fuhghettaboutit 18:40, 15 November 2007 (UTC)

Comment

  • There has been much discussion following my original post regarding quantity, quality and type of references. This is commendable, but I have yet to have anyone spell out that the changes discussed will result in each of these articles containing a direct reference for their text to a ==references== section and a populated {{Reflist}} (and not to additional, better, more numerous "further reading" or infobox citations). Likewise, I have yet to see anyone explain that the lack of such sections was indeed a malfunction, why the malfunction occurred, and how it has been fixed. These issues were a remain my chief concerns. If that is exactly what you are proposing and the malfunction has been addressed, please tell me I'm being dense, and I will add my support above.--Fuhghettaboutit 19:31, 13 November 2007 (UTC)
I replied in section "Verifiability issues" above. The core Wikipedia:Verifiability requirement has been satisfied in articles generated by the bot. The exact citation style ("in line") is only a matter of convenience for a reader, but not a requirement, if only I understand this correctly. If you think that any specific statement in any of these articles is wrong, you are welcome to correct it or comment at the article talk page, as you would normaly do for any other WP article. Please note that articles's content has been generated by humans. The bot is only a tool like Microsoft Word. Biophys 21:15, 13 November 2007 (UTC)
In addition, the summary text added by the bot is imported directly from the curated, public domain database EntrezGene and the source of this text is cited. Therefore the entirety of the text is verifiable since it is a direct quote of a reliable source. Tim Vickers 18:02, 14 November 2007 (UTC)
I have never addressed any comments to whether any text was verifiable. Whether it is verified by links in an infobox and further reading section was at issue. That it is not is fairly easily made out. Consensus to go ahead was qualified by the condition precedent that all articles created be sourced.

WP:V: "The source should be cited clearly and precisely to enable readers to find the text that supports the article content in question."
WP:CITE: "Articles can be supported with references in two ways: the provision of general references – books or other sources that support a significant amount of the material in the article – and inline citations, that is, references within the text, which provide source information for specific statements.... All items used as sources in the article must be listed in the "References" or "Notes" section, and are usually not included in "Further reading" or "External links"

Again, I'm more concerned with proper results than parsing policy.--Fuhghettaboutit 00:58, 15 November 2007 (UTC)
The text is verifiable and verified by an in-line citation to the database from which the text is quoted. That citation is the reference that verifies the text. Other papers are cited in the further reading section to give a broad outline of current research and as a resource to further expand the article. Tim Vickers 01:52, 15 November 2007 (UTC)
This is truly baffling response. All I have ever asked for is that. If that's true, what has changed? All anyone has ever had to say, is "we have changed the bot so that every article will have at least one inline citation" to satisfy me. Indeed, though I'd be less happy with it, I'd accept "the bot has been changed so that all the articles with just links in the infobox, or just a further reading section, have been changed to have a general references section" (as opposed to inline citations). This all started because I found sixteen articles out of the last fifty created at that time) eight of which had nothing but infoboxes with some links, and eight of which had further readings sections, and none of which had any other references whatever. No one has stated this has been addressed.--Fuhghettaboutit 02:13, 15 November 2007 (UTC)
Okay, I sense some rising tension and frustration here. Can I suggest that we all keep a level head here and err on the side of an overly happy and collaborative tone? We're converging on a unanimous consensus here, so frustration should be coming down, not rising.
I think we've already proposed your lesser acceptable suggestion of creating a "general references section" for all stubs. (See the first paragraph of this section.) This will be created under the "Further reading" section heading, and will be populated by review articles and primary research articles. In the past, it was only populated with review articles, and in some cases, there were none to choose from. Moving forward, there will always be at least one article in the "Further reading" section, and if there isn't feel free to speedily delete the stub.
I also think we're getting close to your ideal solution on at least one inline reference. Please see my response above. AndrewGNF 02:47, 15 November 2007 (UTC)
Adding the Entrez reference as default seems a good solution, then all the text in all the stubs will have at least one in-line reference. Tim Vickers 03:38, 15 November 2007 (UTC)

Moving forward

Well, haven't yet heard back from Fuhghettaboutit, but I'm going to hope that the two more recent signers in the support column are enough to keep moving forward. We've started PBB back up on creating pages. Of course we're still open to further discussion on how to fine tune things, but sounds like the effort as a whole is still sound. So, while we sort out the details, we're moving forward with page creation, both in a fully automated manner (when there are no namespace conflicts) and in a semi-automated manner (with our small but effective team of volunteers). Implementing changes later will be pretty easy -- the hard part is getting the page in the right location in the first place. AndrewGNF 21:48, 14 November 2007 (UTC)

Is it intended that the bot removes links from the article it is editing? For example, it removed all links from CASP6 and added a bunch of references, most of them not really necessary in this article. — Tirkfltalk 08:39, 15 November 2007 (UTC)

The bot copies the text from Entrez Gene whenever it update. When you change the text, you need to change the bot controls in order to preserve it. See here for an example. --Banus 12:16, 15 November 2007 (UTC)
Yeah, unfortunately, there is no easy way to preserve the links created in the summary on a bot over-write. While it may be possible to spend a few days building code to preserve links on an overwrite, that would be something for version 2 of the bot. I'm going to leave a note about the nature of such an algorithm here: It would be a complicated text parser that compared the current text to the new text - where the text was the same, but one had a wikilink, it would then add the wikilink to the new text. It could also log all the words that are currently wikilinked, and then wikilink the same words that appear in the new text. Definitely something we are not going to code up now (due to lack of time mainly), but that could be done in the future. Sorry for the current inconvenience. This shouldn't come up very often at all right now as we are just trying to generate/merge new pages and not update pages (though a large batch of pages must be redone at this point...) JonSDSUGrad 20:02, 15 November 2007 (UTC)
And just to emphasize, the solution for now is to disable the summary updates completely (as shown in the link posted by Banus above). Entrez Gene doesn't update those summaries often/substantially, so I'd much rather have even one human-added wikilink than the absolute latest and greatest summary from Entrez Gene. I try to look at all the diffs for PBB updates (or watching for wikilinks in the summary) so I can disable the flag, but I missed this one. If anyone else notice either of these cases in genes they watch, please feel free to change the PBB flag accordingly... AndrewGNF 20:13, 15 November 2007 (UTC)
You could reduce the problem by comparing the Entrez summary text to the log of the last run and only updating the page if the Entrez database has changed. Tim Vickers 20:05, 15 November 2007 (UTC)
Yeah, we thought a bit about solutions like that. But that would require logging a history of Entrez Gene summaries and I think that might be a lot of work too. We can discuss options over V2 specs though. Personally, I favor the disabling of the summary updates as the long-term solution. After all, it's my great hope that eventually, these summaries from Entrez Gene will be obsolete, replaced by more current and more dynamic summaries written by wikipedians... AndrewGNF 20:17, 15 November 2007 (UTC)
This is not a problem if an editor knows that summary update should be turned off by writing "no". What should be done ASAP is decoupling of protein images (PDB structures) and other content of the ProteinBox, because turning ProteinBox updates off is not advisable, and therefore one can not correct the titles of protein structures (which are not informative at best) or replace protein images, as I noted above.Biophys 20:36, 15 November 2007 (UTC)
My apologies, I was wrong about this. Straight from JonSDSUGrad, "It does not overwrite those two fields - any changes will persist through an update. The only time that PBB updates those fields is when no information is present in them." That's the problem when PBB's mouth is decoupled from PBB's hands... ;) So, edit/update the PDB fields as you see fit while leaving the update flag as "yes". AndrewGNF 20:51, 15 November 2007 (UTC)
Thank you! This is really great, except that links to PDB files are not regularly updated (a lot of new structures are deposited to PDB every month). An ideal solution would be to update automatically the list of PDB files in the ProteinBox but allow user to select one or several of them for the image and make himself the title for the image(s). Of course, it would be fantastic to have an automated "downloader" of images from the PDB to Wikimedia, which is basically one of the bot's subroutines. I do not know if the present version of the Bot allows a user to specify himself one or several PDB structures for the image(s).Biophys 22:48, 15 November 2007 (UTC)
Just to clarify, there are three parameters in the infobox that may be relevant. "image" indicates which image is actually displayed at the top of the infobox. "image_source" spells out the caption for the image. "PDB" lists all known linked PDB images and is shown by "Available structures" label in the rendered page. The first two (image and image_source) are skipped by PBB if there are already values there. The PDB field will be updated unless the update is turned off in the control template. So, newly submitted PDB structures will appear as additions on subsequent runs of PBB. AndrewGNF 23:03, 15 November 2007 (UTC)
Great! So everything is perfect. Easy downloading of additional PDB images is certainly a separate question which should not be addressed right now.Biophys 02:33, 16 November 2007 (UTC)

Source code

Will Jon's source code be placed in a public depository (i.e., SourceForge.net or something like it)? --Arcadian (talk) 01:06, 17 November 2007 (UTC)

Absolutely. I think he's a little bashful about it right now because of how many twists and turns I (we) threw at him -- it's gotta look like spaghetti code. But he's not getting his degree until it's posted somewhere... ;) AndrewGNF (talk) 03:16, 17 November 2007 (UTC)

Protein families and categories

It is really important to categorize the proteins as say "GPCR", "ABC transporters", etc. Your company database has InterPro family fields. Is any way to include the InterPro links anywhere, which would significantly facilitate the categorization?Biophys (talk) 06:40, 17 November 2007 (UTC)

Yeah, we know InterPro is important, but it was a question of priorities. For whatever reason, during V1 spec determination, it didn't make it into the final specs. And just because it's in our database, doesn't mean it's simple to incorporate into PBB in a timely manner. I know it's not terribly satisfying, but all I can say is we'll get it in Version 2. AndrewGNF (talk) 19:28, 17 November 2007 (UTC)
I agree. There is nothing urgent here. Let's wait for version 2.0 and then make the categories.Biophys (talk) 21:06, 17 November 2007 (UTC)

"Image source" in caption of protein box

I realize you're probably getting tired of all these code tweaks, but I was editing Apolipoprotein H and came across a "feature". I liked the image that was in the protein box previously and incorporated it into the new protein box. However, when I also incorporated the caption it ended up saying: "Image source: Apolipoprotein H showing positively (blue) and negatively (red) charged regions." Forluvoft (talk) 18:48, 17 November 2007 (UTC)

I guess on the image's WikiMedia page, it lists the original PDB file. I can throw that in so it makes sense. Forluvoft (talk) 19:02, 17 November 2007 (UTC)
Well, I actually prefer keeping the original caption as well, especially when it gives more information than just the PDB code. Do you have a suggestion as to how to handle it? I actually don't think it's too confusing with the "Image source" tag in there, but we could also just remove the caption label altogether. (Easy change of the template.) AndrewGNF (talk) 19:33, 17 November 2007 (UTC)
I would vote for just getting rid of "Image source." It's probably not even necessary for a default PBB caption such as " PDB rendering based on 1cmo." But again, this isn't really a major issue. Forluvoft (talk) 20:36, 17 November 2007 (UTC)
I agree with Forluvoft. The title like "Image source: PDB rendering based on 1a25" in PRKCB1 does not make sense. One could simply make title: "Image of PDB entry 1a25". A good and correct title would be one from the TITLE field of the PDB file, which is "C2 DOMAIN FROM PROTEIN KINASE C (BETA)". Indeed, this is not image of Protein kinase C (as a reader would think from the titles generated by bot); this is image of C2 domain.Biophys (talk) 20:59, 17 November 2007 (UTC)
Simple solution of changing the template is done. Look good for now? (Again, to refresh the cache for a gene page, hit edit then save...) Biophys, your idea to get the caption from the PDB title is already on the Ideas page... V2... AndrewGNF (talk) 21:13, 17 November 2007 (UTC)
Thanks!  :) Forluvoft (talk) 22:19, 17 November 2007 (UTC)

Silly question in a sea of relevant discussion

Why are the protein images JPGs? As someone who spends considerable time on images, and doing stuff like changing this to this, I'm ever so slightly disappointed at seeing low-resolution JPGs in these articles, although I do get the need for a small image footprint if a couple thousand of them are going to be uploaded. Fvasconcellos (t·c) 18:46, 20 November 2007 (UTC)

Not a silly question at all. We get our images straight from the PDB. For example a thumbnail of the PDB entry 2bio can be found at http://www.rcsb.org/pdb/images/2bio_asym_r_500.jpg. Replace the "2bio" in the URL with any PDB ID and you will find a thumbnail, and it is that thumbnail that we use. Unfortunately, their image server is only configured to return JPGs. On another note, Image:ERbeta.jpg was actually not uploaded by PBB. When we find an existing image in a protein box, we are generally inclined to preserve that image. For example pages that use PBB-uploaded images, check out ESR1 and ESR2. Cheers, AndrewGNF (talk) 18:59, 20 November 2007 (UTC)
That's what I suspected. (I do know Image:ERbeta.jpg was made by a person—bad example :) Thanks. Fvasconcellos (t·c) 16:13, 21 November 2007 (UTC)

Another silly question. I understand that Bot should be running under someone's guidance to update the existing pages. But can we run the Bot to create a lot of new articles quickly, where the updates are not required? I think we are ready for that.Biophys (talk) 18:22, 21 November 2007 (UTC)

Yes, the bot was running to update existing pages with the new reference specs. That has been completed at this time and we are moving back to creating new pages. I have a batch job in the works that will be creating/updating all the GPCRs that I will be starting very soon. After that, it is back to pulling jobs out of the master protein list. You can probably expect to start seeing updates within the next 12 hours. JonSDSUGrad (talk) 20:45, 21 November 2007 (UTC)
(and just to add to Jon's reply...) I think you're referring to the fact that PBB for a long while did not create any brand new pages, only updating existing PBB pages. This was due to the fact that PBB processed ~2000 genes before we stopped and backtracked to fix the references issue raised above. Over the past week or so, PBB has been running quite a bit, but first fixing those previously-created pages (and the wikicode contained in the logs). You'll be happy to know that the first brand new run was completed last night (log file here), and these eight genes were created: RAPSN PCM1 LTB PVRL2 PPP2R5C PLSCR1 SORT1 ARHGAP1.
And as a reminder, for every batch of 25 genes we process, only 6-10 genes typically can be created outright. The rest have namespace conflicts, and those require manual inspection to resolve (either merge with an existing gene page or create a brand new one). So while we have over 2000 genes processed by PBB, there are currently only ~1200 PBB pages currently in existence. If you or anyone would like to help volunteer to reduce that backlog, please see the PBB volunteer instructions. AndrewGNF (talk) 20:50, 21 November 2007 (UTC)
This is great! It is a good idea to run all GPCR first (olfactory including?). I am also interested in single-pass transmembrane proteins, articles for some of which have been already generated...Biophys (talk) 21:37, 21 November 2007 (UTC)
Still, I do not understand. Are you telling that 15-19 of every 25 genes have namespace conflicts for the short gene names? That is really strange. If there are any other "conflicts", the corresponding pages can be created and modified later. Any way, can we just go ahead and generate automatically all pages that can be generated automatically? I am asking because I would like to try a semi-automatic generation of pages for protein families and do not want all links to human proteins appear in red. Perhaps you simply do not have expression profiles for other proteins? Then, let's create the pages without expression profiles. Biophys 17:29, 30 November 2007 (UTC)
So, is it possible to generate automatically as many human protein pages as possible? As a practical matter, I started generating protein "family/domain" pages where most human proteins appear in red.Biophys (talk) 05:34, 6 December 2007 (UTC)
Reply below in new section... AndrewGNF (talk) 06:13, 6 December 2007 (UTC)

Protein families

I'm happy to do requests. If you have a group of genes you would like to have run, just get me a list of Entrez Gene Ids and I'll be happy to put it in the queue. JonSDSUGrad (talk) 19:27, 22 November 2007 (UTC)
Thank you! But the list of human single-pass proteins would be at least 2000 long (this project is mostly about protein/peptide structure modeling and some bioinformatics, rather than studies of a few specific proteins). I could try however to automatically generate some WP-formatted content about these proteins in my namespace (a content to be merged with existing or future articles). Ironically, that would mostly be a content which PBB 1.0 is missing: protein family, domain structure, subcellular localization and function - everything is encyclopedic...Biophys (talk) 18:09, 23 November 2007 (UTC)
I just figured out a way to generate lists of Entrez IDs given the Gene Symbol. On the Entrez FTP site is a flat file called DATA/gene_info.gz which contains one gene per line and each line contains the Entrez ID + Gene name + other info. One can simply grep to find the gene, and therefore the Entrez ID that you are searching for. Based on data extracted from this file, I just created a table for the voltage gated ion channels. Cheers. Boghog2 (talk) 21:40, 23 November 2007 (UTC)
Good to know that! I am thinking however, would it be reasonable to do the following. 1. To create automatically a single file with information on several hundred of proteins or protein families using Pfam/Entrez/Uniprot files as input and formatting everything exactly as needed for a series of WP articles (my programming skills are sufficient to do that). The file would be basically a sequence of articles (or some content missed by PBB and organized by articles): article 1, article 2, and so on. 2. Quickly merge the file's content to existing or/and new WP aricles using AWB? Biophys (talk) 23:38, 23 November 2007 (UTC)
Since we have that resource, could/should we create a page with wikilinks to the gene name for the list of pages intended to be created by the bot? Then, if you grepped on the raw HTML on the Wikipedia page, you can identify which links are red (therefore, no namespace conflict), which could help address Andrew's concern above. --Arcadian (talk) 01:50, 24 November 2007 (UTC)
Good idea. As a start, I have added the WP links to the gene names for three protein familes (voltage-gated ion channels, ligand gated ion channels, and solute carrier family) here. As I have stated elsewhere, I think it would be a good idea if possible to process each of the major gene/protein families one at a time so that all the members of a particular family are added at the same time. This will make integrating with existing WP content easier. Boghog2 (talk) 10:09, 24 November 2007 (UTC)
I have now also added lists of transcription factors and the six Enzyme classes (EC1-6) here. I have forward a list of GPCR genesymbols/Entrez ID to Andrew previously. With more work, we could probably produce list for the other major families. Boghog2 (talk) 17:40, 24 November 2007 (UTC)
Right. It makes so much sense to include all proteins/genes family by family and do not forget to include the name of each family as a category (e.g.Category:Ligand gated ion channels) in the corresponding pages.Biophys (talk) 01:40, 25 November 2007 (UTC)
My last comment was directed to Boghog's statement, but appeared below, due to simultaneous editing. However, I support Biophys's proposal. Here is a list of pages using the existing protein box. --Arcadian (talk) 01:54, 24 November 2007 (UTC)
We had a discussion with Boghog on a related topic here[12]. I will try to generate a few examples in my namespace, as time allows, and then ask for everyone's opinion.Biophys (talk) 02:26, 24 November 2007 (UTC)

I would like to reply to Boghog2 comment about protein families (see his work here). I have written a simple program that generates automatically a list of all human protein families in Pfam-A and list of human proteins in each family. There are as many as 3450 different Pfam Human protein families. One gene/polypeptide chain can be present in several families because it can have several domains. Each family is a separare sub-category, such as GPCR from family 1, etc.Biophys (talk) 19:46, 28 November 2007 (UTC)

Oops, these people found 3,853 human protein Pfam-A families [13].Biophys (talk) 23:46, 28 November 2007 (UTC)

UniProt entries

P.S. I do not know if you realize this, but each UniProt entry is a variety of a ProteinBox. However, "Uniprot ProteinBox" includes some important information missing in our Protein Box. For example, [14]:

FUNCTION: Occurs in almost all aerobically respiring organisms and serves to protect cells from the toxic effects of hydrogen peroxide.
CATALYTIC ACTIVITY: 2 H2O2 = O2 + 2 H2O.
COFACTOR: Heme group.
COFACTOR: NADP.
SUBUNIT: Homotetramer.
SUBCELLULAR LOCATION: Peroxisome.
DISEASE: Defects in CAT are the cause of acatalasia [MIM:115500]; also known as acatalasemia. This disease is characterized by absence of catalase activity in red cells and is often associated with ulcerating oral lesions.
SIMILARITY: Belongs to the catalase family. Biophys (talk) 02:39, 24 November 2007 (UTC)
I agree that the above information would be very useful to include, but one first must look at the UniProt terms which in turn applies the Creative Commons - No Derivative Works license which states "You may not alter, transform, or build upon this work." I don't know if I am interpreting the UniProt terms correctly, but it would appear that incorporating UniProt derived data into WP would constitute "transforming" the work and hence would not be permitted under the UniProt terms. At a minimum, I think we would have to obtain permission from UniProt before even considering such an undertaking. Cheers. Boghog2 (talk) 08:40, 24 November 2007 (UTC)
Right. I left the copyright question to Alex Bateman. We need to ask a UniProt permission, and it is probably important who would ask.Biophys (talk) 23:06, 24 November 2007 (UTC)
You should see more complete explanation however: [15]. The biological data in databases are not protected by copyright. It tells: "whether the data itself is copyrightable, depends on what it is. To the extent it consists of factual information, it will not be copyrightable. For example, the contents of NCBI’s Entrez Gene database include gene names, descriptions, pathways, protein products, and other facts. However, to the extent the data is creative and expressive works, such as papers or photographs, then the database content itself is likely to be protected by copyright." Hence aa sequences, locations of protein domains, assignemt to a certain protein family; oligomeric state, catalytic activity, subcellular location, or biologiacl function themselves are not protected by copyright, although the entire Table can be protected because it includes some creative elements, such as names of database fields and overall organization of data. As for summaries by database annotators - this is not quite clear. If they qualify as "creative essay", they are protected. If they are simply "data", they are not protected.Biophys (talk) 22:35, 25 November 2007 (UTC) In short, it tells: "facts are free".Biophys (talk) 22:39, 25 November 2007 (UTC)

New template

I've created a template called Template:NLM content that can be put on Main–namespace pages to show that a page's text was copied directly from the NLM. Use it if you'd like. — Insanity Incarnate 09:22, 22 November 2007 (UTC)

Awesome! Thanks for this. I'll have to check and see if we can make use of it. JonSDSUGrad (talk) 19:15, 22 November 2007 (UTC)
So, the articles now have the following text at the bottom: "This article incorporates text from the United States National Library of Medicine, which is in the public domain.". Sorry, but I think this text should be deleted as misleading. As soon as any wikipedian makes a single edit in the Abstract (which suppose to be the case for every article), this is not a text from the United States National Library of Medicine any more.Biophys (talk) 19:05, 26 November 2007 (UTC)
I think the acknowledgment/disclaimer text is necessary and important and therefore should be included by default. If the article is later substantially edited so that the original text is replaced, then the template can be deleted, but only after significant editing. I have had at least one bot and two administrators issue warning to me after I manually added some PBB generated content. I am now including this disclaimer on all pages that I create that incorporates any NLM text. Boghog2 (talk) 19:14, 26 November 2007 (UTC)
I see. You have notified already User:Coren (but there is also User:Where - User:Wherebot) who operate such bots, so he will include the NLM address to the list of websites which content can be freely copied to WP (he suppose to have such list(s)). There is already a citation of Entrez Gene database, so the NLM template is a duplicate acknowledgement and an additional link to delete for anyone who will edit an article in the future. But I do not care. Let's have this template - no problem.Biophys (talk) 23:15, 26 November 2007 (UTC) A WP "purist" would tell you that such duplicate link falls under Wikispam#Link_spam category, but I am not such person.Biophys (talk) 23:31, 26 November 2007 (UTC)
Perhaps Template:Gray's would be a good model: "This article was originally based on an entry from a public domain edition of Gray's Anatomy. As such, some of the information contained herein may be outdated. Please edit the article if this is the case, and feel free to remove this notice when it is no longer relevant." --Arcadian (talk) 02:57, 27 November 2007 (UTC)
I agree. That would be better.Biophys (talk) 06:56, 27 November 2007 (UTC)
How about adding something like "Feel free to remove this template after editing if it is no longer true."Insanity Incarnate 21:12, 28 November 2007 (UTC)
Same thing but shorter.Biophys (talk) 17:13, 29 November 2007 (UTC)

increasing the rate of page creation (by increasing burden on MCB)?

... (response to Biophys' comment/question above)

Sorry, been busy with "real life" work for the past couple of days. Quick reply here. Check out a somewhat recent log file. Of the 25 genes processed, 8 were created outright by PBB. The remaining 17 had some namespace conflict with any of the gene symbol, name, or aliases. Of those 17 conflicts, 9 were removed if we only check the gene symbol. Of those 9, four look like they have existing gene pages that would need to be manually merged/reconciled (GMNN, ENPP2, NEU1, IL23A). Up to this point, we've been very strongly avoiding these cases where we're creating a duplicate gene/protein page, and doing this at the expense of flagging more pages for manual inspection. (Bots should tread lightly.) If there is consensus to relax that bias and be a bit bolder with page creation (putting a bigger onus on the community to resolve merges), we'd happily consider that proposal.

To summarize, for each batch of 25, we currently create about eight and flag 17 for manual inspection. Generally, PBB creates no duplicate pages that need to be merged. We would certainly consider a proposal that would increase the number of pages created to ~17, leaving eight or so for manual inspection. Importantly, 4 of the 17 pages auto-created would be duplicates in the WP namespace; these duplicates are not easily searched for and are difficult to retrospectively identify. Thoughts? AndrewGNF (talk) 06:15, 6 December 2007 (UTC)

I vote to retain the cautious approach with the possible exception of large gene families where for some reason, the pages are not created automatically even though there doesn't appear to be any name conflicts. For example, in the GPCR special request (see A0 - D11), none of the very large number of olfactory receptors were created outright by the PBB. Would it be possible to rerun the olfactory receptors with relaxed standards so that these pages are created automatically? Because of the large number of pages (~390), it would be very tedious to do this manually. Boghog2 (talk) 06:39, 6 December 2007 (UTC)
Ahh, that would be because when we check the namspace conflicts for official gene name, we strip everything after the first comma. Makes sense for most official gene names, but not for the olfactory receptors (which all get trimmed to "olfactory receptor", which of course has an existing page). I think we could convince Jon to redo those genes with relaxed criteria, but he's a bit cramped for time these days... AndrewGNF (talk) 06:49, 6 December 2007 (UTC)
Thanks for the explanation. If Jon could find time to rerun the olfactory receptors, I would be most grateful. That would just about complete the entire list of G protein-coupled receptors/Chemokine receptors contained in the PBB logs A0 - D11. Boghog2 (talk) 06:24, 8 December 2007 (UTC)
Sorry, I was not clear enough. You are telling that bot can create automatically only 1/3 of pages due to namespace conflicts. Great! Then let's run the bot against 30,000 genes and generate automatically 10,000 pages (1/3) - right now. Why not? If this is done automatically, this does not increase any burden for MCB members. Another question is relaxing the namespace overlap criteria. We should not create automatically any pages that create namespace conflicts for short gene symbols (the article name is gene symbol). As about other other "conflicts" (they are not really conflicts), all such cases could be recorded to a separate list of pages to be inspected later.Biophys (talk) 21:01, 6 December 2007 (UTC)
Not all 30,000 have reliable sources that address the topic. This was discussed in the bot approval process. Tim Vickers (talk) 21:08, 6 December 2007 (UTC)
So, how many human genes with references are still left to be processed automatically? Many thousands. Let's process them now. I just checked four first red links to human genes in SH2 domain. All these genes have reliable sources in Entrez Gene, and all of them can be generated using short gene symbols as article titles.Biophys (talk) 23:10, 6 December 2007 (UTC)
My back-of-the-envelope calculation looks goes something like this... There are ~25000 genes, and maybe a third of them (~8500) will have sufficient useful information so that we'll want to run them. Going with the 8/25 = 32% automatic creation rate, that means we can create ~2720 total straight off. Right now, there are 1678 pages with PBB content on them, and probably ~1000 of them were created outright by PBB. (The rest, of course, were created/merged by our volunteers.) So I'd guess we have another couple thousand to go in terms of genes that can be auto-created by PBB.
But regarding the first four red links at SH2 domain. ABL2 is waiting for a volunteer to process it, BCAR3 has only 17 linked citations in Pubmed so is only ranked 4669 on our priority list (haven't gotten there yet), BLNK is also waiting for a volunteer, and CHN1 only has 13 linked references and is ranked 5842 in priority. (CHN1 also had an existing gene page at Chimerin 1, to which I've added a redirect.) Anyway, if you've got a specific list of requests you'd like us to prioritize, feel free to post them in the requests section. If your favorite gene is waiting for a volunteer, feel free to tackle it yourself (find it by doing a WP search). If all else fails, I'm afraid you'll just have to wait until PBB gets to your gene and/or a volunteer gets to it... AndrewGNF (talk) 02:32, 8 December 2007 (UTC)
Great. So, can we just go ahead and create all articles that can be generated automatically by the bot and satisfy WP:Notability criteria, that is a protein/gene has been mentioned in at least one or two publications? You do not need a volunteer for that? If you are going to publish something, it would be to your advantage to tell that your bot had actually generated such and such large number of pages in WP (a significant coverage). Of course if I was your reviewer, I could have many other critical comments, but it will be someone else.Biophys (talk) 16:35, 8 December 2007 (UTC)
Point heard to "hurry the heck up", but you'll have to believe that we're moving as fast as we can given other constraints... AndrewGNF (talk) 19:11, 8 December 2007 (UTC)
I am sorry. No rush. I just did not see any reason not to run the bot.Biophys (talk) 22:25, 8 December 2007 (UTC)
No need to apologize, PBB just still needs a bit of hand-holding when we run it... AndrewGNF (talk) 17:51, 9 December 2007 (UTC)

CD83 - Protein vs gene

At CD83, it says CD83 molecule is a gene - even if the gene didn't have a special name, I'd call it "the CD83 gene" (certainly not the "CD83 molecule"). Can the bot be corrected to call things proteins instead of genes?

Also, would it be useful for the bot to automatically create talk pages with the MCB template in? --Seans Potato Business 12:50, 8 December 2007 (UTC)

These articles are about both the gene and the protein. In many cases however, the most of the literature discusses the properties of the protein and not the gene, so the wording of these automatically generated pages (e.g., previous version of CD83) often seems very strange. My bias would also be to describe the subject of these articles primarily as proteins rather than genes. I have modified CD83 to make clear that this article is about both the protein and gene. Boghog2 (talk) 13:30, 8 December 2007 (UTC)
Can this change be made to other PBB pages (e.g. CCR7) en masse? --Seans Potato Business 17:43, 8 December 2007 (UTC)
I think comfort with the term "gene" vs. "protein" (I agree "molecule" isn't good) depends on what biological specialty you were trained in. In genomics, we are more likely to talk about the function of genes, not the proteins. But to be a bit more philosophical, although the "protein" actually does all the good stuff, I think the word "gene" encompasses more of the lifespan of this "unit of heredity" than does the word "protein". And I definitely don't want to see separate articles for genes and proteins! I guess going so far as to say something like "CD83 is a gene whose protein product (does something)" would be fine, if necessary. Forluvoft (talk) 18:20, 8 December 2007 (UTC)
Sorry, what Boghog2 wrote for CD83 is great too. Forluvoft (talk) 18:42, 8 December 2007 (UTC)
I agree that a gene and all its protein products (there could be several of them due to alternative splicing and various mutations) should be described in the same WP article. Different protein products of the same gene can be found in each Entrez Gene entry as several links to different UniProt entries, for example (~60,000 human protein entries are currently present in UniProt). ProteinBox provides only one such link if at all.Biophys (talk) 18:45, 8 December 2007 (UTC)
Or perhaps "CD83 is protein whose gene (does [nothing (j/k)])"? Either way, when referring to the gene, shouldn't it be in italics? I don't think that's currently the case (at least not for CCR7). I also am happy to keep proteins and genes discussed together in the same article.
I think that whenever a protein name (e.g. CD83, CCR7) is mentioned without a qualifier, most people will assume we're talking about the protein. You've got to say "CD83 gene". If you really wanted, you could start an article "(Cluster of Differentiation 83) gene is a human gene encoding the CD83 protein", but I object to CD83 without a qualifier, unless you're referring to the protein. --Seans Potato Business 18:48, 8 December 2007 (UTC)
I certainly don't want to discount the validity of the gene/protein issue, but I do think relatively speaking that it's a point of minutiae. ("CD38 molecule" is a special case -- unfortunately, that's its official gene name...) It's also an issue that has been discussed several times above and on other related pages (bot approval, village pump, MCB proposals). If someone wants to make a proposal and rally a consensus of WP users around a new and better header line, then we'll make the change on the bot end. But I'm certain if we make a change based on one person's comment, that will get picked apart too by some future contributor. And I think none of us wants this issue to distract from the bigger and more interesting issue of improving the gene stubs with new content. (And, of course, anyone is welcome to change the header line for a page they care about, as Boghog2 did above...) (Addition of the MCB template is on the Version2 specs...) AndrewGNF (talk) 19:25, 8 December 2007 (UTC)

Discussion on notability moved from AndrewGNF's talkpage

What is the point of creating stubs for genes if nothing is said about what they do? For example, GPR155, GPR156, GPR157, GPR158 all read exactly the same. Aren't you just mirroring some database(s)? Adding 10,000 stubs will increase the size of Wikipedia by 0.47%. AnteaterZot (talk) 08:59, 12 December 2007 (UTC)

Those GPCR genes were created by request -- see User_talk:ProteinBoxBot. Presumably now that those stubs are created, the interested user(s) will add additional useful content. But, I'd argue that the stubs even as they are now are useful and notable (also the consensus of the BAG, MCB, etc.), even if slightly less full than some of the other gene pages. AndrewGNF (talk) 17:34, 12 December 2007 (UTC)
So you are not planning on creating 10,000 such stubs? AnteaterZot (talk) 23:32, 12 December 2007 (UTC)
Yes, we eventually will work up to ~10k genes, as described on the bot approval page. Not sure what you're getting at here... AndrewGNF (talk) 00:03, 13 December 2007 (UTC)
What I'm getting at is that most of the genes in the world are not notable. For example, fruitless is notable. AnteaterZot (talk) 00:08, 13 December 2007 (UTC)
Well, the notability issue has been discussed extensively on the bot approval page, on the MCB/Proposals page, PBB talk page, and at the village pump. (Sorry, if you can't find any of those pages, I'm happy to wikilink. Just feeling lazy...) Each time, the consensus of users has been to move ahead. If you still want to raise notability issues (hopefully with arguments that haven't been previously raised), I'd suggest doing it at the bot talk page. AndrewGNF (talk) 00:13, 13 December 2007 (UTC)
Well, okay. And I think it would be very kind of you to provide maybe one link that would lead me to the others. AnteaterZot (talk) 00:17, 13 December 2007 (UTC)
Done, added two links... AndrewGNF (talk) 00:20, 13 December 2007 (UTC)
I've looked it over, and I must commend you in your efforts to digest material from the various databases into a more accessible format than Entrez. But a couple of things still worry me. One is the assertion that a gene is inherently notable; "Notability of the genes themselves, I think, is a given. These are human genes, the stuff of life!" This is simply not true. Most genes, if knocked out, have little or no effect on phenotype. You address this by requiring the gene be mentioned in more than a couple papers, which is a good start. Two is the heavy reliance on primary sources, and I mean this in the scientific literature sense. Wikipedia requires secondary and/or tertiary sources to establish notability. I take this to mean that a gene should have a couple of mentions in review articles, and/or a mention in the popular press. Take for example, BRCA1. It has 174 mentions in the New York Times. You might say that example is a bit unfair, so how about C5a receptor? It appears in the title of a couple of review journals, and here in a story about a pricy biotech startup. So it might be okay. Now let's take one your bot created, GPR32. It has 208 unique g-hits, none of which amount to anything. I found only one citation on webofknowledge, the (Marchese et al. 1998) one, which is a short communication. They don't really seem to know what the gene does. The gene does not appear to have been in the title or abstract of any review articles. Therefore the gene appears to be not notable. Do you disagree? AnteaterZot (talk) 10:56, 13 December 2007 (UTC)
Having said that, is there any way you can tune your bot to not create stubs on genes like GPR32 while keeping notable ones? Perhaps it can require the word "review" in two sources? AnteaterZot (talk) 10:56, 13 December 2007 (UTC)
I second Andrew's suggestion to move this discussion to the PBB talk page. Since I was the one that requested these GPR pages, I feel that I have an obligation to respond, but on the PBB talk page, not here. Cheers Boghog2 (talk) 17:20, 13 December 2007 (UTC)
So moved. Is the bot active? AnteaterZot (talk) 23:47, 13 December 2007 (UTC)
No, it is not active for a couple of weeks.Biophys (talk) 00:16, 14 December 2007 (UTC)
The idea that "Wikipedia requires secondary and/or tertiary sources to establish notability." is true, but the definition of "secondary" here encompasses articles in peer-reviewed journals. The guideline states "A topic is presumed to be notable if it has received significant coverage in reliable sources that are independent of the subject." A gene that is the subject of even one paper might meet the notability guideline, those that are the subject of multiple papers certainly meet the guideline and those that are the subject of review articles (tertiary sources) are indisputably notable. Tim Vickers (talk) 00:04, 14 December 2007 (UTC)
So, do you claim that GPR32 is notable? When policy says "sources", it means more than one source. If a gene, say, caused people to have three eyes but was only in one paper, it would be notable. But most genes are not notable. AnteaterZot (talk) 00:20, 14 December 2007 (UTC)
I would argue that all human genes should be considered notable, based on the notability of the human genome as a whole. To avoid overload, I suggest that the notability of animal and plant genes should be judged individually. Similarly the notability of proteins should be judged individually. Apparently the current thinking at WP:N is that all towns and cities in the world are notable, no matter how small, so long as their existence and naming can be reliably sourced. (See WP:OUTCOMES). It is not too much of a stretch to allow human genes into the realm of intrinsic notability. (How many things can be more important?) EdJohnston (talk) 00:39, 14 December 2007 (UTC)
Notability is not inherited. See Wikipedia:Arguments_to_avoid_in_deletion_discussions#Notability_is_inherited. AnteaterZot (talk) 00:51, 14 December 2007 (UTC)
And, I disagree. As I said before, you could be walking around right now, missing both copies of a gene, and not show any effects. In fact, it is likely that you have at least one non-functional gene. What if I wanted to add a stub for every known SNP to Wikipedia? AnteaterZot (talk) 00:51, 14 December 2007 (UTC)
Huh? this a strawman argument, you can't compare SNP's with genes. Also since when has a gene only been notable if it has a phenotype when knocked out? David D. (Talk) 07:29, 14 December 2007 (UTC)
Articles created on request are not covered by the ProteinBoxBot notability requirements. We create them and then pass them over to the person who requested them, so they can improve the article themselves. If you want to AfD these examples, go ahead, but this isn't characteristic of the output that the bot is making. Have a look at the pages created in User:ProteinBoxBot/PBB Log Index to get a better idea of the standards we follow. However, I disagree with the idea that all human genes are intrinsically notable since there is no reason to bias this towards humans and the same argument could be applied to any species on the planet. That would be an indiscriminate and uninformative collection of information. Tim Vickers (talk) 00:41, 14 December 2007 (UTC)
Could you give me six (completely random chosen) stubs on human genes the bot created without any human request or intervention, so I can check them out? AnteaterZot (talk) 00:51, 14 December 2007 (UTC)
Choose an integer between 1 and 65. Tim Vickers (talk) 00:55, 14 December 2007 (UTC)
18 AnteaterZot (talk) 00:56, 14 December 2007 (UTC)
From log file 18 on the page I linked above, the bot created/updated SUMO2, STX4, C1S, Thymidine kinase 1 and SLC9A3R2. Tim Vickers (talk) 00:59, 14 December 2007 (UTC)
Okay, I'll look at them. It will take me awhile. Can you give me 5 more from log file 19? I think a slightly larger sample would be better. Thanks, AnteaterZot (talk) 01:04, 14 December 2007 (UTC)
Here's the page User:ProteinBoxBot/PBB Log Wiki 11-9-2007-A2-4, have a look at the content and referencing. Tim Vickers (talk) 01:08, 14 December 2007 (UTC)
May I also suggest [16]? (scroll down past all the GPCRs) AndrewGNF (talk) 01:09, 14 December 2007 (UTC)

Sourcing

Just a quick question: At random, I chose OPN5. I then looked for review articles in the reference list, clicked on Terakita A (2006). "The opsins". Genome Biol. 6 (3): 213. doi:10.1186/gb-2005-6-3-213. PMID 15774036.. I read the whole article, searching for the gene. It turns out the gene is only mentioned in one of the citations of the article, but when I looked at what the reference was pointing at, found a very general statement about opsins. What this means to me is that the bot, not knowing any better, pulled this article from a database as a source, which it really isn't. So, why not tell the bot to only take sources where the gene name is in the title or abstract only? Next, I looked at Vassilatis DK, Hohmann JG, Zeng H, et al. (2003). "The G protein-coupled receptor repertoires of human and mouse". Proc. Natl. Acad. Sci. U.S.A. 100 (8): 4903–8. Bibcode:2003PNAS..100.4903V. doi:10.1073/pnas.0230374100. PMID 12679517.. This paper doesn't mention the gene at all! How did that happen? This is an issue distinct from notability; you don't want the bot using bad citations, right? Anyway, I'm going to go on with my notability investigation, be back later. AnteaterZot (talk) 04:54, 14 December 2007 (UTC)

Well i rechecked this since it seemed surprising that OPN5 would not be mentioned in this review. I counted the neuropsin subfamily mentioned quite a few times. Isn't that synonymous with OPN5? David D. (Talk) 06:41, 14 December 2007 (UTC)

Fredriksson R, Höglund PJ, Gloriam DE, et al. (2003). "Seven evolutionarily conserved human rhodopsin G protein-coupled receptors lacking close relatives". FEBS Lett. 554 (3): 381–8. doi:10.1016/S0014-5793(03)01196-7. PMID 14623098. doesn't mention Opsin 5 or Opn5 at all either. That leaves just the one citation, Tarttelin EE, Bellingham J, Hankins MW, et al. (2003). "Neuropsin (Opn5): a novel opsin identified in mammalian neural tissue". FEBS Lett. 554 (3): 410–6. doi:10.1016/S0014-5793(03)01212-2. PMID 14623103. S2CID 9577067., that is actually about the OPN5 gene on the whole article. It is a primary source, so the very first gene I randomly selected is not-notable. Disagree? AnteaterZot (talk) 05:22, 14 December 2007 (UTC)

When I made a regular PubMed search using "OPN5" as a search word, I got these two references: PMID 16753026 and PMID 14623103. One of them is indeed about OPN5 gene/protein. Perhaps the selection of references can indeed be improved. Does it means that this gene is not notable? Not at all. The publication (a reliable source) claims that "Neuropsin shares 25-30% amino acid identity with all known opsins, making it the founding member of a new opsin family." That may be notable enough. "A founder" of something would be great if said about a person (just kidding). 05:42, 14 December 2007 (UTC)
That's not what that means. AnteaterZot (talk) 05:43, 14 December 2007 (UTC)
And no, by WP:N, it is not notable, because more than one source is required and the sources must be secondary or tertiary. The primary sources are good for WP:V. AnteaterZot (talk) 05:54, 14 December 2007 (UTC)
If you insist, we would have to discuss if the established biological databases (like UniProt and Entrez) qualify as reliable secondary sources about the genes. What would be your arguments that they do not?Biophys (talk) 06:02, 14 December 2007 (UTC)
In the scientific literature, a secondary source is a review article. Databases are directories, and primary or below (zeroary? halfary?). AnteaterZot (talk) 06:08, 14 December 2007 (UTC)

You are forgetting about the synonyms, with GPR136 you also get PMID 14623098 and PMID 12732197. Tim Vickers (talk) 06:02, 14 December 2007 (UTC)

Correct. There are two more sources.Biophys (talk) 06:07, 14 December 2007 (UTC)
Okay, up to three primary sources. Without a secondary or tertiary source, the gene is not notable. You could have a dozen primary sources, but if scientists haven't talked about the gene in a review article, it can't be all that important. Compare it to my BRCA1 and C5a receptor examples above. AnteaterZot (talk) 06:13, 14 December 2007 (UTC)
Just looking at what cites the primary source leads to PMID 16005867. I think the main point is lost here. These articles are starting points not finished products. David D. (Talk) 06:31, 14 December 2007 (UTC)
Guys, in general I think the project of getting these genes onto Wikipedia is great. I'm not trying to sabotage your plan, just to improve it. David D. found what looks like a secondary source; great. "Why didn't the bot list it?" is the question you should be asking yourselves. AnteaterZot (talk) 06:54, 14 December 2007 (UTC)
I didn't mean to sound defensive, these are legitimate questions but these bots are only as good as the data bases they mine. Part of the problem is there are very many references cited and a subset are grabbed, Andrew will correct me if I'm wrong here. I agree in principle that only taking ones with OPN5 in the title is good but in practice that will not solve the problem either. You say yourself that you read the whole of Terakita A (2006). "The opsins". Genome Biol. 6 (3): 213. doi:10.1186/gb-2005-6-3-213. PMID 15774036. but you did not notice that OPN5 was mentioned quite a few times but as neuropsin. If humans find this hard it will be much harder for a script. David D. (Talk) 07:04, 14 December 2007 (UTC)
If humans find it hard, won't it be impossible with 10,000 stubs? How many minutes did it take you to find the secondary source, and then add a few minutes to put it in the article. Now multiply that number by 10,000, and we're talking, what, a Man-year? AnteaterZot (talk) 07:14, 14 December 2007 (UTC)
No, not impossible, as you know there are sources that the scripts can mine, and they are not irrelevant, just not necessarily the best (the ones for OPN5, for example were OK). How many scientists are there in this world? Even if a fraction join in that is not too much effort. That is the whole point of a massively collaborative project. if this is successful then most of the genes will be upgraded to something very useful. At worst know one touches them and they remain a valuable source for information when linked to from other wikipedia articles. David D. (Talk) 07:36, 14 December 2007 (UTC)
A man-year if there are no hitches, and few AfD debates. The database it's mining is from the government. And the OPN5 citations were not OK, the bot found two that don't mention the gene at all, and one that mentioned the gene in the Literature Cited section--a citation which it failed to pick up. AnteaterZot (talk) 08:28, 14 December 2007 (UTC)

Maybe I'm missing your point, eight papers were cited: Three specifically discuss the OPN5 one a review the other two I assume are the first mention of OPN5.

The next one is a survey papers of the family of proteins that OPN5 belongs too

While this might not be perfect, it is perfectly acceptable for a bot. All four are informative and useful to understanding more about OPN5 function. The other four are less useful in the sense that they are genomic papers although they will give some info that is gene specific such as location on the chromosome.

On top of these there are the others that the bot did not add to the article PMID 14623098, PMID 12732197 and PMID 16005867 that are all specifically about OPN5. I think the BOT did a fairly good job here although maybe it is possible for Andrew to program it to be a little more specific to bias against the genomic type papers? David D. (Talk) 09:06, 14 December 2007 (UTC)

  • Yes, the bot did a fairly good job. But wouldn't it have been awesome if it had cited the one review paper? And not cited the Vassilatis paper that doesn't mention it at all? The thing is, I didn't really even check any other stubs yet. It's a lot of work, and any small improvements to the bot will be vastly multiplied over 10,000 pages. AnteaterZot (talk) 09:24, 14 December 2007 (UTC)
The first ref is a review too. I'll be the first to cheer if the bot can do better. David D. (Talk) 09:27, 14 December 2007 (UTC)
You're right. Still, one wonders if it was a happy accident, or the design parameters of the project. I'll know more when I randomly sample some additional stubs, and now that I've had some practice at it, I should be able to do it without help. AnteaterZot (talk) 09:33, 14 December 2007 (UTC)

Bug in citations?

Look at SLC9A3R2, one of the ones that was suggested to me. If you look at the citations, many of them seem to be about the gene E3KARP instead. I think the bot has somehow gotten out of sinc with itself, and is applying the citations to the wrong article. Am I wrong? If I'm right, I can't even begin to look at notability. (In any case, SLC9A3R2 doesn't show up much in the citations. I admit I haven't gone through it with a fine-toothed comb, since it would be a waste of time if the bot was applying the wrong citations.) AnteaterZot (talk) 05:39, 14 December 2007 (UTC)

If you look in the protein box, there is a list of alternative names, E3KARP is one of these. Tim Vickers (talk) 05:46, 14 December 2007 (UTC)
Ah, I was looking in the text. How about OPN5? How did the bot find those citations? AnteaterZot (talk) 05:49, 14 December 2007 (UTC)
This citation seems to be right. From what I can see, this is all about NHERF2 aka SLC9A3R2. Of course, it interacts with a lot of other proteins.Biophys (talk) 05:51, 14 December 2007 (UTC)
I see, the protein goes by many names. How did the bot find what I think are bad citations for OPN5 (above)? AnteaterZot (talk) 05:56, 14 December 2007 (UTC)
Perhaps let's step back and see the forest from the trees here. As stated above, all citations are taken from Entrez Gene (click the Pubmed link). Even if the OPN5 references above are garbage, I think we can assume that in the vast majority of cases, references are correctly linked. And in the rare cases when things are wrong, the bot content is easily overriden by human editors. AndrewGNF (talk) 06:24, 14 December 2007 (UTC)
Yes, I overlooked the alternate names. Sorry about that. How does one overide the bot in a case like OPN5, where it seemed to have grabbed "Opsin G"? What about my idea of the bot only taking article which feature the gene in the title or abstract? And my idea of limiting the genes to ones that have been in the title or abstract of one (or more) review articles? I feel that this would address the notability concern. AnteaterZot (talk) 06:31, 14 December 2007 (UTC)

Notability

(break into new section for clarity)

Journal articles are perfectly acceptable secondary sources, any gene that has even a single paper dealing with it as a main topic is notable. Tim Vickers (talk) 06:59, 14 December 2007 (UTC)
As I've already stated, no. Journal articles are primary sources (unless they are a review). Review articles are secondary. Additionally, Wikipedia generally requires two sources. In the case of a gene, a mix of a primary and a secondary should be fine, if the gene is in fact the topic of both. That's why I have been suggesting that the bot take only those articles that use the name of the gene in the title or abstract, and only take genes that have the word "review" in the title, or the "article type" id (many databases have this, Entrez does for sure, I just checked), or in the title of the journal for at least one citation.
As a temporary measure, could the project at least start out with my more restrictive definition, and see how many gene stubs it creates? AnteaterZot (talk) 07:09, 14 December 2007 (UTC)
This definition seems pretty clear to me in the policy, I've asked at Wikipedia_talk:No_original_research#Journal_articles for clarification. Tim Vickers (talk) 16:44, 14 December 2007 (UTC)
What is your primary worry here, that there will be tons of stubs that are not obviously notable? A big problem with the restrictive approach is that many notable genes by your standard will be missed due to nomenclature issues. What about the geography stub project? Surely that broke the notability rules too when that bot was run, or is that considered a bad precedent? In the school debate all high schools are considered notable and will not get deleted by AfD. Do you consider these genes less notable than a five year old high school? Although i admit that no one is suggesting a bot should create HS stubs, yet, the notability threshold that has been set for them seems much lower than the guidelines. So there is room for exceptions.
What of the gains? A full set of human genes with known function in wikipedia WILL be big publicity for wikipedia in the scientific press and probably the popular press. You can bet that many expert users will be drawn to upgrade the pages of the genes they work on. Think how many scientists are out there? If you say that wikipedia has only a few thousand of some of the most well known genes then the tidal waves of publicity will turn to a trickle. David D. (Talk) 07:24, 14 December 2007 (UTC)
My worry is that there will be tons of stubs that are not notable ever. The geography stubs are a bad precedent. In the AfD debates, a few high schools do get deleted, and it is always because they have no secondary sources demonstrating notability. A couple of years ago, elementary schools were kept at AfD in the same way the high schools are now, but nowadays they are generally deleted. A similar fate may befall the geography stubs, people are just unwilling to open that can of worms right now.
I was not aware of the scope of the plan. A failed project would be worse than a delayed or diminished project, right? If the gene project creates a large number of stubs that never get improved, long term support for them will erode. So it behooves the project to get the bot as perfect as possible before letting it run full-bore. I can tell you that from my point of view, a one sentence entry with 11 citations doesn't look as good as three sentence entry with three citations. What is the bot really creating, a slightly more functional "mirror" of Entrez? It could be more than that. At present, scientists are all over Wikipedia creating articles for genes. Of course, they start with ones like SRY because it's important. So getting the bot to start with its definition of important is not such a bad idea. If I was assured that these 10,000 were "well known" genes I would have no problem with the bot. But in the end, an article must be compatible with Wikipedia policy on notability. AnteaterZot (talk) 07:55, 14 December 2007 (UTC)
Again, i agree with you here but much depends on what is available for the bot to grab from the databases. Three sentences would be great but might not be available. If this is not available to have a bot generate three sentences based on key words might be possible but it would not read well. I suspect the titles of the cited papers would be more informative. Personally I see these pages as much more than a mirror since they bring together data from mutliple sources. In my mind that alone makes it a worthwhile article. Do you really think the small towns and villages are such a problem? I could understand if they are taking up valuable server space but I don't believe that is the case. Probably they'll get vandalised but what is new, just look at any popular pages if you want to read crude, mindless additions. Subtle vandalism might well be a problem but the original stub is always available in the history so I don't think that would be a huge reliability issue either. David D. (Talk) 08:05, 14 December 2007 (UTC)
The bringing together of multiple sources is a very good thing, very worthwhile.
The geography stubs are only a problem because people use them as a precedent. AnteaterZot (talk) 08:36, 14 December 2007 (UTC)

I just noticed you, AnteaterZot, have been quite active in the schools debate so my example above is quite appropriate. You say that all high schools should be in Wikipedia.

"So, it seems clear to me that high schools deserve the benefit of the doubt on notability, and Wikipedia is richer for their presence" [17]

So I'm intrigued why you would not consider human genes in a similar way? David D. (Talk) 07:53, 14 December 2007 (UTC)

I do like genes, honestly. I give a school the benefit of the doubt because some human took the time to create the article. But I would argue for the deletion of a school if it was an unsourced article even after a five day deletion debate. Picture this; the bot creates 10,000 stubs, and then people begin nominating them for deletion. Now scientists have to waste time defending the article at AfD, and trust me, that sucks if you really care about your article.
But you know, right, that most genes have no known function, and most genes, if knocked out completely, result in little or no change in phenotype? How hard would it be to tweak the bot to avoid the less well understood genes at first? AnteaterZot (talk) 08:07, 14 December 2007 (UTC)
Clearly we already have a subset of genes I assume the 10,000 represent the ones for which we do know the function (Andrew?). The number for unknown function is about 40%, and the 10,000 genes represent less than the 60% of known genes. And there is no doubt that amorphs often have no obvious phenotype but I don't think function at the phenotypic level should be a requirement for inclusion. Function at the cellular level is often more important so the gene ontology and expression patterns are where the real information lies in these articles. David D. (Talk) 08:18, 14 December 2007 (UTC)
Function would be nice, but not an rigid criterion for inclusion. A known function, explained ever so briefly, would probably save an article from deletion. AnteaterZot (talk) 08:24, 14 December 2007 (UTC)
Can you give an example of what you mean by this? Key protein domains or the biochemical pathways? Doesn't the preamble we currently get add that type of information? Or do you mean to saying where it is expressed? David D. (Talk) 08:30, 14 December 2007 (UTC)
For example, DNM2 says enough to give me a hint as to what it does. The bot did it all by itself, too. But I had to click on a lot of stubs to find it. AnteaterZot (talk) 08:42, 14 December 2007 (UTC)
As for AfD well that is beyond the scope of those involved, I don't see us rushing to AfD to save them all. If users cannot see the inherent usefulness of such articles from a recruitment perspective, from an internal linking perspective for biology and medical articles or from a general interest perspective then I guess it is doomed. My bet is that some so-called deletionists would start the process and would ending up bashing heads strongly with the usual bunch of inclusionists. Not a debate I care to join. As for tweeking the bot to mine data for a subset of the more important genes, I'll let Andrew address that issue. David D. (Talk) 08:30, 14 December 2007 (UTC)
It will probably work out, with or without my suggestions. But what you have in me is a person who knows Wikipedia's rules, and knows something about genes. I would be happy to help out in a positive way. Right now I have to get some shut eye. I have one more question, why was the bot stopped in November? AnteaterZot (talk) 08:57, 14 December 2007 (UTC)
Well, the stopping of the bot in November would be my fault. Basically, This project had eaten up so much of my time that I was falling behind in my classes. Since my last final exam was a few days ago, I'm getting the bot back up and running. As I type this, PBB is in the process of running a few more genes. Also, as for the citations strategy: Initially, we only included the Review papers in the reference section, but for some Wikipedians that wasn't enough (See the long debates above). So we started supplementing with additional papers not of the review type. To make this selection, we try and pick up the more recent papers. We also pick the papers via querying PubMed, which has papers linked by gene number. If PBB ends up using a paper that isn't a great reference, it is because someone behind PubMed thought that the paper had relevance to the gene. Also, something else to keep in mind is that PBB isn't a one-time ordeal. It has been made to update and maintain all these gene pages over time - even if no one else cares about a particular page for a time, PBB will still care for it (and by extension the Operators behind PBB will care for those pages - PBB isn't fully automated - it does a lot on its own, but not everything. :) ). Also, since we are in a constant process of upgrading PBB's capabilities, we like to hear suggestions for improvement. Currently, we are running version 1 of the bot, but version 2 is in development, so sometime in the future, version 2 will run back over all the gene pages, adding content or correcting references, etc. In version 2 we could certainly adopt a Reference search which would give priority to those articles with the name of the gene in the title. So keeping in mind that the Bot is not a finished product, it would be great if these suggestions could go in the "future ideas" for PBB. We really want to make the Bot as useful as possible, but we had to draw the line somewhere and start the whole process, otherwise no one would notice the bot and actually make any suggestions for improvement. :) So I hope this helps clear a few things up for everyone. I'm sure that when version 2 of the ot is up and running, making corrections and updates to pages that it will spawn even more suggestions, so you'll be able to expect a Version 3 of the bot to come after it as well ;) We want this bot to be very useful and we are working our hardest on making sure that happens (just keep in mind that it is an evolving process and would be impossible to make it "100% shiny platinum" on our first try, but I like to think that we are getting closer). JonSDSUGrad (talk) 21:09, 15 December 2007 (UTC)
Good news! Thank you for your great work! I think there are no any notability problems here. The real practical questions are different. First, the bot produces too much work for volunteers. Let's consider last log file, for example [18]. It includes 25 pages to be created. Among them, 18 could be created automatically by the bot, since they have no exact match for both "short" and complete gene names. Typically, the bot does not create an article about a specific Protein kinase D1 because there is already an article about a family of the proteins (Protein kinase domain, for example). But these subjects are different. I bet that bot can always create an article if there is no exact namespace identity of the both short and long names. Two other serious improvements are including the domain structure and protein-protein interactions, which can be extracted automatically from UniProt and Entrez gene records, respectively. Saying that, I enthusiastically support your hard work of placing proteins in wikipedia.Biophys (talk) 22:28, 15 December 2007 (UTC)
Typically, the bot does not create an article about a specific Protein kinase D1 because there is already an article about a family of the proteins. Sorry Biophys, this isn't correct. Most flags are due to namespace conflicts with gene aliases, and the spate of recently-created pages are due to changes we made to resolve that issue. Conflicts with gene family pages are less common, and those will continue to be flagged for manual inspection by a human volunteer since those are difficult to differentiate in an automated way. AndrewGNF (talk) 00:58, 16 December 2007 (UTC)
Wow! I just looked at the log file above [19] and can see that all "short name" articles have now been created, although the artiles still remain in the log file to be checked by a volunteer. That is a good compromise decision that simplifies work of a volunteer, but still asks for a manual check. This way anyone can go pretty fast through the log file without making too many copy and paste operations, and only merge what should be actually merged. Great. This reslves my first concern. As about others, they are just wishes for the future. Thanks.Biophys (talk) 02:14, 16 December 2007 (UTC)

Verifiability issues

According to Wikipedia:Verifiability, the content of WP articles should be "verifiable", which means "that any reader should be able to check that material added to Wikipedia has already been published by a reliable source". Articles generated by the bot satisfy this core policy twice. First, the content of the articles is supported by established scientific databases, and the links to the databases have been provided. Second, the articles cite supporting publications from scientific journals. Perhaps the title "Further Reading" is misleading and should be replaced by the title "Sources", because these are sources. There is no any other problems. As soon as the core Wikipedia:Verifiability requirement has been satisfied, the exact citation style (that is "in line" or not) is only a matter of convenience for a reader, but not a requirement. All cited sources satisfy WP:Source. Note that biological databases are reliable secondary sources.Biophys 20:50, 13 November 2007 (UTC)

I agree with Biophys' comments regarding satisfying Wikipedia:Verifiability. Although if we wanted to create an inline citation, we could certainly do something like this. But I think the original version is much more concise and clear. AndrewGNF 21:59, 13 November 2007 (UTC)
Good answer! This is actually correct inline citation. It does not make the article any worse, but it also does not make anything better. So, that is something to avoid unless others make a big issue of it.Biophys 22:17, 13 November 2007 (UTC)

The problem with this ProteinBoxBot is that in some cases of RNA expression profiles (e.g., Avpr1a and Avpr2), the data are clearly wrong. Therefore, although you are referencing a serious effort, each gene's expression should be verifiable by peer-reviewed sources. Most scientists realize that these "production scale" databases are useful starting points but are not to be taken as gospel. The Wikipedia readers will not be consulting other peer-reviewed sources so it is incumbent upon us to provide reliable information. AlbertHall (talk) 15:28, 16 December 2007 (UTC)

Thanks for your input. First, one should note that in cases where the patterns are obviously wrong, feel free to delete the images from the infobox and set the "update_protein_box" to no. Second, we feel that those images are of pretty high-quality (though as mentioned before I'm biased). Moreover, I think the expression pattern in human anatomy of a given gene is an important aspect of gene function (just as important as knowing it's a kinase, for example). So, having said all that, do you have a recommendation regarding changes to the default behavior of PBB? AndrewGNF (talk) 16:21, 16 December 2007 (UTC)
One problem eith the gcRMA data (RNA profile) is that it includes various tissue cultured cells and tumors. If the gene is overexpressed in one or more of those, the median is skewed. I would look for a database that only used normal human tissues. AlbertHall (talk) 18:27, 16 December 2007 (UTC)
True, though I don't think there are so many of those samples that will substantially change the median. (The median, of course, is rather robust to a small number of outliers.) Having said that, we're still open to any specific suggestions/alternatives. AndrewGNF (talk) 18:34, 16 December 2007 (UTC)

A proposal

I suggest the following. Let's name each article "ABCD (gene)" instead of "ABCD" (where ABCD is the name of a gene). This way we avoid any present and future namespace conflicts. Then, we can go ahead and generate automatically pages about all human genes described in at least a few publications. Of course, there are two possible objections here. First, we might create many pages to be merged with the existing pages. However this is not really the case. When someone had created in the past an article about a Tubby protein for example, what he meant was a family of proteins, or a group of enzymes with certain catalytic activity, etc., not a human gene. Therefore, there is no need for merging in 95% cases. Second, we might create "orphan articles". One way to handle this problem is a semi-automatic creation of protein family articles with links to individual human proteins, which seems to be doable (Banus and me have generated a couple of files with "semi-preps" of WP family articles). So, what do you think? I still suggest an automatic creation of human gene articles by the bot. Otherwise, this project is not feasible.Biophys (talk) 14:01, 12 December 2007 (UTC)

That's in interesting idea. It has the advantage of being fast, but the main problem is that it will cerate two sets of articles, the "ABCD" ones we have at the moment and a new set of "ABCD (gene)" articles. I'm not sure if this will really save us any work. We will still have to check manually to see if an equivalent article on "ABCD (gene)" exists and then merge, if there is a page, or rename "ABCD (gene)" if there is not. Tim Vickers (talk) 16:53, 12 December 2007 (UTC)
Hmmm, I'm not a huge fan of the "(gene)" suffix for all genes. I understand that we could substantially increase the rate of gene pages created, but I think they'd be at sub-optimal locations. Clearly though, the bottleneck is in manual inspection to resolve namespace conflicts. One idea we're working through is to change the way conflicts are flagged. Currently, a gene is flagged if any of its symbol/name/aliases hit any existing WP pages, but we're thinking of changing so that conflicts are only flagged if the page contains one of the other protein box templates. My guess is that would decrease by about half the number of pages that need manual inspection. We're still working on the idea/implementation though...
It's also worth noting that the PBB code has also been posted in a public repository ([20]). If people want to make modifications to the code for new/better functionality, go for it. (But also note, of course, that those changes either need to be submitted back to us to run under the PBB account, or you need to get approval for the modified bot.) AndrewGNF (talk) 17:59, 12 December 2007 (UTC)
No, there is no any bottleneck with namespace conflicts. If an article with exact name "ABCD" does not exist, bot should simply create it. If an article with name "ABCD" exists, bot should create an article called "ABCD (gene)", which is a proper way to call it. The only shortcoming of this is not making redirect pages from alternative gene names, but that is of secondary importance. In reply to concern by Tim, the advantage of making this is huge. This is a difference between "to have" (pages about human genes) and "not to have". As soon as the pages are created, all pages of "(gene)" type can be manually inspected by anyone and only ~10% of them would requre actual merging with something (I went through "backlog" pages to check). So, what's the problem? This is normal process. Just merge the content and mark one of the articles as "Prod". Let's run the bot. It did not create automatically anything during last several weeks.Biophys (talk) 19:57, 13 December 2007 (UTC)
Another point. It is important to have a systematic approach here. Let's run the bot for all notable (based on the number of PubMed references) genes. The argument that "a gene was placed as number 4,566" (based on a certain highly artificial criterion) does not really work. The SH2 genes mentioned above are very notable based on any human or WP citeria.Biophys (talk) 20:06, 13 December 2007 (UTC)
You know you're loved when people notice your absence... ;) On the issue of namespace conflicts, it comes down to whether (in the short term) we err on the side on creating more pages in in the main namespace that need to be resolved by humans later (the stance you're advocating), or whether we err on creating less pages and exercise great caution to prevent duplicate pages (the approach the bot has taken so far). Either is reasonable, but #2 is the approach we got approved for, and it goes along with the idea that bots should be somewhat less bold in mass edits. I mentioned above an approach that we're thinking about to reduce the number of manual inspection flags, which will hopefully increase the rate of page creation. On the issue of PBB's notable hiatus, I hate to use the stock answer again, but you'll have to believe us that we're moving as fast as we can given other constraints (some combination of implementation issues, work issues, personal issues, etc.). As far as I know, MCB activities don't fall into any of our official job descriptions, so it's hard for anyone to demand any level of commitment. We do it when we can work it in... AndrewGNF (talk) 21:35, 13 December 2007 (UTC)
Of course, and you're doing it very well. I don't see what the rush is myself, I'd rather do the job well and slowly than quickly cutting corners. After all, we're in this for the long haul. 00:08, 14 December 2007 (UTC)
What commitment are we talking about if the bot can run automatically for numerous pages that have no namespace conflicts, and there are thousands of them?Biophys (talk) 00:20, 14 December 2007 (UTC)
It is better to do it right the first time than to have to undo a lot of work. For example, I don't think the bot should run until my notability concerns have been addressed. AnteaterZot (talk) 00:23, 14 December 2007 (UTC)
It seems that JonSDSUGrad had actually implemented this proposal with "gene" prefixes (in cases when there is a precise match of gene names). I think it actually worked well. I have quickly checked a lot of pages just created but the bot (probably around a hundred?). I found only four cases for merging, and only one duplicate page. However I also marked for "Prod" seven wrong redirect pages created earlier by people, not by the bot. As soon as these redirect pages deleted, we should simply rename (move) seven "ABCD (gene)" pages as "ABCD". Not a big deal.Biophys (talk) 06:59, 16 December 2007 (UTC)
Well, actually what we implemented had a little more back-end work than simply creating every gene using the "gene" suffix when the gene symbol was taken. That back-end work was meant to avoid any issues with merging and duplicates, but apparently the system is still imperfect. Can you post here a running list of possible conflicts so that I can look into them? Again, PBB operates under the principle that we should err on the side of avoiding pages that will later need to be merged by hand, so I'm very interested to get this problem fixed. AndrewGNF (talk) 16:43, 16 December 2007 (UTC)
So far I found only one duplicate: COL4A3 (gene). Yes, this should be avoided. As about mergings, (I found only MEF2A HLA-DR52, MUTYH, and SIRT1 - last one is already merged), the bot made an excellent job. Congratulations! Let's repeat such updates again. One should only check all pages with prefix "gene" created by the bot. There are not too many such pages, and I did just that. They belong to two categories (50:50): (a) such as GRN (gene) (too common abbreviation) - one should make GRN a dismbig. page or to use an already existing disambig. page; and (b) such as MEF2A (gene) - merging is needed. This is actually good. Anyone like me can easily mark these pages for merging and wait a little for community opinion and participation. That is actually consistent with WP policies. The bot did just great. Someone only need to do a little bit of follow up work, which is much easier than work with log file, in my opinion.Biophys (talk) 20:41, 16 December 2007 (UTC)
O'K, now I realize that you actually did not implement this proposal. Otherwise, articles like PDCD6IP and PSME3 from this log [21] would have been created. Actually, I do not understand why they have not been created (no any matches with existing articles).Biophys (talk) 19:25, 17 December 2007 (UTC) Oh... now I see: you did not do "fixes" for that log. Let's fix all logs.Biophys (talk) 19:31, 17 December 2007 (UTC)

Summary and consensus?

AnteaterZot, certainly good to have more knowledgeable people commenting on and participating in this bot project. (It is worth noting that others who are also versed in both WP policy and in biology have helped craft the bot specs, but of course, we can never have too much expertise here...) Assuming you support the general principle (as it seems like you do), can I ask you to distill the discussion above into any points that you think are fundamentally flawed with the bot as it has been running? We will attempt to address those now, and all the "nice to have" feature requests and changes can be incorporated in Version 2 specs, after the first round of stub creation is done.

As you can imagine, the more people who become aware of the project, the more voices we have to try to satisfy. And getting absolute consensus has been difficult/impossible (as evidenced by this lengthy talk page). As the bot owner, I inclined to only make changes if a super-majority here support it. Otherwise, I'll default back to the specs that were approved by the Bot Approval Group. Unfortunately, if we stop for every new concern/discussion, we risk not doing anything at all. AndrewGNF (talk) 19:57, 14 December 2007 (UTC)

<moved back to policy page>, this isn't the place to discuss policy. Tim Vickers (talk) 20:33, 16 December 2007 (UTC)
My reason for placing that there was to show the overwheming belief of university librarians and others that journal articles are primary literature, and review articles secondary. AnteaterZot (talk) 01:28, 17 December 2007 (UTC)
Btw, on Wikipedia:Etiquette#A few things to bear in mind, it says, "Though editing articles is acceptable (and, in fact, encouraged), editing the signed words of another editor on a talk page or other discussion page is generally not acceptable, as it can alter the intent or message of the original comment and misrepresent the original editor's thoughts. Try to avoid editing another editor's comments unless absolutely necessary." AnteaterZot (talk) 01:31, 17 December 2007 (UTC)

Now, I would like to see if I can put a positive spin on things. According to User:JonSDSUGrad the bot was capable of detecting review articles and listed only them in the reference section; "as for the citations strategy: Initially, we only included the Review papers in the reference section, but for some Wikipedians that wasn't enough (See the long debates above)."

Proposal one: Since the bot can detect review articles, why can't it make a three-tiered refs list? Something like "Review articles" (all review articles), "Futher reading" (all other Entrez links), and "External links"? By labeling the reviews and the not-reviews, we can see if the parameters already in place for inclusion are catching genes that would be notable under my (and every science librarian in the world's) definition? This will also be useful for users and researchers, who can read the review articles first to get an overview, and the primary sources for the details. If it turns out that most of the first 10,000 have reviews, then the few that don't can be dealt with by humans, either to find a review article or to sigh and let it be deleted if somebody tags it. If nearly all of the 10,000 have reviews, then the number of gene stubs could be expanded beyond 10,000. I think you guys are being too pessimistic about the proportion of genes that would be notable under my stringent definition; I suspect most of them will pass muster. AnteaterZot (talk) 11:21, 16 December 2007 (UTC)

Proposal two: Correct me if I'm wrong, but didn't you place some sort of number-of-references requirement on the bot? Was this not an attempt to find the "most important" genes? I say, lift that requirement, since the number of references is irrelevant to notability. All an article needs is one or two references for verifiability, and one reference for notability. Go back to the "20078 that have at least 2 linked pubmeds," and if one is a review, make a stub. Remember, most reviews are going to talk about more than one gene, so a review article can be used on multiple stubs. AnteaterZot (talk) 11:21, 16 December 2007 (UTC)

Proposal three: Go back to the "20078 that have at least 2 linked pubmeds," and if one is a review that mentions it in the Title or Abstract, make a stub. Remember, most reviews are going to talk about more than one gene, so a review article can be used on multiple stubs. This super-stringent definition may reduce the number of stubs, but they will be unassailable for notability. After those stubs are made and digested by the Wikipedia community, then make the remaining stubs by the standards of proposal one or two and see how it goes. AnteaterZot (talk) 11:21, 16 December 2007 (UTC)

Proposal four: Ignore me and users like User:Mumia-w-18, User:Fuhghettaboutit, User:Reywas92, User:Ryan_Postlethwaite and User:Carcharoth, and any others out there and keep going with the bot as is. AnteaterZot (talk) 11:21, 16 December 2007 (UTC)

All are reasonable proposals, but please see my comments above regarding the challenges of trying please everyone. If we modified to accommodate every reasonable proposal, we'd never get anything done. So, again, on the proposals that you feel represent fundamental flaws in how PBB is working in Version 1, please create a new section with Support and Oppose tags. Please also note in the proposal that you are proposing to stop the ongoing Version 1 run to implement the proposal. If there is a consensus to do so, we (the bot operators) will comply. (The other option mentioned above is for you to add your modifications to the Version 2 specs.)
Finally, if proposal 4 is meant to insinuate that we've been ignoring you or other users who have raised concerns, frankly I think that's ridiculous. We've gone out of our way to accommodate everyone's input, and as far as I know, not one of those users you invoke have left the discussion feeling like they've been ignored. If listing their names here is meant to illustrate some sort of pattern, then I think it shows that we discuss all users' comments (often times ad nauseum) and attempt to synthesize a consensus agreement. (For others' reference, [22].) AndrewGNF (talk) 16:14, 16 December 2007 (UTC)
No, Proposal four is my way of saying go ahead if you can't implement my other suggestions. The fundamental flaw is the fact that the bot does not choose genes by notability, but by an arbitrary number-of-articles system. My proposals are my attempt to address that flaw. I don't know what you mean by "create a new section with Support and Oppose tags." Could you first tell me which of the first three proposals are technically feasible? I really want any change you guys make to be easy for you. AnteaterZot (talk) 01:24, 17 December 2007 (UTC)
The fundamental flaw is the fact that the bot does not choose genes by notability, but by an arbitrary number-of-articles system. Again, not sure I follow you here. Are you taking issue with the order in which gene pages are created? Or are you saying we are under-creating pages (as you seem to allude to in proposals 1 and 2 above)? If it's either of these two, then I'd suggest that you wait until the Version 1 pass is done. The end result will be the same (i.e., order of creation is a moot point), and we can of course add more gene pages later.
Regarding the technical feasibility, anything is feasible, so for the moment, let's ignore this issue. Let's decide if anything is inconsistent with WP policy, and given a proposal to fix it (and a consensus of interested parties), we'll make it happen. But if we're talking about things to make PBB better, please make a note on the V2 specs and wait until then. Any changes now really slow things down, so we're dealing only in the essentials now. AndrewGNF (talk) 05:06, 17 December 2007 (UTC)
I've said it many times; without at least one secondary (read: "independent") source to attest to the notability of a gene, the gene fails WP:N. My Proposal one, to somehow label the review articles, seems to me to be the easiest to implement without slowing you down. AnteaterZot (talk) 21:44, 17 December 2007 (UTC)

MiszaBot archiving

Following up on Andrew's comment above, I added the config instructions for MiszaBot to this page. There is a 10-day timeout (adjustable). Within 24 hours it should start taking away old threads, and adding them to User_talk:ProteinBoxBot/Archives/Archive1, which I believe it is smart enough to create. See the instructions at User:MiszaBot/Archive HowTo. EdJohnston (talk) 20:58, 21 December 2007 (UTC)

Super, much appreciated... AndrewGNF (talk) 21:08, 21 December 2007 (UTC)

Proposal to tag review articles

As proposed by AnteaterZot above, halt the currently ongoing Version 1 and modify the bot to make a three-tiered refs list: "Review articles" (all review articles), "Futher reading" (all other Entrez links), and "External links".

Support

  1. AnteaterZot Might I just add that having the review articles highlighted is useful beyond assessing notability, and that appearances in databases (as named below) is rather like almanac information, and does nothing to establish notability? For example, a database exists for nearly all movie actors–IMDB–but not all actors are notable. Ooh-ooh! I've got another example; a bot makes a stub for every star in the galaxy that has been given a name, has a known position, spectrum, size, whatever. Once you get beyond the brightest ones, are they notable? AnteaterZot (talk) 00:20, 18 December 2007 (UTC)
Comment. I agree that "orphan" proteins without an established biological function are insufficiently notable for WP. But we can not easily sort them out using any simple and formal criteria. I think the best approach would be to create such articles (including olfactory receptors), look at them, and label some of them with "notability" template, just as any ordinary article. But so far, I have seen only one such potentially non-notable non-GPCR article. If a lot of such articles will begin to appear, it would mean that run "1" of the bot is close to the end. But we are still very far from that point.Biophys (talk) 20:46, 23 December 2007 (UTC)

Oppose

  1. AndrewGNF (talk) 22:56, 17 December 2007 (UTC) -- Seems like it's addressing a non-issue to most people, and even worse, letting a rule get in the way of something we all think is useful. (I'm tempted to just WP:IAR, but instead, we'll obey the consensus here...) Finally, I think database links in Entrez Gene, Ensembl, Uniprot, etc. satisfy the need for secondary sources.
  2. Boghog2 (talk) 23:04, 17 December 2007 (UTC) -- ProteinBoxBot should be able to retroactively improve/sort the citations in existing articles if necessary. "Damn the torpedoes, full speed ahead"! My personal opinion is that a manuscript that has been published in the peer reviewed literature is both a primary citation (since it reports the original research) and a secondary citation by virtue of being reviewed by two anonymous referees and the journal editor. Furthermore, each of the articles that has been produced by ProteinBoxBot is of a gene whose name has been approved by the HUGO Gene Nomenclature Committee and cited in the National Center for Biotechnolgy Information which adds an additional level of verification and notability.
  3. Biophys (talk) 00:11, 18 December 2007 (UTC). First, WP:Verifiability only recommends using secondary sources whenever possible but does not require this. If secondary sources are unavailabe (for example, there is a notable event just reported in a few newspapers), an article can be based exclusively on the primary sources. Second, an "original" non-review paper may be either a primary source or a secondary source (if the protein is only mentioned in Introduction and/or Discussion of the article). This has to be decided on the case to case basis. Third, as Boghog2 said, HUGO and NCBI approval adds reliability and notability. Finally, I believe that a record in an established database, such as Entrez Gene, represents a reliable secondary source.Biophys (talk) 00:11, 18 December 2007 (UTC)
  4. EdJohnston (talk) 01:59, 18 December 2007 (UTC) I am sympathetic with the original bot approval of ProteinBoxBot. I wouldn't support the change recommended by AnteaterZot unless I could see clear benefits. There is no need to interrupt the work of the Version 1 bot before the first pass is done. I do not worry that too many articles will be created, since the bot was approved on September 14 to go ahead and create 10,000 articles, and due to the caution of the developers, we are still far short of that number. The greatest risk is doing the job badly.
  5. David D. (Talk), I favor AnteaterZot's idea of having references tiered but I think it is important to get the first run done. Since there will be a second run I think it will be fine to wait until then for fine tuning. —Preceding comment was added at 17:10, 18 December 2007 (UTC)
  6. Since the classification of scientific articles as either primary or secondary sources is presently unclear, I think the idea of tagging review articles with REVIEW and regarding the rest as reliable sources is the best long-term approach. Tim Vickers (talk) 16:46, 21 December 2007 (UTC)

Interested parties

Is it a good idea to run this by people who have previously expressed their support for the bot and those who have previously expressed a concern about notability? AnteaterZot (talk) 23:06, 17 December 2007 (UTC)

I think a fair number of people have this page on their watchlist, so I wouldn't be surprised if we get a fair turnout. OTOH, if you'd like to link here from the other relevant discussions, feel free... AndrewGNF (talk) 23:14, 17 December 2007 (UTC)
I can't change an archive, so I was thinking of just alerting some folks on their talk pages. Is that cool? AnteaterZot (talk) 23:41, 17 December 2007 (UTC)
Don't see why not. Though I pity anyone who is going to try to get up to speed "cold" by reading the discussions above and at WP:NOR and WP:N.  ;) AndrewGNF (talk) 23:49, 17 December 2007 (UTC)
I also do not object. But if that was an AfD discusssion of a "hot" political topic, that would be considered a "forum shopping" and against WP policies.Biophys (talk) 00:15, 18 December 2007 (UTC)
Yes, that's why I asked first, although I think it is called canvassing, not forum shopping. AnteaterZot (talk) 00:26, 18 December 2007 (UTC)
I'm not going to ask them. If you guys want to be certain that you have consensus, consider writing them yourselves. AnteaterZot (talk) 01:09, 18 December 2007 (UTC)

Alternative Idea

Just to throw this out here (and I am talking for Version 2 of the bot) as an alternative to adding a whole new section to each gene page, why not just add a small bold REVIEW to each article that deserves such a tag. It would be easy for anyone to scan the references for review articles then and make a distinction between review and non-review. Plus it wouldn't add another section to the page (PBB has no control over page sections except on initial creation), but would still accomplish what was suggested and allow PBBv2 to make the change to all gene pages in the future that allow for changes in the citations. It would be a non-issue to make a small change to the citation template to incorporate a "Review?" field and then PBB's code could modified to take advantage of the new option... just my .$02 JonSDSUGrad (talk) 19:47, 18 December 2007 (UTC)

I'll put it on the list of things to do. If A few more people chime in that this is a great change they would like to see, then I might move up its priority on that list.. :) JonSDSUGrad (talk) 23:42, 20 December 2007 (UTC)

On a completely different subject, while chasing the bot around last night, I discovered Lactoylglutathione lyase is another name for GLO1 (Glyoxalase I is a redirect to Lactoylglutathione lyase). I'm not sure, but I think that the L.L. page is about Glyoxalase I in all species, right? Nevertheless, should the bot include a link to the an EC number where a listing exists? (For an example, see EC 4.4.1.5) for Glyoxalase I.) AnteaterZot (talk) 23:49, 18 December 2007 (UTC)

It's already in the protein box, the third "identifier" at the top. It currently links to this page. David D. (Talk) 06:33, 19 December 2007 (UTC)
I can't speak to the specific example you gave above, but more generally, better treatment of EC numbers is one of our Version 2 improvements. While there is a slot for it in the protein box template, EC numbers currently aren't retrieved by PBB (flaw in the underlying database actually, not PBB's fault). Especially with the parallel effort to get pages for all EC numbers, this is high on the list of improvements. In the mean time, for pages being merged with PBB content, we make a special note to transfer EC designations when available in the old infobox to the new PBB infobox. AndrewGNF (talk) 16:59, 19 December 2007 (UTC)
Great. I strongly recommend to change the colors of some of the infoboxes so that human gene pages can be rapidly distinguished from these other page types. Beige is so boring. AnteaterZot (talk) 00:31, 20 December 2007 (UTC)
Have at it... Changes to how the underlying templates are rendered can be done by anyone, and PBB doesn't care one bit. So if anyone with an eye for that sort of thing wants to mess around with that, you'll get no objection from me. (Well, not before I see the change anyway...) AndrewGNF (talk) 00:56, 20 December 2007 (UTC)
Color is changed. AnteaterZot (talk) 02:09, 20 December 2007 (UTC)
I can't say that I love the new colors, but I don't hate them either. I'm willing to let it stew for a while to see if it grows on me. Other than that, let's see how the court of public opinion reacts. (Maybe we can do seasonal colors -- red and green in December, yellow and purple around easter, orange and black around halloween...)  ;) AndrewGNF (talk) 02:21, 20 December 2007 (UTC)
I tried out a few colors. I could try a few more. It's tricky; you have to avoid grey, beige, dark colors and blues. I avoided bright colors too. I wanted something that would be unobtrusive but would look different from the protein template. AnteaterZot (talk) 02:30, 20 December 2007 (UTC)
I would write more, but my wife would tell me I have no basis on which to comment on the aesthetics of color. AndrewGNF (talk) 02:44, 20 December 2007 (UTC)
Gotta say, I'm not much one for the "dusty rose" color - the contrast just isn't there.. I kinda liked the original gold.. The best thing would be to adjust the background grey and the rose color at the same time and see if you can get a nice combo that way. I don't have time to tinker with the colors right now or I would give it a shot.. :) JonSDSUGrad (talk) 23:55, 20 December 2007 (UTC)
Agree with JonSDSUGrad. I also like old color better.Biophys (talk) 00:38, 21 December 2007 (UTC)
Given lukewarm reception, I reverted the color change. But still open to alterations. Take further discussion over at Template_talk:GNF_Protein_box? AndrewGNF (talk) 00:51, 21 December 2007 (UTC)
My goal was to make the color different from the protein family infoboxes. I was not wedded to the dusky rose. I'll try out a few other colors, nothing permanent need be assumed. AnteaterZot (talk) 01:09, 21 December 2007 (UTC)
Since this isn't really a PBB issue (and since this talk page is long enough), let's take any followups to Template_talk:GNF_Protein_box#Color_scheme. Incidentally, if anyone readily knows how to configure one of the archiving bots to scan this page and wants to do us all the great favor of setting it up, that would be great... AndrewGNF (talk) 01:32, 21 December 2007 (UTC)
My word, glyoxalases! How exciting. To be completely clear EC 4.4.1.5 applies to all glyoxalse I enzymes except those in the Trypanosomatids, since these organisms contain a novel trypanothione-dependent glyoxalse system. This was discussed first in the particularly fine article PMID 15329410 from a small lab in Dundee. Tim Vickers (talk) 16:44, 21 December 2007 (UTC)
So do you support Dundee or Dundee United? Or are you an atheist :) David D. (Talk) 18:08, 21 December 2007 (UTC)
I'm agnostic, I'm not sure if either team is really better than the other. :) Tim Vickers (talk) 12:46, 23 December 2007 (UTC)

List of existing proteins

If you look at Special:Whatlinkshere/Template:Protein you'll see many articles that already exist for genes the bot might create/has created a stub for. AnteaterZot (talk) 02:09, 20 December 2007 (UTC)

Yes, we should look at these pages and mark some of them for merging. But this will be most efficient after completion of the first run by the bot. I think ProteinBoxBot is running very smoothly. Almost half of links to human genes in protein family pages now appear in blue. Good job! Let's keep the bot running.Biophys (talk) 20:32, 21 December 2007 (UTC)
I went through a few logs just in case. Not only it is much faster now, but it allows a much better quality work, when an editor can focus on real alternative names of each proteins indicated in the UniProt entry, instead of dealing with "junk" gene names in Entrez gene. Unfortunately, UniProt links are sometimes missed by the bot. Note that UniProt codes are included in each Entrez Gene entry, so I do not know what is happening. Otherwise, everything if fine. That is a significant improvement over the previous version.Biophys (talk) 01:50, 23 December 2007 (UTC)
Glad you like it, and thanks again for the help. Just to note, half of the improvement is that we're skipping most of the "hard cases" that will need more manual attention later. But you're right, doing the low-lying fruit first really speeds things up and makes volunteering much easier. (Sorry, couldn't resist the plug.) We'll check on the Uniprot issue. AndrewGNF (talk) 01:57, 23 December 2007 (UTC)
Where is the list of difficult/missed cases? Is not it in a log?Biophys (talk) 01:59, 23 December 2007 (UTC) Are you talking about pages with "gene" prefix?. That is fine to have.Biophys (talk) 02:00, 23 December 2007 (UTC)
We've actually just set them aside for the time being. Anything that has a high likelihood of hitting an existing page, we'll save for later. The ones that we think do not have an existing page are easy, so we'll get through those all first. (Of course, we falsely allow some pages to pass, which is why we still need volunteers to double check the work.) Anyway, it's easier if all the easy cases are lumped together and all the hard cases are lumped together. And obviously, might as well do the easy cases first... AndrewGNF (talk) 02:12, 23 December 2007 (UTC)
I agree. It just would be good to make a log file with problematic cases somewhere. But let's keep bot running in present state.Biophys (talk) 02:46, 23 December 2007 (UTC)

Btw, you can see and/or count all instances of the use of the infobox here: Special:Whatlinkshere/Template:GNF Protein box. It looks like it lists them in the order they were created. Right now it is at 5,391. AnteaterZot (talk) 04:11, 23 December 2007 (UTC)

My count is 4552, after limiting to the main namespace. This link is the one I use... The order listed by "What links here" is not the order created, but it seems to be a rough ordering of "importance" (# links? # edits?). Anyway, WP's order of importance roughly agrees with PBB's order of creation, but not exactly... FWIW... AndrewGNF (talk) 05:41, 24 December 2007 (UTC)
Just do not rely too much on this formal ranking. At least two thirds of important/notable human proteins/genes are missing now.Biophys (talk) 17:01, 24 December 2007 (UTC)

Turning off summary update

I made a stub for Metastasis suppressor, then linked it to a number of pages the bot made. I then turned off the summary update to prevent the bot from overwriting my changes. Was that necessary? Or should I not have done that? AnteaterZot (talk) 05:54, 23 December 2007 (UTC)

If you don't modify the text enclosed in the PBB summary template, as you have done on TIMP1, you can safely keep its update status on "yes". The bot updates only specific sections marked off by templates. By the way, I'm glad to see new articles filling the gaps between gene pages ;) --Banus (talk) 10:09, 23 December 2007 (UTC)

Commons

Would it be possible for this bot to upload to the Commons instead of to here? —Remember the dot (talk) 19:51, 5 January 2008 (UTC)

Yes, that is a change we are considering for version 2. A quick scan of WP:MITC suggests there are many issues to consider if we are to do it right. The plan for the immediate future is to finish the previously-approved version 1 run. AndrewGNF (talk) 17:09, 7 January 2008 (UTC)

Bot flag

You new page creations are not flagged as bot creations, making it impossible to hide them on new page patrol. This is a bit annoying. Could this be solved? Fram (talk) 15:55, 14 January 2008 (UTC)

Thanks for the note. Yes, we'll definitely look into this. AndrewGNF (talk) 19:23, 14 January 2008 (UTC)

Blogosphere

In case anyone is interested, the PBB effort has been blogged... [23] AndrewGNF (talk) 19:24, 14 January 2008 (UTC)

Great, an very interesting discussion. Tim Vickers (talk) 19:37, 14 January 2008 (UTC)
Was a more solid article about PBB submitted anywhere? I did not know about Proteins Wiki. Please see organization of their protein articles: [24]. It includes "Protein interactions", "Domain structure" and other sub-headings missing in PBB articles.Biophys (talk) 21:16, 14 January 2008 (UTC)
Gene/protein wiki efforts shouldn't be judged based on the stubs, whose primary purpose is to seed contributions from the larger (human) community. [25] AndrewGNF (talk) 18:32, 15 January 2008 (UTC)
Right, I agree 100%. Also, there is no questions that your stubs are much better than their stubs. I just wanted to tell that everyone (including authors of Protein Wiki) understands the importance of domain structure and other information that could be automatically extracted from the databases by your bot, but was not extracted. That could be criticized by reviewrs if you submit a paper. But you know better. Perhaps it has been already published. So I asked.Biophys (talk) 19:01, 15 January 2008 (UTC)
Yup, no harm in the reminder and you also know it's on our V2 to-do list. But the real point actually is not that our stubs are better. (In fact, perhaps the Proteins Wiki ones are better for some people for the differences you noted.) The point I think is that the community here is better/bigger. Where to put the stub is more important than what's in it... (no news on the manuscript yet...) AndrewGNF (talk) 19:10, 15 January 2008 (UTC)

blank spaces

I've noticed that articles generated by ProteinBoxBot tend to have large blank spaces in between the infobox and the initial text. Is there anyway that your bot can be reconfigured so that it doesn't create these blank spaces?--69.118.143.107 (talk) 23:34, 21 January 2008 (UTC)

Thanks for the note. We're in the processing of finishing up round 1 of the bot run, but I've added your note to the Version 2 specs. Cheers, AndrewGNF (talk) 01:37, 22 January 2008 (UTC)

Image:PBB GE NIPA1 gnf1h07157 at fs.png listed for deletion

An image or media file that you uploaded or altered, Image:PBB GE NIPA1 gnf1h07157 at fs.png, has been listed at Wikipedia:Images and media for deletion. Please see the discussion to see why this is (you may have to search for the title of the image to find its entry), if you are interested in it not being deleted. Thank you. — Cuyler91093 - Соитяівцтіоиѕ 05:51, 27 January 2008 (UTC)

Image Copyright problem
Image Copyright problem

Thank you for uploading Image:PBB Protein NFAT5 image.jpg. However, it currently is missing information on its copyright status. Wikipedia takes copyright very seriously. It may be deleted soon, unless we can determine the license and the source of the image. If you know this information, then you can add a copyright tag to the image description page.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks again for your cooperation. Flominator (talk) —Preceding comment was added at 08:08, 25 January 2008 (UTC)

Thanks for the note. As you noted, we reference [26] for copyright info, and the first line says "The contents of PDB are in the public domain." This was the basis of tagging all images from the PDB with {{PD-release}}. Is that not sufficient? I also have a vague recollection that Tim had some direct contact with PDB folks, so perhaps he will chime in too. AndrewGNF (talk) 12:47, 25 January 2008 (UTC)

I asked them and got the reply that the contents of the PDB website were in the public domain, they referred me to that statement in their FAQ. Tim Vickers (talk) 17:26, 25 January 2008 (UTC)

Thanks for the response Tim. In my early morning stupor, I didn't notice the link to the media copyright questions page. I'll take the rest of the discussion over there. AndrewGNF (talk) 18:20, 25 January 2008 (UTC)

I raised the issue at Wikipedia:Media_copyright_questions#Protein_Data_Bank. Tim Vickers (talk) 18:46, 25 January 2008 (UTC)

ProteinBoxBot's uploads

Per the switch to the new pre-processor (see m:Migration_to_the_new_preprocessor#Expected_differences), the trick of passing template parameters via {{!}} no longer works. This was actually a bug in the old preprocessor, which can be verified by the fact that it only worked if the template argument was inside a parserfunction. On {{self}} for example it would not work for the second argument, only 2 and beyond, which were inside #if.

Furthermore, a bot should probably not be using {{self}}. It would probably be best to replace all instances of:

{{self|GFDL-no-disclaimers|cc-by-sa-3.0{{!}}[[Genomics Institute of the Novartis Research Foundation]]}}

with:

{{GFDL-no-disclaimers}}
{{cc-by-sa-3.0|[[Genomics Institute of the Novartis Research Foundation]]}}

Would you be willing to fix ProteinBoxBot's uploaded images to no longer break with the new preprocessor (as per the above suggestion)? --MZMcBride (talk) 02:31, 25 January 2008 (UTC)

Thanks for the note. Yikes, I wish we'd caught that earlier. (I checked the PBB articles under the new preprocessor, but not the images. But on the other hand, I guess we've been making these uploads using that syntax for quite some time...) Do we have an expected timeline by which these changes should be complete? I don't like those malformed templates hanging out either (and I agree with discontinuing the use of {{self}}), but we're trying to balance any changes and cleanup duties with getting the first run approved run complete (on which our master's student is depending for his thesis). Any thoughts? AndrewGNF (talk) 13:11, 25 January 2008 (UTC)
The new preprocessor went live today, so all uses of {{self}} with {{!}} are now broken. There are no current plans for the preprocessor to be turned off again, so I imagine we'll need some sort of bot to fix these pages. Removing {{self}} entirely seems like the best option. --MZMcBride (talk) 15:35, 25 January 2008 (UTC)
We'll absolutely fix for all future uploads. The question is what degree of expediency we need to devote to fixing all of the old edits. Since the broken usage only affects the image license templates (as opposed to templates that are commonly visible to the community), I'm hoping that we don't have to drop everything to address it this instant. Agreed, a bot seems like the best option, but of course we don't have one written. I'll inquire over at the bot requests page. AndrewGNF (talk) 18:24, 25 January 2008 (UTC)
Well, even if the display of the image's templates are slightly ugly, they are still informative enough to convey the license information. But... it would probably be nice to bulk-fix them at some point before too many are manually fixed? --Splarka (rant) 17:54, 26 January 2008 (UTC)
Heh. I forgot I'm kinda capable in this area and can do the fixes myself. I've got a script running currently. It should be done in about a day or two.
Also, there are some other things that probably need to be discussed as well. It seems you've been creating two images for every graph and redirecting one (the thumbnail version). However, unfortunately, that redirect "hack" no longer works. Also, I'm not completely sure why two images are needed, though, honestly, I know very, very little about the bot and its work. I was also sort of confused why it's doing all of these uploads to en.wiki vs. doing them to Commons. --MZMcBride (talk) 20:48, 26 January 2008 (UTC)
I swapped the broken link to the redirect for the direct link to the full-size image in AKT1. This looks fine to me, perhaps the simplest way to fix the redirect problem is just to remove the thumbnails? Or is there some technical reason for doing it this way? Tim Vickers (talk) 19:08, 27 January 2008 (UTC)
We had the thumbnails redirect to the full-sized images on David D.'s suggestion (which is a good one still, I think, if we can technically make it work). The rationale was that the image labels on the full-sized images are unreadable when shown at reduced size (as they are in the protein infobox). Better to show a less-detailed but legible axis label. Anyway, if there's not a good workaround to get that behavior back, then I guess we can use the direct link... (MZMcBride, thanks for fixing the image licenses. It was a big help!) AndrewGNF (talk) 20:51, 27 January 2008 (UTC)

Another editor has added the "{{prod}}" template to the article IQSEC1, suggesting that it be deleted according to the proposed deletion process. All contributions are appreciated, but the editor doesn't believe it satisfies Wikipedia's criteria for inclusion, and has explained why in the article (see also Wikipedia:What Wikipedia is not and Wikipedia:Notability). Please either work to improve the article if the topic is worthy of inclusion in Wikipedia or discuss the relevant issues at its talk page. If you remove the {{prod}} template, the article will not be deleted, but note that it may still be sent to Wikipedia:Articles for deletion, where it may be deleted if consensus to delete is reached. BJBot (talk) 20:59, 26 January 2008 (UTC)

I added some more context, incorporated some of the refs from "Further reading" into the article, and removed the template. Tim Vickers (talk) 01:05, 27 January 2008 (UTC)
The problem here is that Tim can't do that for every article. If these articles are allowed to live then experts will be attracted to make significant contributions to wikipedia. If everyone os going to prod these articles it will be a nightmare to keep repeating the same rationales. We need to figure out a way to stop these prods without having to manually update these articles so they will survive long enough to attract new editors. David D. (Talk) 01:46, 27 January 2008 (UTC)
One proposed deletion out of several thousand articles over six months, that's not something I'm worried about. Tim Vickers (talk) 01:48, 27 January 2008 (UTC)
Wow, there are several thouand already! I have not been keeping my eyes open. David D. (Talk) 07:25, 27 January 2008 (UTC)
As of right now, there are 8341 pages that use {{PBB_Controls}}. A semi-current list can also be found here. AndrewGNF (talk) 20:56, 27 January 2008 (UTC)