User:Certes/Gene links
This page lists gene articles which are not linked from the base name. For example, there is no obvious route from ACR to ACR (gene). FOO is used as a placeholder to denote the base name such as ACR.
Dab missing entry
[edit]FOO is (or redirects to) a dab which does not list the gene.
Done Section completed: add an entry for FOO (gene) to existing dab FOO.
- ACR
- AIC also add Akaike information criterion
- BTC
- CAMP: CAMP (gene) is a list of enzymes
- Rewrote cathelicidin and the enzyme article it linked to, as neither of its two corresponding genes are now called "camP", then redirected CAMP (gene). Seppi333 (Insert 2¢) 02:50, 30 November 2019 (UTC)
- CDNF
- CFI
- CGB: existing entry displays the gene name but is piped elsewhere
- CNO
- CROP
- CS
- CTRC
- FAT
- GART
- GDA
- GLA
- HAL
- MFF
- MIA
- NPPA
- NTM
- PGC
- PIGS
- PLI
- Pol
- POR
- PTLD
- REN
- SAC
- SLN
- TAT
- UMPS
- Y14
Unrelated article with dab
[edit]FOO is (or redirects to) an article about an unrelated primary topic. FOO (disambiguation) is (or redirects to) a dab which does not list the gene.
Done: Section completed: add an entry for FOO (gene) to existing dab FOO (disambiguation).
Unrelated article without dab
[edit]FOO is (or redirects to) an article about an unrelated topic. FOO (disambiguation) does not exist.
Fix: If the incumbent article is not primary, move it to FOO (topic) and list it along with the gene on a new dab FOO. Check for incoming links to FOO and update these. If the topic is primary but the initials also denote other topics, create FOO (disambiguation). Otherwise, the primary topic article needs a hatnote to the gene.
Done Section complete except for CTU2, which is the actual name of the C16orf84 gene: requesting a second opinion from PamD or Seppi333.
- AK1 Done
- APOD Done
- ASUN Done
- BATF Done
- BRF1 Done
- BRF2 Done
- BX3 Done
- CCT2 Done
- CCT5 Done
- CES3 Done
- CGB2 Done
- CHGA Done
- CKLF Done
- CLK3 Done
- CNN3 Done
- CPA4 Done
- CRCP Done
- CROT Done, though the gene has a claim to be PT
- CSF3 Done: retargeted to gene
- CSH2 Done
- CSN3 Done
- CTSH Done
- CTU2
- DMWD Done
- DNA2 Done
- Doubletime Done dab page tweaked, gene now linked directly by hatnote
- EN1 Done two genes linked in one complex hatnote
- ESAM Done new dab page
- ESPN Done (mega hatnote now even bigger)
- FMOD Done expanded hatnote
- GATM Done new dab page
- GBAS Done new dab page
- GMDS Done new dab page
- GPS2 Done new dab page (but querying whether existing redirect was justified)
- GPX2 Done new dab page
- HPCA Done new dab
- HULC Done
- IRGC Done
- Isomorph Done: added Isomorph (gene) (a classification of mutations) to Isomorphism (disambiguation)
- Kaiso Done linked PT to new dab
- KMO Done new dab
- KYNU Done
- MAL2 Done
- MLIP Done
- MPNS Done
- MSLN Done
- NAAB Done new dab page
- NAGA Done: added NAGA (gene) to dab Naga
- NEBL Done
- NEMF Done
- NFIC Done
- NKRF Done
- ODAM Done
- Paralytic Done P to S
- PCTP
- PIGN: medical, but probably unrelated to PIGN (gene)
- POLA1
- POP1
- POP4
- PPCS
- PPIE
- PPIG
- PREP
- PSG1
- RARS
- RAX
- SCEL
- SNCB
- Spätzle Done P to S
- VISA Done
- WARS Done
- WTAP Done
Enzyme or protein article
[edit]FOO describes an enzyme or protein related to FOO (gene) but does not link to the gene.
Fix: Expert advice is needed.
- AlkB; AlkB (gene) redirected to a section of AlkB Done Seppi333 (Insert 2¢) 23:19, 29 November 2019 (UTC)
- ANK2; ANK2 (gene) – the latter is a duplicate article of the former created by User:ProteinBoxBot. It should be merged into the former. The sitelink for ANK2 (gene) needs to be moved to ANK2 on wikidata in order to move the
{{Infobox gene}}
template when this happens (i.e., the duplicate article has a gene infobox but the primary article does not). Done Seppi333 (Insert 2¢) 23:23, 29 November 2019 (UTC) - CACNA1B; CACNA1B (gene) Done merged. Seppi333 (Insert 2¢) 23:44, 29 November 2019 (UTC)
- CASP12; CASP12 (gene) Done merged. Seppi333 (Insert 2¢) 23:44, 29 November 2019 (UTC)
- NRXN1; NRXN1 (gene) Done merged. Seppi333 (Insert 2¢) 23:44, 29 November 2019 (UTC)
- SCN1A; SCN1A (gene) Done merged. Seppi333 (Insert 2¢) 23:54, 29 November 2019 (UTC)
- SKP1; SKP1 (gene) Done moved Skp1 to the official UniProt name since that was an incorrectly capitalized gene name, SKP1 and SKP1 (gene) (will) redirect there after 2x redirects are corrected by a bot. Sitelink on wikidata moved to the correct item. Seppi333 (Insert 2¢) 00:04, 30 November 2019 (UTC)
- SPI1; SPI1 (gene) Done - fixed this one earlier. Seppi333 (Insert 2¢) 23:54, 29 November 2019 (UTC)
- TMEM243; TMEM243 (gene) Done moved sitelink and redirected page. Seppi333 (Insert 2¢) 00:04, 30 November 2019 (UTC)
Miscellaneous
[edit]See individual entries for a description of each anomaly.
Fix: Expert advice is needed.
- ALG2 is a gene; ALG2 (gene) is a list of enzymes. Fixed
See WT:MCB#ALG2 and GDP-Man:Man2GlcNAc2-PP-dolichol alpha-1,6-mannosyltransferase.Deleted that section. The ALG2 gene encodes a protein that belongs to 2 classes of enzymes, so it makes sense to redirect both pages to the gene and list the corresponding enzymes there. Seppi333 (Insert 2¢) 01:23, 30 November 2019 (UTC)
- CFTR: redirects are anomalously titled CFTR(gene) (no space) and Cftr (gene) (lower case). Pending deletion.
- This is a fairly widely studied gene due to its central pathophysiological role in cystic fibrosis; CFTR gets a lot of search traffic. I'd suggest deleting CFTR(gene) since it's an erroneous page title that I tried to move to CFTR (gene) with redirect suppression before realizing it already existed. Keeping cftr (gene) seems fine since, while technically incorrect capitalization, it's at least the correct spelling. Addendum: I've PRODed CFTR(gene). Seppi333 (Insert 2¢) 01:23, 30 November 2019 (UTC)
- EYCL1 is a gene; EYCL1 (gene) redirects to a related protein. Pending deletion.
- See WT:MCB#Deletion of EYCL1, EYCL1 (gene) and Eye color 1 (green/blue). Seppi333 (Insert 2¢) 01:46, 30 November 2019 (UTC)
- KCTD9 and KCTD9 (gene) may be duplicate articles. Done
- LRIF1 and LRIF1 (gene) redirect to articles about different proteins. Fixed
- NAGK redirects to one enzyme; NAGK (gene) is a list of enzymes. Fixed
- The bacterial enzyme listed on NAGK (gene) had a 2:1 correspondence between gene and enzyme and the corresponding gene also had a different capitalization (NagK), so I redirected it to where NAGK went. Seppi333 (Insert 2¢) 02:00, 30 November 2019 (UTC)
- NFAM1 and NFAM1 (gene) may be duplicate articles. Done
- NFATC2IP and NFATC2IP (gene) may be duplicate articles. Done
- NMT1 dab and NMT1 (gene) list share the same entries. Fixed
- This one was confusing; don't think I've ever seen 2 enzymes associated with a single gene, but since both enzymes are associated with multiple genes, I redirect both pages to the pagename of the protein that the gene encodes and listed both enzymes there. Seppi333 (Insert 2¢) 03:21, 30 November 2019 (UTC)
- TITF1: TITF1 (gene) is a different topic, but is it just a typo for TTF1? Fixed
- Hmm. [1] - this is a query in the HGNC database for TITF1. It's an old gene symbol for NKX2-1 (current gene symbol), which is currently known as the "NK2 homeobox 1" gene. Both should redirect to the current gene symbol unless disambiguation at TITF1 is necessary. Changed the TITF1 target. Seppi333 (Insert 2¢) 00:57, 30 November 2019 (UTC)
- WAS: WAS (gene) redirects to Wiskott–Aldrich syndrome protein but that article calls the gene WASp. Fixed Clarified by stating the encoding gene's gene symbol. Seppi333 (Insert 2¢) 00:57, 30 November 2019 (UTC)
Merged the wikidata sitelinks for NFATC2IP, KCTD9, and NFAM1 and the corresponding (gene) pages. Will deal with the rest a bit later. Seppi333 (Insert 2¢) 00:10, 30 November 2019 (UTC)
Re-ALG2 (gene): I think it may be worth recoding and rerunning my User:Seppi333/GeneListNLP script to detect/write a list of target pages that are wikilinked from the gene lists and that contain all 5 of the words "Set", "index" "page", "lists", and "articles" on them in order to identify links to set index articles, unless you can locate those with an SQL query. The last time I ran that script, it took 1:33:45 (1.5 hrs) to download and process all the pages, so if it's possible to locate them using another method, it'd probably best to do that instead. Seppi333 (Insert 2¢) 01:23, 30 November 2019 (UTC)
- This PetScan query identifies SIAs linked from gene lists. Certes (talk) 10:25, 30 November 2019 (UTC)
False positives
[edit]FOO links to FOO (gene) (or the target of that redirect) in a complex way not spotted by the Quarry queries.
Fix: probably no action but we may consider a more direct link.
- BBC3 Done (improved hatnote to offer direct link to gene)
- CAD
- Cfr
- Dlx
- ELO Done (clarified link to gene on dab page)
- FARSA Done (retargeted to dab page, no clear PT: shortens route to gene)
- Hairy Done (improved hatnote to offer direct link to gene)
- KIZ
- LAT
- LOX Done (clarified link to gene on dab page)
- MAFA (possibly a related protein)
- MFSD2A and MFSD2A (gene) redirect to the same article.
- MIB2
- MINA Done (clarified link to gene on dab page)
- NES
- OSCAR
- Pokemon
- REST
- RHO
- Sphinx
- Tinman
- THEMIS
- TOR
Other links
[edit]Here are some other link issues raised by the gene lists. They need an expert to fix them because the suggested fix may be wrong, they may indicate wider problems, or the initialism redirect might merit conversion into a dab.
Direct links
[edit]The gene lists link directly to a page which is not in gene categories. These fall into two sections.
1. The target page appears not to be a gene. The link needs to be corrected. In each case, incoming links suggest that the non-gene article is the primary topic, but we could consider moving that article and creating a dab.
- CHML: List of human protein-coding genes 1 should link to CHML (gene)
- DR1: List of human protein-coding genes 1 should link to DR1 (gene)
- HPX: List of human protein-coding genes 2 should link to HPX (gene)
- PIM2: List of human protein-coding genes 2
and Protein kinase domainshould link to PIM2 (gene)
2. The target page appears to be a gene or closely related topic. Links may be correct but the gene page could be added to appropriate gene categories.
Redirects
[edit]The gene lists link to a redirect to a page which is not in gene categories.
- List of human protein-coding genes 1 links to AAMP, which redirects to unrelated article African American Museum in Philadelphia. They should probably link to AAMP (gene).
- List of human protein-coding genes 1 links to CCNC, which redirects to unrelated article Chinese Canadian National Council. They should probably link to CCNC (gene).
- List of human protein-coding genes 1
and Cathepsin Zlink to CTSW, which redirects to unrelated article Flight Design CT. They should probably link to CTSW (gene). - List of human protein-coding genes 1,
Helicase and ZGRF1link to DNA2, which redirects to unrelated article DNA². They should probably link to DNA2 (gene). - List of human protein-coding genes 1 links to EN1, which redirects to unrelated article EN postcode area. They should probably link to EN1 (gene).
- List of human protein-coding genes 1 links to EN2, which redirects to unrelated article EN postcode area. They should probably link to EN2 (gene).
- List of human protein-coding genes 2 links to ETDA, which redirects to unrelated article Ethylenediaminetetraacetic acid. They should probably link to a new redirect ETDA (gene). Which article should it redirect to?
- List of human protein-coding genes 2,
Epstein–Barr virus-associated lymphoproliferative diseases, List of OMIM disorder codes and PD-1 and PD-L1 inhibitorslink to ICOS, which redirects to article Icos about a genetics company. They should probably link to ICOS (gene). - List of human protein-coding genes 2
and Brpf1link to KAT7, which redirects to unrelated article KAT-7. They should probably link to KAT7 (gene). - List of human protein-coding genes 2,
Alpha/beta hydrolase superfamily and Ichthyosislink to LIPN, which redirects to an article Lamellar ichthyosis about a related disease. We may want to link via a new redirect LIPN (gene).- Created Lipase member N and LIPN (gene)
- List of human protein-coding genes 2
and CARD domainlink to MAVS, which redirects to unrelated article Dallas Mavericks. They should probably link to Mitochondrial antiviral-signaling protein, perhaps via a new redirect MAVS (gene). - List of human protein-coding genes 3
and AAA proteinslink to NVL, which redirects to unrelated article Null (SQL). They should probably link to NVL (gene). - List of human protein-coding genes 3 links to OSR2, which redirects to unrelated article Windows 95. It should probably link to OSR2 (gene).
- List of human protein-coding genes 3
and several articleslink to PIGN, which redirects to unrelated article Acute proliferative glomerulonephritis. They should probably link to PIGN (gene). - List of human protein-coding genes 3
and WD40 repeatlink to PLAA, which redirects to unrelated article Poor Law Amendment Act 1834. They should probably link to PLAA (gene). - List of human protein-coding genes 3
and List of OMIM disorder codeslink to RHO, which redirects to unrelated article Rho. They should probably link to RHO (gene). - List of human protein-coding genes 3,
Cancer syndrome and Housekeeping genelink to SDHC, which redirects to unrelated article SD card. They should probably link to SDHC (gene). - List of human protein-coding genes 4
and several articleslink to SYK, which redirects to unrelated article Helsingin Suomalainen Yhteiskoulu. They should probably link either to Syk or to its redirect SYK (gene).
Ahh. I was wondering why my NLP script didn’t locate those... it’s the hatnotes. I should probably reprogram it to fix that bug. Will fix these pages later tonight and (nothing to fix, exception maybe conversion to DABs; I think you guys are better judges of when/how to disambiguate than I though, so I'll leave it to you) revise the wikitables once we locate all these pages. Seppi333 (Insert 2¢) 02:02, 1 December 2019 (UTC)
- Looks like you're right; all of them should link to the SYMBOL (gene) page since those are all the correct articles. I moved the Syk page to the official UniProt name for the protein (Tyrosine-protein kinase SYK) since the only synonym/alias with a lowercase spelling was "p72-Syk". I'll retarget the links in the gene lists/tables once we find the rest of these since it's much less work for me to add them all at once than piecewise. I can rewrite my script to detect the multi-word expressions used on the hatnote pages and just parse the leads to identify ones like Rho tomorrow since it's fairly easy to code that; but, I get the impression that you're able to identify all of the remaining links to mistargeted by simpler means than downloading and parsing 11500 pages.
- Makes me want to learn SQL. What other methods do you use to locate pages like this? I'm really curious now. Seppi333 (Insert 2¢) 04:51, 1 December 2019 (UTC)
- @Seppi333: In theory I could have located these with SQL. In practice, it might have been too complex to complete within Quarry's 30 minute limit, so I used PetScan instead with a Wikipedia search for incoming links. You mention checking 11,500 pages manually. In a way I've done that check myself, but only on the 30 or so suspicious pages that remained after filtering out cases that the queries suggest to be correct. Certes (talk) 12:57, 1 December 2019 (UTC)
- Oh. Wow, that's a surprisingly useful tool then. The algorithm is actually fully-automated; it basically just iteratively goes through all ~11500 of the blue wikilinks on the four list pages one at a time, loads the page (it takes 1.5 hours to run almost entirely because it has to load 11500 pages; I can't run it on a database dump), and determines whether or not the words "gene", "genes", "protein" or "proteins" are present on the page. It missed most of the links above because those words are in the DAB hatnotes. I hadn't considered that being a possibility when I wrote it. I should have some time to revise both the wikitable script to fix the lists and mistargeted link detection script to do a second check within the next 12-24 hours; shouldn't take that long to do. Seppi333 (Insert 2¢) 22:00, 1 December 2019 (UTC)
- Finding the bad direct links is as simple as this, which takes 4 seconds. There are a few false positives such as Locus (genetics) from wikilinks not in the table, but they're obvious. The links via redirects took a little more fiddling. Certes (talk) 22:52, 1 December 2019 (UTC)
- I'll have to make use of that tool; seems very handy. Going to work on the gene lists now and update it once I'm done. Seppi333 (Insert 2¢) 10:07, 2 December 2019 (UTC)
- Following up, I retargeted the links in the gene lists yesterday. Haven't quite finished reprogramming the other one yet, but will probably be tomorrow. I'll retarget the non-list gene articles with mistargeted links sometime within the next couple of hours.
- Assuming neither of us find any additional pages,
I suppose we're done.Thanks again for your help. Edit: I didn't notice the sections above; will get to them after I retarget the links. Seppi333 (Insert 2¢) 10:04, 3 December 2019 (UTC)
- Finding the bad direct links is as simple as this, which takes 4 seconds. There are a few false positives such as Locus (genetics) from wikilinks not in the table, but they're obvious. The links via redirects took a little more fiddling. Certes (talk) 22:52, 1 December 2019 (UTC)
- Oh. Wow, that's a surprisingly useful tool then. The algorithm is actually fully-automated; it basically just iteratively goes through all ~11500 of the blue wikilinks on the four list pages one at a time, loads the page (it takes 1.5 hours to run almost entirely because it has to load 11500 pages; I can't run it on a database dump), and determines whether or not the words "gene", "genes", "protein" or "proteins" are present on the page. It missed most of the links above because those words are in the DAB hatnotes. I hadn't considered that being a possibility when I wrote it. I should have some time to revise both the wikitable script to fix the lists and mistargeted link detection script to do a second check within the next 12-24 hours; shouldn't take that long to do. Seppi333 (Insert 2¢) 22:00, 1 December 2019 (UTC)
- @Seppi333: In theory I could have located these with SQL. In practice, it might have been too complex to complete within Quarry's 30 minute limit, so I used PetScan instead with a Wikipedia search for incoming links. You mention checking 11,500 pages manually. In a way I've done that check myself, but only on the 30 or so suspicious pages that remained after filtering out cases that the queries suggest to be correct. Certes (talk) 12:57, 1 December 2019 (UTC)
Further progress
[edit]@Seppi333: I've fixed incoming links apart from the gene lists which should link to CHML (gene) rather than CHML, AAMP (gene) rather than AAMP, etc. I see that some of these have been done manually in the lists (though a piped link might be better) but not in the Python. Also, do you have any thoughts about AKNA, CD96 and WRAP53? Certes (talk) 00:25, 16 December 2019 (UTC)
- @Certes: Hey there! I'm really sorry for falling off the grid after my last reply here; it seems rather rude of me. I've been really busy off-wiki lately and forgot to work on this. My bad about that. I'll go ahead and finish addressing the links above within the next day or so since I now have some time to work on WP. I'll fix AKNA, CD96, and WRAP53 right now though. I only need to adjust their wikidata sitelinks and add {{infobox gene}} to the article source. Done
- BTW, I finished recoding an updated version of my mistargeted link detection algorithm last week. The updated algorithm is designed to detect the type of mistargeted links you uncovered since I used all of the links that you listed in this section as a sample of testcases; I continually revised the algorithm until it had a 100% detection rate on that sample. This time around, it took 3.5 hours (originally, 1.5 hours) for the algorithm to finish processing all ~12,500 blue wikilinks in the gene lists (LOL). The likely mistargeted links it found are included in the collapse tab below. It found a few more articles with similar issues to the ones that you listed above; these articles would be included in the 2nd list in the tab below. Sometime within the next 24-48 hours, I'll manually go through all the links in the tab below and highlight the mistargeted ones I find. This is probably the last set of links in the gene lists that need to be fixed/retargeted since I think I've accounted for all possible ways that a false negative might occur. Seppi333 (Insert 2¢) 00:39, 18 December 2019 (UTC)
Note: immediately after each bulleted entry below, there are two index values listed: My original script detected links to articles where none of 4 gene-related terms (i.e., "gene", "genes", "protein", "proteins") were found anywhere in the article's source code (NB: these links would be marked with The updated algorithm also listed all articles that included specific gene-related multi-word expressions (i.e., the following phrases: "the gene", "the genes", "the protein", "the proteins", "the enzyme", "the enzymes", "(gene)", "(enzyme)", and "(protein)") in the parameters of certain lead hatnotes if any were present – specifically, the
Entries in this list are articles where none of these 5 single-word tokens –
Entries in this list are articles where one or more of these 5 single-word tokens –
|
- Also, thank you so much for helping me find and address the problematic links in the gene lists! I can't adequately express just how much I appreciate your assistance thus far.
- If it weren't for you, several dozen links in the gene lists probably would've continued to point to the wrong articles since I don't think I would've realized the issues with the original algorithm that were producing false negatives. Seppi333 (Insert 2¢) 00:46, 18 December 2019 (UTC)
- No problem: there is no deadline and we all have things to do offline, especially in December. I'm happy to have helped but have probably done all I can for now. I think the only outstanding issue not mentioned above is cases like CTU2, where the base name leads to a rather flimsy non-gene primary topic and we need either a {{redirect}} hatnote or a two-entry dab. (I'm not sure which is better.) However, I think all the wikilinks now lead to the right destination even in those cases. We've made a lot of improvements and it looks as if the job's almost complete. Certes (talk) 01:26, 18 December 2019 (UTC)
I went through all the links and fixed problems that I found. In addition to the 4 you identified (CHML, DR1, HPX, and PIM2), it looks like only DDT is new. I'll fix these links in the lists shortly. Seppi333 (Insert 2¢) 15:58, 24 December 2019 (UTC)
- @Seppi333: I missed DDT because it's in Category:Nonsteroidal antiandrogens, a subcategory of Hormones, which I viewed as legitimate link targets. When I stopped excluding Hormones from my Petscan query, DDT appeared and nothing else did, so I don't see any similar cases. Most links to the pesticide seem correct but please can you fix the Python for List of human protein-coding genes 1 and check Protein design, which should perhaps link to DDT (gene) instead? Certes (talk) 16:23, 24 December 2019 (UTC)
- Looks like the DDT link in protein design is correctly targeted; had to read the paper to verify which page to link to (quote:
Then they synthesized the 24-mcr (MIF1RPNVGAMSNFYHYPNIIIII:) designed to form a four-stranded 13-sheet and to bind the insecticide DDT. It did indeed...
). Working on recoding the python script for the list pages right now. Seppi333 (Insert 2¢) 17:23, 24 December 2019 (UTC) - Done The lists have been updated with piped links for these genes. Seppi333 (Insert 2¢) 18:48, 24 December 2019 (UTC)
- Looks like the DDT link in protein design is correctly targeted; had to read the paper to verify which page to link to (quote: