Wikipedia:Link rot/URL change requests/Archives/2024/September
This is an archive of past discussions on Wikipedia:Link rot. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current main page. |
cbsnews.com/stories
Hello. CBS News links with /stories/ in the URL don't work. www.cbsnews.com/stories/ is now www.cbsnews.com/news/name-of-the-article/. Some of these can be converted over while others don't fit the format: For example, this is now here for Pedro Carmona.
- Any punctuation marks in the article title are removed. For example, this is now here for Darleen Druyun. Same with this is now here with Kamal Derwish.
- However, this URL change doesn't always work as some also require the date at the end of the URL (day month year). For instance, this is now here for Pittsburgh Tribune-Review.
In this case, I think this changeover would need 3 stages. Article title, article title and date, archive any that remain broken.
Thanks! MrLinkinPark333 (talk) 23:28, 26 August 2024 (UTC)
- Building a new URL from
|title=
data is difficult. In the above examples:- "CBS: Venezuelan Coup Leader Exits" --> "venezuelan-coup-leader-exits" (drop leading "CBS:")
- "Cashing In For Profit?" --> "cashing-in-for-profit" (drop ?)
- "'Lackawanna 6' Link To Yemen Killings?" --> "lackawanna-6-link-to-yemen-killings-04-11-2002" (drop single-quote and ?, add a date string parsed from original URL)
- "U.S. Plants: Open To Terrorists" --> "us-plants-open-to-terrorists-13-11-2003" (drop period and semi-colon, add a date string)
- In #1 and #4 they each have colons but are done differently. I suspect there will be a lot of edge cases. I can try some generic rules like this and see how many it can get. If you find any more rules, that will help. -- GreenC 03:20, 27 August 2024 (UTC)
- Other cases:
- The subtitle was removed from this and now that at Surveillance.
- Dollar sign was also removed from this to make that at Al Waleed bin Talal Al Saud.
- This is now here for Big Mac Index (no punctuation)
- Other cases:
- It might be easier to do ones without punctuation marks first. However, I can't predict which would need dates and which don't. MrLinkinPark333 (talk) 03:59, 27 August 2024 (UTC)
- "We're Watching: How Chicago Authorities Keep An Eye On The City" --> "were-watching" (drop punct and split : to left side)
- "$10 Million? NYC Says No Thanks" --> "10-million-nyc-says-no-thanks" (drop punct including $)
- "Iceland Says Bye to the Big Mac" --> "iceland-says-bye-to-the-big-mac" (square-link title)
- -- GreenC 15:40, 27 August 2024 (UTC)
- It might be easier to do ones without punctuation marks first. However, I can't predict which would need dates and which don't. MrLinkinPark333 (talk) 03:59, 27 August 2024 (UTC)
Results
- URLs with a match: 3,984 (converted via above method)
- URLs not matched: 1,293 (unable to covert)
- Title unspecified: 78 (bare and square links without a title)
User:MrLinkinPark333: This turned out better than expected with a 74% success rate. Though the 80/20 Rule is expected. It was fiddly getting all the transforms right and building a table of possible URLs. Some of the #2's probably have a match but the title is too complicated to parse. Many of the titles in #2 are straightforward but no URL exists. If you want the list of #2 let me know. -- GreenC 17:26, 29 August 2024 (UTC)
- If you mean the ones that were too complicated to convert, sure. Perhaps I can find more conversion rules from them. I'm also interested in the bare links of #3. I don't need the 404s of easy conversion. MrLinkinPark333 (talk) 17:44, 29 August 2024 (UTC)
- Set #2 and #3: Wikipedia:Link_rot/Cases/cbsnews.com-stories -- GreenC 05:47, 31 August 2024 (UTC)
- Of the 10 I tested in case #2, I found a handful that worked.
- Clam sauce this is now here after The Early Show - CBS News is removed
- Comparison of the healthcare systems in Canada and the United States, this is now here (subtitle is removed)
- Cold case - this this is now here - Everything after the slash is removed.
- Continuity of government: - this is now here (single quotes are removed)
- I would like the list for #2 updated at that link rot cases page if any more links are resolved. However, not all of these links will be fixed. For example, the links at Columbus Blue Jackets and Concerns and controversies at the 2010 Commonwealth Games don't have working links. If I find any more, I'll let you know. Otherwise, if we run out of ones to replace, the rest could be replaced with archived links. MrLinkinPark333 (talk) 20:07, 31 August 2024 (UTC)
- They all had archives added already there's no loss to verifiability if nothing further is done. The one's that might be made to work require special edge case rules that I don't want to deal with sorry it's too messy and time consuming there is too much variability. For example how many URLs are fixed by removing "The Early Show - CBS News"? The answer is 4. So that's 4 out of 1000. Cntrl-F search on "/" in that list, there is no general rule for "everything after slash is removed". It goes on like that, the data is extremely messy and variable. In situation like this, the 80/20 Rule rules - you can often get the first "easy" 80% and the remaining hard "20%" is dealt with or not, but at least you got 80% is better than nothing. It's just the nature of this particular problem trying to create a URL from free-form text. -- GreenC 23:23, 31 August 2024 (UTC)
- Fair enough. Hopefully there'd be more luck with the other cbs ones below. --MrLinkinPark333 (talk) 23:26, 31 August 2024 (UTC)
- Honestly, 74% is much better than I expected, considering. And probably at least half those in #2 are legitimate dead links no page available, the real conversion rate might be closer to 90%, after the dead links are factored out. -- GreenC 00:22, 1 September 2024 (UTC)
- Fair enough. Hopefully there'd be more luck with the other cbs ones below. --MrLinkinPark333 (talk) 23:26, 31 August 2024 (UTC)
- They all had archives added already there's no loss to verifiability if nothing further is done. The one's that might be made to work require special edge case rules that I don't want to deal with sorry it's too messy and time consuming there is too much variability. For example how many URLs are fixed by removing "The Early Show - CBS News"? The answer is 4. So that's 4 out of 1000. Cntrl-F search on "/" in that list, there is no general rule for "everything after slash is removed". It goes on like that, the data is extremely messy and variable. In situation like this, the 80/20 Rule rules - you can often get the first "easy" 80% and the remaining hard "20%" is dealt with or not, but at least you got 80% is better than nothing. It's just the nature of this particular problem trying to create a URL from free-form text. -- GreenC 23:23, 31 August 2024 (UTC)
- Of the 10 I tested in case #2, I found a handful that worked.
- Set #2 and #3: Wikipedia:Link_rot/Cases/cbsnews.com-stories -- GreenC 05:47, 31 August 2024 (UTC)
Done
cbsnews.com/numeric
Hello. While looking at CBS News, I found many URLs with numeric IDs that don't work. I found 2 that redirect but the rest don't:
- this redirects here for Judy Sheindlin.
- this redirects to here for 2020 United States presidential election in Idaho.
URL replacements are the same as the above section with some exceptions:
- For Jihobbyist, this is now here. - Political Hotshot needs to be removed from the reference as it does not exist in the new URL's article title.
- ~590 URLs that start with 2 (any non-mainspace can be ignored).
- ~4500 URLs that start with 8 (any non-mainspace can also be ignored)
Thanks again! MrLinkinPark333 (talk) 23:47, 26 August 2024 (UTC)
Results
- URLs with a match: 2,279 (converted with above method)
- URLs unable to match: 631
- URLs no title available: 36
Successful matches: 77.4% .. of those 631, roughly half are not a matching problem rather page no longer exists. Assuming 50% is true, and also removing the no title available, the real match rate is 88% ie. a further 12% might be matched but not practical to the variability of the data. -- GreenC 00:34, 1 September 2024 (UTC)
Done -- GreenC 00:34, 1 September 2024 (UTC)
- Not too bad! MrLinkinPark333 (talk) 03:44, 1 September 2024 (UTC)
cbc.ca/story
Hello. There are links to cbc.ca using /story/ that are broken. While there are new working URLS, I can't predict them. For example, this has a working archived link that redirects here for Scouting controversy and conflict. For these ones, I request looking for archived redirects first, then adding archives to the rest.
Thanks! MrLinkinPark333 (talk) 22:08, 28 August 2024 (UTC)
- This is a weird site because the ghost redirects are.. ghostly. In the above example, there are different redirects depending on timestamp. Sometimes it goes here and other times here. They are also somewhat chronologically buried in the list, normally I only get the most recent redirect (because there is no way of knowing which is correct without looking), and the last redirect goes here, which is not a ghost redirect. Thus unable to determine redirect URLs with automation. -- GreenC 21:06, 1 September 2024 (UTC)
- Checked 457 pages and edited 376 pages. Added 2
{{dead link}}
. Switched 45|url-status=live
to dead. Added 384 archive URLs (337 Wayback). Changed 28 citation metadata.
Done -- GreenC 17:17, 3 September 2024 (UTC)
MrLinkinPark333: The cbc.ca/story and /news/story appear to have been parsed and fixed during the below section. Example Special:Diff/1243671672/1243757966 -- GreenC 17:17, 3 September 2024 (UTC)
www.cbc.ca/redirects
Hello. Cbc.ca has redirects to working URLs. Some of them require URL changes while others can be fixed quickly.
- No Changes
- This automatically goes here without any URL changes for Serena Williams. Not sure how many cases don't require /news/ to make working redirects.
- Changes
- 1a) Any with /m/ might need to be removed to make a redirect. For example, changing this to that points here for Commonwealth Games.
- 1b) However, that doesn't always work like this for House of Commons of Canada. Any of these or others that don't work would need archives, as I can't predict the new URL, which is this in the example.
- 2) Most URLs need to add /news/ to create a working redirect. For example, changing this into that makes the working URL for I'll Be Lovin' U Long Time.
- 3) However, an extra step for 2) might be needed if that doesn't work. Changing this to that makes a broken redirect. This redirect would need most of the URL changed to /amp/ to make the correct URL for Drake (musician).
- ~2000 2 folders
- ~3000 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[story]+\//
- ~1300 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//
- ~50 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//
- 3 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//
Since this is a big request, I suggest focusing on ones that already redirect without changing the URLs first, then the ~180 /m/ ones. Thank you very much! MrLinkinPark333 (talk) 22:44, 28 August 2024 (UTC)
- Able to resolve 1b) by using the same technique in 3) ie. any url that ends in something like "1.2689859" can be tested for https://www.cbc.ca/amp/1.2689859
- URL like this can be changed to this (remove "/canada/" and change to "html")
- URL like this can be changed to this (add "/canada/") -- GreenC 04:52, 2 September 2024 (UTC)
- I tested this with the /news/story/ ones archived in the above section. This makes a working redirect for Vancouver and this makes a working redirect for Calgary Flames. Could these ones could be rechecked? Haven't figured out if the /news/ one have working redirects. MrLinkinPark333 (talk) 16:26, 2 September 2024 (UTC)
- Update: Got a few working redirect from /news/ with some changes. This works for Stan Lee. However, This needs to be changed to this to make a redirect for The Da Vinci Code. Same with this to that for Karla Homolka.MrLinkinPark333 (talk) 16:41, 2 September 2024 (UTC)
- This is getting complicated enough might need to run and see what's remaining for a second pass, otherwise could create rules contradictions. It's still processing but out of over 4000 edits so far there are about 300 it missed using the rules from yesterday. Stil need to check for soft404s and other problems. -- GreenC 16:49, 2 September 2024 (UTC)
- No worries! Those remaining rechecks could wait :) MrLinkinPark333 (talk) 17:06, 2 September 2024 (UTC)
- This is getting complicated enough might need to run and see what's remaining for a second pass, otherwise could create rules contradictions. It's still processing but out of over 4000 edits so far there are about 300 it missed using the rules from yesterday. Stil need to check for soft404s and other problems. -- GreenC 16:49, 2 September 2024 (UTC)
- Update: Got a few working redirect from /news/ with some changes. This works for Stan Lee. However, This needs to be changed to this to make a redirect for The Da Vinci Code. Same with this to that for Karla Homolka.MrLinkinPark333 (talk) 16:41, 2 September 2024 (UTC)
- I tested this with the /news/story/ ones archived in the above section. This makes a working redirect for Vancouver and this makes a working redirect for Calgary Flames. Could these ones could be rechecked? Haven't figured out if the /news/ one have working redirects. MrLinkinPark333 (talk) 16:26, 2 September 2024 (UTC)
- As the House of Commons of Canada example has both an /amp/ and full link, could those /amp/ ones be archived in case they break? Not sure why there's two links to the same article, but that helps! MrLinkinPark333 (talk) 16:09, 2 September 2024 (UTC)
- For House of Commons-like URLs I can't automatically determine the desktop URL only the mobile version. "AMP" is for pages optimized for mobile users, a parallel version of the site. Some sites have an API (a URL) that allows translation between the mobile and desktop URL ie. give it the AMP URL and it will return the desktop URL. Ideally all URLs on Wikipedia are the desktop version. But I don't know if they have an API, that would be nice to have. Either way anything added to Wikipedia will get archived into the Wayback Machine automatically. If the link later dies the bots or my tool will add an archive. -- GreenC 16:20, 2 September 2024 (UTC)
Results
- 10,368 links are live. All (but 31) are new, created per above rules.
- 95 links are not working. Of those, 12 had a
{{dead link}}
added. The rest have archives. - Checked ____ pages and edited 7,133 pages. Moved 10,368 links to a new URL. Removed 45
{{dead link}}
. Added 12{{dead link}}
. Switched 1,460|url-status=dead
to live. Switched 2|url-status=live
to dead. Added 556 archive URLs (402 Wayback).
Done -- GreenC 02:08, 3 September 2024 (UTC)
articles.latimes.com
Hello. This is a big request. URLs with articles.latimes.com either redirect to the new URL or don't work:
- Working redirect: this to that at Carleton Varney
- Not working redirect: this should go to here for The Man from Earth. These redirects that don't redirect could use archive copies for the time being as the date format doesn't match the above URL at Carleton Varney.
- Broken redirects: this gives a redirect to the Los Angeles Times's main page for Cedars-Sinai Medical Center. It's a search URL and not an URL to a specific article.
60000 with HTTP/HTTPS. Any non articlespace links can be filtered out. Thank you very much! MrLinkinPark333 (talk) 23:58, 12 August 2024 (UTC)
- MrLinkinPark333: It looks like *.latimes.com is 96,000 pages and articles.latimes.com is 37,000. I could focus on articles.latimes.com (which is a significant project) but I wonder about the other 2/3rds. Are they redirecting also? Maybe I should do articles.* right now to keep the size manageable. -- GreenC 02:00, 13 August 2024 (UTC)
- From the sample checks at Summer Olympic Games, Tampa Bay Buccaneers and 2020 Summer Olympics for latimes.com, these look fine and don't need new URLs. If that changes in the future, I could file a separate request later. MrLinkinPark333 (talk) 22:40, 13 August 2024 (UTC)
- OK great. This job is going fast because the LAT has an exceptionally clean site, rapid response, few dead links. It's mostly just finding the redirect and replacing. I'm happy with how well ghost redirect discovery is working, now noted in the statistics (along with soft-404 stats) starting with this run. -- GreenC 16:05, 14 August 2024 (UTC)
- From the sample checks at Summer Olympic Games, Tampa Bay Buccaneers and 2020 Summer Olympics for latimes.com, these look fine and don't need new URLs. If that changes in the future, I could file a separate request later. MrLinkinPark333 (talk) 22:40, 13 August 2024 (UTC)
Enwiki in multiple batches:
- Batch 1: Checked 3,000 pages and edited 2,935 pages. Moved 4,152 links to a new URL. Resolved 68 ghost redirects. Resolved 25 soft-404s. Removed 2
{{dead link}}
templates. Added 8{{dead link}}
. Switched 149|url-status=dead
to live. Switched 14|url-status=live
to dead. Added 166 archive URLs (142 Wayback). Changed 13 citation metadata fields. - Batch 2: Checked 7,000 pages and edited 6,859 pages. Moved 9,663 links to a new URL. Resolved 143 ghost redirects. Resolved 53 soft-404s. Removed 1
{{dead link}}
. Added 21{{dead link}}
. Switched 314|url-status=dead
to live. Switched 34|url-status=live
to dead. Added 372 archive URLs (276 Wayback). Changed 36 citation metadata. - Batch 3: Checked 26,845 pages and edited 26,251 pages. Moved 36,856 links to a new URL. Resolved 481 ghost redirects. Resolved 195 soft-404s. Removed 5
{{dead link}}
. Added 89{{dead link}}
. Switched 1,289|url-status=dead
to live. Switched 128|url-status=live
to dead. Added 1,329 archive URLs (1,106 Wayback). Changed 140 citation metadata.
IABot: does not support URL moves, redirects are working the bot will consider links live.
Done -- GreenC 20:43, 14 August 2024 (UTC)
Pass 2
- Checked 36,845 pages and edited 658 pages. Moved 379 links to a new URL. Resolved 216 ghost redirects. Resolved 438 soft-404s. Removed 2
{{dead link}}
. Added 106{{dead link}}
. Switched 199|url-status=dead
to live. Added 217 archive URLs (138 Wayback).
-- GreenC 02:29, 4 September 2024 (UTC)
ehdenfamilytree.com
This 'ehdenfamilytree.com' is dead and the new one is 'ehdenfamilytree.org'. Saroufim1 (talk) 01:49, 31 August 2024 (UTC)
- Checked 79 pages and edited 79 pages. Moved 120 links to a new URL. Removed 3
{{dead link}}
. Switched 1|url-status=dead
to live.
Done -- GreenC 19:26, 3 September 2024 (UTC)
AnandTech shuts down
Amazing website/technews site AnandTech has shut down (https://www.anandtech.com/)
If an archive bot could preemptively archive the entirety of that website, that would be mint, as people are unsure what will happen to the content.
Thanks.
Headbomb {t · c · p · b} 04:32, 31 August 2024 (UTC)
1,158 pages — Preceding unsigned comment added by GreenC (talk • contribs)
- @GreenC: those are just what's used on Wikipedia. Which, I agree should be a priority. But if archiving the entirety of Andandtech is possible... either by talking to IA or through your bot or whatever that would be an amazing service to the tech community/tech historians. Headbomb {t · c · p · b} 14:45, 31 August 2024 (UTC)
- I believe the domain is already crawled by the Wayback Machine as part of the GDELT Collection ("NO404-GDELT"). For example given this archive the "About this capture" tab says GDELT Collection. The crawl was started in 2014, though it might be the whole site. If you can find some older URLs (older the better) and check if they exist in the Wayback. They should be there, but worth checking to see if the crawl missed them. If there are blank spots then I'll need to go through the URLs on Wikipedia one by one and capture any that are missing which is a bit of a job. -- GreenC 16:24, 31 August 2024 (UTC)
- Here's an article from 1998. You can tell it's a very early article by the URL: /161/ .. recent articles are at around /21000/. It's a pretty good bet the site is well archived. -- GreenC 05:29, 2 September 2024 (UTC)
- Sounds like anandtech.com will be staying stable and keeping all its articles up. [1].
And while the AnandTech staff is riding off into the sunset, I am happy to report that the site itself won’t be going anywhere for a while. Our publisher, Future PLC, will be keeping the AnandTech website and its many articles live indefinitely. So that all of the content we’ve created over the years remains accessible and citable.
Just FYI to help with making the decision. –Novem Linguae (talk) 17:41, 1 September 2024 (UTC)- There is a dedicated team of volunteers and staff (I supposed) of the Internet Archive archiving dead or dying websites. And Anandtech is listed on their wiki. If anyone here wants to speed up the process of the site getting archived, I suggest volunteering some time or resources there as well. – robertsky (talk) 05:52, 2 September 2024 (UTC)
- ω Awaiting to see if the site goes offline. -- GreenC 18:15, 3 September 2024 (UTC)
google.com/search?q=cache:
Practically all Google Search links with this string are redirects to Google cache, which has shut down. (technically, not every link starting with this string necessarily redirects to cache, but all links I've found are redirects). Helpful Raccoon (talk) 18:45, 1 September 2024 (UTC)
- Note: while the vast majority of URLs I found are followed by 12 characters and another colon before the original website URL (e.g. http://google.com/search?q=cache:EdF1mH2UVF8J:www.maurinet.com/allform/pportnew.pdf+mauritius+national+card&hl=en&ct=clnk&cd=5&gl=nz in Identity document), a few of them are not followed by 12 characters (e.g. http://www.google.com/search?q=cache:www.melafoundation.org/theatre.pdf in Drone music). Helpful Raccoon (talk) 18:58, 1 September 2024 (UTC)
- OK. I wrote/use Google Cache Parser (GitHub). It correctly parses both those URLs. -- GreenC 21:53, 4 September 2024 (UTC)
User:Helpful Raccoon: I cleared Google Cache in February: Wikipedia:Link_rot/URL_change_requests/Archives/2024/February#Google_cache targeting webcache.googleusercontent.com
but was not aware of google.com/search?q=cache:
.. thanks for bringing this to attention. 776 pages-- GreenC 19:08, 1 September 2024 (UTC)
Results
- Converted 810 URLs
Done -- GreenC 23:40, 4 September 2024 (UTC)
time-blog.com
Site appears to be dead. All links redirect to the time.com homepage. There are 54 pages. Thank you! Helpful Raccoon (talk) 23:13, 1 September 2024 (UTC)
Enwiki
- Checked 54 pages and edited 41 pages. Added 2
{{dead link}}
. Switched 4|url-status=live
to dead. Added 44 archive URLs (43 Wayback).
IABot DB
- Checked and updated 84 unique URLs which will propagate across 300+ wikis.
Done -- GreenC 00:29, 5 September 2024 (UTC)
nola.com/politics
This subpage appears to be dead. All links currently redirect to https://www.theadvocate.com/baton_rouge/news/politics/ (and it doesn't show the original article). I could not find the original articles by searching in theadvocate.com. There are 343 pages. Helpful Raccoon (talk) 23:23, 1 September 2024 (UTC)
Enwiki
- Checked 342 pages and edited 320 pages. Added 30
{{dead link}}
. Switched 36|url-status=live
to dead. Added 563 archive URLs (419 Wayback). Changed 22 citation metadata.
IABot DB
- Checked and updated 677 unique links which will propagate to 300+ wikis
Done -- GreenC 05:03, 5 September 2024 (UTC)