Jump to content

Wikipedia:Link rot/URL change requests/Archives/2024/September

From Wikipedia, the free encyclopedia


cbsnews.com/stories

Hello. CBS News links with /stories/ in the URL don't work. www.cbsnews.com/stories/ is now www.cbsnews.com/news/name-of-the-article/. Some of these can be converted over while others don't fit the format: For example, this is now here for Pedro Carmona.

In this case, I think this changeover would need 3 stages. Article title, article title and date, archive any that remain broken.

Thanks! MrLinkinPark333 (talk) 23:28, 26 August 2024 (UTC)

Building a new URL from |title= data is difficult. In the above examples:
  1. "CBS: Venezuelan Coup Leader Exits" --> "venezuelan-coup-leader-exits" (drop leading "CBS:")
  2. "Cashing In For Profit?" --> "cashing-in-for-profit" (drop ?)
  3. "'Lackawanna 6' Link To Yemen Killings?" --> "lackawanna-6-link-to-yemen-killings-04-11-2002" (drop single-quote and ?, add a date string parsed from original URL)
  4. "U.S. Plants: Open To Terrorists" --> "us-plants-open-to-terrorists-13-11-2003" (drop period and semi-colon, add a date string)
In #1 and #4 they each have colons but are done differently. I suspect there will be a lot of edge cases. I can try some generic rules like this and see how many it can get. If you find any more rules, that will help. -- GreenC 03:20, 27 August 2024 (UTC)
Other cases:
It might be easier to do ones without punctuation marks first. However, I can't predict which would need dates and which don't. MrLinkinPark333 (talk) 03:59, 27 August 2024 (UTC)
  1. "We're Watching: How Chicago Authorities Keep An Eye On The City" --> "were-watching" (drop punct and split : to left side)
  2. "$10 Million? NYC Says No Thanks" --> "10-million-nyc-says-no-thanks" (drop punct including $)
  3. "Iceland Says Bye to the Big Mac" --> "iceland-says-bye-to-the-big-mac" (square-link title)
-- GreenC 15:40, 27 August 2024 (UTC)
4,393 pages

Results

  1. URLs with a match: 3,984 (converted via above method)
  2. URLs not matched: 1,293 (unable to covert)
  3. Title unspecified: 78 (bare and square links without a title)

User:MrLinkinPark333: This turned out better than expected with a 74% success rate. Though the 80/20 Rule is expected. It was fiddly getting all the transforms right and building a table of possible URLs. Some of the #2's probably have a match but the title is too complicated to parse. Many of the titles in #2 are straightforward but no URL exists. If you want the list of #2 let me know. -- GreenC 17:26, 29 August 2024 (UTC)

If you mean the ones that were too complicated to convert, sure. Perhaps I can find more conversion rules from them. I'm also interested in the bare links of #3. I don't need the 404s of easy conversion. MrLinkinPark333 (talk) 17:44, 29 August 2024 (UTC)
Set #2 and #3: Wikipedia:Link_rot/Cases/cbsnews.com-stories -- GreenC 05:47, 31 August 2024 (UTC)
Of the 10 I tested in case #2, I found a handful that worked.
I would like the list for #2 updated at that link rot cases page if any more links are resolved. However, not all of these links will be fixed. For example, the links at Columbus Blue Jackets and Concerns and controversies at the 2010 Commonwealth Games don't have working links. If I find any more, I'll let you know. Otherwise, if we run out of ones to replace, the rest could be replaced with archived links. MrLinkinPark333 (talk) 20:07, 31 August 2024 (UTC)
They all had archives added already there's no loss to verifiability if nothing further is done. The one's that might be made to work require special edge case rules that I don't want to deal with sorry it's too messy and time consuming there is too much variability. For example how many URLs are fixed by removing "The Early Show - CBS News"? The answer is 4. So that's 4 out of 1000. Cntrl-F search on "/" in that list, there is no general rule for "everything after slash is removed". It goes on like that, the data is extremely messy and variable. In situation like this, the 80/20 Rule rules - you can often get the first "easy" 80% and the remaining hard "20%" is dealt with or not, but at least you got 80% is better than nothing. It's just the nature of this particular problem trying to create a URL from free-form text. -- GreenC 23:23, 31 August 2024 (UTC)
Fair enough. Hopefully there'd be more luck with the other cbs ones below. --MrLinkinPark333 (talk) 23:26, 31 August 2024 (UTC)
Honestly, 74% is much better than I expected, considering. And probably at least half those in #2 are legitimate dead links no page available, the real conversion rate might be closer to 90%, after the dead links are factored out. -- GreenC 00:22, 1 September 2024 (UTC)

 Done

cbsnews.com/numeric

Hello. While looking at CBS News, I found many URLs with numeric IDs that don't work. I found 2 that redirect but the rest don't:

URL replacements are the same as the above section with some exceptions:

  • For Jihobbyist, this is now here. - Political Hotshot needs to be removed from the reference as it does not exist in the new URL's article title.
  • ~590 URLs that start with 2 (any non-mainspace can be ignored).
  • ~4500 URLs that start with 8 (any non-mainspace can also be ignored)

Thanks again! MrLinkinPark333 (talk) 23:47, 26 August 2024 (UTC)

2,413 pages

Results

  • URLs with a match: 2,279 (converted with above method)
  • URLs unable to match: 631
  • URLs no title available: 36

Successful matches: 77.4% .. of those 631, roughly half are not a matching problem rather page no longer exists. Assuming 50% is true, and also removing the no title available, the real match rate is 88% ie. a further 12% might be matched but not practical to the variability of the data. -- GreenC 00:34, 1 September 2024 (UTC)

 Done -- GreenC 00:34, 1 September 2024 (UTC)

Not too bad! MrLinkinPark333 (talk) 03:44, 1 September 2024 (UTC)

cbc.ca/story

Hello. There are links to cbc.ca using /story/ that are broken. While there are new working URLS, I can't predict them. For example, this has a working archived link that redirects here for Scouting controversy and conflict. For these ones, I request looking for archived redirects first, then adding archives to the rest.

  • /story/ 41 articles
  • /news/story/ 415 articles.

Thanks! MrLinkinPark333 (talk) 22:08, 28 August 2024 (UTC)

This is a weird site because the ghost redirects are.. ghostly. In the above example, there are different redirects depending on timestamp. Sometimes it goes here and other times here. They are also somewhat chronologically buried in the list, normally I only get the most recent redirect (because there is no way of knowing which is correct without looking), and the last redirect goes here, which is not a ghost redirect. Thus unable to determine redirect URLs with automation. -- GreenC 21:06, 1 September 2024 (UTC)

456 pages

  • Checked 457 pages and edited 376 pages. Added 2 {{dead link}}. Switched 45 |url-status=live to dead. Added 384 archive URLs (337 Wayback). Changed 28 citation metadata.

 Done -- GreenC 17:17, 3 September 2024 (UTC)

MrLinkinPark333: The cbc.ca/story and /news/story appear to have been parsed and fixed during the below section. Example Special:Diff/1243671672/1243757966 -- GreenC 17:17, 3 September 2024 (UTC)

www.cbc.ca/redirects

Hello. Cbc.ca has redirects to working URLs. Some of them require URL changes while others can be fixed quickly.

  • No Changes
    • This automatically goes here without any URL changes for Serena Williams. Not sure how many cases don't require /news/ to make working redirects.
  • ~2000 2 folders
  • ~3000 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[story]+\//
  • ~1300 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//
  • ~50 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//
  • 3 insource:/http?:\/\/www\.cbc\.ca\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[a-z]+\/[story]+\//

Since this is a big request, I suggest focusing on ones that already redirect without changing the URLs first, then the ~180 /m/ ones. Thank you very much! MrLinkinPark333 (talk) 22:44, 28 August 2024 (UTC)

As the House of Commons of Canada example has both an /amp/ and full link, could those /amp/ ones be archived in case they break? Not sure why there's two links to the same article, but that helps! MrLinkinPark333 (talk) 16:09, 2 September 2024 (UTC)
For House of Commons-like URLs I can't automatically determine the desktop URL only the mobile version. "AMP" is for pages optimized for mobile users, a parallel version of the site. Some sites have an API (a URL) that allows translation between the mobile and desktop URL ie. give it the AMP URL and it will return the desktop URL. Ideally all URLs on Wikipedia are the desktop version. But I don't know if they have an API, that would be nice to have. Either way anything added to Wikipedia will get archived into the Wayback Machine automatically. If the link later dies the bots or my tool will add an archive. -- GreenC 16:20, 2 September 2024 (UTC)

Results

  • 10,368 links are live. All (but 31) are new, created per above rules.
  • 95 links are not working. Of those, 12 had a {{dead link}} added. The rest have archives.
  • Checked ____ pages and edited 7,133 pages. Moved 10,368 links to a new URL. Removed 45 {{dead link}}. Added 12 {{dead link}}. Switched 1,460 |url-status=dead to live. Switched 2 |url-status=live to dead. Added 556 archive URLs (402 Wayback).

 Done -- GreenC 02:08, 3 September 2024 (UTC)

articles.latimes.com

Hello. This is a big request. URLs with articles.latimes.com either redirect to the new URL or don't work:

60000 with HTTP/HTTPS. Any non articlespace links can be filtered out. Thank you very much! MrLinkinPark333 (talk) 23:58, 12 August 2024 (UTC)

MrLinkinPark333: It looks like *.latimes.com is 96,000 pages and articles.latimes.com is 37,000. I could focus on articles.latimes.com (which is a significant project) but I wonder about the other 2/3rds. Are they redirecting also? Maybe I should do articles.* right now to keep the size manageable. -- GreenC 02:00, 13 August 2024 (UTC)
From the sample checks at Summer Olympic Games, Tampa Bay Buccaneers and 2020 Summer Olympics for latimes.com, these look fine and don't need new URLs. If that changes in the future, I could file a separate request later. MrLinkinPark333 (talk) 22:40, 13 August 2024 (UTC)
OK great. This job is going fast because the LAT has an exceptionally clean site, rapid response, few dead links. It's mostly just finding the redirect and replacing. I'm happy with how well ghost redirect discovery is working, now noted in the statistics (along with soft-404 stats) starting with this run. -- GreenC 16:05, 14 August 2024 (UTC)

Enwiki in multiple batches:

  • Batch 1: Checked 3,000 pages and edited 2,935 pages. Moved 4,152 links to a new URL. Resolved 68 ghost redirects. Resolved 25 soft-404s. Removed 2 {{dead link}} templates. Added 8 {{dead link}}. Switched 149 |url-status=dead to live. Switched 14 |url-status=live to dead. Added 166 archive URLs (142 Wayback). Changed 13 citation metadata fields.
  • Batch 2: Checked 7,000 pages and edited 6,859 pages. Moved 9,663 links to a new URL. Resolved 143 ghost redirects. Resolved 53 soft-404s. Removed 1 {{dead link}}. Added 21 {{dead link}}. Switched 314 |url-status=dead to live. Switched 34 |url-status=live to dead. Added 372 archive URLs (276 Wayback). Changed 36 citation metadata.
  • Batch 3: Checked 26,845 pages and edited 26,251 pages. Moved 36,856 links to a new URL. Resolved 481 ghost redirects. Resolved 195 soft-404s. Removed 5 {{dead link}}. Added 89 {{dead link}}. Switched 1,289 |url-status=dead to live. Switched 128 |url-status=live to dead. Added 1,329 archive URLs (1,106 Wayback). Changed 140 citation metadata.

IABot: does not support URL moves, redirects are working the bot will consider links live.

 Done -- GreenC 20:43, 14 August 2024 (UTC)

Pass 2

  • Checked 36,845 pages and edited 658 pages. Moved 379 links to a new URL. Resolved 216 ghost redirects. Resolved 438 soft-404s. Removed 2 {{dead link}}. Added 106 {{dead link}}. Switched 199 |url-status=dead to live. Added 217 archive URLs (138 Wayback).

-- GreenC 02:29, 4 September 2024 (UTC)

ehdenfamilytree.com

This 'ehdenfamilytree.com' is dead and the new one is 'ehdenfamilytree.org'. Saroufim1 (talk) 01:49, 31 August 2024 (UTC)

80 pages

  • Checked 79 pages and edited 79 pages. Moved 120 links to a new URL. Removed 3 {{dead link}}. Switched 1 |url-status=dead to live.

 Done -- GreenC 19:26, 3 September 2024 (UTC)

AnandTech shuts down

Amazing website/technews site AnandTech has shut down (https://www.anandtech.com/)

If an archive bot could preemptively archive the entirety of that website, that would be mint, as people are unsure what will happen to the content.

Thanks.

Headbomb {t · c · p · b} 04:32, 31 August 2024 (UTC)

1,158 pages — Preceding unsigned comment added by GreenC (talkcontribs)

@GreenC: those are just what's used on Wikipedia. Which, I agree should be a priority. But if archiving the entirety of Andandtech is possible... either by talking to IA or through your bot or whatever that would be an amazing service to the tech community/tech historians. Headbomb {t · c · p · b} 14:45, 31 August 2024 (UTC)
I believe the domain is already crawled by the Wayback Machine as part of the GDELT Collection ("NO404-GDELT"). For example given this archive the "About this capture" tab says GDELT Collection. The crawl was started in 2014, though it might be the whole site. If you can find some older URLs (older the better) and check if they exist in the Wayback. They should be there, but worth checking to see if the crawl missed them. If there are blank spots then I'll need to go through the URLs on Wikipedia one by one and capture any that are missing which is a bit of a job. -- GreenC 16:24, 31 August 2024 (UTC)
Here's an article from 1998. You can tell it's a very early article by the URL: /161/ .. recent articles are at around /21000/. It's a pretty good bet the site is well archived. -- GreenC 05:29, 2 September 2024 (UTC)
Sounds like anandtech.com will be staying stable and keeping all its articles up. [1]. And while the AnandTech staff is riding off into the sunset, I am happy to report that the site itself won’t be going anywhere for a while. Our publisher, Future PLC, will be keeping the AnandTech website and its many articles live indefinitely. So that all of the content we’ve created over the years remains accessible and citable. Just FYI to help with making the decision. –Novem Linguae (talk) 17:41, 1 September 2024 (UTC)
There is a dedicated team of volunteers and staff (I supposed) of the Internet Archive archiving dead or dying websites. And Anandtech is listed on their wiki. If anyone here wants to speed up the process of the site getting archived, I suggest volunteering some time or resources there as well. – robertsky (talk) 05:52, 2 September 2024 (UTC)
ω Awaiting to see if the site goes offline. -- GreenC 18:15, 3 September 2024 (UTC)

google.com/search?q=cache:

Practically all Google Search links with this string are redirects to Google cache, which has shut down. (technically, not every link starting with this string necessarily redirects to cache, but all links I've found are redirects). Helpful Raccoon (talk) 18:45, 1 September 2024 (UTC)

Note: while the vast majority of URLs I found are followed by 12 characters and another colon before the original website URL (e.g. http://google.com/search?q=cache:EdF1mH2UVF8J:www.maurinet.com/allform/pportnew.pdf+mauritius+national+card&hl=en&ct=clnk&cd=5&gl=nz in Identity document), a few of them are not followed by 12 characters (e.g. http://www.google.com/search?q=cache:www.melafoundation.org/theatre.pdf in Drone music). Helpful Raccoon (talk) 18:58, 1 September 2024 (UTC)
OK. I wrote/use Google Cache Parser (GitHub). It correctly parses both those URLs. -- GreenC 21:53, 4 September 2024 (UTC)

User:Helpful Raccoon: I cleared Google Cache in February: Wikipedia:Link_rot/URL_change_requests/Archives/2024/February#Google_cache targeting webcache.googleusercontent.com but was not aware of google.com/search?q=cache: .. thanks for bringing this to attention. 776 pages-- GreenC 19:08, 1 September 2024 (UTC)

Results

 Done -- GreenC 23:40, 4 September 2024 (UTC)

time-blog.com

Site appears to be dead. All links redirect to the time.com homepage. There are 54 pages. Thank you! Helpful Raccoon (talk) 23:13, 1 September 2024 (UTC)

Enwiki

  • Checked 54 pages and edited 41 pages. Added 2 {{dead link}}. Switched 4 |url-status=live to dead. Added 44 archive URLs (43 Wayback).

IABot DB

  • Checked and updated 84 unique URLs which will propagate across 300+ wikis.

 Done -- GreenC 00:29, 5 September 2024 (UTC)

nola.com/politics

This subpage appears to be dead. All links currently redirect to https://www.theadvocate.com/baton_rouge/news/politics/ (and it doesn't show the original article). I could not find the original articles by searching in theadvocate.com. There are 343 pages. Helpful Raccoon (talk) 23:23, 1 September 2024 (UTC)

Enwiki

  • Checked 342 pages and edited 320 pages. Added 30 {{dead link}}. Switched 36 |url-status=live to dead. Added 563 archive URLs (419 Wayback). Changed 22 citation metadata.

IABot DB

  • Checked and updated 677 unique links which will propagate to 300+ wikis

 Done -- GreenC 05:03, 5 September 2024 (UTC)