User:Mill 1/WikipediaReferences
“ | "I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. | ” |
Inception
[edit]During my private project Chaining back the Years I noticed that a lot of articles that list deceased per month ('dpm's') lacked references. These references should state the date (and cause) of death of those listed (the 'entries').
Early 2020 I realised that deaths lists lend themselves for citation generation, perhaps uniquely so. I was already aware of the excellent archive API of The New York Times. So I experimented with some code that ultimately would become the WikipediaReferences application. This article describes its evolution and general algoritms.
Architecture
[edit]The .NET Core 3.1 application consists of three projects:
- A Web API that interacts with the NYT API en with Wikpedia
- A console application acting as a GUI
- A unit testing project, obviously
To not scare you, the reader, away I will not dwell too much on the technical details of the solution and why the console in the end had way too much responsibilities, I will focus its functions instead.
Generating NYTimes references
[edit]The most used functionality of the app is the 'Print month of death' option. When selected it first resolves which dpm should be handled (see screenshot). Subsequently it generates a file containing wikitext which contains added NYT citations to corresponding entries. In the file sometimes also existing references are replaced by NYT refs (more on that later). When the file contents is pasted in a dpm the article is updated accordingly.
But how was this accomplished?
The challenge: matching the data
[edit]In the end two sources of data need to be connected. We have the NYTimes data (more specifically the obituaries) on one hand and we have the Wikipedia biographies ('bio's') on the other.
I could only generate a reference if I can find an NYT obituary that states the date of death of someone who has a bio on Wikipedia.
Different steps were necessary to accomplish this:
1. Add NYT obituaries to db
[edit]Sending the request
[edit]The initial step was to populate a database with NYT obituary data.[1] The Archive API returns an array of all NYT articles for a given month.[2] The response is in JSON and the response size can be quite large (~20mb). This is structured data. A typical article (or 'Doc') contains next properties and metadata:
- string @abstract
- string web_url
- string snippet
- string lead_paragraph
- string print_section
- string print_page
- string source
- object[] multimedia
- Headline headline
- Keyword[] keywords
- DateTime pub_date
- string document_type
- string news_desk
- string section_name
- Byline byline
- string type_of_material
- string _id
- int word_count
- string uri
- string subsection_name
Filtering and matching the response
[edit]First task was to retrieve the obituaries from this huge array of Docs.[3] Only entries are listed that have a corresponding biography page on the English Wikipedia. So before the obituary data could be added I had to check that the deceased was on Wikipedia. I would retrieve the name of the obituary's subject from the Doc's metadata and perform the check.[4]. Issue was that the NYT subject's name and the Wikipedia article name could differ. It would be a shame that I would miss a lot of citations because of that. So dependent on the name the application would check up to four name variations. For instance the name James D. Hardy Jr. would be checked using these variations:
- James Hardy
- James D. Hardy
- James Hardy Jr.
- James D. Hardy Jr.
Sometimes resolving a NYT subject would lead to a disambiguation page like this one. Luckily these types of pages are higly standardized in format which enabled the software to do this: it would look for the entry in the disambiguation page whose stated YEAR of death would match the month I was processing. So if I was adding January 1995 to the references database then George Price would be resolved as George Price (cartoonist).
During this initial phase I also discovered some bugs in the Archive API itself which I reported to the NYTimes staff and which have been fixed. On July 8 the software finally produced something I could work with.
Resolving the date of death
[edit]The most tricky bit, however, was resolving the actual death date stated in the obituary. One would expect this fact to be present in the doc's metadata but alas. Thankfully the properties lead_paragraph and @abstract in most cases contained enough information to determine the persons date of demise. Easiest of course would be when the death date would be stated in the first paragraph:
"Ann Dunnigan, an actress and translator, died on Sept. 5 at her home in Manhattan. She was 87."[5]
Other examples:
- ...died Aug. 8
- ...died at his home in Fort Lauderdale, Fla., on Oct. 23
- ...was found dead on July 9 near Llangollen
However, in most cases the date of passing had to be deducted from the obituary, using the publication date as a reference point. There were two flavors:
1. Day names mentioned in the lead:
- ...died of undisclosed causes in a clinic outside Paris on Friday
- ...was declared dead on arrival on Saturday
- ...was killed in a traffic accident in Japan on Saturday
After I implemented the first version of the deduction algorithm I still couldn't resolve quite a few dates; see 'bucket section' 31/12/9999 in September 1997. Analysis of the bucket results pinted to the remaining indicators:
- ...died in his sleep yesterday
- ...died at a clinic here today
- ...was killed in a car crash early this morning
- ...died here this afternoon at home
I also had to figure out which words in combination with date (indicators) would yield the best results. After much experimenting I came up with this regular expression:
Regex regex = new Regex(" (?:died|dead|killed) .{0,60}" + dayIndicator);
By 11 July 2020 I could resolve most dates. I drew the line trying to crack these:
- ...shot to death yesterday
- ...leaped to his death from a rooftop late Saturday[6]
- ...committed suicide on Thursday
2. Check the entries of a specific month
[edit]With our obituary data securely stored in the database we can now use it to compile citations when generating the wikicode for a dpm. The general process can be divided into these tasks:
Perform checks
[edit]After the dpm had been parsed into entries as software objects the tool would validate its data:
- Are any of the entries red links? remove them from the list manually if found.
- Does the dpm contain duplicate entry names? If so, get rid of them
Process the references per day of the month
[edit]Per day of the processed month the app checks if NYT obituary data exists for existing entries. If so the refererence wikitext is generated. It will be used as a citation in next two situations.
- The dpm entry does not have a reference; the citation will be added.
- The dpm entry has a reference which is more susceptible to link rot than the NYT ones; the citation will be updated.[7]
Note: if NYT obituary data is found for a bio which is not present in the dpm AND the app evaluates it sufficiently notable[8] than a warning is displayed urging the user (me) to consider adding the bio as entry to the dpm.
Code excerpt:
private void EvaluateDeathsPerMonthArticle(int year, int monthId, IEnumerable<Entry> entries) { UI.Console.WriteLine("Evaluating the entries..."); var references = GetReferencesPermonth(year, monthId); for (int day = 1; day <= DateTime.DaysInMonth(year, monthId); day++) { UI.Console.WriteLine($"\r\nChecking nyt ref. date {new DateTime(year, monthId, day).ToShortDateString()}"); IEnumerable<Reference> referencesPerDay = references.Where(r => r.DeathDate.Day == day); foreach (var reference in referencesPerDay) HandleReference(reference, entries); } }
3. Generate the wikitext for the dpm
[edit]In the early stages of the app the output generated by the app was quite crude. When the wikitext was pasted in my NYT references page it looks like this. The listed items still needed to be manually matched and the reference pasted by hand into the processing page. This situation persisted until October 4 2020.
By that time I got so fed up with this manual work that I spend time on automating wikicode generation. By November 14 work had progressed considerably. I could now copy the wikitext that had been outputted by the app to a text file and paste the text in the dpm. After saving the edits the generated citations would be added. This would also signal the start of Round 2.
Generating other types of references
[edit]During the course of this endeavour I had been adding numerous citations to entries manually. And although the Wikipedia:RefToolbar is a great help I got increasingly frustrated with tediously filling in the fields. Since sportspeople die too I had been using the sub sites of Sports Reference extensively as citation sources. I had already noticed that the layout and html between those sites was very similar. Another time-saving idea popped up in my head: why not extend the application so that it facilitates generating references other than the NYTimes?
Data extraction
[edit]I delved into website data extraction and settled for the Html Agility Pack as the weapon of choice. After some experimenting I was able to grab the citation data of next sites by only entering the person's id as input in the console app:
As you can see in the image the generated wikitext is displayed in green in the console. The text could now be copied and pasted as a citation. It worked so well that in time I time I added three other sources for references:
Since the data was already present I also added an option the generate specific NYTimes references.
Code excerpt regarding Olympedia.org:
public void GenerateOlympediaReference() { string url = GetReferenceUrl("http://www.olympedia.org/athletes/", "Olympedia Id: (f.i.: 73711)"); var rootNode = GetHtmlDocRootNode(url); var table = rootNode.Descendants(0).First(n => n.HasClass("biodata")) .Descendants("tr") .Select(tr => { var key = tr.Elements("th").Select(td => td.InnerText).First(); var value = tr.Elements("td").Select(td => td.InnerText).First(); return new KeyValuePair<string, string>(key, value); } ).ToList(); string usedName = table.First(kvp => kvp.Key == "Used name").Value; var reference = GenerateWebReference($"Olympedia – {usedName}", url, "olympedia.org", DateTime.Today, DateTime.MinValue, publisher: "[[OlyMADMen]]"); UI.Console.WriteLine(ConsoleColor.Green, reference); }
Although I still have to create refs manually, this final piece of functionality expedited citation generation to a level that was acceptable to me.
References
[edit]- ^ See screendump; this functionality could be accessed via option 'Add NYT obituaries to db'
- ^ The fact that the NYT API can be queried per month and that the dpm's list deadpeople per month is pure coincidental and in the end added no benefit.
- ^ Next Linq query worked quite well:
return articleDocs.Where(d => d.type_of_material.Contains("Obituary")).AsEnumerable().OrderBy(d => d.pub_date);
- ^ Determining the existence of an article on Wikipedia was done in pretty straightforward way: just send a GET request with the subject's name as a title and check the response. Any 200 response which is not a redirect means a bio is present
- ^ "Ann Dunnigan, Actress and Translator, 87". The New York Times. 12 September 1997. p. B 8. Retrieved 4 July 2023.
- ^ The preceding word 'death' resulted in a lot of false positives (example)
- ^ This check is done by evaluating the cite source of the existing reference. I will not explain the entire rule set here. Some rules:
- references citing web sites are replaced except when deemed sufficiently reliable (e.g. Rolling Stone, Britannica.com etc.)
- references citing books and journals are never replaced
- references citing new sites are not replaced when deemed sufficiently reliable/established (e.g BBC News, LA Times, The Independent, The Guardian etc.)
- ^ The notability algoritm applied was subject to change during the lifespan of the application. Therefore it will not be elaborated on here.
- ^ After initial success changes in the LoC's site resulted in 418 responses (I'm a teapot). I have not been able to solve this.