Jump to content

Wikipedia:Analysis of citation issues for date and year articles

From Wikipedia, the free encyclopedia

It's been brought up that most of the year/date articles lack inline citations; as of now, referencing guidelines have specific exemptions for these articles. Some people want to change this, on the basis that all text in the encyclopedia should be referenced to something.

Other people say that, since these entries all point to articles which have citations for the information in question, this is a pointless waste of time.

My personal opinion is that, if it's even remotely feasible, we should try to go through and cite the content in these articles.

The question, then, is whether this is remotely feasible. Let's investigate!

Number of articles

[edit]
Days Years Decades
← 600BC
Decades
1800 →
Cent. Mill.
(prose)
Mill.
(list)
Total
Articles 366 2,736 130 24 61 6 7 3,330

I could not find any articles for specific days of specific years: for example, January 1, 2000 redirects to Year 2000 problem, July 20 1969 redirects to Apollo 11, and September 11, 2001 redirects to September 11 attacks. I will assume that these "aren't a thing". (NB: I was later to learn that they were all redirected following a couple of discussions many years ago!)

There are, however, articles for every day of the year (January 1 through December 31) and February 29. The three I just linked are going to be abnormally long, since they're quite unique days (the first of the year, the last of the year, and the anomalous day that happens once every four years). There are 366 dates.

As for years, we have articles for all of them between 708 BC and 2021. There are also articles for years going up to 2029 (they haven't happened yet, but including them makes virtually no impact on the projections, so why the hell not). Since 0 AD did not exist, this gives us 2,736 year articles.

There are also articles for decades and centuries, which are generally the same "deal" as years (they list significant events that happened, and when they happened). Consulting List of decades, centuries, and millennia, we can see that there are 13 millenium articles, 61 century articles, and 384 decade articles.

Many of the decade articles are autofilled by transcluding events from year articles (see the section below for more information); since the information in these ones is the exact same as in the respective year articles (which we already counted), we can ignore these ones, which leaves us with 154.

Adding those to the day and year articles, we end up with 3,560 in total; of these, we can leave out a few that are produced almost entirely by transcluding other pages (which we'll talk about more in a little bit), giving us 3,330 pages. Wowie zowie!

Estimation of scope

[edit]

My strat here is going to be looking at a few random dates, then shamelessly extrapolating this into broad generalizations. To do this, I will just look at the source of a few articles of each category and see how many bullet points it has (some entries start with ** instead of *, so double-asterisk and triple-asterisk strings will be find-replaced to a single asterisk before counting them).

Days

[edit]

For date articles, of which there are 366, we can probably just pick one from each month and assume that those twelve are representative of date articles in general. Now, we don't want to pick the first of each month, because more stuff will happen then, and we don't want to pick the ends of weeks, or round numbers, so they ought to be actually random. I found a website that claims to generate random dates, and it gave me a few.

Date Bytes Entries Refs # not
cited
% not
cited
January 4 95,204 289 267 22 7.6
February 3 34,870 298 26 272 91.3
March 28 41,729 400 19 381 95.3
April 2 40,172 368 34 334 90.8
May 13 47,144 443 20 423 95.5
June 27 35,133 350 15 335 95.7
July 22 36,097 349 14 335 96.0
August 25 41,414 405 16 389 96.0
September 20 44,192 386 41 345 89.4
October 9 36,120 391 9 382 97.7
November 30 47,245 398 38 360 90.5
December 1 38,920 363 15 348 95.9
Total 538,240 4,440 514 3,926 88.4
Average 44,853 370 43 327 88.4
All 366 16,416,320 135,420 15,677 119,743 88.4
w/o Jan 4 14,741,016 138,115 8,218 129,896 94.0

Note that January 4 has a bizarrely high amount of citations (about 90% of its entries are cited, while every other date sampled has around 10%). Since it seems like an anomaly to me, I've included a second projection based on an average excluding it (note how much it changes the averages).

Years

[edit]

Extrapolating these numbers out to the year/decade articles is problematic for a number of reasons; primarily, because most of them aren't that big. While the worst-case scenario is mind-blowingly bad (all year lists having the same number of entries as date lists would yield slightly over a million uncited entries), it isn't very likely.

While the size of date articles is largely random, the size of year articles is not; the further back a year is, the less likely records are to exist. Articles like 2020, for example, are extremely anomalous both in their number of entries and the percentage carrying citations. Moreover, the size of year articles is determined by a multitude of factors: the amount of people living in the world and causing events to happen, the advancement of recording technologies that allow those events to be documented, and the willingness of Wikipedia editors to compile chronicles of any given year all have an impact.

So instead of doing that, we can concoct an estimate of what we're dealing with as far as years go, using a "random" sample of one year from each century all the way forward and back. Since this is getting boring, I will select a few years fairly arbitrarily.

Year Bytes Entries Refs # not
cited
% not
cited
666 BC 706 2 0 2 100.0
579 BC 796 2 0 2 100.0
436 BC 1,222 4 0 4 100.0
303 BC 2,031 6 0 6 100.0
259 BC 1,580 4 0 4 100.0
177 BC 1,425 5 0 5 100.0
69 BC 2,506 14 2 12 85.7
AD 1 5,521 20 2 18 90.0
AD 128 2,809 10 3 7 70.0
AD 256 3,219 16 0 16 100.0
AD 371 2,548 14 2 12 85.7
AD 420 3,673 23 2 21 91.3
AD 534 4,330 20 1 19 95.0
AD 666 2,889 11 3 8 72.7
AD 763 3,451 15 2 13 86.7
AD 811 5,069 19 7 12 63.2
AD 987 5,719 26 5 21 80.8
1024 4,496 24 2 22 91.7
1111 6,031 33 5 28 84.8
1234 3,334 20 1 19 95.0
1337 3,685 24 4 20 83.3
1420 4,985 32 3 29 90.6
1572 15,051 119 4 115 96.6
1666 14,494 98 11 87 88.8
1776 47,599 476 16 460 96.6
1816 16,329 103 14 89 86.4
1969 103,762 1,158 19 1,139 98.4
2008 58,111 494 66 428 86.6
2020 321,430 929 772 157 16.9
Total 648,801 3,721 946 2,775 74.6
Average 22,372 128 33 96 74.6
All 2728 61,032,039 350,031 88,989 261,041 74.6
All 2736 61,211,018 351,057 89,250 261,807 74.6
w/o 2020 31,988,823 272,818 17,002 255,816 93.8
NB: Whether I count only the 2,728 years up to 2021, or the 2,736 years up to 2029 makes basically no difference, so I might as well just future-proof this essay for another decade by including the latter figures.

Later years have an outsized effect on these numbers: 2020 brings up the average number of entries considerably (from 100 to 128), and if it is excluded, the estimate for uncited entries drops from ≈261k to ≈254k. If 2020 and 2008 are excluded, it drops to ≈221k (with 85 entries on average); if 1969 is removed as well, it drops to 110k (with only 44 entries on average).

Because of the extreme influence of 20th- and 21st-century articles on averages, previous version of this section which made an estimate based on a far smaller sample of year articles had projected a much higher number of entries (≈900k) as well as a much higher number of uncited entries (≈380k – 600k).

But anyway, let's move onto the other stuff.

Decades

[edit]

The decade articles, of which there are 384, are more complicated: these tend to all contain a "Events", "Significant people", "Births" and "Deaths" section (at the minimum). On many articles (for example, 210s), three of these sections use templates to transclude content from individual year articles. By looking at transclusions† for {{Births and deaths by year for decade}}, {{Events by year for decade}}, and {{Events by year for decade BC}}, we can see that everything between 490s BC and 1790s contains these templates.

† Expand for more detail on transclusion (largely irrelevant)

Below is a table of the three decade transclusion templates, and which mainspace articles they are on (according to WhatLinksHere and TransclusionCount, which seem to disagree):

Template WLH TC Earliest Latest
{{Births and deaths by year for decade}} 228 230 490s BC 1790s
{{Events by year for decade BC}} 50 51 490s BC 0s BC
{{Events by year for decade}} 179 181 0s 1790s
Note 1: {{Births and deaths by year for decade}} cut off at 1350s for no apparent reason; I decided to see what was going on with subsequent articles. Many of them either had nothing for births and deaths, or contained information nearly identical to the stuff from individual year pages copy/pasted over, so I began checking them to see if they corresponded with the transcluded lists that are provided by that template. For the most part, they were, and a couple entries were missing from individual year lists (so I added some). Others, however, were quite wrong; Dafydd ap Gwilym, James Audley, and Nissim of Gerona's articles place them as dying in totally different decades (let alone the specific years they were included as in the 1380s article). At any rate, I've verified consistency between the births and deaths sections on the decade articles versus individual years, and added the template to every decade up to the 1790s (exept the 1770s, where for some reason the template refused to transclude and I had to use {{transclude births}} and {{transclude deaths}} for each year instead.
Note 2: There is some kind of weird error I don't understand; while there are 230 articles in that decade range, neither of the measures seem accurate. For {{Births and deaths by year for decade}}, there should be 229 transclusions (it couldn't be used on 1770s), but WhatLinksHere gives 228 and TransclusionCount gives 230. Similarly, WhatLinksHere gives a sum of 229 transclusions for the "events by year" templates, and TC gives 232. I've got no idea what's up with that.

Out of 384 decade pages, 230 of them are like this, rendering them a "who cares" situation (of course, they still have "Significant people" and "World leaders" sections, but let's say we ignore those for the time being); only 154 remain. Of these, 130 of these occur prior to the 5th century BC, and 24 of them occur after the 18th century; it goes without saying that the older ones are much sparser in content.

By this point, you know the drill: here are some samples for the decade pages prior to 490s BC.

Decade Bytes Entries Refs # not
cited
% not
cited
Bytes /
entry
1710s BC 531 3 0 3 100.0 177
1610s BC 368 1 0 1 100.0 368
1510s BC 590 1 0 1 100.0 590
1410s BC 443 3 0 3 100.0 147.66
1310s BC 1,189 4 3 1 25.0 297.25
1210s BC 1,359 6 0 6 100.0 226.5
1110s BC 396 2 0 2 100.0 198
1010s BC 610 3 0 3 100.0 203.33
910s BC 706 3 1 2 66.7 235.33
810s BC 722 3 0 3 100.0 240.66
710s BC 1,824 17 0 17 100.0 107.29
610s BC 2,105 19 1 18 94.7 110.79
510s BC 2,790 21 1 20 95.2 132.86
Total 13,633 86 6 80 93.0 158.52
Average 1,049 7 0 6 93.0 158.52
All 130 136,330 860 60 800 93.0 158.52

I really wasn't kidding when I said they were sparse — by my estimation, all of the 130 decade pages from this period put together have about eight hundred missing citations, which is less than the amount of in one decade of year pages (which have an average of 96 missing citations each)!

Now here are some articles representing the decades from 1800s onward:

Decade Bytes Entries Refs # not
cited
% not
cited
Bytes /
entry
1810s 51,496 191 18 173 90.6 269.61
1860s 16,210 74 2 72 97.3 219.05
1910s 26,757 224 13 211 94.2 119.45
Total 94,463 489 33 456 93.3 193.16
Average 31,488 163 11 152 93.3 193.18
All 24 755,704 3,912 264 3,648 93.3 193.18
First 12 377,852 1,956 132 1,824 93.3 193.18

You'll notice I have provided a second projection, which only gives the first 12 decades in this series (1800s1910s). This is because pages for modern decades present a special challenge — as we get closer to the present day, they stop being simple lists of events, and start becoming articles in their own right. For example, 2010s is 321,395 bytes long and has 556 citations, but only contains 34 bullet points. The majority of the article is written in prose, or contained in specially constructed tables; there's no way to apply the bird's-eye view of assessing citation density by tallying up list entries and counting numbered references. consequently, I'm heavily tempted to say that every decade article from the mid-20th century onward is beyond the scope of this analysis and needs to be evaluated individually.

Centuries

[edit]

Not a whole lot to say about these. There's 61 of them; sampling ten apart manages to capture the earliest (40th century BC) and the latest (21st century). Unlike the decades, these do not turn into full prose articles; the 20th and 21st centuries are still basically lists of events.

Century Bytes Entries Refs # not
cited
% not
cited
Bytes /
entry
40th BC 3,883 20 5 15 75.0 194.15
30th BC 3,283 27 3 24 88.9 121.60
20th BC 5,620 35 4 31 88.6 160.57
10th BC 5,943 40 2 38 95.0 148.58
1st 20,048 160 15 145 90.6 125.3
11th 73,681 530 42 488 92.1 139.02
21st 117,700 417 82 335 80.3 282.25
Total 230,158 1,229 153 1,076 87.6 187.3
Average 32,880 176 22 83 47.1 187.3
All 61 2,005,663 10,710 1,333 5,049 47.1 187.3

Millennia

[edit]

There are only 13 millennium articles, so sampling isn't necessary; we can just look at each article individually. The six from the 10th BC through the 5th BC are prose articles, with all statements appropriately sourced. The ones that can be assessed as list articles, then, are the seven afterwards.

Millenium Bytes Entries Refs # not
cited
% not
cited
Bytes /
entry
10th BC 23294 40 40 0 0.0 582.4
9th BC 24747 40 40 0 0.0 618.9
8th BC 13575 20 20 0 0.0 618.7
7th BC 7310 13 13 0 0.0 562.3
6th BC 9425 17 17 0 0.0 554.4
5th BC 8657 16 16 0 0.0 541.1
4th BC 17733 97 14 83 85.6 182.8
3rd BC 17642 129 11 118 91.5 136.8
2nd BC 18,052 58 6 52 89.7 311.2
1st BC 29,177 186 10 176 94.6 156.9
1st 40,467 145 15 130 89.7 279.1
2nd 22,395 166 6 160 96.4 134.9
3rd 39,316 133 54 79 59.4 295.6
Total 271,790 1,060 262 798 75.3 256.4
Average 20,907 82 20 61 75.3 256.4
First 6 87,008 146 146 0 0.0 596.0
Average 14,501 24 24 0 0.0 596.0
Last 7 184,782 914 116 798 87.3 202.2
Average 26,397 131 17 114 87.3 202.2

Summary

[edit]
(rounded
estim.)
Days Years Decades
← 600BC
Decades
1800 →
Cent. Mill.
(prose)
Mill.
(list)
Total
Articles 366 2,736 130 24 61 6 7 3,330
Entries / article 349 128 7 163 187 24 131 151.1
Total entries 127k 351k 0.8k 3.9k 10k 146 914 503,019
Total citations 15.6k 89.3k 60 264 1.3k 146 116 106,846
% not cited 87.5 74.6 93.0 93.3 87.6 0.0 87.3 78.8%
# not cited 120 – 130k 260k 0.8k 1.8 – 3.6k 9.3k 0.0 114 391 – 403k

Based on the analysis above, which I'm sure could be refined further (but I don't think is off by an appreciable amount), we have somewhere around four hundred thousand uncited statements between all the day, year, decade, century and millenium articles.

For reference, as of the time of writing (February 2021), {{Citation needed}}'s transclusion count indicates we have just over 455,000 instances of it.

Of course, there are some mitigating factors that make these a little less bad than {{cn}} transclusion (i.e. each date list entry should link to an article containing the relevant citation, which can simply be copied out); nonetheless, it does look like we are dealing with a problem roughly comparable in scope to the entirety of tagged uncited statements in the whole encyclopedia.

For comparison, the Guild of Copy Editors, over the course of over ten years, has succeeded in getting a backlog of 9,000 articles down to almost zero.

What is to be done?

[edit]

Some random stuff that has popped into my head:

  • I'm not quite convinced anything really needs to be done. Year and date articles don't seem to be very commonly used; if someone is using them for something really mission-critical, they only link to people who have articles written about them anyway, and the information can easily be verified/debunked based on those articles. This seems like a fairly distinct situation, compared to other Wikipedia pages. A good comparison might be, say, List of people from Sacramento, California: sure, there are lots of citations here, but every entry doesn't carry one, and why should it? Who cares? You can go to their articles and find out.
  • This whole situation seems like a legacy issue from the way things were a long time ago (inline citations being a luxury option for most articles). If these articles were all being created today, it wouldn't be that hard to just go through and add a citation for every entry; the main issue seems to be that they've gotten this way over the course of twenty years. If we were to go through and verify every single entry on all these articles, even if we completely stopped monitoring them all, it would probably take at least another twenty years for things to get this way again (by which time I'm sure we will have some way to reliably parse language and make large amounts of edits other than manually opening up at browser windows, highlighting text and slapping ctrl+c and ctrl+v).
  • It seems like lots of this information should be possible to automatically extract from biography articles now; I don't know whether that's infoboxes, categories, or some kind of fancy language parsing. However, if we're talking about a half million missing citations, I think it'd be well worth the investment to spend time on some solution that had even a tiny impact. Let's say a program is able to fill in citations on date/year links... but only if the person's birth year in their article had a direct inline citation... to a machine-readable website that directly mentioned it as their birth year... from a small whitelist of reliable sources. Maybe this only happens on one out of every hundred entries. But that's still five thousand citations. How many days of work is that? Probably at least a few.

Conclusions

[edit]

I guess my opinion of this is that, barring some technological solution that allows portions of this workload to be automated (which may emerge sooner rather than later), it would be unimaginably time-consuming to cite all the date/year/etc articles (potentially involving an effort comparable to the repair of all uncited statements in the whole of Wikipedia).

I would not recommend going through and fixing these by hand, when the effort involved could be used on any number of other tasks.


jp×g, 2021