Wikipedia:Reliability of open government data

This is an essay.

It contains the advice or opinions of one or more Wikipedia contributors. This page is not an encyclopedia article, nor is it one of Wikipedia's policies or guidelines, as it has not been thoroughly vetted by the community. Some essays represent widespread norms; others only represent minority viewpoints.

Shortcut

WP:ROGD

This page in a nutshell: While Wikipedia cites or republishes various governmental data, official does not systematically mean reliable. The reliability of those data should be cautiously assessed for accuracy, and for conflict of interest and other biases.

Wikipedia fundamentally relies on the use of what we call reliable sources. We are starting to use more and more open data from government sources, as illustrated in the COVID-19 pandemic. But shouldn't we clearly distinguish between "reliable" data and "official" data? When can government agencies be trusted to provide reliable data? COVID-19 pandemic daily infection counts lack credibility for several countries around the world:^[1]^[2] how should Wikipedia readers be warned?

Sep 2021: constructive editing of this essay is welcome, but it is not intended as a support/oppose survey. Please edit or insert arguments and counterarguments, preferably with sources, into prose and/or lists. Individual sections on the talk page could be used for support/oppose type discussions, with summaries later being inserted into the essay itself.

The COVID-19 pandemic case

During the COVID-19 pandemic that dominated world news starting in 2020, some of the key pieces of knowledge that readers have sought and editors have provided are the daily counts of how many people have been infected or died in countries around the world. Numerous media sources in specific countries point to particular worries about the data from several countries, and the Wikipedia editing generally follows the usual pattern of judging the reliability of particular media sources, doctors' statements, citizens' groups statements, rather than relying on government agencies' statements alone. However, the key diagrams and the numbers that feed through to global numbers on the pandemic are not nuanced by the unreliability of some of the data.

The WikiProject COVID-19/Case Count Task Force (WP C19CCTF) stated as of 18 Jan 2021 that "COVID-19 confirmed cases, deaths and recovery counts" data are based on reliable sources. But these "reliable sources" are in fact open data provided by government health agencies^[3] from around the world, who have fundamentally different methods of providing information to those of peer-reviewed research and journalism. In addition to country-level claims of data fabrication covered in some article sections (Belarus, Russia, Nicaragua, Venezuela), the statistical properties of the numbers published by the government agencies can be investigated for credibility without any political biases such as those of the known systematic demographic biases in Wikipedia. Both Benford's law^[4] and the lack of noise in the officially stated COVID-19 daily data^[1]^[2] point to the unreliability of the data from several countries. Unsurprisingly, the worse a country's Reporters Without Borders Press Freedom Index is, the more likely it is to lack day-to-day random fluctuations (stochastic noise) in its official COVID-19 daily infection counts. Presumably, government agencies with less risk of press criticism are less worried about fabricating their official open data.^[1]

In this particular case, switching to WHO or Johns Hopkins University CSSE (JHU CSSE) data would not be a solution for finding unfabricated data, because WHO is restricted to providing official national data, and JHU CSSE data shows broadly similar results of suspiciously low-noise daily counts to those of the WP C19CCTF; in fact, the statistical significance of the relation between the Press Freedom Index and low noise is stronger with the JHU CSSE version of the data - see the appendices in the analysis, which aims to be fully reproducible from source data and source code.^[1]

What should Wikipedia policy be?

Terminology: reliable vs official

Is it acceptable that we continue to use the term "reliable" (18 Jan 2021) when we really mean "official" (from a government or governmental agency), and we know that "official" in many cases may mean quite likely falsified? Are we contributing to disinformation if we fail to clearly warn readers that "official" information may be fictitious? Should we trust official open government data by default, or should we distrust it by default?

The COVID-19 pandemic is not the only example of government open data used in Wikipedia, and these questions are likely to become more relevant as citizens increasingly pressure governments to publish open data.

Templates

We could create a template with a mouseover, something like {{cn}} or {{fv}}, with a superscript message something like govt and a longer mouseover message something like Official information from a governmental institution or agency; "official" information may or may not be reliable.

Official sources noticeboard

Should we have a noticeboard to develop official sources ratings lists something like WP:RSP? This would need enough volunteers willing to rate specific government agencies, or specific governments or countries, and enough information to warn Wikipedians of potential personal and legal security risks involved in them accusing their governments of fabricating data. The debates could risk becoming extremely controversial and subject to the usual risks of controversial Wikipedia topics.

Usage

Elections

The overall and detailed numbers of votes in elections for political office are a form of open government data for which electoral fraud is well-known to occur and election forensics is a small but emerging field of study. The current convention in the English language Wikipedia is that the infoboxes show the official results even when the results are dubious (e.g. Iran 2009; Belarus 2015 2020; Turkmenistan 2017). The implicit policy seems to be that the infobox reliably reports the government's point-of-view on the election results, even if these are false data, while the validity or invalidity of the open data is described in prose in the lead, based on reliable sources independent of the government.

Robots and search engines and websites that feed off machine-readable Wikipedia infoboxes process and propagate the infobox numerical data, but as of 2021, don't propagate the prose information. The prose information is what contains warnings about the information being (in some cases) highly unreliable (except in the sense that the information is a reliable report on the government agency's claim about the data).

COVID-19 pandemic

It can reasonably be argued that the COVID-19 pandemic data currently (Sep 2021) in Wikipedia is reliable in the sense that it represents the governments' points of view on their pandemic statistics. However, would the use of better terminology or some good templates be enough to warn users that the data may be nonsense in some cases, so that we are not contributing to official governmental disinformation?

It would be aesthetically upsetting if we had to exclude COVID-19 pandemic data from those countries whose data is most suspicious, and would risk accusations of pro-Western bias, even if the decisions were based on purely statistical properties of the official government data.^[4]^[1]^[5]^[2]

Bayesian option

A possible approach could be to associate a Bayesian probability for the credibility of each source of open government data, where the individual probabilities are generated from peer-reviewed research,^[5]^[1]^[4]^[2] preprint research (itself with a lower Bayesian probability of being correct), and media articles (with bayesian probabilities related to WP:RSP?). Would there be enough people from diverse backgrounds and with the editing capabilities and the enthusiasm to get these data into Wikidata? Currently (Sep 2021), Wikidata elements are subject to much less editorial debate than Wikipedia articles.

Infoboxes for elections, pandemic data or other open government data could have a parameter |credibility_percent = 3 | credibility_refs = <ref name="JStats_Bloggs2017" /> that displays a probability either as a percentage (3% in this case) or as a decimal in the range from 0 to 1, and gives a median (more robust than the mean) credibility estimate based on one or more references. As in ordinary Wikipedia editing, the parameter would quite likely be subject to intense debate on source reliability, how to express the overall value, and so on, depending on the quality of sources for individual open government data articles.

Openness and verifiability of the credibility research itself

En.Wikipedia generally considers any peer-reviewed research by a reputable research journal to be reliable, without requiring that the research paper be open access, and without requiring that the specific data sources, input parameters and method be presented in a fully reproducible format. Given the risk of initially relying on a small number of research papers in what is as of 2022^[update] a small research field, we could require much higher standards than are typically considered enough. We could require that both:

the research papers would necessarily have to be open access
the research papers would have to be fully reproducible in the "narrower scope": Any results should be documented by making all data and code available in such a way that the computations can be executed again, yielding identical results, by any independent researcher with basic scientific computing skills

How do we combine different researchers' assessments?

If we use the credibility estimates from a single research paper by a single research group (or researcher), then we introduce a high element of sensitivity to error in that one research paper: if the paper is wrong, then that feeds through to a whole range of articles.

If we use the credibility estimates from multiples research papers, then how do we combine them? One solution would be to assign credibility parameters to each of the research papers and/or researchers, and take weighted medians (medians for robustness). These could be initially set to, e.g. 0.5, and then raised or lowered based on qualitative discussion, or on track records of those researchers' previous publications. However, this risks being counted as WP:OR or WP:SYNTH. There would have to be strong consensus on the method and algorithm. Or we could include ranges or the interquartile range or the central 95% range if there is a high number of research papers.

Policies

Should there be any specific Wikipedia guideline or policy distinguishing "reliable" versus "official" data? Some sort of text label to clarify the distinction?

Reliable sourcing versus geographical bias dilemma

COVID-19 data is generally more dubious in countries with worse press freedom,^[1] and election data is generally more dubious in countries with less developed democratic structures and human rights cultures and institutions. If we systematically remove open government data from Wikipedia that is less reliable, then we improve our information reliability but risk strengthening the known geographic biases of the English-language Wikipedia. If we don't remove it, then we risk presenting unreliable data as being reliable while appearing to provide less biased encyclopedic coverage. This dilemma is similar to the usual sourcing dilemma in relation to these biases, with the difference that numbers can give the false illusion of being reliable, since numbers can give the impression of being more objective than words. (Numbers obtained and presented accurately, are, of course, at the heart of most of modern science; but there is a huge caveat in the word "accurately".)

Negotiation with other editors on where to compromise, on a case-by-case or topic-by-topic basis on talk pages, with standards evolving with time, is the one way to handle this dilemma.

References

^ ^a ^b ^c ^d ^e ^f ^g Roukema, Boudewijn F. (2021-08-27). "Anti-clustering in the national SARS-CoV-2 daily infection counts". PeerJ. 9: e11856. arXiv:2007.11779. doi:10.7717/peerj.11856. ISSN 2167-8359. PMC 8404575. PMID 34532156. Zenodo: 5262698. Archived from the original on 2021-08-27.
^ ^a ^b ^c ^d Kobak, Dmitry (2022-03-29). "Underdispersion: A statistical anomaly in reported Covid data". Significance. 19: 10–13. doi:10.1111/1740-9713.01627. eISSN 1740-9713. Archived from the original on 2022-04-06.
^ Ruijer, Erna; Françoise, Détienne; Baker, Michael; Groff, Jonathan; Meijer, Albert J. (2019). "The Politics of Open Government Data: Understanding Organizational Responses to Pressure for More Transparency". Amer. Rev. Publ. Admin. 50. SAGE Publishing: 260–274. doi:10.1177/0275074019888065. Archived from the original on 2021-09-16. Retrieved 2021-09-16.
^ ^a ^b ^c Balashov, Vadim S.; Yuxing, Yan; Zhu, Xiaodi (2021). "Using the Newcomb–Benford law to study the association between a country's COVID-19 reporting accuracy and its development". Scientific Reports. 11. Springer Nature: 22914. arXiv:2007.14841. doi:10.1038/s41598-021-02367-z. Archived from the original on 2021-11-27. Retrieved 2022-02-12.
^ ^a ^b Robertson, M.P.; Hinde, R.L.; Lavee, J. (14 November 2019). "Analysis of official deceased organ donation data casts doubt on the credibility of China's organ transplant reform". BMC Med Ethics. 20 (79): 79. doi:10.1186/s12910-019-0406-6. PMC 6854896. PMID 31722695.

[Roukema2021-1] ^ ^a ^b ^c ^d ^e ^f ^g Roukema, Boudewijn F. (2021-08-27). "Anti-clustering in the national SARS-CoV-2 daily infection counts". PeerJ. 9: e11856. arXiv:2007.11779. doi:10.7717/peerj.11856. ISSN 2167-8359. PMC 8404575. PMID 34532156. Zenodo: 5262698. Archived from the original on 2021-08-27.

[Kobak2022-2] Kobak, Dmitry (2022-03-29). "Underdispersion: A statistical anomaly in reported Covid data". Significance. 19: 10–13. doi:10.1111/1740-9713.01627. eISSN 1740-9713. Archived from the original on 2022-04-06.

[Ruijer2019-3] Ruijer, Erna; Françoise, Détienne; Baker, Michael; Groff, Jonathan; Meijer, Albert J. (2019). "The Politics of Open Government Data: Understanding Organizational Responses to Pressure for More Transparency". Amer. Rev. Publ. Admin. 50. SAGE Publishing: 260–274. doi:10.1177/0275074019888065. Archived from the original on 2021-09-16. Retrieved 2021-09-16.

[Balashov2021-4] Balashov, Vadim S.; Yuxing, Yan; Zhu, Xiaodi (2021). "Using the Newcomb–Benford law to study the association between a country's COVID-19 reporting accuracy and its development". Scientific Reports. 11. Springer Nature: 22914. arXiv:2007.14841. doi:10.1038/s41598-021-02367-z. Archived from the original on 2021-11-27. Retrieved 2022-02-12.

[RobertsonHindeLavee2019-5] Robertson, M.P.; Hinde, R.L.; Lavee, J. (14 November 2019). "Analysis of official deceased organ donation data casts doubt on the credibility of China's organ transplant reform". BMC Med Ethics. 20 (79): 79. doi:10.1186/s12910-019-0406-6. PMC 6854896. PMID 31722695.

[1]

[2]

[3]

[4]

[5]