Wikipedia:Wikipedia Signpost/2024-05-16/Op-Ed
Wikidata to split as sheer volume of information overloads infrastructure
The Wikimedia Foundation will soon split parts of the WikiCite dataset off from the main Wikidata dataset. Both data collections will be available through the Wikidata Query Service: although in queries, by default users will get content from the main graph, and can afterwards take extra effort to request WikiCite content. This is the start of query federation for Wikidata content, and is a consequence of Wikidata having so much content that the servers hosting resources of the Wikidata Query Service are under strain.
I support this as a WikiCite editor, because WikiCite is consuming considerable resources, and the split preserves the content by reducing its accessibility. This split could also be the start of dedicated support for Wikimedia citation data products.
I am wary of the split, because it only gives about three more years to look for another solution, and we have already been seeking one since 2018. The complete scholarly citation corpus of ~300 million citations is not a large dataset by contemporary standards, but our Blazegraph backend strains to include 40 million right now. Even after a split, Wikidata will fill with content again. Fear of the split has been slowing and deterring Wikidata content creation for years, and we do not have long-term plans for splitting and federating Wikibase instances repeatedly.
This challenge does not have an obvious solution. I have tried to identify experts who could describe the barriers at d:wikidata:WikiProject Limits of Wikidata, but have not been able to do so. I asked if Wikidata could usefully expand its capacity with US$10 million development, and got uncertainty in return. I have no request of the Wikimedia community members who read this, except to remain aware of how technical development decisions determine the content we can host, the partnerships we can make, and the editors we can attract. The Wikimedia Foundation team managing the split have documentation which invites comment. Visit, and ponder the extent to which it is possible to discuss the scope of Wikidata.
That is the summary, and all that casual readers may wish to know. For more details, read on!
I am writing this article as an opinion or personal statement. I am unable to present this as fact-checked investigative journalism, and am presenting this from my own perspective as a long-term WikiCite contributor who has incorporated this project into many sponsored projects in my role as Wikimedian in residence at the School of Data Science at the University of Virginia. I have a professional stake in this content, and wish to be able to anticipate its future level of stability.
Why split WikiCite from Wikidata?
Wikipedia is a prose encyclopedia established in 2001. Over the years, the Wikipedia community deconstructed Wikipedia's parts into Wikimedia sister projects, one of which was Wikidata, established in 2012. Wikidata is designed such that, to the casual observer, it seems to accomplish magic to solve multiple billion-dollar challenges facing humanity. Soon after its establishment, though, its infrastructure hit technical limitations. Those limits prevent current (and especially prospective) projects from importing more content into Wikidata.
The only large Wikidata project to continue limited development was WikiCite, which is an effort to index all scholarly publications. WikiCite grew, and is currently a third of the content on Wikidata. Users access WikiCite content through multiple tools; the tool I watch is Scholia, a scholarly profiling service serving this content 100,000 times a day. The point of Wikimedia projects is to be popular and present content that people want, and there is agreement that WikiCite is a worthwhile project. While Wikidata is overstuffed with content, use of the Wikidata Query Service strains Wikidata's computational resources and causes downtime. Reducing the costs and strain with a WikiCite split is one solution to manage them.
The problem is that Wikidata is facing an existential crisis, due to reaching many of the challenges reported in WikiProject Limits of Wikidata. Users must be able to access Wikidata content through database queries, and the amount of content in Wikidata is large enough that more queries are failing, and more frequently. The short term solution which is happening right now is the first Wikidata graph split, which will result in the separation of the WikiCite dataset from the main Wikidata graph. This is not a long term solution, because Wikidata will fill up with data again. If users had their way, Wikidata would expand in resource use to index all public information on people, maps, the Sum of All Paintings, video games, climate, sports, civics, species, and every other concept which is familiar to Wikipedia editors and which could be further described with small – meaning not big data – general reference datasets.
Here is a timeline of the discussions:
- 2018 d:Wikidata:WikiCite/Roadmap
- 2019 d:Wikidata:WikiProject Limits of Wikidata
- 2021 wikitech:User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis
- 2021 d:Wikidata:SPARQL query service/WDQS backend update/Blazegraph failure playbook
- 2021 WikiCite panel discussion (WikidataCon 2021 recording) (video)
- 2023 WikiCite talk page discussion
- 2023 meta:WikiCite/Roadmap 2023
- 2024 d:Wikidata:SPARQL query service/WDQS graph split/WDQS Split Refinement
Here are some questions to explore. Ideally, answers to these questions could be comprehensible to Wikimedia community members, technology journalists, and computer scientists. I do not believe that published attempts at answering these questions for those audiences exist.
- If we could predict Wikidata's future capacity, then editors could strategically plan to acquire data at the rate of growth. Will Wikidata's capacity in 3 years be more or the same as current capacity?
- WikiCite hit many upload limits in 2018. In the 6 years since, we have not identified a solution. What could we have done differently to develop appropriate discussion at the time the problem was identified?
- Suppose that the Wikimedia community developed a successful product – like WikiCite and Scholia – which also came with expenses. How can the Wikimedia community assess the value of such things and determine what support is appropriate?
Scholarly profiling
Scholarly profiling is the process of summarizing scholarly metadata from publications, researchers, institutions, research resources including software and datasets, and grants to give the user enough information to gain useful insights and to tell accurate stories about a topic. For example, a scholarly profile of a researcher would identify the topics they research, their social network of co-authors, history of institutional relationships, and the tools they use to do their research. Such data could be rearranged to make any of these elements the subject of a profile, so for example, a profile of a university would identify its researchers and what they study; a profile of software would identify who uses it and for what work; and a profile of a funder would tell what impact their investments make.
The easiest way to understand scholarly profiling is to use and experience popular scholarly profiling services.
Google Scholar is the most popular service and is a free Google product. It presents a search engine results page based on topics and authors. Scopus is the Elsevier product and Web of Science is the Clarivate product. Many universities in Western countries pay for subscriptions to these, with typical subscription costs being US$100,000-200,000 per year.
Free and nonprofit comparable products include Semantic Scholar developed by the Allen Institute for AI, OpenAlex developed by OurResearch, and the scrappy Internet Archive Scholar developed by Wikimedia friend Internet Archive.
Other tools with scholarly profiling features include ResearchGate, which is a commercial scientific social networking platform, and ORCID, which compiles bibliographies of researchers.
OpenAlex, Semantic Scholar and Internet Archive Scholar designate the data as openly licensed and allow export, but all of these have ambiguous open licensing terms for elements of their platforms. Google Scholar, Scopus, and Web of Science slurp data that they find and encourage crowdsourced upload of data, but their terms of use do not allow others to export it as open data. It has been a recurring thought that the WikiCite and Scholia could meet institutional needs at a fraction of the Scopus and Web of Science subscription costs. ORCID also encourages data upload and entire universities do this, but only for living people, and the data is only public with consent of the individual profiled.
Statements such as the Barcelona Declaration on Open Research Information seek to gather a collaboration which could manifest an ideal profiling platform, which would be open data, exportable, allow crowdsourced curation, encourage public community discussion of the many social and ethical issues which arise from presenting a platform like this, and of course be sustainable as a tool which used computing resources. Scholia is these things, except for hitting technical limits.
WikiCite and Scholia
WikiCite is a collection of scholarly metadata in Wikidata, the WikiProject to curate that data, and the name of the Wikimedia community who engage in that curation. Scholia is a tool on Toolforge which generates scholarly profiles by combining WikiCite and general Wikidata content into a reader-friendly format. Scholia is preloaded with about 400 Wikidata queries, so instead of any new user needing to learn queries, they can use the Scholia interface to run queries to answer common questions in academic literature research.
WikiCite is the single most popular project in Wikidata in terms of amount of content, number of participants, depth of engagement of participants, count of institutional collaborations, and donation of in-kind labor from paid staff subject matter experts contributing to the project. In terms of content, WikiCite is about 40 million of Wikidata's 110 million items. Because it is openly licensed, many other applications ingest this content, including the other scholarly profiling services but also free and open services such as Histropedia. Four WikiCite conferences have each convened 100 participants. WikiCite presentations have been a part of many other Wikimedia conferences for some years. The largest WikiCite project in terms of participants was the WikiProject Program for Cooperative Cataloging, which recruited staff at about 50 schools to make substantial WikiCite contributions about their own research output. In the context of the Wikimedia Foundation investing in outreach, there are projects like this which are outside of that investment, but which attract investors, new editors, and institutional partnerships.
The promise of WikiCite is to collect research metadata, confirm its openness, then enrich it with further metadata including topic tagging and deconstruction of the source material to note use of research resources, such as software, datasets, protocols, or anything else which could be reusable. Scholia presents all this content. Example Scholia applications are shown here, with links to the queries and pages which present such results.
What next?
The Wikidata Query Service is failing more often. 99% of the time it works, but 1% failure of a tool central to accessing Wikidata is an emergency to address immediately. To ensure continued access to Wikidata content, the Foundation has responded with a recently refined plan announced here incorporating everyone's best ideas for what to do.
It is challenging to coordinate the Wikipedia and Wikimedia community. The above mentioned Barcelona Declaration asks organizations to commit to "make openness the default", "enable open research information", "support the sustainability of infrastructures for open research information", and "support collective action to accelerate the transition to openness of research information", which are all aims of WikiCite, Scholia, and Wikimedia more broadly, but in my view Wikimedia projects have been too independent to join such public consortia. If we could reach community consensus to join such programs, then I think experts in that group could advise us on technical needs, and funders would consider sponsoring our proposals to develop our technical infrastructure. If the Wikimedia Movement had money, then based on my incomplete understanding of the limit problems, I recommend investing in Wikidata now so that we can better recruit expert partnerships and contributors. Since we lack money, the best idea that I have is to find the world's best experts considering comparable problems, and explore options for collaboration with them. I wish that "Wikipedia" could sign the Barcelona Declaration or a similar effort, and get more outside support.
Discuss this story
I am amazed that a major Wikimedia porject is about to fail, and this is the first we have heard of it. Scaling databases is not a new science, the principles have been known for decades, I am certain there is a technical solution for this, and I haven't even looked at the "Limits" page yet. All the best: Rich Farmbrough 11:11, 16 May 2024 (UTC).[reply]
I operate my own copy of the Wikidata Query Service that continues to provide a unified graph. If you do not have time to rewrite your queries to use the split graph, you should be able to run queries on my query service. I will continue operating this unified graph until it is literally no longer possible. Please reach out by email if you are interested in using this. Harej (talk) 21:41, 16 May 2024 (UTC)[reply]
Didn't realize Wikipedia has now (or will soon have) had Wikidata for longer than it didn't. Nardog (talk) 23:44, 16 May 2024 (UTC)[reply]
Bad idea when I first heard about this option at Wikimania Capetown, still a bad idea now. Charles Matthews (talk) 04:21, 17 May 2024 (UTC)[reply]
- well, I'm a WikiCite editor, and I don't support this. The truths about Wikidata's quite genuine "big data" issues need to be addressed, for sure. Scholia, as a front end, surely also is a different matter from these back end questions. (That is not to discuss funding, which obviously is a point.) I was struck at WikidataCon 2019, where much was said about federation, how little of it made sense from a technical point of view. As someone who puts time into "main subject" and "author disambiguation" questions for article items, I see the direct link into Wikipedias and other sites from the subject and author items as fundamental. The Capetown advocacy was solely in terms of convenience to bot operators, which I would say was a blinkered outlook.Scholia usage volume
here, but that publication explicitly warned that it "cannot be used to draw any conclusions other than that usage grows [as of 2018]. Multiple contributing factors seem likely, including web crawlers, generic growth of Wikidata and WikiCite content, increased interlinking both within Wikidata and between Wikidata and other websites, especially Wikimedia projects, as well as WikiCite or Wikidata outreach activities, which often feature Scholia prominently"). Regards, HaeB (talk) 06:30, 17 May 2024 (UTC)[reply]
- what is the source for this number, and does it distinguish between human users and automated traffic? (It matched the 2018 statsWhat's the limiting factor?
I was surprised that WD, having about as many items as Commons has pictures, should be subject to severe strain that Commons is spared. I guess it's because WD gets a lot more queries. I bring up WikiShootMe more days than not, and Commons App almost as often, and probably each instance means multiple searches among those hundred million items most of which have nothing to do with geocoordinates. So, it makes sense that my favorite views, the ones that are all about geocoordinates, should be limited to a separate database for places, and users who want to know something about people will use a separate biographical one, and scholarly articles? Every one that was published in the past century? Yipes; that must be a terribly heavy load, so that means separation again. And so forth. Kind of a shame that separate but affiliated websites must be set up to work around technical limitations, but I hope they can all be made to work together without too much confusion. Jim.henderson (talk) 02:02, 18 May 2024 (UTC)[reply]
Well, what Lane has written is indeed an opinion piece, leaving room for other opinions and discussions. My own takeaways are more complex. Leaving aside implications for my own future WikiCite involvement, and the whole tech funding debate, here are some:
Enough for now, really. Charles Matthews (talk) 05:28, 19 May 2024 (UTC)[reply]
Missing aspects
The other day, I recommended this Signpost article as background reading in a Facebook discussion (about this recent announcement on the WDQS backend update). I received this feedback:
Just wanted to post this here as context for other Signpost readers, and also a bit in the context of general discussions we have been having in the Signpost team on how much to vet or review opinion articles. (Again, obviously the author of this one put a lot of work into it and also, as the above quoted critic acknowledges, diligently included a disclaimer.)
Regards, HaeB (talk) 06:46, 10 September 2024 (UTC)[reply]