Jump to content

Wikipedia talk:Wikipedia Signpost/2024-05-16/Op-Ed

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Discuss this story

I am amazed that a major Wikimedia porject is about to fail, and this is the first we have heard of it. Scaling databases is not a new science, the principles have been known for decades, I am certain there is a technical solution for this, and I haven't even looked at the "Limits" page yet. All the best: Rich Farmbrough 11:11, 16 May 2024 (UTC).[reply]

P.S. the link https://www.wikidata.org/wiki/WikiProject:Limits_of_Wikidata doesn't work. I'll check back later. All the best: Rich Farmbrough 11:14, 16 May 2024 (UTC).[reply]
@Rich Farmbrough: Good catch -- fixed. Thanks! jp×g🗯️ 11:28, 16 May 2024 (UTC)[reply]
well its not something that most people have a good grasp on, so only once you really start hitting the limits, it becomes more interesting for people who don’r really care about the details. Its also not uncommon, it’s just that usually people are able to find better solutions in the nick of time and most outsiders don't even realized something changed. Reminder that all of wmf was once one database server, and one rack in one datacenter, and that we ran without multimedia backups for over 15 years. In reality, most wmf projects have existential (technical) crises on a almost continual basis and hardly anyone notices (see also the major pagelinks db table normalizations, going on right now). —TheDJ (talkcontribs) 14:34, 16 May 2024 (UTC)[reply]
One problem is that Blazegraph in terms of development is somewhat in dead in the water as Amazon silently hired Blazegraph developers to develop the closed source Amazon Neptune service. However, Qlever could be a valid alternative for Blazegraph if it will scale high enough. --Zache (talk) 16:50, 16 May 2024 (UTC)[reply]
WMF does have money. They could hire a team to revive blazegraph development. We can't always rely on the open source world to provide us with software, sometimes we have to make it ourselves. To be clear, this is an expensive option. DB developers are much more expensive than php web dev. Trade offs would probably have to be made. Like all things in life it comes down to deciding what is important.Bawolff (talk) 17:35, 16 May 2024 (UTC)[reply]
Scaling RDF databases is a pretty different ballgame then scaling relational databases. I would consider scaling SPARQL based databases to very much be an open research problem. That doesn't mean there are no solutions, but they are full of trade-offs, and often look like doing the exact same thing (partitioning the graph) with some front end to keep the queries the same. Bawolff (talk) 17:33, 16 May 2024 (UTC)[reply]

I operate my own copy of the Wikidata Query Service that continues to provide a unified graph. If you do not have time to rewrite your queries to use the split graph, you should be able to run queries on my query service. I will continue operating this unified graph until it is literally no longer possible. Please reach out by email if you are interested in using this. Harej (talk) 21:41, 16 May 2024 (UTC)[reply]

@Harej: Technical question. How do you keep it in sync with wikidata changes? (ie. is it near realtime, is there periodical updates etc) I think that keeping it sync is biggest unsolved issue if one has its own replica. --Zache (talk) 23:52, 16 May 2024 (UTC)[reply]
Zache, syncing is not an unsolved issue if you use Blazegraph. The Wikimedia Foundation has developed tools to synchronize Wikibase instances, including Wikidata, via Recent Changes. So my query service keeps up more or less with real time. Harej (talk) 05:03, 20 May 2024 (UTC)[reply]

Didn't realize Wikipedia has now (or will soon have) had Wikidata for longer than it didn't. Nardog (talk) 23:44, 16 May 2024 (UTC)[reply]

Bad idea when I first heard about this option at Wikimania Capetown, still a bad idea now. I support this as a WikiCite editor - well, I'm a WikiCite editor, and I don't support this. The truths about Wikidata's quite genuine "big data" issues need to be addressed, for sure. Scholia, as a front end, surely also is a different matter from these back end questions. (That is not to discuss funding, which obviously is a point.) I was struck at WikidataCon 2019, where much was said about federation, how little of it made sense from a technical point of view. As someone who puts time into "main subject" and "author disambiguation" questions for article items, I see the direct link into Wikipedias and other sites from the subject and author items as fundamental. The Capetown advocacy was solely in terms of convenience to bot operators, which I would say was a blinkered outlook. Charles Matthews (talk) 04:21, 17 May 2024 (UTC)[reply]

Oh, and I note that the feedback period for comment ended the day before the publication of this article. Charles Matthews (talk) 07:17, 19 May 2024 (UTC)[reply]

Scholia usage volume

[edit]

Scholia, a scholarly profiling service serving this content 100,000 times a day - what is the source for this number, and does it distinguish between human users and automated traffic? (It matched the 2018 stats here, but that publication explicitly warned that it "cannot be used to draw any conclusions other than that usage grows [as of 2018]. Multiple contributing factors seem likely, including web crawlers, generic growth of Wikidata and WikiCite content, increased interlinking both within Wikidata and between Wikidata and other websites, especially Wikimedia projects, as well as WikiCite or Wikidata outreach activities, which often feature Scholia prominently"). Regards, HaeB (talk) 06:30, 17 May 2024 (UTC)[reply]

@HaeB: The source is https://toolviews.toolforge.org/ which is a Wikimedia platform service. Like so many of our fundamentals, documentation is lacking, but based on rumor that service has same specifications for reporting a view as wikitech:Tool:Pageviews, which is how we report Wikipedia traffic to the world and which we consider reliable. The narrative for both pageviews and toolviews is that the Wikimedia platform reports web traffic in the way standard to the field, so regardless of bot/human ratio, our reporting should be comparable to anyone else reporting any other web platform traffic. Bluerasberry (talk) 18:09, 17 May 2024 (UTC)[reply]
Thanks, that's helpful. Some quick observations:
  • based on rumor that service has same specifications for reporting a view as wikitech:Tool:Pageviews - from a quick look at the code (good point about documentation btw), that does not appear to be the case, as Toolviews does not use the same pageview definition. In particular, regarding my question above, it does not seem to attempt to detect automated traffic. (Which does not mean it is unreliable per se - it does what it does - just that the conclusions we can draw from this data are limited, as warned about by the authors of that 2018 paper, which included yourself.)
  • so regardless of bot/human ratio, our reporting should be comparable to anyone else reporting any other web platform traffic - that doesn't seem to be true. Bot pageviews are usually excluded from web traffic reporting. That's also the default setting in the Pageviews tool (in views like [1] you have to change the setting under "Agent" to include "Spider" and "Automated" views alongside human/"User" views; their FAQ has more on the distinction.)
  • One way to get closer to an answer to the question of how much human usage Scholia actually sees might be the unique daily visitor counts whose existence the Toolviews API documentation advertises. Unfortunately though they seem to be broken (the API claims 0 visitors for Scholia and all other tools for recent dates; and tt turns out that MusikAnimal filed a bug about this last year already).
  • We can get a clue though by looking at the series of daily hits/pageviews for Scholia over time. Spot-checking January-April 2024, it's very interesting that while the numbers are indeed above 100k most days, there are several days where they are much smaller (e.g 2024-01-02: 34, 2024-01-12: 12, 2024-02-08: 28, 2024-03-04: 26. Similar a year earlier, e.g. 2023-02-15: 31, 2023-03-30: 8, 2023-04-17: 34). Such extremely large, isolated drops basically never occur for web traffic that is substantially human-generated. Now there might be some different explanations (e.g. further bugs in the Toolviews tool), but the most likely one is that Scholia sees very little direct human usage.
Regards, HaeB (talk) 07:49, 18 May 2024 (UTC)[reply]
PS, re the last bullet point: I noticed since that these weird drops are also showing for other tools (example), so the further bugs in the Toolviews tool possibility I mentioned is in fact the more likely one. And you already pointed that out elsewhere in the article (Many outages since 2022 are of the wikitech:Tool:Pageviews tool - sorry for overlooking that in my last comment; as I said, it consisted just of some quick observations). That's a bit in contrast though with which we consider reliable, no?
Either way, the question remains how much human usage the Scholia tool actually sees. I'm a bit surprised that this Op-ed describes it as a successful product citing nothing more than a metric that the 2018 paper warned should not be used for such conclusions.
Regards, HaeB (talk) 12:26, 20 May 2024 (UTC)[reply]
I don't quite follow everything being said here, but to whom it may concern, there will be a Toolviews visualization akin to Pageviews relatively soon. It's a volunteer project I've slowly been picking away at since 2019. It's maybe 90% of the way there, so stay tuned :) MusikAnimal talk 22:51, 20 May 2024 (UTC)[reply]
Belated apologies for pinging you here without indicating more clearly why. It was because I saw you had filed that bug about that unique visitor data in toolviews. It made me think that you might be able to provide more insight in general on how reliable the toolviews numbers are, in particular when it comes to distinguishing tool usage by humans and spiders/automata (as WMF does for content pageviews). I realize though that you are not responsible at WMF for addressing that kind of problem.
All that being said, I would be curious whether/how it is planned to alert users of that forthcoming visualization of the Toolviews data about such issues (will it include a FAQ like the Pageviews Analysis tool?).
Regards, HaeB (talk) 06:38, 10 September 2024 (UTC)[reply]
PS: While closing tabs I noticed the comments at phab:T87001#8899655 ff. (e.g. The question of bot traffic vs. human was the first thing that came up when I showed toolviews to people at the hackathon and The very streamlined data collection that toolviews is doing based on the front proxy nginx logs does not contain any user-agent based processing. The idea was to be able to show traffic patterns more than to provide any sort of detailed traffic analysis. I agree that we should probably put a FAQ section on the https://wikitech.wikimedia.org/wiki/Tool:Toolviews page that I've not yet bothered to create for this tool). So it definitely looks like this issue is unsolved and that this Signpost article should not have used these numbers in this way to imply conclusions about the popularity of Scholia among human users. Regards, HaeB (talk) 16:30, 10 September 2024 (UTC)[reply]
@HaeB: I admit near total ignorance in interpreting the metrics which the Wikimedia platform makes available, including the limits of their reliability and what to notice when they conflict with each other.
I will be presenting some of the information in this report in about a month at wikiconference:Submissions:2024/WikiCite_-_proposed_as_Wikimedia_sibling_project with care to make the corrections you have indicated. My presentation will include published slides and a recording, so if you have specific suggestions for what I should correct, then say so, as everything you have said so far is valuable to me. I will follow up with an email to ask you for a voice/video chat, and if you can meet me, I will take notes because you understand this and are the only person giving me feedback on this in all the times I have presented this. Thanks so much. Bluerasberry (talk) 18:06, 10 September 2024 (UTC)[reply]

What's the limiting factor?

[edit]

I was surprised that WD, having about as many items as Commons has pictures, should be subject to severe strain that Commons is spared. I guess it's because WD gets a lot more queries. I bring up WikiShootMe more days than not, and Commons App almost as often, and probably each instance means multiple searches among those hundred million items most of which have nothing to do with geocoordinates. So, it makes sense that my favorite views, the ones that are all about geocoordinates, should be limited to a separate database for places, and users who want to know something about people will use a separate biographical one, and scholarly articles? Every one that was published in the past century? Yipes; that must be a terribly heavy load, so that means separation again. And so forth. Kind of a shame that separate but affiliated websites must be set up to work around technical limitations, but I hope they can all be made to work together without too much confusion. Jim.henderson (talk) 02:02, 18 May 2024 (UTC)[reply]

Wikidata has 125 million items. Does Commons have 125 million pictures? Ymblanter (talk) 07:08, 18 May 2024 (UTC)[reply]
Wikidata and Commons have a similar number of pages, around 100 million. Last time I checked, the issues with Wikidata were due to the frequency in updates. Wikidata has over 2 billion edits, while Commons has less than 1 billion (or about 20 million monthly edits vs. 10). Nemo 09:19, 18 May 2024 (UTC)[reply]
To amplify (first posting on this got lost) Commons has 105 million media files. There is no direct comparison. If the commons:SPARQL query service was as much used as the Wikidata SPARQL, and the metadata for files were fully expressed in commons:Structured data, there would be a basis for comparison. It would likely show up that few media files had really large metadata, that typically the metadata were not very much updated after initial upload, and certainly Commons had few projects for systematic metadata expansion (unless the role of categories changed). These differences can help to explain why from the point of view of "big data" issues you cannot really equate the two sister projects. For example query.wikidata.org needs to refresh its working dump of Wikidata on a timescale of seconds to function properly, which is not a relevant requirement for Commons at present. Charles Matthews (talk) 09:53, 18 May 2024 (UTC):[reply]
So I would say that the "sheer volume" headline is more than a trifle misleading on the technical side. The worst-case WikiCite items are things like the Higgs boson article item with several thousand authors. But it is probably the average-case ones, say with 30 "cites-work" statements, that fatten up the graph, and these get edited often enough, for example with tools, to link to author items rather just giving author strings. UniProt is twice the size of Wikidata in terms of items[2]. Charles Matthews (talk) 14:22, 18 May 2024 (UTC)[reply]
So, it's not the number of items as the headline might suggest, nor the size of items, nor even the traffic of queries as I suspected. Rather, the major problem is the heavy traffic of changes, and most of those changes are by bots. It's pleasant to know that we Commons users are not much the problem; mostly we are among the many who are at risk of suffering. Even though my uploads via WikiShootMe and Commons App are automatically linked in WD, that's a small part of the WD load. Yippee; it's not my fault; I'm just a victim! Jim.henderson (talk) 16:08, 18 May 2024 (UTC)[reply]
I'm not familar with the exact issues involved, but i strongly suspect it is not simply data churn rate, but that size of the underlying graph is a very significant issue here. Bawolff (talk) 08:58, 19 May 2024 (UTC)[reply]
I mentioned Uniprot, and https://sparql.uniprot.org/ is not having the same problems, I believe, despite having 150 billion triples. I'm no expert. It does seem to get by with an update every eight weeks. Charles Matthews (talk) 10:12, 19 May 2024 (UTC)[reply]
[I should preface this with im just guessing here]. I'm more just saying that the factors are inter-related and we probably shouldn't assume its just one single cause but a bunch of factors in combination. I suspect (albeit this is a total guess with nothing backing it up) things would be a lot easier if the graph size would be much smaller or the churn rate was significantly lower. As far as churn rate goes - it should be noted that there is a significant difference between a db with a low churn rate vs one with zero churn (a static db you have to rebuild if you want to change something). The static case opens up certain possibilities that just aren't possible if the data changes at all. Bawolff (talk) 18:58, 19 May 2024 (UTC)[reply]
I read a little more some of the on wiki pages. wikitech:Wikidata Query Service/ScalingStrategy seems to imply that there are two problems - capacity (Overall graph size) & update lag (data churn rate). It sounds like some improvements were made on the update lag front so it is less pressing at least in the near term, and that the larger concern at the moment is around capacity. The uniprot endpoint is using Virtuoso, which from what I understand makes some trade-offs to allow for really high capacity. In particular it has bad support for incremental updates (So no live updates), and it also has a feature where if the query is hard, it may return its best guess instead of the correct answer. Having the query engine be a static snapshot is a trade-off that might be acceptable to some users, but it is a pretty big trade-off and probably not to be taken lightly. Bawolff (talk) 08:04, 20 May 2024 (UTC)[reply]
Capacity is indeed the most important issue here. As of now, true horizontal scaling (sharding) of graph databases is an unsolved technological problem – but this would be the solution that is needed for Wikidata.
Currently, the entire graph representation of Wikidata needs to fit into the memory of a single computer for the Wikidata query service (WDQS), which is the most important interface to consume data. Adding more memory to this computer (vertical scaling) becomes increasingly expensive and does not scale indefinitely anyways. Due to the lack of horizontal scaling for graph databases, it is currently not an option to distribute the graph to several computers which behave as if they were one database—as it can easily be done for many other database types.
There are a few graph database engines that claim to have solved horizontal scaling, but usually this comes at the expense of plenty of query performance (i.e. consuming data is usually significantly slower and less predictable when horizontal scaling is used), and those engines which have gotten this feature only got it rather recently. At this point it is not obvious which engine will emerge as the clear go-to option in the future for a graph of the size of Wikidata.
I'm afraid there is really not much that can be done differently than the proposed split at this point; it is not a question of resources, or lack of will. We have to accept that multiple graphs and query federation are the way to go. —MisterSynergy (talk) 19:33, 21 May 2024 (UTC)[reply]
Does blazegraph really need everything in memory? Its not advertised as an in-memory database. Of course, regardless of that, memory pressure still has a significant affect on query performance as the data set gets larger. While i agree that vertical scaling can't go on forever, it does seem like we are very far away from the limit if money was no option. For example, AWS offers servers with 24 terabytes of ram (e.g. u-24tb1.112xlarge). Such things are very expensive, but they do exist. Bawolff (talk) 22:50, 21 May 2024 (UTC)[reply]
As much as I am aware, the memory spec of the current setup is indeed relatively moderate, compared to what is technically possible. However, please also consider that there are already north of 20 servers behind WDQS, distributed to two different data centers and partially reserved for internal and external usage. Each of those servers is capable of running queries on the entire graph, and the load from different requests is somehow evenly distributed among them.
If you want to scale vertically, you need to do this for all of these machines. Doing so might help for possibly a couple of years, but the fundamental problem—the absence of a viable technical solution—would very likely not be solved by then either. —MisterSynergy (talk) 21:29, 23 May 2024 (UTC)[reply]
Its important to distinguish the wikidata query service from the wikidata wiki. The database powering the wikidata wiki is scaling fine. Wikimedia commons is a bad comparison because the wiki itself is not having scaling issues, it is the query service that is having issues. Bawolff (talk) 08:51, 19 May 2024 (UTC)[reply]
Indeed the analysis of the alternatives has the number of triples and the frequency of their update as major criteria. I'd say it's debatable where the boundaries of "the wiki itself" end though: what about native search, for example? At one point ElasticSearch was struggling to handle the updates on Wikidata, as well, for pretty much the same reason (IIRC). Nemo 09:41, 19 May 2024 (UTC)[reply]

Well, what Lane has written is indeed an opinion piece, leaving room for other opinions and discussions. My own takeaways are more complex. Leaving aside implications for my own future WikiCite involvement, and the whole tech funding debate, here are some:

  • Continuing significance of the pre-COVID timelines of Wikidata and WikiCite.
  • Poor state of community control of automated editing on Wikidata, coupled with attachment to "quick results".
  • Assessment of how the WMF's wooing of librarians is doing needed, in concrete terms.

Enough for now, really. Charles Matthews (talk) 05:28, 19 May 2024 (UTC)[reply]

Missing aspects

[edit]

The other day, I recommended this Signpost article as background reading in a Facebook discussion (about this recent announcement on the WDQS backend update). I received this feedback:

I think it should be stressed that this is an opinion piece and rather incomplete. The author writes it himself [...]
the author does not seem to be aware of https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives , which shows throughout the article. It's perfectly fine because of the disclaimer he adds himself. I just wanted to point that out.

Just wanted to post this here as context for other Signpost readers, and also a bit in the context of general discussions we have been having in the Signpost team on how much to vet or review opinion articles. (Again, obviously the author of this one put a lot of work into it and also, as the above quoted critic acknowledges, diligently included a disclaimer.)

Regards, HaeB (talk) 06:46, 10 September 2024 (UTC)[reply]

In fairness, I don't know how much that changes this article. Like yes, the plan seems to be: sharding graph for short term, move to QLever in the longer (or at least medium) term. It is not like we're just sharding the graph and then totally out of ideas. The question still remains - will it work and how much scaling does that buy us? The initial benchmarks do look extremely promising, but we're talking about a small number of benchmarks done in a non-production context not under load on a system that is quite frankly relatively immature (e.g. UPDATE support was just written recently). There is still a ton of uncertainty here. At a high level, I think this op-ed is essentially asking the question of what is the viability of WDQS at the different levels of scale we might optimistically see wikidata reach in the next 10 years assuming no artificial limitations are placed on it. Community members want to know this because they need to make plans based on the answer of this question. Not knowing creates fear, uncertainty and doubt in the Wikidata community. The WDQS backend alternative document doesn't answer that question. Bawolff (talk) 08:43, 3 October 2024 (UTC)[reply]