User:TCO/Improving Wikipedia's important articles

How is Wikipedia doing on its most important articles?

Recently, Looie496 did an analysis of how many important FAs are made each year. He found that we have produced very few since 2008. In the subsequent discussion, Wehwalt had an interesting mathematical challenge: how often do readers see articles of what quality?

I have collected data manually on page views and looked at cases to analyze quality versus importance. After 40 pages of Excel, the result is here in 100 pages of evil PowerPoint:

Quality coverage

Academic examination

We are used to thinking of Wikipedia in terms of numbers of articles, but readers do not look at articles equally often. Answering Wehwalt’s question is not an easy task that one can accomplish by manual methods. This month, Harvard researcher, Andreea Gorbatai, has combined quality and page view statistics from two databases to calculate what percent of eyeballs are hitting what quality level. She found that only 3% of views are on “high quality” articles (GA/A/FA). 28% of views are on “medium quality” (C/B). 69% of views are on “low quality” (start/stub/unranked ). Manual examination of 10 unranked articles showed they were starts or stubs, so this group is assigned to low quality. Essentially, what the average reader sees when he hits Wiki is…low quality.

Previously in a published paper, "Exploring underproduction in Wikipedia", Gorbatai had called out major concerns with Wikipedia’s ability to produce high quality important articles. She examined both articles that were high page view (over 50,000 views per month) and articles that were subjectively rated important (top/high, by WikiProjects). For both cases only a few percent of important articles were high quality, and there were 10-50 times as many low quality articles as high quality articles.

Vital Article situation

Looking at a subjective list of the top 1000 Wikipedia articles, the Vital Articles, we find issues as well. 85% of these super-important articles are below GA. 10 years into the otherwise successful Wiki enterprise, we still have not produced high quality content on our most important articles. Even worse, the number of GA+ VAs is slowly dropping over the last 4 years.

Moving to objective page view analysis, we see some interesting results. Comparing the VAs to FAs, they have 20 times the median monthly page view: 66,000 views versus 3300. GAs are even lower at 900 median monthly views. The Vital Articles are not just subjectively important for educational or cultural reasons…they are popular!

Relevance of FAs and GAs

FAs and GAs are becoming more obscure lately. The 2011 FAs had a median page view of 1000 views per month. GA median for a recent sample was only 600.

Part of the problem is overconcentration on certain topics. 10% of the GA+ articles come from four peculiar categories: hurricanes, mushrooms, trains, and US roads. What would an outside scholar think of this? Those four categories together have only 3 top 1000 VAs (0.3%). Could these topics be pushed because they are easier to mechanically write award winners on?

Hurricanes in particular show a strange pattern: GAs are 30% of the project’s articles and outnumber B and C articles (very unusual). Also, there is a higher % of low priority GAs than there are overall in the WikiProject. For Project Hurricane, “low priority” gets high priority!

While having an incredible 600+ GA/FAs, the project still has half of its top/high articles (only 13 total) below GA. This includes such famous storms as Andrew, Camille, and Hugo. And Andrew has 26,000 views per month, while a typical hurricane GA has only 260, less than the average Wiki stub. WikiProject Hurricane appears to be a factory for making GA plus signs, not a project to serve encyclopedic readers.

There are also some broad categories that appear to be underserved. There are only 6 automobile FAs, 4 fashion FAs, and 6 aircraft FAs (2 added recently). Consider at the same time that there are 67 FAs on battleships.

Why have we piled up so many obscure FAs and still not achieved the goal that ALoan called out for us in 2006: to bring the most important articles to FA?

“

We should aim to do [Feature] all chemical elements, all major solar system objects, all of the plays by Shakespeare, all Roman emperors (well, the Julio-Claudians, at least), all countries, all capital cities, all currencies, the few longest rivers and highest mountains on each continent, all heads of state, all winners of a Nobel or Booker or Pulitzer prize...

”

— ALoan (retired)

Patterns of FA writing

FA writers are generally known by their number of FAs. They show their stars prominently in the right corner of their user pages. And there is a ranking WPBFAN for who has the most.

The importance of the FAs are not tracked or recognized. What happens if we look article relevance to readers as well as just number of FAs? With that second metric, we can then segment the writers into 4 boxes.

Dabblers (low importance, low production)

Star collectors (low importance, high production)

Champions (high importance, low production)

Battleships (high importance, high production).

It is interesting to compare the most prominent champion (of a solo-authored article), Garrondo, with the most prominent star collector, Ucucha. Garrondo has written one FA, Parkinson’s disease in 2011. Ucucha has produced 14 FAs on rare, Latin-named, mammal species. Garrondo has a lousy strategy for climbing up the WPBFAN. However, when we look at what he did for the readers, we see that he blew Ucucha out of the water. Because his single article has 180 times the views as Ucucha’s average article, he achieved 13 times the total contribution to reader-viewed FA content.

The problem is all our systems of rewards, all our tracking systems, all our unconscious assumptions, talk page remarks, etc. simply talk about number of stars…instead of the importance of them. We are incenting the star collectors and discouraging the champions. Yet the champion strategy is the more efficient way to serve the readers.

Issues with the Featured Article program

The Featured Article program is intriguing. There are top minds there, writing meaty, researched prose that is highly polished. However, there are reasons for concern. Not only is article relevancy (by page views) dropping precipitously, but the production is dropping. And this is happening at the same time that GA is growing as a program. Nor have GA standards gone down from 2007. They have gone up.

Not only that, but there are no efforts to recruit new FA writers and reviewers. Even if the standards for writing and reviewing are very high, we still need to find more in Wiki, external to Wiki, or somehow train the ones who are not good enough now. We can't rely on the regulars moving their time around. We need to grow the pie.

Most concerning was a statement from the FA leader that welcomes a smaller program with fewer contributors that can then nominate more articles at a time. Her other comments blame the decline in FA on Wiki overall and are defensive. Surely FA is not optimized? Surely it can at least try to improve its production? We need more strategic, inspirational leadership in FA. Not just criticism of other people's proposals. But development and implementation of new ideas.

The first step to turning FA around should come from leadership and governance. We have hard-working, quality-intent delegates. But they are not elected and the FA project gives the appearance of a power fiefdom, not like other Wikiprojects (for example content subject projects). It is also very strange that the titular FA leader is absent and the acting leader is formally a delegate.

Another concern is open discussion. I was shocked to hear a notable FA writer say that he was "scared silly" to discuss change at FAC-Talk. That’s un-American. It’s unlike any other chat forum of the Internet that I’ve been on.

We should elect the FA delegates yearly. GOCE and MilHist do this and it works fine. Even if we select the exact same people, discussions during an election would be helpful for the project. And there would be a different tone if the position was not for life, more of a connection to the group. And we could resolve the acting/titular director issue. Think how the governance looks to a new contributor…which we need some of. We have good hardworking volunteers, but it is not as if we had hired Daniel Boorstin or Ben Bradlee. What is the fear of an election? Losing?

Path forward

The Wikimedia Foundation has listed quality as one of its top 5 priorities. However, the target is timid: a 25% increase in the percentage of FA/GAs in five years. We will easily achieve that just based on current production. And their initiatives to support quality appear perfunctory. Furthermore, recent statements suggest an even more passive acceptance of current quality. Yes, other priorities exist. We need to keep the servers turned on, look better on smartphones, and stream video in a format that the civilized world uses.

And undeniably, the drop in participation is a big “oh shit” elephant in the room. The CEO is under board pressure to turn it around. But that is a very difficult, amorphous problem. And…it is really “back office”. Readers care about the actual articles, not our community. And donors give money to WMF because they want a better Wiki for readers, not editors.

We have come a long way since 2000, when Jimbo and Larry started Wikipedia. Who could have predicted the incredible Internet presence it would become, with a nonprofit support structure growing towards a $50 million operating budget? What is the next real differentiating step we can take in our evolution? That step is QUALITY. And the way to most efficiently drive improved quality is by concentrating on high importance articles. The results will resonate with readers and donors and the press.

Criticisms of the PowerPoint report and responses

Methodology

1. I don’t think House should be a Vital Article and Donner Party should be. That invalidates your study.

Tree for forest. The average Vital Article is clearly more important than the average FA. Arguing over a handful of VA choices is exactly the kind of activity we need less of. We should be pushing to get them increased in quality. Not making categorization debates. Wiki gets sidetracked like that too often.

And not that it matters, I think House is a decent VA topic similar to Automobile. And Donner Party is a juicy story, but probably doesn’t deserve to be even in the expanded (top 10,000) VAs. Within American history, it is less central than Oregon Trail or Covered wagon or even Sutter's Mill. That said, the VA top 10,000 have over 1,000 slots left and are open to additions.

2. Page view ranking is not a perfect metric: Wiki traffic on a topic may be different than overall Internet traffic because of competition; articles may get some hits from similar named topics; etc.

More nits in the grand scheme of things. They may affect an individual article, but when we think of broad classes of articles, they are noise. Also, having a page view frame of reference is a huge step up from the previous, 'all children equal' mindset. And the blockbuster articles like Parkinson’s disease are 100-1000 times more viewed than the obscure stars.

3. The quotes are cherry-picked and out of context.

The quotes are cherry-picked in the sense of making the points which I wanted to make. They are not a survey. But they are not taken out of context in the sense of a phrase or sentence lifted from caveats or modifying text to change its meaning.

4. The analyses are not perfect. Let’s not change anything until we have perfect analysis.

Products get launched and companies bought and sold on less analysis. Getting perfect statistical methodology is not going to change this story. Ask yourself if this makes enough sense as is. If not, stay with the status quo. If yes, then we probably don’t need to do more work to support decisions. Leave that for academics.

Substance

Weaknesses

1. Content section is weak analysis (subjective) and maybe content was checked outside of FAC.

It is one of the weaker sections. But it is an attempt to move past the common talk page criticism of FA (too much prose, don’t care about content) and the FA defense (nuh-uh) to some actual determination of extent. Ideally we would get a panel of magazine editors or the like and have them evaluate the level of content attention.

Of course, if content was thoroughly covered in preliminary reviews, that suffices…but we should have a definitive statement and a reference to the previous discussion on the FAC review. It should be readily apparent, not a burden of proof to find that it did not occur. And I can cite at least one example, Waddesdon Road railway station, where the article definitely did not get content review. Everyone willing to answer the issue of focus of that article has said that it was padded out from the Brill Tramway article.

Anyhow, I invite community members to look at random 5 FACs on your own (here click for random FA) and judge for yourself the level of content concern in reviews. And while we are not doing well on a per article basis in content concern, we are doing a good job on a per eyeball basis. The blockbuster articles get great content review and the obscure articles that are reviewed for prose are a very tiny fraction of overall eyeballs.

2. B articles are pretty decent. I am working to get important articles to B and not going to GA/FA.

Good point. There are a huge number of very important articles still at Start, even at stub. Taking these to B helps a lot. Then the reader basically has some meat. I spent a lot of the analysis looking at higher quality rankings because they are easier to track and analyze. Late in the report, came to think more about this aspect and appreciate the B more. Still a little worried at the low social rewards for B and probably the lower stability of B class articles to degredation. I would like to see the top 1000 Vital at GA+ and also all the elements because of the benefits of review.

3. Tactical aspects of staying silent: your report will give ammunition to people who want to hurt FA, or mentioning more protection will lose us the little we have now, etc.

Interesting. Hope not.

Star collector reactions

1. I want to work on obscure topics. I do it from interest, not star chasing. This is the most important thing we do for the encyclopedia.

Given we have almost 4 million articles and produce less than 400 FAs per year, perfecting obscure topics is a dog of a strategy. It’s fine if an individual wants to work on the obscure, even on the whimsical, but stepping back as a community, we still need to think about how the basic encyclopedia gets written. And we need to consider that the current level of recognition is implicitly over-valuing the importance of star collectors and perhaps even our premier programs are unconsciously over-valued. Also, even if one person is not star collecting for the awards, I’ve looked at the phenomenon too many different ways to think that there is not a group impact. Imagine the outrage if we pulled the social rewards from obscure articles--that shows the rewards are valued, thus motivating behavior.

2. The big topics get edit warred, vandalized, and good faith degraded more…so I concentrate on obscure topics.

This is a crying shame. I feel for the editors here. That said, I still think social rewards have an impact…if we incented writers more it might make them take on these issues. (This is like how bigger topics are also harder to research and longer to write…just an aspect of difficulty.) Also in a small sense the plus sign or star can help give authority for the current version or to make protection requests easier. Plus, anyway, we still need to figure out how to get those important articles to high quality for the readers.

3. You should not name people or look at individual projects or articles as cases.

Not sure. I’m not out to get anyone so in that sense nameless could work. FA authors, even of trivial topics, are admirable for their brains, writing skill and work ethic. That said, it is all public data and fair analysis. Some people I most respect got dabbler or star collector tags--I just let the chips fall. Two star collectors said the negative label was insightful (but many more complained). Need to think about this going forward. It’s an understandable objection. A formal publication could show the BCG 2x2 grid but omit the list. I think a few specific examples (e.g. Garrondo versus Ucucha) are edifying.

4. Your report is discouraging me from working on content. I am sad.

I would rather have you angry at me or just dismiss me. Do not let people on the Internet make you sad. I don’t want that.

Confusions

1. Subjective importance and popularity are not the same thing (stated several different ways).

Agreed. They are strongly correlated, but not exactly the same thing. Almost all the “important” articles (ones that feel more like traditional encyclopedia topics) are also high view. Sure, there are times when a pop culture topic, Lady Gaga has more page views than someone we might think of as more subjectively important like Albert Einstein. No matter. Emphasizing popular topics OR subjectively ranked ones would help us over the current system where every Featured Article is implicitly equal. I think working on both axes, page view importance and subjective importance, is worthwhile, but I’m flexible. Either one would be a huge positive change.

2. Lots of telling what is wrong, where are the reccs

There are nine pages of recommendations in the document. Probably way too many recommendations. Plus we are all leaders here and should be able to come up with solutions. To move the discussion forward, some sort of workshop would be helpful to rank them or assign them into some impact versus difficulty matrix or just vote on best, worst and “one extra”.

3. High quality and high importance are positively correlated.

Sure. For instance, 0.4% of articles are GA+ and 3% of reader views are GA+. Realize that this is a very low bar though. We have 4 million articles with a lot of stubs. Saying GA+ is more important than the average of that 4 million is not much. Also, even with the positive correlation, we still only have 3% of eyeballs at GA+ and 85% of the top 1000 articles at less than GA.

4. You deprecate collaborations.

No, I don't. That section is phenomenon-investigative (not an argument for action, but observational analysis). We need to think more about collaborations--how to do them and their benefits. I am tentatively favorable to collaborations. My hunch is they will help us get more traction on bigger topics, be more content thoughtful (by being interdisciplinary), and foster apprenticeship. Also they make things less damned lonely.

Format

1. Deck is long and fragmented, no clear audience or theme.

Guilty. Please read it anyway. It is broadly around a theme of quality, but more a collection of deep dives, than a single argument document. Wiki is complicated and problem-solving benefits from looking at different aspects and using different methods (statistics, cases, etc.) It would help to have some on-slide tracker or hyperlinks, but I don't intent do work more on it in this format. If you print the document, you will find it much easier to flip to different sections. Each individual section has a narrative story.

2. The tone is informal, the axes are not labeled and it is not peer reviewed.

Yet.

3. The Four Award descriptions of articles are sarcastic.

True. Fair crit. I indulged myself there. (Like some of the most entertaining movie reviews are the slams). Maybe it's insightful and maybe it's a distraction. I have a bad habit of mixing in humor with serious points and analysis (but purged most of it from the deck...it was worse before.) That one section with the article summaries is the one where I feel I most guilty. (I understand the outrage with the segment labels is higher.)

Peripheral

1. Ad hominem dismissals: TCO is in cahoots with (A) the WMF, (B) USEP, or (C) Croatan High School. He is a (D) big-balled troll. He (E) abused Request to Vanish and he (F) gets blocked a lot. He is (G) manic.

(A-C): No. (D-F): Yes. (G): Self-diagnose using Wiki article?

2. He is a front for/stole from/misrepresented/based his work on A. Gorbatai.

No. Level of involvment is as stated in deck (just the two slides, one summarizing and citing a 2 page paper, one a requested new analysis). Not even consultation or a review. Flaws, text, analytical approaches, recommendations, and flippant remarks are all my responsibility. And I was not aware of the Gorbatai work until my own was almost complete.

3. It’s an outrage for Signpost to cover this study.

Our little press should cover the news and the WikiProjects. We should not try to control internal or external criticism.