User:Tzusheng/sandbox/Wikipedia:Wikibench/Evaluation:Editquality

This Wikibench evaluation page shows AI's predictions on edit quality alongside Wikibench's labels, which Wikipedians collectively curate through the edit quality campaign. The goal of the evaluation is to surface the strengths and limitations of AI used in Wikipedia, such as ORES and LiftWing.

Table

Upon successfully importing Wikibench, the table below randomly loads ten edits curated through the edit quality campaign. The number of edits is restricted by the rate limit imposed by ORES and LiftWing.

Table columns

Diff IDs: The revision IDs before and after an edit. The link brings you to Wikibench's entity page, which documents Wikipedians' labels for an edit.
Wikibench: The primary label collectively determined by Wikipedians using Wikibench. The number in the parenthesis ranges from 0 to 1, reflecting the agreement level among Wikipedians on this primary label. The higher the number indicates the higher level of agreement.
ORES: The prediction of an edit by the edit quality model of ORES, which predicts whether an edit is damaging and saved with good or bad faith. The number in the parenthesis ranges from 0 to 1, reflecting the confidence of a prediction. A higher number indicates higher confidence.
LiftWing: The prediction of an edit by the language-agnostic revert risk model of LiftWing, which predicts whether an edit will be reverted. The number in the parenthesis ranges from 0 to 1, reflecting the confidence of a prediction. A higher number indicates higher confidence.

Note that the prediction of ORES and LiftWing uses the model's default confidence threshold, which can be fine-tuned for different applications. For example, ORES has different thresholds for recent change filters.

Diff ID	Edit damage		User intent		Reverted
Diff ID	Wikibench	ORES	Wikibench	ORES	LiftWing

Limitation and next step

Because of the rate limits imposed by ORES and LiftWing (for safety reasons), getting the prediction of all edits curated through the edit quality campaign for a more holistic evaluation of AI models is technically challenging without WMF's support.

The next step of Wikibench's research team is to communicate the pros and cons of Wikibench to WMF and identify the best way for a wider deployment of Wikibench on Wikipedia if the community finds it useful. Please consider signing up if you're interested in future updates on Wikibench.