User talk:Wugapodes/RFA trend lines

(I may be biased but) I enjoyed this! Quite a quick turnaround, but effective at providing a technical explanation. ~ Amory (u • t • c) 11:33, 26 February 2020 (UTC)[reply]

Interesting. Moneytrees🌴^{Talk🌲Help out at CCI!} 02:10, 7 July 2020 (UTC)[reply]

Thoughts on your R Code

@Wugapodes:, I love seeing R code in the wild, kudos! That being said, I think your comparison operator is incorrect. Isn't the intention to use `p.start` for `i < switchPoint` and `p.end` for `i >= switchPoint'? I've taken the liberty of providing an idea of my own below:

sampVote <- function(N, p1, p2, s) {
  c(rbinom(s - 1, 1, p1), rbinom(N - s + 1, 1, p2))
}

This is a vectorized version of what you intended to do, I believe.

Now in practice:

# Case 1
p1 <- 0.76
p2 <- 0.6
N <- 150
s <- 90
set.seed(133)
A <- sampVote(N, p1, p2, s)

# Mean Case 1
mean(A)

# Plot Case 1
plot(cumsum(A) / seq_along(A), type = 'l')
abline(v = s)

The mean is 0.64 (it's a weighted average) and the plot is to the right.

Now let's consider a case closer to Tamzin:

# Case 2
p1 <- 0.9
p2 <- 0.5
N <- 340
s <- 210
set.seed(133)
A <- sampVote(N, p1, p2, s)
# Mean Case 2
mean(A)
# Plot Case 1
plot(cumsum(A) / seq_along(A), type = 'l')
abline(v = s)

The mean is 0.76 (not exactly our case but close) and the plot is to the right:

I think one graph encodes more information than the other. Thanks again!! -- Avi (talk) 19:56, 2 May 2022 (UTC)[reply]

@Avraham: (sorry for yet another ping!) I appreciate the review. The true answer as to the comparison operator is lost to the sands of time, but I do know that between making the plot and posting the R code I had been playing around with it and refactoring it for public consumption. It's possible it got switched while playing around and I didn't catch it while refactoring. Even if the error was in the original, I believe your replication shows the general pattern still holds; the larger point is that they're so close swapping them doesn't really matter when looking at the graphs. You're right that in some extreme cases like Tamzin's the inflection point may well be obvious, but in response I have two main points.

First, a historical point: the essay was written in response to the crat chat following Money Emoji's RfA which had a voting pattern more similar to the example. While I still think the argument is the same (more later), you're right to point out that the given example is a product of its particular historical and rhetorical context.

Second, a rebuttal: your example chart could also arise from a stable parameter. If the underlying parameter were always p=0.76, the assumption of independent trials does not preclude an ordering where most "heads" occur first. It's unlikely, but the only reason we can reject that null hypothesis is based on rational evidence not mathematics or trial order alone. That is the central argument of this essay: these trends are difficult to interpret even when they might be informative...Like any hypothesis testing tool, a trend line is only useful if we already have a hypothesis. That point still holds true, and the only reason we know that your example did not arise from a a distribution with a stable parameter is because we can inspect "under the hood". We cannot do that in the situation of an RfA, and we have even less confidence in knowing the underlying parameter values. We need to examine more than just the numbers or order of comments in order to come to a rational decision as to the likelihood of a parameter change, but the change alone or in combination with parameter estimates are not sufficient for understanding the consensus of a discussion regardless of whether late-breaking information occurs or not. — Wug·a·po·des 22:34, 2 May 2022 (UTC)[reply]

You are correct in your rebuttal, but consider the following. Consider an urn with 340 balls, of which 255 are blue and 85 are red. What is the probability that at least 189 of those balls are picked in the first 210 spots? This would mean that the observed percentage of "blue" at the 210 draw is 90% or greater, while the true percentage of blue remains 75%. This is sampling w/o replacement so it's a hypergeometric. In R, I'd do:

x <- 189:210
m <- 255
n <- 85
sum(dhyper(x, m, n, 210))
[1] 7.924466e-16

Agreed, there are oversimplifications here too, but I believe it rebuts your rebuttal in that while it is possible to see this pattern at random, it is very, highly unlikely. The change point is more likely, and may well be more representative of the overall "newer" consensus as Ixtal says here. Thanks! -- Avi (talk) 23:17, 2 May 2022 (UTC)[reply]

If I may add to what I said in the other thread (I do appreciate the ping here), is that as most of us are STEM-educated we may be looking at what is really a political group-decision making process as a strictly mathematical problem. This seriously undermines our understanding of the process by which RfX voters make subjective decisions that then create an objective result. — Ixtal ^{( T / C )} ⁂ Join WP:FINANCE! 23:53, 2 May 2022 (UTC)[reply]

Like what happens a lot when looking at RfA trendlines is that editors may feel that they have learnt something about that RfA, but I seriously put into question their ability to then apply that "knowledge" to other past or future RfAs. My feeling from reading the Tamzin discussion (I don't have much experience at RfAs, having only participated in maybe 4 of them) is that editors think they know much more about RfAs than they actually communicate. To the point that this affects the job of 'crats (judging by how a number of them in the Tamzin chat are talking about trends), this may be a highly problematic wildfire borne out of a toy match (i.e. out of a thing that doesn't exist we create a problem with serious real implications). — Ixtal ^{( T / C )} ⁂ Join WP:FINANCE! 23:58, 2 May 2022 (UTC)[reply]

Not disagreeing. My biggest mistake here was falling back into the jargon with which I am familiar to explain a decision that wasn't mathematical. Thank you. -- Avi (talk) 23:58, 2 May 2022 (UTC)[reply]

It is an understandable assumption that most of the editors that take a liking to meta Wikipedia analysis are highly educated enough to understand the jargon and thus your overall point, Avraham :) I think we are all various shades of agreeing with each other but curious to create further knowledge with each other! — Ixtal ^{( T / C )} ⁂ Join WP:FINANCE! 00:03, 3 May 2022 (UTC)[reply]

Yeah I don't disagree, but unlikely ≠ impossible so we need to be clear about how we make epistemic decisions especially when considering how to scale up our models. For example, RfAs usually start with a lot of support; for whatever sociological reason, editors who support a candidate tend to opine before those who don't. The trials are also not independent, and early comments are likely to effect later ones. Neither of these are likely to make a significant dent in that 8e-16 probability (helpful to remember this is the most attended RfA ever, so I think this is more of an extreme lower bound), but as our model starts getting more "real" we need to be more circumspect in what conclusions we can draw from idealized models. Lots of orders are possible, but given domain knowledge orders which have most support occurring early is actually more likely than our iid models would predict. Comments are not truly independent, so the order of comments has some (unknown) influence on later comments in a way that is hard for our iid models to account for and may not even be stable across RfAs. Given these assumption violations, we know the "early support, later opposition" pattern is more likely than our naive models would predict, but by how much? We can't run the RfA multiple times to see the percentage-over-time distribution, so the range of variation is incredibly large depending on our assumptions and hypotheses. To narrow it down to something tractable, we need to start relying on our domain knowledge, and that's really my point. We'll need those rational arguments anyway, so might as well start with them.

I don't disagree that "newer" consensus is useful, I've used it myself when closing some contentious discussions, but I think it's more principled to look at it in terms of how a debate developed (your analyses in RexxS and Hawkeye 2 crat chats are ideal examples of this mode of argument, I think). Trends in !voting are one kind of evidence to justify that, but not necessary or sufficient to demonstrate consensus in the "grey area" most likely to go to crat chat. These things are hard to interpret, and lots of people do not have the math skills to understand or implement the kind of analysis we've been doing. However, the facade of objectivity makes trying to do it dangerously attractive. Better, I think, to just say "hey, you're probably going to need a lot of linking hypotheses supported by rational arguments from your domain expertise. You just use that to begin with and avoid playing mathematician."

All this said, I do think your examples are interesting, and I'll think about how to include them more. I think we agree on how to think about the extremes, and from our discussion I think maybe this essay focuses too much on the ambiguous cases and could be expanded. — Wug·a·po·des 00:37, 3 May 2022 (UTC)[reply]