Jump to content

Talk:Statistical significance/Archive 5

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 3Archive 4Archive 5Archive 6

Wiki Education Foundation-supported course assignment

This article is or was the subject of a Wiki Education Foundation-supported course assignment. Further details are available on the course page. Student editor(s): Cokusiak. Peer reviewers: Cokusiak.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 03:49, 18 January 2022 (UTC)

Introduction

I suggest the following introduction:

In statistics, statistical significance (or a statistically significant result) is attained when, simplified said, the result is rather extreme, unexpected, assuming the null hypothesis to be true. As a measure of how extreme the result is, either the p-value of the result should be sufficiently small, i.e. less than a given value, the significance level, or the value of the test statisic is extreme, i.e. in the critical area.

The p-value is the probability of observing an effect as extreme or more extreme than the actual result, given that the null hypothesis is true, whereas the significance or alpha (α) level is the probability of rejecting the null hypothesis given that it is true.

Nijdam (talk) 18:33, 10 June

  • Oppose. Aside from the clumsy wording, it's not an improvement over the present version. I suggest giving the above discussion, led by EmilKarlsson, some time to come to a consensus on how best to proceed. In the meantime, we could just change "an effect" to "at least as extreme results," as suggested by EmilKarlsson above. danielkueh (talk) 18:48, 10 June 2015 (UTC)
Then improve the wording, but the text as it stands in the intro is too theoretical. Anyone may understand that a result is significant, if it is very unlikely in the light of the null hypothesis. The main point is not the p-value itself, but it being a measure for the extremeness of the found result. Nijdam (talk) 22:33, 10 June 2015 (UTC)
@Nijdam, I've changed "observing an effect" to "obtaining at least as extreme results," as discussed above. danielkueh (talk) 22:43, 10 June 2015 (UTC)
Explaining 'significance' is best done without direct reference to the p-value (of the observed result). As i said above, it is not difficult to understand that a result is significant - meaning pointing towards the rejection of the (null) hypothesis - when, assuming the null hypothesis to be true, an unlikely event has occurred. Compare this to the Proof by contradiction. What unlikely means may then be explained by the p-value. Nijdam (talk) 09:45, 11 June 2015 (UTC)
We go with the sources, which overwhelming define significance in terms of p-values. See cited references in the lead and the extensive discussion in the archives. If you would like to expound further on the process of establishing statistical significance, the best place to do so is in the main body of this article and not the lead, which is supposed to be a summary. You should join the discussion above with EmilKarlsson, who intends to revised the entire article. danielkueh (talk) 11:47, 11 June 2015 (UTC)

I've amended the intro and made explicit ref to the standard fallacies of p-value interpretation, plus a reference. Contributors to this section should note that, as Goodman makes clear in the cited article, many textbooks cannot be relied upon concerning the definition of p-values. Robma (talk) 11:43, 7 August 2015 (UTC)

@Wen D House: I removed the newly added paragraph because it appears to try to settle a controversial issue (frequentist vs Bayesian approach) that is far from being resoled. I am not opposed to adding a new section that compares and contrasts the frequentist and Bayesian approaches. However, I think a better place for that sort of comparison is the p-value article, which already covers it. danielkueh (talk) 16:01, 7 August 2015 (UTC)
@Danielkueh:I think we may be "editing at cross purposes" here. The mods I made were to fix a misconception which transcends the debate between frequentism/Bayesianism. The p-value simply does not mean what the introduction (and, regrettably, Sirkin's text) states, under either paradigm. The frequentist/Bayes debate centres on a different issue: whether metrics like p-values (even correctly defined) are a meaningful inferential tool - which as you rightly say, is best addressed elsewhere. Of course, feel free to revert your edit if you see my point! Cheers Robma (talk) 09:47, 8 August 2015 (UTC)
@Wen D House:, While there are, often, misconceptions, and I sometimes get confused myself (!), but in what way does the p-value not mean what is stated in the introduction? Isambard Kingdom ([[User talk:|talk]]) 13:55, 8 August 2015 (UTC)
@Wen D House: I fail to see how your recent edits "transcend" the frequentist/Bayedian approaches. If they are supposed to transcend that issue, then you're clearly using the wrong source because the whole point of the Goodman paper was to address frequentist/Bayesian issue. I am interested to see your reply to Isambard's question above. danielkueh (talk) 15:17, 8 August 2015 (UTC)

@Danielkueh:@Isambard Kingdom: Gah! Thanks for picking that up; I cited the wrong Goodman paper; here's the right one [1]. This makes clear all the points I'm failing to make - incl the one about the correct interpretation of p-values transcending the Freq/Bayes debate. As Goodman explicitly states in the abstract, there's the p-value and its Bayesian counterpart in inferential issues. One is the tail area giving Pr(at least as extreme results, assuming Ho)), the other is a posterior probability. As such, they're not different interpretations of the same metric, but literally different quantities calculated in different ways. My edit was simply trying to correct the notion that P-values are the probability that Ho has been ruled out...which isn't true, regardless of one's stance on Frequentism/Bayes. Robma (talk) 18:59, 8 August 2015 (UTC)

Okay, but do you agree that the statement given in the intro, which is *conditional* on the null hypothesis being true, is an accurate statement of the meaning of the p-value? Again, I agree that significance is sometimes not understood, but I think the intro is technically correct, if not exactly poetic. Isambard Kingdom (talk) 19:13, 8 August 2015 (UTC)
Yes, indeed that is correct. What I deleted is the following: "But if the p-value is less than the significance level (e.g., p < 0.05), then an investigator may conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error". This indeed is what is done, but it's an incorrect inference from the p-value (as Fisher himself tried to make clear....). Robma (talk) 08:39, 9 August 2015 (UTC)
@Wen D House: Thanks for clarifying the reference. However, I am still not sure why you made the first edit [2], which doesn't seem to be related to your second edit [3]. If you can explain that further, that would be helpful. Thanks. danielkueh (talk) 21:26, 8 August 2015 (UTC)
Because it removes both the incorrect interpretation of the p-value, supported by an unreliable reference, while retaining the frequentist argument in the second part that a p-value < 0.05 can be deemed statistically significant, which of course is correct. Robma (talk) 22:11, 8 August 2015 (UTC)
@Wen D House: But this article is not just about p-values. It is about statistical significance. Significant p-values are typically linked to calculated statistics such as t- or F-values, which are calculated based on the ratio of "effect and error (numerator)" and "error alone (denominator)." That's what the interpretation is based on. danielkueh (talk) 22:20, 8 August 2015 (UTC)

Further work on the intro

Despite the good work above, the lede and the first paragraph in particular are still not helpful to most readers. The lede repeats itself in a few places, and repeats whole sentences that appear elsewhere in the short article. It is full of jargon and parenthetical remarks that could be easily improved, and it includes a few bits that could be moved to the body of the article.

Some clear sentences that could be used more prominently in the introduction (perhaps shortened slightly):

In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.[15][16] But if the p-value is less than the significance level (e.g., p < 0.05), then an investigator may conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error.

statistically significant, i.e. unlikely to have occurred if the null hypothesis were true, due to sampling error alone.

Parts of the current intro worth removing or refactoring:

  • Any sentence beginning with "Equivalently, ..."
  • Details about history and practice, which are repeated in their own sections
  • Any mention of confidence level. That's confusing, better suited to a short section on related concepts. Pick either that or significance level and use that concept throughout.

– SJ + 23:50, 4 October 2016 (UTC)

Agreed, lots of problems. To begin with, I don't like the use, here, of "sampling error". I don't believe "error" is an issue (it is possibly never an issue). Significance is defined in terms a null hypothesis (given by a null distribution), and it is measured by the probability that the data at hand could be realized from the null hypothesis ("small" probabilities are "significant"). I'm not sure how to say that for the lay-person, however. Isambard Kingdom (talk) 00:47, 5 October 2016 (UTC)

Initial suggestion below. If we can simplify and clarify the first sentence and first paragraph, the rest of the article can have as much detail and math as desired. – SJ + 01:13, 5 October 2016 (UTC)
Sampling error just means that the sample is not representative of the population. Hence, one way to increase the power of inferential statical tests is to increase the sample size of the study, so that the sample may more closely approximate the population. Sampling error is ALWAYS an issue, especially in the social sciences, ecology, or in clinical studies. This is such a rudimentary and basic concept in inferential statistics. Forgive me if I sound impatient but I really don't understand the objection here. danielkueh (talk) 03:35, 7 October 2016 (UTC)
Yes, you do sound impatient. Isambard Kingdom (talk) 03:39, 7 October 2016 (UTC)
I am exaspirated because we are having unnecessary debates over trivial concepts. I would be more excited if the discussion is geared towards expanding the scope and depth of this article. danielkueh (talk) 03:50, 7 October 2016 (UTC)
Is that they way you talk to people in person? I hope not. On the subject, then, I do find it confusing to refer to an unlikely sample realization as an "error", though, yes, I recognize that some refer to it as such. Isambard Kingdom (talk) 14:35, 7 October 2016 (UTC)
Do you always provide a counterfactual argument against well-established knowledge without supporting references? I hope not. Once again, sampling error is a standard term and concept used in statistic and taught in practically every introductory statistics course. It is not "some" who refer to it as an "error (see WP article)." danielkueh (talk) 16:41, 7 October 2016 (UTC)
Daniel, statistics is used for more than error analysis, hence my suggestion that we not describe unlikely samples as errors. Like I said in my edit summary, you can proceed with describing them as "errors", I was only trying to make a suggestion to avoid confusion. Have a good day. Isambard Kingdom (talk) 16:48, 7 October 2016 (UTC)
You seem to imply that I made a conscious decision to use the term "error", in place of another word, for sampling error. I hate to break it to you again, but that is the name of the term. Pick up any stats book and look it up (e.g., [[4]], [[5]], [[6]], [[7]]). It's simply not possible to "suggest" another name. Have a nice day. danielkueh (talk) 12:04, 9 October 2016 (UTC)
I'm not implying anything about a decision you've made. What decision would that be? I'm advocating language that is less confusing to the uninitiated reader. But like I've said, please proceed with sampling error if you think it is best. Isambard Kingdom (talk) 01:51, 10 October 2016 (UTC)
Ok, fair enough. danielkueh (talk) 02:49, 10 October 2016 (UTC)

Suggested rework of the first paragraph

Current first para:

In statistical hypothesis testing, statistical significance (or a statistically significant result)
is attained whenever a p-value is less than the significance level (denoted α, alpha).
The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true
whereas the significance level α is the probability of rejecting the null hypothesis given that it is true.
Equivalently, when the null hypothesis specifies the value of a parameter, the data are said to be statistically significant at given confidence level γ = 1 − α when the computed confidence interval for that parameter fails to contain the value specified by the null hypothesis.

Proposed para, broken into logical steps:

In statistical hypothesis testing, a statistically significant result
is one that is unlikely to have occurred if the null hypothesis were true.
Statistical significance is the degree of certainty with which
one can say such a result would not have occurred.
Experiments often have an explicitly or implicitly defined significance level,
which is a minimum threshold for claiming a result is statistically significant.
In such cases, a result is statistically significant
if its p-value (p) is less than the significance level (α).
Here p is the probability of obtaining a result at least as extreme as the observed result, if the null hypothesis were true,
and α can be thought of as the chance of observing a false positive, by rejecting the null hypothesis if it were true.
The lower p is for a result, the more statistically significant it is.

Other suggestions and comments welcome.

Quick comment, I just want to say that I think the latest modification is a nice improvement. That said, the problem that I see with the proposed paragraph above is that there is no transition from the lead sentence to the second sentence. How do we get from "unlikely to have occurred if ...." to "more precisely, the p-value..."? A simple solution would be to reinsert the original lead sentence (now proposed as the 4th sentence). However, doing so makes the proposed lead sentence redundant and not much of an improvement. By the way, I agree that the sentence "Equivalently, when the null hypothesis species.." can be deleted. I don't think it adds much. danielkueh (talk) 02:42, 5 October 2016 (UTC)

Quickly revised. – SJ + 03:32, 5 October 2016 (UTC)
Better. But change "More precisely, it means..." to "It is attained whenever the p-value.." Also, remove "The more important it is to avoid a false positive, the lower experimenters set α. ." That's different for different fields. Or at least make that clear. As for the new lead sentence, which might catch the attention of the more technically inclined editors, is it an example of the "inverse probability fallacy"? Talk:Statistical_significance/Archive_2, Talk:Statistical_significance/Archive_3 danielkueh (talk) 03:44, 5 October 2016 (UTC)
@EmilKarlsson: Care to comment? danielkueh (talk) 04:12, 5 October 2016 (UTC)
Updated. – SJ +

I developed this version [8] of the lead (and other interior content) to address concerns about redundancy and some confusing use of terms like "sample error". This was, however, undone here [9]. Anyway, I submit it for discussion. Isambard Kingdom (talk) 19:05, 6 October 2016 (UTC)

That's shorter in good ways, but also wordier in others. Can you integrate your ideas into a version of the lead paragraph proposed above? – SJ + 21:00, 6 October 2016 (UTC)
I'm not sure how it is wordier, but okay. "Statistical significance" can be measured by the p-value (as a "confidence"). Whether or not some data are "statistically significant" is an assessment that the experimenter makes of the p-value. Not exactly the same thing, and, indeed, the two notions are sometimes confused by too much focus on 5% levels and arbitrary choices for alpha. Isambard Kingdom (talk) 21:32, 6 October 2016 (UTC)
The proposed lead sentence appears to be a misconception and may even be an example of the "inverse probability fallacy," which has been discussed extensively in archives 2 and 3. This needs to be addressed. I am all for making this article more accessible, but not at the expense of accuracy. danielkueh (talk) 03:03, 7 October 2016 (UTC)
Hi @Danielkueh:, where is the misconception? The inverse p. fallacy would be saying "a stat. significant result is one that makes it unlikely that the null hypothesis is true". The current sentence seems accurate. – SJ + 15:24, 9 October 2016 (UTC)
Exactly. Look at the proposed lead sentence. It says "... a statistically significant result is one that is unlikely to have occurred if the null hypothesis were true. Thus, what you are essentially saying is that "... a statistically significant result is one that is likely to have occurred if the null hypothesis were not true. That's the problem. In statistical hypothesis testing, the null hypothesis is always assumed to be true because it is based on conditional probability. Thus, if the observed p-value is 0.03, which is significant, it means that the probability of finding an effect given that the null is true is p = 0.03. danielkueh (talk) 17:40, 9 October 2016 (UTC)

I updated the paragraph to clarify that statistical significance is a degree of certainty, not a binary thing; and makes sense to talk about (in the sense of "more" or "less" significant events) even without defining a significance level. – SJ + 17:24, 9 October 2016 (UTC)

Oppose. Not an improvement. More vague and unnecessarily more complex. Writing is somewhat colloquial. Besides, significance is always defined based on a predetermined threshold, i.e., alpha, and should therefore be stated early. danielkueh (talk) 17:40, 9 October 2016 (UTC)

ADDED NOTE: If you were to pick up any random statistics 101 textbook, you would see a description of statistical significance and the sequential steps of statistical hypothesis testing that corresponds fairly closely to the existing lead of this article and the WP article on statistical hypothesis testing, minus the history lesson. But you will be hard-pressed to find one that describes it in the way that the proposed lead does. danielkueh (talk) 00:15, 10 October 2016 (UTC)
The last statement is not really true: for instance, you can talk about relative statistical significance without specifying an alpha level (lower p-values are more significant), and you might talk about the relative significance of different alpha levels. Similarly you might talk about "what degree of statistical significance is needed" in a particular context, which might mean "how low should we set alpha". Can you suggest better ways to get that point across? – SJ + 22:43, 9 October 2016 (UTC)
To quote a cheesy movie phrase, "It doesn't matter if you win by an inch or a mile. Winning is winning." No doubt, 0.001 is, as you say, relatively more significant than 0.01, which is itself more significant than 0.04. But why are these values even significant to begin with? Because they are all below the threshold (0.05 in this example). So we're back to talking about thresholds again. That aside, this doesn't appear to be such a fundamental concept that it deserves to be at the very beginning of the lead (WP:UNDUE). danielkueh (talk) 00:15, 10 October 2016 (UTC)
Also, and especially in the natural sciences (where the laboratory might not be controlled, but, rather provided by Mother Nature), one evaluates a null hypothesis (or any hypothesis) and reports the significance p-value as is, giving it, simply, as information, rather than as some binary attempt at rejecting or not rejecting an hypothesis. More generally, not everyone agrees with Fisher's strict procedures for rejecting/not-rejecting null hypotheses. Isambard Kingdom (talk) 22:55, 9 October 2016 (UTC)
Reliable secondary sources please (WP:V and WP:RS). danielkueh (talk) 00:15, 10 October 2016 (UTC)
I started to put together a list of source, then noticed that many of them are listed in the article: [10]. These are also interesting: [11], [12], [13]. Isambard Kingdom (talk) 01:33, 10 October 2016 (UTC)
Isambard Kingdom, thank you for that. It may be helpful to separate out the articles based on topics (e.g., alternatives to p-values such as confidence intervals, etc). I am well aware of the literature criticizing the use of significance hypothesis testing and the uncritical use of p-values and thresholds. Unfortunately, this article is not the place to resolve that issue. That said, I agree it is important to acknowledge these criticisms. There is a new section heading called Limitations. Perhaps we can add to that list about the use of alternatives such as confidence intervals and/or other methods. Best. danielkueh (talk) 01:43, 10 October 2016 (UTC)

One-tailed vs two-tailed

There seems to be a misunderstanding as to what "power" means. Whether or not one-tailed is used for "one sided tests are for one sided hypotheses. Two sided for two sided hypotheses" is IRRELEVANT. Power is just the probability of rejecting a null hypothesis, assuming that the H_A is true [[14]]. So it is not incorrect to say that one-tailed tests are more powerful, assuming the directional hypothesis is correct, because the rejection region is concentrated on one end. Hence, it is easier to reject with less extreme results. This is taught in practically EVERY introductory stats book (e.g., [[15]], [[16]], [[17]], [[18]]). Hence, I am perplexed as to why this is even an issue. danielkueh (talk) 03:00, 7 October 2016 (UTC)

[19]. Isambard Kingdom (talk) 03:31, 7 October 2016 (UTC)
That WP article describes the difference between a one-tail vs a two-tail test and how/when they should be used. It doesn't (it should) describe the difference between the two tests in terms of power. Take a look at the secondary sources above to better understand the difference. danielkueh (talk) 03:39, 7 October 2016 (UTC)
Here's a sports analogy. A baseball is used in the game of baseball and a football (American) is used in the game of American football. That convention does not contradict the statement that a football is heavier than a baseball. Likewise, a two-tailed tests is used for non-directional hypothesis and a one-tailed test is used for one-directional hypotheses. That does not contradict the statement that in terms of power (1 - beta), a one tailed-test is more powerful than a two-tailed test, i.e., you might get a statistically significant result with a one-tailed test but not necessarily with a two-tailed two test, assuming the directional hypothesis is true. See the difference? danielkueh (talk) 04:41, 7 October 2016 (UTC)
I can see the difference between a baseball and a football, yes. And, if it is appropriate to use a one-tailed test, then it should be used. Same for two-tailed, if it is appropriate, use it. But just as we don't use a football when playing baseball, we don't use a one-tailed test when we should use a two-tailed test. I (personally) also wouldn't call a test more "powerful" when that test is, actually, just the test that should be used. Isambard Kingdom (talk) 14:28, 7 October 2016 (UTC)
You're missing the point. The point is that a football is heavier than a baseball regardless of whether you use a football or a baseball in a game of football (or baseball). Also, once again, the term "powerful" just means the probability of rejecting the null given that H_A is true. It's not a value judgment. It's not saying that one is better than the other. For example, the SNK potshoc test is more powerful than Turkey's test. All that means is that you are likely to get a rejection of the null with SNK, than with Turkey. In fact, some people prefer to use Tukey's test for this reason, because you are less likely to commit a type I error. Finally, did you at least take a gander at the secondary sources that I cited above?!?! Regardless of whether you accept my point or not, we have to follow WP:V. Anyway, here's a passage from one of the sources:
"There is a way of reducing β without increasing α that superficially looks like a good idea: be more specific in our prediction. A one-tailed test is more powerful than a two-tailed test.[emphasis mine] In the latter we have to consider both tails of the distribution and we hedge our bets as to the position of the unknown distribution. For an overall significance level of 0.05 we must set the cut-off point at each tail at p = 0.025. It is like performing two one-tailed tests at the same time, one on each tail. If the unknown distribution really is higher than the known distribution we will only find it if it is beyond if it is beyond the p = 0.025 level. With a one-tailed test, we can focus on only one tail and at that tail α is twice the size (0.05) than for a two-tailed test. Shooting from a two-tailed to a one-tailed test increases 1-β. So far so good but there is a problem. By using a one-tailed prediction we now have no power in detecting the effect if the result goes the 'wrong way.'" [[20]]
danielkueh (talk) 16:41, 7 October 2016 (UTC)
Daniel, have a good day. Isambard Kingdom (talk) 16:50, 7 October 2016 (UTC)

I'm inclined to agree with @Isambard Kingdom: here, for a few reasons.

  1. This article is too short to fully discuss statistical power, and this certainly shouldn't be its only appearance. The primary connection of power to significance is that, for a given experiment, power and significance are inversely related: doubling the significance level will double the power, trading off false positives for false negatives.
  2. Without that sort of explanation of power as a tradeoff with precision, "being more likely to reject the null hypothesis" should not be equated with being powerful. A one-tailed test is a different sort of statistical accounting, and in fact is misused at times to effectively double α without admitting it.
  3. A one-tailed test doesn't actually have more power than a two-tailed test: they test different hypotheses. A one-tailed test applies to a hypothesis that is twice as constrained, at the same level of power. (Or: makes it twice as likely that you can prove a hypothesis half as strong.) I think this is what Isambard was getting at when he says "if it is appropriate to use a one-tailed test, then it should be used".

You can certainly find statistics books such as the one you quote that approach statistical methods as a tool to demonstrate an effect and improve one's reputation as a researcher. I don't think this is the right article for that approach. (That's certainly not what the founders of statistical hypothesis testing had in mind.) The strong forms of statistical significance are ones with tiny alphas, which are trying to be certain that we really, really know something is true. – SJ + 15:46, 9 October 2016 (UTC)

SJ, here are my responses to your comments below. For clarity, I copied your original comments:

I'm inclined to agree with Isambard here, for a few reasons.
You are free to agree or disagree with what the preponderance of secondary sources say and to have an opinion about them. But this is not a forum (WP:forum). We have to go with what the sources say. danielkueh (talk) 17:40, 9 October 2016 (UTC)
  1. This article is too short to fully discuss statistical power, and this certainly shouldn't be its only appearance. The primary connection of power to significance is that, for a given experiment, power and significance are inversely related: doubling the significance level will double the power, trading off false positives for false negatives.
It's not its only appearance and we have a WP article on statistical power. In this article, we are only describing the difference in power between the one-tailed and two-tailed tests, which is not complicated. Thus, we can add one or two more sentences to explain it further. This is an encyclopedia and so we should strive to be more comprehensive and represent the sources accurately. danielkueh (talk) 17:40, 9 October 2016 (UTC)
  1. Without that sort of explanation of power as a tradeoff with precision, "being more likely to reject the null hypothesis" should not be equated with being powerful. A one-tailed test is a different sort of statistical accounting, and in fact is misused at times to effectively double α without admitting it.
According to reliable sources, the definition of statistical power IS the probability of rejecting a null hypothesis given H_A is true. This is not a controversial statement or issue. The WP article does a good job of explaining this. danielkueh (talk) 17:40, 9 October 2016 (UTC)
  1. A one-tailed test doesn't actually have more power than a two-tailed test: they test different hypotheses. A one-tailed test applies to a hypothesis that is twice as constrained, at the same level of power. (Or: makes it twice as likely that you can prove a hypothesis half as strong.) I think this is what Isambard was getting at when he says "if it is appropriate to use a one-tailed test, then it should be used".
You are conflating the power of a test with the use of a test, which muddles this discussion. As I made clear with my sports analogy above, I'm not disagreeing that one should use a one-tailed test if one knows or can predict the direction of an effect. That's a different matter. The point that the reliable sources are making is that if you were to use both tests to analyze a result with a predicted/known direction, then there is a higher probability of rejecting the null given that H_A is true with a one-tailed test than with a two-tailed test because the rejection region is greater (0.05 vs 0.025) in a one-tailed test than with a two-tailed test. That is why it is more powerful. That is it. There is nothing more to it. danielkueh (talk) 17:40, 9 October 2016 (UTC)
You can certainly find statistics books such as the one you quote that approach statistical methods as a tool to demonstrate an effect and improve one's reputation as a researcher. I don't think this is the right article for that approach. (That's certainly not what the founders of statistical hypothesis testing had in mind.) The strong forms of statistical significance are ones with tiny alphas, which are trying to be certain that we really, really know something is true. – SJ + 15:46, 9 October 2016 (UTC)
First, all the secondary sources are saying is that if we were to use both tests on a particular type of result, we are more likely to get a rejection of the null with one test and not with the other. That's just a statement of fact. There is no hidden agenda or approach in these secondary sources in wanting to "approach statistical methods as a tool to demonstrate an effect and improve's one's reputation as a researcher." That's just an ad hominem assertion. Second, the alpha level is the same for both tests. The only difference is that in a two-tailed test, the rejection region is split into two. danielkueh (talk) 17:40, 9 October 2016 (UTC)
I think I have said all I need to say about this topic. The statement that a one-tailed test is more powerful a two-tailed test is mundane, non-controversial, and supported by a preponderance of reliable sources. One of the authors I cited, Jerome Myers, is a well-established statistician and Professor Emeritus from the University of Massachusetts who has co-authored three editions of his book Research Design and Statistical Analysis, which is a VERY reliable source. Myers gave a fairly comprehensive and technical explanation, which is available online for free [[21]]. I can certainly provide more. But at the point, the burden of proof is not on me. So unless there are many mainstream secondary sources that say otherwise, I would say there is really nothing more to discuss here. However, if the discussion is geared towards wanting to improve the writing of that statement in this article, I am certainly open to that. But to omit a statement or fact simply because it goes against one's personal opinion, that I cannot support. danielkueh (talk) 17:40, 9 October 2016 (UTC)
It wasn't ad hominem :) The page you cited focuses on this - "researchers would prefer to miss an effect than falsely claim one that could affect their reputation."
No, it doesn't. I just looked at the page again and do not see anything corresponding to that. danielkueh (talk) 00:15, 10 October 2016 (UTC)
@Danielkueh: The relevant sentence starts on line 6 of the page. Here is an image. – SJ +
I see that. My mistake. But the author was just making an observation. It's not a motive or approach. Besides, the author was talking about committing a type I error by setting the alpha level to 0.1 (or 10%). That's a pretty low bar. danielkueh (talk) 14:15, 10 October 2016 (UTC)
The definition of statistical power is clear; it's its use in this article (in a way confusing to unfamiliar readers) that we're debating.
I think the root of the problem is: it's not that a one-tailed test is more powerful, it's that a directional hypothesis is more powerful for the same set of {hypothesis, data}, and you always use a one-tailed test to evaluate a directional hypothesis. (what would it even mean to use a two-tailed test to evaluate that? If the result was extreme in the wrong direction, it would not support your hypothesis; the null hypothesis would still be the most likely hypothesis on the table). The Myers reference is indeed a great one, on this detail and on others.
Warmly, – SJ + 22:39, 9 October 2016 (UTC)
No, the test itself is more powerful. It's explicitly stated and explained by the sources. Sorry, this is not up for interpretation (WP:OR, WP:SYNTH). The alternative hypothesis merely states there is an effect (two-tail) and if so, which direction (one-tail). Please read or re-read the sources. I recommend using Take a look at this diagram [[22]] to aid your understanding. Look at the bottom right tail of the blue null distribution and imagine it was alpha (0.05) and not alpha/2 (0.025). When the alpha region increases, the beta region decreases. Thus, power (1-beta) increases. It's Tthat simple. There is, however, a caveat. If the predicted direction is wrong, then the one-tailed test has no power. Thus, the one-tail test is only more powerful than the two-tailed test if the known direction of the alternative hypothesis is true. danielkueh (talk) 00:15, 10 October 2016 (UTC)
Anyway, I elaborated on the explanation of the one-tailed vs two-tailed test in terms of power. Hopefully, it clarifies the concerns that were raised. Best, danielkueh (talk) 03:03, 10 October 2016 (UTC)
Thank you! Your new clarification is long, but clear. It seems to me (as a stats user, not a stats instructor) there is a fallacy in suggesting that one-tailed and two-tailed tests can both be relevant to the same situation. The same fallacy is involved in simultaneously saying "X is more powerful than Y (assuming A) but has no power (if A is wrong), therefore X is more powerful than Y". I recognize that some intro stats sources (though not afaict the most precise ones) seem to make exactly this claim as a quick way of explaining power, but I think this article is improved now that it no longer does. Regards, – SJ + 13:51, 10 October 2016 (UTC)
I am glad you approved. You're right, without the qualification, it is a fallacy. danielkueh (talk) 14:17, 10 October 2016 (UTC)

Deleted history text from the lead

I noticed that the history text from the lead was deleted and the rationale given was "Remove lede para repeated almost exactly two paras later." This was an issue that was raised earlier in the discussion and I forgot to respond to it. I agree that we shouldn't repeat things needlessly. However, the lead is supposed to summarize the entire article (WP:LEAD). Describing the history of statistical significance should be a part of that summary. Rather than delete the text, I would strongly prefer to see it either paraphrased or at least have the text in the history section expanded so that it won't be perceived as needless repetition. danielkueh (talk) 21:02, 10 October 2016 (UTC)

I aded a short note about the history back in. – SJ + 19:15, 24 October 2016 (UTC)

Timeline for introduction of 'null hypothesis' as concept

The null hypothesis wasn't given that name until 1935 (per Lady_tasting_tea), perhaps there is a way to describe the original definition / the Neyman-Pearson results without using that term. (In the history section). At the least this could include a ref to Fisher's work clarifying the concept. – SJ + 04:24, 4 November 2016 (UTC)

Journals banning significance testing

There is a small movement among some journals to ban significance testing as justification of results. This is largely in subfields where significance testing has been overused or misinterpreted. For instance, Basic and Applied Social Psychology, back in early 2015. I think this is worth mentioning somewhere in the article. Thoughts? – SJ + 19:15, 24 October 2016 (UTC)

I guess the issue would be due weight wp:weight. Are there prominent secondary sources (e.g., review articles) that comment and encourage this movement? Or is this just an editorial policy of a handful of journals? If the latter, I recommend holding off. danielkueh (talk) 20:18, 24 October 2016 (UTC)
Yes, it seems to be a big deal to some secondary sources. Some suggest that using null hypothesis significance testing to estimate the importance of a result is controversial. Here's Nature noting the controversy, here is Science News calling the method flawed, and here's an overview of the argument over P-values from a stats prof. All highlight the decision by BASP as a critical point in this field-wide discussion. – SJ + 04:05, 4 November 2016 (UTC)

Part of the debate: Why Most Published Research Findings Are False: [23]. Isambard Kingdom (talk) 20:28, 24 October 2016 (UTC)

Thanks for sharing but the PLoS article doesn't recommend or encourage the banning of statistical significance. Instead, it recommends researchers not to just "chase statistical significance" and that they should also be improving other factors related to sample size and experimental design. danielkueh (talk) 20:40, 24 October 2016 (UTC)


This article may be really helpful as a citation in the Reproducibility section. This could help to provide insight as to why sometimes it is so difficult to reproduce a study when the original researchers "chased" statistical significance. Cite error: There are <ref> tags on this page without content in them (see the help page). 148.85.225.112 (talk) 01:23, 7 November 2016 (UTC)

Suggested rework of the first paragraph

The first paragraph needs work. Revisiting the earlier discussion, updated to account for feedback and interim changes to the current lede:

Current first paragraph:

In statistical hypothesis testing, statistical significance (or a statistically significant result)
is attained whenever the observed p-value of a test statistic is less than the significance level defined for the study.
The p-value is the probability of obtaining results at least as extreme as those observed, given that the null hypothesis is true.
The significance level, α, is the probability of rejecting the null hypothesis, given that it is true.

Proposed paragraph, broken into logical segments:

1. In statistical hypothesis testing, a result has statistical significance when
it is very unlikely to have occurred given the null hypothesis.[1]
2. More precisely, the significance level defined for a study, α,
is the probability of the study rejecting the null hypothesis, given that it were true;
3. and the p-value of a result, p,
is the probability of obtaining a result at least as extreme, given that the null hypothesis were true.
4. The result is statistically significant, by the standards of the study, when p < α.

References

  1. ^ Myers, et al. 2010

Suggestions and comments welcome. Please – SJ + 12:54, 2 March 2017 (UTC)

We have already discussed this and there is no consensus for this proposed change. Please see the archives (e.g., Archive 2) for details of the inverse probability fallacy. Plus, the null hypothesis is always assumed to be true (conditional probability). danielkueh (talk) 14:14, 2 March 2017 (UTC)
Hello Daniel, how are you? This proposed first paragraph is different from the previous proposal, and has incorporated all feedback from the earlier discussion. Please look at the details. I have numbered the clauses to make this easier.
I believe you are taking issue with clause 1a. In that case, how would you complete that sentence without using p and α? "A statistically significant result of a study is... "
Finally, I'm not sure where you are going with your comment about the null hypothesis: getting a significant result often leads to concluding the null hypothesis is false.
Hello, Sj, I'm fine. Thanks for asking. Here is my list of issues:
  • First, to be technically correct, you would either "reject" or "retain" a null hypothesis. You would not conclude that it is false (see [[24]]). This approach is based on conditional probability, which is best explained with an example. Suppose you performed an experiment comparing two groups of runners and found that the difference in speed between the two groups was about 10 m/s, with a p-value of 0.02. You would read that as "the probability of finding a mean difference of 10 m/s, given that the null is true, is p = 0.02. Thus, the null is assumed to be true. All we're doing is setting a threshold that would allow us to either retain or reject the null. This is a not an easy concept to grasp, and if your goal is readability, I don't see the benefit of introducing it so early in the lead paragraph.
I see what you mean, thanks. Does my revised wording above avoid that? – SJ +
If you insist on going this route, at the very least, change "likely" to "very likely," and cite Myers et al. (2010) who states the following:
"First, a statistically significant result means that the value of a test statistic has occurred that is very unlikely if H_0 is true."
danielkueh (talk) 14:06, 8 March 2017 (UTC)
Because somebody says "very" we should say it too? Isambard Kingdom (talk) 14:19, 8 March 2017 (UTC)
No, because "likely" and "unlikely" are often understood as probabilities greater and less than fifty percent, respectively. In statistical significance, we often deal with small probabilities (5% or less). And yes, we should be consistent with the sources, especially if it's written by "somebody" whose work is a reliable source in the field (wp:v). danielkueh (talk) 14:31, 8 March 2017 (UTC)
  • What does "unlikely (segment 1)" mean? Is it measurable? Or is it just subjective probability? There is no transition from 1a to 2. I know "likelihood" in this context is measured by the p-value. An "unlikely result" is one in which the p-value is less than the pre-set alpha. I know that only because I have learned and used statistics. But to a naive reader, that is not clear. Wouldn't it be much simpler to just say a significant result is one in which the the p-value is less than the alpha? Removing all ambiguity?
Added a transition: "More precisely," . The point of the rest of the paragraph is to provide that clarification. – SJ +
  • If you were look at past discussions on this talk page (starting with Archive 2), you will notice that there is no clear consensus on what statistical significance really is. Your proposed paragraph attempts to at least define it as a a result that is "unlikely" as observed by a p-value that is less than alpha. I myself used to think it is just p-value less than alpha and I am sympathetic to defining a concept more concretely. But there some editors have argued that statistical significance is not something concrete like a number. They argue that it is a "concept" or a decision-making judgment aid of some kind. After much discussion, I am somewhat agnostic about what it is. I have dug deep to try to find a canonical reference that would settle it one way or another. But I have yet to find one. That is why we have settled on the imperfect arrangement of just saying that statistical significance (and the less controversial statistical significant result), whatever it may be, is attained whenever the p-value is less than alpha. Until we can find a canonical reference that provides a definitive definition, it is not up to WP to settle the issue by arbitrarily defining it one way or another.
Aha. I will have to look at the earliest instances more closely. As long as α << 0.5, it seems accurate to say that p < α describes a result that would be unlikely (happening a small % of the time) under the null hypothesis. But could you say α=1 and declare that all results for an experiment are significant? My gut says no, but the current definition in this article says yes. – SJ +
In principle, yes. The choice of the alpha is arbitrary and is based on convention. A blackjack player may arbitrarily change the rules of the game by setting the upper limit to 51 instead of 21 but he or she is not likely to find any takers. In any event, this is besides the point of this issue. Remove "is a result" as it is redundant and muddles the definition. Instead, just say "statistical significance (or a statistically significant result) is very unlikely to occur, ....." By the way, if you intend to omit "statistical significance" in favor of "statistical significant result," then you should propose a change to the article title. Otherwise, the tittle should be included in the lead sentence (WP:lead). danielkueh (talk) 14:06, 8 March 2017 (UTC)

Good points. Revised to include the title properly. – SJ + 21:29, 11 March 2017 (UTC)

  • I am all about improving readability but not at the expense of accuracy or precision. There is nothing wrong with using the technical terms (or jargon) in this article. It is after all, a high-level scientific concept, and a very narrow component of a much larger concept or approach (hypothesis thesis). As an example, if you were to take a look at the article "Chi Square," you will notice that it is defined as "the distribution of a sum of the squares of k independent standard normal random variables." Is it "jargon laden?" Yes, it is. But you know what, the chi square article is not an introductory article to statistics where it has to define for the reader every technical term such as distribution, sum of squares, k independent variables, etc. If readers want to know more, they can follow the wikilcnks or read a more basic introductory article.
  • Overall, and with due respect, I don't find the newly proposed first lead paragraph to be an improvement. It attempts to squeeze too much into so little space. Part of it, such as the setting alpha, is already explained in the current second lead paragraph. Why try to squeeze it in the first lead paragraph? There is no reason to do that. If anything, the entire lead can be expanded.
Best, danielkueh (talk) 05:54, 3 March 2017 (UTC)
The chi square definition looks fine to me, and not opaque: all of the terms it uses are understood outside the context of that concept. In contrast, the definition here tries to introduce two new variables to the reader before concluding the first sentence, neither of which exists outside the scope of that definition. I hope we can do better. Early statisticians had a notion of significance, which they then embodied in this definition; we can describe that notion without using the jargon they developed to make the concept precise, before adding detail.
Additionally, one issue here is that "statistical significance" is used in a tremendous number of non-technical documents. Many people looking for an explanation of it, unlike those looking for chi squared details, will not have a stats background. – SJ + 09:02, 8 March 2017 (UTC)
The term "DNA" is used in a tremendous number of non-technical publications and lots of non-biologists are curious about it. But we don't define DNA as a "goopy substance" do we? I have used chi square multiple times and I myself find the WP lead sentence on that topic to be barely comprehensible. I also find it somewhat disingenuous to say that it is less opaque than the lead sentence of this article given that this article introduces only two variables and actually defines them, not to mention an entire second paragraph that explains the context. In any event, I have said pretty much all I have to say about the proposal. Other editors, especially the stats-savvy ones should weigh in. Where I draw the line is that the lead should NOT change without consensus (wp:consensus). And if there is one, then so be it. But even if you do get a change, be prepared for a potential barrage of pushback. It will keep you busy. :) danielkueh (talk) 14:06, 8 March 2017 (UTC)

The opening sentence of DNA is a perfect example of clarity and perspective. It explains why the topic is important, and why one might have encountered it, without going into technical details (which become more specific, step by step, in the following sentences). – SJ + 21:29, 11 March 2017 (UTC)

Well, I think "molecule" and "development" are technical details. But then again, I'm just a silly biologist. The first lead paragraph of this article used to state the importance of statistical significance to statistical hypothesis testing but that was abruptly removed. Anyway, the importance of statistical significance is explained in the second lead paragraph of this article, which is not addressed by the present or newly proposed first lead paragraph. And that's ok, because statistical significance is a threshold (or finish line) that requires quite a bit of explaining. danielkueh (talk) 01:12, 12 March 2017 (UTC)
Sj, Very clear text. I support the change. Thank you. Isambard Kingdom (talk) 14:18, 2 March 2017 (UTC)
Thank you kindly, Isambard. I would like to refine it a bit further, but I do feel leading with a jargon-free summary, and introducing p and α before using them to define signficance, will make this clearer to many. – SJ + 04:50, 3 March 2017 (UTC)


I've revised the proposal based on Daniel's feedback so far, and removed the last sentence which has been shifted to the second paragraph. – SJ + 09:09, 8 March 2017 (UTC) And again. – SJ +

For the purpose of this discussion, I recommend writing out new revisions of just the proposed draft as opposed to making changes directly to the proposal where it first appeared so that other interested editors can follow the changes that have been made. But if you still want to make changes to the original draft, then use strikethroughs and inserts. As for this latest version, it still feels a little rushed from the first to the second sentence. But that's minor. Overall, it looks a lot better. No objections from me for this version. Other interested editors should weigh in. danielkueh (talk) 01:12, 12 March 2017 (UTC)
Ok, I'll do that for any further updates. So far it's only the three of us weighing in; I'll wait a while for other feedback. Warmly, – SJ + 18:53, 18 March 2017 (UTC)

It's been a couple of weeks; any objections to trying this new lede paragraph out? – SJ + 17:48, 31 March 2017 (UTC)

Give it a shot. :) danielkueh (talk) 19:26, 31 March 2017 (UTC)
Okay, see what you think in context. I reused a later Myers cite; not sure if the page numbers apply. Cheers, – SJ + 09:09, 21 April 2017 (UTC)
It looks fine. Good job. I corrected the right Myers reference so that it points to the right chapter. danielkueh (talk) 11:41, 21 April 2017 (UTC)

Statistically significant versus highly statistically significant

I agree that the first paragraph could do with a re-write - this might give it greater clarity. However, I also wish to add that I think the opening could distinguish " a statistically significant result" from a "highly significant result". My understanding is that the former is what you get if the probability of getting your results by chance alone was less than 1 in 20, i.e. p < 0.05, whereas the latter is what you get if the probability of getting your results by chance alone is less than 1 in 100, i.e. p < 0.01. This article could also distinguish Type One errors from Type Two errors. Vorbee (talk) 16:51, 12 August 2017 (UTC)