Talk:P-value/Archive 1

This is an archive of past discussions about P-value. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Let us Step Back a Moment

A big reason this wiki page generates so much misunderstanding, is so “unintelligible” as one person wrote, & is difficult to write, is that so many of us expect statistical results to be inferential, whereas the p-value p is not calculated using inferential reasoning. Inference makes statements about the likelihood or probability of some general hypothesis, whereas p assumes such a generality & then states the probability of results “at least as extreme” as the observed data.
In short, inference is inductive (statements about the unknown given the known), but p is deductive (a statement about the known given the unknown).
In my experience, people’s attachment to p is as emotional as it is intellectual. Especially for statisticians steeped in the culture of frequentist statistics, p serves as a security blanket, their only link—however tenuous—to scientific truth. Therefore, people cannot let p go until they embrace Bayesian reasoning.
When we see how Bayesianism is inferential (and therefore gives us what we want scientifically) but p is not, we can tone down our expectations concerning p. In the short term, this allows us immediately to cut thru all the confusion & avoid all the misinterpretations of p cited in this wiki page. In the longer term, it allows us to hold p “loosely,” & to appreciate it for what it is—an approximation, under some conditions, to certain types of Bayesian probabilities. We can become calmer and more rational students of the topic of when p is useful & when it is not. Andrew Hartley 18:00, 17 July 2010 (UTC)

--Sigh--Melcombe, I think, has removed my so-called "eulogy" to bayesianism, saying it's out of place. Now once more the "interpretation" section therefore contains no information on interpreting p. The section does refer to rejecting the tested hypothesis, which is a decision, or an action, but does not help us with inference, that is, with how strongly we can believe or disbelieve that hypothesis. I understand how some people believe that bayesianism & frequentism are two separate worlds. However, one needs to recognize too that statistical inference is a process of looking outwards, inductively, to make statements about hypotheses given data. That process is at root a bayesian exercise, whereas p is instead a deductive probability, viz., a probability about data given hypotheses. Therefore, since we want to use p inferentially, we need some rough correspondence, at least, between p & useful bayesian probabilities. In fact, as my previous, though now removed, remarks claimed, such a correspondence does exist under some conditions. When it exists, we can use p inferentially. Otherwise, the inferential meaning of p is not clear. Therefore, in the next few days I plan to put that "eulogy" back where it was, along with those bibliographic references which justify the correspondence. Andrew Hartley 18:00, 13 Mar 2012 (UTC)

(initial comments)

Can I just add that the following sentence is confusing and needs simplifying: "The lower the p-value, the less likely the result, assuming the null hypothesis, the more "significant" the result, in the sense of statistical significance." Also, more practical examples needed as well as clearer wording. I agree with the comment below. -- JB —Preceding unsigned comment added by 82.22.79.213 (talk) 16:20, 23 February 2010 (UTC)

THIS IS THE MOST UNINTELLIGIBLE DESCRIPTION I HAVE EVER READ ON WIKIPEDIA--PARTICULARLY THE FIRST PARAGRAPH. SOMEONE NEEDS TO RE-WORD THIS ENTIRE ARTICLE SO THAT ENGLISH-SPEAKING PEOPLE CAN READ IT AND UNDERSTAND IT. —Preceding unsigned comment added by 74.76.51.136 (talk) 08:15, 22 April 2009 (UTC)

Added title. —Nils von Barth (nbarth) (talk) 23:20, 18 April 2009 (UTC)

Does neone actually know how to figure out P or is it all made up?

Certainly all the textbooks explain how to calculate it in various settings. This article, as now written, implies the answer but is not very explicit. Certainly more could be added to it. Michael Hardy 20:23, 5 May 2005 (UTC)

ahhh i see thank you i still havent managed to find out how to work out the p value for a correlational study using pearsons parametric test......guess i must be looking in the wrong text books!

I have difficluties to understand this article as a lay. May be an example would be good ...

Frequent misunderstandings, part b in commnet: there is a numerical mistake. %29 should be %5.

I was adding a numerical example to the p-value article, as requested above, but it's all been deleted. I've no idea why. --Robma 00:17, 11 December 2005 (UTC)

Michael Hardy who modified previous statement " If the p-value is 0.1, you have a 10% chance of being wrong if you reject the null hypothesis " should explain this.

That statement is clearly not true. Michael Hardy 20:15, 28 August 2006 (UTC)

I read again and again and couldn't understand. Why do we need to add the chance of 14/20 heads to the chance of 14/20 tails? Shouldn't the first probability (14 out of 20 heads) is exactly the chance of observing the data under the null hypothesis? —Preceding unsigned comment added by 75.36.187.15 (talk) 20:24, 15 July 2009 (UTC)

Transferred comments of User:Xiaowei JIANG

If the p-value is 0.1, you have a 10% chance to reject the null hypothesis if it is true.("Current statatment is confusing", Michael Hardy who modified previous statement " If the p-value is 0.1, you have a 10% chance of being wrong if you reject the null hypothesis " should explain this.). Note that, in the Bayesian context, the P-value has a quite different meaning than in the frequentist context!

Um, I can't make any sense of this page. Can we have a rewrite? -- Arthur ~I agree that this article isn't as clear as it could be - or needs to be (and, as a contributor to it, I take some responsibility for that). The intro, at the very least, needs redoing. Now back to the day-job....Robma 12:27, 5 June 2006 (UTC)

I agree that we should rewrite this page, which may include how to calculate different P-values in various conditions. A good start might come from the calculating of the p-values from randomized experiments.--Xiaowei JIANG 00:19, 20 October 2006 (UTC)

Show some calculations

It would be nice to see some equations or calculations made in the article, that way people are not left standing in the dark wondering where the numbers came from. I know I could follow the article and the example because I have a little experience with probability and statistics. If someone would like, I can post the equations and some other information I feel would be useful.

Yes, this definitely needs some f-ing formulas. I know what p-values are, but came here in the hopes of getting a formula I could use to calculate them. --Belg4mit 02:44, 31 January 2007 (UTC)

Usually people just look it up in a table. If you're doing a two sample t-test you want find the area under (integral) the continuous probability density function for values of t for a particular range; usually –t and t, which is explained here.Dopeytaylor (talk) 02:23, 9 August 2011 (UTC)

Still waiting for those formulas... —Preceding unsigned comment added by 70.40.148.236 (talk) 06:25, 12 June 2010 (UTC)

What really bothers me about the coin example is that the definition of the null hypothesis changes. At first, Ho is "null hypothesis (H0): fair coin; P(heads) = 0.5" but in the final statement of the problem, the definition is changed to "the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone". This is really confusing. — Preceding unsigned comment added by 97.121.184.166 (talk) 14:49, 26 August 2012 (UTC)

Needs improvement

I feel the explanations offered here are too brief to be of use to anyone who doesn't already know what p-values are. What exactly is 'impressive' supposed to mean in this context to a lay person? General improvement in the clarity of language used, number of examples, and adding calculations would benefit this page.

Agreed -- this suffers from the same disease as most Wiki math-topics pages, i.e. they take a reasonably straightforward concept and immediately bury it in jargon. I'll take a stab at wikifying some of the introduction to this and see if I can include some parenthetical translations for people who are looking to learn something they didn't already know. JSoules (talk) 19:12, 27 March 2008 (UTC) -- ETA: Would " -- that is, the chance that a given data result would occur even if the hypothesis in question were false. Thus, it can be considered a measure of the predictive power or relevance of the theory under consideration (lower p-values indicating greater relevance), by indicating the likelihood that a result is chance" be a fair addition after the initial sentence? Am I understanding this concept correctly? This is how I've seen it used rhetorically, but that was from a semi-trustworthy source only...

...This page is incredibly overcomplicated. On finding P-value, this is what it should say (while of course substituting the appropriate symbols for terms such as mu and xbar):

Step one: Subtract mu from xbar.

Step two: Take the square root of the size of the sample (for example, if the sample is 40 test scores, take the square root of 40)

Step three: Divide the given standard deviation by the square (final answer) from step two.

Step four: Divide the difference (final answer) from step one by the quotient (final answer) from step three.

Step five: Locate the corresponding value for the quotient (final answer) from step four on a z-chart. This is your p-value. (Example, if your quotient from step four was -1.58, then the corresponding p-value would be 0.0571).

Heck, you could even combine steps two and three... HaploTR (talk) 09:10, 1 December 2010 (UTC)

I am having trouble following your instructions. It sounds like you are talking about a χ² test or maybe a test for deviation from a Gaussian distribution's mean. These are two of the many ways of choosing a null model and computing a p-value from it. —

Q

uantling (talk | contribs) 13:51, 1 December 2010 (UTC)

Interpretation

A recent editor added:

The p-value shows the probability that differences between two parameters are by chance deviation.

This is directly (and correctly) contradicted by Point #1 in the next section. It is always the case that the difference between the two parameters is due to chance, thus the probability of this is 1! Indeed, data-dependent p-values do not have a ready interpretation in terms of probabilities, as Berger and Delampady, and Berger and Sellke, have pointed out. From a conditional frequentist point of view, p-values are not the Type I error probabilities for those experiments that are rejected at the observed p-value, for example. Bill Jefferys 21:41, 18 April 2007 (UTC)

Eric Kvaalen removed my paragraph in this section that said in many situations p approximates the probability of the null hypothesis. He disagrees, thinking it's misleading. I still think that some paragraph such as what I inserted is essential, given that, with that removal, the “interpretation” section contains plenty of advice on calculating p but not a single sentence on interpreting p. I'm adding the paragraph in again, but will supplement it this time with more bibliographic references. Andrew Hartley —Preceding undated comment added 13:38, 18 December 2010 (UTC).

Question

Shouldn't this: Generally, the smaller the p-value, the more people there are who would be willing to say that the results came from a biased coin.

instead read: Generally, the smaller the p-value, the less people would be willing to say that the results came from a biased coin. --68.196.242.64 11 June 2007

I thought so too, seeing that was what drove me to this talk page. Anyone care to disagree? (I added bold emphasis to your sentences... as well as signing on your behalf...) --Hugovdm 17:56, 14 July 2007 (UTC)

A small p-value represents a low probability that the result is hapening by chance. An unbiased coin is supposed to work by chance alone! So if we have a low probability that we are getting these results by chance, we *might want to consider* that we are getting them by using a biased coin. Conversly, a high p-value *suggests* that we have an unbiased coin that is giving us our reults based on chance. It is important to remember that the p-value is just another piece of information designed to help you make up your mind about what just happened, but doesn't by itself confirm or deny the null hypothesis. You could be having a crazy day with a fair coin, but then again, you could be having a normal day with a biased coin. The p-value is just a tool to help you decide which is more likely: is it the universe messing with you or is it the coin?

All that being said, I think that this is a terrible example to use to introduce someone to the concept of a p-value! 206.47.252.66 15:03, 4 August 2007 (UTC)

Someone says above that a small p-value represents a low probability that the result is happening by chance. That is false. The result is happening by chance iff the tested hypothesis is true, so the probability the result is happening by chance is the probability of the tested hypothesis, which several notes in the main article show to be a mis-interpretation of the p-value.

I have re-inserted the earlier note that, in many situations, the p-value approximates a useful bayesian posterior probability. This approximation is well-established & documented (see the 4 included bibliographic references), & with the accompanying caveats about the approximation not always holding, is not misleading. khahstats —Preceding undated comment added 12:10, 3 May 2011 (UTC).

Contradiction?

The coin flipping example computes the probabilty that 20 coin flips of a fair coin would result in 14 heads. Then the interpretation section treats that probability as a p-value. This seems to say that the p-value is the probability of the null hypothesis. But the first "frequent misunderstanding" says "The p-value is not the probability that the null hypothesis is true". Huh? That sounds like a contradiction to me.

Yes, the example calculates the probabilty that a *theoretically fair coin* would give the results that we got (14 heads in 20 flips). The example does not calculate the probability of *our coin* producing these results. (Not possible, since we don't know if the coin is fair or unfair, and if it is unfair, exactly how unfair, how often etc). There are two possibilities: the coin is fair or unfair. The p-value only gives you information about one of these possibilities (the fair coin). Knowing how a fair coin would behave helps you to make a guess about your coin. So although the p-value can describe a theoretical coin, it can not directly describe the coin in your hand. It is up to you to make the comparison.

It seems like a semantic distinction but there is a fundamental difference. Imagine that an evil-overlord-type secretly replaced some of the worlds fair coins with unfair ones. What would happen to the probabilty of a theoretically fair coin producing our results? Nothing would change - the probability of a fair coin giving 14/20 heads would remain exactly the same! But what would happen to the probability that our results were due to chance? They would change, wouldn't they? So once again, the p-value only gives us the probability of getting our results 'IF' the null hypothesis were true. That 'IF' can get pretty big, pretty fast. Without knowing exaclty how many coins the evil overlord replaced, you are really left guessing. The p-value can not tell us the probability that the null hypothesis is true - only the evil overlord knows for sure! Hope this helps. 206.47.252.66 15:52, 4 August 2007 (UTC)

Frequent misunderstandings

I think this section is excellent. However, I would delete or substantially modify the 6th misunderstanding since it assumes the Neyman-Pearson approach which is far from universally adopted today. Fisher's interpretation of the p value as one indicator of the evidence against the null hypothesis rather than as an intermediate step in a binary decision process is more widely accepted.

I have seen a great deal of criticism of p-values in the statistical literature. I think this section is very crucial to the content of this page but I think it is only a starting point--the article needs to discuss criticisms that go beyond just potential for misinterpretation. Any basic text on Bayesian statistics talks about these issues (the J.O. Berger text on decision theory & bayesian analysis comes to mind--it has a very rich discussion of these issues). I'll add some stuff when I get time but I would also really appreciate if other people could explore this too! Cazort 19:01, 3 December 2007 (UTC)

Why is the p-value not the probability of falsely rejecting the null hypothesis?

The p-value is not the probability of falsely rejecting the null hypothesis. Why? "the p-value is the probability of obtaining a result at least as extreme as a given data point, under the null hypothesis" If a reject the null hypothesis when p is say, lower than 5%, and repeat the test for many true null hypotheses, than I will reject true null hypotheses in approx. 5% of the tests. --NeoUrfahraner (talk) 05:57, 14 March 2008 (UTC)

Maybe you're mistaking p-values for significance levels. 5% is the significance level, not the p-value. If you get a p-value of 30% and your significance level is 5%, then you don't reject the null hypothesis. 5% would then be the probability of rejecting the null hypothesis, given that the null hypothesis is true, and 30% would be the p-value. Michael Hardy (talk) 19:35, 27 March 2008 (UTC)

To flesh this out a bit, in an article in Statistical Science some years ago, Berger and Delampady give the following example: Suppose you have two experiments each with a point null hypothesis, such that the null hypothesis is true under one of them (call it A) and false under the other (call it B). Suppose you select one of the experiments by the flip of a fair coin, and perform the experiment. Suppose the p-value that results is 0.05. Then, the probability that you actually selected experiment A is not 0.05, as you might think, but is actually no less than 0.3. Above, I pointed out that "From a conditional frequentist point of view, p-values are not the Type I error probabilities for those experiments that are rejected at the observed p-value, for example." This is what I was talking about. Bill Jefferys (talk) 20:55, 27 March 2008 (UTC)

I forgot to mention that Berger has a web page with information on understanding p-values; note in particular the applet available on this page that implements in software the example above. You can plug in various situations and determine the type I error rate. Bill Jefferys (talk) 21:48, 27 March 2008 (UTC)

Of course the situation is muche more complicated in the case of multiple tests. Suppose that there is one fixed statistical tests. Is the statement The p-value is not the probability of falsely rejecting the null hypothesis. valid in the case of one single fixed statistical tests and many objects to tested whether they satisfy the null hypothesis? --NeoUrfahraner (talk) 05:27, 28 March 2008 (UTC)

You're asking about the false discovery rate problem, which is a separate issue. But it remains true, when testing point null hypotheses, that the observed p-value is not the type I error rate for the test that was just conducted. The type I error rate, the probability of rejecting a true null hypothesis, can only be defined in the context of a predetermined significance level that is chosen before the data are observed. This is as true when considering multiple hypotheses (adjusted by some method for the false discovery rate problem) as it is in the case of a single hypothesis. Bill Jefferys (talk) 14:27, 28 March 2008 (UTC)

I still do not understand. Suppose that there is one fixed statistical tests giving me a p-value when testing whether a coin is fair. Let's say I decide that I say the coin is not fair when the p-value is less than q. Now I test may coins. What percentage of the fair coins will I reject in the long run? --NeoUrfahraner (talk) 17:18, 28 March 2008 (UTC)

Ah, that's a different question. Here, you have fixed in advance a value q, such that if the p-value is less than q you will reject. This is a standard significance test, and the probability that you will reject a true point null hypothesis in this scenario is q. Note, that you will reject when the p-value is any value that is less than q. So the type I error rate is q in this example.

But that's a different question from observing a p-value, and then saying that the type I error rate is the value of the observed p-value. That is wrong. Type I error rates are defined if the rejection level is chosen in advance. As the Berger-Delampady paper shows, if you look only at p-values that are very close to a particular value (say 0.05), then the conditional type I error rate for those experiments is bounded from below by 0.3, and can be much larger, in the case that (1) you are testing point null hypotheses and (2) the experiments are an equal mixture of true and false null hypotheses.

The Berger-Delampady paper is Testing Precise Hypotheses, by James O. Berger and Mohan Delampady, Statistical Science 2, pp. 317-335 (1987). If your institution subscribes to jstor.org, it may be accessed here. Even if you are not at a subscribing institution, you can at least read the abstract. Note that in common with many statistics journals, this paper is followed by a discussion by several distinguished statisticians and a response by the authors. Bill Jefferys (talk) 18:29, 28 March 2008 (UTC)

I still do not understand. I did not say I reject if it is close to some value. I reject if the p-value is smaller than some fixed value q. You said the type I error rate is q in this example. The text, however, says that the p-value is not the probability of falsely rejecting the null hypothesis. What is correct in that specific example? --NeoUrfahraner (talk) 16:55, 29 March 2008 (UTC)

The example is correct because of fact that q is chosen in advance of looking at the data. It is not a function of the data (as the p-value is). It is not correct to take the data, compute the p-value, and declare that to be the probability of falsely rejecting the null hypothesis, because (as the Berger-Delampady example shows) it is not the probability that you falsely rejected the particular null hypothesis that you were testing. Please read the Berger-Delampady paper, it's all clearly explained there. Go to Berger's p-value website and plug numbers into the applet. Maybe that will help you understand. Bill Jefferys (talk) 18:35, 29 March 2008 (UTC)

Let me try an intuitive approach. Obviously, if you specify q=0.05, then the probability that the p-value will be less than or equal to q is 0.05; that's the definition of the p-value. But, the (true) nulls that are being rejected will have p-values ranging between 0 and 0.05. The rejection is an average over that entire range. Those p-values that are smaller would be more likely to be rejected (regardless of what value of q you choose) and therefore in our judgment would be more likely to be false nulls. Conversely, those p-values that are larger are less likely to be false nulls than the average over the entire [0,0.05] range. The closer the p-value is to 0.05, the more likely it is that we are rejecting a true null hypothesis. Since the rejection criterion is averaged over the entire range, it follows that those values close to or equal to 0.05 are more likely to be rejections of a true null hypothesis than are those for more extreme p-values. But since the average is 0.05, those closer to the upper end are more likely to be false rejections than the average.

The mistake therefore is in identifying the particular p-value you observe as the probability of falsely rejecting a true null. The average over the entire interval from 0 to the observed p-value would of course be equal to the p-value you observed. But once you observe that p-value, you are no longer averaging over the entire interval, you are now fixed on a p-value that is exactly at the upper end of the interval, and is therefore more likely to be a false rejection than the average over the interval. Bill Jefferys (talk) 23:17, 29 March 2008 (UTC)

To understand the Wikipedia article, one has first to read the Berger-Delampady paper? --NeoUrfahraner (talk) 18:06, 30 March 2008 (UTC)

No, but you do need to pay attention. I think I've said everything that needs to be said. It's all in this section, and if you still don't understand, I can't help you further. 18:56, 30 March 2008 (UTC) —Preceding unsigned comment added by Billjefferys (talk • contribs)

"Use of the applet demonstrates results such as: if, in this long series of tests, half of the null hypotheses are initially true, then, among the subset of tests for which the p-value is near 0.05, at least 22% (and typically over 50%) of the corresponding null hypotheses will be true." ( http://www.stat.duke.edu/~berger/papers/02-01.ps ).

Actually this means "The p-value is not the probability of falsely rejecting the null hypothesis under the condition that a hyppothesis has been rejected." Here I agree, this is indeed a version of the prosecutor's fallacy. Nevertheless, this does not say that p-value is not the probability of falsely rejecting the null hypothesis under the condition that the null hypothesis is true. --NeoUrfahraner (talk) 10:40, 2 April 2008 (UTC)

If the null hypothesis is true, and if you select a rejection level q prospectively, in advance of doing a significance test (as you are supposed to do), then the probability of rejecting is equal to q. But that isn't the p-value. You are not allowed to do the test, compute the p-value and then claim that the probability that you falsely rejected the null hypothesis is equal to the p-value that you computed. That is wrong. In other words, you are not allowed to "up the ante" by choosing q retrospectively based on the p-value you happened to compute. This is not a legitimate significance test. Bill Jefferys (talk) 14:49, 2 April 2008 (UTC)

4.5 Rejoinder 5: P-Values Have a Valid Frequentist Interpretation

This rejoinder is simply not true. P-values are not a repetitive error rate, at least in any real sense. A Neyman-Pearson error probability

\alpha

has the actual frequentist interpretation that a long series of

\alpha

level tests will reject no more than

100\alpha

% of true

H_{0}

, but the data-dependent P-values have no such interpretation. P-values do not even fit easily into any of the conditional frequentist paradigms. (Berger and Delampady 1987, p. 329)

This quotation from Berger and Delampady's paper specifically contradicts the notion that an observed (data-dependent) p-value can be interpreted as the probability of falsely rejecting a true null hypothesis. If you want to reject a hypothesis with a specified probability, conduct a standard significance test by selecting the rejection level in advance of looking at the data. Bill Jefferys (talk) 15:13, 2 April 2008 (UTC)

Why is a prospectively chosen p-value not a p-value? --NeoUrfahraner (talk) 18:22, 2 April 2008 (UTC)

I didn't say that. It's a p-value all right, but a data-dependent p-value does not have a valid frequentist interpretation in the Neyman-Pearson sense, as Delampady and Berger say. Remember, frequentist theory is all about the "long run" behavior of a large sequence of hypothetical replications of an experiment. In the case of hypothesis testing, in the long run, if you conduct a large number of tests where the null hypothesis is true, and reject at a predetermined level q, then a proportion q of those tests will be falsely rejected. No problem. That is the way to do significance testing.

But when you observe a particular instance and it has a particular (data-dependent) p-value, there's no long run sequence of tests, there's only the one test you've conducted, and so you can't give it a frequentist interpretation. Frequentist theory applies only to an ensemble of a large number of hypothetical replications of the experiment. But frequentist theory does not have a probability interpretation for a single one of those replications. The best you can say is that probability that you obtained that particular p-value in that particular replication of the experiment is 1, because you observed it; probability no longer is an issue.

Similarly, in the case of confidence intervals, if you construct the intervals validly for a large number of replications of the experiment, (say for a 95% interval), then 95% of the intervals so constructed will contain the unknown value; but you can't talk about the probability that a particular one of those intervals contains the unknown value. Again, that's because a particular interval is not a large sequence of intervals, so there's no way you can give it a frequentist interpretation.

If you want to talk Bayesian, then you can talk about the probability of unique events, but if you are talking frequentist interpretations, no such interpretation exists. Bill Jefferys (talk) 19:32, 2 April 2008 (UTC)

Thanks for posting this. I can't say I understand it completely but it seems to help. So the aforementioned "the conditional type I error rate for those experiments is bounded from below by 0.3". I taken that is not the frequentist interpretation, but a Bayesian one then? Especially that the word "conditional" is mentioned. Huggie (talk) 18:11, 27 November 2009 (UTC)

OK, so sorry for bringing this back up since it seems like a tiring thing to explain, and no I didn't read the paper, although I did look at the abstract and first page. I could get access to it but you know... I have other things to do and it seems like once you understand it you cant easily justify it to anyone which is what matters in a practical sense (publications) anyway. But my scenario is: I have a timecourse (ie some treatment is given then an effect is measured at times 5 min, 10 min, 30 min, whatever and compared to baseline). At 5 minutes treatment leads to a p-value <.05 while at 10 minutes the difference leads to p<.001. Why is it wrong (or is it wrong?) to use p-values as "scores" like this? I intuitively feel it does not make sense but can't figure out the logic leading to that feeling. Repapetilto (talk) 23:42, 20 March 2010 (UTC)

p-values may be useful for bringing your attention to phenomena that deserve it.

But there is no frequentist justification for conducting an experiment where you take some data, compute a p-value, take more data, compute another p-value, and do this until you get a satisfyingly small p-value. This will inevitably result in rejection of even a true null hypothesis at any predetermined rejection level, due to sampling to a foregone conclusion (there's no page on this, but it means that if you test and test, you will reject at any level, even if the null is exactly true. It's a property of Brownian motion, which can have unexpectedly large excursions from time to time. This procedure is for this reason not an approved use of p-values.)

There is no valid frequentist interpretation of a p-value as the probability of a null hypothesis being true for the experiment that was conducted. The p-value only refers to the probability of hypothetical experiments that have not been conducted and will never be conducted being rejected at that particular p-value, given that the null hypothesis is true.

See Jim Berger's page on p-values. Try the applet that shows that there is no way that p-values actually represent a probability that is reasonably representative of the experiment that has been performed. Bill Jefferys (talk) 04:39, 21 March 2010 (UTC)

Probability of 14/20 on a fair coin = ?

Shouldn't the correct probably (as determined by the PMF of a binomial dist) be .03696? —Preceding unsigned comment added by 70.185.120.218 (talk) 14:03, 28 March 2008 (UTC)

Yes, I second the above...the probability of 14 heads out of 20 coin flips is given by

20!/[(20-14)!14!]*(1/2)^20 = 0.03696

The coin toss example is erroneous. —Preceding unsigned comment added by 165.89.84.86 (talk) 19:31, 20 April 2009 (UTC)

The original version of the coin-toss example is not erroneous, and I have reverted back to the original example.

The one-sided p-value is defined as the probability of getting 14, 15, 16, 17, 18, 19, or 20 heads on 20 flips. That is, it is the sum of 7 numbers. It is not just 0.03696, that is just the largest of those 7 numbers. The number you want is the sum of number you calculated above and 6 more numbers of the form

20!/[(20-n)!n!]*(1/2)^20

for n=15, 16,...,20.

The R (programming language) evaluates this sum as 0.05765 (enter pbinom(6,20,0.5) to get the lower tail, which is equal to the upper tail described above). The number 0.03696 is equal the the probability of getting exactly 14 heads on 20 flips (in R, type in dbinom(6,20,0.5)). But that's not the p-value, which is defined to be the probability of getting 14 or more heads on 20 flips (one-sided). The two-sided value is twice the one-sided value. Bill Jefferys (talk) 17:28, 3 July 2009 (UTC)

About "The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips plus the chance of a fair coin landing on heads 6 or fewer times out of 20 flips."

Shouldn't this read "The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips."?

Or should it read "The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips plus the chance of a fair coin landing on TAILS 6 or fewer times out of 20 flips."? Briancady413 (talk) 16:41, 20 January 2009 (UTC)

significance

can someone please clearly state the larger the p value, the "more" significant the coefficient or the revese: the larger the p value, the "less" significant the coefficient. which one is right? I think the former is right. I am confused very often. Jackzhp (talk) 23:53, 18 April 2008 (UTC)

The p-value is the probability of erroneously rejecting the hypothesis that the coefficient's true value is zero. IOW, the larger the p-value, the "less" significant the coefficient. Wikiant (talk) 18:35, 1 October 2009 (UTC)

Shades of gray

I've just removed "However, the idea of more or less significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray." from the beginning of this entry.

My reasoning is that people familiar with significance tests don't need this kind of qualification, and people unfamiliar with them will be confused or misled.

It is confusing to put the idea of "more significant" on the table and then take it away. (It isn't meaningless, as the remark implies; like "significant" itself, it can have different technical meanings depending on the choices one makes. If these choices are made explicit, it has real--- not just "illustrative"--- meaning.)

It is misleading to summarize the conclusion of a significance test as just a binary "statistically significant" or "not statistically significant", as if that settles the matter. The content of these phrases is embedded in the actual tests used, the values of the parameters used, and so on. Whether we use a p-value of 0.05 or 0.01 or 0.12345 we are making an arbitrary choice that changes the technical meaning of "statistically significant" or "not statistically significant." There certainly _are_ "shades of gray" if we condense this information into just a binary "significant" or "not." I understand the intent of the lines here, but it is better expressed in the entry on statistical significance, which is already referenced in this entry. 75.175.218.136 (talk) 22:40, 5 September 2009 (UTC)

Significance

However, had a single extra head been obtained, the resulting p-value (two-tailed) would be 0.0414 (4.14%). This time the null hypothesis - that the observed result of 14 heads out of 20 flips can be ascribed to chance alone - is rejected. Such a finding would be described as being "statistically significant at the 5% level".

Why would it be rejected if 4.14% is still below 5% ? DRosenbach ^{(Talk | Contribs)} 17:44, 1 October 2009 (UTC)

The value would then be below 5%, but the value before was not below 5%, it was 5.77% Melcombe (talk) 14:19, 2 October 2009 (UTC)

I think the single coin toss example is a very poor one. If one is trying to judge the fairness of a single coin, a .01 confidence level would be more appropriate. —Preceding unsigned comment added by 70.239.212.52 (talk) 18:06, 15 April 2010 (UTC)

External reference linking to a paying page

Schervish MJ (1996). "P Values: What They Are and What They Are Not" : only the first page is readable as an image, for the rest, one needs to pay —Preceding unsigned comment added by 188.93.45.170 (talk) 23:09, 7 November 2009 (UTC)

Frequent misunderstandings

I find several of the points in the "Frequent misunderstandings" paragraph lacking a clear reasoning. One example is point 3: "The p-value is not the probability of falsely rejecting the null hypothesis." A link to the prosecutor's fallacy is not established, the only guess I have is that there is an incorrect assumption of the H0 and H1 hypotheses having the same prior probability. I believe that the p-value is the probability of falsely rejecting the H0 hypothesis if it were true and if we were to reject it (for example) for every value to the right side of the value our experiment returned. —Preceding unsigned comment added by 148.188.9.56 (talk) 15:49, 12 February 2010 (UTC)

No, this is still wrong. The rejection level MUST be established in advance. This is the only correct probability interpretation. See above. Bill Jefferys (talk) 00:10, 18 April 2010 (UTC)

I have trouble with these two statements: "The p-value is not the probability that the null hypothesis is true." "real meaning which is that the p-value is the chance that null hypothesis explains the result" Surely where the null hypothesis is true it explains the result? Zoolium (talk) 09:01, 30 June 2010 (UTC)

Some mathematics?

I was thinking of adding some mathematics to the introduction here, and I came up with the following. But I'm not completely sure it's accurate, so hopefully some of you can comment on it. Not even sure if it's a good idea to add it at all:

The P-value of a sample from a population can be written mathematically as

{\text{P-value}}=\operatorname {P} (Z>{\frac {{\overline {x}}-\mu _{0}}{\sigma /{\sqrt {n}}}})

where Z is the standard normal distribution, ${\overline {x}}$ is the sample average, $\mu _{0}$ is the actual average tested against, $\sigma$ is the standard deviation and n is the size of the sample. --Cachinnus (talk) 18:45, 15 May 2010 (UTC)

The formula you have provided is only valid for when you sample from a normal distribution (or you assume normality, or have a large sample size such that normality can be assumed), you already know the standard deviation (you can't estimate $\sigma$ from the sample) and your alternative hypothesis is that ${\overline {x}}>\mu _{0}$ .

This is almost certainly over-simplified and too specific, and this is only one of many commonly used formula.

Tank (talk) 13:25, 18 May 2010 (UTC)

Hi-jacked by minority opinions

The P-value is one of the most ubiquitously used properties of science - yet this article is plastered with minority opinions on how alternative and better statistics are available. Yes there are references, and links and argument - but to more minority opinions. As an example take a main one such as [2] (Hubbard et al), which itself acknowledges that they are in opposition to most established statistical text box. I'd like to start a discussion if it wouldn't be better to make the bulk of the article describe the established use of P-value, in the standard established way, based on a major statistical text book. And then leave a smaller fraction to a problems and limitations section, where all these claimed "confusions" and "misunderstandings" can live.--LasseFolkersen (talk) 16:09, 8 December 2011 (UTC)

Valid reference?

I'd like to note that a potentially controversial claim is made in the intro of this entry that only references this non-peer-reviewed "working paper":

Raymond Hubbard, M.J. Bayarri, P Values are not Error Probabilities. A working paper that explains the difference between Fisher's evidential p-value and the Neyman–Pearson Type I error rate $\alpha$ .

While both authors have academic affiliation, there paper is not peer-review published. As well, they expressly say their thesis goes against the views of most researchers, and frequently say things like "researchers erroneously believe ..." They may be correct, but a non-peer reviewed paper saying most experts are wrong seems perhaps inappropriate for implying an accepted fact in an encyclopedia. But I leave this too others adept in wiki criteria to make a decision.

98.204.201.124 (talk) 00:17, 25 May 2012 (UTC)

Besides being peer-reviewed or not, another consideration is any published comments on the paper (so that it either does or does not have backing by independent reviewers). An internet search suggests that the working paper has been cited 10 times, but half of these are in foreign languages, one shares an author with this working paper, one is a set of conference slides where the attitude to the paper is difficult to determine, and others are not available online without payment. While some of these citers were in what seem to be well-established journals, none seemed to be in mainstream statistical journals. It may be that [this published paper] (in Theory & Psychology, 2008), involving one of the same authors, could be used as a substitute for the existing citation if it says something broadly similar. Melcombe (talk) 01:12, 25 May 2012 (UTC)

Also, there are lots of contrary views expressed among academics, and of course each view can't be the consensus view. Yet in this case, this entry presents a view about research methodology that the cited authors state is at odds with the views of most researchers and presents this uncommon view as accepted fact. The statement for this citation in question might be more accurate if modified to reflect the fact that at least two experts believe it's the case that ... 98.204.201.124 (talk) 09:49, 25 May 2012 (UTC)

It sounds to me like that sentence and its reference should be moved away from the introduction at least. --LasseFolkersen (talk) 11:10, 25 May 2012 (UTC)

The specific point being made in the lead section is not something that is particularly controversial and it should be possible to find some replacement citation(s)... and the point is so simple that it does deserve to be mentioned in the lead section. As for the second occurence, someone would need to decide what points are worth making in this article about p-values and find reliable sources to back them up. It may be that the paper mentioned above contains some reasonable sources for alternative view points. In fact, the second citation relates directly to something that Fisher is said to said, and the fact that he write something like this is presumably not controversial. The citation as I recall does say that Fisher did say this, which is all that you can require of a citation. Someone might supply a citation to a publication by Fisher. The more "controversial" aspects of the Hubbard paper aren't don't seem to be mentioned in the article. Melcombe (talk) 11:52, 25 May 2012 (UTC)

I have gone ahead and replaced the citation with the peer-reviewed version as this meets the point originally made about the reference not having been peer-reviewed. That obviously doesn't preclude further changes. Melcombe (talk) 12:30, 25 May 2012 (UTC)

Bayesian Fanaticism

Interesting how at least HALF of the references are not linked to articles explaining the p-value, but articles to please the Bayesian crowd on how horrible p-values are.

Turns out that in nearly every single Frequentist/Fisherian statistical article in Wikipedia, one moment or another, there is a section where some Bayesian fanboy sells his product which, if it was in the lines of "there are alternatives" it'd be okay, but often it is in the lines of "frequentist BAD... Bayesian GOOD" and it is becoming a bit annoying besides being untrue.

The section named Problems maybe, just maybe, should be renamed as Criticisms? Specially considering the NON problems listed in the section like:

the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05)

So researchers choose what is significant in their field and that's a problem?? Physicists at CERN choose a five sigma significance level to claim the discovery of a new particle, also arbitrary, also a problem? Of course not!

If significance testing is applied to hypotheses that are known to be false in advance, a non-significant result will simply reflect an insufficient sample size.

Why would anyone in his right mind apply any test to something that is already known to be false? To please Bayesians with a pointless point?

The definition of "more extreme" data depends on the intentions of the investigator

Not the intentions but rather the sampling methodology, and this is rather a virtue than a problem.

Well, I could continue actually but, since it seems the Bayesian crowd is powerful in Wikipedia, I don't feel like going into an update/remove-update fight, but really guys, this is lowering the standards of Wikipedia. To sell your Bayesian product is all right and for some problems it is the adequate approach, but you don't need to trash every non Bayesian approach in the process and present your criticisms like if they were undeniable problems in non Bayesian methodologies. — Preceding unsigned comment added by Viraltux (talk • contribs) 10:26, 23 January 2013 (UTC)

I can only second that. P-values are widely used in science, but any student reading here is going to think it's an obscure concept to be avoided at all cost. It's not. So at least anybody who is confused on reading this article and makes it on to this talk section: get a good statistics book instead. Wikipedia can't be used to learn statistics.--LasseFolkersen (talk) 21:49, 23 January 2013 (UTC)

One solution is to provide more references to the proper basis of p-values. A number of points have one citation, but Wikipedia standards state that more than one source is preferable. 81.98.35.149 (talk) 19:10, 24 January 2013 (UTC)

Shouldn't the P value in the coin example be .115?

Since the null hypothesis is that the coin is not biased at all, shouldn't the universe of events that are as or less favorable to the hypothesis include too many heads AND too many tails? For example, the coin coming up all tails would be more unfavorable to the "fair coin" hypothesis than 14 heads. If the null hyposthesis was "the coin is not biased towards heads (but it may be towards tails)", .058 would be correct.

But as given, the null hypothesis is not that the coin is fair, but rather that the coin is not unfairly biased toward "heads". If the coin is unfairly biased toward "tails", then the null hypothesis is true. Michael Hardy 20:13, 28 August 2006 (UTC)

Is 2x0.058=0.116 {Question Mark) in the Example

Nice example. Some (minor) points:

1) 2 x 0.058 = 0.116

2) I think, for the purposes of the example, H_{1} is usually taken to be the logical negation of the statement that is "H_{0}", so that in the example:

i) H_{0} = "p_{heads}=p_{tails}=0.5"

ii) H_{1}=¬H_{0}= "Either (p_{heads}<p_{tails}) OR (p_{heads}>p_{tails})" (where H_{1} could be written using a symmetric difference logical symbol).

The p-value is then p="the probability of obtaining a _test statistic_ *at least as extreme as the one that was actually observed*, assuming that the null hypothesis is true"=Prob({Observation of events as extreme or more extreme than observation (under assumption of H_{0}})=Prob({14 heads OR MORE out of 20 flips (assuming p_{heads}=0.5) } )=Prob(14 heads for fair coin) +....+Prob(20 heads for fair coin).

{*Where notions of "extreme events" for this example are assumed to refer to EXCESSIVE NUMBERS OF HEADS ONLY. }

Here, our test statistic (I think) is just the sum of 20 characteristic functions T=\sigma_{i=1}^{20}I(heads) - ie: count (and add) 1 whenever we observe a head. It is useful to note that our Observation is, actually, an EVENT (which is why it has a probability). Of course, the test statistic was never explicitly defined....

This example is AMBIGUOUS according to the definition given in the introductory paragraph BECAUSE we have not defined what the "test statistic" is and what is meant by "at least as extreme as the one that was actually observed". Also, the test statistic page does not go into as much mathematical detail as one might hope (ie: with a use of functions).

Test statistic ambiguity : IF the test statistic is the number of heads, that seems clear enough (but its good practice to state that test statistic clearly, and to emphasise that it's NOT the number of consecutive heads, etc...). Extremum ambigiuity : In this case, if we said that the p-value encompasses those events such that they are described by {T>=14}={T=14,15,16,17,18,19 or 20} (disjoint events). The possibility arises that we might define the p-value to be only those events described by {T>15}, but common-sense/intuition means that, in this example, the notion of extreme event we are interested in is defined by {T>=14}. Technically, I could create notion of extreme events which would not be intuitive at all (though such a notion might end up being quite useless...).

Using the definition of the introductory sentence (as it stands) we would have to CHANGE our (at present, UNDEFINED) test-statistic, OR make precise what we meant by an "extreme event" (ie: obtaining 14 or more heads OR (disjointly!) 14 or more tails).

The example to a newcomer as it stands contains sufficient ambiguity to cause confusion (for those who like a procedural approach). Of course, for someone who has a pre-existing "intuition", what I state is not an issue and the example is an acceptable one for such a person.

PS - Am I correct in saying that, technically, "one-tailed tests" and "two-tailed tests" are, actually, just tests based upon different (and complementary) test statistics (using the examples notion of "extreme event"). That is, we could go without ever having to use notions of one-tailed test statistics at all IF we explicitly wrote down our test statistic each time (though, to some, that would be tedious - though, to others, that would be good formalism).

PPS - I have seen criticisms of this example elsewhere, and I imagine that those criticisms must (at least partially) pertain to the above observations. Having said that, things are intuitively clear - though I would have expected more from Wiki.... AnInformedDude (talk) 22:51, 2 February 2013 (UTC)

p-values in hypothesis testing

I've provied a few pages illustrating an approach to using p-values in statistical testing that I have used in the past. I'm no wizard at providing comments in these talk pages, but the file has been uploaded. File name is P-value_discussion.pdf. Hope this helps.

03:20, 9 April 2013 (UTC)

New Figure Added

I created a figure that I believe explains what a p-value is and strongly addresses the most common misunderstanding. If anyone disagrees or would like to edit it let me know. — Preceding unsigned comment added by Repapetilto (talk • contribs) 22:25, 22 December 2011 (UTC)

Perhaps if you remove the grey text box in the top. It's just confusing for a reader who wants to get an easy introduction to P-values. Save all the bayesian thinking for later in the article. Also, in-image text boxes like this is not particularly wikipedia-style, --LasseFolkersen (talk) 19:56, 27 December 2011 (UTC)

I realize it is somewhat unconventional. However, one of my goals in creating the figure was to strongly address this common misunderstanding outright because I was confused as to what everyone was going on about for years. Additionally I think it helps greatly in defining what a p-value means to the average person. Just looking at the chart always failed to inform me. So I think the good outweighs the bad. Also, I would not say that this is an example of "Bayesian thinking", it is just probability theory. I guess we'd need someone new to stats to chime in on whether it just confuses them or not.--Repapetilto (talk) 04:33, 28 December 2011 (UTC)

Ok if you insist, we can wait and see if others chime in. I just think it's a pity to put in that ugly text box in an otherwise nice figure, when the normal text caption would do just as well for that purpose. --LasseFolkersen (talk) 14:14, 29 December 2011 (UTC)

Anecdotally, the main point of confusion was why scientists "assume a bell curve"... Explaining the text box required a "Rain is always associated with clouds, but you can have clouds without rain" example. So perhaps that should be included in the introduction. — Preceding unsigned comment added by Repapetilto (talk • contribs) 08:33, 30 December 2011 (UTC)

No, I agree with LasseFolkersen, the big, chrome backgrounded warning box is distracting, and makes it look more complicated than it actually is. Also, the floating axis at the bottom looks like the tails are floating at a non-zero value above the axis - also potentially confusing. Otherwise, the graph looks okay! - Xgkkp (talk) 18:25, 24 February 2012 (UTC)

It is that complicated. There is a widespread misplaced trust in p-values because people think they are simple to interpret. Either way I replaced it with a simpler version. I think wikipedia should strive to be more informative than other sources. But so be it. Perhaps my method of getting this point accross was not ideal. — Preceding unsigned comment added by Repapetilto (talk • contribs) 02:59, 21 April 2012 (UTC)

I am removing the grey box. It is illegible in my browser, and it is out of keeping with wikipedia's general stylistic standards. Just because something is commonly misunderstood does not mean that we need to reach for drastic fonts and background effects to correct people's impressions of it. Ethan Mitchell (talk) 01:52, 2 April 2012 (UTC)

Now it seems picture is entirely gone. That's also a pity. I really did like the graph parts, just without the grey box --LasseFolkersen (talk) 09:03, 11 April 2012 (UTC)

Is this OK then? Repapetilto (talk) 02:56, 21 April 2012 (UTC)

Why is the vertical axis labeled "Probability"? Let's assume that this is a density curve (otherwise the p-value graphic is meaningless). Probability is area under the density curve, not the height of the curve. For example, the green shaded area is a probability. The correct label for the vertical axis is probability density, or simply density. It is also somewhat confusing to label the horizontal axis using "observation" as in "Very unlikely observation." The p-value is easy to interpret - but it does not tell us anything about the likelihood of an individual observation. The distribution of the data, and the distribution of the test statistic are different distributions, and the p-value is a probability statement about the sampling distribution of the test statistic. Mathstat (talk) 19:24, 21 April 2012 (UTC)

This figure as it is, is incorrectly labeled and quite misleading. It is misinformation. Call for removing it. Please comment. Mathstat (talk) 16:37, 28 April 2012 (UTC)

I think the figure is nice-looking and largely correct, but agree that the y-axis labelling (y-axis "Probability" / density) could be discussed further. I think removing it is overkill. --LasseFolkersen (talk) 09:09, 7 May 2012 (UTC)

Besides the vertical axis being mis-labeled, the labeling of "most likely observations" or "unlikely observation" is incorrect and it confuses the distribution of the test statistic with the distribution of individual sample elements. This by itself leads to much confusion for students especially. If e.g. a value for the mean height of 30 people is significant at 69, this does not mean that a height of 69 is an "very unlikely observation". The observed test statistic is labeled "observed data point". So the figure really only applies to a sample size of 1 since it conveys that the test statistic is one of the individual observations of the sample. What is unfortunate here is that the figure reinforces the main thing that is typically misunderstood by students of introductory statistics - reinforces the wrong interpretation. Mathstat (talk) 12:30, 7 May 2012 (UTC)

I thought I had a handle on this but maybe not. I see nothing wrong with applying p-values to a test statistic with sample size=1. I think it is easiest to explain through example.

So lets say you are testing samples from batches of beer (10 kegs per batch) for "hops levels". You don't want to throw out an entire batch just because 1 keg is bad, and you also do not want to perform ten tests. It is cheaper for you to deal with returns than to test more than one keg per batch. You perform 1000 experiments in which you test all ten kegs and find that hops levels are normally distributed amongst the kegs with standard deviation of 10. Further you do some taste tests and determine that the best tasting beer has "hops level" of 100, deviations from this value get progressively less acceptable. So the best you can hope for (until someone devises a more consistent brewing process) is that your batch has a mean hops level of 100 with six or seven kegs within the range 90-110.

After this you test one keg of a batch. Your results tell you hops level= 130. Is it likely or unlikely that this keg comes from a batch with mean of 100? You test another batch and get hops level=100. Is it more or less probable that it comes from a batch with mean hops level=100 than the batch with hops level=130?

Well, from the distribution you discovered earlier you know that you rarely find kegs with hops level of 130 or greater in good batches of beer. In fact you know that is actually very rare (this only happened one or two times during the course of your 1000 experiments), so you reject the entire batch.

Am I incorrect in saying you have just used a p-value (with implicit significance level)?

As you continue doing this, you realize that usually only 19/20 distributors return bad batches, none of them more likely to do so than others. So sometimes even bad batches can still make you back the money you spent on brewing them. This leads you to think you can maximize profits by limiting your bad batches to only 5% of your total output. From your earlier experiments you know that only 5% of your good batches included kegs with hops level outside the range of 180-120. So you tell your workers to throw out any batch when the measurement is <180 or >120.

Am I incorrect in saying you have just set a significance level?

So the y=axis reads "probability", because taking a sample with hops level=100 is the result that maximizes the probability your batch has the same distribution as a good batch. As the results deviate further from this number, it becomes less and less probable that the batch you are sampling from is one with a good batch distribution. If the beer is from a good batch, the "most likely result" is that the hops level of your sample is near 100. It is much "less likely" that you will get a value of 130. Now if you start sampling 3 kegs rather than one and discover some bad batches have larger variances, want to find out which batch some kegs you found in the back most likely came from, etc you use t-values. In this case I think the best chart would be one showing two partially overlapping distributions.Repapetilto (talk) 01:34, 17 May 2012 (UTC)

In general a p-value is a probability (area under the probability density curve) associated with the sampling distribution of a test statistic. In general the test statistic is rarely ever an observed value of sample size 1. Although one can technically construct a test based on any criterion, I think that it is safe to say that most readers who feel it necessary to consult Wikipedia to understand p-values, are trying to understand the more common application as in a classical test. Examples are one and two-sample t-tests, F tests, one and two-sample tests for proportions, goodness-of-fit tests, etc. The p-value is then a probability associated with the sampling distribution of a sample mean, an F-statistic, a sample proportion, etc. None of these are based on a single observation from a sample size of 1. Would you infer from a single toss of a coin whether the coin is fair?

It seems that the confusion here is that p-value is being regarded as simply tail probability. The figure might be okay for understanding how a tail probability of an individual observation is computed (if the y-axis is labeled "probability density"). In general, though, the height of a probability density curve is not a probability, so in this respect the diagram is incorrect for tail probability also.

Furthermore, the caption of the figure does not indicate anything about sample size 1 or that the example presented is such an extreme special case. The figure is quite misleading as it confuses the sampling distribution of the test statistic (for example, a sample mean or a sample proportion) with the distribution of a randomly drawn member of the population. Please contribute a revised, corrected figure. The idea of the figure is good, but as it is, very misleading and in fact, contributes misinformation rather than clarification. An individual who studies this figure to try to understand p-value will come away with the wrong information: they would be thinking that p-value is a tail probability for an individual observation from the sample, rather than the test statistic. They would also be thinking that the height of the curve can be interpreted as probability, which it is not. Wikipedia is here to inform all readers, not just one person's favorite example. Mathstat (talk) 12:26, 17 May 2012 (UTC)

So what do you have in mind exactly? I am thinking of showing two normally distributed samples with the same sample size and variance but different means, then the two sample t-test equation and an inset showing the t distribution for that degrees of freedom with the t statistic placed as a point on the x axis similar to what is shown in the current image. This seems like it introduces too many topics/nuances at once and may be confusing though. I don't see a way around including these nuances if we wish to be accurate with the figure. As I've said above, p-values are deceptively complex to interpret under many conditions. I don't know why people claim they are easy to understand when misunderstanding is so widespread. Although this misunderstanding is probably due, in part, to the efforts of people to simplify the concept. — Preceding unsigned comment added by Repapetilto (talk • contribs) 23:59, 17 May 2012 (UTC)

That is more complicated than you need. The figure is nice but it seems to confuse the data distribution with the distribution of the test statistic. I moved the figure so that it is close to the coin flipping example. However, it is still not quite right (for binomial it should be a histogram). It should not be too hard to revise with better labels.

There is a website (Beth Chance at Cal-Poly) that has several properly labeled plots in a homework solution. [1]. Notice that the test statistics with continuous distribution are labeled density on the vertical axis. Farther down the page is an example where the statistic is discrete (binomial, like the coin flipping example), and it's graph is a probability histogram rather than a curve. Notice the labeling on the horizontal axis, also. (The horizontal axis for this article can be simply "observed value of test statistic" or "observed value".) Perhaps change "likely observation" to "likely outcome".

Hope this helps. Mathstat (talk) 00:50, 18 May 2012 (UTC)

The more I consider it the more I think you are right that it is misleading. However, I don't want to introduce the concepts of probability density, the various possible distributions, etc without explaining them somehow in the figure. Also the figure should ideally be general and not refer to a specific example. I will think on it. Perhaps a figure that includes both the prob density curve and an inset showing the probabilities would work. I wouldn't call myself a statistician by any means, so a submission of your own would be welcome.Repapetilto (talk) 05:22, 20 May 2012 (UTC)

I have to say this figure is really misleading for people new to this concept, pls do something to it. Maybe add some further instruction.Lotustian (talk) 01:24, 14 March 2013 (UTC)

P-value a probability or not?

The following two statements may sound contradictory to a novice:

"In statistical significance testing the p-value is the probability of obtaining a test statistic at least as extreme [..]"
"The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false – it is not connected to either of these. In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses."

This should be fixed to avoid confusing readers. landroni (talk) 20:49, 25 August 2013 (UTC)

I don't think there's a contradiction here. A p-value is a probability; these statements are just trying to clarify that it's one kind of probability and not another.

--Kairotic (talk) 22:28, 22 September 2013 (UTC)

Deliberately Creating Confusion

There is a long, and valid, section about common misinterpretations but no section that actually gives an interpretation, even in the section labeled interpretation. — Preceding unsigned comment added by 167.206.48.220 (talk) 04:21, 16 February 2013 (UTC)

I agree here. The article needs to be more explicit about what the p-value is (and not only on what it is not). landroni (talk) 16:15, 25 August 2013 (UTC)

Amen to this! I've tried to fix some things, but this article is bogged down with too much abstraction and too many passives. I think it also needs to hold off on the cautions and pedantic nitpicking (i.e., the criticisms and "here's what it's not..."s) until after the concepts are adequately described.

--Kairotic (talk) 22:37, 22 September 2013 (UTC)

Why p=5%?

I found this link which links it to Fisher: http://www.jerrydallal.com/LHSP/p05.htm

If someone would get around to checking the sources, I think the cited quotes are suited for this article. Tal Galili (talk) 18:42, 16 October 2013 (UTC)

Reader feedback: Perhaps giving a few formula...

Zheng1212 posted this comment on 5 July 2013 (view all feedback).

Perhaps giving a few formulas and explaining the intuition behind them would be helpful.

The intuition behind p-value is better presented in the article statistical significance.

Manoguru (talk) 10:11, 4 November 2013 (UTC)

Figure misleading

I fear the figure is misleading. The labels of the tails suggest a two-sided Problem, while the area under the density function clearly corresponds to a one-sided problem. — Preceding unsigned comment added by 94.220.3.133 (talk) 16:56, 1 December 2013 (UTC)

What am I missing?

The third point on the Misunderstanding p-value section mentions that "The p-value is not the probability of falsely rejecting the null hypothesis." In symbols, the type I error can be written as Pr(Reject H | H). If we have a rule of rejecting the null hypothesis when p <= alpha, then this becomes Pr(Reject H | H) = Pr(p<=alpha | H). However, since p-value is a uniform over [0,1] for simple null hypothesis, this would mean that Type I error is indeed given by Pr(Reject H | H) = Pr(p<=alpha | H) = alpha. So it is indeed correct to interpret alpha, the cut-off p-value, as the type I error. What exactly am I missing here? (Manoguru (talk) 16:40, 3 December 2013 (UTC))

After reading earlier talk pages, I just realized that this point has been extensively discussed. As for myself the issue is resolved. Who in their right mind will ever confuse a p-value for alpha level? (Manoguru (talk) 17:13, 3 December 2013 (UTC))

Not sure how to read earlier talk pages - could you post a link? More importantly, although your understanding, and the articles understanding in the sixth paragraph ("The p-value should not be confused with the Type I error rate [false positive rate] α in the Neyman–Pearson approach.") are that the p value and alpha are different things, the article in para 4 and 5 under the section "Definition" reads as if the p value and alpha are the exact same thing, and completely interchangeable. This needs to be re-written, and the relationship or lack thereof between the two needs to be made a part of the main article, not just some old talk page.

(182.185.144.212 (talk) 18:59, 4 April 2014 (UTC))

You can click on the Archive no. 1 in the box above. Here is the link if you need it https://en.wikipedia.org/wiki/Talk:P-value/Archive_1 It would have been better if you had pin pointed the specific part that you found confusing. I am not sure how you got the notion that the two concepts are same by reading the text. I have made some modifications, which I hope you will find useful. The p-value tends to change with every repetition of a test that deals with same null hypothesis. However, the alpha is always held fixed by the investigator for every repetition and does not change. The value of alpha is determined based on the consensus of the research community that the investigator is working in and is not derived from the actual observational data. Thus the setting of alpha is ad-hoc. This arbitrary setting of alpha has often been criticized by many detractors of p-value. Indeed, the alpha needs to be fixed a-priori before any data manipulation can even take place, lest the investigator adjust the alpha level a-posteriori based on the calculated p-value to push his/her own agenda. For more about the alphas, you should look into the article statistical significance. (Manoguru (talk) 11:32, 16 April 2014 (UTC))

There are some minor issues here. Today there are different notions of p-value. Strictly speaking, a p-value is any statistic assuming values between 0 and 1 (extremes included). The definition given here is the standard p-value. This is indeed a statistic (a random variable which can be calculated once given the value of the sample). It is not a probability at all. It is indeed a random variable and in fact the standard p-value has a uniform distribution in [0,1] under the (simple) null hypothesis whenever the sample is absolutely continuous. A desirable property of a p-value (as is usually, but not always the case of the standard one) is that p-values be concentrate near zero for the null hypothesis and be concentrated at 1 under the alternative hypothesis.BrennoBarbosa (talk) 16:22, 17 June 2014 (UTC)

That would be a nice addition. But since I have not heard of this idea before, I invite you to make the necessary amendments. (Manoguru (talk) 08:54, 6 July 2014 (UTC))

normal distribution related

Is it fair to assume that the p value is based on the fact that the outcome of any normal random variable will tend to fall in the 95% space of its distribution?

Not really, since the outcome of any random variable, whether it is normal or not, will tend to fall in the 95% of its distribution. (Manoguru (talk) 12:54, 18 February 2014 (UTC))

You are a bit off in that statement, the 5% space is "defined" as that with low probability in case of a normal distribution function. If the RV p.d.f. was a rectangle -what would the p value be? The context is specific to the tail ends of the pdf or the low probability outcomes of random variable. That "chunk of 5%" can lie anywhere, but it is assumed to be the cumulative area in the region with low probability from my understanding.

You are contradicting yourself in your last statement. You see, when you say 5% or 95% of a distribution, you are specifying a probability of some event. You have not prescribed what the event is, only the probability associated with that event. So you are correct to say that 5% region can lie anywhere, be it at the tail end or at the most likely outcome for a normal distribution. But you cannot just say a region with low probability, since that region can, as you said it, can come from anywhere. Thus the contradiction of that statement. It is important to specify the event to be the tail event when talking about p-values. Also p-value need not be restricted to normal random variable. For instance, for a uniform distribution defined over an interval [a,b], for a left tailed event {X<=x}, the p-value is simply the area under the rectangle from 'a' to 'x', Pr(X<=x|U[a,b]) = (x-a)/(b-a), even though uniform distribution does not have a 'tail'. (Manoguru (talk) 10:20, 5 July 2014 (UTC))

P values need not be associated with normal RVs alone but they are relevant only for low probability zones.

Technically, for a normal distribution the interval (mu-epsilon, mu+epsilon), where mu is the mean and the epsilon is a very small number, still counts as your "low probability zone", even though the most likely outcome of the normal is the mean. By "low probability zone", do you mean places with low pdf values? (Manoguru (talk) 09:09, 6 July 2014 (UTC))

Low probability zones would amount to low pdf. I am pretty sure of my knowledge of statistics and probability, am an engineer and an economist by the way and have used random variable models for several years. I am simply trying to point out that the article takes a 5% p value cutoff simply because that is the low probability zone of a normal distribution, which may not be the case for all random variables- all RVs are not normal. To answer your quote above, the 5% cut off for p implies the tail 5% zone of the normal distribution function, whereas the article should perhaps say it should be the lowest 5% zone of outcomes in the the pdf of the RV. Hope that clarifies. 223.227.28.241 (talk) 11:26, 6 July 2014 (UTC)

Hi, thanks for the clarification. I think we are both on the same page. However I don't think there is any ambiguity in the article, since it is clearly mentioned in the definition section that the cutoff value is independent of the statistical hypothesis under consideration. (Manoguru (talk) 17:00, 6 July 2014 (UTC))

yes but it should be the lowest 5% zone of the cdf else it is normal specific — Preceding unsigned comment added by 223.227.98.64 (talk) 06:18, 19 July 2014 (UTC)

I am not quite sure what you are talking about anymore. It feels we are not discussing p-values, but rather what values of cutoffs to take. It is clearly mentioned in the definition section that the values of cutoff is entirely up to the researcher to decide, and does not depend on what type of distribution is assumed, normal or non-normal. If I may paraphrase you just so I understand you right, by "it should be the lowest 5% zone of the cdf else it is normal specific" do you mean to say that had the cutoff been any other percentage, say 1% or 10% then that cutoff is related only to a normal distribution? But that's certainly not true. Perhaps it would be helpful if you could point out the particular passage in the article that you find confusing. (Manoguru (talk) 11:15, 21 July 2014 (UTC))

The word extreme needs to be clarified in "In statistical significance testing, the p-value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true" ---Extreme to me sounds like a fixation that the curve is somewhat like the normal distribution function, and values away from the mean (in either direction" are "extreme" or have low probability . Eitherways I understand what I use p values for,but it simply is not properly written, — Preceding unsigned comment added by 122.178.239.189 (talk) 08:51, 26 July 2014 (UTC) Most wiki articles are normal distribution specific. Those who learn probability the random variable way realize that p values t statistic and z values are normal specific. — Preceding unsigned comment added by 223.227.74.34 (talk) 13:52, 29 July 2014 (UTC)

Thanks for additional clarification and pointing out the issue with the starting sentence of the article. The wording is quite standard, and I don't quite understand how it conjures a normal distribution when nothing of that sort has even been mentioned. I feel that the issue you have raised has more to do with personal point of view. However, we can try to add another sentence that better explains what the extreme is supposed to mean. (Manoguru (talk) 19:00, 15 August 2014 (UTC))

Wolfram has a definition- http://mathworld.wolfram.com/P-Value.html "The probability that a variate would assume a value greater than or equal to the observed value strictly by chance: P(z>=z_(observed))". P values pin on the variate, not the probability. In the wiki definition, by using the word extreme - it points to the value of the probability whereas p values are defined for the value of the variate

Irrespective of the nature of the curve,it simply represents the number of outcomes greater than this one (area under the region of the pdf beyond this value of the variate z) --And it in no way represents less probable or more probable events. The wolfram definition is perfectly valid as ʃP(z).dz should always be 1. (I added the dz in the edit, the math symbols were off) In essence the cdf adds to unity... Literally the cdf upper limit of the ∫ defines the p value - right? (integral -inf to x defined cdf at x and x is chosen so that 95% of the curve comes in the cdf)?

What I was trying to ask above is ---is it the 5% chunk (the low probability chunk) or the "P(z>=z_observed) that the article was addressing? because p value is applicable to the variate. Of course the application is to check how good the sample fits in the normal distribution curve but that is not "p value". That is p value in the context of a normal distribution.The wolfram definition is better. Extreme here sounds like the result is extreme (low probability outcome - which may not be the case) 122.178.245.153 (talk) 18:42, 6 September 2014 (UTC)

Thanks! Yes, the Wolfram definition is correct, and it is the same as the one given in the Wiki. The word "extreme" as used in the opening sentence is not used to mean low probability, but rather the values of a variate equal to or greater than that observed. However, this is only one of three possible interpretation of what extreme means. The meaning of this word is made precise in the definition section. The use of the word "extreme" in the context of p-values is quite standard and is in use in many textbooks. I don't think it is wise to burden the opening paragraph with too much technical nuances such as this. The opening should serve only as a simple introduction to a casual reader. Anyone wishing to know more needs to read the article in detail. (Manoguru (talk) 16:31, 11 September 2014 (UTC))

Then perhaps relook at the original post "the 5% chunk" is specific to normal only. That is how this debate started. And your answer "any RV will fall in 95% of its distribution" confused me. The 95% space is defined for the variate (space of distribution is defined based on the variable, not the probability)- Simply because the p value works on variate and not on probability. Trust that solves the confusion? I guess we agree on that? So in essence P value need not cover a low probability zone at all?

Now to the part I was really trying to point out by my 1st post: My original question:

When you use a linear regression function in *Excel*, it follows the 5% (the default) for p-values per coef based on what curve? likewise excel regression also reports t stats for the coefficients. These are based on the regression y=mx+c only and are not concerned with anything else. The P-value as reported by excel there is based on the P(z>=Z) notion (95% of m's total distribution area or cdf < p value for m reported by excel) and is actually not relevant because it tries to pin the "m" to be a normal distributed variable.

What is the p-value for - other than the 95%/5% cut off that assumes the value of "m" in y=mx+c follows a normal distribution?

To the casual observer, the wolfram definition is more accurate as per me, and that casual note on wiki is where excel seems confusing . Most professors are also biased to the normal curve, and infact in most quant courses do tell one to ensure that the p value as seen via excel should be below 0.05. Why? Just so that we can safely assume the coefficents are normally distributed? How does that impact the regression? (m is the coefficient in the y=mx+c example)? It should not make any underlying assumption on the distribution of m/coef. (I edited the integral bit above, the symbols confused me- hope the fact that p -value ties to "cdf" of a RV: F(X<x_observerd) is clear? we chose the upper limit of the cdf integral so that we have 95% of the area of the cdf covered??)

You also have to note "significant" is actually "not significant" in the probability test world because the significant once assumes bias/not by chance.

Your second point "For instance, for a uniform distribution defined over an interval [a,b], for a left tailed event {X<=x}, the p-value is simply the area under the rectangle from 'a' to 'x', Pr(X<=x|U[a,b]) = (x-a)/(b-a), even though uniform distribution does not have a 'tail'. " is equally confusing. The p -value will either be from a to infinity or a to -infinity. P values can be perhaps be mathematically be stated as: for a single tailed, values in brackets show the limits:

1 - ∫(-infinity to a) P(z).dz

That is 1 - (cdf till a)

And for 2 tailed:

1 - ∫(-a to a) P(z).dz

Sorry = I dont know why wiki gives "such a huge signature", it overwrites the edit --Dr. Alok

actual observations versus test statistic

The article states: "Usually, instead of the actual observations, X is instead a test statistic." It would be fair to explain why (what I can't because I do know). Thanks Cuvwb (talk) 17:18, 7 February 2015 (UTC)

formatting problem

This is the opening of #Definition and interpretation:

The p-value is defined as the probability, under the assumption of hypothesis

H

, of obtaining a result equal to or more extreme than what was actually observed. Depending on how we look at it, the "more extreme than what was actually observed" can either mean

\{X\geq x\}

(right tail event) or

\{X\leq x\}

(left tail event) or the "smaller" of

\{X\leq x\}

and

\{X\geq x\}

(double tailed event).

The problem is the H in the first line is cut off at the bottom, missing a row or two of pixels and its serifs, both in the article and in reproduced here. I have 'MathML' set for math rendering in my preferences, the recommended setting for modern browsers which mine is (up to date Mac OS and Safari).--JohnBlackburne^words_deeds 21:07, 12 March 2015 (UTC)

Edited the lead of the article

I just shortened the lead of the article, as well as clarified how p-value fits with hypothesis testing.

I removed the following sentences, and am leaving them here if someone thinks they can be incorporated in other parts of the article (I think they don't, but feel free to try):

An informal interpretation of a p-value, based on a significance level of about 10%, might be:

*

p\leq 0.01

: very strong presumption against null hypothesis

*

0.01<p\leq 0.05

: strong presumption against null hypothesis

*

0.05<p\leq 0.1

: low presumption against null hypothesis

*

p>0.1

: no presumption against the null hypothesis

The p-value is a key concept in the approach of Ronald Fisher, where he uses it to measure the weight of the data against a specified hypothesis, and as a guideline to ignore data that does not reach a specified significance level.^[1] Fisher's approach does not involve any alternative hypothesis, which is instead a feature of the Neyman–Pearson approach.

The p-value should not be confused with the significance level α in the Neyman–Pearson approach or the Type I error rate [false positive rate].

Fundamentally, the p-value does not in itself support reasoning about the probabilities of hypotheses, nor choosing between different hypotheses – it is simply a measure of how likely the data (or a more "extreme" version of it) were to have occurred, assuming the null hypothesis is true.

References

^ Cite error: The named reference nature506 was invoked but never defined (see the help page).

Tal Galili (talk) 09:25, 3 January 2015 (UTC)

The problem here is two-fold: (i) You've moved the correct definition *out* of the lead, to the next paragraph, and instead (ii) left in an unintuitive definition that essentially relies on a Neyman-Pearson framework. If someone is operating from a more Fisherian paradigm, the lead paragraph is next to useless to them, while the "as or more extreme" part, while somewhat clumsily worded in the article, would be more intuitive and correct for both.

Glenbarnett (talk) 00:58, 23 March 2015 (UTC)

[nature506-1] Cite error: The named reference nature506 was invoked but never defined (see the help page).

[1]