Talk:Statistical significance/Archive 1
This is an archive of past discussions about Statistical significance. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | → | Archive 5 |
Some major changes,mostly in tone
All things considered, I applaud the writers of "statistical significance." With the exception of the p value choice error (an a-priori decision), this describes and documents important issues very well! I do programmatic evaluation and am forever battling "lies, damn lies, and statistics." About to write another reminder to staff about the meaning of "statistical significance?!" of annual change which shows up from our instruments, I ran across this, and will just refer them to Wikipedia for now. A little statistics is a dangerous thing. Thanks, folks.
I removed the final paragraph only after I had written what is now the second paragraph, and noted it covered the same ground. I tried to convey the same information somewhat less formally and put in a more prominant position. I did this because I think this is of great importance to lay persons who are constantly confused by the concept of significance. This is a serious social issue, since pharmaceutical companies are willfully misleading the public by the use of this term. There are drugs out there, Aricept is one, that has a trivial effect, and no long term effect on curtailing Alzheimer’s, yet was approved and sold because of its “significant” effect, the degree of which is not even described.
With all due respect for those whose work I am being presumptuous enough to modify, it is those without the benefit of a good college statistics course who need this article. I do not believe I “dumbed it down” but rather attempted to make it more accessible. I, of course, left untouched the technical description which is excellent.
I also included the paragraph on spurious significance of multiple groups, which is another way that the public can be confused. I will follow up with a reference to the recent women's study, or someone else can do it, if they choose
I would welcome any comments Arodb 01:11, 26 February 2006 (UTC)
- Arodb, I've moved your edits. I think your points should be made, but I felt that the article in its previous form was a masterpiece of succinctness, so I created a section for them in order to restore the clarity of the original. BrendanH 21:04, 20 March 2006 (UTC)
Italic text
First Sentence Confusing
"In statistics, a result is significant if it is unlikely to have occurred by chance, given that in reality, the independent variable (the test condition being examined) has no effect, or, formally stated, that a presumed null hypothesis is true."
I understand and agree with everything up to the second comma. After the comma it appears to say that "the independent variable has no effect in reality" which of course depends on the situation... could someone reword it? --Username132 (talk) 03:58, 16 April 2006 (UTC)
Popular levels of significance
I changed this, in the opening paragraph, from 10%, 5% and 1% to 5%, 1% and 0.1%. In any of the sciences where I've seen significance level used, as far as I remember, 5% is the maximum usually ever considered "statistically significant". If some people do sometimes use 10% somewhere, my editing is still not incorrect, since it's just listing some examples of commonly used levels. Coppertwig 19:53, 6 November 2006 (UTC)
Small cleanup
The article seems messy right now. The first paragraph in particular was horrible. I've altered some parts for clarity and to try and make it more concise. Let me know what you think (particularly about the opening paragraph - I'm thinking more should be added to that). --Davril2020 06:21, 31 October 2006 (UTC)
- I tried to make the opening paragraph more readable and more accessible to the ordinary person. It can still be further improved. I also added a paragraph to the "pitfalls" section (the last paragraph), describing one more pitfall. --Coppertwig 23:46, 6 November 2006 (UTC)
Comments moved from article
131.130.93.136 put the following at the top of the article significance...THIS ARTICLE IS HORRENDOUS.
- The following article seems to have an error, as statistical significance is defined the other way round than used here.
- The cited significance level of 5% actually is known as alpha error or error of the first kind or Type I error, whereas the significance level is 95%. Thus, the comparison of two significance levels of 99% and 95% obviously results in the facts stated below.
- The statistical power is defined as 1-beta, beta being the Type II error or error of second kind.
- The original article below:
I find this anonymous user's comments to be without merit. Since it's anonymous, I don't think any further comment is needed. Michael Hardy 01:20, 20 Nov 2004 (UTC)
To my mind, there is a confusion in the article between Type I error (α) OF A TEST, which has to be decided a priori, and significance level (p value) OF A RESULT, which is a posteriori. I don't regard this confusion as very serious, but some people do. --Henri de Solages 21:57, 10 December 2005 (UTC)
- I'd like to see this corrected. Smoe
- Raymond Hubbard paper in External Links goes into great detail about confusion between p and α. Have not got my head round it yet, but article in present form appears to treat them as being identical. 172.173.27.197 13:27, 28 March 2007 (UTC)
Added link from Level to Quantile. Maybe this should move to "See also". Smoe 21:22, 15 January 2006 (UTC)
Armstrong
I oppose this edit. [1] It looks like fringe to me. --Coppertwig 02:14, 29 August 2007 (UTC)
Does anyone know if the points raised in the Armstrong paragraph represent more than a single researcher's concerns? I think that this type of material might have a role if it represents an emerging, but broad-based set of concerns. However, if it is just one guy who doesn't like significance tests, I would recommend it be removed or at least toned down. Right now it seems a little prominent. Neltana 13:51, 24 October 2007 (UTC)
Actually, I think it is fair to call his work part of an emerging but broad-based set of concerns, and if anything, I don't think these concerns receive ENOUGH prominence in the article. See work by McCloskey and Ziliak, as well as books like "What if there were no significance tests?", "The significance test controversy", etc. 99.233.173.150 (talk) 01:48, 19 December 2007 (UTC)
Recipes to using statistical significance in practice
I read this article when a reviewer said I need to test significance of the results in my paper submitted to a conf. But I found no recipe of how to add such a test to my experiments. I think the article needs some practical formulas as to when one has n experiments where population X gave the average A and population Y gave the average B, how one could reason about the significance of A > B. Thanks! 189.141.63.166 (talk) 22:01, 25 December 2007 (UTC) [2]
- You're quite right! There is some information about tests at t-test. We need perhaps better links to it; but also, the information there can be improved, and some information on this page here about tests would be good, enough information that people can actually do the tests. I might help with this at some point. --Coppertwig (talk) 00:40, 26 December 2007 (UTC)
More Confusion
"For example, one may choose a significance level of, say, 5%, and calculate a critical value of a statistic (such as the mean) so that the probability of it exceeding that value, given the truth of the null hypothesis, would be 5%. If the actual, calculated statistic value exceeds the critical value, then it is significant "at the 5% level"."
What is the word "it" in reference to? --Username132 (talk) 04:14, 16 April 2006 (UTC)
It seems to me that the first sentence contradicts point #2 in the "frequent misunderstandings" section of Wikipedia's description of a "p value." —Preceding unsigned comment added by 68.10.153.140 (talk) 20:43, 25 May 2008 (UTC)
"More" significant
As someone wrote at the very top of this Talk page, the idea that a smaller p-value implies a "more" significant result is fallacious. I removed the statement saying that ("The smaller the p-value, the more significant the result is said to be.") from the lead. Another editor reverted my edit stating "No, modern recommendation is to state the p-value, not start with an alpha value." That statement is true but it has no relation whatsoever to the fallacious idea being discussed. What is significant in one context may not be significant in another context (using different alphas) but we can never state that a result is "more" significant as significance is a binary state. --ElKevbo (talk) 12:37, 9 July 2009 (UTC)
- I don't know who "we" is, but it is common to say, for example in the context of testing for trends of climate at different places, that the result of a test of trend is more significant at one place than another, and to do things like plotting contour maps of the p-values. And in building a regression model by adding one explanatory variable at a time, it is common to do so by selecting the one that is most significant among tests of adding individual variables to an existing model. As for terminology, accept/reject is binary but the significance level is either fixed in the case of a test pre-specified critical region, or variable (in the case of reporting that a test result is just significant at a 93.2% level for example). Thus in modern usage the "significance of a test result" is often the acceptance probabilty of a test at which the formal result would switch from "reject" to "accept". However I don't now think the "more significant" terminology should be in intro ... I was thinking that what was trying to be said was related to what I indicated originally. Melcombe (talk) 13:59, 10 July 2009 (UTC)
I would suggest that some of the authors of this page read the sections on statistical significance in Rothman and Greenland's Modern Epidemiology. For those who don't have access to that book, there is an article by David Savitz in Reproductive Toxicology (1995) that explains the pitfalls of significance testing. Statistical significance is overused in the health sciences as a way to determine the presence or absence of an association. Given that a p-value is an amalgamation of the standard deviation, sample size, and effect size, it is quite useless as a measure of "chance" in the results. Furthermore, p-values tell you nothing about the potential for results to be biased by poor study design (confounding, selection bias, information bias, etc.). —Preceding unsigned comment added by 24.162.231.137 (talk) 22:03, 11 October 2009 (UTC)
Bad form
I have removed the following warning from the lead section:
- "Clearing up this issue in this article is a work in progress, the reader is asked to proceed with caution."
If there are still problems with the article, fix them. Don't warn readers that the article may be wrong. - dcljr (talk) 16:14, 15 October 2009 (UTC)
Old comments
The assertion "the smaller the p-value the more significant" is fallacious. A finding is either significant or it isn't. The p-value is not capable of measuring "how significant" a difference between two groups is. Before statistical analysis is performed, an a priori p-value which is the cutoff point for statistical significance must be chosen. This value is arbitrary, but 0.05 is conventionally the most commonly used value. If the p-value is greater than this a priori p-value, then the null hypothesis cannot be rejected. If the p-value is less, then the null hypothesis is rejected. It is a simple yes or no question. To put more stock in the p-value than this is to miss the point. Confidence intervals are a nice alternative to the p-value for this reason.
- I do not agree that a p-value is interpreted dichotomously. One will interpret it initially as either significant or non-significant, but there is a scale there. That is to say, a hypothesis tested true at the p<0.05 level (i.e. 5% chance of it being purely by chance), is less significant than one at the p<0.01 (i.e. 1% chance) level. We set an arbitrary limit for actual identifying sig vs non-sig, but the lower the probability of obtaining the observed result by a random sample, the more significant it is. If it were interpreted dichotomously, one could assume 95% probability (p<0.95) to be significant, and a random sample with probability of 95% would be just as significant as one with 0.0001% probability... —Preceding unsigned comment added by 114.76.27.133 (talk) 12:06, 16 October 2009 (UTC)
I'm not an expert, but this article is mixing two independent concepts in statistics. See: http://hops.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf
Should this be merged or linked to the article p-value?
Shouldn't critical value get its own article?
P-values versus Alphas
I appreciate the article as a valiant attempt to explain statistical significance and hypothesis testing. There is one major fallacy: Fisher’s p-values and Neyman-Pearson alpha levels are not equivalent. Obtaining a p-value of 0.05 or less tells you absolutely nothing about the Type-1 error rate. To calculate the p-value, you only need the knowledge of the null hypothesis and the distribution of the test statistic under the null. To calculate the probability of a Type-1 error and chose a suitable alpha, you need to know both the null and alternative hypotheses (and distributions). In practice, correctly determining alpha is not feasible in scientific experiments. Instead, we cite a p-value (if less than 0.05) and erroneously believe that this will limit the overall false positive rate of published scientific works to 5%. It does not. A p-value represents only the level of significance – how unlikely is it to get this result (or more extreme) by chance. Nothing else. See the nice article by Hubbard cited at the bottom of the page. —Preceding unsigned comment added by Franek cimono (talk • contribs) 17:02, 14 March 2008 (UTC)
- It is simply not true that "[t]o calculate the p-value, you only need the knowledge of the null hypothesis and the distribution of the test statistic under the null." A p-value only makes sense if it's calculated with respect to a particular alternative hypothesis. This is why a big deal is made in statistics textbooks about "one-sided" (or "one-tailed") tests versus "two-sided" (or "two-tailed") tests. The correct p-values will differ in the two cases. As for the remark that "[i]n practice, correctly determining alpha is not feasible in scientific experiments", I don't see how that makes any sense. As far as I can see, your objection about p-values are equally true of the significance level and power of a test -- they all rely on assumptions that the distribution of the test statistic is known under certain conditions. - dcljr (talk) 23:00, 15 October 2009 (UTC)
- In some approaches, all the effect of the alternative hypothesis is concentrated in defining the test statistic being used (but this may mean altering what is thought of as the test statistic). For your example of one-sided or two-sided tests, in the first case the test statistic might be a statistic T, while in the second case it would be the absolute value of T. However, I don't think the earlier comment makes sense either ... but it might not relate to the current version of the article. Melcombe (talk) 09:13, 20 October 2009 (UTC)
- Hmm. Okay, I'll grant you that (about the test statistic). I wasn't assuming Franek's comments referred to the current article. It just seems to me that his notions of p-value vs. type I error probability are exactly backward. - dcljr (talk) 20:08, 21 October 2009 (UTC)
- In some approaches, all the effect of the alternative hypothesis is concentrated in defining the test statistic being used (but this may mean altering what is thought of as the test statistic). For your example of one-sided or two-sided tests, in the first case the test statistic might be a statistic T, while in the second case it would be the absolute value of T. However, I don't think the earlier comment makes sense either ... but it might not relate to the current version of the article. Melcombe (talk) 09:13, 20 October 2009 (UTC)
Pitfalls
The first paragraph of the Pitfalls section appears to have been copied verbatim from this website: http://d-edreckoning.blogspot.com/2008/01/statistical-significance-in-education.html Krashski35 (talk) 17:36, 29 May 2008 (UTC)
I think there needs to be more discussion about the pitfalls of using p-values to judge the presence or absence of an association. For instance, one could test the mean difference between two groups of 25 people each and find the p-value for the difference is 0.10. Someone might conclude that there is no difference based on the p-value since it was not below the holy 0.05. However, if we conducted the same study again, this time sampling 250 people in each group, and we found the same difference we would likely obtain a significant p-value because our sample size increased. Would we reach different conclusions based on the results of these 2 studies?
It is worth mentioning that the p-value is a characteristic of the sample size, standard deviation, and effect size. Any one or combination of those 3 things can influence the p-value.
It is worth looking at Rothman and Greenland's Modern Epidemiology to see what some of the pitfalls of significance testing are. Ioandis has also written about significance testing in the American Journal of Epidemiology. —Preceding unsigned comment added by 24.162.231.137 (talk) 21:34, 5 November 2009 (UTC)
- "We" are in some difficulty because of the different articles besides this one: P-value and statistical hypothesis testing. I think that more general criticisms of testing should be mainly contained in statistical hypothesis testing. Nominally your thoughts about p-values should go in p-value, but it is difficult to see what the relation between the article here and the others really should be. Melcombe (talk) 11:02, 6 November 2009 (UTC)
Real-world application/Controversy.
Hi. Not a statistician. I took AP Probability in high school 20 years ago. I feel the following is relevant but have no clue how to get it in the article appropriately: http://www.washingtonpost.com/politics/supreme-court-sides-with-investors-workers-in-two-business-related-cases/2011/03/22/ABbxzBDB_story.html
- "Matrixx said it had no reason to disclose incidents in which users reported a loss of smell because the number was not statistically significant. It asked the court to set a rule that would protect companies from having to disclose such information.
- ...
- 'Given that medical professionals and regulators act on the basis of evidence of causation that is not statistically significant, it stands to reason that in certain cases reasonable investors would as well,' she wrote."
If someone could find a way to get it in the article I think the world would be better off. Pär Larsson (talk) 19:47, 23 March 2011 (UTC)
Exsqueeze me?
The article currently states:
- "It is worth stressing that p-values do not have any repeat sampling interpretation."
This has got to be wrong. The p-value is the probability, computed assuming that the null hypothesis is true, of getting sample results at least as "extreme" (that is, detrimental to the assumption that the null is true) as what we actually observed. How is this probability computed? By considering the distribution of the test statistic across all possible samples of a given size. The p-value is the (exact or estimated) proportion of samples which would give at least as extreme results. Thus, it is the probability, under repeated sampling of the population, that, etc. That is a "repeated sampling" interpretation. - dcljr (talk)
- P-values have to be interpreted in a sampling framework. A single p-value is the proportion of trials that would observe a result as extreme or more extreme if the null hypothesis were true. If you think about it in the confidence interval (CI) framework, a 95% CI has a 95% chance of containing the true value prior to drawing the sample. After you draw the sample, the computed 95% CI either contains the true value or it doesn't. —Preceding unsigned comment added by 24.162.231.137 (talk) 18:28, 25 October 2009 (UTC)
- I don't see how your comment addresses the issue being raised here. Confidence levels are related to significance levels, not p-values. - dcljr (talk) 20:17, 3 November 2009 (UTC)
- What do you mean by significance levels? P-values or alpha? Be clear on your terminology since these terms are used interchangeably in the literature and they actually mean different things. —Preceding unsigned comment added by 24.162.231.137 (talk) 21:29, 5 November 2009 (UTC)
- "Significance level" = "alpha". I'm using the standard terminology used in mathematical statistics classes/textbooks in the U.S. I still don't buy the original line I quoted above, which is still in the article and has now been tagged by me with a {{citation needed}}. - dcljr (talk) 09:56, 22 April 2011 (UTC)
- What do you mean by significance levels? P-values or alpha? Be clear on your terminology since these terms are used interchangeably in the literature and they actually mean different things. —Preceding unsigned comment added by 24.162.231.137 (talk) 21:29, 5 November 2009 (UTC)
Also:
- "[A]uthors Deirdre McCloskey and Stephen Ziliak... propose that the scientific community should abandon usage of the test altogether, as it can cause false hypotheses to be accepted and true hypotheses to be rejected."
Uh... isn't that kind of unavoidable unless you have access to the entire population? In fact, doesn't any approach using incomplete information have the possibility of leading to incorrect conclusions? Their views really should be explained a bit more, if we're going to cite them at all. - dcljr (talk) 20:01, 15 October 2009 (UTC)
Refimprove tag/explanation for
Hi folks,
I've added a refimprove tag to this article. I'm not disputing that the information is accurate, but the article contains entire sections that are unsourced.--Grapplequip (talk) 00:02, 26 September 2011 (UTC)
Someone should probably document this
http://medicine.plosjournals.org/perlserv/?request=get-document&doi=10.1371%2Fjournal.pmed.0020124&ct=1 —Preceding unsigned comment added by 84.68.149.153 (talk) 09:45, 11 October 2007 (UTC)
- "Why Most Published Research Findings Are False" Yes. Here and under clinical trials. - Rod57 (talk) 02:22, 11 February 2012 (UTC)
Intro
I'm not happy with the recent change of the introduction. Others? Nijdam (talk) 10:25, 9 June 2012 (UTC)
- No. It's too complex for an introduction intended for a general audience. ElKevbo (talk) 22:18, 9 June 2012 (UTC)
- Dear Nijdam and ElKevbo, thank you for explaining why you reverted. I started off with the aim of stating that significance does not state the probability of a result being a chance finding (it can't be, because we don't know the relative frequency of genuine effects and null effects amongst the hypotheses we are testing). However I had become carried away and am grateful for your politeness. I have had another go, and made a much more minor attempt at making that point. I have also edited the reference to "observational error" to "random error" because it is only the random error component of observational error that significance testing can help with. Finally I have reworded the reference to effect size statistics to avoid the Weasel-word flag. Darrel Francis (talk)
How does one choose
There is an incompleteness in the article on statistical significance. I’ll try to illustrate it with an example. A dealer and a player are playing one hand of a five card game. The cards are dealt face up one at a time, first five cards to the dealer then five cards to the player. At some point during or before or after the game, the player forms two hypotheses. The null hypothesis, that the deck was fairly shuffled and an alternate that the deck was stacked. The null hypothesis is to be rejected if the dealer’s hand is a royal flush. Say the dealer’s cards were 10, Q, A, K, and J all of clubs in that order. The probability of the dealer’s hand being royal flush can have as many as six different values depending on at what the stage in the dealing it is calculated. If no cards have been dealt, it is P0 = 0.000002, after the 10 of clubs is dealt it is P1 = 0.000004. Continuing P2 = 0.000051, P3 = 0.000850, P4 = 0.020833, P5 = 1.0. The statistical significance depends on when the hypothesis is specified. If it’s specified between the third and fourth card dealt it is P3, if before the first card is dealt it’s P0, if after the fifth card it’s P5,, etc. The critical event (which causes the rejection of the null hypothesis) has several probabilities and the definition of statistical significance has to tell which probability to choose. Gjsis (talk) 16:05, 11 July 2012 (UTC)
- A dealt royal flush should not cause a rejection of the "fairly shuffled" hypothesis as it is perfectly possible, though remote. Binksternet (talk) 17:33, 11 July 2012 (UTC)
- No, I think you're a bit confused. The researcher selects the threshold at which a result is considered statistically significant or not. And more closely related to your psuedo-example, that threshold is selected prior to conducting the study; this helps avoid the possibility of a researcher unnecessarily or unethically moving the goalposts during a study. ElKevbo (talk) 17:47, 11 July 2012 (UTC)
- If there are several stages in an experiment at which a decision might be made, this is dealt with in sequential analysis: see, for example, sequential probability ratio test. This can still involve something identifiable as an overall "significance level". Melcombe (talk) 18:03, 11 July 2012 (UTC)
From the comments, I conclude that my remarks weren’t clear. Here’s another try. The five card game is as described above. The cards dealt to the dealer in order are 10, Q, A, K, and J of clubs. After the third card is dealt the player specifies two hypothesis. The null hypothesis is that the deck was shuffled fairly and the alternate is that the deck was stacked. The criterion the player uses is to reject the null hypothesis if the dealer gets a royal flush. The dealer got a royal flush and the null was rejected. Then the player wants to know the statistical significance level of the test he used. He reads the WP article on statistical significance and it doesn’t tell him whether it is (a) 0.0002% i.e. P0 or (b) 0.0850% i.e. P3. That the article doesn’t answer this question is an incompleteness in the definition of statistical significance presented. Gjsis (talk) 15:40, 20 July 2012 (UTC)
- Your line of thinking looks like you never will buy a lottery ticket with the number 777777 (from a lottery with numbers 000000 till 999999), because you very much doubt the price will fall on number 777777.Nijdam (talk) 08:40, 21 July 2012 (UTC)
- I understand Gjsis's reasoning. In the current intro, it says "significance ... is an assessment of whether observations reflect a pattern...", without specifying that we shouldn't just look through a wagonload of noise, settle on one thing that looks interesting, and declare it to be significant. To continue Nijdam's analogy, assuming the lottery results are obtained by random digits appearing one-by-one on television, he is saying that if someone predicts the lottery result to be 777777, and then moments later it turns out to indeed be 777777, he would only be impressed if the prediction had been delivered before any of the digits of the lottery result had been announced, and progressively less imppressed with every digit more of the result that was already available by the time the prediction was made.
- I have added a pair of sentences to the intro to cover his point. There is more in the body of the article, but Gjsis is asking for it to be clear in the intro. Lets see what people think:
- "The calculated statistical significance of a result is in principle only valid if the hypothesis was specified before any data were examined. If, instead, the hypothesis was specified after some of the data were examined, and specifically tuned to match the direction in which the early data appeared to point, the calculation would overestimate statistical significance."
- It is not always unethical to devise a hypothesis after data have been collected. Sometimes the data is shown to you first, and an opinion asked. Or a pattern is observed in some data which wasn't the prespecified primary aim of the study. In this situation one might calculate the significance, but be appropriately wary that it should be taken in the context of the data not shown to you, or how much data was searched through before finding this. I think gjsis wants that to be clearer up front. Darrel Francis (talk) 20:06, 1 October 2012 (UTC)
10 coins
The recently introduced example of 10 coins all landing up heads, isn't a very good example, because a result of HTTHHTHHTT even has the probability of 1/210, yet is not a significant outcome supporting the coin to be unfair. Nijdam (talk) 13:53, 15 October 2012 (UTC)
Dubious claims
"Traditionally, prospective tests have been required. However, there is a well-known generally accepted hypothesis test in which the data preceded the hypotheses. In that study the statistical significance was calculated the same as it would have been had the hypotheses preceded the data."
Dubious. The citation does not contain a significance test. Instead, "The conclusions were based on what is commonly known as the total weight-of-evidence" rather than on any one study or type of study." The citation does not support the claims.159.83.196.35 (talk) 20:57, 13 April 2013 (UTC)
'Everyday speech'
When used in statistics, the word significant does not mean important or meaningful, as it does in everyday speech: with sufficient data, a statistically significant result may be very small in magnitude.
This sentence strikes me as conveying the opposite of what it intends. Surely in statistics the word significant does mean important and meaningful, and it is the connotations of 'large' or 'major' that are not accurate here, as in for example 'He ate a significant portion of the cake'? Otherwise the bit about significant results being small isn't relevant. An obtained result that does not occur by chance is inherently important to the data it resides in. --78.150.166.254 (talk) 20:29, 29 July 2013 (UTC)
Idol of the market
The probabilities of events in the known past are restricted to zero and one. That is equivalent to saying that the occurrence of such events lacks uncertainty or that such events are unsuitable for gambling. In a retrospective statistical study, wherein the hypotheses are specified after the data are known, the critical event (critical outcome or rejection region) is an event in the known past and so has probability zero or one. What is called the level of statistical significance in retrospective studies is not a probability. It is only a measure of relative goodness of fit of the data to the distributions determined by the null hypothesis and the alternate hypothesis. If the critical event occurred before the hypotheses were specified, then the null hypothesis never had a chance. If it didn’t occur, then the alternate hypothesis never had a chance. Calling that measure of goodness of fit a probability is a misuse of words. The misuse of words is an impediment to understanding long recognized and named an “idol of the market”, words being the coin in the market place of ideas. The actual level of statistical significance in a retrospective study is 0% or 100% i.e. in retrospective statistical studies including retrospective meta-analyses no hypothesis is tested.Gjsis (talk) 22:11, 2 October 2013 (UTC)
Deletion of the section "Does order of procedure affect statistical significance?"
The section "Does order of procedure affect statistical significance?" has been bothering me for a long time, and I just went through the cited references. As previous commentator noted, the site http://www.epa.gov/smokefree/pubs/etsfs.html clearly says that it uses "weights-of-evidence" to combine the data from historic studies. The method is outline in Appendix D of the comprehensive report. Clearly this is not the same as the usual hypothesis testing that much of the article is about and has more to do with Fisher's method of combining probabilities and regression analysis. Any sophisticated application of statistics will often involve more than one statistical tool, as this report on passive smoking clearly does. However, citing this report as a reason to include a separate section in this article is very misleading. I feel it is best simply to mention other approaches. As such I propose the deletion of this passage. (Manoguru (talk) 07:11, 7 November 2013 (UTC))
Lead definition
This article's definition of statistical significance is a mess, to say the least. It defines significance as a concept, a type of test statistic, and two different aspects (described here as sense 1 and sense 2) of hypothesis testing. So which is it?!?! Plus, where are the sources to support any one of these three definitions? The present citations do not support the article's definitions of statistical significance. I took a cursory look at some secondary sources (e.g., [3]),[4], and [5]) on statistical significance, and they all use the term according to sense 1 (Fisher) and not sense 2 (Incorrectly rejecting a null hypothesis). In fact, I have yet to see a single mainstream and reliable source that describes statistical significance using sense 2. I really think it's time we can clean up the lead section of this article. Right now, it is just haphazard, unsourced, and potentially original research WP:OR. danielkueh (talk) 16:26, 22 November 2013 (UTC)
- Hi Daniel, I don't really understand where your confusion lies in. Yes, it is a concept. Yes, the label of significance is attached to a test statistic if it can reject the null hypothesis. Yes, the p-value quantifies the idea of significance. The confusion comes when you try to interpret what the alpha-level means. For Fisher, it is just a convenient threshold point to make a decision. But Neyman-Pearson also used alpha to denote type I error rate, which later somehow got conflated with the Fisherian alpha. If you look at the 2nd reference that you have given, on page 166, on 2nd paragraph that describes statistical significance, it is interpreted that the alpha-level is the probability of wrongly rejecting the null hypothesis when it is actually true. The problem here is that the Fisherian alpha-level is given a Neyman-Pearsonian type I error interpretation, resulting in a hybridization of two distinctly different approaches. As is vocally argued in the P Values are not Error Probabilities article, such conflation is not correct. It is wise to make this distinction at the very beginning of the article. (Manoguru (talk) 04:47, 24 November 2013 (UTC))
- Manoguru, If you want to discuss the controversy surrounding the conflation of Fisherian statistics and Neyman-Pearsonian statistics, then fine, we can do that in a section called misconception, controversy or whatever. Or maybe you can even start a new article about this topic. The title of this article is Statistical Significance and that should be the main focus in the lead paragraph. Just like you have an article called Physics, you start out by describing physics, not droning on about some misplaced interpretation of physics. As it stands, the lead section of this article is poorly written and in need of high quality and reliable sources that are published and consistent with the content of the article, seeWP:V. Right now, much of it is original research WP:OR. danielkueh (talk) 05:49, 24 November 2013 (UTC)
- Daniel, the controversy surrounding these two statistics has been covered in much detail here and here. I don't see any point in going down that path in this article. However, in mainstream statistical text books, like the 2nd ref you linked to testifies, both the Fisherian alpha and Neyman-Pearson alpha is referred to as statistical significance. We can blame Neyman-Pearson for the poor choice of word, see 1st line of pg. 8 P Values are not Error Probabilities, where they refers to their alpha as the level of significance. So it makes sense to tell it outright that the same word is used for two different ideas. Having pointed out that the same word is used for two different ideas, it is also important to draw a distinction, as the mainstream view tends to conflate two different interpretations. As such I don't see much problem with opening paragraph of the article. However, if you feel that you can improve it, please be my guest. BTW, I am removing that "not in citation" tag. (Manoguru (talk) 06:28, 24 November 2013 (UTC))
- You're missing the point that I was making, which is I don't quite care whether there is a discussion about the controversy, I just think we should just move it out of the lead. As for the second reference that I gave, they described alpha level as a "significance level," but they did not equate it with "statistically significance," which is how this article defines it (sense 2). I am not going to speak to the reference that you cited from the Duke website. It is not published and it doesn't define statistical significance the way this articles does. Furthermore, it warns us to guard against conflating between statistical significance and error rate, a conflation that the current lead is making! You can add as many references as you like. But unless those references are 1) published in mainstream reliable sources like a peer-reviewed journal and 2) actually make the kind of assertions that this lead paragraph makes, they are essentially useless. I strongly suggest that you review Wikipedia's policies and guidelines such as WP:V, WP:lead, and WP:OR. Finally, I have located the source of this problem and I am going to revert to this previous version. It is not that great but at least it is simple, consistent with mainstream sources, and easy to follow. It is a good place to start. danielkueh (talk) 07:25, 24 November 2013 (UTC)
- Daniel, the controversy surrounding these two statistics has been covered in much detail here and here. I don't see any point in going down that path in this article. However, in mainstream statistical text books, like the 2nd ref you linked to testifies, both the Fisherian alpha and Neyman-Pearson alpha is referred to as statistical significance. We can blame Neyman-Pearson for the poor choice of word, see 1st line of pg. 8 P Values are not Error Probabilities, where they refers to their alpha as the level of significance. So it makes sense to tell it outright that the same word is used for two different ideas. Having pointed out that the same word is used for two different ideas, it is also important to draw a distinction, as the mainstream view tends to conflate two different interpretations. As such I don't see much problem with opening paragraph of the article. However, if you feel that you can improve it, please be my guest. BTW, I am removing that "not in citation" tag. (Manoguru (talk) 06:28, 24 November 2013 (UTC))
Restoration of the section "Does order of procedure affect statistical significance?"
The explanation provided by Manoguru for deleting the section on “order of procedure” is relevant to how the authors of the USEPA report make policy decisions, but is not relevant to how they calculated p-values to determine statistical significance which they defined as p-value < 0.05. The meaning of the term “statistical significance” as it is currently understood is found by seeing how the term is used. The referenced report, Respiratory health effects of passive smoking: Lung cancer and other disorders, EPA/600/6-90/006F (1992), is from a prestigious organization and has numerous examples of the use of the term. Moreover the report is well known, generally accepted as sound, and is readily available. That makes it a useful reference. Also, dropping the requirement that the hypothesis precede the experiment that tests it, is an innovation in the scientific method that deserves to be noted. For these reasons I suggest restoring the deleted section.Gjsis (talk) 17:08, 1 January 2014 (UTC)
- Prospective tests are preferred, but human testing with known or suspected carcinogens is unethical. Most of toxicology is based on the testing of laboratory animals.172.250.105.20 (talk) 21:00, 6 January 2014 (UTC)
Convention for statistical significance
Recently, I changed the threshold level from p≤0.05 to p<0.05. The reason being that this IS the convention that is used in many peer-reviewed studies, textbooks, and even statistical software packages. My change was reverted by Nijdam, who claimed to know "another convention." If one was to do a Google Scholar search, there will be 2,890,000 hits for p<0.05. Searching with "p≤0.05" will still produce the results page with hits for p<0.05. You will find the same thing in Google books as well. This is not an "unimportant issue." Setting a significance threshold allows researchers to make a decision about whether their data/research is worth publishing. And there exist an unwritten consensus among researchers that p<0.05 IS that threshold. Stating that the convention is p≤0.05 IS very misleading. So unless there is a reputable source that states otherwise, I'm reverting Nijdam's reversion per WP:V, which is supported by multiple sources, both primary and secondary. danielkueh (talk) 23:34, 11 October 2013 (UTC)
- I know nothing about conventions (except that they involve a lot of drinking):
- "If the P-value is less than or equal to the specified significance level, then reject the null hypothesis; otherwise, do not reject the null hypothesis." Introductory Statistics, Weiss, Neil A., 5th, p 540, isbn = 9780201598773, 1999
- "If the P-value is as small or smaller than α, we say that the data are statistically significant at level α." Introduction to the Practice of Statistics, Moore & McCabe, p 442, 2003, isbn = 9780716796572 172.250.105.20 (talk) 20:44, 6 January 2014 (UTC)
- The convention of (p≤0.05) seems to originate from Neyman & Pearson's paper of 1933. Please confirm (or deny) my interpretation of critical regions.172.250.105.20 (talk) 20:09, 12 January 2014 (UTC)
The Criticisms section
The criticisms section CAN be vastly expanded. See the texts in the Further reading section.172.250.105.20 (talk) 20:49, 7 January 2014 (UTC)
- Yes, it can. With actual criticisms supported by reliable sources. danielkueh (talk) 21:27, 7 January 2014 (UTC)
- The section, “Does order of procedure affect statistical significance?”, contained actual criticism supported by reliable sources and it was deleted.Gjsis (talk) 15:12, 8 January 2014 (UTC)
- Gsis, the section your wrote is 1) hardly a criticism of statistical significance in general but a 2) personal criticism that is very much original research (see WP:OR and synthesis WP:synth). Plus, it seems so out of place. There even might be issues of WP:Fringe and WP:undue as well. danielkueh (talk) 15:49, 8 January 2014 (UTC)
- The section, “Does order of procedure affect statistical significance?”, contained actual criticism supported by reliable sources and it was deleted.Gjsis (talk) 15:12, 8 January 2014 (UTC)
- "Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual p value or, better still, a confidence interval." Wilkinson, Leland. "Statistical methods in psychology journals: guidelines and explanations." American psychologist 54.8 (1999): 594. The criticized decision is termed statistical significance.172.250.105.20 (talk) 21:15, 17 January 2014 (UTC)
- That section by Cohen speaks to the issue of effect size and why it should be included when reporting statistical significant results. This point is already made in the Criticism section. danielkueh (talk) 22:18, 17 January 2014 (UTC)
The History section
The existing section may slightly overstate Fisher's role. A search on "statistical significance" in Google Scholar will show that the term was not unknown before Fisher's book was published. Three citations from Google:
Wilson, Edwin Bidwell. "The statistical significance of experimental data." Science 58.1493 (1923): 93-100.
Boring, Edwin G. "The number of observations upon which a limen may be based." The American Journal of Psychology 27.3 (1916): 315-319.
Boring, Edwin G. "Mathematical vs. scientific significance." Psychological Bulletin 16.10 (1919): 335. 172.250.105.20 (talk) 20:57, 17 January 2014 (UTC)
- That is true. The term did precede Fisher. But the "modern" (for the lack of better word) practice of tying p-values to statistical significance did originate from him. The question now is how best to write a text on the pre-Fisherian use of the term. It's great that you found these primary sources. But it would be better if we have a 1-2 secondary sources that has done all the work for us. danielkueh (talk) 22:53, 17 January 2014 (UTC)
- Can we please discuss what it is that should be included in the History section? I have no problems expanding the history section but these two newly inserted sentences appear to be original research. Plus, it would be helpful to specify the exact page numbers of those two citations. Again, I have no problems expanding the section. In fact, I enthusiastically support it. But I think we need to craft something that is consistent and faithful to the sources. danielkueh (talk) 22:49, 20 January 2014 (UTC)
- The two cited books on the history of statistics completely support the sentences that you removed. As supplemental evidence I have both primary and secondary sources (none of which need be cited in the text because the books are better histories):
- Can we please discuss what it is that should be included in the History section? I have no problems expanding the history section but these two newly inserted sentences appear to be original research. Plus, it would be helpful to specify the exact page numbers of those two citations. Again, I have no problems expanding the section. In fact, I enthusiastically support it. But I think we need to craft something that is consistent and faithful to the sources. danielkueh (talk) 22:49, 20 January 2014 (UTC)
- Arbuthnot, John. 1710. An Argument for Divine Providence, taken from the Constant Regularity observ'd in the Births of Both Sexes. Philosophical Transactions of the Royal Society, 27: 186-90. An early example of a statistical significance test. The null hypothesis was that the number of males and females born were equal. The data showed a consistent excess of males. The odds of that were so low as to cast doubt on the null hypothesis.
- Can you provide a specific quote that uses the word "significance" or "statistical significance?" Excess of males is a criterion, but that doesn't mean it is an example of statistical significance. Can you also a provide a secondary source that cites this study as an early pre-Fisherian example of statistical significance? Otherwise, this is an example of original research. danielkueh (talk) 21:55, 24 January 2014 (UTC)
- Arbuthnot, John. 1710. An Argument for Divine Providence, taken from the Constant Regularity observ'd in the Births of Both Sexes. Philosophical Transactions of the Royal Society, 27: 186-90. An early example of a statistical significance test. The null hypothesis was that the number of males and females born were equal. The data showed a consistent excess of males. The odds of that were so low as to cast doubt on the null hypothesis.
- "[Fisher] was the first to investigate the problem of making sound deductions from small collections of measurements. This was a new topic in statistics. Recall that Pearson had used large data sets." Probability and Statistics; The Science of Uncertainty. John Tabak. ISBN= 9780816049561 page 144. "Although it is no longer used in quite the way that Pearson preferred the [chi-squared] test is still one of the most widely used statistical techniques for testing the reasonableness of a hypothesis." page 141. The first person discussed in a chapter "The Birth of Modern Statistics" was Karl Pearson. The second was Ronald Fisher.
- Yes, Pearson had used large data sets and preferred the chi square, but where is the term "statistical significance"? Another example of original research danielkueh (talk) 21:55, 24 January 2014 (UTC)
- "[Fisher] was the first to investigate the problem of making sound deductions from small collections of measurements. This was a new topic in statistics. Recall that Pearson had used large data sets." Probability and Statistics; The Science of Uncertainty. John Tabak. ISBN= 9780816049561 page 144. "Although it is no longer used in quite the way that Pearson preferred the [chi-squared] test is still one of the most widely used statistical techniques for testing the reasonableness of a hypothesis." page 141. The first person discussed in a chapter "The Birth of Modern Statistics" was Karl Pearson. The second was Ronald Fisher.
- Fisher described Pearson's Chi-squared distribution as "the measure of discrepancy between observation and hypothesis" in the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Pearson's work was repeatedly cited by Fisher.
- Just because you cite someone repeatedly doesn't mean they came up with the idea that you are proposing. It could just mean that you are building on what they have discovered or formulated. In any event, did Fisher specifically referred to Pearson's work as tests of significance? danielkueh (talk) 21:55, 24 January 2014 (UTC)
- Fisher described Pearson's Chi-squared distribution as "the measure of discrepancy between observation and hypothesis" in the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Pearson's work was repeatedly cited by Fisher.
- "[F]rom the first edition it has been one of the chief purposes of this book to make better known the effect of [Gosset's] researches...", from the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Gosset's work was repeatedly cited by Fisher. Gosset's work provided the foundation for the t-test.
- Where is the word "significance?" danielkueh (talk) 21:55, 24 January 2014 (UTC)
- "[F]rom the first edition it has been one of the chief purposes of this book to make better known the effect of [Gosset's] researches...", from the Introductory to the 11th Edition of Fisher's Statistical Methods for Research Workers. Gosset's work was repeatedly cited by Fisher. Gosset's work provided the foundation for the t-test.
- "The history of Student [Gosset] and the history of Fisher are inextricably linked. Fisher not only championed Student’s work, but Student exerted a profound influence on the nature and direction of Fisher’s research for nearly two decades." On Student’s 1908 Article “The Probable Error of a Mean”. S. L. Zabell. Journal of the American Statistical Association March 2008, Vol. 103, No. 481 DOI 10.1198/016214508000000030 — Preceding unsigned comment added by 172.250.105.20 (talk) 20:25, 24 January 2014 (UTC)
- I'm afraid this doesn't tell us anything about the origins of statistical significance." danielkueh (talk) 21:55, 24 January 2014 (UTC)
- "The history of Student [Gosset] and the history of Fisher are inextricably linked. Fisher not only championed Student’s work, but Student exerted a profound influence on the nature and direction of Fisher’s research for nearly two decades." On Student’s 1908 Article “The Probable Error of a Mean”. S. L. Zabell. Journal of the American Statistical Association March 2008, Vol. 103, No. 481 DOI 10.1198/016214508000000030 — Preceding unsigned comment added by 172.250.105.20 (talk) 20:25, 24 January 2014 (UTC)
- I now have enough data to reject my null hypothesis of good faith.172.250.105.20 (talk) 19:16, 25 January 2014 (UTC)
- Sorry to see you feel that way. But you really need to review WP's policy of original research wp:or. Cheers. danielkueh (talk) 21:25, 25 January 2014 (UTC)
- "Wikipedia articles must not contain original research. The phrase "original research" (OR) is used on Wikipedia to refer to material—such as facts, allegations, and ideas—for which no reliable, published sources exist. This includes any analysis or synthesis of published material that serves to advance a position not advanced by the sources. To demonstrate that you are not adding OR, you must be able to cite reliable, published sources that are directly related to the topic of the article, and directly support the material being presented." No problem.
- Historical Origins of Statistical Testing Practices: The Treatment of Fisher Versus Neyman-Pearson Views in Textbooks, Carl J. Huberty, Journal of Experimental Education, 61(4), pages 317-333, 1993. Table 1 of Statistical Testing Applications lists Arbuthnot, LaPlace, K Pearson and Gosset with dates well before Fisher in 1925. Huberty notes that the logic of (significance) testing was present even if the modern terminology was not. Huberty's sources for the table are 4 histories of probability and statistics.
- What are your remaining objections to my proposed edit (in enough detail to address them)? You have objected to two sentences while I see nothing wrong with either. "While antecedents extend centuries into the past, statistical significance is largely a development of the early twentieth century." Given that p-values were computed centuries ago (P-value#History), the first sentence is largely immune from attack. "Major contributors include Karl Pearson, William Sealy Gosset, Ronald Fisher, Jerzy Neyman and Egon Pearson." Which names do you object to and why?172.250.105.20 (talk) 20:01, 27 January 2014 (UTC)
- I have already given a point-by-point reply to each of your statements above. If you read them carefully and think it through, you will see why your interpretations and conclusions of the sources are inconsistent with these two WP policies: WP:OR and WP:SYNTH. If you go over WP:OR carefully, you will see that it clearly states that "any interpretation of primary source material requires a reliable secondary source for that interpretation." And if you look at WP:SYNTH, you will see that it states "Do not combine material from multiple sources to reach or imply a conclusion not explicitly stated by any of the sources." danielkueh (talk) 21:05, 27 January 2014 (UTC)
- What are your remaining objections to my proposed edit (in enough detail to address them)? You have objected to two sentences while I see nothing wrong with either. "While antecedents extend centuries into the past, statistical significance is largely a development of the early twentieth century." Given that p-values were computed centuries ago (P-value#History), the first sentence is largely immune from attack. "Major contributors include Karl Pearson, William Sealy Gosset, Ronald Fisher, Jerzy Neyman and Egon Pearson." Which names do you object to and why?172.250.105.20 (talk) 20:01, 27 January 2014 (UTC)
- Did you look at Huberty? "Statistical testing was applied by the English scholar John Arbuthnot nearly 300 years ago." p 317172.250.105.20 (talk) 19:31, 28 January 2014 (UTC)
- Where is the word "significance"? danielkueh (talk) 20:31, 28 January 2014 (UTC)
- Huberty does not use the adjective. Look at the title of the paper, or better yet, read it.172.250.105.20 (talk) 00:52, 30 January 2014 (UTC)
- If he doesn't use it, then we can't use it as well. This article is specifically about statistical significance. Not hypothesis testing. You should take your own advice and learn the policies. Better yet, read them. danielkueh (talk) 01:06, 30 January 2014 (UTC)
- Ah! A difficult distinction to make and one that usually is not made. Thanks for the clarification. I was eventually planning to ask the reason for two articles.172.250.105.20 (talk) 01:17, 30 January 2014 (UTC)
- If he doesn't use it, then we can't use it as well. This article is specifically about statistical significance. Not hypothesis testing. You should take your own advice and learn the policies. Better yet, read them. danielkueh (talk) 01:06, 30 January 2014 (UTC)
- Huberty does not use the adjective. Look at the title of the paper, or better yet, read it.172.250.105.20 (talk) 00:52, 30 January 2014 (UTC)
- Where is the word "significance"? danielkueh (talk) 20:31, 28 January 2014 (UTC)
- Did you look at Huberty? "Statistical testing was applied by the English scholar John Arbuthnot nearly 300 years ago." p 317172.250.105.20 (talk) 19:31, 28 January 2014 (UTC)