Talk:Correlation/Archive 2
This is an archive of past discussions about Correlation. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 |
A further comment on pseudocode
Wikipedia is not a how-to, and in my opinion the pseudocode does not deserve space in such an important article. Nobody is going to be reading this article to pick up coding hints. Code examples are almost nonexistent in Wikipedia articles on math or statistics. EdJohnston (talk) 19:11, 23 February 2009 (UTC)
I think pretty much everyone agrees the code should be elided, but only once someone has created a wiki page on the numerical analysis of calculating correlation.
Please note that the rant above (apparently by SciberDoc) is incorrect. The algorithm works with high precision. To address the (completely valid) referential concerns of Darkroll and Qwfp I hunted down and cited a reference text that addresses the issue.
I also note that an accessible pseudocode representation of the algorithm is very important (I have now seen versions of this code appear in open-source implementations in several places), and that (with no offense intended to EdJohnston) one can hardly call pseudocode representations of numerical algorithms "coding hints" unless one is prepared to apply the same moniker to e.g. Volume 2 of Knuth's Art of Computer Programming oeuvre. Brianboonstra (talk) 22:00, 23 February 2009 (UTC)
- Brian, I perceive that you feel a need for an article on the numerical analysis for computing correlation coefficients. Why don't you go ahead and write such an article? So far as I can tell, the pseudocode is original with you and is not the work of a well-known author who we would consider a reliable source. We do have an article on Donald Knuth, but it contains no code examples. EdJohnston (talk) 23:17, 23 February 2009 (UTC)
- You perceive well. I have ambitions of doing so, indeed. But it will take no small effort to make myself qualified to do it, as I work in industry and lack the resources of a research library. W.r.t. Knuth, I was trying (clumsily) to make a point about pseudocode, not to try to put myself on a par with that master!76.237.206.0 (talk) 04:15, 24 February 2009 (UTC)
I believe that having the pseudocode in this article does not violate the how-to guideline. In the case of an article on a mathematical function, one of the purposes of the article is to indicate how to compute the function, such instruction does not violate Wikipedia not being a guide or manual - otherwise many mathematics articles could not be written. In the current version of the article, not counting the disputed pseudocode, there are 3 formulas for the sample correlation, (not counting also taking the square root of the several formulas for the coefficient of determination). Each of these formulas amounts to pseudocode written in the common language of mathematics[1]. One can (and I do) use each of these formulas for calculation in different circumstances. If one has the mean-normalized data, one uses that formula. If one has the means, one uses the formula with precomputed means.
Today, almost all computation of sample correlation coefficients outside of the classroom is done by computers working with limited precision arithmetic. Due to the unsuitability of the three given algorithms for the case of numerical computation with limited precision, a fourth algorithm is given in the article. One could write this algorithm in the common language of mathematics. However, since those who need numerically stable algorithms are almost always programming computers, that language would be less useful. So pseudocode is the appropriate language in which to write the algorithm.
Mercifully, most mathematical algorithms work without much trouble translated directly from the common language of mathematics. Thus there is no need in most articles for a separate indication of how to compute particular functions in a numerically stable way. However, the (Pearson) correlation coefficient is a notable exception. Thus, its article could also be a reasonable exception to the general rule that mathematical articles do not contain pseudocode.
To summarize my thoughts: since the other 3 algorithms do not make Wikipedia a how-to (and in fact are essential in its fulfilling its encyclopedic function), adding a 4th algorithm also does not make it a how-to. Further, the sample correlation coefficient is normally computed by computers, but (unlike in most articles) the first 3 algorithms are not suitable for computers. So an exception to the usual practice of having no pseudocode in math articles is reasonable in this case. Thus, there is no intrinsic problem with including the pseudocode in the article and a reasonable case for including it.
Having finished, understand that I agree that a separate article on numerical computation is a good idea and would allow all involved to be satisfied. BrotherE (talk) 10:38, 9 March 2009 (UTC)
- For those who might not believe that algorithms given in math and pseudocode are equivalent, I have attempted to translate the numerically stable pseudocode in the article into a mathematical expression. I have not checked this very thoroughly, so don't put it in the article without a lot of consideration.
- BrotherE (talk) 11:29, 9 March 2009 (UTC)
- Since this algorithm is something you put together yourself, I believe it counts as WP:Original research under our rules. Unless you can cite it to a reliable source, it does not belong in a Wikipedia article. Wikipedia does not publish original thought nor original algorithms. See WP:5P for more about our policies. EdJohnston (talk) 18:38, 9 March 2009 (UTC)
- I don't understand the point:
- algorithms for computing encyclopaedic formulas are obviously encyclopaedic themselves (if they are desirable from some point of view);
- the proper way of representing an algorithm is a pseudo-code, as anybody can see on any accademic handbook;
- the algorithm has a source which says the computation is accurate, who says the contrary? Please, report it.
- We all know that the numerical computation of a correlation often raises a number of problems, why should we omit a precious information like that?
- Nightbit (talk) 06:06, 13 March 2009 (UTC)
- It's simply not important enough to deserve space here. Can you find *any* other math article that contains a pseudocode algorithm? If we supply the mathematical expression for the answer, we know it can be implemented in software. Including a reference to an algorithm would be more reasonable. EdJohnston (talk) 13:33, 13 March 2009 (UTC)
- I believe it is important because it's accurate and because it's a one-pass algorithm. Furthermore, this is not exactly a *math* article but a *statistics* article: computational aspects are central in statistics, though they are often neglected by mathematicians. If you read an article on correlation it is often because you want to actually *compute* a correlation on real data. Nightbit (talk) 01:55, 18 March 2009 (UTC)
- It's simply not important enough to deserve space here. Can you find *any* other math article that contains a pseudocode algorithm? If we supply the mathematical expression for the answer, we know it can be implemented in software. Including a reference to an algorithm would be more reasonable. EdJohnston (talk) 13:33, 13 March 2009 (UTC)
- I don't understand the point:
I removed Admdikramr's second algorithm
for two reasons. (1) It is actually not numerically stable (stability of estimation of each mean does not guarantee stability of the whole estimator), and (2) having even one such formula is clearly controversial enough.Brianboonstra (talk) 22:01, 16 March 2009 (UTC)
- Actually, my first algorithm; the others are not mine and this is my first contribution to the page. Calculating the correlation based on numerically "correct" means does, in theory, produce the right result; the formula is a derivation of the formulas listed on top of the page (I will derive it here if requested). So, I am not quite sure what you mean by "it is not actually numerically stable," but I do not claim you are wrong.
- Also, I do not know what you mean by "having one such formula is controversial enough;" the controversy above is about the inclusion of "questionable pseudocode" because (1) pseudocode is inappropriate for this venue and (2) the pseudocode actually contains an algorithm which is of questionable stability. I, rather, am providing a formula (not pseudocode), and suggesting that the issue of "one-pass correlation" can be pushed off to the issue of "one-pass mean," thus avoiding the stability issues mentioned above. I will put my code back if there are no further comments on this topic within a week. Admdikramr (talk) 06:50, 17 March 2009 (UTC)
- Point taken on the pseudocode. With respect to stability, instabilities arise when one takes differences of products that can be large, especially when subsequently taking quotients. It's always best to check these things first (plus one is not supposed to do anything original for Wikipedia). I've posted C program to do just that on my talk page (200 lines would clog this page). If you compile and run it you can see that the form you submit doesn't really do any better than the classical one-pass formula, though it does at least outperform the single-precision version of the stable algorithm for the experimental data given. Cheers, Brianboonstra (talk) 13:19, 17 March 2009 (UTC)
- The main reason I posted the formula regarding means is so that we'd be dealing with numbers in the same scale as the originals (and their squares), rather than sums of squares or sums; means are pretty easy to understand. I do not argue that my "algorithm" is not superior or more stable! The point of posting it was to move towards a formulaic representation of "how to do this in one pass," so we could move the pseudocode (and questions of stability, which depend on many system-dependent variables that are not relevant to this page) somewhere else without leaving the section blank (as appears to be the default solution once you find somewhere better to put your code).
- One other issue (unrelated) regarding the pseudocode: Your variable names pop_sd_x, pop_sd_y, and cov_x_y seem a bit misleading, as they come from sums rather than averages. The population standard deviation for x, e.g., would be pop_sd_x / sqrt(n). For the correlation computation this doesn't matter, but if an enterprising individual were to attempt to compute a regression beta based on your code, debugging would be a nightmare...also, the analogy to the formula breaks down a bit. Admdikramr (talk) 19:54, 17 March 2009 (UTC)
- As originally posted the algorithm had the factors of N in there. People keep coming in, "noticing" that the factors of N are superfluous to the calc, and "optimizing" them out. I'll fix it again.Brianboonstra (talk) 21:49, 17 March 2009 (UTC)
Admdikramr, your formula is OK, but it doesn't help people implement an accurate algorithm. I believe that formula could be useful above, but not in a section that regards the computational aspects of an accurate algorithm. Nightbit (talk) 02:00, 18 March 2009 (UTC)
There is still a fundamental problem with the entire pseudocode / algorithm issue. Product moment correlation (the type of correlation being calculated here) simply isn't calculated this way. Any decent basic book on statistics will show how it is done, which is by finding the sums of xi, yi, xi2, yi2 and xiyi: a single pass throught the data is all that is required to collect these sums entirely accurately.
The variance of x is then found from the data:
- where is the mean of x.
The standard deviation of x, sx is the squareroot of this variance.
The variance of y is found the same way:
- where is the mean of y.
The standard deviation of y, sy is the squareroot of this variance.
The covariance of x and y is found from the data:
Then the correlation is:
An alternative formula can be used instead:
where Sxx, Syy and Sxy are the corrected sums of squares of x, corrected sums of squares of y and corrected sum of products xy respectively. These terms are simply the variance of x, the variance of y, and the covariance of xy each multiplied by (n-1); it is thus saves several divisions by (n-1). Note the capital S used for the corrected sums, versus the small s for the estimates of the population standard deviations and variances, and make sure you use the correct one in your sums.
There are innumerable suitable references for this: one suitable, accurate and simple reference is R.C. Campbell, Statistics for Biologists, C.U.P..
Any algorithm should implement these correct formulae. It is very easy to do, simply summing the relevant terms xi, yi, xi2, yi2 and xiyi into suitably named variables, and then working out the correlation at the end. A single pass throught the data is all that is required to collect these sums entirely accurately, and to calculate the correlation correctly according to these formulae.
There are sometimes slightly different formulae used for the variances, but essentially it doesn't matter much - if the sample size is very small then the errors are large in any case, and if the sample is large then differences between the values given by the different formulae are negligible.
SciberDoc (talk) 13:30, 20 March 2009 (UTC)
- SciberDoc, we all realize that the formulae you give above are the way most books tell one to calculate Pearson correlation, and that they are correct. However, they are not numerically stable in the sense that in real life they actually give the wrong answer in certain pathological cases (as you can see using the code from on my talk page). Traditionally, the way to overcome this numerical instability has been to use a two-pass algorithm that first computes means. I believe your previous comment mentions that one.
- The pseudocode here is a different, but perfectly correct, stable one-pass algorithm found in "Elements of Statistical Computing: Numerical Computation" by Thisted (and probably several other numerical analysis sources, as the techniques are reasonably well known).
- I have not removed your notes about its correctness being "in dispute" because I respect your contribution. But you are indeed incorrect in your assessment, and I encourage you either to read Thisted, or to compile and test the code linked above to satisfy yourself.
- If a more neutral party would like to remove the "dispute" comment, I think it for the best. But I won't do it myself.Brianboonstra (talk) 14:47, 23 March 2009 (UTC)
B+ class?? what the
huh.. why not just nominate it for GA class. -- OlEnglish (Talk) 01:12, 12 May 2009 (UTC)
Correlation as a measure of general dependence
I'm not sure how accepted it is to view correlation as a measure of general (not necessarily linear) dependence, as suggested in the introduction. In my experience, people use words like "association" or "dependence", while "correlation" strictly refers to a linear relationship. Skbkekas (talk) 03:20, 23 May 2009 (UTC)
normality
There are many sources (Wilcox 1998, 2005; Dalgaard 2003) suggesting that the pearson correlation coef (including the associated significance test) is highly dependent on the data coming from a bivariate normal distribution. The standard example for this is "Tukey's" contaminated normal distributions, e.g. when the normal distribution is heavy tailed. It would be best for this issue to have both sides of the story included + relevant refs. --landroni (talk) 13:10, 26 May 2009 (UTC)
- I generally agree with you, although I think it's very important to be specific about what "highly dependent" means, especially in the case of the population coefficient which I would say is not at all highly dependent on normality (at least as far as its existence and general interpretation go). I think it is appropriate for the non-parametric section to mostly focus on issues like the contamination model. I moved some of the material in the non-parametrics section to a new section on "Sensitivity to the data distribution". Skbkekas (talk) 14:08, 26 May 2009 (UTC)
- There is this relevant discussion on the R mailing list. --landroni (talk) 07:51, 29 May 2009 (UTC)
- I agree with Thomas Lumley's perspective on this and think that the current "sensitivity to the data distribution" and "non-parametric correlation coefficients" sections reflect this perspective well. Skbkekas (talk) 01:38, 31 May 2009 (UTC)
- By "highly dependent" I mean that the data not being bivariate normal might affect the estimate of the magnitude of the correlation. Various sources are divergent, so I am not (yet) sure of this. There is one small nice example in R programming language (originally Marona and Yohai (1998)). It shows at least that "that a multivariate outlier need not be an outlier in any of its coordinate variables", in other words that correlation is also affected by multivariate outliers. A more opportunistic interpretation is that deviation from bivariate distribution badly affects the magnitude of the correlation coefficient. An even more opportunistic: if data is not bivariate normal, the pearson correlation is largely good for nothing. Remark that the robust MCD covariance estimation handless well the outlier issue.
> require(rrcov) > data(maryo) > maryo V1 V2 [1,] -0.80277 0.01779 [2,] -0.27227 -1.39980 [3,] -0.10184 -0.19524 [4,] -0.52043 -0.72337 [5,] 0.35972 0.34724 [6,] -0.12618 0.14892 [7,] -0.60633 -0.73132 [8,] -0.37638 -0.62230 [9,] -1.66646 -1.87687 [10,] 0.39734 0.49718 [11,] -0.19946 0.24907 [12,] 0.22108 -0.22474 [13,] -0.87706 -0.71262 [14,] -1.05453 -0.47379 [15,] -0.59331 -0.30050 [16,] 1.03261 1.42684 [17,] -1.08850 0.21384 [18,] -0.04958 0.36770 [19,] 1.22224 1.38157 [20,] -1.46916 -1.73041 > mshapiro.test(t(maryo)) # Original data is bivariate normal Shapiro-Wilk normality test data: Z W = 0.9498, p-value = 0.3635 > ## Modify 10 > ## modify two points (out of 20) by interchanging the > ## largest and smallest value of the first coordinate > imin <- which(maryo[,1]==min(maryo[,1])) # imin = 9 > imax <- which(maryo[,1]==max(maryo[,1])) # imax = 19 > maryo1 <- maryo > maryo1[imin,1] <- maryo[imax,1] > maryo1[imax,1] <- maryo[imin,1] > maryo1 V1 V2 [1,] -0.80277 0.01779 [2,] -0.27227 -1.39980 [3,] -0.10184 -0.19524 [4,] -0.52043 -0.72337 [5,] 0.35972 0.34724 [6,] -0.12618 0.14892 [7,] -0.60633 -0.73132 [8,] -0.37638 -0.62230 [9,] 1.22224 -1.87687 [10,] 0.39734 0.49718 [11,] -0.19946 0.24907 [12,] 0.22108 -0.22474 [13,] -0.87706 -0.71262 [14,] -1.05453 -0.47379 [15,] -0.59331 -0.30050 [16,] 1.03261 1.42684 [17,] -1.08850 0.21384 [18,] -0.04958 0.36770 [19,] -1.66646 1.38157 [20,] -1.46916 -1.73041 > mshapiro.test(t(maryo1)) # Modified data is no longer bivariate normal Shapiro-Wilk normality test data: Z W = 0.8356, p-value = 0.003087 > sf.test(maryo1[,1]) # Although individually the variables are still normal Shapiro-Francia normality test data: maryo1[, 1] W = 0.9819, p-value = 0.9075 > sf.test(maryo1[,2]) Shapiro-Francia normality test data: maryo1[, 2] W = 0.9643, p-value = 0.5437 > getCorr(CovClassic(maryo1)) ## the sample correlation becomes 0.05 V1 V2 V1 1.00000 0.05557 V2 0.05557 1.00000 > getCorr(CovMcd(maryo1)) ## the robust (reweighted) MCD correlation is 0.79 V1 V2 V1 1.0000 0.7917 V2 0.7917 1.0000
- I believe contaminated normals should be discussed in the "sensitivity" section. These distributions exemplify (small) departures from normality and the impact on parametric procedures. Otherwise the current separation looks good to me. Of course, sources are lacking.
- Another issue is the scope of the term "correlation." I'm not sure how well defined the bounds of this term are. I think it should be restricted to measures of the product-moment type (Pearson, Spearman, etc.). I would call the chi-square statistic a "measure of association" rather than a measure of correlation. Skbkekas (talk) 14:58, 26 May 2009 (UTC)
- I don't have an opinion on this, but it would make sense to clearly separate the groups as you suggest. landroni (talk) 19:51, 26 May 2009 (UTC)
- My opinion is that the article is moving too far away from covering whaty ios in the first paragraph: "...in contrast with the usage of the term in colloquial speech, which denotes any relationship, not necessarily linear." This general use of the term does need to covered and needs more prominence. Too high a proportion is concerned with either linear dependence or with measuring "dependence", rather than having a more general context such as graphical means for examining dependence. Of course there is the good figure relating to this, but little textual discussion of the points to be made. Remember that this is not a stats text book. Melcombe (talk) 09:23, 27 May 2009 (UTC)
- I am in favor of moving a lot of the more technical stuff that is specific to the Pearson correlation coefficient from this article over to the Pearson product moment correlation coefficient article. That would free up space for more general topics, like graphical approaches, which I agree are relevant. However, if it gets too general it will become impossible to distinguish from independence (probability theory) and association (statistics). Skbkekas (talk) 01:48, 31 May 2009 (UTC)
- My opinion is that the article is moving too far away from covering whaty ios in the first paragraph: "...in contrast with the usage of the term in colloquial speech, which denotes any relationship, not necessarily linear." This general use of the term does need to covered and needs more prominence. Too high a proportion is concerned with either linear dependence or with measuring "dependence", rather than having a more general context such as graphical means for examining dependence. Of course there is the good figure relating to this, but little textual discussion of the points to be made. Remember that this is not a stats text book. Melcombe (talk) 09:23, 27 May 2009 (UTC)
- I don't have an opinion on this, but it would make sense to clearly separate the groups as you suggest. landroni (talk) 19:51, 26 May 2009 (UTC)
Moved sections to Pearson correlation
There seems to be some consensus that the more technical material that pertains only to the Pearson correlation coefficient should be moved off this page to the Pearson correlation page. I've moved three such sections and may migrate a bit more. Skbkekas (talk) 14:14, 5 June 2009 (UTC)
See the book B. P. Lathi - Modern Digital and Analog Communications Systems - 3rd Ed as a reference book for this text. —Preceding unsigned comment added by 201.58.206.35 (talk) 16:11, 7 September 2009 (UTC)
Correlation coefficient
Correlation coefficient currently directs here. Should it direct to Coefficient of determination (i.e. r-squared) instead? (note: I'm cross-listing this post at Talk:Coefficient of determination.) rʨanaɢ talk/contribs 03:30, 22 September 2009 (UTC)
- No, it shouldn't. This is the main article on correlation, and defines the correlation coefficient. The article on coefficient of determination mentions the correlation coefficient, but does not define it; in fact it rather presupposes a knowledge of the correlation coefficient. What is more this is as it should be, both because correlation coefficient is a much more widely known concept than coefficient of determination, and because it makes more sense to redirect upwards to a more general topic than to redirect sideways to a different concept at the same level. JamesBWatson (talk) 13:05, 27 September 2009 (UTC)
Section merger proposal
I disagree with the proposal to move material from this article to the Pearson correlation article. In fact, this issue has been discussed quite a bit in the past, and the consensus was to use the Pearson correlation article for issues related to linear correlation measures of the product-moment type, while the correlation article could cover topics related to pairwise association measures in general. The section on "sensitivity to the data distribution" applies specifically to the Pearson correlation measure. Some parts of it may be more general, but not most of it. I was the person who originally created this section, in both articles. I later came to feel that the section in the correlation article needed to be merged to Pearson correlation, not the other way around. I just hadn't had a chance to do it yet. The proposed merger takes us in the wrong direction. Skbkekas (talk) 03:54, 2 November 2009 (UTC)
- I don't see that it does any harm to keep both sections, but I certainly agree with Skbkekas that if there is to be a merge it should be from Correlation to Pearson correlation, not the other way around. JamesBWatson (talk) 12:10, 2 November 2009 (UTC)
I have changed the merge templates on both articles to indicate moving material to Pearson correlation, with the discussion pointer still pointing here. I have reverted the move already made of some stuff and I think much more should be moved. If this direction of change is what is wanted, it may be best to rename this article to someting like "correlation and dependence" to give a better indication of its scope. Melcombe (talk) 16:58, 2 November 2009 (UTC)
- If Skbkekas is right in saying that previous discussion has resulted in a consensus for keeping both sections then I don't see that a merger is justified. I should also like to put it on record that I agree that it is better to keep both of them. JamesBWatson (talk) 08:32, 4 November 2009 (UTC)
- But the question is how much and what material should be in both. The use of the "Main" tag to point to the Pearson correlation article would mean that what should be here is only a summary plus whatever other stuff is required that is relevant to the main topic of the current article. Do we agree that the topic should be "pairwise association measures in general"? I think that topic does deserve an article of its own and this is the way the article starts I think, and the direction the article was being pushed. But there are a number of problems with the articles taken together that can hopefully be reduced by having an appropriate separation of topics. For example in the case of the product-moment correlation there are three separate concepts: the population value, the "raw" estimate obtained by the usual formula, other estimates of correlation derived from appropriate non-normal joint distributions. It is hardly made clear which of these is being thought of for the various points being discussed. Melcombe (talk) 10:32, 4 November 2009 (UTC)
- I don't think I said that the consensus of the earlier discussion was to keep both sections. The earlier discussion dealt with how to divide material between the two articles. I like Melcombe's proposal to retitle the correlation article as something along the lines of "correlation and dependence." As far as "sensitivity to the data distribution" goes, I think a section like that belongs in nearly every article about a summary statistic. However, the contents of the section would obviously differ. If the correlation article moves to "correlation and dependence," I'm not sure if there are any general statements that can be made that are applicable in general to correlation and dependence, whereas it is of course possible to say things specifically about Pearson correlation. Skbkekas (talk) 19:57, 4 November 2009 (UTC)
Which one is known as canonical correlation
1. Scatter Diagram. 2. Karl Pearson. 3. Graphic Method. 4. Rank Correlation. —Preceding unsigned comment added by 117.193.144.20 (talk) 11:09, 6 June 2010 (UTC)
- None of them: see Canonical correlation. However, this page is not for questions of this kind: it is for discussing editing of the article. JamesBWatson (talk) 19:34, 6 June 2010 (UTC)
Reference update?
There is a citation given, near the bold term anticorrelation (ref number 5) to Dowdy, S. and Wearden, S. (1983). "Statistics for Research". Wiley. ISBN 0471086029 pp 230. This is the first edition and the latest is the 3rd (Detail and online subscription version)... can anyone say whether this term does (still) appear and so update the reference and page number? Melcombe (talk) 14:17, 21 September 2010 (UTC)
Pearson correlation "mainly sensitive" to linear relationships??
In the second paragraph we have the sentence:
- [...] The Pearson correlation coefficient, [is] mainly sensitive to a linear relationship between two variables.
Shouldn't the word "mainly" be changed to "strictly"?
watson (talk) 01:14, 12 September 2010 (UTC)
- I disagree. To the contrary, I think "mainly" is already too strict. I think it should be "somewhat more". Skbkekas (talk) 05:09, 12 September 2010 (UTC)
- what's an example of two 1D data sets that have high Pearson correlation but lack linear relationship? watson (talk) 01:46, 13 September 2010 (UTC)
- If X is uniformly distributed on (0,1) and Y = log(X) the correlation is around 0.86.Skbkekas (talk) 01:04, 15 September 2010 (UTC)
This example shows the linear relationship between x and log(x), which is present to a large extent. log(x) is not linear towards zero but it is very linear out near 1. Thus the correlation coefficient is reduced by the former but not the latter. I coded up this short program in Python to demonstrate. The visual demonstration of the Pearson correlation is linear regression as shown below (blue line is log(x), red line is the regression of course). Note that the correlation coefficient actually is near .787
The code for this example is as follows:
import numpy as N
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr,linregress
x = N.linspace(.00000001,1,1000)
y = N.log(x)
(a,b,r,tt,stderr)=linregress(x,y)
z = a*x+b
print r
plt.plot(x,z,'r')
plt.plot(x,y)
plt.savefig('x_vs_logx.png')
r,p = pearsonr(x,y)
print r
it returns the above plot as well as this output of the Pearson correlation coefficent (calculated in two places independently):
0.787734089775
0.787734089775
watson (talk) 20:49, 15 September 2010 (UTC)
This isn't a big deal, but the numerical calculation above is giving the wrong answer, since the numerical approximation to the definite integral is very sensitive to how the limiting behavior at zero is handled. Doing the calculation analytically, you get -1/4 for E(X*Y) (using integration by parts), and you get -1/2 for EX*EY (using the fact that Y follows a standard exponential distribution). Thus cov(X,Y) = 1/4. The variance of X is 1/12 and the variance of Y is 1. Thus the correlation is sqrt(12)/4 = 0.866.
The larger issue about whether the Pearson correlation is "strictly" sensitive to a linear relationship amounts to how you interpret the word "strictly". Many people would incorrectly interpret this as implying that the Pearson correlation is blind to relationships that aren't perfectly linear. I also would argue that the plot above exaggerates the approximate linearity of log(x), based on the very large range of the vertical axis.Skbkekas (talk) 02:12, 16 September 2010 (UTC)
- Thanks for that correction, Skbkekas. I was playing fast and loose with my numerical approx and you're totally right about the limiting behavior at zero, considering log(x) shoots off to -inf. I reran my code with my function representation parameters maxed out, i.e. changing the line
x = N.linspace(.00000001,1,1000)
to
x = N.linspace(1e-150,1,6.5e7)
(the smallest value for interval start and largest vector size, respectively which Python running on my computer are capable of)
- and I get the value
0.865246469782
- Note that the correlation actually 'increases towards your analytic limit because using a smaller discretization of the x-axis amounts to giving less weight in the calculation to the values of log(x) really close to and including x=1e-150.
- As to how correlation handles non-linear relationships, I think we're getting caught up on the word "relationship". Yes log(x) has an explicit non-linear relationship to x, but that's a different sense of the word "relationship" than what Pearson correlation measures. Pearson correlation measures the degree to which log(x) is "linear-ish" to use some colloquial language. That is, it measures what the relationship of log(x) is not to x, but to a linear approximation of itself along the specified interval of x. And, as the last paragraph points out, there is only a relatively small sub-interval that log(x) is not linear.
- To demonstrate this, I ran my code again, but now with the above line changed to
x = N.linspace(.2,1,6.5e7)
which returns
0.980302482584
- Your comment about the log(x) axs is fair (I let Python choose them before), and I plotted again with the axis restricted to a min of -8. Including the figure here and also a figure showing the regression on the sub-interval [.2,1] mentioned above.
watson (talk) 21:44, 19 September 2010 (UTC)
@Watson The log(x) function on the unit interval (i.e., [0:1]) is not a suitable example for a numerical solution. By this means, the correlation coefficient for a linear coefficient is ill-defined. Actually, in regard of your edit history, you should have come to this conclusion yourself. How can you seriously be changing the value of this correlation coefficient in the article without understanding that even the new value is not the true value, simply because you cannot find it with your tool. Remember, at first you though 0.787 was close enough to the true value. Tomeasy T C 06:50, 23 September 2010 (UTC)
pseudocode
I suggest removing the sections with pseudo code. Wikipedia is not the place for computing tips and tricks (there's some rule about Wikipedia not being a 'how to' site). If there are important algorithmic considerations, they should be presented more formally using correct numerical analysis terminology. And surely, there's no need to show the same algorithm in two languages. —G716 <T·C> 11:10, 12 October 2008 (UTC)
- I suggest NOT removing it, it is quite convenient to skip all these "important" formulas and get to the point. However, there seems to be extra division by N there:
pop_sd_x = sqrt( sum_sq_x / N )
pop_sd_y = sqrt( sum_sq_y / N )
cov_x_y = sum_coproduct / N
correlation = cov_x_y / (pop_sd_x * pop_sd_y)
seems to be the same as
pop_sd_x = sqrt( sum_sq_x )
pop_sd_y = sqrt( sum_sq_y )
cov_x_y = sum_coproduct
correlation = cov_x_y / (pop_sd_x * pop_sd_y)
92.112.247.110 (talk) 17:04, 9 November 2008 (UTC)
- They are the same apart from numerical accuracy questions. But the former might be preferable if there were later other uses for the standard deviations and covariance, such as writing them out. Melcombe (talk) 10:34, 18 December 2008 (UTC)
- I agree with removing the pseudo code, given that this is not a text book. Melcombe (talk) 10:34, 18 December 2008 (UTC)
- Look, the argument is indeed rational, but deleting information without finding a new home for it is evil, whether or not it fits with the guidelines! Could you perhaps add it to the Ada programming wikibook and/or one of the Python wikibooks? --mcld (talk) 14:32, 19 December 2008 (UTC)
- Great idea. I know nothing about wikibooks - could you either move the info yourself or ask leave a note on the wikibooks talk page to get someone there to help. —G716 <T·C> 15:42, 19 December 2008 (UTC)
- I agree with removing the ada source, but the methodology for computing the correlation in one, fast, accurate and stable step is a very important information.
- Nightbit (talk) 03:12, 20 December 2008 (UTC)
I'm the guy who put the pseudocode there in the first place because so many people come here looking for how to compute correlation. I agree with Mcld ... don't just delete it! It is fine for the pseudocode to be moved elsewhere and just have a link to it, but it is important and difficult to find stuff.
A language-specific site is not really an acceptably general home for it -- this is pseudocode, not an implementation how-to. In addition, if there's a nontrivial risk of the link going stale, then I think the code ought to stay here, where it can actually be useful to people.
For reference, researching stable one-pass algorithms and distilling my findings into that pseudocode took me several hours (and I have a PhD in mathematics). I hope and believe it has saved many people many hours of work.
Frankly, G716 has the right idea: someone should properly write up the numerical analysis, providing the appropriate context for the pseudocode snippet. I simply lack the time to do it myself. But removing the pseudocode because the entry lacks the contextual information seems hasty and wrong.Brianboonstra (talk) 18:37, 6 February 2009 (UTC)
- You need to reference the sources of your research though (don't worry about formatting). Else you're asking any users of the code to just take it on trust, or someone else to repeat your research. Qwfp (talk) 20:00, 6 February 2009 (UTC)
- A reference for this code is certainly required. Further description of the code is also needed. For example, why is a one-pass algorithm better than a two-pass? What is the trade-off in accuracy? How much is the speed increased? Will this speed increase really improve someone's application overall (i.e. is the calculation of r likely to be a bottleneck)? Darkroll (talk) 02:55, 10 February 2009 (UTC)
The pseudocode and related text which was in the article (as of 22 Feb 2009) is reproduced below.
Computing correlation accurately in a single pass The following algorithm (in pseudocode) will calculate Pearson correlation with good numerical stability[citation needed]. Notice this is not an accurate computation as in each iteration only the updated mean is used (not the exact full calculation mean), then the delta is squared, so this error is not fixed by the sweep factor.
sum_sq_x = 0 sum_sq_y = 0 sum_coproduct = 0 mean_x = x[1] mean_y = y[1] for i in 2 to N: sweep = (i - 1.0) / i delta_x = x[i] - mean_x delta_y = y[i] - mean_y sum_sq_x += delta_x * delta_x * sweep sum_sq_y += delta_y * delta_y * sweep sum_coproduct += delta_x * delta_y * sweep mean_x += delta_x / i mean_y += delta_y / i pop_sd_x = sqrt( sum_sq_x ) pop_sd_y = sqrt( sum_sq_y ) cov_x_y = sum_coproduct correlation = cov_x_y / (pop_sd_x * pop_sd_y)
I have removed this to this discussion page so that it is still available to anyone who wants to see it, but is not included in the article where the general consensus is that it does not belong on this page.
The correct calculation does not require the mean subtracting from each observation while passing through the data: even though this is one way in which it is possible to to calculate the correlation, it is not the computationally simplest way to do it, and it would require the means to be found before calculating the deviations from the means (thus requiring a second pass through the data to do it accurately using this approach). Rather the sum of products
- and sums of squares and
are collected, allong with the sums of
- and ,
then allowance is made for the means by subtraction at the end of the calculation, i.e. once the means are known, using the sample correlation coefficient formula given in the article:
I have explained this in this discussion page hoping that it will satisfy those who thought the code was a useful part of the article. The code is still available, but you are strongly recommended not to use it. Instead do the calculation using the above formula. Alternatively you can use the formula
but if you do, you need to calculate the means before you can start, so although this formula is easy to understand, it is slightly less easy to use in practical calculations.
Hey, the proposed formula is wrong: where did you get this (n-1) in the divisor? —Preceding unsigned comment added by 178.94.5.109 (talk) 00:45, 27 December 2010 (UTC)
—Preceding unsigned comment added by SciberDoc (talk • contribs) 12:39, 22 February 2009 (UTC)
I have moved this section to the Pearson correlation page as the algorithm is specific to the Pearson correlation. Skbkekas (talk) 02:40, 5 June 2009 (UTC)
I have a question for the pseudocode.
The previous version:
pop_sd_x = sqrt( sum_sq_x / N )
pop_sd_y = sqrt( sum_sq_y / N )
cov_x_y = sum_coproduct / N
correlation = cov_x_y / (pop_sd_x * pop_sd_y)
The current version
pop_sd_x = sqrt( sum_sq_x )
pop_sd_y = sqrt( sum_sq_y )
cov_x_y = sum_coproduct
correlation = cov_x_y / (pop_sd_x * pop_sd_y)
And someone has said that they are the same, but I miss one N to be the same. Do you know which is the correct one? —Preceding unsigned comment added by 195.75.244.91 (talk) 14:57, 5 October 2009 (UTC)
They are the same. The numerators differ by a factor of N, while the two factors in the denominator each differ by a factor of , so that the s cancel. JamesBWatson (talk) 11:09, 6 October 2009 (UTC)
Mistakes in Formulas
Yesterday I changed some major mistakes in the correlation coefficient (look at the edits eliminating the (n-1) in the denominator. I think these kind of mistakes are unacceptable and unexcusable, and that a warning should be added to the article saying that it's reliability or quality is poor, at least until a couple of experts devote some time verifying the quality of the info. 71.191.7.89 (talk) 16:08, 17 January 2011 (UTC)
- It was right before, with the n–1 terms in the divisors. However, these just serve to cancel out the n–1 in the formula for s given at Standard deviation#With sample standard deviation. I'll add another expression to the first display formula for rxy to make this a bit clearer. One quick way to see that the formulas you left had to be wrong is to consider what would happen if you computed the correlation for a sample with two copies of all the observations in the original sample. Clearly this should have no effect on the estimated correlation. --Qwfp (talk) 21:15, 17 January 2011 (UTC)
Correction of a misunderstanding
The following comment was placed in the article by 82.46.170.196, in the section Pearson's product-moment coefficient.
- It is important to appreciate that the above description applies to the population and not a small sample. It does not take into account the degrees of freedom. A simple test in Excel shows that the covarience of an array divided by the product of the standard deviations does not give the correct value. For example when all x=y, r does not = 1. However, if the product of the z scores of each (x,y) pair are divided by (n-1) rather than n then the correct value is obtained.
Firstly, this comment belongs here, not in the article, so I have moved it. Secondly, I shall try to clear up the misunderstanding. The covariance of a sample is calculated by dividing by n, while dividing by n-1 is used to calculate an unbiased estimate of the covariance of the population. Exactly the same applies to calculating the variance of a sample and an unbiased estimate of a population variance. The standard definition, as given in the article, uses the sample covariance and the sample variances. Alternatively you can use unbiased estimates of population values in both cases: the result is exactly the same. However, the result is not the same if you mix the sample covariance and unbiased estimates of population variances: you have to be consistent. I do not normally use Excel, but to prepare for writing this I have looked at it. The function COVAR calculates the covariance of the numbers given (which may or may not be a sample: that is irrelevant). On the other hand the function VAR does not calculate the variance of the numbers given, but rather an estimate of a variance of a population which the numbers are assumed to be a sample from. In order to calculate the actual variance of the numbers given, you have to use the function VARP. Why VAR and COVAR work inconsistently is something only Microsoft programmers can explain. Unfortunately the Excel help files make things even more confusing: for example they say that VARP "Calculates variance based on the entire population", although the numbers are frequently not a population at all. I think it comes down to Microsoft programmers being programmers with a little knowledge of statistical techniques, rather than statisticians. JamesBWatson (talk) 20:41, 19 November 2009 (UTC)
- I think there is a (n-1) missing from the denominator the second form of the 'sample correlation coefficient'. Shouldn't it be:
- Aaron McDaid (talk - contribs) 21:06, 24 October 2012 (UTC)
The alarm clock and the dawn
Why is the correlation between alarm clocks ringing and dawn such a bad example of correlation without causation? A ringing alarm clock does not cause the dawn, nor does the dawn cause the alarm clock to ring. The dawn (or anticipation of it) causes a person to set the alarm clock, which causes the alarm clock to ring, but this is not the same thing. Anyway, would the correlation of the forces on various objects and the time rate of change of their momenta be a better example? Probably not, its a correlation resulting from the definition of force, an imposed correlation. Maybe two planets of similar mass rotating around two stars of similar mass - they would have a correlation in their position, but that smells like an imposed correlation too. I can't think of an example of unimposed correlation without some series of causative links. PAR (talk) 05:07, 10 June 2013 (UTC)
Non-linear correlation
(JamesBWatson left this comment in my user page; since I believe it's of general interest, I'm taking the liberty of moving it here.
I see that you reverted an edit to the article Correlation with the edit summary "Undid revision 320015831 by JamesBWatson (talk) in stats, "correlation" always refers to linear -- not any -- relationship". I have restored the edit, together with references to three textbooks which use the expression "nonlinear correlation". I could have given many more references; for example, here are just a few papers with the expression in their titles:
- A. Mitropolsky, “On the multiple non-linear correlation equations”, Izv. Akad. Nauk SSSR Ser. Mat., 3:4 (1939), 399–406
- Non-linear canonical correlation analysis with a simulated annealing solution, Sheng G. Shi, Winson Taam (Journal of Applied Statistics, Volume 19, Issue 1 1992 , pages 155 - 165)
- Non-Linear Correlation Discovery-Based Technique in Data Mining, Liu Bo (Intelligent Information Technology Application Workshops, 2008. IITAW '08)
- Ravi K. Sheth (UC Berkeley), Bhuvnesh Jain (MPA-garching), The non-linear correlation function and the shapes of virialized halos.
Google Scholar gives 2790 citations for "non-linear correlation" and 3650 for "nonlinear correlation". I assure you, "correlation" usually, but by no means always, refers to linear correlation. JamesBWatson (talk) 15:37, 31 October 2009 (UTC)
- Thanks for the references, JamesBWatson. I guess the generalization of correlation coefficient for both linear and non-linear associations would require rewriting, e.g., Correlation#Correlation and linearity. Furthermore, we need to define and show how to calculate it. I can see how it could be obtained as , where the variances and the covariance come from a non-simple linear regression (for simple vs. non-simple linear regression, see Regression analysis#Linear regression). Is that what you mean? 128.138.43.211 (talk) 06:34, 1 November 2009 (UTC)
- First, unless "non-linear correlation" is precisely defined, I see no point in just mentioning it in this article. Secondly, if the interpretation above (in terms of variance and covariances) is correct, wouldn't such a non-linearity extend to the PMCC as well? 128.138.43.211 (talk) 05:20, 4 November 2009 (UTC)
Correlation refers only to LINEAR relationships. "Non-linear correlation" (when used correctly) refer to techniques transforming one or more variables from a non-linear relation so that it can be made linear. E.g., suppose there is a exponential relationship between two variables -- clearly not linear. But by transforming one variable (by using logs), the relationship can be expressed linearly, and a linear correlation coefficient can be calculated. (We've all seen linear log or log-log scale graphs). Even the references above supports this definition. E.g., if you read the Shi & Taam paper -- they define their approach as finding "... a and b which maximize the LINEAR relationship between v and h(u)". The Spearman Correlation Coefficient mentioned in the article is another example, transforming one or more variables to ranks so it can be analyzed linearly. In any case, it doesn't matter if some papers or books use this term incorrectly. JamesBWatson should not be doing original research by looking at titles, but by looking at AUTHORITATIVE definitions in mathematical texts (in this case, authoritative statistics texts). 139.228.61.103 (talk) 22:38, 20 August 2015 (UTC)
Unclear phrase
"Correlation refers to any of a broad class of statistical relationships involving dependence." I see references to this sentence on the web and it is too vague.— Preceding unsigned comment added by Toncho11~enwiki (talk • contribs) 15:03, 4 November 2015 (UTC)
Correlation and linearity
When discussing the scatterplots of Anscombe's quartet, the comments about normality are debatable, especially:
The first one (top left) seems to be distributed normally
Based only on the 11 points, it's hard to say anything, as the data could come from a normal distribution, as well as a Student, Laplace, Cauchy, or many other distributions. Even the Uniform distribution seems hard to be ruled out.
The second one (top right) is not distributed normally
There are actually no more evidence for the normality of the data on the first graph than on the second graph. The difference is that the variance seems very low in the second graph. That does not mean that the data is not normally distributed.
Also it is not clear in the section what is/isn't normally distributed. I'm guessing that the contributor was referring to the variable Y, as in both top graphs, the X variable seems more uniformly distributed than normally distributed.
Overall, I think that we should improve this section. I'm willing to do so, for example by removing all the comments about normality. Has someone a better idea? --Nicooo (talk) 02:57, 22 February 2015 (UTC)
- From our article Anscombe's quartet, it appears that the comments about normality may appear in Anscombe's paper. Unfortunately I cannot access the paper. Can you? Maybe that paper makes it clearer what is intended. Loraof (talk) 19:34, 27 February 2016 (UTC)
Dependence does not demonstrate a causal relationship
The page says "dependence is not sufficient to demonstrate the presence of such a causal relationship". Is it? I thought that correlation doesn't imply causality, but that dependence does. If we consider all forms of causality and all forms of dependence, shouldn't dependence at least imply some sort of causality? 7804j (talk) 00:35, 7 June 2016 (UTC)
- I understand your point—"dependence" sounds like it means "causal dependence". But it does not mean that in the present context—paragraph 3 of the lead says "Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence." This is a feature of joint probability distributions, which are descriptions that do not address the issue of dependence. Loraof (talk) 15:24, 11 June 2016 (UTC)
- So do you agree that the relationship between "dependence" and "causality" is not clearly explained in this article? Maybe this should be clarified? Because the page refers a lot to the fact that correlation is not an indication of causality, but never clearly addresses the case of dependence. And I have seen multiple people on other websites such as Quora having some very strong conflicting opinions about whether dependence should always be considered as implying causality or not.7804j (talk) 22:46, 11 June 2016 (UTC)
- I think the sentence I quoted above is quite clear. Also, the first sentence of the article is also clear: "In statistics, dependence is any statistical relationship between two random variables or two sets of data." But I'll add the phrase "whether causal or not" to this sentence.
- Context is important. If someone on another website is using dependence as in everyday informal English, that can imply causality. But this article states from the very first sentence that the context is the formalities of probability and statistics. Loraof (talk) 00:56, 12 June 2016 (UTC)
- I think the "whether causal or not" is a great addition. Thanks for the clarification! (and for the mergers with association. I had tagged it a few days ago).7804j (talk) 23:24, 12 June 2016 (UTC)
Direct and Inverse correlations not "positive" and "negative"
Positive and negative are almost never to be used in statistics. These words have arithmetic connotations and are poorly suited for use in statistics especially as descriptive nouns. Direct and Inverse correlation should be used instead of positive and negative in every manner. 64.134.69.103 (talk) 22:30, 13 January 2014 (UTC)
- I see positively correlated and negatively correlated all the time. And yes, they have arithmetic connotations: e.g., positively correlated means that the numerical correlation is a positive number. Loraof (talk) 01:27, 23 December 2016 (UTC)
Measure of association / correlation for nominal / categorical / clustering data?
Could someone add a section on measures of correlation for nominal data? This is discussed to some degree at Contingency table#Measures_of_association, but it gets a little circular when Association_(statistics) redirects back here.
I've run across references to these, but don't know enough to characterize them.
- Mutual information
- Cramér's V
- Variation of information
- The phi coefficient
- Tschuprow's T
- The uncertainty coefficient
- The Lambda coefficient
- Matthews correlation coefficient
- Rand index
- F1 score
And how should they be cross-referenced? Via "See also"? Should there be a Wikipedia category for them? Is "Category:Summary statistics for contingency tables" the right one? Should that category have a link to a summary page (here or the one at Contingency table?) ★NealMcB★ (talk) 02:44, 30 April 2017 (UTC)
- ^ Actually they are more, each formula is written in the language LaTeX (which, in turn is a set of macros for the language TeX created, ironically, by the Donald Knuth mentioned above) and thus, it is conceivable to write a parser that would be able to calculate directly from the LaTeX - so one could consider that the formulas are not just pseudocode but written in a full programming language