Talk:Multicollinearity

Bell Curve

Reviewing The Bell Curve, black author Thomas Sowell wrote:

Perhaps the most intellectually troubling aspect of The Bell Curve is the authors' uncritical approach to statistical correlations. One of the first things taught in introductory statistics is that correlation is not causation. It is also one of the first things forgotten, and one of the most widely ignored facts in public policy research. The statistical term "multicollinearity," dealing with spurious correlations, appears only once in this massive book. [1]

This quote is related to the current article, as well as to the furor over Murray and Herrnstein's controversial book. --Uncle Ed 18:51, 3 March 2006 (UTC)[reply]

Dummy Variable Trap

Should this article mention perfect multicolinearity caused by 'the dummy variable trap' as this is just about the only time an applied researcher will come across perfect multicollinearity and it might be useful for students/those who wonder why their software cannot estimate their equation by OLS.

Mts202 (talk) 13:35, 20 June 2008 (UTC)[reply]

Matt

Remedy of Multicollinearity is Obtaining More Data?

No. Firstly, it's usually not feasible (statistical process often begins AFTER the experiment has finished). Secondly, obtaining more data does NOT always remedy multicollinearity just because some dependent variable might be expressible by some linear combinations of some other variables in the model. Thus, no amount of additional data will ever help.

Obtaining more data MAY reduce the chance of multicollinearity ONLY IF the data point is very small to begin with, because the previous values of the dependent variables MAYBE collinear by chance. However, realistically, if we have few data points, we usually use only very few dependent variables in the model to begin with and thus multicollinearity is VERY RARELY encountered in that setting either. So, obtaining more data is NEVER the preferred method to reduce / eliminate multicollinearity in real research settings. Dropping variables is. Robbyjo (talk) 01:33, 29 October 2008 (UTC)[reply]

Here's why obtaining more data (if possible) is a remedy for multicollinearity: From the article variance inflation factor, the variance of a coefficient estimate is

{\rm {var}}({\hat {\beta }}_{j})={\frac {\sigma ^{2}}{(n-1){\widehat {\rm {var}}}(X_{j})}}\cdot {\frac {1}{1-R_{j}^{2}}},

where

R_{j}^{2}

is the coefficient of determination of a regression of explanator j on the other explanators, and

{\frac {1}{1-R_{j}^{2}}}

is the variance inflation factor, a measure of multicollinearity. The formula shows that, holding the degree of multicollinearity constant, a larger sample size n always lowers the variance of the coefficient estimate (except of course if

R_{j}^{2}=1

so the multicollinearity is perfect). Therefore I'm removing the old "dubious" tag. Duoduoduo (talk) 17:53, 9 November 2010 (UTC)[reply]

Principal Components as a remedy

Shouldn't there be some mention of PCA as an appropriate (at least in some circumstances) remedy for correlation among regressors? —Preceding unsigned comment added by 203.20.253.5 (talk) 06:41, 10 November 2008 (UTC)[reply]

--Also combine two variables into one. For example combine weight and height to a single variable... —Preceding unsigned comment added by 67.40.8.72 (talk) 20:22, 15 January 2009 (UTC)[reply]

No, because you can't "remedy" something that isn't a problem. The reason principal components regression is useful is because adding more variables always increases variance in predictions--see Akaike information criterion. However, if the variables are strongly correlated, using PCA lets us reduce the dimensionality (and therefore variance) without losing predictive power.

In other words, collinearity isn't a "problem" we're "remedying" with PCA. It's a blessing: it means we can get two predictors for the price of (roughly) one. Closed Limelike Curves (talk) 04:39, 22 January 2024 (UTC)[reply]

Leave the model as is

I think Gujarati in his Econometrics suggests leaving the model as it is, i.e. stressing multicollinearity doesn't make a model redundant. I may be wrong on the source but I have certainly read something on multicollinearity that suggested leaving it, to back up the original author. —Preceding unsigned comment added by 79.77.20.109 (talk) 20:19, 20 July 2009 (UTC)[reply]

Yes, indeed Gujarati suggests this, by quoting Blanchard on his "multicollinearity is God's will" comment. Uuchie (talk) 01:32, 15 November 2009 (UTC)[reply]

But "God's will" is of absolutely no help if you have money or reputation on the line. Keeping a model with multicollinearity is insanity, because you don't have any stability. This is an irresponsible suggestion. — Preceding unsigned comment added by 68.147.28.17 (talk) 17:37, 5 June 2011 (UTC)[reply]

Perhaps so, but both Gujarati and Blanchard are reputable figures, and their opinion carries weight in the field. Uuchie (talk) 04:03, 23 November 2011 (UTC)[reply]

Neither statistics nor truth care about your money or your reputation.

You seem to be misunderstanding multicollinearity. Multicollinearity is a property of the data (or more accurately, the data-generating process, i.e. reality). It is not a property of the model. Removing collinear variables doesn't reduce the correlations between variables, it just ignores it. Closed Limelike Curves (talk) 03:13, 30 December 2023 (UTC)[reply]

Leaving the model "as is" is also the recommendation in Kennedy's "Guide to Econometrics." Dreze refers to variable reduction as "elevating ignorance to arrogance." If you leave the variables in, standard errors will accurately reflect the level of model stability. It is misleading to remove possible confounding variables... giving the illusion of certainty through small (biased) standard errors.--Dansbecker (talk) 16:14, 11 April 2012 (UTC)[reply]

I will be removing the dubious claim and adding Gujarati as a source. In fact, when he suggests "Remedial Measures" for multicollinearity, the section's name where he quotes Blanchard is actually "Do Nothing". I will be waiting for someone to add Kennedy has a source as well, as I don't have his book. Timeu (talk) 05:24, 2 January 2013 (UTC)[reply]

Hi Timeu, I've replied to the next comment. It may depend on sample size whether it is a good idea to leave the model as is, or to reduce multicollinearity. Utelity (talk) 10:57, 1 October 2023 (UTC)[reply]

Can we have a logical argument for this? So far I only see authority arguments. — Preceding unsigned comment added by 187.163.40.190 (talk) 23:54, 10 June 2019 (UTC)[reply]

I agree, authorities can be misunderstood, or might even be wrong. You can easily make experiments that show that multicollinearity actually gives less precise estimates. An example is given in Agresti(2015: Foundations of linear and generalized linear models), exercise 4.36, simulating from a polynomial model (degree 5), that is almost linear. In the original setting, you are asked to simulate only a small data set, n=25. Here you actually get the best fit if you instead of the true model fit a straight line. The fit gets only better when you skrew up the sample size. Thus, the statement "leave the model as is" may be right for large data sets, but not in general. Utelity (talk) 10:55, 1 October 2023 (UTC)[reply]

This isn't related to multicollinearity; it's a general property of larger models with more variables. Large models have higher variance, which should be addressed by using some kind of regularization (like informative priors in a hierarchical model; or in frequentist terms, Lasso or ridge regression).

In fact, if you're using the standard orthogonal polynomial construction, there is exactly 0 correlation (no collinearity) between polynomial predictors! Closed Limelike Curves (talk) 03:46, 30 December 2023 (UTC)[reply]

Level of discourse in the article

Hi, this is a great scholarly article, but not such a great encyclopedia article. As per Wikipedia:Make technical articles understandable it would be great if editors working here could maybe add a section at the top, explaining this in layman's terms, and providing noncontroversial examples to explain multocollinearity's effects and importantly, why it matters, then add a summary of that section to lead? Thanks! Jytdog (talk) 12:55, 31 March 2014 (UTC)[reply]

I think the problem is that, for the typical person, it just doesn't matter. The ideal is to just always use elastic-net regression, 100% of the time, which will solve the "problem" for you by automatically dropping useless predictors. If you don't need to regularize, the regularization parameters will end up small after you fit the model.

The only reason to care is if you write statistical software and care about condition numbers. Closed Limelike Curves (talk) 03:49, 30 December 2023 (UTC)[reply]

Problem with my edit

I'm new to editing on Wikipedia. I added two things that both come from the same book by Belsley but for some reason this made two references in the list where it should have made one. Can someone fix this? Thanks! PeterLFlomPhD (talk) 14:02, 18 July 2015 (UTC)[reply]

Fixed and replied to request on your talk page. Cheers KylieTastic (talk) 14:20, 18 July 2015 (UTC)[reply]

Why have valid citations been removed?

I added a few citations links to pages from newonlinecourses.science.psu.edu since they explain clearly some of the facts/concepts stated in the article, are very specific to multicollinearity and as such will be very useful to users who refer to this article for educational purposes. The pages are obviously from a reputed academic website and yet Bender235 has removed those links and mentioned link-spam as the reason. I am fairly new to Wikipedia edits. Please clarify why those links are link spam? I did not intend to spam. Will you accept my edit if I cite the course material page at just one line, instead of referring to it at multiple lines? Thanks! Kwiki user (talk) 16:56, 16 March 2019 (UTC)[reply]

Update: I have added back one of the citation links and mentioned it at just one line instead of multiple lines. I hope this won't be treated as link-spam this time as my intention is to provided a useful and easy accessible reference. Kwiki user (talk) 18:01, 16 March 2019 (UTC)[reply]

Seeing this many external links added to the same website flashed some red lights, that's why I reverted. It is, of course, a legitimate website, but why did you choose to cite it in particular, and why so often? --bender235 (talk) 13:14, 19 March 2019 (UTC)[reply]

Thanks for the response! I am fairly new to Wikipedia edits and wasn't aware of all the guidelines relating to citations. I knew that the citations should preferably be from academic, non-commercial, trusted websites and the cited source in itself should be reliable and pertinent to the context in which it is referred. Wasn't aware of limits on number of references. Referred to that website at multiple places because the site covered multiple concepts discussed in the Wikipedia article and article had been flagged as having insufficient citations. Will try to avoid referring to same website at more than 2 places in an article hereafter, unless no better source exists and the article absolutely needs that citation at more than 2 places. Kwiki user (talk) 18:38, 20 March 2019 (UTC)[reply]

Frisch-Waugh-Lovell Theorem

Currently at the bottom of the Remedies for Multicolinearity section the following is posted:

"Note that one technique that does not work in offsetting the effects of multicollinearity is orthogonalizing the explanatory variables (linearly transforming them so that the transformed variables are uncorrelated with each other): By the Frisch–Waugh–Lovell theorem, using projection matrices to make the explanatory variables orthogonal to each other will lead to the same results as running the regression with all non-orthogonal explanators included."

This is directly in contradiction to the statement "If the variables are found to be orthogonal, there is no multicollinearity" in section Detection of Multicolinearity. It would also imply that principle component regression, which is a orthogonalizing, linear, transformation, is not a way to ameliorate multicolinearity in contradition to point 9 in section Remedies for Multicolinearity.

If things are orthogonal, then they are linearly independent. If they are linearly independent, they are not colinear. Is the only argument that needs to be made.

Looking at the linked theorem, I think that whoever cross posted it to this page simply misinterpreted it in the context of this page. Note that multicolinearity is a statement about predictive variables only, and that the remedies discuss orthogonalizing these predictive variables. In the linked theorem, both predictive and response variables are transformed, and the project is not orthogonalizing x1,x2 but mapping x1,x2 onto the orthgonal compliment to the column space of X1 (left nullspace?).

Removing for these reasons.

Thank you!

Just wanted to thank the editors of this page for the clear write-up. As a relatively new student of data science, I found it extremely helpful, especially the section on remedies. Invisible Flying Mangoes (talk) 03:01, 12 November 2019 (UTC)[reply]

Definition of condition number

The current article text contains:

"This indicates the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the square root of the maximum eigenvalue divided by the minimum eigenvalue of the design matrix."

Firstly, the text is ambiguous, since it is not clear whether:

the square root is to be taken of (only) the maximum eigenvalue, after which this square root is divided by the minimum eigenvalue;
or the square root is to be taken of the ratio of the maximum and the minimum eigenvalue.

Secondly, I wonder from where the square root taking originates; I cannot find any mention of it on https://en.wikipedia.org/wiki/Condition_number.Redav (talk) 21:49, 1 February 2022 (UTC)[reply]

I believe the authors meant the square root of the ratio of eigenvalues ie. your *second* option. That being said, we are interested in the eigenvalues of the matrix (X^T X), where X is the design matrix, and not in the eigenvalues of the design matrix itself (as this matrix is not even square!). Hugolamarre (talk) 00:34, 25 September 2023 (UTC)[reply]

The square root is referring to the variance inflation factor, which is the square of the condition number--I think that should be mentioned there! Closed Limelike Curves (talk) 04:42, 22 January 2024 (UTC)[reply]