Portal:Mathematics/Selected picture/24

< Previous Next >

Credit: User:Avenue based on original by User:Schutz (data by Francis Anscombe)

Anscombe's quartet is a collection of four sets of bivariate data (paired x–y observations) illustrating the importance of graphical displays of data when analyzing relationships among variables. The data sets were specially constructed in 1973 by English statistician Frank Anscombe to have the same (or nearly the same) values for many commonly computed descriptive statistics (values which summarize different aspects of the data) and yet to look very different when their scatter plots are compared. The four x variables share exactly the same mean (or "average value") of 9; the four y variables have approximately the same mean of 7.50, to 2 decimal places of precision. Similarly, the data sets share at least approximately the same standard deviations for x and y, and correlation between the two variables. When y is viewed as being dependent on x and a least-squares regression line is fit to each data set, almost the same slope and y-intercept are found in all cases, resulting in almost the same predicted values of y for any given x value, and approximately the same coefficient of determination or R² value (a measure of the fraction of variation in y that can be "explained" by x, or more intuitively "how well y can be predicted" from x). Many other commonly computed statistics are also almost the same for the four data sets, including the standard error of the regression equation and the t statistic and accompanying p-value for testing the significance of the slope. Clear differences between the data sets are apparent, however, when they are graphed using scatter plots. The plots even suggest particular reasons why y cannot be perfectly predicted from x using each regression line: (1) While the variables are roughly linearly related in the first data set, there is more variability in y than can be accounted for by x, as seen in the vertical spread of the points around the regression line; in this case, one or more additional independent variables may be needed to account for some of this "residual" variation in y. (2) The second scatter plot shows strong curvature, so a simple linear model is not even appropriate for the data; polynomial regression or some other model allowing for nonlinear relationships may be appropriate. (3) The third data set contains an outlier, which ruins the otherwise perfect linear relationship between the variables; this may indicate that an error was made in collecting or recording the data, or may reveal an aspect of the variation of y that has not been considered. (4) The fourth data set contains an influential point that is almost completely determining the slope of the regression line; the reliability of the line would be increased if more data were collected at the high x value, or at any other x values besides 8. Although some other common summary statistics such as quartiles could have revealed differences across the four data sets, the plots give additional information that would be difficult to glean from mere numerical summaries. The importance of visualizing data is magnified (and made more complicated) when dealing with higher-dimensional data sets. Multiple regression is a straightforward generalization of linear regression to the case of multiple independent variables, while "multivariate" regression methods such as the general linear model allow for multiple dependent variables. Other statistical procedures designed to reveal relationships in multivariate data (several of which are closely tied to useful graphical depictions of the data) include principal component analysis, factor analysis, multidimensional scaling, discriminant function analysis, cluster analysis, and many others.

More selected pictures