Paired difference test

A paired difference test, better known as a paired comparison, is a type of location test that is used when comparing two sets of paired measurements to assess whether their population means differ. A paired difference test is designed for situations where there is dependence between pairs of measurements (in which case a test designed for comparing two independent samples would not be appropriate). That applies in a within-subjects study design, i.e., in a study where the same set of subjects undergo both of the conditions being compared.

Specific methods for carrying out paired difference tests include the paired-samples t-test, the paired Z-test, the Wilcoxon signed-rank test^[1] and others.

Use in reducing variance

Paired difference tests for reducing variance are a specific type of blocking. To illustrate the idea, suppose we are assessing the performance of a drug for treating high cholesterol. Under the design of our study, we enroll 100 subjects, and measure each subject's cholesterol level. Then all the subjects are treated with the drug for six months, after which their cholesterol levels are measured again. Our interest is in whether the drug has any effect on mean cholesterol levels, which can be inferred through a comparison of the post-treatment to pre-treatment measurements.

The key issue that motivates the paired difference test is that unless the study has very strict entry criteria, it is likely that the subjects will differ substantially from each other before the treatment begins. Important baseline differences among the subjects may be due to their gender, age, smoking status, activity level, and diet.

There are two natural approaches to analyzing these data:

In an "unpaired analysis", the data are treated as if the study design had actually been to enroll 200 subjects, followed by random assignment of 100 subjects to each of the treatment and control groups. The treatment group in the unpaired design would be viewed as analogous to the post-treatment measurements in the paired design, and the control group would be viewed as analogous to the pre-treatment measurements. We could then calculate the sample means within the treated and untreated groups of subjects, and compare these means to each other.
In a "paired difference analysis", we would first subtract the pre-treatment value from the post-treatment value for each subject, then compare these differences to zero. See also paired permutation test.

If we only consider the means, the paired and unpaired approaches give the same result. To see this, let $Y i 1, Y i 2$ be the observed data for the $i th$ pair, and let $D i = Y i 2 - Y i 1$ . Also let $D, Y 1$ , and $Y 2$ denote, respectively, the sample means of the $D i$ , the $Y i 1$ , and the $Y i 2$ . By rearranging terms we can see that

{\bar {D}}={\frac {1}{n}}\sum _{i}(Y_{i2}-Y_{i1})={\frac {1}{n}}\sum _{i}Y_{i2}-{\frac {1}{n}}\sum _{i}Y_{i1}={\bar {Y}}_{2}-{\bar {Y}}_{1},

where n is the number of pairs. Thus the mean difference between the groups does not depend on whether we organize the data as pairs.

Although the mean difference is the same for the paired and unpaired statistics, their statistical significance levels can be very different, because it is easy to overstate the variance of the unpaired statistic. Through Bienaymé's identity, the variance of $D$ is

{\begin{aligned}{\rm {var}}({\bar {D}})&=\operatorname {var} ({\bar {Y}}_{2}-{\bar {Y}}_{1})\\&=\operatorname {var} ({\bar {Y}}_{2})+\operatorname {var} ({\bar {Y}}_{1})-2\operatorname {cov} ({\bar {Y}}_{1},{\bar {Y}}_{2})\\&=\sigma _{1}^{2}/n+\sigma _{2}^{2}/n-2\sigma _{1}\sigma _{2}\operatorname {corr} (Y_{i1},Y_{i2})/n,\end{aligned}}

where $σ 1$ and $σ 2$ are the population standard deviations of the $Y i 1$ and $Y i 2$ data, respectively. Thus the variance of $D$ is lower if there is positive correlation within each pair. Such correlation is very common in the repeated measures setting, since many factors influencing the value being compared are unaffected by the treatment. For example, if cholesterol levels are associated with age, the effect of age will lead to positive correlations between the cholesterol levels measured within subjects, as long as the duration of the study is small relative to the variation in ages in the sample.

Power of the paired Z-test

Suppose we are using a Z-test to analyze the data, where the variances of the pre-treatment and post-treatment data $σ 12$ and $σ 22$ are known (the situation with a t-test is similar). The unpaired Z-test statistic is

{\frac {{\bar {Y}}_{2}-{\bar {Y}}_{1}}{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}},

The power of the unpaired, one-sided test carried out at level $α = 0.05$ can be calculated as follows:

{\begin{aligned}P\left({\frac {{\bar {Y}}_{2}-{\bar {Y}}_{1}}{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}}>1.645\right)&=P\left({\frac {{\bar {Y}}_{2}-{\bar {Y}}_{1}}{S}}>1.645{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}/S\right)\\&=P\left({\frac {{\bar {Y}}_{2}-{\bar {Y}}_{1}-\delta +\delta }{S}}>1.645{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}/S\right)\\&=P\left({\frac {{\bar {Y}}_{2}-{\bar {Y}}_{1}-\delta }{S}}>1.645{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}/S-\delta /S\right)\\&=1-\Phi (1.645{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}/S-\delta /S),\end{aligned}}

where S is the standard deviation of D, Φ is the standard normal cumulative distribution function, and δ = EY₂ − EY₁ is the true effect of the treatment. The constant 1.645 is the 95th percentile of the standard normal distribution, which defines the rejection region of the test.

By a similar calculation, the power of the paired Z-test is

1-\Phi (1.645-\delta /S).

By comparing the expressions for power of the paired and unpaired tests, one can see that the paired test has more power as long as

{\sqrt {\sigma _{1}^{2}/n+\sigma _{2}^{2}/n}}/S={\sqrt {\frac {\sigma _{1}^{2}+\sigma _{2}^{2}}{\sigma _{1}^{2}+\sigma _{2}^{2}-2\sigma _{1}\sigma _{2}\rho }}}>1{\text{ where }}\rho :=\operatorname {corr} (Y_{i1},Y_{i2}).

This condition is met whenever $\rho$ , the within-pairs correlation, is positive.

A random effects model for paired testing

The following statistical model is useful for understanding the paired difference test

Y_{ij}=\mu _{j}+\alpha _{i}+\varepsilon _{ij}

where $α i$ is a random effect that is shared between the two values in the pair, and $ε ij$ is a random noise term that is independent across all data points. The constant values $μ 1, μ 2$ are the expected values of the two measurements being compared, and our interest is in $δ = μ 2 - μ 1$ .

In this model, the $α i$ capture "stable confounders" that have the same effect on the pre-treatment and post-treatment measurements. When we subtract to form $D i, the α i$ cancel out, so do not contribute to the variance. The within-pairs covariance is

\operatorname {cov} (Y_{i1},Y_{i2})=\operatorname {var} (\alpha _{i}).

This is non-negative, so it leads to better performance for the paired difference test compared to the unpaired test, unless the $α i$ are constant over $i$ , in which case the paired and unpaired tests are equivalent.

In less mathematical terms, the unpaired test assumes that the data in the two groups being compared are independent. This assumption determines the form for the variance of $D$ . However, when two measurements are made for each subject, it is unlikely that the two measurements are independent. If the two measurements within a subject are positively correlated, the unpaired test overstates the variance of $D$ , making it a conservative test in the sense that its actual type I error probability will be lower than the nominal level, with a corresponding loss of statistical power. In rare cases, the data may be negatively correlated within subjects, in which case the unpaired test becomes anti-conservative. The paired test is generally used when repeated measurements are made on the same subjects, since it has the correct level regardless of the correlation of the measurements within pairs.

Use in reducing confounding

Another application of paired difference testing arises when comparing two groups in a set of observational data, with the goal being to isolate the effect of one factor of interest from the effects of other factors that may play a role. For example, suppose teachers adopt one of two different approaches, denoted "A" and "B", to teaching a particular mathematical topic. We may be interested in whether the performances of the students on a standardized mathematics test differ according to the teaching approach. If the teachers are free to adopt approach A or approach B, it is possible that teachers whose students are already performing well in mathematics will preferentially choose method A (or vice versa). In this situation, a simple comparison between the mean performances of students taught with approach A and approach B will likely show a difference, but this difference is partially or entirely due to the pre-existing differences between the two groups of students. In this situation, the baseline abilities of the students serve as a confounding variable, in that they are related to both the outcome (performance on the standardized test), and to the treatment assignment to approach A or approach B.

It is possible to reduce, but not necessarily eliminate, the effects of confounding variables by forming "artificial pairs" and performing a pairwise difference test. These artificial pairs are constructed based on additional variables that are thought to serve as confounders. By pairing students whose values on the confounding variables are similar, a greater fraction of the difference in the value of interest (e.g. the standardized test score in the example discussed above), is due to the factor of interest, and a lesser fraction is due to the confounder. Forming artificial pairs for paired difference testing is an example of a general approach for reducing the effects of confounding when making comparisons using observational data called matching.^[2]^[3]^[4]

As a concrete example, suppose we observe student test scores X under teaching strategies $A$ and $B$ , and each student has either a "high" or "low" level of mathematical knowledge before the two teaching strategies are implemented. However, we do not know which students are in the "high" category and which are in the "low" category. The population mean test scores in the four possible groups are ${\begin{array}{l|ll}&A&B\\\hline {\text{High}}&\mu _{HA}&\mu _{HB}\\{\text{Low}}&\mu _{LA}&\mu _{LB}\end{array}}$ and the proportions of students in the groups are ${\begin{array}{l|ll}&A&B\\\hline {\text{High}}&p_{HA}&p_{HB}\\{\text{Low}}&p_{LA}&p_{LB}\end{array}}$ where $p HA + p HB + p LA + p LB = 1$ .

The "treatment difference" among students in the "high" group is $μ HA - μ HB$ and the treatment difference among students in the "low" group is $μ LA - μ LB$ . In general, it is possible that the two teaching strategies could differ in either direction, or show no difference, and the effects could differ in magnitude or even in sign between the "high" and "low" groups. For example, if strategy B were superior to strategy A for well-prepared students, but strategy A were superior to strategy B for poorly prepared students, the two treatment differences would have opposite signs.

Since we do not know the baseline levels of the students, the expected value of the average test score $X A$ among students in the A group is an average of those in the two baseline levels:

E{\bar {X}}_{A}=\mu _{HA}{\frac {p_{HA}}{p_{HA}+p_{LA}}}+\mu _{LA}{\frac {p_{LA}}{p_{HA}+p_{LA}}},

and similarly the average test score $X B$ among students in the B group is

E{\bar {X}}_{B}=\mu _{HB}{\frac {p_{HB}}{p_{HB}+p_{LB}}}+\mu _{LB}{\frac {p_{LB}}{p_{HB}+p_{LB}}}.

Thus the expected value of the observed treatment difference $D = X A - X B$ is

\mu _{HA}{\frac {p_{HA}}{p_{HA}+p_{LA}}}-\mu _{HB}{\frac {p_{HB}}{p_{HB}+p_{LB}}}+\mu _{LA}{\frac {p_{LA}}{p_{HA}+p_{LA}}}-\mu _{LB}{\frac {p_{LB}}{p_{HB}+p_{LB}}}.

A reasonable null hypothesis is that there is no effect of the treatment within either the "high" or "low" student groups, so that $μ HA = μ HB and μ LA = μ LB$ . Under this null hypothesis, the expected value of $D$ will be zero if

p_{HA}=(p_{HA}+p_{LA})(p_{HA}+p_{HB})

and

p_{HB}=(p_{HB}+p_{LB})(p_{HA}+p_{HB}).

This condition asserts that the assignment of students to the $A$ and $B$ teaching strategy groups is independent of their mathematical knowledge before the teaching strategies are implemented. If this holds, baseline mathematical knowledge is not a confounder, and conversely, if baseline mathematical knowledge is a confounder, the expected value of $D$ will generally differ from zero. If the expected value of $D$ under the null hypothesis is not equal to zero, then a situation where we reject the null hypothesis could either be due to an actual differential effect between teaching strategies $A$ and $B$ , or it could be due to non-independence in the assignment of students to the $A$ and $B$ groups (even in the complete absence of an effect due to the teaching strategy).

This example illustrates that if we make a direct comparison between two groups when confounders are present, we do not know whether any difference that is observed is due to the grouping itself, or is due to some other factor. If we are able to pair students by an exact or estimated measure of their baseline mathematical ability, then we are only comparing students "within rows" of the table of means given above. Consequently, if the null hypothesis holds, the expected value of $D$ will equal zero, and statistical significance levels have their intended interpretation.

References

^ Derrick, B; Broad, A; Toher, D; White, P (2017). "The impact of an extreme observation in a paired samples design". Metodološki Zvezki - Advances in Methodology and Statistics. 14 (2): 1–17.
^ Rubin, Donald B. (1973). "Matching to Remove Bias in Observational Studies". Biometrics. 29 (1): 159–183. doi:10.2307/2529684. JSTOR 2529684.
^ Anderson, Dallas W.; Kish, Leslie; Cornell, Richard G. (1980). "On Stratification, Grouping and Matching". Scandinavian Journal of Statistics. 7 (2). Blackwell Publishing: 61–66. JSTOR 4615774.
^ Kupper, Lawrence L.; Karon, John M.; Kleinbaum, David G.; Morgenstern, Hal; Lewis, Donald K. (1981). "Matching in Epidemiologic Studies: Validity and Efficiency Considerations". Biometrics. 37 (2): 271–291. CiteSeerX 10.1.1.154.1197. doi:10.2307/2530417. JSTOR 2530417. PMID 7272415.

External links

[outie-1] Derrick, B; Broad, A; Toher, D; White, P (2017). "The impact of an extreme observation in a paired samples design". Metodološki Zvezki - Advances in Methodology and Statistics. 14 (2): 1–17.

[2] Rubin, Donald B. (1973). "Matching to Remove Bias in Observational Studies". Biometrics. 29 (1): 159–183. doi:10.2307/2529684. JSTOR 2529684.

[3] Anderson, Dallas W.; Kish, Leslie; Cornell, Richard G. (1980). "On Stratification, Grouping and Matching". Scandinavian Journal of Statistics. 7 (2). Blackwell Publishing: 61–66. JSTOR 4615774.

[4] Kupper, Lawrence L.; Karon, John M.; Kleinbaum, David G.; Morgenstern, Hal; Lewis, Donald K. (1981). "Matching in Epidemiologic Studies: Validity and Efficiency Considerations". Biometrics. 37 (2): 271–291. CiteSeerX 10.1.1.154.1197. doi:10.2307/2530417. JSTOR 2530417. PMID 7272415.

[1]

[2]

[3]

[4]