When comparing two independent groups, psychology researchers commonly use Student’s t-tests. Assumptions of normality and homogeneity of variance underlie this test. More often than not, when these conditions are not met, Student’s t-test can be severely biased and lead to invalid statistical inferences. Moreover, we argue that the assumption of equal variances will seldom hold in psychological research, and choosing between Student’s t-test and Welch’s t-test based on the outcomes of a test of the equality of variances often fails to provide an appropriate answer. We show that the Welch’s t-test provides a better control of Type 1 error rates when the assumption of homogeneity of variance is not met, and it loses little robustness compared to Student’s t-test when the assumptions are met. We argue that Welch’s t-test should be used as a default strategy.
Independent sample t-tests are commonly used in the psychological literature to statistically test differences between means. There are different types of t-tests, such as Student’s t-test, Welch’s t-test, Yuen’s t-test, and a bootstrapped t-test. These variations differ in the underlying assumptions about whether data is normally distributed and whether variances in both groups are equal (see, e.g., Rasch, Kubinger, & Moder, 2011; Yuen, 1974). Student’s t-test is the default method to compare two groups in psychology. The alternatives that are available are considerably less often reported. This is surprising, since Welch’s t-test is often the preferred choice and is available in practically all statistical software packages.
In this article, we will review the differences between Welch’s t-test, Student’s t-test, and Yuen’s t-test, and we suggest that Welch’s t-test is a better default for the social sciences than Student’s and Yuen’s t-tests. We do not include the bootstrapped t-test because it is known to fail in specific situations, such as when there are unequal sample sizes and standard deviations differ moderately (Hayes & Cai, 2007).
When performing a t-test, several software packages (i.e., R and Minitab) present Welch’s t-test by default. Users can request Student’s t-test, but only after explicitly stating that the assumption of equal variances is met. Student’s t-test is a parametric test, which means it relies on assumptions about the data that are analyzed. Parametric tests are believed to be more powerful than non-parametric tests (i.e., tests that do not require assumptions about the population parameters; Sheskin, 2003). However, Student’s t-test is generally only more powerful when the data are normally distributed (the assumption of normality) and the variances are equal in both groups (homoscedasticity; the assumption of homogeneity of variance; Carroll & Schneider, 1985; Erceg-Hurn & Mirosevich, 2008).
When sample sizes are equal between groups, Student’s t-test is robust to violations of the assumption of equal variances as long as sample sizes are big enough to allow correct estimates of both means and standard deviations (i.e., n ≥ 5),1 except when distributions underlying the data have very high skewness and kurtosis, such as a chi-square distribution with 2 degrees of freedom. However, if variances are not equal across groups and the sample sizes differ across independent groups, Student’s t-test can be severely biased and lead to invalid statistical inferences (Erceg-Hurn & Mirosevich, 2008).2, 3 Here, we argue that there are no strong reasons to assume equal variances in the psychological literature by default nor substantial costs in abandoning this assumption.
In this article, we will first discuss why we need a default test and why a two-step procedure where researchers decide whether or not to use Welch’s t-test based on a check of the assumption of normality and equal variances is undesirable. Then, we will discuss whether the assumption of equal variances is plausible in psychology and point out research areas where this assumption is implausible. We will then review differences between Student’s t-test, Welch’s t-test, and Yuen’s t-test and show through simulations that the bias in Type 1 error rates when Yuen’s t-test is used is often severely inflated (above 0.075, which is “critical inflation”, following Bradley, 1978) and that the bias in Type 1 error rates when Student’s t-test is used has a larger impact on statistical inferences than the rather modest impact on the Type 2 error rate of always using Welch’s t-test by default. Given our analysis and the availability of Welch’s t-test in all statistical software, we recommend a procedure where Welch’s t-test is used by default when sample sizes are unequal.
Limitations of Two-Step Procedures
Readers may have learned that the assumptions of normality and of equal variances (or the homoscedasticity assumption) must be examined using assumption checks prior to performing any t-test. When data are not normally distributed, with small sample sizes, alternatives should be used. Classic nonparametric statistics are well-known, such as the Mann-Whitney U-test and Kruskal-Wallis. However, unlike a t-test, tests based on rank assume that the distributions are the same between groups. Any departure to this assumption, such as unequal variances, will therefore lead to the rejection of the assumption of equal distributions (Zimmerman, 2000). Alternatives exist, known as the “modern robust statistics” (Wilcox, Granger, & Clark, 2013). For example, data sets with low kurtosis (i.e., a distribution flatter than the normal distribution) should be analyzed with the two-sample trimmed t-test for unequal population variances, also called Yuen’s t-test (Luh & Guo, 2007; Yuen, 1974). However, analyses in a later section will show that the normality assumption is not very important for Welch’s t-test and that there are good reasons to, in general, prefer Welch’s t-test over Yuen’s t-test.
With respect to the assumption of homogeneity of variance, if the test of the equality of variance is non-significant and the assumption of equal variances cannot be rejected, homoscedastic methods such as the Student’s t-test should be used (Wilcox et al., 2013). If the test of the equality of variances is significant, Welch’s t-test should be used instead of Student’s t-test because the assumption of equal variances is violated. However, testing the equality of variances before deciding which t-test is performed is problematic for several reasons, which will be explained after having described some of the most widely used tests of equality of variances.
Different Ways to Test for Equal Variances
Researchers have proposed several tests for the assumption of equal variances. Levene’s test and the F-ratio test are the most likely to be used by researchers because they are available in popular statistical software (Hayes & Cai, 2007). Levene’s test is the default option in SPSS. Levene’s test is the One-Way ANOVA computed on the terms |x_{ij}-θ_{j}|, where x_{ij} is the ith observation in the jth group, and θ_{j} is the “center” of the distribution for the jth group (Carroll & Schneider, 1985). In R, the “center” is by default the median, which is also called “Brown Forsythe test for equal variances”. In SPSS, the “center” is by default the mean (which is the most powerful choice when the underlying data are symmetrical).4 The F-ratio statistic is obtained by computing SD2/SD1 (standard deviation ratio, SDR). A generalization of the F-ratio test, to be used when there are more than two groups to compare, is known as the Bartlett’s test.
The F-ratio test and the Bartlett test are powerful, but they are only valid under the assumption of normality and collapse as soon as one deviates even slightly from the normal distribution. They are therefore not recommended (Rakotomalala, 2008).
Levene’s test is more robust than Bartlett’s test and the F-ratio test, but there are three arguments against the use of Levene’s test. First, there are several ways to compute Levene’s test (i.e., using the median or mean as center), and the best version of the test for equal variances depends on how symmetrically the data is distributed, which is itself difficult to statistically quantify.
Second, performing two tests (Levene’s test followed by a t-test) on the same data makes the alpha level and power of the t-test dependent upon the outcome of Levene’s test. When we perform Student’s or Welch’s t-test conditionally on a significant Levene’s test, the long-run Type 1 and Type 2 error rates will depend on the power of Levene’s test. When the power of Levene’s test is low, the error rates of the conditional choice will be very close to Student’s error rates (because the probability of choosing Student’s t-test is very high). On the other hand, when the power of Levene’s test is very high, the error rates of the conditional choice will be very close to Welch’s error rate (because the probability of choosing Welch’s t-test is very high; see Rasch, Kubinger, & Moder, 2011). When the power of Levene’s test is medium, the error rates of the conditional choice will be somewhere between Student’s and Welch’s error rates (see, e.g., Zimmerman, 2004). This is problematic when the test most often performed actually has incorrect error rates.
Third, and relatedly, Levene’s test can have very low power, which leads to Type 2 errors when sample sizes are small and unequal (Nordstokke & Zumbo, 2007). As an illustration, to estimate the power of Levene’s test, we simulated 1,000,000 simulations with balanced designs of different sample sizes (ranging from 10 to 80 in each condition, with a step of 5) under three SDR where the true variances are unequal, respectively, 1.1, 1.5, and 2, yielding 45,000,000 simulations in total. When SDR = 1, the equal variances assumption is true when SDR > 1 the standard deviation of the second sample is bigger than the standard deviation of the first sample and when SDR < 1 the standard deviation of the second sample is smaller than the standard deviation of the first sample. We ran Levene’s test centered around the mean and Levene’s test centered around the median and estimated the power (in %) to detect unequal variances with equal sample sizes (giving the best achievable power for a given total N; see Figure 1).5
Estimated power of Levene’s test as a function of sample size, SDR and centering parameter.
As we can see in the graph, the further SDR is from 1, the smaller the sample size needed to detect a statistically significant difference in the SDR. Furthermore, for each SDR, power curves of the Levene’s test based on the mean are slightly above power curves of the Levene’s test based on the median, meaning that it leads to slightly higher power than Levene’s test based on the median. This can be due to the fact that data is extracted from normal distributions. With asymmetric data, the median would perform better. When SDR = 2, approximately 50 subjects are needed to have 80 percent power to detect differences, while approximately 70 subjects are needed to have 95 percent power to detect differences (for both versions of Levene’s test). To detect an SDR of 1.5 with Levene’s test, approximately 120 subjects are needed to reach a power of 0.80 and about 160 to reach a power of 0.95. Since such an SDR is already very problematic in terms of the type 1 error rate for the Student’s t-test (Bradley, 1978), needing such a large sample size to detect it is a serious hurdle. This issue becomes even worse for lower SDR, since an SDR as small as 1.1 already calls for the use of Welch’s t-test (See table A3.1 to A3.9 in the additional file). Detecting such a small SDR calls for a huge sample size (a sample size of 160 provides a power rate of 0.16).
Since Welch’s t-test has practically the same power as Student’s t-test, even when SDR = 1, as explained using simulations later, we should seriously consider using Welch’s t-test by default.
The problems in using a two-step procedure (first testing for equality of variances, then deciding upon which test to use) have already been discussed in the field of statistics (see e.g., Rasch, Kubinger, & Moder, 2011; Ruxton, 2006; Wilcox, Granger, & Clark, 2013; Zimmerman, 2004), but these insights have not changed the current practices in psychology, as of yet. More importantly, researchers do not even seem to take the assumptions of Student’s t-test into consideration before performing the test, or at least rarely discuss assumption checks.
We surveyed statistical tests reported in the journal SPPS (Social Psychological and Personality Science) between April 2015 and April 2016. From the total of 282 studies, 97 used a t-test (34.4%), and the homogeneity of variance was explicitly discussed in only 2 of them. Moreover, based on the reported degrees of freedom in the results section, it seems that Student’s t-test is used most often and that alternatives are considerably less popular. For 7 studies, there were decimals in the values of the degrees of freedom, which suggests Welch’s t-test might have been used, although the use of Welch’s t-test might be higher but not identifiable because some statisticians recommend rounding the degrees of freedom to round numbers.
To explain this lack of attention to assumption checks, some authors have argued that researchers might have a lack of knowledge (or a misunderstanding) of the parametric assumptions and consequences of their violations or that they might not know how to check assumptions or what to do when assumptions are violated (Hoekstra, Kiers, & Johnson, 2012).6 Finally, many researchers don’t even know there are options other than the Student’s t-test for comparing two groups (Erceg-Hurn & Mirosevich, 2008). How problematic this is depends on how plausible the assumption of equal variances is in psychological research. We will discuss circumstances under which the equality of variances assumption is especially improbable and provide real-life examples where the assumption of equal variances is violated.
Homogeneity of Variance Assumptions
The homogeneity of variances assumption is rarely true in real life and cannot be taken for granted when performing a statistical test (Erceg-Hurn & Mirosevich, 2008; Zumbo & Coulombe, 1997). Many authors have examined real data and noted that SDR is often different from the 1:1 ratio (see, e.g., Grissom, 2000; Erceg-Hurn & Mirosevich, 2008). This shows that the presence of unequal variances is a realistic assumption in psychological research.7 We will discuss three different origins of unequal standard deviations across two groups of observations.
A first reason for unequal variances across groups is that psychologists often use measured variables (such as age, gender, educational level, ethnic origin, depression level, etc.) instead of random assignment to condition. In their review of comparing psychological findings from all fields of the behavioral sciences across cultures, Henrich, Heine, and Norenzayan (2010) suggest that parameters vary largely from one population to another. In other words, variance is not systematically the same in every pre-existing group. For example, Feingold (1992) has shown that intellectual abilities of males were more variable than intellectual abilities of females when looking at several standardized test batteries measuring general knowledge, mechanical reasoning, spatial visualization, quantitative ability, and spelling. Indeed, the variability hypothesis (that men demonstrate greater variability than women) is more than a century old (for a review, see Shields, 1975). In many research domains, such as mathematics performance, there are strong indicators that variances ratios differ between 1.1 and 1.2, although variances ratios do not differ in all countries, and the causes for these differences are not yet clear. Nevertheless, it is an empirical fact that variances ratios can differ among pre-existing groups.
Furthermore, some pre-existing groups have different variability by definition. An example from the field of education is the comparison of selective school systems (where students are accepted on the basis of selection criterions) versus comprehensive school systems (where all students are accepted, whatever their aptitudes; see, e.g., Hanushek & Wößmann, 2006). At the moment that a school accepts its students, variability in terms of aptitude will be greater in a comprehensive school than in a selective school, by definition.
Finally, a quasi-experimental treatment can have a different impact on variances between groups. Hanushek and Wößmann (2006) suggest that there is an impact of the educational system on variability in achievement. Even if variability, in terms of aptitude, is greater in a comprehensive school than in a selective school at first, a selective school system at primary school increases inequality (and then variability) in achievement in secondary school. Another example is variability in moods. Cowdry, Gardner, O’Leary, Leibenluft, & Rubinow (1991) noted than intra-individual variability is larger in patients suffering from premenstrual syndrome (PMS) than in normal patients and larger in normal patients than in depressive patients. Researchers studying the impact of an experimental treatment on mood changes can expect a bigger variability of mood changes in patients with PMS than in normal or depressive patients and thus a higher standard deviation in mood measurements.
A second reason for unequal variances across groups is that while variances of two groups are the same when group assignment is completely randomized, deviation from equality of variances can occur later, as a consequence of an experimental treatment (Cumming, 2013; Erceg-Hurn & Mirosevich, 2008; Keppel, 1991). For example, psychotherapy for depression can increase the variability in depressive symptoms, in comparison with a control group, because the effectiveness of the therapy will depend on individual differences (Bryk & Raudenbush, 1988; Erceg-Hurn & Mirosevich, 2008). Similarly, Kester (1969) compared the IQs of students from a control group with the IQs of students when high expectancies about students were induced in the teacher. While no effect of teacher expectancy on IQ was found, the variance was bigger in the treatment group than in the control group (56.52 vs. 32.59, that is, SDR ≈ 1.32). As proposed by Bryk and Raudenbush (1988), this can result from the interaction between the treatment and the students’ reactions: students can react differently to the induced expectations. More generally, whenever a manipulation has individual moderators, variability should increase compared to a control condition.
Knowing whether standard deviations differ across conditions is important information, but in many fields, we have no accurate estimates of the standard deviation in the population. Whereas we collect population effect sizes in meta-analyses, these meta-analyses often do not include the standard deviations from the literature. As a consequence, we regrettably do not have easy access to aggregated information about standard deviations across research areas, despite the importance of this information. It would be useful if meta-analysts start to code information about standard deviations when performing meta-analyses (Lakens, Hilgard, & Staaks, 2016), such that we can accurately quantify whether standard deviations differ between groups, and how large the SDR is.
The Mathematical Differences Between Student’s <italic>t</italic>-test, Welch’s <italic>t</italic>-test, and Yuen’s <italic>t</italic>-test
So far, we have simply mentioned that Welch’s t-test differs from Student’s t-test in that it does not rely on the equality of variances assumption. In this section, we will explain why this is the case. The Student’s t statistic is calculated by dividing the mean difference between group x¯1−x¯2
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
{\bar x_1} - {\bar x_2}
\]
\end{document}
by a pooled error term, where s12
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
s_1^2
\]
\end{document}
and s22
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
s_2^2
\]
\end{document}
are variance estimates from each independent group, and where n_{1} and n_{2} are the respective sample sizes for each independent group (Student, 1908):
Student’s t-test is calculated based on a pooled error term, which implies that both samples’ variances are estimates of a common population variance. Whenever the variances of the two normal distributions are not similar and the sample sizes in each group are not equal, Student’s t-test results are biased (Zimmerman, 1996). The more unbalanced the distribution of participants across both independent groups, the more Student’s t-test is based on the incorrect standard error (Wilcox et al., 2013) and, consequently, the less accurate the computation of the p-value will be.
When the larger variance is associated with the larger sample size, there is a decrease in the nominal Type 1 error rate (Nimon, 2012; Overall, Atlas, & Gibson, 1995). The reason for this is that the error term increases, and, as a consequence, the Student’s t-value decreases, leading to fewer significant findings than expected with a specific alpha level. When the larger variance is associated with the smaller sample size, the Type 1 error rate is inflated (Nimon, 2012; Overall, Atlas, & Gibson, 1995). This inflation is caused by the under-evaluation of the error term, which increases Student’s t value and thus leads to more significant results than are expected based on the alpha level.
As discussed earlier, Student’s t-test is robust to unequal variances as long as the sample sizes of each group are similar (Nimon, 2012; Ruxton, 2006; Wallenstein, Zucker, & Fleiss, 1980), but, in practice, researchers often have different sample sizes in each of the independent groups (Ruxton, 2006). Unequal sample sizes are particularly common when examining measured variables, where it is not always possible to determine a priori how many of the collected subjects will fall in each category (e.g., sex, nationality, or marital status). However, even with complete randomized assignment to conditions, where the same number of subjects are assigned to each condition, unequal sample sizes can emerge when participants have to be removed from the data analysis due to being outliers because the experimental protocol was not followed when collecting the data (Shaw & Mitchell-Olds, 1993) or due to missing values (Wang et al., 2012).
Previous work by many researchers has shown that Student’s t-test performs surprisingly poorly when variances are unequal and sample sizes are unequal (Glass, Peckham, & Sanders, 1972; Overall, Atlas, & Gibson, 1995; Zimmerman, 1996), especially with small sample sizes and low alpha levels (e.g., alpha = 1%; Zimmerman, 1996). The poor performance of Student’s t-test when variances are unequal becomes visible when we look at the error rates of the test and the influence of both Type 1 errors and Type 2 errors. An increase in the Type 1 error rate leads to an inflation of the number of false positives in the literature, while an increase in the Type 2 error rate leads to a loss of statistical power (Banerjee et al., 2009).
To address these limitations of Student’s t-test, Welch (1947) proposed a separate-variances t-test computed by dividing the mean difference between group x¯1−x¯2
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
{\bar x_1} - {\bar x_2}
\]
\end{document}
by an unpooled error term, where s12
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
s_1^2
\]
\end{document}
and s22
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
s_2^2
\]
\end{document}
are variance estimates from each independent group, and where n_{1} and n_{2} are the respective sample sizes for each independent group:8
When both variances and sample sizes are the same in each independent group, the t-values, degrees of freedom, and the p-values in Student’s t-test and Welch’s t-test are the same (see Table 1). When the variance is the same in both independent groups but the sample sizes differ, the t-value remains identical, but the degrees of freedom differ (and, as a consequence, the p-value differs). Similarly, when the variances differ between independent groups but the sample sizes in each group are the same, the t-value is identical in both tests, but the degrees of freedom differ (and, thus, the p-value differs). The most important difference between Student’s t-test and Welch’s t-test, and indeed the main reason Welch’s t-test was developed, is when both the variances and the sample sizes differ between groups, the t-value, degrees of freedom, and p-value all differ between Student’s t-test and Welch’s t-test. Note that, in practice, samples practically never show exactly the same pattern of variance as populations, especially with small sample sizes (Baguley, 2012; also see table A2 in the additional file).
Comparison of t-value and Degrees of Freedom of Welch’s and Student’s t-test.
Equal variances
Unequal variances
Balanced design
t_{Welch} = t_{Student}
t_{Welch} = t_{Student}
df_{Welch} = df_{Student}
df_{Welch} ≠ df_{Student}
p_{Welch} = p_{Student}
p_{Welch} ≠ p_{Student}
Unbalanced design
t_{Welch} = t_{Student}
t_{Welch} ≠ t_{Student}
df_{Welch} ≠ df_{Student}
df_{Welch} ≠ df_{Student}
p_{Welch} ≠ p_{Student}
p_{Welch} ≠ p_{Student}
Yuen’s t-test, also called “20 percent trimmed means test”, is an extension of Welch’s t-test and is allegedly more robust in case of non-normal distributions (Wilcox & Keselman, 2003). Yuen’s t-test consists of removing the lowest and highest 20 percent of the data and applying Welch’s t-test on the remaining values. The procedure is explained and well-illustrated in a paper by Erceg-Hurn and Mirosevich (2008).
Simulations: Error Rates for Student’s <italic>t</italic>-test versus Welch’s <italic>t</italic>-test
When we are working with a balanced design, the statistical power (the probability of finding a significant effect, when there is a true effect in the population, or 1 minus the Type 2 error rate) is very similar for Student’s t-test and Welch’s t-test. Even with extremely large SDR (respectively, 0.01, 0.1, 10, and 100) and small sample sizes (10 subjects per group), the biggest increase in power of Student’s t-test compared to Welch’s-t test is approximately 5 percent when the test is applied on two normal skewed distributions with unequal shapes. In all other cases, the difference in power between both tests is smaller (See table A1.1 to A1.9 in the additional file).
Considering the cases where sample sizes are unequal and SDR = 1, Student’s-t test is sometimes better than Welch’s t-test, and sometimes the reverse is true. The difference is small, except in three scenarios (See table A5.2, A5.5, and A5.6 in the additional file). However, because there is no correct test to perform that assures SDR = 1, and because variances are likely not to be equal in certain research areas, our recommendation is to always use Welch’s t-test instead of Student’s t-test.
To illustrate the differences in Type 1 error rates between Student’s t-test and Welch’s t-test, we simulated 1,000,000 studies under the null hypothesis (no difference between the means in each group) under four scenarios. We chose a small sample ratio (n_{1} = 40 vs. n_{2} = 60) to show that when the equal variances assumption was not met and SDR = 2, biased error rates are observed in Student’s t-test. We compared Scenario 1, where the variance is the same in each group (SDR = 1; homoscedasticity assumption met) and sample sizes are unequal (See v 2a), with Scenario 2, where the variance differs between groups (SDR = 2) but sample sizes are equal (n_{1} = n_{2} = 50; see Figure 2b). Furthermore, we simulated Scenario 3, where both sample sizes and variances were unequal between groups and the larger variance is associated with the larger sample size (SDR = 2; see Figure 2c), and a similar Scenario 4, where the larger variance is associated with the smaller sample size (SDR = 0.5; see Figure 2d). P-value distributions for both Student’s and Welch’s t-tests were then plotted. When there is no true effect, p-values are distributed uniformly.
P-value distributions for Student’s and Welch’s t-test under the null as a function of SDR, and sample size.
As long as the variances are equal between groups or sample sizes are equal, the distribution of Student’s p-values is uniform, as expected (see Figures 2a and 2b), which implies that the probability of rejecting a true null hypothesis equals the alpha level for any value of alpha. On the other hand, when the larger variance is associated with the larger sample size, the frequency of p-values less than 5 percent decreases to 0.028 (see Figure 2c), and when the larger variance is associated with the smaller sample size, the frequency of p-values less than 5 percent increases to 0.083 (see Figure 2d). Welch’s t-test has a more stable Type 1 error rate (see Keselman et al., 1998; Keselman, Othman, Wilcox, & Fradette, 2004; Moser & Stevens, 1992; Zimmerman, 2004) Additional simulations, presented in the additional file, show that these scenarios are similar for several shapes of distributions (see tables A3.1 to A3.9 and table A4 in the additional file).
Moreover, as discussed previously, with very small SDRs, Welch’s t-test still has a better control of Type 1 error rates than Student’s t-test, even if neither of them give critical values (i.e., values under 0.025 or above 0.075, according to the definition of Bradley, 1975). With SDR = 1.1, when the larger variance is associated with the larger sample size, the frequency of Student’s p-value being less than 5 percent decreases to 0.046, and when the larger variance is associated with the smaller sample size, the frequency of Student’s p-value being less than 5 percent increases to 0.054. On the other side, the frequency of Welch’s p-values being below 0.05 is exactly 5 percent in both cases.
Yuen’s t-test is not a good unconditional alternative because we observe an unacceptable departure from the nominal alpha risk of 5 percent for several shapes of distributions (see tables A3.1, A3.4, A3.7, A3.8, and A3.9 in the additional file), particularly when we are studying asymmetric distributions of unequal shapes (see tables A3.8 and A3.9 in the additional file). Moreover, even when Yuen’s Type 1 error does not show a critical departure from the nominal alpha risk (i.e., values above 0.075), Welch’s t-test more accurately controls the Type 1 error rate (see tables A3.2, A3.3, A3.5, and A3.6 in the additional file). The Type 1 error rate of Welch’s t-test remains closer to the nominal size (i.e., 5%) in all the previously discussed cases and also performs better with very extreme SDRs and unbalanced designs, as long as there are at least 10 subjects per groups (See table A4 in the additional file).
In Figure 3, p-values from Welch’s t-test and Student’s t-test tests, shown separately in Figure 2 (through histograms), are now plotted against each other. Figure 3a shows Student’s p-values plotted against Welch’s p-values of Scenario 1, where the variance is the same in each group (SDR = 1) and sample sizes are unequal. Figure 3b displays Student’s p-values plotted against Welch’s p-values of Scenario 2, where the variance differs between groups (SDR = 2) but sample sizes are equal (n_{1} = n_{2} = 50). Figure 3c shows Student’s p-values plotted against Welch’s p-values of Scenario 3, where both sample sizes and variances are unequal between groups and the larger variance is associated with the larger sample size (SDR = 2). And, finally, figure 3d plots Student’s p-values against Welch’s p-values of Scenario 4, where the greater variance is associated with the smaller sample size (SDR = 0.5).
P-values from Student’s t-test against p-values from Welch’s t-test under the null.
Dots are marked on the black diagonal line when both tests return the same p-value. The top left quadrant contains all p-values less than 0.05 according to a Student’s t-test, but greater than 0.05 according to Welch’s t-test. The bottom right quadrant reports all p-values less than 0.05 according to Welch’s t-test, but greater than 0.05 according to Student’s t-test. The larger the standard deviations ratio and the greater the sample sizes ratio, the larger the difference between p-values from Welch’s t-test and Student’s t-test.
Conclusion
When the assumption of equal variances is not met, Student’s t-test yields unreliable results, while Welch’s t-test controls Type 1 error rates as expected. The widely recommended two-step approach, where the assumption of equal variances is tested using Levene’s test and, based on the outcome of this test, a choice of Student’s t-test or Welch’s t-test is made, should not be used. Because the statistical power for this test is often low, researchers will inappropriately choose Student’s t-test instead of more robust alternatives. Furthermore, as we have argued, it is reasonable to assume that variances are unequal in many studies in psychology, either because measured variables are used (e.g., age, culture, gender) or because, after random assignment to conditions, variance is increased in the experimental condition compared to the control condition due to the experimental manipulation. As it is explained in the additional file, Yuen’s t-test is not a better test than Welch’s t-test, since it often suffers high departure from the alpha risk of 5 percent. Therefore, we argue that Welch’s t-test should always be used instead of Student’s t-test.
When using Welch’s t-test, a very small loss in statistical power can occur, depending on the shape of the distributions. However, the Type 1 error rate is more stable when using Welch’s t-test compared to Student’s t-test, and Welch’s t-test is less dependent on assumptions that cannot be easily tested. Welch’s t-test is available in practically all statistical software packages (and already the default in R and Minitab) and is easy to use and report. We recommend that researchers make clear which test they use by specifying the analysis approach in the result section.
Convention is a weak justification for the current practice of using Student’s t-test by default. Psychologists should pay more attention to the assumptions underlying the tests they perform. The default use of Welch’s t-test is a straightforward way to improve statistical practice.
There is a Type 1 error rate inflation in a few cases where sample sizes are extremely small and SDR is big (e.g., when n_{1} = n_{2} = 3 are sampled from uniform distributions and SDR = 2, the Type 1 error rate = 0.083; or when n_{1} = 3 is sampled from a uniform distribution and n_{2} = 3 is sampled from a double exponential distribution). However, with extremely small sample sizes (N ≤ 5), the estimate of means and standard deviations is extremely inaccurate anyway. As we mentioned in table A2 (see the additional file), the smaller the sample size, the further the average standard deviation is from the population standard deviation, and the larger the dispersion around this average.
This is called the Behren-Fisher problem (Hayes & Cai, 2007).
In a simulation that explored Type 1 error rates, we varied the size of the first sample from 10 to 40 in steps of 10 and the sample sizes ratio and the standard deviation ratio from 0.5 to 2 in steps of 0.5, resulting in 64 simulations designs. Each design was tested 1,000,000 times. Considering these parameter values, we found that the alpha level can be inflated up to 0.11 or deflated down to 0.02 (see the additional file).
Other variants have been proposed, such as the percent trimmed mean (Lim & Loh, 1996).
Because sample sizes are equal for each pair of samples, which sample has the bigger standard deviation is not applicable. In this way, SDR = X will return the same answer in terms of percent power of Levene’s test as SDR = 1/X. For example, SDR = 2 will return the same answer as SDR = ½ = 0.5.
For example, many statistical users believe that the Mann-Whitney non-parametric test can cope with both normality and homosedasticity issues (Ruxton, 2006). This assumption is false, since the Mann-Whitney test remains sensitive to heterosedasticity (Grissom, 2000; Nachar, 2008; Neuhäuser & Ruxton, 2009).
Like Bryk and Raudenbush (1988), we note that unequal variances between groups does not systematically mean that population variances are different: standard deviation ratios are more or less biased estimates of population variance (see table A2 in the additional file). Differences can be a consequence of bias in measurement, such as response styles (Baumgartner & Steenkamp, 2001). However, there is no way to determine what part of the variability is due to error rather than the true population value.
Also known as the Satterwaite’s test, the Smith/Welch/Satterwaite test, the Aspin-Welch test, or the unequal variances t-test.
Competing Interests
The authors have no competing interests to declare.
Additional File
The additional file for this article can be found as follows:
DOI: https://doi.org/10.5334/irsp.82.s1
Author’s Note
All code needed to recreate the simulations resulting in the figures and appendices is available at https://osf.io/bver8/files/, as are as the .txt files containing the results of all simulations.
BaguleyT.BanerjeeA.ChitnisU.JadhavS.BhawalkarJ.ChaudhuryS.Hypothesis testing, type I and type II errorsBaumgartnerH.SteenkampJ.-B. E.Response styles in marketing research: A cross-national investigationBrykA. S.RaudenbushS. W.Heterogeneity of variance in experimental studies: A challenge to conventional interpretationsCarrollR. J.SchneiderH.A note on Levene’s tests for equality of variancesCowdryR. W.GardnerD. L.O’LearyK. M.LeibenluftE.RubinowD. R.Mood variability: A study of four groupsCummingG.Erceg-HurnD. M.MirosevichV. M.Modern robust statistical methods: An easy way to maximize the accuracy and power of your researchFeingoldA.Sex differences in variability in intellectual abilities: A new look at an old controversyGlassG. V.PeckhamP. D.SandersJ. R.Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covarianceGrissomR. J.Heterogeneity of variance in clinical dataHanushekE. A.WößmannL.Does educational tracking affect performance and inequality? Differences- in-differences evidence across countries*HayesA. F.CaiL.Further evaluating the conditional decision rule for comparing two independent meansHenrichJ.HeineS. J.NorenzayanA.Most people are not WEIRDHoekstraR.KiersH.JohnsonA.Are assumptions of well-known statistical techniques checked, and why (not)?KeppelG.Design and analysis: A researcher’s handbook1991Prentice-Hall, Inc.Retrieved from http://psycnet.apa.org/psycinfo/1991-98751-000KeselmanH. J.HubertyC. J.LixL. M.OlejnikS.CribbieR. A.DonahueB.LevinJ. R.Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analysesKeselmanH. J.OthmanA. R.WilcoxR. R.FradetteK.The new and improved two-sample t testKesterS. W.The communication of teacher expectations and their effects on the achievement and attitudes of secondary school pupils1969University of OklahomaRetrieved from https://shareok.org/handle/11244/2570LakensD.HilgardJ.StaaksJ.On the reproducibility of meta-analyses: Six practical recommendationsLimT.-S.LohW.-Y.A comparison of tests of equality of variancesLuhW.-M.GuoJ.-H.Approximate sample size formulas for the two-sample trimmed mean test with unequal variancesMoserB. K.StevensG. R.Homogeneity of variance in the two-sample means testNacharN.The Mann-Whitney U: A test for assessing whether two independent samples come from the same distributionNeuhäuserM.RuxtonG. D.Distribution-free two-sample comparisons in the case of heterogeneous variancesNimonK. F.Statistical assumptions of substantive analyses across the general linear model: A mini-reviewNordstokkeD. W.ZumboB. D.A Cautionary Tale about Levene’s Tests for Equal VariancesOverallJ. E.AtlasR. S.GibsonJ. M.Tests that are robust against variance heterogeneity in k × 2 designs with unequal cell frequenciesRakotomalalaR.Comparaison de populationsRaschD.KubingerK. D.ModerK.The two-sample t test: Pre-testing its assumptions does not pay offRuxtonG. D.The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U testShawR. G.Mitchell-OldsT.ANOVA for unbalanced data: An overviewSheskinD. J.ShieldsS.Functionalism, Darwinism, and the psychology of womenStudentThe probable error of a meanWallensteinS.ZuckerC. L.FleissJ. L.Some statistical methods useful in circulation researchWangH.SmithK. P.CombsE.BlakeT.HorsleyR. D.MuehlbauerG. J.Effect of population size and unbalanced data sets on QTL detection using genome-wide association mapping in barley breeding germplasmWelchB. L.The generalization of Student’s problem when several different population variances are involvedWilcoxR. R.GrangerD. A.ClarkF.Modern robust statistical methods: Basics with illustrations using psychobiological dataWilcoxR. R.KeselmanH. J.Modern robust data analysis methods: Measures of central tendencyYuenK. K.The two-sample trimmed t for unequal population variancesZimmermanD. W.Some properties of preliminary tests of equality of variances in the two-sample location problemZimmermanD. W.Statistical significance levels of nonparametric tests biased by heterogeneous variances of treatment groupsZimmermanD. W.A note on preliminary tests of equality of variancesZumboB. D.CoulombeD.Investigation of the robust rank-order test for non-normal populations with unequal variances: The case of reaction time