Suppose you encounter the following situation in your company: the person responsible for the purchasing department says they need to analyze which supplier to choose based on delivery time. Then, the project manager approaches and expresses doubts about whether the implementation of the WMS system brought any operational benefits, suggesting an analysis of the time of picking before and after the implementation of the system.
Since you can easily capture the historical data for these two cases, you commit to helping them with the analysis. After capturing the data from the last 15 periods, you build 2 tables with the respective times:

From this, you make the following analysis:
“There is a difference in times in both cases, however in the first case, the difference between the delivery times is small and probably not very significant. Percentually, it corresponds to a difference of only 2.2% between the average delivery times (19.3 days versus 18.8 days). In the second case, there was a significant reduction in the time of picking, as this time reduced from 8.01 minutes to 7.44 minutes (a reduction of 7.2% in time).”
And concludes:
“We can say that the delivery time of the two suppliers is practically the same, with an insignificant difference. The same does not occur with the second case, as the implementation of the WMS brought a visible reduction in the time of picking.”
Well, here is the problem: statistically speaking, this analysis is flawed. This is because it simply disregarded the dispersion of the data. In fact, we can say with a 95% confidence level that the average delivery time of supplier 1 is significantly greater than that of supplier 2 and the implementation of the WMS did not significantly decrease the time of picking in the company.
Next, we will look at this visually and, in order to facilitate our explanation, we will initially consider two normally distributed data sets.


Recalling some basic concepts of statistics: we know that in a normal curve, the area shaded in blue of each curve corresponds to 95% of the total area, that is, values that are between μ-1.96σ and μ+1.96σ represent 95% of the data. It is notable that in the second curve this range is much smaller, as the observations are less dispersed.
To exemplify what this means: suppose that the first curve represents observations with a mean of 0 and a standard deviation of 1, and the second curve represents observations with a mean of 0 and a standard deviation of 0.5. Although both have the same mean, we can say that in the first case 95% of the values are between -1.96 and +1.96. In the second case, 95% of the values are between -0.98 and +0.98. If a distribution has a larger standard deviation, this interval will be wider. In other words: greater dispersion means greater uncertainty about the mean.
Returning to our initial tables. So how do we compare whether the means of 2 samples are different or not with a 95% confidence level? In these cases, we should perform the t-test, which instead of using a population distribution (normal) uses a sample distribution (t-Student).
Unlike the normal distribution, the t-Student distribution has heavier tails when the sample size is small and approaches the normal distribution as the number of observations increases. The t-test is used for analyzing sample means (up to 2 samples) and, in cases of comparing means of a larger number of samples (3 or more), another test should be applied, in this case, ANOVA.
Based on the data tables, we will create a visual representation of the data dispersions:
1 – Visual representation of the distribution of supplier lead times (independent samples).

When graphically observing the 95% confidence intervals for each supplier, it is noted that the upper limit of the supplier with the lowest average does not overlap with the lower limit of the supplier with the highest average. This visual analysis is merely illustrative. The formal decision criterion, however, is given by the t statistic and the p-value.
It is also important to emphasize that, in this example, we assumed equal variances. Traditionally, this verification is done through the F test. However, in modern applications, it is common to directly use the Welch's t-test, which does not assume equal variances and automatically adjusts the degrees of freedom (we will address these two topics in later posts).
t-test for two independent samples with equal variances
Our interest is to verify if supplier 1 has a greater average time than supplier 2. Since the hypothesis is directional, we use a one-tailed test.
Hypotheses:
• H₀: μ₁ ≤ μ₂
• H₁: μ₁ > μ₂
The formula for the t-test for 2 independent samples with equal variances is given by:

Applying the formula we have:

The one-tailed critical value for a significance level of 5% is approximately 1.701. The t value of 3.775 is well above the critical value, therefore, we reject H₀.
The p-value of this test is approximately 0.0004. This means that the result remains statistically significant even if we adopted a confidence level of approximately 99.96%!
A confidence level close to 100% can even be a very commonly used choice in medicine, for example, certifying the effectiveness of a drug in a group of people (in this case, a paired t-test, as we will see next). However, for our case, we can say with a confidence level of 95% that supplier 1 has a significantly greater average delivery time than supplier 2 (p-value below 0.05).
Now let's analyze the data on the times of picking...
2 – Representation of the distribution of picking times before and after the implementation of the WMS (paired samples).

Note that, in our illustrative representation, the maximum time limit with the lowest average exceeds the minimum time limit with the highest average. This intersection is given by the green hatched area. In cases like these, we cannot, with a confidence level of 95%, dismiss the hypothesis that the times are equal.
t-test for two paired samples
Unlike the first case, where we performed the t-test for 2 independent and distinct samples, this time we will analyze the same sample, but at different times. Here we use the paired t-test, as we are comparing the same sample over time.
Hypotheses:
H₀: the mean of the differences is equal to zero
H₁: the mean of the differences is different from zero
The formula for the paired t-test is given by:

Applying the formula we have:

The one-tailed critical value for a significance level of 5% is approximately +/- 1.761. The t value of -1.675 is below the critical value, with a p-value of 0.058. Thus, with a confidence level of 95%, we cannot reject the hypothesis that the times are equal (mean of the differences is equal to zero). In other words, although the percentage reduction seems relevant (7.2%), the variability of the data prevents this difference from being considered statistically significant.
Important considerations
The t-test works well even with small samples (like 15 observations), as long as there is no strong skewness in the data. Although the Central Limit Theorem indicates that means tend towards normality as sample size increases, with small samples we use the t distribution precisely to deal with this additional uncertainty.
Conclusion
Comparing only percentages can lead to misguided decisions.
In the first case, a seemingly small difference (2.2%) turned out to be statistically significant. In the second case, a seemingly larger reduction (7.2%) was not sufficient to ensure statistical significance.
The lesson is clear: In data analysis, means do not tell the whole story. Dispersion is as important as the magnitude of the difference.