When N 30 and the population standard deviation is not known what is the appropriate distribution to use?

The t distribution is a continuous probability distribution that is symmetric and bell-shaped like the normal distribution but with a shorter peak and thicker tails. It was designed to factor in the greater uncertainty associated with small sample sizes.

The t distribution describes the variability of the distances between sample means and the population mean when the population standard deviation is unknown and the data approximately follow the normal distribution. This distribution has only one parameter, the degrees of freedom, based on (but not equal to) the sample size.

The t distribution, also known as the Students T Distribution, was developed by William Sealy Gosset in 1908 for use with small sample sizes. Back then, the Z distribution and the corresponding Z-test were available to test means, but they are valid for large sample sizes. There was no distribution designed for small samples.

Gosset was the Chief Brewer at the Guinness Brewery in Dublin and was dedicated to applying the scientific method to beer production. He needed a procedure for statistically analyzing small batches of barley. After developing the t distribution for this purpose, the brewery wanted Gosset to publish using a pen name so competitors would not learn about their methods. Hence, he published using the pseudonym of Student. That’s why we have the “Student T-test” today!

When to Use the T Distribution

The essential uses for the t distribution are for finding:

Use the t distribution when you need to assess the mean and do not know the population standard deviation. It’s particularly important to use it when you have a small (n < 30) sample size. More about this aspect below!

In the context of a t-test, it represents the sampling distribution of t-values for your design when the null hypothesis is true. Learn more about sampling distributions.

For more detailed information, read about using it to Find P values and Confidence Intervals.

To find the critical t-values using a table, see my T-table. It includes instructions and examples of how to use it.

Related post: How to Do T-Tests in Excel

Parameter – Degrees of Freedom

The t distribution has only one parameter, the degrees of freedom (DF). In t-tests, DF are linked to the sample size. For 1-sample and paired t-tests, DF = N – 1. For 2-sample t-tests, it equals N – 2. Hence, as the sample size increases, the DF also increases. Learn more about degrees of freedom.

Let’s see how changing the degrees of freedom affects it.

This graph illustrates how Gosset designed the t distribution to handle the greater uncertainty inherent with smaller samples. As the degrees of freedom increase, the curve pulls in tighter around zero—the tails become thinner and the peak becomes taller. The blue curve has the fewest DF (3) and it has the thickest tails. Conversely, the green curve has the most DF (20) and the thinnest tails.

The changing shapes are how it factors in the greater uncertainty when you have a smaller sample. Smaller samples have thicker tails because small samples are more likely to produce unusual means than larger samples. However, as the sample size increases, outliers become rarer, and the tails thin out.

Because the t distribution is a probability distribution, t-tests can use it to calculate probabilities like the p-value while factoring in the sample size.

At around 30 degrees of freedom, the t distribution closely approximates the standard normal distribution (Z-distribution), as shown below. Consequently, when your sample size exceeds ~30, t-tests and Z-tests provide very similar results.

In this graph, the blue curve is the standard normal distribution, while the red dashed curve is the t distribution with 30 degrees of freedom.

In the preceding discussion we have been using s, the population standard deviation, to compute the standard error. However, we don't really know the population standard deviation, since we are working from samples. To get around this, we have been using the sample standard deviation (s) as an estimate. This is not a problem if the sample size is 30 or greater because of the central limit theorem. However, if the sample is small (<30) , we have to adjust and use a t-value instead of a Z score in order to account for the smaller sample size and using the sample SD.  

Therefore, if n<30, use the appropriate t score instead of a z score, and note that the t-value will depend on the degrees of freedom (df) as a reflection of sample size. When using the t-distribution to compute a confidence interval, df = n-1.

Calculation of a 95% confidence interval when n<30 will then use the appropriate t-value in place of Z in the formula:

The T-distribution

One way to think about the t-distribution is that it is actually a large family of distributions that are similar in shape to the normal standard distribution, but adjusted to account for smaller sample sizes. A t-distribution for a small sample size would look like a squashed down version of the standard normal distribution, but as the sample size increase the t-distribution will get closer and closer to approximating the standard normal distribution.

The table below shows a portion of the table for the t-distribution. Notice that sample size is represented by the "degrees of freedom" in the first column. For determining the confidence interval df=n-1. Notice also that this table is set up a lot differently than the table of Z scores. Here, only five levels of probability are shown in the column titles, whereas in the table of Z scores, the probabilities were in the interior of the table. Consequently, the levels of probability are much more limited here, because t-values depend on the degrees of freedom, which are listed in the rows.  

Confidence Level

80%

90%

95%

98%

99%

Two-sided test p-values

.20

.10

.05

.02

.01

One-sided test p-values

.10

.05

.025

.01

.005

Degrees of Freedom (df)

1

3.078

6.314

12.71

31.82

63.66

2

1.886

2.920

4.303

6.965

9.925

3

1.638

2.353

3.182

4.541

5.841

4

1.533

2.132

2.776

3.747

4.604

5

1.476

2.015

2.571

3.365

4.032

6

1.440

1.943

2.447

3.143

3.707

7

1.415

1.895

2.365

2.998

3.499

8

1.397

1.860

2.306

2.896

3.355

9

1.383

1.833

2.262

2.821

3.250

10

1.372

1.812

2.228

2.764

3.169

11

1.362

1.796

2.201

2.718

3.106

12

1.356

1.782

2.179

2.681

3.055

13

1.350

1.771

2.160

2.650

3.012

14

1.345

1.761

2.145

2.624

2.977

15

1.341

1.753

2.131

2.602

2.947

16

1.337

1.746

2.120

2.583

2.921

17

1.333

1.740

2.110

2.567

2.898

18

1.330

1.734

2.101

2.552

2.878

19

1.328

1.729

2.093

2.539

2.861

20

1.325

1.725

2.086

2.528

2.845

Notice that the value of t is larger for smaller sample sizes (i.e., lower df). When we use "t" instead of "Z" in the equation for the confidence interval, it will result in a larger margin of error and a wider confidence interval reflecting the smaller sample size.

With an infinitely large sample size the t-distribution and the standard normal distribution will be the same, and for samples greater than 30 they will be similar, but the t-distribution will be somewhat more conservative. Consequently, one can always use a t-distribution instead of the standard normal distribution. However, when you want to compute a 95% confidence interval for an estimate from a large sample, it is easier to just use Z=1.96.

Because the t-distribution is, if anything, more conservative, R relies heavily on the t-distribution.

 Test Yourself

Problem #1

Using the table above, what is the critical t score for a 95% confidence interval if the sample size (n) is 11?

Answer

Problem #2

A sample of n=10 patients free of diabetes have their body mass index (BMI) measured. The mean is 27.26 with a standard deviation of 2.10. Generate a 90% confidence interval for the mean BMI among patients free of diabetes.

Link to Answer in a Word file

Confidence Intervals for a Mean Using R

Instead of using the table, you can use R to generate t-values. For example, to generate t values for calculating a 95% confidence interval, use the function qt(1-tail area,df).

For example, if the sample size is 15, then df=14, we can calculate the t-score for the lower and upper tails of the 95% confidence interval in R:

> qt(0.025,14) qt(0.975,14)
[1] 2.144787

Then, to compute the 95% confidence interval we could plug t=2.144787 into the equation:

Confidence Intervals from Raw Data Using R

It is also easy to compute the point estimate and 95% confidence interval from a raw data set using the " t.test" function in R. For example, in the data set from the Weymouth Health Survey I could compute the mean and 95% confidence interval for BMI as follows. First, I would load the data set and give it a short nickname. Then I would attach the data set, and then use the following command:

> t.test(bmi)

The output would look like this: 

One Sample t-test

data:  bmi
t = 228.5395, df = 3231, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
26.66357 27.12504

sample estimates:
mean of x
26.8943

R defaults to computing a 95% confidence interval, but you can specify the confidence interval as follows:

> t.test(bmi,conf.level=.90)

This would compute a 90% confidence interval.

Test Yourself

Lozoff and colleagues compared developmental outcomes in children who had been anemic in infancy to those in children who had not been anemic. Some of the data are shown in the table below.

Mean + SD

Anemia in Infancy

(n=30)

Non-anemic in Infancy

(n=133)

Gross Motor Score

52.4+14.3

58.7+12.5

Verbal IQ

101.4+13.2`

102.9+12.4

Source: Lozoff et al.: Long-term Developmental Outcome of Infants with Iron Deficiency, NEJM, 1991

Compute the 95% confidence interval for verbal IQ using the t-distribution

Link to the Answer in a Word file


Última postagem

Tag