# Statistical Background on DOE

Variations occur in nature, be it the tensile strength of a particular grade of steel, caffeine content in your energy drink or the distance traveled by your vehicle in a day. Variations are also seen in the observations recorded during multiple executions of a process, even when all factors are strictly maintained at their respective levels and all the executions are run as identically as possible. The natural variations that occur in a process, even when all conditions are maintained at the same level, are often called *noise*. When the effect of a particular factor on a process is studied, it becomes extremely important to distinguish the changes in the process caused by the factor from noise. A number of statistical methods are available to achieve this. This chapter covers basic statistical concepts that are useful in understanding the statistical analysis of data obtained from designed experiments. The initial sections of this chapter discuss the normal distribution and related concepts. The assumption of the normal distribution is widely used in the analysis of designed experiments. The subsequent sections introduce the standard normal, chi-squared, and distributions that are widely used in calculations related to hypothesis testing and confidence bounds. This chapter also covers hypothesis testing. It is important to gain a clear understanding of hypothesis testing because this concept finds direct application in the analysis of designed experiments to determine whether or not a particular factor is significant [Wu, 2000].

# Basic Concepts

## Random Variables and the Normal Distribution

If you record the distance traveled by your car everyday, you'll notice that these values show some variation because your car does not travel the exact same distance every day. If a variable is used to denote these values then is considered a *random variable* (because of the diverse and unpredicted values can have). Random variables are denoted by uppercase letters, while a measured value of the random variable is denoted by the corresponding lowercase letter. For example, if the distance traveled by your car on January 1 was 10.7 miles, then:

A commonly used distribution to describe the behavior of random variables is the normal distribution. When you calculate the mean and standard deviation for a given data set, a common assumption used is that the data follows a normal distribution. A normal distribution (also referred to as the Gaussian distribution) is a bell-shaped curved (see figure below). The mean and standard deviation are the two parameters of this distribution. The mean determines the location of the distribution on the x-axis and is also called the *location parameter*. The standard deviation determines the spread of the distribution (how narrow or wide) and is thus called the *scale parameter*. The standard deviation, or its square called *variance*, gives an indication of the variability or spread of data. A large value of the standard deviation (or variance) implies that a large amount of variability exists in the data.

Any curve in the image below is also referred to as the probability density function, or *pdf* of the normal distribution, as the area under the curve gives the probability of occurrence of for a particular interval. For instance, if you obtained the mean and standard deviation for the distance data of your car as 15 miles and 2.5 miles respectively, then the probability that your car travels a distance between 7 miles and 14 miles is given by the area under the curve covered between these two values, which is calculated to be 34.4% (see figure below). This means that on 34.4 days out of every 100 days your car travels, your car can be expected to cover a distance in the range of 7 to 14 miles.

On a normal probability density function, the area under the curve between the values of and is approximately 99.7% of the total area under the curve. This implies that almost all the time (or 99.7% of the time) the distance traveled will fall in the range of 7.5 miles and 22.5 miles . Similarly, covers approximately 95% of the area under the curve and covers approximately 68% of the area under the curve.

## Population Mean, Sample Mean and Variance

If data for all of the population under investigation is known, then the mean and variance for this population can be calculated as follows:

**Population Mean:**

**Population Variance:**

Here, is the size of the population.

The population standard deviation is the positive square root of the population variance.

Most of the time it is not possible to obtain data for the entire population. For example, it is impossible to measure the height of every male in a country to determine the average height and variance for males of a particular country. In such cases, results for the population have to be estimated using samples. This process is known as *statistical inference*. Mean and variance for a sample are calculated using the following relations:

**Sample Mean:**

**Sample Variance:**

Here, is the sample size.
The sample standard deviation is the positive square root of the sample variance.
The sample mean and variance of a random sample can be used as estimators of the population mean and variance, respectively. The sample mean and variance are referred to as *statistics*. A statistic is any function of observations in a random sample.
You may have noticed that the denominator in the calculation of sample variance, unlike the denominator in the calculation of population variance, is and not . The reason for this difference is explained in Biased Estimators.

## Central Limit Theorem

The Central Limit Theorem states that for a large sample size, :

- The sample means from a population are normally distributed with a mean value equal to the population mean, , even if the population is not normally distributed.

- What this means is that if random samples are drawn from any population and the sample mean, , calculated for each of these samples, then these sample means would follow the normal distribution with a mean (or location parameter) equal to the population mean, . Thus, the distribution of the statistic, , would be a normal distribution with mean, . The distribution of a statistic is called the
*sampling distribution*.

- The variance, , of the sample means would be times smaller than the variance of the population, .

- This implies that the sampling distribution of the sample means would have a variance equal to (or a scale parameter equal to ), where is the population standard deviation. The standard deviation of the sampling distribution of an estimator is called the standard error of the estimator. Thus the standard error of sample mean is .

In short, the Central Limit Theorem states that the sampling distribution of the sample mean is a normal distribution with parameters and as shown in the figure below.

## Unbiased and Biased Estimators

If the mean value of an estimator equals the true value of the quantity it estimates, then the estimator is called an *unbiased estimator* (see figure below). For example, assume that the sample mean is being used to estimate the mean of a population. Using the Central Limit Theorem, the mean value of the sample mean equals the population mean. Therefore, the sample mean is an unbiased estimator of the population mean.
If the mean value of an estimator is either less than or greater than the true value of the quantity it estimates, then the estimator is called a *biased estimator*. For example, suppose you decide to choose the smallest observation in a sample to be the estimator of the population mean. Such an estimator would be biased because the average of the values of this estimator would always be less than the true population mean. In other words, the mean of the sampling distribution of this estimator would be less than the true value of the population mean it is trying to estimate. Consequently, the estimator is a biased estimator.

A case of biased estimation is seen to occur when sample variance, , is used to estimate the population variance, , if the following relation is used to calculate the sample variance:

The sample variance calculated using this relation is always less than the true population variance. This is because deviations with respect to the sample mean, , are used to calculate the sample variance. Sample observations, , tend to be closer to than to . Thus, the calculated deviations are smaller. As a result, the sample variance obtained is smaller than the population variance. To compensate for this, is used as the denominator in place of in the calculation of sample variance. Thus, the correct formula to obtain the sample variance is:

It is important to note that although using as the denominator makes the sample variance, , an unbiased estimator of the population variance, , the sample standard deviation, , still remains a biased estimator of the population standard deviation, . For large sample sizes this bias is negligible.

## Degrees of Freedom (dof)

The number of *degrees of freedom* is the number of independent observations made in excess of the unknowns. If there are 3 unknowns and 7 independent observations are taken, then the number of degrees of freedom is 4 (7-3). As another example, two parameters are needed to specify a line. Therefore, there are 2 unknowns. If 10 points are available to fit the line, the number of degrees of freedom is 8 (10-2).

## Standard Normal Distribution

A normal distribution with mean and variance is called the *standard normal distribution* (see figure below). Standard normal random variables are denoted by . If represents a normal random variable that follows the normal distribution with mean and variance , then the corresponding standard normal random variable is:

represents the distance of from the mean in terms of the standard deviation .

## Chi-Squared Distribution

If is a standard normal random variable, then the distribution of is a *chi-squared* distribution (see figure below).

A chi-squared random variable is represented by . Thus:

The distribution of the variable mentioned in the previous equation is also referred to as *centrally distributed chi-squared* with one degree of freedom. The degree of freedom is 1 here because the chi-squared random variable is obtained from a single standard normal random variable . The previous equation may also be represented by including the degree of freedom in the equation as:

If , , ... are independent standard normal random variables, then:

is also a chi-squared random variable. The distribution of is said to be *centrally distributed chi-squared* with degrees of freedom, as the chi-squared random variable is obtained from independent standard normal random variables.
If is a normal random variable, then the distribution of is said to be *non-centrally distributed* chi-squared with one degree of freedom. Therefore, is a chi-squared random variable and can be represented as:

If , , ... are independent normal random variables then:

is a non-centrally distributed chi-squared random variable with degrees of freedom.

## Student's t Distribution (t Distribution)

If is a standard normal random variable, is a chi-squared random variable with degrees of freedom, and both of these random variables are independent, then the distribution of the random variable such that:

is said to follow the distribution with degrees of freedom.

The distribution is similar in appearance to the standard normal distribution (see figure below). Both of these distributions are symmetric, reaching a maximum at the mean value of zero. However, the distribution has heavier tails than the standard normal distribution, implying that it has more probability in the tails. As the degrees of freedom, , of the distribution approach infinity, the distribution approaches the standard normal distribution.

## F Distribution

If and are two independent chi-squared random variables with and degrees of freedom, respectively, then the distribution of the random variable such that:

is said to follow the distribution with degrees of freedom in the numerator and degrees of freedom in the denominator. The distribution resembles the chi-squared distribution (see the following figure). This is because the random variable, like the chi-squared random variable, is non-negative and the distribution is skewed to the right (a right skew means that the distribution is unsymmetrical and has a right tail). The random variable is usually abbreviated by including the degrees of freedom as .

# Hypothesis Testing

A statistical hypothesis is a statement about the population under study or about the distribution of a quantity under consideration. The null hypothesis, , is the hypothesis to be tested. It is a statement about a theory that is believed to be true but has not been proven. For instance, if a new product design is thought to perform consistently, regardless of the region of operation, then the null hypothesis may be stated as

Statements in always include exact values of parameters under consideration. For example:

Or simply:

Rejection of the null hypothesis, , leads to the possibility that the alternative hypothesis, , may be true. Given the previous null hypothesis, the alternate hypothesis may be:

In the case of the example regarding inference on the population mean, the alternative hypothesis may be stated as:

Or simply:

Hypothesis testing involves the calculation of a test statistic based on a random sample drawn from the population. The test statistic is then compared to the critical value(s) and used to make a decision about the null hypothesis. The critical values are set by the analyst.

The outcome of a hypothesis test is that we either *reject* or we *fail to reject* . Failing to reject implies that we did not find sufficient evidence to reject . It does not necessarily mean that there is a high probability that is true. As such, the terminology *accept* is not preferred.

For example, assume that an analyst wants to know if the mean of a certain population is 100 or not. The statements for this hypothesis can be stated as follows:

The analyst decides to use the sample mean as the test statistic for this test. The analyst further decides that if the sample mean lies between 98 and 102 it can be concluded that the population mean is 100. Thus, the critical values set for this test by the analyst are 98 and 102. It is also decided to draw out a random sample of size 25 from the population.

Now assume that the true population mean is and the true population standard deviation is . This information is not known to the analyst. Using the Central Limit Theorem, the test statistic (sample mean) will follow a normal distribution with a mean equal to the population mean, , and a standard deviation of , where is the sample size. Therefore, the distribution of the test statistic has a mean of 100 and a standard deviation of . This distribution is shown in the figure below.

The unshaded area in the figure bound by the critical values of 98 and 102 is called the *acceptance region*. The acceptance region gives the probability that a random sample drawn from the population would have a sample mean that lies between 98 and 102. Therefore, this is the region that will lead to the "acceptance" of . On the other hand, the shaded area gives the probability that the sample mean obtained from the random sample lies outside of the critical values. In other words, it gives the probability of rejection of the null hypothesis when the true mean is 100. The shaded area is referred to as the critical region or the rejection region. Rejection of the null hypothesis when it is true is referred to as type I error. Thus, there is a 4.56% chance of making a type I error in this hypothesis test. This percentage is called the significance level of the test and is denoted by . Here or (area of the shaded region in the figure). The value of is set by the analyst when he/she chooses the critical values.

A type II error is also defined in hypothesis testing. This error occurs when the analyst fails to reject the null hypothesis when it is actually false. Such an error would occur if the value of the sample mean obtained is in the acceptance region bounded by 98 and 102 even though the true population mean is not 100. The probability of occurrence of type II error is denoted by .

### Two-sided and One-sided Hypotheses

As seen in the previous section, the critical region for the hypothesis test is split into two parts, with equal areas in each tail of the distribution of the test statistic. Such a hypothesis, in which the values for which we can reject are in both tails of the probability distribution, is called a two-sided hypothesis. The hypothesis for which the critical region lies only in one tail of the probability distribution is called a one-sided hypothesis. For instance, consider the following hypothesis test:

This is an example of a one-sided hypothesis. Here the critical region lies entirely in the right tail of the distribution.

The hypothesis test may also be set up as follows:

This is also a one-sided hypothesis. Here the critical region lies entirely in the left tail of the distribution.

# Statistical Inference for a Single Sample

Hypothesis testing forms an important part of statistical inference. As stated previously, statistical inference refers to the process of estimating results for the population based on measurements from a sample. In the next sections, statistical inference for a single sample is discussed briefly.

### Inference on the Mean of a Population When the Variance Is Known

The test statistic used in this case is based on the standard normal distribution. If is the calculated sample mean, then the standard normal test statistic is:

where is the hypothesized population mean, is the population standard deviation and is the sample size.

For example, assume that an analyst wants to know if the mean of a population, , is 100. The population variance, , is known to be 25. The hypothesis test may be conducted as follows:

1) The statements for this hypothesis test may be formulated as:

It is a clear that this is a two-sided hypothesis. Thus the critical region will lie in both of the tails of the probability distribution.

2) Assume that the analyst chooses a significance level of 0.05. Thus . The significance level determines the critical values of the test statistic. Here the test statistic is based on the standard normal distribution. For the two-sided hypothesis these values are obtained as:

and

These values and the critical regions are shown in figure below. The analyst would fail to reject if the test statistic, , is such that:

or

3) Next the analyst draws a random sample from the population. Assume that the sample size, , is 25 and the sample mean is obtained as .

4) The value of the test statistic corresponding to the sample mean value of 103 is:

Since this value does not lie in the acceptance region , we reject at a significance level of 0.05.

### P Value

In the previous example the null hypothesis was rejected at a significance level of 0.05. This statement does not provide information as to how far out the test statistic was into the critical region. At times it is necessary to know if the test statistic was just into the critical region or was far out into the region. This information can be provided by using the value.

The value is the probability of occurrence of the values of the test statistic that are either equal to the one obtained from the sample or more unfavorable to than the one obtained from the sample. It is the lowest significance level that would lead to the rejection of the null hypothesis, , at the given value of the test statistic. The value of the test statistic is referred to as significant when is rejected. The value is the smallest at which the statistic is significant and is rejected.

For instance, in the previous example the test statistic was obtained as . Values that are more unfavorable to in this case are values greater than 3. Then the required probability is the probability of getting a test statistic value either equal to or greater than 3 (this is abbreviated as ). This probability is shown in figure below as the dark shaded area on the right tail of the distribution and is equal to 0.0013 or 0.13% (i.e., ). Since this is a two-sided test the value is:

Therefore, the smallest (corresponding to the test static value of 3) that would lead to the rejection of is 0.0026.

### Inference on Mean of a Population When Variance Is Unknown

When the variance, , of a population (that can be assumed to be normally distributed) is unknown the sample variance, , is used in its place in the calculation of the test statistic. The test statistic used in this case is based on the distribution and is obtained using the following relation:

The test statistic follows the distribution with degrees of freedom.

For example, assume that an analyst wants to know if the mean of a population, , is less than 50 at a significance level of 0.05. A random sample drawn from the population gives the sample mean, , as 47.7 and the sample standard deviation, , as 5. The sample size, , is 25. The hypothesis test may be conducted as follows:

1) The statements for this hypothesis test may be formulated as:

It is clear that this is a one-sided hypothesis. Here the critical region will lie in the left tail of the probability distribution.

2) Significance level, . Here, the test statistic is based on the distribution. Thus, for the one-sided hypothesis the critical value is obtained as:

This value and the critical regions are shown in the figure below. The analyst would fail to reject if the test statistic is such that:

3) The value of the test statistic, , corresponding to the given sample data is:

Since is less than the critical value of -1.7109, is rejected and it is concluded that at a significance level of 0.05 the population mean is less than 50.

4) value

In this case the value is the probability that the test statistic is either less than or equal to (since values less than are unfavorable to ). This probability is equal to 0.0152.

### Inference on Variance of a Normal Population

The test statistic used in this case is based on the chi-squared distribution. If is the calculated sample variance and the hypothesized population variance then the Chi-Squared test statistic is:

The test statistic follows the chi-squared distribution with degrees of freedom.

For example, assume that an analyst wants to know if the variance of a population exceeds 1 at a significance level of 0.05. A random sample drawn from the population gives the sample variance as 2. The sample size, , is 20. The hypothesis test may be conducted as follows:

1) The statements for this hypothesis test may be formulated as:

This is a one-sided hypothesis. Here the critical region will lie in the right tail of the probability distribution.

2) Significance level, . Here, the test statistic is based on the chi-squared distribution. Thus for the one-sided hypothesis the critical value is obtained as:

This value and the critical regions are shown in the figure below. The analyst would fail to reject if the test statistic is such that:

3) The value of the test statistic corresponding to the given sample data is:

Since is greater than the critical value of 30.1435, is rejected and it is concluded that at a significance level of 0.05 the population variance exceeds 1.

4) value

In this case the value is the probability that the test statistic is greater than or equal to 38 (since values greater than 38 are unfavorable to ). This probability is determined to be 0.0059.

# Statistical Inference for Two Samples

### Inference on the Difference in Population Means When Variances Are Known

The test statistic used here is based on the standard normal distribution. Let and represent the means of two populations, and and their variances, respectively. Let be the hypothesized difference in the population means and and be the sample means obtained from two samples of sizes and drawn randomly from the two populations, respectively. The test statistic can be obtained as:

The statements for the hypothesis test are:

If , then the hypothesis will test for the equality of the two population means.

### Inference on the Difference in Population Means When Variances Are Unknown

If the population variances can be assumed to be equal then the following test statistic based on the distribution can be used. Let , , and be the sample means and variances obtained from randomly drawn samples of sizes and from the two populations, respectively. The weighted average, , of the two sample variances is:

has ( + -- 2) degrees of freedom. The test statistic can be calculated as:

follows the distribution with ( + -- 2) degrees of freedom. This test is also referred to as the two-sample pooled test.
If the population variances cannot be assumed to be equal then the following test statistic is used:

follows the distribution with degrees of freedom. is defined as follows:

### Inference on the Variances of Two Normal Populations

The test statistic used here is based on the distribution. If and are the sample variances drawn randomly from the two populations and and are the two sample sizes, respectively, then the test statistic that can be used to test the equality of the population variances is:

The test statistic follows the distribution with ( --
1) degrees of freedom in the numerator and ( -- 1) degrees of freedom in the denominator.

For example, assume that an analyst wants to know if the variances of two normal populations are equal at a significance level of 0.05. Random samples drawn from the two populations give the sample standard deviations as 1.84 and 2, respectively. Both the sample sizes are 20. The hypothesis test may be conducted as follows:

1) The statements for this hypothesis test may be formulated as:

It is clear that this is a two-sided hypothesis and the critical region will be located on both sides of the probability distribution.

2) Significance level . Here the test statistic is based on the distribution. For the two-sided hypothesis the critical values are obtained as:

and

These values and the critical regions are shown in the figure below. The analyst would fail to reject if the test statistic is such that:

or

3) The value of the test statistic corresponding to the given data is:

Since lies in the acceptance region, the analyst fails to reject at a significance level of 0.05.