There are two types of statistical inferences: estimation of population parameters and hypothesis testing. Hypothesis testing is one of the most important tools of application of statistics to real life problems. Most often, decisions are required to be made concerning populations on the basis of sample information. Statistical tests are used in arriving at these decisions.
There are five ingredients to any statistical test :
In attempting to reach a decision, it is useful to make an educated guess or assumption about the population involved, such as the type of distribution.
Statistical Hypotheses : They are defined as assertion or conjecture about the parameter or parameters of a population, for example the mean or the variance of a normal population. They may also concern the type, nature or probability distribution of the population.
Statistical hypotheses are based on the concept of proof by contradiction. For example, say, we test the mean (m) of a population to see if an experiment has caused an increase or decrease in m. We do this by proof of contradiction by formulating a null hypothesis.
Null Hypothesis : It is a hypothesis which states that there is no difference between the procedures and is denoted by H0. For the above example the corresponding H0 would be that there has been no increase or decrease in the mean. Always the null hypothesis is tested, i.e., we want to either accept or reject the null hypothesis because we have information only for the null hypothesis.
Alternative Hypothesis : It is a hypothesis which states that there is a difference between the procedures and is denoted by HA.
Case | Null Hypothesis H 0 | Alternate Hypothesis H A |
---|---|---|
1 | m1 = m2 | m1 ¹ m2 |
2 | m1 < m2 | m1 > m2 |
3 | m1 > m2 | m1 < m2 |
Test Statistic : It is the random variable X whose value is tested to arrive at a decision. The Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a population, the distribution of the means of those samples will approximate normality, even when the data in the parent population are not distributed normally. A z statistic is usually used for large sample sizes (n > 30), but often large samples are not easy to obtain, in which case the t-distribution can be used. The population standard deviation s is estimated by the sample standard deviation, s. The t curves are bell shaped and distributed around t=0. The exact shape on a given t-curve depends on the degrees of freedom. In case of performing multiple comparisons by one way Anova, the F-statistic is normally used.It is defined as the ratio of the mean square due to the variability between groups to the mean square due to the variability within groups. The critical value of F is read off from tables on the F-distribution knowing the Type-I error aand the degrees of freedom between & within the groups.
Rejection Region : It is the part of the sample space (critical region) where the null hypothesis H0 is rejected. The size of this region, is determined by the probability (a) of the sample point falling in the critical region when H0 is true. a is also known as the level of significance, the probability of the value of the random variable falling in the critical region. Also it should be noted that the term "Statistical significance" refers only to the rejection of a null hypothesis at some level a.It implies only that the observed difference between the sample statistic and the mean of the sampling distribution did not occur by chance alone.
Conclusion : If the test statistic falls in the rejection/critical region, H0 is rejected, else H0 is accepted.
Tests of hypothesis can be carried out on one or two samples. One sample tests are used to test if the population parameter (m) is different from a specified value. Two sample tests are used to detect the difference between the parameters of two populations (m1 and m2).
Two sample tests can further be classified as unpaired or paired two sample tests. While in unpaired two sample tests the sample data are not related, in paired two sample tests the sample data are paired according to some identifiable characteristic. For example, when testing hypothesis about the effect of a treatment on (say) a landfill, we would like to pair the data taken at different points before and after implementation of the treatment.
Both one sample and two sample tests can be classified as :
One tailed test : Here the alternate hypothesis HA is one-sided and we test whether the test statistic falls in the critical region on only one side of the distribution.
One sample test: For example, we are measuring the concentration of a lake and we need to know if the mean concentration of the lake is greater than a specified value of 10mg/L.
Hence, H0: m £ 10 mg/L, vs, HA: m > 10 mg/L.
Two sample test: In Table1, cases 2 and 3 are illustrations of two sample, one tailed tests. In case 2 we want to test whether the population mean of the first sample is lesser than that of the second sample.
Hence, H0: m1 ³ m2 , vs, HA: m1 < m2.
Two tailed test : Here the alternate hypothesis HA is formulated to test for difference in either direction, i.e., for either an increase or a decrease in the random variable. Hence the test statistic is tested for occurrence within either of the two critical regions on the two extremes of the distribution.
One sample test: For the lake example we need to know if the mean concentration of the lake is the same as or different from a specified value of 10 mg/L.
Hence, H0: m ¹ 10 mg/L, vs, HA: m = 10 mg/L.
Two sample test: In Table 1, case 1 is an illustration of a two sample two tailed test. In case 1 we want to test whether the population mean of the first sample (m1) is the same as or different from the mean of the second sample (m2).
Hence H0: m1 = m2 , vs, HA: m1 ¹ m2.
Given the same level of significance the two tailed test is more conservative, i.e., it is more rigorous than the one-tailed test because the rejection point is farther out in the tail. It is more difficult to reject H0 with a two-tailed test than with a one-tailed test.
The diagram associated with the link illustrates the critical region(s) for one and two tailed tests.
When using probability to decide whether a statistical test provides evidence for or against our predictions, there is always a chance of driving the wrong conclusions. Even when choosing a probability level of 95%, there is always a 5% chance that one rejects the null hypothesis when it was actually correct. This is called Type I error, represented by the Greek letter a.
It is possible to err in the opposite way if one fails to reject the null hypothesis when it is, in fact, incorrect. This is called Type II error, represented by the Greek letter b. These two errors are represented in the following chart.
Type of decision | H0 true | H0 false |
---|---|---|
Reject H0 | Type I error (a) | Correct decision (1-b) |
Accept H0 | Correct decision (1-a) | Type II error (b) |
A related concept is power, which is the probability of rejecting the null hypothesis when it is actually false. Power is simply 1 minus the Type II error rate, and is usually expressed as 1-b.
When choosing the probability level of a test, it is possible to control the risk of committing a Type I error by choosing an appropriate a.
This also affects Type II error, since they are are inversely related: as one increases, the other decreases. To appreciate this in a diagram, follow this link:
There is little control on the risk of committing Type II error, because it also depends on the actual difference being evaluated, which is usually unknown. The following link leads to a diagram that illustrates how at a fixed a value, the b value changes according to the actual distribution of the population:
The consequences of these different types of error are very different. For example, if one tests for the significant presence of a pollutant, incorrectly deciding that a site is polluted (Type I error) will cause a waste of resources and energy cleaning up a site that does not need it. On the other hand, failure to determine presence of pollution (Type II error) can lead to environmental deterioration or health problems in the nearby community.
1 Identify the null hypothesis H0 and the alternate hypothesis HA. |
2 Choose a. The value should be small, usually less than 10%. It is important to consider the consequences of both types of errors. |
3 Select the test statistic and determine its value from the sample data. This value is called the observed value of the test statistic. Remember that a t statistic is usually appropriate for a small number of samples; for larger number of samples, a z statistic can work well if data are normally distributed. |
4 Compare the observed value of the statistic to the critical value obtained for the chosen a. |
5 Make a decision. |
If the test statistic falls in the critical region: Reject H0 in favour of HA. |
If the test statistic does not fall in the critical region: Conclude that there is not enough evidence to reject H0. |
A) One tailed Test
An aquaculture farm takes water from a stream and returns it after it has circulated through the fish tanks. The owner thinks that, since the water circulates rather quickly through the tanks, there is little organic matter in the effluent. To find out if this is true, he takes some samples of the water at the intake and other samples downstream the outlet, and tests for Biochemical Oxygen Demand (BOD). If BOD increases, it can be said that the effluent contains more organic matter than the stream can handle.
The data for this problem are given in the following table:One tailed t-test :
Upstream | Downstream |
---|---|
6.782 | 9.063 |
5.809 | 8.381 |
6.849 | 8.660 |
6.879 | 8.405 |
7.014 | 9.248 |
7.321 | 8.735 |
5.986 | 9.772 |
6.628 | 8.545 |
6.822 | 8.063 |
6.448 | 8.001 |
Upstream | Downstream | |
---|---|---|
Mean | 6.6539 | 8.6874 |
Variance | 0.2124 | 0.2988 |
Observations | 10 | 10 |
Pooled Variance | 0.2556 | |
Hypothesized Mean Difference | 0 | |
Degrees of freedom | 18 | |
t stat | -8.9941 | |
P(T<t) one-tail | 2.22 x 10-08 | |
t Critical one-tail | 1.7341 | |
P(T<t) two-tail | 4.45 x 10-08 | t Critical two-tail | 2.1009 |
Let us asume that an induced bioremediation process is being conducted at a contaminated site. The researcher has obtained good cleanup rates by injecting a mixture of nutrients into the soil in order to maintain an abundant microbial community. Someone suggests using a cheaper mixture. The researcher tries one patch of land with the new mixture, and compares the degradation rates to those obtained from a patch treated with the expensive one to see if he can get the same degradation rates.
The data for this problem are shown in the following table:Cheap Nutrient | Expensive Nutrient |
7.1031 | 9.6662 |
6.4085 | 10.1320 |
8.8819 | 9.0624 |
7.0094 | 8.8136 |
4.6715 | 9.2345 |
6.6135 | 9.9949 |
6.5877 | 9.4299 |
6.2849 | 8.8012 |
6.6789 | 9.9249 |
6.5542 | 8.1739 |
1) A is treated with the cheap nutrient; and B is treated with the expensive one.
Cheap Nutrient | Expensive Nutrient | |
Mean | 6.6794 | 9.3233 |
Variance | 1.0476 | 0.3917 |
Observations | 10 | 10 |
Hypothesized Mean Difference | 0 | |
Degrees of freedom | 15 | |
t Stat | -6.9691 | |
P(T<t) one-tail | 2.25 x10-6 | |
t Critical one-tail | 1.7531 | |
P(T<t crit) two-tail | 4.51 x 10-6 | |
t critical two tail | 2.1315 | |
Although hypothesis tests are a very useful tool in general, they are sometimes not appropriate in the environmental field. The following cases illustrate some of the limitations of this type of test:
A) Multiple Comparisons
z and t tests are very useful when comparing two population menas. However, when it comes to comparing several population means at the same time, this method is not very appropriate.
Suppose we are interested in comparing pollutant concentrations form three different wells with means m1, m2 and m3. We could test the following hypothesis:
H0: m1 = m2 = m3
HA: not all means are equal
We would need to conduct three different hypothesis tests, which are shown here:
m1 = m2 m1 ¹ m2 |
m2 = m3 m2 ¹ m3 |
m1 = m3 m1 ¹ m3 |
For each test, there is always the possibility of committing an error. Since we are conducting three such tests, the overall error probability would exceed the acceptable ranges, and we could not feel very confident about the final conclusion. Table 8 shows the resulting overall a if multiple t tests are conducted. Assume that each k value represents the number of populations to be compared.
Number of means (k) | 0.20 | 0.10 | 0.05 | 0.02 | 0.01 | 0.001 |
---|---|---|---|---|---|---|
2 | 0.20 | 0.10 | 0.05 | 0.02 | 0.01 | 0.001 |
3 | 0.41 | 0.23 | 0.13 | 0.05 | 0.03 | 0.003 |
4 | 0.58 | 0.36 | 0.21 | 0.09 | 0.05 | 0.006 |
5 | 0.71 | 0.47 | 0.23 | 0.13 | 0.07 | 0.009 |
10 | 0.96 | 0.83 | 0.63 | 0.37 | 0.23 | 0.034 |
20 | 1.00 | 0.98 | 0.92 | 0.71 | 0.52 | 0.109 |
¥ | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Note : The particular values were derived from a table by Pearson (1942) by assuming equal population variances and large samples.
A better method for comparing several population means is an analysis of variance, abbreviated as ANOVA.
ANOVA test is based on the variability between the sample means. This variability is measured in relation to the variability of the data values within the samples. These two variances are compared through means of the F ratio test.
If there is a large variability between the sample means, this suggests that not all the population means are equal. When the variability between the sample means is large compared to the variability within the samples, it can be concluded that not all the population means are equal.
B) Multiple Constituents
In example 1, we were only testing for BOD, so only one t test was necessary. If we had been trying to trace more than one pollutant, which is usually the case, we would have to take out different tests for each pollutant in order to determine if the effluent was similar to the receiving stream. Then we would have the same proplem we encountered with multiple comparisons: Overall a would increase. Table 8 applies to this case too. The k value in this case would represent the number of pollutants instead of a number of populations.
C) Difficulty in meeting assumptions
The tests used in the testing of hypothesis, viz., t-tests and ANOVA have some fundamental assumptions that need to be met, for the test to work properly and yield good results. The main assumptions for the t-test and ANOVA are listed below.
The primary assumptions underlying the a t-test are:
Like the t-test, analysis of variance is based on a model that requires certain assumptions. Three primary assumptions of ANOVA are that:
In actual experimental or sampling situations, the underlying populations are not likely to be exactly normally distributed with exactly equal variances. Both the t-test and ANOVA are quite robust and yield reliable results when some of the assumptions are not met. For example, if n1 = n2 = ... = nk, ANOVA tends to be especially robust with respect to the assumption of homogeneity As the number of groups tested, k, increases there is a greater effect on the value of the F-statistic. It is also seen that a reasonable departure from the assumption of population normality does not have a serious effect on the reliability of the F-statistic or the t-statistic. It is essential however that the assumption of independence be met. The analysis is not robust for non-independent measurements.These factors are to be taken into consideration while testing hypotheses.
Sampling & Monitoring Primer Table of Contents |
Previous Topic |
Next Topic |
Send comments or suggestions to:
Student Authors: Georgina Wilson González, gwilsong@vt.edu , and Karpagam Sankaran, ksankara@vt.edu
Faculty Advisor: Daniel Gallagher, dang@vt.edu
Copyright © 1997 Daniel Gallagher
Last Modified: 09-10-1997