By Michelle Harris, Rick Nordheim, and Janet Batzli
Biocore 304 Spring 2007
When analyzing data, the first step should always be plotting your data. This gives you an opportunity to look at your data to see that they “make sense” and to make some preliminary interpretations. A good plot can help you spot errors before they creep into more formal analyses. Also, all formal statistical inference is based on underlying assumptions. An examination of the data at the beginning can help you to spot possible violations of the key assumptions. When you have gathered data from a biological study, avoid the temptation to jump straight to performing inference (e.g., conducting statistical tests of your hypotheses regarding population values). Always look at your data!!
After plotting your data, it is helpful to provide some summary (descriptive) statistics for your data. These provide sample quantities that you will use to estimate important characteristics of the population to which you want to make inference. Let the n data values be labeled y 1, y 2, …, y n. A common measure of central tendency is the mean (called y-bar), or numerical average, calculated for a sample of data using the formula:
A common measure of data dispersion, or spread, is the standard deviation (s), which can be determined for a given sample with the equation:
The sample mean and sample standard deviation are estimates of the corresponding population mean and population standard deviation. You will never know the population (true) values. However, if your sample is a random sample from the population of interest, you will be able to make reliable inference about the population. The units of both the mean and the standard deviation are the same as the units of the underlying distribution. For example, if you are measuring the length of six-week old lizards of a given species in centimeters, the mean and the standard deviation are both measured in centimeters. The variance, which is s 2, would have units of squared centimeters in this case.
Let us think of the sample mean as estimating the population mean (the Greek letter mu). We would like to know how close is to . This measure is known as the standard error of the mean (SE). For any distribution, this is related to the standard deviation and the sample size by
Suppose we have taken a random sample of 16 six-week old lizards and computed
cm and a standard deviation s=2.0 cm. Then cm. The standard deviation measures the typical deviation of the length of an individual lizard whereas the standard error measures the uncertainty in how well estimates . There is a true standard deviation for the lizard length which is an inherent characteristic of the lizard. From our sample we estimate the standard deviation by 2.0; in different samples it may be somewhat larger or smaller but will not be grossly different. The standard error depends on the standard deviation and the sample size. As the sample size gets larger, the standard error becomes smaller which means that we can be sure that estimates very closely. Indeed the sample size is often chosen to achieve a given level of precision in how well estimates , balanced with the practical limitations of the study.
I t is often useful to use and SE to compute a range of plausible values for . The most common method for this is to compute a confidence interval for. An equation that often provides a useful approximation to such a confidence interval is:
However, the construction of this, or any other, confidence interval depends on some underlying assumptions. If your sample is reasonably symmetric and bell-shaped and your sample size is at least 25, then this interval can be thought of as a 95% confidence interval for the true population mean. This says that, on average for intervals computed in this manner, the true population mean will fall within the interval about 95% of the time. If the data are asymmetric, then such an interval may not be reliable. If the sample size is smaller than 25, then the “1.96” needs to be replaced by a larger number with the magnitude depending on the specific sample size. (For sample sizes in the range of 5 to 10, an approximate value to use is 2.5. The “correct” value to use is based on the t-distribution described in a later section.)
Comparing the Means of Two Samples
Often scientists wish to compare two groups to see if the means for each group are the same --- or not. For example, perhaps you wish to test whether the mean value of one group (e.g., the average heart rate of a group of women after 5 minutes of stair-stepping) is statistically different from the mean value of another group (e.g. the average heart rate of a group of men after similar exercise). As another example, you might be interested in determining if people on a particular medication regimen weigh more or less after one week on the medication. Note that the data here (heart rates and weights) are continuous variables. Before providing the statistical procedure for performing the statistical inference for these problems, it is necessary to recognize that there are two distinct designs that are used in studies that compare the means of two samples. The second example, weight gain on medication, is an example of a paired two-sample design since data are obtained before and after on the same individual. (The paired design can be a bit more general that this; the key ingredient is that there is a direct relationship between an observation on one treatment (before medication) and an observation on the other treatment (after medication). The first example, comparing heart rates comes from an independent two-sample design. In this case there is no direct relationship between an observation on one treatment (women) and an observation on the second (men). (If however, one had as one’s sample a set of fraternal twins, one female and the other male, then one could conduct a paired study. In this case there is again a direct relationship between an observation on one treatment (male twin) and an observation on the other treatment (female twin).)
The key is to determine the design corresponding to your data before conducting your analysis. The analysis always follows the design!!
The independent two-sample case
The more common situation is the independent two-sample design and the formal procedure for comparison leads to a t-test. There are a number of assumptions that must be met in order for this test to be valid.
(Researchers must design their experimental data collection protocol carefully to satisfy these assumptions of independence.)
Performing the t-test
It is useful to formally state the underlying hypothesis for your test. Using the notation from the previous section with representing a population mean, there are now population means for each of the two groups: and . The null hypothesis (H o) is always that the two population means are equal. In most cases you will design experiments to test hypotheses which, if supported, would lead you to reject (not accept) H o. The logic underlying the t-test is that you need the data to provide evidence against the null. We formally state the null hypothesis as:
The standard alternative hypothesis(H A) is written:
This states that the two means differ in one direction or the other. Most of the experimental hypotheses that scientists pose are alternative hypotheses. Sometimes the alternative can be “one-sided”, for example , which indicates that the null is rejected only if the mean of the first group is larger than the mean of the second. In scientific practice, it is best to use the two-sided alternative in most cases. Only if there is a strong and compelling argument for using a one-sided alternative, which is presented before the experiment is conducted, should a one-sided alternative be used.
To calculate the t-statistic and associated probability for 2 samples :
1. Make sure that the underlying assumptions for the test are met.
2. Compute the mean and variance (s 2) for each of the two samples.
3. Compute a pooled variance as follows:
where n 1 and n 2 are the sample sizes for the two groups and and are the variances for the two groups. (Note that we pool variances and not standard deviations!!)
4. Compute the t-statistic for the null hypothesis using the equation:
Every t-test has associated with it a value of degrees of freedom (df). (This is related to the discussion of the appropriate multiplier to use in the computation of the confidence interval as discussed above.) For this t-test: df= (n 1 + n 2 - 2).
5. After you have determined the t-statistic and the degrees of freedom, look up the associated probability of a Type I error (the p-value) in the t-table of this appendix. The p-value is the probability of obtaining a value of t as extreme or more extreme than was observed given the null and alternative hypotheses. To find the p-value, look in the column headed by “df” and find the value appropriate to your problem. Then read across to the right to find the two columns between which the t-statistic you have calculated falls. Now record the p-value (see column headings) of the two columns. For example, for a two-sided alternative, if you calculated a t-value of 2.05 and you have df=12, then 0.05<p-value<0.10.
Interpreting the p-value
By convention in many biological science disciplines, a p-value £ 0.05 indicates that the difference between your two sample means is statistically significant . By this convention, if your t-score results in a p-value that is less than or equal to 0.05, then you should reject your null hypothesis H o. The smaller the p-value, the more evidence there is against the null hypothesis. Even if you use a threshold of 0.05 for “statistical significance” it is good practice to report the p-value. Scientifically, the difference in data values that produce a p-value of 0.048 and 0.052 is miniscule and it is bad practice to over-interpret the decision to reject the null or not. Scientifically the difference between a p-value of 0.048 and 0.0048 (or between 0.052 and 0.52) is very meaningful even though such differences do not affect conclusions on significance at 0.05.
If your data indicate that there is a statistically significant difference, always ask yourself if the difference is biologically meaningful. Statistical significance does not necessarily imply that the results are biologically meaningful.
Why is the 0.05 level so important here? You can think of this as a threshold in the following way. If your null hypothesis (that the treatment means are the same) is true, you are willing to accept that you will reject the null hypothesis by chance with a probability of 0.05. You would be making an error --- by falsely rejecting a true null hypothesis --- in this case with the stated probability. Many scientists are willing to accept this kind of error, called Type I error, about 5% of the time. You will often see the threshold value of 0.05 referred to as the α-value in the scientific literature.
Reporting the results of t-tests
When reporting t-test results, provide your reader with the sample means for each group and a measure of variation for each, the t-statistic, degrees of freedom, p-value, and whether the p-value (and hence the alternative hypothesis) was one or two-tailed. Here is an example of how you could concisely report the results of an independent t-test comparing mean heart rate in a sample of men and women:
“Females had a slighter lower heart rate (mean = 95.2 bpm, SD = 6.8) than men (mean = 96.8 bpm, SD = 4.8). This difference was not, however, statistically significant (t (18) = 0.63, p = 0.54, two-tailed).”
The number 18 in parentheses after the t represents the degrees of freedom.
The paired two-sample case
As explained above, the first step in performing a two-sample comparison is to determine whether the design is independent or paired. Our earlier example for the paired case focused on comparing the weight of a medication user before and after one week on the medication. In paired designs we are typically interested in the change or response elicited by the treatment. By looking at the difference between the before and after values, we focus on the changes within subjects (organisms) because we have “removed” the variation that normally is present between subjects. By lowering the variability in the sample, we make it more likely that we can detect any effect of the treatment, if present. Thus, we have reduced what was originally a two-sample problem - a comparison of the mean of the before values with the mean of the after values - to a one-sample problem, a comparison of the before-after difference with a constant value (usually zero). Indeed, an analysis of such data using the methods for independent samples would be incorrect!! This is because the first bulleted assumption for the independent sample procedure is not met.
Paired designs can be more general than “before” and “after”. For example, suppose you only had one week to test the effect of depleted nutrient medium on the change in leaf size of fast plant leaves. You notice that there is a great deal of variation in the size of the seedlings you can use in your experiment, however, so you are concerned that the variation within samples would obscure any differences in the mean increase in leaf area over just one week. You could control variation between seedlings, however, by carefully matching individual plants on such characteristics as age, shoot length, etc... You would then subject one member of the pair to normal nutrient medium and the other to a depleted medium over the same 7 day-period. Leaf area is measured after 7 days for several of these pairs, and you would then test whether the mean leaf area difference between "with" and "without" normal nutrient medium pairs is significantly different from zero.
The appropriate analysis of a paired design again leads to a t-test. However, the t-test will have a different (and actually simpler) form. The key is “reducing” the data for each pair to a single value – the difference. Thus, if y b and y a represent the weights before and after medication (or the leaf areas of a pair of similar plants), we define the difference d by d= y a - y b. Our test will only depend on the values of d. Again, there are assumptions that underlie this test.
(We need make no assumptions directly about the “y” values. The key to a useful paired design is removing the variability --- in this case the differences among individuals subject to the medication --- by looking at the difference between the “after” and “before” measurements within an individual.)
Performing the paired t-test
Again it is useful to formally state the underlying hypothesis for your test. Using similar notation to before,
where μ d is the mean of the population of difference (d) values. The same choices regarding one-sided and two-sided alternatives are possible here.
To calculate the t-statistic and associated probability for a paired sample:
1. Make sure that the underlying assumptions for the test are met.
2. Compute, for each experimental unit (organism), the difference (d) between the paired values (e.g., the before and after values). You now have a single sample made up of the d value from each pair (organism). Note: It does not matter whether you subtract the first value in the pair from the second, or vice versa (as long as you keep track of the sign), but it is usually easier to choose the order that you believe will give you a positive difference.
3. Compute the mean, and the standard deviation ( ) of the sample of d values.
4. Compute the t-statistic for the null hypothesis using the equation:
5. Now that you have the t-statistic and the degrees of freedom, look up the associated probability of a Type I error as you did for the independent sample case. The interpretation of the p-value is very much as before.
Ambrose, H.W. III and Peckham Ambrose, K. 1995. A handbook of biological investigation, 5th ed. Hunter Textbooks, Winston-Salem , North Carolina .
Gravetter, F.J. and Wallnau, L.B. 2000. Statistics for the behavioral sciences, 5th ed. Wadsworth Thomson Learning, Belmont , California .
Nordheim, E.V. and Clayton, M.K. 1997. Course notes for statistics/forestry/horticulture 571. Department of Statistics, University of Wisconsin-Madison.
Sokal, R.R. and Rohlf, F.J. 1981. Biometry, 2nd ed. W.H. Freeman and Company, New York .
The t Distribution, from page 693 in Gravetter & Wallnau, (2000)
This semester you will subject your data to various statistical tests and generate graphs to represent the results of these analyses. Before running any statistical tests, graph your raw data first using “exploratory” graphs to get a feel for the normality of your sample and to understand what the results show. Raw data are typically not used for graphs presented in a talk or paper. Instead, use summary graphs whenever possible to show the main results of the experiment. Use a table only if you can not think of an appropriate graph.
The purpose of this handout is to introduce you to some Excel summary statistics and graphing functions. Before you begin, tell your Excel program to give you the "Data analysis" option under the "Tools" option. To do this, choose Tools-->Add-ins, then check the "Analysis tool pack" box.
To sort data from least to greatest choose (Data-->Sort). Data can also be arranged according to your unique variable categories (e.g., gender, chronological time, etc.)
To have Excel compute some summary statistics, choose (Tools-->Data Analysis-->Descriptive Stats). Check the "labels in first row" box to keep your column headings. Also check "summary statistics" and “confidence levels for mean, 95%.” To find out the mean, median, mode, variance, range, standard deviation, and standard error for a particular variable, click on the Input Range icon, select the appropriate column, and then "OK." Your output will be placed on a new sheet. Is the mean reported a sample mean or a population mean?
Calculating Standard Deviation & Standard Error
To calculate the standard deviation of a data set, first click on an empty cell in which you want the standard deviation to be displayed. Choose the f x STDEV (standard deviation) function and select “STDEV.” You’ll have to click on the Input Range icon again to select the data, without the label in the first row. Because there is no function to calculate standard error, you'll have to choose another empty cell and type in the formula yourself as:
=(STDEV(data set cell range)/(your sample size^(1/2)))
You can check your calculated SD and SE values against those generated in a summary statistics output file.
95% Confidence Intervals
For sample sizes greater than about 30, the true population mean can be found within the range described by the sample mean +/- 2SE. We often don’t have the time or resources in Biocore to achieve samples of this size. For smaller sample sizes, we can instead compute the actual 95% confidence interval (CI) using a t-statistic from the t-table presented earlier in this appendix. The 95% CI is simply calculated as the sample mean +/- SE * tstat .
The number of significant digits in data cells should be no greater than the number of significant figures in the data point(s) with the least number of significant figures. For example, the mean of 1.47, 1.33 and 1.4 should be reported as 1.4. The same convention is used when reporting the standard deviation and standard error of a sample. You can use the "decrease decimal" icon in the upper right of the tool bar to do this.
You'll be using graphs often this semester. Graphs will show the main points of your experiment; use a table only if you cannot think of an appropriate graph. Before subjecting your data to any statistical tests, ALWAYS GRAPH YOUR RAW DATA FIRST!!! Recall that one of the assumptions of t-tests is that your data are more or less normally distributed; you must check this assumption before proceeding with a t-test analysis. Using a few exploratory graphing techniques will help you to get a feel for the normality of your sample and to understand what the results show. Becoming familiar with general trends in your data is particularly important when results of statistical tests are counter-intuitive. For example, this may signal a simple error you made in some calculation or data entry.
To generate histograms chose (Tools-->Data Analysis-->Histogram). Select chart output to display your histogram. You can set the numerical data ranges or “Bins” yourself or let Excel generate them automatically. For example, a Bin column containing the numbers 3, 6, and 9 will generate a histogram showing the frequency of data points falling within the 0-3, 4-6, and 7-9 ranges. A general rule of thumb for the number of bins in your histogram is to set the #bins = square root of your sample size.
To set your own Bins, create a new column in your spreadsheet that defines the maximum values for each data range. After selecting the histogram option, choose “Bin range” and then highlight the numbers in the Bin column. After the histogram appears, make it bigger to examine your data more closely. Do your data appear to be approximately normally distributed? What numerical ranges do most data points fall into? How does the histogram look when you change the Bin values? (To modify a graph once it's been generated, select the graph, then double click on the area you'd like to edit.)
See the Writing Manual, “Producing Figures Using Microsoft Excel” for directions on generating bar graphs, line graphs, and creating error bars around mean values. Try displaying your bar columns and lines with different colors and textures. Label axes appropriately. Make the background on each graph white instead of the default gray (it consumes too much ink when printed.) If you do not have access to a color printer, you'll need to think about how to best display your data in black, white and gray tones. Save each chart in a new sheet.
At some point you may want to look at how change in an independent continuous variable affects a dependent variable (e.g., how increases in external room temperature affect heart rate). An XY scatter plot is useful for depicting such relationships. (Chart Wizard-->XY scatter). Edit your graph such that your data points fill the entire graph area, axes are appropriately labeled, etc... Keep in mind that any charts you insert into your research papers this semester will have figure legends, and thus will not need titles. You should, however, insert titles on charts used in oral presentations.
Testing Differences in Sample Means
Often this semester you will wish to find out whether there is a statistically significant difference between the means of two or more independent samples. If your data have satisfied the Students t-test assumptions, use this test to determine whether there is a significant difference between two independent sample means. To have Excel compute the t-statistic, select Tools--->Data Analysis--->t-test (Two sample assuming equal variances).
A paired t-test is appropriate when we are focusing on the changes in a particular variable within subjects, i.e., when data are paired. To have Excel compute the paired t-statistic, select Tools à Data Analysis à t-test (Paired two sample for means). In this Statistics Appendix you’ll find a detailed handout to further help you with reporting and graphing the results of independent and paired t-tests.