********************************************************* Basic Statistics and the Idea of Resampling ********************************************************* Here we explore how to use MATLAB to perform several common statistical analyses, specifically to estimate parameters of a population using data gathered from a sample of the population using Resampling, Bootstrap and Permutation methods. One common example would be to estimate the population mean, median or quartiles of a numerical variable such as the weight of a student at Bellevue College registered on the 10th day of the current quarter. We take a random sample of size n of individuals from a population of (usually) much larger size and calculate various statistics associated with the sample. We use these statistics to estimate the parameters we would like to know. The word ``random'' is absolutely critical. No statistical technique will give reliable information unless the sample is random, and the gold standard here is the ``simple random sample'' or SRS. A sample is an SRS of size n if it was taken by a procedure for which every subset of the population of size n was equally likely to have been the one selected. We will not investigate how to do this here. The details vary depending on the population. Classical statistical practice is based on mathematical theorems from probability leading to many different formulas and if the assumptions of these theorems hold for the population the desired results necessarily follow. However ... it is usually impossible to know if the assumptions of these theorems hold and in many cases it is known they do not hold. Still, statisticians have learned that the CONCLUSIONS of the theorems are often still ``true enough'' when sample sizes are ``large enough'' or some of the assumptions are ``close enough'' to true. That is the foundation of standard statistical practice: a collection of rules to follow that these practitioners have learned give practical information about population parameters in many specific and common situations. It is an art. As far as a typical user is concerned, say a scientist trying to understand the implications of an experiment, if he or she follows standard practice the editors of journals will recognize that they are doing things ``correctly'' and may publish the paper. Otherwise they may not and colleagues might criticize the work. In recent years advances in computing power have made certain competing techniques practical, techniques which are much simpler to understand in various ways but which are computationally expensive---too expensive to use during the time statistical practice was being codified. The words ``resampling'' and ``bootstrap'' and ``permutation'' and ``jacknife'' are associated with these methods. These techniques make full use of the data gathered and make fewer (possibly) questionable assumptions about the data and minimize the use of cookbook formulas. It is the purpose of this section to indicate how to use MATLAB to perform some of these calculations. We will make no comparison of the performance of these techniques with standard statistical practice: those practices and that comparison await you in an actual statistics course. Also, as you might expect, there are numerous tweaks and improvements and special cases (such as Bootstrap-t and BCa confidence intervals, found in the MATLAB Statistics and Machine Learning Toolbox) that go beyond what we look at here. We just perform the first and simplest versions of some calculations to give the general idea and illustrate some immediately-useful techniques. Note: no method can give useful results if the data size is too small. A sample size of 30 is a common minimum, but 50 or 100 might be better. These methods all are based on the idea that the shape of the histogram of the randomly collected data looks, more or less, like the distribution of the population. For smallish data sets this could be far from true. So we will collect a random sample of data from the population of size n and use it to estimate a population parameter. We will assume here that the parameter is the population mean mu of a numerical variable, though the same techniques work for other parameters such as median or standard deviation or the quartiles or the IQR (In fact, they may actually work BETTER for some of these particular parameters.) ********************************************************* The Bootstrap Distribution: ********************************************************* We first create the bootstrap distribution. We do this by taking a very large number of samples DRAWN FROM THIS ORIGINAL RANDOM SAMPLE, not the whole population. A bootstrap sample is a random sample WITH REPLACEMENT from the original sample. It is to be of size n, the same size as the original sample. There may be repeated values in a single bootstrap sample, and some values left out, but all selected values will be taken from the original sample. We calculate the parameter value (in this case the mean) on this bootstrap sample and determine the distribution (use, say, a histogram to look at it) of these numbers from a huge number of bootstrap samples. This is the bootstrap distribution. It is a fact, and this is the key idea to the discussion here, that this distribution will usually have a very similar shape and spread to the complete sampling distribution of the parameter in question (the distribution of all samples of size n) from the original population, though the mean of the bootstrap distribution will be at xbar and not at the unknown population mean mu as in the actual sampling distribution of the parameter. ********************************************************* Confidence Intervals: ********************************************************* In Bootstrap Confidence Interval calculations we leave the bootstrap samples where they are, which means the mean of the bootstrap distribution will be xbar. We calculate the endpoints of an interval that captures a certain percentage of these bootstrap samples. For instance a 95 percent confidence interval will have all but 97.5 percent of the bootstrap samples to the left of the right endpoint, and all but 2.5 percent will be to the right of the left endpoint. We hope that this interval will ``capture the true population mean mu 95 percent of the time.'' If this interval is symmetric around xbar, this is because the bootstrap distribution will probably have shape and variation very similar to the sampling distribution of sample size n from the whole population and IF THIS IS TRUE the the same interval shifted to have center at the true parameter value mu would capture xbar in 95 percent of the possible random samples from this population. Our initial sample is one of these. This interval will, in any case, at least give us a sense of where the population mean is likely to be. However if the interval is NOT symmetric we have a problem. We don't know whether mu is to the right or to the left of xbar and so an interval relative to mu which contains xbar might not contain mu when shifted by xbar-mu. This could happen if xbar is on the ``shorter side'' of an asymmetric sampling distribution interval. We could try to cure this problem by increasing the length of the shorter side to match the longer, a conservative option. There are better ways which we don't consider here. The ``bias corrected and accelerated'' BCa confidence interval addresses problems with bias and skewness, and this technique is built into the bootci function, found in the Statistics and Machine Learning Toolbox. ********************************************************* P-Values and the Hypothesis Test: ********************************************************* We now discuss the idea of a hypothesis test. Let's suppose a sub-population under study has a numerical parameter whose value is expected to differ from that of a different or larger ``comparison'' population whose value is known to be m. We will assume here that m is the comparison population mean. In a hypothesis test we assume that the sub-population has the same mean m (this is called the Null Hypothesis) and go out and collect a sample as above, which has mean xbar. Most likely xbar will differ from m, at least a little, and so will always constitute some evidence that the mean is NOT m. But is the evidence negligible or is it strong? If the xbar we got is HIGHLY UNLIKELY with our null assumption, the hypothesis that got us there is highly questionable. The P-Value of the test statistic (your particular xbar) is the probability that you would have gotten the result you actually did get, or an xbar even farther away from m, by chance variation if the Null Hypothesis is true. A tiny P-Value provides support FOR the Alternative Hypothesis, that the population from which the sample was drawn DOES NOT have the same mean as the larger population, and is strong evidence AGAINST the Null Hypothesis. Sometimes we calculate the P-Value that the population mean in question simply differs from m. This is called a 2-sided alternative hypothesis, because we start the whole project without no opinion about which way the mean should differ, only that it does differ. Sometimes we calculate the P-Value that our population mean exceeds m, or that the population mean is smaller than m. These are called 1-sided alternative hypotheses, chosen when we start the study with an opinion about which way the mean should differ from m. In a medical experiment, for instance, after animal studies indicate promise we might want to collect evidence that a treatment extends the life of human subjects with a disease beyond what would be expected without treatment. Or with a treatment of dubious merit, such as the numerous ``naturopathic'' treatments with which the gullible dose themselves, a researcher might start by wondering if the treatment had any effect at all, beneficial or harmful. It is a good idea to meditate on the following flaw of hypothesis testing: just because a P-Value is microscopic (the data is said to be statistically significant in this event) does not mean that xbar differs from m is any PRACTICAL sense! You MUST look at the actual data values, and Confidence Intervals help you decide this. ********************************************************* Comparing Two Populations: ********************************************************* We also will be interested in samples taken from two different populations. We will want to know if there is strong evidence that a common parameter differs in the two populations, or not. Hypothesis tests and confidence intervals are useful here too. The accompanying m-files are extensively ``commented'' with further discussion in this context.