As was the case with proportions, we can use either a bootstrap procedure or a parametric procedure to construct a confidence interval for population mean. Bootstrap is a relatively new method. Even once it was developed, it took some time for computing power to get to the point where it was easy to construct bootstrap confidence intervals. Prior to that, there were methods that were more mathematical in nature and took advantage of the fact that the sampling distribution of sample mean is often nearly normally distributed under certain conditions. In fact, this is again a consequence of the Central Limit Theorem that helped us when we were working with proportions. We'll now look at parametric confidence intervals that were developed initially before the bootstrap was even understood and then still were used quite commonly. Many confidence intervals in practice today are still use the parametric method. Learning about both the bootstrap and the parametric method is quite useful to get there. We're going to first just try to get a better intuition about the distribution of the sample. We'll do that in the context of a couple of datasets. The first dataset looks at the annual cost of college attendance for public universities in the United States. This data is from 2020. This is the population distribution of 500 public universities and their annual cost of attendance. If we wanted to describe the population data, which remember in most situations we wouldn't have access to this. But we want in this case, to be working in an unrealistic setting. We have access so we can compare the population distribution to the sampling distribution of the mean. We're going to describe the population distribution. Let's go back and take a look at it. Looks like it's a little bit skewed to the right. Looks like the center is, I don't know, 23,000 something like that. And it looks like the values go from between about 10,000 to 40,000 we could say. It's slightly skewed to the right. The mean value is approximately 23,000 and the values range from about 10,000 to 40,000 Formally, the range is the difference between the minimum and the maximum value. In this case, the approximate range would be about 40,000 -10,000 which would be about 30,000 From a histogram, you can't determine the mean or the range very precisely, but we can get a good sense of what those are. Now we want to think about what would the distribution of means, let's say of five observations, or 20 observations, or 40 observations from this dataset look like. Next we're going to investigate the distribution of the mean. We'll use an app that helps facilitate this URL for the app is at the top of the slide. We'll go to that app, take a look at samples of size five, of size 15 and of size 30. And record the shape of the distribution, the approximate mean of the distribution, the approximate range of the distribution. And think about how sample size affects these three characteristics of the distribution. Here's the app that will help us to get a better sense of the sampling distribution of the mean and how the sample size affects that. Here's the population distribution we're looking at of the annual cost of college attendance in 2020 for 500 public universities in the US. Now we'll start with a sample of size five. We were asked to look at 500 such samples. And basically what we're going to be doing with this app is asking it to 500 times, take a sample of size five from this population distribution, compute the sample mean, and then show us what those 500 sample means look like with the histogram. Let's do that. If I look at the distribution of the sample means compared to the distribution of the population, the mean of both distributions is very close, Still around 23,000 I notice that the sample mean distribution seems to be a little bit more symmetric still, little bit may be skewed to the right. Seemingly more symmetric. Also, I notice that it is less spread out. The population distribution ranges from around 10,000 to 40,000 This one ranges over a smaller set of values. Okay, we'll do that again for a sample of size 15 and then a sample of size 30. To see how those values change 15 sample size, I'll click Draw Samples again. We see the mean of both distributions is at about the same place here. This distribution of the means now looks pretty symmetric. Again, it is narrower than the distribution of the population. If we were to compare with the distribution of the means from a sample of size five, we would see that from a sample of size 15, they're narrower or the distribution is narrow. Now let's do this for 30. Under 30, I'll click Draw Samples. Again, mean stays the same, but now we have an even narrower distribution, less variability, or a smaller standard deviation. Again, it looks pretty symmetric. We can now summarize what we've seen from these three different sample means from samples of size 515.30 and think about how, in particular the sample size affects those distributions. Let's summarize what we learned from the app. And in particular, what do we notice as a sample size increases? When we looked at the shape of the distribution for n equals 15 and equals 30, it looked pretty symmetric. It showed a little bit of skew to the right one, n was equal to five in all cases. The mean was very similar to the population mean, which looks to be around 23,000 The range of the values got smaller as the sample size increased. For n equals five, we had some sample means as low as 18,000 and some as high as 31,000 Whereas by the time we got to equals 30, we had sample means that ranged between about 20,500.25 thousand. As the sample size increases, we tend to see more symmetry, a smaller range. And the same means the mean doesn't seem to change as the sample size does. But the shape of the distribution in this case does seem to change somewhat, and definitely the range changes quite a lot. Here is a picture of another simulation of the same thing we did, same population data. Bottom left is n equals five. Top right is n equals 15. Bottom right is n equals 30. What we're seeing is what we saw in our version, that the means all are staying about the same, but the amount of variability or the range of the data is decreasing as the sample size increases. Also the sample means seem to be more symmetric as the sample size increases. We're going to do the same thing, except starting with a distribution that has a lot more severe right skew. The same contexts. 500 public universities, but rather than the cost. We're going to look at the number of undergraduate students for these public universities, and you can see that this is skewed to the right and pretty significantly skewed. Let's see what happens if, again, we use the app, and in this case we'll have four sample sizes, n equals five, equals 15 and equals 30 and equals 50. We'll see whether mean stays the same across all sample sizes. We'll see whether the range decreases as the sample size increases. And we'll see what happens to the shape as the sample size increases. Here again, we'll use the app, same dataset except the column we want to use, the actual values. The variable will be the number of undergraduate students. This is the very right skewed distribution we're looking at. We'll start again with a sample size of five, ask for 500 samples and draw those. What we see is, let's see. First of all, the mean seems to be very similar. Maybe a tiny bit smaller, but it seems similar to the population mean. There's less right skew, but it's still pretty obvious in this dataset. Again, the distribution of the sample means with the sample size of five is narrower than the distribution of the original data. That corresponds to what we saw with the less skewed dataset. Let's now change the sample size, 5-15 and do the same thing. We see something similar. Again, the mean of the distribution of the original data and the mean of the distribution of sample means is approximately the same. The distribution of the sample means is less skewed, but still skewed to the right when n equals 15. And again it is getting narrower. Let's go to n equals 30. With n equals 30, we draw our samples when n equals 30. Not sure if I should describe this as slightly skewed to the right or being pretty symmetric. This is pretty close to being symmetric, although there's this, maybe one or two outliers over here. That might make us think there's a slight skew, but definitely things are getting more symmetric and again getting narrower. Again, the mean data distribution of the raw data is about the same as the mean of the distribution of the sample means. Finally, we'll go all the way up to 50, in this case 50. Here again we see what looks like a pretty symmetric distribution. Again, it's getting narrower. Again, mean of the two distributions. The red line is very similar across the original data and the distribution of the sample means. We're seeing something similar with this skewed dataset, except that it takes a little bit more data, a higher sample size, in order for the distribution of the sample means to not be skewed, to be pretty close to being symmetric. Here again, we'll use the app, same dataset except the column we want to use the actual values. The variable will be the number of undergraduate students. This is the very right skewed distribution we're looking at. We'll start again with a sample size of five and ask for 500 samples and draw those. What we see is, let's see. First of all, the mean seems to be very similar. Maybe a tiny bit smaller, but it seems similar to the population mean. There's less right skew, but it's still pretty obvious in this dataset. Again, the distribution of the sample means with the sample size of five is narrower than the distribution of the original data that corresponds to what we saw with the less skewed data set. Let's now change the sample size, 5-15 and do the same thing. We see something similar. Again, the mean of the distribution of the original data and the mean of the distribution of sample means is approximately the same. The distribution of the sample means is less skewed, but still skewed to the right when n equals 15, and again it is getting narrower. Let's go to n equals 30. With n equals 30, we draw our samples when it equals 30. Not sure if I should describe this as slightly skewed to the right or are being pretty symmetric. This is pretty close to being symmetric, although there's maybe one or two outliers over here that might make us think there's a slight skew, but definitely things are getting more symmetric and again getting narrower. And again, the mean of data distribution of the raw data is about the same as the mean of the distribution of the sample means. Finally, we'll go all the way up to 50, in this case 50. Here again we see what looks like a pretty symmetric distribution. Again, it's getting narrower. Again, the mean of the two distributions, the red line, is very similar across the original data and the distribution of the sample means. We're seeing something similar with this skewed data set, except that it takes a little bit more data, a higher sample size, in order for the distribution of the sample means to not be skewed, to be pretty close to being symmetric.

SampleMeanDistributionIntuition

From Vincent Melfi October 14th, 2023  

381 plays 0 comments
 Add a comment