1 Review of Statistical Inference

1.1 Sampling Distribution

Suppose we take a random sample of size \(n\) from a large population of individuals. We record a variable and compute a statistic like a sample mean or sample proportion from this data. The sampling distribution of the statistic describes how the stat is distributed over many random samples:

  1. take a random sample of size \(n\)
  2. record the variable of interest (data) for the \(n\) sampled cases
  3. compute the statistic from the data
  4. repeat 1-3 many, many times
  5. plot the distribution of your statistics from step 4.

1.2 Central Limit Theorem

Under the right conditions, the distribution of many sample means \(\bar{x}\) or proportions \(\hat{p}\) will look like a normal distribution that is centered at the true population parameter (mean \(\mu\) or proportion \(p\)) value.

  • Shape: normally distributed (“bell-shaped”)
  • Center: true population parameter value
  • Variation: called the standard error of the statistic which measures the standard deviation of your statistic over many samples

The conditions for “normality” (symmetry) have to do with both the sample sized and the distribution of your variable.

  • If your measurements are very symmetric (e.g. heights of woman in an adult population) then the sample size \(n\) can be very small and the sample mean will behave normally.
  • As your measurements get more skewed (e.g. income per person in a large city), then the sample size \(n\) needs to get larger for the sample mean to behave normally.
  • Outliers are always a problem, even if you have a large sample size, since means can be easily influenced by one or two very unusual cases.

1.2.1 Standard error

The standard error of the sample mean \(\bar{x}\) measures the variability of this statistic and is the standard deviation of the sampling distribution for \(\bar{x}\). Roughly, the SE of any statistic tells us how it will vary from sample to sample.

For the sample mean statistic, the estimated SE is equal to \[ SE(\bar{x}) = \dfrac{s}{\sqrt{n}} \] where \(s\) is the sample standard deviation of your variable. For the difference of two sample means, \(\bar{x}_1 - \bar{x}_2\), the standard error is \[ SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\dfrac{s^2_1}{n_1} + \dfrac{s^2_2}{n_2}} \] where \(n_i\) is the sample size of sample \(i\) and \(s_i\) is the sample SD of population \(i\).

For the sample proportion \(\hat{p}\), the standard error is \[ SE(\hat{p}) = \sqrt{\dfrac{p(1-p)}{n}} \] where \(p\) is the true population proportion. To estimate this SE we just use the sample proportion \(\hat{p}\) in the SE formula.

1.3 Hypothesis testing

Your null hypothesis \(H_0\) is a specific claim about a population parameter of interest. Examples include:

  • \(H_0: p = 0.5\) (population proportion is 0.50, or 50%)
  • \(H_0: \mu_1 - \mu_2 = 0\) (the mean of population 1 \(\mu_1\) is equal to the mean of population 2 \(\mu_2\))

The alternative hypothesis specifies another, more general, scenario for the population(s). Examples include:

  • \(H_0: p > 0.5\) (population proportion greater than 0.50)
  • \(H_0: \mu_1 - \mu_2 \neq 0\) (the two populations have difference means)

You then use a test statistic to measure how far the data (e.g. sample statistic) deviates from what is expected assuming the null is true. Often these test stats take form of a standardized values so large absolute values indicate data that deviates a lot from what is expected if the null is true.

For a hypothesis about one population mean \(H_0: \mu = \mu_0\), we use a t-test statistic: \[ t = \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}} \] To test for a difference in two means, we use another t-test stat: \[ t = \dfrac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\dfrac{s^2_1}{n_1} + \dfrac{s^2_2}{n_2}}} \]

We then construct the sampling distribution of the test statistic under the assumption that the null hypothesis is true. Using this model, we compute a p-value by finding the probability of getting a test stat as, or more, extreme as the observed test stat. The p-value measures the probability of observing data as unusual, or more usual, than the observed data if the null is true. Small p-values indicate data that would rarely be seen if the null is true which means we have evidence (data) that supports the alternative hypothesis.

If we use a t-test statistic, then we use a t-distribution to compute a p-value. If t is the test stat value then either command below will give a one-sided (\(<\) or \(>\)) p-value:

pt(abs(t), df = , lower.tail = FALSE)  # method 1 (gives right tail value)
pt(-abs(t), df = )  # method 2 (gives left tail value)

If your alternative is two-sided, you need to double the value given by either command above.

The degrees of freedom depends on the type of estimate:

  • for a one-sample mean problem, use \(df = n-1\)
  • for a two-sample difference in means problem, let technology get the value!

1.4 Confidence Intervals

A confidence interval for a population parameter gives us a range of likely values of the parameter. Most confidence intervals we use in this class are of the form: \[ \textrm{estimate} \pm \textrm{ margin of error} \]

The idea behind constructing a confidence interval starts with our estimate’s (statistic’s) sampling distribution and margin of error for some level of confidence:

  • The sampling distribution tells us how an estimate varies from sample to sample around the true population parameter.
  • The margin of error for a “C%” confidence interval is computed so that the estimate will be within the margin of error of the true parameter value C% of the time.
  • Another “confidence” interpretation: of all possible samples, C% will yield a confidence interval (using the C% margin of error) that contains the true parameter value.
  • Examples:
    • E.g. for 95% confidence: 95% of all possible samples will give an estimate that is within the (95% level) margin of error of the truth.
    • E.g. for 90% confidence: 90% of all possible samples will give 90% confidence interval that contains the truth.
  • Normally distributed estimates: when a sampling distribution is roughly normally distributed, we can approximately say that
    • 95% margin of error \(\approx 1.96 \times SE\)
    • 90% margin of error \(\approx 1.65 \times SE\)
    • 99% margin of error \(\approx 2.58 \times SE\)
  • A higher level of confidence will lead to a larger margin of error: We need a larger margin of error to get a higher fraction of samples that close to the population parameter.

The margin of errors given above are ballpark values. We will mostly be using a more trustworthy distribution in our class, the t-distribution, to compute how many SE’s we go to be a certain level of confidence: \[ \textrm{margin of error } = t^*_{df} \times SE \] where \(t^*\) is the \((100-C)/2\) percentile from the t-distribution with \(df\) degrees of freedom.

Examples of confidence levels:

  • for 95% confidence, \(t^*\) is the 2.5% percentile (or 97.5%)
qt(.025, df= )   # for 95% confidence
  • for 90% confidence, \(t^*\) is the 5% percentile (or 95%)
qt(.05, df= )   # for 90% confidence

Examples of degrees of freedom:

  • for a one-sample mean problem, use \(df = n-1\)
  • for a two-sample difference in means problem, let technology get the value!

1.5 Review activity (day 2)

1.5.1 General questions

  1. What is the difference between a parameter and a statistic? Give an example of a parameter and a statistic, using “common” notation for each.

  2. Suppose you are told a 95% confidence interval for the proportion of registered voters for Trump just before the 2020 election is 0.38 to 0.44. What does this mean? What is the margin of error for this CI? What does “95% confidence” mean?

  3. A study found that participants who spent money on others instead of themselves had significantly lower blood pressure (two-sided p-value=0.012). What are the hypotheses for this test? What does the p-value mean?

  4. Inference methods rely on understanding the sampling distribution of the statistic that we used to estimate our unknown parameter. What does the sampling distribution of the sample mean look like? What does the standard error of the sample mean measure? How is the sampling distribution used in a hypothesis test? How is it used with confidence intervals?

1.5.2 Comparing two means

The data set agstrat.csv contains a stratified random sample of 300 US counties

  • Sampling units: counties
  • Strata: region (North Central, Northeast, South, West)
  • response: number of farms in 1992 (farms92)
  • How can we answer the question:
    • Do the average number of farms per county differ in the western and north central regions in 1992?

Our basic plan for analysis looks like:

  1. Data: load and clean/manipulate (if needed)
  2. EDA: exploratory data analysis
  3. Inference: run tests/CIs and check assumptions!
  4. Conclusion: interpret results, avoid lots of stats jargon!

1.5.2.1 Data input

Read in the data and check variable names:

agstrat <- read.csv("http://people.carleton.edu/~kstclair/data/agstrat.csv")
names(agstrat)

Note: Here we are reading this data into R using a .csv that can be accessed with the url above. In the Appendix Data Section C.2, we used the R package SDaA to load this data into R. The one difference between these two methods is how they treat our categorical variables, like region. We get a “character” categorical variable using read.csv while the SDaA package uses “factor” categorical variables. Big picture: It doesn’t matter much how these variables are stored (character vs factor) right now, but some commands like levels only work for factor variable types!

(1a) How many counties were sampled from each region?

table(agstrat$region)

We want to focus on W and NC for our comparison. I will do a lot of data manipulation in this class using the R package dplyr. This package is already available on the mirage server, but if you are using a newly installed version of Rstudio on your computer you will need to install the package first. (See Appendix A.4).

The filter command from the dplyr package selects rows from a data set that we want to keep (Section C.2.4). E.g. only rows from the W or NC regions. The argument used in this command is a logical statement and the row of a given case is kept in the data set when the statement is TRUE.

library(dplyr)  # need to load the package commands
agstrat2 <- filter(agstrat, region %in% c("W", "NC"))
table(agstrat2$region)

1.5.2.2 Exploratory data analysis

(2a) In 2-3 sentences, compare the distributions (shape, center, spread) of farms92 by region (W vs. NC).

Note that the commands below give basic R commands for comparing distributions graphically and numerically. You do not necessarily need to use or show all output below when answering this type of question for homework, exams or reports.

Graphical options (Section C.3.4) include boxplots:

boxplot(farms92 ~ region, data = agstrat2, horizontal=TRUE,
        main="Number of farms (1992) by region",
        ylab = "region", xlab = "# of farms per county in 1992")

or faceted histograms from the ggplot2 package:

library(ggplot2)  # load ggplot2 package
ggplot(agstrat2, aes(x=farms92)) + 
  geom_histogram() + 
  facet_wrap(~region) + 
  labs(title = "Number of farms by region")

Summary stats, by region, can be found using multiple tools in R (Section C.3.3). The tapply command is an option:

tapply(agstrat2$farms92, agstrat2$region, summary)

So is using the group_by and summarize combination in dplyr. Here we get the mean and SD of farms92 for both regions:

agstrat2 %>% 
    group_by(region) %>%
    summarize(mean(farms92), sd(farms92))

(2b) What county has the unusually high number of farms in the western region? How many farms do they have? Any ideas why it is so large? How do the summary stats change when this county is omitted?

Here we find which row in agstrat2 has a farms92 value greater than 3000 (Section C.2.4):

which(agstrat2$farms92 > 3000)   

What county is this? We need to access row 118 in agstrat2. We use use [] subsetting or the slice command from dplyr:

agstrat2[118,]  # see row 118
slice(agstrat2, 118)  # another way to see row 118

We can use the dplyr commands for (2a) with the slice command added to get stats without row 118

agstrat2 %>% slice(-118) %>%    # slice removes row 118
    group_by(region) %>%  # then get stats by region
    summarize(mean(farms92), sd(farms92), median(farms92), IQR(farms92))

1.5.2.3 Inference

(3a) What hypotheses are being tested with the t.test command below? What are the p-value and test statistics values for this test? How do you interepret the test statistic? How do you interpret the p-value? Are the assumptions met for this t-test?

# run t test to compare mean number of farms 
t.test(farms92 ~ region, data = agstrat2)

(3b) How and why do the test stat and p-value change when we omit row 118? Does your test conclusion change when omitting 118? Why?

# redo without row 118 outlier
t.test(farms92 ~ region, subset = -118, data = agstrat2)

1.5.2.4 Conclusion

(4a) Do the average number of farms per county differ in the western and north central regions in 1992? If they do differ, by how much? Answer these question in context using numbers from the results above to support your conclusions.

knitr::opts_chunk$set(echo = TRUE, eval=TRUE, prompt=FALSE, collapse = FALSE, 
                      message = FALSE, warning = FALSE)