Goals for this lab.

Setup and packages

As usual, we start by loading our two packages: mosaic and ggformula. To load a package, you use the library() function, wrapped around the name of a package. I’ve put the code to load one package into the chunk below. Add the other package you need.

library(mosaic)
library(ggformula)
# put in the other package that you need here

Loading in data

We’ll load the example data, GSS_clean.csv from this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv and use the read.csv() function.

#load data
GSS <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv")

We also need to do a little data cleaning to ensure this will work properly for the lab,

GSS <- filter(GSS, should_marijuana_be_made_legal != "")
GSS <- filter(GSS, self_emp_or_works_for_somebody != "")

Research Question

Our research question is whether there is a difference in the proportion of people who said marijuana should be made legal in the two groups of people that are self employed or work for somebody else. We think of the self_emp_or_works_for_somebody as the explanatory variable and should_marijuana_be_made_legal as the response variable.

Our null hypothesis is the proportion of people that believe marijuana should be made legal is the same in the self employed group as it is in the work for someone else group. In other words, there is no association between thinking marijuana should be legal and whether a person works for someone else or is self employed.

Let \(\pi_{selfEmp}\) be the proportion of Self Employed people that think marijuana should be legal and \(\pi_{someoneElse}\) be the proportion of people that work for someone else that think marijuana should be legal.

Our null and alternative hypotheses are

\[H_0 : \pi_{selfEmp} - \pi_{someoneElse} = 0\] \[H_a : \pi_{SelfEmp} - \pi_{SomeoneElse} \neq 0\]

Let’s start by creating a bar chart to visualize the data. We want to graph the two groups, self employed or work for somebody else, and see in each bar those that believe marijuana should be legal and those that don’t.

Here is the most basic bar chart of counts.

gf_bar( ~self_emp_or_works_for_somebody, fill= ~ should_marijuana_be_made_legal, data=GSS)

Here is a bar chart of counts that doesn’t have the counts stacked and instead has them positioned side-by-side.

gf_bar( ~self_emp_or_works_for_somebody, fill= ~ should_marijuana_be_made_legal, data=GSS, position ="dodge" )

Here is a segmented bar graph. Notice that the command has changed to gf_props instead of gf_bar.

gf_props( ~self_emp_or_works_for_somebody, fill= ~ should_marijuana_be_made_legal, data=GSS, position ="fill" )

We create a 2-way table with the command tally to determine the proportion of self employed people that believe marijuana should be made legal and the proportion of people that work for someone else that believe marijuana should be made legal.

Important note: the order of the variables matters!! It should be tally( response_var ~ explanatory_var). Be careful or your proportions will be incorrect

The first code chunk is a table of counts and the second is a table of proportions.

tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data=GSS)
##                               self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
##                      Legal                98          809
##                      Not legal            44          442

We can calculate the sample size for each group: \(n_1\) is the number of people that are self-employed and \(n_2\) the number of people that work for someone else.

n1 = 98+44
n1
## [1] 142
n2 = 809+442
n2
## [1] 1251
tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data=GSS, format="proportion")
##                               self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
##                      Legal         0.6901408    0.6466827
##                      Not legal     0.3098592    0.3533173

Two Proportion: Validity Conditions for theory-based inference and confidence intervals

Validity Conditions: The theory-based test and interval for the difference in two proportions (called a two-sample z-test or interval) work well when there are at least 10 observations in each of the four cells of the 2 × 2 table.

If we look at the tally of counts, we see that the values in the 2 x 2 table are 98, 44, 809, 442, all of which are greater than 10. So our validity conditions are satisfied.

Calculate the standardized statistic

Let’s start by finding our observed statistic.

p_diff <- 0.6901408 - 0.6466827
p_diff
## [1] 0.0434581

For two proportions, in a hypothesis test the standard error of the null distribution is given by

\[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \] where \(\hat{p}\) is the pooled proportion.

Using R as a calculator the pooled proportion is

phat <- (98+809)/1393
phat
## [1] 0.6511127

The Standard error is

SE <- sqrt(phat*(1-phat)/(n1) + phat*(1-phat)/(n2))
SE
## [1] 0.04220592

Next, we can calculate the standardized statistic using the formula

\[ z = \frac{\hat{p}_1 - \hat{p}_2 - 0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}}} = \frac{\hat{p}_{diff} - 0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}}}\]

z<- (0.6901408 -  0.6466827)/SE
z
## [1] 1.029668

What does this standardized statistic suggest regarding our hypothesis test?

The standardized statistic is not greater than 2 or less than -2, so we don’t have enough evidence to reject the null hypothesis. It looks like the difference in proportions of people that believe that marijuana should be legal between those that are self-employed and those that work for someone else could be zero.

Next we calculate the theory based \(p\)-value using prop.test and adjust the code below to calculate a 90% confidence interval for our difference in proportions. Note: In the code below we will omit the default continuity correction (using the option correct= FALSE because the counts in all four cells of the two-way table are large. The continuity correction becomes important if one of the cell counts is small, especially if a count is less than or equal to 5.

#inference for two proportions
prop.test(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS, success = "Legal", alternative = "two.sided", conf.level = 0.95, correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody)
## X-squared = 1.0602, df = 1, p-value = 0.3032
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.03708184  0.12399822
## sample estimates:
##    prop 1    prop 2 
## 0.6901408 0.6466827

Interpret the confidence interval: We are 90% confident that the difference in proportion of people that believe that marijuana should be legal between those that are self-employed and those that work for someone else is between -0.03 and 0.11. Since this confidence interval contains 0, we are not confident that these two proportions are significantly different.

#USE THIS COMMAND if you only have the counts and not the data
# c(98, 809) are the success counts for the two groups: self employed or works for someone else
# c(142, 1251) are the sample size counts for the two groups
prop.test(c(98, 809), c(142, 1251), alternative = "two.sided", conf.level = 0.95, correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c out of c98 out of 142809 out of 1251
## X-squared = 1.0602, df = 1, p-value = 0.3032
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.03708184  0.12399822
## sample estimates:
##    prop 1    prop 2 
## 0.6901408 0.6466827

Does the \(p\)-value from prop.test support the conclusion made with the standardized statistic?

2SD Confidence Intervals, Difference of proportions

To do find confidence intervals for a difference of proportions, we start by computing the standard error. Recall that the formula is different based on whether we’re doing a confidence interval or a hypothesis test. This is because a hypothesis test has a hypothesized value for the observed statistic and confidence intervals do not.

For two proportions, the standard error for a confidence interval is given by \[ SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \] This formula is slightly different than the formula used above because we have no hypothesized value for the difference in proportions, so we use our observed proportions \(\hat{p}_1\) and \(\hat{p}_2\) instead of the pooled proportion. Thus, our margin of error (MOE) is given by

p1<- 0.6901408
p2<- 0.6466827
MOE <- 2*sqrt(p1*(1-p1)/(98+44) + p1*(1-p2)/(809+442))
MOE
## [1] 0.08248336

So the endpoints of our 2SD confidence interval are

p_diff <- p1-p2

left <- p_diff - MOE
left
## [1] -0.03902526
right<- p_diff +MOE
right 
## [1] 0.1259415

Does this align with a 95% confidence interval calculated using prop.test? Yes!

We are 95% confident that the difference in population proportions (\(\pi_{SelfEmp} − \pi_{SomeoneElse}\)) is between -0.04 and 0.13.

Exercises

To investigate whether giving chest-compression-only (CC) instructions rather than standard cardiopulmonary resuscitation (CPR) instructions to the witness of a heart attack will improve the victim’s chance of surviving, researchers Hupfl et al. (The Lancet, 2010) combined the results from three randomized experiments. In each experiment, the emergency services dispatcher randomly assigned either CC or CPR instructions to the bystander who was at the site where a person had just experienced a heart attack. The data they collected is located at https://raw.githubusercontent.com/IJohnson-math/Math138/main/CPR.csv.

  1. Load the data from the Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/CPR.csv and name the data CPR. How many observational units are there? What are the names and types of the variables? Which variable is the explanatory variable? Which is the response?

Observational units and number:

Variables and type:

  1. Define (in words) the parameters of interest of this study. Also, assign symbols to the parameters.

  2. State the appropriate null and alternative hypotheses in words and symbols to address the research question of whether instructions for chest-compressions-only rather than standard cardiopulmonary resuscitation (CPR) instructions to the witness of a heart attack will improve the victim’s chance of surviving. (Hint: Remember that hypotheses are always about population parameters, and think about whether the alternative should be one- or two-sided before you see the data.)

\[H_o: \]

\[H_a: \]

  1. Make a graph to visualize the proportion of survivors in the two treatment groups.

  2. Create two-way tables with one showing the counts and another containing the proportions of survivors in each of the treatment groups.

  3. Use R as a calculator to find and display

  1. Check the Validity Conditions for a Two Proportion Inference Test. Explain what you are checking, any numerical values you are comparing, and whether or not the conditions have been met.

  2. Use the proper command in R to calculate the theory based \(p\)-value for the hypothesis test. Are you using the continuity correction? Why or why not?

  3. Calculate a theory-based 90% confidence interval and interpret the resulting interval in the context of the study.

  4. Based on your findings, state a complete conclusion about the study. Be sure to address significance (p-value and standardized statistic), estimation (confidence interval), causation, and generalization.

Significance with context:

Estimation with interpretation:

Causation:

Generalization: