We use two packages in this course: mosaic
and
ggformula
. To load a package, you use the
library()
function, wrapped around the name of a package.
I’ve put the code to load one package into the chunk below. Add the
other two you need.
We will load the example data, GSS22clean.csv
, from the
url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv
Recall, this dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use.
We’ll use the read.csv()
function to read in the data.
Remember that quotes are required around the url address.
We’ve seen how to do inference for a single mean, and for a difference of means. Now, we’ll study how to do inference on many means. To do this, we’ll use a procedure called ANOVA for ANalysis Of VAriance. We’re comparing the variability within groups to the variability between groups using the F-statistic.
The ANOVA test is a hypothesis testing (no confidence intervals until after ANOVA) and the hypotheses are always of this form:
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \\ H_A: \text{At least one of the means is different} \]
Let’s consider the variables marital_status
and
number_of_hours_worked_last_week
. We will investigate
whether or not there is an association between marital status and the
mean number of hours worked per week.
#filtering out NA values
GSS22 <- filter(GSS22, !is.na(marital_status))
GSS22 <- filter(GSS22, !is.na(hours_worked_last_week))
You should notice that the number of observational units decreased
from 3544 to 1933, meaning that 1611 observational units have no
recorded data for either martial_status
or
hours_worked_last_week
or both.
Make a set of side-by-side boxplots to show the relationship between
those two variables. Note you might have to filter the data to remove NA
values. Use the all_the_R_so_far_M138.Rmd
file if you don’t
remember the code.
Another graph to look for skewness or symmetric data is
gf_jitter()
#a jitterplot
gf_jitter(hours_worked_last_week ~ marital_status, data= GSS22, alpha = 0.3, width=0.2)
How many groups do we have here? Five.
So our hypotheses here are
\[ H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\ H_a: \text{At least one of the means is different} \] or, if we wanted to name the means according to the group names,
\[ H_0: \mu_{\text{divorced}} = \mu_{\text{married}} = \mu_{\text{never married}} = \mu_{\text{separated}} = \mu_{\text{widowed}} \\ H_a: \text{At least one of the means is different} \]
To perform ANOVA in R, we can use the built-in R function
aov()
, which will use an F distribution as the
distributional approximation for the null distribution. In order to use
this distributional approximation, two conditions must be met:
Sample size/shape: Either the sample size is at least 20 for all the groups, without strong skewness or outliers; or if the sample size is less than 20, then the distribution of the response variable is approximately symmetric in all the samples (examine the dotplots for strong skewness or outliers)
SD: the variances of each group should be approximately equal. In practice, we check this by determining if \(sd_{max}< 2\times sd_{min}\).
How can we check those conditions?
#calculating our favorite statistics for each category of marital status
favstats(hours_worked_last_week ~ marital_status, data= GSS22)
## marital_status min Q1 median Q3 max mean sd n missing
## 1 divorced 0 37 40 48.0 89 41.25890 14.41447 309 0
## 2 married 0 40 40 48.0 89 40.89529 14.19947 850 0
## 3 never married 0 35 40 45.0 89 39.15237 13.42575 676 0
## 4 separated 8 40 40 46.5 89 41.51064 14.38730 47 0
## 5 widowed 0 19 40 40.0 89 34.07843 17.67579 51 0
#another option to check for strong skewness
Married <- filter(GSS22, marital_status=='married')
gf_histogram(~hours_worked_last_week, data=Married, title="Number of hours worked last week by Married people")
Can you create histograms for the hours worked last week of the other groups?
#another option to check for strong skewness
Divorced <- filter(GSS22, marital_status=='divorced')
gf_histogram(~hours_worked_last_week, data=Divorced, title="Number of hours worked last week by Divorced people")
#another option to check for strong skewness
NeverMarried <- filter(GSS22, marital_status=='never married')
gf_histogram(~hours_worked_last_week, data=NeverMarried, title="Number of hours worked last week by Never Married people")
#another option to check for strong skewness
Separated <- filter(GSS22, marital_status=='separated')
gf_histogram(~hours_worked_last_week, data=Separated, title="Number of hours worked last week by Separated people")
#another option to check for strong skewness
Widowed <- filter(GSS22, marital_status=='widowed')
gf_histogram(~hours_worked_last_week, data=Widowed, title="Number of hours worked last week by Widowed people")
Checking the Validity Conditions
The group sizes are 850, 309, 676, 47, and 51, all of which are greater than 20. The jitter and histogram plots below show no more than moderate skewness.
## marital_status min Q1 median Q3 max mean sd n missing
## 1 divorced 0 37 40 48.0 89 41.25890 14.41447 309 0
## 2 married 0 40 40 48.0 89 40.89529 14.19947 850 0
## 3 never married 0 35 40 45.0 89 39.15237 13.42575 676 0
## 4 separated 8 40 40 46.5 89 41.51064 14.38730 47 0
## 5 widowed 0 19 40 40.0 89 34.07843 17.67579 51 0
From the favstats calculations we see the minimum sd is 13.4 and the maximum sd of 17.6 is well below 2(13.4) = 26.8.
Thus, validity conditions are satisfied.
To perform the ANOVA hypothesis test we use the aov( )
function. The input has two main parts:
response_variable ~ explanatory_variable
and
data=NAME_OF_DATA_FILE
. The aov( )
function
calculates and stores many pieces of information. We will limit our view
to a summary display of aov()
. To view the summary we will
name the output (in the code below I’ve named the result
a1
) and then look at the summary of the output.
Now, let’s run summary()
on that object,
## Df Sum Sq Mean Sq F value Pr(>F)
## marital_status 4 3490 872.6 4.404 0.00151 **
## Residuals 1928 381988 198.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What are the degrees of freedom for the groups? 4
The total degrees of freedom? 1932
The degrees of freedom for the error? 1932-4 = 1928
What is the test statistic? F=4.04
What is the p-value? 0.00151
What is our general conclusion at the \(\alpha=0.05\) level? Very strong evidence against the null. We conclude that the mean number of hours worked last week is different for at least one of the five marital status groups. We follow this analysis with pairwise confidence intervals for the difference of two means to find any significant differences.
We will create pairwise confidence intervals in R all at once. You may recall from class that doing many tests at once can lead to a problem of multiple comparisons which compound the likelihood of a Type I error. One solution to the problem of multiple comparisons is the Bonferroni correction, which essentially divides the \(\alpha\) cutoff value by the number of tests you are going to run. However, there are other, more sophisticated methods. One of these is called Tukey’s Honest Significant Difference, which we will use here.
Let’s create an 95% confidence interval for the difference between the number of hours worked by widowed and married people.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = hours_worked_last_week ~ marital_status, data = GSS22)
##
## $marital_status
## diff lwr upr p adj
## married-divorced -0.3636056 -2.916565 2.1893541 0.9951658
## never married-divorced -2.1065328 -4.745637 0.5325710 0.1879153
## separated-divorced 0.2517386 -5.765369 6.2688461 0.9999616
## widowed-divorced -7.1804683 -12.989154 -1.3717828 0.0067398
## never married-married -1.7429273 -3.723475 0.2376209 0.1149329
## separated-married 0.6153442 -5.143413 6.3741014 0.9984208
## widowed-married -6.8168627 -12.357488 -1.2762374 0.0071205
## separated-never married 2.3582714 -3.439189 8.1557321 0.8010060
## widowed-never married -5.0739355 -10.654777 0.5069063 0.0950258
## widowed-separated -7.4322069 -15.203083 0.3386687 0.0686741
95% confidence intervals: (-12.989154, -1.3717828) for difference in means \(\mu_{widowed}-\mu_{divorced}\).
(-12.357488 -1.2762374) for difference in means \(\mu_{widowed}-\mu_{married}\).
Significance: Our overall ANOVA test for association between mean number of hours worked and marital status shows very strong evidence (p-value = 0.00151) of a significant difference in mean number of hours worked for at least one of the five groups: married, never married, divorced, separated, widowed.
Following up with the pairwise tests we find two significant differences:
The 95% confidence interval for the difference between the number of hours worked by widowed and divorced people is (-12.989154, -1.3717828), meaning that we are 95% confidence that the average hours worked by divorced people is 1.37 to 12.9 hours more than the average hours worked by people that are widowed.
The 95% confidence interval for the difference between the number of hours worked by widowed and married people is (-12.357488 -1.2762374), meaning that we are 95% confidence that divorced people work on average 1.27 to 12.35 hours more than widowed people.
All other pairwise comparisons do not show a significant difference in mean number of hours worked. This is perhaps a bit surprising as the difference between widowed and separated groups resulted in the largest difference in means. However, these two groups also have the smallest sample sizes of 51 and 47 respectively. Recall, that when sample sizes are smaller the width of the confidence interval is larger. The larger width in confidence interval is seen using Tukey’s Honest Significant Difference a hypothesis test results in a p-value of 0.06 (not significant at the 0.05 level) and a 95% confidence interval that just barely includes 0.
Generalization: Since the GSS is a random sample of US households, we can generalize our findings. In other words, the households that consist of widowed people work significantly less than households with married and divorced households. This should make sense since people that are widowed are likely to be older and thus retired.
Causation: Since this is an observational study we cannot conclude any type of causal relationship between marital status and number of hours worked last week.