Every Lab will have two parts. The Pre-Lab is a guided code-with-me activity. Don’t worry if you fall behind or miss something. The solutions to the Pre-Lab will be posted online for you to consult whenever you wish. The Lab assignment is a collection of questions (similar to our in-class Explorations) that you will answer using text, code and mathematical expressions. The Pre-Lab will give you helpful examples to follow when working on the Lab. Your solutions to the Lab will be written in an Rmarkdown document that will contain your text, mathematical expressions, code and the output of your code. The beauty of the Rmarkdown file is that it can create a final document that weaves together your code, the output of your code, and your regular text descriptions and conclusions.
We will be using the software R, RStudio, and the remote server Posit Cloud throughout the semester to learn about the statistical concepts and how to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface that we will use within Posit Cloud.
Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment work space as well as a History of the commands that you’ve entered. When you import data or define a variable they will appear in your Environment.
Any Files that you upload or generate will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it. When this happens, just follow the prompts to load the package.
The pane on the left is where the action happens. The current display shows the Console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
TO DO NEXT: Within the Files pane in the lower right, click on PreLab1.Rmd to open the document Prelab1.Rmd. You can now view and edit the Rmarkdown file that was used to create this document.
Rather than typing commands into the Console we will be writing our work in an R Markdown document. As described in this video Why use Rmarkdown?, we are using R Markdown to combine our code, data visualization & analysis, and written conclusions into one lab report. Here is a reference document that gives an overview of basic R Markdown components and commands.
When beginning any new lab or project, we will start by making sure
certain “packages” in R are available for us to use. This is a two step
process. First we install the package with the
install.packages
command. Second we make the package
accessible to us using the library
command. The two main
packages we will use are called mosaic
and
ggformula
. We install the mosaic package with the
command
\[ \textrm{ install.packages(``mosaic'') }\]
This command makes the ‘mosaic’ package available for us to use. The mosaic package includes many tools for calculating summary statistics, performing theory-based hypothesis tests, and calculating confidence intervals. Since we are using Posit Cloud, you won’t have to perform the install packages step since I will be sharing assignments with you that already have the necessary packages loaded. I’m mentioning the installation step here in case you want to use Posit Cloud for your own projects. In this case, you’ll need to install the packages yourself.
For now we will skip the install step and load the two required
packages: mosaic
and ggformula
.
Here is the command to load the mosaic
package. Write
the command to enter the ggformula
package.
CODE: Remember to load the two packages: mosaic and ggformula.
Before doing anything fancy, notice that R can be used as a calculator. Run the code and calculate the value of z by hand to verify the answer.
#This is a comment. Any line of code with a # sign in front
# is a comment and ignored by R.
#After running the code (in Posit Cloud), alter it slightly and see what happens.
x = -1
y <- 4
q <- 10
z <- sqrt(q) - y*x + 4**2
z
## [1] 23.16228
R can also be used as a calculator with data. Imagine a dot plot of the data \[4, 5, 6, 7, 8\] Determine the mean and estimate the standard deviation of the data. Next use R to calculate the mean and standard deviation using the commands below.
We’ll load the example data, GSS22clean.csv
. It is
available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv.
Since the data is a csv file, we will use the read.csv()
function to read in the data.
# this command will load data and save it as GSS22
GSS22 <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv")
This dataset comes from the 2022 General Social Survey (GSS), which is collected by NORC, the National Opinion Research Center, at the University of Chicago. It collects and maintains data from a random sample of households from the United States. This survey has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier for us to use. But, there are still some messy aspects which we’ll discover as we analyze it repeatedly this semester.
We will learn two basic commands to import data and identify which
function to use depending on whether the data is a .txt
file or a .csv
file.
CODE structure to import csv file:
YourDataFileName <- read.csv("the_https_address_for_the_data.csv")
The following commands can be used to view parts of the data:
glimpse
, head
, tail
.
You can also view the data in another tab by clicking on ‘GSS22’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.
Use the commands glimpse
, head
,
tail
on the GSS data to answer the following questions.
Number of children, age of respondent, highest year of school completed, number of siblings, & many others.
One quantitative variable and calculate the mean value.
Name a couple variables that are categorical.
college major 1, college major 2, diploma/GED/Other, self-employed or works for somebody, Occupation code, & many others.
Pick one categorical variable and create a table of counts for the categories.
Pick out an observation to write about. What are some characteristics of this observation?
Example: Observational unit number 10 is a 55 year old father of 2. He works a 40 hour work week in a private (non-government) job. He has a high-school diploma with a total 12 years of education, his spouse has 11 years, father has 6 years and mother has 20 years of education.
Examples: Do people who are self-employed work more hours than people that work for someone else? Do religious people support the death penalty similarly, regardless of their religion? Is a persons happiness level associated with their age? Are people with higher household incomes more or less likely to believe that marijuana should be illegal?
Calculate the average number of years of education
of the people surveyed in the 2022 GSS. We use the function
mean( )
applied to the variable
~years_education
from our data file data=GSS22
and we will remove all NA values using the code
na.rm=TRUE
## [1] 14.10812
Calculate the average number of siblings for the people surveyed in the 2022 GSS.
Calculate the number and proportion of people that
are self-employed or work for someone else. We use the function
tally( )
to create a table of counts for the variable
~works_for
in the data file data=GSS22
. If we
want our tally to display the proportion of values in the various
categories we add the code format="proportion"
.
## works_for
## self-employed someone else <NA>
## 393 3012 139
Calculate the number and proportion of people that have attained various educational degrees.
CODE for mean of quantitative variable:
mean(~exact_name_variable, data=DataFileName, na.rm=TRUE)
CODE for counts or proportions of categorical variable:
tally(~exact_variable_name, data=DataFileName, format="proportion")
Suppose we knew that a few years back 11.1% of the US adult population attained an education level of “less than high school”. From the 2022 GSS survey it looks as though that the proportion may have dropped. Perform a hypothesis test to determine if the drop is statistically significant.
Our null and alternative hypotheses are
\[ H_0: \pi = 0.111 \textrm{ vs. } H_a: \pi < 0.111\] Our statistic is \(\hat{p} = 0.10129797\), as shown above.
Perform the One Proportion Hypothesis test We use
the function prop.test( )
applied to the variable
~degree
from the data file data=GSS22
, with
success corresponding to success = "less than high school"
,
our hypothesized population proportion (our \(\pi\) value) is p=0.11
and our
alternative hypothesis is less than, <, which we write in the code as
alternative="less"
prop.test(~degree, data=GSS22, success="less than high school", p=0.111, alternative="less", correct=FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: GSS22$degree [with success = less than high school]
## X-squared = 3.3806, df = 1, p-value = 0.03298
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
## 0.0000000 0.1099411
## sample estimates:
## p
## 0.101298
#alternate code to calculate the p-value from the counts.}
#The 359 is the success count and n=3544 is the sample size.
#p=0.11 is the null hypothesis value of pi and our alternative hypothesis is less than.
prop.test(359, 3544, p=0.111, alternative="less", correct=FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 359 out of 3544
## X-squared = 3.3806, df = 1, p-value = 0.03298
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
## 0.0000000 0.1099411
## sample estimates:
## p
## 0.101298
CODE for One Proportion Hypothesis Test:
prop.test(~exact_variable_name, data=NameDataFile, success="category viewed as success", p=null hypothesis parameter value, alternative="less" or "greater" or "two.sided")
or alternatively,
prop.test(success_count, sample_size, p=null hypothesis parameter value, alternative="less" or "greater" or "two.sided")
The validity conditions for one proportion z-test are that there are at least 10 successes and 10 failures.
The success and failure counts of 359 and 3185 are both much larger
than 10, so our validity conditions are satisfied. In cases where the
counts are close to 10 or just under 10, a continuity correction should
be used and the default in R is to use the correction (better safe than
sorry). Since our success/failure counts here are much larger than 10 we
include the code correct=FALSE
so that the continuity
correction is not applied.
Conclusion: From the one proportion z-test (without continuity correction) we have a p-value of 0.03298 which is strong evidence against the null hypothesis. This p-value means that if the null hypothesis is true, we expect to observe a sample proportion of $\hat{p} = 0.101298$ or smaller, about 3.2% of the time. Due to the strong evidence found, we reject the null hypothesis finding it is not plausible. We conclude that in 2022 the true proportion of adults with less than a high school education is less than 11.1%. This means that the proportion of adults with less than a high school education seems to be decreasing, so a higher proportion of people are earning high school degrees or degrees beyond high school.