Goals for this lab.

Every Lab will have two parts. The Pre-Lab is a guided code-with-me activity. Don’t worry if you fall behind or miss something. The solutions to the Pre-Lab will be posted online for you to consult whenever you wish. The Lab assignment is a collection of questions (similar to our in-class Explorations) that you will answer using text, code and mathematical expressions. The Pre-Lab will give you helpful examples to follow when working on the Lab. Your solutions to the Lab will be written in an Rmarkdown document that will contain your text, mathematical expressions, code and the output of your code. The beauty of the Rmarkdown file is that it can create a final document that weaves together your code, the output of your code, and your regular text descriptions and conclusions.

We will be using the software R, RStudio, and the remote server Posit Cloud throughout the semester to learn about the statistical concepts and how to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface that we will use within Posit Cloud.

The RStudio interface.

The RStudio interface.
The RStudio interface.

Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment work space as well as a History of the commands that you’ve entered. When you import data or define a variable they will appear in your Environment.

Any Files that you upload or generate will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it. When this happens, just follow the prompts to load the package.

The pane on the left is where the action happens. The current display shows the Console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

TO DO NEXT: Within the Files pane in the lower right, click on PreLab1.Rmd to open the document Prelab1.Rmd. You can now view and edit the Rmarkdown file that was used to create this document.

Rather than typing commands into the Console we will be writing our work in an R Markdown document. As described in this video Why use Rmarkdown?, we are using R Markdown to combine our code, data visualization & analysis, and written conclusions into one lab report. Here is a reference document that gives an overview of basic R Markdown components and commands.

When beginning any new lab or project, we will start by making sure certain “packages” in R are available for us to use. This is a two step process. First we install the package with the install.packages command. Second we make the package accessible to us using the library command. The two main packages we will use are called mosaic and ggformula. We install the mosaic package with the command

\[ \textrm{ install.packages(``mosaic'') }\]

This command makes the ‘mosaic’ package available for us to use. The mosaic package includes many tools for calculating summary statistics, performing theory-based hypothesis tests, and calculating confidence intervals. Since we are using Posit Cloud, you won’t have to perform the install packages step since I will be sharing assignments with you that already have the necessary packages loaded. I’m mentioning the installation step here in case you want to use Posit Cloud for your own projects. In this case, you’ll need to install the packages yourself.

For now we will skip the install step and load the two required packages: mosaic and ggformula.

Load Packages:

Here is the command to load the mosaic package. Write the command to enter the ggformula package.

library(mosaic)

CODE: Remember to load the two packages: mosaic and ggformula.

R as a calculator.

Before doing anything fancy, notice that R can be used as a calculator. Run the code and calculate the value of z by hand to verify the answer.

#This is a comment. Any line of code with a # sign in front 
# is a comment and ignored by R.
#After running the code (in Posit Cloud), alter it slightly and see what happens.
x = -1
y <- 4
q <- 10
z <- sqrt(q) - y*x + 4**2
z
## [1] 23.16228

R can also be used as a calculator with data. Imagine a dot plot of the data \[4, 5, 6, 7, 8\] Determine the mean and estimate the standard deviation of the data. Next use R to calculate the mean and standard deviation using the commands below.

MyData <- c(4, 5, 6, 7, 8)   
MyMean <- mean(MyData)
MySd <- sd(MyData)

Reading in data from a website.

We’ll load the example data, GSS22clean.csv. It is available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv. Since the data is a csv file, we will use the read.csv() function to read in the data.

# this command will load data and save it as GSS22
GSS22 <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv")

This dataset comes from the 2022 General Social Survey (GSS), which is collected by NORC, the National Opinion Research Center, at the University of Chicago. It collects and maintains data from a random sample of households from the United States. This survey has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier for us to use. But, there are still some messy aspects which we’ll discover as we analyze it repeatedly this semester.

We will learn two basic commands to import data and identify which function to use depending on whether the data is a .txt file or a .csv file.

CODE structure to import csv file:

YourDataFileName <- read.csv("the_https_address_for_the_data.csv")

Basic commands to view data and calculate statistics.

The following commands can be used to view parts of the data: glimpse, head, tail.

You can also view the data in another tab by clicking on ‘GSS22’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.

Use the commands glimpse, head, tail on the GSS data to answer the following questions.

  • What are the observational units?
  • How many observational units are there?
  • Is there any missing data? How can you tell?
  • How many variables are there?
  • Name a couple of variables that are quantitative.

Number of children, age of respondent, highest year of school completed, number of siblings, & many others.

  • One quantitative variable and calculate the mean value.

  • Name a couple variables that are categorical.

college major 1, college major 2, diploma/GED/Other, self-employed or works for somebody, Occupation code, & many others.

  • Pick one categorical variable and create a table of counts for the categories.

  • Pick out an observation to write about. What are some characteristics of this observation?

Example: Observational unit number 10 is a 55 year old father of 2. He works a 40 hour work week in a private (non-government) job. He has a high-school diploma with a total 12 years of education, his spouse has 11 years, father has 6 years and mother has 20 years of education.

  • What is a question we could answer using this data?

Examples: Do people who are self-employed work more hours than people that work for someone else? Do religious people support the death penalty similarly, regardless of their religion? Is a persons happiness level associated with their age? Are people with higher household incomes more or less likely to believe that marijuana should be illegal?

Calculate the average number of years of education of the people surveyed in the 2022 GSS. We use the function mean( ) applied to the variable ~years_education from our data file data=GSS22 and we will remove all NA values using the code na.rm=TRUE

mean(~years_education, data=GSS22, na.rm=TRUE)
## [1] 14.10812

Calculate the average number of siblings for the people surveyed in the 2022 GSS.

Calculate the number and proportion of people that are self-employed or work for someone else. We use the function tally( ) to create a table of counts for the variable ~works_for in the data file data=GSS22. If we want our tally to display the proportion of values in the various categories we add the code format="proportion".

tally(~works_for, data=GSS22)
## works_for
## self-employed  someone else          <NA> 
##           393          3012           139

Calculate the number and proportion of people that have attained various educational degrees.

CODE for mean of quantitative variable:

mean(~exact_name_variable, data=DataFileName, na.rm=TRUE)

CODE for counts or proportions of categorical variable:

tally(~exact_variable_name, data=DataFileName, format="proportion")

Theory-Based Inference for one proportion

Suppose we knew that a few years back 11.1% of the US adult population attained an education level of “less than high school”. From the 2022 GSS survey it looks as though that the proportion may have dropped. Perform a hypothesis test to determine if the drop is statistically significant.

Our null and alternative hypotheses are

\[ H_0: \pi = 0.111 \textrm{ vs. } H_a: \pi < 0.111\] Our statistic is \(\hat{p} = 0.10129797\), as shown above.

Perform the One Proportion Hypothesis test We use the function prop.test( ) applied to the variable ~degree from the data file data=GSS22, with success corresponding to success = "less than high school", our hypothesized population proportion (our \(\pi\) value) is p=0.11 and our alternative hypothesis is less than, <, which we write in the code as alternative="less"

prop.test(~degree, data=GSS22, success="less than high school", p=0.111, alternative="less", correct=FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  GSS22$degree  [with success = less than high school]
## X-squared = 3.3806, df = 1, p-value = 0.03298
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
##  0.0000000 0.1099411
## sample estimates:
##        p 
## 0.101298
#alternate code to calculate the p-value from the counts.}
#The 359 is the success count and n=3544 is the sample size.
#p=0.11 is the null hypothesis value of pi and our alternative hypothesis is less than.
prop.test(359, 3544, p=0.111, alternative="less", correct=FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  359 out of 3544
## X-squared = 3.3806, df = 1, p-value = 0.03298
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
##  0.0000000 0.1099411
## sample estimates:
##        p 
## 0.101298

CODE for One Proportion Hypothesis Test:

prop.test(~exact_variable_name, data=NameDataFile, success="category viewed as success", p=null hypothesis parameter value, alternative="less" or "greater" or "two.sided")

or alternatively,

prop.test(success_count, sample_size, p=null hypothesis parameter value, alternative="less" or "greater" or "two.sided")

Validity Conditions for Theory-Based One Proportion Inference

The validity conditions for one proportion z-test are that there are at least 10 successes and 10 failures.

#success = 
#success

#failure = 
#failure

The success and failure counts of 359 and 3185 are both much larger than 10, so our validity conditions are satisfied. In cases where the counts are close to 10 or just under 10, a continuity correction should be used and the default in R is to use the correction (better safe than sorry). Since our success/failure counts here are much larger than 10 we include the code correct=FALSE so that the continuity correction is not applied.

Conclusion: From the one proportion z-test (without continuity correction) we have a p-value of 0.03298 which is strong evidence against the null hypothesis. This p-value means that if the null hypothesis is true, we expect to observe a sample proportion of $\hat{p} = 0.101298$ or smaller, about 3.2% of the time. Due to the strong evidence found, we reject the null hypothesis finding it is not plausible. We conclude that in 2022 the true proportion of adults with less than a high school education is less than 11.1%. This means that the proportion of adults with less than a high school education seems to be decreasing, so a higher proportion of people are earning high school degrees or degrees beyond high school.

Instructions to Print Your Lab!

  • Knit to html and you should see the html file in the Files pane.
  • Open the .html file in a Web Browser window.
  • Print from the browser, BUT FIRST select the layout as 2 Pages per sheet. You may also print double-sided to save paper.
  • Bring your printed file to class and turn it in. It is due in one week. Check the Daily Schedule for the exact date.