The first goal of this lab is to introduce you to R and RStudio. We will be using this software throughout the course both to learn the statistical concepts discussed in the texbook and also to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
Another goal is to learn how to import data.
We will also learn a few commands to view the data and create basic summaries in R.
We will learn how to create basic bar charts and dot plots.
At the end of the in-class portion of the lab you will be asked to apply what you have learned to a new data set. Your solutions will be written in an Rmarkdown document that will contain your code and the output of your code.
We begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and some basic commands.
The RStudio interface.
Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment workspace as well as a History of the commands that you’ve previously entered. When you import data or define a variable they will appear in your Environment.
Any Plots that you generate or files that you upload will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it.
The pane on the left is where the action happens. It’s called the Console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
To get you started, enter the following command at the R prompt (i.e. right after >
on the console). You can either type the command in manually or copy and paste them from this document.
\[ \textrm{ install.packages(''readr") }\] This command makes the ‘readr’ package available for us to use. As the name suggests, the ‘readr’ library is used to read a data file into R.
Next use the same ‘install.packages( )’ command to install the ‘tidyverse’ package that contains tools for creating graphics and data manipulation, the ‘rmarkdown’ package will be used to make the lab reports that you will turn in, and the ‘openintro’ package that contains a template that you can use for your lab reports.
The installation step above will only be done once. After installing these four packages they will be listed in the lower right pane under Packages. You can put a checkmark next to these four packages to use the packages or, alternatively, you can enter the library( ) commands shown below. (From here on you can copy and paste the commands.)
#copy and paste these commands into the console
library(readr)
library(tidyverse)
library(rmarkdown)
library(openintro)
Before doing anything fancy, notice that R can be used as a calculator.
#enter these commands into R, run the code, then alter it slightly to see what happens.
=5
x=7
y= x^2+3*y-10
z z
## [1] 36
R can also be used as a calculator on data. Before trying the calculations below, draw a dot plot of the data \[2, 3, 4, 5, 6\] and calculate the mean (recall, the mean is the balance point of the data).
= c(2, 3, 4, 5, 6)
data_1 = 2*data_1
data_2 = mean(data_1)
mean_1 = sd(data_1)
SD_1 mean_1
## [1] 4
SD_1
## [1] 1.581139
Next let’s read in a data set from our textbook. The data is available here <link http://www.isi-stats.com/isi/data/> under Chapter 2, College Midwest. The observational units of the data are students from a college in the midwest. After reading the data you should see data in the Environment pane with name ‘CollegeMidwest’ containing 2919 observational units and 2 variables.
<- read_table2("http://www.isi-stats.com/isi/data/chap3/CollegeMidwest.txt") CollegeMidwest
## Warning: `read_table2()` was deprecated in readr 2.0.0.
## Please use `read_table()` instead.
The following commands can be used to viewing parts of the data: glimpse, head, tail.
glimpse(CollegeMidwest)
## Rows: 2,919
## Columns: 2
## $ OnCampus <chr> "N", "N", "N", "N", "N", "Y", "Y", "Y", "N", "Y", "Y", "Y", "…
## $ CumGpa <dbl> 2.92, 3.59, 3.36, 2.47, 3.46, 2.98, 3.07, 3.79, 3.21, 3.67, 3…
To look at the first six rows of the data use the ‘head( )’ command.
head(CollegeMidwest)
## # A tibble: 6 × 2
## OnCampus CumGpa
## <chr> <dbl>
## 1 N 2.92
## 2 N 3.59
## 3 N 3.36
## 4 N 2.47
## 5 N 3.46
## 6 Y 2.98
To look at the last six rows of the data use the ‘tail( )’ command.
tail(CollegeMidwest)
## # A tibble: 6 × 2
## OnCampus CumGpa
## <chr> <dbl>
## 1 Y 3.09
## 2 Y 2.8
## 3 Y 4
## 4 N 3.35
## 5 Y 3.33
## 6 Y 2.99
You can also viiew the data by clicking on CollegeMidwest in the Environment pane.
The command below extracts the gpa data from CollegeMidwest. Try it. Can you calculate the mean gpa? The standard deviation of the gpas? Try it!
<- CollegeMidwest$CumGpa
GPA summary(GPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.010 3.410 3.288 3.700 4.000
We can create a dotplot of the gpa data
Next we create a table of counts from the OnCampus variable in CollegeMidwest. The output consists of the number of students reporting no, they don’t live on campus and the number reporting yes, they do live on campus.
<- table(CollegeMidwest$OnCampus)
table_OC table_OC
##
## N Y
## 654 2265
From this table we can create a bar chart of the categorical OnCampus variable using the following command. Notice that this command uses the table we created above.
barplot(table_OC, main="Campus Housing Distribution")
An alternative collection of commands to create a bar chart is shown below. These commands are using the tidyverse package that we installed at the beginning of our lab. The command starts with ‘ggplot’ which is an abbreviation of grammar of graphics. Inside the ggplot function is aes( ) which adds the aesthetics to our graphic. Next comes the geom_bar which tells R that we want to create a bar chart. An last is the ggtitle which adds the title to the graph. The plus command, ‘+’, is used to put the pieces together ggplot(…)+ geom_bar( )+ ggtitle( ) to create the final graphic.
ggplot(data=CollegeMidwest, mapping=aes(OnCampus, fill=OnCampus))+geom_bar()+ggtitle("Bar Chart of Housing on Campus")
Have questions about ggplot? Try the command below.
?ggplot
This next part of the lab contains exercises for you to complete and turn in using an R markdown file. To open a new Rmarkdown file use the New file button in the upper lefthand corner of the screen.
<- read_table2("http://www.isi-stats.com/isi/data/prelim/NationalAnthemTimes.txt") NationalAnthemTimes
## Warning: `read_table2()` was deprecated in readr 2.0.0.
## Please use `read_table()` instead.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Genre = col_character(),
## Sex = col_character(),
## Time = col_double()
## )
<- table(NationalAnthemTimes$Genre)
table_genre barplot(table_genre, main="National Anthem Singers by Genre, 1980-2019")
ggplot(NationalAnthemTimes, aes(Genre, fill=Genre))+geom_bar()+ggtitle("1980-2019 National Anthem Singers by Genre")