The first goal of this lab is to introduce you to R and RStudio. We will be using this software throughout the course both to learn the statistical concepts discussed in the texbook and also to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
Another goal is to learn how to import data.
We will also learn a few commands to view the data and create tables, bar charts and histograms.
At the end of the in-class portion of the lab you will be asked to apply what you have learned to a new data set. Your solutions will be written in an Rmarkdown document that will contain your code and the output of your code.
We begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and some basic commands.
The RStudio interface.
Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment workspace as well as a History of the commands that you’ve previously entered. When you import data or define a variable they will appear in your Environment.
Any Plots that you generate or files that you upload will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it.
The pane on the left is where the action happens. It’s called the Console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
To get you started, enter the following command at the R prompt
(i.e. right after >
on the console). Type the command in
manually exactly as it is displayed below.
\[ \textrm{ install.packages(''mosaic") }\] This command makes the ‘mosaic’ package available for us to use. The mosaic package includes many tools that we will be using this semester including tools for calculating summary statistics, performing theory-based hypothesis tests, and calculating confidence intervals.
Next use the same ‘install.packages( )’ command to install one more package:
The installation step above will only need to be done once. After installing these two packages they will be listed in the lower right pane under Packages. You can put a checkmark next to these packages to use the packages, or alternatively, you can enter the library( ) commands shown below. (From here on you can copy and paste the commands.)
#copy and paste these commands into the console
library(mosaic)
library(ggformula)
This step of loading our two main packages will be a necessary first step in all of our labs this semester.
Before doing anything fancy, notice that R can be used as a calculator. Run the code and calculate the value of z by hand to verify the answer.
#enter these commands into R, run the code, then alter it slightly to see what happens.
<- 5
x <- 7
y <- 2
q <- (x^2 + 3*y - 10)/q
z z
## [1] 18
R can also be used as a calculator on data. Draw a dot plot of the data \[2, 3, 4, 5, 6\] and by hand calculate the mean. Now use R to calculate the mean and standard deviation using the commands below. Can you write code to calculate the mean and standard deviation of data_2?
<- c(2, 3, 4, 5, 6)
data_1 <- mean(data_1)
mean_1 <- sd(data_1)
SD_1
data_1
## [1] 2 3 4 5 6
mean_1
## [1] 4
SD_1
## [1] 1.581139
Transform a data set by multiplying every entry by 2.
= 2*data_1 data_2
This is data_2:
data_2
## [1] 4 6 8 10 12
<- mean(data_2)
mean_2 <- sd(data_2)
SD_2
mean_2
## [1] 8
SD_2
## [1] 3.162278
Next let’s import a data set from our textbook into our workspace. The data is available here <link https://www.isi-stats.com/isi2nd/data.html> under Chapter 2, Example2.1/2.2:College Midwest. The observational units of the data are students from a college in the midwest. After reading the data you should see data in the Environment pane with name ‘CollegeMidwest’ containing 2919 observational units and 2 variables.
<- read.table("http://www.isi-stats.com/isi/data/chap3/CollegeMidwest.txt", header=TRUE) CollegeMidwest
The following commands can be used to viewing parts of the data: glimpse, head, tail. Try entering them into your console.
glimpse(CollegeMidwest)
## Rows: 2,919
## Columns: 2
## $ OnCampus <chr> "N", "N", "N", "N", "N", "Y", "Y", "Y", "N", "Y", "Y", "Y", "…
## $ CumGpa <dbl> 2.92, 3.59, 3.36, 2.47, 3.46, 2.98, 3.07, 3.79, 3.21, 3.67, 3…
To look at the first six rows of the data use the ‘head( )’ command.
head(CollegeMidwest)
## OnCampus CumGpa
## 1 N 2.92
## 2 N 3.59
## 3 N 3.36
## 4 N 2.47
## 5 N 3.46
## 6 Y 2.98
To look at the last six rows of the data use the ‘tail( )’ command.
tail(CollegeMidwest)
## OnCampus CumGpa
## 2914 Y 3.09
## 2915 Y 2.80
## 2916 Y 4.00
## 2917 N 3.35
## 2918 Y 3.33
## 2919 Y 2.99
You can also view the data in another tab by clicking on ‘CollegeMidwest’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.
The commands below calculate the mean, median, standard deviation, and range for the quantitative variable CumGpa data in CollegeMidwest dataset. The range is 0 through 4, which means that the minimum value is 0. Look near row 2680. Can you find an observational unit that has a 0.00 cumulative GPA?
mean(~CumGpa, data=CollegeMidwest)
## [1] 3.287955
median(~CumGpa, data=CollegeMidwest)
## [1] 3.41
sd(~CumGpa, data=CollegeMidwest)
## [1] 0.550747
range(~CumGpa, data=CollegeMidwest)
## [1] 0 4
We can create a histogram of the quantitative gpa data using the code shown below. The binwidth option controls the width of the bins in the histogram. Try changing this number and see what happens.
gf_histogram( ~CumGpa, data=CollegeMidwest, binwidth = 0.05)
Next we create a table of counts or a table of proportions from the categorical OnCampus variable in CollegeMidwest. The output consists of the number of students reporting no, they don’t live on campus and the number reporting yes, they do live on campus.
tally(~OnCampus, data=CollegeMidwest)
## OnCampus
## N Y
## 654 2265
tally(~OnCampus, data=CollegeMidwest, format="proportion")
## OnCampus
## N Y
## 0.2240493 0.7759507
We can also create a bar chart of the categorical OnCampus variable using the following command. In the code below we’ve also given the bar chart a title.
gf_bar( ~ OnCampus, data=CollegeMidwest, title = "Campus Housing Distribution", caption = "N, does not live on campus; Y, yes lives on campus")
Instead of writing solutions in a Word document you will write your solutions as an R Markdown file, which is a single file created in RStudio that will include the text of the exercises, your code, your code output, and your typed conclusions. The document that you are reading right now was created using R Markdown.
To open a new R Markdown file use the New
file button in the upper left hand corner of the screen (see the
red arrow in the image below).
After clicking the New file button, select “R Markdown” from the dropdown menu. Next select “From Template”, select “mosaic plain (HTML)”, and name your document like this FirstnameLastname-Lab0.Rmd but with your first and last name.
The R Markdown file should now be open in a new pane in the upper left corner. From here forward, you will write text and code in the text-chunks and code-chunks of your Lab0.Rmd file.
You can Add new code chunks by following the blue arrow to the green button with the C and selecting “R” to make an R code chunk (follow the blue arrow in the image above).
For this first lab I have created an Rmarkdown file for you to use. Look for it in the RStudio Cloud site that I shared with you.
Import the National Anthem Time data from the website [link http://www.isi-stats.com/isi/data/prelim/NationalAnthemTimes.txt]. Name the data NationalAnthemTime Notice that this name has no spaces. This is important feature of a variable name.
Take a look at the National Anthem Time data using the glimpse, head and tail commands.
Make a table of counts and a table of proportions for the categorical variable Genre.
Create a bar chart of the Genre variable. Give your graph the title “19xx - 20xx Super Bowl National Anthem Singers by Genre” but replace the xx’s with the actual range of years for this data.
Use R as a calculator to verify that the proportion of singers from the R&B/Soul Genre during the range of years in question is indeed 0.175.
# Write the equation that gives the calculation of the proportion of R&B/Soul singers that sang the National Anthem.
#Hint:
# proportion <- (number of times R&B/Soul artist sang anthem)/(total number of times anthem is sung)
Create a histogram of the quantitative variable Time from the NationalAnthemTimes data. Adjust the binwidth so that each observational unit is easily countable. Give an appropriate title and include the time units for your graph. Recall, we saw this data in the initial chapter of our zyBook. You can find the description for this data by looking in the textbook’s “Chapter 12”.
BONUS. Do more with the National Anthem dataset using the tools introduced in Lab 1.