Intro to R: Day 1

Kumar Ramanathan

2019-09-17

Introduction

Why R?

  • R is flexible
  • R is customizable
  • R is open source
  • R has a large community of users

RStudio and RStudio Server

  • RStudio refers to both the company and the software
  • RStudio is an integrated development environment (IDE) for R
  • RStudio Server is a browser-based version of RStudio that runs on a remote Linux server
  • We will be using RStudio today

Some key concepts

You will hear these terms often this week:

  • Objects
  • Packages
  • Functions
  • Help files

Basics

Some of you may have completed the Introduction to R course on DataCamp that was recommended. Congratulations! You already know quite a lot about how R works. As we continue, please help your classmates who did not have a chance to complete the course.

The DataCamp course covered the following material:

  • Arithmetic with R
  • Objects and variable assignment
  • Data types, incl.: character, numeric, logical, factors
  • Data structures, incl.: vectors, matricies, data frames, lists

Copy relevant folders

We will be working on some exercises and using data files in the next few hours. To do this, you will need to fork and clone the bootcamp-2019 github repository.

Let’s walk through how to fork the repo to your GitHub account, and then clone it locally on your computer using the Git workflow within RStudio.

Basics and data types

Arithmetic

  • <- is the assignment operator
  • Standard arithmetic operators work: +, -, /, *, ^
  • To run an operation, you can hit Ctrl+Return (Windows) or Cmd+Return (Mac)

Arithmetic

y <- 5
y
## [1] 5
y + 2
## [1] 7

Functions

  • Functions take inputs and return outputs, e.g. log(10) results in 2.302585
  • Functions can also take arguments, e.g. log(10, base=10)
  • All functions have a help file that you can call up, e.g. ?log

Functions

log(10)
## [1] 2.302585
log(10, base=10)
## [1] 1

Arithmetic: exercise

Open up the file day1_exercises.R in RStudio.
Do the following tasks:

  • Pick a number; save it as x
  • Multiply x by 3
  • Take the log of the above
    (Hint, you need the function log() here)
  • Subtract 4 from the above
  • Square the above

Arithmetic: exercise

x <- 5
x*3
## [1] 15
log(x*3)
## [1] 2.70805
log(x*3)-4
## [1] -1.29195
(log(x*3)-4)^2
## [1] 1.669134

Comparisons and Logical Operators

  • Common logical operators:
    • == (is equal)
    • != (not equal)
    • < (greater than), > (less than)
    • & (and), | (or)
  • If you evaluate statements with these operators, R will tell you if the statement is TRUE or FALSE

Comparisons and Logical Operators: exercise

  • Check if 1 is bigger than 2
  • Check if 1 + 1 is equal to 2
  • Check if it is true that the strings “eat” and “drink” are not equal to each other
  • Check if it is true that 1 is equal to 1 AND 1 is equal to 2
    (Hint: remember what the operators & and | do)
  • Check if it is true that 1 is equal to 1 OR 1 is equal to 2

Comparisons and Logical Operators: exercise

1 > 2
## [1] FALSE
(1 + 1) == 2
## [1] TRUE
"eat" != 'drink'
## [1] TRUE
(1==1) & (1==2)
## [1] FALSE
(1==1) | (1==2)
## [1] TRUE

Packages

  • Load the package tidyverse using the function library()
    You may need to install it using install.packages("tidyverse") if you have not done so already
  • Open the help file for the function recode
    (Hint: remember what ? does)

Packages

library(tidyverse)

?recode

Remember, if you need to install a package:

install.packages("tidyverse")

Data types

  • logical: TRUE or FALSE
    (When coerced into numeric type, TRUE=1 and FALSE=0)
  • integer: a specific type; most numbers are numeric instead
  • numeric: real or decimal
  • complex: ex: 2+5i
  • character: "text data", denoted with quotes
    ('single quotes' and "double quotes" both work)
  • Use typeof() on an object to check its type

Review: Data structures

Vectors

  • Vectors store multiple values of a single data type
  • Create a vector using the c() function
  • Vectors are homogenous: each vector can only contain one data type
  • Arithmetic operators and many functions can apply to vectors
  • Vectors can be indexed by:
    • element position (vec[1])
    • ‘slice’ position (vec[1:3])
    • condition (vec[vec>2]).

Vectors

example <- c(7,8,9)
example[2]
## [1] 8
example[example>7]
## [1] 8 9

Vectors: exercise

Return to the exercise file and complete the following tasks:

  • Run the code to generate variables x1 and x2
  • Select the 3rd element in x1
    Hint: use []
  • Select the elements of x1 that are less than 0
  • Select the elements of x2 that are greater than 1
  • Create x3 containing the first five elements of x2
  • Select all but the third element of x1

Vectors: exercise

set.seed(1234) 
x1 <- rnorm(5)
x2 <- rnorm(20, mean=0.5)
x1[3]
## [1] 1.084441
x1[x1 < 0]
## [1] -1.207066 -2.345698
x2[x2 > 1]
## [1] 1.006056 1.459494 2.915835
x3 <- x2[1:5]
x1[-3]
## [1] -1.2070657  0.2774292 -2.3456977  0.4291247

Interlude: Missing Values

Variables in datasets sometimes include missing variables. In R, missing values are stored as NA. Vectors containing any data type can contain missing values. Functions deal with missing values differently, and sometimes require arguments to specify how to deal with missing values.

vec <- c(1, 8, NA, 7, 3)
mean(vec)
## [1] NA
mean(vec, na.rm=TRUE)
## [1] 4.75

Interlude: Missing Values

You can check if a vector contains missing values by the function is.na(). Since this returns a logical vector, you can use sum() or mean() on the result to count the number or proportion of TRUE values.

is.na(vec)
## [1] FALSE FALSE  TRUE FALSE FALSE
sum(is.na(vec))
## [1] 1
mean(is.na(vec))
## [1] 0.2

Factors

  • Factors are a special type of vector that are useful for categorical variables
  • Factors have a limited number of levels that the variable can take, set by the user
  • For categorical variables with natural ordering between categories, we often want to use ordered factors
  • Create factors with factor(), which includes an argument for levels =

Lists

  • Lists are like vectors, but more complex.
  • Lists are heterogenous: they can store single elements, vectors, or even lists.
  • You can keep multi-dimensional and ragged data in R using lists.
  • You can index an element in a list using double brackets: [[1]]. Single brackets will return the element as a list.

Matricies

  • Matrices in R are two-dimensional arrays.
  • Matrices are homogenous: all values of a matrix must be of the same data type.
  • You can initialize a matrix using the matrix() function.
  • Matrices are used sparingly in R, primarly for numerical calculations or explicit matrix manipulation.
  • Matrices are indexed as follows: mat[row no, col no].

Matricies

mat <- matrix(data=c(1,2,3,4,5,6,11,12,34), ncol=3)
mat
##      [,1] [,2] [,3]
## [1,]    1    4   11
## [2,]    2    5   12
## [3,]    3    6   34
mat[1,]
## [1]  1  4 11
mat[1,3]
## [1] 11

Data frames

  • Data frames are the core data structure in R. A data frame is a list of named vectors with the same length.
  • Data frames are heterogenous: the vectors in a data frames can each be of a different data type.
  • Columns are typically variables and rows are observations.
  • You can make make data frames with data.frame(), or by combining vectors with cbind() or rbind().
  • Data frames can be indexed in the same way as matricies: df[row no, col no].
  • Data frames can also be indexed by using variable/column names: df$var or df["var"].

Data frames

df <- data.frame(candidate=c("Biden","Warren","Sanders"), 
                 poll=c(26,17,17), 
                 age=c(76,70,78))
df
##   candidate poll age
## 1     Biden   26  76
## 2    Warren   17  70
## 3   Sanders   17  78
df[1,3]
## [1] 76
df$age
## [1] 76 70 78
df$age[df$candidate=="Biden"]
## [1] 76
df$poll_max <- df$poll+3
df
##   candidate poll age poll_max
## 1     Biden   26  76       29
## 2    Warren   17  70       20
## 3   Sanders   17  78       20

Data frames: exercise

  • Load the example data frame using the code provided
  • Identify the number of observations (rows) and number of variables (columns)
  • Identify the names of the variables
  • Select the variable mpg
  • Select the 4th row
  • Square the value of the cyl variable and store this as a new variable cylsq

Data frames: exercise

dim(mtcars) # str(mtcars) also okay here
## [1] 32 11
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
mtcars$mpg
mtcars[4,]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
mtcars$cylsq <- (mtcars$cyl)^2

Break!

Roadmap

Roadmap for the rest of the R sessions

  • Today we will learn how to do basic data manipulation and data visualization in base R
  • Most commonly, you will do these tasks using specialized packages such as tidyverse or data.table
  • So why teach these skills in base R?
    • This helps you understand how R works
    • Many packages rely on how these tasks work in base R
    • Useful for simple tasks in workflows that otherwise don’t involve much manipulation or visualization

Reading files

Working directories

The working directory is the folder where R scripts and projects look for files by default. Since we are using RStudio’s projects feature, your working directory is set already. Check where it is as follows:

getwd()
## [1] "C:/Users/kumar/OneDrive - Northwestern University - Student Advantage/Assistantships & Jobs/MSiA/bootcamp-2019"

If you need to change your working directory, e.g. when you are not working within an RStudio Project, you can go to the Files tab in the bottom right window in RStudio and find the directory you want. Then you can set it as a working directory with an option in the “More” menu. Or you can use the setwd() function.

Reading files: read.csv()

gapminder <- read.csv("data/gapminder5.csv", stringsAsFactors=FALSE)
gapminder <- read.csv(file = "data/gapminder5.csv",
                       sep = ",",
                       stringsAsFactors = FALSE)

Reading files: exercise

  • Run the read.csv() code provided and see what happens in the Environment tab.
  • Load the readr package.
  • Use read_csv() to load the gapminder data. Read the message generated in the console.

Reading files: exercise

library(readr)
## Warning: package 'readr' was built under R version 3.4.4
gapminder <- read_csv("data/gapminder5.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   pop = col_double(),
##   continent = col_character(),
##   lifeExp = col_double(),
##   gdpPercap = col_double()
## )

Reading files

  • You can also read files from the full local path or from URLs
  • You can read files using RStudio’s interface through the “Files” tab
  • For other file types, use the packages haven (Stata, SAS, SPSS) or readxl (Excel)

Data manipulation

Exploring data frames

  • Remember: you can view data frames in RStudio with View() and examine other characteristics with str(), dim(), names(), nrow(), and more.
  • When run on a data frame, summary() returns summary statistics for all variables.
  • mean(), median(), var(), sd(), and quantile() operate as expected.
  • Frequency tables are a simple and useful way to explore discrete/categorical variables in data frames
    • table() creates a frequency table of one or more variables
    • prop.table() can turn a frequency table into a proportion table

Exploring data frames: exercise

  • Run summary() on the gapminder data
  • Find the mean of the variable pop
  • Create a frequency table of the variable, using table()
  • Create a proportion table of the variable continent, using prop.table()
    (Hint: the input for prop.table() is the output of table())

Exploring data frames: exercise

summary(gapminder)
##    country               year           pop             continent        
##  Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
##  Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
##  Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
##                     Mean   :1980   Mean   :2.960e+07                     
##                     3rd Qu.:1993   3rd Qu.:1.959e+07                     
##                     Max.   :2007   Max.   :1.319e+09                     
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1

Exploring data frames: exercise

mean(gapminder$pop)
## [1] 29601212
table(gapminder$continent)
## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24
prop.table(table(gapminder$continent))
## 
##     Africa   Americas       Asia     Europe    Oceania 
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451

Subsetting

  • One of the benefits of R is that we can work with multiple data frames at the same time
  • We will often want to subset a data frame, i.e. work with a portion of the data frame
  • There are two common ways to subset a data frame in base R
    • Index the data frame: gapminder[gapminder$continent=="Asia",]
    • Use the subset() function: subset(gapminder, subset=continent=="Asia")

Sorting

  • The sort() function reorders elements, in ascending order by default.
    • You can flip the order by using the decreasing = TRUE argument.
  • The order() function gives you the index positions in sorted order.
  • sort() is useful for quickly viewing vectors; order() is useful for arranging data frames.

Subsetting and Sorting: exercise

  • Create a new data frame called gapminder07 containing only those rows in the gapminder data where year is 2007
  • Created a sorted frequency table of the variable continent in gapminder07
    (Hint: use table() and sort())
  • Print out the population of Mexico in 2007
  • Try the bonus question if you have time

Subsetting and Sorting: exercise

gapminder07 <- subset(gapminder, subset = year==2007)
sort(table(gapminder07$continent))
## 
##  Oceania Americas   Europe     Asia   Africa 
##        2       25       30       33       52
gapminder07$pop[gapminder07$country=="Mexico"]
## [1] 108700891
head(gapminder07[order(gapminder07$pop, decreasing=TRUE),])
## # A tibble: 6 x 6
##   country        year        pop continent lifeExp gdpPercap
##   <chr>         <dbl>      <dbl> <chr>       <dbl>     <dbl>
## 1 China          2007 1318683096 Asia         73.0     4959.
## 2 India          2007 1110396331 Asia         64.7     2452.
## 3 United States  2007  301139947 Americas     78.2    42952.
## 4 Indonesia      2007  223547000 Asia         70.6     3541.
## 5 Brazil         2007  190010647 Americas     72.4     9066.
## 6 Pakistan       2007  169270617 Asia         65.5     2606.

Adding and removing columns

When cleaning or wrangling datasets in RStudio, we will often want to create new variables.

Two ways to add a vector as a new variable in R:

gapminder$newvar <- newvar

gapminder <- cbind(gapminder, newvar)

Removing columns is easy too:

gapminder$newvar <- NULL

gapminder <- gapminder[-"newvar"]

Recoding variables

  • A common task when cleaning/wrangling data is recoding variables.
  • Think about what the recoded variable should look like & then decide on an approach.
    • Sometimes, a single function can accomplish the recoding task needed. The new vector can then be assigned to a new column in the data frame.
    • If no single function comes to mind, we can initialize a new variable in the data frame, and assign values using indexes and conditional statements.
    • More complex recoding tasks can be accomplished with other packages like dplyr, which you can preview in the lecture notes.

Recoding variables: exercise

Use the data frame gapminder07 throughout this exercise.

  • Round the values of the variable lifeExp using round(), and store this as a new variable lifeExp_round
  • Print out the new variable to see what it looks like
  • Read through the code that creates the new variable lifeExp_over70 and try to understand what it does.
  • Try to create a new variable lifeExp_highlow that has the value “High” when life expectancy is over the mean and the value “Low” when it is below the mean.

Recoding variables: exercise

gapminder07$lifeExp_round <- round(gapminder07$lifeExp)
head(gapminder07$lifeExp_round)
## [1] 44 76 72 43 75 81
gapminder07$lifeExp_highlow <- NA
gapminder07$lifeExp_highlow[gapminder07$lifeExp>mean(gapminder07$lifeExp)] <- "High"
gapminder07$lifeExp_highlow[gapminder07$lifeExp<mean(gapminder07$lifeExp)] <- "Low"
table(gapminder07$lifeExp_highlow)
## 
## High  Low 
##   85   57

Aggregating

  • Notice that the observations (i.e. rows) in our data frame are grouped; specifically, each country is grouped into a continent.
  • We are often interested in summary statistics by groups.
  • The aggregate() function accomplishes this: aggregate(y ~ x, FUN = mean) gives the mean of vector y for each unique group in x.
    • mean can be replaced by other functions here, such as median.
  • Try it! In the exercise file, find the mean of life expectancy in 2007 for each continent.

Aggregating: exercise

aggregate(gapminder07$lifeExp ~ gapminder07$continent, FUN = mean)
##   gapminder07$continent gapminder07$lifeExp
## 1                Africa            54.80604
## 2              Americas            73.60812
## 3                  Asia            70.72848
## 4                Europe            77.64860
## 5               Oceania            80.71950
aggregate(lifeExp ~ continent, data = gapminder07, FUN = mean)
##   continent  lifeExp
## 1    Africa 54.80604
## 2  Americas 73.60812
## 3      Asia 70.72848
## 4    Europe 77.64860
## 5   Oceania 80.71950

Statistics

  • Here are some easy statistical analyses to conduct in R
    • Correlations: cor(); Covariance: cov()
    • T-tests: t.test(var1 ~ var2), where var2 is the grouping variable
    • Linear regression: lm(y ~ x1 + x2, data = df)
  • You can store the results of these functions in objects, which is especially useful for statistical models with many components.

Statistics: exercise

Use gapminder07 for all the below exercises.

You’re using some new functions, so refer to help files whenever you get stuck.

  • Calculate the correlation between lifeExp and gdpPercap.
  • Use a t-test to evaluate the difference between gdpPercap in “high” and “low” life expectancy countries. Store the results as t1, and then print out t1.

Statistics: exercise

cor(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 0.6786624
t1 <- t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
t1 <- t.test(gdpPercap~lifeExp_highlow, data=gapminder07)
t1
## 
##  Welch Two Sample t-test
## 
## data:  gdpPercap by lifeExp_highlow
## t = 10.564, df = 95.704, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12674.02 18539.14
## sample estimates:
## mean in group High  mean in group Low 
##          17944.685           2338.104

Note that t1 is stored as a list. You can now call up the components of the t-test when you need them.

Statistics: exercise

  • Conduct a linear regression using lm() which predicts lifeExp as a function of gdpPercap and pop. Store the results as reg1.
    • You can define all the variables using the df$var syntax, or you can just use variable names and identify the data frame in the data = argument.
    • Examples are shown at the bottom of the help file for lm()
  • Print out reg1.
  • Run summary() on reg1.

Statistics: exercise

reg1 <- lm(lifeExp ~ gdpPercap + pop, data = gapminder07)
reg1
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
## 
## Coefficients:
## (Intercept)    gdpPercap          pop  
##   5.921e+01    6.416e-04    7.001e-09
summary(reg1)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.496  -6.119   1.899   7.018  13.383 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.921e+01  1.040e+00  56.906   <2e-16 ***
## gdpPercap   6.416e-04  5.818e-05  11.029   <2e-16 ***
## pop         7.001e-09  5.068e-09   1.381    0.169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.87 on 139 degrees of freedom
## Multiple R-squared:  0.4679, Adjusted R-squared:  0.4602 
## F-statistic: 61.11 on 2 and 139 DF,  p-value: < 2.2e-16

Writing files

Writing a data file

  • We will often want to save the data frames as data files after cleaning/wrangling/etc.
  • You can use write.csv() from base R or write_csv() from readr to do this.
  • Try it! Save the data frame gapminder07 to the “data” subfolder in your working directory using the write.csv function. Set the argument row.names = FALSE.

Writing a data file: exercise

write.csv(gapminder07, file = "data/gapminder07.csv", row.names = FALSE)

Save R objects

  • You can save all objects in your workspace using save.image() or by clicking the “Save” icon in the Environment tab.
    • You can load all objects back in using load.image() or opening the .RData file that is created.
    • You can save specific objects in an .RData file with the save() function.
  • If your script file is well-written, you should be able to retrieve all your objects just by running your code again.
  • If you have a project with code that takes a long time to run, I would recommend using project files.

Data visualization

Base R vs. ggplot2

  • We will only cover visualization briefly today, using some functions included in base R. Data scientists generally use other packages for data visualization, especially ggplot2, which we will cover on Day 3.
  • So why learn data visualization in base R?
    • Some of the simple functions are useful ways to explore data while doing analysis.
    • The syntax of visualization in base R is often adopted by other packages.

Histograms

  • Histograms are a useful way to examine the distribution of a single variable. The base R function for histograms is simple: hist().
  • Try it! Create a histogram of the variable lifeExp in gapminder07.
    • When you’re done, look at the help file and try to re-create the histogram, this time with a title and axis labels.
    • Bonus: Change the breaks = argument from its default setting and see what happens.

Histograms: exercise

hist(gapminder07$lifeExp, 
     main="Distribution of life expectancy across countries in 2007", 
     xlab="Life expectancy", ylab="Frequency")

Scatterplots

  • You can create a scatterplot by providing a formula containing two variables (i.e. y ~ x) to the plot() function in R.
  • Titles and axis labels can be added in plot() similarly to hist().
  • The function abline() can “layer” straight lines on top of a plot() output.

Scatterplots: exercise

  • Create a scatterplot with lifeExp on the y-axis and gdpPercap on the x-axis.
  • Add a title and axis labels.
  • Bonus: Add a horizontal line indicating the mean of lifeExp onto the plot using abline().

Scatterplots: exercise

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap)

Scatterplots: exercise

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap,
     main="Relationship between life expectancy and GDP per capita in 2007", 
     ylab="Life expectancy", xlab="GDP per capita")

Scatterplots: exercise

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap,
     main="Relationship between life expectancy and GDP per capita in 2007", 
     ylab="Life expectancy", xlab="GDP per capita")
abline(h = mean(gapminder07$lifeExp))

The End!