Getting acquainted with R

Why R?

R is a free software environment for statistical computing and graphics. It is supported by the R Foundation for Statistical Computing and was initially developed in the late 1990s. As a full programming language, it is more flexible and capacious than alternatives for statistical computing such as Stata. Since it is designed for data analysis, it is particularly well suited to the needs of data scientists compared to most other programming languages. Since it is open source, it is a customizable and extensible tool, for which thousands of useful and tailored packages have been written by users worldwide. Finally, it is widely used and a large community has grown around R, meaning that help is always around the corner.

For more information, see the R project website. You can get a taste for the R community at R-bloggers, the RStudio community forum, R-Ladies, and Stack Overflow.

Here are a few things to remember about R: - “R” refers to both the language and the environment (application). - R runs in memory, objects are loaded in memory. - It’s expected that you’ll install and use additional packages - Packages are open source and user contributed, so use established packages or evaluate quality. - There are multiple ways to do most things. Some ways are better than others, but sometimes it is a question of style and preference. - You can, and often will, have more than one dataset open in R at the same time.

RStudio and RStudio Server

RStudio is a software program that makes working in R easier, developed and maintained by a company with the same name. It provides an integrated development environment, or IDE, for R. RStudio helps you organize your workflow and keep track of your work. The top-left pane is where you open, work on, and save script files. The bottom-left pane includes the console, which is where your code actually “runs” – you can run code here directly, or from a script file. The right-hand side panes include tools for manging your environment, workspace, and packages; for plotting and graphics; for accessing help files; and more.

At MSiA, you have access to RStudio Server, which allows you to access RStudio through a web browser and do computation on a server rather than your personal computer.

Installing and loading packages

Much of R’s power comes from contributed packages. You can install and manage packages using the Packages tab in the bottom right pane in RStudio. Or you can install packages with a command:

install.packages("tidyverse")

The RStudio Server Pro install that you are using already has tidyverse and some other common packages installed at the system-level, so you will not actually need to do this step. But do take note of this material, as you will likely need to install packages in the future. You can easily check what packages have been installed already in the “Packages” tab on the bottom-right pane.

Note that tidyverse is a composite package that will install multiple component packages and their dependencies. It includes dplyr and ggplot2, which we will use extensively on Day 3 of the boot camp. You’ll get a lot of messages as the installation happens.

CRAN (Comprehensive R Archive Network) is the name of the package repository. There are mirrors around the world. You can also install packages that are not on CRAN using the devtools package.

If you have trouble or get errors when trying to install a package, you might need to specify the repository mirror to download from:

install.packages("tidyverse", repos="http://cran.wustl.edu/")

After you install a package, you have to load it with the library function to actually use it.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Functions and help files

Functions are called with functionName(parameters). Multiple paramters are comma-separated. Paramters can be named. For unnamed parameters, the order matters. R functions don’t change the objects passed to them (more on this later). Instead, to store the result of a function, you need to assign its output to an object using the assignment operator <-. For example: object <- functionName(parameter1, parameter2).

Remember that all functions come with help files. Well-written packages will include extensive help files which explain the kinds of inputs the function takes, the purpose of each parameter or “argument” in addition to the inputs, and the kinds of output the function produces. They will also often include examples.

To access a help file, enter ?functionName into the console or use the Help tab in the bottom-right pane of RStudio.

Sometimes different packages will include functions with the same name. To make sure you are using the function from the right package, you can use the following syntax: packageName::functionName().

Recall your DataCamp course

By now, you should all have completed the Introduction to R course on DataCamp. Congratulations! You already know quite a lot about how R works. To refresh, you covered the following material in your course:

  • Arithmetic with R
  • Variable assignment
  • Data types, including:
    • Character
    • Numeric
    • Logical
    • Factors
  • Data structures, including:
    • Vectors
    • Matricies
    • Data frames
    • Lists

In these lecture notes, we will review all of the above. In the lecture slides, we will move through these topics quickly, and mostly in the form of exercises, focusing our attention on some new material: reading and writing files, basic data manipulation, and basic data visualization.

Review: Basics and data types

A good reference for R Basics is the Base R Cheat Sheet.

Remember, to run a line of code from a script file in RStudio:

  • MAC: command + return
  • PC: Ctrl + r

Arithmetic

Let’s recall how to do some basic arithmetic in R:

2+2
## [1] 4
5%%2
## [1] 1
3.452*6
## [1] 20.712
2^4
## [1] 16

You can use ?Arithmetic to pull up the help for arithmetic operators.

A few functions

Functions are called with functionName(parameters). Multiple paramters are comma-separated. Paramters can be named. For unnamed parameters, the order matters. R functions don’t change the objects passed to them (more on this later).

log(10)
## [1] 2.302585
log(16, base=2)
## [1] 4
log10(10)
## [1] 1
sqrt(10)
## [1] 3.162278
exp(10)
## [1] 22026.47
sin(1)
## [1] 0.841471

Comparisons

1 < 2
## [1] TRUE
TRUE == FALSE
## [1] FALSE
'a' != "Boy" # not equal
## [1] TRUE

Note that character vectors/strings can use single or double quotes.

Logical Operators

& is and, and | is or, and ! is not:

TRUE & FALSE
## [1] FALSE
!TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
(2 > 1) & (3 > 2)
## [1] TRUE

You use these to join together conditions.

Variables

Use the <- operator to assign values to variables. = also works but is bad practice and less common.

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods. They cannot start with a number; they should not start with an underscore or period in regular use. They cannot contain spaces. Different people use different conventions for long variable names, these include

  • periods.between.words
  • underscores_between_words
  • camelCaseToSeparateWords
x <- 4
x
## [1] 4
y <- 3/10
y
## [1] 0.3
x + y
## [1] 4.3
myVariable <- x <- 3 + 4 + 7

Note that when you create a variable in RStudio, it shows up in the environment tab in the top-right pane.

Data Types

There are a few types of data in R:

  • logical: TRUE or FALSE
  • integer: a specific type; most numbers are numeric instead
  • numeric: real or decimal
  • complex: ex: 2+5i
  • character: "text data", denoted with single or double quotes
typeof(TRUE)
## [1] "logical"
typeof("foo")
## [1] "character"

Review: Data structures

Vectors

Vectors store multiple values of a single data type. You can create a vector by combining values with the c() function.

x<-c(1,2,3,4,5)
x<-1:5

Vectors can only have one type of values in them. The order depends on what types can be converted to other types. If there’s multiple types, everything in a vector will be converted to the type of the lowest in this list:

  • logical
  • integer
  • numeric
  • complex
  • character
x<-c(TRUE, 2, 4.3)
x
## [1] 1.0 2.0 4.3
x<-c(4, "alpha", TRUE)
x
## [1] "4"     "alpha" "TRUE"

Functions and arithmetic operators can apply to vectors as well:

x <- c(1,2,3,4,5)
x+1
## [1] 2 3 4 5 6
x*2
## [1]  2  4  6  8 10
x*x
## [1]  1  4  9 16 25
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
x < 5
## [1]  TRUE  TRUE  TRUE  TRUE FALSE

Some functions will apply to each element of a vector, but others take a vector as a parameter:

log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
sum(x)
## [1] 15

Vectors are one-dimensional and can’t be nested:

c(c(1,2,3), 4, 5)
## [1] 1 2 3 4 5

Vector indexes (and all other indexes in R) start with 1, not 0:

x <- c('a', 'b', 'c', 'd', 'e')
x[1]
## [1] "a"

You can take “slices” of vectors using index brackets:

x[1:3]
## [1] "a" "b" "c"

Or exclude values with a negative sign:

x[-1]
## [1] "b" "c" "d" "e"

Elements are returned in the order that the indices are supplied:

y <- c(5,1)
y
## [1] 5 1

You can use a vector of integers or booleans to select from a vector as well:

x[x<'c']
## [1] "a" "b"
x[c(1,3,5)]
## [1] "a" "c" "e"

Get the length of a vector with length:

length(x)
## [1] 5

See if a value is in a vector with the %in% operator:

'b' %in% x
## [1] TRUE

Or get the first position of one or more elements in a vector with the match function:

match(c('b', 'd', 'k'), x)
## [1]  2  4 NA

Use which to find all positions:

y <- c('a','b','c','a','b','c')
which(y == 'c')
## [1] 3 6

You can also name the elements of a vector:

x<-1:5
names(x)<-c("Ohio","Illinois","Indiana","Michigan","Wisconsin")
x
##      Ohio  Illinois   Indiana  Michigan Wisconsin 
##         1         2         3         4         5

Which allows you to select values from the vector using the names:

x["Ohio"]
## Ohio 
##    1
x[c("Illinois", "Indiana")]
## Illinois  Indiana 
##        2        3

Missing Data (NA)

Before we move onto other data structures, let’s pause to consider how to deal with missing values in a vector (or, later, a data frame). Missing data in R is encoded as NA. Some functions will ignore NA when doing computations. Others will ignore missing values if you tell them to. Others will process NA and give you a result of NA.

tmp <- c(1,2,5,NA,6,NA,2,5,1,1,NA,5)

You can test for NA (is.na). Or you can get the index location of the missing observations within the vector (useful for later selecting observations in your dataset).

is.na(tmp)
##  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE
which(is.na(tmp))
## [1]  4  6 11

It can also be useful to count the number of NAs in a vector:

sum(is.na(tmp))
## [1] 3

Why does this work? How can you sum logical values? This takes advantage of the trick that TRUE=1 and FALSE=0. The function call tries to convert the logicals to numeric, and this is how the conversion works:

as.numeric(c(TRUE, FALSE))
## [1] 1 0

Remember that different functions treat NA differently. With an input vector that includes NA values, mean results in NA. It has an option to exclude missing:

mean(tmp)
## [1] NA
mean(tmp, na.rm=TRUE)
## [1] 3.111111

table behaves differently. It excludes NA by default. You have to tell it to include NA.

table(tmp)
## tmp
## 1 2 5 6 
## 3 2 3 1
table(tmp, useNA = "ifany")
## tmp
##    1    2    5    6 <NA> 
##    3    2    3    1    3

Other Special Values

NULL

NULL is another special type. NULL is usually used to mean undefined. You might get it when a function can’t compute a result. NULL is a single value and can’t be in a vector. (NAs can be in vectors and data.frames.)

c()
## NULL
c(NULL, NULL)
## NULL

The above somewhat surprisingly gives a single NULL because of the restrictions on how it’s used.

NULL should not be used for missing data.

NaN, Inf

NaN means “not a number”.

0/0
## [1] NaN

Inf and -Inf are “infinity” and “negative infinity”.

1/0
## [1] Inf
-1/0
## [1] -Inf

Factors

Factors are a special type of vector can be used for categorical variables. Why would we need them? Consider that the values of character vectors sometimes have an order, and we may want to store this information in R. For example, consider a vector containing month names. When we use table(), R arranges the values in alphabetical order.

months<-c("January","March","February","December","January","March")
table(months)
## months
## December February  January    March 
##        1        1        2        2

The factor() function converts a vector into a factor. Without supplying any additional information, the function infers the possible “levels” that the vector takes.

months_fac <- factor(months)
levels(months_fac)
## [1] "December" "February" "January"  "March"

Factors can be ordered, which is useful when you have categorical variables in your data. Let’s create an ordered factor from the months variable. Using the table() function on the factor, we can see one of the benefits of using factors for categorical variables - the values are ordered in meaningful way rather than alphabetically.

months_fac <- factor(months, levels=c("January","February","March","December"))
table(months_fac)
## months_fac
##  January February    March December 
##        2        1        2        1

Note that you cannot add values to a factor that are not included as one of the levels.

months_fac[5] <- "April"

The best solution to this is to remake the factor. The factor function will convert months_fac in the example below back to character data before creating the new factor.

months_fac <- factor(months, levels=c("January","February","March","April","December")) 
months_fac[5] <- "April"
months_fac
## [1] January  March    February December April    March   
## Levels: January February March April December

Alternatively, when you create the factor for the first time, you can include all possible levels of the factor. This has the added benefit of producing even more meaningful results when using functions such as table().

months_fac <- factor(months, levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
table(months_fac)
## months_fac
##   January  February     March     April       May      June      July 
##         2         1         2         0         0         0         0 
##    August September   October  November  December 
##         0         0         0         0         1

Under the hood, factors are stored as integers, with the (ordered) levels attribute providing information about the character value associated with each integer.

typeof(months_fac)
## [1] "integer"

Even if you don’t plan to use categorical data, you should know that factors exist because when reading data into R, text strings can be loaded as factors.

Lists

Lists are a bit like complex vectors. An element of a list can hold any other object, including another list. You can keep multi-dimensional and ragged data in R using lists.

l1 <- list(1, "a", TRUE, 1+4i)
l1
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
l2 <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
l2
## $title
## [1] "Research Bazaar"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

Indexing lists is a little different. [[1]] is the first element of the list as whatever type it was. [1] is a subset of the list – the first element of the list as a list. You can also access list elements by name using the $ operator.

l2[[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10
l2[2]
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
l2$numbers
##  [1]  1  2  3  4  5  6  7  8  9 10

Matrices

Matrices in R are two-dimensional arrays. All values of a matrix must be of the same type. You can initialize a matrix using the matrix() function.

matrix(c('a', 'b', 'c', 'd'), nrow=2)
##      [,1] [,2]
## [1,] "a"  "c" 
## [2,] "b"  "d"
y<-matrix(1:25, nrow=5, byrow=TRUE)
y
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25

Matrices are used sparingly in R, primarly for numerical calculations or explicit matrix manipulation. You can attach names to rows and columns.

Matrix algebra functions are available:

y%*%y
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  215  230  245  260  275
## [2,]  490  530  570  610  650
## [3,]  765  830  895  960 1025
## [4,] 1040 1130 1220 1310 1400
## [5,] 1315 1430 1545 1660 1775
x<-1:5
y%*%x
##      [,1]
## [1,]   55
## [2,]  130
## [3,]  205
## [4,]  280
## [5,]  355
y^-1 # matrix inversion
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000
## [2,] 0.16666667 0.14285714 0.12500000 0.11111111 0.10000000
## [3,] 0.09090909 0.08333333 0.07692308 0.07142857 0.06666667
## [4,] 0.06250000 0.05882353 0.05555556 0.05263158 0.05000000
## [5,] 0.04761905 0.04545455 0.04347826 0.04166667 0.04000000
y * -1
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -1   -2   -3   -4   -5
## [2,]   -6   -7   -8   -9  -10
## [3,]  -11  -12  -13  -14  -15
## [4,]  -16  -17  -18  -19  -20
## [5,]  -21  -22  -23  -24  -25

Elements in a matrix are indexed like mat[row number, col number]. Omitting a value for row or column will give you the entire column or row, respectively.

y[1,1]
## [1] 1
y[1,]
## [1] 1 2 3 4 5
y[,1]
## [1]  1  6 11 16 21
y[1:2,3:4]
##      [,1] [,2]
## [1,]    3    4
## [2,]    8    9
y[,c(1,4)]
##      [,1] [,2]
## [1,]    1    4
## [2,]    6    9
## [3,]   11   14
## [4,]   16   19
## [5,]   21   24

Using just a single index will get the element from the specified position, as if the matrix were turned into a vector first:

w<-matrix(5:29, nrow=5)
w[7]
## [1] 11
as.vector(w)[7]
## [1] 11

Data Frames

Data frames are the core data structure in R. A data frame is a list of named vectors with the same length. Columns are typically variables and rows are observations. Different columns can have different types of data:

id<-1:20
id
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
color<-c(rep("red", 3), rep("green",10), rep("blue", 7))
color
##  [1] "red"   "red"   "red"   "green" "green" "green" "green" "green"
##  [9] "green" "green" "green" "green" "green" "blue"  "blue"  "blue" 
## [17] "blue"  "blue"  "blue"  "blue"
score<-runif(20)
score
##  [1] 0.20354493 0.50185548 0.35535068 0.52583579 0.57558433 0.81498953
##  [7] 0.83673077 0.62897844 0.21323122 0.69859791 0.34753467 0.90691752
## [13] 0.57466982 0.79328765 0.55604629 0.07937945 0.67997536 0.99449463
## [19] 0.68909116 0.57609978
df<-data.frame(id, color, score)
df
##    id color      score
## 1   1   red 0.20354493
## 2   2   red 0.50185548
## 3   3   red 0.35535068
## 4   4 green 0.52583579
## 5   5 green 0.57558433
## 6   6 green 0.81498953
## 7   7 green 0.83673077
## 8   8 green 0.62897844
## 9   9 green 0.21323122
## 10 10 green 0.69859791
## 11 11 green 0.34753467
## 12 12 green 0.90691752
## 13 13 green 0.57466982
## 14 14  blue 0.79328765
## 15 15  blue 0.55604629
## 16 16  blue 0.07937945
## 17 17  blue 0.67997536
## 18 18  blue 0.99449463
## 19 19  blue 0.68909116
## 20 20  blue 0.57609978

Instead of making individual objects first, we could do it all together:

df<-data.frame(id=1:20, 
               color=c(rep("red", 3), rep("green",10), rep("blue", 7)),
               score=runif(20))

Data frames can be indexed like matrices to retrieve the values.

df[2,2]
## [1] red
## Levels: blue green red
df[1,]
##   id color     score
## 1  1   red 0.2482876
df[,3]
##  [1] 0.24828757 0.06906664 0.34048913 0.08584947 0.23787852 0.15512797
##  [7] 0.31793669 0.33260000 0.22951894 0.06794502 0.30247253 0.75979288
## [13] 0.28629046 0.91565300 0.05544233 0.16628694 0.82199100 0.08954519
## [19] 0.32176353 0.46035658

You can use negative values when indexing to exclude values:

df[,-2]
##    id      score
## 1   1 0.24828757
## 2   2 0.06906664
## 3   3 0.34048913
## 4   4 0.08584947
## 5   5 0.23787852
## 6   6 0.15512797
## 7   7 0.31793669
## 8   8 0.33260000
## 9   9 0.22951894
## 10 10 0.06794502
## 11 11 0.30247253
## 12 12 0.75979288
## 13 13 0.28629046
## 14 14 0.91565300
## 15 15 0.05544233
## 16 16 0.16628694
## 17 17 0.82199100
## 18 18 0.08954519
## 19 19 0.32176353
## 20 20 0.46035658
df[-1:-10,]
##    id color      score
## 11 11 green 0.30247253
## 12 12 green 0.75979288
## 13 13 green 0.28629046
## 14 14  blue 0.91565300
## 15 15  blue 0.05544233
## 16 16  blue 0.16628694
## 17 17  blue 0.82199100
## 18 18  blue 0.08954519
## 19 19  blue 0.32176353
## 20 20  blue 0.46035658

You can also use the names of the columns after a $ or in the indexing:

df$color
##  [1] red   red   red   green green green green green green green green
## [12] green green blue  blue  blue  blue  blue  blue  blue 
## Levels: blue green red

Indexing into a data frame with a single integer or name of the column will give you the column(s) specified as a new data frame.

df['color']
##    color
## 1    red
## 2    red
## 3    red
## 4  green
## 5  green
## 6  green
## 7  green
## 8  green
## 9  green
## 10 green
## 11 green
## 12 green
## 13 green
## 14  blue
## 15  blue
## 16  blue
## 17  blue
## 18  blue
## 19  blue
## 20  blue
df[2:3]
##    color      score
## 1    red 0.24828757
## 2    red 0.06906664
## 3    red 0.34048913
## 4  green 0.08584947
## 5  green 0.23787852
## 6  green 0.15512797
## 7  green 0.31793669
## 8  green 0.33260000
## 9  green 0.22951894
## 10 green 0.06794502
## 11 green 0.30247253
## 12 green 0.75979288
## 13 green 0.28629046
## 14  blue 0.91565300
## 15  blue 0.05544233
## 16  blue 0.16628694
## 17  blue 0.82199100
## 18  blue 0.08954519
## 19  blue 0.32176353
## 20  blue 0.46035658

Instead of index numbers or names, you can also select values by using logical statements. This is usually done with selecting rows.

df[df$color == "green",]
##    id color      score
## 4   4 green 0.08584947
## 5   5 green 0.23787852
## 6   6 green 0.15512797
## 7   7 green 0.31793669
## 8   8 green 0.33260000
## 9   9 green 0.22951894
## 10 10 green 0.06794502
## 11 11 green 0.30247253
## 12 12 green 0.75979288
## 13 13 green 0.28629046
df[df$score > .5,]
##    id color     score
## 12 12 green 0.7597929
## 14 14  blue 0.9156530
## 17 17  blue 0.8219910
df[df$score > .5 & df$color == "blue",]
##    id color    score
## 14 14  blue 0.915653
## 17 17  blue 0.821991

You can assign names to the rows of a data frame as well as to the columns, and then use those names for indexing and selecting data.

rownames(df)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20"

You can add columns or rows simply by assigning values to them. There are also rbind and cbind (for row bind and column bind) functions that can be useful.

df$year<-1901:1920
df
##    id color      score year
## 1   1   red 0.24828757 1901
## 2   2   red 0.06906664 1902
## 3   3   red 0.34048913 1903
## 4   4 green 0.08584947 1904
## 5   5 green 0.23787852 1905
## 6   6 green 0.15512797 1906
## 7   7 green 0.31793669 1907
## 8   8 green 0.33260000 1908
## 9   9 green 0.22951894 1909
## 10 10 green 0.06794502 1910
## 11 11 green 0.30247253 1911
## 12 12 green 0.75979288 1912
## 13 13 green 0.28629046 1913
## 14 14  blue 0.91565300 1914
## 15 15  blue 0.05544233 1915
## 16 16  blue 0.16628694 1916
## 17 17  blue 0.82199100 1917
## 18 18  blue 0.08954519 1918
## 19 19  blue 0.32176353 1919
## 20 20  blue 0.46035658 1920
df[22,]<-list(21, "green", 0.4, 1921)

Note that we had to use a list for adding a row because there are different data types.

Reading data files

Working Directory

Before reading or writing files, it’s often useful to set the working directory first so that you don’t have specify complete file paths.

You can go to the Files tab in the bottom right window in RStudio and find the directory you want. Then under the More menu, there is an option to set the current directory as the working directory. Or you can use the setwd command like:

setwd("~/training/intror")  # ~ stands for your home directory
setwd("/Users/username/Documents/workshop") # mac, absolute path example
setwd("C:\Users\username\Documents\workshop") # windows, absolute path example

In our case, we are working out of the directory team/bootcamp/2018 in the base directory. So we can set our working directory as follows:

setwd("~/team/bootcamp/2018/R session materials")

To check where your working directory is, use getwd():

getwd()
## [1] "C:/Users/kumar/OneDrive - Northwestern University - Student Advantage/Assistantships & Jobs/MSiA/bootcamp-2019"

Reading

Read in a csv file and save it as a data frame with a name. Below are two examples, using a CSV file and a local file stored in the working directory respectively:

# Using a URL
schooldata <- read.csv("https://goo.gl/f4UhMX")

# Using a local file
gapminder <- read.csv("data/gapminder5.csv")

You can view the data frames in RStudio using the View() function.

View(schooldata)
View(gapminder)

You could also use the Import Dataset option in the Environment tab in the top right window in RStudio.

Looking at the help for read.csv, there are a number of different options and different function calls. read.table, read.csv, and read.delim all work in the same basic way and take the same set of arguments, but they have different defaults. Key options to pay attention to include:

  • header: whether the first row of the file has the names of the columns
  • sep: the separator used (comma, tab (enter as \t), etc) in the file
  • na.strings: how is missing data encoded in your file? “NA” are treated as missing by default; blanks are treated as missing by default in everything but character data.
  • stringsAsFactors: should strings (text data) be converted to factors or kept as is? Example of this below.

Let’s redo the above with a better set of options:

gapminder <- read.csv("data/gapminder5.csv", 
                      stringsAsFactors=FALSE, 
                      strip.white=TRUE, 
                      na.strings=c("NA", ""))

The option na.strings is needed now because while blanks are treated as missing by default in numeric fields (which includes factors), they aren’t by default missing for character data.

readr Package

Does all of the above seem annoying or unnecessarily complicated? Others have thought so too.

Look at the readr package (part of the tidyverse), which attempts to smooth over some of the annoyances of reading in file in R. The main source of potential problems when using readr functions is that it guesses variable types from a subset of the observations, so if you have a strange value further down in your dataset, you might get an error or an unexpected value conversion.

To read in the same data with the same settings as above, using readr (note similar function name, with _ instead of .):

library(readr)
gapminder <- read_csv("data/gapminder5.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   pop = col_double(),
##   continent = col_character(),
##   lifeExp = col_double(),
##   gdpPercap = col_double()
## )

Options used above are defaults in readr. You get a long message about the column types.

Learn more at the readr website.

Reading different file formats

For Stata, SAS, or SPSS files, try the haven or foreign packages. For Excel files, use the readxl package.

Special Data Types

data.table package also has functions for reading in data, which you will learn about on Day 3 of the boot camp. The fread function is relatively fast for reading a rectangular standardized data file into R.

R also has packages for reading other structured files like XML and JSON, or interfacing with databases. For more on using R with databases, see the R section of the Databases workshop materials from NUIT Research Computing Services.

There are also multiple packages that make collecting data from APIs (either in general or specific APIs like the Census Bureau) easier. There are also packages that interface with Google docs/drive and Dropbox, although those APIs change frequently, so beware when using those packages if they haven’t been updated recently.

Data manipulation in base R

Exploring a data frame

In the previous section, we imported two datasets. For the rest of today, we will focus on the Gapminder data, which is stored in our environment as gapminder. To refresh yourself, you can view the data frame in R using the View() function.

View(gapminder)

You can also see a list of variables using names().

names(gapminder)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Other useful functions are dim() which shows the dimensions of the data frame, str() which shows the dimensions of the data frame along with the names of variables and the first few values in each variable, nrow() and ncol() which show the number of rows and colums, and head() which shows the first few rows of the data frame (5 rows by default).

When applied to a data frame, the summary() function provides useful summary statistics for each variable (i.e. column) in the data frame. Let’s try it with the Gapminder data:

summary(gapminder)
##    country               year           pop             continent        
##  Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
##  Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
##  Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
##                     Mean   :1980   Mean   :2.960e+07                     
##                     3rd Qu.:1993   3rd Qu.:1.959e+07                     
##                     Max.   :2007   Max.   :1.319e+09                     
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1

We can also use functions like mean(), median(), var(), sd(), and quantile() to calculate other summary statistics for individual variables. For example, let’s calculate the mean of life expectancy. Recall that we can use the $ operator to call up a variable within a data frame using its name.

mean(gapminder$lifeExp)
## [1] 59.47444

A useful way to examine a discrete or categorical variable is to use a frequency table. These are easy to make in R, using the table() function:

table(gapminder$continent)
## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

prop.table() is a useful wrapper around table(), showing the proportion of rows in each category:

prop.table(table(gapminder$continent))
## 
##     Africa   Americas       Asia     Europe    Oceania 
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451

You can generate a frequency table with more than one variable as well:

table(gapminder$continent, gapminder$year)
##           
##            1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
##   Africa     52   52   52   52   52   52   52   52   52   52   52   52
##   Americas   25   25   25   25   25   25   25   25   25   25   25   25
##   Asia       33   33   33   33   33   33   33   33   33   33   33   33
##   Europe     30   30   30   30   30   30   30   30   30   30   30   30
##   Oceania     2    2    2    2    2    2    2    2    2    2    2    2

Subsetting

Notice that each row in the data frame represents one country in a given year. Perhaps we are interested in analyzing only data from one year. To do this, we will have to “subset” our data frame to include only those rows that we want to keep.

The subset() function lets you select rows and columns you want to keep. You can either name columns or rows, or include a logical statement such that only rows/columns where the statement is true are retained.

subset(data.frame, 
       subset=condition indicating rows to keep,
       select=condition indicating columns to keep)

For eaxmple, let’s create a new data frame containing only 2007 data by subsetting the original data frame.

gapminder07 <- subset(gapminder, subset = year==2007)

Look at the number of rows in the new data frame: it is only 142, whereas the original data frame has 1704 rows.

nrow(gapminder07)
## [1] 142

Sorting

The sort() function reorders elements, in ascending order by default. You can flip the order by using the decreasing = TRUE argument.

sort(gapminder07$lifeExp)
##   [1] 39.613 42.082 42.384 42.568 42.592 42.731 43.487 43.828 44.741 45.678
##  [11] 46.242 46.388 46.462 46.859 48.159 48.303 48.328 49.339 49.580 50.430
##  [21] 50.651 50.728 51.542 51.579 52.295 52.517 52.906 52.947 54.110 54.467
##  [31] 54.791 55.322 56.007 56.728 56.735 56.867 58.040 58.420 58.556 59.443
##  [41] 59.448 59.545 59.723 60.022 60.916 62.069 62.698 63.062 63.785 64.062
##  [51] 64.164 64.698 65.152 65.483 65.528 65.554 66.803 67.297 69.819 70.198
##  [61] 70.259 70.616 70.650 70.964 71.164 71.338 71.421 71.688 71.752 71.777
##  [71] 71.878 71.993 72.235 72.301 72.390 72.396 72.476 72.535 72.567 72.777
##  [81] 72.801 72.889 72.899 72.961 73.005 73.338 73.422 73.747 73.923 73.952
##  [91] 74.002 74.143 74.241 74.249 74.543 74.663 74.852 74.994 75.320 75.537
## [101] 75.563 75.635 75.640 75.748 76.195 76.384 76.423 76.442 76.486 77.588
## [111] 77.926 78.098 78.242 78.273 78.332 78.400 78.553 78.623 78.746 78.782
## [121] 78.885 79.313 79.406 79.425 79.441 79.483 79.762 79.829 79.972 80.196
## [131] 80.204 80.546 80.653 80.657 80.745 80.884 80.941 81.235 81.701 81.757
## [141] 82.208 82.603
sort(gapminder07$lifeExp, decreasing=TRUE)
##   [1] 82.603 82.208 81.757 81.701 81.235 80.941 80.884 80.745 80.657 80.653
##  [11] 80.546 80.204 80.196 79.972 79.829 79.762 79.483 79.441 79.425 79.406
##  [21] 79.313 78.885 78.782 78.746 78.623 78.553 78.400 78.332 78.273 78.242
##  [31] 78.098 77.926 77.588 76.486 76.442 76.423 76.384 76.195 75.748 75.640
##  [41] 75.635 75.563 75.537 75.320 74.994 74.852 74.663 74.543 74.249 74.241
##  [51] 74.143 74.002 73.952 73.923 73.747 73.422 73.338 73.005 72.961 72.899
##  [61] 72.889 72.801 72.777 72.567 72.535 72.476 72.396 72.390 72.301 72.235
##  [71] 71.993 71.878 71.777 71.752 71.688 71.421 71.338 71.164 70.964 70.650
##  [81] 70.616 70.259 70.198 69.819 67.297 66.803 65.554 65.528 65.483 65.152
##  [91] 64.698 64.164 64.062 63.785 63.062 62.698 62.069 60.916 60.022 59.723
## [101] 59.545 59.448 59.443 58.556 58.420 58.040 56.867 56.735 56.728 56.007
## [111] 55.322 54.791 54.467 54.110 52.947 52.906 52.517 52.295 51.579 51.542
## [121] 50.728 50.651 50.430 49.580 49.339 48.328 48.303 48.159 46.859 46.462
## [131] 46.388 46.242 45.678 44.741 43.828 43.487 42.731 42.592 42.568 42.384
## [141] 42.082 39.613

The order() function gives you the index positions in sorted order:

order(gapminder07$lifeExp)
##   [1] 122  87 141 113  74   4 142   1  22  75 108  53  28  95 117  78  31
##  [18] 118  18  20  23  14 133  41  17 127  89  43  69  80  36  29  52  11
##  [35]  46  94  42 129 121  77  47  62  19  49  54  88 140 111  90   9  81
##  [52]  59  27  98 109  12  84  70 130  55  51 128  60  61  86  39 101 102
##  [69] 100 132  40  73  37   3  15 120 107  68  66 110  82  26  93  25  16
##  [86]  57 139 137 131  76 112 125  79 138  85 115  13  38   5  99 103   8
## [103]  97  32  83 136   2 106  34  72 116 104 135  33  35 126  24  71 105
## [120]  30  63  44  48 134  10  50  91   7 114  96  92  65  21  45  64 123
## [137] 119   6 124  58  56  67

order() is useful for arranging data frames. Combined with head(), which shows the first 5 rows of a data frame, we can use this to view the rows of the data frame with the highest life expectancy:

head(gapminder07[order(gapminder07$lifeExp, decreasing=TRUE),])
## # A tibble: 6 x 6
##   country          year       pop continent lifeExp gdpPercap
##   <chr>           <dbl>     <dbl> <chr>       <dbl>     <dbl>
## 1 Japan            2007 127467972 Asia         82.6    31656.
## 2 Hong Kong China  2007   6980412 Asia         82.2    39725.
## 3 Iceland          2007    301931 Europe       81.8    36181.
## 4 Switzerland      2007   7554661 Europe       81.7    37506.
## 5 Australia        2007  20434176 Oceania      81.2    34435.
## 6 Spain            2007  40448191 Europe       80.9    28821.

Sorting a table is often useful. For example:

sort(table(gapminder07$continent))
## 
##  Oceania Americas   Europe     Asia   Africa 
##        2       25       30       33       52

Adding and removing columns

You can add variables to a data frame in several ways. Here, we will show two standard methods using base R. On Day 3, you will learn about alternatives using the data.table and dplyr approaches.

To demonstrate, let’s first create a vector with the same number of values as the number of rows in the data frame. If you want to learn what is going on in this code, look at the help file for the function sample().

newvar <- sample(1:5000, 1704, replace = FALSE)

You can add a variable/column by using the cbind() function:

gapminder <- cbind(gapminder, newvar)

You can add a variable/column by assigning it to data frame directly:

gapminder$newvar <- newvar

To remove a variable/column from a data frame, you can assign a NULL value to the variable:

gapminder$newvar <- NULL

You can also remove a variable/column by negatively indexing the data frame:

gapminder <- gapminder[-"newvar"]
gapminder <- gapminder[,-c("newvar")] 
# The second method is equivalent to the first, but can be used to remove multiple columns at the same time.

To add rows, you can use the function rbind(). Remember that rows may include different data types, in which case you would need to use the function list().

Recoding variables

To recode a variable, you could make a new column, or overwrite the existing one entirely. For example, let’s create a new variable for life expectancy containing rounded values, using the round() function.

gapminder07$lifeExp_rounded <- round(gapminder07$lifeExp)

If you just want to replace part of a column (or vector), you can assign to a subset. For example, let’s say we want to create a new variable that marks all cases where life expectancy is higher than the mean as “High” and those where it is lower than the mean as “Low”.

# Start by creating a new variable with all missing values
gapminder07$lifeExp_highlow <- NA
# Replace higher-than-mean values with "High"
gapminder07$lifeExp_highlow[gapminder07$lifeExp>mean(gapminder07$lifeExp)] <- "High"
# Replace lower-than-mean values with "Low"
gapminder07$lifeExp_highlow[gapminder07$lifeExp<mean(gapminder07$lifeExp)] <- "Low"

There’s also a recode() function in the dplyr library. You specify the reassignment of values. For example, let’s create a new variable with abbreviated continent names.

library(dplyr)
gapminder07$continent_abrv <- recode(gapminder07$continent,
                                     `Africa`="AF",
                                     `Americas`="AM",
                                     `Asia`="AS",
                                     `Europe`="EU",
                                     `Oceania`="OC")
table(gapminder07$continent_abrv)
## 
## AF AM AS EU OC 
## 52 25 33 30  2

We will return to recode() and other dplyr functions on Day 3. The ifelse() function, covered in Day 2, is also useful for recoding.

Aggregating

To compute summary statistics by groups in the data, one option is to use the aggregate() function. For example, we can calculate the mean of life expectancy for each continent:

aggregate(gapminder07$lifeExp ~ gapminder07$continent, FUN=mean)
##   gapminder07$continent gapminder07$lifeExp
## 1                Africa            54.80604
## 2              Americas            73.60812
## 3                  Asia            70.72848
## 4                Europe            77.64860
## 5               Oceania            80.71950

The ~ operator can be read as “by” or “as a function of”, and is used in many contexts. A construction such as y ~ x1 + x2 is referred to as a formula.

We can also aggregate by two variables. For example, let’s use the original Gapminder data (not just the 2007 data) and aggregate by continent and year.

aggregate(gapminder$lifeExp ~ gapminder$year + gapminder$continent, FUN=mean)
##    gapminder$year gapminder$continent gapminder$lifeExp
## 1            1952              Africa          39.13550
## 2            1957              Africa          41.26635
## 3            1962              Africa          43.31944
## 4            1967              Africa          45.33454
## 5            1972              Africa          47.45094
## 6            1977              Africa          49.58042
## 7            1982              Africa          51.59287
## 8            1987              Africa          53.34479
## 9            1992              Africa          53.62958
## 10           1997              Africa          53.59827
## 11           2002              Africa          53.32523
## 12           2007              Africa          54.80604
## 13           1952            Americas          53.27984
## 14           1957            Americas          55.96028
## 15           1962            Americas          58.39876
## 16           1967            Americas          60.41092
## 17           1972            Americas          62.39492
## 18           1977            Americas          64.39156
## 19           1982            Americas          66.22884
## 20           1987            Americas          68.09072
## 21           1992            Americas          69.56836
## 22           1997            Americas          71.15048
## 23           2002            Americas          72.42204
## 24           2007            Americas          73.60812
## 25           1952                Asia          46.31439
## 26           1957                Asia          49.31854
## 27           1962                Asia          51.56322
## 28           1967                Asia          54.66364
## 29           1972                Asia          57.31927
## 30           1977                Asia          59.61056
## 31           1982                Asia          62.61794
## 32           1987                Asia          64.85118
## 33           1992                Asia          66.53721
## 34           1997                Asia          68.02052
## 35           2002                Asia          69.23388
## 36           2007                Asia          70.72848
## 37           1952              Europe          64.40850
## 38           1957              Europe          66.70307
## 39           1962              Europe          68.53923
## 40           1967              Europe          69.73760
## 41           1972              Europe          70.77503
## 42           1977              Europe          71.93777
## 43           1982              Europe          72.80640
## 44           1987              Europe          73.64217
## 45           1992              Europe          74.44010
## 46           1997              Europe          75.50517
## 47           2002              Europe          76.70060
## 48           2007              Europe          77.64860
## 49           1952             Oceania          69.25500
## 50           1957             Oceania          70.29500
## 51           1962             Oceania          71.08500
## 52           1967             Oceania          71.31000
## 53           1972             Oceania          71.91000
## 54           1977             Oceania          72.85500
## 55           1982             Oceania          74.29000
## 56           1987             Oceania          75.32000
## 57           1992             Oceania          76.94500
## 58           1997             Oceania          78.19000
## 59           2002             Oceania          79.74000
## 60           2007             Oceania          80.71950

Statistics

Now that we have a dataset … we can do statistics! You will learn more about particular statistical models and methods over the course of your program. For now, let’s do some basic things to get a feel for how R handles statistical analysis.

Correlations

You can use the cor() function to calculate correlation (Pearson’s r):

cor(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 0.6786624

You can also find the covariance:

cov(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 105368

T-test

Do countries with high or low life expectancy have different GDP per capita? Apart from simply comparing the means for the two groups, we can use a T-test to evaluate the likelihood that these means are significantly different from each other.

t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
## 
##  Welch Two Sample t-test
## 
## data:  gapminder07$gdpPercap by gapminder07$lifeExp_highlow
## t = 10.564, df = 95.704, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12674.02 18539.14
## sample estimates:
## mean in group High  mean in group Low 
##          17944.685           2338.104

Remember: you can read ~ as “as a function of”. So the above code reads “GDP per capita as a function of life expectancy”, meaning grouped by or explained by.

We don’t have to use the formula syntax. We can specify data for two different groups. Let’s see if GDP per capita is different when comparing the Americas and Asia.

t.test(gapminder07$gdpPercap[gapminder07$continent=="Asia"], gapminder07$gdpPercap[gapminder07$continent=="Americas"])
## 
##  Welch Two Sample t-test
## 
## data:  gapminder07$gdpPercap[gapminder07$continent == "Asia"] and gapminder07$gdpPercap[gapminder07$continent == "Americas"]
## t = 0.46849, df = 55.535, p-value = 0.6413
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4816.823  7756.813
## sample estimates:
## mean of x mean of y 
##  12473.03  11003.03

By storing the output of the T-test (which is a list) as its own object, we can access different parts of the results.

t1 <- t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
names(t1)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
t1$p.value
## [1] 9.507438e-18

Regression

Of course, the two life expectancy “groups” we used above to conduct a T-test are based on a continuous variable indicating life expectancy. We may be more interested in whether this variable predicts GDP per capita rather than the two “groups” that we created using an arbitrary threshold.

The basic syntax for a liner regression is shown below. Note that instead of repeating df$variablename several times, we can indicate the data frame name using the data = argument and simply use variable names.

lm(y ~ x1 + x2 + x3, data=df_name)

Example:

lm(gdpPercap ~ lifeExp, data=gapminder07)
## 
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
## 
## Coefficients:
## (Intercept)      lifeExp  
##    -36759.4        722.9

The default output isn’t much. You get a lot more with summary():

r1 <- lm(gdpPercap ~ lifeExp, data=gapminder07)
summary(r1)
## 
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14473  -7840  -2145   6159  28143 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36759.42    4501.25  -8.166 1.67e-13 ***
## lifeExp        722.90      66.12  10.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9479 on 140 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.4567 
## F-statistic: 119.5 on 1 and 140 DF,  p-value: < 2.2e-16

Note that a constant (Intercept) term was added automatically.

Let’s try another regression with two indpendent variables. This time, we will predict life expectancy as a function of GDP per capita and population.

r2 <- lm(lifeExp ~ gdpPercap + pop, data=gapminder07)
summary(r2)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.496  -6.119   1.899   7.018  13.383 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.921e+01  1.040e+00  56.906   <2e-16 ***
## gdpPercap   6.416e-04  5.818e-05  11.029   <2e-16 ***
## pop         7.001e-09  5.068e-09   1.381    0.169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.87 on 139 degrees of freedom
## Multiple R-squared:  0.4679, Adjusted R-squared:  0.4602 
## F-statistic: 61.11 on 2 and 139 DF,  p-value: < 2.2e-16

Writing data files

You will often want to save your work in R as well. There are a few different ways to save:

Writing a data file

We imported the gapminder data earlier in CSV format, and manipulated it in several ways: we subsetted the 2007 data and added the variables lifeExp_rounded, lifeExp_highlow. and continent_abrv.

The best method for making your workflow and analysis reproducible is to write any data sets you create to plain text files.

Let’s try to save our subsetted and manipulated gapminder07 data frame as a CSV. To write a CSV, there are write.csv and write.table functions, similar to their read counterparts. The one trick is that you usually want to NOT write row.names.

write.csv(gapminder07, file="data/gapminder_2007_edited.csv", 
          row.names=FALSE)

Or using readr package’s equivalent:

write_csv(schooldata, "data/gapminder_2007_edited.csv")

Saving R objects

You can use the save function to save multiple objects together in a file. The standard file extension to use is .RData. Example:

save(schooldata, gapminder, 
     file = "workshopobjects.RData")

To later load in saved data, use the load function:

load("workshopobjects.RData")

This can be useful if you’re working with multiple objects and want to be able to pick up your work easily later. But.RData files generally aren’t portable to other programs, so think of them only as internal R working files – not the format you want to keep data in long-term. Loading a .RData file will overwrite objects with the same name already in the environment.

You can also save all the objects in your environment by using the save.image() function, or by clicking the “Save” icon in the Environment pane in RStudio.

Data visualization in base R

We will spend a lot more time later in the boot camp on data visualization, but today we will briefly introduce some functions for visualization that are included in base R. These functions are useful to quickly visualize data in early phases of analysis, and their syntax is often incorporated into other packages. For more advanced and aesthetically pleasing data visualization, you will want to use the ggplot2 package, which we will go over in detail on Day 3.

Histograms

Histograms are a simple and useful way to visualize the distribution of a variable. For example, let’s plot a histogram of life expectancy from the gapminder07 data frame:

hist(gapminder07$lifeExp)

By reading the help file for the hist() function, we can identify several arguments that can change the aesthetics of the plot. The breaks = argument controls the number of breaks on the x-axis.

hist(gapminder07$lifeExp, breaks=20,
     main="Life expectancy (2007 data)", ylab="Frequency", xlab="Life expectancy")

Scatterplots

The simplest way to plot the relationship between two variables is a scatterplot. If you provide two variables to the plot() function in R, it produces a scatterplot. Let’s try it with life expectancy and GDP per capita in the gapminder07 data frame. Recall that ~ means “a function of”, so we will put the y-axis variable on the left and the x-axis variable on the right.

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap)

Again, we can add axes and labels:

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")

Perhaps we want to add a line indicating the mean value of life expectancy. We can do this by using the abline() function to add a line after creating a plot. Adding multiple layers to a plot is much more intuitive and flexible with the ggplot2 package, which we will explore on Day 3.

plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")
abline(h = mean(gapminder07$lifeExp))