R is a free software environment for statistical computing and graphics. It is supported by the R Foundation for Statistical Computing and was initially developed in the late 1990s. As a full programming language, it is more flexible and capacious than alternatives for statistical computing such as Stata. Since it is designed for data analysis, it is particularly well suited to the needs of data scientists compared to most other programming languages. Since it is open source, it is a customizable and extensible tool, for which thousands of useful and tailored packages have been written by users worldwide. Finally, it is widely used and a large community has grown around R, meaning that help is always around the corner.
For more information, see the R project website. You can get a taste for the R community at R-bloggers, the RStudio community forum, R-Ladies, and Stack Overflow.
Here are a few things to remember about R: - “R” refers to both the language and the environment (application). - R runs in memory, objects are loaded in memory. - It’s expected that you’ll install and use additional packages - Packages are open source and user contributed, so use established packages or evaluate quality. - There are multiple ways to do most things. Some ways are better than others, but sometimes it is a question of style and preference. - You can, and often will, have more than one dataset open in R at the same time.
RStudio is a software program that makes working in R easier, developed and maintained by a company with the same name. It provides an integrated development environment, or IDE, for R. RStudio helps you organize your workflow and keep track of your work. The top-left pane is where you open, work on, and save script files. The bottom-left pane includes the console, which is where your code actually “runs” – you can run code here directly, or from a script file. The right-hand side panes include tools for manging your environment, workspace, and packages; for plotting and graphics; for accessing help files; and more.
At MSiA, you have access to RStudio Server, which allows you to access RStudio through a web browser and do computation on a server rather than your personal computer.
Much of R’s power comes from contributed packages. You can install and manage packages using the Packages tab in the bottom right pane in RStudio. Or you can install packages with a command:
install.packages("tidyverse")
The RStudio Server Pro install that you are using already has tidyverse
and some other common packages installed at the system-level, so you will not actually need to do this step. But do take note of this material, as you will likely need to install packages in the future. You can easily check what packages have been installed already in the “Packages” tab on the bottom-right pane.
Note that tidyverse
is a composite package that will install multiple component packages and their dependencies. It includes dplyr
and ggplot2
, which we will use extensively on Day 3 of the boot camp. You’ll get a lot of messages as the installation happens.
CRAN (Comprehensive R Archive Network) is the name of the package repository. There are mirrors around the world. You can also install packages that are not on CRAN using the devtools
package.
If you have trouble or get errors when trying to install a package, you might need to specify the repository mirror to download from:
install.packages("tidyverse", repos="http://cran.wustl.edu/")
After you install a package, you have to load it with the library
function to actually use it.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Functions are called with functionName(parameters)
. Multiple paramters are comma-separated. Paramters can be named. For unnamed parameters, the order matters. R functions don’t change the objects passed to them (more on this later). Instead, to store the result of a function, you need to assign its output to an object using the assignment operator <-
. For example: object <- functionName(parameter1, parameter2)
.
Remember that all functions come with help files. Well-written packages will include extensive help files which explain the kinds of inputs the function takes, the purpose of each parameter or “argument” in addition to the inputs, and the kinds of output the function produces. They will also often include examples.
To access a help file, enter ?functionName
into the console or use the Help tab in the bottom-right pane of RStudio.
Sometimes different packages will include functions with the same name. To make sure you are using the function from the right package, you can use the following syntax: packageName::functionName()
.
By now, you should all have completed the Introduction to R course on DataCamp. Congratulations! You already know quite a lot about how R works. To refresh, you covered the following material in your course:
In these lecture notes, we will review all of the above. In the lecture slides, we will move through these topics quickly, and mostly in the form of exercises, focusing our attention on some new material: reading and writing files, basic data manipulation, and basic data visualization.
A good reference for R Basics is the Base R Cheat Sheet.
Remember, to run a line of code from a script file in RStudio:
command + return
Ctrl + r
Let’s recall how to do some basic arithmetic in R:
2+2
## [1] 4
5%%2
## [1] 1
3.452*6
## [1] 20.712
2^4
## [1] 16
You can use ?Arithmetic
to pull up the help for arithmetic operators.
Functions are called with functionName(parameters)
. Multiple paramters are comma-separated. Paramters can be named. For unnamed parameters, the order matters. R functions don’t change the objects passed to them (more on this later).
log(10)
## [1] 2.302585
log(16, base=2)
## [1] 4
log10(10)
## [1] 1
sqrt(10)
## [1] 3.162278
exp(10)
## [1] 22026.47
sin(1)
## [1] 0.841471
1 < 2
## [1] TRUE
TRUE == FALSE
## [1] FALSE
'a' != "Boy" # not equal
## [1] TRUE
Note that character vectors/strings can use single or double quotes.
&
is and, and |
is or, and !
is not:
TRUE & FALSE
## [1] FALSE
!TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
(2 > 1) & (3 > 2)
## [1] TRUE
You use these to join together conditions.
Use the <-
operator to assign values to variables. =
also works but is bad practice and less common.
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number; they should not start with an underscore or period in regular use. They cannot contain spaces. Different people use different conventions for long variable names, these include
x <- 4
x
## [1] 4
y <- 3/10
y
## [1] 0.3
x + y
## [1] 4.3
myVariable <- x <- 3 + 4 + 7
Note that when you create a variable in RStudio, it shows up in the environment tab in the top-right pane.
There are a few types of data in R:
TRUE
or FALSE
2+5i
"text data"
, denoted with single or double quotestypeof(TRUE)
## [1] "logical"
typeof("foo")
## [1] "character"
Vectors store multiple values of a single data type. You can create a vector by combining values with the c()
function.
x<-c(1,2,3,4,5)
x<-1:5
Vectors can only have one type of values in them. The order depends on what types can be converted to other types. If there’s multiple types, everything in a vector will be converted to the type of the lowest in this list:
x<-c(TRUE, 2, 4.3)
x
## [1] 1.0 2.0 4.3
x<-c(4, "alpha", TRUE)
x
## [1] "4" "alpha" "TRUE"
Functions and arithmetic operators can apply to vectors as well:
x <- c(1,2,3,4,5)
x+1
## [1] 2 3 4 5 6
x*2
## [1] 2 4 6 8 10
x*x
## [1] 1 4 9 16 25
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
x < 5
## [1] TRUE TRUE TRUE TRUE FALSE
Some functions will apply to each element of a vector, but others take a vector as a parameter:
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
sum(x)
## [1] 15
Vectors are one-dimensional and can’t be nested:
c(c(1,2,3), 4, 5)
## [1] 1 2 3 4 5
Vector indexes (and all other indexes in R) start with 1, not 0:
x <- c('a', 'b', 'c', 'd', 'e')
x[1]
## [1] "a"
You can take “slices” of vectors using index brackets:
x[1:3]
## [1] "a" "b" "c"
Or exclude values with a negative sign:
x[-1]
## [1] "b" "c" "d" "e"
Elements are returned in the order that the indices are supplied:
y <- c(5,1)
y
## [1] 5 1
You can use a vector of integers or booleans to select from a vector as well:
x[x<'c']
## [1] "a" "b"
x[c(1,3,5)]
## [1] "a" "c" "e"
Get the length of a vector with length
:
length(x)
## [1] 5
See if a value is in a vector with the %in%
operator:
'b' %in% x
## [1] TRUE
Or get the first position of one or more elements in a vector with the match
function:
match(c('b', 'd', 'k'), x)
## [1] 2 4 NA
Use which
to find all positions:
y <- c('a','b','c','a','b','c')
which(y == 'c')
## [1] 3 6
You can also name the elements of a vector:
x<-1:5
names(x)<-c("Ohio","Illinois","Indiana","Michigan","Wisconsin")
x
## Ohio Illinois Indiana Michigan Wisconsin
## 1 2 3 4 5
Which allows you to select values from the vector using the names:
x["Ohio"]
## Ohio
## 1
x[c("Illinois", "Indiana")]
## Illinois Indiana
## 2 3
NA
)Before we move onto other data structures, let’s pause to consider how to deal with missing values in a vector (or, later, a data frame). Missing data in R is encoded as NA
. Some functions will ignore NA
when doing computations. Others will ignore missing values if you tell them to. Others will process NA
and give you a result of NA
.
tmp <- c(1,2,5,NA,6,NA,2,5,1,1,NA,5)
You can test for NA
(is.na
). Or you can get the index location of the missing observations within the vector (useful for later selecting observations in your dataset).
is.na(tmp)
## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [12] FALSE
which(is.na(tmp))
## [1] 4 6 11
It can also be useful to count the number of NA
s in a vector:
sum(is.na(tmp))
## [1] 3
Why does this work? How can you sum logical values? This takes advantage of the trick that TRUE=1 and FALSE=0. The function call tries to convert the logicals to numeric, and this is how the conversion works:
as.numeric(c(TRUE, FALSE))
## [1] 1 0
Remember that different functions treat NA
differently. With an input vector that includes NA
values, mean
results in NA
. It has an option to exclude missing:
mean(tmp)
## [1] NA
mean(tmp, na.rm=TRUE)
## [1] 3.111111
table
behaves differently. It excludes NA
by default. You have to tell it to include NA
.
table(tmp)
## tmp
## 1 2 5 6
## 3 2 3 1
table(tmp, useNA = "ifany")
## tmp
## 1 2 5 6 <NA>
## 3 2 3 1 3
NULL
is another special type. NULL
is usually used to mean undefined. You might get it when a function can’t compute a result. NULL
is a single value and can’t be in a vector. (NA
s can be in vectors and data.frames.)
c()
## NULL
c(NULL, NULL)
## NULL
The above somewhat surprisingly gives a single NULL
because of the restrictions on how it’s used.
NULL
should not be used for missing data.
NaN
, Inf
NaN
means “not a number”.
0/0
## [1] NaN
Inf
and -Inf
are “infinity” and “negative infinity”.
1/0
## [1] Inf
-1/0
## [1] -Inf
Factors are a special type of vector can be used for categorical variables. Why would we need them? Consider that the values of character vectors sometimes have an order, and we may want to store this information in R. For example, consider a vector containing month names. When we use table()
, R arranges the values in alphabetical order.
months<-c("January","March","February","December","January","March")
table(months)
## months
## December February January March
## 1 1 2 2
The factor()
function converts a vector into a factor. Without supplying any additional information, the function infers the possible “levels” that the vector takes.
months_fac <- factor(months)
levels(months_fac)
## [1] "December" "February" "January" "March"
Factors can be ordered, which is useful when you have categorical variables in your data. Let’s create an ordered factor from the months variable. Using the table()
function on the factor, we can see one of the benefits of using factors for categorical variables - the values are ordered in meaningful way rather than alphabetically.
months_fac <- factor(months, levels=c("January","February","March","December"))
table(months_fac)
## months_fac
## January February March December
## 2 1 2 1
Note that you cannot add values to a factor that are not included as one of the levels.
months_fac[5] <- "April"
The best solution to this is to remake the factor. The factor
function will convert months_fac
in the example below back to character data before creating the new factor.
months_fac <- factor(months, levels=c("January","February","March","April","December"))
months_fac[5] <- "April"
months_fac
## [1] January March February December April March
## Levels: January February March April December
Alternatively, when you create the factor for the first time, you can include all possible levels of the factor. This has the added benefit of producing even more meaningful results when using functions such as table()
.
months_fac <- factor(months, levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
table(months_fac)
## months_fac
## January February March April May June July
## 2 1 2 0 0 0 0
## August September October November December
## 0 0 0 0 1
Under the hood, factors are stored as integers, with the (ordered) levels attribute providing information about the character value associated with each integer.
typeof(months_fac)
## [1] "integer"
Even if you don’t plan to use categorical data, you should know that factors exist because when reading data into R, text strings can be loaded as factors.
Lists are a bit like complex vectors. An element of a list can hold any other object, including another list. You can keep multi-dimensional and ragged data in R using lists.
l1 <- list(1, "a", TRUE, 1+4i)
l1
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
l2 <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
l2
## $title
## [1] "Research Bazaar"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
Indexing lists is a little different. [[1]]
is the first element of the list as whatever type it was. [1]
is a subset of the list – the first element of the list as a list. You can also access list elements by name using the $
operator.
l2[[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
l2[2]
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
l2$numbers
## [1] 1 2 3 4 5 6 7 8 9 10
Matrices in R are two-dimensional arrays. All values of a matrix must be of the same type. You can initialize a matrix using the matrix()
function.
matrix(c('a', 'b', 'c', 'd'), nrow=2)
## [,1] [,2]
## [1,] "a" "c"
## [2,] "b" "d"
y<-matrix(1:25, nrow=5, byrow=TRUE)
y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
Matrices are used sparingly in R, primarly for numerical calculations or explicit matrix manipulation. You can attach names to rows and columns.
Matrix algebra functions are available:
y%*%y
## [,1] [,2] [,3] [,4] [,5]
## [1,] 215 230 245 260 275
## [2,] 490 530 570 610 650
## [3,] 765 830 895 960 1025
## [4,] 1040 1130 1220 1310 1400
## [5,] 1315 1430 1545 1660 1775
x<-1:5
y%*%x
## [,1]
## [1,] 55
## [2,] 130
## [3,] 205
## [4,] 280
## [5,] 355
y^-1 # matrix inversion
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000
## [2,] 0.16666667 0.14285714 0.12500000 0.11111111 0.10000000
## [3,] 0.09090909 0.08333333 0.07692308 0.07142857 0.06666667
## [4,] 0.06250000 0.05882353 0.05555556 0.05263158 0.05000000
## [5,] 0.04761905 0.04545455 0.04347826 0.04166667 0.04000000
y * -1
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1 -2 -3 -4 -5
## [2,] -6 -7 -8 -9 -10
## [3,] -11 -12 -13 -14 -15
## [4,] -16 -17 -18 -19 -20
## [5,] -21 -22 -23 -24 -25
Elements in a matrix are indexed like mat[row number, col number]
. Omitting a value for row or column will give you the entire column or row, respectively.
y[1,1]
## [1] 1
y[1,]
## [1] 1 2 3 4 5
y[,1]
## [1] 1 6 11 16 21
y[1:2,3:4]
## [,1] [,2]
## [1,] 3 4
## [2,] 8 9
y[,c(1,4)]
## [,1] [,2]
## [1,] 1 4
## [2,] 6 9
## [3,] 11 14
## [4,] 16 19
## [5,] 21 24
Using just a single index will get the element from the specified position, as if the matrix were turned into a vector first:
w<-matrix(5:29, nrow=5)
w[7]
## [1] 11
as.vector(w)[7]
## [1] 11
Data frames are the core data structure in R. A data frame is a list of named vectors with the same length. Columns are typically variables and rows are observations. Different columns can have different types of data:
id<-1:20
id
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
color<-c(rep("red", 3), rep("green",10), rep("blue", 7))
color
## [1] "red" "red" "red" "green" "green" "green" "green" "green"
## [9] "green" "green" "green" "green" "green" "blue" "blue" "blue"
## [17] "blue" "blue" "blue" "blue"
score<-runif(20)
score
## [1] 0.20354493 0.50185548 0.35535068 0.52583579 0.57558433 0.81498953
## [7] 0.83673077 0.62897844 0.21323122 0.69859791 0.34753467 0.90691752
## [13] 0.57466982 0.79328765 0.55604629 0.07937945 0.67997536 0.99449463
## [19] 0.68909116 0.57609978
df<-data.frame(id, color, score)
df
## id color score
## 1 1 red 0.20354493
## 2 2 red 0.50185548
## 3 3 red 0.35535068
## 4 4 green 0.52583579
## 5 5 green 0.57558433
## 6 6 green 0.81498953
## 7 7 green 0.83673077
## 8 8 green 0.62897844
## 9 9 green 0.21323122
## 10 10 green 0.69859791
## 11 11 green 0.34753467
## 12 12 green 0.90691752
## 13 13 green 0.57466982
## 14 14 blue 0.79328765
## 15 15 blue 0.55604629
## 16 16 blue 0.07937945
## 17 17 blue 0.67997536
## 18 18 blue 0.99449463
## 19 19 blue 0.68909116
## 20 20 blue 0.57609978
Instead of making individual objects first, we could do it all together:
df<-data.frame(id=1:20,
color=c(rep("red", 3), rep("green",10), rep("blue", 7)),
score=runif(20))
Data frames can be indexed like matrices to retrieve the values.
df[2,2]
## [1] red
## Levels: blue green red
df[1,]
## id color score
## 1 1 red 0.2482876
df[,3]
## [1] 0.24828757 0.06906664 0.34048913 0.08584947 0.23787852 0.15512797
## [7] 0.31793669 0.33260000 0.22951894 0.06794502 0.30247253 0.75979288
## [13] 0.28629046 0.91565300 0.05544233 0.16628694 0.82199100 0.08954519
## [19] 0.32176353 0.46035658
You can use negative values when indexing to exclude values:
df[,-2]
## id score
## 1 1 0.24828757
## 2 2 0.06906664
## 3 3 0.34048913
## 4 4 0.08584947
## 5 5 0.23787852
## 6 6 0.15512797
## 7 7 0.31793669
## 8 8 0.33260000
## 9 9 0.22951894
## 10 10 0.06794502
## 11 11 0.30247253
## 12 12 0.75979288
## 13 13 0.28629046
## 14 14 0.91565300
## 15 15 0.05544233
## 16 16 0.16628694
## 17 17 0.82199100
## 18 18 0.08954519
## 19 19 0.32176353
## 20 20 0.46035658
df[-1:-10,]
## id color score
## 11 11 green 0.30247253
## 12 12 green 0.75979288
## 13 13 green 0.28629046
## 14 14 blue 0.91565300
## 15 15 blue 0.05544233
## 16 16 blue 0.16628694
## 17 17 blue 0.82199100
## 18 18 blue 0.08954519
## 19 19 blue 0.32176353
## 20 20 blue 0.46035658
You can also use the names of the columns after a $
or in the indexing:
df$color
## [1] red red red green green green green green green green green
## [12] green green blue blue blue blue blue blue blue
## Levels: blue green red
Indexing into a data frame with a single integer or name of the column will give you the column(s) specified as a new data frame.
df['color']
## color
## 1 red
## 2 red
## 3 red
## 4 green
## 5 green
## 6 green
## 7 green
## 8 green
## 9 green
## 10 green
## 11 green
## 12 green
## 13 green
## 14 blue
## 15 blue
## 16 blue
## 17 blue
## 18 blue
## 19 blue
## 20 blue
df[2:3]
## color score
## 1 red 0.24828757
## 2 red 0.06906664
## 3 red 0.34048913
## 4 green 0.08584947
## 5 green 0.23787852
## 6 green 0.15512797
## 7 green 0.31793669
## 8 green 0.33260000
## 9 green 0.22951894
## 10 green 0.06794502
## 11 green 0.30247253
## 12 green 0.75979288
## 13 green 0.28629046
## 14 blue 0.91565300
## 15 blue 0.05544233
## 16 blue 0.16628694
## 17 blue 0.82199100
## 18 blue 0.08954519
## 19 blue 0.32176353
## 20 blue 0.46035658
Instead of index numbers or names, you can also select values by using logical statements. This is usually done with selecting rows.
df[df$color == "green",]
## id color score
## 4 4 green 0.08584947
## 5 5 green 0.23787852
## 6 6 green 0.15512797
## 7 7 green 0.31793669
## 8 8 green 0.33260000
## 9 9 green 0.22951894
## 10 10 green 0.06794502
## 11 11 green 0.30247253
## 12 12 green 0.75979288
## 13 13 green 0.28629046
df[df$score > .5,]
## id color score
## 12 12 green 0.7597929
## 14 14 blue 0.9156530
## 17 17 blue 0.8219910
df[df$score > .5 & df$color == "blue",]
## id color score
## 14 14 blue 0.915653
## 17 17 blue 0.821991
You can assign names to the rows of a data frame as well as to the columns, and then use those names for indexing and selecting data.
rownames(df)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20"
You can add columns or rows simply by assigning values to them. There are also rbind
and cbind
(for row bind and column bind) functions that can be useful.
df$year<-1901:1920
df
## id color score year
## 1 1 red 0.24828757 1901
## 2 2 red 0.06906664 1902
## 3 3 red 0.34048913 1903
## 4 4 green 0.08584947 1904
## 5 5 green 0.23787852 1905
## 6 6 green 0.15512797 1906
## 7 7 green 0.31793669 1907
## 8 8 green 0.33260000 1908
## 9 9 green 0.22951894 1909
## 10 10 green 0.06794502 1910
## 11 11 green 0.30247253 1911
## 12 12 green 0.75979288 1912
## 13 13 green 0.28629046 1913
## 14 14 blue 0.91565300 1914
## 15 15 blue 0.05544233 1915
## 16 16 blue 0.16628694 1916
## 17 17 blue 0.82199100 1917
## 18 18 blue 0.08954519 1918
## 19 19 blue 0.32176353 1919
## 20 20 blue 0.46035658 1920
df[22,]<-list(21, "green", 0.4, 1921)
Note that we had to use a list for adding a row because there are different data types.
Before reading or writing files, it’s often useful to set the working directory first so that you don’t have specify complete file paths.
You can go to the Files tab in the bottom right window in RStudio and find the directory you want. Then under the More menu, there is an option to set the current directory as the working directory. Or you can use the setwd
command like:
setwd("~/training/intror") # ~ stands for your home directory
setwd("/Users/username/Documents/workshop") # mac, absolute path example
setwd("C:\Users\username\Documents\workshop") # windows, absolute path example
In our case, we are working out of the directory team/bootcamp/2018 in the base directory. So we can set our working directory as follows:
setwd("~/team/bootcamp/2018/R session materials")
To check where your working directory is, use getwd()
:
getwd()
## [1] "C:/Users/kumar/OneDrive - Northwestern University - Student Advantage/Assistantships & Jobs/MSiA/bootcamp-2019"
Read in a csv file and save it as a data frame with a name. Below are two examples, using a CSV file and a local file stored in the working directory respectively:
# Using a URL
schooldata <- read.csv("https://goo.gl/f4UhMX")
# Using a local file
gapminder <- read.csv("data/gapminder5.csv")
You can view the data frames in RStudio using the View()
function.
View(schooldata)
View(gapminder)
You could also use the Import Dataset option in the Environment tab in the top right window in RStudio.
Looking at the help for read.csv
, there are a number of different options and different function calls. read.table
, read.csv
, and read.delim
all work in the same basic way and take the same set of arguments, but they have different defaults. Key options to pay attention to include:
header
: whether the first row of the file has the names of the columnssep
: the separator used (comma, tab (enter as \t
), etc) in the filena.strings
: how is missing data encoded in your file? “NA” are treated as missing by default; blanks are treated as missing by default in everything but character data.stringsAsFactors
: should strings (text data) be converted to factors or kept as is? Example of this below.Let’s redo the above with a better set of options:
gapminder <- read.csv("data/gapminder5.csv",
stringsAsFactors=FALSE,
strip.white=TRUE,
na.strings=c("NA", ""))
The option na.strings
is needed now because while blanks are treated as missing by default in numeric fields (which includes factors), they aren’t by default missing for character data.
readr
PackageDoes all of the above seem annoying or unnecessarily complicated? Others have thought so too.
Look at the readr
package (part of the tidyverse), which attempts to smooth over some of the annoyances of reading in file in R. The main source of potential problems when using readr
functions is that it guesses variable types from a subset of the observations, so if you have a strange value further down in your dataset, you might get an error or an unexpected value conversion.
To read in the same data with the same settings as above, using readr
(note similar function name, with _
instead of .
):
library(readr)
gapminder <- read_csv("data/gapminder5.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## pop = col_double(),
## continent = col_character(),
## lifeExp = col_double(),
## gdpPercap = col_double()
## )
Options used above are defaults in readr
. You get a long message about the column types.
Learn more at the readr
website.
For Stata, SAS, or SPSS files, try the haven
or foreign
packages. For Excel files, use the readxl
package.
data.table
package also has functions for reading in data, which you will learn about on Day 3 of the boot camp. The fread
function is relatively fast for reading a rectangular standardized data file into R.
R also has packages for reading other structured files like XML and JSON, or interfacing with databases. For more on using R with databases, see the R section of the Databases workshop materials from NUIT Research Computing Services.
There are also multiple packages that make collecting data from APIs (either in general or specific APIs like the Census Bureau) easier. There are also packages that interface with Google docs/drive and Dropbox, although those APIs change frequently, so beware when using those packages if they haven’t been updated recently.
In the previous section, we imported two datasets. For the rest of today, we will focus on the Gapminder data, which is stored in our environment as gapminder
. To refresh yourself, you can view the data frame in R using the View()
function.
View(gapminder)
You can also see a list of variables using names()
.
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
Other useful functions are dim()
which shows the dimensions of the data frame, str()
which shows the dimensions of the data frame along with the names of variables and the first few values in each variable, nrow()
and ncol()
which show the number of rows and colums, and head()
which shows the first few rows of the data frame (5 rows by default).
When applied to a data frame, the summary()
function provides useful summary statistics for each variable (i.e. column) in the data frame. Let’s try it with the Gapminder data:
summary(gapminder)
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
We can also use functions like mean()
, median()
, var()
, sd()
, and quantile()
to calculate other summary statistics for individual variables. For example, let’s calculate the mean of life expectancy. Recall that we can use the $
operator to call up a variable within a data frame using its name.
mean(gapminder$lifeExp)
## [1] 59.47444
A useful way to examine a discrete or categorical variable is to use a frequency table. These are easy to make in R, using the table()
function:
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
prop.table()
is a useful wrapper around table()
, showing the proportion of rows in each category:
prop.table(table(gapminder$continent))
##
## Africa Americas Asia Europe Oceania
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451
You can generate a frequency table with more than one variable as well:
table(gapminder$continent, gapminder$year)
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
Notice that each row in the data frame represents one country in a given year. Perhaps we are interested in analyzing only data from one year. To do this, we will have to “subset” our data frame to include only those rows that we want to keep.
The subset()
function lets you select rows and columns you want to keep. You can either name columns or rows, or include a logical statement such that only rows/columns where the statement is true are retained.
subset(data.frame,
subset=condition indicating rows to keep,
select=condition indicating columns to keep)
For eaxmple, let’s create a new data frame containing only 2007 data by subsetting the original data frame.
gapminder07 <- subset(gapminder, subset = year==2007)
Look at the number of rows in the new data frame: it is only 142, whereas the original data frame has 1704 rows.
nrow(gapminder07)
## [1] 142
The sort()
function reorders elements, in ascending order by default. You can flip the order by using the decreasing = TRUE
argument.
sort(gapminder07$lifeExp)
## [1] 39.613 42.082 42.384 42.568 42.592 42.731 43.487 43.828 44.741 45.678
## [11] 46.242 46.388 46.462 46.859 48.159 48.303 48.328 49.339 49.580 50.430
## [21] 50.651 50.728 51.542 51.579 52.295 52.517 52.906 52.947 54.110 54.467
## [31] 54.791 55.322 56.007 56.728 56.735 56.867 58.040 58.420 58.556 59.443
## [41] 59.448 59.545 59.723 60.022 60.916 62.069 62.698 63.062 63.785 64.062
## [51] 64.164 64.698 65.152 65.483 65.528 65.554 66.803 67.297 69.819 70.198
## [61] 70.259 70.616 70.650 70.964 71.164 71.338 71.421 71.688 71.752 71.777
## [71] 71.878 71.993 72.235 72.301 72.390 72.396 72.476 72.535 72.567 72.777
## [81] 72.801 72.889 72.899 72.961 73.005 73.338 73.422 73.747 73.923 73.952
## [91] 74.002 74.143 74.241 74.249 74.543 74.663 74.852 74.994 75.320 75.537
## [101] 75.563 75.635 75.640 75.748 76.195 76.384 76.423 76.442 76.486 77.588
## [111] 77.926 78.098 78.242 78.273 78.332 78.400 78.553 78.623 78.746 78.782
## [121] 78.885 79.313 79.406 79.425 79.441 79.483 79.762 79.829 79.972 80.196
## [131] 80.204 80.546 80.653 80.657 80.745 80.884 80.941 81.235 81.701 81.757
## [141] 82.208 82.603
sort(gapminder07$lifeExp, decreasing=TRUE)
## [1] 82.603 82.208 81.757 81.701 81.235 80.941 80.884 80.745 80.657 80.653
## [11] 80.546 80.204 80.196 79.972 79.829 79.762 79.483 79.441 79.425 79.406
## [21] 79.313 78.885 78.782 78.746 78.623 78.553 78.400 78.332 78.273 78.242
## [31] 78.098 77.926 77.588 76.486 76.442 76.423 76.384 76.195 75.748 75.640
## [41] 75.635 75.563 75.537 75.320 74.994 74.852 74.663 74.543 74.249 74.241
## [51] 74.143 74.002 73.952 73.923 73.747 73.422 73.338 73.005 72.961 72.899
## [61] 72.889 72.801 72.777 72.567 72.535 72.476 72.396 72.390 72.301 72.235
## [71] 71.993 71.878 71.777 71.752 71.688 71.421 71.338 71.164 70.964 70.650
## [81] 70.616 70.259 70.198 69.819 67.297 66.803 65.554 65.528 65.483 65.152
## [91] 64.698 64.164 64.062 63.785 63.062 62.698 62.069 60.916 60.022 59.723
## [101] 59.545 59.448 59.443 58.556 58.420 58.040 56.867 56.735 56.728 56.007
## [111] 55.322 54.791 54.467 54.110 52.947 52.906 52.517 52.295 51.579 51.542
## [121] 50.728 50.651 50.430 49.580 49.339 48.328 48.303 48.159 46.859 46.462
## [131] 46.388 46.242 45.678 44.741 43.828 43.487 42.731 42.592 42.568 42.384
## [141] 42.082 39.613
The order()
function gives you the index positions in sorted order:
order(gapminder07$lifeExp)
## [1] 122 87 141 113 74 4 142 1 22 75 108 53 28 95 117 78 31
## [18] 118 18 20 23 14 133 41 17 127 89 43 69 80 36 29 52 11
## [35] 46 94 42 129 121 77 47 62 19 49 54 88 140 111 90 9 81
## [52] 59 27 98 109 12 84 70 130 55 51 128 60 61 86 39 101 102
## [69] 100 132 40 73 37 3 15 120 107 68 66 110 82 26 93 25 16
## [86] 57 139 137 131 76 112 125 79 138 85 115 13 38 5 99 103 8
## [103] 97 32 83 136 2 106 34 72 116 104 135 33 35 126 24 71 105
## [120] 30 63 44 48 134 10 50 91 7 114 96 92 65 21 45 64 123
## [137] 119 6 124 58 56 67
order()
is useful for arranging data frames. Combined with head()
, which shows the first 5 rows of a data frame, we can use this to view the rows of the data frame with the highest life expectancy:
head(gapminder07[order(gapminder07$lifeExp, decreasing=TRUE),])
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Japan 2007 127467972 Asia 82.6 31656.
## 2 Hong Kong China 2007 6980412 Asia 82.2 39725.
## 3 Iceland 2007 301931 Europe 81.8 36181.
## 4 Switzerland 2007 7554661 Europe 81.7 37506.
## 5 Australia 2007 20434176 Oceania 81.2 34435.
## 6 Spain 2007 40448191 Europe 80.9 28821.
Sorting a table is often useful. For example:
sort(table(gapminder07$continent))
##
## Oceania Americas Europe Asia Africa
## 2 25 30 33 52
You can add variables to a data frame in several ways. Here, we will show two standard methods using base R. On Day 3, you will learn about alternatives using the data.table
and dplyr
approaches.
To demonstrate, let’s first create a vector with the same number of values as the number of rows in the data frame. If you want to learn what is going on in this code, look at the help file for the function sample()
.
newvar <- sample(1:5000, 1704, replace = FALSE)
You can add a variable/column by using the cbind()
function:
gapminder <- cbind(gapminder, newvar)
You can add a variable/column by assigning it to data frame directly:
gapminder$newvar <- newvar
To remove a variable/column from a data frame, you can assign a NULL
value to the variable:
gapminder$newvar <- NULL
You can also remove a variable/column by negatively indexing the data frame:
gapminder <- gapminder[-"newvar"]
gapminder <- gapminder[,-c("newvar")]
# The second method is equivalent to the first, but can be used to remove multiple columns at the same time.
To add rows, you can use the function rbind()
. Remember that rows may include different data types, in which case you would need to use the function list()
.
To recode a variable, you could make a new column, or overwrite the existing one entirely. For example, let’s create a new variable for life expectancy containing rounded values, using the round()
function.
gapminder07$lifeExp_rounded <- round(gapminder07$lifeExp)
If you just want to replace part of a column (or vector), you can assign to a subset. For example, let’s say we want to create a new variable that marks all cases where life expectancy is higher than the mean as “High” and those where it is lower than the mean as “Low”.
# Start by creating a new variable with all missing values
gapminder07$lifeExp_highlow <- NA
# Replace higher-than-mean values with "High"
gapminder07$lifeExp_highlow[gapminder07$lifeExp>mean(gapminder07$lifeExp)] <- "High"
# Replace lower-than-mean values with "Low"
gapminder07$lifeExp_highlow[gapminder07$lifeExp<mean(gapminder07$lifeExp)] <- "Low"
There’s also a recode()
function in the dplyr
library. You specify the reassignment of values. For example, let’s create a new variable with abbreviated continent names.
library(dplyr)
gapminder07$continent_abrv <- recode(gapminder07$continent,
`Africa`="AF",
`Americas`="AM",
`Asia`="AS",
`Europe`="EU",
`Oceania`="OC")
table(gapminder07$continent_abrv)
##
## AF AM AS EU OC
## 52 25 33 30 2
We will return to recode()
and other dplyr
functions on Day 3. The ifelse()
function, covered in Day 2, is also useful for recoding.
To compute summary statistics by groups in the data, one option is to use the aggregate()
function. For example, we can calculate the mean of life expectancy for each continent:
aggregate(gapminder07$lifeExp ~ gapminder07$continent, FUN=mean)
## gapminder07$continent gapminder07$lifeExp
## 1 Africa 54.80604
## 2 Americas 73.60812
## 3 Asia 70.72848
## 4 Europe 77.64860
## 5 Oceania 80.71950
The ~
operator can be read as “by” or “as a function of”, and is used in many contexts. A construction such as y ~ x1 + x2
is referred to as a formula.
We can also aggregate by two variables. For example, let’s use the original Gapminder data (not just the 2007 data) and aggregate by continent and year.
aggregate(gapminder$lifeExp ~ gapminder$year + gapminder$continent, FUN=mean)
## gapminder$year gapminder$continent gapminder$lifeExp
## 1 1952 Africa 39.13550
## 2 1957 Africa 41.26635
## 3 1962 Africa 43.31944
## 4 1967 Africa 45.33454
## 5 1972 Africa 47.45094
## 6 1977 Africa 49.58042
## 7 1982 Africa 51.59287
## 8 1987 Africa 53.34479
## 9 1992 Africa 53.62958
## 10 1997 Africa 53.59827
## 11 2002 Africa 53.32523
## 12 2007 Africa 54.80604
## 13 1952 Americas 53.27984
## 14 1957 Americas 55.96028
## 15 1962 Americas 58.39876
## 16 1967 Americas 60.41092
## 17 1972 Americas 62.39492
## 18 1977 Americas 64.39156
## 19 1982 Americas 66.22884
## 20 1987 Americas 68.09072
## 21 1992 Americas 69.56836
## 22 1997 Americas 71.15048
## 23 2002 Americas 72.42204
## 24 2007 Americas 73.60812
## 25 1952 Asia 46.31439
## 26 1957 Asia 49.31854
## 27 1962 Asia 51.56322
## 28 1967 Asia 54.66364
## 29 1972 Asia 57.31927
## 30 1977 Asia 59.61056
## 31 1982 Asia 62.61794
## 32 1987 Asia 64.85118
## 33 1992 Asia 66.53721
## 34 1997 Asia 68.02052
## 35 2002 Asia 69.23388
## 36 2007 Asia 70.72848
## 37 1952 Europe 64.40850
## 38 1957 Europe 66.70307
## 39 1962 Europe 68.53923
## 40 1967 Europe 69.73760
## 41 1972 Europe 70.77503
## 42 1977 Europe 71.93777
## 43 1982 Europe 72.80640
## 44 1987 Europe 73.64217
## 45 1992 Europe 74.44010
## 46 1997 Europe 75.50517
## 47 2002 Europe 76.70060
## 48 2007 Europe 77.64860
## 49 1952 Oceania 69.25500
## 50 1957 Oceania 70.29500
## 51 1962 Oceania 71.08500
## 52 1967 Oceania 71.31000
## 53 1972 Oceania 71.91000
## 54 1977 Oceania 72.85500
## 55 1982 Oceania 74.29000
## 56 1987 Oceania 75.32000
## 57 1992 Oceania 76.94500
## 58 1997 Oceania 78.19000
## 59 2002 Oceania 79.74000
## 60 2007 Oceania 80.71950
Now that we have a dataset … we can do statistics! You will learn more about particular statistical models and methods over the course of your program. For now, let’s do some basic things to get a feel for how R handles statistical analysis.
You can use the cor()
function to calculate correlation (Pearson’s r):
cor(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 0.6786624
You can also find the covariance:
cov(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 105368
Do countries with high or low life expectancy have different GDP per capita? Apart from simply comparing the means for the two groups, we can use a T-test to evaluate the likelihood that these means are significantly different from each other.
t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
##
## Welch Two Sample t-test
##
## data: gapminder07$gdpPercap by gapminder07$lifeExp_highlow
## t = 10.564, df = 95.704, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12674.02 18539.14
## sample estimates:
## mean in group High mean in group Low
## 17944.685 2338.104
Remember: you can read ~
as “as a function of”. So the above code reads “GDP per capita as a function of life expectancy”, meaning grouped by or explained by.
We don’t have to use the formula syntax. We can specify data for two different groups. Let’s see if GDP per capita is different when comparing the Americas and Asia.
t.test(gapminder07$gdpPercap[gapminder07$continent=="Asia"], gapminder07$gdpPercap[gapminder07$continent=="Americas"])
##
## Welch Two Sample t-test
##
## data: gapminder07$gdpPercap[gapminder07$continent == "Asia"] and gapminder07$gdpPercap[gapminder07$continent == "Americas"]
## t = 0.46849, df = 55.535, p-value = 0.6413
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4816.823 7756.813
## sample estimates:
## mean of x mean of y
## 12473.03 11003.03
By storing the output of the T-test (which is a list) as its own object, we can access different parts of the results.
t1 <- t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
names(t1)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
t1$p.value
## [1] 9.507438e-18
Of course, the two life expectancy “groups” we used above to conduct a T-test are based on a continuous variable indicating life expectancy. We may be more interested in whether this variable predicts GDP per capita rather than the two “groups” that we created using an arbitrary threshold.
The basic syntax for a liner regression is shown below. Note that instead of repeating df$variablename
several times, we can indicate the data frame name using the data =
argument and simply use variable names.
lm(y ~ x1 + x2 + x3, data=df_name)
Example:
lm(gdpPercap ~ lifeExp, data=gapminder07)
##
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
##
## Coefficients:
## (Intercept) lifeExp
## -36759.4 722.9
The default output isn’t much. You get a lot more with summary()
:
r1 <- lm(gdpPercap ~ lifeExp, data=gapminder07)
summary(r1)
##
## Call:
## lm(formula = gdpPercap ~ lifeExp, data = gapminder07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14473 -7840 -2145 6159 28143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36759.42 4501.25 -8.166 1.67e-13 ***
## lifeExp 722.90 66.12 10.933 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9479 on 140 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.4567
## F-statistic: 119.5 on 1 and 140 DF, p-value: < 2.2e-16
Note that a constant (Intercept) term was added automatically.
Let’s try another regression with two indpendent variables. This time, we will predict life expectancy as a function of GDP per capita and population.
r2 <- lm(lifeExp ~ gdpPercap + pop, data=gapminder07)
summary(r2)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.496 -6.119 1.899 7.018 13.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.921e+01 1.040e+00 56.906 <2e-16 ***
## gdpPercap 6.416e-04 5.818e-05 11.029 <2e-16 ***
## pop 7.001e-09 5.068e-09 1.381 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.87 on 139 degrees of freedom
## Multiple R-squared: 0.4679, Adjusted R-squared: 0.4602
## F-statistic: 61.11 on 2 and 139 DF, p-value: < 2.2e-16
You will often want to save your work in R as well. There are a few different ways to save:
We imported the gapminder
data earlier in CSV format, and manipulated it in several ways: we subsetted the 2007 data and added the variables lifeExp_rounded
, lifeExp_highlow
. and continent_abrv
.
The best method for making your workflow and analysis reproducible is to write any data sets you create to plain text files.
Let’s try to save our subsetted and manipulated gapminder07
data frame as a CSV. To write a CSV, there are write.csv
and write.table
functions, similar to their read
counterparts. The one trick is that you usually want to NOT write row.names.
write.csv(gapminder07, file="data/gapminder_2007_edited.csv",
row.names=FALSE)
Or using readr
package’s equivalent:
write_csv(schooldata, "data/gapminder_2007_edited.csv")
You can use the save
function to save multiple objects together in a file. The standard file extension to use is .RData
. Example:
save(schooldata, gapminder,
file = "workshopobjects.RData")
To later load in saved data, use the load
function:
load("workshopobjects.RData")
This can be useful if you’re working with multiple objects and want to be able to pick up your work easily later. But.RData
files generally aren’t portable to other programs, so think of them only as internal R working files – not the format you want to keep data in long-term. Loading a .RData
file will overwrite objects with the same name already in the environment.
You can also save all the objects in your environment by using the save.image()
function, or by clicking the “Save” icon in the Environment pane in RStudio.
We will spend a lot more time later in the boot camp on data visualization, but today we will briefly introduce some functions for visualization that are included in base R. These functions are useful to quickly visualize data in early phases of analysis, and their syntax is often incorporated into other packages. For more advanced and aesthetically pleasing data visualization, you will want to use the ggplot2
package, which we will go over in detail on Day 3.
Histograms are a simple and useful way to visualize the distribution of a variable. For example, let’s plot a histogram of life expectancy from the gapminder07
data frame:
hist(gapminder07$lifeExp)
By reading the help file for the hist()
function, we can identify several arguments that can change the aesthetics of the plot. The breaks =
argument controls the number of breaks on the x-axis.
hist(gapminder07$lifeExp, breaks=20,
main="Life expectancy (2007 data)", ylab="Frequency", xlab="Life expectancy")
The simplest way to plot the relationship between two variables is a scatterplot. If you provide two variables to the plot()
function in R, it produces a scatterplot. Let’s try it with life expectancy and GDP per capita in the gapminder07
data frame. Recall that ~
means “a function of”, so we will put the y-axis variable on the left and the x-axis variable on the right.
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap)
Again, we can add axes and labels:
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")
Perhaps we want to add a line indicating the mean value of life expectancy. We can do this by using the abline()
function to add a line after creating a plot. Adding multiple layers to a plot is much more intuitive and flexible with the ggplot2
package, which we will explore on Day 3.
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap, main="Life expectancy as a function of GDP per capita (2007 data)", xlab="GDP per capita", ylab="Life expectancy")
abline(h = mean(gapminder07$lifeExp))