You will hear these terms often this week:
Some of you may have completed the Introduction to R course on DataCamp that was recommended. Congratulations! You already know quite a lot about how R works. As we continue, please help your classmates who did not have a chance to complete the course.
The DataCamp course covered the following material:
We will be working on some exercises and using data files in the next few hours. To do this, you will need to fork and clone the bootcamp-2019 github repository.
Let’s walk through how to fork the repo to your GitHub account, and then clone it locally on your computer using the Git workflow within RStudio.
<-
is the assignment operator+
, -
, /
, *
, ^
y <- 5
y
## [1] 5
y + 2
## [1] 7
log(10)
results in 2.302585
log(10, base=10)
?log
log(10)
## [1] 2.302585
log(10, base=10)
## [1] 1
Open up the file day1_exercises.R
in RStudio.
Do the following tasks:
x
log()
here)x <- 5
x*3
## [1] 15
log(x*3)
## [1] 2.70805
log(x*3)-4
## [1] -1.29195
(log(x*3)-4)^2
## [1] 1.669134
==
(is equal)!=
(not equal)<
(greater than), >
(less than)&
(and), |
(or)TRUE
or FALSE
&
and |
do)1 > 2
## [1] FALSE
(1 + 1) == 2
## [1] TRUE
"eat" != 'drink'
## [1] TRUE
(1==1) & (1==2)
## [1] FALSE
(1==1) | (1==2)
## [1] TRUE
tidyverse
using the function library()
install.packages("tidyverse")
if you have not done so alreadyrecode
?
does)library(tidyverse)
?recode
Remember, if you need to install a package:
install.packages("tidyverse")
TRUE
or FALSE
TRUE
=1
and FALSE
=0
)2+5i
"text data"
, denoted with quotes 'single quotes'
and "double quotes"
both work)typeof()
on an object to check its typec()
functionvec[1]
)vec[1:3]
)vec[vec>2]
).example <- c(7,8,9)
example[2]
## [1] 8
example[example>7]
## [1] 8 9
Return to the exercise file and complete the following tasks:
x1
and x2
x1
[]
set.seed(1234)
x1 <- rnorm(5)
x2 <- rnorm(20, mean=0.5)
x1[3]
## [1] 1.084441
x1[x1 < 0]
## [1] -1.207066 -2.345698
x2[x2 > 1]
## [1] 1.006056 1.459494 2.915835
x3 <- x2[1:5]
x1[-3]
## [1] -1.2070657 0.2774292 -2.3456977 0.4291247
Variables in datasets sometimes include missing variables. In R, missing values are stored as NA
. Vectors containing any data type can contain missing values. Functions deal with missing values differently, and sometimes require arguments to specify how to deal with missing values.
vec <- c(1, 8, NA, 7, 3)
mean(vec)
## [1] NA
mean(vec, na.rm=TRUE)
## [1] 4.75
You can check if a vector contains missing values by the function is.na()
. Since this returns a logical vector, you can use sum()
or mean()
on the result to count the number or proportion of TRUE
values.
is.na(vec)
## [1] FALSE FALSE TRUE FALSE FALSE
sum(is.na(vec))
## [1] 1
mean(is.na(vec))
## [1] 0.2
factor()
, which includes an argument for levels =
[[1]]
. Single brackets will return the element as a list.matrix()
function.mat[row no, col no]
.mat <- matrix(data=c(1,2,3,4,5,6,11,12,34), ncol=3)
mat
## [,1] [,2] [,3]
## [1,] 1 4 11
## [2,] 2 5 12
## [3,] 3 6 34
mat[1,]
## [1] 1 4 11
mat[1,3]
## [1] 11
data.frame()
, or by combining vectors with cbind()
or rbind()
.df[row no, col no]
.df$var
or df["var"]
.df <- data.frame(candidate=c("Biden","Warren","Sanders"),
poll=c(26,17,17),
age=c(76,70,78))
df
## candidate poll age
## 1 Biden 26 76
## 2 Warren 17 70
## 3 Sanders 17 78
df[1,3]
## [1] 76
df$age
## [1] 76 70 78
df$age[df$candidate=="Biden"]
## [1] 76
df$poll_max <- df$poll+3
df
## candidate poll age poll_max
## 1 Biden 26 76 29
## 2 Warren 17 70 20
## 3 Sanders 17 78 20
mpg
cyl
variable and store this as a new variable cylsq
dim(mtcars) # str(mtcars) also okay here
## [1] 32 11
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
mtcars$mpg
mtcars[4,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
mtcars$cylsq <- (mtcars$cyl)^2
tidyverse
or data.table
The working directory is the folder where R scripts and projects look for files by default. Since we are using RStudio’s projects feature, your working directory is set already. Check where it is as follows:
getwd()
## [1] "C:/Users/kumar/OneDrive - Northwestern University - Student Advantage/Assistantships & Jobs/MSiA/bootcamp-2019"
If you need to change your working directory, e.g. when you are not working within an RStudio Project, you can go to the Files tab in the bottom right window in RStudio and find the directory you want. Then you can set it as a working directory with an option in the “More” menu. Or you can use the setwd()
function.
gapminder <- read.csv("data/gapminder5.csv", stringsAsFactors=FALSE)
gapminder <- read.csv(file = "data/gapminder5.csv",
sep = ",",
stringsAsFactors = FALSE)
read.csv()
code provided and see what happens in the Environment tab.readr
package.read_csv()
to load the gapminder data. Read the message generated in the console.library(readr)
## Warning: package 'readr' was built under R version 3.4.4
gapminder <- read_csv("data/gapminder5.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## pop = col_double(),
## continent = col_character(),
## lifeExp = col_double(),
## gdpPercap = col_double()
## )
haven
(Stata, SAS, SPSS) or readxl
(Excel)View()
and examine other characteristics with str()
, dim()
, names()
, nrow()
, and more.summary()
returns summary statistics for all variables.mean()
, median()
, var()
, sd()
, and quantile()
operate as expected.table()
creates a frequency table of one or more variablesprop.table()
can turn a frequency table into a proportion tablesummary()
on the gapminder datapop
table()
continent
, using prop.table()
prop.table()
is the output of table()
)summary(gapminder)
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
mean(gapminder$pop)
## [1] 29601212
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
prop.table(table(gapminder$continent))
##
## Africa Americas Asia Europe Oceania
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451
gapminder[gapminder$continent=="Asia",]
subset()
function: subset(gapminder, subset=continent=="Asia")
sort()
function reorders elements, in ascending order by default.
decreasing = TRUE
argument.order()
function gives you the index positions in sorted order.sort()
is useful for quickly viewing vectors; order()
is useful for arranging data frames.gapminder07
containing only those rows in the gapminder data where year
is 2007continent
in gapminder07
table()
and sort()
)gapminder07 <- subset(gapminder, subset = year==2007)
sort(table(gapminder07$continent))
##
## Oceania Americas Europe Asia Africa
## 2 25 30 33 52
gapminder07$pop[gapminder07$country=="Mexico"]
## [1] 108700891
head(gapminder07[order(gapminder07$pop, decreasing=TRUE),])
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 China 2007 1318683096 Asia 73.0 4959.
## 2 India 2007 1110396331 Asia 64.7 2452.
## 3 United States 2007 301139947 Americas 78.2 42952.
## 4 Indonesia 2007 223547000 Asia 70.6 3541.
## 5 Brazil 2007 190010647 Americas 72.4 9066.
## 6 Pakistan 2007 169270617 Asia 65.5 2606.
When cleaning or wrangling datasets in RStudio, we will often want to create new variables.
Two ways to add a vector as a new variable in R:
gapminder$newvar <- newvar
gapminder <- cbind(gapminder, newvar)
Removing columns is easy too:
gapminder$newvar <- NULL
gapminder <- gapminder[-"newvar"]
dplyr
, which you can preview in the lecture notes.Use the data frame gapminder07
throughout this exercise.
lifeExp
using round()
, and store this as a new variable lifeExp_round
lifeExp_over70
and try to understand what it does.lifeExp_highlow
that has the value “High” when life expectancy is over the mean and the value “Low” when it is below the mean.gapminder07$lifeExp_round <- round(gapminder07$lifeExp)
head(gapminder07$lifeExp_round)
## [1] 44 76 72 43 75 81
gapminder07$lifeExp_highlow <- NA
gapminder07$lifeExp_highlow[gapminder07$lifeExp>mean(gapminder07$lifeExp)] <- "High"
gapminder07$lifeExp_highlow[gapminder07$lifeExp<mean(gapminder07$lifeExp)] <- "Low"
table(gapminder07$lifeExp_highlow)
##
## High Low
## 85 57
aggregate()
function accomplishes this: aggregate(y ~ x, FUN = mean)
gives the mean of vector y
for each unique group in x
.
mean
can be replaced by other functions here, such as median
.aggregate(gapminder07$lifeExp ~ gapminder07$continent, FUN = mean)
## gapminder07$continent gapminder07$lifeExp
## 1 Africa 54.80604
## 2 Americas 73.60812
## 3 Asia 70.72848
## 4 Europe 77.64860
## 5 Oceania 80.71950
aggregate(lifeExp ~ continent, data = gapminder07, FUN = mean)
## continent lifeExp
## 1 Africa 54.80604
## 2 Americas 73.60812
## 3 Asia 70.72848
## 4 Europe 77.64860
## 5 Oceania 80.71950
cor()
; Covariance: cov()
t.test(var1 ~ var2)
, where var2
is the grouping variablelm(y ~ x1 + x2, data = df)
Use gapminder07
for all the below exercises.
You’re using some new functions, so refer to help files whenever you get stuck.
lifeExp
and gdpPercap
.gdpPercap
in “high” and “low” life expectancy countries. Store the results as t1
, and then print out t1
.cor(gapminder07$lifeExp, gapminder07$gdpPercap)
## [1] 0.6786624
t1 <- t.test(gapminder07$gdpPercap~gapminder07$lifeExp_highlow)
t1 <- t.test(gdpPercap~lifeExp_highlow, data=gapminder07)
t1
##
## Welch Two Sample t-test
##
## data: gdpPercap by lifeExp_highlow
## t = 10.564, df = 95.704, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12674.02 18539.14
## sample estimates:
## mean in group High mean in group Low
## 17944.685 2338.104
Note that t1
is stored as a list. You can now call up the components of the t-test when you need them.
lm()
which predicts lifeExp
as a function of gdpPercap
and pop
. Store the results as reg1
.
df$var
syntax, or you can just use variable names and identify the data frame in the data =
argument.lm()
reg1
.summary()
on reg1
.reg1 <- lm(lifeExp ~ gdpPercap + pop, data = gapminder07)
reg1
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
##
## Coefficients:
## (Intercept) gdpPercap pop
## 5.921e+01 6.416e-04 7.001e-09
summary(reg1)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.496 -6.119 1.899 7.018 13.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.921e+01 1.040e+00 56.906 <2e-16 ***
## gdpPercap 6.416e-04 5.818e-05 11.029 <2e-16 ***
## pop 7.001e-09 5.068e-09 1.381 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.87 on 139 degrees of freedom
## Multiple R-squared: 0.4679, Adjusted R-squared: 0.4602
## F-statistic: 61.11 on 2 and 139 DF, p-value: < 2.2e-16
write.csv()
from base R or write_csv()
from readr
to do this.gapminder07
to the “data” subfolder in your working directory using the write.csv
function. Set the argument row.names = FALSE
.write.csv(gapminder07, file = "data/gapminder07.csv", row.names = FALSE)
save.image()
or by clicking the “Save” icon in the Environment tab.
load.image()
or opening the .RData
file that is created..RData
file with the save()
function.ggplot2
, which we will cover on Day 3.hist()
.lifeExp
in gapminder07
.
breaks =
argument from its default setting and see what happens.hist(gapminder07$lifeExp,
main="Distribution of life expectancy across countries in 2007",
xlab="Life expectancy", ylab="Frequency")
y ~ x
) to the plot()
function in R.plot()
similarly to hist()
.abline()
can “layer” straight lines on top of a plot()
output.lifeExp
on the y-axis and gdpPercap
on the x-axis.lifeExp
onto the plot using abline()
.plot(gapminder07$lifeExp ~ gapminder07$gdpPercap)
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap,
main="Relationship between life expectancy and GDP per capita in 2007",
ylab="Life expectancy", xlab="GDP per capita")
plot(gapminder07$lifeExp ~ gapminder07$gdpPercap,
main="Relationship between life expectancy and GDP per capita in 2007",
ylab="Life expectancy", xlab="GDP per capita")
abline(h = mean(gapminder07$lifeExp))