R Day 3 - Advanced manipulation and dplyr

A new dataset

California energy data

file names: ca_energy_generation.csv and ca_energy_imports.csv
Read these two files into your working environment
They are in the “data” folder

Reading in the data

generation <- read.csv(here::here("data/ca_energy_generation.csv"), 
                       stringsAsFactors = F)
imports <- read.csv(here::here("data/ca_energy_imports.csv"), 
                    stringsAsFactors = F)

Exploring the data

str(generation)

## 'data.frame':    168 obs. of  12 variables:
##  $ datetime   : chr  "2019-09-03 00:00:00" "2019-09-03 01:00:00" "2019-09-03 02:00:00" "2019-09-03 03:00:00" ...
##  $ biogas     : num  239 239 239 239 238 ...
##  $ biomass    : num  407 406 407 407 406 ...
##  $ coal       : num  9 9 9 9.83 9.42 ...
##  $ geothermal : num  967 970 973 974 975 ...
##  $ large_hydro: num  3152 2724 2630 2611 2651 ...
##  $ natural_gas: num  9711 9155 8078 7801 7744 ...
##  $ nuclear    : num  2249 2251 2251 2245 2243 ...
##  $ other      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ small_hydro: num  402 401 414 360 382 ...
##  $ solar      : num  0 0 0 0 0 ...
##  $ wind       : num  2670 2779 2795 2583 2397 ...

Dealing with dates and times

Notice that the first variable in both dataset is the called “datetime”
What class are these variables?

class(generation$datetime)

## [1] "character"

class(imports$datetime)

## [1] "character"

Dealing with dates/times with `lubridate`

The best way to deal with date-time data is to use the lubridate package and the as_datetime function
Recode the datetime variable in the imports dataframe

library(lubridate)
generation$datetime <- as_datetime(generation$datetime)
class(generation$datetime)

## [1] "POSIXct" "POSIXt"

head(generation$datetime)

## [1] "2019-09-03 00:00:00 UTC" "2019-09-03 01:00:00 UTC"
## [3] "2019-09-03 02:00:00 UTC" "2019-09-03 03:00:00 UTC"
## [5] "2019-09-03 04:00:00 UTC" "2019-09-03 05:00:00 UTC"

Dealing with dates/times with `lubridate`

imports$datetime <- as_datetime(imports$datetime)
head(imports$datetime)

## [1] "2019-09-03 00:00:00 UTC" "2019-09-03 01:00:00 UTC"
## [3] "2019-09-03 02:00:00 UTC" "2019-09-03 03:00:00 UTC"
## [5] "2019-09-03 04:00:00 UTC" "2019-09-03 05:00:00 UTC"

Reshaping data

Wide versus long data

Principles of “tidy data” (R for Data Science - Wickham & Grolemund)

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Often, we want to make wide data long (or tidy) for analysis

Wide versus long data

head(generation)

##              datetime   biogas  biomass      coal geothermal large_hydro
## 1 2019-09-03 00:00:00 238.9167 407.0833  9.000000   967.2500    3152.500
## 2 2019-09-03 01:00:00 239.0000 406.5000  9.000000   970.3333    2724.333
## 3 2019-09-03 02:00:00 239.0000 406.8333  9.000000   973.1667    2630.000
## 4 2019-09-03 03:00:00 238.9167 407.1667  9.833333   973.8333    2611.250
## 5 2019-09-03 04:00:00 237.9167 406.0833  9.416667   974.6667    2651.333
## 6 2019-09-03 05:00:00 237.1667 408.9167 10.000000   973.3333    2965.500
##   natural_gas  nuclear other small_hydro solar     wind
## 1    9710.833 2248.750     0    401.6667     0 2670.167
## 2    9155.000 2250.583     0    400.9167     0 2778.833
## 3    8078.333 2250.833     0    413.5833     0 2795.250
## 4    7801.250 2245.000     0    359.9167     0 2582.583
## 5    7744.333 2242.833     0    382.0833     0 2397.417
## 6    8624.583 2242.833     0    395.9167     0 2327.000

Using `reshape2`

melt –> make data long
dcast –> make data wide
recast–> melt then cast data

Reshaping CA energy data

Right now, the generation dataframe has several observations per row

head(generation)

##              datetime   biogas  biomass      coal geothermal large_hydro
## 1 2019-09-03 00:00:00 238.9167 407.0833  9.000000   967.2500    3152.500
## 2 2019-09-03 01:00:00 239.0000 406.5000  9.000000   970.3333    2724.333
## 3 2019-09-03 02:00:00 239.0000 406.8333  9.000000   973.1667    2630.000
## 4 2019-09-03 03:00:00 238.9167 407.1667  9.833333   973.8333    2611.250
## 5 2019-09-03 04:00:00 237.9167 406.0833  9.416667   974.6667    2651.333
## 6 2019-09-03 05:00:00 237.1667 408.9167 10.000000   973.3333    2965.500
##   natural_gas  nuclear other small_hydro solar     wind
## 1    9710.833 2248.750     0    401.6667     0 2670.167
## 2    9155.000 2250.583     0    400.9167     0 2778.833
## 3    8078.333 2250.833     0    413.5833     0 2795.250
## 4    7801.250 2245.000     0    359.9167     0 2582.583
## 5    7744.333 2242.833     0    382.0833     0 2397.417
## 6    8624.583 2242.833     0    395.9167     0 2327.000

`melt` the generation data

melt(df, id.vars = "id")

Specify the variable that doesn’t melt with id.vars

library(reshape2)
long_gen <- melt(generation, id.vars = "datetime",
                           variable.name = "source",
                           value.name = "usage")
head(long_gen)

##              datetime source    usage
## 1 2019-09-03 00:00:00 biogas 238.9167
## 2 2019-09-03 01:00:00 biogas 239.0000
## 3 2019-09-03 02:00:00 biogas 239.0000
## 4 2019-09-03 03:00:00 biogas 238.9167
## 5 2019-09-03 04:00:00 biogas 237.9167
## 6 2019-09-03 05:00:00 biogas 237.1667

`melt` the generation data

long_gen[order(long_gen$datetime)[1:20], ]

##                 datetime      source     usage
## 1    2019-09-03 00:00:00      biogas  238.9167
## 169  2019-09-03 00:00:00     biomass  407.0833
## 337  2019-09-03 00:00:00        coal    9.0000
## 505  2019-09-03 00:00:00  geothermal  967.2500
## 673  2019-09-03 00:00:00 large_hydro 3152.5000
## 841  2019-09-03 00:00:00 natural_gas 9710.8333
## 1009 2019-09-03 00:00:00     nuclear 2248.7500
## 1177 2019-09-03 00:00:00       other    0.0000
## 1345 2019-09-03 00:00:00 small_hydro  401.6667
## 1513 2019-09-03 00:00:00       solar    0.0000
## 1681 2019-09-03 00:00:00        wind 2670.1667
## 2    2019-09-03 01:00:00      biogas  239.0000
## 170  2019-09-03 01:00:00     biomass  406.5000
## 338  2019-09-03 01:00:00        coal    9.0000
## 506  2019-09-03 01:00:00  geothermal  970.3333
## 674  2019-09-03 01:00:00 large_hydro 2724.3333
## 842  2019-09-03 01:00:00 natural_gas 9155.0000
## 1010 2019-09-03 01:00:00     nuclear 2250.5833
## 1178 2019-09-03 01:00:00       other    0.0000
## 1346 2019-09-03 01:00:00 small_hydro  400.9167

Merging data

Merging CA energy data

Sometimes you have data from two (or more) sources that you want to analyze
Need to merge these dataframes together
To merge, need to chose the columns that have common values between the dataframes
Usually a variable with ids or years, or both

Merging the `merge`

merge(x, y, by = c("id", "year"))

Key arguments:
x: first dataframe
y: second dataframe
by: variables to match (must have common name)

More `merge` arguments

merge(x, y, by.x = "id", by.y = "cd", all.x = T, all.y = T)

Advanced arguments:
Use by.x and by.y if the dataframes have different variable names
Use all.x = T if you want to keep all the observation in the first dataframe (unmatched observations in y are dropped!)
Use all.y = T if you want to keep all observations in the second dataframe (umatched observations in x are dropped!)
Use both (or, simply all = T) to keep all observations!

Merge by `datetime`

Use merge to join the generation and imports dataframes, using the datetime variable to match

Merge by `datetime`

Always check your merge!

merged_energy <- merge(generation, imports, by = "datetime")
dim(merged_energy)

## [1] 168  13

head(merged_energy)

##              datetime   biogas  biomass      coal geothermal large_hydro
## 1 2019-09-03 00:00:00 238.9167 407.0833  9.000000   967.2500    3152.500
## 2 2019-09-03 01:00:00 239.0000 406.5000  9.000000   970.3333    2724.333
## 3 2019-09-03 02:00:00 239.0000 406.8333  9.000000   973.1667    2630.000
## 4 2019-09-03 03:00:00 238.9167 407.1667  9.833333   973.8333    2611.250
## 5 2019-09-03 04:00:00 237.9167 406.0833  9.416667   974.6667    2651.333
## 6 2019-09-03 05:00:00 237.1667 408.9167 10.000000   973.3333    2965.500
##   natural_gas  nuclear other small_hydro solar     wind  imports
## 1    9710.833 2248.750     0    401.6667     0 2670.167 7527.333
## 2    9155.000 2250.583     0    400.9167     0 2778.833 6930.333
## 3    8078.333 2250.833     0    413.5833     0 2795.250 7005.833
## 4    7801.250 2245.000     0    359.9167     0 2582.583 7080.583
## 5    7744.333 2242.833     0    382.0833     0 2397.417 7421.667
## 6    8624.583 2242.833     0    395.9167     0 2327.000 7511.667

Try reshaping the merged data!

Our merged dataframe is still wide and untidy
Create a long version called long_merged_energy

Try reshaping the merged data!

long_merged_energy <- melt(merged_energy, id.vars = "datetime",
                           variable.name = "source",
                           value.name = "usage")
head(long_merged_energy)

##              datetime source    usage
## 1 2019-09-03 00:00:00 biogas 238.9167
## 2 2019-09-03 01:00:00 biogas 239.0000
## 3 2019-09-03 02:00:00 biogas 239.0000
## 4 2019-09-03 03:00:00 biogas 238.9167
## 5 2019-09-03 04:00:00 biogas 237.9167
## 6 2019-09-03 05:00:00 biogas 237.1667

dplyr!

dplyr - an idiosyncratic approach to data manipulation

dplyr is an R package for data manipulation developed by Hadley Wickham with help from lots of other R developers
https://dplyr.tidyverse.org/
Principles: consistent syntax, meaningful naming conventions, fewer arguments/more functions

`dplyr` verbs

select –> subset variables
filter –> subset observations based on conditions
mutate –> add new variables
summarize –> reduce multiple observations to a single value (e.g., find the mean)

Trying out `dplyr` verbs - `select`

select by name:
select(gapminder, continent, pop)
select by position:
select(df, c(1, 3, 10))
select by range:
select(df, country:pop) or select(df, 1:3)
drop variables with -
select(df, -gdpPercap)

Trying out `dplyr` verbs - `select`

library(dplyr)
tmp <- select(merged_energy, biogas, biomass, geothermal, solar)
names(tmp)

## [1] "biogas"     "biomass"    "geothermal" "solar"

tmp <- select(merged_energy, -biogas, -biomass, -geothermal, -solar)
names(tmp)

## [1] "datetime"    "coal"        "large_hydro" "natural_gas" "nuclear"    
## [6] "other"       "small_hydro" "wind"        "imports"

select helpers

There are several “select helpers” that make subsetting variables very easy
one_of(), contains(), starts_with(), ends_with(), matches()

tmp <- select(merged_energy, contains("hydro"), starts_with("bio"))
names(tmp)

## [1] "large_hydro" "small_hydro" "biogas"      "biomass"

`filter`

tmp <- filter(merged_energy, imports > 7000)
nrow(tmp)

## [1] 78

head(tmp)

##              datetime   biogas  biomass      coal geothermal large_hydro
## 1 2019-09-03 00:00:00 238.9167 407.0833  9.000000   967.2500    3152.500
## 2 2019-09-03 02:00:00 239.0000 406.8333  9.000000   973.1667    2630.000
## 3 2019-09-03 03:00:00 238.9167 407.1667  9.833333   973.8333    2611.250
## 4 2019-09-03 04:00:00 237.9167 406.0833  9.416667   974.6667    2651.333
## 5 2019-09-03 05:00:00 237.1667 408.9167 10.000000   973.3333    2965.500
## 6 2019-09-03 06:00:00 236.7500 406.0833  9.166667   972.3333    3809.500
##   natural_gas  nuclear other small_hydro    solar     wind  imports
## 1    9710.833 2248.750     0    401.6667  0.00000 2670.167 7527.333
## 2    8078.333 2250.833     0    413.5833  0.00000 2795.250 7005.833
## 3    7801.250 2245.000     0    359.9167  0.00000 2582.583 7080.583
## 4    7744.333 2242.833     0    382.0833  0.00000 2397.417 7421.667
## 5    8624.583 2242.833     0    395.9167  0.00000 2327.000 7511.667
## 6    9544.583 2244.833     0    407.5000 77.16667 2048.833 8172.667

multiple conditions in `filter`

tmp <- filter(merged_energy, imports > 7000, natural_gas < 7000)
nrow(tmp)

## [1] 20

head(tmp)

##              datetime   biogas  biomass     coal geothermal large_hydro
## 1 2019-09-07 03:00:00 240.1667 358.9167 9.166667   964.7500    2413.833
## 2 2019-09-07 06:00:00 238.8333 371.3333 8.500000   971.7500    2649.417
## 3 2019-09-08 00:00:00 236.8333 389.7500 5.416667   970.2500    2697.000
## 4 2019-09-08 01:00:00 237.2500 388.8333 4.750000   969.3333    2654.083
## 5 2019-09-08 02:00:00 237.1667 374.3333 4.083333   969.9167    2217.667
## 6 2019-09-08 03:00:00 237.8333 342.5000 4.000000   972.5000    2090.917
##   natural_gas  nuclear other small_hydro    solar     wind  imports
## 1    6667.417 2255.083     0    378.4167  0.00000 3809.917 7129.667
## 2    6553.917 2257.083     0    408.5000 67.83333 3643.750 7338.250
## 3    6364.333 2251.167     0    403.7500  0.00000 3628.083 8077.583
## 4    5037.917 2254.000     0    403.3333  0.00000 3118.083 8658.583
## 5    4530.750 2255.583     0    385.0833  0.00000 2928.083 8806.583
## 6    4079.500 2257.500     0    377.5000  0.00000 2726.583 8887.167

`mutate`

mutate creates new variables

tmp <- mutate(long_merged_energy, log_usage = log(usage))
head(tmp)

##              datetime source    usage log_usage
## 1 2019-09-03 00:00:00 biogas 238.9167  5.476115
## 2 2019-09-03 01:00:00 biogas 239.0000  5.476464
## 3 2019-09-03 02:00:00 biogas 239.0000  5.476464
## 4 2019-09-03 03:00:00 biogas 238.9167  5.476115
## 5 2019-09-03 04:00:00 biogas 237.9167  5.471920
## 6 2019-09-03 05:00:00 biogas 237.1667  5.468763

`mutate`

tmp <- mutate(long_merged_energy, log_usage = log(usage), usage2 = usage^2, usage3 = usage^3)
head(tmp)

##              datetime source    usage log_usage   usage2   usage3
## 1 2019-09-03 00:00:00 biogas 238.9167  5.476115 57081.17 13637644
## 2 2019-09-03 01:00:00 biogas 239.0000  5.476464 57121.00 13651919
## 3 2019-09-03 02:00:00 biogas 239.0000  5.476464 57121.00 13651919
## 4 2019-09-03 03:00:00 biogas 238.9167  5.476115 57081.17 13637644
## 5 2019-09-03 04:00:00 biogas 237.9167  5.471920 56604.34 13467116
## 6 2019-09-03 05:00:00 biogas 237.1667  5.468763 56248.03 13340157

`summarize`

summarize reduces observations to a single value based on functions - mean, sum, sd, min, max, etc.

# total energy consumption
summarize(long_merged_energy, total = sum(usage, na.rm = T))

##     total
## 1 5223392

# mean energy consumption
summarize(long_merged_energy, mean_cons = mean(usage, na.rm = T))

##   mean_cons
## 1  2590.968

A new operator: the pipe `%>%`

The powerful %>% operator lets you chain together functions
It sends the result of one function to another function
Read %>% as “then”

# take df then filter it then select these variables
# you do not need to repeat the name of the dataframe!
long_merged_energy %>% 
  filter(source == "geothermal") %>% 
  select(-datetime) %>% 
  mutate(log_usage = log(usage)) %>% 
  summarize(mean_log_usage = mean(log_usage, na.rm = T))

##   mean_log_usage
## 1       6.872053

A new operator: the pipe `%>%`

While piping, the piped dataframe is not changed!
To refer to the manipulated dataframe, use .

merged_energy %>% 
  select(-datetime) %>% 
  mutate(total_usage = rowSums(., na.rm = T)) %>% 
  summarize(total_usage = sum(total_usage, na.rm = T))

##   total_usage
## 1     5223392

A new operator: the pipe `%>%`

While piping, the piped dataframe is not changed!
To refer to the manipulated dataframe, use .

merged_energy %>% 
  select(-datetime) %>% 
  mutate(total_usage = rowSums(merged_energy, na.rm = T)) %>% 
  summarize(total_usage = sum(total_usage, na.rm = T))

## Error: Evaluation error: 'x' must be numeric.

Try piping

Using the (wide) merged CA energy data merged_energy do the following with pipes:

Select variables that contain the word “hydro”
Create a new variable called total_hydro that is the sum of the retained hydro variables
Find the mean usage for total_hydro

Try piping

merged_energy %>% 
  select(contains("hydro")) %>% 
  mutate(total_hydro = rowSums(., na.rm = T)) %>%
  summarize(mean_hydro = mean(total_hydro, na.rm = T))

##   mean_hydro
## 1   3387.702

`group_by` and `summarize`

group_by is a powerful function that allows us to perform operations by groups of observations!
Very useful with summarize!

long_merged_energy %>% 
  group_by(source) %>% 
  summarize(sum_usage = sum(usage, na.rm = T))

## # A tibble: 12 x 2
##    source      sum_usage
##    <fct>           <dbl>
##  1 biogas         39427.
##  2 biomass        62890.
##  3 coal            1528.
##  4 geothermal    162114.
##  5 large_hydro   498668.
##  6 natural_gas  1786555.
##  7 nuclear       378618.
##  8 other              0 
##  9 small_hydro    70466.
## 10 solar         660315.
## 11 wind          440125.
## 12 imports      1122687.

`group_by` versus `for` loops

Remember the for loop we wrote yesterday to find the mean life expectancy by year?
MUCH better in dplyr

`group_by` versus `for` loops

gapminder <- read.csv(here::here("data/gapminder5.csv"))

gapminder %>% 
  group_by(year) %>% 
  summarize(mean_le = mean(lifeExp, na.rm = T),
            sd_lf = sd(lifeExp, na.rm = T))

## # A tibble: 12 x 3
##     year mean_le sd_lf
##    <int>   <dbl> <dbl>
##  1  1952    49.1  12.2
##  2  1957    51.5  12.2
##  3  1962    53.6  12.1
##  4  1967    55.7  11.7
##  5  1972    57.6  11.4
##  6  1977    59.6  11.2
##  7  1982    61.5  10.8
##  8  1987    63.2  10.6
##  9  1992    64.2  11.2
## 10  1997    65.0  11.6
## 11  2002    65.7  12.3
## 12  2007    67.0  12.1

Find the mean by source

Use your knowledge of dplyr to find the mean usage for small hydro, large hydro, biogas, and biomass
Start with either the wide or long merged dataset

Find the mean by source

merged_energy %>% 
  select(datetime, contains("hydro"), contains("bio")) %>% 
  melt(id.vars = "datetime",
       variable.name = "source",
       value.name = "usage") %>% 
  group_by(source) %>% 
  summarize(mean_usage = mean(usage, na.rm = T))

## # A tibble: 4 x 2
##   source      mean_usage
##   <fct>            <dbl>
## 1 large_hydro      2968.
## 2 small_hydro       419.
## 3 biogas            235.
## 4 biomass           374.

Merging in `dplyr` with `*_join` functions

dplyr’s philosophy is to have more functions with fewer options
left_join - keeps all observations in the first dataframe
- merge(all.x = T)
right_join - keeps all observations in the second dataframe
- merge(all.y = T)
full_join - keeps all observations in both dataframes
- merge(all = T)
inner_join - keeps only those observations that are matched in both datasets
- merge(all = F)

Comparing `merge` and `inner_join`

tmp <- merge(generation, imports, by = "datetime", all = F)
dim(tmp)

## [1] 168  13

tmp <- inner_join(generation, imports, by = "datetime")
dim(tmp)

## [1] 168  13

Advanced R - Reshaping, merging, and dplyr

Richard Paquin Morel

2019-09-16

R Day 3 - Advanced manipulation and dplyr

A new dataset

Reading in the data

Exploring the data

Dealing with dates and times

Dealing with dates/times with lubridate

Dealing with dates/times with lubridate

Reshaping data

Wide versus long data

Wide versus long data

Using reshape2

Reshaping CA energy data

melt the generation data

melt the generation data

Merging data

Merging CA energy data

Merging the merge

More merge arguments

Merge by datetime

Merge by datetime

Try reshaping the merged data!

Try reshaping the merged data!

dplyr!

dplyr - an idiosyncratic approach to data manipulation

dplyr verbs

Trying out dplyr verbs - select

Trying out dplyr verbs - select

select helpers

filter

multiple conditions in filter

mutate

mutate

summarize

A new operator: the pipe %>%

A new operator: the pipe %>%

A new operator: the pipe %>%

Try piping

Try piping

group_by and summarize

group_by versus for loops

group_by versus for loops

Find the mean by source

Find the mean by source

Merging in dplyr with *_join functions

Comparing merge and inner_join

Dealing with dates/times with `lubridate`

Dealing with dates/times with `lubridate`

Using `reshape2`

`melt` the generation data

`melt` the generation data

Merging the `merge`

More `merge` arguments

Merge by `datetime`

Merge by `datetime`

`dplyr` verbs

Trying out `dplyr` verbs - `select`

Trying out `dplyr` verbs - `select`

`filter`

multiple conditions in `filter`

`mutate`

`mutate`

`summarize`

A new operator: the pipe `%>%`

A new operator: the pipe `%>%`

A new operator: the pipe `%>%`

`group_by` and `summarize`

`group_by` versus `for` loops

`group_by` versus `for` loops

Merging in `dplyr` with `*_join` functions

Comparing `merge` and `inner_join`