Advanced R: Day 3

Data visualization with ggplot2

Richard Paquin Morel
(slides created by Kumar Ramanathan)

2019-09-18

Introduction to ggplot

Part of the tidyverse, ggplot2 is an R package that implements the “grammar of graphics” in R.

In ggplot, graphics are built by supplying data and mapping of data values to aesthetics, and then adding layers that build geometric objects, scales, labels, and more.

Why ggplot?

  • Highly customizable and extensible.
  • Designed to work with dplyr and other tidyverse packages.
  • In use for over 10 years, meaning that help is widely available.
  • Wide variety of avilable extensions

Useful resources

Let’s make sure we have the necessary data frames for our examples

  1. If you still have imports, generation, merged_energy, long_gen and long_merged_energy in your Enrivorment, read in gapminder and recreate gapminder07.
  2. Alternatively, or if you are missing any of those objects from your workspace, run the following line of code:

Basics: grammar of graphics

Components of a basic plot

  • data: a data frame, provided to the ggplot() function
  • geometric objects: the objects/shapes that you want to plot, indicated through one of the many available geom functions, such as geom_point() or geom_hist()
  • aesthetic mapping: the mapping from the data to the geometric objects, provided in an aes() function nested within the chosen geom function
  • connected with the + operator

Example: Scatterplot from Day 1

Task:
Create scatterplot of lifeExp against gdpPercap from gapminder07 data.

Steps:

  1. Supply gapminder07 data frame as input to ggplot() function
  2. Choose a geom; in this case we choose geom_point() to generate points
  3. Supply aesthetic mapping, i.e. specify x and y values, to geom_point().

Example: Scatterplot from Day 1

Example: Scatterplot from Day 1

Exercise 1: Scatterplot

Using the gapminder07 data, create a scatterplot of the natural log of gdpPercap as a function of the natural log of pop. Give it a title and axis labels.

Remember, you will need three functions: ggplot(), geom_point(), and labs().

Exercise 1: Scatterplot

Preparing your data

  • Recall the role of pipes (%>%) in dplyr this morning.
  • dplyr and ggplot2 are designed to work well together.
  • You can use a series of pipes to prepare your data before plotting.
  • To see examples, let’s turn to the California energy data.

Example: Energy generated over time

Task:
Plot a column chart of total energy generated over time.

Steps:

  1. Choose long-format generation data, i.e. long_gen.
  2. Manipulate data frame to calculate total output per date-time.
  3. Pipe manipulated data frame into plot.
  4. Select geom_col() and map appropriate x and y variables.

Example: Energy generated over time

Exercise 2: Hydro power generated over time

Task:
Plot a column chart hydroelectric power generated over time.

Hint: There are two types of hydroelectric sources in the data: large_hydro and small_hydro.

Exercise 2: Hydro power generated over time

Exercise 2: Hydro power generated over time

Geometric objects

We have already seen examples of two kinds of plots:

  • Scatterplots, generated with geom_point()
  • Column charts, generated with geom_col()

Let’s see a few more.

Line plots: geom_line()

Interlude: changing geom characteristics

  • In the geom functions, we have only been supplying aesthetic mapping (in aes()).
  • Geom functions can also control characteristics of the geom object, outside the aes() function.
  • In most geoms, we can modulate size= and col=.
  • Let’s try it with the line plot we just made.

Interlude: changing geom characteristics

Area plots: geom_area()

Box plots: geom_boxplot()

Multiple geoms in one plot

Task:
Plot a line of large hydro generation over time, and a smoothed line of the same relationship on top of it.

Steps:

  1. Supply the generation data to ggplot().
  2. Create a line with geom_line() with appropriate x and y aesthetics. Make it turquoise for fun.
  3. Create a smoothed line with geom_smooth() with the same x and y aesthetics.
    1. geom_smooth() plots smoothed conditional means (estimated using a loess regression)

Multiple geoms in one plot

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise 3: Total output per source

Task:
Create a column chart that shows the total output per source.

  • Change the color of the columns to "darkred".
  • Add a horizontal line indicating the mean output across all sources. Use the cheatsheet to identify the geom function that you need.
  • Add a meaningful title and axis labels using labs().

Exercise 3: Total output per source

Additional layers

Labels

  • We have already seen that the labs() layer can add titles (title=) and axis labels (x= and y=) to a plot.
  • It can also add subtitles (subtitle=) and captions (caption=).

Labels

Scales

  • Scales allow you to manipulate the relationship between data values and aesthetics beyond aesthetic mapping.
    • AKA: scales “control the details of how data values are translated to visual properties” (quote from reference guide).
  • For example:
    • scale_alpha: control transparency settings
    • scale_color and scale_fill: control color palettes and other aspects of data mapped to colors
    • scale_x and scale_y: control the position of axis markers
  • The cheatsheet is your friend!

Scales

Task:
Recreate line plot of imports over time, but label x-axis with hours rather than dates.

Steps:

  1. Check the cheatsheet!
  2. Supply imports data to ggplot().
  3. Add a geom_line() layer and map appropriate x and y variables.
  4. Add a scale_x_datetime() layer and use date_labels= and date_breaks= arguments to modify x-axis labels and breaks.
  5. Add a labs() layer.

Example: Imports over time

Themes: presets

ggplot comes with several preset themes that can be added as layers, including:

  • theme_grey(): default
  • theme_bw(): strip colors, including grey gradients
  • theme_dark() and theme_light(): change the background of the coordinate system
  • theme_minimal() and theme_void(): see reference guide for details

Example: Imports over time

Themes: manual

The theme() layer lets you modulate many components of the theme, including:

  • Angle, position, and other characteristics of axis labels (e.g. axis.labels.x =).
  • Angle, position, and other characteristics of legends (e.g. legend.position =).
  • Characteristics of the background of the plot (e.g. plot.background =).
  • For a full list of the components that can be modified, see the help file for ggplot2::theme().

Example: Imports over time

Coordinate system adjustment

  • coord_flip() is the default Cartesian coordinate system used; by explicitly calling this function, you can change the limits of the x and y axes from their defaults
  • coord_fixed() sets a fixed aspect ratio between the x and y axes
  • coord_transform() lets you transform the Cartesian coordinates using functions like sqrt or log
  • coord_polar() changes the coordinate system to polar coordinates rather than a Cartesian system

Example: coord_flip()

Example: coord_flip()

Stats

  • There are several stat functions in ggplot that enable you to conduct statistical transformations of your data prior to plotting.
  • stat and geom layers can be used interchangeably, as each stat layer has a geom argument and vice versa.
  • You may want to use stat layers or arguments to override default settings, or to use transformed versions of a variable in your plot.
  • For more, see Chapter 3.7 of the R for data science online textbook.

Visualizing grouped data

  • We often want to visualize multiple variables, or a variable that can be divided into groups, in one graphic.
  • There are two broad principles for visualizing grouped variables in R:
    • Identify a grouping variable in a geom() using the group = argument, and demarcate the distinct objects created per group in some way.
    • Use facets to generate geometric objects in separate coordinate systems for each group.

Colors and fill

Task:
Create a line plot of energy output over time, with separate lines for each source.

Steps:

  1. Supply the data frame long_merged_energy to ggplot().
  2. Add a geom_line() layer and specify x=datetime and y=output.
  3. Also specify that group=source and col=source.
  4. Add a labs() layer.

Colors and fill

Exercise 4: Colors and fill

Task:
Create a line plot that compares generation of wind, solar, and geothermal energy over time.

Bonus: Set the line size to 1.5.

Exercise 4: Colors and fill

Customizing with scale_color layers

Colors and fill

  • col= is used to color objects like lines and points
  • fill= is used to color objects like columns and histograms
    • col= will create colored outlines for such objects
  • After some trial and error, this will become intuitive

Position adjustment

  • Once we introduce a grouping variable to a geom function, we generate multiple geometric objects on the same plot.
  • Colors and fill help us distinguish each object from other objects.
  • Another way to distinguish objects, often used together with color/fill, is position adjustment.

Example: Energy use by day

Task:
Create a column chart of energy use by day, grouped by source.

Steps:

  1. Modify long_merged_energy to summarize output by date and source.
    1. Use mutate() to create a date variable from datetime.
    2. Use group_by() and summarize() to calculate total output per date per source.
  2. Supply modified data frame to ggplot()
  3. Add a geom_col() layer, supply x and y aesthetics.
  4. Specify group=source and fill=source.
  5. Add a labs() layer.

Example: Energy use by day

Example: Energy use by day

  • This column chart is useful if our main goal is to examine total output over time across sources.
  • But what if our main goal is to compare trends across sources?
    • We can use position="dodge" in geom_col() ‘unstack’ the columns for each source.
  • What if our goal is to compare what portion of output per day each source comprises?
    • We can use position="fill" to normalize height of stacked columns.

Example: Energy use by day (pos dodge)

Example: Energy use by day (pos fill)

Shapes and linetypes

  • Colors/fill and position adjustment are the most common and intuitive way to distinguish geometric objects from each other.
  • Shapes and linetypes are two other ways.
  • The idea here is to modify what the object actually looks like.

Example: Energy source group, by day

Task:

Steps:

  1. Prepare data frame that summarizes output by date and “group” from regroup.
  2. Supply modified data frame to ggplot().
  3. Two choices!
    1. Add geom_line() and geom_point() with shape=group.
    2. Add geom_line() with linetype=group.

Example: Energy source group by day

## # A tibble: 6 x 3
## # Groups:   date [1]
##   date       group      output
##   <date>     <chr>       <dbl>
## 1 2019-09-03 hydro      93919.
## 2 2019-09-03 imports   169826.
## 3 2019-09-03 nuclear    53926.
## 4 2019-09-03 other          0 
## 5 2019-09-03 renewable 183304.
## 6 2019-09-03 thermal   313436.

Example: Energy source group, by day

Example: Energy source group, by day

Sizes and alpha

  • size= and alpha= inside aes() modulate the size and transparency of geom objects based on some data value.
  • They are particularly useful when you want to distinguish objects based on some continuous variable.
  • Let’s return to the Gapminder data for an example.

Example: Life expectancy over GDP

Exercise 5: Average hourly output by source

Task:
Visualize the average output for each hour of the day, grouped by source.

You need to identify the output per source per hour (e.g. 01:00, 02:00, etc) averaged over all days.

  • You will need to prepare your data using both dplyr and lubridate functions.
  • You can choose which geom(s) to use, and how to demarcate groups.
  • Bonus: use a scale layer to set a color palette (try "Set3") and change the legend name.
  • Remember to add labs()!

Exercise 5: Average hourly output by source

Exercise 5: Average hourly output by source

Facets

  • So far we have been visualizing grouped data by changing the appearance or position of the objects.
  • A different approach is to use facets, which creates separate coordinate systems for each geometric object.

Example: Comparing generation patterns

Task:
Compare energy generation over time, across sources.

How do we do this? Using what we’ve learned so far:

  1. Supply long_gen to ggplot().
  2. Add a geom_line() layer, setting x=datetime and y=output in aes().
  3. Let’s set group=source and col=source in aes().

Example: Comparing generation patterns

Example: Comparing generation patterns

Is this a helpful plot?

  • Not if our goal is to compare the patterns of each source.
  • It’s too noisy!
  • Instead of setting col= in aes(), let’s add a facet_wrap() layer.
  • Specifically, facet_wrap(~source), i.e. tell ggplot to “facet by source”.

Example: Comparing generation patterns

Example: Comparing generation patterns

Exercise 6: Facets

Task:
Compare generation patterns for each source in facets. Color the lines using the “group” variable in regroup.

Remember:

  • You will need to prepare your data. Think about the variables that you need: source/type, output, and group.
  • When you pipe modified data into ggplot, remember that you are grouping in two ways: source/type through facets, and group through color.

Exercise 6: Facets

##     type            datetime   output     group
## 1 biogas 2019-09-03 00:00:00 238.9167 renewable
## 2 biogas 2019-09-03 01:00:00 239.0000 renewable
## 3 biogas 2019-09-03 02:00:00 239.0000 renewable
## 4 biogas 2019-09-03 03:00:00 238.9167 renewable
## 5 biogas 2019-09-03 04:00:00 237.9167 renewable
## 6 biogas 2019-09-03 05:00:00 237.1667 renewable

Exercise 6: Facets

Final notes

Comment your code

Always remember to comment your code!

When writing particularly complex code in dplyr or ggplot2, this includes commenting within a flow of %>% or + operators. See the lecture notes for some examples of this.

Saving images

Three ways to save images created by ggplot:

  • Assign the result of your ggplot code to a named object, which can be saved like other objects in your R workspace
  • Use the “save” icon in the Plots tab of RStudio
  • Add the function ggsave() as a layer to your plotting code

Break!

This is the end of the R lecture sessions.

After a short break, we will return for a cumulative exercise that combines the skills you have learned since Monday.

Final Exercise

The instructions are in a markdown file in the exercises folder. Create a new RMarkdown file (save it using this naming convention: FinalRExercise_LastnameFirstname.Rmd), in which you will complete the exercise.

You can work in small groups, but write up your code separately.

Raise your hand if you need help!

Optional Submissions

You’ve created several RMarkdown files over the past three days. Since you stored these in a forked repo, it is possible to create a pull request and ‘submit’ these changes to the base repo. This is optional, but gives you a chance to explore Github functionality and share your work with your classmates.

Before you submit your completed exercises, move every new file in your repo to the submissions folder. This ensures that we won’t inadvertently make changes to the session materials. Then, create a new pull request, asking to merge changes from your fork to the base repository.