Part of the tidyverse
, ggplot2
is an R package that implements the “grammar of graphics” in R.
In ggplot, graphics are built by supplying data and mapping of data values to aesthetics, and then adding layers that build geometric objects, scales, labels, and more.
dplyr
and other tidyverse
packages.imports
, generation
, merged_energy
, long_gen
and long_merged_energy
in your Enrivorment, read in gapminder
and recreate gapminder07
.ggplot()
functiongeom
functions, such as geom_point()
or geom_hist()
aes()
function nested within the chosen geom
function+
operatorTask:
Create scatterplot of lifeExp
against gdpPercap
from gapminder07
data.
Steps:
gapminder07
data frame as input to ggplot()
functiongeom
; in this case we choose geom_point()
to generate pointsx
and y
values, to geom_point()
.Using the gapminder07
data, create a scatterplot of the natural log of gdpPercap
as a function of the natural log of pop
. Give it a title and axis labels.
Remember, you will need three functions: ggplot()
, geom_point()
, and labs()
.
%>%
) in dplyr
this morning.dplyr
and ggplot2
are designed to work well together.Task:
Plot a column chart of total energy generated over time.
Steps:
long_gen
.geom_col()
and map appropriate x
and y
variables.Task:
Plot a column chart hydroelectric power generated over time.
Hint: There are two types of hydroelectric sources in the data: large_hydro
and small_hydro
.
We have already seen examples of two kinds of plots:
geom_point()
geom_col()
Let’s see a few more.
geom_line()
aes()
).aes()
function.size=
and col=
.geom_area()
geom_boxplot()
Task:
Plot a line of large hydro generation over time, and a smoothed line of the same relationship on top of it.
Steps:
generation
data to ggplot()
.geom_line()
with appropriate x
and y
aesthetics. Make it turquoise for fun.geom_smooth()
with the same x
and y
aesthetics.
geom_smooth()
plots smoothed conditional means (estimated using a loess regression)generation %>%
ggplot() +
geom_line(aes(x=datetime, y=large_hydro), col="turquoise3") +
geom_smooth(aes(x=datetime, y=large_hydro)) +
labs(title="Hydroelectric (large) generation per hour, Sept 3-9", x="Hour", y="Output (MW)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Task:
Create a column chart that shows the total output per source.
"darkred"
.geom
function that you need.labs()
.labs()
layer can add titles (title=
) and axis labels (x=
and y=
) to a plot.subtitle=
) and captions (caption=
).scale_alpha
: control transparency settingsscale_color
and scale_fill
: control color palettes and other aspects of data mapped to colorsscale_x
and scale_y
: control the position of axis markersTask:
Recreate line plot of imports over time, but label x-axis with hours rather than dates.
Steps:
imports
data to ggplot()
.geom_line()
layer and map appropriate x
and y
variables.scale_x_datetime()
layer and use date_labels=
and date_breaks=
arguments to modify x-axis labels and breaks.labs()
layer.ggplot comes with several preset themes that can be added as layers, including:
theme_grey()
: defaulttheme_bw()
: strip colors, including grey gradientstheme_dark()
and theme_light()
: change the background of the coordinate systemtheme_minimal()
and theme_void()
: see reference guide for detailsThe theme()
layer lets you modulate many components of the theme, including:
axis.labels.x =
).legend.position =
).plot.background =
).ggplot2::theme()
.imports %>%
ggplot() + geom_line(aes(x=datetime, y=imports), col="red") +
scale_x_datetime(date_labels="%H:%M", date_breaks="12 hours") +
labs(title="Energy imports over time in California", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Amount imported (MW)") +
theme(axis.text.x=element_text(angle=45, hjust=1, size=12))
coord_flip()
is the default Cartesian coordinate system used; by explicitly calling this function, you can change the limits of the x
and y
axes from their defaultscoord_fixed()
sets a fixed aspect ratio between the x
and y
axescoord_transform()
lets you transform the Cartesian coordinates using functions like sqrt
or log
coord_polar()
changes the coordinate system to polar coordinates rather than a Cartesian systemstat
functions in ggplot that enable you to conduct statistical transformations of your data prior to plotting.stat
and geom
layers can be used interchangeably, as each stat
layer has a geom
argument and vice versa.geom()
using the group =
argument, and demarcate the distinct objects created per group in some way.Task:
Create a line plot of energy output over time, with separate lines for each source.
Steps:
long_merged_energy
to ggplot()
.geom_line()
layer and specify x=datetime
and y=output
.group=source
and col=source
.labs()
layer.Task:
Create a line plot that compares generation of wind, solar, and geothermal energy over time.
Bonus: Set the line size to 1.5.
long_merged_energy %>%
filter(source=="wind"|source=="solar"|source=="geothermal") %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source), size=1.5) +
labs(title="Wind vs. Solar vs. Geothermal generation", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
long_merged_energy %>% filter(source=="wind"|source=="solar"|source=="geothermal") %>%
ggplot() +
geom_line(aes(x=datetime, y=output, group=source, col=source), size=1.5) +
scale_color_brewer(palette="Accent", name="Energy source") +
labs(title="Wind vs. Solar vs. Geothermal generation", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)")
col=
is used to color objects like lines and pointsfill=
is used to color objects like columns and histograms
col=
will create colored outlines for such objectsgeom
function, we generate multiple geometric objects on the same plot.Task:
Create a column chart of energy use by day, grouped by source.
Steps:
long_merged_energy
to summarize output by date and source.
mutate()
to create a date
variable from datetime
.group_by()
and summarize()
to calculate total output per date per source.ggplot()
geom_col()
layer, supply x
and y
aesthetics.group=source
and fill=source
.labs()
layer.position="dodge"
in geom_col()
‘unstack’ the columns for each source.position="fill"
to normalize height of stacked columns.Task:
Steps:
regroup
.ggplot()
.geom_line()
and geom_point()
with shape=group
.geom_line()
with linetype=group
.# Prepare data
long_merged_energy_regroup <- long_merged_energy %>%
rename(type = source) %>%
merge(regroup, by = "type") %>%
mutate(date=lubridate::date(datetime)) %>%
group_by(date, group) %>%
summarise(output=sum(output))
# Take a look at our prepared data
head(long_merged_energy_regroup)
## # A tibble: 6 x 3
## # Groups: date [1]
## date group output
## <date> <chr> <dbl>
## 1 2019-09-03 hydro 93919.
## 2 2019-09-03 imports 169826.
## 3 2019-09-03 nuclear 53926.
## 4 2019-09-03 other 0
## 5 2019-09-03 renewable 183304.
## 6 2019-09-03 thermal 313436.
long_merged_energy_regroup %>%
ggplot() +
geom_line(aes(x=date, y=output, group=group, col=group), size=0.8) +
geom_point(aes(x=date, y=output, group=group, shape=group)) +
labs(title="Output by source group over time", subtitle="Data collected during September 3-9, 2018", x="Date", y="Output (MW)")
size=
and alpha=
inside aes()
modulate the size and transparency of geom objects based on some data value.gapminder07 %>%
ggplot() +
geom_point(aes(x=log(gdpPercap), y=lifeExp, size=pop, col=continent)) +
scale_size_continuous(name="Population") + scale_color_discrete(name="Continent") +
labs(title="Life expectancy as a function of GDP per capita in 2007", x="Logged GDP per capita", y="Life expectancy")
Task:
Visualize the average output for each hour of the day, grouped by source.
You need to identify the output per source per hour (e.g. 01:00, 02:00, etc) averaged over all days.
dplyr
and lubridate
functions.geom
(s) to use, and how to demarcate groups."Set3"
) and change the legend name.labs()
!ex5 <- long_merged_energy %>%
mutate(hour=lubridate::hour(datetime)) %>%
group_by(hour, source) %>%
summarize(output=sum(output)) %>%
ggplot() +
geom_area(aes(x=hour, y=output, fill=factor(source))) +
scale_fill_brewer(palette="Set3", name="Source") +
labs(title="Average hourly output by source",
subtitle="Data collected during September 3-9",
x="Hour of the day", y="Output (MW)") +
theme_bw()
Task:
Compare energy generation over time, across sources.
How do we do this? Using what we’ve learned so far:
long_gen
to ggplot()
.geom_line()
layer, setting x=datetime
and y=output
in aes()
.group=source
and col=source
in aes()
.Is this a helpful plot?
col=
in aes()
, let’s add a facet_wrap()
layer.facet_wrap(~source)
, i.e. tell ggplot to “facet by source”.Task:
Compare generation patterns for each source in facets. Color the lines using the “group” variable in regroup
.
Remember:
long_gen_regroup <- long_gen %>%
rename(type = source) %>%
merge(regroup, by="type")
head(long_gen_regroup)
## type datetime output group
## 1 biogas 2019-09-03 00:00:00 238.9167 renewable
## 2 biogas 2019-09-03 01:00:00 239.0000 renewable
## 3 biogas 2019-09-03 02:00:00 239.0000 renewable
## 4 biogas 2019-09-03 03:00:00 238.9167 renewable
## 5 biogas 2019-09-03 04:00:00 237.9167 renewable
## 6 biogas 2019-09-03 05:00:00 237.1667 renewable
long_gen_regroup %>% ggplot() +
geom_line(aes(x=datetime, y=output, group=group, col=group), size=1) +
scale_color_brewer(palette="Set1", name="Type of energy source") +
facet_wrap(~type, scales="free") +
labs(title="Generation over time, by energy source", subtitle="Hourly data from September 3-9, 2018", x="Hour", y="Output (MW)") +
theme(legend.position = "bottom")
Three ways to save images created by ggplot:
ggsave()
as a layer to your plotting codeThis is the end of the R lecture sessions.
After a short break, we will return for a cumulative exercise that combines the skills you have learned since Monday.
The instructions are in a markdown file in the exercises
folder. Create a new RMarkdown file (save it using this naming convention: FinalRExercise_LastnameFirstname.Rmd
), in which you will complete the exercise.
You can work in small groups, but write up your code separately.
Raise your hand if you need help!
You’ve created several RMarkdown files over the past three days. Since you stored these in a forked repo, it is possible to create a pull request and ‘submit’ these changes to the base repo. This is optional, but gives you a chance to explore Github functionality and share your work with your classmates.
Before you submit your completed exercises, move every new file in your repo to the submissions
folder. This ensures that we won’t inadvertently make changes to the session materials. Then, create a new pull request, asking to merge changes from your fork to the base repository.
Comment your code
Always remember to comment your code!
When writing particularly complex code in
dplyr
orggplot2
, this includes commenting within a flow of%>%
or+
operators. See the lecture notes for some examples of this.