Intro to R: Loops, Conditionals, Functions

Richard Paquin Morel

2019-09-16

Remember our data from this morning…

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Remember our data this morning…

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

gapminder dataset

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

gapminder dataset

  • The dataset has data on 142 countries on 5 continents for five year intervals between 1952 and 2007
  • Let’s fix a few things about the data…the character variables are currently factors. Doesn’t make sense!
  • Find the factor variables and change them to character vectors

gapminder dataset

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

gapminder dataset

  • What if we want to know the mean life expectancy by country? -Or by year?
  • One approach:
## [1] 37.47883
## [1] 68.43292

gapminder dataset

  • Or we could do save ourselves a lot of typing and time

  • Loops!

Don’t repeat yourself–loop it!

Loops in R – for, while, and the apply family

  • Loops allow you to repeat a function based on conditions
  • There are several flavors
    • for
    • while
    • the apply family (often preferred to loops because the code is cleaner)
  • Sneak peak: dplyr and data.table offer better approaches to loops

A word of caution about loops

  • It is often a good idea at avoid loops when possible in R
    • They are slow!
  • Try to find other solutions before looping

The anatomy of a for loop

  • for loops repeat a function for all values in a vector – don’t cut and paste!
  • The basic form is: for (i in vector) { function(i) }
    • i is the iterator variable (could be any letter!)
    • The loop repeats for each value in the vector, which replaces i for each interation

The anatomy of a for loop

  • Let’s recover the GDP for each country

You try

Create a new variable that finds that natural log (log) of the GDP per capita and of population - call them log_gdpPercap and log_pop

Log gdp and log population

Log gdp and log population

##       country year      pop continent lifeExp gdpPercap         gdp
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231
##   log_gdpPercap  log_pop
## 1      6.658583 15.94675
## 2      6.710344 16.03915
## 3      6.748878 16.14445
## 4      6.728864 16.26115
## 5      6.606625 16.38655
## 6      6.667101 16.51555

Avoid loops when possible

  • Loops are useful, but slow
  • Avoid when possible, especially when there is a vectorized function you can use
## [1] TRUE

Let’s try something a bit more substantive

  • Has life expectancy increased over time?
  • Find the mean life expectancy by year
## [1] "1952: 49.0576197183099"
## [1] "1957: 51.5074011267606"
## [1] "1962: 53.6092490140845"
## [1] "1967: 55.6782895774648"
## [1] "1972: 57.6473864788732"
## [1] "1977: 59.5701574647887"
## [1] "1982: 61.5331971830986"
## [1] "1987: 63.2126126760563"
## [1] "1992: 64.160338028169"
## [1] "1997: 65.014676056338"
## [1] "2002: 65.6949225352113"
## [1] "2007: 67.0074225352113"

Mean life expectancy by continent

  • Try the same, this time for continents!
    • Which continent has the highest mean life expectancy?

Mean life expectancy by continent

## [1] "Asia: 60.0649032323232"
## [1] "Europe: 71.9036861111111"
## [1] "Africa: 48.8653301282051"
## [1] "Americas: 64.6587366666667"
## [1] "Oceania: 74.3262083333333"

double up!

  • It is possible to make nested for loops by defining different iterators
  • What is the mean life expectancy for each continent for each year?

the nested for loop

## [1] "Continent: Asia"
## [1] "1952: 46.3143939393939"
## [1] "1957: 49.3185442424242"
## [1] "1962: 51.563223030303"
## [1] "1967: 54.66364"
## [1] "1972: 57.3192690909091"
## [1] "1977: 59.6105563636364"
## [1] "1982: 62.6179393939394"
## [1] "1987: 64.8511818181818"
## [1] "1992: 66.5372121212121"
## [1] "1997: 68.0205151515152"
## [1] "2002: 69.2338787878788"
## [1] "2007: 70.7284848484849"
## [1] "Continent: Europe"
## [1] "1952: 64.4085"
## [1] "1957: 66.7030666666667"
## [1] "1962: 68.5392333333333"
## [1] "1967: 69.7376"
## [1] "1972: 70.7750333333333"
## [1] "1977: 71.9377666666667"
## [1] "1982: 72.8064"
## [1] "1987: 73.6421666666667"
## [1] "1992: 74.4401"
## [1] "1997: 75.5051666666667"
## [1] "2002: 76.7006"
## [1] "2007: 77.6486"
## [1] "Continent: Africa"
## [1] "1952: 39.1355"
## [1] "1957: 41.2663461538462"
## [1] "1962: 43.3194423076923"
## [1] "1967: 45.3345384615385"
## [1] "1972: 47.4509423076923"
## [1] "1977: 49.5804230769231"
## [1] "1982: 51.5928653846154"
## [1] "1987: 53.3447884615385"
## [1] "1992: 53.6295769230769"
## [1] "1997: 53.5982692307692"
## [1] "2002: 53.3252307692308"
## [1] "2007: 54.8060384615385"
## [1] "Continent: Americas"
## [1] "1952: 53.27984"
## [1] "1957: 55.96028"
## [1] "1962: 58.39876"
## [1] "1967: 60.41092"
## [1] "1972: 62.39492"
## [1] "1977: 64.39156"
## [1] "1982: 66.22884"
## [1] "1987: 68.09072"
## [1] "1992: 69.56836"
## [1] "1997: 71.15048"
## [1] "2002: 72.42204"
## [1] "2007: 73.60812"
## [1] "Continent: Oceania"
## [1] "1952: 69.255"
## [1] "1957: 70.295"
## [1] "1962: 71.085"
## [1] "1967: 71.31"
## [1] "1972: 71.91"
## [1] "1977: 72.855"
## [1] "1982: 74.29"
## [1] "1987: 75.32"
## [1] "1992: 76.945"
## [1] "1997: 78.19"
## [1] "2002: 79.74"
## [1] "2007: 80.7195"

nested for loop exercise!

  • Has the gap in life expectancy between countries on different continents narrowed over time?

nested for loop exercise!

  • What is the standard deviation (sd) for life expectancy for each continent for each year?

nested for loop exercise!

## [1] "Continent: Asia"
## [1] "1952: 9.29175069597824"
## [1] "1957: 9.63542861940215"
## [1] "1962: 9.82063194066467"
## [1] "1967: 9.65096458232544"
## [1] "1972: 9.72270004073083"
## [1] "1977: 10.0221969818167"
## [1] "1982: 8.53522140873991"
## [1] "1987: 8.20379188414779"
## [1] "1992: 8.07554897033932"
## [1] "1997: 8.09117060876087"
## [1] "2002: 8.37459538857541"
## [1] "2007: 7.96372447069057"
## [1] "Continent: Europe"
## [1] "1952: 6.36108825405387"
## [1] "1957: 5.29580539238584"
## [1] "1962: 4.30249955966524"
## [1] "1967: 3.79972849846788"
## [1] "1972: 3.2405763693743"
## [1] "1977: 3.12102997680124"
## [1] "1982: 3.21826029893856"
## [1] "1987: 3.16968033940696"
## [1] "1992: 3.20978108986074"
## [1] "1997: 3.10467655135052"
## [1] "2002: 2.92217957861169"
## [1] "2007: 2.9798126601609"
## [1] "Continent: Africa"
## [1] "1952: 5.1515814343277"
## [1] "1957: 5.62012285430095"
## [1] "1962: 5.87536393337021"
## [1] "1967: 6.08267262744012"
## [1] "1972: 6.41625832389558"
## [1] "1977: 6.80819741006083"
## [1] "1982: 7.37594008904693"
## [1] "1987: 7.86408910830706"
## [1] "1992: 9.46107098639753"
## [1] "1997: 9.10338657543333"
## [1] "2002: 9.58649585045544"
## [1] "2007: 9.63078067196179"
## [1] "Continent: Americas"
## [1] "1952: 9.32608188397822"
## [1] "1957: 9.03319227681997"
## [1] "1962: 8.50354373815215"
## [1] "1967: 7.90917103705144"
## [1] "1972: 7.32301680161029"
## [1] "1977: 7.06949561543585"
## [1] "1982: 6.72083381905351"
## [1] "1987: 5.80192884249138"
## [1] "1992: 5.16710380580843"
## [1] "1997: 4.88758389629614"
## [1] "2002: 4.7997054986044"
## [1] "2007: 4.44094763085538"
## [1] "Continent: Oceania"
## [1] "1952: 0.190918830920365"
## [1] "1957: 0.0494974746830535"
## [1] "1962: 0.219203102167821"
## [1] "1967: 0.296984848098351"
## [1] "1972: 0.0282842712474663"
## [1] "1977: 0.898025612106913"
## [1] "1982: 0.636396103067887"
## [1] "1987: 1.4142135623731"
## [1] "1992: 0.869741340859456"
## [1] "1997: 0.905096679918782"
## [1] "2002: 0.890954544295053"
## [1] "2007: 0.729027091403335"

for loops can be slow…very slow

for loops can be slow…very slow

  • Sometimes, people recommend using the apply family of functions as a faster alternative
    • There are some efficiency gains with smaller datasets
    • But, as Patrick Burns, the author of “R Inferno”, states, the apply family is “loop-hiding”
  • apply and its relatives help you write cleaner code, but do not expect much of a speed boost
    • do a search for “for loops versus apply in r” and get a taste for the debate

Three flavors: apply, lapply, sapply

  • Let’s look at these three (there are more!): apply, lapply, sapply

apply

apply(matrix, 1 = row or 2 = column, function) - Let’s say we want to find the mean for each stat in gapminder

##      lifeExp          pop    gdpPercap 
## 5.947444e+01 2.960121e+07 7.215327e+03

apply versus for

##      lifeExp          pop    gdpPercap 
## 5.947444e+01 2.960121e+07 7.215327e+03
## [1] 59.47444
## [1] 29601212
## [1] 7215.327

lapply and sapply

  • Both lapply and sapply iterate over a values in a vector or list, rather than rows or columns
    • Generally, much more common to use in data analysis
  • lapply returns a list
  • sapply returns a simplified list (i.e., a vector)
    • Word of caution – there is some inconsistency in how sapply returns results, so always check

lapply and sapply

  • lapply(vector, function)
## $country
## [1] NA
## 
## $year
## [1] 1979.5
## 
## $pop
## [1] 29601212
## 
## $continent
## [1] NA
## 
## $lifeExp
## [1] 59.47444
## 
## $gdpPercap
## [1] 7215.327
## 
## $gdp
## [1] 186809560507
## 
## $log_gdpPercap
## [1] 8.158791
## 
## $log_pop
## [1] 15.76611
## 
## $vec_log_gdpPercap
## [1] 8.158791
##           country              year               pop         continent 
##                NA      1.979500e+03      2.960121e+07                NA 
##           lifeExp         gdpPercap               gdp     log_gdpPercap 
##      5.947444e+01      7.215327e+03      1.868096e+11      8.158791e+00 
##           log_pop vec_log_gdpPercap 
##      1.576611e+01      8.158791e+00

Anonymous functions in apply

  • You can do more complex functions within an apply call
  • Add function(x) [function] to the call–x becomes the iterator
##  [1] 49.05762 51.50740 53.60925 55.67829 57.64739 59.57016 61.53320
##  [8] 63.21261 64.16034 65.01468 65.69492 67.00742

the while loop

while loop syntax

  • Similar syntax to the for loop -> while (condition) { function }
  • Often, you define an interator that you will increase for each loop
## [1] "1952: 12.2259557776501"
## [1] "1957: 12.2312861234041"
## [1] "1962: 12.0972450062645"
## [1] "1967: 11.7188577789887"
## [1] "1972: 11.3819531380937"
## [1] "1977: 11.2272293919197"
## [1] "1982: 10.7706178327824"

try a while loop

  • What is the standard deviation for life expectancy for each year between 1987 and 2002 (inclusive)?

results

## [1] "1987: 10.5562851721688"
## [1] "1992: 11.2273795265798"
## [1] "1997: 11.5594390582383"
## [1] "2002: 12.2798227122797"

the infinite loop - a while loop cautionary tale

  • Beware! A while loop will continually run if the logical condition is always satisfied!
  • Give it a try: run the previous while loop without increasing the iterator

The if/else conditional!

Start with if

  • Similar to for and while, initialize with if and then detail condition in parentheses
## [1] 1992
## [1] 1992

if statements

  • We probably got different answers because of random sampling from years
  • We can fix that with set.seed()
## [1] 2002

Adding an else clause

  • What happens if random_year <= 1977? NOTHING!
  • We can add an else statement, telling R what to do when the if condition isn’t met
## [1] 1992
## [1] 1992

putting if and else together

## [1] "sorry, random year is less than 1977"

Putting for and if/else together

  • We can add an ifelse clause to a for loop

Putting for and if/else together

Which continents have a mean life expectancy greater than 70 years?

## [1] "Mean Life Expectancy in Asia is less than 70"
## [1] "Mean Life Expectancy in Europe is greater than 70"
## [1] "Mean Life Expectancy in Africa is less than 70"
## [1] "Mean Life Expectancy in Americas is less than 70"
## [1] "Mean Life Expectancy in Oceania is greater than 70"

Putting for and if/else together

Write a for loop that reports the mean population for years greater than or equal to 1987. Make sure the loop prints a message if the condition is not met!

Putting for and if/else together

## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "Sorry, year is less than 1987"
## [1] "1987: 63.2126126760563"
## [1] "1992: 64.160338028169"
## [1] "1997: 65.014676056338"
## [1] "2002: 65.6949225352113"
## [1] "2007: 67.0074225352113"

Writing functions

Writing functions

  • Hadley Wickham: If you have to copy-and-paste three times, it is time to write a function
  • John Chambers: Everything that happens in R is the result of a function call – even [
##  [1] 67.500 69.100 70.300 70.800 71.000 72.500 73.800 74.847 76.070 77.340
## [11] 78.670 79.406
##  [1] 67.500 69.100 70.300 70.800 71.000 72.500 73.800 74.847 76.070 77.340
## [11] 78.670 79.406

The anatomy of a function

  • Write a function should look familiar – initialize with function
  • Every function has arguments

Writing a simple function

  • Let’s write a simple function that prints the value of a selected variable in the gapminder dataset

Writing a more substantial function

  • Let’s write a function that prints the mean and standard deviation for life expentancy for a given country in the gapminder dataset
## Country: Bulgaria 
## Mean Life Expectancy: 69.74375 
## SD Life Expectancy: 3.55268

Create your own function

  • Write a function that reports the mean, median, minimum, and maximum for life expectancy for a continent in gapminder
  • Hint: min, max

Create your own function

## Continent: Asia 
## Minimum Life expectancy: 28.801 
## Maximum Life expectancy: 82.603

Functions and loops, together at last

  • Combining functions and loops saves time

A log-log model relating life expectancy to GDP

Running the function

Loop it!

Reporting analysis with Rmarkdown and GitHub

Rmarkdown & GitHub

  • Rmarkdown creates dynamic reports in HTML, PDF, and Word
  • Combine text (using the markdown language) and R code
  • Rmarkdown runs R code, compiles, and produces a report in chosen format
  • This presentation was created using Rmarkdown!

Day 1, Part 2 Exercise

  1. Open the rmd_exercise_template.Rmd
  2. Save as: Day1Part2RExercise_LastnameFirstname.Rmd.
  3. Read in the gapminder data set
  4. As you answer questions, be sure to annotate your work with as much detail as possible!