Functions in R



Juan C. Rocha
Stockholm Resilience Centre, Stockholm University

Previously

  1. Structure of simple functions
  2. Advanced functions for iteration
    • if, ifelse, for, while
    • apply() and its family (tapply, lapply, sapply, mapply)
  3. Homework
  4. Exercises

Exercise

  1. Open an RStudio session https://github.com/ifetzer/RStudioForTeaching
  2. Load the gapminder data
  3. country is of the class factor, discuss in couples what are they and when are useful?
    • Tip: help(factor) & help(levels)
  4. How many countries has the gapminder data?
  5. Write a function to calculate the average
    • Test it in small cases

Solutions

  1. How many countries has the gapminder data?
library(gapminder)
data(gapminder)

class(gapminder$country)
[1] "factor"
levels(gapminder$country) |> 
    length()
[1] 142

Solutions

  1. For every country in the dataset calculate the average life expectancy, the range of the population, and the median GDP per capita

base

countries <- levels(gapminder$country)
# when working with loops, it is a good practice to
# declare the objects that will collect your results
ale <- list() # average life expectancy
pop_rng <- list() # population range
med_gdp <- list() # median gdp

for (i in seq_along(countries)) {
    ale[[i]] <- avg(
        gapminder[gapminder$country == countries[i], "lifeExp"])
    pop_rng[[i]] <- range(
        gapminder[gapminder$country == countries[i], "pop"])
    med_gdp[[i]] <- median(
        gapminder[gapminder$country == countries[i],]$gdpPercap, na.rm = TRUE)

}

Solutions

  1. Use the result to plot: Average lifeExp vs median gdpPercap

base

plot(x = unlist(ale), y = unlist(med_gdp), 
     xlab = "Average life expectancy",
     ylab = "Median GDP", 
     type = "p", col = "orange")

Solutions

  1. For every year in the dataset calculate the average life expectancy across all countries. Plot the result over time
gapminder |> 
    group_by(year) |> 
    summarise(
        ale = mean(lifeExp)
    ) |> 
    ggplot(aes(x = year, y = ale)) +
    geom_line(color = "orange") +
    labs(x = "Year", 
         y = "Average life expectancy") +
    theme_classic(base_size = 16)

Iteration

  • for (i in something) … do something: you know how many iterations you need
  • while some condition is F/T … do something: you don’t know how many iteration it will take

Iteration with purrr

  • Functional programming
    • Functions of functions: functions can be arguments to other functions, returned to other functions, and be treated as data types.
  • Advantages of purrr:
    • First argument is data
    • All functions are type-stable
    • Accepts functions or formulas
    • Understand variables by name or position
    • Easier to handle errors

Iteration with purrr

All functions are type-stable

  • map(): list
  • map_dbl(): double
  • map_chr(): character
  • map_int(): integer
  • map_raw(): raw vector
  • map_dfr(): data frame binding rows
  • map_dfc(): data frame binding columns
  • modify(): same as input

Iteration with purrr

lm(gdpPercap ~ year, data = gapminder)

Call:
lm(formula = gdpPercap ~ year, data = gapminder)

Coefficients:
(Intercept)         year  
  -249693.7        129.8  
gapminder |> 
    ggplot(aes(year, gdpPercap)) +
    geom_smooth(method = "lm") + theme_light(base_size = 10)

Syntactic sugar

x <- list(list(1,2,3),
          list(4,5,6),
          list(7,8,9))

x |> map_dbl(2)
[1] 2 5 8
models <- mtcars |> 
    split( ~cyl) |> 
    map(function(df) lm(mpg ~ wt, data = df))
models <- mtcars |> 
    split( ~cyl) |> 
    map(~lm(mpg ~ wt, data = .))

Dealing with failure

When a function fails the process typically stops. What if you need completion and keeping track of errors?

log("juan")
Error in log("juan"): non-numeric argument to mathematical function
safe_log <- safely(log)
str(safe_log(10))
List of 2
 $ result: num 2.3
 $ error : NULL

Dealing with failure

x <- list(1, 10, "a")
y <- x |> map(safely(log))
str(y)
List of 3
 $ :List of 2
  ..$ result: num 0
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 2.3
  ..$ error : NULL
 $ :List of 2
  ..$ result: NULL
  ..$ error :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Here you can see the errors by naked eye, but what if you have too many objects?

Dealing with failure

is_ok <- y$error |> map_lgl(is_null)
x[!is_ok]
[[1]]
[1] "a"
y$result[is_ok] |> flatten_dbl()
[1] 0.000000 2.302585

Exercise

  • With a data frame of your choice (perhaps your own data!) use functions of the tidyverse to summarize numeric and non-numeric variables. Extract for example means, max, min, number of observations and missing values.
  • For categorical variables extract number of levels, observations and missing values.
  • Use map() to automate the process for all variables of the same type
    • Tip: keep() and discard()
  • 15min before the end, share your progress with a colleague

Homework

Prepare for next class an explanation of the following:

  • Study the help for the pmap() and walk() functions. Bring an example of when they are useful.
  • Open a GitHub account
  • Create a test repository (public)
  • Clone it from RStudio in your computer
  • If in need of some resources: https://happygitwithr.com

All lecture notes are based on Hadley Wickham’s books R for Data Science and Advanced R.