Functions in R



Juan C. Rocha
Stockholm Resilience Centre, Stockholm University

Introduction

To understand computations in R, two slogans are helpful:

“Everything that exists is an object.

Everything that happens is a function call.”

— John Chambers

Functions - where the magic begins

Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:

  1. You drastically reduce the chances of making incidental mistakes when you copy and paste.
  2. As requirements change, you only need to update code in one place, instead of many.
  3. You can give a function an evocative name that makes your code easier to understand.

Kitchen recipes

recipe |ˈrɛsɪpi|
noun
a set of instructions for preparing a particular dish, including a list of the ingredients required: a traditional Yorkshire recipe.

- something which is likely to lead to a particular outcome: sky-high interest rates are a recipe for disaster.
- archaic a medical prescription. it would be useless to enumerate all the drugs and recipes for their application which have been tried.

Every recipe has a name, a list of ingredients and a series of steps that guide you through the delicious and expected outcome.

When you need functions?

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df
             a           b          c           d
1  -1.23603683 -0.46243626 -1.0313782 -0.29592427
2   1.76210526  0.58844462 -0.9459206 -1.89922812
3  -0.63994691  0.34525809 -0.4897386 -1.45808546
4   1.32263801  0.06832497  1.0865047 -1.00989905
5  -0.57967558 -0.08270555  0.2869225 -0.69430865
6  -0.54018597 -1.19619451 -1.4703025 -0.01666479
7   1.49159683 -0.33556735  1.0028291 -1.38519644
8   0.03808189  0.02053097 -0.7886262  0.17721946
9  -2.06158034 -1.58887567 -0.3215548  0.75123582
10  0.06884175 -1.52564230  0.6311851  1.69482153

Code

This is the code snippet that is repeated:

(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
 [1] 0.2159026 1.0000000 0.3717966 0.8850671 0.3875593 0.3978869 0.9292545
 [8] 0.5491200 0.0000000 0.5571646

His operation is single input, you only need df$a to run it.

x <- 1:10
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
 [8] 0.7777778 0.8888889 1.0000000

There is duplication in the code since the range of the data is computed three times

Simpler code

Intermediate calculations are pulled into named variables

rng <- range(x, na.rm = TRUE)       # Now range is calculated only once
(x - rng[1]) / (rng[2] - rng[1])    # and reused when needed
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
 [8] 0.7777778 0.8888889 1.0000000

Now that we understand what every step does, Hadley can write a function:

rescale01 <- function(x) {              # the ingredients
  rng <- range(x, na.rm = TRUE)         # step 1
  (x - rng[1]) / (rng[2] - rng[1])      # step 2
}
rescale01(c(0, 5, 10))                  # see the result!
[1] 0.0 0.5 1.0

Functions are for humans

  • Comment your code to guide your reader (your future self!)
  • Use meaningful names, prefer verbs: clear descriptions. snake_case or camelCase but be consistent
  • For a family of functions, use prefixes and take advantage of autocomplete. e.g. str_ from stringr package.
  • Arguments should be nouns
  • Avoid overriding existing functions and variables

Time-saving examples

## bipartite projections according with Newman 2010

# This function takes a bipartite network and return the one-mode projection as an object of the class
# network. It can be modified to get the adjacency matrix with weigthed paths, or the number of nodes of 
# class 2 a dyad of class one is connected, co-occurrence.
mode.1<- function (net){
    m <- as.matrix.network (net, expland.bipartite=F)
    mode1 <- m %*% t(m)
    mat1 <- mode1 
    diag(mode1)<-0
    mode1[mode1>1]<-1
    net1<- network (mode1, loops=F, dir=F, hyper=F, 
                    multiple=F, bipartite=F )
        set.edge.value(net1, "paths", mat1) 
        net1 %e% "paths" <- mat1
    mode2 <- t(m) %*% m
    mat2 <- mode2
    diag(mode2)<-0
    mode2[mode2>1]<-1
    net2<- network (mode2, loops=F, dir=F, hyper=F, 
                    multiple=F, bipartite=F )   
        set.edge.value(net2, "paths", mat2)
        net1 %e% "paths" <- mat2            
    return(list(net1, net2)) 
} 

Time-saving examples

A funtion to quickly visualize a survey

# this function...
question <- function(dat, q1, q2, q3, fun){ 
    # dat = survey, q = is the colname of the question
    a0 <- select(dat, col1 = q1, col2 = q2, place = q3)
    g <- ggplot(data = aggregate(col2 ~ col1 + place, data = a0, FUN = fun),
                aes (x = col2, fill = place)) +
        geom_bar(stat = 'count', na.rm = TRUE) + 
        theme_minimal(base_size = 10, base_family = "Helvetica")
  return (g)
}

See the result in action on the online report!

Iteration

Reducing code duplication has three main benefits:

  • It’s easier to see the intent of your code, because your eyes are drawn to what is different, not what is the same.

  • It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.

  • You’re likely to have fewer bugs because each line of code is used in more places.

Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Iteration: the for loop

Remember our previous data frame example?

head(df, 3)
          a         b         c         d
1 0.2159026 0.4351076 0.1716689 0.4460995
2 1.0000000 0.8410293 0.2050925 0.0000000
3 0.3717966 0.7470941 0.3835111 0.1227425
# A for loop that calculates the median of each column
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
[1] 0.4735034 0.5329492 0.4164005 0.3906768

Iteration: the while loop

while loops are useful when you don’t know how long the intput sequence should run for. They are more general: you can rewrite any for loop as a while loop, but not necessarily otherwise. Here is a comparison:

for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i < length(x)) {
  # body
  i <- i + 1 
}

Iteration: the while loop

An example by Hadley to find out how many tries does it takes to get three heads in a row:

flip <- function() sample(c("T", "H"), 1)

flips <- 1
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
[1] 8

Exercise

  1. Open an RStudio session https://github.com/ifetzer/RStudioForTeaching
  2. Load the gapminder data
  3. country is of the class factor, discuss in couples what are they and when are useful?
    • Tip: help(factor) & help(levels)
  4. How many countries has the gapminder data?
  5. Write a function to calculate the average
    • Test it in small cases

Homework

Prepare for next class an explanation of the following functions:

  • apply() and its family (tapply, lapply, sapply, mapply)
  • for
  • ifelse
  • while
  • with

All lecture notes are based on Hadley Wickham’s books R for Data Science and Advanced R.