Functions in R
Juan C. Rocha
Stockholm Resilience Centre, Stockholm University
Introduction
To understand computations in R, two slogans are helpful:
“Everything that exists is an object.
Everything that happens is a function call.”
— John Chambers
Functions - where the magic begins
Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:
- You drastically reduce the chances of making incidental mistakes when you copy and paste.
- As requirements change, you only need to update code in one place, instead of many.
- You can give a function an evocative name that makes your code easier to understand.
Kitchen recipes
recipe |ˈrɛsɪpi|
noun
a set of instructions for preparing a particular dish, including a list of the ingredients required: a traditional Yorkshire recipe.
- something which is likely to lead to a particular outcome: sky-high interest rates are a recipe for disaster.
- archaic a medical prescription. it would be useless to enumerate all the drugs and recipes for their application which have been tried.
Every recipe has a name, a list of ingredients and a series of steps that guide you through the delicious and expected outcome.
When you need functions?
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df
a b c d
1 -1.23603683 -0.46243626 -1.0313782 -0.29592427
2 1.76210526 0.58844462 -0.9459206 -1.89922812
3 -0.63994691 0.34525809 -0.4897386 -1.45808546
4 1.32263801 0.06832497 1.0865047 -1.00989905
5 -0.57967558 -0.08270555 0.2869225 -0.69430865
6 -0.54018597 -1.19619451 -1.4703025 -0.01666479
7 1.49159683 -0.33556735 1.0028291 -1.38519644
8 0.03808189 0.02053097 -0.7886262 0.17721946
9 -2.06158034 -1.58887567 -0.3215548 0.75123582
10 0.06884175 -1.52564230 0.6311851 1.69482153
Code
This is the code snippet that is repeated:
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
[1] 0.2159026 1.0000000 0.3717966 0.8850671 0.3875593 0.3978869 0.9292545
[8] 0.5491200 0.0000000 0.5571646
His operation is single input, you only need df$a
to run it.
x <- 1:10
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
[8] 0.7777778 0.8888889 1.0000000
There is duplication in the code since the range
of the data is computed three times
Simpler code
Intermediate calculations are pulled into named variables
rng <- range(x, na.rm = TRUE) # Now range is calculated only once
(x - rng[1]) / (rng[2] - rng[1]) # and reused when needed
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
[8] 0.7777778 0.8888889 1.0000000
Now that we understand what every step does, Hadley can write a function:
rescale01 <- function(x) { # the ingredients
rng <- range(x, na.rm = TRUE) # step 1
(x - rng[1]) / (rng[2] - rng[1]) # step 2
}
rescale01(c(0, 5, 10)) # see the result!
[1] 0.0 0.5 1.0
Functions are for humans
- Comment your code to guide your reader (your future self!)
- Use meaningful names, prefer verbs: clear descriptions. snake_case or camelCase but be consistent
- For a family of functions, use prefixes and take advantage of autocomplete. e.g.
str_
from stringr
package.
- Arguments should be nouns
- Avoid overriding existing functions and variables
Time-saving examples
## bipartite projections according with Newman 2010
# This function takes a bipartite network and return the one-mode projection as an object of the class
# network. It can be modified to get the adjacency matrix with weigthed paths, or the number of nodes of
# class 2 a dyad of class one is connected, co-occurrence.
mode.1<- function (net){
m <- as.matrix.network (net, expland.bipartite=F)
mode1 <- m %*% t(m)
mat1 <- mode1
diag(mode1)<-0
mode1[mode1>1]<-1
net1<- network (mode1, loops=F, dir=F, hyper=F,
multiple=F, bipartite=F )
set.edge.value(net1, "paths", mat1)
net1 %e% "paths" <- mat1
mode2 <- t(m) %*% m
mat2 <- mode2
diag(mode2)<-0
mode2[mode2>1]<-1
net2<- network (mode2, loops=F, dir=F, hyper=F,
multiple=F, bipartite=F )
set.edge.value(net2, "paths", mat2)
net1 %e% "paths" <- mat2
return(list(net1, net2))
}
Time-saving examples
A funtion to quickly visualize a survey
# this function...
question <- function(dat, q1, q2, q3, fun){
# dat = survey, q = is the colname of the question
a0 <- select(dat, col1 = q1, col2 = q2, place = q3)
g <- ggplot(data = aggregate(col2 ~ col1 + place, data = a0, FUN = fun),
aes (x = col2, fill = place)) +
geom_bar(stat = 'count', na.rm = TRUE) +
theme_minimal(base_size = 10, base_family = "Helvetica")
return (g)
}
See the result in action on the online report!
Iteration
Reducing code duplication has three main benefits:
It’s easier to see the intent of your code, because your eyes are drawn to what is different, not what is the same.
It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
You’re likely to have fewer bugs because each line of code is used in more places.
Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Iteration: the for
loop
Remember our previous data frame example?
a b c d
1 0.2159026 0.4351076 0.1716689 0.4460995
2 1.0000000 0.8410293 0.2050925 0.0000000
3 0.3717966 0.7470941 0.3835111 0.1227425
# A for loop that calculates the median of each column
output <- vector("double", ncol(df)) # 1. output
for (i in seq_along(df)) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
output
[1] 0.4735034 0.5329492 0.4164005 0.3906768
Iteration: the while
loop
while
loops are useful when you don’t know how long the intput sequence should run for. They are more general: you can rewrite any for
loop as a while
loop, but not necessarily otherwise. Here is a comparison:
for (i in seq_along(x)) {
# body
}
# Equivalent to
i <- 1
while (i < length(x)) {
# body
i <- i + 1
}
Iteration: the while
loop
An example by Hadley to find out how many tries does it takes to get three heads in a row:
flip <- function() sample(c("T", "H"), 1)
flips <- 1
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
[1] 8
Exercise
- Open an RStudio session https://github.com/ifetzer/RStudioForTeaching
- Load the
gapminder
data
country
is of the class factor
, discuss in couples what are they and when are useful?
- Tip:
help(factor)
& help(levels)
- How many countries has the
gapminder
data?
- Write a function to calculate the average
Homework
Prepare for next class an explanation of the following functions:
apply()
and its family (tapply
, lapply
, sapply
, mapply
)
for
ifelse
while
with
All lecture notes are based on Hadley Wickham’s books R for Data Science and Advanced R.
Functions in R
Juan C. Rocha Stockholm Resilience Centre, Stockholm University