# 6 R programming

The tools in Chapters 2-5 will allow you to manipulate, summarise and visualise your data in all sorts of ways. But what if you need to compute some statistic that there isn’t a function for? What if you need automatic checks of your data and results? What if you need to repeat the same analysis for a large number of files? This is where the programming tools that you’ll learn about in this chapter, like loops and conditional statements, come in handy. And this is where you take the step from being able to use R for routine analyses to being able to use R for *any* analysis.

After working with the material in this chapter, you will be able to use R to:

- Write your own R functions,
- Use several new pipe operators,
- Use conditional statements to perform different operations depending on whether or not a condition is satisfied,
- Iterate code operations multiple times using loops,
- Iterate code operations multiple times using functionals,
- Measure the performance of your R code.

## 6.1 Functions

Suppose that we wish to compute the mean of a vector `x`

. One way to do this would be to use `sum`

and `length`

:

```
<- 1:100
x # Compute mean:
sum(x)/length(x)
```

Now suppose that we wish to compute the mean of several vectors. We could do this by repeated use of `sum`

and `length`

:

```
<- 1:100
x <- 1:200
y <- 1:300
z
# Compute means:
sum(x)/length(x)
sum(y)/length(y)
sum(z)/length(x)
```

But wait! I made a mistake when I copied the code to compute the mean of `z`

- I forgot to change `length(x)`

to `length(z)`

! This is any easy mistake to make when you repeatedly copy and paste code. In addition, repeating the same code multiple times just doesn’t look good. In this case, it would be much more convenient to have a single function for computing the means. Fortunately, such a function exists - `mean`

:

```
# Compute means
mean(x)
mean(y)
mean(z)
```

As you can see, using `mean`

makes the code shorter and easier to read, and reduces the risk of errors induced by copying and pasting code (we only have to change the argument of one function instead of two).

You’ve already used a ton of different functions in R: functions for computing means, manipulating data, plotting graphics, and more. All these functions have been written by somebody who thought that they needed to repeat a task (e.g. computing a mean or plotting a bar chart) over and over again. And in such cases, it is much more convenient to have a function that does that task than to have to write or copy code every time you want to do it. This is true also for your own work - whenever you need to repeat the same task several times it is probably a good idea to write a function for it. It will reduce the amount of code that you have to write and lessen the risk of errors caused by copying and pasting old code. In this section, you will learn how to write your own functions.

### 6.1.1 Creating functions

Let’s say, for the sake of the example, that we wish to compute the mean of several vectors but that the function `mean`

doesn’t exist. We would therefore like to write our own function for computing the mean of a vector. An R function takes some variables as input (arguments or parameters) and returns an object. Functions are defined using `function`

. The definition follows a particular format:

```
<- function(argument1, argument2, ...)
function_name
{# ...
# Some rows with code that creates some_object
# ...
return(some_object)
}
```

In the case of our function for computing a mean, this could look like:

```
<- function(x)
average
{<- sum(x)/length(x)
avg return(avg)
}
```

This defines a function called `average`

, that takes an object called `x`

as input. It computes the sum of the elements of `x`

, divides that by the number of elements in `x`

, and returns the resulting mean.

If we now make a call to `average(x)`

, our function will compute the mean value of the vector `x`

. Let’s try it out, to see that it works:

```
<- 1:100
x <- 1:200
y average(x)
average(y)
```

### 6.1.2 Local and global variables

Note that despite the fact that the vector was called `x`

in the code we used to define the function, `average`

works regardless of whether the input is called `x`

or `y`

. This is because R distinguishes between *global variables* and *local variables*. A global variable is created in the *global environment* outside a function, and is available to all functions (these are the variables that you can see in the Environment panel in RStudio). A local variable is created in the *local environment* inside a function, and is only available to that particular function. For instance, our `average`

function creates a variable called `avg`

, yet when we attempt to access `avg`

after running `average`

this variable doesn’t seem to exist:

```
average(x)
avg
```

Because `avg`

is a local variable, it is only available inside of the `average`

function. Local variables take precedence over global variables inside the functions to which they belong. Because we named the argument used in the function `x`

, `x`

becomes the name of a local variable in `average`

. As far as `average`

is concerned, there is only one variable named `x`

, and that is whatever object that was given as input to the function, regardless of what its original name was. Any operations performed on the local variable `x`

won’t affect the global variable `x`

at all.

Functions can access global variables:

```
<- function()
y_squared
{return(y^2)
}
<- 2
y y_squared()
```

But operations performed on global variables inside functions won’t affect the global variable:

```
<- function(n)
add_to_y
{<- y + n
y
}
<- 1
y add_to_y(1)
y
```

If you really need to change a global variable inside a function^{39}, you can use an alternative assignment operator, `<<-`

, which assigns a value to the variable in the *parent environment* to the current environment. If you use `<<-`

for assignment inside a function that is called from the global environment, this means that the assignment takes place in the global environment. But if you use `<<-`

in a function (function 1) that is called by another function (function 2), the assignment will take place in the environment for function 2, thus affecting a local variable in function 2. Here is an example of global assignment using `<<-`

:

```
<- function(n)
add_to_y_global
{<<- y + n
y
}
<- 1
y add_to_y_global(1)
y
```

### 6.1.3 Will your function work?

It is always a good idea to test if your function works as intended, and to try to figure out what can cause it to break. Let’s return to our `average`

function:

```
<- function(x)
average
{<- sum(x)/length(x)
avg return(avg)
}
```

We’ve already seen that it seems to work when the input `x`

is a numeric vector. But what happens if we input something else instead?

```
average(c(1, 5, 8)) # Numeric input
average(c(TRUE, TRUE, FALSE)) # Logical input
average(c("Lady Gaga", "Tool", "Dry the River")) # Character input
average(data.frame(x = c(1, 1, 1), y = c(2, 2, 1))) # Numeric df
average(data.frame(x = c(1, 5, 8), y = c("A", "B", "C"))) # Mixed type
```

The first two of these render the desired output (the `logical`

values being represented by 0’s and 1’s), but the rest don’t. Many R functions include checks that the input is of the correct type, or checks to see which method should be applied depending on what data type the input is. We’ll learn how to perform such checks in Section 6.3.

As a side note, it is possible to write functions that don’t end with `return`

. In that case, the output (i.e. what would be written in the Console if you’d run the code there) from the last line of the function will automatically be returned. I prefer to (almost) always use `return`

though, as it is easy to accidentally make the function return nothing by finishing it with a line that yields no output. Below are two examples of how we could have written `average`

without a call to `return`

. The first doesn’t work as intended, because the final (and only) line of the function doesn’t give any output.

```
<- function(x)
average_bad
{<- sum(x)/length(x)
avg
}
<- function(x)
average_ok
{sum(x)/length(x)
}
average_bad(c(1, 5, 8))
average_ok(c(1, 5, 8))
```

### 6.1.4 More on arguments

It is possible to create functions with as many arguments as you like, but if the user has to supply to many arguments to your function, it will become quite unwieldy. It is therefore common to provide default values to arguments, which is done by setting a value in the `function`

call. Here is an example of a function that computes \(x^n\), using \(n=2\) as the default:

```
<- function(x, n = 2)
power_n
{return(x^n)
}
```

If we don’t supply `n`

, `power_n`

uses the default `n = 2`

:

`power_n(3)`

But if we supply an `n`

, `power_n`

will use that instead:

```
power_n(3, 1)
power_n(3, 3)
```

For clarity, you can specify which value corresponds to which argument:

`power_n(x = 2, n = 5)`

…and can then even put the arguments in the wrong order:

`power_n(n = 5, x = 2)`

However, if we only supply `n`

we get an error, because there is no default value for `x`

:

`power_n(n = 5)`

It is possible to pass a function as an argument. Here is a function that takes a vector and a function as input, and applies the function to the first two element of the vector:

```
<- function(x, func)
apply_to_first2
{<- func(x[1:2])
result return(result)
}
```

By supplying different functions to `apply_to_first2`

, we can make it perform different tasks:

```
<- c(4, 5, 6)
x apply_to_first2(x, sqrt)
apply_to_first2(x, is.character)
apply_to_first2(x, power_n)
```

But what if the function that we supply requires additional arguments? Using `apply_to_first2`

with `sum`

and the vector `c(4, 5, 6)`

works fine:

`apply_to_first2(x, sum)`

But if we instead use the vector `c(4, NA, 6)`

the function returns `NA`

:

```
<- c(4, NA, 6)
x apply_to_first2(x, sum)
```

Perhaps we’d like to pass `na.rm = TRUE`

to `sum`

to make sure that we get a `numeric`

result, if at all possible. This can be done by adding `...`

to the list of arguments for both functions, which indicates additional parameters (to be supplied by the user) that will be passed to `func`

:

```
<- function(x, func, ...)
apply_to_first2
{<- func(x[1:2], ...)
result return(result)
}
<- c(4, NA, 6)
x apply_to_first2(x, sum)
apply_to_first2(x, sum, na.rm = TRUE)
```

\[\sim\]

**Exercise 6.1 **Write a function that converts temperature measurements in degrees Fahrenheit to degrees Celsius, and apply it to the `Temp`

column of the `airquality`

data.

(Click here to go to the solution.)

**Exercise 6.2 **Practice writing functions by doing the following:

Write a function that takes a vector as input and returns a vector containing its minimum and its maximum, without using

`min`

and`max`

.Write a function that computes the mean of the squared values of a vector using

`mean`

, and that takes additional arguments that it passes on to`mean`

(e.g.`na.rm`

).

### 6.1.5 Namespaces

It is possible, and even likely, that you will encounter functions in packages that have the same name as functions in other packages. Or, similarly, that there are functions in packages with the same names as those you have written yourself. This is of course a bit of a headache, but it’s actually something that can be overcome without changing the names of the functions. Just like variables can live in different environments, R functions live in *namespaces*, usually corresponding either to the global environment or the package that they belong to. By specifying which namespace to look for the function in, you can use multiple functions that all have the same name.

For example, let’s create a function called `sqrt`

. There is already such a function in the `base`

package^{40} (see `?sqrt`

), but let’s do it anyway:

```
<- function(x)
sqrt
{return(x^10)
}
```

If we now apply `sqrt`

to an object, the function that we just defined will be used:

`sqrt(4)`

But if we want to use the `sqrt`

from `base`

, we can specify that by writing the namespace (which almost always is the package name) followed by `::`

and the function name:

`::sqrt(4) base`

The `::`

notation can also be used to call a function or object from a package without loading the package’s namespace:

```
# Doesn't work if ggplot2 isn't loaded
msleep ::msleep # Works, without loading the ggplot2 namespace! ggplot2
```

When you call a function, R will look for it in all active namespaces, following a particular order. To see the order of the namespaces, you can use `search`

:

`search()`

Note that the global environment is first in this list - meaning that the functions that you define always will be preferred to functions in packages.

All this being said, note that it is bad practice to give your functions and variables the same names as common functions. Don’t name them `mean`

, `c`

or `sqrt`

. Nothing good can ever come from that sort of behaviour.

Nothing.

### 6.1.6 Sourcing other scripts

If you want to reuse a function that you have written in a new script, you can of course copy it into that script. But if you then make changes to your function, you will quickly end up with several different versions of it. A better idea can therefore be to put the function in a separate script, which you then can call in each script where you need the function. This is done using `source`

. If, for instance, you have code that defines some functions in a file called `helper-functions.R`

in your working directory, you can run it (thus defining the functions) when the rest of your code is run by adding `source("helper-functions.R")`

to your code.

Another option is to create an R package containing the function, but that is beyond the scope of this book.

## 6.2 More on pipes

We have seen how the `magrittr`

pipe `%>%`

can be used to chain functions together. But there are also other pipe operators that are useful. In this section we’ll have a look at some of them, and see how you can create functions using pipes.

### 6.2.1 *Ce ne sont pas non plus des pipes*

Although `%>%`

is the most used pipe operator, the `magrittr`

package provides a number of other pipes that are useful in certain situations.

One example is when you want to pass variables rather than an entire dataset to the next function. This is needed for instance if you want to use `cor`

to compute the correlation between two variables, because `cor`

takes two vectors as input instead of a data frame. You can do it uses ordinary `%>%`

pipes:

```
library(magrittr)
%>%
airquality subset(Temp > 80) %>%
cor(.$Temp, .$Wind)} {
```

However, the curly brackets `{}`

and the dots `.`

makes this a little awkward and difficult to read. A better option is to use the `%$%`

pipe, which passes on the names of all variables in your data frame instead:

```
%>%
airquality subset(Temp > 80) %$%
cor(Temp, Wind)
```

If you want to modify a variable using a pipe, you can use the *compound assignment* pipe `%<>%`

. The following three lines all yield exactly the same result:

```
<- 1:8; x <- sqrt(x); x
x <- 1:8; x %>% sqrt -> x; x
x <- 1:8; x %<>% sqrt; x x
```

As long as the first pipe in the pipeline is the compound assignment operator `%<>%`

, you can combine it with other pipes:

```
<- 1:8
x %<>% subset(x > 5) %>% sqrt
x x
```

Sometimes you want to do something in the middle of a pipeline, like creating a plot, before continuing to the next step in the chain. The *tee* operator `%T>%`

can be used to execute a function without passing on its output (if any). Instead, it passes on the output to its left. Here is an example:

```
%>%
airquality subset(Temp > 80) %T>%
%$%
plot cor(Temp, Wind)
```

Note that if we’d used an ordinary pipe `%>%`

instead, we’d get an error:

```
%>%
airquality subset(Temp > 80) %>%
%$%
plot cor(Temp, Wind)
```

The reason is that `cor`

looks for the variables `Temp`

and `Wind`

in the plot object, and not in the data frame. The tee operator takes care of this by passing on the data from its left side.

Remember that if you have a function where data only appears within parentheses, you need to wrap the function in curly brackets:

```
%>%
airquality subset(Temp > 80) %T>%
cat("Number of rows in data:", nrow(.), "\n")} %$%
{cor(Temp, Wind)
```

When using the tee operator, this is true also for call to `ggplot`

, where you additionally need to wrap the plot object in a call to `print`

:

```
library(ggplot2)
%>%
airquality subset(Temp > 80) %T>%
print(ggplot(., aes(Temp, Wind)) + geom_point())} %$%
{cor(Temp, Wind)
```

### 6.2.2 Writing functions with pipes

If you will be reusing the same pipeline multiple times, you may want to create a function for it. Let’s say that you have a data frame containing only `numeric`

variables, and that you want to create a scatterplot matrix (which can be done using `plot`

) and compute the correlations between all variables (using `cor`

). As an example, you could do this for `airquality`

as follows:

`%T>% plot %>% cor airquality `

To define a function for this combination of operators, we simply write:

`<- . %T>% plot %>% cor plot_and_cor `

Note that we don’t have to write `function(...)`

when defining functions with pipes!

We can now use this function just like any other:

```
# With airquality data:
%>% plot_and_cor
airquality plot_and_cor(airquality)
# With bookstore data:
<- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
age <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
purchase <- data.frame(age, purchase)
bookstore %>% plot_and_cor bookstore
```

**Exercise 6.3 **Write a function that takes a data frame as input and uses pipes to print the number of `NA`

values in the data, remove all rows with `NA`

values and return a summary of the remaining data.

(Click here to go to the solution.)

**Exercise 6.4**Pipes are operators, that is, functions that take two variables as input and can be written without parenthesis (other examples of operators are

`+`

and `*`

). You can define your own operators just as you would any other function. For instance, we can define an operator called `quadratic`

that takes two numbers `a`

and `b`

as input and computes the quadratic expression \((a+b)^2\):
```
`%quadratic%` <- function(a, b) { (a + b)^2 }
2 %quadratic% 3
```

Create an operator called `%against%`

that takes two vectors as input and draws a scatterplot of them.

## 6.3 Checking conditions

Sometimes you’d like your code to perform different operations depending on whether or not a certain condition is fulfilled. Perhaps you want it to do something differently if there is missing data, if the input is a `character`

vector, or if the largest value in a `numeric`

vector is greater than some number. In Section 3.2.3 you learned how to filter data using conditions. In this section you’ll learn how to use conditional statements for a number of other tasks.

### 6.3.1 `if`

and `else`

The most important functions for checking whether a condition is fulfilled are `if`

and `else`

. The basic syntax is

`if(condition) { do something } else { do something else }`

The condition should return a single `logical`

value, so that is evaluates to either `TRUE`

or `FALSE`

. If the condition is fulfilled, i.e. if it is `TRUE`

, the code inside the first pair of curly brackets will run, and if it’s not (`FALSE`

), the code within the second pair of curly brackets will run instead.

As a first example, assume that you want to compute the reciprocal of \(x\), \(1/x\), unless \(x=0\), in which case you wish to print an error message:

```
<- 2
x if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```

Now try running the same code with `x`

set to `0`

:

```
<- 0
x if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```

Alternatively, we could check if \(x\neq 0\) and then change the order of the segments within the curly brackets:

```
<- 0
x if(x != 0) { 1/x } else { cat("Error! Division by zero.") }
```

You don’t have to write all of the code on the same line, but you must make sure that the `else`

part is on the same line as the first `}`

:

```
if(x == 0)
{ cat("Error! Division by zero.")
else
}
{ 1/x
}
```

You can also choose not to have an `else`

part at all. In that case, the code inside the curly brackets will run if the condition is satisfied, and if not nothing will happen:

```
<- 0
x if(x == 0) { cat("x is 0.") }
<- 2
x if(x == 0) { cat("x is 0.") }
```

Finally, if you need to check a number of conditions one after another, in order to list different possibilities, you can do so by repeated use of `if`

and `else`

:

```
if(x == 0)
{ cat("Error! Division by zero.")
else if(is.infinite((x)))
}
{cat("Error! Divison by infinity.")
else if(is.na((x)))
}
{cat("Error! Divison by NA.")
else
}
{ 1/x
}
```

### 6.3.2 `&`

& `&&`

Just as when we used conditions for filtering in Sections 3.2.3 and 5.8.2, it is possible to combine several conditions into one using `&`

(AND) and `|`

(OR). However, the `&`

and `|`

operators are vectorised, meaning that they will return a vector of `logical`

values whenever possible. This is not desirable in conditional statements, where the condition must evaluate to a single value. Using a condition that returns a vector results in a warning message:

```
if(c(1, 2) == 2) { cat("The vector contains the number 2.\n") }
if(c(2, 1) == 2) { cat("The vector contains the number 2.\n") }
```

As you can see, only the first element of the `logical`

vector is evaluated by `if`

. Usually, if a condition evaluates to a vector it is because you’ve made an error in your code. Remember, if you really need to evaluate a condition regarding the elements in a vector, you can collapse the resulting `logical`

vector to a single value using `any`

or `all`

.

Some texts recommend using the operators `&&`

and `||`

instead of `&`

and `|`

in conditional statements. These work almost like `&`

and `|`

, but force the condition to evaluate to a single `logical`

. I prefer to use `&`

and `|`

, because I want to be notified if my condition evaluates to a vector - once again, that likely means that there is an error somewhere in my code!

There is however one case where I much prefer `&&`

and `||`

. `&`

and `|`

always evaluate all the conditions that you’re combining, while `&&`

and `||`

don’t: `&&`

stops as soon as it encounters a `FALSE`

and `||`

stops as soon as it encounters a `TRUE`

. Consequently, you can put the conditions you wish to combine in a particular order to make sure that they can be evaluated. For instance, you may want to first check that a variable exists, and then check a property. This can be done using `exists`

to check whether or not it exists - note that the variable name must be written within quotes:

```
# a is a variable that doesn't exist
# Using && works:
if(exists("a") && a > 0)
{cat("The variable exists and is positive.")
else { cat("a doesn't exist or is negative.") }
}
# But using & doesn't, because it attempts to evaluate a>0
# even though a doesn't exist:
if(exists("a") & a > 0)
{cat("The variable exists and is positive.")
else { cat("a doesn't exist or is negative.") } }
```

### 6.3.3 `ifelse`

It is common that you want to assign different values to a variable depending on whether or not a condition is satisfied:

```
<- 2
x
if(x == 0)
{ <- "Error! Division by zero."
reciprocal else
}
{ <- 1/x
reciprocal
}
reciprocal
```

In fact, this situation is so common that there is a special function for it: `ifelse`

:

`<- ifelse(x == 0, "Error! Division by zero.", 1/x) reciprocal `

`ifelse`

evaluates a condition and then returns different answers depending on whether the condition is `TRUE`

or `FALSE`

. It can also be applied to vectors, in which case it checks the condition for each element of the vector and returns an answer for each element:

```
<- c(-1, 1, 2, -2, 3)
x ifelse(x > 0, "Positive", "Negative")
```

### 6.3.4 `switch`

For the sake of readability, it is usually a good idea to try to avoid chains of the type `if() {} else if() {} else if() {} else {}`

. One function that can be useful for this is `switch`

, which lets you list a number of possible results, either by position (a number) or by name:

```
<- 2
position switch(position,
"First position",
"Second position",
"Third position")
<- "First"
name switch(name,
First = "First name",
Second = "Second name",
Third = "Third name")
```

You can for instance use this to decide what function should be applied to your data:

```
<- 1:3
x <- c(3, 5, 4)
y <- "nonparametric2"
method <- switch(method,
cor_xy parametric = cor(x, y, method = "pearson"),
nonparametric1 = cor(x, y, method = "spearman"),
nonparametric2 = cor(x, y, method = "kendall"))
cor_xy
```

### 6.3.5 Failing gracefully

Conditional statements are useful for making sure that the input to a function you’ve written is of the correct type. In Section 6.1.3 we saw that our `average`

function failed if we applied it to a `character`

vector:

```
<- function(x)
average
{<- sum(x)/length(x)
avg return(avg)
}
average(c("Lady Gaga", "Tool", "Dry the River"))
```

By using a conditional statement, we can provide a more informative error message. We can check that the input is `numeric`

and, if it’s not, stop the function and print an error message, using `stop`

:

```
<- function(x)
average
{if(is.numeric(x))
{<- sum(x)/length(x)
avg return(avg)
else
}
{ stop("The input must be a numeric vector.")
}
}
average(c(1, 5, 8))
average(c("Lady Gaga", "Tool", "Dry the River"))
```

\[\sim\]

**Exercise 6.5 **Which of the following conditions are `TRUE`

? First think about the answer, and then check it using R.

```
<- 2
x <- 3
y <- -3 z
```

`x > 2`

`x > y | x > z`

`x > y & x > z`

`abs(x*z) >= y`

(Click here to go to the solution.)

**Exercise 6.6 **Fix the errors in the following code:

```
<- c(1, 2, pi, 8)
x
# Only compute square roots if x exists
# and contains positive values:
if(exists(x)) { if(x > 0) { sqrt(x) } }
```

## 6.4 Iteration using loops

We have already seen how you can use functions to make it easier to repeat the same task over and over. But there is still a part of the puzzle missing - what if, for example, you wish to apply a function to each column of a data frame? What if you wish to apply it to data from a number of files, one at a time? The solution to these problems is to use *iteration*. In this section, we’ll explore how to perform iteration using *loops*.

### 6.4.1 `for`

loops

`for`

loops can be used to run the same code several times, with different settings, e.g. different data, in each iteration. Their use is perhaps best explained by some examples. We create the loop using `for`

, give the name of a *control variable* and a vector containing its values; the control variable controls how many iterations to run; and then write the code that should be repeated in each iteration of the loop. In each iteration, a new value of the control variable is used in the code, and the loop stops when all values have been used.

As a first example, let’s write a `for`

loop that runs a block of code five time, where the block prints the current iteration number:

```
for(i in 1:5)
{cat("Iteration", i, "\n")
}
```

This is equivalent to writing:

```
cat("Iteration", 1, "\n")
cat("Iteration", 2, "\n")
cat("Iteration", 3, "\n")
cat("Iteration", 4, "\n")
cat("Iteration", 5, "\n")
```

The upside is that we didn’t have to copy and edit the same code multiple times - and as you can imagine this benefit becomes even more pronounced if you have more complicated code blocks.

The values for the control variable are given in a vector, and the code block will be run once for each element in the vector - we say the we *loop over the values in the vector*. The vector doesn’t have to be `numeric`

- here is an example with a `character`

vector:

```
for(word in c("one", "two", "five hundred and fifty five"))
{cat("Iteration", word, "\n")
}
```

Of course, loops are used for so much more than merely printing text on the screen. A common use is to perform some computation and then store the result in a vector. In this case, we must first create an empty vector to store the result in, e.g. using `vector`

, which creates an empty vector of a specific type and length:

```
<- vector("numeric", 5)
squares
for(i in 1:5)
{<- i^2
squares[i]
} squares
```

In this case, it would have been both simpler and computationally faster to compute the squared values by running `(1:5)^2`

. This is known as a *vectorised* solution, and is very important in R. We’ll discuss vectorised solutions in detail in Section 6.5.

When creating the values used for the control variable, we often wish to create different sequences of numbers. Two functions that are very useful for this are `seq`

, which creates sequences, and `rep`

, which repeats patterns:

```
seq(0, 100)
seq(0, 100, by = 10)
seq(0, 100, length.out = 21)
rep(1, 4)
rep(c(1, 2), 4)
rep(c(1, 2), c(4, 2))
```

Finally, `seq_along`

can be used to create a sequence of indices for a vector of data frame, which is useful if you wish to iterate some code for each element of a vector or each column of a data frame:

```
seq_along(airquality) # Gives the indices of all column of the data
# frame
seq_along(airquality$Temp) # Gives the indices of all elements of the
# vector
```

Here is an example of how to use `seq_along`

to compute the mean of each column of a data frame:

```
# Compute the mean for each column of the airquality data:
<- vector("double", ncol(airquality))
means
# Loop over the variables in airquality:
for(i in seq_along(airquality))
{<- mean(airquality[[i]], na.rm = TRUE)
means[i]
}
# Check that the results agree with those from the colMeans function:
meanscolMeans(airquality, na.rm = TRUE)
```

The line inside the loop could have read `means[i] <- mean(airquality[,i], na.rm = TRUE)`

, but that would have caused problems if we’d used it with a `data.table`

or `tibble`

object; see Section 5.9.4.

Finally, we can also change the values of the data in each iteration of the loop. Some machine learning methods require that the data is *standardised*, i.e. that all columns have mean 0 and standard deviation 1. This is achieved by subtracting the mean from each variable and dividing by the variable’s standard deviation. We can write a function for this that uses a loop, changing the values of a column in each iteration:

```
<- function(df, ...)
standardise
{for(i in seq_along(df))
{<- (df[[i]] - mean(df[[i]], ...))/sd(df[[i]], ...)
df[[i]]
}return(df)
}
# Try it out:
<- standardise(airquality, na.rm = TRUE)
aqs colMeans(aqs, na.rm = TRUE) # Non-zero due to floating point
# arithmetics!
sd(aqs$Wind)
```

\[\sim\]

**Exercise 6.7 **Practice writing `for`

loops by doing the following:

Compute the mean temperature for each month in the

`airquality`

dataset using a loop rather than an existing function.Use a

`for`

loop to compute the maximum and minimum value of each column of the`airquality`

data frame, storing the results in a data frame.Make your solution to the previous task reusable by writing a function that returns the maximum and minimum value of each column of a data frame.

(Click here to go to the solution.)

**Exercise 6.8 **Use `rep`

or `seq`

to create the following vectors:

`0.25 0.5 0.75 1`

`1 1 1 2 2 5`

(Click here to go to the solution.)

**Exercise 6.9 **As an alternative to `seq_along(airquality)`

and `seq_along(airquality$Temp)`

, we could create the same sequences using `1:ncol(airquality)`

and `1:length(airquality$Temp)`

. Use `x <- c()`

to create a vector of length zero. Then create loops that use `seq_along(x)`

and `1:length(x)`

as values for the control variable. How many iterations are the two loops run? Which solution is preferable?

(Click here to go to the solution.)

**Exercise 6.10 **An alternative to standardisation is *normalisation*, where all `numeric`

variables are rescaled so that their smallest value is 0 and their largest value is 1. Write a function that normalises the variables in a data frame containing `numeric`

columns.

(Click here to go to the solution.)

**Exercise 6.11 **The function `list.files`

can be used to create a vector containing the names of all files in a folder. The `pattern`

argument can be used to supply a regular expression describing a file name pattern. For instance, if `pattern = "\\.csv$"`

is used, only `.csv`

files will be listed.

Create a loop that goes through all `.csv`

files in a folder and prints the names of the variables for each file.

### 6.4.2 Loops within loops

In some situations you’ll want to put a loop inside another loop. Such loops are said to be *nested*. An example is if we want to compute the correlation between all pairs of variables in `airquality`

, and store the result in a matrix:

```
<- matrix(NA, nrow = ncol(airquality),
cor_mat ncol = ncol(airquality))
for(i in seq_along(airquality))
{for(j in seq_along(airquality))
{<- cor(airquality[[i]], airquality[[j]],
cor_mat[i, j] use = "pairwise.complete")
}
}
# Element [i, j] of the matrix now contains the correlation between
# variables i and j:
cor_mat
```

Once again, there is a vectorised solution to this problem, given by `cor(airquality, use = "pairwise.complete")`

. As we will see in Section 6.6, vectorised solutions like this can be several times faster than solutions that use nested loops. In general, solutions involving nested loops tend to be fairly slow - but on the other hand they are often easy and straightforward to implement.

### 6.4.3 Keeping track of what’s happening

Sometimes each iteration of your loop takes a long time to run, and you’ll want to monitor its progress. This can be done using printed messages or a progress bar in the Console panel, or sound notifications. We’ll showcase each of these using a loop containing a call to `Sys.sleep`

, which pauses the execution of R commands for a short time (determined by the user).

First, we can use `cat`

to print a message describing the progress. Adding `\r`

to the end of a string allows us to print all messages on the same line, with each new message replacing the old one:

```
# Print each message on a new same line:
for(i in 1:5)
{cat("Step", i, "out of 5\n")
Sys.sleep(1) # Sleep for 1 second
}
# Replace the previous message with the new one:
for(i in 1:5)
{cat("Step", i, "out of 5\r")
Sys.sleep(1) # Sleep for one second
}
```

Adding a progress bar is a little more complicated, because we must first start the bar by using `txtProgressBar`

and the update it using `setTxtProgressBar`

:

```
<- 1:5
sequence <- txtProgressBar(min = 0, max = max(sequence), style = 3)
pbar for(i in sequence)
{Sys.sleep(1) # Sleep for 1 second
setTxtProgressBar(pbar, i)
}close(pbar)
```

Finally, the `beepr`

package^{41} can be used to play sounds, with the function `beep`

:

```
install.packages("beepr")
library(beepr)
# Play all 11 sounds available in beepr:
for(i in 1:11)
{beep(sound = i)
Sys.sleep(2) # Sleep for 2 seconds
}
```

### 6.4.4 Loops and lists

In our previous examples of loops, it has always been clear from the start how many iterations the loop should run and what the length of the output vector (or data frame) should be. This isn’t always the case. To begin with, let’s consider the case where the length of the output is unknown or difficult to know in advance. Let’s say that we want to go through the `airquality`

data to find days that are extreme in the sense that at least one variable attains its maximum on those days. That is, we wish to find the indices of the maxima of each variable, and store them in a vector. Because several days can have the same temperature or wind speed, there may be more than one such maximal index for each variable. For that reason, we don’t know the length of the output vector in advance.

In such cases, it is usually a good idea to store the result from each iteration in a `list`

(Section 5.2), and then collect the elements from the list once the loop has finished. We can create an empty list with one element for each variable in `airquality`

using `vector`

:

```
# Create an empty list with one element for each variable in
# airquality:
<- vector("list", ncol(airquality))
max_list
# Naming the list elements will help us see which variable the maximal
# indices belong to:
names(max_list) <- names(airquality)
# Loop over the variables to find the maxima:
for(i in seq_along(airquality))
{# Find indices of maximum values:
<- which(airquality[[i]] == max(airquality[[i]],
max_index na.rm = TRUE))
# Add indices to list:
<- max_index
max_list[[i]]
}
# Check results:
max_list
# Collapse to a vector:
<- unlist(max_list) extreme_days
```

(In this case, only the variables `Month`

and `Days`

have duplicate maximum values.)

### 6.4.5 `while`

loops

In some situations, we want to run a loop until a certain condition is met, meaning that we don’t know in advance how many iterations we’ll need. This is more common in numerical optimisation and simulation, but sometimes also occurs in data analyses.

When we don’t know in advance how many iterations that are needed, we can use `while`

loops. Unlike `for`

loops, that iterate a fixed numbers of times, `while`

loops keep iterating as long as some specified condition is met. Here is an example where the loop keeps iterating until `i`

squared is greater than 100:

```
<- 1
i
while(i^2 <= 100)
{cat(i,"squared is", i^2, "\n")
<- i +1
i
}
i
```

The code block inside the loop keeps repeating until the condition `i^2 <= 100`

no longer is satisfied. We have to be a little bit careful with this condition - if we set it in such a way that it is possible that the condition *always* will be satisfied the loop will just keep running and running - creating what is known as an *infinite loop*. If you’ve accidentally created an infinite loop, you can break it by pressing the Stop button at the top of the Console panel in RStudio.

In Section 5.3.3 we saw how `rle`

can be used to find and compute the lengths of runs of equal values in a vector. We can use nested `while`

loops to create something similar. `while`

loops are a good choice here, because we don’t know in advance how many runs are in the vector. Here is an example, which you’ll study in more detail in Exercise 6.12:

```
# Create a vector of 0's and 1's:
<- rep(c(0, 1, 0, 1, 0), c(5, 1, 4, 2, 7))
x
# Create empty vectors where the results will be stored:
<- run_lengths <- c()
run_values
# Set the initial condition:
<- 1
i
# Iterate over the entire vector:
while(i < length(x))
{# A new run starts:
<- 1
run_length cat("A run starts at i =", i, "\n")
# Check how long the run continues:
while(x[i+1] == x[i] & i < length(x))
{<- run_length + 1
run_length <- i + 1
i
}
<- i + 1
i
# Save results:
<- c(run_values, x[i-1])
run_values <- c(run_lengths, run_length)
run_lengths
}
# Present the results:
data.frame(run_values, run_lengths)
```

\[\sim\]

**Exercise 6.12 **Consider the nested `while`

loops in the run length example above. Go through the code and think about what happens in each step. What happens when `i`

is `1`

? When it is `5`

? When it is `6`

? Answer the following questions:

What does the condition for the outer while loop check? Why is it needed?

What does the condition for the inner while loop check? Why is it needed?

What does the line

`run_values <- c(run_values, x[i-1])`

do?

(Click here to go to the solution.)

**Exercise 6.13**The

*control statements*

`break`

and `next`

can be used inside both `for`

and `while`

loops to further control their behaviour. `break`

stops a loop, and `next`

skips to the next iteration of it. Use these functions to modify the following piece of code so that the loop skips to the next iteration if `x[i]`

is `0`

, and breaks if `x[i]`

is `NA`

:
```
<- c(1, 5, 8, 0, 20, 0, 3, NA, 18, 2)
x
for(i in seq_along(x))
{cat("Step", i, "- reciprocal is", 1/x[i], "\n")
}
```

(Click here to go to the solution.)

**Exercise 6.14**Using the

`cor_mat`

computation from Section 6.4.2, write a function that computes all pairwise correlations in a data frame, and uses `next`

to only compute correlations for `numeric`

variables. Test your function by applying it to the `msleep`

data from `ggplot2`

. Could you achieve the same thing without using `next`

?
## 6.5 Iteration using vectorisation and functionals

Many operators and functions in R take vectors as input and handle them in a highly efficient way, usually by passing the vector on to an optimised function written in the C programming language^{42}. So if we want to compute the squares of the numbers in a vector, we don’t need to write a loop:

```
<- vector("numeric", 5)
squares
for(i in 1:5)
{<- i^2
squares[i]
} squares
```

Instead, we can simply apply the `^`

operator, which uses fast C code to compute the squares:

`<- (1:5)^2 squares `

These types of functions and operators are called *vectorised*. They take a vector as input and apply a function to all elements of it, meaning that we can avoid slower solutions utilising loops in R^{43}. Try to use vectorised solutions rather than loops whenever possible - it makes your code both easier to read and faster to run.

A related concept is *functionals*, which are functions that contain a `for`

loop. Instead of writing a `for`

loop, you can use a functional, supplying data, a function that should be applied in each iteration of the loop, and a vector to loop over. This won’t necessarily make your loop run faster, but it does have other benefits:

*Shorter code*: functionals allow you to write shorter code. Some would argue that they also allow you to write code that is easier to read, but that is obviously a matter of taste.*Efficient*: functionals handle memory allocation and other small tasks in an efficient way, meaning that you don’t have to worry about creating a vector of an appropriate size to store the result.*No changes to your environment*: because all operations now take place in the local environment of the functional, you don’t run the risk of accidentally changing variables in your global environment.*No left-overs*:`for`

leaves the control variable (e.g.`i`

) in the environment, functionals do not.*Easy to use with pipes*: because the loop has been wrapped in a function, it lends itself well to being used in a`%>%`

pipeline.

Explicit loops are preferable when:

- You think that they are easier to read and write.
- Your functions take data frames or other non-vector objects as input.
- Each iteration of your loop depends on the results from previous iterations.

In this section, we’ll see how we can apply functionals to obtain elegant alternatives to (explicit) loops.

### 6.5.1 A first example with `apply`

The prototypical functional is `apply`

, which loops over either the rows or the columns of a data frame^{44}. The arguments are a dataset, the margin to loop over (`1`

for rows, `2`

for columns) and then the function to be applied.

In Section 6.4.1 we wrote a `for`

loop for computing the mean value of each column in a data frame:

```
# Compute the mean for each column of the airquality data:
<- vector("double", ncol(airquality))
means
# Loop over the variables in airquality:
for(i in seq_along(airquality))
{<- mean(airquality[[i]], na.rm = TRUE)
means[i] }
```

Using apply, we can reduce this to a single line. We wish to use the `airquality`

data, loop over the columns (margin `2`

) and apply the function `mean`

to each column:

`apply(airquality, 2, mean)`

Rather elegant, don’t you think?

Additional arguments can be passed to the function inside apply by adding them to the end of the function call:

`apply(airquality, 2, mean, na.rm = TRUE)`

\[\sim\]

**Exercise 6.15 **Use `apply`

to compute the maximum and minimum value of each column of the `airquality`

data frame. Can you write a function that allows you to compute both with a single call to `apply`

?

### 6.5.2 Variations on a theme

There are several variations of `apply`

that are tailored to specific problems:

`lapply`

: takes a function and vector/list as input, and returns a list.`sapply`

: takes a function and vector/list as input, and returns a vector or matrix.`vapply`

: a version of`sapply`

with additional checks of the format of the output.`tapply`

: for looping over groups, e.g. when computing grouped summaries.`rapply`

: a recursive version of`tapply`

.`mapply`

: for applying a function to multiple arguments; see Section 6.5.7.`eapply`

: for applying a function to all objects in an environment.

We have already seen several ways to compute the mean temperature for different months in the `airquality`

data (Sections 3.8 and 5.7.7, and Exercise 6.7). The `*apply`

family offer several more:

```
# Create a list:
<- split(airquality$Temp, airquality$Month)
temps
lapply(temps, mean)
sapply(temps, mean)
vapply(temps, mean, vector("numeric", 1))
tapply(airquality$Temp, airquality$Month, mean)
```

There is, as that delightful proverb goes, more than one way to skin a cat.

\[\sim\]

**Exercise 6.16 **Use an `*apply`

function to simultaneously compute the monthly maximum and minimum temperature in the `airquality`

data frame.

(Click here to go to the solution.)

**Exercise 6.17 **Use an `*apply`

function to simultaneously compute the monthly maximum and minimum temperature and windspeed in the `airquality`

data frame.

Hint: start by writing a function that simultaneously computes the maximum and minimum temperature and windspeed for a data frame containing data from a single month.

### 6.5.3 `purrr`

If you are feeling enthusiastic about ~~skinning cats~~ using functionals instead of loops, the tidyverse package `purrr`

is a great addition to your toolbox. It contains a number of specialised alternatives to the `*apply`

functions. More importantly, it also contains certain shortcuts that come in handy when working with functionals. For instance, it is fairly common to define a short function inside of your functional, which is useful for instance when you don’t want the function to take up space in your environment. This can be done a little more elegantly with `purrr`

functions using a shortcut denoted by `~`

. Let’s say that we want to standardise all variables in `airquality`

. The `map`

function is the `purrr`

equivalent of `lapply`

. We can use it with or with the shortcut, and with or without pipes (we mention the use of pipes now because it will be important in what comes next):

```
# Base solution:
lapply(airquality, function(x) { (x-mean(x))/sd(x) })
# Base solution with pipe:
library(magrittr)
%>% lapply(function(x) { (x-mean(x))/sd(x) })
airquality
# purrr solution:
library(purrr)
map(airquality, function(x) { (x-mean(x))/sd(x) })
# We can make the purrr solution less verbose using a shortcut:
map(airquality, ~(.-mean(.))/sd(.))
# purr solution with pipe and shortcut:
%>% map(~(.-mean(.))/sd(.)) airquality
```

Where this shortcut really shines is if you need to use multiple functionals. Let’s say that we want to standardise the `airquality`

variables, compute a `summary`

and then extract columns `2`

and `5`

from the summary (which contains the 1st and 3rd quartile of the data):

```
# Impenetrable base solution:
lapply(lapply(lapply(airquality,
function(x) { (x-mean(x))/sd(x) }),
summary),function(x) { x[c(2, 5)] })
# Base solution with pipe:
%>%
airquality lapply(function(x) { (x-mean(x))/sd(x) }) %>%
lapply(summary) %>%
lapply(function(x) { x[c(2, 5)] })
# purrr solution:
%>%
airquality map(~(.-mean(.))/sd(.)) %>%
map(summary) %>%
map(~.[c(2, 5)])
```

Once you know the meaning of `~`

, the `purrr`

solution is a lot cleaner than the base solutions.

### 6.5.4 Specialised functions

So far, it may seem like `map`

is just like `lapply`

but with a shortcut for defining functions. Which is more or less true. But `purrr`

contains a lot more functionals that you can use, each tailored to a very specific problem.

For instance, if you need to specify that the output should be a vector of a specific type, you can use:

`map_dbl(data, function)`

instead of`vapply(data, function, vector("numeric", length))`

,`map_int(data, function)`

instead of`vapply(data, function, vector("integer", length))`

,`map_chr(data, function)`

instead of`vapply(data, function, vector("character", length))`

,`map_lgl(data, function)`

instead of`vapply(data, function, vector("logical", length))`

.

If you need to specify that the output should be a data frame, you can use:

`map_dfr(data, function)`

instead of`sapply(data, function)`

.

The `~`

shortcut for functions is available for all these `map_*`

functions. In case you need to pass additional arguments to the function inside the functional, just add them at the end of the functional call:

```
%>% map_dbl(max)
airquality %>% map_dbl(max, na.rm = TRUE) airquality
```

Another specialised function is the `walk`

function. It works just like `map`

, but doesn’t return anything. This is useful if you want to apply a function with no output, such as `cat`

or `read.csv`

:

```
# Returns a list of NULL values:
%>% map(~cat("Maximum:", max(.), "\n"))
airquality
# Returns nothing:
%>% walk(~cat("Maximum:", max(.), "\n")) airquality
```

\[\sim\]

**Exercise 6.18 **Use a `map_*`

function to simultaneously compute the monthly maximum and minimum temperature in the `airquality`

data frame, returning a vector.

### 6.5.5 Exploring data with functionals

Functionals are great for creating custom summaries of your data. For instance, if you want to check the data type and number of unique values of each variable in your dataset, you can do that with a functional:

```
library(ggplot2)
%>% map_dfr(~(data.frame(unique_values = length(unique(.)),
diamonds class = class(.))))
```

You can of course combine `purrr`

functionals with functions from other packages, e.g. to replace `length(unique(.))`

with a function from your favourite data manipulation package:

```
# Using uniqueN from data.table:
library(data.table)
<- as.data.table(diamonds)
dia %>% map_dfr(~(data.frame(unique_values = uniqueN(.),
dia class = class(.))))
# Using n_distinct from dplyr:
library(dplyr)
%>% map_dfr(~(data.frame(unique_values = n_distinct(.),
diamonds class = class(.))))
```

When creating summaries it can often be useful to be able to loop over both the elements of a vector and their indices. In `purrr`

, this is done using the usual `map*`

functions, but with an `i`

(for index) in the beginning of their names, e.g. `imap`

and `iwalk`

:

```
# Returns a list of NULL values:
imap(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = ""))
# Returns nothing:
iwalk(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = ""))
```

Note that `.x`

is used to denote the variable, and that `.y`

is used to denote the *name* of the variable. If `i*`

functions are used on vectors without element names, indices are used instead. The names of elements of vectors can be set using `set_names`

:

```
# Without element names:
<- 1:5
x iwalk(x, ~ cat(.y, ": ", exp(.x), "\n", sep = ""))
# Set element names:
<- set_names(x, c("exp(1)", "exp(2)", "exp(3)", "exp(4)", "exp(5)"))
x iwalk(x, ~ cat(.y, ": ", exp(.x), "\n", sep = ""))
```

\[\sim\]

**Exercise 6.19 **Write a function that takes a data frame as input and returns the following information about each variable in the data frame: variable name, number of unique values, data type and number of missing values. The function should, as you will have guessed, use a functional.

(Click here to go to the solution.)

**Exercise 6.20 **In Exercise 6.11 you wrote a function that printed the names and variables for all `.csv`

files in a folder given by `folder_path`

. Use `purrr`

functionals to do the same thing.

### 6.5.6 Keep calm and carry on

Another neat feature of `purrr`

is the `safely`

function, which can be used to wrap a function that will be used inside of a functional, and makes sure that the functional returns a result even if there is an error. For instance, let’s say that we want to compute the logarithm of all variables in the `msleep`

data:

```
library(ggplot2)
msleep
```

Note that some columns are `character`

vectors, which will cause `log`

to throw an error:

```
log(msleep$name)
log(msleep)
lapply(msleep, log)
map(msleep, log)
```

Note that the error messages that we get from `lapply`

and `map`

here don’t give any information about which variable caused the error, making it more difficult to figure out what’s gone wrong.

If first we wrap `log`

with `safely`

, we get a list containing the correct output for the numeric variables, and error messages for the non-numeric variables:

```
<- safely(log)
safe_log lapply(msleep, safe_log)
map(msleep, safe_log)
```

Not only does this tell us where the errors occur, but it also returns the logarithms for all variables that `log`

actually could be applied to.

If you’d like your functional to return some default value, e.g. `NA`

, instead of an error message, you can use `possibly`

instead of `safely`

:

```
<- possibly(log, otherwise = NA)
pos_log map(msleep, pos_log)
```

### 6.5.7 Iterating over multiple variables

A final important case is when you want to iterate over more than one variable. This is often the case when fitting statistical models that should be used for prediction, as you’ll see in Section 8.1.10. Another example is when you wish to create plots for several subsets in your data. For instance, we could create a plot of `carat`

versus `price`

for each combination of `color`

and `cut`

in the `diamonds`

data. To do this for a single combination, we’d use something like this:

```
library(ggplot2)
library(dplyr)
%>% filter(cut == "Fair",
diamonds == "D") %>%
color ggplot(aes(carat, price)) +
geom_point() +
ggtitle("Fair, D")
```

To create such a plot for all combinations of `color`

and `cut`

, we must first create a data frame containing all unique combinations, which can be done using the `distinct`

function from `dplyr`

:

```
<- diamonds %>% distinct(cut, color)
combos <- combos$cut
cuts <- combos$color colours
```

`map2`

and `walk2`

from `purrr`

loop over the elements of two vectors, `x`

and `y`

, say. They combine the first element of `x`

with the first element of `y`

, the second element of `x`

with the second element of `y`

, and so on - meaning that they won’t automatically loop over all combinations of elements. That is the reason why we use `distinct`

above to create two vectors where each pair (`x[i]`

, `y[i]`

) correspond to a combination. Apart from the fact that we add a second vector to the call, `map2`

and `walk2`

work just like `map`

and `walk`

:

```
# Print all pairs:
walk2(cuts, colours, ~cat(.x, .y, "\n"))
# Create a plot for each pair:
<- map2(cuts, colours, ~{
combos_plots %>% filter(cut == .x,
diamonds == .y) %>%
color ggplot(aes(carat, price)) +
geom_point() +
ggtitle(paste(.x, .y, sep =", "))})
# View some plots:
1]]
combos_plots[[30]]
combos_plots[[
# Save all plots in a pdf file, with one plot per page:
pdf("all_combos_plots.pdf", width = 8, height = 8)
combos_plotsdev.off()
```

The base function `mapply`

could also have been used here. If you need to iterate over more than two vectors, you can use `pmap`

or `pwalk`

, which work analogously to `map2`

and `walk2`

.

\[\sim\]

**Exercise 6.21 **Using the `gapminder`

data from the `gapminder`

package, create scatterplots of `pop`

and `lifeExp`

for each combination of `continent`

and `year`

. Save each plot as a separate `.png`

file.

## 6.6 Measuring code performance

There are probably as many ideas about what good code is as there are programmers. Some prefer readable code, others prefer concise code. Some prefer to work with separate functions for each task, while others would rather continue to combine a few basic functions in new ways. Regardless of what you consider to be *good code*, there are a few objective measures that can be used to assess the quality of your code. In addition writing code that works and is bug-free, you’d like your code to be:

*Fast*: meaning that it runs quickly. There are tasks that can take anything from second to weeks depending on what code you write for them. Speed is particularly important if you’re going to run your code many times.*Memory efficient*: meaning that it uses as little of your computer’s memory as possible. Software running on your computer uses its memory - its RAM - to store data. If you’re not careful with RAM, you may end up with a full memory and a sluggish or frozen computer. Memory efficiency is particularly important if you’re working with big datasets, that take up a lot of RAM to begin with.

In this section we’ll have a look at how you can measure the speed and memory efficiency of R functions. A caveat is that while speed and memory efficiency are important, the most important thing is to come up with a solution that works in the first place. You should almost always start by solving a problem, and then worry about speed and memory efficiency, not the other way around. The reasons for this is that efficient code often is more difficult to write, read, and debug, which can slow down process of writing it considerably.

Note also, that speed and memory usage is system dependent - the clock frequency and architecture of your processor and speed and size of your RAM will affect how your code performs, as will what operating system you use and what other programs you are running at the same time. That means that if you wish to compare how two functions perform, you need to compare them on the same system under the same conditions.

As a side note, a great way to speed up functions that use either loops or functionals is parallelisation. We cover that topic in Section 10.2.

### 6.6.1 Timing functions

To measure how long a piece of code takes to run, we can use `system.time`

as follows:

```
<- system.time({
rtime <- rnorm(1e6)
x mean(x)
sd(x)
})
# elapsed is the total time it took to execute the code:
rtime
```

This isn’t the best way of measuring computational time though, and doesn’t allow us to easily compare different functions. Instead, we’ll use the `bench`

package, which contains a function called `mark`

that is very useful for measuring the execution time of functions and blocks of code. Let’s start by installing it:

`install.packages("bench")`

In Section 6.1.1 we wrote a function for computing the mean of a vector:

```
<- function(x)
average
{return(sum(x)/length(x))
}
```

Is this faster or slower than `mean`

? We can use `mark`

to apply both functions to a vector multiple times, and measure how long each execution takes:

```
library(bench)
<- 1:100
x <- mark(mean(x), average(x))
bm # Or use View(bm) if you don't want to print the results in the
bm # Console panel.
```

`mark`

has executed both function `n_itr`

times each, and measured how long each execution took to perform. The execution time varies - in the output you can see the shortest (`min`

) and median (`median`

) execution times, as well as the number of iterations per second (`itr/sec`

). Be a little wary of the units for the execution times so that you don’t get them confused - a millisecond (ms, \(10^{-3}\) seconds) is 1,000 microseconds (µs, 1 µs is \(10^{-6}\) seconds), and 1 microsecond is 1,000 nanoseconds (ns, 1 ns is \(10^{-9}\) seconds).

The result here may surprise you - it appears that `average`

is faster than `mean`

! The reason is that `mean`

does a lot of things that `average`

does not: it checks the data type and gives error messages if the data is of the wrong type (e.g. `character`

), and then traverses the vector twice to lower the risk of errors due to floating point arithmetics. All of this takes time, and makes the function slower (but safer to use).

We can plot the results using the `ggbeeswarm`

package:

```
install.packages("ggbeeswarm")
plot(bm)
```

It is also possible to place blocks of code inside curly brackets, `{ }`

, in `mark`

. Here is an example comparing a vectorised solution for computing the squares of a vector with a solution using a loop:

```
<- 1:100
x <- mark(x^2,
bm
{<- x
y for(i in seq_along(x))
{<- x[i]*x[i]
y[i]
}
y
})
bmplot(bm)
```

Although the above code works, it isn’t the prettiest, and the `bm`

table looks a bit confusing because of the long expression for the code block. I prefer to put the code block inside a function instead:

```
<- function(x)
squares
{<- x
y for(i in seq_along(x))
{<- x[i]*x[i]
y[i]
}return(y)
}
<- 1:100
x <- mark(x^2, squares(x))
bm
bmplot(bm)
```

Note that `squares(x)`

is faster than the original code block:

```
<- mark(squares(x),
bm
{<- x
y for(i in seq_along(x))
{<- x[i]*x[i]
y[i]
}
y
}) bm
```

Functions in R are *compiled* the first time they are run, which often makes them run faster than the same code would have outside of the function. We’ll discuss this further next.

### 6.6.2 Measuring memory usage - and a note on compilation

`mark`

also shows us how much memory is allocated when running different blocks of code, in the `mem_alloc`

column of the output^{45}.

Unfortunately, measuring memory usage is a little tricky. To see why, restart R (yes, really - this is important!), and then run the following code to benchmark `x^2`

versus `squares(x)`

:

```
library(bench)
<- function(x)
squares
{<- x
y for(i in seq_along(x))
{<- x[i]*x[i]
y[i]
}return(y)
}
<- 1:100
x <- mark(x^2, squares(x))
bm bm
```

Judging from the `mem_alloc`

column, it appears that the `squares(x)`

loop not only is slower, but also uses more memory. But wait! Let’s run the code again, just to be sure of the result:

```
<- mark(x^2, squares(x))
bm bm
```

This time out, both functions use less memory, and `squares`

now uses *less* memory than `x^2`

. What’s going on?

Computers can’t read code written in R or most other programming languages directly. Instead, the code must be translated to *machine code* that the computer’s processor uses, in a process known as *compilation*. R uses *just-in-time compilation* of functions and loops^{46}, meaning that it translates the R code for new functions and loops to machine code *during execution*. Other languages, such as C, use ahead-of-time compilation, translating the code *prior to execution*. The latter can make the execution much faster, but some flexibility is lost, and the code needs to be run through a compiler ahead of execution, which also takes time. When doing the just-in-time compilation, R needs to use some of the computer’s memory, which causes the memory usage to be greater the first time the function is run. However, if an R function is run again, it has already been compiled, meaning that R doesn’t have to allocate memory for compilation.

In conclusion, if you want to benchmark the memory usage of functions, make sure to run them once before benchmarking. Alternatively, if your function takes a long time to run, you can compile it without running it using the `cmpfun`

function from the `compiler`

package:

```
library(compiler)
<- cmpfun(squares)
squares squares(1:10)
```

\[\sim\]

**Exercise 6.22 **Write a function for computing the mean of a vector using a `for`

loop. How much slower than `mean`

is it? Which function uses more memory?

(Click here to go to the solution.)

**Exercise 6.23 **We have seen three different ways of filtering a data frame to only keep rows that fulfill a condition: using base R, `data.table`

and `dplyr`

. Suppose that we want to extract all flights from 1 January from the `flights`

data in the `nycflights13`

package:

```
library(data.table)
library(dplyr)
library(nycflights13)
# Read about the data:
?flights
# Make a data.table copy of the data:
<- as.data.table(flights)
flights.dt
# Filtering using base R:
<- flights[flights$month == 1 & flights$day == 1,]
flights0101 # Filtering using data.table:
<- flights.dt[month == 1 & day == 1,]
flights0101 # Filtering using dplyr:
<- flights %>% filter(month ==1, day == 1) flights0101
```

Compare the speed and memory usage of these three approaches. Which has the best performance?

Do you

*really*?↩︎`base`

is automatically loaded when you start R, and contains core functions such as`sqrt`

.↩︎Arguably the best add-on package for R.↩︎

Unlike R, C is a low-level language that allows the user to write highly specialised (and complex) code to perform operations very quickly.↩︎

The vectorised functions often use loops, but loops written in C, which are much faster.↩︎

Actually, over the rows or columns of a matrix -

`apply`

converts the data frame to a`matrix`

object.↩︎But only if your version of R has been

*compiled with memory profiling*. If you are using a standard build of R, i.e. have downloaded the base R binary from R-project.org, you should be good to go. You can check that memory profiling is enabled by checking that`capabilities("profmem")`

returns`TRUE`

. If not, you may need to reinstall R if you wish to enable memory profiling.↩︎Since R 3.4.↩︎