# 10 Advanced topics

This chapter contains brief descriptions of more advanced uses of R. First, we cover more details surrounding packages. We then deal with two topics that are important for computational speed: parallelisation and matrix operations. Finally, there are some tips for how to play well with others (which in this case means using R in combination with programming languages like Python and C++).

After reading this chapter, you will know how to:

- Update and remove R packages,
- Install R packages from other repositories than CRAN,
- Run computations in parallel,
- Perform matrix computations using R,
- Integrate R with other programming languages.

## 10.1 More on packages

### 10.1.1 Loading and auto-installing packages

The best way to load R packages is usually to use `library`

, as we’ve done in the examples in this book. If the package that you’re trying to load isn’t installed, this will return an error message:

`library("theWrongPackageName")`

Alternatively, you can use `require`

to load packages. This will only display a warning, but won’t cause your code to stop executing, which usually would be a problem, if the rest of the code depends on the package^{63}!

However, `require`

also returns a `logical`

: `TRUE`

if the package is installed, and `FALSE`

otherwise. This is useful if you want to load a package, and automatically install it if it doesn’t exist.

To load the `beepr`

package, and install it if it doesn’t already exist, we can use `require`

inside an `if`

condition, as in the code chunk below. If the package exists, the package will be loaded (by `require`

) and `TRUE`

will be returned, and otherwise `FALSE`

will be returned. By using `!`

to turn `FALSE`

into `TRUE`

and vice versa, we can make R install the package if it is missing:

```
if(!require("beepr")) { install.packages("beepr"); library(beepr) }
beep(4)
```

### 10.1.2 Updating R and your packages

You can download new versions of R and RStudio following the same steps as in Section 2.1. On Windows, you can have multiple versions of R installed simultaneously.

To update a specific R package, you can use `install.packages`

. For instance, to update the `beepr`

package, you’d run:

`install.packages("beepr")`

If you make a major update of R, you may have to update most or all of your packages. To update all your packages, you simply run `update.packages()`

. If this fails, you can try the following instead:

```
<- installed.packages()
pkgs <- pkgs[is.na(pkgs[, "Priority"]), 1]
pkgs install.packages(pkgs)
```

### 10.1.3 Alternative repositories

In addition to CRAN, two important sources for R packages are Bioconductor, which contains a large number of packages for bioinformatics, and GitHub, where many developers post development versions of their R packages (which often contain functions and features not yet included in the version of the package that has been posted on CRAN).

To install packages from GitHub, you need the `devtools`

package. You can install it using:

`install.packages("devtools")`

If you for instance want to install the development version of `dplyr`

(which you can find at https://github.com/tidyverse/dplyr), you can then run the following:

```
library(devtools)
install_github("tidyverse/dplyr")
```

Using development versions of packages can be great, because it gives you the most up-to-date version of packages. Bear in mind that they are development versions though, which means that they can be less stable and have more bugs.

To install packages from Bioconductor, you can start by running this code chunk, which installs the `BiocManager`

package that is used to install Bioconductor packages:

```
install.packages("BiocManager")
# Install core packages:
library(BiocManager)
install()
```

You can have a look at the list of packages at

https://www.bioconductor.org/packages/release/BiocViews.html#___Software

If you for instance find the `affyio`

package interesting, you can then install it using:

```
library(BiocManager)
install("affyio")
```

### 10.1.4 Removing packages

This is probably not something that you’ll find yourself doing often, but if you need to uninstall a package, you can do so using `remove.packages`

. Perhaps you’ve installed the development version of a package and want to remove it, so that you can install the stable version again? If you for instance want to uninstall the `beepr`

package^{64}, you’d run the following:

`remove.packages("beepr")`

## 10.2 Speeding up computations with parallelisation

Modern computers have CPU’s with multiple cores and threads, which allows us to speed up computations by performing them in parallel. Some functions in R do this by default, but far from all do. In this section, we’ll have a look at how to run parallel versions of `for`

loops and functionals.

### 10.2.1 Parallelising `for`

loops

First, we’ll have a look at how to parallelise a `for`

loop. We’ll use the `foreach`

, `parallel`

, and `doParallel`

packages, so let’s install them if you haven’t already:

`install.packages(c("foreach", "parallel", "doParallel"))`

To see how many cores that are available on your machine, you can use `detectCores`

:

```
library(parallel)
detectCores()
```

It is unwise to use all available cores for your parallel computation - you’ll always need to reserve at least one for running RStudio and other applications.

To run the steps of a `for`

loop in parallel, we must first use `registerDoParallel`

to *register* the parallel backend to be used. Here is an example where we create 3 workers (and so use 3 cores in parallel^{65}) using `registerDoParallel`

. When we then use `foreach`

to create a `for`

loop, these three workers will execute different steps of the loop in parallel. Note that this wouldn’t work if each step of the loop depended on output from the previous step. `foreach`

returns the output created at the end of each step of the loop in a `list`

(Section 5.2):

```
library(doParallel)
registerDoParallel(3)
<- foreach(i = 1:9) %dopar%
loop_output
{^2
i
}
loop_outputunlist(loop_output) # Convert the list to a vector
```

If the output created at the end of each iteration is a vector, we can collect the output in a `matrix`

object as follows:

```
library(doParallel)
registerDoParallel(3)
<- foreach(i = 1:9) %dopar%
loop_output
{c(i, i^2)
}
loop_outputmatrix(unlist(loop_output), 9, 2, byrow = TRUE)
```

If you have nested loops, you should run the outer loop in parallel, but not the inner loops. The reason for this is that parallelisation only really helps if each step of the loop takes a comparatively long time to run. In fact, there is a small overhead cost associated with assigning different iterations to different cores, meaning that parallel loops can be slower than regular loops if each iteration runs quickly.

An example where each step often takes a while to run is simulation studies. Let’s rewrite the simulation we used to compute the type I error rates of different versions of the t-test in Section 7.5.2 using a parallel `for`

loop instead. First, we define the function as in Section 7.5.2 (minus the progress bar):

```
# Load package used for permutation t-test:
library(MKinfer)
# Create a function for running the simulation:
<- function(n1, n2, distr, level = 0.05, B = 999,
simulate_type_I alternative = "two.sided", ...)
{# Create a data frame to store the results in:
<- data.frame(p_t_test = rep(NA, B),
p_values p_perm_t_test = rep(NA, B),
p_wilcoxon = rep(NA, B))
for(i in 1:B)
{# Generate data:
<- distr(n1, ...)
x <- distr(n2, ...)
y
# Compute p-values:
1] <- t.test(x, y,
p_values[i, alternative = alternative)$p.value
2] <- perm.t.test(x, y,
p_values[i, alternative = alternative,
R = 999)$perm.p.value
3] <- wilcox.test(x, y,
p_values[i, alternative = alternative)$p.value
}
# Return the type I error rates:
return(colMeans(p_values < level))
}
```

Next, we create a parallel version:

```
# Register parallel backend:
library(doParallel)
registerDoParallel(3)
# Create a function for running the simulation in parallel:
<- function(n1, n2, distr, level = 0.05,
simulate_type_I_parallel B = 999,
alternative = "two.sided", ...)
{
<- foreach(i = 1:B) %dopar%
results
{# Generate data:
<- distr(n1, ...)
x <- distr(n2, ...)
y
# Compute p-values:
<- t.test(x, y,
p_val1 alternative = alternative)$p.value
<- perm.t.test(x, y,
p_val2 alternative = alternative,
R = 999)$perm.p.value
<- wilcox.test(x, y,
p_val3 alternative = alternative)$p.value
# Return vector with p-values:
c(p_val1, p_val2, p_val3)
}
# Each element of the results list is now a vector
# with three elements.
# Turn the list into a matrix:
<- matrix(unlist(results), B, 3, byrow = TRUE)
p_values
# Return the type I error rates:
return(colMeans(p_values < level))
}
```

We can now compare how long the two functions take to run using the tools from Section 6.6 (we’ll not use `mark`

in this case, as it requires both functions to yield identical output, which won’t be the case for a simulation):

```
<- system.time(simulate_type_I(20, 20, rlnorm,
time1 B = 999, sdlog = 3))
<- system.time(simulate_type_I_parallel(20, 20, rlnorm,
time2 B = 999, sdlog = 3))
# Compare results:
/time1 time1; time2; time2
```

As you can see, the parallel function is considerably faster. If you have more cores, you can try increasing the value in `registerDoParallel`

and see how that affects the results.

### 10.2.2 Parallelising functionals

The `parallel`

package contains parallelised version of the `apply`

family of functions, with names like `parApply`

, `parLapply`

, and `mclapply`

. Which of these that you should use depends in part on your operating system, as different operating systems handle multicore computations differently. Here is the first example from Section 6.5.3, run in parallel with 3 workers:

```
# Non-parallel version:
lapply(airquality, function(x) { (x-mean(x))/sd(x) })
# Parallel version for Linux/Mac:
library(parallel)
mclapply(airquality, function(x) { (x-mean(x))/sd(x) },
mc.cores = 3)
# Parallel version for Windows (a little slower):
library(parallel)
<- makeCluster(3)
myCluster parLapply(myCluster, airquality, function(x) { (x-mean(x))/sd(x) })
stopCluster(myCluster)
```

Similarly, the `furrr`

package lets just run `purrr`

-functionals in parallel. It relies on a package called `future`

. Let’s install them both:

`install.packages(c("future", "furrr"))`

To run functionals in parallel, we load the `furrr`

package and use `plan`

to set the number of parallel workers:

```
library(furrr)
# Use 3 workers:
plan(multisession, workers = 3)
```

We can then run parallel versions of functions like `map`

and `imap`

, by using functions from `furrr`

with the same names, only with `future_`

added at the beginning. Here is the first example from Section 6.5.3, run in parallel:

```
library(magrittr)
%>% future_map(~(.-mean(.))/sd(.)) airquality
```

Just as for `for`

loops, parallelisation of functionals only really helps if each iteration of the functional takes a comparatively long time to run (and so there is no benefit to using parallelisation in this particular example).

## 10.3 Linear algebra and matrices

Linear algebra is the beating heart of many statistical methods. R has a wide range of functions for creating and manipulating matrices, and doing matrix algebra. In this section, we’ll have a look at some of them.

### 10.3.1 Creating matrices

To create a `matrix`

object, we can use the `matrix`

function. It always coerces all elements to be of the same type (Section 5.1):

```
# Create a 3x2 matrix, one column at a time:
matrix(c(2, -1, 3, 1, -2, 4), 3, 2)
# Create a 3x2 matrix, one row at a time:
# (No real need to include line breaks in the vector with
# the values, but I like to do so to see what the matrix
# will look like!)
matrix(c(2, -1,
3, 1,
-2, 4), 3, 2, byrow = TRUE)
```

Matrix operations require the dimension of the matrices involved to match. To check the dimension of a `matrix`

, we can use `dim`

:

```
<- matrix(c(2, -1, 3, 1, -2, 4), 3, 2)
A dim(A)
```

To create a unit matrix (all 1’s) or a zero matrix (all 0’s), we use `matrix`

with a single value in the first argument:

```
# Create a 3x3 unit matrix:
matrix(1, 3, 3)
# Create a 2x3 zero matrix:
matrix(0, 2, 3)
```

The `diag`

function has three uses. First, it can be used to create a diagonal matrix (if we supply a vector as input). Second, it can be used to create an identity matrix (if we supply a single number as input). Third, it can be used to extract the diagonal from a square matrix (if we supply a matrix as input). Let’s give it a go:

```
# Create a diagonal matrix with 2, 4, 6 along the diagonal:
diag(c(2, 4, 6))
# Create a 9x9 identity matrix:
diag(9)
# Create a square matrix and then extract its diagonal:
<- matrix(1:9, 3, 3)
A
Adiag(A)
```

Similarly, we can use `lower.tri`

and `upper.tri`

to extract a matrix of `logical`

values, describing the location of the lower and upper triangular part of a matrix:

```
# Create a matrix_
<- matrix(1:9, 3, 3)
A
A
# Which are the elements in the lower triangular part?
lower.tri(A)
lower.tri(A)]
A[
# Set the lower triangular part to 0:
lower.tri(A)] <- 0
A[ A
```

To transpose a matrix, use `t`

:

`t(A)`

Matrices can be combined using `cbind`

and `rbind`

:

```
<- matrix(c(1:3, 3:1, 2, 1, 3), 3, 3, byrow = TRUE) # 3x3
A <- matrix(c(2, -1, 3, 1, -2, 4), 3, 2) # 3x2
B
# Add B to the right of A:
cbind(A, B)
# Add the transpose of B below A:
rbind(A, t(B))
# Adding B below A doesn't work, because the dimensions
# don't match:
rbind(A, B)
```

### 10.3.2 Sparse matrices

The `Matrix`

package contains functions for creating and speeding up computations with sparse matrices (i.e. matrices with lots of 0’s), as well as for creating matrices with particular structures. You likely already have it installed, as many other packages rely on it. `Matrix`

distinguishes between sparse and dense matrices:

```
# Load or/and install Matrix:
if(!require("Matrix")) { install.packages("Matrix"); library(Matrix) }
# Create a dense 8x8 matrix using the Matrix package:
<- Matrix(1:64, 8, 8)
A
# Create a copy and randomly replace 40 elements by 0:
<- A
B sample(1:64, 40)] <- 0
B[
B
# Store B as a sparse matrix instead:
<- as(B, "sparseMatrix")
B B
```

To visualise the structure of a sparse matrix, we can use `image`

:

`image(B)`

An example of a slightly larger, \(72\times 72\) sparse matrix is given by `CAex`

:

```
data(CAex)
CAeximage(CAex)
```

`Matrix`

contains additional classes for e.g. symmetric sparse matrices and triangular matrices. See `vignette("Introduction", "Matrix")`

for further details.

### 10.3.3 Matrix operations

In this section, we’ll use the following matrices and vectors to show how to perform various matrix operations:

```
# Matrices:
<- matrix(c(1:3, 3:1, 2, 1, 3), 3, 3, byrow = TRUE) # 3x3
A <- matrix(c(2, -1, 3, 1, -2, 4), 3, 2) # 3x2
B <- matrix(c(4, 1, 1, 2), 2, 2) # Symmetric 2x2
C
# Vectors:
<- 1:9 # Length 9
a <- c(2, -1, 3, 1, -2, 4) # Length 6
b <- 9:1 # Length 9
d <- 4:6 # Length 3 y
```

To perform element-wise addition and subtraction with matrices, use `+`

and `-`

:

```
+ A
A - t(A) A
```

To perform element-wise multiplication, use `*`

:

```
2 * A # Multiply all elements by 2
* A # Square all elements A
```

To perform matrix multiplication, use `%*%`

. Remember that matrix multiplication is non-commutative and so the order of the matrices is important:

```
%*% B # A is 3x3, B is 3x2
A %*% C # B is 3x2, C is 2x2
B %*% A # Won't work, because B is 3x2 and A 3x3! B
```

Given the vectors `a`

, `b`

, and `d`

defined above, we can compute the outer product \(a\otimes b\) using `%o%`

and the dot product \(a\cdot d\) by using `%*%`

and `t`

in the right manner:

```
%o% b # Outer product
a %*% t(b) # Alternative way of getting outer product
a t(a) %*% d # Dot product
```

To find the inverse of a square matrix, we can use `solve`

. To find the generalised Moore-Penrose inverse of any matrix, we can use `ginv`

from `MASS`

:

```
solve(A)
solve(B) # Doesn't work because B isn't square
library(MASS)
ginv(A) # Same as solve(A), because A is non-singular and square
ginv(B)
```

`solve`

can also be used to solve equations systems. To solve the equation \(Ax = y\):

`solve(A, y)`

The eigenvalues and eigenvectors of a square matrix can be found using `eigen`

:

```
eigen(A)
eigen(A)$values # Eigenvalues only
eigen(A)$vectors # Eigenvectors only
```

The singular value decomposition, QR decomposition, and the Choleski factorisation of a matrix are computed as follows:

```
svd(A)
qr(A)
chol(C)
```

`qr`

also provides the rank of the matrix:

```
qr(A)$rank
qr(B)$rank
```

Finally, you can get the determinant^{66} of a matrix using `det`

:

`det(A)`

As a P.S., I’ll also mention the `matlab`

package, which contains functions for running computations using MATLAB-like function calls. This is useful if you want to reuse MATLAB code in R without translating it row-by-row. Incidentally, this also brings us nicely into the next section.

## 10.4 Integration with other programming languages

R is great for a lot of things, but it is obviously not the best choice for every task. There are a number of packages that can be used to harvest the power of other languages, or to integrate your R code with code that you or others have developed in other programming languages. In this section, we’ll mention a few of them.

### 10.4.1 Integration with C++

C++ is commonly used to speed up functions, for instance involving loops that can’t be vectorised or parallelised due to dependences between different iterations. The `Rcpp`

package allows you to easily call C++ functions from R, as well as calling R functions from C++. See `vignette("Rcpp-introduction", "Rcpp")`

(Eddelbuettel & Balamuta, 2018) for details.

An important difference between R and C++ that you should be aware of is that the indexing of vectors (and similar objects) in C++ starts with 0. So the first element of the vector is element 0, the second is element 1, and so forth. Bear this in mind if you pass a vector and a list of indices to C++ functions.

### 10.4.2 Integration with Python

The `reticulate`

package can be used to call Python functions from R. See `vignette("calling_python", "reticulate")`

for some examples.

Some care has to be taken when sending data back and forth between R and Python. In R `NA`

is used to represent missing data and `NaN`

(not a number) is used to represent things that should be numbers but aren’t (e.g. the result of computing `0/0`

). Perfectly reasonable! However, for reasons unknown to mankind, popular Python packages like Pandas, NumPy and SciKit-Learn use `NaN`

instead of `NA`

to represent missing data - but only for `double`

(`numeric`

) variables. `integer`

and `logical`

variables have no way to represent missing data in Pandas. Tread gently if there are `NA`

or `NaN`

values in your data.

Like in C++, the indexing of vectors (and similar objects) in Python starts with 0.

### 10.4.3 Integration with Tensorflow and PyTorch

Tensorflow, Keras and PyTorch are popular frameworks for deep learning. To use Tensorflow and/or Keras with R, you can use the `keras`

package. See `vignette("index", "keras")`

for an introduction and Chollet & Allaire (2018) for a thorough treatise. Similarly, to use PyTorch with R, use the `torch`

package. In both cases, it can take some tampering to get the frameworks to run on a GPU.

### 10.4.4 Integration with Spark

If you need to process large datasets using Spark, you can do so from R using the `sparklyr`

package. It can be used both with local and cloud clusters, and (as the name seems to imply) is easy to integrate with `dplyr`

.