15 Solutions to exercises

Chapter 2

Exercise 2.1

Type the following code into the Console window:

1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10

The answer is $3,628,800$.

(Click here to return to the exercise.)

Exercise 2.2

To compute the sum and assign it to a, we use:

a <- 924 + 124

To compute the square of a we can use:

a*a

The answer is $1,098,304$.

As you’ll soon see in other examples, the square can also be computed using:

a^2

(Click here to return to the exercise.)

Exercise 2.3

When an invalid character is used in a variable name, an error message is displayed in the Console window. Different characters will render different error messages. For instance, net-income <- income - taxes yields the error message Error in net - income <- income - taxes : object 'net' not found. This may seem a little cryptic (and it is!), but what it means is that R is trying to compute the difference between the variables net and income, because that is how R interprets net-income, and fails because the variable net does not exist. As you become more experienced with R, the error messages will start making more and more sense (at least in most cases).
If you put R code as a comment, it will be treated as a comment, meaning that it won’t run. This is actually hugely useful, for instance when you’re looking for errors in your code - you can comment away lines of code and see if the rest of the code runs without them.
Semicolons can be used to write multiple commands on a single line - both will run as if they were on separate lines. If you like, you can add more semicolons to run even more commands.
The value to the right is assigned to both variables. Note, however, that any operations you perform on one variable won’t affect the other. For instance, if you change the value of one of them, the other will remain unchanged:

income2 <- taxes2 <- 100
income2; taxes2 # Check that both are 100
taxes2 <- 30 # income2 doesn't change
income2; taxes2 # Check values

(Click here to return to the exercise.)

Exercise 2.4

To create the vectors, use c:

height <- c(158, 170, 172, 181, 196)
weight <- c(45, 80, 62, 75, 115)

To combine the two vectors into a data frame, use data.frame

hw_data <- data.frame(height, weight)

(Click here to return to the exercise.)

Exercise 2.5

The vector created using:

x <- 1:5

is $(1,2,3,4,5)$. Similarly,

x <- 5:1

gives us the same vector in reverse order: $(5,4,3,2,1)$. To create the vector $(1,2,3,4,5,4,3,2,1)$ we can therefore use:

x <- c(1:5, 4:1)

(Click here to return to the exercise.)

Exercise 2.6

To compute the mean height, use the mean function:

mean(height)

To compute the correlation between the two variables, use cor:

cor(height, weight)

(Click here to return to the exercise.)

Exercise 2.7

length computes the length (i.e. the number of elements) of a vector. length(height) returns the value 5, because the vector is 5 elements long.
sort sorts a vector. The parameter decreasing can be used to decide whether the elements should be sorted in ascending (sort(weights, decreasing = FALSE)) or descending (sort(weights, decreasing = TRUE)) order. To sort the weights in ascending order, we can use sort(weight). Note, however, that the resulting sorted vector won’t be stored in the variable weight unless we write weight <- sort(weight)!

(Click here to return to the exercise.)

Exercise 2.8

$\sqrt{\pi}=1.772454\ldots$:

sqrt(pi)

$e^2\cdot log(4)=10.24341\ldots$:

exp(2)*log(4)

(Click here to return to the exercise.)

Exercise 2.9

The expression $1/x$ tends to infinity as $x\rightarrow 0$, and so R returns $\infty$ as the answer in this case:

1/0

The division $0/0$ is undefined, and R returns NaN, which stands for Not a Number:

0/0

$\sqrt{-1}$ is undefined (as long as we stick to real numbers), and so R returns NaN. The sqrt function also provides an error message saying that NaN values were produced.

sqrt(-1)

If you want to use complex numbers for some reason, you can write the complex number $a+bi$ as complex(1, a, b). Using complex numbers, the square root of $-1$ is $i$:

sqrt(complex(1, -1, 0))

(Click here to return to the exercise.)

Exercise 2.10

To install the package, we use the install.packages function as follows:

install.packages("palmerpenguins")

(Click here to return to the exercise.)

Exercise 2.11

View the documentation, where the data is described:

library(palmerpenguins)
?penguins

Have a look at the structure of the data:

str(penguins)

This shows you the number of observations (344) and variables (8), and the variable types. There are three different data types here: num (numerical), Factor (factor, i.e. a categorical variable) and int (integer, a numerical variable that only takes integer values).

To compute the descriptive statistics, we can use:

summary(penguins)

In the summary, missing values show up as NA’s. There are some NA’s here, and hence there are missing values.

(Click here to return to the exercise.)

Exercise 2.12

ggplot(msleep, aes(sleep_total, awake)) +
      geom_point()

The points follow a declining line. The reason for this is that at any given time, an animal is either awake or asleep, so the total sleep time plus the awake time is always 24 hours for all animals. Consequently, the points lie on the line given by awake=24-sleep_total.

(Click here to return to the exercise.)

Exercise 2.13

library(palmerpenguins)
ggplot(penguins, aes(bill_length_mm, flipper_length_mm, colour = species)) +
      geom_point() +
      labs(x = "Bill length (mm)",
           y = "Flipper length (mm)")

We can change the opacity of the points by adding an alpha argument to geom_point. This is useful when the plot contains overlapping points:

ggplot(penguins, aes(bill_length_mm, flipper_length_mm, colour = species)) +
      geom_point(alpha = 0.75) +
      labs(x = "Bill length (mm)",
           y = "Flipper length (mm)")

(Click here to return to the exercise.)

Exercise 2.14

To set different shapes for different values of island we use:

library(palmerpenguins)
ggplot(penguins, aes(bill_length_mm, flipper_length_mm,
                     colour = species,
                     shape = island)) +
      geom_point() +
      labs(x = "Bill length (mm)",
           y = "Flipper length (mm)")

We can then change the size of the points as follows. Adding alpha to geom_point makes it easier to distinguish the different points.

ggplot(penguins, aes(bill_length_mm, flipper_length_mm,
                     colour = species,
                     shape = island,
                     size = body_mass_g)) +
      geom_point(alpha = 0.5) +
      labs(x = "Bill length (mm)",
           y = "Flipper length (mm)")

(Click here to return to the exercise.)

Exercise 2.15

Using the scale_axis_log10 options:

ggplot(msleep, aes(bodywt, brainwt, colour = sleep_total)) + 
      geom_point() +
      labs( x= "Body weight (logarithmic scale)",
            y = "Brain weight (logarithmic scale)") +
      scale_x_log10() +
      scale_y_log10()

(Click here to return to the exercise.)

Exercise 2.16

We use facet_wrap(~ species) to create the facetting:

library(palmerpenguins)
ggplot(penguins, aes(bill_length_mm, flipper_length_mm)) +
  geom_point() +
  facet_wrap(~ species)

To set the number of rows, we add an nrow argument to facet_wrap:

ggplot(penguins, aes(bill_length_mm, flipper_length_mm)) +
  geom_point() +
  facet_wrap(~ species, nrow = 2)

(Click here to return to the exercise.)

Exercise 2.17

library(palmerpenguins)
ggplot(penguins, aes(bill_length_mm, species)) +
      geom_boxplot()

To change the colours of the boxes, we add colour (outline colour) and fill (box colour) arguments to geom_boxplot:

ggplot(penguins, aes(bill_length_mm, species)) +
      geom_boxplot(colour = "magenta", fill = "turquoise")

(No, I don’t really recommend using this particular combination of colours.)

geom_jitter can be used to plot the individual observations on top of the histogram. It is often a good idea to set a small size and a low alpha in order not to cover the boxes completely.

ggplot(penguins, aes(bill_length_mm, species)) +
      geom_boxplot(colour = "magenta", fill = "turquoise") +
      geom_jitter(size = 0.5, alpha = 0.25)

If we like, we can also change the height of the jitter:

ggplot(penguins, aes(bill_length_mm, species)) +
      geom_boxplot(colour = "magenta", fill = "turquoise") +
      geom_jitter(size = 0.5, alpha = 0.25, height = 0.1)

(Click here to return to the exercise.)

Exercise 2.18

library(palmerpenguins)
ggplot(penguins, aes(bill_length_mm)) +
      geom_histogram()

Next, we facet the histograms using cut:

ggplot(penguins, aes(bill_length_mm)) +
      geom_histogram() +
      facet_wrap(~ species)

Finally, by reading the documentation ?geom_histogram we find that we can add outlines using the colour argument:

ggplot(penguins, aes(bill_length_mm)) +
      geom_histogram(colour = "black") +
      facet_wrap(~ species)

(Click here to return to the exercise.)

Exercise 2.19

library(palmerpenguins)
ggplot(penguins, aes(species)) +
      geom_bar()

To set different colours for the bars, we can use fill, either to set the colours manually or using default colours (by adding a colour aesthetic):

# Set colours manually:
ggplot(penguins, aes(species)) +
      geom_bar(fill = c("red", "yellow", "blue"))

# Use defaults:
ggplot(penguins, aes(species, fill = species)) +
      geom_bar()

width lets us control the bar width:

ggplot(penguins, aes(species, fill = species)) +
      geom_bar(width = 0.5)

By adding fill = sex to aes we create stacked bar charts:

ggplot(penguins, aes(species, fill = sex)) +
      geom_bar()

By adding position = "dodge" to geom_bar we obtain grouped bar charts:

ggplot(penguins, aes(species, fill = sex)) +
      geom_bar(position = "dodge")

coord_flip flips the coordinate system, yielding a horizontal bar plot:

ggplot(penguins, aes(species, fill = sex)) +
      geom_bar(position = "dodge")
      coord_flip()

(Click here to return to the exercise.)

Exercise 2.20

To save the png file, use

myPlot <- ggplot(msleep, aes(sleep_total, sleep_rem)) +
      geom_point()

ggsave("filename.png", myPlot, width = 4, height = 4)

To change the resolution, we use the dpi argument:

ggsave("filename.png", myPlot, width = 4, height = 4, dpi=600)

(Click here to return to the exercise.)

Exercise 2.21

Both approaches render a character object with the text A rainy day in Edinburgh:

a <- "A rainy day in Edinburgh"
a
class(a)

a <- 'A rainy day in Edinburgh'
a
class(a)

That is, you are free to choose whether to use single or double quotation marks. I tend to use double quotation marks, because I was raised to believe that double quotation marks are superior in every way (well, that, and the fact that I think that they make code easier to read simply because they are easier to notice).

The first two sums are numeric whereas the third is integer

class(1 + 2)    # numeric
class(1L + 2)   # numeric
class (1L + 2L) # integer

If we mix numeric and integer variables, the result is a numeric. But as long as we stick to just integer variables, the result is usually an integer. There are exceptions though - computing 2L/3L won’t result in an integer because… well, because it’s not an integer.

When we run "Hello" + 1 we receive an error message:

> "Hello" + 1
Error in "Hello" + 1 : non-numeric argument to binary operator

In R, binary operators are mathematical operators like +, -, * and / that takes two numbers and returns a number. Because "Hello" is a character and not a numeric, it fails in this case. So, in English the error message reads Error in "Hello" + 1 : trying to perform addition with something that is not a number. Maybe you know a bit of algebra and want to say hey, we can add characters together, like in $a^2+b^2=c^2$!. Which I guess is correct. But R doesn’t do algebraic calculations, but numerical ones - that is, all letters involved in the computations must represent actual numbers. a^2+b^2=c^2 will work only if a, b and c all have numbers assigned to them.

Combining numeric and a logical variables turns out to be very useful in some problems. The result is always numeric, with FALSE being treated as the number 0 and TRUE being treated as the number 1 in the computations:

class(FALSE * 2)
class(TRUE + 1)

(Click here to return to the exercise.)

Exercise 2.22

The functions return information about the data frame:

ncol(airquality) # Number of columns of the data frame
nrow(airquality) # Number of rows of the data frame
dim(airquality) # Number of rows, followed by number of columns
names(airquality) # The name of the variables in the data frame
row.names(airquality) # The name of the rows in the data frame
                      # (indices unless the rows have been named)

(Click here to return to the exercise.)

Exercise 2.23

To create the matrices, we need to set the number of rows nrow, the number of columns ncol and whether to use the elements of the vector x to fill the matrix by rows or by columns (byrow). To create

\[\begin{pmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{pmatrix}\]

we use:

x <- 1:6

matrix(x, nrow = 2, ncol = 3, byrow = TRUE)

And to create

\[\begin{pmatrix} 1 & 4\\ 2 & 5\\ 3 & 6 \end{pmatrix}\]

we use:

x <- 1:6

matrix(x, nrow = 3, ncol = 2, byrow = FALSE)

We’ll do a deep-dive on matrix objects in Section 12.3.

(Click here to return to the exercise.)

Exercise 2.24

In the [i, j] notation, i is the row number and j is the column number. In this case, airquality[, 3], we have j=3 and therefore asks for the 3rd column, not the 3rd row. To get the third row, we’d use airquality[3,] instead.
To extract the first five rows, we can use:

airquality[1:5,]
# or
airquality[c(1, 2, 3, 4, 5),]

First, we use names(airquality) to check the column numbers of the two variables. Wind is column 3 and Temp is column 4, so we can access them using airquality[,3] and airquality[,4] respectively. Thus, we can compute the correlation using:

cor(airquality[,3], airquality[,4])

Alternatively, we could refer to the variables using the column names:

cor(airquality[,"Wind"], airquality[,"Temp"])

To extract all columns except Temp and Wind, we use a minus sign - and a vector containing their indices:

airquality[, -c(3, 4)]

(Click here to return to the exercise.)

Exercise 2.25

To add the new variable, we can use:

bookstore$rev_per_minute <- bookstore$purchase / bookstore$visit_length

By using View(bookstore) or looking at the data in the Console window using bookstore, we see that the customer in question is on row 6 of the data. To replace the value, we can use:

bookstore$purchase[6] <- 16

Note that the value of rev_per_minute hasn’t been changed by this operation. We will therefore need to compute it again, to update its value:

# We can either compute it again for all customers:
bookstore$rev_per_minute <- bookstore$purchase / bookstore$visit_length
# ...or just for customer number 6:
bookstore$rev_per_minute[6] <- bookstore$purchase[6] / bookstore$visit_length[6]

(Click here to return to the exercise.)

Exercise 2.26

The coldest day was the day with the lowest temperature:

airquality[which.min(airquality$Temp),]

We see that the 5th day in the period, May 5, was the coldest, with a temperature of 56 degrees Fahrenheit.

To find out how many days the wind speed was greater than 17 mph, we use sum:

sum(airquality$Wind > 17)

Because there are so few days fulfilling this condition, we could also easily have solved this by just looking at the rows for those days and counting them:

airquality[airquality$Wind > 17,]

Missing data are represented by NA values in R, and so we wish to check how many NA elements there are in the Ozone vector. We do this by combining is.na and sum and find that there are 37 missing values:

sum(is.na(airquality$Ozone))

In this case, we need to use an ampersand & sign to combine the two conditions:

sum(airquality$Temp < 70 & airquality$Wind > 10)

We find that there are 22 such days in the data.

(Click here to return to the exercise.)

Exercise 2.27

We should use the breaks argument to set the interval bounds in cut:

airquality$TempCat <- cut(airquality$Temp,
                          breaks = c(50, 70, 90, 110))

To see the number of days in each category, we can use summary:

summary(airquality$TempCat)

(Click here to return to the exercise.)

Exercise 2.28

First we load and inspect the data:

library(datasauRus)
View(datasaurus_dozen)

Next, we compute summary statistics grouped by dataset:

aggregate(cbind(x, y) ~ dataset, data = datasaurus_dozen, FUN = mean)
aggregate(cbind(x, y) ~ dataset, data = datasaurus_dozen, FUN = sd)

by(datasaurus_dozen[, 2:3], datasaurus_dozen$dataset, cor)

The summary statistics for all datasets are virtually identical.

Next, we make scatterplots. Here is a solution using ggplot2:

library(ggplot2)
ggplot(datasaurus_dozen, aes(x, y, colour = dataset)) +
    geom_point() +
    facet_wrap(~ dataset, ncol = 3)

Clearly, the datasets are very different! This is a great example of how simply computing summary statistics is not enough. They tell a part of the story, yes, but only a part.

(Click here to return to the exercise.)

Exercise 2.29

We can use mutate to add the new variable:

age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
visit_length <- c(5, 2, 20, 22, 12, 31, 9, 10, 11)
bookstore <- data.frame(age, purchase, visit_length)

library(magrittr)
bookstore |> mutate(rev_per_minute = purchase/visit_length) -> bookstore

(Click here to return to the exercise.)

Exercise 2.30

The variable X represents the empty column between Visit and VAS. In the X.1 column the researchers have made comments on two rows (rows 692 and 1153), causing R to read this otherwise empty column. If we wish, we can remove these columns from the data using the syntax from Section 2.11.1:

vas <- vas[, -c(4, 6)]

We remove the sep = ";" argument:

vas <- read.csv(file_path, dec = ",", skip = 4)

…and receive the following error message:

Error in read.table(file = file, header = header, sep = sep,
                    quote = quote,  : 
  duplicate 'row.names' are not allowed

By default, read.csv uses commas, ,, as column delimiters. In this case it fails to read the file, because it uses semicolons instead.

Next, we remove the dec = "," argument:

vas <- read.csv(file_path, sep = ";", skip = 4)
str(vas)

read.csv reads the data without any error messages, but now VAS has become a character vector. By default, read.csv assumes that the file uses decimal points rather than decimals commas. When we don’t specify that the file has decimal commas, read.csv interprets 0,4 as text rather than a number.

Next, we remove the skip = 4 argument:

vas <- read.csv(file_path, sep = ";", dec = ",")
str(vas)
names(vas)

read.csv looks for column names on the first row that it reads. skip = 4 tells the function to skip the first 4 rows of the .csv file (which in this case were blank or contain other information about the data). When it doesn’t skip those lines, the only text on the first row is Data updated 2020-04-25. This then becomes the name of the first column, and the remaining columns are named X, X.1, X.2, and so on.

Finally, we change skip = 4 to skip = 5:

vas <- read.csv(file_path, sep = ";", dec = ",", skip = 5)
str(vas)
names(vas)

In this case, read.csv skips the first 5 rows, which includes row 5, on which the variable names are given. It still looks for variable names on the first row that it reads though, meaning that the data values from the first observation become variable names instead of data points. An X is added at the beginning of the variable names, because variable names in R cannot begin with a number.

(Click here to return to the exercise.)

Exercise 2.31

We set file_path to the path for vas.csv and read the data as in Exercise 2.30::

vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)

First, we compute the mean VAS for each patient:

aggregate(VAS ~ ID, data = vas, FUN = mean)

Next, we compute the lowest and highest VAS recorded for each patient:

aggregate(VAS ~ ID, data = vas, FUN = min)
aggregate(VAS ~ ID, data = vas, FUN = max)

Finally, we compute the number of high-VAS days for each patient. One way to do this is to create a logical vector by VAS >= 7 and then compute its sum.

aggregate((VAS >= 7) ~ ID, data = vas, FUN = sum)

(Click here to return to the exercise.)

Exercise 2.32

First, set file_path to the path to projects-email.xlsx. Then we can use read.xlsx from the openxlsx package. The argument sheet lets us select which sheet to read:

library(openxlsx)
emails <- read.xlsx(file_path, sheet = 2)

View(emails)
str(emails)

To obtain a vector containing the email addresses without any duplicates, we apply unique to the vector containing the e-mail addresses. That vector is called E-mail with a hyphen -. We cannot access it using emails$E-mail, because R will interpret that as email$E - mail, and neither the vector email$E nor the variable mail exist. Instead, we can do one of the following:

unique(emails[,3])
unique(emails$"E-mail")

(Click here to return to the exercise.)

Exercise 2.33

We set file_path to the path to vas-transposed.csv and then read it:

vast <- read.csv(file_path)
dim(vast)
View(vast)

It is a data frame with 4 rows and 2366 variables.

Adding row.names = 1 lets us read the row names:

vast <- read.csv(file_path, row.names = 1)
View(vast)

This data frame only contains 2365 variables, because the leftmost column is now the row names and not a variable.

t lets us rotate the data into the format that we are used to. If we only apply t though, the resulting object is a matrix and not a data.frame. If we want it to be a data.frame, we must also make a call to as.data.frame:

vas <- t(vast)
class(vas)

vas <- as.data.frame(t(vast))
class(vas)

(Click here to return to the exercise.)

Chapter 3

Exercise 3.1

We create the matrix and run the test as follows:

bacteria <- matrix(c(15, 3, 17, 8),
            2, 2,
            byrow = TRUE,
            dimnames = list(c("E.coli", "K.pneumoniae"),
            c("Not resistant", "Resistant")))

chisq.test(bacteria)

The p-value is $0.4339$, so we have no evidence against $H_0$.

What about the criteria for running the test? We can check the expected counts:

chisq.test(bacteria)$expected

There is one cell (resistant E.coli) for which the expected count is less than 5. This means that 25 % of the cells have a count below 5, and the criteria for running the test are not met. We should consider using simulation to compute the p-value, or using Fisher’s exact test instead:

chisq.test(bacteria, simulate.p.value = TRUE, B = 9999)
fisher.test(bacteria)

(Click here to return to the exercise.)

Exercise 3.2

We have $x=440$ and $n=998$, and can compute the 99% Wilson confidence interval as follows:

library(MKinfer)
binomCI(x = 440, n = 998, conf.level = 0.99, method = "wilson")

The interval is $(0.40, 0.48)$.

(Click here to return to the exercise.)

Exercise 3.3

First, let’s compute the proportion of herbivores and carnivores that sleep for more than 7 hours a day:

Without pipes:

library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",]
n1 <- sum(!is.na(herbivores$sleep_total))
x1 <- sum(herbivores$sleep_total > 7, na.rm = TRUE)

carnivores <- msleep[msleep$vore == "carni",]
n2 <- sum(!is.na(carnivores$sleep_total))
x2 <- sum(carnivores$sleep_total > 7, na.rm = TRUE)

With pipes:

library(dplyr)

msleep |>
    filter(vore == "herbi") |> 
    select(sleep_total) |> 
    na.omit() |> 
    nrow() -> n1

msleep |>
    filter(vore == "herbi",
           sleep_total > 7) |> 
    count() |> 
    pull() -> x1

msleep |>
    filter(vore == "carni") |> 
    select(sleep_total) |> 
    na.omit() |> 
    nrow() -> n2

msleep |>
    filter(vore == "carni",
           sleep_total > 7) |> 
    count() |> 
    pull() -> x2

The proportions are 0.625 and 0.68, respectively. Checking ?binomDiffCI, we see that in order to obtain a confidence interval for the difference of the two proportions, we use binomDiffCI as follows:

library(MKinfer)
binomDiffCI(x1, x2, n1-x1, n2-x2)

(Click here to return to the exercise.)

Exercise 3.4

We compute the sample size as follows:

library(MKpower)
ssize.propCI(prop = 0.7, width = 0.05,
             conf.level = 0.95, method = "wilson")

The required sample size is $n=1296$ (the output says $1295.303$, which we round up to $1296$).

(Click here to return to the exercise.)

Exercise 3.5

We use t.test to perform the test.

Without pipes:

library(palmerpenguins)
t.test(body_mass_g ~ sex,
       data = subset(penguins,
                     species == "Chinstrap"))

With pipes:

library(palmerpenguins)
library(dplyr)

penguins |>
  filter(species == "Chinstrap") |> 
  t.test(body_mass_g ~ sex,
         data = _)

The p-value is 2.264e-06, i.e. $0.000002264$, and we reject the null hypothesis that the average weight is the same for females and males.

(Click here to return to the exercise.)

Exercise 3.6

When running the test, we need to use the argument mu = 195 and alternative = "greater".

Without pipes:

library(palmerpenguins)
t.test(flipper_length_mm ~ 1,
       data = subset(penguins,
                species == "Chinstrap" &
                sex == "male"),
       mu = 195,
       alternative = "greater")

# Or, in two steps:
new_data <- penguins[penguins$species ==
                       "Chinstrap" &
                     penguins$sex ==
                       "male",]
t.test(new_data$flipper_length_mm,
       mu = 195,
       alternative = "greater")

With pipes:

library(palmerpenguins)
library(dplyr)

penguins |>
  filter(species == "Chinstrap",
         sex == "male") |> 
  t.test(flipper_length_mm ~ 1,
         data = _,
         mu = 195,
       alternative = "greater")

The p-value is 1.7e-05, i.e. $0.000017$, and we reject the null hypothesis that the average flipper length is less than or equal to 195 mm.

(Click here to return to the exercise.)

Exercise 3.7

First, we assume that delta is 5 and that the standard deviation is 6, and want to find the $n$ required to achieve 95 % power at a 5 % significance level:

power.t.test(power = 0.95, delta = 5, sd = 6, sig.level = 0.05,
             type = "one.sample", alternative = "one.sided")

We see than $n$ needs to be at least 18 (17.04 rounded up) to achieve the desired power.

The actual sample size for this dataset was $n=34$. Let’s see what power that gives us:

power.t.test(n = 34, delta = 5, sd = 6, sig.level = 0.05,
             type = "one.sample", alternative = "one.sided")

The power is $0.999$. We’re more or less guaranteed to find statistical evidence that the mean is greater than 195 if the true mean is 200!

(Click here to return to the exercise.)

Exercise 3.8

We use t.test to perform each test.

Without pipes:

library(palmerpenguins)

p1 <- t.test(body_mass_g ~ sex,
       data = subset(penguins,
                     species == "Chinstrap"))$p.val
p2 <- t.test(body_mass_g ~ sex,
       data = subset(penguins,
                     species == "Adelie"))$p.val
p3 <- t.test(body_mass_g ~ sex,
       data = subset(penguins,
                     species == "Gentoo"))$p.val

# Unadjusted p-values:
c(p1, p2, p3)

# Adjusted p-values:
p.adjust(c(p1, p2, p3), method = "holm")

The p-values are still very small after adjustment, and we reject all three null hypotheses.

(Click here to return to the exercise.)

Chapter 4

Exercise 4.1

We change the background colour of the entire plot to lightblue.

p + theme(panel.background = element_rect(fill = "lightblue"),
          plot.background = element_rect(fill = "lightblue"))

Next, we change the font of the legend to serif.

p + theme(panel.background = element_rect(fill = "lightblue"),
          plot.background = element_rect(fill = "lightblue"),
          legend.text = element_text(family = "serif"),
          legend.title = element_text(family = "serif"))

We remove the grid:

p + theme(panel.background = element_rect(fill = "lightblue"),
          plot.background = element_rect(fill = "lightblue"),
          legend.text = element_text(family = "serif"),
          legend.title = element_text(family = "serif"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())

Finally, we change the colour of the axis ticks to orange and increase their width:

p + theme(panel.background = element_rect(fill = "lightblue"),
          plot.background = element_rect(fill = "lightblue"),
          legend.text = element_text(family = "serif"),
          legend.title = element_text(family = "serif"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.ticks = element_line(colour = "orange", linewidth = 2))

It doesn’t look all that great, does it? Let’s just stick to the default theme in the remaining examples.

(Click here to return to the exercise.)

Exercise 4.2

We can use the bw argument to control the smoothness of the curves:

ggplot(diamonds, aes(carat, colour = cut)) +
      geom_density(bw = 0.2)

We can fill the areas under the density curves by adding fill to the aes:

ggplot(diamonds, aes(carat, colour = cut, fill = cut)) +
      geom_density(bw = 0.2)

Because the densities overlap, it’d be better to make the fill colours slightly transparent. We add alpha to the geom:

ggplot(diamonds, aes(carat, colour = cut, fill = cut)) +
      geom_density(bw = 0.2, alpha = 0.2)

A similar plot can be created using geom_density_ridges from the ggridges package. Note that you must set y = cut in the aes, because the densities should be separated by cut.

install.packages("ggridges")
library(ggridges)

ggplot(diamonds, aes(carat, cut, fill = cut)) +
      geom_density_ridges()

(Click here to return to the exercise.)

Exercise 4.3

We use xlim to set the boundaries of the x-axis and bindwidth to decrease the bin width:

ggplot(diamonds, aes(carat)) +
      geom_histogram(binwidth = 0.01) +
      xlim(0, 3)

It appears that carat values that are just above multiples of 0.25 are more common than other values. We’ll explore that next.

(Click here to return to the exercise.)

Exercise 4.4

We set the colours using the fill aesthetic:

ggplot(diamonds, aes(cut, price, fill = cut)) +
      geom_violin()

Next, we remove the legend:

ggplot(diamonds, aes(cut, price, fill = cut)) +
      geom_violin() +
      theme(legend.position = "none")

We add boxplots by adding an additional geom to the plot. Increasing the width of the violins and decreasing the width of the boxplots creates a better figure. We also move the fill = cut aesthetic from ggplot to geom_violin so that the boxplots use the default colours instead of different colours for each category.

ggplot(diamonds, aes(cut, price)) +
      geom_violin(aes(fill = cut), width = 1.25) +
      geom_boxplot(width = 0.1, alpha = 0.5) +
      theme(legend.position = "none")

Finally, we can create a horizontal version of the figure in the same way we did for boxplots in Section 2.19: by adding coord_flip() to the plot:

ggplot(diamonds, aes(cut, price)) +
      geom_violin(aes(fill = cut), width = 1.25) +
      geom_boxplot(width = 0.1, alpha = 0.5) +
      theme(legend.position = "none") +
      coord_flip()

(Click here to return to the exercise.)

Exercise 4.5

We can create an interactive scatterplot using:

myPlot <- ggplot(diamonds, aes(x, y,
                 text = paste("Row:", rownames(diamonds)))) +
      geom_point()

ggplotly(myPlot)

There are outliers along the y-axis on rows 24,068 and 49,190. There are also some points for which $x=0$. Examples include rows 11,183 and 49,558. It isn’t clear from the plot, but in total there are 8 such points, 7 of which have both $x=0$ and $y=0$. To view all such diamonds, you can use filter(diamonds, x==0). These observations must be due to data errors, since diamonds can’t have 0 width. The high $y$-values also seem suspicious - carat is a measure of diamond weight, and if these diamonds really were 10 times longer than others then we would probably expect them to have unusually high carat values as well (which they don’t).

(Click here to return to the exercise.)

Exercise 4.6

The two outliers are the only observations for which $y>20$, so we use that as our condition:

ggplot(diamonds, aes(x, y)) +
      geom_point() +
      geom_text(aes(label = ifelse(y > 20, rownames(diamonds), "")),
                hjust = 1.1)

(Click here to return to the exercise.)

Exercise 4.7

# Create a copy of diamonds, then replace x-values greater than 9
# with NA:
diamonds2 <- diamonds
diamonds2$x[diamonds2$x > 9] <- NA

## Create the scatterplot
ggplot(diamonds2, aes(carat, price, colour = is.na(x))) +
      geom_point()

In this plot, we see that virtually all high carat diamonds have missing x values. This seems to indicate that there is a systematic pattern to the missing data (which of course is correct in this case!), and we should proceed with any analyses of x with caution.

(Click here to return to the exercise.)

Exercise 4.8

The code below is an example of what your analysis can look like, with some remarks as comments:

# Investigate missing data
colSums(is.na(flights2))
# Not too much missing data in this dataset!
View(flights2[is.na(flights2$air_time),])
# Flights with missing data tend to have several missing variables.

# Ridge plots to compare different carriers (boxplots, facetted
# histograms and violin plots could also be used)
library(ggridges)
ggplot(flights2, aes(arr_delay, carrier, fill = carrier)) +
      geom_density_ridges() +
      theme(legend.position = "none") +
      xlim(-50, 250)
# Some airlines (e.g. EV) appear to have a larger spread than others

ggplot(flights2, aes(dep_delay, carrier, fill = carrier)) +
      geom_density_ridges() +
      theme(legend.position = "none") +
      xlim(-15, 100)
# Some airlines (e.g. EV) appear to have a larger spread others

ggplot(flights2, aes(air_time, carrier, fill = carrier)) +
      geom_density_ridges() +
      theme(legend.position = "none")
# VX only do long-distance flights, whereas MQ, FL and 9E only do
# shorter flights

# Make scatterplots and label outliers with flight numbers
ggplot(flights2, aes(dep_delay, arr_delay, colour = carrier)) +
      geom_point() +
      geom_text(aes(label = ifelse(arr_delay > 300,
                            paste("Flight", flight), "")),
                vjust = 1.2, hjust = 1)

ggplot(flights2, aes(air_time, arr_delay, colour = carrier)) +
      geom_point() +
      geom_text(aes(label = ifelse(air_time > 400 | arr_delay > 300,
                                   paste("Flight", flight), "")),
                vjust = 1.2, hjust = 1)

(Click here to return to the exercise.)

Exercise 4.9

To decrease the smoothness of the line, we use the span argument in geom_smooth. The default is geom_smooth(span = 0.75). Decreasing this values yields a very different fit:

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      geom_smooth(span = 0.25) +
      labs(x = "Brain weight (logarithmic scale)",
           y = "Total sleep time") +
      scale_x_log10()

More smoothing is probably preferable in this case. The relationship appears to be fairly weak, and appears to be roughly linear.

We can use the method argument in geom_smooth to fit a straight line using lm instead of LOESS:

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      geom_smooth(method = "lm") +
      labs(x = "Brain weight (logarithmic scale)",
           y = "Total sleep time") +
      scale_x_log10()

To remove the confidence interval from the plot, we set se = FALSE in geom_smooth:

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(x = "Brain weight (logarithmic scale)",
           y = "Total sleep time") +
      scale_x_log10()

Finally, we can change the colour of the smoothing line using the colour argument:

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      geom_smooth(method = "lm", se = FALSE, colour = "red") +
      labs(x = "Brain weight (logarithmic scale)",
           y = "Total sleep time") +
      scale_x_log10()

(Click here to return to the exercise.)

Exercise 4.10

Adding the geom_smooth geom with the default settings produces a trend line that does not capture seasonality:

autoplot(a10) +
      geom_smooth()

We can change the axes labels using labs:

autoplot(a10) +
      geom_smooth() +
      labs(x = "Year",
           y = "Sales ($ million)")

ggtitle adds a title to the figure:

autoplot(a10) +
      geom_smooth() +
      labs(x = "Year",
           y = "Sales ($ million)") +
      ggtitle("Anti-diabetic drug sales in Australia")

The colour argument can be passed to autoplot to change the colour of the time series line:

autoplot(a10, colour = "red") +
      geom_smooth() +
      labs(x = "Year",
           y = "Sales ($ million)") +
      ggtitle("Anti-diabetic drug sales in Australia")

(Click here to return to the exercise.)

Exercise 4.11

The text can be added by using annotate(geom = "text", ...). In order not to draw the text on top of the circle, you can shift the x-value of the text (the appropriate shift depends on the size of your plot window):

autoplot(gold) +
      annotate(geom = "point", x = spike_date, y = gold[spike_date], 
               size = 5, shape = 21, colour = "red",
               fill = "transparent") +
      annotate(geom = "text", x = spike_date - 100,
               y = gold[spike_date], 
               label = "Incorrect value!")

We can remove the erroneous value by replacing it with NA in the time series:

gold[spike_date] <- NA
autoplot(gold)

Finally, we can add a reference line using geom_hline:

autoplot(gold) +
      geom_hline(yintercept = 400, colour = "red")

(Click here to return to the exercise.)

Exercise 4.12

We can specify which variables to include in the plot as follows:

autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE)

This produces a terrible-looking label for the y-axis, which we can remove by setting the y-label to NULL:

autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE) +
      labs(y = NULL)

As before, we can add smoothers using geom_smooth:

autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE) +
      geom_smooth() +
      labs(y = NULL)

(Click here to return to the exercise.)

Exercise 4.13

We set the size of the points using geom_point(size):

ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) +
      geom_point(size = 0.5) +
      geom_path()

To add annotations, we use annotate and some code to find the days of the lowest and highest temperatures:

## Lowest temperature
lowest <- which.min(elecdaily2$Temperature)

## Highest temperature
highest <- which.max(elecdaily2$Temperature)

## We shift the y-values of the text so that it appears above
# the points
ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) +
      geom_point(size = 0.5) +
      geom_path() +
      annotate(geom = "text", x = elecdaily2$Temperature[lowest],
               y = elecdaily2$Demand[lowest] + 4, 
               label = elecdaily2$day[lowest]) +
      annotate(geom = "text", x = elecdaily2$Temperature[highest],
               y = elecdaily2$Demand[highest] + 4, 
               label = elecdaily2$day[highest])

(Click here to return to the exercise.)

Exercise 4.14

We can specify aes(group) for a particular geom only as follows:

ggplot(Oxboys, aes(age, height, colour = Subject)) + 
      geom_point() + 
      geom_line(aes(group = Subject)) +
      geom_smooth(method = "lm", colour = "red", se = FALSE)

Subject is now used for grouping the points used to draw the lines (i.e. for geom_line), but not for geom_smooth, which now uses all the points to create a trend line showing the average height of the boys over time.

(Click here to return to the exercise.)

Exercise 4.15

Code for producing the three plots is given below:

library(fma)

# Time series plot
autoplot(writing) +
      geom_smooth() +
      labs(y = "Sales (francs)") +
      ggtitle("Sales of printing and writing paper")

# Seasonal plot
ggseasonplot(writing, year.labels = TRUE, year.labels.left = TRUE) +
      labs(y = "Sales (francs)") +
      ggtitle("Seasonal plot of sales of printing and writing paper")
# There is a huge dip in sales in August, when many French offices are
# closed due to holidays.

# stl-decomposition
autoplot(stl(writing, s.window = 365)) +
      ggtitle("Seasonal decomposition of paper sales time series")

(Click here to return to the exercise.)

Exercise 4.16

We use the cpt.var functions with the default settings:

library(forecast)
library(fpp2)
library(changepoint)
library(ggfortify)

# Plot the time series:
autoplot(elecdaily[,"Demand"])

# Plot points where there are changes in the variance:
autoplot(cpt.var(elecdaily[,"Demand"]))

The variance is greater in the beginning of the year, and then appears to be more or less constant. Perhaps this can be explained by temperature?

# Plot the time series:
autoplot(elecdaily[,"Temperature"])

We see that the high-variance period coincides with peaks and large oscillations in temperature, which would cause the energy demand to increase and decrease more than usual, making the variance greater.

(Click here to return to the exercise.)

Exercise 4.17

By adding a copy of the observation for month 12, with the Month value replaced by 0, we can connect the endpoints to form a continuous curve:

Cape_Town_weather[13,] <- Cape_Town_weather[12,]
Cape_Town_weather$Month[13] <- 0

ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
      geom_line() +
      coord_polar() +
      xlim(0, 12)

(Click here to return to the exercise.)

Exercise 4.18

As for all ggplot2 plots, we can use ggtitle to add a title to the plot:

ggpairs(diamonds[, which(sapply(diamonds, class) == "numeric")],
        aes(colour = diamonds$cut, alpha = 0.5)) +
      ggtitle("Numeric variables in the diamonds dataset")

(Click here to return to the exercise.)

Exercise 4.19

We create the correlogram using ggcorr as follows:

ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")])

method allows us to control which correlation coefficient to use:

ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")],
       method = c("pairwise", "spearman"))

nbreaks is used to create a categorical colour scale:

ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")],
       method = c("pairwise", "spearman"),
       nbreaks = 5)

low and high can be used to control the colours at the endpoints of the scale:

ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")],
       method = c("pairwise", "spearman"),
       nbreaks = 5,
       low = "yellow", high = "black")

(Yes, the default colours are a better choice!)

(Click here to return to the exercise.)

Exercise 4.20

We replace colour = vore in the aes by fill = vore and add colour = "black", shape = 21 to geom_point. The points now get black borders, which makes them a bit sharper:

ggplot(msleep, aes(brainwt, sleep_total, fill = vore, size = bodywt)) +
      geom_point(alpha = 0.5, colour = "black", shape = 21) +
      labs(x = "log(Brain weight)",
           y = "Sleep total (h)") +
      scale_x_log10() +
      scale_size(range = c(1, 20), trans = "sqrt",
                 name = "Square root of\nbody weight") +
      scale_color_discrete(name = "Feeding behaviour")

We can use ggplotly to create an interactive version of the plot. Adding text to the aes allows us to include more information when hovering points:

library(plotly)
myPlot <- ggplot(msleep, aes(brainwt, sleep_total, fill = vore,
                             size = bodywt, text = name)) +
      geom_point(alpha = 0.5, colour = "black", shape = 21) +
      labs(x = "log(Brain weight)",
           y = "Sleep total (h)") +
      scale_x_log10() +
      scale_size(range = c(1, 20), trans = "sqrt",
                 name = "Square root of\nbody weight") +
      scale_color_discrete(name = "Feeding behaviour")

ggplotly(myPlot)

(Click here to return to the exercise.)

Exercise 4.21

We create the tile plot using geom_tile. By setting fun = max we obtain the highest price in each bin:

ggplot(diamonds, aes(table, depth, z = price)) +
      geom_tile(binwidth = 1, stat = "summary_2d", fun = max) +
      ggtitle("Highest prices for diamonds with different depths
              and tables")

We can create the bin plot using either geom_bin2d or geom_hex:

ggplot(diamonds, aes(carat, price)) +
      geom_bin2d(bins = 50)

Diamonds with carat around 0.3 and price around 1000 have the highest bin counts.

(Click here to return to the exercise.)

Exercise 4.22

VS2 and Ideal is the most common combination:

diamonds2 <- aggregate(carat ~ cut + clarity, data = diamonds,
                       FUN = length)
names(diamonds2)[3] <- "Count"
ggplot(diamonds2, aes(clarity, cut, fill = Count)) +
      geom_tile()

As for continuous variables, we can use geom_tile with the arguments stat = "summary_2d", fun = mean to display the average prices for different combinations. SI2 and Premium is the combination with the highest average price:

ggplot(diamonds, aes(clarity, cut, z = price)) +
      geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
      ggtitle("Mean prices for diamonds with different
              clarities and cuts")

(Click here to return to the exercise.)

Exercise 4.23

We create the scatterplot using:

library(gapminder)
library(GGally)

gapminder2007 <- gapminder[gapminder$year == 2007,]

ggpairs(gapminder2007[, c("lifeExp", "pop", "gdpPercap")],
        aes(colour = gapminder2007$continent, alpha = 0.5),
        upper = list(continuous = "na"))

The interactive facetted bubble plot is created using:

library(plotly)

gapminder2007 <- gapminder[gapminder$year == 2007,]

myPlot <- ggplot(gapminder2007, aes(gdpPercap, lifeExp, size = pop,
                                    colour = country)) +
      geom_point(alpha = 0.5) +
      scale_x_log10() +
      scale_size(range = c(2, 15)) +
      scale_colour_manual(values = country_colors) +
      theme(legend.position = "none") +
      facet_wrap(~ continent)

ggplotly(myPlot)

Well done, you just visualised 5 variables in a facetted bubble plot!

(Click here to return to the exercise.)

Exercise 4.24

Fixed wing multi engine Boeings are the most common planes:

library(nycflights13)
library(ggplot2)

planes2 <- aggregate(tailnum ~ type + manufacturer, data = planes,
                       FUN = length)

ggplot(planes2, aes(type, manufacturer, fill = tailnum)) +
      geom_tile()

The fixed wing multi engine Airbus has the highest average number of seats:

ggplot(planes, aes(type, manufacturer, z = seats)) +
      geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
      ggtitle("Number of seats for different planes")

The number of seats seems to have increased in the 1980’s, and then reached a plateau:

ggplot(planes, aes(year, seats)) +
      geom_point(aes(colour = engine)) +
      geom_smooth()

The plane with the largest number of seats is not an Airbus, but a Boeing 747-451. It can be found using planes[which.max(planes$seats),] or visually using plotly:

myPlot <- ggplot(planes, aes(year, seats,
                             text = paste("Tail number:", tailnum,
                                          "<br>Manufacturer:",
                                          manufacturer))) +
      geom_point(aes(colour = engine)) +
      geom_smooth()

ggplotly(myPlot)

Finally, we can investigate what engines were used during different time periods in several ways, for instance by differentiating engines by colour in our previous plot:

ggplot(planes, aes(year, seats)) +
      geom_point(aes(colour = engine)) +
      geom_smooth()

(Click here to return to the exercise.)

Exercise 4.25

First, we compute the principal components:

library(ggplot2)

# Compute principal components:
pca <- prcomp(diamonds[, which(sapply(diamonds, class) == "numeric")],
              center = TRUE, scale. = TRUE)

To see the proportion of variance explained by each component, we use summary:

summary(pca)

The first PC accounts for 65.5 % of the total variance. The first two account for 86.9 % and the first three account for 98.3 % of the total variance, meaning that 3 components are needed to account for at least 90 % of the total variance.

To see the loadings, we type:

pca

The first PC appears to measure size: it is dominated by carat, x, y and z, which all are size measurements. The second PC appears is dominated by depth and table and is therefore a summary of those measures.

To compute the correlation, we use cor:

cor(pca$x[,1], diamonds$price)

The (Pearson) correlation is 0.89, which is fairly high. Size is clearly correlated to price!

To see if the first two principal components be used to distinguish between diamonds with different cuts, we make a scatterplot:

autoplot(pca, data = diamonds, colour = "cut")

The points are mostly gathered in one large cloud. Apart from the fact that very large or very small values of the second PC indicates that a diamond has a Fair cut, the first two principal components seem to offer little information about a diamond’s cut.

(Click here to return to the exercise.)

Exercise 4.26

We create the scatterplot with the added arguments:

seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

pca <- prcomp(seeds[,-8], center = TRUE, scale. = TRUE)

library(ggfortify)
autoplot(pca, data = seeds, colour = "Variety",
         loadings = TRUE, loadings.label = TRUE)

The arrows for Area, Perimeter, Kernel_length, Kernel_width and Groove_length are all about the same length and are close to parallel the x-axis, which shows that these have similar impact on the first principal component but not the second, making the first component a measure of size. Asymmetry and Compactness both affect the second component, making it a measure of shape. Compactness also affects the first component, but not as much as the size variables do.

(Click here to return to the exercise.)

Exercise 4.27

We change the hc_method and hc_metric arguments to use complete linkage and the Manhattan distance:

library(cluster)
library(factoextra)
votes.repub |> scale() |>
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "complete",
                     hc_metric = "manhattan") |> 
                fviz_dend()

fviz_dend produces ggplot2 plots. We can save the plots from both approaches and then plot them side-by-side using patchwork as in Section 4.4:

votes.repub |> scale() |>
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "average",
                     hc_metric = "euclidean") |> 
                fviz_dend() -> dendro1
votes.repub |> scale() |>
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "complete",
                     hc_metric = "manhattan") |> 
                fviz_dend() -> dendro2

library(patchwork)
dendro1 / dendro2

Alaska and Vermont are clustered together in both cases. The red leftmost cluster is similar but not identical, including Alabama, Georgia and Louisiana.

To compare the two dendrograms in a different way, we can use tanglegram. Setting k_labels = 5 and k_branches = 5 gives us 5 coloured clusters:

votes.repub |> scale() |>
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "average",
                     hc_metric = "euclidean") -> clust1
votes.repub |> scale() |>
                hcut(k = 5, hc_func = "agnes",
                     hc_method = "complete",
                     hc_metric = "manhattan") -> clust2

library(dendextend)
tanglegram(as.dendrogram(clust1), 
           as.dendrogram(clust2),
           k_labels = 5,
           k_branches = 5)

Note that the colours of the lines connecting the two dendrograms are unrelated to the colours of the clusters.

(Click here to return to the exercise.)

Exercise 4.28

Using the default settings in agnes, we can do the clustering using:

library(cluster)
library(magrittr)
USArrests |> scale() |>
                agnes() |> 
                plot(which = 2)

Maryland is clustered with New Mexico, Michigan and Arizona, in that order.

(Click here to return to the exercise.)

Exercise 4.29

We draw a heatmap, with the data standardised in the column direction because we wish to cluster the observations rather than the variables:

library(cluster)
library(magrittr)
USArrests |> as.matrix() |> heatmap(scale = "col")

You may want to increase the height of your Plot window so that the names of all states are displayed properly.

The heatmap shows that Maryland, and the states similar to it, has higher crime rates than most other states. There are a few other states with high crime rates in other clusters, but those tend to only have a high rate for one crime (e.g. Georgia, which has a very high murder rate), whereas states in the cluster that Maryland is in have high rates for all or almost all types of violent crime.

(Click here to return to the exercise.)

Exercise 4.30

First, we inspect the data:

library(cluster)
?chorSub

# Scatterplot matrix:
library(GGally)
ggpairs(chorSub)

There are a few outliers, so it may be a good idea to use pam as it is less affected by outliers than kmeans. Next, we draw some plots to help use choose $k$:

library(factoextra)
library(magrittr)
chorSub |> scale() |> 
                fviz_nbclust(pam, method = "wss")
chorSub |> scale() |> 
                fviz_nbclust(pam, method = "silhouette")
chorSub |> scale() |> 
                fviz_nbclust(pam, method = "gap")

There is no pronounced elbow in the WSS plot, although slight changes appear to occur at $k=3$ and $k=7$. Judging by the silhouette plot, $k=3$ may be a good choice, while the gap statistic indicates that $k=7$ would be preferable. Let’s try both values:

# k = 3:
chorSub |> scale() |> 
                pam(k = 3) -> kola_cluster
fviz_cluster(kola_cluster, geom = "point")

# k = 7:
chorSub |> scale() |> 
                pam(k = 7) -> kola_cluster
fviz_cluster(kola_cluster, geom = "point")

Neither choice is clearly superior. Remember that clustering is an exploratory procedure, that we use to try to better understand our data.

The plot for $k=7$ may look a little strange, with two largely overlapping clusters. Bear in mind though, that the clustering algorithm uses all 10 variables and not just the first two principal components, which are what is shown in the plot. The differences between the two clusters isn’t captured by the first two principal components.

(Click here to return to the exercise.)

Exercise 4.31

First, we try to find a good number of clusters:

library(factoextra)
library(magrittr)
USArrests |> scale() |> 
                fviz_nbclust(fanny, method = "wss")
USArrests |> scale() |> 
                fviz_nbclust(fanny, method = "silhouette")

We’ll go with $k=2$ clusters:

library(cluster)
USArrests |> scale() |>
                fanny(k = 2) -> USAclusters

# Show memberships:
USAclusters$membership

Maryland is mostly associated with the first cluster. Its neighbouring state New Jersey is equally associated with both clusters.

(Click here to return to the exercise.)

Exercise 4.32

We do the clustering and plot the resulting clusters:

library(cluster)
library(mclust)
kola_cluster <- Mclust(scale(chorSub))
summary(kola_cluster)

# Plot results with ellipsoids:
library(factoextra)
fviz_cluster(kola_cluster, geom = "point", ellipse.type = "norm")

Three clusters, that overlap substantially when the first two principal components are plotted, are found.

(Click here to return to the exercise.)

Chapter 5

Exercise 5.1

as.logical returns FALSE for 0 and TRUE for all other numbers:

as.logical(0)
as.logical(1)
as.logical(14)
as.logical(-8.889)
as.logical(pi^2 + exp(18))

When the as. functions are applied to vectors, they convert all values in the vector:

as.character(c(1, 2, 3, pi, sqrt(2)))

The is. functions return a logical: TRUE if the variable is of the type and FALSE otherwise:

is.numeric(27)
is.numeric("27")
is.numeric(TRUE)

The is. functions show that NA in fact is a (special type of) logical. This is also verified by the documentation for NA:

is.logical(NA)
is.numeric(NA)
is.character(NA)
?NA

(Click here to return to the exercise.)

Exercise 5.2

We set file_path to the path for vas.csv and load the data as in Exercise 2.30:

vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)

To split the VAS vector by patient ID, we use split:

vas_split <- split(vas$VAS, vas$ID)

To access the values for patient 212, either of the following works:

vas_split$`212`
vas_split[[12]]

(Click here to return to the exercise.)

Exercise 5.3

To convert the proportions to percentages with one decimal place, we must first multiply them by 100 and then round them:

props <- c(0.1010, 0.2546, 0.6009, 0.0400, 0.0035)
round(100 * props, 1)

The cumulative maxima and minima are computed using cummax and cummin:

cummax(airquality$Temp)
cummin(airquality$Temp)

The minimum during the period occurs on the 5th day, whereas the maximum occurs during day 120.

To find runs of days with temperatures above 80, we use rle:

runs <- rle(airquality$Temp > 80)

To find runs with temperatures above 80, we extract the length of the runs for which runs$values is TRUE:

runs$lengths[runs$values == TRUE]

We see that the longest run was 23 days.

(Click here to return to the exercise.)

Exercise 5.4

On virtually all systems, the largest number that R can represent as a floating point is 1.797693e+308. You can find this by gradually trying larger and larger numbers:

1e+100
# ...
1e+308
1e+309  # The largest number must be between 1e+308 and 1e+309!
# ...
1.797693e+308
1.797694e+308

If we place the ^2 inside sqrt the result becomes 0:

sqrt(2)^2 - 2  # Not 0
sqrt(2^2) - 2  # 0

(Click here to return to the exercise.)

Exercise 5.5

We re-use the solution from Exercise 2.27:

airquality$TempCat <- cut(airquality$Temp,
                          breaks = c(50, 70, 90, 110))

Next, we change the levels’ names:

levels(airquality$TempCat) <- c("Mild", "Moderate", "Hot")

Finally, we combine the last two levels:

levels(airquality$TempCat)[2:3] <- "Hot"

(Click here to return to the exercise.)

Exercise 5.6

1 We start by converting the vore variable to a factor:

library(ggplot2)
str(msleep) # vore is a character vector!

msleep$vore <- factor(msleep$vore)
levels(msleep$vore)

The levels are ordered alphabetically, which is the default in R.

To compute grouped means, we use aggregate:

means <- aggregate(sleep_total ~ vore, data = msleep, FUN = mean)

Finally, we sort the factor levels according to their sleep_total means:

# Check order:
means
# New order: herbi, carni, omni, insecti.

# We could set the new order manually:
msleep$vore <- factor(msleep$vore,
                      levels = c("herbi", "carni", "omni", "insecti"))

# Alternatively, rank and match can be used to get the new order of
# the levels:
?rank
?match
ranks <- rank(means$sleep_total)
new_order <- match(1:4, ranks)

msleep$vore <- factor(msleep$vore,
                      levels = levels(msleep$vore)[new_order])

(Click here to return to the exercise.)

Exercise 5.7

First, we set file_path to the path to handkerchiefs.csv and import it to the data frame pricelist:

pricelist <- read.csv(file_path)

nchar counts the number of characters in strings:

?nchar
nchar(pricelist$Italian.handkerchiefs)

We can use grep and a regular expression to see that there are 2 rows of the Italian.handkerchief column that contain numbers:

grep("[[:digit:]]", pricelist$Italian.handkerchiefs)

To extract the prices in shillings (S) and pence (D) from the Price column and store these in two new numeric variables in our data frame, we use strsplit, unlist and matrix as follows:

# Split strings at the space between the numbers and the letters:
Price_split <- strsplit(pricelist$Price, " ")
Price_split <- unlist(Price_split)
Price_matrix <- matrix(Price_split, nrow = length(Price_split)/4,
                       ncol = 4, byrow = TRUE)

# Add to the data frame:
pricelist$PriceS <- as.numeric(Price_matrix[,1])
pricelist$PriceD <- as.numeric(Price_matrix[,3])

(Click here to return to the exercise.)

Exercise 5.8

We set file_path to the path to oslo-biomarkers.xlsx and load the data:

library(openxlsx)

oslo <- as.data.table(read.xlsx(file_path))

To find out how many patients were included in the study, we use strsplit to split the ID-timepoint string, and then unique:

oslo_id <- unlist(strsplit(oslo$"PatientID.timepoint", "-"))

oslo_id_matrix <- matrix(oslo_id, nrow = length(oslo_id)/2,
                         ncol = 2, byrow = TRUE)

unique(oslo_id_matrix[,1])
length(unique(oslo_id_matrix[,1]))

We see that 118 patients were included in the study.

(Click here to return to the exercise.)

Exercise 5.9

"$g" matches strings ending with g:

contacts$Address[grep("g$", contacts$Address)]

"^[^[[:digit:]]" matches strings beginning with anything but a digit:

contacts$Address[grep("^[^[[:digit:]]", contacts$Address)]

"a(s|l)" matches strings containing either as or al:

contacts$Address[grep("a(s|l)", contacts$Address)]

"[[:lower:]]+[.][[:lower:]]+" matches strings containing any number of lowercase letters, followed by a period ., followed by any number of lowercase letters:

contacts$Address[grep("[[:lower:]]+[.][[:lower:]]+", contacts$Address)]

(Click here to return to the exercise.)

Exercise 5.10

We want to extract all words, i.e. segments of characters separated by white spaces. First, let’s create the string containing example sentences:

x <- "This is an example of a sentence, with 10 words. Here are 4 more!"

Next, we split the string at the spaces:

x_split <- strsplit(x, " ")

Note that x_split is a list. To turn this into a vector, we use unlist

x_split <- unlist(x_split)

Finally, we can use gsub to remove the punctuation marks, so that only the words remain:

gsub("[[:punct:]]", "", x_split)

If you like, you can put all steps on a single row:

gsub("[[:punct:]]", "", unlist(strsplit(x, " ")))

…or reverse the order of the operations:

unlist(strsplit(gsub("[[:punct:]]", "", x), " "))

(Click here to return to the exercise.)

Exercise 5.11

The functions are used to extract the weekday, month and quarter for each date:

weekdays(dates)
months(dates)
quarters(dates)

julian can be used to compute the number of days from a specific date (e.g. 1970-01-01) to each date in the vector:

julian(dates, origin = as.Date("1970-01-01", format = "%Y-%m-%d"))

(Click here to return to the exercise.)

Exercise 5.12

On most systems, converting the three variables to Date objects using as.Date yields correct dates without times:

as.Date(c(time1, time2, time3))

We convert time1 to a Date object and add 1 to it:

as.Date(time1) + 1

The result is 2020-04-02, i.e. adding 1 to the Date object has added 1 day to it.

We convert time3 and time1 to Date objects and subtract them:

as.Date(time3) - as.Date(time1)

The result is a difftime object, printed as Time difference of 2 days. Note that the times are ignored, just as before.

We convert time2 and time1 to Date objects and subtract them:

as.Date(time2) - as.Date(time1)

The result is printed as Time difference of 0 days, because the difference in time is ignored.

We convert the three variables to POSIXct date and time objects using as.POSIXct without specifying the date format:

as.POSIXct(c(time1, time2, time3))

On most systems, this yields correctly displayed dates and times.

We convert time3 and time1 to POSIXct objects and subtract them:

as.POSIXct(time3) - as.POSIXct(time1)

This time out, time is included when the difference is computed, and the output is Time difference of 2.234722 days.

We convert time2 and time1 to POSIXct objects and subtract them:

as.POSIXct(time2) - as.POSIXct(time1)

In this case, the difference is presented in hours: Time difference of 1.166667 hours. In the next step, we take control over the units shown in the output.

difftime can be used to control what units are used for expressing differences between two timepoints:

difftime(as.POSIXct(time3), as.POSIXct(time1), units = "hours")

The out is Time difference of 53.63333 hours.

(Click here to return to the exercise.)

Exercise 5.13

Using the first option, the Date becomes the first day of the quarter. Using the second option, it becomes the last day of the quarter instead. Both can be useful for presentation purposes - which you prefer is a matter of taste.
To convert the quarter-observations to the first day of their respective quarters, we use as.yearqtr as follows:

library(zoo)
as.Date(as.yearqtr(qvec2, format = "Q%q/%y"))
as.Date(as.yearqtr(qvec3, format = "Q%q-%Y"))

%q, %y,and %Y are date tokens. The other letters and symbols in the format argument simply describe other characters included in the format.

(Click here to return to the exercise.)

Exercise 5.14

The x-axis of the data can be changed in multiple ways. A simple approach is the following:

## Create a new data frame with the correct dates and the demand data:
dates <- seq.Date(as.Date("2014-01-01"), as.Date("2014-12-31"),
                  by = "day")
elecdaily2 <- data.frame(dates = dates, demand = elecdaily[,1])

ggplot(elecdaily2, aes(dates, demand)) +
      geom_line()

A more elegant approach relies on the xts package for time series:

library(xts)
## Convert time series to an xts object:
dates <- seq.Date(as.Date("2014-01-01"), as.Date("2014-12-31"),
                  by = "day")
elecdaily3 <- xts(elecdaily, order.by = dates)

autoplot(elecdaily3[,"Demand"])

(Click here to return to the exercise.)

Exercise 5.15

## First, create a data frame with better formatted dates
a102 <- as.data.frame(a10)
a102$Date <- seq.Date(as.Date("1991-07-01"), as.Date("2008-06-01"),
                      by = "month")

## Create the plot object
myPlot <- ggplot(a102, aes(Date, x)) +
      geom_line() +
      labs(x = "Sales")
      
## Create the interactive plot
ggplotly(myPlot)

(Click here to return to the exercise.)

Exercise 5.16

We set file_path to the path for vas.csv and read the data as in Exercise 2.30 and convert it to a data.table (the last step being optional if we’re only using dplyr for this exercise):

vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)
vas <- as.data.table(vas)

A better option is to achieve the same result in a single line by using the fread function from data.table:

vas <- fread(file_path, sep = ";", dec = ",", skip = 4)

First, we remove the columns X and X.1:

With data.table:

vas[, c("X", "X.1") := NULL]

With dplyr:

vas |> select(-X, -X.1) -> vas

Second, we add a dummy variable called highVAS that indicates whether a patient’s VAS is 7 or greater on any given day:

With data.table:

vas[, highVAS := VAS >= 7]

With dplyr:

vas |> mutate(highVAS = VAS >= 7) -> vas

(Click here to return to the exercise.)

Exercise 5.17

We re-use the solution from Exercise 2.27:

airquality$TempCat <- cut(airquality$Temp,
                          breaks = c(50, 70, 90, 110))

aq <- data.table(airquality)

Next, we change the levels’ names:

With data.table:

new_names = c("Mild", "Moderate",
              "Hot")
aq[.(TempCat = levels(TempCat),
         to = new_names),
       on = "TempCat",
       TempCat := i.to]
aq[,TempCat := droplevels(TempCat)]

With dplyr:

aq |> mutate(TempCat = case_match(TempCat,
              "(50,70]" ~ "Mild",
              "(70,90]" ~ "Moderate",
              "(90,110]" ~ "Hot")) -> aq

Finally, we combine the last two levels:

With data.table:

aq[.(TempCat = c("Moderate", "Hot"),
         to = "Hot"),
    on = "TempCat", TempCat := i.to]
aq[, TempCat := droplevels(TempCat)]

With dplyr:

aq |> mutate(TempCat = case_match(TempCat,
            "Mild" ~ "Mild",
            c("Moderate", "Hot") ~ "Hot"))

(Click here to return to the exercise.)

Exercise 5.18

We set file_path to the path for vas.csv and read the data as in Exercise 2.30 using fread to import it as a data.table:

vas <- fread(file_path, sep = ";", dec = ",", skip = 4)

First, we compute the mean VAS for each patient:

With data.table:

vas[, mean(VAS, na.rm = TRUE), ID]

With dplyr:

vas |> group_by(ID) |>
       summarise(meanVAS = 
          mean(VAS, na.rm = TRUE))

Next, we compute the lowest and highest VAS recorded for each patient:

With data.table:

vas[, .(min = min(VAS,
                  na.rm = TRUE),
        max = max(VAS,
                  na.rm = TRUE)),
        ID]

With dplyr:

vas |> group_by(ID) |>
       summarise(min = min(VAS,
                 na.rm = TRUE),
                 max = max(VAS,
                 na.rm = TRUE))

Finally, we compute the number of high-VAS days for each patient. We can compute the sum directly:

With data.table:

vas[, sum(VAS >= 7, na.rm = TRUE),
    ID]

With dplyr:

vas |> group_by(ID) |>
       summarise(highVASdays = 
        sum(VAS >= 7, na.rm = TRUE))

Alternatively, we can do this by first creating a dummy variable for high-VAS days:

With data.table:

vas[, highVAS := VAS >=7]
vas[, sum(highVAS, na.rm = TRUE),
    ID]

With dplyr:

vas |> mutate(highVAS = VAS >= 7) -> vas
vas |> group_by(ID) |>
       summarise(highVASdays =
                sum(highVAS,
                    na.rm = TRUE))

(Click here to return to the exercise.)

Exercise 5.19

First we load the data and convert it to a data.table (the last step being optional if we’re only using dplyr for this exercise):

library(datasauRus)
dd <- as.data.table(datasaurus_dozen)

Next, we compute summary statistics grouped by dataset:

With data.table:

dd[, .(mean_x = mean(x),
        mean_y = mean(y),
        sd_x = sd(x),
        sd_y = sd(y),
        cor = cor(x,y)),
        dataset]

With dplyr:

dd |> group_by(dataset) |>
       summarise(mean_x = mean(x),
        mean_y = mean(y),
        sd_x = sd(x),
        sd_y = sd(y),
        cor = cor(x,y))

The summary statistics for all datasets are virtually identical.

Next, we make scatterplots. Here is a solution using ggplot2:

library(ggplot2)
ggplot(datasaurus_dozen, aes(x, y, colour = dataset)) +
    geom_point() +
    facet_wrap(~ dataset, ncol = 3)

Clearly, the datasets are very different! This is a great example of how simply computing summary statistics is not enough. They tell a part of the story, yes, but only a part.

(Click here to return to the exercise.)

Exercise 5.20

We set file_path to the path for vas.csv and read the data as in Exercise 2.30 using fread to import it as a data.table:

library(data.table)
vas <- fread(file_path, sep = ";", dec = ",", skip = 4)

To fill in the missing values, we can now do as follows:

With data.table:

vas[, Visit := nafill(
      Visit, "locf")]

With tidyr:

vas |> fill(Visit)

(Click here to return to the exercise.)

Exercise 5.21

We set file_path to the path to ucdp-onesided-191.csv and load the data as a data.table using fread:

library(dplyr)
library(data.table)

ucdp <- fread(file_path)

First, we filter the rows so that only conflicts that took place in Colombia are retained.

With data.table:

colombia <- ucdp[location ==
                   "Colombia",]

With dplyr:

ucdp |> filter(location ==
         "Colombia") -> colombia

To list the number of different actors responsible for attacks, we can use unique:

unique(colombia$actor_name)

We see that there were attacks by 7 different actors during the period.

To find the number of fatalities caused by government attacks on civilians, we first filter the data to only retain rows where the actor name contains the word government:

With data.table:

gov <- ucdp[actor_name %like%
              "[gG]overnment",]

With dplyr:

ucdp |> filter(grepl("[gG]overnment",
                actor_name)
                ) -> gov

It may be of interest to list the governments involved in attacks on civilians:

unique(gov$actor_name)

To estimate the number of fatalities cause by these attacks, we sum the fatalities from each attack:

sum(gov$best_fatality_estimate)

(Click here to return to the exercise.)

Exercise 5.22

We set file_path to the path to oslo-biomarkers.xlsx and load the data:

library(dplyr)
library(data.table)
library(openxlsx)

oslo <- as.data.table(read.xlsx(file_path))

First, we select only the measurements from blood samples taken at 12 months. These are the only observations where the PatientID.timepoint column contains the word months:

With data.table:

oslo[PatientID.timepoint %like%
       "months",]

With dplyr:

oslo |> filter(grepl("months",
           PatientID.timepoint))

Second, we select only the measurements from the patient with ID number 6. Note that we cannot simply search for strings containing a 6, as we then also would find measurements from other patients taken at 6 weeks, as well as patients with a 6 in their ID number, e.g. patient 126. Instead, we search for strings beginning with 6-:

With data.table:

oslo[PatientID.timepoint %like%
       "^6[-]",]

With dplyr:

oslo |> filter(grepl("^6[-]",
           PatientID.timepoint))

(Click here to return to the exercise.)

Exercise 5.23

We set file_path to the path to ucdp-onesided-191.csv and load the data as a data.table using fread:

library(dplyr)
library(data.table)

ucdp <- fread(file_path)

Next, we select the actor_name, year, best_fatality_estimate and location columns:

With data.table:

ucdp[, .(actor_name, year,
         best_fatality_estimate,
         location)]

With dplyr:

ucdp |> select(actor_name, year,
          best_fatality_estimate,
          location)

(Click here to return to the exercise.)

Exercise 5.24

We set file_path to the path to oslo-biomarkers.xlsx and load the data:

library(dplyr)
library(data.table)
library(openxlsx)

oslo <- as.data.table(read.xlsx(file_path))

We then order the data by the PatientID.timepoint column:

With data.table:

oslo[order(PatientID.timepoint),]

With dplyr:

oslo |> arrange(PatientID.timepoint)

Note that because PatientID.timepoint is a character column, the rows are now ordered in alphabetical order, meaning that patient 1 is followed by 100, 101, 102, and so on. To order the patients in numerical order, we must first split the ID and timepoints into two different columns. We’ll see how to do that in the next section, and try it out on the oslo data in Exercise 5.25.

(Click here to return to the exercise.)

Exercise 5.25

We set file_path to the path to oslo-biomarkers.xlsx and load the data:

library(dplyr)
library(tidyr)
library(data.table)
library(openxlsx)

oslo <- as.data.table(read.xlsx(file_path))

First, we split the PatientID.timepoint column:

With data.table:

oslo[, c("PatientID",
         "timepoint") :=
  tstrsplit(PatientID.timepoint, "-",
                 fixed = TRUE)]

With tidyr:

oslo |> separate(PatientID.timepoint,
           into = c("PatientID",
                   "timepoint"),
           sep = "-") -> oslo

Next, we reformat the patient ID to a numeric and sort the table:

With data.table:

oslo[, PatientID :=
       as.numeric(PatientID)]
oslo[order(PatientID),]

With dplyr:

oslo |> mutate(PatientID =
          as.numeric(PatientID)) |> 
          arrange(PatientID)

Finally, we reformat the data from long to wide, keeping the IL-8 and VEGF-A measurements. We store it as oslo2, knowing that we’ll need it again in Exercise 5.26.

With data.table:

oslo2 <- dcast(oslo,
          PatientID ~ timepoint,
          value.var = c("IL-8", "VEGF-A"))

With tidyr:

oslo |> pivot_wider(id_cols =
                        PatientID,
           names_from = timepoint,
           values_from = 
             c("IL-8", "VEGF-A")
           ) -> oslo2

(Click here to return to the exercise.)

Exercise 5.26

We use the oslo2 data frame that we created in Exercise 5.26. In addition, we set file_path to the path to oslo-covariates.xlsx and load the data:

library(dplyr)
library(data.table)
library(openxlsx)

covar <- as.data.table(read.xlsx(file_path))

First, we merge the wide data frame from Exercise 5.25 with the oslo-covariates.xlsx data, using patient ID as key. A left join, where we only keep data for patients with biomarker measurements, seems appropriate here. We see that both datasets have a column named PatientID, which we can use as our key.

With data.table:

merge(oslo2, covar, 
      all.x = TRUE,
      by = "PatientID")

With dplyr:

oslo2 |> left_join(covar,
           by = "PatientID")

Next, we use the oslo-covariates.xlsx data to select data for smokers from the wide data frame using a semijoin. The Smoker.(1=yes,.2=no) column contains information about smoking habits. First we create a table for filtering:

With data.table:

filter_data <-
  covar[`Smoker.(1=yes,.2=no)`
                     == 1,]

With dplyr:

covar |>
  filter(`Smoker.(1=yes,.2=no)`
          == 1) -> filter_data

Next, we perform the semijoin:

With data.table:

setkey(oslo2, PatientID)
oslo2[oslo2[filter_data,
            which = TRUE]]

With dplyr:

oslo2 |> semi_join(filter_data,
            by = "PatientID")

(Click here to return to the exercise.)

Exercise 5.27

We read the HTML file and extract the table:

library(rvest)

wiki <- read_html("https://en.wikipedia.org/wiki/List_of_keytars")
keytars <- html_table(html_nodes(wiki, "table")[[1]], fill = TRUE)

We note that some non-numeric characters cause Dates to be a character vector:

str(keytars)
keytars$Dates

Noting that the first four characters in each element of the vector contain the year, we can use substr to only keep those characters. Finally, we use as.numeric to convert the text to numbers:

keytars$Dates <- substr(keytars$Dates, 1, 4)
keytars$Dates <- as.numeric(keytars$Dates)
keytars$Dates

(Click here to return to the exercise.)

Chapter 6

Exercise 6.1

The formula for converting a temperature $F$ measured in Fahrenheit to a temperature $C$ measured in Celsius is $C=(F-32)\cdot 5/9$. Our function becomes:

FtoC <- function(F)
{
    C <- (F-32)*5/9
    return(C)
}

To apply it to the Temp column of airquality:

FtoC(airquality$Temp)

(Click here to return to the exercise.)

Exercise 6.2

We want out function to take a vector as input and return a vector containing its minimum and the maximum, without using min and max:

minmax <- function(x)
{
      # Sort x so that the minimum becomes the first element
      # and the maximum becomes the last element:
      sorted_x <- sort(x)
      min_x <- sorted_x[1]
      max_x <- sorted_x[length(sorted_x)]
      return(c(min_x, max_x))
}

# Check that it works:
x <- c(3, 8, 1, 4, 5)
minmax(x) # Should be 1 and 8

We want a function that computes the mean of the squared values of a vector using mean, and that takes additional arguments that it passes on to mean (e.g. na.rm):

mean2 <- function(x, ...)
{
      return(mean(x^2, ...))
}

# Check that it works:
x <- c(3, 2, 1)
mean2(x) # Should be 14/3=4.666...

# With NA:
x <- c(3, 2, NA)
mean2(x) # Should be NA
mean2(x, na.rm = TRUE) # Should be 13/2=6.5

(Click here to return to the exercise.)

Exercise 6.3

We use cat to print a message about missing values, sum(is.na(.)) to compute the number of missing values, na.omit to remove rows with missing data and then summary to print the summary:

na_remove <- . %T>% {cat("Missing values:", sum(is.na(.)), "\n")} |>
                    na.omit |> summary
  
na_remove(airquality)

(Click here to return to the exercise.)

Exercise 6.4

The following operator allows us to plot y against x:

`%against%` <- function(y, x) { plot(x, y) }

Let’s try it out:

airquality$Wind %against% airquality$Temp

Or, if we want to use ggplot2 instead of base graphics:

library(ggplot2)
`%against%` <- function(y, x) { 
    df <- data.frame(x, y)
    ggplot(df, aes(x, y)) +
        geom_point()
}

airquality$Wind %against% airquality$Temp

(Click here to return to the exercise.)

Exercise 6.5

FALSE: x is not greater than 2.
TRUE: | means that at least one of the conditions need to be satisfied, and x is greater than z.
FALSE: & means that both conditions must be satisfied, and x is not greater than y.
TRUE: the absolute value of x*z is 6, which is greater than y.

(Click here to return to the exercise.)

Exercise 6.6

There are two errors: the variable name in exists is not between quotes and x > 0 evaluates to a vector an not a single value. The goal is to check that all values in x are positive, so all can be used to collapse the logical vector x > 0:

x <- c(1, 2, pi, 8)

# Only compute square roots if x exists
# and contains positive values:
if(exists("x")) { if(all(x > 0)) { sqrt(x) } }

Alternatively, we can get a better looking solution by using &&:

if(exists("x") && all(x > 0)) { sqrt(x) }

(Click here to return to the exercise.)

Exercise 6.7

To compute the mean temperature for each month in the airquality dataset using a loop, we loop over the 6 months:

months <- unique(airquality$Month)
meanTemp <- vector("numeric", length(months))

for(i in seq_along(months))
{
      # Extract data for month[i]:
      aq <- airquality[airquality$Month == months[i],]
      # Compute mean temperature:
      meanTemp[i] <- mean(aq$Temp)
}

Next, we use a for loop to compute the maximum and minimum value of each column of the airquality data frame, storing the results in a data frame:

results <- data.frame(min = vector("numeric", ncol(airquality)),
                      max = vector("numeric", ncol(airquality)))

for(i in seq_along(airquality))
{
      results$min[i] <- min(airquality[,i], na.rm = TRUE)
      results$max[i] <- max(airquality[,i], na.rm = TRUE)
}
results

# For presentation purposes, we can add the variable names as
# row names:
row.names(results) <- names(airquality)

Finally, we write a function to solve task 2 for any data frame:

minmax <- function(df, ...)
{
       results <- data.frame(min = vector("numeric", ncol(df)),
                             max = vector("numeric", ncol(df)))

      for(i in seq_along(df))
      {
            results$min[i] <- min(df[,i], ...)
            results$max[i] <- max(df[,i], ...)
      }
      
      # For presentation purposes, we add the variable names as
      # row names:
      row.names(results) <- names(airquality)
      
      return(results)
}

# Check that it works:
minmax(airquality)
minmax(airquality, na.rm = TRUE)

(Click here to return to the exercise.)

Exercise 6.8

We can create 0.25 0.5 0.75 1 in two different ways using seq:

seq(0.25, 1, length.out = 4)
seq(0.25, 1, by = 0.25)

We can create 1 1 1 2 2 5 using rep. 1 is repeated 3 times, 2 is repeated 2 times and 5 is repeated a single time:

rep(c(1, 2, 5), c(3, 2, 1))

(Click here to return to the exercise.)

Exercise 6.9

We could create the same sequences using 1:ncol(airquality) and 1:length(airquality$Temp), but if we accidentally apply those solutions to objects with zero length, we would run into trouble! Let’s see what happens:

x <- c()
length(x)

Even though there are no elements in the vector, two iterations are run when we use 1:length(x) to set the values of the control variable:

for(i in 1:length(x)) { cat("Element", i, "of the vector\n") }

The reason is that 1:length(x) yields the vector 0 1, providing two values for the control variable.

If we use seq_along instead, no iterations will be run, because seq_along(x) returns zero values:

for(i in seq_along(x)) { cat("Element", i, "of the vector\n") }

This is the desired behaviour - if there are no elements in the vector then the loop shouldn’t run! seq_along is the safer option, but 1:length(x) is arguably less opaque and therefore easier for humans to read, which also has its benefits.

(Click here to return to the exercise.)

Exercise 6.10

To normalise the variable, we need to map the smallest value to 0 and the largest to 1:

normalise <- function(df, ...)
{
      for(i in seq_along(df))
      {
          df[,i] <- (df[,i] - min(df[,i], ...))/(max(df[,i], ...) -
                                                   min(df[,i], ...))
      }
      return(df)
}

aqn <- normalise(airquality, na.rm = TRUE)
summary(aqn)

(Click here to return to the exercise.)

Exercise 6.11

We set folder_path to the path of the folder (making sure that the path ends with / (or \\ on Windows)). We can then loop over the .csv files in the folder and print the names of their variables as follows:

files <- list.files(folder_path, pattern = "\\.csv$")

for(file in files)
{
      csv_data <- read.csv(paste(folder_path, file, sep = ""))
      cat(file, "\n")
      cat(names(csv_data))
      cat("\n\n")
}

(Click here to return to the exercise.)

Exercise 6.12

The condition in the outer loop, i < length(x), is used to check that the element x[i+1] used in the inner loop actually exists. If i is equal to the length of the vector (i.e. is the last element of the vector) then there is no element x[i+1] and consequently the run cannot go on. If this condition wasn’t included, we would end up with an infinite loop.
The condition in the inner loop, x[i+1] == x[i] & i < length(x), is used to check if the run continues. If [i+1] == x[i] is TRUE then the next element of x is the same as the current, meaning that the run continues. As in the previous condition, i < length(x) is included to make sure that we don’t start looking for elements outside of x, which could create an infinite loop.
The line run_values <- c(run_values, x[i-1]) creates a vector combining the existing elements of run_values with x[i-1]. This allows us to store the results in a vector without specifying its size in advance. Not however that this approach is slower than specifying the vector size in advance, and that you therefore should avoid it when using for loops.

(Click here to return to the exercise.)

Exercise 6.13

We modify the loop so that it skips to the next iteration if x[i] is 0, and breaks if x[i] is NA:

x <- c(1, 5, 8, 0, 20, 0, 3, NA, 18, 2)

for(i in seq_along(x))
{
      if(is.na(x[i])) { break }
      if(x[i] == 0) { next }
      cat("Step", i, "- reciprocal is", 1/x[i], "\n")
}

(Click here to return to the exercise.)

Exercise 6.13

We can put a conditional statement inside each of the loops, to check that both variables are numeric:

cor_func <- function(df)
{
    cor_mat <- matrix(NA, nrow = ncol(df), ncol = ncol(df))
    for(i in seq_along(df))
    {
        if(!is.numeric(df[[i]])) { next }
        for(j in seq_along(df))
        {
            if(!is.numeric(df[[j]])) { next }
            cor_mat[i, j] <- cor(df[[i]], df[[j]],
                                 use = "pairwise.complete")
        }
    }
    return(cor_mat)
}

# Check that it works:
str(ggplot2::msleep)
cor_func(ggplot2::msleep)

An (nicer?) alternative would be to check which columns are numeric and loop over those:

cor_func <- function(df)
{
    cor_mat <- matrix(NA, nrow = ncol(df), ncol = ncol(df))
    indices <- which(sapply(df, class) == "numeric")
    for(i in indices)
    {
        for(j in indices)
        {
            cor_mat[i, j] <- cor(df[[i]], df[[j]],
                                 use = "pairwise.complete")
        }
    }
    return(cor_mat)
}

# Check that it works:
cor_func(ggplot2::msleep)

(Click here to return to the exercise.)

Exercise 6.15

To compute the minima, we can use:

apply(airquality, 2, min, na.rm = TRUE)

To compute the maxima, we can use:

apply(airquality, 2, max, na.rm = TRUE)

We could also write a function that computes both the minimum and the maximum and returns both, and use that with apply:

minmax <- function(x, ...)
{
      return(c(min = min(x, ...), max = max(x, ...)))
}

apply(airquality, 2, minmax, na.rm = TRUE)

(Click here to return to the exercise.)

Exercise 6.16

We can for instance make use of the minmax function that we created in Exercise 6.15:

minmax <- function(x, ...)
{
      return(c(min = min(x, ...), max = max(x, ...)))
}

temps <- split(airquality$Temp, airquality$Month)

sapply(temps, minmax) # or lapply/vapply

# Or:
tapply(airquality$Temp, airquality$Month, minmax)

(Click here to return to the exercise.)

Exercise 6.17

To compute minima and maxima, we can use:

minmax <- function(x, ...)
{
      return(c(min = min(x, ...), max = max(x, ...)))
}

This time out, we want to apply this function to two variables: Temp and Wind. We can do this using apply:

minmax2 <- function(x, ...)
{
      return(apply(x, 2, minmax))
}

tw <- split(airquality[,c("Temp", "Wind")], airquality$Month)

lapply(tw, minmax2)

If we use sapply instead, we lose information about which statistic correspond to which variable, so lapply is a better choice here:

sapply(tw, minmax2)

(Click here to return to the exercise.)

Exercise 6.18

We can for instance make use of the minmax function that we created in Exercise 6.15:

minmax <- function(x, ...)
{
      return(c(min = min(x, ...), max = max(x, ...)))
}

library(purrr)

temps <- split(airquality$Temp, airquality$Month)
temps |> map(minmax)

We can also use a single pipe chain to split the data and apply the functional:

airquality |> split(.$Month) |> map(~minmax(.$Temp))

(Click here to return to the exercise.)

Exercise 6.19

Because we want to use both the variable names and their values, an imap_* function is appropriate here:

data_summary <- function(df)
{
    df |> imap_dfr(~(data.frame(variable = .y,
                                 unique_values = length(unique(.x)),
                                 class = class(.x),
                                 missing_values = sum(is.na(.x)) )))
}

# Check that it works:
library(ggplot2)
data_summary(msleep)

(Click here to return to the exercise.)

Exercise 6.20

We combine map and imap to get the desired result. folder_path is the path to the folder containing the .csv files. We must use set_names to set the file names as element names, otherwise only the index of each file (in the file name vector) will be printed:

list.files(folder_path, pattern = "\\.csv$") |> 
    paste(folder_path, ., sep = "") |> 
    set_names() |> 
    map(read.csv) |> 
    imap(~cat(.y, "\n", names(.x), "\n\n"))

(Click here to return to the exercise.)

Exercise 6.21

First, we load the data and create vectors containing all combinations

library(gapminder)
combos <- gapminder |> distinct(continent, year)
continents <- combos$continent
years <- combos$year

Next, we create the scatterplots:

# Create a plot for each pair:
combos_plots <- map2(continents, years, ~{
                     gapminder |> filter(continent == .x,
                                         year == .y) |> 
                      ggplot(aes(pop, lifeExp)) + geom_point() +
                             ggtitle(paste(.x, .y, sep =" in "))})

If instead we just want to save each scatterplot in a separate file, we can do so by putting ggsave (or png + dev.off) inside a walk2 call:

# Create a plot for each pair:
combos_plots <- walk2(continents, years, ~{
                     gapminder |> filter(continent == .x,
                                         year == .y) |> 
                      ggplot(aes(pop, lifeExp)) + geom_point() +
                             ggtitle(paste(.x, .y, sep =" in "))
                      ggsave(paste(.x, .y, ".png", sep = ""),
                             width = 3, height = 3)})

(Click here to return to the exercise.)

Exercise 6.22

First, we write a function for computing the mean of a vector with a loop:

mean_loop <- function(x)
{
      m <- 0
      n <- length(x)
      for(i in seq_along(x))
      {
            m <- m + x[i]/n
      }
      return(m)
}

Next, we run the functions once, and then benchmark them:

x <- 1:10000
mean_loop(x)
mean(x)

library(bench)
mark(mean(x), mean_loop(x))

mean_loop is several times slower than mean. The memory usage of both functions is negligible.

(Click here to return to the exercise.)

Exercise 6.23

We can compare the three solutions as follows:

library(data.table)
library(dplyr)
library(nycflights13)
library(bench)

# Make a data.table copy of the data:
flights.dt <- as.data.table(flights)

# Wrap the solutions in functions (using global assignment to better
# monitor memory usage):
base_filter <- function() { flights0101 <<- flights[flights$month ==
                                              1 & flights$day == 1,] }
dt_filter <- function() { flights0101 <<- flights.dt[month == 1 &
                                                          day == 1,] }
dplyr_filter <- function() { flights0101 <<- flights |> filter(
                                                month ==1, day == 1) }

# Compile the functions:
library(compiler)
base_filter <- cmpfun(base_filter)
dt_filter <- cmpfun(dt_filter)
dplyr_filter <- cmpfun(dplyr_filter)

# benchmark the solutions:
bm <- mark(base_filter(), dt_filter(), dplyr_filter())
bm
plot(bm)

We see that dplyr is substantially faster and more memory efficient than the base R solution, but that data.table beats them both by a margin.

(Click here to return to the exercise.)

Chapter 7

Exercise 7.1

The parameter replace controls whether or not replacement is used. To draw 5 random numbers with replacement, we use:

sample(1:10, 5, replace = TRUE)

(Click here to return to the exercise.)

Exercise 7.2

As an alternative to sample(1:10, n, replace = TRUE) we could use runif to generate random numbers from 1:10. This can be done in at least three different ways.

Generating (decimal) numbers between $0$ and $10$ and rounding up to the nearest integer:

n <- 10 # Generate 10 numbers
ceiling(runif(n, 0, 10))

Generating (decimal) numbers between $1$ and $11$ and rounding down to the nearest integer:

floor(runif(n, 1, 11))

Generating (decimal) numbers between $0.5$ and $10.5$ and rounding to the nearest integer:

round(runif(n, 0.5, 10.5))

Using sample(1:10, n, replace = TRUE) is more straightforward in this case, and is the recommended approach.

(Click here to return to the exercise.)

Exercise 7.3

First, we compare the histogram of the data to the normal density function:

library(ggplot2)
ggplot(msleep, aes(x = sleep_total)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_density(colour = "blue", size = 2) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                    args = list(mean = mean(msleep$sleep_total), 
                                  sd = sd(msleep$sleep_total)))

The density estimate is fairly similar to the normal density, but there appear to be too many low values in the data.

Then a normal Q-Q plot:

ggplot(msleep, aes(sample = sleep_total)) +
        geom_qq() + geom_qq_line()

There are some small deviations from the line, but no large deviations. To decide whether these deviations are large enough to be a concern, it may be a good idea to compare this Q-Q-plot to Q-Q-plots from simulated normal data:

# Create a Q-Q-plot for the total sleep data, and store
# it in a list:
qqplots <- list(ggplot(msleep, aes(sample = sleep_total)) +
  geom_qq() + geom_qq_line() + ggtitle("Actual data"))

# Compute the sample size n: 
n <- sum(!is.na(msleep$sleep_total))

# Generate 8 new datasets of size n from a normal distribution.
# Then draw Q-Q-plots for these and store them in the list:
for(i in 2:9)
{
    generated_data <- data.frame(normal_data = rnorm(n, 10, 1))
    qqplots[[i]] <- ggplot(generated_data, aes(sample = normal_data)) +
      geom_qq() + geom_qq_line() + ggtitle("Simulated data")
}

# Plot the resulting Q-Q-plots side-by-side:
library(patchwork)
(qqplots[[1]] + qqplots[[2]] + qqplots[[3]]) /
  (qqplots[[4]] + qqplots[[5]] + qqplots[[6]]) /
  (qqplots[[7]] + qqplots[[8]] + qqplots[[9]])

The Q-Q-plot for the real data is pretty similar to those from the simulated samples. We can’t rule out the normal distribution.

Nevertheless, perhaps the lognormal distribution would be a better fit? We can compare its density to the histogram, and draw a Q-Q plot:

# Histogram:
ggplot(msleep, aes(x = sleep_total)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_density(colour = "blue", size = 2) +
      geom_function(fun = dlnorm, colour = "red", size = 2,
                  args = list(meanlog = mean(log(msleep$sleep_total)), 
                              sdlog = sd(log(msleep$sleep_total))))

# Q-Q plot:
ggplot(msleep, aes(sample = sleep_total)) +
        geom_qq(distribution = qlnorm) +
        geom_qq_line(distribution = qlnorm)

The right tail of the distribution differs greatly from the data. If we have to choose between these two distributions, then the normal distribution seems to be the better choice.

(Click here to return to the exercise.)

Exercise 7.4

The documentation for shapiro.test shows that it takes a vector containing the data as input. So to apply it to the sleeping times data, we use:

library(ggplot2)
shapiro.test(msleep$sleep_total)

The p-value is $0.21$, meaning that we can’t reject the null hypothesis of normality - the test does not indicate that the data is non-normal.

Next, we generate data from a $\chi^2(100)$ distribution, and compare its distribution to a normal density function:

generated_data <- data.frame(x = rchisq(2000, 100))

ggplot(generated_data, aes(x)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_density(colour = "blue", size = 2) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                    args = list(mean = mean(generated_data$x), 
                                  sd = sd(generated_data$x)))

The fit is likely to be very good - the data is visually very close to the normal distribution. Indeed, it is rare in practice to find real data that is closer to the normal distribution than this.

However, the Shapiro-Wilk test probably tells a different story:

shapiro.test(generated_data$x)

The lesson here is that if the sample size is large enough, the Shapiro-Wilk test (and any other test for normality, for that matter) is likely to reject normality even if the deviation from normality is tiny. When the sample size is too large, the power of the test is close to 1 even for very small deviations. On the other hand, if the sample size is small, the power of the Shapiro-Wilk test is low, meaning that it can’t be used to detect non-normality.

In summary, you probably shouldn’t use formal tests for normality at all. And I say that as someone who has written two papers introducing new tests for normality!

(Click here to return to the exercise.)

Exercise 7.5

To run the same simulation for different $n$, we will write a function for the simulation, with the sample size n as an argument:

# Function for our custom estimator:
max_min_avg <- function(x)
{
      return((max(x)+min(x))/2)
}

# Function for simulation:
simulate_estimators <- function(n, mu = 0, sigma = 1, B = 1e4)
{
    cat(n, "\n")
    res <- data.frame(x_mean = vector("numeric", B),
                      x_median = vector("numeric", B),
                      x_mma = vector("numeric", B))
    
    # Start progress bar:
    pbar <- txtProgressBar(min = 0, max = B, style = 3)
    
    for(i in seq_along(res$x_mean))
    {
          x <- rnorm(n, mu, sigma)
          res$x_mean[i] <- mean(x)
          res$x_median[i] <- median(x)
          res$x_mma[i] <- max_min_avg(x)
          
          # Update progress bar
          setTxtProgressBar(pbar, i)
    }
    close(pbar)
    
    # Return a list containing the sample size,
    # and the simulation results:
    return(list(n = n,
                bias = colMeans(res-mu),
                vars = apply(res, 2, var)))
}

We could write a for loop to perform the simulation for different values of $n$. Alternatively, we can use a function, as in Section 6.5. Here are two examples of how this can be done:

# Create a vector of samples sizes:
n_vector <- seq(10, 100, 10)

# Run a simulation for each value in n_vector:

# Using base R:
res <- apply(data.frame(n = n_vector), 1, simulate_estimators)

# Using purrr:
library(purrr)
res <- map(n_vector, simulate_estimators)

Next, we want to plot the results. We need to extract the results from the list res and store them in a data frame, so that we can plot them using ggplot2.

simres <- matrix(unlist(res), 10, 7, byrow = TRUE)
simres <- data.frame(simres)
names(simres) <- names(unlist(res))[1:7]
simres

Transforming the data frame from wide to long format (Section 5.11) makes plotting easier.

We can do this using data.table:

library(data.table)
simres2 <- data.table(melt(simres, id.vars = c("n"),
           measure.vars = 2:7))
simres2[, c("measure", "estimator") := tstrsplit(variable,
                                        ".", fixed = TRUE)]

…or with tidyr:

library(tidyr)
simres |> pivot_longer(names(simres)[2:7],
                   names_to = "variable",
                   values_to = "value") |> 
        separate(variable,
                into = c("measure", "estimator"),
                sep = "[.]") -> simres2

We are now ready to plot the results:

library(ggplot2)
# Plot the bias, with a reference line at 0:
ggplot(subset(simres2, measure == "bias"), aes(n, value,
                                               col = estimator)) +
      geom_line() +
      geom_hline(yintercept = 0, linetype = "dashed") +
      ggtitle("Bias")

# Plot the variance:
ggplot(subset(simres2, measure == "vars"), aes(n, value,
                                               col = estimator)) +
      geom_line() +
      ggtitle("Variance")

All three estimators have a bias close to 0 for all values of $n$ (indeed, we can verify analytically that they are unbiased). The mean has the lowest variance for all $n$, with the median as a close competitor. Our custom estimator has a higher variance, that also has a slower decrease as $n$ increases. In summary, based on bias and variance, the mean is the best estimator for the mean of a normal distribution.

(Click here to return to the exercise.)

Exercise 7.6

To perform the same simulation with $t(3)$-distributed data, we can reuse the same code as in Exercise 7.5, only replacing three lines:

The arguments of simulate_estimators (mu and sigma are replaced by the degrees of freedom df of the $t$-distribution,
The line were the data is generated (rt replaces rnorm),
The line were the bias is computed (the mean of the $t$-distribution is always 0).

# Function for our custom estimator:
max_min_avg <- function(x)
{
      return((max(x)+min(x))/2)
}

# Function for simulation:
simulate_estimators <- function(n, df = 3, B = 1e4)
{
    cat(n, "\n")
    res <- data.frame(x_mean = vector("double", B),
                      x_median = vector("double", B),
                      x_mma = vector("double", B))
    
    # Start progress bar:
    pbar <- txtProgressBar(min = 0, max = B, style = 3)
    
    for(i in seq_along(res$x_mean))
    {
          x <- rt(n, df)
          res$x_mean[i] <- mean(x)
          res$x_median[i] <- median(x)
          res$x_mma[i] <- max_min_avg(x)
          
          # Update progress bar
          setTxtProgressBar(pbar, i)
    }
    close(pbar)
    
    # Return a list containing the sample size,
    # and the simulation results:
    return(list(n = n,
                bias = colMeans(res-0),
                vars = apply(res, 2, var)))
}

To perform the simulation, we can then e.g. run the following, which has been copied from the solution to the previous exercise.

# Create a vector of samples sizes:
n_vector <- seq(10, 100, 10)

# Run a simulation for each value in n_vector:
res <- apply(data.frame(n = n_vector), 1, simulate_estimators)

# Reformat the results:
simres <- matrix(unlist(res), 10, 7, byrow = TRUE)
simres <- data.frame(simres)
names(simres) <- names(unlist(res))[1:7]
library(data.table)
simres2 <- data.table(melt(simres, id.vars = c("n"),
           measure.vars = 2:7))
simres2[, c("measure", "estimator") := tstrsplit(variable,
                                        ".", fixed = TRUE)]
# Plot the result
library(ggplot2)
# Plot the bias, with a reference line at 0:
ggplot(subset(simres2, measure == "bias"), aes(n, value,
                                               col = estimator)) +
      geom_line() +
      geom_hline(yintercept = 0, linetype = "dashed") +
      ggtitle("Bias, t(3)-distribution")

# Plot the variance:
ggplot(subset(simres2, measure == "vars"), aes(n, value,
                                               col = estimator)) +
      geom_line() +
      ggtitle("Variance, t(3)-distribution")

The results are qualitatively similar to those for normal data.

(Click here to return to the exercise.)

Exercise 7.7

We will use the functions that we created to simulate the type I error rates and powers of the three tests in Sections @ref(simtypeI} and 7.2.3. Also, we must make sure to load the MKinfer package that contains perm.t.test.

To compare the type I error rates, we only need to supply the function rt for generating data and the parameter df = 3 to clarify that a $t(3)$-distribution should be used:

simulate_type_I(20, 20, rt, B = 9999, df = 3)
simulate_type_I(20, 30, rt, B = 9999, df = 3)

Here are the results from my runs:

# Balanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.04340434    0.04810481    0.04860486 

# Imbalanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
   0.04300430    0.04860486    0.04670467

The old-school t-test appears to be a little conservative, with an actual type I error rate close to $0.043$. We can use binomDiffCI from MKinfer to get a confidence interval for the difference in type I error rate between the old-school t-test and the permutation t-test:

B <- 9999
binomDiffCI(B*0.04810481, B*0.04340434, B, B, method = "wilson")

The confidence interval is $(-0.001, 0.010)$. Even though the old-school t-test appeared to have a lower type I error rate, we cannot say for sure, as a difference of 0 is included in the confidence interval. Increasing the number of simulated samples to, say, $99,999$, might be required to detect any differences between the different tests.

Next, we compare the power of the tests. For the function used to simulate data for the second sample, we add a + 1 to shift the distribution to the right (so that the mean difference is 1):

# Balanced sample sizes:
simulate_power(20, 20, function(n) { rt(n, df = 3,) },
                       function(n) { rt(n, df = 3) + 1 },
                       B = 9999)

# Imbalanced sample sizes:
simulate_power(20, 30, function(n) { rt(n, df = 3,) },
                       function(n) { rt(n, df = 3) + 1 },
                       B = 9999)

Here are the results from my runs:

# Balanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
    0.5127513     0.5272527     0.6524652 

# Imbalanced sample sizes:
     p_t_test p_perm_t_test    p_wilcoxon 
    0.5898590     0.6010601     0.7423742

The Wilcoxon-Mann-Whitney test has the highest power in this example.

(Click here to return to the exercise.)

Exercise 7.8

Both the functions that we created in Section 7.3.1, simulate_power and power.cor.test include ... in their list of arguments, which allows us to pass additional arguments to interior functions. In particular, the line in simulate_power where the p-value for the correlation test is computed, contains this placeholder:

p_values[i] <- cor.test(x[,1], x[,2], ...)$p.value

This means that we can pass the argument method = "spearman" to use the functions to compute the sample size for the Spearman correlation test. Let’s try it:

power.cor.test(n_start = 10, rho = 0.5, power = 0.9,
               method = "spearman")
power.cor.test(n_start = 10, rho = 0.2, power = 0.8,
               method = "spearman")

In my runs, the Pearson correlation test required the sample sizes $n=45$ and $n=200$, whereas the Spearman correlation test required larger sample sizes: $n=50$ and $n=215$.

(Click here to return to the exercise.)

Exercise 7.9

First, we create a function that simulates the expected width of the Clopper-Pearson interval for a given $n$ and $p$:

cp_width <- function(n, p, level = 0.05, B = 999)
{
      widths <- rep(NA, B)
      
      # Start progress bar:
      pbar <- txtProgressBar(min = 0, max = B, style = 3)
      
      for(i in 1:B)
      {
            # Generate binomial data:
            x <- rbinom(1, n, p)
            
            # Compute interval width:
            interval <- binomCI(x, n, conf.level = 0.95,
                                method = "clopper-pearson")$conf.int
            widths[i] <- interval[2] - interval[1]
            
            # Update progress bar:
            setTxtProgressBar(pbar, i)
      }
      
      close(pbar)
      
      return(mean(widths))
}

Next, we create a function with a while loop that finds the sample sizes required to achieve a desired expected width:

cp_ssize <- function(n_start = 10, p, n_incr = 5, level = 0.05,
                     width = 0.1, B = 999)
{
    # Set initial values
    n <- n_start
    width_now <- 1
    
    # Check power for different sample sizes:
    while(width_now > width)
    {
        width_now <- cp_width(n, p, level, B)
        cat("n =", n, " - Width:", width_now, "\n")
        n <- n + n_incr
    }
    
    # Return the result:
    cat("\nWhen n =", n, "the expected with is", round(width, 3), "\n")
    return(n)
}

Finally, we run our simulation for $p=0.1$ (with expected width $0.01$) and $p=0.3$ (expected width $0.05$) and compare the results to the asymptotic answer:

# p = 0.1
# Asymptotic answer:
ssize.propCI(prop = 0.1, width = 0.01, method = "clopper-pearson")

# The asymptotic answer is 14,029 - so we need to set a fairly high
# starting value for n in our simulation!
cp_ssize(n_start = 14020, p = 0.1, n_incr = 1, level = 0.05,
         width = 0.01, B = 9999)

#######

# p = 0.3, width = 0.05
# Asymptotic answer:
ssize.propCI(prop = 0.3, width = 0.1, method = "clopper-pearson")

# The asymptotic answer is 343.
cp_ssize(n_start = 335, p = 0.3, n_incr = 1, level = 0.05,
         width = 0.1, B = 9999)

As you can see, the asymptotic results are very close to those obtained from the simulation, and so using ssize.propCI is preferable in this case, as it is much faster.

(Click here to return to the exercise.)

Exercise 7.10

If we want to assume that the two populations have equal variances, we first have to create a centred dataset, where both groups have mean 0. We can then draw observations from this sample, and shift them by the two group means:

library(ggplot2)
boot_data <- na.omit(subset(msleep,
                  vore == "carni" | vore == "herbi")[,c("sleep_total",
                                                          "vore")])
# Compute group means and sizes:
group_means <- aggregate(sleep_total ~ vore,
                         data = boot_data, FUN = mean)
group_sizes <- aggregate(sleep_total ~ vore,
                         data = boot_data, FUN = length)
n1 <- group_sizes[1, 2]

# Create a centred dataset, where both groups have mean 0:
boot_data$sleep_total[boot_data$vore == "carni"] <- 
     boot_data$sleep_total[boot_data$vore == "carni"] -
                                      group_means[1, 2]
boot_data$sleep_total[boot_data$vore == "herbi"] <- 
     boot_data$sleep_total[boot_data$vore == "herbi"] -
                                      group_means[2, 2]

# Verify that we've centred the two groups:
aggregate(sleep_total ~ vore, data = boot_data, FUN = mean)

# First, we resample from the centred data sets. Then we label
# some observations as carnivores, and add the group mean for
# carnivores to them, and label some as herbivores and add
# that group mean instead. That way both groups are used to
# estimate the variability of the observations.
mean_diff_msleep <- function(data, i)
{ 
    # Create a sample with the same mean as the carnivore group:
    sample1 <- data[i[1:n1], 1] + group_means[1, 2]
    # Create a sample with the same mean as the herbivore group:
    sample2 <- data[i[(n1+1):length(i)], 1] + group_means[2, 2]
    # Compute the difference in means:
    return(mean(sample1$sleep_total) - mean(sample2$sleep_total))
}

library(boot)

# Do the resampling:
boot_res <- boot(boot_data,
                 mean_diff_msleep,
                 9999)

# Compute confidence intervals:
boot.ci(boot_res, type = c("perc", "bca"))

The resulting percentile interval is close to that which we obtained without assuming equal variances. The BCa interval is however very different.

(Click here to return to the exercise.)

Exercise 7.11

We use the percentile confidence interval from the previous exercise to compute p-values as follows (the null hypothesis is that the parameter is 0):

library(boot.pval)
boot.pval(boot_res, type = "perc", theta_null = 0)

A more verbose solution would be to write a while loop:

# The null hypothesis is there the difference is 0:
diff_null <- 0

# Set initial conditions:
in_interval <- TRUE
alpha <- 0

# Find the lowest alpha for which diff_null is in the
# interval:
while(in_interval)
{
    alpha <- alpha + 0.001
    interval <- boot.ci(boot_res, 
                        conf = 1 - alpha,
                        type = "perc")$percent[4:5]
    in_interval <- diff_null > interval[1] & diff_null < interval[2]
}

# Print the p-value:
alpha

The p-value is approximately 0.52, and we can not reject the null hypothesis.

(Click here to return to the exercise.)

Chapter 8

Exercise 8.1

We set file_path to the path of sales-weather.csv. To load the data, fit the model and plot the results, we do the following:

# Load the data:
weather <- read.csv(file_path, sep =";")
View(weather)

# Fit model:
m <- lm(TEMPERATURE ~ SUN_HOURS, data = weather)
summary(m)
                  
# Plot the result:
library(ggplot2)
ggplot(weather, aes(SUN_HOURS, TEMPERATURE)) +
      geom_point() + 
      geom_abline(aes(intercept = coef(m)[1], slope = coef(m)[2]),
                colour = "red")

The coefficient for SUN_HOURS is not significantly non-zero at the 5 % level. The $R^2$ value is 0.035, which is very low. There is little evidence of a connection between the number of sun hours and the temperature during this period.

(Click here to return to the exercise.)

Exercise 8.2

We fit a model using the formula:

m <- lm(mpg ~ ., data = mtcars)
summary(m)

What we’ve just done is to create a model where all variables from the data frame (except mpg) are used as explanatory variables. This is the same model as we’d have obtained using the following (much longer) code:

m <- lm(mpg ~ cyl + disp + hp + drat + wt +
          qsec + vs + am + gear + carb, data = mtcars)

The ~ . shorthand is very useful when you want to fit a model with a lot of explanatory variables.

(Click here to return to the exercise.)

Exercise 8.3

First, we create the dummy variable:

weather$prec_dummy <- factor(weather$PRECIPITATION > 0)

Then, we fit the new model and have a look at the results. We won’t centre the SUN_HOURS variable, as the model is easy to interpret without centring. The intercept corresponds to the expected temperature on a day with 0 SUN_HOURS and no precipitation.

m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
summary(m)

Both SUN_HOURS and the dummy variable are significantly non-zero. In the next section, we’ll have a look at how we can visualise the results of this model.

(Click here to return to the exercise.)

Exercise 8.4

We run the code to create the two data frames. We then fit a model to the first dataset exdata1, and make some plots:

m1 <- lm(y ~ x, data = exdata1)

# Show fitted values in scatterplot:
library(ggplot2)
m1_pred <- data.frame(x = exdata1$x, y_pred = predict(m1))
ggplot(exdata1, aes(x, y)) +
      geom_point() + 
      geom_line(data = m1_pred, aes(x = x, y = y_pred),
                colour = "red") 

# Residual plots:
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)

There are clear signs of nonlinearity here, that can be seen both in the scatterplot and the residuals versus fitted plot.

Next, we do the same for the second dataset:

m2 <- lm(y ~ x, data = exdata2)

# Show fitted values in scatterplot:
m2_pred <- data.frame(x = exdata2$x, y_pred = predict(m2))
ggplot(exdata2, aes(x, y)) +
      geom_point() + 
      geom_line(data = m2_pred, aes(x = x, y = y_pred),
                colour = "red") 

# Residual plots:
library(ggfortify)
autoplot(m2, which = 1:6, ncol = 2, label.size = 3)

There is a strong indication of heteroscedasticity. As is seen e.g. in the scatterplot and in the scale-location plot, the residuals appear to vary more the larger x becomes.

(Click here to return to the exercise.)

Exercise 8.5

First, we plot the observed values against the fitted values for the two models.

# The two models:
m1 <- lm(TEMPERATURE ~ SUN_HOURS, data = weather)
m2 <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)

n <- nrow(weather)
models <- data.frame(Observed = rep(weather$TEMPERATURE, 2),
                     Fitted = c(predict(m1), predict(m2)),
                     Model = rep(c("Model 1", "Model 2"), c(n, n)))

ggplot(models, aes(Fitted, Observed)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Model, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      labs(x = "Fitted values",
           y = "Observed values")

The first model only predicts values within a fairly narrow interval. The second model does a somewhat better job of predicting high temperatures.

Next, we create residual plots for the second model.

library(ggfortify)
autoplot(m2, which = 1:6, ncol = 2, label.size = 3)

There are no clear trends or signs of heteroscedasticity. There are some deviations from normality in the tail of the residual distribution. There are a few observations - 57, 76 and 83, that have fairly high Cook’s distance. Observation 76 also has a very high leverage. Let’s have a closer look at them:

weather[c(57, 76, 83),]

As we can see using sort(weather$SUN_HOURS) and min(weather$TEMPERATURE), observation 57 corresponds to the coldest day during the period, and observations 76 and 83 to the two days with the highest numbers of sun hours. Neither of them deviate too much from other observations though, so it shouldn’t be a problem that their Cook’s distances are little high.

(Click here to return to the exercise.)

Exercise 8.6

We refit the model using:

library(lmPerm)
m <- lmp(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather,
         Ca = 1e-3, maxIter = 1e6)
summary(m)

The main effects are not significant at the 5 % level.

(Click here to return to the exercise.)

Exercise 8.7

We run boxcox to find a suitable Box-Cox transformation for our model:

m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)

library(MASS)
boxcox(m)

You’ll notice an error message, saying:

Error in boxcox.default(m) : response variable must be positive

The boxcox method can only be used for non-negative response variables. We can solve this e.g. by transforming the temperature (which currently is in degrees Celsius) to degrees Fahrenheit, or by adding a constant to the temperature (which only will affect the intercept of the model, and not the slope coefficients). Let’s try the latter:

weather$TEMPERATUREplus10 <- weather$TEMPERATURE + 10
m <- lm(TEMPERATUREplus10 ~ SUN_HOURS*prec_dummy, data = weather)

boxcox(m)

The value $\lambda = 1$ is inside the interval indicated by the dotted lines. This corresponds to no transformation at all, meaning that there is no indication that we should transform our response variable.

(Click here to return to the exercise.)

Exercise 8.8

autoplot uses syntax similar to ggplot2, so by adding data = mtcars, colour = "cyl" to autoplot, we can plot different groups in different colours:

mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)

# Fit model and print ANOVA table:
m <- aov(mpg ~ cyl + am, data = mtcars)

library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3,
         data = mtcars, colour = "cyl")

(Click here to return to the exercise.)

Exercise 8.9

We rerun the analysis:

# Convert variables to factors:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)

# Fit model and print ANOVA table:
library(lmPerm)
m <- aovp(mpg ~ cyl + am, data = mtcars)
summary(m)

Unfortunately, if you run this multiple times, the p-values will vary a lot. To fix that, you need to increase the maximum number of iterations allowed, by increasing maxIter, and changing the condition for the accuracy of the p-value by lowering Ca:

m <- aovp(mpg ~ cyl + am, data = mtcars,
          perm = "Prob",
          Ca = 1e-3,
          maxIter = 1e6)
summary(m)

According to ?aovp, the seqs arguments controls which type of table is produced. It’s perhaps not perfectly clear from the documentation, but the default seqs = FALSE corresponds to a type III table, whereas seqs = TRUE corresponds to a type I table:

# Type I table:
m <- aovp(mpg ~ cyl + am, data = mtcars,
          seqs = TRUE,
          perm = "Prob",
          Ca = 1e-3,
          maxIter = 1e6)
summary(m)

(Click here to return to the exercise.)

Exercise 8.10

We can run the test using the usual formula notation:

kruskal.test(mpg ~ cyl, data = mtcars)

The p-value is very low, and we conclude that the fuel consumption differs between the three groups.

(Click here to return to the exercise.)

Exercise 8.11

The easiest way to do this is to use boot_summary:

library(MASS)
m <- rlm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)

library(boot.pval)
boot_summary(m, type = "perc", method = "residual")

We can also use Boot:

library(car)
boot_res <- Boot(m, method = "residual")

# Compute 95 % confidence intervals using confint
confint(boot_res, type = "perc")

If instead we want to use boot, we begin by fitting the model:

library(MASS)
m <- rlm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)

Next, we compute the confidence intervals using boot and boot.ci (note that we use rlm inside the coefficients function!):

library(boot)

coefficients <- function(formula, data, i, predictions, residuals) {
      # Create the bootstrap value of response variable by
      # adding a randomly drawn residual to the value of the
      # fitted function for each observation:
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      
      # Fit a new model with the bootstrap value of the response
      # variable and the original explanatory variables:
      m <- rlm(formula, data = data)
      return(coef(m))
}

# Fit the linear model:
m <- rlm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)

# Compute scaled and centred residuals:
res <- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)

# Run the bootstrap, extracting the model formula and the
# fitted function from the model m:
boot_res <- boot(data = mtcars, statistic = coefficients,
                R = 999, formula = formula(m),
                predictions = predict(m),
                residuals = res)

# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Sun hours
boot.ci(boot_res, type = "perc", index = 3) # Precipitation dummy
boot.ci(boot_res, type = "perc", index = 4) # Interaction term

Using the connection between hypothesis tests and confidence intervals, to see whether an effect is significant at the 5 % level, you can check whether 0 is contained in the confidence interval. If not, then the effect is significant.

(Click here to return to the exercise.)

Exercise 8.12

First, we prepare the model and the data:

m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
new_data <- data.frame(SUN_HOURS = 0, prec_dummy = "TRUE")

We can then compute the prediction interval using boot.ci:

boot_pred <- function(data, new_data, model, i,
                      formula, predictions, residuals){
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      m_boot <- lm(formula, data = data)
      predict(m_boot, newdata = new_data) + 
         sample(residuals(m_boot), nrow(new_data))
}

library(boot)

boot_res <- boot(data = m$model,
                     statistic = boot_pred,
                     R = 999,
                     model = m,
                     new_data = new_data,
                     formula = formula(m),
                     predictions = predict(m),
                     residuals = residuals(m))

# 95 % bootstrap prediction interval:
boot.ci(boot_res, type = "perc")

(Click here to return to the exercise.)

Exercise 8.13

The variable names in sharks.csv contain special characters like ( and /, which are treated differently by different functions for importing data. We therefore present two solutions here.

The first solution uses the base R functions read.csv and not pipes:

We set file_path to the path of shark.csv and then load and inspect the data:

sharks <- read.csv(file_path, sep =";")
View(sharks)

We need to convert the Age variable to a numeric, which will cause us to lose information (“NAs introduced by coercion”) about the age of the persons involved in some attacks, i.e. those with values like 20's and 25 or 28, which cannot be automatically coerced into numbers. Similarly, we’ll convert Sex. and Fatal..Y.N. to factor variables:

Without pipes:

sharks$Age <- as.numeric(sharks$Age)
sharks$Sex. <- factor(sharks$Sex, levels = c("F", "M"))
sharks$Fatal..Y.N. <- factor(sharks$Fatal..Y.N., levels = c("N", "Y"))

We can now fit the model:

m <- glm(Fatal..Y.N. ~ Age + Sex., data = sharks, family = binomial)
summary(m)

Judging from the p-values, there is no evidence that sex and age affect the probability of an attack being fatal.

The second solutions uses the read_delim function from the readr package:

library(readr)
sharks <- read_delim("~/R-kurs/sharks.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

## Rows: 585 Columns: 9
## ── Column specification ───────────────────────────────────────────────────
## Delimiter: ";"
## chr (9): Year, Type, Area, Activity, Sex, Age, Injury, Fatal (Y/N), Species
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ll convert Sex. and Fatal..Y.N. to factor variables:

sharks |> 
    mutate(Age = as.numeric(Age),
           Sex = factor(Sex, levels = c("F", "M")),
           `Fatal (Y/N)` = factor(`Fatal (Y/N)`, levels = c("N", "Y"))) -> sharks

We can now fit the model:

m <- glm(`Fatal (Y/N)` ~ Age + Sex, data = sharks, family = binomial)
summary(m)

(Click here to return to the exercise.)

Exercise 8.14

We use the same logistic regression model for the wine data as before:

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

The broom functions work also for generalised linear models. As for linear models, tidy gives the table of coefficients and p-values, glance gives some summary statistics, and augment adds fitted values and residuals to the original dataset:

library(broom)
tidy(m)
glance(m)
augment(m)

(Click here to return to the exercise.)

Exercise 8.15

Using the model m from the other exercise, we can now do the following.

Compute asymptotic confidence intervals:

library(MASS)
confint(m)

Next, we compute bootstrap confidence intervals and p-values. In this case, the response variable is missing for a lot of observations. In order to use the same number of observations in our bootstrapping as when fitting the original model, we need to add a line to remove those observation (as in Section 5.8.2).

library(boot.pval)
# Try fitting the model without removing missing values:
boot_summary(m, type = "perc", method = "case")

# Remove missing values, refit the model, and then run
# boot_summary again:
sharks2 <- na.omit(sharks, "Fatal..Y.N.")
m <- glm(Fatal..Y.N. ~ Age + Sex., data = sharks2,
         family = binomial)
boot_summary(m, type = "perc", method = "case")

If you prefer writing your own bootstrap code, you could proceed as follows:

library(boot)

coefficients <- function(formula, data, predictions, ...) {
  # Remove rows where the response variable is missing:
  data <- na.omit(data, all.vars(formula)[1])
  
  # Check whether the response variable is a factor or
  # numeric, and then resample:
  if(is.factor(data[,all.vars(formula)[1]])) {
      # If the response variable is a factor:
      data[,all.vars(formula)[1]] <- 
      factor(levels(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                          1, predictions)]) } else {
      # If the response variable is numeric:
      data[,all.vars(formula)[1]] <- 
         unique(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                              1, predictions)] }
  
      m <- glm(formula, data = data, family = binomial)
      return(coef(m))
}

boot_res <- boot(data = sharks, statistic = coefficients,
                R=999, formula = formula(m),
                predictions = predict(m, type = "response"))

# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Age
boot.ci(boot_res, type = "perc", index = 3) # Sex.M

# Compute p-values:

# The null hypothesis is that the effect (beta coefficient)
# is 0:
beta_null <- 0

# Set initial conditions:
in_interval <- TRUE
alpha <- 0

# Find the lowest alpha for which beta_null is in the
# interval:
while(in_interval)
{
    # Based on the asymptotic test, we expect the p-value
    # to not be close to 0. We therefore increase alpha by
    # 0.01 instead of 0.001 in each iteration.
    alpha <- alpha + 0.01
    interval <- boot.ci(boot_res, 
                        conf = 1 - alpha,
                        type = "perc", index = 2)$perc[4:5]
    in_interval <- beta_null > interval[1] & beta_null < interval[2]
}

# Print the p-value:
alpha

(Click here to return to the exercise.)

Exercise 8.16

We draw a binned residual plot for our model:

m <- glm(Fatal..Y.N. ~ Age + Sex., data = sharks, family = binomial)

library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))

There are a few points outside the interval, but not too many. There is not trend, i.e. there is for instance no sign that the model has a worse performance when it predicts a larger probability of a fatal attack.

Next, we plot the Cook’s distances of the observations:

res <- data.frame(Index = 1:length(cooks.distance(m)),
                  CooksDistance = cooks.distance(m))

# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > 0.05,
                                   rownames(res), "")),
                hjust = 1.1)

There are a few points with a high Cook’s distance. Let’s investigate point 116, which has the highest distance:

sharks[116,]

This observation corresponds to the oldest person in the dataset, and a fatal attack. Being an extreme observation, we’d expect it to have a high Cook’s distance.

(Click here to return to the exercise.)

Exercise 8.17

First, we have a look at the quakes data:

?quakes
View(quakes)

We then fit a Poisson regression model with stations as response variable and mag as explanatory variable:

m <- glm(stations ~ mag, data = quakes, family = poisson)
summary(m)

We plot the fitted values against the observed values, create a binned residual plot, and perform a test of overdispersion:

# Plot observed against fitted:
res <- data.frame(Observed = quakes$stations,
                  Fitted = predict(m, type = "response"))

ggplot(res, aes(Fitted, Observed)) +
      geom_point(colour = "blue") +
      geom_abline(intercept = 0, slope = 1) +
      labs(x = "Fitted values",
           y = "Observed values")

# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))

# Test overdispersion
library(AER)
dispersiontest(m, trafo = 1)

Visually, the fit is pretty good. As indicated by the test, there are however signs of overdispersion. Let’s try a negative binomial regression instead.

# Fit NB regression:
library(MASS)
m2 <- glm.nb(stations ~ mag, data = quakes)
summary(m2)

# Compare fit of observed against fitted:
n <- nrow(quakes)
models <- data.frame(Observed = rep(quakes$stations, 2),
                     Fitted = c(predict(m, type = "response"),
                                predict(m2, type = "response")),
                     Model = rep(c("Poisson", "NegBin"),
                                 c(n, n)))

ggplot(models, aes(Fitted, Observed)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Model, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      labs(x = "Fitted values",
           y = "Observed values")

The difference between the models is tiny. We’d probably need to include more variables to get a real improvement of the model.

(Click here to return to the exercise.)

Exercise 8.18

We can get confidence intervals for the $\beta_j$ using boot_summary, as in previous sections. To get bootstrap confidence intervals for the rate ratios $e^{\beta_j}$, we exponentiate the confidence intervals for the $\beta_j$:

library(boot.pval)
boot_table <- boot_summary(m, type = "perc", method = "case")
boot_table

# Confidence intervals for rate ratios:
exp(boot_table[, 2:3])

(Click here to return to the exercise.)

Exercise 8.19

First, we load the data and have a quick look at it:

library(nlme)
?Oxboys
View(Oxboys)

Next, we make a plot for each boy (each subject):

ggplot(Oxboys, aes(age, height, colour = Subject)) +
      geom_point() +
      theme(legend.position = "none") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_smooth(method = "lm", colour = "black", se = FALSE)

Both intercepts and slopes seem to vary between individuals. Are they correlated?

# Collect the coefficients from each linear model:
library(purrr)
Oxboys |>  split(Oxboys$Subject) |> 
              map(~ lm(height ~ age, data = .)) |>  
              map(coef) -> coefficients

# Convert to a data frame:
coefficients <- data.frame(matrix(unlist(coefficients),
                  nrow = length(coefficients),
                  byrow = TRUE),
                  row.names = names(coefficients))
names(coefficients) <- c("Intercept", "Age")

# Plot the coefficients:
ggplot(coefficients, aes(Intercept, Age,
                         colour = row.names(coefficients))) +
      geom_point() +
      geom_smooth(method = "lm", colour = "black", se = FALSE) +
      labs(fill = "Subject")

# Test the correlation:
cor.test(coefficients$Intercept, coefficients$Age)

There is a strong indication that the intercepts and slopes have a positive correlation. We’ll therefore fit a linear mixed model with correlated random intercepts and slopes:

m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
summary(m, correlation = FALSE)

(Click here to return to the exercise.)

Exercise 8.20

We’ll use the model that we fitted to the Oxboys data in the previous exercise:

library(lme4)
library(nlme)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)

First, we install broom.mixed:

install.packages("broom.mixed")

Next, we obtain the summary table as a data frame using tidy:

library(broom.mixed)
tidy(m)

As you can see, fixed and random effects are shown in the same table. However, different information is displayed for the two types of variables (just as when we use summary).

Note that if we fit the model after loading the lmerTest, the tidy table also includes p-values:

library(lmerTest)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
tidy(m)

(Click here to return to the exercise.)

Exercise 8.21

We use the same model as in the previous exercise:

library(nlme)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)

We make some diagnostic plots:

library(ggplot2)
fm <- fortify.merMod(m)

# Plot residuals:
ggplot(fm, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted values",
       y = "Residuals")

# Compare the residuals of different subjects:
ggplot(fm, aes(Subject, .resid)) +
  geom_boxplot() +
  coord_flip() +
  labs(y = "Residuals")

# Observed values versus fitted values:
ggplot(fm, aes(.fitted, height)) +
  geom_point(colour = "blue") +
  facet_wrap(~ Subject, nrow = 3) +
  geom_abline(intercept = 0, slope = 1) +
  labs(x = "Fitted values",
       y = "Observed values")

## Q-Q plot of residuals:
ggplot(fm, aes(sample = .resid)) +
  geom_qq() + geom_qq_line()

## Q-Q plot of random effects:
ggplot(ranef(m)$Subject, aes(sample = `(Intercept)`)) +
  geom_qq() + geom_qq_line()
ggplot(ranef(m)$Subject, aes(sample = `age`)) +
  geom_qq() + geom_qq_line()

Overall, the fit seems very good. There may be some heteroscedasticity, but nothing too bad. Some subjects have a larger spread in their residuals, which is to be expected in this case - growth in children is non-constant, and a large negative residual is therefore likely to be followed by a large positive residual, and vice versa. The regression errors and random effects all appear to be normally distributed.

(Click here to return to the exercise.)

Exercise 8.22

To look for an interaction between TVset and Assessor, we draw an interaction plot:

library(lmerTest)
interaction.plot(TVbo$Assessor, TVbo$TVset,
                 response = TVbo$Coloursaturation)

The lines overlap and follow different patterns, so there appears to be an interaction. There are two ways in which we could include this. Which we choose depends on what we think our clusters of correlated measurements are. If only the assessors are clusters, we’d include this as a random slope:

m <- lmer(Coloursaturation ~ TVset*Picture + (1 + TVset|Assessor),
         data = TVbo)
m
anova(m)

In this case, we think that there is a fixed interaction between each pair of assessor and TV set.

However, if we think that the interaction is random and varies between repetitions, the situation is different. In this case the combination of assessor and TV set are clusters of correlated measurements (which could make sense here, because we have repeated measurements for each assessor-TV set pair). We can then include the interaction as a nested random effect:

m <- lmer(Coloursaturation ~ TVset*Picture + (1|Assessor/TVset),
         data = TVbo)
m
anova(m)

Neither of these approaches is inherently superior to the other. Which we choose is a matter of what we think best describes the correlation structure of the data.

In either case, the results are similar, and all fixed effects are significant at the 5 % level.

(Click here to return to the exercise.)

Exercise 8.23

BROOD, INDEX (subject ID number) and LOCATION all seem like they could cause measurements to be correlated, and so are good choices for random effects. To keep the model simple, we’ll only include random intercepts. We fit a mixed Poisson regression using glmer:

library(lme4)
m <- glmer(TICKS ~ YEAR + HEIGHT + (1|BROOD) + (1|INDEX) + (1|LOCATION),
           data = grouseticks, family = poisson)
summary(m, correlation = FALSE)

To compute the bootstrap confidence interval for the effect of HEIGHT, we use boot_summary:

library(boot.pval)
boot_summary(m, type = "perc", R = 100)

(Click here to return to the exercise.)

Chapter 9

Exercise 9.1

The ovarian data comes from a randomised trial comparing two treatments for ovarian cancer:

library(survival)
?ovarian
str(ovarian)

Let’s plot Kaplan-Meier curves to compare the two treatments:

library(ggfortify)
m <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)
autoplot(m)

We get the median survival times as follows:

For the group with rx equal to 2, the estimated median is NA. We can see why by looking at the Kaplan-Meier curve for that group: more than 50 % of the patients were alive at the end of the study, and consequently we cannot estimate the median survival from the Kaplan-Meier curve.

The parametric confidence interval overlap a lot. Let’s compute a bootstrap confidence interval for the difference in the 90 % quantile of the survival times. We set the quantile level using the q argument in bootkm:

library(Hmisc)

# Create a survival object:
survobj <- Surv(ovarian$futime, ovarian$fustat)

# Get bootstrap replicates of the 75 % quantile for the
# survival time for the two groups:
q90_surv_time_1 <- bootkm(survobj[ovarian$rx == 1],
                                  q = 0.90, B = 999)
q90_surv_time_2 <- bootkm(survobj[ovarian$rx == 2],
                                q = 0.90, B = 999)

# 95 % bootstrap confidence interval for the difference in
# 90 % quantile of the survival time distribution:
quantile(q90_surv_time_2 - q90_surv_time_1,
         c(.025,.975), na.rm=TRUE)

The resulting confidence interval is quite wide, which is unsurprising as we have a fairly small sample size.

(Click here to return to the exercise.)

Exercise 9.2

First, we fit a Cox regression model. From ?ovarian we see that the survival/censoring times are given by futime and the censoring status by fustat.

library(survival)
m <- coxph(Surv(futime, fustat) ~ age + rx,
           data = ovarian, model = TRUE)
summary(m)

According to the p-value in the table, which is 0.2, there is no significant difference between the two treatment groups. Put differently, there is no evidence that the hazard ratio for treatment isn’t equal to 1.

To assess the assumption of proportional hazards, we plot the Schoenfeld residuals:

library(survminer)
ggcoxzph(cox.zph(m), var = 1)
ggcoxzph(cox.zph(m), var = 2)

There is no clear trend over time, and the assumption appears to hold.

To compute a bootstrap confidence interval for the hazard ratio for age, we follow the same steps as in the lung example, using censboot_summary:.

library(boot.pval)
censboot_summary(m)

All values in the confidence interval are positive, meaning that we are fairly sure that the hazard increases with age.

(Click here to return to the exercise.)

Exercise 9.3

First, we fit the model:

m <- coxph(Surv(futime, status) ~ age + type + trt,
           cluster = id, data = retinopathy)
summary(m)

To check the assumption of proportional hazards, we make a residual plot:

library(survminer)
ggcoxzph(cox.zph(m), var = 1)
ggcoxzph(cox.zph(m), var = 2)
ggcoxzph(cox.zph(m), var = 3)

As there are no trends over time, there is no evidence against the assumption of proportional hazards.

(Click here to return to the exercise.)

Exercise 9.4

We fit the model:

m <- coxph(Surv(futime, status) ~ age + type + trt + frailty(id),
           data = retinopathy)
summary(m)

(Click here to return to the exercise.)

While the p-values for the parameters change, the conclusions do not: trt is only explanatory variable with a significant effect.

Exercise 9.5

We fit the model using survreg:

library(survival)
m <- survreg(Surv(futime, fustat) ~ ., data = ovarian,
                dist = "loglogistic")

To get the estimated effect on survival times, we exponentiate the coefficients:

exp(coef(m))

According to the model, the survival time increases 1.8 times for patients in treatment group 2, compared to patients in treatment group 1. Running summary(m) shows that the p-value for rx is 0.05, meaning that the result isn’t significant at the 5 % level (albeit with the smallest possible margin!).

(Click here to return to the exercise.)

Exercise 9.6

We set file_path to the path to il2rb.csv and then load the data (note that it uses a decimal comma!):

biomarkers <- read.csv(file_path, sep = ";", dec = ",")

Next, we check which measurements that are nondetects, and impute the detection limit 0.25:

censored <- is.na(biomarkers$IL2RB)
biomarkers$IL2RB[censored] <- 0.25

# Check the proportion of nondetects:
mean(censored)

27.5 % of the observations are left-censored.

To compute bootstrap confidence intervals for the mean of the biomarker level distribution under the assumption of lognormality, we can now use elnormAltCensored:

elnormAltCensored(biomarkers$IL2RB, censored, method = "mle",
              ci = TRUE,  ci.method = "bootstrap",
              n.bootstraps = 999)$interval$limits

(Click here to return to the exercise.)

Exercise 9.7

We set file_path to the path to il2rb.csv and then load and prepare the data:

biomarkers <- read.csv(file_path, sep = ";", dec = ",")
censored <- is.na(biomarkers$IL2RB)
biomarkers$IL2RB[censored] <- 0.25

Based on the recommendations in Zhang et al. (2009), we can now run a Wilcoxon-Mann-Whitney test. Because we’ve imputed the LoD for the nondetects, all observations are included in the test:

wilcox.test(IL2RB ~ Group, data = biomarkers)

The p-value is 0.42, and we do not reject the null hypothesis that there is no difference in location.

(Click here to return to the exercise.)

Chapter 10

Exercise 10.1

First, we have a look at the data:

?ability.cov
ability.cov

We can imagine several different latent variables that could explain how well the participants performed in these tests: general ability, visual ability, verbal ability, and so on. Let’s use a scree plot to determine how many factors to use:

library(psych)
scree(ability.cov$cov, pc = FALSE)

2 or 3 factors seem like a good choice here. Let’s try both:

# 2-factor model:
ab_fa2 <- fa(ability.cov$cov, nfactors = 2,
             rotate = "oblimin", fm = "ml")
fa.diagram(ab_fa2, simple = FALSE)

# 3-factor model:
ab_fa3 <- fa(ability.cov$cov, nfactors = 3,
             rotate = "oblimin", fm = "ml")
fa.diagram(ab_fa3, simple = FALSE)

In the 2-factor model, one factor is primarily associated with the visual variables (which we interpret as the factor describing visual ability), whereas the other primarily is associated with reading and vocabulary (verbal ability). Both are associated with the measure of general intelligence.

In the 3-factor model, there is still a factor associated with reading and vocabulary. There are two factors associated with the visual tests: one with block design and mazes and one with picture completion and general intelligence.

(Click here to return to the exercise.)

Exercise 10.2

First, we have a look at the data:

library(poLCA)
?cheating
View(cheating)

Next, we perform a latent class analysis with GPA as a covariate:

m <- poLCA(cbind(LIEEXAM, LIEPAPER,
                 FRAUD, COPYEXAM) ~ GPA,
           data = cheating, nclass = 2)

The two classes roughly correspond to cheaters and non-cheaters. From the table showing the relationship with GPA, we see students with high GPA’s are less likely to be cheaters.

(Click here to return to the exercise.)

Exercise 10.3

First, we note that the sample size is small ($n=30$), so we should take any conclusions with a fistful of salt.

We specify the model and then fit it using cfa:

model <- "Career_opportunity  =~ advance + learning + raises 
          Employee_treatment =~ complaints + rating + raises + learning + privileges"

library(lavaan)
m <- cfa(model, data = attitude)

Next, we look at the measures of goodness-of-fit:

summary(m, fit.measures = TRUE)

These all indicate a good fit. The $\chi^2$-test is not significant ($p=0.409$). In addition, we have $\mbox{CFI}=0.999>0.9$, $\mbox{TLI}=0.997>0.95$, $\mbox{RMSEA}=0.026<0.06$, and $\mbox{SRMR}=0.059<0.08$. Our conclusion is that the model fits the data well, indicating that it gives a reasonable description of the dependence structure.

Finally, we plot the path diagram of the model:

library(semPlot)
semPaths(m, exoVar = FALSE, residuals = FALSE, fixedStyle = 1)

(Click here to return to the exercise.)

Exercise 10.4

We specify the model and then fit it using sem:

model <- "
  # Measurement model:
  Visual  =~ x1 + x2 + x3 
  Textual =~ x4 + x5 + x6
  Speed   =~ x7 + x8 + x9
  Maturity = ~ ageyr + sex
  
  # Latent variable regressions:
  Visual ~ Maturity
  Textual ~ Maturity
  Speed ~ Maturity"

library(lavaan)
m <- sem(model, data = HolzingerSwineford1939,
         std.lv = TRUE)
summary(m, fit.measures = TRUE)

The results indicate that the theory may not be correct. The “Maturity” variable is not significantly associated with age and sex. The endogenous variables are not significantly associated with “Maturity”. The $\chi^2$ test, the TLI, and the RMSEA indicate problems with the model fit.

Finally, we plot the path diagram of the model:

library(semPlot)
semPaths(m)

(Click here to return to the exercise.)

Exercise 10.5

We set file_path to the path to treegrowth.csv to import the data, specify the model and then fit it using sem:

treegrowth <- read.csv(file_path)

model <- "
  # Measurement model:
  Environment =~ x1 + x2 + x3 + x4 + x5
  Growth =~ y1 + y2 + y3 + y4
  
  # Latent variable regressions:
  Growth ~ Environment"

library(lavaan)
m <- sem(model, data = treegrowth,
         std.lv = TRUE)
summary(m, fit.measures = TRUE)

The goodness-of-fit measures all indicate a very good fit. All relations are significant, except for the relation between x5 and the Environment variable.

Finally, we plot the path diagram of the model:

library(semPlot)
semPaths(m)

(Click here to return to the exercise.)

Exercise 10.6

We set file_path to the path to treegrowth.csv to import the data, specify the model and then fit it using sem:

treegrowth <- read.csv(file_path)

model <- "# Model:
          x4 ~ a*x1
          y1 ~ b*x4 + c*x1

          # Indirect and total effects:
          indirect_effect := a*b
          total_effect := indirect_effect + c"

library(lavaan)
m <- sem(model, data = treegrowth,
         std.lv = TRUE)
summary(m)

The indirect effect is significant, so there is mediation here. However, it the indirect effect is only a small proportion of the total effect.

(Click here to return to the exercise.)

Exercise 10.7

We set file_path to the path to treegrowth.csv to import the data, specify the model and then fit it using sem:

treegrowth <- read.csv(file_path)

model <- "# Model:
          x4 ~ a1*x1 + a2*x2 + a12*x1:x2
          y1 ~ b*x4 + c1*x1 + c2*x2 + c12*x1:x2

          # Moderated effects:
          mediated_moderation := a12*b
          total_moderated_effect := mediated_moderation + c12"

library(lavaan)
m <- sem(model, data = treegrowth,
         std.lv = TRUE)
summary(m)

The moderated effects are not significant, and we conclude that there is no statistical evidence for moderation.

(Click here to return to the exercise.)

Chapter 11

Exercise 11.1

We load the data and compute the expected values using the formula $y = 2x_1-x_2+x_3\cdot x_2$:

exdata <- data.frame(x1 = c(0.87, -1.03, 0.02, -0.25, -1.09, 0.74,
          0.09, -1.64, -0.32, -0.33, 1.40, 0.29, -0.71, 1.36, 0.64,
          -0.78, -0.58, 0.67, -0.90, -1.52, -0.11, -0.65, 0.04,
          -0.72, 1.71, -1.58, -1.76, 2.10, 0.81, -0.30),
          x2 = c(1.38, 0.14, 1.46, 0.27, -1.02, -1.94, 0.12, -0.64,
          0.64, -0.39, 0.28, 0.50, -1.29, 0.52, 0.28, 0.23, 0.05,
          3.10, 0.84, -0.66, -1.35, -0.06, -0.66, 0.40, -0.23,
          -0.97, -0.78, 0.38, 0.49, 0.21),
          x3 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
          1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1),
          y = c(3.47, -0.80, 4.57, 0.16, -1.77, -6.84, 1.28, -0.52,
          1.00, -2.50, -1.99, 1.13, -4.26, 1.16, -0.69, 0.89, -1.01,
          7.56, 2.33, 0.36, -1.11, -0.53, -1.44, -0.43, 0.69, -2.30,
          -3.55, 0.99, -0.50, -1.67))

exdata$Ey = 2*exdata$x2 - exdata$x3 + exdata$x3*exdata$x2

Next, we plot the expected values against the actual values:

library(ggplot2)
ggplot(exdata, aes(Ey, y)) + geom_point()

The points seem to follow a straight line, and a linear model seems appropriate.

Next, we fit a linear model to the first 20 observations:

m <- lm(y ~ x1 + x2 + x3, data = exdata[1:20,])
summary(m)

The $R^2$-value is pretty high: 0.91. x1 and x2 both have low p-values, as does the F-test for the regression. We can check the model fit by comparing the fitted values to the actual values. We add a red line that the points should follow if we have a good fit:

ggplot(exdata[1:20,], aes(y, predict(m))) + 
   geom_point() +
   geom_abline(intercept = 0, slope = 1, col = "red")

The model seems to be pretty good! Now let’s see how well it does when faced with new data.

We make predictions for all 10 observations:

exdata$predictions <- predict(m, exdata)

We can plot the results for the last 10 observations, which weren’t used when we fitted the model:

ggplot(exdata[21:30,], aes(y, predictions)) + 
   geom_point() +
   geom_abline(intercept = 0, slope = 1, col = "red")

The results are much worse than before! The correlation between the predicted values and the actual values is very low:

cor(exdata[21:30,]$y, exdata[21:30,]$predictions)

Despite the good in-sample performance (as indicated e.g. by the high $R^2$), the model doesn’t seem to be very useful for prediction.

Perhaps you noted that the effect of x3 wasn’t significant in the model. Perhaps the performance will improve if we remove it? Let’s try!

m <- lm(y ~ x1 + x2, data = exdata[1:20,])
summary(m)

The p-values and $R^2$ still look very promising. Let’s make predictions for the new observations and check the results:

exdata$predictions <- predict(m, exdata)

ggplot(exdata[21:30,], aes(y, predictions)) + 
   geom_point() +
   geom_abline(intercept = 0, slope = 1, col = "red")

cor(exdata[21:30,]$y, exdata[21:30,]$predictions)

The predictions are no better than before - indeed, the correlation between the actual and predicted values is even lower this time out!

Finally, we fit a correctly specified model and evaluate the results:

m <- lm(y ~ x1 + x2 + x3*x2, data = exdata[1:20,])
summary(m)

exdata$predictions <- predict(m, exdata)

ggplot(exdata[21:30,], aes(y, predictions)) + 
   geom_point() +
   geom_abline(intercept = 0, slope = 1, col = "red")

cor(exdata[21:30,]$y, exdata[21:30,]$predictions)

The predictive performance of the model remains low, which shows that model misspecification wasn’t the (only) reason for the poor performance of the previous models.

(Click here to return to the exercise.)

Exercise 11.2

We set file_path to the path to estates.xlsx and then load the data:

library(openxlsx)
estates <- read.xlsx(file_path)

View(estates)

There are a lot of missing values which can cause problems when fitting the model, so let’s remove those:

estates <- na.omit(estates)

Next, we fit a linear model and evaluate it with LOOCV using caret and train:

library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
           data = estates,
           method = "lm",
           trControl = tc)

The $RMSE$ is 547 and the $MAE$ is 395 kSEK. The average selling price in the data (mean(estates$selling_price)) is 2843 kSEK, meaning that the $MAE$ is approximately 13 % of the mean selling price. This is not unreasonably high for this application. Prediction errors are definitely expected here, given the fact that we have relatively few variables - the selling price can be expected to depend on several things not captured by the variables in our data (proximity to schools, access to public transport, and so on). Moreover, houses in Sweden are not sold at fixed prices, but subject to bidding, which can cause prices to fluctuate a lot. All in all, and $MAE$ of 395 is pretty good, and, at the very least, the model seems useful for getting a ballpark figure for the price of a house.

(Click here to return to the exercise.)

Exercise 11.3

We set file_path to the path to estates.xlsx and then load and clean the data:

library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)

Next, we evaluate the model with 10-fold cross-validation a few times:

library(caret)
# Run this several times:
tc <- trainControl(method = "cv" , number = 10)
m <- train(selling_price ~ .,
           data = estates,
           method = "lm",
           trControl = tc)
m$results

In my runs, the $MAE$ ranged from to 391 to 405. Not a massive difference on the scale of the data, but there is clearly some variability in the results.

Next, we run repeated 10-fold cross-validations a few times:

# Run this several times:
tc <- trainControl(method = "repeatedcv",
                   number = 10, repeats = 100)
m <- train(selling_price ~ .,
           data = estates,
           method = "lm",
           trControl = tc)
m$results

In my runs the $MAE$ varied between 396.0 and 397.4. There is still some variability, but it is much smaller than for a simple 10-fold cross-validation.

(Click here to return to the exercise.)

Exercise 11.4

We set file_path to the path to estates.xlsx and then load and clean the data:

library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)

Next, we evaluate the model with the bootstrap a few times:

library(caret)
# Run this several times:
tc <- trainControl(method = "boot",
                   number = 999)
m <- train(selling_price ~ .,
           data = estates,
           method = "lm",
           trControl = tc)
m$results

In my run, the $MAE$ varied between 410.0 and 411.8, meaning that the variability is similar to the with repeated 10-fold cross-validation. When I increased the number of bootstrap samples to 9,999, the $MAE$ stabilised around 411.7.

(Click here to return to the exercise.)

Exercise 11.5

We load and format the data as in the beginning of Section 11.1.7. We can then fit the two models using train:

library(caret)
tc <- trainControl(method = "repeatedcv",
                   number = 10, repeats = 100,
                   savePredictions = TRUE,
                   classProbs = TRUE)

# Model 1 - two variables:
m <- train(type ~ pH + alcohol,
           data = wine,
           trControl = tc,
           method = "glm",
           family = "binomial")

# Model 2 - four variables:
m2 <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "glm",
           family = "binomial")

To compare the models, we use evalm to plot ROC and calibration curves:

library(MLeval)
plots <- evalm(list(m, m2),
               gnames = c("Model 1", "Model 2"))

# ROC:
plots$roc

# Calibration curves:
plots$cc

Model 2 performs much better, both in terms of $AUC$ and calibration. Adding two more variables has both increased the predictive performance of the model (a much higher $AUC$) and lead to a better-calibrated model.

(Click here to return to the exercise.)

Exercise 11.9

First, we load and clean the data:

library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)

Next, we fit a ridge regression model and evaluate it with LOOCV using caret and train:

library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
           data = estates,
           method = "glmnet", 
           tuneGrid = expand.grid(alpha = 0,
                                  lambda = seq(0, 10, 0.1)),
           trControl = tc) 

# Results for the best model:
m$results[which(m$results$lambda == m$finalModel$lambdaOpt),]

Noticing that the $\lambda$ that gave the best $RMSE$ was 10, which was the maximal $\lambda$ that we investigated, we rerun the code, allowing for higher values of $\lambda$:

m <- train(selling_price ~ .,
           data = estates,
           method = "glmnet", 
           tuneGrid = expand.grid(alpha = 0,
                                  lambda = seq(10, 120, 1)),
           trControl = tc) 

# Results for the best model:
m$results[which(m$results$lambda == m$finalModel$lambdaOpt),]

The $RMSE$ is 549 and the $MAE$ is 399. In this case, ridge regression did not improve the performance of the model compared to an ordinary linear regression.

(Click here to return to the exercise.)

Exercise 11.10

We load and format the data as in the beginning of Section 11.1.7.

We can now fit the models using train, making sure to add family = "binomial":

library(caret)
tc <- trainControl(method = "cv",
                   number = 10,
                   savePredictions = TRUE,
                   classProbs = TRUE)
m1 <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           method = "glmnet", 
           family = "binomial",
           tuneGrid = expand.grid(alpha = 0,
                                  lambda = seq(0, 10, 0.1)),
           trControl = tc) 

m1

The best value for $\lambda$ is 0, meaning that no regularisation is used.

Next, we add summaryFunction = twoClassSummary and metric = "ROC", which means that $AUC$ and not accuracy will be used to find the optimal $\lambda$:

tc <- trainControl(method = "cv",
                   number = 10,
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)
m2 <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           method = "glmnet", 
           family = "binomial",
           tuneGrid = expand.grid(alpha = 0,
                                  lambda = seq(0, 10, 0.1)),
           metric = "ROC",
           trControl = tc) 

m2

The best value for $\lambda$ is still 0. For this dataset, both accuracy and $AUC$ happened to give the same $\lambda$, but that isn’t always the case.

(Click here to return to the exercise.)

Exercise 11.11

First, we load and clean the data:

library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)

Next, we fit a lasso model and evaluate it with LOOCV using caret and train:

library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
           data = estates,
           method = "glmnet", 
           tuneGrid = expand.grid(alpha = 1,
                                  lambda = seq(0, 10, 0.1)),
           trControl = tc) 

# Results for the best model:
m$results[which(m$results$lambda == m$finalModel$lambdaOpt),]

The $RMSE$ is 545 and the $MAE$ is 394. Both are a little lower than for the ordinary linear regression, but the difference is small in this case. To see which variables have been removed, we can use:

coef(m$finalModel, m$finalModel$lambdaOpt)

Note that this data isn’t perfectly suited to the lasso, because most variables are useful in explaining the selling price. Where the lasso really shines in problems where a lot of the variables, perhaps even most, aren’t useful in explaining the response variable. We’ll see an example of that in the next exercise.

(Click here to return to the exercise.)

Exercise 11.12

We try fitting a linear model to the data:

m <- lm(y ~ ., data = simulated_data)
summary(m)

There are no error messages, but summary reveals that there were problems: Coefficients: (101 not defined because of singularities) and for half the variables we don’t get estimates of the coefficients. It is not possible to fit ordinary linear models when there are more variables than observations (there is no unique solution to the least squares equations from which we obtain the coefficient estimates), which leads to this strange-looking output.

Lasso models can be used even when the number of variables is greater than the number of observations - regularisation ensures that there will be a unique solution. We fit a lasso model using caret and train:

library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(y ~ .,
           data = simulated_data,
           method = "glmnet", 
           tuneGrid = expand.grid(alpha = 1,
                                  lambda = seq(0, 10, 0.1)),
           trControl = tc)

Next, we have a look at what variables have non-zero coefficients:

rownames(coef(m$finalModel, m$finalModel$lambdaOpt))[
         coef(m$finalModel, m$finalModel$lambdaOpt)[,1]!= 0]

Your mileage may vary (try running the simulation more than once!), but it is likely that the lasso will have picked at least the first four of the explanatory variables, probably along with some additional variables. Try changing the ratio between n and p in your experiment, or the size of the coefficients used when generating y, and see what happens.

(Click here to return to the exercise.)

Exercise 11.13

First, we load and clean the data:

library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)

Next, we fit an elastic net model and evaluate it with LOOCV using caret and train:

library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
           data = estates,
           method = "glmnet", 
           tuneGrid = expand.grid(alpha = seq(0, 1, 0.2),
                                  lambda = seq(10, 20, 1)),
           trControl = tc) 

# Print best choices of alpha and lambda:
m$bestTune

# Print the RMSE and MAE for the best model:
m$results[which(rownames(m$results) == rownames(m$bestTune)),]

We get a slight improvement over the lasso, with an $RMSE$ of 543.5 and an $MAE$ of 393.

(Click here to return to the exercise.)

Exercise 11.14

We load and format the data as in the beginning of Section 11.1.7. We can then fit the model using train. We set summaryFunction = twoClassSummary and metric = "ROC" to use $AUC$ to find the optimal $k$.

library(caret)
tc <- trainControl(method = "repeatedcv",
                   number = 10, repeats = 100,
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "rpart",
           metric = "ROC",
           tuneGrid = expand.grid(cp = 0))

m

Next, we plot the resulting decision tree:

library(rpart.plot)
prp(m$finalModel)

The tree is pretty large. The parameter cp, called a complexity parameter, can be used to prune the tree, i.e. to make it smaller. Let’s try setting a larger value for cp:

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "rpart",
           metric = "ROC",
           tuneGrid = expand.grid(cp = 0.1))
prp(m$finalModel)

That was way too much pruning - now the tree is too small! Try a value somewhere in-between:

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "rpart",
           metric = "ROC",
           tuneGrid = expand.grid(cp = 0.01))
prp(m$finalModel)

That seems like a good compromise. The tree is small enough for us to understand and discuss, but hopefully large enough that it still has a high $AUC$.

For presentation and interpretability purposes we can experiment with manually setting different values of cp. We can also let train find an optimal value of cp for us, maximising for instance the $AUC$. We’ll use tuneGrid = expand.grid(cp = seq(0, 0.01, 0.001)) to find a good choice of cp somewhere between 0 and 0.01:

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "rpart",
           metric = "ROC",
           tuneGrid = expand.grid(cp = seq(0, 0.01, 0.001)))
m
prp(m$finalModel)

In some cases, increasing cp can increase the $AUC$, but not here - a cp of 0 turns out to be optimal in this instance.

Finally, to visually evaluate the model, we use evalm to plot ROC and calibration curves:

library(MLeval)
plots <- evalm(m, gnames = "Decision tree")

# ROC:
plots$roc

# 95 % Confidence interval for AUC:
plots$optres[[1]][13,]

# Calibration curves:
plots$cc

(Click here to return to the exercise.)

Exercise 11.15

We set file_path to the path of bacteria.csv, then load and format the data as in Section 11.3.3:

bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")

Next, we fit a regression tree model using rows 45 to 90:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[45:90,],
           trControl = tc,
           method = "rpart",
           tuneGrid = expand.grid(cp = 0))

Finally, we make predictions for the entire dataset and compare the results to the actual outcomes:

bacteria$Predicted <- predict(m, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted), colour = "red")

Regression trees are unable to extrapolate beyond the training data. By design, they will make constant predictions whenever the values of the explanatory variables go beyond those in the training data. Bear this in mind if you use tree-based models for predictions!

(Click here to return to the exercise.)

Exercise 11.16

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we fit a classification tree model with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "rpart",
           tuneGrid = expand.grid(cp = 0))

Finally, we plot the decision boundaries:

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions <- data.frame(contour_data,
                          Variety = as.numeric(predict(m, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions, colour = "black")

The decision boundaries seem pretty good - most points in the lower left part belong to variety 3, most in the middle to variety 1, and most to the right to variety 2.

(Click here to return to the exercise.)

Exercise 11.17

We load and format the data as in the beginning of Section 11.1.7. We can then fit the models using train (fitting m2 takes a while):

library(caret)
tc <- trainControl(method = "cv",
                   number = 10, 
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m1 <- train(type ~ .,
           data = wine,
           trControl = tc,
           method = "rpart",
           metric = "ROC",
           tuneGrid = expand.grid(cp = c(0, 0.1, 0.01)))

m2 <- train(type ~ .,
           data = wine,
           trControl = tc,
           method = "rf",
           metric = "ROC",
           tuneGrid = expand.grid(mtry = 2:6))

Next, we compare the results of the best models:

m1
m2

And finally, a visual comparison:

library(MLeval)
plots <- evalm(list(m1, m2),
               gnames = c("Decision tree", "Random forest"))

# ROC:
plots$roc

# Calibration curves:
plots$cc

The calibration curves may look worrisome, but the main reason that they deviate from the straight line is that almost all observations have predicted probabilities close to either 0 or 1. To see this, we can have a quick look at the histogram of the predicted probabilities that the wines are white:

hist(predict(m2, type ="prob")[,2])

We used 10-fold cross-validation here, as using repeated cross-validation would take too long (at least in this case, where we only study this data as an example). As we’ve seen before, that means that the performance metrics can vary a lot between runs, so we shouldn’t read too much into the difference we found here.

(Click here to return to the exercise.)

Exercise 11.18

We set file_path to the path of bacteria.csv, then load and format the data as in Section 11.3.3:

bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")

Next, we fit a random forest using rows 45 to 90:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[45:90,],
           trControl = tc,
           method = "rf",
           tuneGrid = expand.grid(mtry = 1))

Finally, we make predictions for the entire dataset and compare the results to the actual outcomes:

bacteria$Predicted <- predict(m, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted), colour = "red")

The model does very well for the training data, but fails to extrapolate beyond it. Because random forests are based on decision trees, they give constant predictions whenever the values of the explanatory variables go beyond those in the training data.

(Click here to return to the exercise.)

Exercise 11.19

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we fit a random forest model with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "rf",
           tuneGrid = expand.grid(mtry = 1:2))

Finally, we plot the decision boundaries:

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions <- data.frame(contour_data,
                          Variety = as.numeric(predict(m, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions, colour = "black")

The decision boundaries are much more complex and flexible than those for the decision tree of Exercise 11.16. Perhaps they are too flexible, and the model has overfitted to the training data?

(Click here to return to the exercise.)

Exercise 11.20

We load and format the data as in the beginning of Section 11.1.7. We can then fit the model using train. Try a large number of parameter values to see if you can get a high $AUC$. You can try using a simple 10-fold cross-validation to find reasonable candidate values for the parameters, and then rerun the tuning with a replicated 10-fold cross-validation with parameter values close to those that were optimal in your first search.

library(caret)
tc <- trainControl(method = "cv",
                   number = 10,
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "gbm",
           metric = "ROC",
           tuneGrid = expand.grid(
                 interaction.depth = 1:5,
                 n.trees = seq(20, 200, 20),
                 shrinkage = seq(0.01, 0.1, 0.01),
                 n.minobsinnode = c(10, 20, 30)),
           verbose = FALSE)

ggplot(m)

(Click here to return to the exercise.)

Exercise 11.21

We set file_path to the path of bacteria.csv, then load and format the data as in Section 11.3.3:

bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")

Next, we fit a boosted trees model using rows 45 to 90:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[45:90,],
           trControl = tc,
           method = "gbm")

Finally, we make predictions for the entire dataset and compare the results to the actual outcomes:

bacteria$Predicted <- predict(m, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted), colour = "red")

The model does OK for the training data, but fails to extrapolate beyond it. Because boosted trees models are based on decision trees, they give constant predictions whenever the values of the explanatory variables go beyond those in the training data.

(Click here to return to the exercise.)

Exercise 11.22

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we fit a boosted trees model with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "gbm",
           verbose = FALSE)

Finally, we plot the decision boundaries:

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions <- data.frame(contour_data,
                          Variety = as.numeric(predict(m, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions, colour = "black")

The decision boundaries are much complex and flexible than those for the decision tree of Exercise 11.16, but does not appear to have overfitted like the random forest in Exercise 11.19.

(Click here to return to the exercise.)

Exercise 11.23

We set file_path to the path of bacteria.csv, then load and format the data as in Section 11.3.3:

bacteria <- read.csv(file_path)
bacteria$Time <- as.numeric(as.POSIXct(bacteria$Time, format = "%H:%M:%S"))

First, we fit a decision tree using rows 45 to 90:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[45:90,],
           trControl = tc,
           method = "rpart",
           tuneGrid = expand.grid(cp = 0))

Next, we fit a model tree using rows 45 to 90. The only explanatory variable available to us is Time, and we want to use that both for the models in the nodes and for the splits:

library(partykit)
m2 <- lmtree(OD ~ Time | Time, data = bacteria[45:90,])

library(ggparty)
autoplot(m2)

Next, we make predictions for the entire dataset and compare the results to the actual outcomes. We plot the predictions from the decision tree in red and those from the model tree in blue:

bacteria$Predicted_dt <- predict(m, bacteria)
bacteria$Predicted_mt <- predict(m2, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted_dt), colour = "red") +
      geom_line(aes(Time, Predicted_mt), colour = "blue")

Neither model does particularly well (but fail in different ways).

Next, we repeat the same steps, but use observations 20 to 120 for fitting the models:

m <- train(OD ~ Time,
           data = bacteria[20:120,],
           trControl = tc,
           method = "rpart",
           tuneGrid = expand.grid(cp = 0))

m2 <- lmtree(OD ~ Time | Time, data = bacteria[20:120,])

autoplot(m2)

bacteria$Predicted_dt <- predict(m, bacteria)
bacteria$Predicted_mt <- predict(m2, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted_dt), colour = "red") +
      geom_line(aes(Time, Predicted_mt), colour = "blue")

As we can see from the plot of the model tree, it (correctly!) identifies different time phases in which the bacteria grow at different speeds. It therefore also managed to make better extrapolation than the decision tree, which predicts no growth as Time is increased beyond what was seen in the training data.

(Click here to return to the exercise.)

Exercise 11.24

We load and format the data as in the beginning of Section 11.1.7. We can then fit the model using train as follows:

library(caret)
tc <- trainControl(method = "repeatedcv",
                   number = 10, repeats = 100,
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m <- train(type ~ .,
           data = wine,
           trControl = tc,
           method = "qda",
           metric = "ROC")

m

To round things off, we evaluate the model using evalm:

library(MLeval)
plots <- evalm(m, gnames = "QDA")

# ROC:
plots$roc

# 95 % Confidence interval for AUC:
plots$optres[[1]][13,]

# Calibration curves:
plots$cc

(Click here to return to the exercise.)

Exercise 11.25

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we fit LDA and QDA models with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m1 <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "lda")

m2 <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "qda")

Next, we plot the decision boundaries in the same scatterplot (LDA is black and QDA is orange):

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions1 <- data.frame(contour_data,
                          Variety = as.numeric(predict(m1, contour_data)))
predictions2 <- data.frame(contour_data,
                          Variety = as.numeric(predict(m2, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions1, colour = "black") +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions2, colour = "orange")

The decision boundaries are fairly similar and seem pretty reasonable. QDA offers more flexible non-linear boundaries, but the difference isn’t huge.

(Click here to return to the exercise.)

Exercise 11.26

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we fit the MDA model with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "mda")

Finally, we plot the decision boundaries:

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions <- data.frame(contour_data,
                          Variety = as.numeric(predict(m, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions, colour = "black")

The decision boundaries are similar to those of QDA.

(Click here to return to the exercise.)

Exercise 11.27

We load and format the data as in the beginning of Section 11.1.7. We’ll go with a polynomial kernel and compare polynomials of degree 2 and 3. We can fit the model using train as follows:

library(caret)
tc <- trainControl(method = "cv",
                   number = 10, 
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m <- train(type ~  .,
           data = wine,
           trControl = tc,
           method = "svmPoly",
           tuneGrid = expand.grid(C = 1,
                                  degree = 2:3,
                                  scale = 1),
           metric = "ROC")

And, as usual, we can then plot ROC and calibration curves:

library(MLeval)
plots <- evalm(m, gnames = "SVM poly")

# ROC:
plots$roc

# 95 % Confidence interval for AUC:
plots$optres[[1]][13,]

# Calibration curves:
plots$cc

(Click here to return to the exercise.)

Exercise 11.28

We set file_path to the path of bacteria.csv, then load and format the data as in Section 11.3.3:

bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")

Next, we fit an SVM with a polynomial kernel using rows 45 to 90:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[45:90,],
           trControl = tc,
           method = "svmPoly")

Finally, we make predictions for the entire dataset and compare the results to the actual outcomes:

bacteria$Predicted <- predict(m, bacteria)

library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted), colour = "red")

Similar to the linear model in Section 11.3.3, the SVM model does not extrapolate too well outside the training data. Unlike tree-based models, however, it does not yield constant predictions for values of the explanatory variable that are outside the range in the training data. Instead, the fitted function is assumed to follow the same shape as in the training data.

Next, we repeat the same steps using the data from rows 20 to 120:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(OD ~ Time,
           data = bacteria[20:120,],
           trControl = tc,
           method = "svmPoly")

bacteria$Predicted <- predict(m, bacteria)

ggplot(bacteria, aes(Time, OD)) +
        geom_line() +
        geom_line(aes(Time, Predicted), colour = "red")

The results are disappointing. Using a different kernel could improve the results though, so go ahead and give that a try!

(Click here to return to the exercise.)

Exercise 11.29

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we two different SVM models with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "cv",
                   number = 10)

m1 <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "svmPoly")

m2 <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "svmRadialCost")

Next, we plot the decision boundaries in the same scatterplot (the polynomial kernel is black and the radial basis kernel is orange):

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions1 <- data.frame(contour_data,
                          Variety = as.numeric(predict(m1, contour_data)))
predictions2 <- data.frame(contour_data,
                          Variety = as.numeric(predict(m2, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions1, colour = "black") +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions2, colour = "orange")

It is likely the case that the polynomial kernel gives a similar results to e.g. MDA, whereas the radial basis kernel gives more flexible decision boundaries.

(Click here to return to the exercise.)

Exercise 11.30

library(caret)
tc <- trainControl(method = "cv",
                   number = 10,
                   summaryFunction = twoClassSummary,
                   savePredictions = TRUE,
                   classProbs = TRUE)

m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
           data = wine,
           trControl = tc,
           method = "knn",
           metric = "ROC",
           tuneLength = 15,
           preProcess = c("center","scale"))

m

To visually evaluate the model, we use evalm to plot ROC and calibration curves:

library(MLeval)
plots <- evalm(m, gnames = "kNN")

# ROC:
plots$roc

# 95 % Confidence interval for AUC:
plots$optres[[1]][13,]

# Calibration curves:
plots$cc

The performance is as good as, or a little better than, the best logistic regression model from Exercise 11.5. We shouldn’t make too much of any differences though, as the models were evaluated in different ways - we used repeated 10-fold cross-validation for the logistics models and a simple 10-fold cross-validation here (because repeated cross-validation would be too slow in this case).

(Click here to return to the exercise.)

Exercise 11.31

First, we load the data as in Section 4.11:

# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
        col.names = c("Area", "Perimeter", "Compactness",
         "Kernel_length", "Kernel_width", "Asymmetry",
         "Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)

Next, we two different kNN models with Kernel_length and Compactness as explanatory variables:

library(caret)
tc <- trainControl(method = "LOOCV")

m <- train(Variety ~ Kernel_length + Compactness,
           data = seeds,
           trControl = tc,
           method = "knn",
           tuneLength = 15,
           preProcess = c("center","scale"))

Next, we plot the decision boundaries:

contour_data <- expand.grid(
  Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length),
                      length = 500),
  Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))

predictions <- data.frame(contour_data,
                          Variety = as.numeric(predict(m, contour_data)))

library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
      geom_point(size = 2) +
      stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
                   data = predictions, colour = "black")

The decision boundaries are quite “wiggly”, which will always be the case when there are enough points in the sample.

(Click here to return to the exercise.)

Exercise 11.32

We start by plotting the time series:

library(forecast)
library(fma)

autoplot(writing) +
      labs(y = "Sales (francs)") +
      ggtitle("Sales of printing and writing paper")

Next, we fit an ARIMA model after removing the seasonal component:

tsmod <- stlm(writing, s.window = "periodic",
              modelfunction = auto.arima)

The residuals look pretty good for this model:

checkresiduals(tsmod)

Finally, we make a forecast for the next 36 months, adding the seasonal component back and using bootstrap prediction intervals:

autoplot(forecast(tsmod, h = 36, bootstrap = TRUE))

(Click here to return to the exercise.)

’ else ’ ’`