# 8 Regression models

Regression models, in which explanatory variables are used to model the behaviour of a response variable, is without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.

After reading this chapter, you will be able to use R to:

- Fit and evaluate linear and generalised linear models,
- Fit and evaluate mixed models,
- Fit survival analysis models,
- Analyse data with left-censored observations,
- Create matched samples.

## 8.1 Linear models

Being flexible enough to handle different types of data, yet simple enough to be useful and interpretable, linear models are among the most important tools in the statistics toolbox. In this section, we’ll discuss how to fit and evaluate linear models in R.

### 8.1.1 Fitting linear models

We had a quick glance at linear models in Section 3.7. There we used the `mtcars`

data:

```
?mtcarsView(mtcars)
```

First, we plotted fuel consumption (`mpg`

) against gross horsepower (`hp`

):

```
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point()
```

Given \(n\) observations of \(p\) explanatory variables (also known as predictors, covariates, independent variables, and features), the linear model is:

\[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\] where \(\epsilon_i\) is a random error with mean 0, meaning that the model also can be written as:

\[E(y_i)=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\]
We fitted a linear model using `lm`

, with `mpg`

as the response variable and `hp`

as the explanatory variable:

```
<- lm(mpg ~ hp, data = mtcars)
m summary(m)
```

We added the fitted line to the scatterplot by using `geom_abline`

:

```
# Check model coefficients:
coef(m)
# Add regression line to plot:
ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
geom_abline(aes(intercept = coef(m)[1], slope = coef(m)[2]),
colour = "red")
```

We had a look at some diagnostic plots given by applying `plot`

to our fitted model `m`

:

`plot(m)`

Finally, we added another variable, the car weight `wt`

, to the model:

```
<- lm(mpg ~ hp + wt, data = mtcars)
m summary(m)
```

Next, we’ll look at what more R has to offer when it comes to regression. Before that though, it’s a good idea to do a quick exercise to make sure that you remember how to fit linear models.

\[\sim\]

**Exercise 8.1 **The `sales-weather.csv`

data from Section 5.12 describes the weather in a region during the first quarter of 2020. Download the file from the book’s web page. Fit a linear regression model with `TEMPERATURE`

as the response variable and `SUN_HOURS`

as an explanatory variable. Plot the results. Is there a connection?

You’ll return to and expand this model in the next few exercises, so make sure to save your code.

(Click here to go to the solution.)

**Exercise 8.2 **Fit a linear model to the `mtcars`

data using the formula `mpg ~ .`

. What happens? What is `~ .`

a shorthand for?

### 8.1.2 Interactions and polynomial terms

It seems plausible that there could be an *interaction* between gross horsepower and weight. We can include an interaction term by adding `hp:wt`

to the formula:

```
<- lm(mpg ~ hp + wt + hp:wt, data = mtcars)
m summary(m)
```

Alternatively, to include the main effects of `hp`

and `wt`

along with the interaction effect, we can use `hp*wt`

as a shorthand for `hp + wt + hp:wt`

to write the model formula more concisely:

```
<- lm(mpg ~ hp*wt, data = mtcars)
m summary(m)
```

It is often recommended to centre the explanatory variables in regression models, i.e. to shift them so that they all have mean 0. There are a number of benefits to this: for instance that the intercept then can be interpreted as the expected value of the response variable when all explanatory variables are equal to their means, i.e. in an average case^{55}. It can also reduce any multicollinearity in the data, particularly when including interactions or polynomial terms in the model. Finally, it can reduce problems with numerical instability that may arise due to floating point arithmetics. Note however, that there is no need to centre the response variable^{56}.

Centring the explanatory variables can be done using `scale`

:

```
# Create a new data frame, leaving the response variable mpg
# unchanged, while centring the explanatory variables:
<- data.frame(mpg = mtcars[,1],
mtcars_scaled scale(mtcars[,-1], center = TRUE,
scale = FALSE))
<- lm(mpg ~ hp*wt, data = mtcars_scaled)
m summary(m)
```

If we wish to add a polynomial term to the model, we can do so by wrapping the polynomial in `I()`

. For instance, to add a quadratic effect in the form of the square weight of a vehicle to the model, we’d use:

```
<- lm(mpg ~ hp*wt + I(wt^2), data = mtcars_scaled)
m summary(m)
```

### 8.1.3 Dummy variables

Categorical variables can be included in regression models by using *dummy variables*. A dummy variable takes the values 0 and 1, indicating that an observation either belongs to a category (1) or not (0). If the original categorical variable has more than two categories, \(c\) categories, say, the number of dummy variables included in the regression model should be \(c-1\) (with the last category corresponding to all dummy variables being 0). R does this automatically for us if we include a `factor`

variable in a regression model:

```
# Make cyl a categorical variable:
$cyl <- factor(mtcars$cyl)
mtcars
<- lm(mpg ~ hp*wt + cyl, data = mtcars)
m summary(m)
```

Note how only two categories, 6 cylinders and 8 cylinders, are shown in the summary table. The third category, 4 cylinders, corresponds to both those dummy variables being 0. Therefore, the coefficient estimates for `cyl6`

and `cyl8`

are relative to the remaining *reference category* `cyl4`

. For instance, compared to `cyl4`

cars, `cyl6`

cars have a higher fuel consumption, with their `mpg`

being \(1.26\) lower.

We can control which category is used as the reference category by setting the order of the `factor`

variable, as in Section 5.4. The first `factor`

level is always used as the reference, so if for instance we want to use `cyl6`

as our reference category, we’d do the following:

```
# Make cyl a categorical variable with cyl6 as
# reference variable:
$cyl <- factor(mtcars$cyl, levels =
mtcarsc(6, 4, 8))
<- lm(mpg ~ hp*wt + cyl, data = mtcars)
m summary(m)
```

Dummy variables are frequently used for modelling differences between different groups. Including only the dummy variable corresponds to using different intercepts for different groups. If we also include an interaction with the dummy variable, we can get different slopes for different groups. Consider the model \[E(y_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_{12} x_{i1}x_{i2},\qquad i=1,\ldots,n\] where \(x_1\) is numeric and \(x_2\) is a dummy variable. Then the intercept and slope changes depending on the value of \(x_2\) as follows:

\[E(y_i)=\beta_0+\beta_1 x_{i1},\qquad \mbox{if } x_2=0,\] \[E(y_i)=(\beta_0+\beta_2)+(\beta_1+\beta_{12}) x_{i1},\qquad \mbox{if } x_2=1.\] This yields a model where the intercept and slope differs between the two groups that \(x_2\) represents.

\[\sim\]

**Exercise 8.3 **Return to the weather model from Exercise 8.1. Create a dummy variable for precipitation (zero precipitation or non-zero precipitation) and add it to your model. Also include an interaction term between the precipitation dummy and the number of sun hours. Are any of the coefficients significantly non-zero?

### 8.1.4 Model diagnostics

There are a few different ways in which we can plot the fitted model. First, we can of course make a scatterplot of the data and add a curve showing the fitted values corresponding to the different points. These can be obtained by running `predict(m)`

with our fitted model `m`

.

```
# Fit two models:
$cyl <- factor(mtcars$cyl)
mtcars<- lm(mpg ~ hp + wt, data = mtcars) # Simple model
m1 <- lm(mpg ~ hp*wt + cyl, data = mtcars) # Complex model
m2
# Create data frames with fitted values:
<- data.frame(hp = mtcars$hp, mpg_pred = predict(m1))
m1_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m2))
m2_pred
# Plot fitted values:
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
geom_line(data = m1_pred, aes(x = hp, y = mpg_pred),
colour = "red") +
geom_line(data = m2_pred, aes(x = hp, y = mpg_pred),
colour = "blue")
```

We could also plot the observed values against the fitted values:

```
<- nrow(mtcars)
n <- data.frame(Observed = rep(mtcars$mpg, 2),
models Fitted = c(predict(m1), predict(m2)),
Model = rep(c("Model 1", "Model 2"), c(n, n)))
ggplot(models, aes(Fitted, Observed)) +
geom_point(colour = "blue") +
facet_wrap(~ Model, nrow = 3) +
geom_abline(intercept = 0, slope = 1) +
xlab("Fitted values") + ylab("Observed values")
```

Linear models are fitted and analysed using a number of assumptions, most of which are assessed by looking at plots of the model residuals, \(y_i-\hat{y}_i\), where \(\hat{y}_i\) is the fitted value for observation \(i\). Some important assumptions are:

- The model is
*linear in the parameters*: we check this by looking for non-linear patterns in the residuals, or in the plot of observed against fitted values. *The observations are independent*: which can be difficult to assess visually. We’ll look at models that are designed to handle correlated observations in Sections 8.4 and 9.6.*Homoscedasticity*: that the random errors all have the same variance. We check this by looking for non-constant variance in the residuals. The opposite of homoscedasticity is heteroscedasticity.*Normally distributed random errors*: this assumption is important if we want to use the traditional parametric p-values, confidence intervals and prediction intervals. If we use permutation p-values or bootstrap intervals (as we will later in this chapter), we no longer need this assumption.

Additionally, residual plots can be used to find influential points that (possibly) have a large impact on the model coefficients (influence is measured using *Cook’s distance* and potential influence using *leverage*). We’ve already seen that we can use `plot(m)`

to create some diagnostic plots. To get more and better-looking plots, we can use the `autoplot`

function for `lm`

objects from the `ggfortify`

package:

```
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)
```

In each of the plots, we look for the following:

- Residuals versus fitted: look for patterns that can indicate non-linearity, e.g. that the residuals all are high in some areas and low in others. The blue line is there to aid the eye - it should ideally be relatively close to a straight line (in this case, it isn’t perfectly straight, which could indicate a mild non-linearity).
- Normal Q-Q: see if the points follow the line, which would indicate that the residuals (which we for this purpose can think of as estimates of the random errors) follow a normal distribution.
- Scale-Location: similar to the residuals versus fitted plot, this plot shows whether the residuals are evenly spread for different values of the fitted values. Look for patterns in how much the residuals vary - if they e.g. vary more for large fitted values, then that is a sign of heteroscedasticity. A horizontal blue line is a sign of homoscedasticity.
- Cook’s distance: look for points with high values. A commonly-cited rule-of-thumb (Cook & Weisberg, 1982) says that values above 1 indicate points with a high influence.
- Residuals versus leverage: look for points with a high residual and high leverage. Observations with a high residual but low leverage deviate from the fitted model but don’t affect it much. Observations with a high residual and a high leverage likely have a strong influence on the model fit, meaning that the fitted model could be quite different if these points were removed from the dataset.
- Cook’s distance versus leverage: look for observations with a high Cook’s distance and a high leverage, which are likely to have a strong influence on the model fit.

A formal test for heteroscedasticity, the Breusch-Pagan test, is available in the `car`

package as a complement to graphical inspection. A low p-value indicates statistical evidence for heteroscedasticity. To run the test, we use `ncvTest`

(where “ncv” stands for non-constant variance):

```
install.packages("car")
library(car)
ncvTest(m1)
```

A common problem in linear regression models is multicollinearity, i.e. explanatory variables that are strongly correlated. Multicollinearity can cause your \(\beta\) coefficients and p-values to change greatly if there are small changes in the data, rendering them unreliable. To check if you have multicollinearity in your data, you can create a scatterplot matrix of your explanatory variables, as in Section 4.8.1:

```
library(GGally)
ggpairs(mtcars[, -1])
```

In this case, there are some highly correlated pairs, `hp`

and `disp`

among them. As a numerical measure of collinearity, we can use the generalised variance inflation factor (GVIF), given by the `vif`

function in the `car`

package:

```
library(car)
<- lm(mpg ~ ., data = mtcars)
m vif(m)
```

A high GVIF indicates that a variable is highly correlated with other explanatory variables in the dataset. Recommendations for what a “high GVIF” is varies, from 2.5 to 10 or more.

You can mitigate problems related to multicollinearity by:

- Removing one or more of the correlated variables from the model (because they are strongly correlated, they measure almost the same thing anyway!),
- Centring your explanatory variables (particularly if you include polynomial terms),
- Using a regularised regression model (which we’ll do in Section 9.4).

\[\sim\]

**Exercise 8.4**Below are two simulated datasets. One exhibits a nonlinear dependence between the variables, and the other exhibits heteroscedasticity. Fit a model with

`y`

as the response variable and `x`

as the explanatory variable for each dataset, and make some residual plots. Which dataset suffers from which problem?
```
<- data.frame(
exdata1 x = c(2.99, 5.01, 8.84, 6.18, 8.57, 8.23, 8.48, 0.04, 6.80,
7.62, 7.94, 6.30, 4.21, 3.61, 7.08, 3.50, 9.05, 1.06,
0.65, 8.66, 0.08, 1.48, 2.96, 2.54, 4.45),
y = c(5.25, -0.80, 4.38, -0.75, 9.93, 13.79, 19.75, 24.65,
6.84, 11.95, 12.24, 7.97, -1.20, -1.76, 10.36, 1.17,
15.41, 15.83, 18.78, 12.75, 24.17, 12.49, 4.58, 6.76,
-2.92))
<- data.frame(
exdata2 x = c(5.70, 8.03, 8.86, 0.82, 1.23, 2.96, 0.13, 8.53, 8.18,
6.88, 4.02, 9.11, 0.19, 6.91, 0.34, 4.19, 0.25, 9.72,
9.83, 6.77, 4.40, 4.70, 6.03, 5.87, 7.49),
y = c(21.66, 26.23, 19.82, 2.46, 2.83, 8.86, 0.25, 16.08,
17.67, 24.86, 8.19, 28.45, 0.52, 19.88, 0.71, 12.19,
0.64, 25.29, 26.72, 18.06, 10.70, 8.27, 15.49, 15.58,
19.17))
```

(Click here to go to the solution.)

**Exercise 8.5 **We continue our investigation of the weather models from Exercises 8.1 and 8.3.

Plot the observed values against the fitted values for the two models that you’ve fitted. Does either model seem to have a better fit?

Create residual plots for the second model from Exercise 8.3. Are there any influential points? Any patterns? Any signs of heteroscedasticity?

### 8.1.5 Transformations

If your data displays signs of heteroscedasticity or non-normal residuals, you can sometimes use a Box-Cox transformation (Box & Cox, 1964) to mitigate those problems. The Box-Cox transformation is applied to your dependent variable \(y\). What it looks like is determined by a parameter \(\lambda\). The transformation is defined as \(\frac{y_i^\lambda-1}{\lambda}\) if \(\lambda\neq 0\) and \(\ln(y_i)\) if \(\lambda=0\). \(\lambda=1\) corresponds to no transformation at all. The `boxcox`

function in `MASS`

is useful for finding an appropriate choice of \(\lambda\). Choose a \(\lambda\) that is close to the peak (inside the interval indicated by the outer dotted lines) of the curve plotted by `boxcox`

:

```
<- lm(mpg ~ hp + wt, data = mtcars)
m
library(MASS)
boxcox(m)
```

In this case, the curve indicates that \(\lambda=0\), which corresponds to a log-transformation, could be a good choice. Let’s give it a go:

```
$logmpg <- log(mtcars$mpg)
mtcars<- lm(logmpg ~ hp + wt, data = mtcars)
m_bc summary(m_bc)
library(ggfortify)
autoplot(m_bc, which = 1:6, ncol = 2, label.size = 3)
```

The model fit seems to have improved after the transformation. The downside is that we now are modelling the log-mpg rather than mpg, which make the model coefficients a little difficult to interpret.

\[\sim\]

**Exercise 8.6 **Run `boxcox`

with your model from Exercise 8.3. Does it indicate that a transformation can be useful for your model?

### 8.1.6 Alternatives to `lm`

Non-normal regression errors can sometimes be an indication that you need to transform your data, that your model is missing an important explanatory variable, that there are interaction effects that aren’t accounted for, or that the relationship between the variables is non-linear. But sometimes, you get non-normal errors simply because the errors are non-normal.

The p-values reported by `summary`

are computed under the assumption of normally distributed regression errors, and can be sensitive to deviations from normality. An alternative is to use the `lmp`

function from the `lmPerm`

package, which provides permutation test p-values instead. This doesn’t affect the model fitting in any way - the only difference is how the p-values are computed. Moreover, the syntax for `lmp`

is identical to that of `lm`

:

```
# First, install lmPerm:
install.packages("lmPerm")
# Get summary table with permutation p-values:
library(lmPerm)
<- lmp(mpg ~ hp + wt, data = mtcars)
m summary(m)
```

In some cases, you need to change the arguments of `lmp`

to get reliable p-values. We’ll have a look at that in Exercise 8.12. Relatedly, in Section 8.1.7 we’ll see how to construct bootstrap confidence intervals for the parameter estimates.

Another option that does affect the model fitting is to use a *robust* regression model based on M-estimators. Such models tend to be less sensitive to outliers, and can be useful if you are concerned about the influence of deviating points. The `rlm`

function in `MASS`

is used for this. As was the case for `lmp`

, the syntax for `rlm`

is identical to that of `lm`

:

```
library(MASS)
<- rlm(mpg ~ hp + wt, data = mtcars)
m summary(m)
```

Another option is to use Bayesian estimation, which we’ll discuss in Section 8.1.13.

\[\sim\]

**Exercise 8.7 **Refit your model from Exercise 8.3 using `lmp`

. Are the two main effects still significant?

### 8.1.7 Bootstrap confidence intervals for regression coefficients

Assuming normality, we can obtain parametric confidence intervals for the model coefficients using `confint`

:

```
<- lm(mpg ~ hp + wt, data = mtcars)
m
confint(m)
```

I usually prefer to use bootstrap confidence intervals, which we can obtain using `boot`

and `boot.ci`

, as we’ll do next. Note that the only random part in the linear model
\[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\]
is the error term \(\epsilon_i\). In most cases, it is therefore this term (and this term only) that we wish to resample. The explanatory variables should remain constant throughout the resampling process; the inference is conditioned on the values of the explanatory variables.

To achieve this, we’ll resample from the model residuals, and add those to the values predicted by the fitted function, which creates new bootstrap values of the response variable. We’ll then fit a linear model to these values, from which we obtain observations from the bootstrap distribution of the model coefficients.

It turns out that the bootstrap performs better if we resample not from the original residuals \(e_1,\ldots,e_n\), but from scaled and centred residuals \(r_i-\bar{r}\), where each \(r_i\) is a scaled version of residual \(e_i\), scaled by the leverage \(h_i\):

\[r_i=\frac{e_i}{\sqrt{1-h_i}},\]
see Chapter 6 of Davison & Hinkley (1997) for details. The leverages can be computed using `lm.influence`

.

We implement this procedure in the code below (and will then have a look at convenience functions that help us achieve the same thing more easily). It makes use of `formula`

, which can be used to extract the model formula from regression models:

```
library(boot)
<- function(formula, data, i, predictions, residuals) {
coefficients # Create the bootstrap value of response variable by
# adding a randomly drawn scaled residual to the value of
# the fitted function for each observation:
all.vars(formula)[1]] <- predictions + residuals[i]
data[,
# Fit a new model with the bootstrap value of the response
# variable and the original explanatory variables:
<- lm(formula, data = data)
m return(coef(m))
}
# Fit the linear model:
<- lm(mpg ~ hp + wt, data = mtcars)
m
# Compute scaled and centred residuals:
<- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)
res
# Run the bootstrap, extracting the model formula and the
# fitted function from the model m:
<- boot(data = mtcars, statistic = coefficients,
boot_res R = 999, formula = formula(m),
predictions = predict(m),
residuals = res)
# Compute 95 % confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # hp
boot.ci(boot_res, type = "perc", index = 3) # wt
```

The argument `index`

in `boot.ci`

should be the row number of the parameter in the table given by `summary`

. The intercept is on the first row, and so its `index`

is 1, `hp`

is on the second row and its `index`

is 2, and so on.

Clearly, the above code is a little unwieldy. Fortunately, the `car`

package contains a function called `Boot`

that can be used to bootstrap regression models in the exact same way:

```
library(car)
<- Boot(m, method = "residual", R = 9999)
boot_res
# Compute 95 % confidence intervals:
confint(boot_res, type = "perc")
```

Finally, the most convenient approach is to use `boot_summary`

from the `boot.pval`

package. It provides a data frame with estimates, bootstrap confidence intervals, and bootstrap p-values (computed using interval inversion) for the model coefficients. The arguments specify what interval type and resampling strategy to use (more on the latter in Exercise 8.9):

```
library(boot.pval)
boot_summary(m, type = "perc", method = "residual", R = 9999)
```

\[\sim\]

**Exercise 8.8 **Refit your model from Exercise 8.3 using a robust regression estimator with `rlm`

. Compute confidence intervals for the coefficients of the robust regression model.

(Click here to go to the solution.)

**Exercise 8.9 **In an alternative bootstrap scheme for regression models, often referred to as *case resampling*, the observations (or cases) \((y_i, x_{i1},\ldots,x_{ip})\) are resampled instead of the residuals. This approach can be applied when the explanatory variables can be treated as being random (but measured without error) rather than fixed. It can also be useful for models with heteroscedasticity, as it doesn’t rely on assumptions about constant variance (which, on the other hand, makes it less efficient if the errors actually are homoscedastic).

Read the documentation for `boot_summary`

to see how you can compute confidence intervals for the coefficients in the model `m <- lm(mpg ~ hp + wt, data = mtcars)`

using case resampling. Do they differ substantially from those obtained using residual resampling in this case?

### 8.1.8 Alternative summaries with `broom`

The `broom`

package contains some useful functions when working with linear models (and many other common models), which allow us to get various summaries of the model fit in useful formats. Let’s install it:

`install.packages("broom")`

A model fitted with `m`

is stored as a `list`

with lots of elements:

```
<- lm(mpg ~ hp + wt, data = mtcars)
m str(m)
```

How can we access the information about the model? For instance, we may want to get the summary table from `summary`

, but as a data frame rather than as printed text. Here are two ways of doing this, using `summary`

and the `tidy`

function from `broom`

:

```
# Using base R:
summary(m)$coefficients
# Using broom:
library(broom)
tidy(m)
```

`tidy`

is the better option if you want to retrieve the table as part of a pipeline. For instance, if you want to adjust the p-values for multiplicity using Bonferroni correction (Section 7.2.5), you could do as follows:

```
library(magrittr)
%>%
mtcars lm(mpg ~ hp + wt, data = .) %>%
tidy() %$%
p.adjust(p.value, method = "bonferroni")
```

If you prefer bootstrap p-values, you can use `boot_summary`

from `boot.pval`

similarly. That function also includes an argument for adjusting the p-values for multiplicity:

```
library(boot.pval)
lm(mpg ~ hp + wt, data = mtcars) %>%
boot_summary(adjust.method = "bonferroni")
```

Another useful function in `broom`

is `glance`

, which lets us get some summary statistics about the model:

`glance(m)`

Finally, `augment`

can be used to add predicted values, residuals, and Cook’s distances to the dataset used for fitting the model, which of course can be very useful for model diagnostics:

```
# To get the data frame with predictions and residuals added:
augment(m)
# To plot the observed values against the fitted values:
library(ggplot2)
%>%
mtcars lm(mpg ~ hp + wt, data = .) %>%
augment() %>%
ggplot(aes(.fitted, mpg)) +
geom_point() +
xlab("Fitted values") + ylab("Observed values")
```

### 8.1.9 Variable selection

A common question when working with linear models is what variables to include in your model. Common practices for variable selection include stepwise regression methods, where variables are added to or removed from the model depending on p-values, \(R^2\) values, or information criteria like AIC or BIC.

**Don’t ever do this if your main interest is p-values.** Stepwise regression increases the risk of type I errors, renders the p-values of your final model invalid, and can lead to over-fitting; see e.g. Smith (2018). Instead, you should let your research hypothesis guide your choice of variables, or base your choice on a pilot study.

If your main interest is prediction, then that is a completely different story. For predictive models, it is usually recommended that variable selection and model fitting should be done simultaneously. This can be done using regularised regression models, to which Section 9.4 is devoted.

### 8.1.10 Prediction

An important use of linear models is prediction. In R, this is done using `predict`

. By providing a fitted model and a new dataset, we can get predictions.

Let’s use one of the models that we fitted to the `mtcars`

data to make predictions for two cars that aren’t from the 1970’s. Below, we create a data frame with data for a 2009 Volvo XC90 D3 AWD (with a fuel consumption of 29 mpg) and a 2019 Ferrari Roma (15.4 mpg):

```
<- data.frame(hp = c(161, 612), wt = c(4.473, 3.462),
new_cars row.names = c("Volvo XC90", "Ferrari Roma"))
```

To get the model predictions for these new cars, we run the following:

`predict(m, new_cars)`

`predict`

also lets us obtain prediction intervals for our prediction, under the assumption of normality^{57}. To get 90 % prediction intervals, we add `interval = "prediction"`

and `level = 0.9`

:

```
<- lm(mpg ~ hp + wt, data = mtcars)
m predict(m, new_cars,
interval = "prediction",
level = 0.9)
```

If we were using a transformed \(y\)-variable, we’d probably have to transform the predictions back to the original scale for them to be useful:

```
$logmpg <- log(mtcars$mpg)
mtcars<- lm(logmpg ~ hp + wt, data = mtcars)
m_bc
<- predict(m_bc, new_cars,
preds interval = "prediction",
level = 0.9)
# Predictions for log-mpg:
preds# Transform back to original scale:
exp(preds)
```

The `lmp`

function that we used to compute permutation p-values does not offer confidence intervals. We can however compute bootstrap prediction intervals using the code below. Prediction intervals try to capture two sources of uncertainty:

*Model uncertainty*, which we will capture by resampling the data and make predictions for the expected value of the observation,*Random noise*, i.e. that almost all observations deviate from their expected value. We will capture this by resampling residuals from the fitted bootstrap models.

Consequently, the value that we generate in each bootstrap replication will be the sum of a prediction and a resampled residual (see Davison & Hinkley (1997), Section 6.3, for further details):

```
<- function(data, new_data, model, i,
boot_pred
formula, predictions, residuals){# Resample residuals and fit new model:
all.vars(formula)[1]] <- predictions + residuals[i]
data[,<- lm(formula, data = data)
m_boot
# We use predict to get an estimate of the
# expectation of new observations, and then
# add resampled residuals to also include the
# natural variation around the expectation:
predict(m_boot, newdata = new_data) +
sample(residuals, nrow(new_data))
}
library(boot)
<- lm(mpg ~ hp + wt, data = mtcars)
m
# Compute scaled and centred residuals:
<- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)
res
<- boot(data = m$model,
boot_res statistic = boot_pred,
R = 999,
model = m,
new_data = new_cars,
formula = formula(m),
predictions = predict(m),
residuals = res)
# 90 % bootstrap prediction intervals:
boot.ci(boot_res, type = "perc", index = 1, conf = 0.9) # Volvo
boot.ci(boot_res, type = "perc", index = 2, conf = 0.9) # Ferrari
```

\[\sim\]

**Exercise 8.10 **Use your model from Exercise 8.3 to compute a bootstrap prediction interval for the temperature on a day with precipitation but no sun hours.

### 8.1.11 Prediction for multiple datasets

In certain cases, we wish to fit different models to different subsets of the data. Functionals like `apply`

and `map`

(Section 6.5) are handy when you want to fit several models at once. Below is an example of how we can use `split`

(Section 5.2.1) and tools from the `purrr`

package (Section 6.5.3) to fit the models simultaneously, as well as for computing the fitted values in a single line of code:

```
# Split the dataset into three groups depending on the
# number of cylinders:
library(magrittr)
<- mtcars %>% split(.$cyl)
mtcars_by_cyl
# Fit a linear model to each subgroup:
library(purrr)
<- mtcars_by_cyl %>% map(~ lm(mpg ~ hp + wt, data = .))
models
# Compute the fitted values for each model:
map2(models, mtcars_by_cyl, predict)
```

We’ll make use of this approach when we study linear mixed models in Section 8.4.

### 8.1.12 ANOVA

Linear models are also used for analysis of variance (*ANOVA*) models to test whether there are differences among the means of different groups. We’ll use the `mtcars`

data to give some examples of this. Let’s say that we want to investigate whether the mean fuel consumption (`mpg`

) of cars differs depending on the number of cylinders (`cyl`

), and that we want to include the type of transmission (`am`

) as a blocking variable.

To get an ANOVA table for this problem, we must first convert the explanatory variables to `factor`

variables, as the variables in `mtcars`

all `numeric`

(despite some of them being categorical). We can then use `aov`

to fit the model, and then `summary`

:

```
# Convert variables to factors:
$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
mtcars
# Fit model and print ANOVA table:
<- aov(mpg ~ cyl + am, data = mtcars)
m summary(m)
```

(`aov`

actually uses `lm`

to fit the model, but by using `aov`

we specify that we want an ANOVA table to be printed by `summary`

.)

When there are different numbers of observations in the groups in an ANOVA, so that we have an unbalanced design, the sums of squares used to compute the test statistics can be computed in at least three different ways, commonly called type I, II and III. See Herr (1986) for an overview and discussion of this.

`summary`

prints a type I ANOVA table, which isn’t the best choice for unbalanced designs. We can however get type II or III tables by instead using `Anova`

from the `car`

package to print the table:

```
library(car)
Anova(m, type = "II")
Anova(m, type = "III") # Default in SAS and SPSS.
```

As a guideline, for unbalanced designs, you should use type II tables if there are no interactions, and type III tables if there are interactions. To look for interactions, we can use `interaction.plot`

to create a two-way interaction plot:

`interaction.plot(mtcars$am, mtcars$cyl, response = mtcars$mpg)`

In this case, there is no sign of an interaction between the two variables, as the lines are more or less parallel. A type II table is therefore probably the best choice here.

We can obtain diagnostic plots the same way we did for other linear models:

```
library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3)
```

To find which groups that have significantly different means, we can use a post hoc test like Tukey’s HSD, available through the `TukeyHSD`

function:

`TukeyHSD(m)`

We can visualise the results of Tukey’s HSD with `plot`

, which shows 95 % confidence intervals for the mean differences:

```
# When the difference isn't significant, the dashed line indicating
# "no differences" falls within the confidence interval for
# the difference:
plot(TukeyHSD(m, "am"))
# When the difference is significant, the dashed line does not
# fall within the confidence interval:
plot(TukeyHSD(m, "cyl"))
```

\[\sim\]

**Exercise 8.11 **Return to the residual plots that you created with `autoplot`

. Figure out how you can plot points belonging to different `cyl`

groups in different colours.

(Click here to go to the solution.)

**Exercise 8.12 **The `aovp`

function in the `lmPerm`

package can be utilised to perform permutation tests instead of the classical parametric ANOVA tests. Rerun the analysis in the example above, using `aovp`

instead. Do the conclusions change? What happens if you run your code multiple times? Does using `summary`

on a model fitted using `aovp`

generate a type I, II or III table by default? Can you change what type of table it produces?

(Click here to go to the solution.)

**Exercise 8.13 **In the case of a one-way ANOVA (i.e. ANOVA with a single explanatory variable), the Kruskal-Wallis test can be used as a nonparametric option. It is available in `kruskal.test`

. Use the Kruskal-Wallis test to run a one-way ANOVA for the `mtcars`

data, with `mpg`

as the response variable and `cyl`

as an explanatory variable.

### 8.1.13 Bayesian estimation of linear models

We can fit Bayesian linear models using the `rstanarm`

package. To fit a model to the `mtcars`

data using all explanatory variables, we can use `stan_glm`

in place of `lm`

as follows:

```
library(rstanarm)
<- stan_glm(mpg ~ ., data = mtcars)
m
# Print the estimates:
coef(m)
```

Next, we can plot the posterior distributions of the effects:

`plot(m, "dens", pars = names(coef(m)))`

To get 95 % credible intervals for the effects, we can use `posterior_interval`

:

```
posterior_interval(m,
pars = names(coef(m)),
prob = 0.95)
```

We can also plot them using `plot`

:

```
plot(m, "intervals",
pars = names(coef(m)),
prob = 0.95)
```

Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:

`plot(m, "rhat")`

Like for `lm`

, `residuals(m)`

provides the model residuals, which can be used for diagnostics. For instance, we can plot the residuals against the fitted values to look for signs of non-linearity, adding a curve to aid the eye:

```
<- data.frame(Fitted = predict(m),
model_diag Residual = residuals(m))
library(ggplot2)
ggplot(model_diag, aes(Fitted, Residual)) +
geom_point() +
geom_smooth(se = FALSE)
```

For fitting ANOVA models, we can instead use `stan_aov`

with the argument `prior = R2(location = 0.5)`

to fit the model.

## 8.2 Ethical issues in regression modelling

The p-hacking problem, discussed in Section 7.4, is perhaps particularly prevalent in regression modelling. Regression analysis often involves a large number of explanatory variables, and practitioners often try out several different models (e.g. by performing stepwise variable selection; see Section 8.1.9). Because so many hypotheses are tested, often in many different but similar models, there is a large risk of false discoveries.

In any regression analysis, there is a risk of finding *spurious relationships*. These are dependencies between the response variable and an explanatory variable that either are non-causal or are purely coincidental. As an example of the former, consider the number of deaths by drowning, which is strongly correlated with ice cream sales. Not because ice cream cause people to drown, but because both are affected by the weather: we are more likely to go swimming or buy ice cream on hot days. Lurking variables, like the temperature in the ice cream-drowning example, are commonly referred to as *confounding factors*. An effect may be statistically significant, but that does not necessarily mean that it is meaningful.

\[\sim\]

**Exercise 8.14**

*Discuss the following.*You are tasked with analysing a study on whether Vitamin D protects against the flu. One group of patients are given Vitamin D supplements, and one group is given a placebo. You plan on fitting a regression model to estimate the effect of the vitamin supplements, but note that some confounding factors that you have reason to believe are of importance, such as age and ethnicity, are missing from the data. You can therefore not include them as explanatory variables in the model. Should you still fit the model?

**Exercise 8.15**

*Discuss the following.*You are fitting a linear regression model to a dataset from a medical study on a new drug which potentially can have serious side effects. The test subjects take a risk by participating in the study. Each observation in the dataset corresponds to a test subject. Like all ordinary linear regression models, your model gives more weight to observations that deviate from the average (and have a high leverage or Cook’s distance). Given the risks involved for the test subjects, is it fair to give different weight to data from different individuals? Is it OK to remove outliers because they influence the results too much, meaning that the risk that the subject took was for nought?

## 8.3 Generalised linear models

Generalised linear models, abbreviated GLM, are (yes) a generalisation of the linear model, that can be used when your response variable has a non-normal error distribution. Typical examples are when your response variable is binary (only takes two values, e.g. 0 or 1), or a count of something. Fitting GLM’s is more or less entirely analogous to fitting linear models in R, but model diagnostics are very different. In this section we will look at some examples of how it can be done.

### 8.3.1 Modelling proportions: Logistic regression

As the first example of binary data, we will consider the wine quality dataset `wine`

from Cortez et al. (2009), which is available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Wine+Quality. It contains measurements on white and red vinho verde wine samples from northern Portugal.

We start by loading the data. It is divided into two separate `.csv`

files, one for white wines and one for red, which we have to merge:

```
# Import data about white and red wines:
<- read.csv("https://tinyurl.com/winedata1",
white sep = ";")
<- read.csv("https://tinyurl.com/winedata2",
red sep = ";")
# Add a type variable:
$type <- "white"
white$type <- "red"
red
# Merge the datasets:
<- rbind(white, red)
wine $type <- factor(wine$type)
wine
# Check the result:
summary(wine)
```

We are interested in seeing if measurements like pH (`pH`

) and alcohol content (`alcohol`

) can be used to determine the colours of the wine. The colour is represented by the `type`

variable, which is binary.

Our model is that the `type`

of a randomly selected wine is binomial \(Bin(1, \pi_i)\)-distributed (Bernoulli distributed), where \(\pi_i\) depends on explanatory variables like pH and alcohol content. A common model for this situation is a *logistic regression model*. Given \(n\) observations of \(p\) explanatory variables, the model is:

\[\log\Big(\frac{\pi_i}{1-\pi_i}\Big)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\]
Where we in linear regression models model the expected value of the response variable as a linear function of the explanatory variables, we now model the expected value of a *function of the expected value of the response variable* (that is, a function of \(\pi_i\)). In GLM terminology, this function is known as a *link function*.

Logistic regression models can be fitted using the `glm`

function. To specify what our model is, we use the argument `family = binomial`

:

```
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m summary(m)
```

The p-values presented in the summary table are based on a Wald test known to have poor performance unless the sample size is very large (Agresti, 2013). In this case, with a sample size of 6,497, it is probably safe to use, but for smaller sample sizes, it is preferable to use a bootstrap test instead, which you will do in Exercise 8.18.

The coefficients of a logistic regression model aren’t as straightforward to interpret as those in a linear model. If we let \(\beta\) denote a coefficient corresponding to an explanatory variable \(x\), then:

- If \(\beta\) is positive, then \(\pi_i\) increases when \(x_i\) increases.
- If \(\beta\) is negative, then \(\pi_i\) decreases when \(x_i\) increases.
- \(e^\beta\) is the
*odds ratio*, which shows how much the odds \(\frac{\pi_i}{1-\pi_1}\) change when \(x_i\) is increased 1 step.

We can extract the coefficients and odds ratios using `coef`

:

```
coef(m) # Coefficients, beta
exp(coef(m)) # Odds ratios
```

To find the fitted probability that an observation belongs to the second class we can use `predict(m, type = "response")`

:

```
# Check which class is the second one:
levels(wine$type)
# "white" is the second class!
# Get fitted probabilities:
<- predict(m, type = "response")
probs
# Check what the average prediction is for
# the two groups:
mean(probs[wine$type == "red"])
mean(probs[wine$type == "white"])
```

It turns out that the model predicts that most wines are white - even the red ones! The reason may be that we have more white wines (4,898) than red wines (1,599) in the dataset. Adding more explanatory variables could perhaps solve this problem. We’ll give that a try in the next section.

\[\sim\]

**Exercise 8.16 **Download `sharks.csv`

file from the book’s web page. It contains information about shark attacks in South Africa. Using data on attacks that occurred in 2000 or later, fit a logistic regression model to investigate whether the age and sex of the individual that was attacked affect the probability of the attack being fatal.

Note: save the code for your model, as you will return to it in the subsequent exercises.

(Click here to go to the solution.)

**Exercise 8.17 **In Section 8.1.8 we saw how some functions from the `broom`

package could be used to get summaries of linear models. Try using them with the `wine`

data model that we created above. Do the `broom`

functions work for generalised linear models as well?

### 8.3.2 Bootstrap confidence intervals

In a logistic regression, the response variable \(y_i\) is a binomial (or Bernoulli) random variable with success probability \(\pi_i\). In this case, we don’t want to resample residuals to create confidence intervals, as it turns out that this can lead to predicted probabilities outside the range \((0,1)\). Instead, we can either use the case resampling strategy described in Exercise 8.9 or use a parametric bootstrap approach where we generate new binomial variables (Section 7.1.2) to construct bootstrap confidence intervals.

To use case resampling, we can use `boot_summary`

from `boot.pval`

:

```
library(boot.pval)
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m
boot_summary(m, type = "perc", method = "case")
```

In the parametric approach, for each observation, the fitted success probability from the logistic model will be used to sample new observations of the response variable. This method can work well if the model is well-specified but tends to perform poorly for misspecified models, so make sure to carefully perform model diagnostics (as described in the next section) before applying it. To use the parametric approach, we can do as follows:

```
library(boot)
<- function(formula, data, predictions, ...) {
coefficients # Check whether the response variable is a factor or
# numeric, and then resample:
if(is.factor(data[,all.vars(formula)[1]])) {
# If the response variable is a factor:
all.vars(formula)[1]] <-
data[,factor(levels(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
1, predictions)]) } else {
# If the response variable is numeric:
all.vars(formula)[1]] <-
data[,unique(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
1, predictions)] }
<- glm(formula, data = data, family = binomial)
m return(coef(m))
}
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m
<- boot(data = wine, statistic = coefficients,
boot_res R = 999,
formula = formula(m),
predictions = predict(m, type = "response"))
# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # pH
boot.ci(boot_res, type = "perc", index = 3) # Alcohol
```

\[\sim\]

**Exercise 8.18 **Use the model that you fitted to the `sharks.csv`

data in Exercise 8.16 for the following:

When the

`MASS`

package is loaded, you can use`confint`

to obtain (asymptotic) confidence intervals for the parameters of a GLM. Use it to compute confidence intervals for the parameters of your model for the`sharks.csv`

data.Compute parametric bootstrap confidence intervals and p-values for the parameters of your logistic regression model for the

`sharks.csv`

data. Do they differ from the intervals obtained using`confint`

? Note that there are a lot of missing values for the response variable. Think about how that will affect your bootstrap intervals and adjust your code accordingly.Use the confidence interval inversion method of Section 7.7.3 to compute bootstrap p-values for the effect of age.

### 8.3.3 Model diagnostics

It is notoriously difficult to assess model fit for GLM’s, because the behaviour of the residuals is very different from residuals in ordinary linear models. In the case of logistic regression, the response variable is always 0 or 1, meaning that there will be two bands of residuals:

```
# Store deviance residuals:
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m <- data.frame(Predicted <- predict(m),
res <- residuals(m, type ="deviance"),
Residuals Index = 1:nrow(m$data),
CooksDistance = cooks.distance(m))
# Plot fitted values against the deviance residuals:
library(ggplot2)
ggplot(res, aes(Predicted, Residuals)) +
geom_point()
# Plot index against the deviance residuals:
ggplot(res, aes(Index, Residuals)) +
geom_point()
```

Plots of raw residuals are of little use in logistic regression models. A better option is to use a binned residual plot, in which the observations are grouped into bins based on their fitted value. The average residual in each bin can then be computed, which will tell us if which parts of the model have a poor fit. A function for this is available in the `arm`

package:

```
install.packages("arm")
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
```

The grey lines show confidence bounds which are supposed to contain about 95 % of the bins. If too many points fall outside these bounds, it’s a sign that we have a poor model fit. In this case, there are a few points outside the bounds. Most notably, the average residuals are fairly large for the observations with the lowest fitted values, i.e. among the observations with the lowest predicted probability of being white wines.

Let’s compare the above plot to that for a model with more explanatory variables:

```
<- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar,
m2 data = wine, family = binomial)
binnedplot(predict(m2, type = "response"),
residuals(m2, type = "response"))
```

This looks much better - adding more explanatory variable appears to have improved the model fit.

It’s worth repeating that if your main interest is hypothesis testing, you shouldn’t fit multiple models and then pick the one that gives the best results. However, if you’re doing an exploratory analysis or are interested in predictive modelling, you can and should try different models. It can then be useful to do a formal hypothesis test of the null hypothesis that `m`

and `m2`

fit the data equally well, against the alternative that `m2`

has a better fit. If both fit the data equally well, we’d prefer `m`

, since it is a simpler model. We can use `anova`

to perform a likelihood ratio *deviance test* (see Section 12.4 for details), which tests this:

`anova(m, m2, test = "LRT")`

The p-value is very low, and we conclude that `m2`

has a better model fit.

Another useful function is `cooks.distance`

, which can be used to compute the Cook’s distance for each observation, which is useful for finding influential observations. In this case, I’ve chosen to print the row numbers for the observations with a Cook’s distance greater than 0.004 - this number has been arbitrarily chosen in order only to highlight the observations with the highest Cook’s distance.

```
<- data.frame(Index = 1:length(cooks.distance(m)),
res CooksDistance = cooks.distance(m))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > 0.004,
rownames(res), "")),
hjust = 1.1)
```

\[\sim\]

**Exercise 8.19 **Investigate the residuals for your `sharks.csv`

model. Are there any problems with the model fit? Any influential points?

### 8.3.4 Prediction

Just as for linear models, we can use `predict`

to make predictions for new observations using a GLM. To begin with, let’s randomly sample 10 rows from the `wine`

data and fit a model using all data except those ten observations:

```
# Randomly select 10 rows from the wine data:
<- sample(1:nrow(wine), 10)
rows
<- glm(type ~ pH + alcohol, data = wine[-rows,], family = binomial) m
```

We can now use `predict`

to make predictions for the ten observations:

```
<- predict(m, wine[rows,])
preds preds
```

Those predictions look a bit strange though - what are they? By default, `predict`

returns predictions on the scale of the link function. That’s not really what we want in most cases - instead, we are interested in the predicted probabilities. To get those, we have to add the argument `type = "response"`

to the call:

```
<- predict(m, wine[rows,], type = "response")
preds preds
```

Logistic regression models are often used for prediction, in what is known as classification. Section 9.1.7 is concerned with how to evaluate the predictive performance of logistic regression and other classification models.

### 8.3.5 Modelling count data

Logistic regression is but one of many types of GLM’s used in practice. One important example is Cox regression, which is used for survival data. We’ll return to that model in Section 8.5. For now, we’ll consider count data instead. Let’s have a look at the shark attack data in `sharks.csv`

, available on the book’s website. It contains data about shark attacks in South Africa, downloaded from The Global Shark Attack File (http://www.sharkattackfile.net/incidentlog.htm). To load it, we download the file and set `file_path`

to the path of `sharks.csv`

:

```
<- read.csv(file_path, sep =";")
sharks
# Compute number of attacks per year:
<- aggregate(Type ~ Year, data = sharks, FUN = length)
attacks
# Keep data for 1960-2019:
<- subset(attacks, Year >= 1960) attacks
```

The number of attacks in a year is not binary but a count that, in principle, can take any non-negative integer as its value. Are there any trends over time for the number of reported attacks?

```
# Plot data from 1960-2019:
library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks")
```

No trend is evident. To confirm this, let’s fit a regression model with `Type`

(the number of attacks) as the response variable and `Year`

as an explanatory variable. For count data like this, a good first model to use is *Poisson regression*. Let \(\mu_i\) denote the expected value of the response variable given the explanatory variables. Given \(n\) observations of \(p\) explanatory variables, the Poisson regression model is:

\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\]
To fit it, we use `glm`

as before, but this time with `family = poisson`

:

```
<- glm(Type ~ Year, data = attacks, family = poisson)
m summary(m)
```

We can add the curve corresponding to the fitted model to our scatterplot as follows:

```
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred predict(m, type = "response"))
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red")
```

The fitted model seems to confirm our view that there is no trend over time in the number of attacks.

For model diagnostics, we can use a binned residual plot and a plot of Cook’s distance to find influential points:

```
# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
# Plot index against the Cook's distance to find
# influential points:
<- data.frame(Index = 1:nrow(m$data),
res CooksDistance = cooks.distance(m))
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > 0.1,
rownames(res), "")),
hjust = 1.1)
```

A common problem in Poisson regression models is excess zeros, i.e. more observations with value 0 than what is predicted by the model. To check the distribution of counts in the data, we can draw a histogram:

```
ggplot(attacks, aes(Type)) +
geom_histogram(binwidth = 1, colour = "black")
```

If there are a lot of zeroes in the data, we should consider using another model, such as a hurdle model or a zero-inflated Poisson regression. Both of these are available in the `pscl`

package.

Another common problem is *overdispersion*, which occurs when there is more variability in the data than what is predicted by the GLM. A formal test of overdispersion (Cameron & Trivedi, 1990) is provided by `dispersiontest`

in the `AER`

package. The null hypothesis is that there is no overdispersion, and the alternative that there is overdispersion:

```
install.packages("AER")
library(AER)
dispersiontest(m, trafo = 1)
```

There are several alternative models that can be considered in the case of overdispersion. One of them is *negative binomial regression*, which uses the same link function as Poisson regression. We can fit it using the `glm.nb`

function from `MASS`

:

```
library(MASS)
<- glm.nb(Type ~ Year, data = attacks)
m_nb
summary(m_nb)
```

For the shark attack data, the predictions from the two models are virtually identical, meaning that both are equally applicable in this case:

```
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred predict(m, type = "response"))
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred_nb predict(m_nb, type = "response"))
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red") +
geom_line(data = attacks_pred_nb, aes(x = Year, y = at_pred),
colour = "blue", linetype = "dashed")
```

Finally, we can obtain bootstrap confidence intervals e.g. using case resampling, using `boot_summary`

:

```
library(boot.pval)
boot_summary(m_nb, type = "perc", method = "case")
```

\[\sim\]

**Exercise 8.20 **The `quakes`

dataset, available in base R, contains information about seismic events off Fiji. Fit a Poisson regression model with `stations`

as the response variable and `mag`

as an explanatory variable. Are there signs of overdispersion? Does using a negative binomial model improve the model fit?

### 8.3.6 Modelling rates

Poisson regression models, and related models like negative binomial regression, can not only be used to model count data. They can also be used to model *rate data*, such as the number of cases per capita or the number of cases per unit area. In that case, we need to include an *exposure* variable \(N\) that describes e.g. the population size or area corresponding to each observation. The model will be that:

\[\log(\mu_i/N_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n.\] Because \(\log(\mu_i/N_i)=\log(\mu_i)-\log(N_i)\), this can be rewritten as:

\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip}+\log(N_i),\qquad i=1,\ldots,n.\]
In other words, we should include \(\log(N_i)\) on the right-hand side of our model, *with a known coefficient* equal to 1. In regression, such a term is known as an *offset*. We can add it to our model using the `offset`

function.

As an example, we’ll consider the `ships`

data from the `MASS`

package. It describes the number of damage incidents for different ship types operating in the 1960’s and 1970’s, and includes information about how many months each ship type was in service (i.e. each ship type’s *exposure*):

```
library(MASS)
?shipsView(ships)
```

For our example, we’ll use ship `type`

as the explanatory variable, `incidents`

as the response variable and `service`

as the exposure variable. First, we remove observations with 0 exposure (by definition, these can’t be involved in incidents, and so there is no point in including them in the analysis). Then, we fit the model using `glm`

and `offset`

:

```
<- ships[ships$service != 0,]
ships
<- glm(incidents ~ type + offset(log(service)),
m data = ships,
family = poisson)
summary(m)
```

Model diagnostics can be performed as in the previous sections.

Rate models are usually interpreted in terms of the rate ratios \(e^{\beta_j}\), which describe the multiplicative increases of the intensity of rates when \(x_j\) is increased by one unit. To compute the rate ratios for our model, we use `exp`

:

`exp(coef(m))`

\[\sim\]

**Exercise 8.21**Compute bootstrap confidence intervals for the ratio ratios in the model for the

`ships`

data.
### 8.3.7 Bayesian estimation of generalised linear models

We can fit a Bayesian GLM with the `rstanarm`

package, using `stan_glm`

in the same way we did for linear models. Let’s look at an example with the `wine`

data. First, we load and prepare the data:

```
# Import data about white and red wines:
<- read.csv("https://tinyurl.com/winedata1",
white sep = ";")
<- read.csv("https://tinyurl.com/winedata2",
red sep = ";")
$type <- "white"
white$type <- "red"
red<- rbind(white, red)
wine $type <- factor(wine$type) wine
```

Now, we fit a Bayesian logistic regression model:

```
library(rstanarm)
<- stan_glm(type ~ pH + alcohol, data = wine, family = binomial)
m
# Print the estimates:
coef(m)
```

Next, we can plot the posterior distributions of the effects:

`plot(m, "dens", pars = names(coef(m)))`

To get 95 % credible intervals for the effects, we can use `posterior_interval`

. We can also use `plot`

to visualise them:

```
posterior_interval(m,
pars = names(coef(m)),
prob = 0.95)
plot(m, "intervals",
pars = names(coef(m)),
prob = 0.95)
```

Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:

`plot(m, "rhat")`

## 8.4 Mixed models

Mixed models are used in regression problems where measurements have been made on clusters of related units. As the first example of this, we’ll use a dataset from the `lme4`

package, which also happens to contain useful methods for mixed models. Let’s install it:

`install.packages("lme4")`

The `sleepstudy`

dataset from `lme4`

contains data from a study on reaction times in a sleep deprivation study. The participants were restricted to 3 hours of sleep per night, and their average reaction time on a series of tests were measured each day during the 9 days that the study lasted:

```
library(lme4)
?sleepstudystr(sleepstudy)
```

Let’s start our analysis by making boxplots showing reaction times for each subject. We’ll also superimpose the observations for each participant on top of their boxplots:

```
library(ggplot2)
ggplot(sleepstudy, aes(Subject, Reaction)) +
geom_boxplot() +
geom_jitter(aes(colour = Subject),
position = position_jitter(0.1))
```

We are interested in finding out if the reaction times increase when the participants have been starved for sleep for a longer period. Let’s try plotting reaction times against days, adding a regression line:

```
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE)
```

As we saw in the boxplots, and can see in this plot too, some participants always have comparatively high reaction times, whereas others always have low values. There are clear differences between individuals, and the measurements for each individual will be correlated. This violates a fundamental assumption of the traditional linear model, namely that all observations are independent.

In addition to this, it also seems that the reaction times change in different ways for different participants, as can be seen if we facet the plot by test subject:

```
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ Subject, nrow = 3) +
geom_smooth(method = "lm", colour = "black", se = FALSE)
```

Both the intercept and the slope of the average reaction time differs between individuals. Because of this, the fit given by the single model can be misleading. Moreover, the fact that the observations are correlated will cause problems for the traditional intervals and tests. We need to take this into account when we estimate the overall intercept and slope.

One approach could be to fit a single model for each subject. That doesn’t seem very useful though. We’re not really interested in these particular test subjects, but in how sleep deprivation affects reaction times in an average person. It would be much better to have a single model that somehow incorporates the correlation between measurements made on the same individual. That is precisely what a linear mixed regression model does.

### 8.4.1 Fitting a linear mixed model

A linear mixed model (LMM) has two types of effects (explanatory variables):

*Fixed effects*, which are non-random. These are usually the variables of primary interest in the data. In the`sleepstudy`

example,`Days`

is a fixed effect.*Random effects*, which represent nuisance variables that cause measurements to be correlated. These are usually not of interest in and of themselves, but are something that we need to include in the model to account for correlations between measurements. In the`sleepstudy`

example,`Subject`

is a random effect.

Linear mixed models can be fitted using `lmer`

from the `lme4`

package. The syntax is the same as for `lm`

, with the addition of random effects. These can be included in different ways. Let’s have a look at them.

First, we can include a *random intercept*, which gives us a model where the intercept (but not the slope) varies between test subjects. In our example, the formula for this is:

```
library(lme4)
<- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy) m1
```

Alternatively, we could include a *random slope* in the model, in which case the slope (but not the intercept) varies between test subjects. The formula would be:

`<- lmer(Reaction ~ Days + (0 + Days|Subject), data = sleepstudy) m2 `

Finally, we can include both a random intercept and random slope in the model. This can be done in two different ways, as we can model the intercept and slope as being correlated or uncorrelated:

```
# Correlated random intercept and slope:
<- lmer(Reaction ~ Days + (1 + Days|Subject), data = sleepstudy)
m3
# Uncorrelated random intercept and slope:
<- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
```

Which model should we choose? Are the intercepts and slopes correlated? It could of course be the case that individuals with a high intercept have a smaller slope - or a greater slope! To find out, we can fit different linear models to each subject, and then make a scatterplot of their intercepts and slopes. To fit a model to each subject, we use `split`

and `map`

as in Section 8.1.11:

```
# Collect the coefficients from each linear model:
library(purrr)
%>% split(.$Subject) %>%
sleepstudy map(~ lm(Reaction ~ Days, data = .)) %>%
map(coef) -> coefficients
# Convert to a data frame:
<- data.frame(matrix(unlist(coefficients),
coefficients nrow = length(coefficients),
byrow = TRUE),
row.names = names(coefficients))
names(coefficients) <- c("Intercept", "Days")
# Plot the coefficients:
ggplot(coefficients, aes(Intercept, Days,
colour = row.names(coefficients))) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE) +
labs(fill = "Subject")
# Test the correlation:
cor.test(coefficients$Intercept, coefficients$Days)
```

The correlation test is not significant, and judging from the plot, there is little indication that the intercept and slope are correlated. We saw earlier that both the intercept and the slope seem to differ between subjects, and so `m4`

seems like the best choice here. Let’s stick with that, and look at a summary table for the model.

`summary(m4, correlation = FALSE)`

I like to add `correlation = FALSE`

here, which suppresses some superfluous output from `summary`

.

You’ll notice that unlike the `summary`

table for linear models, there are no p-values! This is a deliberate design choice from the `lme4`

developers, who argue that the approximate test available aren’t good enough for small sample sizes (Bates et al., 2015).

Using the bootstrap, as we will do in Section 8.4.3, is usually the best approach for mixed models. If you *really* want some quick p-values, you can load the `lmerTest`

package, which adds p-values computed using the Satterthwaite approximation (Kuznetsova et al., 2017). This is better than the usual approximate test, but still not perfect.

```
install.packages("lmerTest")
library(lmerTest)
<- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
summary(m4, correlation = FALSE)
```

If we need to extract the model coefficients, we can do so using `fixef`

(for the fixed effects) and `ranef`

(for the random effects):

```
fixef(m4)
ranef(m4)
```

If we want to extract the variance components from the model, we can use `VarCorr`

:

`VarCorr(m4)`

Let’s add the lines from the fitted model to our facetted plot, to compare the results of our mixed model to the lines that were fitted separately for each individual:

```
<- coef(m4)$Subject
mixed_mod $Subject <- row.names(mixed_mod)
mixed_mod
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ Subject, nrow = 3) +
geom_smooth(method = "lm", colour = "cyan", se = FALSE,
size = 0.8) +
geom_abline(aes(intercept = `(Intercept)`, slope = Days,
color = "magenta"),
data = mixed_mod, size = 0.8)
```

Notice that the lines differ. The intercept and slopes have been *shrunk* toward the global effects, i.e. toward the average of all lines.

\[\sim\]

**Exercise 8.22 **Consider the `Oxboys`

data from the `nlme`

package. Does a mixed model seem appropriate here? If so, is the intercept and slope for different subjects correlated? Fit a suitable model, with `height`

as the response variable.

Save the code for your model, as you will return to it in the next few exercises.

(Click here to go to the solution.)

**Exercise 8.23 **The `broom.mixed`

package allows you to get summaries of mixed models as data frames, just as `broom`

does for linear and generalised linear models. Install it and use it to get the summary table for the model for the `Oxboys`

data that you created in the previous exercise. How are fixed and random effects included in the table?

### 8.4.2 Model diagnostics

As for any linear model, residual plots are useful for diagnostics for linear mixed models. Of particular interest are signs of heteroscedasticity, as homoscedasticity is assumed in the mixed model. We’ll use `fortify.merMod`

to turn the model into an object that can be used with `ggplot2`

, and then create some residual plots:

```
library(ggplot2)
<- fortify.merMod(m4)
fm4
# Plot residuals:
ggplot(fm4, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
xlab("Fitted values") + ylab("Residuals")
# Compare the residuals of different subjects:
ggplot(fm4, aes(Subject, .resid)) +
geom_boxplot() +
coord_flip() +
ylab("Residuals")
# Observed values versus fitted values:
ggplot(fm4, aes(.fitted, Reaction)) +
geom_point(colour = "blue") +
facet_wrap(~ Subject, nrow = 3) +
geom_abline(intercept = 0, slope = 1) +
xlab("Fitted values") + ylab("Observed values")
## Q-Q plot of residuals:
ggplot(fm4, aes(sample = .resid)) +
geom_qq() + geom_qq_line()
## Q-Q plot of random effects:
ggplot(ranef(m4)$Subject, aes(sample = `(Intercept)`)) +
geom_qq() + geom_qq_line()
ggplot(ranef(m4)$Subject, aes(sample = `Days`)) +
geom_qq() + geom_qq_line()
```

The normality assumption appears to be satisfied, but there are some signs of heteroscedasticity in the boxplots of the residuals for the different subjects.

\[\sim\]

**Exercise 8.24 **Return to your mixed model for the `Oxboys`

data from Exercise 8.22. Make diagnostic plots for the model. Are there any signs of heteroscedasticity or non-normality?

### 8.4.3 Bootstrapping

Summary tables, including p-values, for the fixed effects are available through `boot_summary`

:

```
library(boot.pval)
boot_summary(m4, type = "perc")
```

`boot_summary`

calls a function called `bootMer`

, which performs parametric resampling from the model. In case you want to call it directly, you can do as follows:

```
<- bootMer(m4, fixef, nsim = 999)
boot_res
library(boot)
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Days
```

### 8.4.4 Nested random effects and multilevel/hierarchical models

In many cases, a random factor is *nested* within another. To see an example of this, consider the `Pastes`

data from `lme4`

:

```
library(lme4)
?Pastesstr(Pastes)
```

We are interested in the `strength`

of a chemical product. There are ten delivery batches (`batch`

), and three casks within each delivery (`cask`

). Because of variations in manufacturing, transportation, storage, and so on, it makes sense to include random effects for both `batch`

and `cask`

in a linear mixed model. However, each cask only appears within a single batch, which makes the `cask`

effect *nested* within `batch`

.

Models that use nested random factors are commonly known as *multilevel models* (the random factors exist at different “levels”), or *hierarchical models* (there is a hierarchy between the random factors). These aren’t really any different from other mixed models, but depending on how the data is structured, we may have to be a bit careful to get the nesting right when we fit the model with `lmer`

.

If the two effects weren’t nested, we could fit a model using:

```
# Incorrect model:
<- lmer(strength ~ (1|batch) + (1|cask),
m1 data = Pastes)
summary(m1, correlation = FALSE)
```

However, because the casks are labelled `a`

, `b`

, and `c`

within each batch, we’ve now fitted a model where casks from different batches are treated as being equal! To clarify that the labels `a`

, `b`

, and `c`

belong to different casks in different batches, we need to include the nesting in our formula. This is done as follows:

```
# Cask in nested within batch:
<- lmer(strength ~ (1|batch/cask),
m2 data = Pastes)
summary(m2, correlation = FALSE)
```

Equivalently, we can also use:

```
<- lmer(strength ~ (1|batch) + (1|batch:cask),
m3 data = Pastes)
summary(m3, correlation = FALSE)
```

### 8.4.5 ANOVA with random effects

The `lmerTest`

package provides ANOVA tables that allow us to use random effects in ANOVA models. To use it, simply load `lmerTest`

before fitting a model with `lmer`

, and then run `anova(m, type = "III")`

(or replace `III`

with `II`

or `I`

if you want a type II or type I ANOVA table instead).

As an example, consider the `TVbo`

data from `lmerTest`

. 3 types of TV sets were compared by 8 assessors for 4 different pictures. To see if there is a difference in the mean score for the colour balance of the TV sets, we can fit a mixed model. We’ll include a random intercept for the assessor. This is a balanced design (in which case the results from all three types of tables coincide):

```
library(lmerTest)
# TV data:
?TVbo
# Fit model with both fixed and random effects:
<- lmer(Colourbalance ~ TVset*Picture + (1|Assessor),
m data = TVbo)
# View fitted model:
m
# All three types of ANOVA table give the same results here:
anova(m, type = "III")
anova(m, type = "II")
anova(m, type = "I")
```

The interaction effect is significant at the 5 % level. As for other ANOVA models, we can visualise this with an interaction plot:

```
interaction.plot(TVbo$TVset, TVbo$Picture,
response = TVbo$Colourbalance)
```

\[\sim\]

**Exercise 8.25 **Fit a mixed effects ANOVA to the `TVbo`

data, using `Coloursaturation`

as the response variable, `TVset`

and `Picture`

as fixed effects, and `Assessor`

as a random effect. Does there appear to be a need to include the interaction between `Assessor`

and `TVset`

as a random effect? If so, do it.

### 8.4.6 Generalised linear mixed models

Everything that we have just done for the linear mixed models carries over to *generalised linear mixed models* (GLMM), which are GLM’s with both fixed and random effects.

A common example is the *item response model*, which plays an important role in psychometrics. This model is frequently used in psychological tests containing multiple questions or sets of questions (“items”), where both the subject the item are considered random effects. As an example, consider the `VerbAgg`

data from `lme4`

:

```
library(lme4)
?VerbAggView(VerbAgg)
```

We’ll use the binary version of the response, `r2`

, and fit a logistic mixed regression model to the data, to see if it can be used to explain the subjects’ responses. The formula syntax is the same as for linear mixed models, but now we’ll use `glmer`

to fit a GLMM. We’ll include `Anger`

and `Gender`

as fixed effects (we are interested in seeing how these affect the response) and `item`

and `id`

as random effects with random slopes (we believe that answers to the same item and answers from the same individual may be correlated):

```
<- glmer(r2 ~ Anger + Gender + (1|item) + (1|id),
m data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
```

We can plot the fitted random effects for `item`

to verify that there appear to be differences between the different items:

```
<- coef(m)$item
mixed_mod $item <- row.names(mixed_mod)
mixed_mod
ggplot(mixed_mod, aes(`(Intercept)`, item)) +
geom_point() +
xlab("Random intercept")
```

The `situ`

variable, describing situation type, also appears interesting. Let’s include it as a fixed effect. Let’s also allow different situational (random) effects for different respondents. It seems reasonable that such responses are random rather than fixed (as in the solution to Exercise 8.25), and we do have repeated measurements of these responses. We’ll therefore also include `situ`

as a random effect nested within `id`

:

```
<- glmer(r2 ~ Anger + Gender + situ + (1|item) + (1|id/situ),
m data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
```

Finally, we’d like to obtain bootstrap confidence intervals for fixed effects. Because this is a fairly large dataset (\(n=7,584\)) this can take a looong time to run, so stretch your legs and grab a cup of coffee or two while you wait:

```
library(boot.pval)
boot_summary(m, type = "perc", R = 100)
# Ideally, R should be greater, but for the sake of
# this example, we'll use a low number.
```

\[\sim\]

**Exercise 8.26 **Consider the `grouseticks`

data from the `lme4`

package (Elston et al., 2001). Fit a mixed Poisson regression model to the data, with `TICKS`

as the response variable and `YEAR`

and `HEIGHT`

as fixed effects. What variables are suitable to use for random effects? Compute a bootstrap confidence interval for the effect of `HEIGHT`

.

### 8.4.7 Bayesian estimation of mixed models

From a numerical point of view, using Bayesian modelling with `rstanarm`

is preferable to frequentist modelling with `lme4`

if you have complex models with many random effects. Indeed, for some models, `lme4`

will return a warning message about a singular fit, basically meaning that the model is too complex, whereas `rstanarm`

, powered by the use of a prior distribution, always will return a fitted model regardless of complexity.

After loading `rstanarm`

, fitting a Bayesian linear mixed model with a weakly informative prior is as simple as substituting `lmer`

with `stan_lmer`

:

```
library(lme4)
library(rstanarm)
<- stan_lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
# Print the results:
m4
```

To plot the posterior distributions for the coefficients of the fixed effects, we can use `plot`

, specifying which effects we are interested in using `pars`

:

`plot(m4, "dens", pars = c("(Intercept)", "Days"))`

To get 95 % credible intervals for the fixed effects, we can use `posterior_interval`

as follows:

```
posterior_interval(m4,
pars = c("(Intercept)", "Days"),
prob = 0.95)
```

We can also plot them using `plot`

:

```
plot(m4, "intervals",
pars = c("(Intercept)", "Days"),
prob = 0.95)
```

Finally, we’ll check that the model fitting has converged:

`plot(m4, "rhat")`

## 8.5 Survival analysis

Many studies are concerned with the duration of time until an event happens: time until a machine fails, time until a patient diagnosed with a disease dies, and so on. In this section we will consider some methods for *survival analysis* (also known as reliability analysis in engineering and duration analysis in economics), which is used for analysing such data. The main difficulty here is that studies often end before all participants have had events, meaning that some observations are *right-censored* - for these observations, we don’t know when the event happened, but only that it happened after the end of the study.

The `survival`

package contains a number of useful methods for survival analysis. Let’s install it:

`install.packages("survival")`

We will study the lung cancer data in `lung`

:

```
library(survival)
?lungView(lung)
```

The survival times of the patients consist of two parts: `time`

(the time from diagnosis until either death or the end of the study) and `status`

(1 if the observations is censored, 2 if the patient died before the end of the study). To combine these so that they can be used in a survival analysis, we must create a `Surv`

object:

`Surv(lung$time, lung$status)`

Here, a `+`

sign after a value indicates right-censoring.

### 8.5.1 Comparing groups

Survival times are best visualised using Kaplan-Meier curves that show the proportion of surviving patients. Let’s compare the survival times of women and men. We first fit a survival model using `survfit`

, and then draw the Kaplan-Meier curve (with parametric confidence intervals) using `autoplot`

from `ggfortify`

:

```
library(ggfortify)
library(survival)
<- survfit(Surv(time, status) ~ sex, data = lung)
m autoplot(m)
```

To print the values for the survival curves at different time points, we can use `summary`

:

`summary(m)`

To test for differences between two groups, we can use the logrank test (also known as the Mantel-Cox test), given by `survfit`

:

`survdiff(Surv(time, status) ~ sex, data = lung)`

Another option is the Peto-Peto test, which puts more weight on early events (deaths, in the case of the `lung`

data), and therefore is suitable when such events are of greater interest. In contrast, the logrank test puts equal weights on all events regardless of when they occur. The Peto-Peto test is obtained by adding the argument `rho = 1`

:

`survdiff(Surv(time, status) ~ sex, rho = 1, data = lung)`

The `Hmisc`

package contains a function for obtaining confidence intervals based on the Kaplan-Meier estimator, called `bootkm`

. This allows us to get confidence intervals for the quantiles (including the median) of the survival distribution for different groups, as well as for differences between the quantiles of different groups. First, let’s install it:

`install.packages("Hmisc")`

We can now use `bootkm`

to compute bootstrap confidence intervals for survival times based on the `lung`

data. We’ll compute an interval for the median survival time for females, and one for the difference in median survival time between females and males:

```
library(Hmisc)
# Create a survival object:
<- Surv(lung$time, lung$status)
survobj
# Get bootstrap replicates of the median survival time for
# the two groups:
<- bootkm(survobj[lung$sex == 2],
median_surv_time_female q = 0.5, B = 999)
<- bootkm(survobj[lung$sex == 1],
median_surv_time_male q = 0.5, B = 999)
# 95 % bootstrap confidence interval for the median survival time
# for females:
quantile(median_surv_time_female,
c(.025,.975), na.rm=TRUE)
# 95 % bootstrap confidence interval for the difference in median
# survival time:
quantile(median_surv_time_female - median_surv_time_male,
c(.025,.975), na.rm=TRUE)
```

To obtain confidence intervals for other quantiles, we simply change the argument `q`

in `bootkm`

.

**Exercise 8.27 **Consider the `ovarian`

data from the `survival`

package. Plot Kaplan-Meier curves comparing the two treatment groups. Compute a bootstrap confidence interval for the difference in the 75 % quantile for the survival time for the two groups.

### 8.5.2 The Cox proportional hazards model

The hazard function, or hazard rate, is the rate of events at time \(t\) if a subject has survived until time \(t\). The higher the hazard, the greater the probability of an event. Hazard rates play an integral part in survival analysis, particularly in regression models. To model how the survival times are affected by different explanatory variables, we can use a *Cox proportional hazards model* (Cox, 1972), fitted using `coxph`

:

```
<- coxph(Surv(time, status) ~ age + sex, data = lung)
m summary(m)
```

The exponentiated coefficients show the hazard ratios, i.e. the relative increases (values greater than 1) or decreases (values below 1) of the hazard rate when a covariate is increased one step while all others are kept fixed:

`exp(coef(m))`

In this case, the hazard increases with age (multiply the hazard by 1.017 for each additional year that the person has lived), and is lower for women (`sex=2`

) than for men (`sex=1`

).

The `censboot_summary`

function from `boot.pval`

provides a table of estimates, bootstrap confidence intervals, and bootstrap p-values for the model coefficients. The `coef`

argument can be used to specify whether to print confidence intervals for the coefficients or for the exponentiated coeffientes (i.e. the hazard ratios):

```
library(boot.pval)
# Original coefficients:
censboot_summary(m, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m, type = "perc", coef = "exp")
```

To manually obtain bootstrap confidence intervals for the exponentiated coefficients, we can use the `censboot`

function from `boot`

as follows:

```
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
<- function(data, formula) {
boot_fun <- coxph(formula, data = data)
m_boot return(exp(coef(m_boot)))
}
# Run the resampling:
library(boot)
<- censboot(lung[,c("time", "status", "age", "sex")],
boot_res R = 999,
boot_fun, formula =
formula(Surv(time, status) ~ age + sex))
# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Age
boot.ci(boot_res, type = "perc", index = 2) # Sex
```

As the name implies, the Cox proportional hazards model relies on the assumption of *proportional hazards*, which essentially means that the effect of the explanatory variables is constant over time. This can be assessed visually by plotting the model residuals, using `cox.zph`

and the `ggcoxzph`

function from the `survminer`

package. Specifically, we will plot the scaled Schoenfeld (1982) residuals, which measure the difference between the observed covariates and the expected covariates given the risk at the time of an event. If the proportional hazards assumption holds, then there should be no trend over time for these residuals. Use the trend line to aid the eye:

```
install.packages("survminer")
library(survminer)
ggcoxzph(cox.zph(m), var = 1) # age
ggcoxzph(cox.zph(m), var = 2) # sex
# Formal p-values for a test of proportional
# hazards, for each variable:
cox.zph(m)
```

In this case, there are no apparent trends over time (which is in line with the corresponding formal hypothesis tests), indicating that the proportional hazards model could be applicable here.

\[\sim\]

**Exercise 8.28 **Consider the `ovarian`

data from the `survival`

package.

- Use a Cox proportional hazards regression to test whether there is a difference between the two treatment groups, adjusted for age.
- Compute bootstrap confidence interval for the hazard ratio of age.

(Click here to go to the solution.)

**Exercise 8.29 **Consider the `retinopathy`

data from the `survival`

package. We are interested in a mixed survival model, where `id`

is used to identify patients and `type`

, `trt`

and `age`

are fixed effects. Fit a mixed Cox proportional hazards regression (add `cluster = id`

to the call to `coxph`

to include this as a random effect). Is the assumption of proportional hazards fulfilled?

### 8.5.3 Accelerated failure time models

In many cases, the proportional hazards assumption does not hold. In such cases we can turn to *accelerated failure time models* (Wei, 1992), for which the effect of covariates is to accelerate or decelerate the life course of a subject.

While the proportional hazards model is semiparametric, accelerated failure time models are typically fully parametric, and thus involve stronger assumptions about an underlying distribution. When fitting such a model using the `survreg`

function, we must therefore specify what distribution to use. Two common choices are the Weibull distribution and the log-logistic distribution. The Weibull distribution is commonly used in engineering, e.g. in reliability studies. The hazard function of Weibull models is always monotonic, i.e. either always increasing or always decreasing. In contrast, the log-logistic distribution allows the hazard function to be non-monotonic, making it more flexible, and often more appropriate for biological studies. Let’s fit both types of models to the `lung`

data and have a look at the results:

```
library(survival)
# Fit Weibull model:
<- survreg(Surv(time, status) ~ age + sex, data = lung,
m_w dist = "weibull")
summary(m_w)
# Fit log-logistic model:
<- survreg(Surv(time, status) ~ age + sex, data = lung,
m_ll dist = "loglogistic")
summary(m_ll)
```

Interpreting the coefficients of accelerated failure time models is easier than interpreting coefficients from proportional hazards models. The exponentiated coefficients show the relative increase or decrease in the expected survival times when a covariate is increased one step while all others are kept fixed:

`exp(coef(m_ll))`

In this case, according to the log-logistic model, the expected survival time decreases by 1.4 % (i.e. multiply by \(0.986\)) for each additional year that the patient has lived. The expected survival time for females (`sex=2`

) is 61.2 % higher than for males (multiply by \(1.612\)).

To obtain bootstrap confidence intervals and p-values for the effects, we follow the same procedure as for the Cox model, using `censboot_summary`

. Here is an example for the log-logistic accelerated failure time model:

```
library(boot.pval)
# Original coefficients:
censboot_summary(m_ll, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m_ll, type = "perc", coef = "exp")
```

We can also use `censboot`

:

```
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
<- function(data, formula, distr) {
boot_fun <- survreg(formula, data = data, dist = distr)
m_boot return(exp(coef(m_boot)))
}
# Run the resampling:
library(boot)
<- censboot(lung[,c("time", "status", "age", "sex")],
boot_res R = 999,
boot_fun, formula =
formula(Surv(time, status) ~ age + sex),
distr = "loglogistic")
# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 2) # Age
boot.ci(boot_res, type = "perc", index = 3) # Sex
```

\[\sim\]

**Exercise 8.30 **Consider the `ovarian`

data from the `survival`

package. Fit a log-logistic accelerated failure time model to the data, using all available explanatory variables. What is the estimated difference in survival times between the two treatment groups?

### 8.5.4 Bayesian survival analysis

At the time of this writing, the latest release of `rstanarm`

does not contain functions for fitting survival analysis models. You can check whether this still is the case by running `?stan_surv`

in the Console. If you don’t find the documentation for the `stan_surv`

function, you will have to install the development version of the package from GitHub (which contains such functions), using the following code:

```
# Check if the devtools package is installed, and start
# by installing it otherwise:
if (!require(devtools)) {
install.packages("devtools")
}library(devtools)
# Download and install the development version of the package:
install_github("stan-dev/rstanarm", build_vignettes = FALSE)
```

Now, let’s have a look at how to fit a Bayesian model to the `lung`

data from `survival`

:

```
library(survival)
library(rstanarm)
# Fit proportional hazards model using cubic M-splines (similar
# but not identical to the Cox model!):
<- stan_surv(Surv(time, status) ~ age + sex, data = lung)
m m
```

Fitting a survival model with a random effect works similarly, and uses the same syntax as `lme4`

. Here is an example with the `retinopathy`

data:

```
<- stan_surv(Surv(futime, status) ~ age + type + trt + (1|id),
m data = retinopathy)
m
```

### 8.5.5 Power estimates for the logrank test

The `spower`

function in `Hmisc`

can be used to compute the power of the logrank test in different scenarios using simulation. The helper functions `Weibull2`

, `Lognorm2`

, and `Gompertz2`

can be used to define Weibull, lognormal and Gomperts distributions to sample from, using survival probabilities at different time points rather than the traditional parameters of those distributions. We’ll look at an example involving the Weibull distribution here. Additional examples can be found in the function’s documentation (`?spower`

).

Let’s simulate the power of a 3-year follow-up study with two arms (i.e. two groups, control and intervention). First, we define a Weibull distribution for (compliant) control patients. Let’s say that their 1-year survival is 0.9 and their 3-year survival is 0.6. To define a Weibull distribution that corresponds to these numbers, we use `Weibull2`

as follows:

`<- Weibull2(c(1, 3), c(.9, .6)) weib_dist `

We’ll assume that the treatment has no effect for the first 6 months, and that it then has a constant effect, leading to a hazard ratio of 0.75 (so the hazard ratio is 1 if the time in years is less than or equal to 0.5, and 0.75 otherwise). Moreover, we’ll assume that there is a constant drop-out rate, such that 20 % of the patients can be expected to drop out during the three years. Finally, there is no drop-in. We define a function to simulate survival times under these conditions:

```
# In the functions used to define the hazard ratio, drop-out
# and drop-in, t denotes time in years:
<- Quantile2(weib_dist,
sim_func hratio = function(t) { ifelse(t <= 0.5, 1, 0.75) },
dropout = function(t) { 0.2*t/3 },
dropin = function(t) { 0 })
```

Next, we define a function for the censoring distribution, which is assumed to be the same for both groups. Let’s say that each follow-up is done at a random time point between 2 and 3 years. We’ll therefore use a uniform distribution on the interval \((2,3)\) for the censoring distribution:

```
<- function(n)
rcens
{runif(n, 2, 3)
}
```

Finally, we define two helper functions required by `spower`

and then run the simulation study. The output is the simulated power using the settings that we’ve just created.

```
# Define helper functions:
<- function(n) { sim_func(n, "control") }
rcontrol <- function(n) { sim_func(n, "intervention") }
rinterv
# Simulate power when both groups have sample size 300:
spower(rcontrol, rinterv, rcens, nc = 300, ni = 300,
test = logrank, nsim = 999)
# Simulate power when both groups have sample size 450:
spower(rcontrol, rinterv, rcens, nc = 450, ni = 450,
test = logrank, nsim = 999)
# Simulate power when the control group has size 100
# and the intervention group has size 300:
spower(rcontrol, rinterv, rcens, nc = 100, ni = 300,
test = logrank, nsim = 999)
```

## 8.6 Left-censored data and nondetects

Survival data is typically right-censored. Left-censored data, on the other hand, is common in medical research (e.g. in biomarker studies) and environmental chemistry (e.g. measurements of chemicals in water), where some measurements fall below the laboratory’s *detection limits* (or limit of detection, LoD). Such data also occur in studies in economics. A measurement below the detection limit, a *nondetect*, is still more informative than having no measurement at all - we may not know the exact value, but we know that the measurement is below a given threshold.

In principle, all methods that are applicable to survival analysis can also be used for left-censored data (although the interpretation of coefficients and parameters may differ), but in practice the distributions of lab measurements and economic variables often differ from those that typically describe survival times. In this section we’ll look at methods tailored to the kind of left-censored data that appears in applications in the aforementioned fields.

### 8.6.1 Estimation

The `EnvStats`

package contains a number of functions that can be used to compute descriptive statistics and estimating parameters of distributions from data with nondetects. Let’s install it:

`install.packages("EnvStats")`

Estimates of the mean and standard deviation of a normal distribution that take the censoring into account in the right way can be obtained with `enormCensored`

, which allows us to use several different estimators (the details surrounding the available estimators can be found using `?enormCensored`

). Analogous functions are available for other distributions, for instance `elnormAltCensored`

for the lognormal distribution, `egammaCensored`

for the gamma distribution, and `epoisCensored`

for the Poisson distribution.

To illustrate the use of `enormCensored`

, we will generate data from a normal distribution. We know the true mean and standard deviation of the distribution, and can compute the estimates for the generated sample. We will then pretend that there is a detection limit for this data, and artificially left-censor about 20 % of it. This allows us to compare the estimates for the full sample and the censored sample, to see how the censoring affects the estimates. Try running the code below a few times:

```
# Generate 50 observations from a N(10, 9)-distribution:
<- rnorm(50, 10, 3)
x
# Estimate the mean and standard deviation:
<- mean(x)
mean_full <- sd(x)
sd_full
# Censor all observations below the "detection limit" 8
# and replace their values by 8:
<- x<8
censored <- 8
x[censored]
# The proportion of censored observations is:
mean(censored)
# Estimate the mean and standard deviation in a naive
# manner, using the ordinary estimators with all
# nondetects replaced by 8:
<- mean(x)
mean_cens_naive <- sd(x)
sd_cens_naive
# Estimate the mean and standard deviation using
# different estimators that take the censoring
# into account:
library(EnvStats)
# Maximum likelihood estimate:
<- enormCensored(x, censored,
estimates_mle method = "mle")
# Biased-corrected maximum likelihood estimate:
<- enormCensored(x, censored,
estimates_bcmle method = "bcmle")
# Regression on order statistics, ROS, estimate:
<- enormCensored(x, censored,
estimates_ros method = "ROS")
# Compare the different estimates:
mean_full; sd_full
mean_cens_naive; sd_cens_naive$parameters
estimates_mle$parameters
estimates_bcmle$parameters estimates_ros
```

The naive estimators tend to be biased for data with nondetects (sometimes very biased!). Your mileage may vary depending on e.g. the sample size and the amount of censoring, but in general, the estimators that take censoring into account will fare much better.

After we have obtained estimates for the parameters of the normal distribution, we can plot the data against the fitted distribution to check the assumption of normality:

```
library(ggplot2)
# Compare to histogram, including a bar for nondetects:
ggplot(data.frame(x), aes(x)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = estimates_mle$parameters[1],
sd = estimates_mle$parameters[2]))
# Compare to histogram, excluding nondetects:
<- x[!censored]
x_noncens ggplot(data.frame(x_noncens), aes(x_noncens)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = estimates_mle$parameters[1],
sd = estimates_mle$parameters[2]))
```

To obtain percentile and BCa bootstrap confidence intervals for the mean, we can add the options `ci = TRUE`

and `ci.method = "bootstrap"`

:

```
# Using 999 bootstrap replicates:
enormCensored(x, censored, method = "mle",
ci = TRUE, ci.method = "bootstrap",
n.bootstraps = 999)$interval$limits
```

\[\sim\]

**Exercise 8.31 **Download the `il2rb.csv`

data from the book’s web page. It contains measurements of the biomarker IL-2RB made in serum samples from two groups of patients. The values that are missing are in fact nondetects, with detection limit 0.25.

Under the assumption that the biomarker levels follow a lognormal distribution, compute bootstrap confidence intervals for the mean of the distribution for the control group. What proportion of the data is left-censored?

### 8.6.2 Tests of means

When testing the difference between two groups’ means, nonparametric tests like the Wilcoxon-Mann-Whitney test often perform very well for data with nondetects, unlike the t-test (Zhang et al., 2009). For data with a high degree of censoring (e.g. more than 50 %), most tests perform poorly. For multivariate tests of mean vectors the situation is the opposite, with Hotelling’s \(T^2\) (Section 7.2.6) being a much better option than nonparametric tests (Thulin, 2016).

\[\sim\]

**Exercise 8.32 **Return to the `il2rb.csv`

data from Exercise 8.32. Test the hypothesis that there is no difference in location between the two groups.

### 8.6.3 Censored regression

Censored regression models can be used when the response variable is censored. A common model in economics is the Tobit regression model (Tobin, 1958), which is a linear regression model with normal errors, tailored to left-censored data. It can be fitted using `survreg`

.

As an example, consider the `EPA.92c.zinc.df`

dataset available in `EnvStats`

. It contains measurements of zinc concentrations from five wells, made on 8 samples from each well, half of which are nondetects. Let’s say that we are interested in comparing these five wells (so that the wells aren’t random effects). Let’s also assume that the 8 samples were collected at different time points, and that we want to investigate whether the concentrations change over time. Such changes could be non-linear, so we’ll include the sample number as a factor. To fit a Tobit model to this data, we use `survreg`

as follows.

```
library(EnvStats)
.92c.zinc.df
?EPA
# Note that in Surv, in the vector describing censoring 0 means
# censoring and 1 no censoring. This is the opposite of the
# definition used in EPA.92c.zinc.df$Censored, so we use the !
# operator to change 0's to 1's and vice versa.
library(survival)
<- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
m data = EPA.92c.zinc.df, dist = "gaussian")
summary(m)
```

Similarly, we can fit a model under the assumption of lognormality:

```
<- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
m data = EPA.92c.zinc.df, dist = "lognormal")
summary(m)
```

Fitting regression models where the explanatory variables are censored is more challenging. For prediction, a good option is models based on decision trees, studied in Section 9.5. For testing whether there is a trend over time, tests based on Kendall’s correlation coefficient can be useful. `EnvStats`

provides two functions for this - `kendallTrendTest`

for testing a monotonic trend, and `kendallSeasonalTrendTest`

for testing a monotonic trend within seasons.

## 8.7 Creating matched samples

*Matching* is used to balance the distribution of explanatory variables in the groups that are being compared. This is often required in observational studies, where the treatment variable is not randomly assigned, but determined by some external factor(s) that may be related to the treatment. For instance, if you wish to study the effect of smoking on mortality, you can recruit a group of smokers and non-smokers and follow them for a few years. But both mortality and smoking are related to *confounding* variables such as age and gender, meaning that imbalances in the age and gender distributions of smokers and non-smokers can bias the results. There are several methods for creating balanced or *matched samples* that seek to mitigate this bias, including *propensity score matching*, which we’ll use here. The `MatchIt`

and `optmatch`

packages contain the functions that we need for this.

To begin with, let’s install the two packages:

`install.packages(c("MatchIt", "optmatch"))`

We will illustrate the use of the packages using the `lalonde`

dataset, that is shipped with the `MatchIt`

package:

```
library(MatchIt)
data(lalonde)
?lalondeView(lalonde)
```

Note that the data has row names, which are useful e.g. for identifying which individuals have been paired - we can access them using `rownames(lalonde)`

.

### 8.7.1 Propensity score matching

To perform automated propensity score matching, we will use the `matchit`

function, which computes propensity scores and then matches participants from the treatment and control groups using these. Matches can be found in several ways. We’ll consider two of them here. As input, the `matchit`

function takes a formula describing the treatment variable and potential confounders, what datasets to use, which method to use and what ratio of control to treatment participants to use.

A common method is *nearest neighbour matching*, where each participant is matched to the participant in the other group with the most similar propensity score. By default, it starts by finding a match for the participant in the treatment group that has the largest propensity score, then finds a match for the participant in the treatment groups with the second largest score, and so on. Two participants cannot be matched with the same participant in the control group. The nearest neighbour match is *locally optimal* in the sense that it find the best (still) available match for each participant in the treatment group, ignoring if that match in fact would be even better for another participant in the treatment group.

To perform propensity score matching using nearest neighbour matching with 1 match each, evaluate the results and then extract the matched samples we can use `matchit`

as follows:

```
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "nearest", ratio = 1)
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
```

To view the matched pairs, you can use:

`$match.matrix matches`

To view the values of the `re78`

variable of the matched pairs, use:

```
<- "re78"
varName <- lalonde[row.names(matches$match.matrix), varName]
resMatrix for(i in 1:ncol(matches$match.matrix))
{<- cbind(resMatrix, lalonde[matches$match.matrix[,i],
resMatrix
varName])
}rownames(resMatrix) <- row.names(matches$match.matrix)
View(resMatrix)
```

As an alternative to nearest neighbour-matching, *optimal matching* can be used. This is similar to nearest neighbour-matching, but strives to obtain *globally optimal* matches rather than locally optimal. This means that each participant in the treatment group is paired with a participant in the control group, while also taking into account how similar the latter participant is to other participants in the treatment group.

To perform propensity score matching using optimal matching with 2 matches each:

```
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "optimal", ratio = 2)
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
```

You may also want to find all controls that match participants in the treatment group exactly. This is called exact matching:

```
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "exact")
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
```

Participants with no exact matches won’t be included in `matched_data`

.

### 8.7.2 Stepwise matching

At times you will want to combine the above approaches. For instance, you may want to have an exact match for age, and then an approximate match using the propensity scores for other variables. This is also achievable but requires the matching to be done in several steps. To first match the participant exactly on age and then 1-to-2 via nearest-neighbour propensity score matching on `re74`

and `re75`

we can use a loop:

```
# Match exactly one age:
<- matchit(treat ~ age, data = lalonde, method = "exact")
matches <- match.data(matches)
matched_data
# Match the first subclass 1-to-2 via nearest-neighbour propensity
# score matching:
<- matchit(treat ~ re74 + re75,
matches2 data = matched_data[matched_data$subclass == 1,],
method = "nearest", ratio = 2)
<- match.data(matches2, weights = "weights2",
matched_data2 subclass = "subclass2")
<- matches2$match.matrix
matchlist
# Match the remaining subclasses in the same way:
for(i in 2:max(matched_data$subclass))
{<- matchit(treat ~ re74 + re75,
matches2 data = matched_data[matched_data$subclass == i,],
method = "nearest", ratio = 2)
<- rbind(matched_data2, match.data(matches2,
matched_data2 weights = "weights2",
subclass = "subclass2"))
<- rbind(matchlist, matches2$match.matrix)
matchlist
}
# Check results:
View(matchlist)
View(matched_data2)
```

If the variables aren’t centred, the intercept is the expected value of the response variable when all explanatory variables are 0. This isn’t always realistic or meaningful.↩︎

On the contrary, doing so will usually only serve to make interpretation more difficult.↩︎

Prediction intervals provide interval estimates for the new observations. They incorporate both the uncertainty associated with our model estimates, and the fact that the new observation is likely to deviate slightly from its expected value.↩︎