8 Regression models

Regression models, in which explanatory variables are used to model the behaviour of a response variable, is without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.

After reading this chapter, you will be able to use R to:

  • Fit and evaluate linear and generalised linear models,
  • Fit and evaluate mixed models,
  • Fit survival analysis models,
  • Analyse data with left-censored observations,
  • Create matched samples.

8.1 Linear models

Being flexible enough to handle different types of data, yet simple enough to be useful and interpretable, linear models are among the most important tools in the statistics toolbox. In this section, we’ll discuss how to fit and evaluate linear models in R.

8.1.1 Fitting linear models

We had a quick glance at linear models in Section 3.7. There we used the mtcars data:

?mtcars
View(mtcars)

First, we plotted fuel consumption (mpg) against gross horsepower (hp):

library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
      geom_point()

Given \(n\) observations of \(p\) explanatory variables (also known as predictors, covariates, independent variables, and features), the linear model is:

\[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\] where \(\epsilon_i\) is a random error with mean 0, meaning that the model also can be written as:

\[E(y_i)=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\] We fitted a linear model using lm, with mpg as the response variable and hp as the explanatory variable:

m <- lm(mpg ~ hp, data = mtcars)
summary(m)

We added the fitted line to the scatterplot by using geom_abline:

# Check model coefficients:
coef(m)

# Add regression line to plot:
ggplot(mtcars, aes(hp, mpg)) +
      geom_point() + 
      geom_abline(aes(intercept = coef(m)[1], slope = coef(m)[2]),
                colour = "red")

We had a look at some diagnostic plots given by applying plot to our fitted model m:

plot(m)

Finally, we added another variable, the car weight wt, to the model:

m <- lm(mpg ~ hp + wt, data = mtcars)
summary(m)

Next, we’ll look at what more R has to offer when it comes to regression. Before that though, it’s a good idea to do a quick exercise to make sure that you remember how to fit linear models.

\[\sim\]

Exercise 8.1 The sales-weather.csv data from Section 5.12 describes the weather in a region during the first quarter of 2020. Download the file from the book’s web page. Fit a linear regression model with TEMPERATURE as the response variable and SUN_HOURS as an explanatory variable. Plot the results. Is there a connection?

You’ll return to and expand this model in the next few exercises, so make sure to save your code.

(Click here to go to the solution.)


Exercise 8.2 Fit a linear model to the mtcars data using the formula mpg ~ .. What happens? What is ~ . a shorthand for?

(Click here to go to the solution.)

8.1.2 Interactions and polynomial terms

It seems plausible that there could be an interaction between gross horsepower and weight. We can include an interaction term by adding hp:wt to the formula:

m <- lm(mpg ~ hp + wt + hp:wt, data = mtcars)
summary(m)

Alternatively, to include the main effects of hp and wt along with the interaction effect, we can use hp*wt as a shorthand for hp + wt + hp:wt to write the model formula more concisely:

m <- lm(mpg ~ hp*wt, data = mtcars)
summary(m)

It is often recommended to centre the explanatory variables in regression models, i.e. to shift them so that they all have mean 0. There are a number of benefits to this: for instance that the intercept then can be interpreted as the expected value of the response variable when all explanatory variables are equal to their means, i.e. in an average case56. It can also reduce any multicollinearity in the data, particularly when including interactions or polynomial terms in the model. Finally, it can reduce problems with numerical instability that may arise due to floating point arithmetics. Note however, that there is no need to centre the response variable57.

Centring the explanatory variables can be done using scale:

# Create a new data frame, leaving the response variable mpg
# unchanged, while centring the explanatory variables:
mtcars_scaled <- data.frame(mpg = mtcars[,1],
                            scale(mtcars[,-1], center = TRUE,
                                         scale = FALSE))
m <- lm(mpg ~ hp*wt, data = mtcars_scaled)
summary(m)

If we wish to add a polynomial term to the model, we can do so by wrapping the polynomial in I(). For instance, to add a quadratic effect in the form of the square weight of a vehicle to the model, we’d use:

m <- lm(mpg ~ hp*wt + I(wt^2), data = mtcars_scaled)
summary(m)

8.1.3 Dummy variables

Categorical variables can be included in regression models by using dummy variables. A dummy variable takes the values 0 and 1, indicating that an observation either belongs to a category (1) or not (0). If the original categorical variable has more than two categories, \(c\) categories, say, the number of dummy variables included in the regression model should be \(c-1\) (with the last category corresponding to all dummy variables being 0). R does this automatically for us if we include a factor variable in a regression model:

# Make cyl a categorical variable:
mtcars$cyl <- factor(mtcars$cyl)

m <- lm(mpg ~ hp*wt + cyl, data = mtcars)
summary(m)

Note how only two categories, 6 cylinders and 8 cylinders, are shown in the summary table. The third category, 4 cylinders, corresponds to both those dummy variables being 0. Therefore, the coefficient estimates for cyl6 and cyl8 are relative to the remaining reference category cyl4. For instance, compared to cyl4 cars, cyl6 cars have a higher fuel consumption, with their mpg being \(1.26\) lower.

We can control which category is used as the reference category by setting the order of the factor variable, as in Section 5.4. The first factor level is always used as the reference, so if for instance we want to use cyl6 as our reference category, we’d do the following:

# Make cyl a categorical variable with cyl6 as
# reference variable:
mtcars$cyl <- factor(mtcars$cyl, levels =
                       c(6, 4, 8))

m <- lm(mpg ~ hp*wt + cyl, data = mtcars)
summary(m)

Dummy variables are frequently used for modelling differences between different groups. Including only the dummy variable corresponds to using different intercepts for different groups. If we also include an interaction with the dummy variable, we can get different slopes for different groups. Consider the model \[E(y_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_{12} x_{i1}x_{i2},\qquad i=1,\ldots,n\] where \(x_1\) is numeric and \(x_2\) is a dummy variable. Then the intercept and slope changes depending on the value of \(x_2\) as follows:

\[E(y_i)=\beta_0+\beta_1 x_{i1},\qquad \mbox{if } x_2=0,\] \[E(y_i)=(\beta_0+\beta_2)+(\beta_1+\beta_{12}) x_{i1},\qquad \mbox{if } x_2=1.\] This yields a model where the intercept and slope differs between the two groups that \(x_2\) represents.

\[\sim\]

Exercise 8.3 Return to the weather model from Exercise 8.1. Create a dummy variable for precipitation (zero precipitation or non-zero precipitation) and add it to your model. Also include an interaction term between the precipitation dummy and the number of sun hours. Are any of the coefficients significantly non-zero?

(Click here to go to the solution.)

8.1.4 Model diagnostics

There are a few different ways in which we can plot the fitted model. First, we can of course make a scatterplot of the data and add a curve showing the fitted values corresponding to the different points. These can be obtained by running predict(m) with our fitted model m.

# Fit two models:
mtcars$cyl <- factor(mtcars$cyl)
m1 <- lm(mpg ~ hp + wt, data = mtcars) # Simple model
m2 <- lm(mpg ~ hp*wt + cyl, data = mtcars) # Complex model

# Create data frames with fitted values:
m1_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m1))
m2_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m2))

# Plot fitted values:
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
      geom_point() + 
      geom_line(data = m1_pred, aes(x = hp, y = mpg_pred),
                colour = "red") + 
      geom_line(data = m2_pred, aes(x = hp, y = mpg_pred),
                colour = "blue")

We could also plot the observed values against the fitted values:

n <- nrow(mtcars)
models <- data.frame(Observed = rep(mtcars$mpg, 2),
                     Fitted = c(predict(m1), predict(m2)),
                     Model = rep(c("Model 1", "Model 2"), c(n, n)))

ggplot(models, aes(Fitted, Observed)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Model, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      xlab("Fitted values") + ylab("Observed values")

Linear models are fitted and analysed using a number of assumptions, most of which are assessed by looking at plots of the model residuals, \(y_i-\hat{y}_i\), where \(\hat{y}_i\) is the fitted value for observation \(i\). Some important assumptions are:

  • The model is linear in the parameters: we check this by looking for non-linear patterns in the residuals, or in the plot of observed against fitted values.
  • The observations are independent: which can be difficult to assess visually. We’ll look at models that are designed to handle correlated observations in Sections 8.4 and 9.6.
  • Homoscedasticity: that the random errors all have the same variance. We check this by looking for non-constant variance in the residuals. The opposite of homoscedasticity is heteroscedasticity.
  • Normally distributed random errors: this assumption is important if we want to use the traditional parametric p-values, confidence intervals and prediction intervals. If we use permutation p-values or bootstrap intervals (as we will later in this chapter), we no longer need this assumption.

Additionally, residual plots can be used to find influential points that (possibly) have a large impact on the model coefficients (influence is measured using Cook’s distance and potential influence using leverage). We’ve already seen that we can use plot(m) to create some diagnostic plots. To get more and better-looking plots, we can use the autoplot function for lm objects from the ggfortify package:

library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)

In each of the plots, we look for the following:

  • Residuals versus fitted: look for patterns that can indicate non-linearity, e.g. that the residuals all are high in some areas and low in others. The blue line is there to aid the eye - it should ideally be relatively close to a straight line (in this case, it isn’t perfectly straight, which could indicate a mild non-linearity).
  • Normal Q-Q: see if the points follow the line, which would indicate that the residuals (which we for this purpose can think of as estimates of the random errors) follow a normal distribution.
  • Scale-Location: similar to the residuals versus fitted plot, this plot shows whether the residuals are evenly spread for different values of the fitted values. Look for patterns in how much the residuals vary - if they e.g. vary more for large fitted values, then that is a sign of heteroscedasticity. A horizontal blue line is a sign of homoscedasticity.
  • Cook’s distance: look for points with high values. A commonly-cited rule-of-thumb (Cook & Weisberg, 1982) says that values above 1 indicate points with a high influence.
  • Residuals versus leverage: look for points with a high residual and high leverage. Observations with a high residual but low leverage deviate from the fitted model but don’t affect it much. Observations with a high residual and a high leverage likely have a strong influence on the model fit, meaning that the fitted model could be quite different if these points were removed from the dataset.
  • Cook’s distance versus leverage: look for observations with a high Cook’s distance and a high leverage, which are likely to have a strong influence on the model fit.

A formal test for heteroscedasticity, the Breusch-Pagan test, is available in the car package as a complement to graphical inspection. A low p-value indicates statistical evidence for heteroscedasticity. To run the test, we use ncvTest (where “ncv” stands for non-constant variance):

install.packages("car")
library(car)
ncvTest(m1)

A common problem in linear regression models is multicollinearity, i.e. explanatory variables that are strongly correlated. Multicollinearity can cause your \(\beta\) coefficients and p-values to change greatly if there are small changes in the data, rendering them unreliable. To check if you have multicollinearity in your data, you can create a scatterplot matrix of your explanatory variables, as in Section 4.8.1:

library(GGally)
ggpairs(mtcars[, -1])

In this case, there are some highly correlated pairs, hp and disp among them. As a numerical measure of collinearity, we can use the generalised variance inflation factor (GVIF), given by the vif function in the car package:

library(car)
m <- lm(mpg ~ ., data = mtcars)
vif(m)

A high GVIF indicates that a variable is highly correlated with other explanatory variables in the dataset. Recommendations for what a “high GVIF” is varies, from 2.5 to 10 or more.

You can mitigate problems related to multicollinearity by:

  • Removing one or more of the correlated variables from the model (because they are strongly correlated, they measure almost the same thing anyway!),
  • Centring your explanatory variables (particularly if you include polynomial terms),
  • Using a regularised regression model (which we’ll do in Section 9.4).

\[\sim\]

Exercise 8.4 Below are two simulated datasets. One exhibits a nonlinear dependence between the variables, and the other exhibits heteroscedasticity. Fit a model with y as the response variable and x as the explanatory variable for each dataset, and make some residual plots. Which dataset suffers from which problem?
exdata1 <- data.frame(
   x = c(2.99, 5.01, 8.84, 6.18, 8.57, 8.23, 8.48, 0.04, 6.80,
         7.62, 7.94, 6.30, 4.21, 3.61, 7.08, 3.50, 9.05, 1.06,
         0.65, 8.66, 0.08, 1.48, 2.96, 2.54, 4.45),
   y = c(5.25, -0.80, 4.38, -0.75, 9.93, 13.79, 19.75, 24.65,
         6.84, 11.95, 12.24, 7.97, -1.20, -1.76, 10.36, 1.17,
         15.41, 15.83, 18.78, 12.75, 24.17, 12.49, 4.58, 6.76,
         -2.92))

exdata2 <- data.frame(
   x = c(5.70, 8.03, 8.86, 0.82, 1.23, 2.96, 0.13, 8.53, 8.18,
         6.88, 4.02, 9.11, 0.19, 6.91, 0.34, 4.19, 0.25, 9.72,
         9.83, 6.77, 4.40, 4.70, 6.03, 5.87, 7.49),
   y = c(21.66, 26.23, 19.82, 2.46, 2.83, 8.86, 0.25, 16.08,
         17.67, 24.86, 8.19, 28.45, 0.52, 19.88, 0.71, 12.19,
         0.64, 25.29, 26.72, 18.06, 10.70, 8.27, 15.49, 15.58,
         19.17))

(Click here to go to the solution.)


Exercise 8.5 We continue our investigation of the weather models from Exercises 8.1 and 8.3.

  1. Plot the observed values against the fitted values for the two models that you’ve fitted. Does either model seem to have a better fit?

  2. Create residual plots for the second model from Exercise 8.3. Are there any influential points? Any patterns? Any signs of heteroscedasticity?

(Click here to go to the solution.)

8.1.5 Transformations

If your data displays signs of heteroscedasticity or non-normal residuals, you can sometimes use a Box-Cox transformation (Box & Cox, 1964) to mitigate those problems. The Box-Cox transformation is applied to your dependent variable \(y\). What it looks like is determined by a parameter \(\lambda\). The transformation is defined as \(\frac{y_i^\lambda-1}{\lambda}\) if \(\lambda\neq 0\) and \(\ln(y_i)\) if \(\lambda=0\). \(\lambda=1\) corresponds to no transformation at all. The boxcox function in MASS is useful for finding an appropriate choice of \(\lambda\). Choose a \(\lambda\) that is close to the peak (inside the interval indicated by the outer dotted lines) of the curve plotted by boxcox:

m <- lm(mpg ~ hp + wt, data = mtcars)

library(MASS)
boxcox(m)

In this case, the curve indicates that \(\lambda=0\), which corresponds to a log-transformation, could be a good choice. Let’s give it a go:

mtcars$logmpg <- log(mtcars$mpg)
m_bc <- lm(logmpg ~ hp + wt, data = mtcars)
summary(m_bc)

library(ggfortify)
autoplot(m_bc, which = 1:6, ncol = 2, label.size = 3)

The model fit seems to have improved after the transformation. The downside is that we now are modelling the log-mpg rather than mpg, which make the model coefficients a little difficult to interpret.

\[\sim\]

Exercise 8.6 Run boxcox with your model from Exercise 8.3. Does it indicate that a transformation can be useful for your model?

(Click here to go to the solution.)

8.1.6 Alternatives to lm

Non-normal regression errors can sometimes be an indication that you need to transform your data, that your model is missing an important explanatory variable, that there are interaction effects that aren’t accounted for, or that the relationship between the variables is non-linear. But sometimes, you get non-normal errors simply because the errors are non-normal.

The p-values reported by summary are computed under the assumption of normally distributed regression errors, and can be sensitive to deviations from normality. An alternative is to use the lmp function from the lmPerm package, which provides permutation test p-values instead. This doesn’t affect the model fitting in any way - the only difference is how the p-values are computed. Moreover, the syntax for lmp is identical to that of lm:

# First, install lmPerm:
install.packages("lmPerm")

# Get summary table with permutation p-values:
library(lmPerm)
m <- lmp(mpg ~ hp + wt, data = mtcars)
summary(m)

In some cases, you need to change the arguments of lmp to get reliable p-values. We’ll have a look at that in Exercise 8.12. Relatedly, in Section 8.1.7 we’ll see how to construct bootstrap confidence intervals for the parameter estimates.

Another option that does affect the model fitting is to use a robust regression model based on M-estimators. Such models tend to be less sensitive to outliers, and can be useful if you are concerned about the influence of deviating points. The rlm function in MASS is used for this. As was the case for lmp, the syntax for rlm is identical to that of lm:

library(MASS)
m <- rlm(mpg ~ hp + wt, data = mtcars)
summary(m)

Another option is to use Bayesian estimation, which we’ll discuss in Section 8.1.13.

\[\sim\]

Exercise 8.7 Refit your model from Exercise 8.3 using lmp. Are the two main effects still significant?

(Click here to go to the solution.)

8.1.7 Bootstrap confidence intervals for regression coefficients

Assuming normality, we can obtain parametric confidence intervals for the model coefficients using confint:

m <- lm(mpg ~ hp + wt, data = mtcars)

confint(m)

I usually prefer to use bootstrap confidence intervals, which we can obtain using boot and boot.ci, as we’ll do next. Note that the only random part in the linear model \[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\] is the error term \(\epsilon_i\). In most cases, it is therefore this term (and this term only) that we wish to resample. The explanatory variables should remain constant throughout the resampling process; the inference is conditioned on the values of the explanatory variables.

To achieve this, we’ll resample from the model residuals, and add those to the values predicted by the fitted function, which creates new bootstrap values of the response variable. We’ll then fit a linear model to these values, from which we obtain observations from the bootstrap distribution of the model coefficients.

It turns out that the bootstrap performs better if we resample not from the original residuals \(e_1,\ldots,e_n\), but from scaled and centred residuals \(r_i-\bar{r}\), where each \(r_i\) is a scaled version of residual \(e_i\), scaled by the leverage \(h_i\):

\[r_i=\frac{e_i}{\sqrt{1-h_i}},\] see Chapter 6 of Davison & Hinkley (1997) for details. The leverages can be computed using lm.influence.

We implement this procedure in the code below (and will then have a look at convenience functions that help us achieve the same thing more easily). It makes use of formula, which can be used to extract the model formula from regression models:

library(boot)

coefficients <- function(formula, data, i, predictions, residuals) {
      # Create the bootstrap value of response variable by
      # adding a randomly drawn scaled residual to the value of
      # the fitted function for each observation:
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      
      # Fit a new model with the bootstrap value of the response
      # variable and the original explanatory variables:
      m <- lm(formula, data = data)
      return(coef(m))
}

# Fit the linear model:
m <- lm(mpg ~ hp + wt, data = mtcars)

# Compute scaled and centred residuals:
res <- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)

# Run the bootstrap, extracting the model formula and the
# fitted function from the model m:
boot_res <- boot(data = mtcars, statistic = coefficients,
                R = 999, formula = formula(m),
                predictions = predict(m),
                residuals = res)

# Compute 95 % confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # hp
boot.ci(boot_res, type = "perc", index = 3) # wt

The argument index in boot.ci should be the row number of the parameter in the table given by summary. The intercept is on the first row, and so its index is 1, hp is on the second row and its index is 2, and so on.

Clearly, the above code is a little unwieldy. Fortunately, the car package contains a function called Boot that can be used to bootstrap regression models in the exact same way:

library(car)

boot_res <- Boot(m, method = "residual", R = 9999)

# Compute 95 % confidence intervals:
confint(boot_res, type = "perc")

Finally, the most convenient approach is to use boot_summary from the boot.pval package. It provides a data frame with estimates, bootstrap confidence intervals, and bootstrap p-values (computed using interval inversion) for the model coefficients. The arguments specify what interval type and resampling strategy to use (more on the latter in Exercise 8.9):

library(boot.pval)
boot_summary(m, type = "perc", method = "residual", R = 9999)

\[\sim\]

Exercise 8.8 Refit your model from Exercise 8.3 using a robust regression estimator with rlm. Compute confidence intervals for the coefficients of the robust regression model.

(Click here to go to the solution.)


Exercise 8.9 In an alternative bootstrap scheme for regression models, often referred to as case resampling, the observations (or cases) \((y_i, x_{i1},\ldots,x_{ip})\) are resampled instead of the residuals. This approach can be applied when the explanatory variables can be treated as being random (but measured without error) rather than fixed. It can also be useful for models with heteroscedasticity, as it doesn’t rely on assumptions about constant variance (which, on the other hand, makes it less efficient if the errors actually are homoscedastic).

Read the documentation for boot_summary to see how you can compute confidence intervals for the coefficients in the model m <- lm(mpg ~ hp + wt, data = mtcars) using case resampling. Do they differ substantially from those obtained using residual resampling in this case?

(Click here to go to the solution.)

8.1.8 Alternative summaries with broom

The broom package contains some useful functions when working with linear models (and many other common models), which allow us to get various summaries of the model fit in useful formats. Let’s install it:

install.packages("broom")

A model fitted with m is stored as a list with lots of elements:

m <- lm(mpg ~ hp + wt, data = mtcars)
str(m)

How can we access the information about the model? For instance, we may want to get the summary table from summary, but as a data frame rather than as printed text. Here are two ways of doing this, using summary and the tidy function from broom:

# Using base R:
summary(m)$coefficients

# Using broom:
library(broom)
tidy(m)

tidy is the better option if you want to retrieve the table as part of a pipeline. For instance, if you want to adjust the p-values for multiplicity using Bonferroni correction (Section 7.2.5), you could do as follows:

library(magrittr)
mtcars %>% 
    lm(mpg ~ hp + wt, data = .) %>%
    tidy() %$% 
    p.adjust(p.value, method = "bonferroni")

If you prefer bootstrap p-values, you can use boot_summary from boot.pval similarly. That function also includes an argument for adjusting the p-values for multiplicity:

library(boot.pval)
lm(mpg ~ hp + wt, data = mtcars) %>%
    boot_summary(adjust.method = "bonferroni")

Another useful function in broom is glance, which lets us get some summary statistics about the model:

glance(m)

Finally, augment can be used to add predicted values, residuals, and Cook’s distances to the dataset used for fitting the model, which of course can be very useful for model diagnostics:

# To get the data frame with predictions and residuals added:
augment(m)

# To plot the observed values against the fitted values:
library(ggplot2)
mtcars %>% 
    lm(mpg ~ hp + wt, data = .) %>%
    augment() %>% 
    ggplot(aes(.fitted, mpg)) +
      geom_point() +
      xlab("Fitted values") + ylab("Observed values")

8.1.9 Variable selection

A common question when working with linear models is what variables to include in your model. Common practices for variable selection include stepwise regression methods, where variables are added to or removed from the model depending on p-values, \(R^2\) values, or information criteria like AIC or BIC.

Don’t ever do this if your main interest is p-values. Stepwise regression increases the risk of type I errors, renders the p-values of your final model invalid, and can lead to over-fitting; see e.g. Smith (2018). Instead, you should let your research hypothesis guide your choice of variables, or base your choice on a pilot study.

If your main interest is prediction, then that is a completely different story. For predictive models, it is usually recommended that variable selection and model fitting should be done simultaneously. This can be done using regularised regression models, to which Section 9.4 is devoted.

8.1.10 Prediction

An important use of linear models is prediction. In R, this is done using predict. By providing a fitted model and a new dataset, we can get predictions.

Let’s use one of the models that we fitted to the mtcars data to make predictions for two cars that aren’t from the 1970’s. Below, we create a data frame with data for a 2009 Volvo XC90 D3 AWD (with a fuel consumption of 29 mpg) and a 2019 Ferrari Roma (15.4 mpg):

new_cars <- data.frame(hp = c(161, 612), wt = c(4.473, 3.462),
                       row.names = c("Volvo XC90", "Ferrari Roma"))

To get the model predictions for these new cars, we run the following:

predict(m, new_cars)

predict also lets us obtain prediction intervals for our prediction, under the assumption of normality58. To get 90 % prediction intervals, we add interval = "prediction" and level = 0.9:

m <- lm(mpg ~ hp + wt, data = mtcars)
predict(m, new_cars,
        interval = "prediction",
        level = 0.9)

If we were using a transformed \(y\)-variable, we’d probably have to transform the predictions back to the original scale for them to be useful:

mtcars$logmpg <- log(mtcars$mpg)
m_bc <- lm(logmpg ~ hp + wt, data = mtcars)

preds <- predict(m_bc, new_cars,
        interval = "prediction",
        level = 0.9)

# Predictions for log-mpg:
preds
# Transform back to original scale:
exp(preds)

The lmp function that we used to compute permutation p-values does not offer confidence intervals. We can however compute bootstrap prediction intervals using the code below. Prediction intervals try to capture two sources of uncertainty:

  • Model uncertainty, which we will capture by resampling the data and make predictions for the expected value of the observation,
  • Random noise, i.e. that almost all observations deviate from their expected value. We will capture this by resampling residuals from the fitted bootstrap models.

Consequently, the value that we generate in each bootstrap replication will be the sum of a prediction and a resampled residual (see Davison & Hinkley (1997), Section 6.3, for further details):

boot_pred <- function(data, new_data, model, i,
                      formula, predictions, residuals){
      # Resample residuals and fit new model:
      data[,all.vars(formula)[1]] <- predictions + residuals[i]
      m_boot <- lm(formula, data = data)
      
      # We use predict to get an estimate of the
      # expectation of new observations, and then
      # add resampled residuals to also include the
      # natural variation around the expectation:
      predict(m_boot, newdata = new_data) + 
         sample(residuals, nrow(new_data))
}

library(boot)

m <- lm(mpg ~ hp + wt, data = mtcars)

# Compute scaled and centred residuals:
res <- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)

boot_res <- boot(data = m$model,
                     statistic = boot_pred,
                     R = 999,
                     model = m,
                     new_data = new_cars,
                     formula = formula(m),
                     predictions = predict(m),
                     residuals = res)

# 90 % bootstrap prediction intervals:
boot.ci(boot_res, type = "perc", index = 1, conf = 0.9) # Volvo
boot.ci(boot_res, type = "perc", index = 2, conf = 0.9) # Ferrari

\[\sim\]

Exercise 8.10 Use your model from Exercise 8.3 to compute a bootstrap prediction interval for the temperature on a day with precipitation but no sun hours.

(Click here to go to the solution.)

8.1.11 Prediction for multiple datasets

In certain cases, we wish to fit different models to different subsets of the data. Functionals like apply and map (Section 6.5) are handy when you want to fit several models at once. Below is an example of how we can use split (Section 5.2.1) and tools from the purrr package (Section 6.5.3) to fit the models simultaneously, as well as for computing the fitted values in a single line of code:

# Split the dataset into three groups depending on the
# number of cylinders:
library(magrittr)
mtcars_by_cyl <- mtcars %>% split(.$cyl)

# Fit a linear model to each subgroup:
library(purrr)
models <- mtcars_by_cyl %>% map(~ lm(mpg ~ hp + wt, data = .))

# Compute the fitted values for each model:
map2(models, mtcars_by_cyl, predict)

We’ll make use of this approach when we study linear mixed models in Section 8.4.

8.1.12 ANOVA

Linear models are also used for analysis of variance (ANOVA) models to test whether there are differences among the means of different groups. We’ll use the mtcars data to give some examples of this. Let’s say that we want to investigate whether the mean fuel consumption (mpg) of cars differs depending on the number of cylinders (cyl), and that we want to include the type of transmission (am) as a blocking variable.

To get an ANOVA table for this problem, we must first convert the explanatory variables to factor variables, as the variables in mtcars all numeric (despite some of them being categorical). We can then use aov to fit the model, and then summary:

# Convert variables to factors:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)

# Fit model and print ANOVA table:
m <- aov(mpg ~ cyl + am, data = mtcars)
summary(m)

(aov actually uses lm to fit the model, but by using aov we specify that we want an ANOVA table to be printed by summary.)

When there are different numbers of observations in the groups in an ANOVA, so that we have an unbalanced design, the sums of squares used to compute the test statistics can be computed in at least three different ways, commonly called type I, II and III. See Herr (1986) for an overview and discussion of this.

summary prints a type I ANOVA table, which isn’t the best choice for unbalanced designs. We can however get type II or III tables by instead using Anova from the car package to print the table:

library(car)
Anova(m, type = "II")
Anova(m, type = "III") # Default in SAS and SPSS.

As a guideline, for unbalanced designs, you should use type II tables if there are no interactions, and type III tables if there are interactions. To look for interactions, we can use interaction.plot to create a two-way interaction plot:

interaction.plot(mtcars$am, mtcars$cyl, response = mtcars$mpg)

In this case, there is no sign of an interaction between the two variables, as the lines are more or less parallel. A type II table is therefore probably the best choice here.

We can obtain diagnostic plots the same way we did for other linear models:

library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3)

To find which groups that have significantly different means, we can use a post hoc test like Tukey’s HSD, available through the TukeyHSD function:

TukeyHSD(m)

We can visualise the results of Tukey’s HSD with plot, which shows 95 % confidence intervals for the mean differences:

# When the difference isn't significant, the dashed line indicating
# "no differences" falls within the confidence interval for
# the difference:
plot(TukeyHSD(m, "am"))

# When the difference is significant, the dashed line does not
# fall within the confidence interval:
plot(TukeyHSD(m, "cyl"))

\[\sim\]

Exercise 8.11 Return to the residual plots that you created with autoplot. Figure out how you can plot points belonging to different cyl groups in different colours.

(Click here to go to the solution.)


Exercise 8.12 The aovp function in the lmPerm package can be utilised to perform permutation tests instead of the classical parametric ANOVA tests. Rerun the analysis in the example above, using aovp instead. Do the conclusions change? What happens if you run your code multiple times? Does using summary on a model fitted using aovp generate a type I, II or III table by default? Can you change what type of table it produces?

(Click here to go to the solution.)


Exercise 8.13 In the case of a one-way ANOVA (i.e. ANOVA with a single explanatory variable), the Kruskal-Wallis test can be used as a nonparametric option. It is available in kruskal.test. Use the Kruskal-Wallis test to run a one-way ANOVA for the mtcars data, with mpg as the response variable and cyl as an explanatory variable.

(Click here to go to the solution.)

8.1.13 Bayesian estimation of linear models

We can fit Bayesian linear models using the rstanarm package. To fit a model to the mtcars data using all explanatory variables, we can use stan_glm in place of lm as follows:

library(rstanarm)
m <- stan_glm(mpg ~ ., data = mtcars)

# Print the estimates:
coef(m)

Next, we can plot the posterior distributions of the effects:

plot(m, "dens", pars = names(coef(m)))

To get 95 % credible intervals for the effects, we can use posterior_interval:

posterior_interval(m, 
        pars = names(coef(m)),
        prob = 0.95)

We can also plot them using plot:

plot(m, "intervals",
        pars = names(coef(m)),
        prob = 0.95)

Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:

plot(m, "rhat")

Like for lm, residuals(m) provides the model residuals, which can be used for diagnostics. For instance, we can plot the residuals against the fitted values to look for signs of non-linearity, adding a curve to aid the eye:

model_diag <- data.frame(Fitted = predict(m),
                         Residual = residuals(m))

library(ggplot2)
ggplot(model_diag, aes(Fitted, Residual)) +
      geom_point() +
      geom_smooth(se = FALSE)

For fitting ANOVA models, we can instead use stan_aov with the argument prior = R2(location = 0.5) to fit the model.

8.2 Ethical issues in regression modelling

The p-hacking problem, discussed in Section 7.4, is perhaps particularly prevalent in regression modelling. Regression analysis often involves a large number of explanatory variables, and practitioners often try out several different models (e.g. by performing stepwise variable selection; see Section 8.1.9). Because so many hypotheses are tested, often in many different but similar models, there is a large risk of false discoveries.

In any regression analysis, there is a risk of finding spurious relationships. These are dependencies between the response variable and an explanatory variable that either are non-causal or are purely coincidental. As an example of the former, consider the number of deaths by drowning, which is strongly correlated with ice cream sales. Not because ice cream cause people to drown, but because both are affected by the weather: we are more likely to go swimming or buy ice cream on hot days. Lurking variables, like the temperature in the ice cream-drowning example, are commonly referred to as confounding factors. An effect may be statistically significant, but that does not necessarily mean that it is meaningful.

\[\sim\]

Exercise 8.14 Discuss the following. You are tasked with analysing a study on whether Vitamin D protects against the flu. One group of patients are given Vitamin D supplements, and one group is given a placebo. You plan on fitting a regression model to estimate the effect of the vitamin supplements, but note that some confounding factors that you have reason to believe are of importance, such as age and ethnicity, are missing from the data. You can therefore not include them as explanatory variables in the model. Should you still fit the model?


Exercise 8.15 Discuss the following. You are fitting a linear regression model to a dataset from a medical study on a new drug which potentially can have serious side effects. The test subjects take a risk by participating in the study. Each observation in the dataset corresponds to a test subject. Like all ordinary linear regression models, your model gives more weight to observations that deviate from the average (and have a high leverage or Cook’s distance). Given the risks involved for the test subjects, is it fair to give different weight to data from different individuals? Is it OK to remove outliers because they influence the results too much, meaning that the risk that the subject took was for nought?

8.3 Generalised linear models

Generalised linear models, abbreviated GLM, are (yes) a generalisation of the linear model, that can be used when your response variable has a non-normal error distribution. Typical examples are when your response variable is binary (only takes two values, e.g. 0 or 1), or a count of something. Fitting GLM’s is more or less entirely analogous to fitting linear models in R, but model diagnostics are very different. In this section we will look at some examples of how it can be done.

8.3.1 Modelling proportions: Logistic regression

As the first example of binary data, we will consider the wine quality dataset wine from Cortez et al. (2009), which is available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Wine+Quality. It contains measurements on white and red vinho verde wine samples from northern Portugal.

We start by loading the data. It is divided into two separate .csv files, one for white wines and one for red, which we have to merge:

# Import data about white and red wines:
white <- read.csv("https://tinyurl.com/winedata1",
                  sep = ";")
red <- read.csv("https://tinyurl.com/winedata2",
                  sep = ";")

# Add a type variable:
white$type <- "white"
red$type <- "red"

# Merge the datasets:
wine <- rbind(white, red)
wine$type <- factor(wine$type)

# Check the result:
summary(wine)

We are interested in seeing if measurements like pH (pH) and alcohol content (alcohol) can be used to determine the colours of the wine. The colour is represented by the type variable, which is binary.

Our model is that the type of a randomly selected wine is binomial \(Bin(1, \pi_i)\)-distributed (Bernoulli distributed), where \(\pi_i\) depends on explanatory variables like pH and alcohol content. A common model for this situation is a logistic regression model. Given \(n\) observations of \(p\) explanatory variables, the model is:

\[\log\Big(\frac{\pi_i}{1-\pi_i}\Big)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\] Where we in linear regression models model the expected value of the response variable as a linear function of the explanatory variables, we now model the expected value of a function of the expected value of the response variable (that is, a function of \(\pi_i\)). In GLM terminology, this function is known as a link function.

Logistic regression models can be fitted using the glm function. To specify what our model is, we use the argument family = binomial:

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
summary(m)

The p-values presented in the summary table are based on a Wald test known to have poor performance unless the sample size is very large (Agresti, 2013). In this case, with a sample size of 6,497, it is probably safe to use, but for smaller sample sizes, it is preferable to use a bootstrap test instead, which you will do in Exercise 8.18.

The coefficients of a logistic regression model aren’t as straightforward to interpret as those in a linear model. If we let \(\beta\) denote a coefficient corresponding to an explanatory variable \(x\), then:

  • If \(\beta\) is positive, then \(\pi_i\) increases when \(x_i\) increases.
  • If \(\beta\) is negative, then \(\pi_i\) decreases when \(x_i\) increases.
  • \(e^\beta\) is the odds ratio, which shows how much the odds \(\frac{\pi_i}{1-\pi_1}\) change when \(x_i\) is increased 1 step.

We can extract the coefficients and odds ratios using coef:

coef(m)      # Coefficients, beta
exp(coef(m)) # Odds ratios

To find the fitted probability that an observation belongs to the second class we can use predict(m, type = "response"):

# Check which class is the second one:
levels(wine$type) 
# "white" is the second class!

# Get fitted probabilities:
probs <- predict(m, type = "response")

# Check what the average prediction is for
# the two groups:
mean(probs[wine$type == "red"])
mean(probs[wine$type == "white"])

It turns out that the model predicts that most wines are white - even the red ones! The reason may be that we have more white wines (4,898) than red wines (1,599) in the dataset. Adding more explanatory variables could perhaps solve this problem. We’ll give that a try in the next section.

\[\sim\]

Exercise 8.16 Download sharks.csv file from the book’s web page. It contains information about shark attacks in South Africa. Using data on attacks that occurred in 2000 or later, fit a logistic regression model to investigate whether the age and sex of the individual that was attacked affect the probability of the attack being fatal.

Note: save the code for your model, as you will return to it in the subsequent exercises.

(Click here to go to the solution.)


Exercise 8.17 In Section 8.1.8 we saw how some functions from the broom package could be used to get summaries of linear models. Try using them with the wine data model that we created above. Do the broom functions work for generalised linear models as well?

(Click here to go to the solution.)

8.3.2 Bootstrap confidence intervals

In a logistic regression, the response variable \(y_i\) is a binomial (or Bernoulli) random variable with success probability \(\pi_i\). In this case, we don’t want to resample residuals to create confidence intervals, as it turns out that this can lead to predicted probabilities outside the range \((0,1)\). Instead, we can either use the case resampling strategy described in Exercise 8.9 or use a parametric bootstrap approach where we generate new binomial variables (Section 7.1.2) to construct bootstrap confidence intervals.

To use case resampling, we can use boot_summary from boot.pval:

library(boot.pval)

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

boot_summary(m, type = "perc", method = "case")

In the parametric approach, for each observation, the fitted success probability from the logistic model will be used to sample new observations of the response variable. This method can work well if the model is well-specified but tends to perform poorly for misspecified models, so make sure to carefully perform model diagnostics (as described in the next section) before applying it. To use the parametric approach, we can do as follows:

library(boot)

coefficients <- function(formula, data, predictions, ...) {
  # Check whether the response variable is a factor or
  # numeric, and then resample:
  if(is.factor(data[,all.vars(formula)[1]])) {
      # If the response variable is a factor:
      data[,all.vars(formula)[1]] <-
         factor(levels(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                              1, predictions)]) } else {
      # If the response variable is numeric:
      data[,all.vars(formula)[1]] <-
         unique(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
                                              1, predictions)] }
  
      m <- glm(formula, data = data, family = binomial)
      return(coef(m))
}

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

boot_res <- boot(data = wine, statistic = coefficients,
                R = 999, 
                formula = formula(m),
                predictions = predict(m, type = "response"))

# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # pH
boot.ci(boot_res, type = "perc", index = 3) # Alcohol

\[\sim\]

Exercise 8.18 Use the model that you fitted to the sharks.csv data in Exercise 8.16 for the following:

  1. When the MASS package is loaded, you can use confint to obtain (asymptotic) confidence intervals for the parameters of a GLM. Use it to compute confidence intervals for the parameters of your model for the sharks.csv data.

  2. Compute parametric bootstrap confidence intervals and p-values for the parameters of your logistic regression model for the sharks.csv data. Do they differ from the intervals obtained using confint? Note that there are a lot of missing values for the response variable. Think about how that will affect your bootstrap intervals and adjust your code accordingly.

  3. Use the confidence interval inversion method of Section 7.7.3 to compute bootstrap p-values for the effect of age.

(Click here to go to the solution.)

8.3.3 Model diagnostics

It is notoriously difficult to assess model fit for GLM’s, because the behaviour of the residuals is very different from residuals in ordinary linear models. In the case of logistic regression, the response variable is always 0 or 1, meaning that there will be two bands of residuals:

# Store deviance residuals:
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
res <- data.frame(Predicted <- predict(m),
                  Residuals <- residuals(m, type ="deviance"),
                  Index = 1:nrow(m$data),
                  CooksDistance = cooks.distance(m))

# Plot fitted values against the deviance residuals:
library(ggplot2)
ggplot(res, aes(Predicted, Residuals)) +
      geom_point()

# Plot index against the deviance residuals:
ggplot(res, aes(Index, Residuals)) +
      geom_point()

Plots of raw residuals are of little use in logistic regression models. A better option is to use a binned residual plot, in which the observations are grouped into bins based on their fitted value. The average residual in each bin can then be computed, which will tell us if which parts of the model have a poor fit. A function for this is available in the arm package:

install.packages("arm")

library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))

The grey lines show confidence bounds which are supposed to contain about 95 % of the bins. If too many points fall outside these bounds, it’s a sign that we have a poor model fit. In this case, there are a few points outside the bounds. Most notably, the average residuals are fairly large for the observations with the lowest fitted values, i.e. among the observations with the lowest predicted probability of being white wines.

Let’s compare the above plot to that for a model with more explanatory variables:

m2 <- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar,
          data = wine, family = binomial)

binnedplot(predict(m2, type = "response"),
           residuals(m2, type = "response"))

This looks much better - adding more explanatory variable appears to have improved the model fit.

It’s worth repeating that if your main interest is hypothesis testing, you shouldn’t fit multiple models and then pick the one that gives the best results. However, if you’re doing an exploratory analysis or are interested in predictive modelling, you can and should try different models. It can then be useful to do a formal hypothesis test of the null hypothesis that m and m2 fit the data equally well, against the alternative that m2 has a better fit. If both fit the data equally well, we’d prefer m, since it is a simpler model. We can use anova to perform a likelihood ratio deviance test (see Section 12.4 for details), which tests this:

anova(m, m2, test = "LRT")

The p-value is very low, and we conclude that m2 has a better model fit.

Another useful function is cooks.distance, which can be used to compute the Cook’s distance for each observation, which is useful for finding influential observations. In this case, I’ve chosen to print the row numbers for the observations with a Cook’s distance greater than 0.004 - this number has been arbitrarily chosen in order only to highlight the observations with the highest Cook’s distance.

res <- data.frame(Index = 1:length(cooks.distance(m)),
                  CooksDistance = cooks.distance(m))

# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > 0.004,
                                   rownames(res), "")),
                hjust = 1.1)

\[\sim\]

Exercise 8.19 Investigate the residuals for your sharks.csv model. Are there any problems with the model fit? Any influential points?

(Click here to go to the solution.)

8.3.4 Prediction

Just as for linear models, we can use predict to make predictions for new observations using a GLM. To begin with, let’s randomly sample 10 rows from the wine data and fit a model using all data except those ten observations:

# Randomly select 10 rows from the wine data:
rows <- sample(1:nrow(wine), 10)

m <- glm(type ~ pH + alcohol, data = wine[-rows,], family = binomial)

We can now use predict to make predictions for the ten observations:

preds <- predict(m, wine[rows,])
preds

Those predictions look a bit strange though - what are they? By default, predict returns predictions on the scale of the link function. That’s not really what we want in most cases - instead, we are interested in the predicted probabilities. To get those, we have to add the argument type = "response" to the call:

preds <- predict(m, wine[rows,], type = "response")
preds

Logistic regression models are often used for prediction, in what is known as classification. Section 9.1.7 is concerned with how to evaluate the predictive performance of logistic regression and other classification models.

8.3.5 Modelling count data

Logistic regression is but one of many types of GLM’s used in practice. One important example is Cox regression, which is used for survival data. We’ll return to that model in Section 8.5. For now, we’ll consider count data instead. Let’s have a look at the shark attack data in sharks.csv, available on the book’s website. It contains data about shark attacks in South Africa, downloaded from The Global Shark Attack File (http://www.sharkattackfile.net/incidentlog.htm). To load it, we download the file and set file_path to the path of sharks.csv:

sharks <- read.csv(file_path, sep =";")

# Compute number of attacks per year:
attacks <- aggregate(Type ~ Year, data = sharks, FUN = length)

# Keep data for 1960-2019:
attacks <- subset(attacks, Year >= 1960)

The number of attacks in a year is not binary but a count that, in principle, can take any non-negative integer as its value. Are there any trends over time for the number of reported attacks?

# Plot data from 1960-2019:
library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks")

No trend is evident. To confirm this, let’s fit a regression model with Type (the number of attacks) as the response variable and Year as an explanatory variable. For count data like this, a good first model to use is Poisson regression. Let \(\mu_i\) denote the expected value of the response variable given the explanatory variables. Given \(n\) observations of \(p\) explanatory variables, the Poisson regression model is:

\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\] To fit it, we use glm as before, but this time with family = poisson:

m <- glm(Type ~ Year, data = attacks, family = poisson)
summary(m)

We can add the curve corresponding to the fitted model to our scatterplot as follows:

attacks_pred <- data.frame(Year = attacks$Year, at_pred =
                             predict(m, type = "response"))

ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks") +
      geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
                colour = "red")

The fitted model seems to confirm our view that there is no trend over time in the number of attacks.

For model diagnostics, we can use a binned residual plot and a plot of Cook’s distance to find influential points:

# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
           residuals(m, type = "response"))


# Plot index against the Cook's distance to find
# influential points:
res <- data.frame(Index = 1:nrow(m$data),
                  CooksDistance = cooks.distance(m))
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > 0.1,
                                   rownames(res), "")),
                hjust = 1.1)

A common problem in Poisson regression models is excess zeros, i.e. more observations with value 0 than what is predicted by the model. To check the distribution of counts in the data, we can draw a histogram:

ggplot(attacks, aes(Type)) +
      geom_histogram(binwidth = 1, colour = "black")

If there are a lot of zeroes in the data, we should consider using another model, such as a hurdle model or a zero-inflated Poisson regression. Both of these are available in the pscl package.

Another common problem is overdispersion, which occurs when there is more variability in the data than what is predicted by the GLM. A formal test of overdispersion (Cameron & Trivedi, 1990) is provided by dispersiontest in the AER package. The null hypothesis is that there is no overdispersion, and the alternative that there is overdispersion:

install.packages("AER")

library(AER)
dispersiontest(m, trafo = 1)

There are several alternative models that can be considered in the case of overdispersion. One of them is negative binomial regression, which uses the same link function as Poisson regression. We can fit it using the glm.nb function from MASS:

library(MASS)
m_nb <- glm.nb(Type ~ Year, data = attacks)

summary(m_nb)

For the shark attack data, the predictions from the two models are virtually identical, meaning that both are equally applicable in this case:

attacks_pred <- data.frame(Year = attacks$Year, at_pred =
                             predict(m, type = "response"))
attacks_pred_nb <- data.frame(Year = attacks$Year, at_pred =
                             predict(m_nb, type = "response"))

ggplot(attacks, aes(Year, Type)) +
      geom_point() +
      ylab("Number of attacks") +
      geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
                colour = "red") +
      geom_line(data = attacks_pred_nb, aes(x = Year, y = at_pred),
                colour = "blue", linetype = "dashed")

Finally, we can obtain bootstrap confidence intervals e.g. using case resampling, using boot_summary:

library(boot.pval)
boot_summary(m_nb, type = "perc", method = "case")

\[\sim\]

Exercise 8.20 The quakes dataset, available in base R, contains information about seismic events off Fiji. Fit a Poisson regression model with stations as the response variable and mag as an explanatory variable. Are there signs of overdispersion? Does using a negative binomial model improve the model fit?

(Click here to go to the solution.)

8.3.6 Modelling rates

Poisson regression models, and related models like negative binomial regression, can not only be used to model count data. They can also be used to model rate data, such as the number of cases per capita or the number of cases per unit area. In that case, we need to include an exposure variable \(N\) that describes e.g. the population size or area corresponding to each observation. The model will be that:

\[\log(\mu_i/N_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n.\] Because \(\log(\mu_i/N_i)=\log(\mu_i)-\log(N_i)\), this can be rewritten as:

\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip}+\log(N_i),\qquad i=1,\ldots,n.\] In other words, we should include \(\log(N_i)\) on the right-hand side of our model, with a known coefficient equal to 1. In regression, such a term is known as an offset. We can add it to our model using the offset function.

As an example, we’ll consider the ships data from the MASS package. It describes the number of damage incidents for different ship types operating in the 1960’s and 1970’s, and includes information about how many months each ship type was in service (i.e. each ship type’s exposure):

library(MASS)
?ships
View(ships)

For our example, we’ll use ship type as the explanatory variable, incidents as the response variable and service as the exposure variable. First, we remove observations with 0 exposure (by definition, these can’t be involved in incidents, and so there is no point in including them in the analysis). Then, we fit the model using glm and offset:

ships <- ships[ships$service != 0,]

m <- glm(incidents ~ type + offset(log(service)),
         data = ships,
         family = poisson)

summary(m)

Model diagnostics can be performed as in the previous sections.

Rate models are usually interpreted in terms of the rate ratios \(e^{\beta_j}\), which describe the multiplicative increases of the intensity of rates when \(x_j\) is increased by one unit. To compute the rate ratios for our model, we use exp:

exp(coef(m))

\[\sim\]

Exercise 8.21 Compute bootstrap confidence intervals for the rate ratios in the model for the ships data.

(Click here to go to the solution.)

8.3.7 Bayesian estimation of generalised linear models

We can fit a Bayesian GLM with the rstanarm package, using stan_glm in the same way we did for linear models. Let’s look at an example with the wine data. First, we load and prepare the data:

# Import data about white and red wines:
white <- read.csv("https://tinyurl.com/winedata1",
                  sep = ";")
red <- read.csv("https://tinyurl.com/winedata2",
                  sep = ";")
white$type <- "white"
red$type <- "red"
wine <- rbind(white, red)
wine$type <- factor(wine$type)

Now, we fit a Bayesian logistic regression model:

library(rstanarm)
m <- stan_glm(type ~ pH + alcohol, data = wine, family = binomial)

# Print the estimates:
coef(m)

Next, we can plot the posterior distributions of the effects:

plot(m, "dens", pars = names(coef(m)))

To get 95 % credible intervals for the effects, we can use posterior_interval. We can also use plot to visualise them:

posterior_interval(m, 
        pars = names(coef(m)),
        prob = 0.95)

plot(m, "intervals",
        pars = names(coef(m)),
        prob = 0.95)

Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:

plot(m, "rhat")

8.4 Mixed models

Mixed models are used in regression problems where measurements have been made on clusters of related units. As the first example of this, we’ll use a dataset from the lme4 package, which also happens to contain useful methods for mixed models. Let’s install it:

install.packages("lme4")

The sleepstudy dataset from lme4 contains data from a study on reaction times in a sleep deprivation study. The participants were restricted to 3 hours of sleep per night, and their average reaction time on a series of tests were measured each day during the 9 days that the study lasted:

library(lme4)
?sleepstudy
str(sleepstudy)

Let’s start our analysis by making boxplots showing reaction times for each subject. We’ll also superimpose the observations for each participant on top of their boxplots:

library(ggplot2)
ggplot(sleepstudy, aes(Subject, Reaction)) +
      geom_boxplot() + 
      geom_jitter(aes(colour = Subject), 
                  position = position_jitter(0.1))

We are interested in finding out if the reaction times increase when the participants have been starved for sleep for a longer period. Let’s try plotting reaction times against days, adding a regression line:

ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
      geom_point() +
      geom_smooth(method = "lm", colour = "black", se = FALSE)

As we saw in the boxplots, and can see in this plot too, some participants always have comparatively high reaction times, whereas others always have low values. There are clear differences between individuals, and the measurements for each individual will be correlated. This violates a fundamental assumption of the traditional linear model, namely that all observations are independent.

In addition to this, it also seems that the reaction times change in different ways for different participants, as can be seen if we facet the plot by test subject:

ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
      geom_point() +
      theme(legend.position = "none") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_smooth(method = "lm", colour = "black", se = FALSE)

Both the intercept and the slope of the average reaction time differs between individuals. Because of this, the fit given by the single model can be misleading. Moreover, the fact that the observations are correlated will cause problems for the traditional intervals and tests. We need to take this into account when we estimate the overall intercept and slope.

One approach could be to fit a single model for each subject. That doesn’t seem very useful though. We’re not really interested in these particular test subjects, but in how sleep deprivation affects reaction times in an average person. It would be much better to have a single model that somehow incorporates the correlation between measurements made on the same individual. That is precisely what a linear mixed regression model does.

8.4.1 Fitting a linear mixed model

A linear mixed model (LMM) has two types of effects (explanatory variables):

  • Fixed effects, which are non-random. These are usually the variables of primary interest in the data. In the sleepstudy example, Days is a fixed effect.
  • Random effects, which represent nuisance variables that cause measurements to be correlated. These are usually not of interest in and of themselves, but are something that we need to include in the model to account for correlations between measurements. In the sleepstudy example, Subject is a random effect.

Linear mixed models can be fitted using lmer from the lme4 package. The syntax is the same as for lm, with the addition of random effects. These can be included in different ways. Let’s have a look at them.

First, we can include a random intercept, which gives us a model where the intercept (but not the slope) varies between test subjects. In our example, the formula for this is:

library(lme4)
m1 <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)

Alternatively, we could include a random slope in the model, in which case the slope (but not the intercept) varies between test subjects. The formula would be:

m2 <- lmer(Reaction ~ Days + (0 + Days|Subject), data = sleepstudy)

Finally, we can include both a random intercept and random slope in the model. This can be done in two different ways, as we can model the intercept and slope as being correlated or uncorrelated:

# Correlated random intercept and slope:
m3 <- lmer(Reaction ~ Days + (1 + Days|Subject), data = sleepstudy)

# Uncorrelated random intercept and slope:
m4 <- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)

Which model should we choose? Are the intercepts and slopes correlated? It could of course be the case that individuals with a high intercept have a smaller slope - or a greater slope! To find out, we can fit different linear models to each subject, and then make a scatterplot of their intercepts and slopes. To fit a model to each subject, we use split and map as in Section 8.1.11:

# Collect the coefficients from each linear model:
library(purrr)
sleepstudy %>% split(.$Subject) %>% 
              map(~ lm(Reaction ~ Days, data = .)) %>% 
              map(coef) -> coefficients

# Convert to a data frame:
coefficients <- data.frame(matrix(unlist(coefficients),
                  nrow = length(coefficients),
                  byrow = TRUE),
                  row.names = names(coefficients))
names(coefficients) <- c("Intercept", "Days")

# Plot the coefficients:
ggplot(coefficients, aes(Intercept, Days,
                         colour = row.names(coefficients))) +
      geom_point() +
      geom_smooth(method = "lm", colour = "black", se = FALSE) +
      labs(fill = "Subject")

# Test the correlation:
cor.test(coefficients$Intercept, coefficients$Days)

The correlation test is not significant, and judging from the plot, there is little indication that the intercept and slope are correlated. We saw earlier that both the intercept and the slope seem to differ between subjects, and so m4 seems like the best choice here. Let’s stick with that, and look at a summary table for the model.

summary(m4, correlation = FALSE)

I like to add correlation = FALSE here, which suppresses some superfluous output from summary.

You’ll notice that unlike the summary table for linear models, there are no p-values! This is a deliberate design choice from the lme4 developers, who argue that the approximate test available aren’t good enough for small sample sizes (Bates et al., 2015).

Using the bootstrap, as we will do in Section 8.4.3, is usually the best approach for mixed models. If you really want some quick p-values, you can load the lmerTest package, which adds p-values computed using the Satterthwaite approximation (Kuznetsova et al., 2017). This is better than the usual approximate test, but still not perfect.

install.packages("lmerTest")

library(lmerTest)
m4 <- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)
summary(m4, correlation = FALSE)

If we need to extract the model coefficients, we can do so using fixef (for the fixed effects) and ranef (for the random effects):

fixef(m4)
ranef(m4)

If we want to extract the variance components from the model, we can use VarCorr:

VarCorr(m4)

Let’s add the lines from the fitted model to our facetted plot, to compare the results of our mixed model to the lines that were fitted separately for each individual:

mixed_mod <- coef(m4)$Subject
mixed_mod$Subject <- row.names(mixed_mod)

ggplot(sleepstudy, aes(Days, Reaction)) +
      geom_point() +
      theme(legend.position = "none") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_smooth(method = "lm", colour = "cyan", se = FALSE,
                  size = 0.8) +
      geom_abline(aes(intercept = `(Intercept)`, slope = Days,
                      color = "magenta"),
                  data = mixed_mod, size = 0.8)

Notice that the lines differ. The intercept and slopes have been shrunk toward the global effects, i.e. toward the average of all lines.

\[\sim\]

Exercise 8.22 Consider the Oxboys data from the nlme package. Does a mixed model seem appropriate here? If so, is the intercept and slope for different subjects correlated? Fit a suitable model, with height as the response variable.

Save the code for your model, as you will return to it in the next few exercises.

(Click here to go to the solution.)


Exercise 8.23 The broom.mixed package allows you to get summaries of mixed models as data frames, just as broom does for linear and generalised linear models. Install it and use it to get the summary table for the model for the Oxboys data that you created in the previous exercise. How are fixed and random effects included in the table?

(Click here to go to the solution.)

8.4.2 Model diagnostics

As for any linear model, residual plots are useful for diagnostics for linear mixed models. Of particular interest are signs of heteroscedasticity, as homoscedasticity is assumed in the mixed model. We’ll use fortify.merMod to turn the model into an object that can be used with ggplot2, and then create some residual plots:

library(ggplot2)
fm4 <- fortify.merMod(m4)

# Plot residuals:
ggplot(fm4, aes(.fitted, .resid)) +
      geom_point() +
      geom_hline(yintercept = 0) +
      xlab("Fitted values") + ylab("Residuals")

# Compare the residuals of different subjects:
ggplot(fm4, aes(Subject, .resid)) +
      geom_boxplot() +
      coord_flip() +
      ylab("Residuals")

# Observed values versus fitted values:
ggplot(fm4, aes(.fitted, Reaction)) +
      geom_point(colour = "blue") +
      facet_wrap(~ Subject, nrow = 3) +
      geom_abline(intercept = 0, slope = 1) +
      xlab("Fitted values") + ylab("Observed values")

## Q-Q plot of residuals:
ggplot(fm4, aes(sample = .resid)) +
        geom_qq() + geom_qq_line()

## Q-Q plot of random effects:
ggplot(ranef(m4)$Subject, aes(sample = `(Intercept)`)) +
        geom_qq() + geom_qq_line()
ggplot(ranef(m4)$Subject, aes(sample = `Days`)) +
        geom_qq() + geom_qq_line()

The normality assumption appears to be satisfied, but there are some signs of heteroscedasticity in the boxplots of the residuals for the different subjects.

\[\sim\]

Exercise 8.24 Return to your mixed model for the Oxboys data from Exercise 8.22. Make diagnostic plots for the model. Are there any signs of heteroscedasticity or non-normality?

(Click here to go to the solution.)

8.4.3 Bootstrapping

Summary tables, including p-values, for the fixed effects are available through boot_summary:

library(boot.pval)
boot_summary(m4, type = "perc")

boot_summary calls a function called bootMer, which performs parametric resampling from the model. In case you want to call it directly, you can do as follows:

boot_res <- bootMer(m4, fixef, nsim = 999)

library(boot)
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Days

8.4.4 Nested random effects and multilevel/hierarchical models

In many cases, a random factor is nested within another. To see an example of this, consider the Pastes data from lme4:

library(lme4)
?Pastes
str(Pastes)

We are interested in the strength of a chemical product. There are ten delivery batches (batch), and three casks within each delivery (cask). Because of variations in manufacturing, transportation, storage, and so on, it makes sense to include random effects for both batch and cask in a linear mixed model. However, each cask only appears within a single batch, which makes the cask effect nested within batch.

Models that use nested random factors are commonly known as multilevel models (the random factors exist at different “levels”), or hierarchical models (there is a hierarchy between the random factors). These aren’t really any different from other mixed models, but depending on how the data is structured, we may have to be a bit careful to get the nesting right when we fit the model with lmer.

If the two effects weren’t nested, we could fit a model using:

# Incorrect model:
m1 <- lmer(strength ~ (1|batch) + (1|cask),
           data = Pastes)
summary(m1, correlation = FALSE)

However, because the casks are labelled a, b, and c within each batch, we’ve now fitted a model where casks from different batches are treated as being equal! To clarify that the labels a, b, and c belong to different casks in different batches, we need to include the nesting in our formula. This is done as follows:

# Cask in nested within batch:
m2 <- lmer(strength ~ (1|batch/cask),
           data = Pastes)
summary(m2, correlation = FALSE)

Equivalently, we can also use:

m3 <- lmer(strength ~ (1|batch) + (1|batch:cask),
           data = Pastes)
summary(m3, correlation = FALSE)

8.4.5 ANOVA with random effects

The lmerTest package provides ANOVA tables that allow us to use random effects in ANOVA models. To use it, simply load lmerTest before fitting a model with lmer, and then run anova(m, type = "III") (or replace III with II or I if you want a type II or type I ANOVA table instead).

As an example, consider the TVbo data from lmerTest. 3 types of TV sets were compared by 8 assessors for 4 different pictures. To see if there is a difference in the mean score for the colour balance of the TV sets, we can fit a mixed model. We’ll include a random intercept for the assessor. This is a balanced design (in which case the results from all three types of tables coincide):

library(lmerTest)

# TV data:
?TVbo

# Fit model with both fixed and random effects:
m <- lmer(Colourbalance ~ TVset*Picture + (1|Assessor),
         data = TVbo)

# View fitted model:
m

# All three types of ANOVA table give the same results here:
anova(m, type = "III")
anova(m, type = "II")
anova(m, type = "I")

The interaction effect is significant at the 5 % level. As for other ANOVA models, we can visualise this with an interaction plot:

interaction.plot(TVbo$TVset, TVbo$Picture,
                 response = TVbo$Colourbalance)

\[\sim\]

Exercise 8.25 Fit a mixed effects ANOVA to the TVbo data, using Coloursaturation as the response variable, TVset and Picture as fixed effects, and Assessor as a random effect. Does there appear to be a need to include the interaction between Assessor and TVset as a random effect? If so, do it.

(Click here to go to the solution.)

8.4.6 Generalised linear mixed models

Everything that we have just done for the linear mixed models carries over to generalised linear mixed models (GLMM), which are GLM’s with both fixed and random effects.

A common example is the item response model, which plays an important role in psychometrics. This model is frequently used in psychological tests containing multiple questions or sets of questions (“items”), where both the subject the item are considered random effects. As an example, consider the VerbAgg data from lme4:

library(lme4)
?VerbAgg
View(VerbAgg)

We’ll use the binary version of the response, r2, and fit a logistic mixed regression model to the data, to see if it can be used to explain the subjects’ responses. The formula syntax is the same as for linear mixed models, but now we’ll use glmer to fit a GLMM. We’ll include Anger and Gender as fixed effects (we are interested in seeing how these affect the response) and item and id as random effects with random slopes (we believe that answers to the same item and answers from the same individual may be correlated):

m <- glmer(r2 ~ Anger + Gender + (1|item) + (1|id),
           data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)

We can plot the fitted random effects for item to verify that there appear to be differences between the different items:

mixed_mod <- coef(m)$item
mixed_mod$item <- row.names(mixed_mod)

ggplot(mixed_mod, aes(`(Intercept)`, item)) +
      geom_point() +
      xlab("Random intercept")

The situ variable, describing situation type, also appears interesting. Let’s include it as a fixed effect. Let’s also allow different situational (random) effects for different respondents. It seems reasonable that such responses are random rather than fixed (as in the solution to Exercise 8.25), and we do have repeated measurements of these responses. We’ll therefore also include situ as a random effect nested within id:

m <- glmer(r2 ~ Anger + Gender + situ + (1|item) + (1|id/situ),
           data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)

Finally, we’d like to obtain bootstrap confidence intervals for fixed effects. Because this is a fairly large dataset (\(n=7,584\)) this can take a looong time to run, so stretch your legs and grab a cup of coffee or two while you wait:

library(boot.pval)
boot_summary(m, type = "perc", R = 100)
# Ideally, R should be greater, but for the sake of
# this example, we'll use a low number.

\[\sim\]

Exercise 8.26 Consider the grouseticks data from the lme4 package (Elston et al., 2001). Fit a mixed Poisson regression model to the data, with TICKS as the response variable and YEAR and HEIGHT as fixed effects. What variables are suitable to use for random effects? Compute a bootstrap confidence interval for the effect of HEIGHT.

(Click here to go to the solution.)

8.4.7 Bayesian estimation of mixed models

From a numerical point of view, using Bayesian modelling with rstanarm is preferable to frequentist modelling with lme4 if you have complex models with many random effects. Indeed, for some models, lme4 will return a warning message about a singular fit, basically meaning that the model is too complex, whereas rstanarm, powered by the use of a prior distribution, always will return a fitted model regardless of complexity.

After loading rstanarm, fitting a Bayesian linear mixed model with a weakly informative prior is as simple as substituting lmer with stan_lmer:

library(lme4)
library(rstanarm)
m4 <- stan_lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
           data = sleepstudy)

# Print the results:
m4

To plot the posterior distributions for the coefficients of the fixed effects, we can use plot, specifying which effects we are interested in using pars:

plot(m4, "dens", pars = c("(Intercept)", "Days"))

To get 95 % credible intervals for the fixed effects, we can use posterior_interval as follows:

posterior_interval(m4, 
        pars = c("(Intercept)", "Days"),
        prob = 0.95)

We can also plot them using plot:

plot(m4, "intervals",
        pars = c("(Intercept)", "Days"),
        prob = 0.95)

Finally, we’ll check that the model fitting has converged:

plot(m4, "rhat")

8.5 Survival analysis

Many studies are concerned with the duration of time until an event happens: time until a machine fails, time until a patient diagnosed with a disease dies, and so on. In this section we will consider some methods for survival analysis (also known as reliability analysis in engineering and duration analysis in economics), which is used for analysing such data. The main difficulty here is that studies often end before all participants have had events, meaning that some observations are right-censored - for these observations, we don’t know when the event happened, but only that it happened after the end of the study.

The survival package contains a number of useful methods for survival analysis. Let’s install it:

install.packages("survival")

We will study the lung cancer data in lung:

library(survival)
?lung
View(lung)

The survival times of the patients consist of two parts: time (the time from diagnosis until either death or the end of the study) and status (1 if the observations is censored, 2 if the patient died before the end of the study). To combine these so that they can be used in a survival analysis, we must create a Surv object:

Surv(lung$time, lung$status)

Here, a + sign after a value indicates right-censoring.

8.5.1 Comparing groups

Survival times are best visualised using Kaplan-Meier curves that show the proportion of surviving patients. Let’s compare the survival times of women and men. We first fit a survival model using survfit, and then draw the Kaplan-Meier curve (with parametric confidence intervals) using autoplot from ggfortify:

library(ggfortify)
library(survival)
m <- survfit(Surv(time, status) ~ sex, data = lung)
autoplot(m)

To print the values for the survival curves at different time points, we can use summary:

summary(m)

To test for differences between two groups, we can use the logrank test (also known as the Mantel-Cox test), given by survfit:

survdiff(Surv(time, status) ~ sex, data = lung)

Another option is the Peto-Peto test, which puts more weight on early events (deaths, in the case of the lung data), and therefore is suitable when such events are of greater interest. In contrast, the logrank test puts equal weights on all events regardless of when they occur. The Peto-Peto test is obtained by adding the argument rho = 1:

survdiff(Surv(time, status) ~ sex, rho = 1, data = lung)

The Hmisc package contains a function for obtaining confidence intervals based on the Kaplan-Meier estimator, called bootkm. This allows us to get confidence intervals for the quantiles (including the median) of the survival distribution for different groups, as well as for differences between the quantiles of different groups. First, let’s install it:

install.packages("Hmisc")

We can now use bootkm to compute bootstrap confidence intervals for survival times based on the lung data. We’ll compute an interval for the median survival time for females, and one for the difference in median survival time between females and males:

library(Hmisc)

# Create a survival object:
survobj <- Surv(lung$time, lung$status)

# Get bootstrap replicates of the median survival time for
# the two groups:
median_surv_time_female <- bootkm(survobj[lung$sex == 2],
                                  q = 0.5, B = 999)
median_surv_time_male <- bootkm(survobj[lung$sex == 1],
                                q = 0.5, B = 999)

# 95 % bootstrap confidence interval for the median survival time
# for females:
quantile(median_surv_time_female,
         c(.025,.975), na.rm=TRUE)

# 95 % bootstrap confidence interval for the difference in median
# survival time:
quantile(median_surv_time_female - median_surv_time_male,
         c(.025,.975), na.rm=TRUE)

To obtain confidence intervals for other quantiles, we simply change the argument q in bootkm.

\[\sim\]

Exercise 8.27 Consider the ovarian data from the survival package. Plot Kaplan-Meier curves comparing the two treatment groups. Compute a bootstrap confidence interval for the difference in the 75 % quantile for the survival time for the two groups.

(Click here to go to the solution.)

8.5.2 The Cox proportional hazards model

The hazard function, or hazard rate, is the rate of events at time \(t\) if a subject has survived until time \(t\). The higher the hazard, the greater the probability of an event. Hazard rates play an integral part in survival analysis, particularly in regression models. To model how the survival times are affected by different explanatory variables, we can use a Cox proportional hazards model (Cox, 1972), fitted using coxph:

m <- coxph(Surv(time, status) ~ age + sex, data = lung)
summary(m)

The exponentiated coefficients show the hazard ratios, i.e. the relative increases (values greater than 1) or decreases (values below 1) of the hazard rate when a covariate is increased one step while all others are kept fixed:

exp(coef(m))

In this case, the hazard increases with age (multiply the hazard by 1.017 for each additional year that the person has lived), and is lower for women (sex=2) than for men (sex=1).

The censboot_summary function from boot.pval provides a table of estimates, bootstrap confidence intervals, and bootstrap p-values for the model coefficients. The coef argument can be used to specify whether to print confidence intervals for the coefficients or for the exponentiated coeffientes (i.e. the hazard ratios):

# censboot_summary requires us to use model = TRUE
# when fitting our regression model:
m <- coxph(Surv(time, status) ~ age + sex,
           data = lung, model = TRUE)

library(boot.pval)
# Original coefficients:
censboot_summary(m, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m, type = "perc", coef = "exp")

To manually obtain bootstrap confidence intervals for the exponentiated coefficients, we can use the censboot function from boot as follows:

# Function to get the bootstrap replicates of the exponentiated
# coefficients:
boot_fun <- function(data, formula) {
     m_boot <- coxph(formula, data = data)
     return(exp(coef(m_boot)))
}

# Run the resampling:
library(boot)
boot_res <- censboot(lung[,c("time", "status", "age", "sex")],
                     boot_fun, R = 999, 
                     formula = 
                       formula(Surv(time, status) ~ age + sex))

# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Age
boot.ci(boot_res, type = "perc", index = 2) # Sex

As the name implies, the Cox proportional hazards model relies on the assumption of proportional hazards, which essentially means that the effect of the explanatory variables is constant over time. This can be assessed visually by plotting the model residuals, using cox.zph and the ggcoxzph function from the survminer package. Specifically, we will plot the scaled Schoenfeld (1982) residuals, which measure the difference between the observed covariates and the expected covariates given the risk at the time of an event. If the proportional hazards assumption holds, then there should be no trend over time for these residuals. Use the trend line to aid the eye:

install.packages("survminer")
library(survminer)

ggcoxzph(cox.zph(m), var = 1) # age
ggcoxzph(cox.zph(m), var = 2) # sex

# Formal p-values for a test of proportional
# hazards, for each variable:
cox.zph(m)

In this case, there are no apparent trends over time (which is in line with the corresponding formal hypothesis tests), indicating that the proportional hazards model could be applicable here.

\[\sim\]

Exercise 8.28 Consider the ovarian data from the survival package.

  1. Use a Cox proportional hazards regression to test whether there is a difference between the two treatment groups, adjusted for age.
  2. Compute bootstrap confidence interval for the hazard ratio of age.

(Click here to go to the solution.)


Exercise 8.29 Consider the retinopathy data from the survival package. We are interested in a mixed survival model, where id is used to identify patients and type, trtand age are fixed effects. Fit a mixed Cox proportional hazards regression (add cluster = id to the call to coxph to include this as a random effect). Is the assumption of proportional hazards fulfilled?

(Click here to go to the solution.)

8.5.3 Accelerated failure time models

In many cases, the proportional hazards assumption does not hold. In such cases we can turn to accelerated failure time models (Wei, 1992), for which the effect of covariates is to accelerate or decelerate the life course of a subject.

While the proportional hazards model is semiparametric, accelerated failure time models are typically fully parametric, and thus involve stronger assumptions about an underlying distribution. When fitting such a model using the survreg function, we must therefore specify what distribution to use. Two common choices are the Weibull distribution and the log-logistic distribution. The Weibull distribution is commonly used in engineering, e.g. in reliability studies. The hazard function of Weibull models is always monotonic, i.e. either always increasing or always decreasing. In contrast, the log-logistic distribution allows the hazard function to be non-monotonic, making it more flexible, and often more appropriate for biological studies. Let’s fit both types of models to the lung data and have a look at the results:

library(survival)

# Fit Weibull model:
m_w <- survreg(Surv(time, status) ~ age + sex, data = lung,
             dist = "weibull", model = TRUE)
summary(m_w)

# Fit log-logistic model:
m_ll <- survreg(Surv(time, status) ~ age + sex, data = lung,
             dist = "loglogistic", model = TRUE)
summary(m_ll)

Interpreting the coefficients of accelerated failure time models is easier than interpreting coefficients from proportional hazards models. The exponentiated coefficients show the relative increase or decrease in the expected survival times when a covariate is increased one step while all others are kept fixed:

exp(coef(m_ll))

In this case, according to the log-logistic model, the expected survival time decreases by 1.4 % (i.e. multiply by \(0.986\)) for each additional year that the patient has lived. The expected survival time for females (sex=2) is 61.2 % higher than for males (multiply by \(1.612\)).

To obtain bootstrap confidence intervals and p-values for the effects, we follow the same procedure as for the Cox model, using censboot_summary. Here is an example for the log-logistic accelerated failure time model:

library(boot.pval)
# Original coefficients:
censboot_summary(m_ll, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m_ll, type = "perc", coef = "exp")

We can also use censboot:

# Function to get the bootstrap replicates of the exponentiated
# coefficients:
boot_fun <- function(data, formula, distr) {
     m_boot <- survreg(formula, data = data, dist = distr)
     return(exp(coef(m_boot)))
}

# Run the resampling:
library(boot)
boot_res <- censboot(lung[,c("time", "status", "age", "sex")],
                     boot_fun, R = 999, 
                     formula = 
                       formula(Surv(time, status) ~ age + sex),
                     distr = "loglogistic")

# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 2) # Age
boot.ci(boot_res, type = "perc", index = 3) # Sex

\[\sim\]

Exercise 8.30 Consider the ovarian data from the survival package. Fit a log-logistic accelerated failure time model to the data, using all available explanatory variables. What is the estimated difference in survival times between the two treatment groups?

(Click here to go to the solution.)

8.5.4 Bayesian survival analysis

At the time of this writing, the latest release of rstanarm does not contain functions for fitting survival analysis models. You can check whether this still is the case by running ?stan_surv in the Console. If you don’t find the documentation for the stan_surv function, you will have to install the development version of the package from GitHub (which contains such functions), using the following code:

# Check if the devtools package is installed, and start
# by installing it otherwise:
if (!require(devtools)) {
     install.packages("devtools")
}
library(devtools)
# Download and install the development version of the package:
install_github("stan-dev/rstanarm", build_vignettes = FALSE)

Now, let’s have a look at how to fit a Bayesian model to the lung data from survival:

library(survival)
library(rstanarm)

# Fit proportional hazards model using cubic M-splines (similar
# but not identical to the Cox model!):
m <- stan_surv(Surv(time, status) ~ age + sex, data = lung)
m

Fitting a survival model with a random effect works similarly, and uses the same syntax as lme4. Here is an example with the retinopathy data:

m <- stan_surv(Surv(futime, status) ~ age + type + trt + (1|id),
                data = retinopathy)
m

8.5.5 Multivariate survival analysis

Some trials involve multiple time-to-event outcomes that need to be assessed simultaneously in a multivariate analysis. Examples includes studies of the time until each of several correlated symptoms or comorbidities occur. This is analogous to the multivariate testing problem of Section 7.2.6, but with right-censored data. To test for group differences for a vector of right-censored outcomes, a multivariate version of the logrank test described in Persson et al. (2019) can be used. It is available through the MultSurvTests package:

install.packages("MultSurvTests")

As an example, we’ll use the diabetes dataset from MultSurvTest. It contains two time-to-event outcomes: time until blindness in a treated eye and in an untreated eye.

library(MultSurvTests)
# Diabetes data:
?diabetes

We’ll compare two groups that received two different treatments. The survival times (time until blindness) and censoring statuses of the two groups are put in a matrices called z and z.delta, which are used as input for the test function perm_mvlogrank:

# Survival times for the two groups:
x <- as.matrix(subset(diabetes, LASER==1)[,c(6,8)])
y <- as.matrix(subset(diabetes, LASER==2)[,c(6,8)])

# Censoring status for the two groups:
delta.x <- as.matrix(subset(diabetes, LASER==1)[,c(7,9)])
delta.y <- as.matrix(subset(diabetes, LASER==2)[,c(7,9)])

# Create the input for the test:
z <- rbind(x, y)
delta.z <- rbind(delta.x, delta.y)

# Run the test with 499 permutations:
perm_mvlogrank(B = 499, z, delta.z, n1 = nrow(x))

8.5.6 Power estimates for the logrank test

The spower function in Hmisc can be used to compute the power of the univariate logrank test in different scenarios using simulation. The helper functions Weibull2, Lognorm2, and Gompertz2 can be used to define Weibull, lognormal and Gomperts distributions to sample from, using survival probabilities at different time points rather than the traditional parameters of those distributions. We’ll look at an example involving the Weibull distribution here. Additional examples can be found in the function’s documentation (?spower).

Let’s simulate the power of a 3-year follow-up study with two arms (i.e. two groups, control and intervention). First, we define a Weibull distribution for (compliant) control patients. Let’s say that their 1-year survival is 0.9 and their 3-year survival is 0.6. To define a Weibull distribution that corresponds to these numbers, we use Weibull2 as follows:

weib_dist <- Weibull2(c(1, 3), c(.9, .6))

We’ll assume that the treatment has no effect for the first 6 months, and that it then has a constant effect, leading to a hazard ratio of 0.75 (so the hazard ratio is 1 if the time in years is less than or equal to 0.5, and 0.75 otherwise). Moreover, we’ll assume that there is a constant drop-out rate, such that 20 % of the patients can be expected to drop out during the three years. Finally, there is no drop-in. We define a function to simulate survival times under these conditions:

# In the functions used to define the hazard ratio, drop-out
# and drop-in, t denotes time in years:
sim_func <- Quantile2(weib_dist, 
      hratio = function(t) { ifelse(t <= 0.5, 1, 0.75) },
      dropout = function(t) { 0.2*t/3 },
      dropin = function(t) { 0 })

Next, we define a function for the censoring distribution, which is assumed to be the same for both groups. Let’s say that each follow-up is done at a random time point between 2 and 3 years. We’ll therefore use a uniform distribution on the interval \((2,3)\) for the censoring distribution:

rcens <- function(n)
{
   runif(n, 2, 3)
}

Finally, we define two helper functions required by spower and then run the simulation study. The output is the simulated power using the settings that we’ve just created.

# Define helper functions:
rcontrol <- function(n) { sim_func(n, "control") }
rinterv  <- function(n) { sim_func(n, "intervention") }

# Simulate power when both groups have sample size 300:
spower(rcontrol, rinterv, rcens, nc = 300, ni = 300, 
       test = logrank, nsim = 999)

# Simulate power when both groups have sample size 450:
spower(rcontrol, rinterv, rcens, nc = 450, ni = 450, 
       test = logrank, nsim = 999)

# Simulate power when the control group has size 100
# and the intervention group has size 300:
spower(rcontrol, rinterv, rcens, nc = 100, ni = 300, 
       test = logrank, nsim = 999)

8.6 Left-censored data and nondetects

Survival data is typically right-censored. Left-censored data, on the other hand, is common in medical research (e.g. in biomarker studies) and environmental chemistry (e.g. measurements of chemicals in water), where some measurements fall below the laboratory’s detection limits (or limit of detection, LoD). Such data also occur in studies in economics. A measurement below the detection limit, a nondetect, is still more informative than having no measurement at all - we may not know the exact value, but we know that the measurement is below a given threshold.

In principle, all methods that are applicable to survival analysis can also be used for left-censored data (although the interpretation of coefficients and parameters may differ), but in practice the distributions of lab measurements and economic variables often differ from those that typically describe survival times. In this section we’ll look at methods tailored to the kind of left-censored data that appears in applications in the aforementioned fields.

8.6.1 Estimation

The EnvStats package contains a number of functions that can be used to compute descriptive statistics and estimating parameters of distributions from data with nondetects. Let’s install it:

install.packages("EnvStats")

Estimates of the mean and standard deviation of a normal distribution that take the censoring into account in the right way can be obtained with enormCensored, which allows us to use several different estimators (the details surrounding the available estimators can be found using ?enormCensored). Analogous functions are available for other distributions, for instance elnormAltCensored for the lognormal distribution, egammaCensored for the gamma distribution, and epoisCensored for the Poisson distribution.

To illustrate the use of enormCensored, we will generate data from a normal distribution. We know the true mean and standard deviation of the distribution, and can compute the estimates for the generated sample. We will then pretend that there is a detection limit for this data, and artificially left-censor about 20 % of it. This allows us to compare the estimates for the full sample and the censored sample, to see how the censoring affects the estimates. Try running the code below a few times:

# Generate 50 observations from a N(10, 9)-distribution:
x <- rnorm(50, 10, 3)

# Estimate the mean and standard deviation:
mean_full <- mean(x)
sd_full <- sd(x)

# Censor all observations below the "detection limit" 8
# and replace their values by 8:
censored <- x<8
x[censored] <- 8

# The proportion of censored observations is:
mean(censored)

# Estimate the mean and standard deviation in a naive
# manner, using the ordinary estimators with all
# nondetects replaced by 8:
mean_cens_naive <- mean(x)
sd_cens_naive <- sd(x)

# Estimate the mean and standard deviation using
# different estimators that take the censoring
# into account:

library(EnvStats)
# Maximum likelihood estimate:
estimates_mle <- enormCensored(x, censored,
                               method = "mle")
# Biased-corrected maximum likelihood estimate:
estimates_bcmle <- enormCensored(x, censored,
                               method = "bcmle")
# Regression on order statistics, ROS, estimate:
estimates_ros <- enormCensored(x, censored,
                               method = "ROS")

# Compare the different estimates:
mean_full; sd_full
mean_cens_naive; sd_cens_naive
estimates_mle$parameters
estimates_bcmle$parameters
estimates_ros$parameters

The naive estimators tend to be biased for data with nondetects (sometimes very biased!). Your mileage may vary depending on e.g. the sample size and the amount of censoring, but in general, the estimators that take censoring into account will fare much better.

After we have obtained estimates for the parameters of the normal distribution, we can plot the data against the fitted distribution to check the assumption of normality:

library(ggplot2)
# Compare to histogram, including a bar for nondetects:
ggplot(data.frame(x), aes(x)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                 args = list(mean = estimates_mle$parameters[1], 
                                sd = estimates_mle$parameters[2]))

# Compare to histogram, excluding nondetects:
x_noncens <- x[!censored]
ggplot(data.frame(x_noncens), aes(x_noncens)) +
      geom_histogram(colour = "black", aes(y = ..density..)) +
      geom_function(fun = dnorm, colour = "red", size = 2,
                 args = list(mean = estimates_mle$parameters[1], 
                                sd = estimates_mle$parameters[2]))

To obtain percentile and BCa bootstrap confidence intervals for the mean, we can add the options ci = TRUE and ci.method = "bootstrap":

# Using 999 bootstrap replicates:
enormCensored(x, censored, method = "mle",
              ci = TRUE,  ci.method = "bootstrap",
              n.bootstraps = 999)$interval$limits

\[\sim\]

Exercise 8.31 Download the il2rb.csv data from the book’s web page. It contains measurements of the biomarker IL-2RB made in serum samples from two groups of patients. The values that are missing are in fact nondetects, with detection limit 0.25.

Under the assumption that the biomarker levels follow a lognormal distribution, compute bootstrap confidence intervals for the mean of the distribution for the control group. What proportion of the data is left-censored?

(Click here to go to the solution.)

8.6.2 Tests of means

When testing the difference between two groups’ means, nonparametric tests like the Wilcoxon-Mann-Whitney test often perform very well for data with nondetects, unlike the t-test (Zhang et al., 2009). For data with a high degree of censoring (e.g. more than 50 %), most tests perform poorly. For multivariate tests of mean vectors the situation is the opposite, with Hotelling’s \(T^2\) (Section 7.2.6) being a much better option than nonparametric tests (Thulin, 2016).

\[\sim\]

Exercise 8.32 Return to the il2rb.csv data from Exercise 8.32. Test the hypothesis that there is no difference in location between the two groups.

(Click here to go to the solution.)

8.6.3 Censored regression

Censored regression models can be used when the response variable is censored. A common model in economics is the Tobit regression model (Tobin, 1958), which is a linear regression model with normal errors, tailored to left-censored data. It can be fitted using survreg.

As an example, consider the EPA.92c.zinc.df dataset available in EnvStats. It contains measurements of zinc concentrations from five wells, made on 8 samples from each well, half of which are nondetects. Let’s say that we are interested in comparing these five wells (so that the wells aren’t random effects). Let’s also assume that the 8 samples were collected at different time points, and that we want to investigate whether the concentrations change over time. Such changes could be non-linear, so we’ll include the sample number as a factor. To fit a Tobit model to this data, we use survreg as follows.

library(EnvStats)
?EPA.92c.zinc.df

# Note that in Surv, in the vector describing censoring 0 means
# censoring and 1 no censoring. This is the opposite of the 
# definition used in EPA.92c.zinc.df$Censored, so we use the !
# operator to change 0's to 1's and vice versa.
library(survival)
m <- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
                data = EPA.92c.zinc.df, dist = "gaussian")
summary(m)

Similarly, we can fit a model under the assumption of lognormality:

m <- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
                data = EPA.92c.zinc.df, dist = "lognormal")
summary(m)

Fitting regression models where the explanatory variables are censored is more challenging. For prediction, a good option is models based on decision trees, studied in Section 9.5. For testing whether there is a trend over time, tests based on Kendall’s correlation coefficient can be useful. EnvStats provides two functions for this - kendallTrendTest for testing a monotonic trend, and kendallSeasonalTrendTest for testing a monotonic trend within seasons.

8.7 Creating matched samples

Matching is used to balance the distribution of explanatory variables in the groups that are being compared. This is often required in observational studies, where the treatment variable is not randomly assigned, but determined by some external factor(s) that may be related to the treatment. For instance, if you wish to study the effect of smoking on mortality, you can recruit a group of smokers and non-smokers and follow them for a few years. But both mortality and smoking are related to confounding variables such as age and gender, meaning that imbalances in the age and gender distributions of smokers and non-smokers can bias the results. There are several methods for creating balanced or matched samples that seek to mitigate this bias, including propensity score matching, which we’ll use here. The MatchIt and optmatch packages contain the functions that we need for this.

To begin with, let’s install the two packages:

install.packages(c("MatchIt", "optmatch"))

We will illustrate the use of the packages using the lalonde dataset, that is shipped with the MatchIt package:

library(MatchIt)
data(lalonde)
?lalonde
View(lalonde)

Note that the data has row names, which are useful e.g. for identifying which individuals have been paired - we can access them using rownames(lalonde).

8.7.1 Propensity score matching

To perform automated propensity score matching, we will use the matchit function, which computes propensity scores and then matches participants from the treatment and control groups using these. Matches can be found in several ways. We’ll consider two of them here. As input, the matchit function takes a formula describing the treatment variable and potential confounders, what datasets to use, which method to use and what ratio of control to treatment participants to use.

A common method is nearest neighbour matching, where each participant is matched to the participant in the other group with the most similar propensity score. By default, it starts by finding a match for the participant in the treatment group that has the largest propensity score, then finds a match for the participant in the treatment groups with the second largest score, and so on. Two participants cannot be matched with the same participant in the control group. The nearest neighbour match is locally optimal in the sense that it find the best (still) available match for each participant in the treatment group, ignoring if that match in fact would be even better for another participant in the treatment group.

To perform propensity score matching using nearest neighbour matching with 1 match each, evaluate the results and then extract the matched samples we can use matchit as follows:

matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                 data = lalonde, method = "nearest", ratio = 1)

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)

To view the matched pairs, you can use:

matches$match.matrix

To view the values of the re78 variable of the matched pairs, use:

varName <- "re78"
resMatrix <- lalonde[row.names(matches$match.matrix), varName]
for(i in 1:ncol(matches$match.matrix))
{
    resMatrix <- cbind(resMatrix, lalonde[matches$match.matrix[,i],
                                          varName])
}
rownames(resMatrix) <- row.names(matches$match.matrix)
View(resMatrix)

As an alternative to nearest neighbour-matching, optimal matching can be used. This is similar to nearest neighbour-matching, but strives to obtain globally optimal matches rather than locally optimal. This means that each participant in the treatment group is paired with a participant in the control group, while also taking into account how similar the latter participant is to other participants in the treatment group.

To perform propensity score matching using optimal matching with 2 matches each:

matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                   data = lalonde, method = "optimal", ratio = 2)

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)

You may also want to find all controls that match participants in the treatment group exactly. This is called exact matching:

matches <- matchit(treat ~ re74 + re75 + age + educ + married,
                   data = lalonde, method = "exact")

summary(matches)
plot(matches)
plot(matches, type = "hist")

matched_data <- match.data(matches)
summary(matched_data)

Participants with no exact matches won’t be included in matched_data.

8.7.2 Stepwise matching

At times you will want to combine the above approaches. For instance, you may want to have an exact match for age, and then an approximate match using the propensity scores for other variables. This is also achievable but requires the matching to be done in several steps. To first match the participant exactly on age and then 1-to-2 via nearest-neighbour propensity score matching on re74 and re75 we can use a loop:

# Match exactly one age:
matches <- matchit(treat ~ age, data = lalonde, method = "exact")
matched_data <- match.data(matches)

# Match the first subclass 1-to-2 via nearest-neighbour propensity
# score matching: 
matches2 <- matchit(treat ~ re74 + re75,
                    data = matched_data[matched_data$subclass == 1,],
                    method = "nearest", ratio = 2)
matched_data2 <- match.data(matches2, weights = "weights2",
                            subclass = "subclass2")
matchlist <- matches2$match.matrix

# Match the remaining subclasses in the same way:
for(i in 2:max(matched_data$subclass))
{
    matches2 <- matchit(treat ~ re74 + re75,
                 data = matched_data[matched_data$subclass == i,],
                 method = "nearest", ratio = 2)
    matched_data2 <- rbind(matched_data2, match.data(matches2,
                                          weights = "weights2",
                                          subclass = "subclass2"))
    matchlist <- rbind(matchlist, matches2$match.matrix)
}

# Check results:
View(matchlist)
View(matched_data2)

  1. If the variables aren’t centred, the intercept is the expected value of the response variable when all explanatory variables are 0. This isn’t always realistic or meaningful.↩︎

  2. On the contrary, doing so will usually only serve to make interpretation more difficult.↩︎

  3. Prediction intervals provide interval estimates for the new observations. They incorporate both the uncertainty associated with our model estimates, and the fact that the new observation is likely to deviate slightly from its expected value.↩︎