# 7 Modern classical statistics

“Modern classical” may sound like a contradiction, but it is in fact anything but. Classical statistics covers topics like estimation, quantification of uncertainty, and hypothesis testing - all of which are at the heart of data analysis. Since the advent of modern computers, much has happened in this field that has yet to make it to the standard textbooks of introductory courses in statistics. This chapter attempts to bridge part of that gap by dealing with those classical topics, but with a modern approach that uses more recent advances in statistical theory and computational methods. Particular focus is put on how simulation can be used for analyses and for evaluating the properties of statistical procedures.

Whenever it is feasible, our aim in this chapter and the next is to:

• Use hypothesis tests that are based on permutations or the bootstrap rather than tests based on strict assumptions about the distribution of the data or asymptotic distributions,
• To complement estimates and hypothesis tests with computing confidence intervals based on sound methods (including the bootstrap),
• Offer easy-to-use Bayesian methods as an alternative to frequentist tools.

After reading this chapter, you will be able to use R to:

• Generate random numbers,
• Perform simulations to assess the performance of statistical methods,
• Perform hypothesis tests,
• Compute confidence intervals,
• Make sample size computations,
• Report statistical results.

## 7.1 Simulation and distributions

A random variable is a variable whose value describes the outcome of a random phenomenon. A (probability) distribution is a mathematical function that describes the probability of different outcomes for a random variable. Random variables and distributions are at the heart of probability theory and most, if not all, statistical models.

As we shall soon see, they are also invaluable tools when evaluating statistical methods. A key component of modern statistical work is simulation, in which we generate artificial data that can be used both in the analysis of real data (e.g. in permutation tests and bootstrap confidence intervals, topics that we’ll explore in this chapter) and for assessing different methods. Simulation is possible only because we can generate random numbers, so let’s begin by having a look at how we can generate random numbers in R.

### 7.1.1 Generating random numbers

The function sample can be used to randomly draw a number of elements from a vector. For instance, we can use it to draw 2 random numbers from the first ten integers: $$1, 2, \ldots, 9, 10$$:

sample(1:10, 2)

Try running the above code multiple times. You’ll get different results each time, because each time it runs the random number generator is in a different state. In most cases, this is desirable (if the results were the same each time we used sample, it wouldn’t be random), but not if we want to replicate a result at some later stage.

When we are concerned about reproducibility, we can use set.seed to fix the state of the random number generator:

# Each run generates different results:
sample(1:10, 2); sample(1:10, 2)

# To get the same result each time, set the seed to a
set.seed(314); sample(1:10, 2)
set.seed(314); sample(1:10, 2)

We often want to use simulated data from a probability distribution, such as the normal distribution. The normal distribution is defined by its mean $$\mu$$ and its variance $$\sigma^2$$ (or, equivalently, its standard deviation $$\sigma$$). There are special functions for generating data from different distributions - for the normal distribution it is called rnorm. We specify the number of observations that we want to generate (n) and the parameters of the distribution (the mean mu and the standard deviation sigma):

rnorm(n = 10, mu = 2, sigma = 1)

# A shorter version:
rnorm(10, 2, 1)

Similarly, there are functions that can be used compute the quantile function, density function, and cumulative distribution function (CDF) of the normal distribution. Here are some examples for a normal distribution with mean 2 and standard deviation 1:

qnorm(0.9, 2, 1)    # Upper 90 % quantile of distribution
dnorm(2.5, 2, 1)    # Density function f(2.5)
pnorm(2.5, 2, 1)    # Cumulative distribution function F(2.5)

$\sim$

Exercise 7.1 Sampling can be done with or without replacement. If replacement is used, an observation can be drawn more than once. Check the documentation for sample. How can you change the settings to sample with replacement? Draw 5 random numbers from the first ten integers, with replacement.

### 7.1.2 Some common distributions

Next, we provide the syntax for random number generation, quantile functions, density/probability functions and cumulative distribution functions for some of the most commonly used distributions. This section is mainly intended as a reference, for you to look up when you need to use one of these distributions - so there is no need to run all the code chunks below right now.

Normal distribution $$N(\mu, \sigma^2)$$ with mean $$\mu$$ and variance $$\sigma^2$$:

rnorm(n, mu, sigma)    # Generate n random numbers
qnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dnorm(x, mu, sigma)    # Density function f(x)
pnorm(x, mu, sigma)    # Cumulative distribution function F(X)

Continuous uniform distribution $$U(a,b)$$ on the interval $$(a,b)$$, with mean $$\frac{a+b}{2}$$ and variance $$\frac{(b-a)^2}{12}$$:

runif(n, a, b)    # Generate n random numbers
qunif(0.95, a, b) # Upper 95 % quantile of distribution
dunif(x, a, b)    # Density function f(x)
punif(x, a, b)    # Cumulative distribution function F(X)

Exponential distribution $$Exp(m)$$ with mean $$m$$ and variance $$m^2$$:

rexp(n, 1/m)    # Generate n random numbers
qexp(0.95, 1/m) # Upper 95 % quantile of distribution
dexp(x, 1/m)    # Density function f(x)
pexp(x, 1/m)    # Cumulative distribution function F(X)

Gamma distribution $$\Gamma(\alpha, \beta)$$ with mean $$\frac{\alpha}{\beta}$$ and variance $$\frac{\alpha}{\beta^2}$$:

rgamma(n, alpha, beta)    # Generate n random numbers
qgamma(0.95, alpha, beta) # Upper 95 % quantile of distribution
dgamma(x, alpha, beta)    # Density function f(x)
pgamma(x, alpha, beta)    # Cumulative distribution function F(X)

Lognormal distribution $$LN(\mu, \sigma^2)$$ with mean $$\exp(\mu+\sigma^2/2)$$ and variance $$(\exp(\sigma^2)-1)\exp(2\mu+\sigma^2)$$:

rlnorm(n, mu, sigma)    # Generate n random numbers
qlnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dlnorm(x, mu, sigma)    # Density function f(x)
plnorm(x, mu, sigma)    # Cumulative distribution function F(X)

t-distribution $$t(\nu)$$ with mean 0 (for $$\nu>1$$) and variance $$\frac{\nu}{\nu-2}$$ (for $$\nu>2$$):

rt(n, nu)    # Generate n random numbers
qt(0.95, nu) # Upper 95 % quantile of distribution
dt(x, nu)    # Density function f(x)
pt(x, nu)    # Cumulative distribution function F(X)

Chi-squared distribution $$\chi^2(k)$$ with mean $$k$$ and variance $$2k$$:

rchisq(n, k)    # Generate n random numbers
qchisq(0.95, k) # Upper 95 % quantile of distribution
dchisq(x, k)    # Density function f(x)
pchisq(x, k)    # Cumulative distribution function F(X)

F-distribution $$F(d_1, d_2)$$ with mean $$\frac{d_2}{d_2-2}$$ (for $$d_2>2$$) and variance $$\frac{2d_2^2(d_1+d_2-2)}{d_1(d_2-2)^2(d_2-4)}$$ (for $$d_2>4$$):

rf(n, d1, d2)    # Generate n random numbers
qf(0.95, d1, d2) # Upper 95 % quantile of distribution
df(x, d1, d2)    # Density function f(x)
pf(x, d1, d2)    # Cumulative distribution function F(X)

Beta distribution $$Beta(\alpha,\beta)$$ with mean $$\frac{\alpha}{\alpha+\beta}$$ and variance $$\frac{\alpha \beta}{(\alpha+\beta)^2 (\alpha+\beta+1)}$$:

rbeta(n, alpha, beta)    # Generate n random numbers
qbeta(0.95, alpha, beta) # Upper 95 % quantile of distribution
dbeta(x, alpha, beta)    # Density function f(x)
pbeta(x, alpha, beta)    # Cumulative distribution function F(X)

Binomial distribution $$Bin(n,p)$$ with mean $$np$$ and variance $$np(1-p)$$:

rbinom(n, n, p)    # Generate n random numbers
qbinom(0.95, n, p) # Upper 95 % quantile of distribution
dbinom(x, n, p)    # Probability function f(x)
pbinom(x, n, p)    # Cumulative distribution function F(X)

Poisson distribution $$Po(\lambda)$$ with mean $$\lambda$$ and variance $$\lambda$$:

rpois(n, lambda)    # Generate n random numbers
qpois(0.95, lambda) # Upper 95 % quantile of distribution
dpois(x, lambda)    # Probability function f(x)
ppois(x, lambda)    # Cumulative distribution function F(X)

Negative binomial distribution $$NegBin(r, p)$$ with mean $$\frac{rp}{1-p}$$ and variance $$\frac{rp}{(1-p)^2}$$:

rnbinom(n, r, p)    # Generate n random numbers
qnbinom(0.95, r, p) # Upper 95 % quantile of distribution
dnbinom(x, r, p)    # Probability function f(x)
pnbinom(x, r, p)    # Cumulative distribution function F(X)

Multivariate normal distribution with mean vector $$\mu$$ and covariance matrix $$\Sigma$$:

library(MASS)
mvrnorm(n, mu, Sigma) # Generate n random numbers

$\sim$

Exercise 7.2 Use runif and (at least) one of round, ceiling and floor to generate observations from a discrete random variable on the integers $$1, 2, 3, 4, 5, 6, 7, 8, 9, 10$$.

### 7.1.3 Assessing distributional assumptions

So how can we know that the functions for generating random observations from distributions work? And when working with real data, how can we know what distribution fits the data? One answer is that we can visually compare the distribution of the generated (or real) data to the target distribution. This can for instance be done by comparing a histogram of the data to the target distribution’s density function.

To do so, we must add aes(y = ..density..)) to the call to geom_histogram, which rescales the histogram to have area 1 (just like a density function has). We can then add the density function using geom_function:

# Generate data from a normal distribution with mean 10 and
# standard deviation 1
generated_data <- data.frame(normal_data = rnorm(1000, 10, 1))

library(ggplot2)
# Compare to histogram:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data), sd = sd(generated_data$normal_data)))

Try increasing the number of observations generated. As the number of observations increase, the histogram should start to look more and more like the density function.

We could also add a density estimate for the generated data, to further aid the eye here - we’d expect this to be close to the theoretical density function:

# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data), sd = sd(generated_data$normal_data)))

If instead we wished to compare the distribution of the data to a $$\chi^2$$ distribution, we would change the value of fun and args in geom_function accordingly:

# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dchisq, colour = "red", size = 2,
args = list(df = mean(generated_data$normal_data))) Note that the values of args have changed. args should always be a list containing values for the parameters of the distribution: mu and sigma for the normal distribution and df for the $$\chi^2$$ distribution (the same as in Section 7.1.2). Another option is to draw a quantile-quantile plot, or Q-Q plot for short, which compares the theoretical quantiles of a distribution to the empirical quantiles of the data, showing each observation as a point. If the data follows the theorised distribution, then the points should lie more or less along a straight line. To draw a Q-Q plot for a normal distribution, we use the geoms geom_qq and geom_qq_line: # Q-Q plot for normality: ggplot(generated_data, aes(sample = normal_data)) + geom_qq() + geom_qq_line() For all other distributions, we must provide the quantile function of the distribution (many of which can be found in Section 7.1.2): # Q-Q plot for the lognormal distribution: ggplot(generated_data, aes(sample = normal_data)) + geom_qq(distribution = qlnorm) + geom_qq_line(distribution = qlnorm) Q-Q-plots can be a little difficult to read. There will always be points deviating from the line - in fact, that’s expected. So how much must they deviate before we rule out a distributional assumption? Particularly when working with real data, I like to compare the Q-Q-plot of my data to Q-Q-plots of simulated samples from the assumed distribution, to get a feel for what kind of deviations can appear if the distributional assumption holds. Here’s an example of how to do this, for the normal distribution: # Look at solar radiation data for May from the airquality # dataset: May <- airquality[airquality$Month == 5,]

# Create a Q-Q-plot for the solar radiation data, and store
# it in a list:
qqplots <- list(ggplot(May, aes(sample = Solar.R)) +
geom_qq() + geom_qq_line() + ggtitle("Actual data"))

# Compute the sample size n:
n <- sum(!is.na(May$Temp)) # Generate 8 new datasets of size n from a normal distribution. # Then draw Q-Q-plots for these and store them in the list: for(i in 2:9) { generated_data <- data.frame(normal_data = rnorm(n, 10, 1)) qqplots[[i]] <- ggplot(generated_data, aes(sample = normal_data)) + geom_qq() + geom_qq_line() + ggtitle("Simulated data") } # Plot the resulting Q-Q-plots side-by-side: library(patchwork) (qqplots[] + qqplots[] + qqplots[]) / (qqplots[] + qqplots[] + qqplots[]) / (qqplots[] + qqplots[] + qqplots[]) You can run the code several times, to get more examples of what Q-Q-plots can look like when the distributional assumption holds. In this case, the tail points in the Q-Q-plot for the solar radiation data deviate from the line more than the tail points in most simulated examples do, and personally, I’d be reluctant to assume that the data comes from a normal distribution. $\sim$ Exercise 7.3 Investigate the sleeping times in the msleep data from the ggplot2 package. Do they appear to follow a normal distribution? A lognormal distribution? (Click here to go to the solution.) Exercise 7.4 Another approach to assessing distributional assumptions for real data is to use formal hypothesis tests. One example is the Shapiro-Wilk test for normality, available in shapiro.test. The null hypothesis is that the data comes from a normal distribution, and the alternative is that it doesn’t (meaning that a low p-value is supposed to imply non-normality). 1. Apply shapiro.test to the sleeping times in the msleep dataset. According to the Shapiro-Wilk test, is the data normally distributed? 2. Generate 2,000 observations from a $$\chi^2(100)$$ distribution. Compare the histogram of the generated data to the density function of a normal distribution. Are they similar? What are the results when you apply the Shapiro-Wilk test to the data? (Click here to go to the solution.) ### 7.1.4 Monte Carlo integration In this chapter, we will use simulation to compute p-values and confidence intervals, to compare different statistical methods, and to perform sample size computations. Another important use of simulation is in Monte Carlo integration, in which random numbers are used for numerical integration. It plays an important role in for instance statistical physics, computational biology, computational linguistics, and Bayesian statistics; fields that require the computation of complicated integrals. To create an example of Monte Carlo integration, let’s start by writing a function, circle, that defines a quarter-circle on the unit square. We will then plot it using the geom geom_function: circle <- function(x) { return(sqrt(1-x^2)) } ggplot(data.frame(x = c(0, 1)), aes(x)) + geom_function(fun = circle) Let’s say that we are interest in computing the area under quarter-circle. We can highlight the area in our plot using geom_area: ggplot(data.frame(x = seq(0, 1, 1e-4)), aes(x)) + geom_area(aes(x = x, y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)), fill = "pink") + geom_function(fun = circle) To find the area, we will generate a large number of random points uniformly in the unit square. By the law of large numbers, the proportion of points that end up under the quarter-circle should be close to the area under the quarter-circle49. To do this, we generate 10,000 random values for the $$x$$ and $$y$$ coordinates of each point using the $$U(0,1)$$ distribution, that is, using runif: B <- 1e4 unif_points <- data.frame(x = runif(B), y = runif(B)) Next, we add the points to our plot: ggplot(unif_points, aes(x, y)) + geom_area(aes(x = x, y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)), fill = "pink") + geom_point(size = 0.5, alpha = 0.25, colour = ifelse(unif_points$x^2 + unif_points$y^2 <= 1, "red", "black")) + geom_function(fun = circle) Note the order in which we placed the geoms - we plot the points after the area so that the pink colour won’t cover the points, and the function after the points so that the points won’t cover the curve. To estimate the area, we compute the proportion of points that are below the curve: mean(unif_points$x^2 + unif_points$y^2 <= 1) In this case, we can also compute the area exactly: $$\int_0^1\sqrt{1-x^2}dx=\pi/4=0.7853\ldots$$. For more complicated integrals, however, numerical integration methods like Monte Carlo integration may be required. That being said, there are better numerical integration methods for low-dimensional integrals like this one. Monte Carlo integration is primarily used for higher-dimensional integrals, where other techniques fail. ## 7.2 Student’s t-test revisited For decades teachers all over the world have been telling the story of William Sealy Gosset: the head brewer at Guinness who derived the formulas used for the t-test and, following company policy, published the results under the pseudonym “Student.” Gosset’s work was hugely important, but the passing of time has rendered at least parts of it largely obsolete. His distributional formulas were derived out of necessity: lacking the computer power that we have available to us today, he was forced to impose the assumption of normality on the data, in order to derive the formulas he needed to be able to carry out his analyses. Today we can use simulation to carry out analyses with fewer assumptions. As an added bonus, these simulation techniques often happen to result in statistical methods with better performance than Student’s t-test and other similar methods. ### 7.2.1 The old-school t-test The really old-school way of performing a t-test - the way statistical pioneers like Gosset and Fisher would have done it - is to look up p-values using tables covering several pages. There haven’t really been any excuses for doing that since the advent of the personal computer though, so let’s not go further into that. The “modern” version of the old-school t-test uses numerical evaluation of the formulas for Student’s t-distribution to compute p-values and confidence intervals. Before we delve into more modern approaches, let’s look at how we can run an old-school t-test in R. In Section 3.6 we used t.test to run a t-test to see if there is a difference in how long carnivores and herbivores sleep, using the msleep data from ggplot250. First, we subtracted a subset of the data corresponding to carnivores and herbivores, and then we ran the test. There are in fact several different ways of doing this, and it is probably a good idea to have a look at them. In the approach used in Section 3.6 we created two vectors, using bracket notation, and then used those as arguments for t.test: library(ggplot2) carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",] t.test(carnivores$sleep_total, herbivores$sleep_total) Alternatively, we could have used formula notation, as we e.g. did for the linear model in Section 3.7. We’d then have to use the data argument in t.test to supply the data. By using subset, we can do the subsetting simultaneously: t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi")) Unless we are interested in keeping the vectors carnivores and herbivores for other purposes, this latter approach is arguably more elegant. Speaking of elegance, the data argument also makes it easy to run a t-test using pipes. Here is an example, where we use filter from dplyr to do the subsetting: library(dplyr) msleep %>% filter(vore == "carni" | vore == "herbi") %>% t.test(sleep_total ~ vore, data = .) We could also use the magrittr pipe %$% from Section 6.2 to pass the variables from the filtered subset of msleep, avoiding the data argument:

library(magrittr)
msleep %>% filter(vore == "carni" | vore == "herbi") %$% t.test(sleep_total ~ vore) There are even more options than this - the point that I’m trying to make here is that like most functions in R, you can use functions for classical statistics in many different ways. In what follows, I will show you one or two of these, but don’t hesitate to try out other approaches if they seem better to you. What we just did above was a two-sided t-test, where the null hypothesis was that there was no difference in means between the groups, and the alternative hypothesis that there was a difference. We can also perform one-sided tests using the alternative argument. alternative = "greater" means that the alternative is that the first group has a greater mean, and alternative = "less" means that the first group has a smaller mean. Here is an example with the former: t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi"), alternative = "greater") By default, R uses the Welch two-sample t-test, meaning that it is not assumed that the groups have equal variances. If you don’t want to make that assumption, you can add var.equal = TRUE: t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi"), var.equal = TRUE) In addition to two-sample t-tests, t.test can also be used for one-sample tests and paired t-tests. To perform a one-sample t-test, all we need to do is to supply a single vector with observations, along with the value of the mean $$\mu$$ under the null hypothesis. I usually sleep for about 7 hours each night, and so if I want to test whether that is true for an average mammal, I’d use the following: t.test(msleep$sleep_total, mu = 7)

As we can see from the output, your average mammal sleeps for 10.4 hours per day. Moreover, the p-value is quite low - apparently, I sleep unusually little for a mammal!

As for paired t-tests, we can perform them by supplying two vectors (where element 1 of the first vector corresponds to element 1 of the second vector, and so on) and the argument paired = TRUE. For instance, using the diamonds data from ggplot2, we could run a test to see if the length x of diamonds with a fair quality of the cut on average equals the width y:

fair_diamonds <- subset(diamonds, cut == "Fair")
t.test(fair_diamonds$x, fair_diamonds$y, paired = TRUE)

$\sim$

Exercise 7.5 Load the VAS pain data vas.csv from Exercise 3.8. Perform a one-sided t-test to see test the null hypothesis that the average VAS among the patients during the time period is less than or equal to 6.

### 7.2.2 Permutation tests

Maybe it was a little harsh to say that Gosset’s formulas have become obsolete. The formulas are mathematical approximations to the distribution of the test statistics under the null hypothesis. The truth is that they work very well as long as your data is (nearly) normally distributed. The two-sample test also works well for non-normal data as long as you have balanced sample sizes, that is, equally many observations in both groups. However, for one-sample tests, and two-sample tests with imbalanced sample sizes, there are better ways to compute p-values and confidence intervals than to use Gosset’s traditional formulas.

The first option that we’ll look at is permutation tests. Let’s return to our mammal sleeping times example, where we wanted to investigate whether there are differences in how long carnivores and herbivores sleep on average:

t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))

There are 19 carnivores and 32 herbivores - 51 animals in total. If there are no differences between the two groups, the vore labels offer no information about how long the animals sleep each day. Under the null hypothesis, the assignment of vore labels to different animals is therefore for all intents and purposes random. To find the distribution of the test statistic under the null hypothesis, we could look at all possible ways to assign 19 animals the label carnivore and 32 animals the label herbivore. That is, look at all permutations of the labels. The probability of a result at least as extreme as that obtained in our sample (in the direction of the alternative), i.e. the p-value, would then be the proportion of permutations that yield a result at least extreme as that in our sample. This is known as a permutation test.

Permutation tests were known to the likes of Gosset and Fisher (Fisher’s exact test is a common example), but because the number of permutations of labels often tend to become quite large (76,000 billion, in our carnivore-herbivore example), they lacked the means actually to use them. 76,000 billion permutations may be too many even today, but we can obtain very good approximations of the p-values of permutation tests using simulation.

The idea is that we look at a large number of randomly selected permutations, and check for how many of them we obtain a test statistic that is more extreme than the sample test statistic. The law of large number guarantees that this proportion will converge to the permutation test p-value as the number of randomly selected permutations increases.

Let’s have a go!

# Filter the data, to get carnivores and herbivores:
data <- subset(msleep, vore == "carni" | vore == "herbi")

# Compute the sample test statistic:
sample_t <- t.test(sleep_total ~ vore, data = data)$statistic # Set the number of random permutations and create a vector to # store the result in: B <- 9999 permutation_t <- vector("numeric", B) # Start progress bar: pbar <- txtProgressBar(min = 0, max = B, style = 3) # Compute the test statistic for B randomly selected permutations for(i in 1:B) { # Draw a permutation of the labels: data$vore <- sample(data$vore, length(data$vore),
replace = FALSE)

# Compute statistic for permuted sample:
permutation_t[i] <- t.test(sleep_total ~ vore,
data = data)$statistic # Update progress bar setTxtProgressBar(pbar, i) } close(pbar) # In this case, with a two-sided alternative hypothesis, a # "more extreme" test statistic is one that has a larger # absolute value than the sample test statistic. # Compute approximate permutation test p-value: mean(abs(permutation_t) > abs(sample_t)) In this particular example, the resulting p-value is pretty close to that from the old-school t-test. However, we will soon see examples where the two versions of the t-test differ more. You may ask why we used 9,999 permutations and not 10,000. The reason is that we avoid p-values that are equal to traditional significance levels like 0.05 and 0.01 this way. If we’d used 10,000 permutations, 500 of which yielded a statistics that had a larger absolute value than the sample statistic, then the p-value would have been exactly 0.05, which would cause some difficulties in trying to determine whether or not the result was significant at the 5 % level. This cannot happen when we use 9,999 permutations instead (500 statistics with a large absolute value yields the p-value $$0.050005>0.05$$, and 499 yields the p-value $$0.0499<0.05$$). Having to write a for loop every time we want to run a t-test seems unnecessarily complicated. Fortunately, others have tread this path before us. The MKinfer package contains a function to perform (approximate) permutation t-tests, which also happens to be faster than our implementation above. Let’s install it: install.packages("MKinfer") The function for the permutation t-test, perm.t.test, works exactly like t.test. In all the examples from Section 7.2.1 we can replace t.test with perm.t.test to run a permutation t-test instead. Like so: library(MKinfer) perm.t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi")) Note that two p-values and confidence intervals are presented: one set from the permutations and one from the old-school approach - so make sure that you look at the right ones! You may ask how many randomly selected permutations we need to get an accurate approximation of the permutation test p-value. By default, perm.t.test uses 9,999 permutations (you can change that number using the argument R), which is widely considered to be a reasonable number. If you are running a permutation test with a much more complex (and computationally intensive) statistic, you may have to use a lower number, but avoid that if you can. ### 7.2.3 The bootstrap A popular method for computing p-values and confidence intervals that resembles the permutation approach is the bootstrap. Instead of drawing permuted samples, new observations are drawn with replacement from the original sample, and then labels are randomly allocated to them. That means that each randomly drawn sample will differ not only in the permutation of labels, but also in what observations are included - some may appear more than once and some not at all. We will have a closer look at the bootstrap in Section 7.7, where we will learn how to use it for creating confidence intervals and computing p-values for any test statistic. For now, we’ll just note that MKinfer offers a bootstrap version of the t-test, boot.t.test : library(MKinfer) boot.t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi")) Both perm.test and boot.test have a useful argument called symmetric, the details of which are discussed in depth in Section 12.3. ### 7.2.4 Saving the output When we run a t-test, the results are printed in the Console. But we can also store the results in a variable, which allows us to access e.g. the p-value of the test: library(ggplot2) carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",] test_result <- t.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi")) test_result What does the resulting object look like? str(test_result) As you can see, test_result is a list containing different parameters and vectors for the test. To get the p-value, we can run the following: test_result$p.value

### 7.2.5 Multiple testing

Some programming tools from Section 6.4 can be of use if we wish to perform multiple t-tests. For example, maybe we want to make pairwise comparisons of the sleeping times of all the different feeding behaviours in msleep: carnivores, herbivores, insectivores and omnivores. To find all possible pairs, we can use a nested for loop (Section 6.4.2). Note how the indices i and j that we loop over are set so that we only run the test for each combination once:

library(MKinfer)

# List the different feeding behaviours (ignoring NA's):
vores <- na.omit(unique(msleep$vore)) B <- length(vores) # Compute the number of pairs, and create an appropriately # sized data frame to store the p-values in: n_comb <- choose(B, 2) p_values <- data.frame(group1 = vector("character", n_comb), group2 = vector("character", n_comb), p = vector("numeric", n_comb)) # Loop over all pairs: k <- 1 # Counter variable for(i in 1:(B-1)) { for(j in (i+1):B) { # Run a t-test for the current pair: test_res <- perm.t.test(sleep_total ~ vore, data = subset(msleep, vore == vores[i] | vore == vores[j])) # Store the p-value: p_values[k, ] <- c(vores[i], vores[j], test_res$p.value)
# Increase the counter variable:
k <- k + 1
}
}

To view the p-values for each pairwise test, we can now run:

p_values

When we run multiple tests, the risk for a type I error increases, to the point where we’re virtually guaranteed to get a significant result. We can reduce the risk of false positive results and adjust the p-values for multiplicity using for instance Bonferroni correction, Holm’s method (an improved version of the standard Bonferroni approach), or the Benjamini-Hochberg approach (which controls the false discovery rate and is useful if you for instance are screening a lot of variables for differences), using p.adjust:

p.adjust(p_values$p, method = "bonferroni") p.adjust(p_values$p, method = "holm")
herbivores <- msleep[msleep$vore == "herbi",] wilcox.test(carnivores$sleep_total, herbivores$sleep_total) Or use a formula: wilcox.test(sleep_total ~ vore, data = subset(msleep, vore == "carni" | vore == "herbi")) ### 7.3.2 Tests for correlation To test the null hypothesis that two numerical variables are correlated, we can use cor.test. Let’s try it with sleeping times and brain weight, using the msleep data again: library(ggplot2) cor.test(msleep$sleep_total, msleep$brainwt, use = "pairwise.complete") The setting use = "pairwise.complete" means that NA values are ignored. cor.test doesn’t have a data argument, so if you want to use it in a pipeline I recommend using the %$% pipe (Section 6.2) to pass on the vectors from your data frame:

library(magrittr)
msleep %$% cor.test(sleep_total, brainwt, use = "pairwise.complete") The test we just performed uses the Pearson correlation coefficient as its test statistic. If you prefer, you can use the nonparametric Spearman and Kendall correlation coefficients in the test instead, by changing the value of method: # Spearman test of correlation: cor.test(msleep$sleep_total, msleep$brainwt, use = "pairwise.complete", method = "spearman") These tests are all based on asymptotic approximations, which among other things causes the Pearson correlation test perform poorly for non-normal data. In Section 7.7 we will create a bootstrap version of the correlation test, which has better performance. ### 7.3.3$$\chi^2$$-tests $$\chi^2$$ (chi-squared) tests are most commonly used to test whether two categorical variables are independent. To use it, we must first construct a contingency table, i.e. a table showing the counts for different combinations of categories, typically using table. Here is an example with the diamonds data from ggplot2: library(ggplot2) table(diamonds$cut, diamonds$color) The null hypothesis of our test is that the quality of the cut (cut) and the colour of the diamond (color) are independent, with the alternative being that they are dependent. We use chisq.test with the contingency table as input to run the $$\chi^2$$ test of independence: chisq.test(table(diamonds$cut, diamonds$color)) By default, chisq.test uses an asymptotic approximation of the p-value. For small sample sizes, it is almost often better to use permutation p-values by setting simulate.p.value = TRUE (but here the sample is not small, and so the computation of the permutation test will take a while): chisq.test(table(diamonds$cut, diamonds$color), simulate.p.value = TRUE) As with t.test, we can use pipes to perform the test if we like: library(magrittr) diamonds %$% table(cut, color) %>%
chisq.test()

If both of the variables are binary, i.e. only take two values, the power of the test can be approximated using power.prop.test. Let’s say that we have two variables, $$X$$ and $$Y$$, taking the values 0 and 1. Assume that we collect $$n$$ observations with $$X=0$$ and $$n$$ with $$X=1$$. Furthermore, let p1 be the probability that $$Y=1$$ if $$X=0$$ and p2 be the probability that $$Y=1$$ if $$X=1$$. We can then use power.prop.test as follows:

# Assume that n = 50, p1 = 0.4 and p2 = 0.5 and compute the power:
power.prop.test(n = 50, p1 = 0.4, p2 = 0.5, sig.level = 0.05)

# Assume that p1 = 0.4 and p2 = 0.5 and that we want 85 % power.
# To compute the sample size required:
power.prop.test(power = 0.85, p1 = 0.4, p2 = 0.5, sig.level = 0.05)

### 7.3.4 Confidence intervals for proportions

The different t-test functions provide confidence intervals for means and differences of means. But what about proportions? The binomCI function in the MKinfer package allows us to compute confidence intervals for proportions from binomial experiments using a number of methods. The input is the number of “successes” x, the sample size n, and the method to be used.

Let’s say that we want to compute a confidence interval for the proportion of herbivore mammals that sleep for more than 7 hours a day.

library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",] # Compute the number of animals for which we know the sleeping time: n <- sum(!is.na(herbivores$sleep_total))

# Compute the number of "successes", i.e. the number of animals
# that sleep for more than 7 hours:
x <- sum(herbivores$sleep_total > 7, na.rm = TRUE) The estimated proportion is x/n, which in this case is 0.625. We’d like to quantify the uncertainty in this estimate by computing a confidence interval. The standard Wald method, taught in most introductory courses, can be computed using: library(MKinfer) binomCI(x, n, conf.level = 0.95, method = "wald") Don’t do that though! The Wald interval is known to be severely flawed (Brown et al., 2001), and much better options are available. If the proportion can be expected to be close to 0 or 1, the Clopper-Pearson interval is recommended, and otherwise the Wilson interval is the best choice (Thulin, 2014a): binomCI(x, n, conf.level = 0.95, method = "clopper-pearson") binomCI(x, n, conf.level = 0.95, method = "wilson") An excellent Bayesian credible interval is the Jeffreys interval, which uses the weakly informative Jeffreys prior: binomCI(x, n, conf.level = 0.95, method = "jeffreys") The ssize.propCI function in MKpower can be used to compute the sample size needed to obtain a confidence interval with a given width51. It relies on asymptotic formulas that are highly accurate, as you later on will verify in Exercise 7.17. library(MKpower) # Compute the sample size required to obtain an interval with # width 0.1 if the true proportion is 0.4: ssize.propCI(prop = 0.4, width = 0.1, method = "wilson") ssize.propCI(prop = 0.4, width = 0.1, method = "clopper-pearson") $\sim$ Exercise 7.7 The function binomDiffCI from MKinfer can be used to compute a confidence interval for the difference of two proportions. Using the msleep data, use it to compute a confidence interval for the difference between the proportion of herbivores that sleep for more than 7 hours a day and the proportion of carnivores that sleep for more than 7 hours a day. (Click here to go to the solution.) ## 7.4 Ethical issues in statistical inference The use and misuse of statistical inference offer many ethical dilemmas. Some common issues related to ethics and good statistical practice are discussed below. As you read them and work with the associated exercises, consider consulting the ASA’s ethical guidelines, presented in Section 3.11. ### 7.4.1 p-hacking and the file-drawer problem Hypothesis tests are easy to misuse. If you run enough tests on your data, you are almost guaranteed to end up with significant results - either due to chance or because some of the null hypotheses you test are false. The process of trying lots of different tests (different methods, different hypotheses, different sub-groups) in search of significant results is known as p-hacking or data dredging. This greatly increases the risk of false findings, and can often produce misleading results. Many practitioners inadvertently resort to p-hacking, by mixing exploratory data analysis and hypothesis testing, or by coming up with new hypotheses to test as they work with their data. This can be avoided by planning your analyses in advance, a practice that in fact is required in medical trials. On the other end of the spectrum, there is the file-drawer problem, in which studies with negative (i.e. not statistically significant) results aren’t published or reported, but instead are stored in the researcher’s file-drawers. There are many reasons for this, one being that negative results usually are seen as less important and less worthy of spending time on. Simply put, negative results just aren’t news. If your study shows that eating kale every day significantly reduces the risk of cancer, then that is news, something that people are interested in learning, and something that can be published in a prestigious journal. However, if your study shows that a daily serving of kale has no impact on the risk of cancer, that’s not news, people aren’t really interested in hearing it, and it may prove difficult to publish your findings. But what if 100 different researchers carried out the same study? If eating kale doesn’t affect the risk of cancer, then we can still expect 5 out of these researchers to get significant results (using a 5 % significance level). If only those researchers publish their results, that may give the impressions that there is strong evidence of the cancer-preventing effect of kale backed up by several papers, even though the majority of studies actually indicated that there was no such effect. $\sim$ Exercise 7.8 Discuss the following. You are helping a research team with statistical analysis of data that they have collected. You agree on five hypotheses to test. None of the tests turns out significant. Fearing that all their hard work won’t lead anywhere, your collaborators then ask you to carry out five new tests. Neither turns out significant. Your collaborators closely inspect the data and then ask you to carry out ten more tests, two of which are significant. The team wants to publish these significant results in a scientific journal. Should you agree to publish them? If so, what results should be published? Should you have put your foot down and told them not to run more tests? Does your answer depend on how long it took the research team to collect the data? What if the team won’t get funding for new projects unless they publish a paper soon? What if other research teams competing for the same grants do their analyses like this? Exercise 7.9 Discuss the following. You are working for a company that is launching a new product, a hair-loss treatment. In a small study, the product worked for 19 out of 22 participants (86 %). You compute a 95 % Clopper-Pearson confidence interval (Section 7.3.4) for the proportion of successes and find that it is (0.65, 0.97). Based on this, the company wants to market the product as being 97 % effective. Is that acceptable to you? If not, how should it be marketed? Would your answer change if the product was something else (new running shoes that make you faster, a plastic film that protects smartphone screens from scratches, or contraceptives)? What if the company wanted to market it as being 86 % effective instead? Exercise 7.10 Discuss the following. You have worked long and hard on a project. In the end, to see if the project was a success, you run a hypothesis test to check if two variables are correlated. You find that they are not (p = 0.15). However, if you remove three outliers, the two variables are significantly correlated (p = 0.03). What should you do? Does your answer change if you only have to remove one outlier to get a significant result? If you have to remove ten outliers? 100 outliers? What if the p-value is 0.051 before removing the outliers and 0.049 after removing the outliers? Exercise 7.11 Discuss the following. You are analysing data from an experiment to see if there is a difference between two treatments. You estimate52 that given the sample size and the expected difference in treatment effects, the power of the test that you’ll be using, i.e. the probability of rejecting the null hypothesis if it is false, is about 15 %. Should you carry out such an analysis? If not, how high does the power need to be for the analysis to be meaningful? ### 7.4.2 Reproducibility An analysis is reproducible if it can be reproduced by someone else. By producing reproducible analyses, we make it easier for others to scrutinise our work. We also make all the steps in the data analysis transparent. This can act as a safeguard against data fabrication and data dredging. In order to make an analysis reproducible, we need to provide at least two things. First, the data - all unedited data files in their original format. This also includes metadata with information required to understand the data (e.g. codebooks explaining variable names and codes used for categorical variables). Second, the computer code used to prepare and analyse the data. This includes any wrangling and preliminary testing performed on the data. As long as we save our data files and code, data wrangling and analyses in R are inherently reproducible, in contrast to the same tasks carried out in menu-based software such as Excel. However, if reports are created using a word processor, there is always a risk that something will be lost along the way. Perhaps numbers are copied by hand (which may introduce errors), or maybe the wrong version of a figure is pasted into the document. R Markdown (Section 4.1) is a great tool for creating completely reproducible reports, as it allows you to integrate R code for data wrangling, analyses, and graphics in your report-writing. This reduces the risk of manually inserting errors, and allows you to share your work with others easily. $\sim$ Exercise 7.12 Discuss the following. You are working on a study at a small-town hospital. The data involves biomarker measurements for a number of patients, and you show that patients with a sexually transmittable disease have elevated levels of some of the biomarkers. The data also includes information about the patients: their names, ages, ZIP codes, heights, and weights. The research team wants to publish your results and make the analysis reproducible. Is it ethically acceptable to share all your data? Can you make the analysis reproducible without violating patient confidentiality? ## 7.5 Evaluating statistical methods using simulation An important use of simulation is in the evaluation of statistical methods. In this section, we will see how simulation can be used to compare the performance of two estimators, as well as the type I error rate and power of hypothesis tests. ### 7.5.1 Comparing estimators Let’s say that we want to estimate the mean $$\mu$$ of a normal distribution. We could come up with several different estimators for $$\mu$$: • The sample mean $$\bar{x}$$, • The sample median $$\tilde{x}$$, • The average of the largest and smallest value in the sample: $$\frac{x_{max}+x_{min}}{2}$$. In this particular case (under normality), statistical theory tells us that the sample mean is the best estimator53. But how much better is it, really? And what if we didn’t know statistical theory - could we use simulation to find out which estimator to use? To begin with, let’s write a function that computes the estimate $$\frac{x_{max}+x_{min}}{2}$$: max_min_avg <- function(x) { return((max(x)+min(x))/2) } Next, we’ll generate some data from a $$N(0,1)$$ distribution and compute the three estimates: x <- rnorm(25) x_mean <- mean(x) x_median <- median(x) x_mma <- max_min_avg(x) x_mean; x_median; x_mma As you can see, the estimates given by the different approaches differ, so clearly the choice of estimator matters. We can’t determine which to use based on a single sample though. Instead, we typically compare the long-run properties of estimators, such as their bias and variance. The bias is the difference between the mean of the estimator and the parameter it seeks to estimate. An estimator is unbiased if its bias is 0, which is considered desirable at least in this setting. Among unbiased estimators, we prefer the one that has the smallest variance. So how can we use simulation to compute the bias and variance of estimators? The key to using simulation here is to realise that x_mean is an observation of the random variable $$\bar{X}= \frac{1}{25}(X_1+X_2+\cdots+X_{25})$$ where each $$X_i$$ is $$N(0, 1)$$-distributed. We can generate observations of $$X_i$$ (using rnorm), and can therefore also generate observations of $$\bar{X}$$. That means that we can obtain an arbitrarily large sample of observations of $$\bar{X}$$, which we can use to estimate its mean and variance. Here is an example: # Set the parameters for the normal distribution: mu <- 0 sigma <- 1 # We will generate 10,000 observations of the estimators: B <- 1e4 res <- data.frame(x_mean = vector("numeric", B), x_median = vector("numeric", B), x_mma = vector("numeric", B)) # Start progress bar: pbar <- txtProgressBar(min = 0, max = B, style = 3) for(i in seq_along(res$x_mean))
{
x <- rnorm(25, mu, sigma)
res$x_mean[i] <- mean(x) res$x_median[i] <- median(x)
res$x_mma[i] <- max_min_avg(x) # Update progress bar setTxtProgressBar(pbar, i) } close(pbar) # Compare the estimators: colMeans(res-mu) # Bias apply(res, 2, var) # Variances All three estimators appear to be unbiased (even if the simulation results aren’t exactly 0, they are very close). The sample mean has the smallest variance (and is therefore preferable!), followed by the median. The $$\frac{x_{max}+x_{min}}{2}$$ estimator has the worst performance, which is unsurprising as it ignores all information not contained in the extremes of the dataset. In Section 7.5.5 we’ll discuss how to choose the number of simulated samples to use in your simulations. For now, we’ll just note that the estimate of the estimators’ biases becomes more stable as the number of simulated samples increases, as can be seen from this plot, which utilises cumsum, described in Section 5.3.3: # Compute estimates of the bias of the sample mean for each # iteration: res$iterations <- 1:B
res$x_mean_bias <- cumsum(res$x_mean)/1:B - mu

# Plot the results:
library(ggplot2)
ggplot(res, aes(iterations, x_mean_bias)) +
geom_line() +
xlab("Number of iterations") +
ylab("Estimated bias")

# Cut the x-axis to better see the oscillations for smaller
# numbers of iterations:
ggplot(res, aes(iterations, x_mean_bias)) +
geom_line() +
xlab("Number of iterations") +
ylab("Estimated bias") +
xlim(0, 1000)

$\sim$

Exercise 7.13 Repeat the above simulation for different samples sizes $$n$$ between 10 and 100. Plot the resulting variances as a function of $$n$$.

Exercise 7.14 Repeat the simulation in 7.13, but with a $$t(3)$$ distribution instead of the normal distribution. Which estimator is better in this case?

### 7.5.2 Type I error rate of hypothesis tests

In the same vein that we just compared estimators, we can also compare hypothesis tests or confidence intervals. Let’s have a look at the former, and evaluate how well the old-school two-sample t-test fares compared to a permutation t-test and the Wilcoxon-Mann-Whitney test.

For our first comparison, we will compare the type I error rate of the three tests, i.e. the risk of rejecting the null hypothesis if the null hypothesis is true. Nominally, this is the significance level $$\alpha$$, which we set to be 0.05.

We write a function for such a simulation, to which we can pass the sizes n1 and n2 of the two samples, as well as a function distr to generate data:

# Load package used for permutation t-test:
library(MKinfer)

# Create a function for running the simulation:
simulate_type_I <- function(n1, n2, distr, level = 0.05, B = 999,
alternative = "two.sided", ...)
{
# Create a data frame to store the results in:
p_values <- data.frame(p_t_test = vector("numeric", B),
p_perm_t_test = vector("numeric", B),
p_wilcoxon = vector("numeric", B))

# Start progress bar:
pbar <- txtProgressBar(min = 0, max = B, style = 3)

for(i in 1:B)
{
# Generate data:
x <- distr(n1, ...)
y <- distr(n2, ...)

# Compute p-values:
p_values[i, 1] <- t.test(x, y,
alternative = alternative)$p.value p_values[i, 2] <- perm.t.test(x, y, alternative = alternative, R = 999)$perm.p.value
p_values[i, 3] <- wilcox.test(x, y,
alternative = alternative)$p.value # Update progress bar: setTxtProgressBar(pbar, i) } close(pbar) # Return the type I error rates: return(colMeans(p_values < level)) } First, let’s try it with normal data. The simulation takes a little while to run, primarily because of the permutation t-test, so you may want to take a short break while you wait. simulate_type_I(20, 20, rnorm, B = 9999) Next, let’s try it with a lognormal distribution, both with balanced and imbalanced sample sizes. Increasing the parameter $$\sigma$$ (sdlog) increases the skewness of the lognormal distribution (i.e. makes it more asymmetric and therefore less similar to the normal distribution), so let’s try that to. In case you are in a rush, the results from my run of this code block can be found below it. simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 1) simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 3) simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 1) simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 3) My results were: # Normal distribution, n1 = n2 = 20: p_t_test p_perm_t_test p_wilcoxon 0.04760476 0.04780478 0.04680468 # Lognormal distribution, n1 = n2 = 20, sigma = 1: p_t_test p_perm_t_test p_wilcoxon 0.03320332 0.04620462 0.04910491 # Lognormal distribution, n1 = n2 = 20, sigma = 3: p_t_test p_perm_t_test p_wilcoxon 0.00830083 0.05240524 0.04590459 # Lognormal distribution, n1 = 20, n2 = 30, sigma = 1: p_t_test p_perm_t_test p_wilcoxon 0.04080408 0.04970497 0.05300530 # Lognormal distribution, n1 = 20, n2 = 30, sigma = 3: p_t_test p_perm_t_test p_wilcoxon 0.01180118 0.04850485 0.05240524  What’s noticeable here is that the permutation t-test and the Wilcoxon-Mann-Whitney test have type I error rates that are close to the nominal 0.05 in all five scenarios, whereas the t-test has too low a type I error rate when the data comes from a lognormal distribution. This makes the test too conservative in this setting. Next, let’s compare the power of the tests. ### 7.5.3 Power of hypothesis tests The power of a test is the probability of rejecting the null hypothesis if it is false. To estimate that, we need to generate data under the alternative hypothesis. For two-sample tests of the mean, the code is similar to what we used for the type I error simulation above, but we now need two functions for generating data - one for each group, because the groups differ under the alternative hypothesis. Bear in mind that the alternative hypothesis for the two-sample test is that the two distributions differ in location, so the two functions for generating data should reflect that. # Load package used for permutation t-test: library(MKinfer) # Create a function for running the simulation: simulate_power <- function(n1, n2, distr1, distr2, level = 0.05, B = 999, alternative = "two.sided") { # Create a data frame to store the results in: p_values <- data.frame(p_t_test = vector("numeric", B), p_perm_t_test = vector("numeric", B), p_wilcoxon = vector("numeric", B)) # Start progress bar: pbar <- txtProgressBar(min = 0, max = B, style = 3) for(i in 1:B) { # Generate data: x <- distr1(n1) y <- distr2(n2) # Compute p-values: p_values[i, 1] <- t.test(x, y, alternative = alternative)$p.value
p_values[i, 2] <- perm.t.test(x, y,
alternative = alternative,
R = 999)$perm.p.value p_values[i, 3] <- wilcox.test(x, y, alternative = alternative)$p.value

# Update progress bar:
setTxtProgressBar(pbar, i)
}

close(pbar)

# Return power:
return(colMeans(p_values < level))
}

Let’s try this out with lognormal data, where the difference in the log means is 1:

# Balanced sample sizes:
simulate_power(20, 20, function(n) { rlnorm(n,
meanlog = 2, sdlog = 1) },
function(n) { rlnorm(n,
meanlog = 1, sdlog = 1) },
B = 9999)

# Imbalanced sample sizes:
simulate_power(20, 30, function(n) { rlnorm(n,
meanlog = 2, sdlog = 1) },
function(n) { rlnorm(n,
meanlog = 1, sdlog = 1) },
B = 9999)

Here are the results from my runs:

# Balanced sample sizes:
p_t_test p_perm_t_test    p_wilcoxon
0.6708671     0.7596760     0.8508851

# Imbalanced sample sizes:
p_t_test p_perm_t_test    p_wilcoxon
0.6915692     0.7747775     0.9041904

Among the three, the Wilcoxon-Mann-Whitney test appears to be preferable for lognormal data, as it manages to obtain the correct type I error rate (unlike the old-school t-test) and has the highest power (although we would have to consider more scenarios, including different samples sizes, other differences of means, and different values of $$\sigma$$ to say for sure!).

Remember that both our estimates of power and type I error rates are proportions, meaning that we can use binomial confidence intervals to quantify the uncertainty in the estimates from our simulation studies. Let’s do that for the lognormal setting with balanced sample sizes, using the results from my runs. The number of simulated samples were 9,999. For the t-test, the estimated type I error rate was 0.03320332, which corresponds to $$0.03320332\cdot9,999=332$$ “successes.” Similarly, there were 6,708 “successes” in the power study. The confidence intervals become:

library(MKinfer)
binomCI(332, 9999, conf.level = 0.95, method = "clopper-pearson")
binomCI(6708, 9999, conf.level = 0.95, method = "wilson")

$\sim$

Exercise 7.15 Repeat the simulation study of type I error rate and power for the old school t-test, permutation t-test and the Wilcoxon-Mann-Whitney test with $$t(3)$$-distributed data. Which test has the best performance? How much lower is the type I error rate of the old-school t-test compared to the permutation t-test in the case of balanced sample sizes?

### 7.5.4 Power of some tests of location

The MKpower package contains functions for quickly performing power simulations for the old-school t-test and Wilcoxon-Mann-Whitney test in different settings. The arguments rx and ry are used to pass functions used to generate the random numbers, in line with the simulate_power function that we created above.

For the t-test, we can use sim.power.t.test:

library(MKpower)
sim.power.t.test(nx = 25, rx = rnorm, rx.H0 = rnorm,
ny = 25, ry = function(x) { rnorm(x, mean = 0.8) },
ry.H0 = rnorm)

For the Wilcoxon-Mann-Whitney test, we can use sim.power.wilcox.test for power simulations:

library(MKpower)
sim.power.wilcox.test(nx = 10, rx = rnorm, rx.H0 = rnorm,
ny = 15,
ry = function(x) { rnorm(x, mean = 2) },
ry.H0 = rnorm)

### 7.5.5 Some advice on simulation studies

There are two things that you need to decide when performing a simulation study:

• How many scenarios to include, i.e. how many different settings for the model parameters to study, and
• How many iterations to use, i.e. how many simulated samples to create for each scenario.

The number of scenarios is typically determined by what the purpose of the study is. If you only are looking to compare two tests for a particular sample size and a particular difference in means, then maybe you only need that one scenario. On the other hand, if you want to know which of the two tests that is preferable in general, or for different sample sizes, or for different types of distributions, then you need to cover more scenarios. In that case, the number of scenarios may well be determined by how much time you have available or how many you can fit into your report.

As for the number of iterations to run, that also partially comes down to computational power. If each iteration takes a long while to run, it may not be feasible to run tens of thousands of iterations (some advice for speeding up simulations by using parallelisation can be found in Section 10.2). In the best of all possible worlds, you have enough computational power available, and can choose the number of iterations freely. In such cases, it is often a good idea to use confidence intervals to quantify the uncertainty in your estimate of power, bias, or whatever it is that you are studying. For instance, the power of a test is estimated as the proportion of simulations in which the null hypothesis was rejected. This is a binomial experiment, and a confidence interval for the power can be obtained using the methods described in Section 7.3.4. Moreover, the ssize.propCI function described in said section can be used to determine the number of simulations that you need to obtain a confidence interval that is short enough for you to feel that you have a good idea about the actual power of the test.

As an example, if a small pilot simulation indicates that the power is about 0.8 and you want a confidence interval with width 0.01, the number of simulations needed can be computed as follows:

library(MKpower)
ssize.propCI(prop = 0.8, width = 0.01, method = "wilson")

In this case, you’d need 24,592 iterations to obtain the desired accuracy.

## 7.6 Sample size computations using simulation

Using simulation to compare statistical methods is a key tool in methodological statistical research and when assessing new methods. In applied statistics, a use of simulation that is just as important is sample size computations. In this section we’ll have a look at how simulations can be useful in determining sample sizes.

### 7.6.1 Writing your own simulation

Suppose that we want to perform a correlation test and want to know how many observations we need to collect. As in the previous section, we can write a function to compute the power of the test:

simulate_power <- function(n, distr, level = 0.05, B = 999, ...)
{
p_values <- vector("numeric", B)

# Start progress bar:
pbar <- txtProgressBar(min = 0, max = B, style = 3)

for(i in 1:B)
{
# Generate bivariate data:
x <- distr(n)

# Compute p-values:
p_values[i] <- cor.test(x[,1], x[,2], ...)$p.value # Update progress bar: setTxtProgressBar(pbar, i) } close(pbar) return(mean(p_values < level)) } Under the null hypothesis of no correlation, the correlation coefficient is 0. We want to find a sample size that will give us 90 % power at the 5 % significance level, for different hypothesised correlations. We will generate data from a bivariate normal distribution, because it allows us to easily set the correlation of the generated data. Note that the mean and variance of the marginal normal distributions are nuisance variables, which can be set to 0 and 1, respectively, without loss of generality (because the correlation test is invariant under scaling and shifts in location). First, let’s try our power simulation function: library(MASS) # Contains mvrnorm function for generating data rho <- 0.5 # The correlation between the variables mu <- c(0, 0) Sigma <- matrix(c(1, rho, rho, 1), 2, 2) simulate_power(50, function(n) { mvrnorm(n, mu, Sigma) }, B = 999) To find the sample size we need, we will write a new function containing a while loop (see Section 6.4.5), that performs the simulation for increasing values of $$n$$ until the test has achieved the desired power: library(MASS) power.cor.test <- function(n_start = 10, rho, n_incr = 5, power = 0.9, B = 999, ...) { # Set parameters for the multivariate normal distribution: mu <- c(0, 0) Sigma <- matrix(c(1, rho, rho, 1), 2, 2) # Set initial values n <- n_start power_cor <- 0 # Check power for different sample sizes: while(power_cor < power) { power_cor <- simulate_power(n, function(n) { mvrnorm(n, mu, Sigma) }, B = B, ...) cat("n =", n, " - Power:", power_cor, "\n") n <- n + n_incr } # Return the result: cat("\nWhen n =", n, "the power is", round(power_cor, 2), "\n") return(n) } Let’s try it out with different settings: power.cor.test(n_start = 10, rho = 0.5, power = 0.9) power.cor.test(n_start = 10, rho = 0.2, power = 0.8) As expected, larger sample sizes are required to detect smaller correlations. ### 7.6.2 The Wilcoxon-Mann-Whitney test The sim.ssize.wilcox.test in MKpower can be used to quickly perform sample size computations for the Wilcoxon-Mann-Whitney test, analogously to how we used sim.power.wilcox.test in Section 7.5.4: library(MKpower) sim.ssize.wilcox.test(rx = rnorm, ry = function(x) rnorm(x, mean = 2), power = 0.8, n.min = 3, n.max = 10, step.size = 1) $\sim$ Exercise 7.16 Modify the functions we used to compute the sample sizes for the Pearson correlation test to instead compute sample sizes for the Spearman correlation tests. For bivariate normal data, are the required sample sizes lower or higher than those of the Pearson correlation test? (Click here to go to the solution.) Exercise 7.17 In Section 7.3.4 we had a look at some confidence intervals for proportions, and saw how ssize.propCI can be used to compute sample sizes for such intervals using asymptotic approximations. Write a function to compute the exact sample size needed for the Clopper-Pearson interval to achieve a desired expected (average) width. Compare your results to those from the asymptotic approximations. Are the approximations good enough to be useful? (Click here to go to the solution.) ## 7.7 Bootstrapping The bootstrap can be used formany things, most notably for constructing confidence intervals and running hypothesis tests. These tend to perform better than traditional parametric methods, such as the old-school t-test and its associated confidence interval, when the distributional assumptions of the parametric methods aren’t met. Confidence intervals and hypothesis tests are always based on a statistic, i.e. a quantity that we compute from the samples. The statistic could be the sample mean, a proportion, the Pearson correlation coefficient, or something else. In traditional parametric methods, we start by assuming that our data follows some distribution. For different reasons, including mathematical tractability, a common assumption is that the data is normally distributed. Under that assumption, we can then derive the distribution of the statistic that we are interested in analytically, like Gosset did for the t-test. That distribution can then be used to compute confidence intervals and p-values. When using a bootstrap method, we follow the same steps, but use the observed data and simulation instead. Rather than making assumptions about the distribution54, we use the empirical distribution of the data. Instead of analytically deriving a formula that describes the statistic’s distribution, we find a good approximation of the distribution of the statistic by using simulation. We can then use that distribution to obtain confidence intervals and p-values, just as in the parametric case. The simulation step is important. We use a process known as resampling, where we repeatedly draw new observations with replacement from the original sample. We draw $$B$$ samples this way, each with the same size $$n$$ as the original sample. Each randomly drawn sample - called a bootstrap sample - will include different observations. Some observations from the original sample may appear more than once in a specific bootstrap sample, and some not at all. For each bootstrap sample, we compute the statistic in which we are interested. This gives us $$B$$ observations of this statistic, which together form what is called the bootstrap distribution of the statistic. I recommend using $$B=9,999$$ or greater, but we’ll use smaller $$B$$ in some examples, to speed up the computations. ### 7.7.1 A general approach The Pearson correlation test is known to be sensitive to deviations from normality. We can construct a more robust version of it using the bootstrap. To illustrate the procedure, we will use the sleep_total and brainwt variables from the msleep data. Here is the result from the traditional parametric Pearson correlation test: library(ggplot2) msleep %$% cor.test(sleep_total, brainwt, use = "pairwise.complete")

To find the bootstrap distribution of the Pearson correlation coefficient, we can use resampling with a for loop (Section 6.4.1):

# Extract the data that we are interested in:
mydata <- na.omit(msleep[,c("sleep_total", "brainwt")])

# Resampling using a for loop:
B <- 999 # Number of bootstrap samples
statistic <- vector("numeric", B)
for(i in 1:B)
{
# Draw row numbers for the bootstrap sample:
row_numbers <- sample(1:nrow(mydata), nrow(mydata),
replace = TRUE)

# Obtain the bootstrap sample:
sample <- mydata[row_numbers,]

# Compute the statistic for the bootstrap sample:
statistic[i] <- cor(sample[, 1], sample[, 2])
}

# Plot the bootstrap distribution of the statistic:
ggplot(data.frame(statistic), aes(statistic)) +
geom_histogram(colour = "black")

Because this is such a common procedure, there are R packages that let’s us do resampling without having to write a for loop. In the remainder of the section, we will use the boot package to draw bootstrap samples. It also contains convenience functions that allows us to get confidence intervals from the bootstrap distribution quickly. Let’s install it:

install.packages("boot")

The most important function in this package is boot, which does the resampling. As input, it takes the original data, the number $$B$$ of bootstrap samples to draw (called R here), and a function that computes the statistic of interest. This function should take the original data (mydata in our example above) and the row numbers of the sampled observation for a particular bootstrap sample (row_numbers in our example) as input.

For the correlation coefficient, the function that we input can look like this:

cor_boot <- function(data, row_numbers, method = "pearson")
{
# Obtain the bootstrap sample:
sample <- data[row_numbers,]

# Compute and return the statistic for the bootstrap sample:
return(cor(sample[, 1], sample[, 2], method = method))
}

To get the bootstrap distribution of the Pearson correlation coefficient for our data, we can now use boot as follows:

library(boot)

# Base solution:
boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]),
cor_boot,
999)

Next, we can plot the bootstrap distribution of the statistic computed in cor_boot:

plot(boot_res)

If you prefer, you can of course use a pipeline for the resampling instead:

library(boot)
library(dplyr)

# With pipes:
msleep %>% select(sleep_total, brainwt) %>%
drop_na %>%
boot(cor_boot, 999) -> boot_res

### 7.7.2 Bootstrap confidence intervals

The next step is to use boot.ci to compute bootstrap confidence intervals. This is as simple as running:

boot.ci(boot_res)

Four intervals are presented: normal, basic, percentile and BCa. The details concerning how these are computed based on the bootstrap distribution are presented in Section 12.1. It is generally agreed that the percentile and BCa intervals are preferable to the normal and basic intervals; see e.g. Davison & Hinkley (1997) and Hall (1992); but which performs the best varies.

We also receive a warning message:

Warning message:
In boot.ci(boot_res) : bootstrap variances needed for studentized
intervals

A fifth type of confidence interval, the studentised interval, requires bootstrap estimates of the standard error of the test statistic. These are obtained by running an inner bootstrap, i.e. by bootstrapping each bootstrap sample to get estimates of the variance of the test statistic. Let’s create a new function that does this, and then compute the bootstrap confidence intervals:

cor_boot_student <- function(data, i, method = "pearson")
{
sample <- data[i,]

correlation <- cor(sample[, 1], sample[, 2], method = method)

inner_boot <- boot(sample, cor_boot, 100)
variance <- var(inner_boot$t) return(c(correlation, variance)) } library(ggplot2) library(boot) boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]), cor_boot_student, 999) # Show bootstrap distribution: plot(boot_res) # Compute confidence intervals - including studentised: boot.ci(boot_res) While theoretically appealing (Hall, 1992), studentised intervals can be a little erratic in practice. I prefer to use percentile and BCa intervals instead. For two-sample problems, we need to make sure that the number of observations drawn from each sample is the same as in the original data. The strata argument in boot is used to achieve this. Let’s return to the example studied in Section 7.2, concerning the difference in how long carnivores and herbivores sleep. Let’s say that we want a confidence interval for the difference of two means, using the msleep data. The simplest approach is to create a Welch-type interval, where we allow the two populations to have different variances. We can then resample from each population separately: # Function that computes the mean for each group: mean_diff_msleep <- function(data, i) { sample1 <- subset(data[i, 1], data[i, 2] == "carni") sample2 <- subset(data[i, 1], data[i, 2] == "herbi") return(mean(sample1[]) - mean(sample2[])) } library(ggplot2) # Load the data library(boot) # Load bootstrap functions # Create the data set to resample from: boot_data <- na.omit(subset(msleep, vore == "carni" | vore == "herbi")[,c("sleep_total", "vore")]) # Do the resampling - we specify that we want resampling from two # populations by using strata: boot_res <- boot(boot_data, mean_diff_msleep, 999, strata = factor(boot_data$vore))

# Compute confidence intervals:
boot.ci(boot_res, type = c("perc", "bca"))

$\sim$

Exercise 7.18 Let’s continue the example with a confidence interval for the difference in how long carnivores and herbivores sleep. How can you create a confidence interval under the assumption that the two groups have equal variances?

### 7.7.3 Bootstrap hypothesis tests

Writing code for bootstrap hypothesis tests can be a little tricky, because the resampling must be done under the null hypothesis. The process is greatly simplified by computing p-values using confidence interval inversion instead. This approach exploits the equivalence between confidence intervals and hypothesis tests, detailed in Section 12.2. It relies on the fact that:

• The p-value of the test for the parameter $$\theta$$ is the smallest $$\alpha$$ such that $$\theta$$ is not contained in the corresponding $$1-\alpha$$ confidence interval.
• For a test for the parameter $$\theta$$ with significance level $$\alpha$$, the set of values of $$\theta$$ that aren’t rejected by the test (when used as the null hypothesis) is a $$1-\alpha$$ confidence interval for $$\theta$$.

Here is an example of how we can use a while loop (Section 6.4.5) for confidence interval inversion, in order to test the null hypothesis that the Pearson correlation between sleeping time and brain weight is $$\rho=-0.2$$. It uses the studentised confidence interval that we created in the previous section:

# Compute the studentised confidence interval:
cor_boot_student <- function(data, i, method = "pearson")
{
sample <- data[i,]

correlation <- cor(sample[, 1], sample[, 2], method = method)

inner_boot <- boot(sample, cor_boot, 100)
variance <- var(inner_boot$t) return(c(correlation, variance)) } library(ggplot2) library(boot) boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]), cor_boot_student, 999) # Now, a hypothesis test: # The null hypothesis: rho_null <- -0.2 # Set initial conditions: in_interval <- TRUE alpha <- 0 # Find the lowest alpha for which rho_null is in the interval: while(in_interval) { # Increase alpha a small step: alpha <- alpha + 0.001 # Compute the 1-alpha confidence interval, and extract # its bounds: interval <- boot.ci(boot_res, conf = 1 - alpha, type = "stud")$student[4:5]

# Check if the null value for rho is greater than the lower
# interval bound and smaller than the upper interval bound,
# i.e. if it is contained in the interval:
in_interval <- rho_null > interval & rho_null < interval
}
# The loop will finish as soon as it reaches a value of alpha such
# that rho_null is not contained in the interval.

# Print the p-value:
alpha

The boot.pval package contains a function computing p-values through inversion of bootstrap confidence intervals. We can use it to obtain a bootstrap p-value without having to write a while loop. It works more or less analogously to boot.ci. The arguments to the boot.pval function is the boot object (boot_res), the type of interval to use ("stud"), and the value of the parameter under the null hypothesis (-0.2):

install.packages("boot.pval")
library(boot.pval)
boot.pval(boot_res, type = "stud", theta_null = -0.2)

Confidence interval inversion fails in spectacular ways for certain tests for parameters of discrete distributions (Thulin & Zwanzig, 2017), so be careful if you plan on using this approach with count data.

$\sim$

Exercise 7.19 With the data from Exercise 7.18, invert a percentile confidence interval to compute the p-value of the corresponding test of the null hypothesis that there is no difference in means. What are the results?

### 7.7.4 The parametric bootstrap

In some cases, we may be willing to make distributional assumptions about our data. We can then use the parametric bootstrap, in which the resampling is done not from the original sample, but the theorised distribution (with parameters estimated from the original sample). Here is an example for the bootstrap correlation test, where we assume a multivariate normal distribution for the data. Note that we no longer include an index as an argument in the function cor_boot, because the bootstrap samples won’t be drawn directly from the original data:

cor_boot <- function(data, method = "pearson")
{
return(cor(data[, 1], data[, 2], method = method))
}

library(MASS)
generate_data <- function(data, mle)
{
return(mvrnorm(nrow(data), mle[], mle[]))
}

library(ggplot2)
library(boot)

filtered_data <- na.omit(msleep[,c("sleep_total", "brainwt")])
boot_res <- boot(filtered_data,
cor_boot,
R = 999,
sim = "parametric",
ran.gen = generate_data,
mle = list(colMeans(filtered_data),
cov(filtered_data)))

# Show bootstrap distribution:
plot(boot_res)

# Compute bootstrap percentile confidence interval:
boot.ci(boot_res, type = "perc")

The BCa interval implemented in boot.ci is not valid for parametric bootstrap samples, so running boot.ci(boot_res) without specifying the interval type will render an error55. Percentile intervals work just fine, though.

## 7.8 Reporting statistical results

Carrying out a statistical analysis is only the first step. After that, you probably need to communicate your results to others: your boss, your colleagues, your clients, the public… This section contains some tips for how best to do that.

### 7.8.1 What should you include?

When reporting your results, it should always be clear:

• How the data was collected,
• If, how, and why any observations were removed from the data prior to the analysis,
• What method was used for the analysis (including a reference unless it is a routine method),
• If any other analyses were performed/attempted on the data, and if you don’t report their results, why.

Let’s say that you’ve estimate some parameter, for instance the mean sleeping time of mammals, and want to report the results. The first thing to think about is that you shouldn’t include too many decimals: don’t give the mean with 5 decimals if sleeping times only were measured with one decimal.

BAD: The mean sleeping time of mammals was found to be 10.43373.

GOOD: The mean sleeping time of mammals was found to be 10.4.

It is common to see estimates reported with standard errors or standard deviations:

BAD: The mean sleeping time of mammals was found to be 10.3 ($$\sigma=4.5$$).

or

BAD: The mean sleeping time of mammals was found to be 10.3 (standard error 0.49).

or

BAD: The mean sleeping time of mammals was found to be $$10.3 \pm 0.49$$.

Although common, this isn’t a very good practice. Standard errors/deviations are included to give some indication of the uncertainty of the estimate, but are very difficult to interpret. In most cases, they will probably cause the reader to either overestimate or underestimate the uncertainty in your estimate. A much better option is to present the estimate with a confidence interval, which quantifies the uncertainty in the estimate in an interpretable manner:

GOOD: The mean sleeping time of mammals was found to be 10.3 (95 % percentile bootstrap confidence interval: 9.5-11.4).

Similarly, it is common to include error bars representing standard deviations and standard errors e.g. in bar charts. This questionable practice becomes even more troublesome because a lot of people fail to indicate what the error bars represent. If you wish to include error bars in your figures, they should always represent confidence intervals, unless you have a very strong reason for them to represent something else. In the latter case, make sure that you clearly explain what the error bars represent.

If the purpose of your study is to describe differences between groups, you should present a confidence interval for the difference between the groups, rather than one confidence interval (or error bar) for each group. It is possible for the individual confidence intervals to overlap even if there is a significant difference between the two groups, so reporting group-wise confidence intervals will only lead to confusion. If you are interested in the difference, then of course the difference is what you should report a confidence interval for.

BAD: There was no significant difference between the sleeping times of carnivores (mean 10.4, 95 % percentile bootstrap confidence interval: 8.4-12.5) and herbivores (mean 9.5, 95 % percentile bootstrap confidence interval: 8.1-12.6).

GOOD: There was no significant difference between the sleeping times of carnivores (mean 10.4) and herbivores (mean 9.5), with the 95 % percentile bootstrap confidence interval for the difference being (-1.8, 3.5).

### 7.8.2 Citing R packages

In statistical reports, it is often a good idea to specify what version of a software or a package that you used, for the sake of reproducibility (indeed, this is a requirement in some scientific journals). To get the citation information for the version of R that you are running, simply type citation(). To get the version number, you can use R.Version as follows:

citation()
R.Version()\$version.string

To get the citation and version information for a package, use citation and packageVersion as follows:

citation("ggplot2")
packageVersion("ggplot2")

1. In general, the proportion of points that fall below the curve will be proportional to the area under the curve relative to the area of the sample space. In this case the sample space is the unit square, which has area 1, meaning that the relative area is the same as the absolute area.↩︎

2. Note that this is not a random sample of mammals, and so one of the fundamental assumptions behind the t-test isn’t valid in this case. For the purpose of showing how to use the t-test, the data is good enough though.↩︎

3. Or rather, a given expected, or average, width. The width of the interval is a function of a random variable, and is therefore also random.↩︎

4. We’ll discuss methods for producing such estimates in Section 7.5.3.↩︎

5. At least in terms of mean squared error.↩︎

6. Well, sometimes we make assumptions about the distribution and use the bootstrap. This is known as the parametric bootstrap, and is discussed in Section 7.7.4.↩︎

7. If you really need a BCa interval for the parametric bootstrap, you can find the formulas for it in Davison & Hinkley (1997).↩︎