# 3 Transforming, summarising, and analysing data

Most datasets are stored as tables, with rows and columns. In this chapter we’ll see how you can import and export such data, and how it is stored in R. We’ll also discuss how you can transform, summarise, and analyse your data.

After working with the material in this chapter, you will be able to use R to:

• Distinguish between different data types,
• Import data from Excel spreadsheets and csv text files,
• Compute descriptive statistics for subgroups in your data,
• Find interesting points in your data,
• Modify variables in your data,
• Remove variables from your data,
• Save and export your data,
• Work with RStudio projects,
• Run t-tests and fit linear models,
• Use %>% pipes to chain functions together.

The chapter ends with a discussion of ethical guidelines for statistical work.

## 3.1 Data frames and data types

### 3.1.1 Types and structures

We have already seen that different kinds of data require different kinds of statistical methods. For numeric data we create boxplots and compute means, but for categorical data we don’t. Instead we produce bar charts and display the data in tables. It is no surprise then, that what R also treats different kinds of data differently.

In programming, a variable’s_data type_ describes what kind of object is assigned to it. We can assign many different types of objects to the variable a: it could for instance contain a number, text, or a data frame. In order to treat a correctly, R needs to know what data type its assigned object has. In some programming languages, you have to explicitly state what data type a variable has, but not in R. This makes programming R simpler and faster, but can cause problems if a variable turns out to have a different data type than what you thought14.

R has six basic data types. For most people, it suffices to know about the first three in the list below:

• numeric: numbers like 1 and 16.823 (sometimes also called double).
• logical: true/false values (boolean): either TRUE or FALSE.
• character: text, e.g. "a", "Hello! I'm Ada." and "name@domain.com".
• integer: integer numbers, denoted in R by the letter L: 1L, 55L.
• complex: complex numbers, like 2+3i. Rarely used in statistical work.
• raw: used to hold raw bytes. Don’t fret if you don’t know what that means. You can have a long and meaningful career in statistics, data science, or pretty much any other field without ever having to worry about raw bytes. We won’t discuss raw objects again in this book.

In addition, these can be combined into special data types sometimes called data structures, examples of which include vectors and data frames. Important data structures include factor, which is used to store categorical data, and the awkwardly named POSIXct which is used to store date and time data.

To check what type of object a variable is, you can use the class function:

x <- 6
y <- "Scotland"
z <- TRUE

class(x)
class(y)
class(z)

What happens if we use class on a vector?

numbers <- c(6, 9, 12)
class(numbers)

class returns the data type of the elements of the vector. So what happens if we put objects of different type together in a vector?

all_together <- c(x, y, z)
all_together
class(all_together)

In this case, R has coerced the objects in the vector to all be of the same type. Sometimes that is desirable, and sometimes it is not. The lesson here is to be careful when you create a vector from different objects. We’ll learn more about coercion and how to change data types in Section 5.1.

### 3.1.2 Types of tables

The basis for most data analyses in R are data frames: spreadsheet-like tables with rows and columns containing data. You encountered some data frames in the previous chapter. Have a quick look at them to remind yourself of what they look like:

# Bookstore example
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
bookstore <- data.frame(age, purchase)
View(bookstore)

# Animal sleep data
library(ggplot2)
View(msleep)

# Diamonds data
View(diamonds)

Notice that all three data frames follow the same format: each column represents a variable (e.g. age) and each row represents an observation (e.g. an individual). This is the standard way to store data in R (as well as the standard format in statistics in general). In what follows, we will use the terms column and variable interchangeably, to describe the columns/variables in a data frame.

This kind of table can be stored in R as different types of objects - that is, in several different ways. As you’d expect, the different types of objects have different properties and can be used with different functions. Here’s the run-down of four common types:

• matrix: a table where all columns must contain objects of the same type (e.g. all numeric or all character). Uses less memory than other types and allows for much faster computations, but is difficult to use for certain types of data manipulation, plotting and analyses.
• data.frame: the most common type, where different columns can contain different types (e.g. one numeric column, one character column).
• data.table: an enhanced version of data.frame.
• tbl_df (“tibble”): another enhanced version of data.frame.

First of all, in most cases it doesn’t matter which of these four that you use to store your data. In fact, they all look similar to the user. Have a look at the following datasets (WorldPhones and airquality come with base R):

# First, an example of data stored in a matrix:
?WorldPhones
class(WorldPhones)
View(WorldPhones)

# Next, an example of data stored in a data frame:
?airquality
class(airquality)
View(airquality)

# Finally, an example of data stored in a tibble:
library(ggplot2)
?msleep
class(msleep)
View(msleep)

That being said, in some cases it really matters which one you use. Some functions require that you input a matrix, while others may break or work differently from what was intended if you input a tibble instead of an ordinary data frame. Luckily, you can convert objects into other types:

WorldPhonesDF <- as.data.frame(WorldPhones)
class(WorldPhonesDF)

airqualityMatrix <- as.matrix(airquality)
class(airqualityMatrix)

$\sim$

Exercise 3.1 The following tasks are all related to data types and data structures:

1. Create a text variable using e.g. a <- "A rainy day in Edinburgh". Check that it gets the correct type. What happens if you use single quotes marks instead of double quotes when you create the variable?

2. What data types are the sums 1 + 2, 1L + 2 and 1L + 2L?

3. What happens if you add a numeric to a character, e.g. "Hello" + 1?

4. What happens if you perform mathematical operations involving a numeric and a logical, e.g. FALSE * 2 or TRUE + 1?

Exercise 3.2 What do the functions ncol, nrow, dim, names, and row.names return when applied to a data frame?

Exercise 3.3 matrix tables can be created from vectors using the function of the same name. Using the vector x <- 1:6 use matrix to create the following matrices:

$\begin{pmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{pmatrix}$

and

$\begin{pmatrix} 1 & 4\\ 2 & 5\\ 3 & 6 \end{pmatrix}.$

Remember to check ?matrix to find out how to set the dimensions of the matrix, and how it is filled with the numbers from the vector!

## 3.2 Vectors in data frames

In the next few sections, we will explore the airquality dataset. It contains daily air quality measurements from New York during a period of five months:

• Ozone: mean ozone concentration (ppb),
• Solar.R: solar radiation (Langley),
• Wind: average wind speed (mph),
• Temp: maximum daily temperature in degrees Fahrenheit,
• Month: numeric month (May=5, June=6, and so on),
• Day: numeric day of the month (1-31).

There are lots of things that would be interesting to look at in this dataset. What was the mean temperature during the period? Which day was the hottest? Which was the windiest? What days were the temperature more than 90 degrees Fahrenheit? To answer these questions, we need to be able to access the vectors inside the data frame. We also need to be able to quickly and automatically screen the data in order to find interesting observations (e.g. the hottest day)

### 3.2.1 Accessing vectors and elements

In Section 2.6, we learned how to compute the mean of a vector. We also learned that to compute the mean of a vector that is stored inside a data frame15 we could use a dollar sign: data_frame_name$vector_name. Here is an example with the airquality data: # Extract the Temp vector: airquality$Temp

# Compute the mean temperature:
mean(airquality$Temp) If we want to grab a particular element from a vector, we must use its index within square brackets: [index]. The first element in the vector has index 1, the second has index 2, the third index 3, and so on. To access the fifth element in the Temp vector in the airquality data frame, we can use: airquality$Temp

The square brackets can also be applied directly to the data frame. The syntax for this follows that used for matrices in mathematics: airquality[i, j] means the element at the i:th row and j:th column of airquality. We can also leave out either i or j to extract an entire row or column from the data frame. Here are some examples:

# First, we check the order of the columns:
names(airquality)
# We see that Temp is the 4th column.

airquality[5, 4]    # The 5th element from the 4th column,
# i.e. the same as airquality$Temp airquality[5,] # The 5th row of the data airquality[, 4] # The 4th column of the data, like airquality$Temp
airquality[]     # The 4th column of the data, like airquality$Temp airquality[, c(2, 4, 6)] # The 2nd, 4th and 6th columns of the data airquality[, -2] # All columns except the 2nd one airquality[, c("Temp", "Wind")] # The Temp and Wind columns $\sim$ Exercise 3.4 The following tasks all involve using the the [i, j] notation for extracting data from data frames: 1. Why does airquality[, 3] not return the third row of airquality? 2. Extract the first five rows from airquality. Hint: a fast way of creating the vector c(1, 2, 3, 4, 5) is to write 1:5. 3. Compute the correlation between the Temp and Wind vectors of airquality without refering to them using $.

4. Extract all columns from airquality except Temp and Wind.

The $ operator can be used not just to extract data from a data frame, but also to manipulate it. Let’s return to our bookstore data frame, and see how we can make changes to it using the dollar sign. age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31) purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29) bookstore <- data.frame(age, purchase) Perhaps there was a data entry error - the second customer was actually 18 years old and not 48. We can assign a new value to that element by referring to it in either of two ways: bookstore$age <- 18
# or
bookstore[2, 1] <- 18

We could also change an entire column if we like. For instance, if we wish to change the age vector to months instead of years, we could use

bookstore$age <- bookstore$age * 12

What if we want to add another variable to the data, for instance the length of the customers’ visits in minutes? There are several ways to accomplish this, one of which involves the dollar sign:

bookstore$visit_length <- c(5, 2, 20, 22, 12, 31, 9, 10, 11) bookstore As you see, the new data has now been added to a new column in the data frame. $\sim$ Exercise 3.5 Use the bookstore data frame to do the following: 1. Add a new variable rev_per_minute which is the ratio between purchase and the visit length. 2. Oh no, there’s been an error in the data entry! Replace the purchase amount for the 80-year old customer with 16. (Click here to go to the solution.) ### 3.2.3 Using conditions A few paragraphs ago, we were asking which was the hottest day in the airquality data. Let’s find out! We already know how to find the maximum value in the Temp vector: max(airquality$Temp)

But can we find out which day this corresponds to? We could of course manually go through all 153 days e.g. by using View(airquality), but that seems tiresome and wouldn’t even be possible in the first place if we’d had more observations. A better option is therefore to use the function which.max:

which.max(airquality$Temp) which.max returns the index of the observation with the maximum value. If there is more than one observation attaining this value, it only returns the first of these. We’ve just used which.max to find out that day 120 was the hottest during the period. If we want to have a look at the entire row for that day, we can use airquality[120,] Alternatively, we could place the call to which.max inside the brackets. Because which.max(airquality$Temp) returns the number 120, this yields the same result as the previous line:

airquality[which.max(airquality$Temp),] Were we looking for the day with the lowest temperature, we’d use which.min analogously. In fact, we could use any function or computation that returns an index in the same way, placing it inside the brackets to get the corresponding rows or columns. This is extremely useful if we want to extract observations with certain properties, for instance all days where the temperature was above 90 degrees. We do this using conditions, i.e. by giving statements that we wish to be fulfilled. As a first example of a condition, we use the following, which checks if the temperature exceeds 90 degrees: airquality$Temp > 90

For each element in airquality$Temp this returns either TRUE (if the condition is fulfilled, i.e. when the temperature is greater than 90) or FALSE (if the conditions isn’t fulfilled, i.e. when the temperature is 90 or lower). If we place the condition inside brackets following the name of the data frame, we will extract only the rows corresponding to those elements which were marked with TRUE: airquality[airquality$Temp > 90, ]

If you prefer, you can also store the TRUE or FALSE values in a new variable:

airquality$Hot <- airquality$Temp > 90

There are several logical operators and functions which are useful when stating conditions in R. Here are some examples:

a <- 3
b <- 8

a == b     # Check if a equals b
a > b      # Check if a is greater than b
a < b      # Check if a is less than b
a >= b     # Check if a is equal to or greater than b
a <= b     # Check if a is equal to or less than b
a != b     # Check if a is not equal to b
is.na(a)   # Check if a is NA
a %in% c(1, 4, 9) # Check if a equals at least one of 1, 4, 9

When checking a conditions for all elements in a vector, we can use which to get the indices of the elements that fulfill the condition:

which(airquality$Temp > 90) If we want to know if all elements in a vector fulfill the condition, we can use all: all(airquality$Temp > 90)

In this case, it returns FALSE, meaning that not all days had a temperature above 90 (phew!). Similarly, if we wish to know whether at least one day had a temperature above 90, we can use any:

any(airquality$Temp > 90) To find how many elements that fulfill a condition, we can use sum: sum(airquality$Temp > 90)

Why does this work? Remember that sum computes the sum of the elements in a vector, and that when logical values are used in computations, they are treated as 0 (FALSE) or 1 (TRUE). Because the condition returns a vector of logical values, the sum of them becomes the number of 1’s - the number of TRUE values - i.e. the number of elements that fulfill the condition.

To find the proportion of elements that fulfill a condition, we can count how many elements fulfill it and then divide by how many elements are in the vector. This is exactly what happens if we use mean:

herbivores <- msleep[msleep$vore == "herbi",] t.test(carnivores$sleep_total, herbivores$sleep_total) The output contains a lot of useful information, including the p-value ($$0.53$$) and a 95 % confidence interval. t.test contains a number of useful arguments that we can use to tailor the test to our taste. For instance, we can change the confidence level of the confidence interval (to 90 %, say), use a one-sided alternative hypothesis (“carnivores sleep more than herbivores,” i.e. the mean of the first group is greater than that of the second group) and perform the test under the assumption of equal variances in the two samples: t.test(carnivores$sleep_total, herbivores$sleep_total, conf.level = 0.90, alternative = "greater", var.equal = TRUE) We’ll explore t.test and related functions further in Section 7.2. ## 3.7 Fitting a linear regression model The mtcars data from Henderson and Velleman (1981) has become one of the classic datasets in R, and a part of the initiation rite for new R users is to use the mtcars data to fit a linear regression model. The data describes fuel consumption, number of cylinders and other information about cars from the 1970’s: ?mtcars View(mtcars) Let’s have a look at the relationship between gross horsepower (hp) and fuel consumption (mpg): library(ggplot2) ggplot(mtcars, aes(hp, mpg)) + geom_point() The relationship doesn’t appear to be perfectly linear, but nevertheless, we can try fitting a linear regression model to the data. This can be done using lm. We fit a model with mpg as the response variable and hp as the explanatory variable: m <- lm(mpg ~ hp, data = mtcars) The first argument is a formula, saying that mpg is a function of hp, i.e. $mpg=\beta_0 +\beta_1 \cdot hp.$ A summary of the model is obtained using summary. Among other things, it includes the estimated parameters, p-values and the coefficient of determination $$R^2$$. summary(m) We can add the fitted line to the scatterplot by using geom_abline, which lets us add a straight line with a given intercept and slope - we take these to be the coefficients from the fitted model, given by coef: # Check model coefficients: coef(m) # Add regression line to plot: ggplot(mtcars, aes(hp, mpg)) + geom_point() + geom_abline(aes(intercept = coef(m), slope = coef(m)), colour = "red") Diagnostic plots for the residuals are obtained using plot: plot(m) If we wish to add further variables to the model, we simply add them to the right-hand-side of the formula in the function call: m2 <- lm(mpg ~ hp + wt, data = mtcars) summary(m2) In this case, the model becomes $mpg=\beta_0 +\beta_1 \cdot hp + \beta_2\cdot wt.$ There is much more to be said about linear models in R. We’ll return to them in Section 8.1. $\sim$ Exercise 3.11 Fit a linear regression model to the mtcars data, using mpg as the response variable and hp, wt, cyl, and am as explanatory variables. Are all four explanatory variables significant? (Click here to go to the solution.) ## 3.8 Grouped summaries Being able to compute the mean temperature for the airquality data during the entire period is great, but it would be even better if we also had a way to compute it for each month. The aggregate function can be used to create that kind of grouped summary. To begin with, let’s compute the mean temperature for each month. Using aggregate, we do this as follows: aggregate(Temp ~ Month, data = airquality, FUN = mean) The first argument is a formula, similar to what we used for lm, saying that we want a summary of Temp grouped by Month. Similar formulas are used also in other R functions, for instance when building regression models. In the second argument, data, we specify in which data frame the variables are found, and in the third, FUN, we specify which function should be used to compute the summary. By default, mean returns NA if there are missing values. In airquality, Ozone contains missing values, but when we compute the grouped means the results are not NA: aggregate(Ozone ~ Month, data = airquality, FUN = mean) By default, aggregate removes NA values before computing the grouped summaries. It is also possible to compute summaries for multiple variables at the same time. For instance, we can compute the standard deviations (using sd) of Temp and Wind, grouped by Month: aggregate(cbind(Temp, Wind) ~ Month, data = airquality, FUN = sd) aggregate can also be used to count the number of observations in the groups. For instance, we can count the number of days in each month. In order to do so, we put a variable with no NA values on the left-hand side in the formula, and use length, which returns the length of a vector: aggregate(Temp ~ Month, data = airquality, FUN = length) Another function that can be used to compute grouped summaries is by. The results are the same, but the output is not as nicely formatted. Here’s how to use it to compute the mean temperature grouped by month: by(airquality$Temp, airquality$Month, mean) What makes by useful is that unlike aggregate it is easy to use with functions that take more than one variable as input. If we want to compute the correlation between Wind and Temp grouped by month, we can do that as follows: names(airquality) # Check that Wind and Temp are in columns 3 and 4 by(airquality[, 3:4], airquality$Month, cor)

For each month, this outputs a correlation matrix, which shows both the correlation between Wind and Temp and the correlation of the variables with themselves (which always is 1).

$\sim$

Exercise 3.12 Load the VAS pain data vas.csv from Exercise 3.8. Then do the following:

1. Compute the mean VAS for each patient.

2. Compute the lowest and highest VAS recorded for each patient.

3. Compute the number of high-VAS days, defined as days where the VAS was at least 7, for each patient.

Exercise 3.13 Install the datasauRus package using install.packages("datasauRus") (note the capital R!). It contains the dataset datasaurus_dozen. Check its structure and then do the following:

1. Compute the mean of x, mean of y, standard deviation of x, standard deviation of y, and correlation between x and y, grouped by dataset. Are there any differences between the 12 datasets?

2. Make a scatterplot of x against y for each dataset (use facetting!). Are there any differences between the 12 datasets?

## 3.9 Using %>% pipes

Consider the code you used to solve part 1 of Exercise 3.5:

bookstore$rev_per_minute <- bookstore$purchase / bookstore$visit_length Wouldn’t it be more convenient if you didn’t have to write the bookstore$ part each time? To just say once that you are manipulating bookstore, and have R implicitly understand that all the variables involved reside in that data frame? Yes. Yes, it would. Fortunately, R has tools that will let you do just that.

### 3.9.1Ceci n’est pas une pipe

The magrittr package21 adds a set of tools called pipes to R. Pipes are operators that let you improve your code’s readability and restructure your code so that it is read from the left to the right instead of from the inside out. Let’s start by installing the package:

install.packages("magrittr")

Now, let’s say that we are interested in finding out what the mean wind speed (in m/s rather than mph) on hot days (temperature above 80, say) in the airquality data is, aggregated by month. We could do something like this:

# Extract hot days:
airquality2 <- airquality[airquality$Temp > 80, ] # Convert wind speed to m/s: airquality2$Wind <- airquality2$Wind * 0.44704 # Compute mean wind speed for each month: hot_wind_means <- aggregate(Wind ~ Month, data = airquality2, FUN = mean) There is nothing wrong with this code per se. We create a copy of airquality (because we don’t want to change the original data), change the units of the wind speed, and then compute the grouped means. A downside is that we end up with a copy of airquality that we maybe won’t need again. We could avoid that by putting all the operations inside of aggregate: # More compact: hot_wind_means <- aggregate(Wind*0.44704 ~ Month, data = airquality[airquality$Temp > 80, ],
FUN = mean)

The problem with this is that it is a little difficult to follow because we have to read the code from the inside out. When we run the code, R will first extract the hot days, then convert the wind speed to m/s, and then compute the grouped means - so the operations happen in an order that is the opposite of the order in which we wrote them.

magrittr introduces a new operator, %>%, called a pipe, which can be used to chain functions together. Calls that you would otherwise write as

new_variable <- function_2(function_1(your_data))

can be written as

your_data %>% function_1 %>% function_2 -> new_variable

so that the operations are written in the order they are performed. Some prefer the former style, which is more like mathematics, but many prefer the latter, which is more like natural language (particularly for those of us who are used to reading from left to right).

Three operations are required to solve the airquality wind speed problem:

1. Extract the hot days,
2. Convert the wind speed to m/s,
3. Compute the grouped means.

Where before we used function-less operations like airquality2$Wind <- airquality2$Wind * 0.44704, we would now require functions that carried out the same operations if we wanted to solve this problem using pipes.

A function that lets us extract the hot days is subset:

subset(airquality, Temp > 80)

The magrittr function inset lets us convert the wind speed:

library(magrittr)
inset(airquality, "Wind", value = airquality$Wind * 0.44704) And finally, aggregate can be used to compute the grouped means. We could use these functions step-by-step: # Extract hot days: airquality2 <- subset(airquality, Temp > 80) # Convert wind speed to m/s: airquality2 <- inset(airquality2, "Wind", value = airquality2$Wind * 0.44704)
# Compute mean wind speed for each month:
hot_wind_means <- aggregate(Wind ~ Month, data = airquality2,
FUN = mean)

But, because we have functions to perform the operations, we can instead use %>% pipes to chain them together in a pipeline. Pipes automatically send the output from the previous function as the first argument to the next, so that the data flows from left to right, which make the code more concise. They also let us refer to the output from the previous function as ., which saves even more space. The resulting code is:

airquality %>%
subset(Temp > 80) %>%
inset("Wind", value = .$Wind * 0.44704) %>% aggregate(Wind ~ Month, data = ., FUN = mean) -> hot_wind_means You can read the %>% operator as then: take the airquality data, then subset it, then convert the Wind variable, then compute the grouped means. Once you wrap your head around the idea of reading the operations from left to right, this code is arguably clearer and easier to read. Note that we used the right-assignment operator -> to assign the result to hot_wind_means, to keep in line with the idea that the data flows from the left to the right. ### 3.9.2 Aliases and placeholders In the remainder of the book, we will use pipes in some situations where they make the code easier to write or read. Pipes don’t always make code easier to read though, as can be seen if we use them to compute $$\exp(\log(2))$$: # Standard solution: exp(log(2)) # magrittr solution: 2 %>% log %>% exp If you need to use binary operators like +, ^ and <, magrittr has a number of aliases that you can use. For instance, add works as an alias for +: x <- 2 exp(x + 2) x %>% add(2) %>% exp Here are a few more examples: x <- 2 # Base solution; magrittr solution exp(x - 2); x %>% subtract(2) %>% exp exp(x * 2); x %>% multiply_by(2) %>% exp exp(x / 2); x %>% divide_by(2) %>% exp exp(x^2); x %>% raise_to_power(2) %>% exp head(airquality[,1:4]); airquality %>% extract(,1:4) %>% head airquality$Temp[1:5];     airquality %>%
use_series(Temp) %>% extract(1:5)

In simple cases like these it is usually preferable to use the base R solution - the point here is that if you need to perform this kind of operation inside a pipeline, the aliases make it easy to do so. For a complete list of aliases, see ?extract.

If the function does not take the output from the previous function as its first argument, you can use . as a placeholder, just as we did in the airquality problem. Here is another example:

cat(paste("The current time is ", Sys.time())))
Sys.time() %>% paste("The current time is", .) %>% cat

If the data only appears inside parentheses, you need to wrap the function in curly brackets {}, or otherwise %>% will try to pass it as the first argument to the function:

airquality %>% cat("Number of rows in data:", nrow(.)) # Doesn't work
airquality %>% {cat("Number of rows in data:", nrow(.))} # Works!

In addition to the magrittr pipes, from version 4.1 R also offers a native pipe, |>, which can be used in lieu of %>% without loading any packages. Nevertheless, we’ll use %>% pipes in the remainder of the book, partially because they are more commonly used (meaning that you are more likely to encounter them when looking at other people’s code), and partially because magrittr also offers some other useful pipe operators. You’ll see plenty of examples of how pipes can be used in Chapters 5-9, and learn about other pipe operators in Section 6.2.

$\sim$

Exercise 3.14 Rewrite the following function calls using pipes, with x <- 1:8:

1. sqrt(mean(x))

2. mean(sqrt(x))

3. sort(x^2-5)[1:2]

Exercise 3.15 Using the bookstore data:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
visit_length <- c(5, 2, 20, 22, 12, 31, 9, 10, 11)
bookstore <- data.frame(age, purchase, visit_length)

Add a new variable rev_per_minute which is the ratio between purchase and the visit length, using a pipe.

## 3.10 Flavours of R: base and tidyverse

R is a programming language, and just like any language, it has different dialects. When you read about R online, you’ll frequently see people mentioning the words “base” and “tidyverse.” These are the two most common dialects of R. Base R is just that, R in its purest form. The tidyverse is a collection of add-on packages for working with different types of data. The two are fully compatible, and you can mix and match as much as you like. Both ggplot2 and magrittr are part of the tidyverse.

In recent years, the tidyverse has been heavily promoted as being “modern” R which “makes data science faster, easier and more fun.” You should believe the hype. The tidyverse is marvellous. But if you only learn tidyverse R, you will miss out on much of what R has to offer. Base R is just as marvellous, and can definitely make data science as fast, easy and fun as the tidyverse. Besides, nobody uses just base R anyway - there are a ton of non-tidyverse packages that extend and enrich R in exciting new ways. Perhaps “extended R” or “superpowered R” would be better names for the non-tidyverse dialect.

Anyone who tells you to just learn one of these dialects is wrong. Both are great, they work extremely well together, and they are similar enough that you shouldn’t limit yourself to just mastering one of them. This book will show you both base R and tidyverse solutions to problems, so that you can decide for yourself which is faster, easier, and more fun.

A defining property of the tidyverse is that there are separate functions for everything, which is perfect for code that relies on pipes. In contrast, base R uses fewer functions, but with more parameters, to perform the same tasks. If you use tidyverse solutions there is a good chance that there exists a function which performs exactly the task you’re going to do with its default settings. This is great (once again, especially if you want to use pipes), but it means that there are many more functions to master for tidyverse users, whereas you can make do with much fewer in base R. You will spend more time looking up function arguments when working with base R (which fortunately is fairly straightforward using the ? documentation), but on the other hand, looking up arguments for a function that you know the name of is easier than finding a function that does something very specific that you don’t know the name of. There are advantages and disadvantages to both approaches.

## 3.11 Ethics and good statistical practice

Throughout this book, there will be sections devoted to ethics. Good statistical practice is intertwined with good ethical practice. Both require transparent assumptions, reproducible results, and valid interpretations.

One of the most commonly cited ethical guidelines for statistical work is The American Statistical Association’s Ethical Guidelines for Statistical Practice (Committee on Professional Ethics of the American Statistical Association, 2018), a shortened version of which is presented below22. The full ethical guidelines are available at https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx

• Professional Integrity and Accountability. The ethical statistician uses methodology and data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended to produce valid, interpretable, and reproducible results. The ethical statistician does not knowingly accept work for which he/she is not sufficiently qualified, is honest with the client about any limitation of expertise, and consults other statisticians when necessary or in doubt. It is essential that statisticians treat others with respect.
• Integrity of data and methods. The ethical statistician is candid about any known or suspected limitations, defects, or biases in the data that may affect the integrity or reliability of the statistical analysis. Objective and valid interpretation of the results requires that the underlying analysis recognizes and acknowledges the degree of reliability and integrity of the data.
• Responsibilities to Science/Public/Funder/Client. The ethical statistician supports valid inferences, transparency, and good science in general, keeping the interests of the public, funder, client, or customer in mind (as well as professional colleagues, patients, the public, and the scientific community).
• Responsibilities to Research Subjects. The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research.
• Responsibilities to Research Team Colleagues. Science and statistical practice are often conducted in teams made up of professionals with different professional standards. The statistician must know how to work ethically in this environment.
• Responsibilities to Other Statisticians or Statistics Practitioners. The practice of statistics requires consideration of the entire range of possible explanations for observed phenomena, and distinct observers drawing on their own unique sets of experiences can arrive at different and potentially diverging judgments about the plausibility of different explanations. Even in adversarial settings, discourse tends to be most successful when statisticians treat one another with mutual respect and focus on scientific principles, methodology, and the substance of data interpretations.
• Responsibilities Regarding Allegations of Misconduct. The ethical statistician understands the differences between questionable statistical, scientific, or professional practices and practices that constitute misconduct. The ethical statistician avoids all of the above and knows how each should be handled.
• Responsibilities of Employers, Including Organizations, Individuals, Attorneys, or Other Clients Employing Statistical Practitioners. Those employing any person to analyze data are implicitly relying on the profession’s reputation for objectivity. However, this creates an obligation on the part of the employer to understand and respect statisticians’ obligation of objectivity.

Similar ethical guidelines for statisticians have been put forward by the International Statistical Institute (https://www.isi-web.org/about-isi/policies/professional-ethics), the United Nations Statistics Division (https://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx), and the Data Science Association (http://www.datascienceassn.org/code-of-conduct.html). For further reading on ethics in statistics, see Franks (2020) and Fleming & Bruce (2021).

$\sim$

Exercise 3.16 Discuss the following. In the introduction to American Statistical Association’s Ethical Guidelines for Statistical Practice, it is stated that “using statistics in pursuit of unethical ends is inherently unethical.” What is considered unethical depends on social, moral, political, and religious values, and ultimately you must decide for yourself what you consider to be unethical ends. Which (if any) of the following do you consider to be unethical?

1. Using statistical analysis to help a company that harm the environment through their production processes. Does it matter to you what the purpose of the analysis is?
2. Using statistical analysis to help a tobacco or liquor manufacturing company. Does it matter to you what the purpose of the analysis is?
3. Using statistical analysis to help a bank identify which loan applicants that are likely to default on their loans.
4. Using statistical analysis of social media profiles to identify terrorists.
5. Using statistical analysis of social media profiles to identify people likely to protest against the government.
6. Using statistical analysis of social media profiles to identify people to target with political adverts.
7. Using statistical analysis of social media profiles to target ads at people likely to buy a bicycle.
8. Using statistical analysis of social media profiles to target ads at people likely to gamble at a new online casino. Does it matter to you if it’s an ad for the casino or for help for gambling addiction?

1. And the subsequent troubleshooting makes programming R more difficult and slower.↩︎

2. This works regardless of whether this is a regular data.frame, a data.table or a tibble.↩︎

3. In interval notation, (50, 70] means that the interval contains all values between 50 and 70, excluding 50 but including 70; the intervals is open on the left but closed to the right.↩︎

4. If you are running an older version of R (specifically, a version older than the 4.0.0 version released in April 2020), the character vectors will have been imported as factor vectors instead. You can change that behaviour by adding a stringsAsFactors = FALSE argument to read.csv.↩︎

5. To copy the path, navigate to the file in Explorer. Hold down the Shift key and right-click the file, selecting Copy as path.↩︎

6. To copy the path, navigate to the file in Finder and right-click/Control+click/two-finger click on the file. Hold down the Option key, and then select Copy “file name” as Pathname.↩︎

7. To copy the path from Nautilus, navigate to the file and press Ctrl+L to show the path, then copy it. If you are using some other file browser or the terminal, my guess is that you’re tech-savvy enough that you don’t need me to tell you how to find the path of a file.↩︎

8. Arguably the best-named R package.↩︎

9. The excerpt is from the version of the guidelines dated April 2018, and presented here with permission from the ASA.↩︎