# 4 Exploratory data analysis and unsupervised learning

Exploratory data analysis (EDA) is a process in which we summarise and visually explore a dataset. An important part of EDA is unsupervised learning, which is a collection of methods for finding interesting subgroups and patterns in our data. Unlike statistical hypothesis testing, which is used to reject hypotheses, EDA can be used to generate hypotheses (which can then be confirmed or rejected by new studies). Another purpose of EDA is to find outliers and incorrect observations, which can lead to a cleaner and more useful dataset. In EDA we ask questions about our data, and then try to answer them using summary statistics and graphics. Some questions will prove to be important, and some will not. The key to finding important questions is to ask a lot of questions. This chapter will provide you with a wide range of tools for question-asking.

After working with the material in this chapter, you will be able to use R to:

• Create reports using R Markdown,
• Customise the look of your plots,
• Visualise the distribution of a variable,
• Create interactive plots,
• Detect and label outliers,
• Investigate patterns in missing data,
• Visualise trends,
• Plot time series data,
• Visualise multiple variables at once using scatterplot matrices, correlograms and bubble plots,
• Visualise multivariate data using principal component analysis,
• Use unsupervised learning techniques for clustering,
• Use factor analysis to find latent variables in your data.

## 4.1 Reports with R Markdown

R Markdown files can be used to create nicely formatted documents using R, that are easy to export to other formats, like HTML, Word or pdf. They allow you to mix R code with results and text. They can be used to create reproducible reports, that are easy to update with new data, because they include the code for creating tables and figures. Additionally, they can be used as notebooks for keeping track of your work and your thoughts as you carry out an analysis. You can even use them for writing books - in fact, this entire book was written using R Markdown.

It is often a good idea to use R Markdown for exploratory analyses, as it allows you to write down your thoughts and comments as the analysis progresses, as well as to save the results of the exploratory journey. For that reason, we’ll start this chapter by having a look at some examples of what you can do using R Markdown. You can use R Markdown or ordinary R scripts for the analyses in the remainder of the chapter. The R code used is the same and the results are identical, but if you use R Markdown, you can also save the output of the analysis in a nicely formatted document.

### 4.1.1 A first example

When you create a new R Markdown document in RStudio by clicking File > New File > R Markdown in the menu, a document similar to that below is created :

---
title: "Untitled"
author: "Måns Thulin"
date: "10/20/2020"
output: html_document
---

{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)  ## R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: {r cars} summary(cars)  ## Including Plots You can also embed plots, for example: {r pressure, echo=FALSE} plot(pressure)  Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Press the Knit button at the top of the Script panel to create an HTML document using this Markdown file. It will be saved in the same folder as your Markdown file. Once the HTML document has been created, it will open so that you can see the results. You may have to install additional packages for this to work, in which case RStudio will prompt you. Now, let’s have a look at what the different parts of the Markdown document do. The first part is called the document header or YAML header. It contains information about the document, including its title, the name of its author, and the date on which it was first created: --- title: "Untitled" author: "Måns Thulin" date: "10/20/2020" output: html_document --- The part that says output: html_document specifies what type of document that should be created when you press Knit. In this case, it’s set to html_document, meaning that an HTML document will be created. By changing this to output: word_document you can create a .docx Word document instead. By changing it to output: pdf_document, you can create a .pdf document using LaTeX (you’ll have to install LaTeX if you haven’t already - RStudio will notify you if that is the case). The second part sets the default behaviour of code chunks included in the document, specifying that the output from running the chunks should be printed unless otherwise specified: {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)


The third part contains the first proper section of the document. First, a header is created using ##. Then there is some text with formatting: < > is used to create a link and ** is used to get bold text. Finally, there is a code chunk, delimited by :

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax
for authoring HTML, PDF, and MS Word documents. For more details on
using R Markdown see
<http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that
includes both content as well as the output of any embedded R code
chunks within the document. You can embed an R code chunk like this:

{r cars}
summary(cars)


The fourth and final part contains another section, this time with a figure created using R. A setting is added to the code chunk used to create the figure, which prevents the underlying code from being printed in the document:

## Including Plots

You can also embed plots, for example:

{r pressure, echo=FALSE}
plot(pressure)


Note that the echo = FALSE parameter was added to the code chunk to
prevent printing of the R code that generated the plot.

In the next few sections, we will look at how formatting and code chunks work in R Markdown.

### 4.1.2 Formatting text

To create plain text in a Markdown file, you simply have to write plain text. If you wish to add some formatting to you text, you can use the following:

• _italics_ or *italics*: to create text in italics.
• __bold__ or **bold**: to create bold text.
• [linked text](http://www.modernstatisticswithr.com): to create linked text.
• code: to include inline code in your document.
• $a^2 + b^2 = c^2$: to create inline equations like $$a^2 + b^2 = c^2$$ using LaTeX syntax.
• $$a^2 + b^2 = c^2$$: to create a centred equation on a new line, like $a^2 + b^2 = c^2.$

To add headers and subheaders, and to divide your document into section, start a new line with #’s as follows:

# Header text
...and so on.

### 4.1.3 Lists, tables, and images

To create a bullet lists, you can use * as follows. Note that you need a blank line between your list and the previous paragraph to begin a list.

* Item 1
* Item 2
+ Sub-item 1
+ Sub-item 2
* Item 3

yielding:

• Item 1
• Item 2
• Sub-item 1
• Sub-item 2
• Item 3

To create an ordered list, use:

1. First item
2. Second item
i) Sub-item 1
ii) Sub-item 2
3. Item 3

yielding:

1. First item
2. Second item
1. Sub-item 1
2. Sub-item 2
1. Item 3

To create a table, use | and --------- as follows:

Column 1  | Column 2
--------- | ---------
Content  | More content
Even more  | And some here
Even more?  | Yes!

which yields the following output:

Column 1 Column 2
Content More content
Even more And some here
Even more? Yes!

To include an image, use the same syntax as when creating linked text with a link to the image path (either local or on the web), but with a ! in front:

![](https://www.r-project.org/Rlogo.png)

![Put some text here if you want a caption](https://www.r-project.org/Rlogo.png)

which yields the following:

### 4.1.4 Code chunks

The simplest way to define a code chunk is to simply write:

{r}
plot(pressure)


In RStudio, Ctrl+Alt+I is a keyboard shortcut for inserting this kind of code chunk.

We can add a name and a caption to the chunk, which lets us reference objects created by the chunk:

{r pressureplot, fig.cap = "Plot of the pressure data."}
plot(pressure)


As we can see in Figure \@ref(fig:cars-plot), the relationship between
temperature and pressure resembles a banana.

This yields the following output:

plot(pressure)

As we can see in Figure 4.1, the relationship between temperature and pressure resembles a banana.

In addition, you can add settings to the chunk header to control its behaviour. For instance, you can include a code chunk without running it by adding echo = FALSE:

{r, eval = FALSE}
plot(pressure)


• echo = FALSE to run the code without printing it,
• eval = FALSE to print the code without running it,
• results = "hide" to hide printed output,
• fig.show = "hide" to hide plots,
• warning = FALSE to suppress warning messages from being printed in your document,
• message = FALSE to suppress other messages from being printed in your document,
• include = FALSE to run a chunk without showing the code or results in the document,
• error = TRUE to continue running your R Markdown document even if there is an error in the chunk (by default, the documentation creation stops if there is an error).

Data frames can be printed either as in the Console or as a nicely formatted table. For example,

{r, echo = FALSE}


yields:

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

whereas

{r, echo = FALSE}
knitr::kable(
caption = "Some data I found."
)


yields:

Table 4.1: Some data I found.
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Further help and documentation for R Markdown can be found through the RStudio menus, by clicking Help > Cheatsheets > R Markdown Cheat Sheet or Help > Cheatsheets > R Markdown Reference Guide.

## 4.2 Customising ggplot2 plots

We’ll be using ggplot2 a lot in this chapter, so before we get started with exploratory analyses, we’ll take some time to learn how we can customise the look of ggplot2-plots.

Consider the following facetted plot from Section 2.7.4:

library(ggplot2)

ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)

It looks nice, sure, but there may be things that you’d like to change. Maybe you’d like the background of the plot to be white instead of grey, or maybe you’d like to use a different font. These, and many other things, can be modified using themes.

### 4.2.1 Using themes

ggplot2 comes with a number of basic themes. All are fairly similar, but differ in things like background colour, grids and borders. You can add them to your plot using theme_themeName, where themeName is the name of the theme21. Here are some examples:

p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)

# Create plot with different themes:
p + theme_grey() # The default theme
p + theme_bw()
p + theme_linedraw()
p + theme_light()
p + theme_dark()
p + theme_minimal()
p + theme_classic()

install.packages("ggthemes")
library(ggthemes)

# Create plot with different themes from ggthemes:
p + theme_tufte() # Minimalist Tufte theme
p + theme_wsj() # Wall Street Journal
p + theme_solarized() + scale_colour_solarized() # Solarized colours

##############################

install.packages("hrbrthemes")
library(hrbrthemes)

# Create plot with different themes from hrbrthemes:
p + theme_ipsum() # Ipsum theme
p + theme_ft_rc() # Suitable for use with dark RStudio themes
p + theme_modern_rc() # Suitable for use with dark RStudio themes

### 4.2.2 Theme settings

The point of using a theme is that you get a combination of colours, fonts and other choices that are supposed to go well together, meaning that you don’t have to spend too much time on picking combinations. But if you like, you can override the default options and customise all and any parts of a theme.

Let’s start by creating a ggplot object:

p <- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()

You can change the colour palette, i.e. the list of colours used for plotting, using scale_colour_brewer - see ?scale_colour_brewer for a list of the available palettes. Here are some examples:

p + scale_colour_brewer(palette = "Set1")
p + scale_colour_brewer(palette = "Set2")
p + scale_colour_brewer(palette = "Accent")
p + scale_colour_brewer(palette = "Spectral")
p + scale_colour_brewer(palette = "RdBu")

You can use theme to remove the legend, or change its position:

# No legend:
p + theme(legend.position = "none")

# Legend below figure:
p + theme(legend.position = "bottom")

# Legend inside plot:
p + theme(legend.position = c(0.9, 0.7))

In the last example, the vector c(0.9, 0.7) gives the relative coordinates of the legend, with c(0 0) representing the bottom left corner and c(1, 1) the upper right corner. Try to change the coordinates to different values between 0 and 1 and see what happens.

theme has a lot of other settings, including for the colours of the background, the grid and the text in the plot. Here are a few examples that you can use as starting point for experimenting with the settings:

p + theme(panel.grid.major = element_line(colour = "black"),
panel.grid.minor = element_line(colour = "red",
linetype = "dotted"),
panel.background = element_rect(colour = "blue", size = 2),
plot.background = element_rect(fill = "yellow"),
axis.text = element_text(family = "mono", colour = "purple"),
axis.title = element_text(family = "serif", size = 14))

To find a complete list of settings, see ?theme, ?element_line (lines), ?element_rect (borders and backgrounds), ?element_text (text), and element_blank (for suppressing plotting of elements). As before, you can use colors() to get a list of built-in colours, or use colour hex codes.

$\sim$

Exercise 4.1 Use the documentation for theme and the element_... functions to change the plot object p created above as follows:

1. Change the background colour of the the entire plot to lightblue.

2. Change the font of the legend to serif.

3. Remove the grid.

4. Change the colour of the axis ticks to orange and make them thicker.

## 4.3 Exploring distributions

It is often useful to visualise the distribution of a numerical variable. Comparing the distributions of different groups can lead to important insights. Visualising distributions is also important when checking assumptions used for various statistical tests (sometimes called initial data analysis). In this section we will illustrate how this can be done using the diamonds data from the ggplot2 package, which you started to explore in Chapter 2.

### 4.3.1 Density plots and frequency polygons

We already know how to visualise the distribution of the data by dividing it into bins and plotting a histogram:

library(ggplot2)
ggplot(diamonds, aes(carat)) +
geom_histogram(colour = "black")

A similar plot is created using frequency polygons, which uses lines instead of bars to display the counts in the bins:

ggplot(diamonds, aes(carat)) +
geom_freqpoly()

An advantage with frequency polygons is that they can be used to compare groups, e.g. diamonds with different cuts, without facetting:

ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly()

It is clear from this figure that there are more diamonds with ideal cuts than diamonds with fair cuts in the data. The polygons have roughly the same shape, except perhaps for the polygon for diamonds with fair cuts.

In some cases we are more interested in the shape of the distribution that in the actual counts in the different bins. Density plots are similar to frequency polygons, but show an estimate of the density function of the underlying random variable. These estimates are smooth curves that are scaled so that the area below them is 1 (i.e. scaled to be proper density functions):

ggplot(diamonds, aes(carat, colour = cut)) +
geom_density()

From this figure, it becomes clear that low-carat diamonds tend to have better cuts, which wasn’t obvious from the frequency polygons. However, the plot does not provide any information about how common different cuts are. Use density plots if you’re more interested in the shape of a variable’s distribution, and frequency polygons if you’re more interested in counts.

$\sim$

Exercise 4.2 Using the density plot created above and the documentation for geom_density, do the following:

1. Increase the smoothness of the density curves.

2. Fill the area under the density curves with the same colour as the curves themselves.

3. Make the colours that fill the areas under the curves transparent.

4. The figure still isn’t that easy to interpret. Install and load the ggridges package, which is an extension of ggplot2 that allows you to make so-called ridge plots (density plots that are separated along the y-axis, similar to facetting). Read the documentation for geom_density_ridges and use it to make a ridge plot of diamond prices for different cuts.

Exercise 4.3 Return to the histogram created by ggplot(diamonds, aes(carat)) + geom_histogram() above. As there are very few diamonds with carat greater than 3, cut the x-axis at 3. Then decrease the bin width to 0.01. Do any interesting patterns emerge?

Exercise 4.3 causes us to ask why diamonds with carat values that are multiples of 0.25 are more common that others. Perhaps price is involved? Unfortunately, a plot of carat versus price is not that informative:

ggplot(diamonds, aes(carat, price)) +
geom_point(size = 0.05)

Maybe we could compute the average price in each bin of the histogram? In that case, we need to extract the bin breaks from the histogram somehow. We could then create a new categorical variable using the breaks with cut (as we did in Exercise 3.7). It turns out that extracting the bins is much easier using base graphics than ggplot2, so let’s do that:

# Extract information from a histogram with bin width 0.01,
# which corresponds to 481 breaks:
carat_br <- hist(diamonds$carat, breaks = 481, right = FALSE, plot = FALSE) # Of interest to us are: # carat_br$breaks, which contains the breaks for the bins
# carat_br$mid, which contains the midpoints of the bins # (useful for plotting!) # Create categories for each bin: diamonds$carat_cat <- cut(diamonds$carat, 481, right = FALSE) We now have a variable, carat_cat, that shows which bin each observation belongs to. Next, we’d like to compute the mean for each bin. This is a grouped summary - mean by category. After we’ve computed the bin means, we could then plot them against the bin midpoints. Let’s try it: means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean) plot(carat_br$mid, means$price) That didn’t work as intended. We get an error message when attempting to plot the results: Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ The error message implies that the number of bins and the number of mean values that have been computed differ. But we’ve just computed the mean for each bin, haven’t we? So what’s going on? By default, aggregate ignores groups for which there are no values when computing grouped summaries. In this case, there are a lot of empty bins - there is for instance no observation in the [4.99,5) bin. In fact, only 272 out of the 481 bins are non-empty. We can solve this in different ways. One way is to remove the empty bins. We can do this using the match function, which returns the positions of matching values in two vectors. If we use it with the bins from the grouped summary and the vector containing all bins, we can find the indices of the non-empty bins. This requires the use of the levels function, that you’ll learn more about in Section 5.4: means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean) id <- match(means$carat_cat, levels(diamonds$carat_cat)) Finally, we’ll also add some vertical lines to our plot, to call attention to multiples of 0.25. Using base graphics is faster here: plot(carat_br$mid[id], means$price, cex = 0.5) # Add vertical lines at multiples # of 0.25: abline(v = c(0.5, 0.75, 1, 1.25, 1.5)) But we can of course stick to ggplot2 if we like: library(ggplot2) d2 <- data.frame( bin = carat_br$mid[id],
is.na(msleep$sleep_rem) We can count the number of missing values for each variable using: colSums(is.na(msleep)) Here, colSums computes the sum of is.na(msleep) for each column of msleep (remember than in summation, TRUE counts as 1 and FALSE as 0), yielding the number of missing values for each variable. In total there are 136 missing values in the dataset: sum(is.na(msleep)) You’ll notice that ggplot2 prints a warning in the Console when you create a plot with missing data: ggplot(msleep, aes(brainwt, sleep_total)) + geom_point() + scale_x_log10() Sometimes data are missing simply because the information is not yet available (for instance, the brain weight of the mountain beaver could be missing because no one has ever weighed the brain of a mountain beaver). In other cases, data can be missing because something about them are different (for instance, values for a male patient in a medical trial can be missing because the patient died, or because some values only were collected for female patients). It is therefore of interest to see if there are any differences in non-missing variables between subjects that have missing data and subjects that don’t. In msleep, all animals have recorded values for sleep_total and bodywt. To check if the animals that have missing brainwt values differ from the others, we can plot them in a different colour in a scatterplot: ggplot(msleep, aes(bodywt, sleep_total, colour = is.na(brainwt))) + geom_point() + scale_x_log10() (If is.na(brainwt) is TRUE then the brain weight is missing in the dataset.) In this case, there are no obvious differences between the animals with missing data and those without. $\sim$ Exercise 4.7 Create a version of the diamonds dataset where the x value is missing for all diamonds with $$x>9$$. Make a scatterplot of carat versus price, with points where the x value is missing plotted in a different colour. How would you interpret this plot? (Click here to go to the solution.) ### 4.4.4 Exploring data The nycflights13 package contains data about flights to and from three airports in New York, USA, in 2013. As a summary exercise, we will study a subset of these, namely all flights departing from New York on 1 January that year: install.packages("nycflights13") library(nycflights13) flights2 <- flights[flights$month == 1 & flights$day == 1,] $\sim$ Exercise 4.8 Explore the flights2 dataset, focusing on delays and amount of time spent in the air. Are there any differences between the different carriers? Are there missing data? Are there any outliers? (Click here to go to the solution.) ## 4.6 Exploring time series Before we have a look at time series, you should install four useful packages: forecast, nlme, fma and fpp2. The first contains useful functions for plotting time series data and the latter three contain datasets that we’ll use. install.packages(c("nlme", "forecast", "fma", "fpp2"), dependencies = TRUE) The a10 dataset contains information about the monthly anti-diabetic drug sales in Australia during the period July 1991 to June 2008. By checking its structure, we see that it is saved as a time series object25: library(fpp2) str(a10) ggplot2 requires that data is saved as a data frame in order for it to be plotted. In order to plot the time series, we could first convert it to a data frame and then plot each point using geom_points: a10_df <- data.frame(time = time(a10), sales = a10) ggplot(a10_df, aes(time, sales)) + geom_point() It is however usually preferable to plot time series using lines instead of points. This is done using a different geom: geom_line: ggplot(a10_df, aes(time, sales)) + geom_line() Having to convert the time series object to a data frame is a little awkward. Luckily, there is a way around this. ggplot2 offers a function called autoplot, that automatically draws an appropriate plot for certain types of data, including time series objects: autoplot(a10) We can still add other geoms, axis labels and other things just as before. autoplot has simply replaced the ggplot(data, aes()) + geom part that would be the first two rows of the ggplot2 figure, and has implicitly converted the data to a data frame. $\sim$ Exercise 4.10 Using the autoplot(a10) figure, do the following: 1. Add a smoothed line describing the trend in the data. Make sure that it is smooth enough not to capture the seasonal variation in the data. 2. Change the label of the x-axis to “Year” and the label of the y-axis to “Sales ($ million)”.

3. Check the documentation for the ggtitle function. What does it do? Use it with the figure.

4. Change the colour of the time series line to red.

### 4.6.1 Annotations and reference lines

We sometimes wish to add text or symbols to plots, for instance to highlight interesting observations. Consider the following time series plot of daily morning gold prices, based on the gold data from the forecast package:

library(forecast)
autoplot(gold)

There is a sharp spike a few weeks before day 800, which is due to an incorrect value in the data series. We’d like to add a note about that to the plot. First, we wish to find out on which day the spike appears. This can be done by checking the data manually, or using some code:

spike_date <- which.max(gold)

To add a circle around that point, we add a call to annotate to the plot:

autoplot(gold) +
annotate(geom = "point", x = spike_date, y = gold[spike_date],
size = 5, shape = 21,
colour = "red",
fill = "transparent")

annotate can be used to annotate the plot with both geometrical objects and text (and can therefore be used as an alternative to geom_text).

$\sim$

Exercise 4.11 Using the figure created above and the documentation for annotate, do the following:

1. Add the text “Incorrect value” next to the circle.

2. Create a second plot where the incorrect value has been removed.

3. Read the documentation for the geom geom_hline. Use it to add a red reference line to the plot, at $$y=400$$.

### 4.6.2 Longitudinal data

Multiple time series with identical time points, known as longitudinal data or panel data, are common in many fields. One example of this is given by the elecdaily time series from the fpp2 package, which contains information about electricity demand in Victoria, Australia during 2014. As with a single time series, we can plot these data using autoplot:

library(fpp2)
autoplot(elecdaily)

In this case, it is probably a good idea to facet the data, i.e. to plot each series in a separate figure:

autoplot(elecdaily, facets = TRUE)

$\sim$

Exercise 4.12 Make the following changes to the autoplot(elecdaily, facets = TRUE):

1. Remove the WorkDay variable from the plot (it describes whether or not a given date is a work day, and while it is useful for modelling purposes, we do not wish to include it in our figure).

2. Add smoothed trend lines to the time series plots.

### 4.6.3 Path plots

Another option for plotting multiple time series is path plots. A path plot is a scatterplot where the points are connected with lines in the order that they appear in the data (which, for time series data, should correspond to time). The lines and points can be coloured to represent time.

To make a path plot of Temperature versus Demand for the elecdaily data, we first convert the time series object to a data frame and create a scatterplot:

library(fpp2)
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point()

Next, we connect the points by lines using the geom_path geom:

ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point() +
geom_path()

The resulting figure is quite messy. Using colour to indicate the passing of time helps a little. For this, we need to add the day of the year to the data frame. To get the values right, we use nrow which gives us the number of rows in the data frame.

elecdaily2 <- as.data.frame(elecdaily)
elecdaily2$day <- 1:nrow(elecdaily2) ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) + geom_point() + geom_path() It becomes clear from the plot that temperatures were the highest at the beginning of the year, and lower in the winter months (July-August). $\sim$ Exercise 4.13 Make the following changes to the plot you created above: 1. Decrease the size of the points. 2. Add text annotations showing the dates of the highest and lowest temperatures, next to the corresponding points in the figure. (Click here to go to the solution.) ### 4.6.4 Spaghetti plots In cases where we’ve observed multiple subjects over time, we often wish to visualise their individual time series together using so-called spaghetti plots. With ggplot2 this is done using the geom_line geom. To illustrate this, we use the Oxboys data from the nlme package, showing the heights of 26 boys over time. library(nlme) ggplot(Oxboys, aes(age, height, group = Subject)) + geom_point() + geom_line() The first two aes arguments specify the x and y-axes, and the third specifies that there should be one line per subject (i.e. per boy) rather than a single line interpolating all points. The latter would be a rather useless figure that looks like this: ggplot(Oxboys, aes(age, height)) + geom_point() + geom_line() + ggtitle("A terrible plot") Returning to the original plot, if we wish to be able to identify which time series corresponds to which boy, we can add a colour aesthetic: ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) + geom_point() + geom_line() Note that the boys are ordered by height, rather than subject number, in the legend. Now, imagine that we wish to add a trend line describing the general growth trend for all boys. The growth appears approximately linear, so it seems sensible to use geom_smooth(method = "lm") to add the trend: ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) + geom_point() + geom_line() + geom_smooth(method = "lm", colour = "red", se = FALSE) Unfortunately, because we have specified in the aestethics that the data should be grouped by Subject, geom_smooth produces one trend line for each boy. The “problem” is that when we specify an aesthetic in the ggplot call, it is used for all geoms. $\sim$ Exercise 4.14 Figure out how to produce a spaghetti plot of the Oxboys data with a single red trend line based on the data from all 26 boys. (Click here to go to the solution.) ### 4.6.5 Seasonal plots and decompositions The forecast packages includes a number of useful functions when working with time series. One of them is ggseasonplot, which allows us to easily create a spaghetti plot of different periods of a time series with seasonality, i.e. with patterns that repeat seasonally over time. It works similar to the autoplot function, in that it replaces the ggplot(data, aes) + geom part of the code. library(forecast) library(fpp2) ggseasonplot(a10) This function is very useful when visually inspecting seasonal patterns. The year.labels and year.labels.left arguments removes the legend in favour of putting the years at the end and beginning of the lines: ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE) As always, we can add more things to our plot if we like: ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE) + ylab("Sales ($ million)") +
ggtitle("Seasonal plot of anti-diabetic drug sales")

When working with seasonal time series, it is common to decompose the series into a seasonal component, a trend component and a remainder. In R, this is typically done using the stl function, which uses repeated LOESS smoothing to decompose the series. There is an autoplot function for stl objects:

autoplot(stl(a10, s.window = 365))

This plot can too be manipulated in the same way as other ggplot objects. You can access the different parts of the decomposition as follows:

stl(a10, s.window = 365)$time.series[,"seasonal"] stl(a10, s.window = 365)$time.series[,"trend"]
stl(a10, s.window = 365)$time.series[,"remainder"] $\sim$ Exercise 4.15 Investigate the writing dataset from the fma package graphically. Make a time series plot with a smoothed trend line, a seasonal plot and an stl-decomposition plot. Add appropriate plot titles and labels to the axes. Can you see any interesting patterns? (Click here to go to the solution.) ### 4.6.6 Detecting changepoints The changepoint package contains a number of methods for detecting changepoints in time series, i.e. timepoints at which either the mean or the variance of the series changes. Finding changepoints can be important for detecting changes in the process underlying the time series. The ggfortify package extends ggplot2 by adding autoplot functions for a variety of tools, including those in changepoint. Let’s install the packages: install.packages(c("changepoint", "ggfortify")) We can now look at some examples with the anti-diabetic drug sales data: library(forecast) library(fpp2) library(changepoint) library(ggfortify) # Plot the time series: autoplot(a10) # Remove the seasonal part and plot the series again: a10_ns <- a10 - stl(a10, s.window = 365)$time.series[,"seasonal"]
autoplot(a10_ns)

# Plot points where there are changes in the mean:
autoplot(cpt.mean(a10_ns))

# Choosing a different method for finding changepoints
# changes the result:
autoplot(cpt.mean(a10_ns, method = "BinSeg"))

# Plot points where there are changes in the variance:
autoplot(cpt.var(a10_ns))

# Plot points where there are changes in either the mean or the variance:
autoplot(cpt.meanvar(a10_ns))

As you can see, the different methods from changepoint all yield different results. The results for changes in the mean are a bit dubious - which isn’t all that strange as we are using a method that looks for jumps in the mean on a time series where the increase actually is more or less continuous. The changepoint for the variance looks more reliable - there is a clear change towards the end of the series where the sales become more volatile. We won’t go into details about the different methods here, but mention that the documentation at ?cpt.mean, ?cpt.var, and ?cpt.meanvar contains descriptions of and references for the available methods.

$\sim$

Exercise 4.16 Are there any changepoints for variance in the Demand time series in elecdaily? Can you explain why the behaviour of the series changes?

### 4.6.7 Interactive time series plots

The plotly packages can be used to create interactive time series plots. As before, you create a ggplot2 object as usual, assigning it to a variable and then call the ggplotly function. Here is an example with the elecdaily data:

library(plotly)
library(fpp2)
myPlot <- autoplot(elecdaily[,"Demand"])

ggplotly(myPlot)

When you hover the mouse pointer over a point, a box appears, which displays information about that data point. Unfortunately, the date formatting isn’t great in this example - dates are shown as weeks with decimal points. We’ll see how to fix this in Section 5.6.

## 4.7 Visualising multiple variables

### 4.7.1 Scatterplot matrices

When we have a large enough number of numeric variables in our data, plotting scatterplots of all pairs of variables becomes tedious. Luckily there are some R functions that speed up this process.

The GGally package is an extension to ggplot2 which contains several functions for plotting multivariate data. They work similar to the autoplot functions that we have used in previous sections. One of these is ggpairs, which creates a scatterplot matrix, that is, a grid with scatterplots of all pairs of variables in data. In addition, it also plots density estimates (along the diagonal) and shows the (Pearson) correlation for each pair. Let’s start by installing GGally:

install.packages("GGally")

To create a scatterplot matrix for the airquality dataset, simply write:

library(GGally)
ggpairs(airquality)

(Enlarging your Plot window can make the figure look better.)

If we want to create a scatterplot matrix but only want to include some of the variables in a dataset, we can do so by providing a vector with variable names. Here is an example for the animal sleep data msleep:

ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")])

Optionally, if we wish to create a scatterplot involving all numeric variables, we can replace the vector with variable names with some R code that extracts the columns containing numeric variables:

ggpairs(msleep[, which(sapply(msleep, class) == "numeric")])

(You’ll learn more about the sapply function in Section 6.5.)

The resulting plot is identical to the previous one, because the list of names contained all numeric variables. The grab-all-numeric-variables approach is often convenient, because we don’t have to write all the variable names. On the other hand, it’s not very helpful in case we only want to include some of the numeric variables.

If we include a categorical variable in the list of variables (such as the feeding behaviour vore), the matrix will include a bar plot of the categorical variable as well as boxplots and facetted histograms to show differences between different categories in the continuous variables:

ggpairs(msleep[, c("vore", "sleep_total", "sleep_rem", "sleep_cycle",
"awake", "brainwt", "bodywt")])

Alternatively, we can use a categorical variable to colour points and density estimates using aes(colour = ...). The syntax for this is follows the same pattern as that for a standard ggplot call - ggpairs(data, aes). The only exception is that if the categorical variable is not included in the data argument, we much specify which data frame it belongs to:

ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")],
aes(colour = msleep$vore, alpha = 0.5)) As a side note, if all variables in your data frame are numeric, and if you only are looking for a quick-and-dirty scatterplot matrix without density estimates and correlations, you can also use the base R plot: plot(airquality) $\sim$ Exercise 4.17 Create a scatterplot matrix for all numeric variables in diamonds. Differentiate different cuts by colour. Add a suitable title to the plot. (diamonds is a fairly large dataset and it may take a minute or so for R to create the plot.) (Click here to go to the solution.) ### 4.7.2 3D scatterplots The plotly package lets us make three-dimensional scatterplots with the plot_ly function, which can be a useful alternative to scatterplot matrices in some cases. Here is an example using the airquality data: library(plotly) plot_ly(airquality, x = ~Ozone, y = ~Wind, z = ~Temp, color = ~factor(Month)) Note that you can drag and rotate the plot, to see it from different angles. ### 4.7.3 Correlograms Scatterplot matrices are not a good choice when we have too many variables, partially because the plot window needs to be very large to fit all variables and partially because it becomes difficult to get a good overview of the data. In such cases, a correlogram, where the strength of the correlation between each pair of variables is plotted instead of scatterplots, can be used instead. It is effectively a visualisation of the correlation matrix of the data, where the strengths and signs of the correlations are represented by different colours. The GGally package contains the function ggcorr, which can be used to create a correlogram: ggcorr(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake", "brainwt", "bodywt")]) $\sim$ Exercise 4.18 Using the diamonds dataset and the documentation for ggcorr, do the following: 1. Create a correlogram for all numeric variables in the dataset. 2. The Pearson correlation that ggcorr uses by default isn’t always the best choice. A commonly used alternative is the Spearman correlation. Change the type of correlation used to create the plot to the Spearman correlation. 3. Change the colour scale from a categorical scale with 5 categories. 4. Change the colours on the scale to go from yellow (low correlation) to black (high correlation). (Click here to go to the solution.) ### 4.7.4 Adding more variables to scatterplots We have already seen how scatterplots can be used to visualise two continuous and one categorical variable, by plotting the two continuous variables against each other and using the categorical variable to set the colours of the points. There are however more ways we can incorporate information about additional variables into a scatterplot. So far, we have set three aesthethics in our scatterplots: x, y and colour. Two other important aesthetics are shape and size which, as you’d expect, allow us to control the shape and size of the points. As a first example using the msleep data, we use feeding behaviour (vore) to set the shapes used for the points: ggplot(msleep, aes(brainwt, sleep_total, shape = vore)) + geom_point() + scale_x_log10() The plot looks a little nicer if we increase the size of the points: ggplot(msleep, aes(brainwt, sleep_total, shape = vore, size = 2)) + geom_point() + scale_x_log10() Another option is to let size represent a continuous variable, in what is known as a bubble plot: ggplot(msleep, aes(brainwt, sleep_total, colour = vore, size = bodywt)) + geom_point() + scale_x_log10() The size of each “bubble” now represents the weight of the animal. Because some animals are much heavier (i.e. have higher bodywt values) than most others, almost all points are quite small. There are a couple of things we can do to remedy this. First, we can transform bodywt, e.g. using the square root transformation sqrt(bodywt), to decrease the differences between large and small animals. This can be done by adding scale_size(trans = "sqrt") to the plot. Second, we can also use scale_size to control the range of point sizes (e.g. from size 1 to size 20). This will cause some points to overlap, so we add alpha = 0.5 to the geom, to make the points transparent: ggplot(msleep, aes(brainwt, sleep_total, colour = vore, size = bodywt)) + geom_point(alpha = 0.5) + scale_x_log10() + scale_size(range = c(1, 20), trans = "sqrt") This produces a fairly nice-looking plot, but it’d look even better if we changed the axes labels and legend texts. We can change the legend text for the size scale by adding the argument name to scale_size. Including a \n in the text lets us create a line break - you’ll learn more tricks like that in Section 5.5. Similarly, we can use scale_colour_discrete to change the legend text for the colours: ggplot(msleep, aes(brainwt, sleep_total, colour = vore, size = bodywt)) + geom_point(alpha = 0.5) + xlab("log(Brain weight)") + ylab("Sleep total (h)") + scale_x_log10() + scale_size(range = c(1, 20), trans = "sqrt", name = "Square root of\nbody weight") + scale_colour_discrete(name = "Feeding behaviour") $\sim$ Exercise 4.19 Using the bubble plot created above, do the following: 1. Replace colour = vore in the aes by fill = vore and add colour = "black", shape = 21 to geom_point. What happens? 2. Use ggplotly to create an interactive version of the bubble plot above, where variable information and the animal name are displayed when you hover a point. (Click here to go to the solution.) ### 4.7.5 Overplotting Let’s make a scatterplot of table versus depth based on the diamonds datset: ggplot(diamonds, aes(table, depth)) + geom_point() This plot is cluttered. There are too many points, which makes it difficult to see if, for instance, high table values are more common than low table values. In this section, we’ll look at some ways to deal with this problem, known as overplotting. The first thing we can try is to decrease the point size: ggplot(diamonds, aes(table, depth)) + geom_point(size = 0.1) This helps a little, but now the outliers become a bit difficult to spot. We can try changing the opacity using alpha instead: ggplot(diamonds, aes(table, depth)) + geom_point(alpha = 0.2) This is also better than the original plot, but neither plot is great. Instead of plotting each individual point, maybe we can try plotting the counts or densities in different regions of the plot instead? Effectively, this would be a 2D version of a histogram. There are several ways of doing this in ggplot2. First, we bin the points and count the numbers in each bin, using geom_bin2d: ggplot(diamonds, aes(table, depth)) + geom_bin2d() By default, geom_bin2d uses 30 bins. Increasing that number can sometimes give us a better idea about the distribution of the data: ggplot(diamonds, aes(table, depth)) + geom_bin2d(bins = 50) If you prefer, you can get a similar plot with hexagonal bins by using geom_hex instead: ggplot(diamonds, aes(table, depth)) + geom_hex(bins = 50) As an alternative to bin counts, we could create a 2-dimensional density estimate and create a contour plot showing the levels of the density: ggplot(diamonds, aes(table, depth)) + stat_density_2d(aes(fill = ..level..), geom = "polygon", colour = "white") The fill = ..level.. bit above probably looks a little strange to you. It means that an internal function (the level of the contours) is used to choose the fill colours. It also means that we’ve reached a point where we’re reaching deep into the depths of ggplot2! We can use a similar approach to show a summary statistic for a third variable in a plot. For instance, we may want to plot the average price as a function of table and depth. This is called a tile plot: ggplot(diamonds, aes(table, depth, z = price)) + geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) + ggtitle("Mean prices for diamonds with different depths and tables") $\sim$ Exercise 4.20 The following tasks involve the diamonds dataset: 1. Create a tile plot of table versus depth, showing the highest price for a diamond in each bin. 2. Create a bin plot of carat versus price. What type of diamonds have the highest bin counts? (Click here to go to the solution.) ### 4.7.6 Categorical data When visualising a pair of categorical variables, plots similar to those in the previous section prove to be useful. One way of doing this is to use the geom_count geom. We illustrate this with an example using diamonds, showing how common different combinations of colours and cuts are: ggplot(diamonds, aes(color, cut)) + geom_count() However, it is often better to use colour rather than point size to visualise counts, which we can do using a tile plot. First we have to compute the counts though, using aggregate. We now wish to have two grouping variables, color and cut, which we can put on the right-hand side of the formula as follows: diamonds2 <- aggregate(carat ~ cut + color, data = diamonds, FUN = length) diamonds2 diamonds2 is now a data frame containing the different combinations of color and cut along with counts of how many diamonds that belong to each combination (labelled carat, because we put carat in our formula). Next, we can plot the counts using geom_tile: ggplot(diamonds2, aes(color, cut, fill = carat)) + geom_tile() It is also possible to combine point size and colours: ggplot(diamonds2, aes(color, cut, colour = carat, size = carat)) + geom_count() $\sim$ Exercise 4.21 Using the diamonds dataset, do the following: 1. Use a plot to find out what the most common combination of cut and clarity is. 2. Use a plot to find out which combination of cut and clarity that has the highest average price. (Click here to go to the solution.) ### 4.7.7 Putting it all together In the next two exercises, you will repeat what you have learned so far by investigating the gapminder and planes datasets. First, load the corresponding libraries and have a look at the documentation for each dataset: install.packages("gapminder") library(gapminder) ?gapminder library(nycflights13) ?planes $\sim$ Exercise 4.22 Do the following using the gapminder dataset: 1. Create a scatterplot matrix showing life expectancy, population and GDP per capita for all countries, using the data from the year 2007. Use colours to differentiate countries from different continents. 2. Create an interactive bubble plot, showing information about each country when you hover the points. Use data from the year 2007. Put log(GDP per capita) on the x-axis and life expetancy on the y-axis. Let population determine point size. Plot each country in a different colour and facet by continent. Tip: the gapminder package provides a pretty color scheme for different countries, called country_colors. You can use that scheme by adding scale_colour_manual(values = country_colors) to your plot. (Click here to go to the solution.) Exercise 4.23 Use graphics to answer the following questions regarding the planes dataset: 1. What is the most common combination of manufacturer and plane type in the dataset? 2. Which combination of manufacturer and plane type has the highest average number of seats? 3. Do the numbers of seats on planes change over time? Which plane had the highest number of seats? 4. Does the type of engine used change over time? (Click here to go to the solution.) ## 4.8 Principal component analysis If there are a lot of variables in your data, it can often be difficult to detect differences between groups, or to create a perspicuous visualisation. A useful tool in this context is principal component analysis (PCA, for short), which can reduce high-dimensional data to a lower number of variables that can be visualised in one or two scatterplots. The idea is to compute new variables, called principal components, that are linear combinations of the original variables26. These are constructed with two goals in mind: the principal components should capture as much of the variance in the data as possible, and each principal component should be uncorrelated to the other components. You can then plot the principal components to get a low-dimensional representation of your data, which hopefully captures most of its variation. By design, the number of principal components computed are as many as the original number of variables, with the first having the largest variance, the second the second largest variance, and so on. We hope that it will suffice to use just the first few of these to represent most of the variation in the data, but this is not guaranteed. Principal component analysis is more likely to yield a useful result if several variables are correlated. To illustrate the principles of PCA we will use a dataset from Charytanowicz et al. (2010), containing measurements on wheat kernels for three varieties of wheat (a description of the variables is available at http://archive.ics.uci.edu/ml/datasets/seeds). We are interested to find out if these measurements can be used to distinguish between the varieties. The data is stored in a .txt file, which we import using read.table (which works just like read.csv, but is tailored to text files) and convert the Variety column to a categorical factor variable (which you’ll learn more about in Section 5.4): # The data is downloaded from the UCI Machine Learning Repository: # http://archive.ics.uci.edu/ml/datasets/seeds seeds <- read.table( "http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", col.names = c("Area", "Perimeter", "Compactness", "Kernel_length", "Kernel_width", "Asymmetry", "Groove_length", "Variety")) seeds$Variety <- factor(seeds$Variety) If we make a scatterplot matrix of all variables, it becomes evident that there are differences between the varieties, but that no single pair of variables is enough to separate them: library(ggplot2) library(GGally) ggpairs(seeds[, -8], aes(colour = seeds$Variety, alpha = 0.2))

Moreover, for presentation purposes the amount of information in the scatterplot matrix is a bit overwhelming. It would be nice to be able to present the data in a single scatterplot, without losing too much information. We’ll therefore compute the principal components using the prcomp function. It is usually recommended that PCA is performed using standardised data. We don’t have to standardise the data ourselves, but can let prcomp do that for us using the arguments center = TRUE and scale. = TRUE:

# Compute principal components:
pca <- prcomp(seeds[,-8], center = TRUE, scale. = TRUE)

To see the loadings of the components, i.e. how much each variable contributes to the components, simply type the name of the object prcomp created:

pca

The first principal component is more or less a weighted average of all variables, but has stronger weights on Area, Perimeter, Kernel_length, Kernel_width, and Groove_length, all of which are measures of size. We can therefore interpret it as a size variable. The second component has higher loadings for Compactness and Asymmetry, meaning that it mainly measures those shape features. In Exercise 4.25 you’ll see how the loadings can be visualised in a biplot.

To see how much of the variance each component represents, use summary:

summary(pca)

The first principal component accounts for 71.87 % of the variance, and the first three combined account for 98.67 %.

To visualise this, we can draw a scree plot, which shows the variance of each principal component - the total variance of the data is the sum of the variances of the principal components:

screeplot(pca, type = "lines")

We can use this to choose how many principal components to use when visualising or summarising our data. In that case, we look for an “elbow”, i.e. a bend after which increases the number of components doesn’t increase amount of variance explained much.

We can access the values of the principal components using pca$x. Let’s check that the first two components really are uncorrelated: cor(pca$x[,1], pca$x[,2]) In this case, almost all of the variance is summarised by the first two or three principal components. It appears that we have successfully reduced the data from 7 variables to 2-3, which should make visualisation much easier. The ggfortify package contains an autoplot function for PCA objects, that creates a scatterplot of the first two principal components: library(ggfortify) autoplot(pca, data = seeds, colour = "Variety") That is much better! The groups are almost completely separated, which shows that the variables can be used to discriminate between the three varieties. The first principal component accounts for 71.87 % of the total variance in the data, and the second for 17.11 %. If you like, you can plot other pairs of principal components than just components 1 and 2. In this case, component 3 may be of interest, as its variance is almost as high as that of component 2. You can specify which components to plot with the x and y arguments: # Plot 2nd and 3rd PC: autoplot(pca, data = seeds, colour = "Variety", x = 2, y = 3) Here, the separation is nowhere near as clear as in the previous figure. In this particular example, plotting the first two principal components is the better choice. Judging from these plots, it appears that the kernel measurements can be used to discriminate between the three varieties of wheat. In Chapters 7 and 9 you’ll learn how to use R to build models that can be used to do just that, e.g. by predicting which variety of wheat a kernel comes from given its measurements. If we wanted to build a statistical model that could be used for this purpose, we could use the original measurements. But we could also try using the first two principal components as the only input to the model. Principal component analysis is very useful as a pre-processing tool, used to create simpler models that are based on fewer variables (or ostensibly simpler, because the new variables are typically more difficult to interpret than the original ones). $\sim$ Exercise 4.24 Use principal components on the carat, x, y, z, depth, and table variables in the diamonds data, and answer the following questions: 1. How much of the total variance does the first principal component account for? How many components are needed to account for at least 90 % of the total variance? 2. Judging by the loadings, what do the first two principal components measure? 3. What is the correlation between the first principal component and price? 4. Can the first two principal components be used to distinguish between diamonds with different cuts? (Click here to go to the solution.) Exercise 4.25 Return to the scatterplot of the first two principal components for the seeds data, created above. Adding the arguments loadings = TRUE and loadings.label = TRUE to the autoplot call creates a biplot, which shows the loadings for the principal components on top of the scatterplot. Create a biplot and compare the result to those obtained by looking at the loadings numerically. Do the conclusions from the two approaches agree? (Click here to go to the solution.) ## 4.9 Cluster analysis Cluster analysis is concerned with grouping observations into groups, clusters, that in some sense are similar. Numerous methods are available for this task, approaching the problem from different angles. Many of these are available in the cluster package, that ships with R. In this section, we’ll have a look at a smorgasbord of clustering techniques. ### 4.9.1 Hierarchical clustering As a first example where clustering can be of interest, we’ll consider the votes.repub data from cluster. It describes the proportion of votes for the Republican candidate in US presidential elections from 1856 to 1976 in 50 different states: library(cluster) ?votes.repub View(votes.repub) We are interested in finding subgroups - clusters - of states with similar voting patterns. To find clusters of similar observations (states, in this case), we could start by assigning each observation to its own cluster. We’d then start with 50 clusters, one for each observation. Next, we could merge the two clusters that are the most similar, yielding 49 clusters, one of which consisted of two observations and 48 consisting of a single observation. We could repeat this process, merging the two most similar clusters in each iteration, until only a single cluster was left. This would give us a hierarchy of clusters, which could be plotted in a tree-like structure, where observations from the same cluster would be one the same branch. Like this: clusters_agnes <- agnes(votes.repub) plot(clusters_agnes, which = 2) This type of plot is known as a dendrogram We’ve just used agnes, a function from cluster that can be used to carry out hierarchical clustering in the manner described above. There are a couple of things that need to be clarified though. First, how do we measure how similar two $$p$$-dimensional observations $$x$$ and $$y$$ are? agnes provides two measures of distance between points: • metric = "euclidean" (the default), uses the Euclidean $$L_2$$ distance $$||x-y||=\sqrt{\sum_{i=1}^p(x_i-y_i)^2}$$, • metric = "manhattan", the default, uses the Manhattan $$L_1$$ distance $$||x-y||={\sum_{i=1}^p|x_i-y_i|}$$. Note that neither of these will work create if you have categorical variables in your data. If all your variables are binary, i.e. categorical with two values, you can use mona instead of agnes. Second, how do we measure how similar two clusters of observations are? agnes offers a number of options here. Among them are: • method = "average (the default), unweighted average linkage, uses the average distance between points from the two clusters, • method = "single, single linkage, uses the smallest distance between points from the two clusters, • method = "complete, complete linkage, uses the largest distance between points from the two clusters, • method = "ward", Ward’s method, uses the within-cluster variance to compare different possible clusterings, with the clustering with the lowest within-cluster variance being chosen. Regardless of which of these that you use, it is often a good idea to standardise the numeric variables in your dataset so that they all have the same variance. If you don’t, your distance measure is likely to be dominated by variables with larger variance, while variables with low variances will have little or no impact on the clustering. To standardise your data, you can use scale: # Perform clustering on standardised data: clusters_agnes <- agnes(scale(votes.repub)) # Plot dendrogram: plot(clusters_agnes, which = 2) At this point, we’re starting to use several functions after another, and so this looks like a perfect job for a pipeline. To carry out the same analysis uses %>% pipes, we write: library(magrittr) votes.repub %>% scale() %>% agnes() %>% plot(which = 2) We can now try changing the metric and clustering method used as described above. Let’s use the Manhattan distance and complete linkage: votes.repub %>% scale() %>% agnes(metric = "manhattan", method = "complete") %>% plot(which = 2) We can change the look of the dendrogram by adding hang = -1, which causes all observations to be placed at the same level: votes.repub %>% scale() %>% agnes(metric = "manhattan", method = "complete") %>% plot(which = 2, hang = -1) As an alternative to agnes, we can consider diana. agnes is an agglomerative methods, which starts with a lot of clusters and then merge them step-by-step. diana, in contrast, is a divisive method, which starts with one large cluster and then step-by-step splits it into several smaller clusters. votes.repub %>% scale() %>% diana() %>% plot(which = 2) You can change the distance measure used by setting metric in the diana call. Euclidean distance is the default. To wrap this section up, we’ll look at two packages that are useful for plotting the results of hierarchical clustering: dendextend and factoextra: install.packages(c("dendextend", "factoextra")) To compare the dendrograms from produced by different methods (or the same method with different settings), in a tanglegram, where the dendrograms are plotted against each other, we can use tanglegram from dendextend: library(dendextend) # Create clusters using agnes: votes.repub %>% scale() %>% agnes() -> clusters_agnes # Create clusters using diana: votes.repub %>% scale() %>% diana() -> clusters_diana # Compare the results: tanglegram(as.dendrogram(clusters_agnes), as.dendrogram(clusters_diana)) Some clusters are quite similar here, whereas others are very different. A lot of the time, we are interested in finding a comparatively small number of clusters, $$k$$. In hierarchical clustering, we can reduce the number of clusters by “cutting” the dendrogram tree. To do so using the factoextra package, we first use hcut to cut the tree into $$k$$ parts, and then fviz_dend to plot the dendrogram, with each cluster plotted in a different colour. If, for instance, we want $$k=5$$ clusters27 and want to use agnes with average linkage and Euclidean distance for the clustering, we’d do the following: library(factoextra) votes.repub %>% scale() %>% hcut(k = 5, hc_func = "agnes", hc_method = "average", hc_metric = "euclidean") %>% fviz_dend() There is no inherent meaning to the colours - they are simply a way to visually distinguish between clusters. Hierarchical clustering is especially suitable for data with named observations. For other types of data, other methods may be better. We will consider some alternatives next. $\sim$ Exercise 4.26 Continue the last example above by changing the clustering method to complete linkage with the Manhattan distance. 1. Do any of the 5 coloured clusters remain the same? 2. How can you produce a tanglegram with 5 coloured clusters, to better compare the results from the two clusterings? (Click here to go to the solution.) Exercise 4.27 The USArrests data contains statistics on violent crime rates in 50 US states. Perform a hierarchical cluster analysis of the data. Which states are Maryland clustered with? (Click here to go to the solution.) ### 4.9.2 Centroid-based clustering Let’s return to the seeds data that we explored in Section 4.8: # Download the data: seeds <- read.table( "http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", col.names = c("Area", "Perimeter", "Compactness", "Kernel_length", "Kernel_width", "Asymmetry", "Groove_length", "Variety")) seeds$Variety <- factor(seeds$Variety) We know that there are three varieties of seeds in this dataset, but what if we didn’t? Or what if we’d lost the labels, and don’t know what seeds are of what type? There are no rows names for this data, and plotting a dendrogram may therefore not be that useful. Instead, we can use $$k$$-means clustering, where the the points are clustered into $$k$$ clusters based on their distances to the cluster means, or centroids. When performing $$k$$-means clustering (using the algorithm of Hartigan & Wong (1979) that is the default in the function that we’ll use), the data is split into $$k$$ clusters based on their distance to the mean of all points. Points are then moved between clusters, one at a time, based on how close they are (as measured by Euclidean distance) to the mean of each cluster. The algorithm finishes when no point can be moved between clusters without increasing the average distance between points and the means of their clusters. To run a $$k$$-means clustering in R, we can use kmeans. Let’s start by using $$k=3$$ clusters: # First, we standardise the data, and then we do a k-means # clustering. # We ignore variable 8, Variety, which is the group label. library(magrittr) seeds[, -8] %>% scale() %>% kmeans(centers = 3) -> seeds_cluster seeds_cluster To visualise the results, we’ll plot the first two principal components. We’ll use colour to show the clusters. Moreover, we’ll plot the different varieties in different shapes, to see if the clusters found correspond to different varieties: # Compute principal components: pca <- prcomp(seeds[,-8]) library(ggfortify) autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)

In this case, the clusters more or less overlap with the varieties! Of course, in a lot of cases we don’t know the number of clusters beforehand. What happens if we change $$k$$?

First, we try $$k=2$$:

seeds[, -8] %>% scale() %>%
kmeans(centers = 2) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster, shape = "Variety", size = 2, alpha = 0.75) Next, $$k=4$$: seeds[, -8] %>% scale() %>% kmeans(centers = 4) -> seeds_cluster autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)

And finally, a larger number of clusters, say $$k=12$$:

seeds[, -8] %>% scale() %>%
kmeans(centers = 12) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster, shape = "Variety", size = 2, alpha = 0.75) If it weren’t from the fact that the different varieties were shown as different shapes, we’d have no way to say, based on this plot alone, which choice of $$k$$ that is preferable here. Before we go into methods for choosing $$k$$ though, we’ll mention pam, and alternative to $$k$$-means that works in the same way, but uses median-like points, medoids instead of cluster means. This makes it more robust to outliers. Let’s try it with $$k=3$$ clusters: seeds[, -8] %>% scale() %>% pam(k = 3) -> seeds_cluster autoplot(pca, data = seeds, colour = seeds_cluster$clustering,
shape = "Variety", size = 2, alpha = 0.75)

For both kmeans and pam, there are visual tools that can help us choose the value of $$k$$ in the factoextra package. Let’s install it:

install.packages("factoextra")

The fviz_nbclust function in factoextra can be used to obtain plots that can guide the choice of $$k$$. It takes three arguments as input: the data, the clustering function (e.g. kmeans or pam) and the method used for evaluating different choices of $$k$$. There are three options for the latter: "wss", "silhouette" and "gap_stat".

method = "wss" yields a plot that relies on the within-cluster sum of squares, WSS, which is a measure of the within-cluster variation. The smaller this is, the more compact are the clusters. The WSS is plotted for several choices of $$k$$, and we look for an “elbow”, just as we did when using a scree plot for PCA. That is, we look for the value of $$k$$ such that increasing $$k$$ further doesn’t improve the WSS much. Let’s have a look at an example, using pam for clustering:

library(factoextra)
fviz_nbclust(scale(seeds[, -8]), pam, method = "wss")

# Or, using a pipeline instead:
library(magrittr)
seeds[, -8] %>% scale() %>%
fviz_nbclust(pam, method = "wss")

$$k=3$$ seems like a good choice here.

method = "silhouette" produces a silhouette plot. The silhouette value measures how similar a point is compared to other points in its cluster. The closer to 1 this value is, the better. In a silhouette plot, the average silhouette value for points in the data are plotted against $$k$$:

fviz_nbclust(scale(seeds[, -8]), pam, method = "silhouette")

Judging by this plot, $$k=2$$ appears to be the best choice.

Finally, method = "gap_stat" yields a plot of the gap statistic (Tibshirani et al., 2001), which is based on comparing the WSS to its expected value under a null distribution obtained using the bootstrap (Section 7.6). Higher values of the gap statistic are preferable:

fviz_nbclust(scale(seeds[, -8]), pam, method = "gap_stat")

In this case, $$k=3$$ gives the best value.

In addition to plots for choosing $$k$$, factoextra provides the function fviz_cluster for creating PCA-based plots, with an option to add convex hulls or ellipses around the clusters:

# First, find the clusters:
seeds[, -8] %>% scale() %>%
kmeans(centers = 3) -> seeds_cluster

# Plot clusters and their convex hulls:
library(factoextra)
fviz_cluster(seeds_cluster, data = seeds[, -8])

# Without row numbers:
fviz_cluster(seeds_cluster, data = seeds[, -8], geom = "point")

# With ellipses based on the multivariate normal distribution:
fviz_cluster(seeds_cluster, data = seeds[, -8],
geom = "point", ellipse.type = "norm")

Note that in this plot, the shapes correspond to the clusters and not the varieties of seeds.

$\sim$

Exercise 4.28 The chorSub data from cluster contains measurements of 10 chemicals in 61 geological samples from the Kola Peninsula. Cluster this data using using either kmeans or pam (does either seem to be a better choice here?). What is a good choice of $$k$$ here? Visualise the results.

### 4.9.3 Fuzzy clustering

An alternative to $$k$$-means clustering is fuzzy clustering, in which each point is “spread out” over the $$k$$ clusters instead of being placed in a single cluster. The more similar it is to other observations in a cluster, the higher is its membership in that cluster. Points can have a high degree of membership to several clusters, which is useful in applications where points should be allowed to belong to more than one cluster. An important example is genetics, where genes can encode proteins with more than one function. If each point corresponds to a gene, it then makes sense to allow the points to belong to several clusters, which potentially can be associated with different functions. The opposite of fuzzy clustering is hard clustering, in which each point only belongs to one cluster.

fanny from cluster can be used to perform fuzzy clustering:

library(cluster)
library(magrittr)
seeds[, -8] %>% scale() %>%
fanny(k = 3) -> seeds_cluster

# Check membership of each cluster for the different points:
seeds_cluster$membership # Plot the closest hard clustering: library(factoextra) fviz_cluster(seeds_cluster, geom = "point") As for kmeans and pam, we can use fviz_nbclust to determine how many clusters to use: seeds[, -8] %>% scale() %>% fviz_nbclust(fanny, method = "wss") seeds[, -8] %>% scale() %>% fviz_nbclust(fanny, method = "silhouette") # Producing the gap statistic plot takes a while here, so # you may want to skip it in this case: seeds[, -8] %>% scale() %>% fviz_nbclust(fanny, method = "gap") $\sim$ Exercise 4.29 Do a fuzzy clustering of the USArrests data. Is Maryland strongly associated with a single cluster, or with several clusters? What about New Jersey? (Click here to go to the solution.) ### 4.9.4 Model-based clustering As a last option, we’ll consider model-based clustering, in which each cluster is assumed to come from a multivariate normal distribution. This will yield ellipsoidal clusters. Mclust from the mclust package fits such a model, called a Gaussian finite mixture model, using the EM-algorithm (Scrucca et al., 2016). First, let’s install the package: install.packages("mclust") Now, let’s cluster the seeds data. The number of clusters is chosen as part of the clustering procedure. library(mclust) seeds_cluster <- Mclust(scale(seeds[, -8])) summary(seeds_cluster) # Plot results with ellipsoids: library(factoextra) fviz_cluster(seeds_cluster, geom = "point", ellipse.type = "norm") $\sim$ Exercise 4.30 Return to the chorSub data from Exercise 4.28. Cluster it using a Gaussian finite mixture model. How many clusters do you find? (Click here to go to the solution.) ### 4.9.5 Comparing clusters Having found some interesting clusters, we are often interested in exploring differences between the clusters. To do so, we much first extract the cluster labels from our clustering (which are contained in the variables clustering for methods with Western female names, cluster for kmeans, and classification for Mclust). We can then add those labels to our data frame and use it when plotting. For instance, using the seeds data, we can compare the area of seeds from different clusters: # Cluster the seeds using k-means with k=3: library(cluster) seeds[, -8] %>% scale() %>% kmeans(centers = 3) -> seeds_cluster # Add the results to the data frame: seeds$clusters <- factor(seeds_cluster$cluster) # Insted of$cluster, we'd use $clustering for agnes, pam, and fanny # objects, and$classification for an Mclust object.

# Compare the areas of the 3 clusters using boxplots:
library(ggplot2)
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
geom_boxplot()

# Or using density estimates:
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
geom_density(alpha = 0.7)

We can also create a scatterplot matrix to look at all variables simultaneously:

library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds\$clusters, alpha = 0.2))

## 4.10 Exploratory factor analysis

The purpose of factor analysis is to describe the correlation structure for a set of observable variables by a smaller number of unobservable underlying variables, called factors or latent variables. These are thought to explain the values of the observed variables in a causal manner. Factor analysis is a popular tool in psychometrics, where it for instance is used to identify latent variables that explain people’s results on different tests, e.g. related to personality, intelligence, or attitude.

We’ll use the psych package, along with the associated package GPArotation, for our analyses. Let’s install them:

install.packages(c("psych", "GPArotation"))

For our first example of factor analysis, we’ll be using the attitude data that comes with R. It describes the outcome of a survey of employees at a financial organisation. Have a look at its documentation to read about the variables in the dataset:

?attitude
attitude

To fit a factor analysis model to these data, we can use fa from psych. fa requires us to specify the number of factors used in the model. We’ll get back to how to choose the number of factors, but for now, let’s go with 2:

library(psych)
# Fit factor model:
attitude_fa <- fa(attitude, nfactors = 2,
rotate = "oblimin", fm = "ml")

fa does two things for us. First, it fits a factor model to the data, which yields a table of factor loadings, i.e. the correlation between the two unobserved factors and the observed variables. However, there are an infinite number of mathematically valid factor models for any given dataset. The factors are therefore rotated according to some rule to obtain a factor model that hopefully allows for easy and useful interpretation. There are several methods that can be used to fit the factor model (set using the fm argument in fa) and for rotation the solution (set using rotate). We’ll look at some of the options shortly.

First, we’ll print the result, showing the factor loadings (after rotation). We’ll also plot the resulting model using fa.diagram, showing the correlation between the factors and the observed variables:

# Print results:
attitude_fa

# Plot results:
fa.diagram(attitude_fa, simple = FALSE)

The first factor is correlated to the variables advance, learning and raises. We can perhaps interpret this factor as measuring the employees’ career opportunity at the organisation. The second factor is strongly correlated to complaints and (overall) rating, but also to a lesser degree correlated to raises, learning and privileges. This can maybe be interpreted as measuring how the employees’ feel that they are treated at the organisation.

We can also see that the two factors are correlated. In some cases, it makes sense to expect the factors to be uncorrelated. In that case, we can change the rotation method used, from oblimin (which yields oblique rotations, allowing for correlations - usually a good default) to varimax, which yields uncorrelated factors:

attitude_fa <- fa(attitude, nfactors = 2,
rotate = "varimax", fm = "ml")
fa.diagram(attitude_fa, simple = FALSE)

In this case, the results are fairly similar.

The fm = "ml" setting means that maximum likelihood estimation of the factor model is performed, under the assumption of a normal distribution for the data. Maximum likelihood estimation is widely recommended for estimation of factor models, and can often work well even for non-normal data (Costello & Osborne, 2005). However, there are cases where it fails to find useful factors. fa offers several different estimation methods. A good alternative is minres, which often works well when maximum likelihood fails:

attitude_fa <- fa(attitude, nfactors = 2,
rotate = "oblimin", fm = "minres")
fa.diagram(attitude_fa, simple = FALSE)

Once again, the results are similar to what we saw before. In other examples, the results differ more. When choosing which estimation method and rotation to use, bear in mind that in an exploratory study, there is no harm in playing around with a few different methods. After all, your purpose is to generate hypotheses rather than to confirm them, and looking at the data in a few different ways will help you do that.

To determine the number of factors that are appropriate for a particular dataset, we can draw a scree plot with scree. This is interpreted in the same way as for principal components analysis (Section 4.8) and centroid-based clustering (Section 4.9.2) - we look for an “elbow” in the plot, which tells us at which point adding more factors no longer contributes much to the model:

scree(attitude, pc = FALSE)

A useful alternative version of this is provided by fa.parallel, which adds lines showing what the scree plot would look like for randomly generated uncorrelated data of the same size as the original dataset:

fa.parallel(attitude, fm = "ml", fa = "fa")

Some older texts recommend that only factors with an eigenvalue (the y-axis in the scree plot) greater than 1 be kept in the model. It is widely agreed that this so-called Kaiser rule is inappropriate (Costello & Osborne, 2005), as it runs the risk of leaving out important factors.

Similarly, some older texts also recommend using principal components analysis to fit factor models. While the two are mathematically similar in that both in some sense reduce the dimensionality of the data, PCA and factor analysis are designed to target different problems. Factor analysis is concerned with an underlying causal structure where the unobserved factors affect the observed variables. In contrast, PCA simply seeks to create a small number of variables that summarise the variation in the data, which can work well even if there are no unobserved factors affecting the variables.

$\sim$

Exercise 4.31 Factor analysis only relies on the covariance or correlation matrix of your data. When using fa and other functions for factor analysis, you can input either a data frame or a covariance/correlation matrix. Read about the ability.cov data that comes shipped with R, and perform a factor analysis of it.

1. See ?theme_grey` for a list of available themes.↩︎

2. Note that it is not just the pricex nor just the caratx of these diamonds that make them outliers, but the unusual combinations of pricex and carats!↩︎

3. LOESS, LOcally Estimated Scatterplot Smoothing, is a non-parametric regression method that fits a polynomial to local areas of the data. (Click here to visit the Wikipedia page on LOESS)↩︎

4. GAM, Generalised Additive Model, is a generalised linear model where the response variable is a linear function of smooth functions of the predictors. (Click here to visit the Wikipedia page on GAM)↩︎

5. Time series objects are a special class of vectors, with data sampled a equispaced points in time. Each observation can have a year/date/time associated with it.↩︎

6. A linear combination is a weighted sum of the form $$a_1x_1+a_2x_2+\cdots+a_px_p$$. If you like, you can think of principal components as weighted averages of variables, computed for each row in your data.↩︎

7. Just to be clear, 5 is just an arbitrary number here. We could of course want 4, 14, or any other number of clusters.↩︎