# 4 Exploratory data analysis and unsupervised learning

Exploratory data analysis (EDA) is a process in which we summarise and visually explore a dataset. An important part of EDA is unsupervised learning, which is a collection of methods for finding interesting subgroups and patterns in our data. Unlike statistical hypothesis testing, which is used to reject hypotheses, EDA can be used to *generate* hypotheses (which can then be confirmed or rejected by new studies). Another purpose of EDA is to find outliers and incorrect observations, which can lead to a cleaner and more useful dataset. In EDA we ask questions about our data, and then try to answer them using summary statistics and graphics. Some questions will prove to be important, and some will not. The key to finding important questions is to ask a lot of questions. This chapter will provide you with a wide range of tools for question-asking.

After working with the material in this chapter, you will be able to use R to:

- Create reports using R Markdown,
- Customise the look of your plots,
- Visualise the distribution of a variable,
- Create interactive plots,
- Detect and label outliers,
- Investigate patterns in missing data,
- Visualise trends,
- Plot time series data,
- Visualise multiple variables at once using scatterplot matrices, correlograms and bubble plots,
- Visualise multivariate data using principal component analysis,
- Use unsupervised learning techniques for clustering,
- Use factor analysis to find latent variables in your data.

## 4.1 Reports with R Markdown

R Markdown files can be used to create nicely formatted documents using R, that are easy to export to other formats, like HTML, Word or pdf. They allow you to mix R code with results and text. They can be used to create reproducible reports, that are easy to update with new data, because they include the code for creating tables and figures. Additionally, they can be used as notebooks for keeping track of your work and your thoughts as you carry out an analysis. You can even use them for writing books - in fact, this entire book was written using R Markdown.

It is often a good idea to use R Markdown for exploratory analyses, as it allows you to write down your thoughts and comments as the analysis progresses, as well as to save the results of the exploratory journey. For that reason, we’ll start this chapter by having a look at some examples of what you can do using R Markdown. You can use either R Markdown or ordinary R scripts for the analyses in the remainder of the chapter, according to your preference. The R code used is the same and the results are identical, but if you use R Markdown, you can also save the output of the analysis in a nicely formatted document.

### 4.1.1 A first example

When you create a new R Markdown document in RStudio by clicking *File > New File > R Markdown* in the menu, a document similar to that below is created :

```
---
title: "Untitled"
author: "Måns Thulin"
date: "10/20/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax
for authoring HTML, PDF, and MS Word documents. For more details on
using R Markdown see
<http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that
includes both content as well as the output of any embedded R code
chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to
prevent printing of the R code that generated the plot.
```

Press the *Knit* button at the top of the Script panel to create an HTML document using this Markdown file. It will be saved in the same folder as your Markdown file. Once the HTML document has been created, it will open so that you can see the results. You may have to install additional packages for this to work, in which case RStudio will prompt you.

Now, let’s have a look at what the different parts of the Markdown document do. The first part is called the *document header* or *YAML header*. It contains information about the document, including its title, the name of its author, and the date on which it was first created:

```
---
title: "Untitled"
author: "Måns Thulin"
date: "10/20/2020"
output: html_document
---
```

The part that says `output: html_document`

specifies what type of document that should be created when you press *Knit*. In this case, it’s set to `html_document`

, meaning that an HTML document will be created. By changing this to `output: word_document`

you can create a `.docx`

Word document instead. By changing it to `output: pdf_document`

, you can create a `.pdf`

document using LaTeX (you’ll have to install LaTeX if you haven’t already - RStudio will notify you if that is the case).

The second part sets the default behaviour of *code chunks* included in the document, specifying that the output from running the chunks should be printed unless otherwise specified:

```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```

The third part contains the first proper section of the document. First, a header is created using `##`

. Then there is some text with formatting: `< >`

is used to create a link and `**`

is used to get **bold text**. Finally, there is a code chunk, delimited by `````

:

```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax
for authoring HTML, PDF, and MS Word documents. For more details on
using R Markdown see
<http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that
includes both content as well as the output of any embedded R code
chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
```

The fourth and final part contains another section, this time with a figure created using R. A setting is added to the code chunk used to create the figure, which prevents the underlying code from being printed in the document:

```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to
prevent printing of the R code that generated the plot.
```

In the next few sections, we will look at how formatting and code chunks work in R Markdown.

### 4.1.2 Formatting text

To create plain text in a Markdown file, you simply have to write plain text. If you wish to add some formatting to you text, you can use the following:

`_italics_`

or`*italics*`

: to create text in*italics*.`__bold__`

or`**bold**`

: to create**bold**text.`[linked text](http://www.modernstatisticswithr.com)`

: to create linked text.``code``

: to include inline`code`

in your document.`$a^2 + b^2 = c^2$`

: to create inline equations like \(a^2 + b^2 = c^2\) using LaTeX syntax.`$$a^2 + b^2 = c^2$$`

: to create a centred equation on a new line, like \[a^2 + b^2 = c^2.\]

To add headers and subheaders, and to divide your document into section, start a new line with `#`

’s as follows:

```
# Header text
## Sub-header text
### Sub-sub-header text
...and so on.
```

### 4.1.3 Lists, tables, and images

To create a bullet lists, you can use `*`

as follows. Note that you need a blank line between your list and the previous paragraph to begin a list.

```
* Item 1
* Item 2
+ Sub-item 1
+ Sub-item 2
* Item 3
```

yielding:

- Item 1
- Item 2
- Sub-item 1
- Sub-item 2

- Item 3

To create an ordered list, use:

```
1. First item
2. Second item
i) Sub-item 1
ii) Sub-item 2
3. Item 3
```

yielding:

- First item
- Second item

- Sub-item 1
- Sub-item 2

- Item 3

To create a table, use `|`

and `---------`

as follows:

```
Column 1 | Column 2
--------- | ---------
Content | More content
Even more | And some here
Even more? | Yes!
```

which yields the following output:

Column 1 | Column 2 |
---|---|

Content | More content |

Even more | And some here |

Even more? | Yes! |

To include an image, use the same syntax as when creating linked text with a link to the image path (either local or on the web), but with a `!`

in front:

```
![](https://www.r-project.org/Rlogo.png)
![Put some text here if you want a caption](https://www.r-project.org/Rlogo.png)
```

which yields the following:

### 4.1.4 Code chunks

The simplest way to define a code chunk is to simply write:

```
```{r}
plot(pressure)
```
```

In RStudio, Ctrl+Alt+I is a keyboard shortcut for inserting this kind of code chunk.

We can add a name and a caption to the chunk, which lets us reference objects created by the chunk:

```
```{r pressureplot, fig.cap = "Plot of the pressure data."}
plot(pressure)
```
As we can see in Figure \@ref(fig:cars-plot), the relationship between
temperature and pressure resembles a banana.
```

This yields the following output:

`plot(pressure)`

As we can see in Figure 4.1, the relationship between temperature and pressure resembles a banana.

In addition, you can add settings to the chunk header to control its behaviour. For instance, you can include a code chunk without running it by adding `echo = FALSE`

:

```
```{r, eval = FALSE}
plot(pressure)
```
```

You can add the following settings to your chunks:

`echo = FALSE`

to run the code without printing it,`eval = FALSE`

to print the code without running it,`results = "hide"`

to hide printed output,`fig.show = "hide"`

to hide plots,`warning = FALSE`

to suppress warning messages from being printed in your document,`message = FALSE`

to suppress other messages from being printed in your document,`include = FALSE`

to run a chunk without showing the code or results in the document,`error = TRUE`

to continue running your R Markdown document even if there is an error in the chunk (by default, the documentation creation stops if there is an error).

Data frames can be printed either as in the Console or as a nicely formatted table. For example,

```
```{r, echo = FALSE}
head(airquality)
```
```

yields:

```
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
```

whereas

```
```{r, echo = FALSE}
knitr::kable(
head(airquality),
caption = "Some data I found."
)
```
```

yields the table below.

Ozone | Solar.R | Wind | Temp | Month | Day |
---|---|---|---|---|---|

41 | 190 | 7.4 | 67 | 5 | 1 |

36 | 118 | 8.0 | 72 | 5 | 2 |

12 | 149 | 12.6 | 74 | 5 | 3 |

18 | 313 | 11.5 | 62 | 5 | 4 |

NA | NA | 14.3 | 56 | 5 | 5 |

28 | NA | 14.9 | 66 | 5 | 6 |

Further help and documentation for R Markdown can be found through the RStudio menus, by clicking *Help > Cheatsheets > R Markdown Cheat Sheet* or *Help > Cheatsheets > R Markdown Reference Guide*.

## 4.2 Customising `ggplot2`

plots

We’ll be using `ggplot2`

a lot in this chapter, so before we get started with exploratory analyses, we’ll take some time to learn how we can customise the look of `ggplot2`

-plots.

Consider the following facetted plot from Section 2.7.4:

```
library(ggplot2)
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)
```

It looks nice, sure, but there may be things that you’d like to change. Maybe you’d like the background of the plot to be white instead of grey, or maybe you’d like to use a different font. These, and many other things, can be modified using *themes*.

### 4.2.1 Using themes

`ggplot2`

comes with a number of basic themes. All are fairly similar, but differ in things like background colour, grids and borders. You can add them to your plot using `theme_themeName`

, where `themeName`

is the name of the theme^{21}. Here are some examples:

```
<- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
p geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10() +
facet_wrap(~ vore)
# Create plot with different themes:
+ theme_grey() # The default theme
p + theme_bw()
p + theme_linedraw()
p + theme_light()
p + theme_dark()
p + theme_minimal()
p + theme_classic() p
```

There are several packages available that contain additional themes. Let’s download a few:

```
install.packages("ggthemes")
library(ggthemes)
# Create plot with different themes from ggthemes:
+ theme_tufte() # Minimalist Tufte theme
p + theme_wsj() # Wall Street Journal
p + theme_solarized() + scale_colour_solarized() # Solarized colours
p
##############################
install.packages("hrbrthemes")
library(hrbrthemes)
# Create plot with different themes from hrbrthemes:
+ theme_ipsum() # Ipsum theme
p + theme_ft_rc() # Suitable for use with dark RStudio themes
p + theme_modern_rc() # Suitable for use with dark RStudio themes p
```

### 4.2.2 Theme settings and colour palettes

The point of using a theme is that you get a combination of colours, fonts and other choices that are supposed to go well together, meaning that you don’t have to spend too much time on picking combinations. But if you like, you can override the default options and customise any and all parts of a theme.

Let’s start by creating a `ggplot`

object:

```
<- ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
p geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```

You can change the *colour palette*, i.e. the list of colours used for plotting, using `scale_colour_brewer`

. Three types of colour palettes are available:

**Sequential palettes**: that range from a colour to white. These are useful for representing ordinal (i.e. ordered) categorical variables and numerical variables.**Diverging palettes**: these range from one colour to another, with white in between. Diverging palettes are useful when there is a meaningful middle or 0 value (e.g. when your variables represents temperatures or profit/loss), which can be mapped to white.**Qualitative palettes**: which contain multiple distinct colours. These are useful for nominal (i.e. with no natural ordering) categorical variables.

See `?scale_colour_brewer`

or http://www.colorbrewer2.org for a list of the available palettes. Here are some examples:

```
# Sequential palette:
+ scale_colour_brewer(palette = "OrRd")
p # Diverging palette:
+ scale_colour_brewer(palette = "RdBu")
p # Qualitative palette:
+ scale_colour_brewer(palette = "Set1") p
```

In this case, because `vore`

is a nominal categorical variable, a qualitative palette is arguably the best choice.

You can use `theme`

to remove the legend, or change its position:

```
# No legend:
+ theme(legend.position = "none")
p
# Legend below figure:
+ theme(legend.position = "bottom")
p
# Legend inside plot:
+ theme(legend.position = c(0.9, 0.7)) p
```

In the last example, the vector `c(0.9, 0.7)`

gives the relative coordinates of the legend, with `c(0 0)`

representing the bottom left corner of the plot and `c(1, 1)`

the upper right corner. Try to change the coordinates to different values between `0`

and `1`

and see what happens.

`theme`

has a lot of other settings, including for the colours of the background, the grid and the text in the plot. Here are a few examples that you can use as starting point for experimenting with the settings:

```
+ theme(panel.grid.major = element_line(colour = "black"),
p panel.grid.minor = element_line(colour = "purple",
linetype = "dotted"),
panel.background = element_rect(colour = "red", size = 2),
plot.background = element_rect(fill = "yellow"),
axis.text = element_text(family = "mono", colour = "blue"),
axis.title = element_text(family = "serif", size = 14))
```

To find a complete list of settings, see `?theme`

, `?element_line`

(lines), `?element_rect`

(borders and backgrounds), `?element_text`

(text), and `element_blank`

(for suppressing plotting of elements). As before, you can use `colors()`

to get a list of built-in colours, or use colour hex codes.

\[\sim\]

**Exercise 4.1 **Use the documentation for `theme`

and the `element_...`

functions to change the plot object `p`

created above as follows:

Change the background colour of the the entire plot to

`lightblue`

.Change the font of the legend to

`serif`

.Remove the grid.

Change the colour of the axis ticks to

`orange`

and make them thicker.

## 4.3 Exploring distributions

It is often useful to visualise the distribution of a numerical variable. Comparing the distributions of different groups can lead to important insights. Visualising distributions is also important when checking assumptions used for various statistical tests (sometimes called *initial data analysis*). In this section we will illustrate how this can be done using the `diamonds`

data from the `ggplot2`

package, which you started to explore in Chapter 2.

### 4.3.1 Density plots and frequency polygons

We already know how to visualise the distribution of the data by dividing it into bins and plotting a histogram:

```
library(ggplot2)
ggplot(diamonds, aes(carat)) +
geom_histogram(colour = "black")
```

A similar plot is created using frequency polygons, which uses lines instead of bars to display the counts in the bins:

```
ggplot(diamonds, aes(carat)) +
geom_freqpoly()
```

An advantage with frequency polygons is that they can be used to compare groups, e.g. diamonds with different cuts, without facetting:

```
ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly()
```

It is clear from this figure that there are more diamonds with ideal cuts than diamonds with fair cuts in the data. The polygons have roughly the same shape, except perhaps for the polygon for diamonds with fair cuts.

In some cases we are more interested in the shape of the distribution that in the actual counts in the different bins. Density plots are similar to frequency polygons, but show an estimate of the density function of the underlying random variable. These estimates are smooth curves that are scaled so that the area below them is 1 (i.e. scaled to be proper density functions):

```
ggplot(diamonds, aes(carat, colour = cut)) +
geom_density()
```

From this figure, it becomes clear that low-carat diamonds tend to have better cuts, which wasn’t obvious from the frequency polygons. However, the plot does not provide any information about *how* common different cuts are. Use density plots if you’re more interested in the shape of a variable’s distribution, and frequency polygons if you’re more interested in counts.

\[\sim\]

**Exercise 4.2 **Using the density plot created above and the documentation for `geom_density`

, do the following:

Increase the smoothness of the density curves.

Fill the area under the density curves with the same colour as the curves themselves.

Make the colours that fill the areas under the curves transparent.

The figure still isn’t that easy to interpret. Install and load the

`ggridges`

package, which is an extension of`ggplot2`

that allows you to make so-called ridge plots (density plots that are separated along the y-axis, similar to facetting). Read the documentation for`geom_density_ridges`

and use it to make a ridge plot of diamond prices for different cuts.

(Click here to go to the solution.)

**Exercise 4.3 **Return to the histogram created by `ggplot(diamonds, aes(carat)) + geom_histogram()`

above. As there are very few diamonds with carat greater than 3, cut the x-axis at 3. Then decrease the bin width to 0.01. Do any interesting patterns emerge?

### 4.3.2 Asking questions

Exercise 4.3 causes us to ask why diamonds with carat values that are multiples of 0.25 are more common that others. Perhaps price is involved? Unfortunately, a plot of carat versus price is not that informative:

```
ggplot(diamonds, aes(carat, price)) +
geom_point(size = 0.05)
```

Maybe we could compute the average price in each bin of the histogram? In that case, we need to extract the bin breaks from the histogram somehow. We could then create a new categorical variable using the breaks with `cut`

(as we did in Exercise 3.7). It turns out that extracting the bins is much easier using base graphics than `ggplot2`

, so let’s do that:

```
# Extract information from a histogram with bin width 0.01,
# which corresponds to 481 breaks:
<- hist(diamonds$carat, breaks = 481, right = FALSE,
carat_br plot = FALSE)
# Of interest to us are:
# carat_br$breaks, which contains the breaks for the bins
# carat_br$mid, which contains the midpoints of the bins
# (useful for plotting!)
# Create categories for each bin:
$carat_cat <- cut(diamonds$carat, 481, right = FALSE) diamonds
```

We now have a variable, `carat_cat`

, that shows which bin each observation belongs to. Next, we’d like to compute the mean for each bin. This is a grouped summary - mean by category. After we’ve computed the bin means, we could then plot them against the bin midpoints. Let’s try it:

```
<- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
means
plot(carat_br$mid, means$price)
```

That didn’t work as intended. We get an error message when attempting to plot the results:

```
in xy.coords(x, y, xlabel, ylabel, log) :
Error 'x' and 'y' lengths differ
```

The error message implies that the number of bins and the number of mean values that have been computed differ. But we’ve just computed the mean for each bin, haven’t we? So what’s going on?

By default, `aggregate`

ignores groups for which there are no values when computing grouped summaries. In this case, there are a lot of empty bins - there is for instance no observation in the `[4.99,5)`

bin. In fact, only 272 out of the 481 bins are non-empty.

We can solve this in different ways. One way is to remove the empty bins. We can do this using the `match`

function, which returns the positions of *matching values* in two vectors. If we use it with the bins from the grouped summary and the vector containing all bins, we can find the indices of the non-empty bins. This requires the use of the `levels`

function, that you’ll learn more about in Section 5.4:

```
<- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
means
<- match(means$carat_cat, levels(diamonds$carat_cat)) id
```

Finally, we’ll also add some vertical lines to our plot, to call attention to multiples of 0.25.

Using base graphics is faster here:

```
plot(carat_br$mid[id], means$price,
cex = 0.5)
# Add vertical lines at multiples
# of 0.25:
abline(v = c(0.5, 0.75, 1,
1.25, 1.5))
```

But we can of course stick to `ggplot2`

if we like:

```
library(ggplot2)
<- data.frame(
d2 bin = carat_br$mid[id],
mean = means$price)
ggplot(d2, aes(bin, mean)) +
geom_point() +
geom_vline(xintercept =
c(0.5, 0.75, 1,
1.25, 1.5))
# geom_vline add vertical lines at
# multiples of 0.25
```

It appears that there are small jumps in the prices at at least some of the 0.25-marks. This explains why there are more diamonds just above these marks than just below.

The above example illustrates three important things regarding exploratory data analysis:

Plots (in our case, the histogram) often lead to new questions.

A lot of the time we must transform, summarise or otherwise manipulate our data to answer a question. Sometimes this is straightforward and sometimes it means diving deep into R code.

Sometimes the thing that we’re trying to do doesn’t work straight away. There is almost always a solution though (and oftentimes more than one!). The more you work with R, the more problem-solving tricks you will learn.

### 4.3.3 Violin plots

Density curves can also be used as alternatives to boxplots. In Exercise 2.16, you created boxplots to visualise price differences between diamonds of different cuts:

```
ggplot(diamonds, aes(cut, price)) +
geom_boxplot()
```

Instead of using a boxplot, we can use a violin plot. Each group is represented by a “violin”, given by a rotated and duplicated density plot:

```
ggplot(diamonds, aes(cut, price)) +
geom_violin()
```

Compared to boxplots, violin plots capture the entire distribution of the data rather than just a few numerical summaries. If you like numerical summaries (and you should!) you can add the median and the quartiles (corresponding to the borders of the box in the boxplot) using the `draw_quantiles`

argument:

```
ggplot(diamonds, aes(cut, price)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))
```

\[\sim\]

**Exercise 4.4 **Using the first boxplot created above, i.e. `ggplot(diamonds, aes(cut, price)) + geom_violin()`

, do the following:

Add some colour to the plot by giving different colours to each violin.

Because the categories are shown along the x-axis, we don’t really need the legend. Remove it.

Both boxplots and violin plots are useful. Maybe we can have the best of both worlds? Add the corresponding boxplot inside each violin. Hint: the

`width`

and`alpha`

arguments in`geom_boxplot`

are useful for creating a nice-looking figure here.Flip the coordinate system to create horizontal violins and boxes instead.

### 4.3.4 Combine multiple plots into a single graphic

When exploring data with many variables, you’ll often want to make the same kind of plot (e.g. a violin plot) for several variables. It will frequently make sense to place these side-by-side in the same plot window. The `patchwork`

package extends `ggplot2`

by letting you do just that. Let’s install it:

`install.packages("patchwork")`

To use it, save each plot as a plot object and then add them together:

```
<- ggplot(diamonds, aes(cut, carat, fill = cut)) +
plot1 geom_violin() +
theme(legend.position = "none")
<- ggplot(diamonds, aes(cut, price, fill = cut)) +
plot2 geom_violin() +
theme(legend.position = "none")
library(patchwork)
+ plot2 plot1
```

You can also arrange the plots on multiple lines, with different numbers of plots on each line. This is particularly useful if you are combining different types of plots in a single plot window. In this case, you separate plots that are same line by `|`

and mark the beginning of a new line with `/`

:

```
# Create two more plot objects:
<- ggplot(diamonds, aes(cut, depth, fill = cut)) +
plot3 geom_violin() +
theme(legend.position = "none")
<- ggplot(diamonds, aes(carat, fill = cut)) +
plot4 geom_density(alpha = 0.5) +
theme(legend.position = c(0.9, 0.6))
# One row with three plots and one row with a single plot:
| plot2 | plot3) / plot4
(plot1
# One column with three plots and one column with a single plot:
/ plot2 / plot3) | plot4 (plot1
```

(You may need to enlarge your plot window for this to look good!)

## 4.4 Outliers and missing data

### 4.4.1 Detecting outliers

Both boxplots and scatterplots are useful for detecting deviating observations - often called outliers. Outliers can be caused by measurement errors or errors in the data input, but can also be interesting rare cases that can provide valuable insights about the process that generated the data. Either way, it is often of interest to detect outliers, for instance because that may influence the choice of what statistical tests to use.

Let’s return to the scatterplot of diamond carats versus prices:

```
ggplot(diamonds, aes(carat, price)) +
geom_point()
```

There are some outliers which we may want to study further. For instance, there is a surprisingly cheap 5 carat diamond, and some cheap 3 carat diamonds^{22}. But how can we identify those points?

One option is to use the `plotly`

package to make an interactive version of the plot, where we can hover interesting points to see more information about them. Start by installing it:

`install.packages("plotly")`

To use `plotly`

with a `ggplot`

graphic, we store the graphic in a variable and then use it as input to the `ggplotly`

function. The resulting (interactive!) plot takes a little longer than usual to load. Try hovering the points:

```
<- ggplot(diamonds, aes(carat, price)) +
myPlot geom_point()
library(plotly)
ggplotly(myPlot)
```

By default, `plotly`

only shows the carat and price of each diamond. But we can add more information to the box by adding a `text`

aesthetic:

```
<- ggplot(diamonds, aes(carat, price, text = paste("Row:",
myPlot rownames(diamonds)))) +
geom_point()
ggplotly(myPlot)
```

We can now find the row numbers of the outliers visually, which is very useful when exploring data.

\[\sim\]

**Exercise 4.5 **The variables `x`

and `y`

in the `diamonds`

data describe the length and width of the diamonds (in mm). Use an interactive scatterplot to identify outliers in these variables. Check prices, carat and other information and think about if any of the outliers can be due to data errors.

### 4.4.2 Labelling outliers

Interactive plots are great when exploring a dataset, but are not always possible to use in other contexts, e.g. for printed reports and some presentations. In these other cases, we can instead annotate the plot with notes about outliers. One way to do this is to use a geom called `geom_text`

.

For instance, we may want to add the row numbers of outliers to a plot. To do so, we use `geom_text`

along with a condition that specifies which points we should add annotations for. Like in Section 3.2.3, if we e.g. wish to add row numbers for diamonds with carats greater than four, our condition would be `carat > 4`

. The `ifelse`

function, which we’ll look closer at in Section 6.3, is perfect to use here. The syntax will be `ifelse(condition, what text to write if condition is satisfied, what text to write else)`

. To add row names for observations that satisfy the condition but not for other observations, we use `ifelse(condition, rownames(diamonds), "")`

. If instead we wanted to print the price of the diamonds, we’d use `ifelse(condition, price, "")`

instead.

Here are some different examples of conditions used to plot text:

```
## Using the row number (the 5 carat diamond is on row 27,416)
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(rownames(diamonds) == 27416,
rownames(diamonds), "")),
hjust = 1.1)
## (hjust=1.1 shifts the text to the left of the point)
## Plot text next to all diamonds with carat>4
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(carat > 4, rownames(diamonds), "")),
hjust = 1.1)
## Plot text next to 3 carat diamonds with a price below 7500
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_text(aes(label = ifelse(carat == 3 & price < 7500,
rownames(diamonds), "")),
hjust = 1.1)
```

\[\sim\]

**Exercise 4.6 **Create a static (i.e. non-interactive) scatterplot of `x`

versus `y`

from the `diamonds`

data. Label the diamonds with suspiciously high \(y\)-values.

### 4.4.3 Missing data

Like many datasets, the mammal sleep data `msleep`

contains a lot of missing values, represented by `NA`

(Not Available) in R. This becomes evident when we have a look at the data:

```
library(ggplot2)
View(msleep)
```

We can check if a particular observation is missing using the `is.na`

function:

```
is.na(msleep$sleep_rem[4])
is.na(msleep$sleep_rem)
```

We can count the number of missing values for each variable using:

`colSums(is.na(msleep))`

Here, `colSums`

computes the sum of `is.na(msleep)`

for each column of `msleep`

(remember than in summation, `TRUE`

counts as `1`

and `FALSE`

as `0`

), yielding the number of missing values for each variable. In total there are 136 missing values in the dataset:

`sum(is.na(msleep))`

You’ll notice that `ggplot2`

prints a warning in the Console when you create a plot with missing data:

```
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
scale_x_log10()
```

Sometimes data are missing simply because the information is not yet available (for instance, the brain weight of the mountain beaver could be missing because no one has ever weighed the brain of a mountain beaver). In other cases, data can be missing because something about them are different (for instance, values for a male patient in a medical trial can be missing because the patient died, or because some values only were collected for female patients). It is therefore of interest to see if there are any differences in non-missing variables between subjects that have missing data and subjects that don’t.

In `msleep`

, all animals have recorded values for `sleep_total`

and `bodywt`

. To check if the animals that have missing `brainwt`

values differ from the others, we can plot them in a different colour in a scatterplot:

```
ggplot(msleep, aes(bodywt, sleep_total, colour = is.na(brainwt))) +
geom_point() +
scale_x_log10()
```

(If `is.na(brainwt)`

is `TRUE`

then the brain weight is missing in the dataset.) In this case, there are no obvious differences between the animals with missing data and those without.

\[\sim\]

**Exercise 4.7 **Create a version of the diamonds dataset where the `x`

value is missing for all diamonds with \(x>9\). Make a scatterplot of carat versus price, with points where the `x`

value is missing plotted in a different colour. How would you interpret this plot?

### 4.4.4 Exploring data

The `nycflights13`

package contains data about flights to and from three airports in New York, USA, in 2013. As a summary exercise, we will study a subset of these, namely all flights departing from New York on 1 January that year:

```
install.packages("nycflights13")
library(nycflights13)
<- flights[flights$month == 1 & flights$day == 1,] flights2
```

\[\sim\]

**Exercise 4.8 **Explore the `flights2`

dataset, focusing on delays and amount of time spent in the air. Are there any differences between the different carriers? Are there missing data? Are there any outliers?

## 4.5 Trends in scatterplots

Let’s return to a familiar example - the relationship between animal brain size and sleep times:

```
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```

There appears to be a decreasing trend in the plot. To aid the eye, we can add a smoothed line by adding a new geom, `geom_smooth`

, to the figure:

```
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
geom_smooth() +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
```

This technique is useful for bivariate data as well as for time series, which we’ll delve into next.

By default, `geom_smooth`

adds a line fitted using either LOESS^{23} or GAM^{24}, as well as the corresponding 95 % confidence interval describing the uncertainty in the estimate. There are several useful arguments that can be used with `geom_smooth`

. You will explore some of these in the exercise below.

\[\sim\]

**Exercise 4.9 **Check the documentation for `geom_smooth`

. Starting with the plot of brain weight vs. sleep time created above, do the following:

Decrease the degree of smoothing for the LOESS line that was fitted to the data. What is better in this case, more or less smoothing?

Fit a straight line to the data instead of a non-linear LOESS line.

Remove the confidence interval from the plot.

Change the colour of the fitted line to red.

## 4.6 Exploring time series

Before we have a look at time series, you should install four useful packages: `forecast`

, `nlme`

, `fma`

and `fpp2`

. The first contains useful functions for plotting time series data and the latter three contain datasets that we’ll use.

```
install.packages(c("nlme", "forecast", "fma", "fpp2"),
dependencies = TRUE)
```

The `a10`

dataset contains information about the monthly anti-diabetic drug sales in Australia during the period July 1991 to June 2008. By checking its structure, we see that it is saved as a time series object^{25}:

```
library(fpp2)
str(a10)
```

`ggplot2`

requires that data is saved as a data frame in order for it to be plotted. In order to plot the time series, we could first convert it to a data frame and then plot each point using `geom_points`

:

```
<- data.frame(time = time(a10), sales = a10)
a10_df ggplot(a10_df, aes(time, sales)) +
geom_point()
```

It is however usually preferable to plot time series using lines instead of points. This is done using a different geom: `geom_line`

:

```
ggplot(a10_df, aes(time, sales)) +
geom_line()
```

Having to convert the time series object to a data frame is a little awkward. Luckily, there is a way around this. `ggplot2`

offers a function called `autoplot`

, that automatically draws an appropriate plot for certain types of data. `forecast`

extends this function to time series objects:

```
library(forecast)
autoplot(a10)
```

We can still add other geoms, axis labels and other things just as before. `autoplot`

has simply replaced the `ggplot(data, aes()) + geom`

part that would be the first two rows of the `ggplot2`

figure, and has implicitly converted the data to a data frame.

\[\sim\]

**Exercise 4.10 **Using the `autoplot(a10)`

figure, do the following:

Add a smoothed line describing the trend in the data. Make sure that it is smooth enough

*not*to capture the seasonal variation in the data.Change the label of the x-axis to “Year” and the label of the y-axis to “Sales ($ million).”

Check the documentation for the

`ggtitle`

function. What does it do? Use it with the figure.Change the colour of the time series line to red.

(Click here to go to the solution.)

### 4.6.1 Annotations and reference lines

We sometimes wish to add text or symbols to plots, for instance to highlight interesting observations. Consider the following time series plot of daily morning gold prices, based on the `gold`

data from the `forecast`

package:

```
library(forecast)
autoplot(gold)
```

There is a sharp spike a few weeks before day 800, which is due to an incorrect value in the data series. We’d like to add a note about that to the plot. First, we wish to find out on which day the spike appears. This can be done by checking the data manually, or using some code:

`<- which.max(gold) spike_date `

To add a circle around that point, we add a call to `annotate`

to the plot:

```
autoplot(gold) +
annotate(geom = "point", x = spike_date, y = gold[spike_date],
size = 5, shape = 21,
colour = "red",
fill = "transparent")
```

`annotate`

can be used to annotate the plot with both geometrical objects and text (and can therefore be used as an alternative to `geom_text`

).

\[\sim\]

**Exercise 4.11 **Using the figure created above and the documentation for `annotate`

, do the following:

Add the text “Incorrect value” next to the circle.

Create a second plot where the incorrect value has been removed.

Read the documentation for the geom

`geom_hline`

. Use it to add a red reference line to the plot, at \(y=400\).

### 4.6.2 Longitudinal data

Multiple time series with identical time points, known as longitudinal data or panel data, are common in many fields. One example of this is given by the `elecdaily`

time series from the `fpp2`

package, which contains information about electricity demand in Victoria, Australia during 2014. As with a single time series, we can plot these data using `autoplot`

:

```
library(fpp2)
autoplot(elecdaily)
```

In this case, it is probably a good idea to facet the data, i.e. to plot each series in a separate figure:

`autoplot(elecdaily, facets = TRUE)`

\[\sim\]

**Exercise 4.12 **Make the following changes to the `autoplot(elecdaily, facets = TRUE)`

:

Remove the

`WorkDay`

variable from the plot (it describes whether or not a given date is a work day, and while it is useful for modelling purposes, we do not wish to include it in our figure).Add smoothed trend lines to the time series plots.

### 4.6.3 Path plots

Another option for plotting multiple time series is path plots. A path plot is a scatterplot where the points are connected with lines in the order that they appear in the data (which, for time series data, should correspond to time). The lines and points can be coloured to represent time.

To make a path plot of Temperature versus Demand for the `elecdaily`

data, we first convert the time series object to a data frame and create a scatterplot:

```
library(fpp2)
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point()
```

Next, we connect the points by lines using the `geom_path`

geom:

```
ggplot(as.data.frame(elecdaily), aes(Temperature, Demand)) +
geom_point() +
geom_path()
```

The resulting figure is quite messy. Using colour to indicate the passing of time helps a little. For this, we need to add the day of the year to the data frame. To get the values right, we use `nrow`

which gives us the number of rows in the data frame.

```
<- as.data.frame(elecdaily)
elecdaily2 $day <- 1:nrow(elecdaily2)
elecdaily2
ggplot(elecdaily2, aes(Temperature, Demand, colour = day)) +
geom_point() +
geom_path()
```

It becomes clear from the plot that temperatures were the highest at the beginning of the year, and lower in the winter months (July-August).

\[\sim\]

**Exercise 4.13 **Make the following changes to the plot you created above:

Decrease the size of the points.

Add text annotations showing the dates of the highest and lowest temperatures, next to the corresponding points in the figure.

### 4.6.4 Spaghetti plots

In cases where we’ve observed multiple subjects over time, we often wish to visualise their individual time series together using so-called spaghetti plots. With `ggplot2`

this is done using the `geom_line`

geom. To illustrate this, we use the `Oxboys`

data from the `nlme`

package, showing the heights of 26 boys over time.

```
library(nlme)
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
```

The first two `aes`

arguments specify the x and y-axes, and the third specifies that there should be one line per subject (i.e. per boy) rather than a single line interpolating all points. The latter would be a rather useless figure that looks like this:

```
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line() +
ggtitle("A terrible plot")
```

Returning to the original plot, if we wish to be able to identify which time series corresponds to which boy, we can add a colour aesthetic:

```
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) +
geom_point() +
geom_line()
```

Note that the boys are ordered by height, rather than subject number, in the legend.

Now, imagine that we wish to add a trend line describing the general growth trend for all boys. The growth appears approximately linear, so it seems sensible to use `geom_smooth(method = "lm")`

to add the trend:

```
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) +
geom_point() +
geom_line() +
geom_smooth(method = "lm", colour = "red", se = FALSE)
```

Unfortunately, because we have specified in the aesthetics that the data should be grouped by `Subject`

, `geom_smooth`

produces one trend line for each boy. The “problem” is that when we specify an aesthetic in the `ggplot`

call, it is used for all geoms.

\[\sim\]

**Exercise 4.14 **Figure out how to produce a spaghetti plot of the `Oxboys`

data with a single red trend line based on the data from all 26 boys.

### 4.6.5 Seasonal plots and decompositions

The `forecast`

packages includes a number of useful functions when working with time series. One of them is `ggseasonplot`

, which allows us to easily create a spaghetti plot of different periods of a time series with seasonality, i.e. with patterns that repeat seasonally over time. It works similar to the `autoplot`

function, in that it replaces the `ggplot(data, aes) + geom`

part of the code.

```
library(forecast)
library(fpp2)
ggseasonplot(a10)
```

This function is very useful when visually inspecting seasonal patterns.

The `year.labels`

and `year.labels.left`

arguments removes the legend in favour of putting the years at the end and beginning of the lines:

`ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE)`

As always, we can add more things to our plot if we like:

```
ggseasonplot(a10, year.labels = TRUE, year.labels.left = TRUE) +
ylab("Sales ($ million)") +
ggtitle("Seasonal plot of anti-diabetic drug sales")
```

When working with seasonal time series, it is common to decompose the series into a seasonal component, a trend component and a remainder. In R, this is typically done using the `stl`

function, which uses repeated LOESS smoothing to decompose the series. There is an `autoplot`

function for `stl`

objects:

`autoplot(stl(a10, s.window = 365))`

This plot can too be manipulated in the same way as other `ggplot`

objects. You can access the different parts of the decomposition as follows:

```
stl(a10, s.window = 365)$time.series[,"seasonal"]
stl(a10, s.window = 365)$time.series[,"trend"]
stl(a10, s.window = 365)$time.series[,"remainder"]
```

\[\sim\]

**Exercise 4.15 **Investigate the `writing`

dataset from the `fma`

package graphically. Make a time series plot with a smoothed trend line, a seasonal plot and an `stl`

-decomposition plot. Add appropriate plot titles and labels to the axes. Can you see any interesting patterns?

### 4.6.6 Detecting changepoints

The `changepoint`

package contains a number of methods for detecting *changepoints* in time series, i.e. timepoints at which either the mean or the variance of the series changes. Finding changepoints can be important for detecting changes in the process underlying the time series. The `ggfortify`

package extends `ggplot2`

by adding `autoplot`

functions for a variety of tools, including those in `changepoint`

. Let’s install the packages:

`install.packages(c("changepoint", "ggfortify"))`

We can now look at some examples with the anti-diabetic drug sales data:

```
library(forecast)
library(fpp2)
library(changepoint)
library(ggfortify)
# Plot the time series:
autoplot(a10)
# Remove the seasonal part and plot the series again:
<- a10 - stl(a10, s.window = 365)$time.series[,"seasonal"]
a10_ns autoplot(a10_ns)
# Plot points where there are changes in the mean:
autoplot(cpt.mean(a10_ns))
# Choosing a different method for finding changepoints
# changes the result:
autoplot(cpt.mean(a10_ns, method = "BinSeg"))
# Plot points where there are changes in the variance:
autoplot(cpt.var(a10_ns))
# Plot points where there are changes in either the mean or
# the variance:
autoplot(cpt.meanvar(a10_ns))
```

As you can see, the different methods from `changepoint`

all yield different results. The results for changes in the mean are a bit dubious - which isn’t all that strange as we are using a method that looks for jumps in the mean on a time series where the increase actually is more or less continuous. The changepoint for the variance looks more reliable - there is a clear change towards the end of the series where the sales become more volatile. We won’t go into details about the different methods here, but mention that the documentation at `?cpt.mean`

, `?cpt.var`

, and `?cpt.meanvar`

contains descriptions of and references for the available methods.

\[\sim\]

**Exercise 4.16 **Are there any changepoints for variance in the `Demand`

time series in `elecdaily`

? Can you explain why the behaviour of the series changes?

### 4.6.7 Interactive time series plots

The `plotly`

packages can be used to create interactive time series plots. As before, you create a `ggplot2`

object as usual, assigning it to a variable and then call the `ggplotly`

function. Here is an example with the `elecdaily`

data:

```
library(plotly)
library(fpp2)
<- autoplot(elecdaily[,"Demand"])
myPlot
ggplotly(myPlot)
```

When you hover the mouse pointer over a point, a box appears, which displays information about that data point. Unfortunately, the date formatting isn’t great in this example - dates are shown as weeks with decimal points. We’ll see how to fix this in Section 5.6.

## 4.7 Using polar coordinates

Most plots are made using *Cartesian coordinates systems*, in which the axes are orthogonal to each other and values are placed in an even spacing along each axis. In some cases, nonlinear axes (e.g. log-transformed) are used instead, as we have already seen.

Another option is to use a *polar* coordinate system, in which positions are specified using an angle and a (radial) distance from the origin. Here is an example of a polar scatterplot:

```
ggplot(msleep, aes(sleep_rem, sleep_total, colour = vore)) +
geom_point() +
xlab("REM sleep (circular axis)") +
ylab("Total sleep time (radial axis)") +
coord_polar()
```

### 4.7.1 Visualising periodic data

Polar coordinates are particularly useful when the data is periodic. Consider for instance the following dataset, describing monthly weather averages for Cape Town, South Africa:

```
<- data.frame(
Cape_Town_weather Month = 1:12,
Temp_C = c(22, 23, 21, 18, 16, 13, 13, 13, 14, 16, 18, 20),
Rain_mm = c(20, 20, 30, 50, 70, 90, 100, 70, 50, 40, 20, 20),
Sun_h = c(11, 10, 9, 7, 6, 6, 5, 6, 7, 9, 10, 11))
```

We can visualise the monthly average temperature using lines in a Cartesian coordinate system:

```
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line()
```

What this plot doesn’t show is that the 12th month and the 1st month actually are consecutive months. If we instead use polar coordinates, this becomes clearer:

```
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line() +
coord_polar()
```

To improve the presentation, we can change the scale of the x-axis (i.e. the circular axis) so that January and December aren’t plotted at the same angle:

```
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line() +
coord_polar() +
xlim(0, 12)
```

\[\sim\]

**Exercise 4.17 **In the plot that we just created, the last and first month of the year aren’t connected. You can fix manually this by adding a cleverly designed faux data point to `Cape_Town_weather`

. How?

### 4.7.2 Pie charts

Consider the stacked bar chart that we plotted in Section 2.8:

```
ggplot(msleep, aes(factor(1), fill = vore)) +
geom_bar()
```

What would happen if we plotted this figure in a polar coordinate system instead? If we map height of the bars (the y-axis of the Cartesian coordinate system) to both the angle and the radial distance, we end up with a pie chart:

```
ggplot(msleep, aes(factor(1), fill = vore)) +
geom_bar() +
coord_polar(theta = "y")
```

There are many arguments against using pie charts for visualisations. Most boil down to the fact that the same information is easier to interpret when conveyed as a bar chart. This is at least partially due to the fact that most people are more used to reading plots in Cartesian coordinates than in polar coordinates.

If we make a similar transformation of a grouped bar chart, we get a different type of pie chart, in which the height of the bars are mapped to both the angle and the radial distance^{26}:

```
# Cartestian bar chart:
ggplot(msleep, aes(vore, fill = vore)) +
geom_bar()
# Polar bar chart:
ggplot(msleep, aes(vore, fill = vore)) +
geom_bar() +
coord_polar()
```

## 4.8 Visualising multiple variables

### 4.8.1 Scatterplot matrices

When we have a large enough number of `numeric`

variables in our data, plotting scatterplots of all pairs of variables becomes tedious. Luckily there are some R functions that speed up this process.

The `GGally`

package is an extension to `ggplot2`

which contains several functions for plotting multivariate data. They work similar to the `autoplot`

functions that we have used in previous sections. One of these is `ggpairs`

, which creates a scatterplot matrix, that is, a grid with scatterplots of all pairs of variables in `data`

. In addition, it also plots density estimates (along the diagonal) and shows the (Pearson) correlation for each pair. Let’s start by installing `GGally`

:

`install.packages("GGally")`

To create a scatterplot matrix for the `airquality`

dataset, simply write:

```
library(GGally)
ggpairs(airquality)
```

(Enlarging your Plot window can make the figure look better.)

If we want to create a scatterplot matrix but only want to include some of the variables in a dataset, we can do so by providing a vector with variable names. Here is an example for the animal sleep data `msleep`

:

```
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")])
```

Optionally, if we wish to create a scatterplot involving all `numeric`

variables, we can replace the vector with variable names with some R code that extracts the columns containing `numeric`

variables:

`ggpairs(msleep[, which(sapply(msleep, class) == "numeric")])`

(You’ll learn more about the `sapply`

function in Section 6.5.)

The resulting plot is identical to the previous one, because the list of names contained all `numeric`

variables. The grab-all-numeric-variables approach is often convenient, because we don’t have to write all the variable names. On the other hand, it’s not very helpful in case we only want to include some of the `numeric`

variables.

If we include a categorical variable in the list of variables (such as the feeding behaviour `vore`

), the matrix will include a bar plot of the categorical variable as well as boxplots and facetted histograms to show differences between different categories in the continuous variables:

```
ggpairs(msleep[, c("vore", "sleep_total", "sleep_rem", "sleep_cycle",
"awake", "brainwt", "bodywt")])
```

Alternatively, we can use a categorical variable to colour points and density estimates using `aes(colour = ...)`

. The syntax for this is follows the same pattern as that for a standard `ggplot`

call - `ggpairs(data, aes)`

. The only exception is that if the categorical variable is not included in the `data`

argument, we much specify which data frame it belongs to:

```
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")],
aes(colour = msleep$vore, alpha = 0.5))
```

As a side note, if all variables in your data frame are `numeric`

, and if you only are looking for a quick-and-dirty scatterplot matrix without density estimates and correlations, you can also use the base R `plot`

:

`plot(airquality)`

\[\sim\]

**Exercise 4.18 **Create a scatterplot matrix for all `numeric`

variables in `diamonds`

. Differentiate different cuts by colour. Add a suitable title to the plot. (`diamonds`

is a fairly large dataset and it may take a minute or so for R to create the plot.)

### 4.8.2 3D scatterplots

The `plotly`

package lets us make three-dimensional scatterplots with the `plot_ly`

function, which can be a useful alternative to scatterplot matrices in some cases. Here is an example using the `airquality`

data:

```
library(plotly)
plot_ly(airquality, x = ~Ozone, y = ~Wind, z = ~Temp,
color = ~factor(Month))
```

Note that you can drag and rotate the plot, to see it from different angles.

### 4.8.3 Correlograms

Scatterplot matrices are not a good choice when we have too many variables, partially because the plot window needs to be very large to fit all variables and partially because it becomes difficult to get a good overview of the data. In such cases, a correlogram, where the strength of the correlation between each pair of variables is plotted instead of scatterplots, can be used instead. It is effectively a visualisation of the correlation matrix of the data, where the strengths and signs of the correlations are represented by different colours.

The `GGally`

package contains the function `ggcorr`

, which can be used to create a correlogram:

```
ggcorr(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")])
```

\[\sim\]

**Exercise 4.19 **Using the `diamonds`

dataset and the documentation for `ggcorr`

, do the following:

Create a correlogram for all

`numeric`

variables in the dataset.The Pearson correlation that

`ggcorr`

uses by default isn’t always the best choice. A commonly used alternative is the Spearman correlation. Change the type of correlation used to create the plot to the Spearman correlation.Change the colour scale from a categorical scale with 5 categories.

Change the colours on the scale to go from yellow (low correlation) to black (high correlation).

### 4.8.4 Adding more variables to scatterplots

We have already seen how scatterplots can be used to visualise two continuous and one categorical variable, by plotting the two continuous variables against each other and using the categorical variable to set the colours of the points. There are however more ways we can incorporate information about additional variables into a scatterplot.

So far, we have set three aesthethics in our scatterplots: `x`

, `y`

and `colour`

. Two other important aesthetics are `shape`

and `size`

which, as you’d expect, allow us to control the shape and size of the points. As a first example using the `msleep`

data, we use feeding behaviour (`vore`

) to set the shapes used for the points:

```
ggplot(msleep, aes(brainwt, sleep_total, shape = vore)) +
geom_point() +
scale_x_log10()
```

The plot looks a little nicer if we increase the `size`

of the points:

```
ggplot(msleep, aes(brainwt, sleep_total, shape = vore, size = 2)) +
geom_point() +
scale_x_log10()
```

Another option is to let `size`

represent a continuous variable, in what is known as a bubble plot:

```
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
size = bodywt)) +
geom_point() +
scale_x_log10()
```

The size of each “bubble” now represents the weight of the animal. Because some animals are much heavier (i.e. have higher `bodywt`

values) than most others, almost all points are quite small. There are a couple of things we can do to remedy this. First, we can transform `bodywt`

, e.g. using the square root transformation `sqrt(bodywt)`

, to decrease the differences between large and small animals. This can be done by adding `scale_size(trans = "sqrt")`

to the plot. Second, we can also use `scale_size`

to control the range of point sizes (e.g. from size 1 to size 20). This will cause some points to overlap, so we add `alpha = 0.5`

to the geom, to make the points transparent:

```
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
size = bodywt)) +
geom_point(alpha = 0.5) +
scale_x_log10() +
scale_size(range = c(1, 20), trans = "sqrt")
```

This produces a fairly nice-looking plot, but it’d look even better if we changed the axes labels and legend texts. We can change the legend text for the size scale by adding the argument `name`

to `scale_size`

. Including a `\n`

in the text lets us create a line break - you’ll learn more tricks like that in Section 5.5. Similarly, we can use `scale_colour_discrete`

to change the legend text for the colours:

```
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
size = bodywt)) +
geom_point(alpha = 0.5) +
xlab("log(Brain weight)") +
ylab("Sleep total (h)") +
scale_x_log10() +
scale_size(range = c(1, 20), trans = "sqrt",
name = "Square root of\nbody weight") +
scale_colour_discrete(name = "Feeding behaviour")
```

\[\sim\]

**Exercise 4.20 **Using the bubble plot created above, do the following:

Replace

`colour = vore`

in the`aes`

by`fill = vore`

and add`colour = "black", shape = 21`

to`geom_point`

. What happens?Use

`ggplotly`

to create an interactive version of the bubble plot above, where variable information and the animal name are displayed when you hover a point.

### 4.8.5 Overplotting

Let’s make a scatterplot of `table`

versus `depth`

based on the `diamonds`

dataset:

```
ggplot(diamonds, aes(table, depth)) +
geom_point()
```

This plot is cluttered. There are too many points, which makes it difficult to see if, for instance, high `table`

values are more common than low `table`

values. In this section, we’ll look at some ways to deal with this problem, known as overplotting.

The first thing we can try is to decrease the point size:

```
ggplot(diamonds, aes(table, depth)) +
geom_point(size = 0.1)
```

This helps a little, but now the outliers become a bit difficult to spot. We can try changing the opacity using `alpha`

instead:

```
ggplot(diamonds, aes(table, depth)) +
geom_point(alpha = 0.2)
```

This is also better than the original plot, but neither plot is great. Instead of plotting each individual point, maybe we can try plotting the counts or densities in different regions of the plot instead? Effectively, this would be a 2D version of a histogram. There are several ways of doing this in `ggplot2`

.

First, we bin the points and count the numbers in each bin, using `geom_bin2d`

:

```
ggplot(diamonds, aes(table, depth)) +
geom_bin2d()
```

By default, `geom_bin2d`

uses 30 bins. Increasing that number can sometimes give us a better idea about the distribution of the data:

```
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(bins = 50)
```

If you prefer, you can get a similar plot with hexagonal bins by using `geom_hex`

instead:

```
ggplot(diamonds, aes(table, depth)) +
geom_hex(bins = 50)
```

As an alternative to bin counts, we could create a 2-dimensional density estimate and create a contour plot showing the levels of the density:

```
ggplot(diamonds, aes(table, depth)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon",
colour = "white")
```

The `fill = ..level..`

bit above probably looks a little strange to you. It means that an internal function (the level of the contours) is used to choose the fill colours. It also means that we’ve reached a point where we’re reaching deep into the depths of `ggplot2`

!

We can use a similar approach to show a summary statistic for a third variable in a plot. For instance, we may want to plot the average price as a function of `table`

and `depth`

. This is called a tile plot:

```
ggplot(diamonds, aes(table, depth, z = price)) +
geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
ggtitle("Mean prices for diamonds with different depths and
tables")
```

\[\sim\]

**Exercise 4.21 **The following tasks involve the `diamonds`

dataset:

Create a tile plot of

`table`

versus`depth`

, showing the highest price for a diamond in each bin.Create a bin plot of

`carat`

versus`price`

. What type of diamonds have the highest bin counts?

### 4.8.6 Categorical data

When visualising a pair of categorical variables, plots similar to those in the previous section prove to be useful. One way of doing this is to use the `geom_count`

geom. We illustrate this with an example using `diamonds`

, showing how common different combinations of colours and cuts are:

```
ggplot(diamonds, aes(color, cut)) +
geom_count()
```

However, it is often better to use colour rather than point size to visualise counts, which we can do using a tile plot. First we have to compute the counts though, using `aggregate`

. We now wish to have two grouping variables, `color`

and `cut`

, which we can put on the right-hand side of the formula as follows:

```
<- aggregate(carat ~ cut + color, data = diamonds,
diamonds2 FUN = length)
diamonds2
```

`diamonds2`

is now a data frame containing the different combinations of `color`

and `cut`

along with counts of how many diamonds that belong to each combination (labelled `carat`

, because we put `carat`

in our formula). Let’s change the name of the last column from `carat`

to `Count`

:

`names(diamonds2)[3] <- "Count"`

Next, we can plot the counts using `geom_tile`

:

```
ggplot(diamonds2, aes(color, cut, fill = Count)) +
geom_tile()
```

It is also possible to combine point size and colours:

```
ggplot(diamonds2, aes(color, cut, colour = Count, size = Count)) +
geom_count()
```

\[\sim\]

**Exercise 4.22 **Using the `diamonds`

dataset, do the following:

Use a plot to find out what the most common combination of cut and clarity is.

Use a plot to find out which combination of cut and clarity that has the highest average price.

### 4.8.7 Putting it all together

In the next two exercises, you will repeat what you have learned so far by investigating the `gapminder`

and `planes`

datasets. First, load the corresponding libraries and have a look at the documentation for each dataset:

```
install.packages("gapminder")
library(gapminder)
?gapminder
library(nycflights13)
?planes
```

\[\sim\]

**Exercise 4.23 **Do the following using the `gapminder`

dataset:

Create a scatterplot matrix showing life expectancy, population and GDP per capita for all countries, using the data from the year 2007. Use colours to differentiate countries from different continents. Note: you’ll probably need to add the argument

`upper = list(continuous = "na")`

when creating the scatterplot matrix. By default, correlations are shown above the diagonal, but the fact that there only are two countries from Oceania will cause a problem there - at least 3 points are needed for a correlation test.Create an interactive bubble plot, showing information about each country when you hover the points. Use data from the year 2007. Put log(GDP per capita) on the x-axis and life expetancy on the y-axis. Let population determine point size. Plot each country in a different colour and facet by continent. Tip: the

`gapminder`

package provides a pretty color scheme for different countries, called`country_colors`

. You can use that scheme by adding`scale_colour_manual(values = country_colors)`

to your plot.

(Click here to go to the solution.)

**Exercise 4.24 **Use graphics to answer the following questions regarding the `planes`

dataset:

What is the most common combination of manufacturer and plane type in the dataset?

Which combination of manufacturer and plane type has the highest average number of seats?

Do the numbers of seats on planes change over time? Which plane had the highest number of seats?

Does the type of engine used change over time?

## 4.9 Principal component analysis

If there are a lot of variables in your data, it can often be difficult to detect differences between groups, or to create a perspicuous visualisation. A useful tool in this context is *principal component analysis* (PCA, for short), which can reduce high-dimensional data to a lower number of variables that can be visualised in one or two scatterplots. The idea is to compute new variables, called *principal components*, that are linear combinations of the original variables^{27}. These are constructed with two goals in mind: the principal components should capture as much of the variance in the data as possible, and each principal component should be uncorrelated to the other components. You can then plot the principal components to get a low-dimensional representation of your data, which hopefully captures most of its variation.

By design, the number of principal components computed are as many as the original number of variables, with the first having the largest variance, the second the second largest variance, and so on. We hope that it will suffice to use just the first few of these to represent most of the variation in the data, but this is not guaranteed. Principal component analysis is more likely to yield a useful result if several variables are correlated.

To illustrate the principles of PCA we will use a dataset from Charytanowicz et al. (2010), containing measurements on wheat kernels for three varieties of wheat. A description of the variables is available at

http://archive.ics.uci.edu/ml/datasets/seeds

We are interested to find out if these measurements can be used to distinguish between the varieties. The data is stored in a `.txt`

file, which we import using `read.table`

(which works just like `read.csv`

, but is tailored to text files) and convert the `Variety`

column to a categorical `factor`

variable (which you’ll learn more about in Section 5.4):

```
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
<- read.table("https://tinyurl.com/seedsdata",
seeds col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
$Variety <- factor(seeds$Variety) seeds
```

If we make a scatterplot matrix of all variables, it becomes evident that there are differences between the varieties, but that no single pair of variables is enough to separate them:

```
library(ggplot2)
library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds$Variety, alpha = 0.2))
```

Moreover, for presentation purposes the amount of information in the scatterplot matrix is a bit overwhelming. It would be nice to be able to present the data in a single scatterplot, without losing too much information. We’ll therefore compute the principal components using the `prcomp`

function. It is usually recommended that PCA is performed using standardised data, i.e. using data that has been scaled to have mean 0 and standard deviation 1. We don’t have to standardise the data ourselves, but can let `prcomp`

do that for us using the arguments `center = TRUE`

(to get mean 0) and `scale. = TRUE`

(to get standard deviation 1):

```
# Compute principal components:
<- prcomp(seeds[,-8], center = TRUE, scale. = TRUE) pca
```

To see the *loadings* of the components, i.e. how much each variable contributes to the components, simply type the name of the object `prcomp`

created:

` pca`

The first principal component is more or less a weighted average of all variables, but has stronger weights on `Area`

, `Perimeter`

, `Kernel_length`

, `Kernel_width`

, and `Groove_length`

, all of which are measures of size. We can therefore interpret it as a size variable. The second component has higher loadings for `Compactness`

and `Asymmetry`

, meaning that it mainly measures those shape features. In Exercise 4.26 you’ll see how the loadings can be visualised in a *biplot*.

To see how much of the variance each component represents, use `summary`

:

`summary(pca)`

The first principal component accounts for 71.87 % of the variance, and the first three combined account for 98.67 %.

To visualise this, we can draw a *scree plot*, which shows the variance of each principal component - the total variance of the data is the sum of the variances of the principal components:

`screeplot(pca, type = "lines")`

We can use this to choose how many principal components to use when visualising or summarising our data. In that case, we look for an “elbow,” i.e. a bend after which increases the number of components doesn’t increase amount of variance explained much.

We can access the values of the principal components using `pca$x`

. Let’s check that the first two components really are uncorrelated:

`cor(pca$x[,1], pca$x[,2])`

In this case, almost all of the variance is summarised by the first two or three principal components. It appears that we have successfully reduced the data from 7 variables to 2-3, which should make visualisation much easier. The `ggfortify`

package contains an `autoplot`

function for PCA objects, that creates a scatterplot of the first two principal components:

```
library(ggfortify)
autoplot(pca, data = seeds, colour = "Variety")
```

That is much better! The groups are almost completely separated, which shows that the variables can be used to discriminate between the three varieties. The first principal component accounts for 71.87 % of the total variance in the data, and the second for 17.11 %.

If you like, you can plot other pairs of principal components than just components 1 and 2. In this case, component 3 may be of interest, as its variance is almost as high as that of component 2. You can specify which components to plot with the `x`

and `y`

arguments:

```
# Plot 2nd and 3rd PC:
autoplot(pca, data = seeds, colour = "Variety",
x = 2, y = 3)
```

Here, the separation is nowhere near as clear as in the previous figure. In this particular example, plotting the first two principal components is the better choice.

Judging from these plots, it appears that the kernel measurements can be used to discriminate between the three varieties of wheat. In Chapters 7 and 9 you’ll learn how to use R to build models that can be used to do just that, e.g. by predicting which variety of wheat a kernel comes from given its measurements. If we wanted to build a statistical model that could be used for this purpose, we could use the original measurements. But we could also try using the first two principal components as the only input to the model. Principal component analysis is very useful as a pre-processing tool, used to create simpler models that are based on fewer variables (or ostensibly simpler, because the new variables are typically more difficult to interpret than the original ones).

\[\sim\]

**Exercise 4.25 **Use principal components on the `carat`

, `x`

, `y`

, `z`

, `depth`

, and `table`

variables in the `diamonds`

data, and answer the following questions:

How much of the total variance does the first principal component account for? How many components are needed to account for at least 90 % of the total variance?

Judging by the loadings, what do the first two principal components measure?

What is the correlation between the first principal component and

`price`

?Can the first two principal components be used to distinguish between diamonds with different cuts?

(Click here to go to the solution.)

**Exercise 4.26 **Return to the scatterplot of the first two principal components for the `seeds`

data, created above. Adding the arguments `loadings = TRUE`

and `loadings.label = TRUE`

to the `autoplot`

call creates a *biplot*, which shows the loadings for the principal components on top of the scatterplot. Create a biplot and compare the result to those obtained by looking at the loadings numerically. Do the conclusions from the two approaches agree?

## 4.10 Cluster analysis

Cluster analysis is concerned with grouping observations into groups, *clusters*, that in some sense are similar. Numerous methods are available for this task, approaching the problem from different angles. Many of these are available in the `cluster`

package, that ships with R. In this section, we’ll have a look at a smorgasbord of clustering techniques.

### 4.10.1 Hierarchical clustering

As a first example where clustering can be of interest, we’ll consider the `votes.repub`

data from `cluster`

. It describes the proportion of votes for the Republican candidate in US presidential elections from 1856 to 1976 in 50 different states:

```
library(cluster)
?votes.repubView(votes.repub)
```

We are interested in finding subgroups - clusters - of states with similar voting patterns.

To find clusters of similar observations (states, in this case), we could start by assigning each observation to its own cluster. We’d then start with 50 clusters, one for each observation. Next, we could merge the two clusters that are the most similar, yielding 49 clusters, one of which consisted of two observations and 48 consisting of a single observation. We could repeat this process, merging the two most similar clusters in each iteration, until only a single cluster was left. This would give us a *hierarchy* of clusters, which could be plotted in a tree-like structure, where observations from the same cluster would be one the same branch. Like this:

```
<- agnes(votes.repub)
clusters_agnes plot(clusters_agnes, which = 2)
```

This type of plot is known as a *dendrogram*

We’ve just used `agnes`

, a function from `cluster`

that can be used to carry out *hierarchical clustering* in the manner described above. There are a couple of things that need to be clarified though.

First, how do we measure how similar two \(p\)-dimensional observations \(x\) and \(y\) are? `agnes`

provides two measures of distance between points:

`metric = "euclidean"`

(the default), uses the Euclidean \(L_2\) distance \(||x-y||=\sqrt{\sum_{i=1}^p(x_i-y_i)^2}\),`metric = "manhattan"`

, the default, uses the Manhattan \(L_1\) distance \(||x-y||={\sum_{i=1}^p|x_i-y_i|}\).

Note that neither of these will work create if you have categorical variables in your data. If all your variables are binary, i.e. categorical with two values, you can use `mona`

instead of `agnes`

for hierarchical clustering.

Second, how do we measure how similar two clusters of observations are? `agnes`

offers a number of options here. Among them are:

`method = "average`

(the default), unweighted average linkage, uses the average distance between points from the two clusters,`method = "single`

, single linkage, uses the smallest distance between points from the two clusters,`method = "complete`

, complete linkage, uses the largest distance between points from the two clusters,`method = "ward"`

, Ward’s method, uses the within-cluster variance to compare different possible clusterings, with the clustering with the lowest within-cluster variance being chosen.

Regardless of which of these that you use, it is often a good idea to standardise the numeric variables in your dataset so that they all have the same variance. If you don’t, your distance measure is likely to be dominated by variables with larger variance, while variables with low variances will have little or no impact on the clustering. To standardise your data, you can use `scale`

:

```
# Perform clustering on standardised data:
<- agnes(scale(votes.repub))
clusters_agnes # Plot dendrogram:
plot(clusters_agnes, which = 2)
```

At this point, we’re starting to use several functions after another, and so this looks like a perfect job for a pipeline. To carry out the same analysis uses `%>%`

pipes, we write:

```
library(magrittr)
%>% scale() %>%
votes.repub agnes() %>%
plot(which = 2)
```

We can now try changing the metric and clustering method used as described above. Let’s use the Manhattan distance and complete linkage:

```
%>% scale() %>%
votes.repub agnes(metric = "manhattan", method = "complete") %>%
plot(which = 2)
```

We can change the look of the dendrogram by adding `hang = -1`

, which causes all observations to be placed at the same level:

```
%>% scale() %>%
votes.repub agnes(metric = "manhattan", method = "complete") %>%
plot(which = 2, hang = -1)
```

As an alternative to `agnes`

, we can consider `diana`

. `agnes`

is an *agglomerative* methods, which starts with a lot of clusters and then merge them step-by-step. `diana`

, in contrast, is a *divisive* method, which starts with one large cluster and then step-by-step splits it into several smaller clusters.

```
%>% scale() %>%
votes.repub diana() %>%
plot(which = 2)
```

You can change the distance measure used by setting `metric`

in the `diana`

call. Euclidean distance is the default.

To wrap this section up, we’ll look at two packages that are useful for plotting the results of hierarchical clustering: `dendextend`

and `factoextra`

:

`install.packages(c("dendextend", "factoextra"))`

To compare the dendrograms from produced by different methods (or the same method with different settings), in a *tanglegram*, where the dendrograms are plotted against each other, we can use `tanglegram`

from `dendextend`

:

```
library(dendextend)
# Create clusters using agnes:
%>% scale() %>%
votes.repub agnes() -> clusters_agnes
# Create clusters using diana:
%>% scale() %>%
votes.repub diana() -> clusters_diana
# Compare the results:
tanglegram(as.dendrogram(clusters_agnes),
as.dendrogram(clusters_diana))
```

Some clusters are quite similar here, whereas others are very different.

A lot of the time, we are interested in finding a comparatively small number of clusters, \(k\). In hierarchical clustering, we can reduce the number of clusters by “cutting” the dendrogram tree. To do so using the `factoextra`

package, we first use `hcut`

to cut the tree into \(k\) parts, and then `fviz_dend`

to plot the dendrogram, with each cluster plotted in a different colour. If, for instance, we want \(k=5\) clusters^{28} and want to use `agnes`

with average linkage and Euclidean distance for the clustering, we’d do the following:

```
library(factoextra)
%>% scale() %>%
votes.repub hcut(k = 5, hc_func = "agnes",
hc_method = "average",
hc_metric = "euclidean") %>%
fviz_dend()
```

There is no inherent meaning to the colours - they are simply a way to visually distinguish between clusters.

Hierarchical clustering is especially suitable for data with named observations. For other types of data, other methods may be better. We will consider some alternatives next.

\[\sim\]

**Exercise 4.27 **Continue the last example above by changing the clustering method to complete linkage with the Manhattan distance.

Do any of the 5 coloured clusters remain the same?

How can you produce a tanglegram with 5 coloured clusters, to better compare the results from the two clusterings?

(Click here to go to the solution.)

**Exercise 4.28 **The `USArrests`

data contains statistics on violent crime rates in 50 US states. Perform a hierarchical cluster analysis of the data. Which states are Maryland clustered with?

### 4.10.2 Centroid-based clustering

Let’s return to the `seeds`

data that we explored in Section 4.9:

```
# Download the data:
<- read.table("https://tinyurl.com/seedsdata",
seeds col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
$Variety <- factor(seeds$Variety) seeds
```

We know that there are three varieties of seeds in this dataset, but what if we didn’t? Or what if we’d lost the labels, and don’t know what seeds are of what type? There are no rows names for this data, and plotting a dendrogram may therefore not be that useful. Instead, we can use \(k\)-means clustering, where the the points are clustered into \(k\) clusters based on their distances to the cluster means, or *centroids*.

When performing \(k\)-means clustering (using the algorithm of Hartigan & Wong (1979) that is the default in the function that we’ll use), the data is split into \(k\) clusters based on their distance to the mean of all points. Points are then moved between clusters, one at a time, based on how close they are (as measured by Euclidean distance) to the mean of each cluster. The algorithm finishes when no point can be moved between clusters without increasing the average distance between points and the means of their clusters.

To run a \(k\)-means clustering in R, we can use `kmeans`

. Let’s start by using \(k=3\) clusters:

```
# First, we standardise the data, and then we do a k-means
# clustering.
# We ignore variable 8, Variety, which is the group label.
library(magrittr)
-8] %>% scale() %>%
seeds[, kmeans(centers = 3) -> seeds_cluster
seeds_cluster
```

To visualise the results, we’ll plot the first two principal components. We’ll use colour to show the clusters. Moreover, we’ll plot the different varieties in different shapes, to see if the clusters found correspond to different varieties:

```
# Compute principal components:
<- prcomp(seeds[,-8])
pca library(ggfortify)
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
```

In this case, the clusters more or less overlap with the varieties! Of course, in a lot of cases we don’t know the number of clusters beforehand. What happens if we change \(k\)?

First, we try \(k=2\):

```
-8] %>% scale() %>%
seeds[, kmeans(centers = 2) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
```

Next, \(k=4\):

```
-8] %>% scale() %>%
seeds[, kmeans(centers = 4) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
```

And finally, a larger number of clusters, say \(k=12\):

```
-8] %>% scale() %>%
seeds[, kmeans(centers = 12) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
```

If it weren’t from the fact that the different varieties were shown as different shapes, we’d have no way to say, based on this plot alone, which choice of \(k\) that is preferable here. Before we go into methods for choosing \(k\) though, we’ll mention `pam`

, and alternative to \(k\)-means that works in the same way, but uses median-like points, *medoids* instead of cluster means. This makes it more robust to outliers. Let’s try it with \(k=3\) clusters:

```
-8] %>% scale() %>%
seeds[, pam(k = 3) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$clustering,
shape = "Variety", size = 2, alpha = 0.75)
```

For both `kmeans`

and `pam`

, there are visual tools that can help us choose the value of \(k\) in the `factoextra`

package. Let’s install it:

`install.packages("factoextra")`

The `fviz_nbclust`

function in `factoextra`

can be used to obtain plots that can guide the choice of \(k\). It takes three arguments as input: the data, the clustering function (e.g. `kmeans`

or `pam`

) and the method used for evaluating different choices of \(k\). There are three options for the latter: `"wss"`

, `"silhouette"`

and `"gap_stat"`

.

`method = "wss"`

yields a plot that relies on the within-cluster sum of squares, WSS, which is a measure of the within-cluster variation. The smaller this is, the more compact are the clusters. The WSS is plotted for several choices of \(k\), and we look for an “elbow,” just as we did when using a scree plot for PCA. That is, we look for the value of \(k\) such that increasing \(k\) further doesn’t improve the WSS much. Let’s have a look at an example, using `pam`

for clustering:

```
library(factoextra)
fviz_nbclust(scale(seeds[, -8]), pam, method = "wss")
# Or, using a pipeline instead:
library(magrittr)
-8] %>% scale() %>%
seeds[, fviz_nbclust(pam, method = "wss")
```

\(k=3\) seems like a good choice here.

`method = "silhouette"`

produces a silhouette plot. The silhouette value measures how similar a point is compared to other points in its cluster. The closer to 1 this value is, the better. In a silhouette plot, the average silhouette value for points in the data are plotted against \(k\):

`fviz_nbclust(scale(seeds[, -8]), pam, method = "silhouette")`

Judging by this plot, \(k=2\) appears to be the best choice.

Finally, `method = "gap_stat"`

yields a plot of the gap statistic (Tibshirani et al., 2001), which is based on comparing the WSS to its expected value under a null distribution obtained using the bootstrap (Section 7.6). Higher values of the gap statistic are preferable:

`fviz_nbclust(scale(seeds[, -8]), pam, method = "gap_stat")`

In this case, \(k=3\) gives the best value.

In addition to plots for choosing \(k\), `factoextra`

provides the function `fviz_cluster`

for creating PCA-based plots, with an option to add convex hulls or ellipses around the clusters:

```
# First, find the clusters:
-8] %>% scale() %>%
seeds[, kmeans(centers = 3) -> seeds_cluster
# Plot clusters and their convex hulls:
library(factoextra)
fviz_cluster(seeds_cluster, data = seeds[, -8])
# Without row numbers:
fviz_cluster(seeds_cluster, data = seeds[, -8], geom = "point")
# With ellipses based on the multivariate normal distribution:
fviz_cluster(seeds_cluster, data = seeds[, -8],
geom = "point", ellipse.type = "norm")
```

Note that in this plot, the shapes correspond to the clusters and not the varieties of seeds.

\[\sim\]

**Exercise 4.29 **The `chorSub`

data from `cluster`

contains measurements of 10 chemicals in 61 geological samples from the Kola Peninsula. Cluster this data using using either `kmeans`

or `pam`

(does either seem to be a better choice here?). What is a good choice of \(k\) here? Visualise the results.

### 4.10.3 Fuzzy clustering

An alternative to \(k\)-means clustering is *fuzzy clustering*, in which each point is “spread out” over the \(k\) clusters instead of being placed in a single cluster. The more similar it is to other observations in a cluster, the higher is its membership in that cluster. Points can have a high degree of membership to several clusters, which is useful in applications where points should be allowed to belong to more than one cluster. An important example is genetics, where genes can encode proteins with more than one function. If each point corresponds to a gene, it then makes sense to allow the points to belong to several clusters, which potentially can be associated with different functions. The opposite of fuzzy clustering is *hard clustering*, in which each point only belongs to one cluster.

`fanny`

from `cluster`

can be used to perform fuzzy clustering:

```
library(cluster)
library(magrittr)
-8] %>% scale() %>%
seeds[, fanny(k = 3) -> seeds_cluster
# Check membership of each cluster for the different points:
$membership
seeds_cluster
# Plot the closest hard clustering:
library(factoextra)
fviz_cluster(seeds_cluster, geom = "point")
```

As for `kmeans`

and `pam`

, we can use `fviz_nbclust`

to determine how many clusters to use:

```
-8] %>% scale() %>%
seeds[, fviz_nbclust(fanny, method = "wss")
-8] %>% scale() %>%
seeds[, fviz_nbclust(fanny, method = "silhouette")
# Producing the gap statistic plot takes a while here, so
# you may want to skip it in this case:
-8] %>% scale() %>%
seeds[, fviz_nbclust(fanny, method = "gap")
```

\[\sim\]

**Exercise 4.30 **Do a fuzzy clustering of the `USArrests`

data. Is Maryland strongly associated with a single cluster, or with several clusters? What about New Jersey?

### 4.10.4 Model-based clustering

As a last option, we’ll consider model-based clustering, in which each cluster is assumed to come from a multivariate normal distribution. This will yield ellipsoidal clusters. `Mclust`

from the `mclust`

package fits such a model, called a Gaussian finite mixture model, using the EM-algorithm (Scrucca et al., 2016). First, let’s install the package:

`install.packages("mclust")`

Now, let’s cluster the `seeds`

data. The number of clusters is chosen as part of the clustering procedure.

```
library(mclust)
<- Mclust(scale(seeds[, -8]))
seeds_cluster summary(seeds_cluster)
# Plot results with ellipsoids:
library(factoextra)
fviz_cluster(seeds_cluster, geom = "point", ellipse.type = "norm")
```

Gaussian finite mixture models are based on the assumption that the data is numerical. For categorical data, we can use latent class analysis, which we’ll discuss in Section 4.11.2, instead.

\[\sim\]

**Exercise 4.31 **Return to the `chorSub`

data from Exercise 4.29. Cluster it using a Gaussian finite mixture model. How many clusters do you find?

### 4.10.5 Comparing clusters

Having found some interesting clusters, we are often interested in exploring differences between the clusters. To do so, we much first extract the cluster labels from our clustering (which are contained in the variables `clustering`

for methods with Western female names, `cluster`

for `kmeans`

, and `classification`

for `Mclust`

). We can then add those labels to our data frame and use it when plotting.

For instance, using the `seeds`

data, we can compare the area of seeds from different clusters:

```
# Cluster the seeds using k-means with k=3:
library(cluster)
-8] %>% scale() %>%
seeds[, kmeans(centers = 3) -> seeds_cluster
# Add the results to the data frame:
$clusters <- factor(seeds_cluster$cluster)
seeds# Instead of $cluster, we'd use $clustering for agnes, pam, and fanny
# objects, and $classification for an Mclust object.
# Compare the areas of the 3 clusters using boxplots:
library(ggplot2)
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
geom_boxplot()
# Or using density estimates:
ggplot(seeds, aes(x = Area, group = clusters, fill = clusters)) +
geom_density(alpha = 0.7)
```

We can also create a scatterplot matrix to look at all variables simultaneously:

```
library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds$clusters, alpha = 0.2))
```

## 4.11 Exploratory factor analysis

The purpose of *factor analysis* is to describe and understand the correlation structure for a set of observable variables through a smaller number of unobservable underlying variables, called *factors* or *latent variables*. These are thought to explain the values of the observed variables in a causal manner. Factor analysis is a popular tool in psychometrics, where it for instance is used to identify latent variables that explain people’s results on different tests, e.g. related to personality, intelligence, or attitude.

### 4.11.1 Factor analysis

We’ll use the `psych`

package, along with the associated package `GPArotation`

, for our analyses. Let’s install them:

`install.packages(c("psych", "GPArotation"))`

For our first example of factor analysis, we’ll be using the `attitude`

data that comes with R. It describes the outcome of a survey of employees at a financial organisation. Have a look at its documentation to read about the variables in the dataset:

```
?attitude attitude
```

To fit a factor analysis model to these data, we can use `fa`

from `psych`

. `fa`

requires us to specify the number of factors used in the model. We’ll get back to how to choose the number of factors, but for now, let’s go with 2:

```
library(psych)
# Fit factor model:
<- fa(attitude, nfactors = 2,
attitude_fa rotate = "oblimin", fm = "ml")
```

`fa`

does two things for us. First, it fits a factor model to the data, which yields a table of *factor loadings*, i.e. the correlation between the two unobserved factors and the observed variables. However, there are an infinite number of mathematically valid factor models for any given dataset. The factors are therefore *rotated* according to some rule to obtain a factor model that hopefully allows for easy and useful interpretation. There are several methods that can be used to fit the factor model (set using the `fm`

argument in `fa`

) and for rotation the solution (set using `rotate`

). We’ll look at some of the options shortly.

First, we’ll print the result, showing the factor loadings (after rotation). We’ll also plot the resulting model using `fa.diagram`

, showing the correlation between the factors and the observed variables:

```
# Print results:
attitude_fa
# Plot results:
fa.diagram(attitude_fa, simple = FALSE)
```

The first factor is correlated to the variables `advance`

, `learning`

and `raises`

. We can perhaps interpret this factor as measuring the employees’ career opportunity at the organisation. The second factor is strongly correlated to `complaints`

and (overall) `rating`

, but also to a lesser degree correlated to `raises`

, `learning`

and `privileges`

. This can maybe be interpreted as measuring how the employees’ feel that they are treated at the organisation.

We can also see that the two factors are correlated. In some cases, it makes sense to expect the factors to be uncorrelated. In that case, we can change the rotation method used, from `oblimin`

(which yields *oblique rotations*, allowing for correlations - usually a good default) to `varimax`

, which yields uncorrelated factors:

```
<- fa(attitude, nfactors = 2,
attitude_fa rotate = "varimax", fm = "ml")
fa.diagram(attitude_fa, simple = FALSE)
```

In this case, the results are fairly similar.

The `fm = "ml"`

setting means that maximum likelihood estimation of the factor model is performed, under the assumption of a normal distribution for the data. Maximum likelihood estimation is widely recommended for estimation of factor models, and can often work well even for non-normal data (Costello & Osborne, 2005). However, there are cases where it fails to find useful factors. `fa`

offers several different estimation methods. A good alternative is `minres`

, which often works well when maximum likelihood fails:

```
<- fa(attitude, nfactors = 2,
attitude_fa rotate = "oblimin", fm = "minres")
fa.diagram(attitude_fa, simple = FALSE)
```

Once again, the results are similar to what we saw before. In other examples, the results differ more. When choosing which estimation method and rotation to use, bear in mind that in an exploratory study, there is no harm in playing around with a few different methods. After all, your purpose is to generate hypotheses rather than to confirm them, and looking at the data in a few different ways will help you do that.

To determine the number of factors that are appropriate for a particular dataset, we can draw a scree plot with `scree`

. This is interpreted in the same way as for principal components analysis (Section 4.9) and centroid-based clustering (Section 4.10.2) - we look for an “elbow” in the plot, which tells us at which point adding more factors no longer contributes much to the model:

`scree(attitude, pc = FALSE)`

A useful alternative version of this is provided by `fa.parallel`

, which adds lines showing what the scree plot would look like for randomly generated uncorrelated data of the same size as the original dataset. As long as the blue line, representing the actual data, is higher than the red line, representing randomly generated data, adding more factors improves the model:

`fa.parallel(attitude, fm = "ml", fa = "fa")`

Some older texts recommend that only factors with an eigenvalue (the y-axis in the scree plot) greater than 1 be kept in the model. It is widely agreed that this so-called Kaiser rule is inappropriate (Costello & Osborne, 2005), as it runs the risk of leaving out important factors.

Similarly, some older texts also recommend using principal components analysis to fit factor models. While the two are mathematically similar in that both in some sense reduce the dimensionality of the data, PCA and factor analysis are designed to target different problems. Factor analysis is concerned with an underlying causal structure where the unobserved factors affect the observed variables. In contrast, PCA simply seeks to create a small number of variables that summarise the variation in the data, which can work well even if there are no unobserved factors affecting the variables.

\[\sim\]

**Exercise 4.32 **Factor analysis only relies on the covariance or correlation matrix of your data. When using `fa`

and other functions for factor analysis, you can input either a data frame or a covariance/correlation matrix. Read about the `ability.cov`

data that comes shipped with R, and perform a factor analysis of it.

### 4.11.2 Latent class analysis

When there is a single categorical latent variable, factor analysis overlaps with clustering, which we studied in Section 4.10. Whether we think of the values of the latent variable as clusters, classes, factor levels, or something else is mainly a philosophical question - from a mathematical perspective it doesn’t matter what name we use for them.

When observations from the same cluster are assumed to be uncorrelated, the resulting model is called *latent profile analysis*, which typically is handled using model-based clustering (Section 4.10.4). The special case where the observed variables are categorical, is instead known as *latent class analysis*. This is common e.g. in analyses of survey data, and we’ll have a look at such an example in this section. The package that we’ll use for our analyses is called `poLCA`

- let’s install it:

`install.packages("poLCA")`

The National Mental Health Services Survey is an annual survey collecting information about mental health treatment facilities in the US. We’ll analyse data from the 2019 survey, courtesy of the Substance Abuse and Mental Health Data Archive, and try to find latent classes. Download `nmhss-puf-2019.csv`

from the book’s web page, and set `file_path`

to its path. We can then load and look at a summary of the data using:

```
<- read.csv(file_path)
nmhss summary(nmhss)
```

All variables are categorical (except perhaps for the first one, which is an identifier). According to the survey’s documentation, negative values are used to represent missing values. For binary variables, `0`

means no/non-presence and `1`

means yes/presence.

Next, we’ll load the `poLCA`

package and read the documentation for the function that we’ll use for the analysis.

```
library(poLCA)
?poLCA
```

As you can see in the description of the `data`

argument, the observed variables (called *manifest variables* here) are only allowed to contain consecutive integer values, starting from 1. Moreover, missing values should be represented by `NA`

, and not by negative numbers (just as elsewhere in R!). We therefore need to make two changes to our data:

- Change negative values to
`NA`

, - Change the levels of binary variables so that
`1`

means no/non-presence and`2`

means yes/presence.

In our example, we’ll look at variables describing what treatments are available at the different facilities. Let’s create a new data frame for those variables:

```
<- nmhss[, names(nmhss)[17:30]]
treatments summary(treatments)
```

To make the changes to the data that we need, we can do the following:

```
# Change negative values to NA:
< 0] <- NA
treatments[treatments
# Change binary variables from 0 and 1 to
# 1 and 2:
<- treatments + 1
treatments
# Check the results:
summary(treatments)
```

We are now ready to get started with out analysis. To begin with, we will try to find classes based on whether or not the facilities offer the following five treatments:

`TREATPSYCHOTHRPY`

: Facility offers individual psychotherapy,`TREATFAMTHRPY`

: Facility offers couples/family therapy,`TREATGRPTHRPY`

: Facility offers group therapy,`TREATCOGTHRPY`

: Facility offers cognitive behavioural therapy,`TREATPSYCHOMED`

: Facility offers psychotropic medication.

The `poLCA`

function needs three inputs: a formula describing what observed variables to use, a data frame containing the observations, and `nclass`

, the number of latent classes to find. To begin with, let’s try two classes:

```
<- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
m
TREATGRPTHRPY, TREATCOGTHRPY,~ 1,
TREATPSYCHOMED) data = treatments, nclass = 2)
```

The output shows the probabilities of 1’s (no/non-presence) and 2’s (yes/presence) for the two classes. So, for instance, from the output

```
$TREATPSYCHOTHRPY
Pr(1) Pr(2)
1: 0.6628 0.3372
class 2: 0.0073 0.9927 class
```

we gather that 34 % of facilities belonging to the first class offer individual psychotherapy, whereas 99 % of facilities from the second class offer individual psychotherapy. Looking at the other variables, we see that the second class always has high probabilities of offering therapies, while the first class doesn’t. Interpreting this, we’d say that the second class contains facilities that offer a wide variety of treatments, and the first facilities that only offer some therapies. Finally, we see from the output that 88 % of the facilities belong to the second class:

```
Estimated class population shares 0.1167 0.8833
```

We can visualise the class differences in a plot:

`plot(m)`

To see which classes different observations belong to, we can use:

`$predclass m`

Just as in a cluster analysis, it is often a good idea to run the analysis with different numbers of classes. Next, let’s try 3 classes:

```
<- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
m
TREATGRPTHRPY, TREATCOGTHRPY,~ 1,
TREATPSYCHOMED) data = treatments, nclass = 3)
```

This time, we run into numerical problems - the model estimation has failed, as indicated by the following warning message:

`: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND ALERT`

`poLCA`

fits the model using a method known as the *EM algorithm*, which find maximum likelihood estimates numerically. First, the observations are randomly assigned to the classes. Step by step, the observations are then moved between classes, until the optimal split has been found. It can however happen that more steps are needed to find the optimum (by default 1,000 steps are used), or that we end up with unfortunate initial class assignments that prevent the algorithm from finding the optimum. To attenuate this problem, we can increase the number of steps used, or run the algorithm multiple times, each with new initial class assignments. The `poLCA`

arguments for this are `maxiter`

, which controls the number of steps (or iterations) used, and `nrep`

, which controls the number of repetitions with different initial assignments. We’ll increase both, and see if that helps. Note that this means that the algorithm will take longer to run:

```
<- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
m
TREATGRPTHRPY, TREATCOGTHRPY,~ 1,
TREATPSYCHOMED) data = treatments, nclass = 3,
maxiter = 2500, nrep = 5)
```

These setting should do the trick for this dataset, and you probably won’t see a warning message this time. If you do, try increasing either number and run the code again.

The output that you get can differ between runs - in particular, the order of the classes can differ depending on initial assignments. Here is part of the output from my run:

```
$TREATPSYCHOTHRPY
Pr(1) Pr(2)
1: 0.0076 0.9924
class 2: 0.0068 0.9932
class 3: 0.6450 0.3550
class
$TREATFAMTHRPY
Pr(1) Pr(2)
1: 0.1990 0.8010
class 2: 0.0223 0.9777
class 3: 0.9435 0.0565
class
$TREATGRPTHRPY
Pr(1) Pr(2)
1: 0.0712 0.9288
class 2: 0.3753 0.6247
class 3: 0.4935 0.5065
class
$TREATCOGTHRPY
Pr(1) Pr(2)
1: 0.0291 0.9709
class 2: 0.0515 0.9485
class 3: 0.5885 0.4115
class
$TREATPSYCHOMED
Pr(1) Pr(2)
1: 0.0825 0.9175
class 2: 1.0000 0.0000
class 3: 0.3406 0.6594
class
Estimated class population shares 0.8059 0.0746 0.1196
```

We can interpret this as follows:

- Class 1 (81 % of facilities): Offer all treatments, including psychotropic medication.
- Class 2 (7 % of facilities): Offer all treatments, except for psychotropic medication.
- Class 3 (12 % of facilities): Only offer some treatments, which may include psychotropic medication.

You can either let interpretability guide your choice of how many classes to include in your analysis, our use model fit measures like \(AIC\) and \(BIC\), which are printed in the output and can be obtained from the model using:

```
$aic
m$bic m
```

The lower these are, the better is the model fit.

If you like, you can add a covariate to your latent class analysis, which allows you to simultaneously find classes and study their relationship with the covariate. Let’s add the variable `PAYASST`

, which says whether a facility offers treatment at no charge or minimal payment to clients who cannot afford to pay, to our data, and then use that as a covariate.

```
# Add PAYASST variable to data, then change negative values
# to NA's:
$PAYASST <- nmhss$PAYASST
treatments$PAYASST[treatments$PAYASST < 0] <- NA
treatments
# Run LCA with covariate:
<- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
m
TREATGRPTHRPY, TREATCOGTHRPY,~ PAYASST,
TREATPSYCHOMED) data = treatments, nclass = 3,
maxiter = 2500, nrep = 5)
```

My output from this model includes the following tables:

```
=========================================================
for 3 latent classes:
Fit =========================================================
2 / 1
Pr(>|t|)
Coefficient Std. error t value 0.10616 0.18197 0.583 0.570
(Intercept) 0.43302 0.11864 3.650 0.003
PAYASST =========================================================
3 / 1
Pr(>|t|)
Coefficient Std. error t value 1.88482 0.20605 9.147 0
(Intercept) 0.59124 0.10925 5.412 0
PAYASST =========================================================
```

The interpretation is that both class 2 and class 3 differ significantly from class 1 (the p-values in the `Pr(>|t|)`

column are low), with the positive coefficients for `PAYASST`

telling us that class 2 and 3 facilities are more likely to offer pay assistance than class 1 facilities.

\[\sim\]

**Exercise 4.33 **The `cheating`

dataset from `poLCA`

contains students’ answers to four questions about cheating, along with their grade point averages (GPA). Perform a latent class analysis using GPA as a covariate. What classes do you find? Does having a high GPA increase the probability of belonging to either class?

See

`?theme_grey`

for a list of available themes.↩︎Note that it is not just the prices nor just the carats of these diamonds that make them outliers, but the unusual combinations of prices and carats!↩︎

LOESS, LOcally Estimated Scatterplot Smoothing, is a non-parametric regression method that fits a polynomial to local areas of the data.↩︎

GAM, Generalised Additive Model, is a generalised linear model where the response variable is a linear function of smooth functions of the predictors.↩︎

Time series objects are a special class of vectors, with data sampled a equispaced points in time. Each observation can have a year/date/time associated with it.↩︎

Florence Nightingale, who famously pioneered the use of the pie chart, drew her pie charts this way.↩︎

A linear combination is a weighted sum of the form \(a_1x_1+a_2x_2+\cdots+a_px_p\). If you like, you can think of principal components as weighted averages of variables, computed for each row in your data.↩︎

Just to be clear, 5 is just an arbitrary number here. We could of course want 4, 14, or any other number of clusters.↩︎