1 Introduction

1.1 Welcome to R

Welcome to the wonderful world of R!

R is not like other statistical software packages. It is free, versatile, fast, and modern. It has a large and friendly community of users that help answer questions and develop new R tools. With more than 17,000 add-on packages available, R offers more functions for data analysis than any other statistical software. This includes specialised tools for disciplines as varied as political science, environmental chemistry, and astronomy, and new methods come to R long before they come to other programs. R makes it easy to construct reproducible analyses and workflows that allow you to easily repeat the same analysis more than once.

R is not like other programming languages. It was developed by statisticians as a tool for data analysis and not by software engineers as a tool for other programming tasks. It is designed from the ground up to handle data, which is evident. But it is also flexible enough to be used to create interactive web pages, automated reports, and application programming interfaces (APIs).

Simply put, R is currently the best tool there is for data analysis.

1.2 About this book

This book was born out of lecture notes and materials that I created for courses at the University of Edinburgh, Uppsala University, Dalarna University, the Swedish University of Agricultural Sciences, and Karolinska Institutet. It can be used as a textbook, for self-study, or as a reference manual for R. No background in programming is assumed.

This is not a book that has been written with the intention that you should read it back-to-back. Rather, it is intended to serve as a guide to what to do next as you explore R. Think of it as a conversation, where you and I discuss different topics related to data analysis and data wrangling. At times I’ll do the talking, introduce concepts and pose questions. At times you’ll do the talking, working with exercises and discovering all that R has to offer. The best way to learn R is to use R. You should strive for active learning, meaning that you should spend more time with R and less time stuck with your nose in a book. Together we will strive for an exploratory approach, where the text guides you to discoveries and the exercises challenge you to go further. This is how I’ve been teaching R since 2008, and I hope that it’s a way that you will find works well for you. The book contains more than 200 exercises. Apart from a number of open-ended questions about ethical issues, all exercises involve R code. These exercises all have worked solutions available on the book’s webpage, http://www.modernstatisticswithr.com. It is highly recommended that you actually work with all the exercises, as they are central to the approach to learning that this book seeks to support: using R to solve problems is a much better way to learn the language than to just read about how to use R to solve problems. Once you have finished an exercise (or attempted but failed to finish it) read the proposed solution; it may differ from what you came up with and will sometimes contain comments that you may find interesting. Treat the proposed solutions as a part of our conversation. As you work with the exercises and compare your solutions to those in the back of the book, you will gain more and more experience working with R and build your own library of examples of how problems can be solved.

Some books on R focus entirely on data science – data wrangling and exploratory data analysis – ignoring the many great tools R has to offer for deeper data analyses. Others focus on predictive modelling or classical statistics but ignore data-handling, which is a vital part of modern statistical work. Many introductory books on statistical methods put too little focus on recent advances in computational statistics and advocate methods that have become obsolete. Far too few books contain discussions of ethical issues in statistical practice. This book aims to cover all of these topics and show you the state-of-the-art tools for all these tasks. It covers data science and (modern!) classical statistics as well as predictive modelling and machine learning, and deals with important topics that rarely appear in other introductory texts, such as simulation. It is written for R 4.3 or later and will teach you powerful add-on packages like data.table, dplyr, ggplot2, and caret.

The expanded second edition includes new and updated examples throughout the book, and new material on, among other things, fundamental statistical concepts, survival analysis, and structural equation models.

The book is organised as follows:

Chapter 2 covers basic concepts and shows how to use R to import and handle data, compute descriptive statistics, create nice-looking plots, and use pipes.

Chapter 3 is concerned with concepts like hypothesis testing and confidence intervals, and how to use common statistical methods like the t-test and \(\chi^2\)-tests in R. It also covers the basics of computer-intensive methods like permutation tests and the bootstrap.

Chapter 4 covers exploratory data analysis using statistical graphics, as well as unsupervised learning techniques like principal components analysis and clustering. It also contains an introduction to R Markdown, a powerful markup language that can be used, e.g., to create reports.

Chapter 5 describes how to deal with messy data – including filtering, rearranging and merging datasets – and different data types.

Chapter 6 deals with programming in R and covers concepts such as iteration, conditional statements and functions.

Chapters 4-6 can be read in any order.

Chapter 7 is concerned with simulation and its role in modern statistics. This includes sample size computations, the bootstrap, and evaluation of statistical methods. It is not required reading for understanding most of the material in subsequent chapters, but the principles presented in this chapter are used in some of the more advanced examples.

Chapter 8 deals with various regression models, including linear, generalised linear and mixed models. It also includes materials on multiple imputation of missing values, along with methods for creating matched samples.

Chapter 9 is about survival analysis and methods for analysing different kinds of censored data, such as data with detection limits.

Chapter 10 describes exploratory factor analysis and confirmatory analyses using structural equation models, as well as mediation analysis.

Chapter 11 covers predictive modelling, including regularised regression, machine learning techniques, and an introduction to forecasting using time series models. Much focus is given to cross-validation and ways to evaluate the performance of predictive models.

Chapter 12 gives an overview of more advanced topics, including parallel computing, matrix computations, and integration with other programming languages.

Chapter 13 covers debugging, i.e., how to spot and fix errors in your code. It includes a list of more than 25 common error and warning messages, and advice on how to resolve them.

Chapter 14 covers some mathematical aspects of methods used in Chapters 7-11.

Finally, Chapter 15, available online, contains fully worked solutions to all exercises in the book.

The datasets that are used for the examples and exercises can be downloaded from:

http://www.modernstatisticswithr.com/data.zip

I have opted not to put the datasets in an R package, because I want you to practice loading data from files, as this is what you’ll be doing whenever you use R for real work.

This book is available both in print and as an open access online book at: http://www.modernstatisticswithr.com.

I am indebted to team at Chapman & Hall for their work on this book, and to the numerous readers who have provided feedback on various drafts. My sincerest thanks go out to all of you. Any remaining errors are, obviously, entirely my own fault.

Finally, there are countless packages and statistical methods that deserve a mention but aren’t included in the book. Like any author, I’ve had to draw the line somewhere. If you feel that something is missing, feel free to get in touch, and I’ll gladly consider it for future revisions.