1 Introduction

1.1 Welcome to R

Welcome to the wonderful world of R!

R is not like other statistical software packages. It is free, versatile, fast, and modern. It has a large and friendly community of users that help answer questions and develop new R tools. With more than 17,000 add-on packages available, R offers more functions for data analysis than any other statistical software. This includes specialised tools for disciplines as varied as political science, environmental chemistry, and astronomy, and new methods come to R long before they come to other programs. R makes it easy to construct reproducible analyses and workflows that allow you to easily repeat the same analysis more than once.

R is not like other programming languages. It was developed by statisticians as a tool for data analysis and not by software engineers as a tool for other programming tasks. It is designed from the ground up to handle data, and that shows. But it is also flexible enough to be used to create interactive web pages, automated reports, and APIs.

R is, simply put, currently the best tool there is for data analysis.

1.2 About this book

This book was born out of lecture notes and materials that I created for courses at the University of Edinburgh, Uppsala University, Dalarna University, the Swedish University of Agricultural Sciences, and Karolinska Institutet. It can be used as a textbook, for self-study, or as a reference manual for R. No background in programming is assumed.

This is not a book that has been written with the intention that you should read it back-to-back. Rather, it is intended to serve as a guide to what to do next as you explore R. Think of it as a conversation, where you and I discuss different topics related to data analysis and data wrangling. At times I’ll do the talking, introduce concepts and pose questions. At times you’ll do the talking, working with exercises and discovering all that R has to offer. The best way to learn R is to use R. You should strive for active learning, meaning that you should spend more time with R and less time stuck with your nose in a book. Together we will strive for an exploratory approach, where the text guides you to discoveries and the exercises challenge you to go further. This is how I’ve been teaching R since 2008, and I hope that it’s a way that you will find works well for you.

The book contains more than 200 exercises. Apart from a number of open-ended questions about ethical issues, all exercises involve R code. These exercises all have worked solutions. It is highly recommended that you actually work with all the exercises, as they are central to the approach to learning that this book seeks to support: using R to solve problems is a much better way to learn the language than to just read about how to use R to solve problems. Once you have finished an exercise (or attempted but failed to finish it) read the proposed solution - it may differ from what you came up with and will sometimes contain comments that you may find interesting. Treat the proposed solutions as a part of our conversation. As you work with the exercises and compare your solutions to those in the back of the book, you will gain more and more experience working with R and build your own library of examples of how problems can be solved.

Some books on R focus entirely on data science - data wrangling and exploratory data analysis - ignoring the many great tools R has to offer for deeper data analyses. Others focus on predictive modelling or classical statistics but ignore data-handling, which is a vital part of modern statistical work. Many introductory books on statistical methods put too little focus on recent advances in computational statistics and advocate methods that have become obsolete. Far too few books contain discussions of ethical issues in statistical practice. This book aims to cover all of these topics and show you the state-of-the-art tools for all these tasks. It covers data science and (modern!) classical statistics as well as predictive modelling and machine learning, and deals with important topics that rarely appear in other introductory texts, such as simulation. It is written for R 4.0 or later and will teach you powerful add-on packages like data.table, dplyr, ggplot2, and caret.

The book is organised as follows:

Chapter 2 covers basic concepts and shows how to use R to compute descriptive statistics and create nice-looking plots.

Chapter 3 is concerned with how to import and handle data in R, and how to perform routine statistical analyses.

Chapter 4 covers exploratory data analysis using statistical graphics, as well as unsupervised learning techniques like principal components analysis and clustering. It also contains an introduction to R Markdown, a powerful markup language that can be used e.g. to create reports.

Chapter 5 describes how to deal with messy data - including filtering, rearranging and merging datasets - and different data types.

Chapter 6 deals with programming in R, and covers concepts such as iteration, conditional statements and functions.

Chapters 4-6 can be read in any order.

Chapter 7 is concerned with classical statistical topics like estimation, confidence intervals, hypothesis tests, and sample size computations. Frequentist methods are presented alongside Bayesian methods utilising weakly informative priors. It also covers simulation and important topics in computational statistics, such as the bootstrap and permutation tests.

Chapter 8 deals with various regression models, including linear, generalised linear and mixed models. Survival models and methods for analysing different kinds of censored data are also included, along with methods for creating matched samples.

Chapter 9 covers predictive modelling, including regularised regression, machine learning techniques, and an introduction to forecasting using time series models. Much focus is given to cross-validation and ways to evaluate the performance of predictive models.

Chapter 10 gives an overview of more advanced topics, including parallel computing, matrix computations, and integration with other programming languages.

Chapter 11 covers debugging, i.e. how to spot and fix errors in your code. It includes a list of more than 25 common error and warning messages, and advice on how to resolve them.

Chapter 12 covers some mathematical aspects of methods used in Chapters 7-9.

Finally, Chapter 13 contains fully worked solutions to all exercises in the book.

The datasets that are used for the examples and exercises can be downloaded from

http://www.modernstatisticswithr.com/data.zip

I have opted not to put the datasets in an R package, because I want you to practice loading data from files, as this is what you’ll be doing whenever you use R for real work.

This book is available both in print and as an open access online book. The digital version of the book is offered under the Creative Commons CC BY-NC-SA 4.0. license, meaning that you are free to redistribute and build upon the material for non-commercial purposes, as long as appropriate credit is given to the author. The source for the book is available at it’s GitHub page (https://github.com/mthulin/mswr-book).

I am indebted to the numerous readers who have provided feedback on drafts of this book. My sincerest thanks go out to all of you. Any remaining misprints are, obviously, entirely my own fault.

Finally, there are countless packages and statistical methods that deserve a mention but aren’t included in the book. Like any author, I’ve had to draw the line somewhere. If you feel that something is missing, feel free to post an issue on the book’s GitHub page, and I’ll gladly consider it for future revisions.