References

Andersen P.K., Gill R.D. (1982). Cox’s regression model for counting processes: a large sample study. Annals of Statistics, 10, 1100-1120.

Agresti, A. (2013). Categorical Data Analysis. Wiley.

Barnard, J., Rubin, D.B. (1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955.

Bates, D., Mächler, M., Bolker, B., Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1.

Boehmke, B., Greenwell, B. (2019). Hands-On Machine Learning with R. CRC Press.

Bollen, K.A. (1989). Structural Equations with Latent Variables. Wiley Series in Probability and Mathematical Statistics. Wiley.

Box, G.E. (2013). An Accidental Statistician. Wiley.

Box, G.E., Cox, D.R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243.

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A. (1984). Classification and Regression Trees. CRC Press.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Brown, L.D., Cai, T.T., DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101-117.

Buolamwini, J., Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1-15.

Cameron, A.C., Trivedi, P.K. (1990). Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics, 46(3), 347-364.

Casella, G., Berger, R.L. (2002). Statistical Inference. Brooks/Cole.

Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P.A., Lukasik, S. & Zak, S. (2010). A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images. In: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 15-24.

Chollet, F., Allaire, J.J. (2022). Deep Learning with R. Second edition. Manning.

Cochran, W.G. (1954). Some methods of strengthening the common \(\chi^2\) tests. Biometrics, 10, 417-451.

Committee on Professional Ethics of the American Statistical Association. (2018). Ethical Guidelines for Statistical Practice. https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx

Cook, R.D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman & Hall.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.

Costello, A.B., Osborne, J. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research, and Evaluation, 10(1), 7.

Cox, D. R. (1972). Regression models and life‐tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2), 187-202.

Dastin, J. (2018). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters.

Davison, A.C., Hinkley, D.V. (1997). Bootstrap Methods and their Application. Cambridge University Press.

Delacre, M., Lakens, D., Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30(1).

Drymon, M.M. (2008). Disguised As the Devil: How Lyme Disease Created Witches and Changed History. Wythe Avenue Press.

Eck, K., Hultman, L. (2007). One-sided violence against civilians in war: Insights from new fatality data. Journal of Peace Research, 44(2), 233-246.

Eddelbuettel, D., Balamuta, J.J. (2018). Extending R with C++: a brief introduction to Rcpp. The American Statistician, 72(1), 28-36.

Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78(382), 316-331.

Elston, D.A., Moss, R., Boulinier, T., Arrowsmith, C., Lambin, X. (2001). Analysis of aggregation, a worked example: numbers of ticks on red grouse chicks. Parasitology, 122(05), 563-569.

Fine, J.P., Gray, R.J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American statistical association, 94(446), 496-509.

Fisher, R.A. (1935). The Design of Experiments. Oliver & Boyd.

Fleming, G., Bruce, P.C. (2021). Responsible Data Science: Transparency and Fairness in Algorithms. Wiley.

Franks, B. (Ed.) (2020). 97 Things About Ethics Everyone in Data Science Should Know. O’Reilly Media.

Friedman, J.H. (2002). Stochastic Gradient Boosting, Computational Statistics and Data Analysis, 38(4), 367-378.

Gao, L.L, Bien, J., Witten, D. (2022). Selective inference for hierarchical clustering. Journal of the American Statistical Association, DOI: 10.1080/01621459.2022.2116331.

Groll, A., Tutz, G. (2014). Variable selection for generalized linear mixed models by L1-penalized estimation. Statistics and Computing, 24(2), 137-154.

Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer Science & Business Media.

Hartigan, J.A., Wong, M.A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100-108.

Henderson, H.V., Velleman, P.F. (1981). Building multiple regression models interactively. Biometrics, 37, 391–411.

Herr, D.G. (1986). On the history of ANOVA in unbalanced, factorial designs: the first 30 years. The American Statistician, 40(4), 265-270.

Hoerl, A.E., Kennard, R.W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

Holzinger, K., Swineford, F. (1939). A Study in Factor Analysis: The Stability of a Bifactor Solution. Supplementary Educational Monograph, no. 48. University of Chicago Press.

Hu, L.; Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling. 6 (1): 1-55.

Hyndman, R. J., Athanasopoulos, G. (2018). Forecasting: Principles and Practice. OTexts.

Imai, K., Keele, L., Yamamoto, T. (2010). Identification, inference, and sensitivity analysis for causal mediation effects. Statistical Science, 25(1), 51-71.

James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R. Springer.

Kuznetsova, A., Brockhoff, P. B., Christensen, R. H. (2017). lmerTest package: tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1-26.

Liero, H., Zwanzig, S. (2012). Introduction to the Theory of Statistical Inference. CRC Press.

Liu, X., Swenson, N.G., Lin, D., Mi, X., Umaña, M.N., Schmid, B., Ma, K. (2016). Linking individual-level functional traits to tree growth in a subtropical forest. Ecology (Durham), 97(9), 2396-2405.

Long, J.D., Teetor, P. (2019). The R Cookbook. O’Reilly Media.

Moen, A., Lind, A.L., Thulin, M., Kamali–Moghaddamd, M., Roe, C., Gjerstad, J., Gordh, T. (2016). Inflammatory serum protein profiling of patients with lumbar radicular pain one year after disc herniation. International Journal of Inflammation, 2016, Article ID 3874964.

Persson, I., Arnroth, L., Thulin, M. (2019). Multivariate two-sample permutation tests for trials with multiple time-to-event outcomes. Pharmaceutical Statistics, 18(4), 476-485.

Petterson, T., Högbladh, S., Öberg, M. (2019). Organized violence, 1989-2018 and peace agreements. Journal of Peace Research, 56(4), 589-603.

Picard, R.R., Cook, R.D. (1984). Cross-validation of regression models. Journal of the American Statistical Association, 79(387), 575–583.

Prentice R.L., Williams B.J., Peterson A.V. (1981). On the regression analysis of multivariate failure time data. Biometrika, 68, 373-379.

Rasch, D., Kubinger, K.D., Moder, K. (2011). The two-sample t test: pre-testing its assumptions does not pay off. Statistical Papers, 52(1), 219.

Recht, B., Roelofs, R., Schmidt, L., Shankar, V. (2019). Do ImageNet classifiers generalize to ImageNet?. arXiv:1902.10811.

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.

Schoenfeld, D. (1982). Partial residuals for the proportional hazards regression model. Biometrika, 69(1), 239-241.

Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289.

Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(1), 32.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Thulin, M. (2014a). The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8, 817-840.

Thulin, M. (2014b). On Confidence Intervals and Two-Sided Hypothesis Testing. PhD thesis. Department of Mathematics, Uppsala University.

Thulin, M. (2014c). Decision-theoretic justifications for Bayesian hypothesis testing using credible sets. Journal of Statistical Planning and Inference, 146, 133-138.

Thulin, M. (2016). Two‐sample tests and one‐way MANOVA for multivariate biomarker data with nondetects. Statistics in Medicine, 35(20), 3623-3644.

Thulin, M., Zwanzig, S. (2017). Exact confidence intervals and hypothesis tests for parameters of discrete distributions. Bernoulli, 23(1), 479-502.

Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica, 26, 24-36.

Wasserstein, R.L., Lazar, N.A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.

Wei, L.J. (1992). The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14‐15), 1871-1879.

Wickham, H. (2019). Advanced R. CRC Press.

Wickham, H., Bryan, J. (2023). R Packages. O’Reilly Media.

Wickham, H., Grolemund, G. (2017). R for Data Science. O’Reilly Media.

Wickham, H., Navarro, D., Lin Pedersen, T. (forthcoming). ggplot2: Elegant Graphics for Data Analysis. Third edition.

Wilke, C.O. (2019). Fundamentals of Data Visualization. O’Reilly Media.

Xie, Y., Allaire, J.J., Grolemund, G. (2018). R Markdown: the definitive guide. Chapman & Hall.

Zeileis, A., Hothorn, T., Hornik, K. (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2), 492-514.

Zhang, D., Fan, C., Zhang, J., Zhang, C.-H. (2009). Nonparametric methods for measurements below detection limit. Statistics in Medicine, 28, 700–715.

Zhang, Y., Yang, Y. (2015). Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1), 95-112.

Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Methodological), 67(2), 301-320.

Further reading

Below is a list of some highly recommended books that either partially overlap with the content in this book or serve as a natural next step after you finish reading this book. All of these are available for free online.

  • The R Cookbook (https://rc2e.com/) by Long & Teetor (2019) contains tons of examples of how to perform common tasks in R.
  • R for Data Science (https://r4ds.had.co.nz/) by Wickham & Grolemund (2017) is similar in scope to Chapters 2-6 of this book, but with less focus on statistics and greater focus on tidyverse functions.
  • Advanced R (http://adv-r.had.co.nz/) by Wickham (2019) deals with advanced R topics, delving further into object-oriented programming, functions, and increasing the performance of your code.
  • R Packages (https://r-pkgs.org/) by Wickham and Bryan (2023) describes how to create your own R packages.
  • ggplot2: Elegant Graphics for Data Analysis (https://ggplot2-book.org/) by Wickham, Navarro & Lin Pedersen is an in-depth treatise of ggplot2.
  • Fundamentals of Data Visualization (https://clauswilke.com/dataviz/) by Wilke (2019) is a software-agnostic text on data visualisation, with tons of useful advice.
  • R Markdown: the definitive guide (https://bookdown.org/yihui/rmarkdown/) by Xie et al. (2018) describes how to use R Markdown for reports, presentations, dashboards, and more.
  • An Introduction to Statistical Learning with Applications in R (https://www.statlearning.com/) by James et al. (2021) provides an introduction to methods for regression and classification, with examples in R (but not using caret).
  • Hands-On Machine Learning with R (https://bradleyboehmke.github.io/HOML/) by Boehmke & Greenwell (2019) covers a large number of machine learning methods.
  • Forecasting: principles and practice (https://otexts.com/fpp2/) by Hyndman & Athanasopoulos, G. (2018) deals with forecasting and time series models in R.
  • Deep Learning with R (https://livebook.manning.com/book/deep-learning-with-r/) by Chollet & Allaire (2018) delves into neural networks and deep learning, including computer vision and generative models.

Online resources