Skip to main content

R

R is a language for programming with data and a working environment for statistics, graphics, simulation, and modeling. Tilman M. Davies's The Book of R teaches these pieces in a deliberate order: first the language, then programming structures, then probability and statistics, then statistical modeling, and finally richer graphics. These notes follow that arc while keeping each page focused on the concepts, R idioms, worked examples, and cross-links needed for SJ Wiki review.

The course begins with interactive work in the console and scripts, then builds outward from vectors to matrices, lists, data frames, factors, files, functions, apply-style iteration, descriptive statistics, probability distributions, inference, regression, generalized models, object-oriented behavior, and plotting systems. The pages are original study notes based on the textbook's scope rather than a replacement for the book's exercises or prose.

Definitions

R is both a programming language and a statistical computing environment. It evaluates expressions interactively, stores objects in environments, and provides a large standard library for data manipulation, probability, modeling, and graphics.

Base R refers to the language and packages distributed with R itself, including core data structures, graphics, statistics, and utilities. Much of The Book of R is intentionally base-R-first so readers understand the language before relying on contributed packages.

CRAN is the Comprehensive R Archive Network, the main distribution network for R and contributed packages. Packages such as ggplot2, spreadsheet readers, or 3D graphics tools are installed from CRAN or another repository and loaded with library().

A script is a saved .R file containing commands that can be rerun. Reproducible analysis depends on scripts more than console history or saved workspace images.

An object is a value bound to a name. Objects include vectors, matrices, arrays, lists, data frames, factors, functions, fitted models, hypothesis-test results, and plots.

A data frame is R's main rectangular data object: a list of equal-length columns, where each column can have its own type or class.

A model formula is R's compact syntax for statistical models, such as mpg ~ wt + hp, meaning "model mpg using wt and hp."

Key results

The textbook's table of contents supports five large blocks:

Textbook blockChaptersWiki coverage
The language1-8setup, vectors, indexing, matrices, nonnumeric values, data frames, classes, files, base plotting
Programming9-12function calls, scoping, conditions, loops, writing functions, apply family, errors and visibility
Statistics and probability13-16descriptive statistics, data visualization, probability, common distributions
Testing and modeling17-22confidence intervals, hypothesis tests, ANOVA ideas, linear regression, multiple regression, diagnostics, model selection
Advanced graphics23-26devices, customization, grammar of graphics, color, contours, surfaces, and 3D plotting concepts

The page sequence in this wiki is:

PositionPageMain role
2Getting started with RConsole, RStudio workflow, packages, help, scripts
3Vectors, arithmetic, and comparisonNumeric and character vectors, vectorization, logical tests
4Indexing, names, and recyclingSubsetting, named lookup, replacement, recycling rules
5Matrices and arraysRectangular atomic data, dimensions, matrix algebra
6Lists and data framesHeterogeneous containers and tabular data
7Factors and categorical dataLevels, ordered categories, model contrasts
8Special values, classes, and coercionNA, NaN, Inf, NULL, type, class, conversion
9Reading and writing dataCSV, spreadsheet, RDS/RData, graphics files
10Control flow, functions, and scopingif, loops, functions, environments
11Apply familyapply, lapply, sapply, vapply, mapply
12Probability distributionsd/p/q/r distribution functions and simulation
13Descriptive statisticsCenter, spread, tables, grouped summaries
14Base graphicsProcedural plotting and file devices
15ggplot2 graphicsGrammar of graphics, mappings, geoms, facets
16Statistical inferenceConfidence intervals, hypothesis tests, categorical tests
17Linear and generalized modelslm, diagnostics, prediction, glm overview
18Advanced graphics and 3D plotsColor, contours, surfaces, higher-dimensional plots
19Object-oriented RS3, S4, generic functions, class-based behavior

The notes intentionally do not create a separate full tidyverse page because the textbook's main contributed graphics package coverage is ggplot2, not a full modern dplyr/tidyr/magrittr workflow. The ggplot2 page links that grammar-of-graphics material to the rest of the R course.

Visual

Learning questionFirst page to readThen read
How do I start an R analysis reproducibly?SetupFiles, functions
Why does R operate on whole columns at once?VectorsIndexing, apply
How do tables and categories work?Data framesFactors, classes
How do I summarize and plot data?Descriptive statisticsBase graphics, ggplot2
How do I run tests and models?InferenceLinear and generalized models
Why does summary() change behavior?ClassesObject-oriented R

Worked example 1: Choosing a reading path for a data analysis task

Problem: a student has a CSV file with plant measurements and treatment groups. They need to read the file, clean missing values, summarize growth by treatment, plot the result, and fit a simple model. Choose a path through the wiki pages.

Method:

  1. Identify the first technical need: importing a CSV file.
  2. Identify the data structure after import: a data frame with numeric and categorical columns.
  3. Identify cleaning issues: missing values and class conversion.
  4. Identify summaries and visualizations.
  5. Identify inference or modeling.
  6. Map each need to a page.

Page path:

reading-and-writing-data
-> lists-and-data-frames
-> factors-and-categorical-data
-> special-values-classes-coercion
-> descriptive-statistics
-> base-graphics or ggplot2-graphics
-> statistical-inference
-> linear-and-generalized-models

Checked answer: the path starts with file import because no analysis can happen until the data is a reliable R object. It then moves to data frames and factors because treatment groups should be categorical. Missing-value and coercion checks happen before summaries. Plotting and modeling come after the data's structure is known. This path mirrors the textbook's order but skips pages that are not immediately needed, such as arrays or 3D plots.

The important study habit is to follow the dependency chain. Models depend on clean variables; clean variables depend on correct import; correct import depends on paths, delimiters, missing-value codes, and classes.

Worked example 2: Translating a textbook chapter into R objects

Problem: the textbook chapter on simple linear regression introduces an equation, fitted coefficients, residuals, inference, and predictions. Translate those ideas into R objects and functions.

Method:

  1. Represent the response and predictor as data frame columns.
  2. Represent the model equation with a formula.
  3. Fit the model with lm.
  4. Extract coefficients, residuals, fitted values, and predictions.
  5. Use class-aware summaries and plots for interpretation.
fit <- lm(mpg ~ wt, data = mtcars)

coef(fit)
# (Intercept) wt
# 37.285126 -5.344472

head(fitted(fit), 3)
# Mazda RX4 Mazda RX4 Wag Datsun 710
# 23.28261 21.91977 24.88595

head(resid(fit), 3)
# Mazda RX4 Mazda RX4 Wag Datsun 710
# -2.282610 -0.919770 -2.085952

predict(fit, newdata = data.frame(wt = 3))
# 1
# 21.25171

Checked answer: the fitted equation is

mpg^=37.2851265.344472wt.\begin{aligned} \widehat{mpg} &= 37.285126 - 5.344472 \cdot wt. \end{aligned}

For wt = 3, the prediction is 37.285126 - 5.344472 * 3 = 21.251710, matching predict. The object fit stores much more than the printed coefficients; it is an S3 object of class "lm" that works with summary, plot, resid, fitted, and predict.

This translation pattern works throughout the course: mathematical ideas become R objects; R functions operate on those objects; class-specific methods present the results.

Code

# A small map of the R notes. This is useful as a checklist for review.

r_notes <- data.frame(
order = 1:6,
stage = c(
"Language basics",
"Core data structures",
"Import and cleaning",
"Programming",
"Statistics",
"Graphics and modeling"
),
pages = c(
"setup, vectors, indexing",
"matrices, lists, data frames, factors",
"files, special values, classes",
"control flow, functions, apply",
"descriptive stats, probability, inference",
"base graphics, ggplot2, lm, glm, OOP"
)
)

print(r_notes)

Common pitfalls

  • Reading the modeling pages before understanding vectors, data frames, factors, and missing values.
  • Treating RStudio as the language. RStudio is an editor and workflow environment; R is the language doing the computation.
  • Memorizing function names without understanding object structure. str() often explains more than another guessed command.
  • Depending on saved workspace state instead of scripts that run from top to bottom.
  • Skipping graphics until the end. Plots are part of data checking, not just final presentation.
  • Treating the textbook's ggplot2 coverage as a complete tidyverse course. These notes cover ggplot2 where the book does, while base R remains the backbone.

Connections