Reading and Writing Data
Analysis starts when data enters R and becomes an object with known columns, classes, and missing-value conventions. The Book of R introduces built-in data sets, text tables, spreadsheet workbooks, web files, graphics output, and ad hoc object storage. The details vary by file type, but the same discipline applies: know the path, know the delimiter, inspect the imported object, and save outputs in a format suited to their next use.
The safest workflow is to separate raw input from derived output. Raw files should remain unchanged. R scripts should read raw files, perform explicit cleaning steps, and write cleaned data, tables, plots, or serialized R objects to predictable locations. This keeps the analysis reproducible and makes mistakes easier to trace.
Definitions
A CSV file is a comma-separated text file. It is commonly read with read.csv() or read.table(sep = ",", header = TRUE), and written with write.csv().
A delimited text file stores rows as lines and columns separated by a delimiter such as comma, tab, semicolon, or pipe. read.table() is the flexible base R reader; read.delim() is a tab-delimited convenience wrapper.
An Excel workbook is a spreadsheet file such as .xlsx. Base R does not read .xlsx directly; contributed packages such as readxl are commonly used.
An RData file stores one or more R objects by name. save() writes named objects to .RData, and load() restores them into an environment. An RDS file stores one object without forcing its name; saveRDS() and readRDS() are often cleaner for scripted workflows.
A graphics device is an output target for plots. Devices include the interactive plot pane, png(), pdf(), and other file devices. Plot code runs while the device is open, and dev.off() closes the file.
Key results
Import decisions should be explicit:
| Format | Read function | Write function | Preserves R classes well? | Good use |
|---|---|---|---|---|
| CSV | read.csv | write.csv | Partly | Exchange with spreadsheets and other tools |
| Tab-delimited | read.delim | write.table | Partly | Plain text data transfer |
Excel .xlsx | readxl::read_excel | writexl::write_xlsx or other package | Partly | Spreadsheet collaboration |
| RData | load | save | Yes | Multiple R objects |
| RDS | readRDS | saveRDS | Yes | One object with explicit assignment |
| Image/PDF plot | Not usually read as data | png, pdf, jpeg | Not data | Reports and figures |
After every import, inspect structure and dimensions:
dim(dat)
names(dat)
str(dat)
summary(dat)
head(dat)
CSV import can go wrong when files use non-comma delimiters, decimal commas, embedded commas in quoted strings, nonstandard missing codes, or inconsistent header rows. Use arguments such as sep, header, na.strings, stringsAsFactors, and colClasses when defaults do not match the file.
RData is convenient but can hide object names. load("file.RData") creates objects in the current environment and returns their names invisibly. RDS is usually more explicit:
clean_data <- readRDS("clean_data.rds")
Visual
| Checkpoint | Command | Failure it catches |
|---|---|---|
| File exists | file.exists(path) | Wrong working directory or path |
| Row/column shape | dim(dat) | Header parsed as data, delimiter wrong |
| Column names | names(dat) | Unexpected names or duplicates |
| Column classes | str(dat) | Numeric data imported as character |
| Missing values | colSums(is.na(dat)) | Unrecognized missing-value codes |
Worked example 1: Reading a CSV string and cleaning classes
Problem: a small CSV stores patient id, group, and response. The missing response code is ".". Read it, convert group to a factor, convert response to numeric, and compute group means.
Method:
- Represent the CSV text with
textConnectionfor a self-contained example. - Use
read.csvwithna.strings = ".". - Inspect the result.
- Convert group to a factor with explicit levels.
- Aggregate response by group.
- Check one mean manually.
txt <- "id,group,response
1,control,5.1
2,treated,6.3
3,control,.
4,treated,6.9"
dat <- read.csv(textConnection(txt), na.strings = ".")
dat$group \lt - factor(dat$group, levels = c("control", "treated"))
str(dat)
# 'data.frame': 4 obs. of 3 variables:
# $ id : int 1 2 3 4
# $ group : Factor w/ 2 levels "control","treated": 1 2 1 2
# $ response: num 5.1 6.3 NA 6.9
aggregate(response ~ group, data = dat, FUN = mean, na.rm = TRUE)
# group response
# 1 control 5.1
# 2 treated 6.6
Checked answer: the treated responses are 6.3 and 6.9, so their mean is (6.3 + 6.9) / 2 = 6.6. The control group has one observed response, 5.1, because the other is missing.
The example is small, but the import policy is realistic: missing codes are declared during reading, then class choices are made explicitly.
Worked example 2: Saving a cleaned object and a plot
Problem: create a cleaned mtcars subset with model names and four variables. Save it as RDS, read it back, and write a PNG scatterplot.
Method:
- Build the cleaned data frame.
- Save one object with
saveRDS. - Read it back with explicit assignment.
- Open a PNG graphics device.
- Draw the plot.
- Close the device with
dev.off().
cars <- mtcars[, c("mpg", "wt", "hp", "cyl")]
cars$model <- rownames(mtcars)
tmp_rds <- tempfile(fileext = ".rds")
tmp_png <- tempfile(fileext = ".png")
saveRDS(cars, tmp_rds)
cars2 <- readRDS(tmp_rds)
identical(cars, cars2)
# [1] TRUE
png(tmp_png, width = 700, height = 500)
plot(cars2$wt, cars2$mpg, pch = 19, xlab = "Weight", ylab = "MPG")
dev.off()
Checked answer: identical(cars, cars2) returns TRUE, so the RDS round trip preserved the object. The plot file path is temporary in this example, but a project script would use a stable path such as "figures/mpg_vs_weight.png".
The key habit is to close file graphics devices. If dev.off() is forgotten, the file may be incomplete or locked until the session ends.
Code
# Reusable import checker for rectangular data.
check_import <- function(df) {
stopifnot(is.data.frame(df))
report <- data.frame(
variable = names(df),
class = vapply(df, function(x) paste(class(x), collapse = "/"), character(1)),
missing = vapply(df, function(x) sum(is.na(x)), integer(1)),
unique_values = vapply(df, function(x) length(unique(x)), integer(1)),
row.names = NULL
)
list(
dimensions = dim(df),
names = names(df),
report = report,
preview = head(df)
)
}
txt <- "site,count,status
A,10,ok
B,.,check
C,15,ok"
example <- read.csv(textConnection(txt), na.strings = ".")
print(check_import(example))
The checker returns a list rather than only printing text because the pieces may be useful later. dimensions can be compared with an expected row and column count. report can be written to a quality-control file. preview can be displayed in an interactive session. Returning structured information makes the function useful in both exploratory and scripted contexts.
File import is also where reproducibility often breaks. A script that reads "mydata.csv" depends on the working directory; a script that reads "data/mydata.csv" inside an RStudio project is clearer. A script that says na.strings = c("", "NA", ".") documents missing-value conventions; a script that accepts defaults leaves readers guessing. If an Excel file is required, the package dependency should be visible near the top of the script, and the sheet name or sheet number should be specified.
For saving, choose the format according to the next user. CSV is excellent for exchanging rectangular data with non-R tools, but it loses some R-specific classes and attributes. RDS is excellent for preserving one cleaned R object inside an R workflow. RData can save multiple objects, but it can also clutter an environment when loaded. Plot files are outputs, not data; save them from code so they can be regenerated after any change.
Finally, never overwrite raw data as part of cleaning. Write cleaned data to a new file or object. That single rule makes it possible to rerun the analysis, audit decisions, and recover when a cleaning assumption changes.
For larger projects, keep a small data dictionary beside the import code. It should name each column, its intended class, allowed missing codes, units, and any factor levels. The import script can then be checked against that dictionary: numeric columns should be numeric, categorical columns should have expected levels, and dates should parse without unexpected NAs. This is a practical way to turn informal knowledge about a file into reproducible validation.
It is also useful to separate input paths and output paths. A common project layout has data/raw, data/clean, figures, and results. The exact names are less important than the separation. Raw data are read-only inputs; clean data are reproducible outputs; figures and result tables are generated artifacts. This organization makes scripts easier to rerun and reduces the chance of confusing source material with derived material.
Common pitfalls
- Reading a file from the wrong working directory. Check
getwd()andfile.exists(path). - Letting a missing code such as
".","NA ", or"-99"become ordinary data. - Trusting printed output without checking
str(). Numeric-looking columns may be character. - Saving analysis state only as
.RData. Scripts plus raw data are more reproducible. - Forgetting
row.names = FALSEwhen writing CSV files for non-R users. - Opening a graphics device and forgetting
dev.off(). - Using Excel import code without recording the package dependency.