Notes on Hadley Wickham’s 2014 paper “Tidy Data” that lays out the philosophy behind the tidyverse suite of tools in R.

Philosophy of Data

The paper describes a structure and semantics of datasets that facilitates manipulation, visualization, and modeling, in short a philosophy of data.

Tidy data table

Variables

Variables often have relationships between them, such as “z is a linear combination of x and y”.

Observations

Obserservations are often grouped then aggregations computed and compared.

Tidy Data

Tidy Data is:

  1. Variables are in columns
  2. Observations are in rows
  3. Each type of observational unit forms a table

…closely related to Codd’s 3rd Normal Form.

Messy data

Common pathologies of messy data:

  • Column headers are values
  • Multiple variables are stored in a single column
  • Variables are stored in both rows and columns
  • Multiple types of observations are stored in one table
  • A single observational unit is split into multiple tables

Tools

Manipulation

Common data manipulations:

  • filter
  • transform (computed variables)
  • aggregate
  • sort
  • join

Modeling

Modeling typically creates a mapping between sets of predictors and responses.

Model output is unique to the type of model: compare the coefficients of a linear model with the structure representing a random forest.

See also: