Tidy Data

Notes on Hadley Wickham’s 2014 paper “Tidy Data” that lays out the philosophy behind the tidyverse suite of tools in R.

Philosophy of Data

The paper describes a structure and semantics of datasets that facilitates manipulation, visualization, and modeling, in short a philosophy of data.

Tidy data table

Variables often have relationships between them, such as “z is a linear combination of x and y”.

Obserservations are often grouped then aggregations computed and compared.

Tidy Data is:

…closely related to Codd’s 3rd Normal Form.

Common pathologies of messy data:

Common data manipulations:

Modeling typically creates a mapping between sets of predictors and responses.

Model output is unique to the type of model: compare the coefficients of a linear model with the structure representing a random forest.