Tidy Data
Notes on Hadley Wickham’s 2014 paper “Tidy Data” that lays out the philosophy behind the tidyverse suite of tools in R.
Philosophy of Data
The paper describes a structure and semantics of datasets that facilitates manipulation, visualization, and modeling, in short a philosophy of data.
Variables
Variables often have relationships between them, such as “z is a linear combination of x and y”.
Observations
Obserservations are often grouped then aggregations computed and compared.
Tidy Data
Tidy Data is:
- Variables are in columns
- Observations are in rows
- Each type of observational unit forms a table
…closely related to Codd’s 3rd Normal Form.
Messy data
Common pathologies of messy data:
- Column headers are values
- Multiple variables are stored in a single column
- Variables are stored in both rows and columns
- Multiple types of observations are stored in one table
- A single observational unit is split into multiple tables
Tools
Manipulation
Common data manipulations:
- filter
- transform (computed variables)
- aggregate
- sort
- join
Modeling
Modeling typically creates a mapping between sets of predictors and responses.
Model output is unique to the type of model: compare the coefficients of a linear model with the structure representing a random forest.