Tidy data for librarians

Key Points

Using spreadsheet programs for data wrangling
  • We will discuss good practices for data entry and formatting

  • We will not discuss analysis or visualisation

Formatting data tables in Spreadsheets
  • Use one column for one variable

  • Use one row for one observation

  • Use one cell for one value

Formatting problems
  • Don’t use multiple tables in one sheet

  • Don’t use multiple tabs in a file

  • Fill in zero when you mean zero

  • Use a blank cell to indicate a null value

  • Don’t use formatting to convey information or make the data look pretty

  • Don’t put units or comments in cells

  • Don’t combine several values in one cell

  • Take care over column names

  • Avoid including special characters in your data file

  • Put metadata (units, legends etc.) in a separate file

Dates as data
  • Excel is notoriously bad at handling dates.

  • Dates can be stored more robustly as YEAR, MONTH, DAY or YEAR, DAY-OF-YEAR in separate columns.

Basic quality assurance and control, and data manipulation in spreadsheets
  • Use data validation tools to minimise the possibility of input errors.

  • Use sorting and conditional formatting to identify possibly errors.

Exporting data from spreadsheets
  • Use .csv file format for data storage and processing

Caveats of popular data and file formats
  • Be careful when using commas in values