Library Carpentry: OpenRefine

Key Points

Introduction to OpenRefine
  • OpenRefine is ‘a tool for working with messy data’

  • OpenRefine works best with data in a simple tabular format

  • OpenRefine can help you split data up into more granular parts

  • OpenRefine can help you match local data up to other data sets

  • OpenRefine can help you enhance a data set with data from other sources

Importing data into OpenRefine
  • Use the Create Project option to import data

  • You can control how data imports using options on the import screen

  • Several files types may be imported into OpenRefine.

Layout of OpenRefine, Rows vs Records
  • OpenRefine uses rows and columns to display data

  • Most options to work with data in OpenRefine are accessed through a drop down menu at the top of a data column

  • When you select an option in a particular column (e.g. to make a change to the data), it will affect all the cells in that column

  • OpenRefine has a Records mode which links together multiple rows into a single record

  • Split and join multi-valued cells to modify the individual values within them

  • When creating multi-valued cells in your data, choose a separator that will not appear in the data values

Faceting and filtering
  • You can use facets and filters to explore your data

  • You can use facets and filters work with a subset of data in OpenRefine

  • You can correct common data issues from a Facet

  • Clustering is a way of finding variant forms of the same piece of data within a dataset (e.g. different spellings of a name)

  • There are a number of different Clustering algorithms that work in different ways and will produce different results

  • The best clustering algorithm to use will depend on the data

  • Using clustering you can replace varying forms of the same data with a single consistent value

Working with columns and sorting
  • You can reorder, rename and remove columns in OpenRefine

  • Sorting in OpenRefine always sorts all rows

  • The original order of rows in OpenRefine is maintained during a sort until you use the option to Reorder Rows Permanently

Introduction to Transformations
  • Common transformations are available through the Menu option

Writing Transformations
  • You can alter data in OpenRefine based on specific instructions

  • You can preview the results of your GREL expression

Transformations - Undo and Redo
  • You can use Undo and Redo to retrace ones’ steps

  • You can save and apply a set of steps to a new set of data using the ‘Extract’ and ‘Apply’ features

Transforming Strings, Numbers, Dates and Booleans
  • You can alter data in OpenRefine based on specific instructions

  • You can expand the data editing functions that are built-in into OpenRefine by building your own

Transformations - Handling Arrays
  • Arrays cannot appear directly in an OpenRefine cell

  • Arrays can be used in many ways using GREL expressions

Exporting data
  • You can export your data in a variety of formats

Looking Up Data
  • OpenRefine can look up custom URLs to fetch data based on what’s in an OpenRefine project

  • Such API calls can be custom built, or one can use existing Reconciliation services to enrich data

  • OpenRefine can be further enhanced by installing extensions