This lesson is in the early stages of development (Alpha version)

Data cleaning & transformation with dplyr

Overview

Teaching: 50 min
Exercises: 30 min
Questions
  • How can I select specific rows and/or columns from a data frame?

  • How can I combine multiple commands into a single command?

  • How can create new columns or remove existing columns from a data frame?

  • How can I reformat a dataframe to meet my needs?

Objectives
  • Describe the purpose of an R package and the dplyr and tidyr packages.

  • Select certain columns in a data frame with the dplyr function select.

  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter.

  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.

  • Add new columns to a data frame that are functions of existing columns with mutate.

  • Use the split-apply-combine concept for data analysis.

  • Use summarize, group_by, and count to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.

  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.

  • Describe what key-value pairs are.

  • Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.

  • Export a data frame to a csv file.

Getting set up

Open your R Project file

If you have not already done so, open your R Project file (library_carpentry.Rproj) created in the Before We Start lesson.

If you did not complete that step then do the following:

dir.create("data")
dir.create("data_output")
dir.create("fig_output")
download.file("https://ndownloader.figshare.com/files/22031487",
              "data/books.csv", mode = "wb")

Load the tidyverse and data frame into your R session

Load the tidyverse

library(tidyverse)
── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ ggplot2 3.2.1     ✔ forcats 0.4.0
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

And the books data we saved in the previous lesson.

books <- read_csv("data/books.csv")  # load the data and assign it to books

Transforming data with dplyr

We are now entering the data cleaning and transforming phase. While it is possible to do much of the following using Base R functions (in other words, without loading an external package) dplyr makes it much easier. Like many of the most useful R packages, dplyr was developed by data scientist http://hadley.nz/.

dplyr is a package for making tabular data manipulation easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis.

dplyr is also part of the tidyverse. Let’s make sure we are all on the same page by loading the tidyverse and the books dataset we downloaded earlier.

We’re going to learn some of the most common dplyr functions:

Renaming variables

It is often necessary to rename variables to make them more meaningful. If you print the names of the sample books dataset you can see that some of the vector names are not particularly helpful:

names(books)  # print names of the books data frame to the console
 [1] "CALL...BIBLIO." "X245.ab"        "X245.c"         "LOCATION"      
 [5] "TOT.CHKOUT"     "LOUTDATE"       "SUBJECT"        "ISN"           
 [9] "CALL...ITEM."   "X008.Date.One"  "BCODE2"         "BCODE1"        

There are many ways to rename variables in R, but the rename() function in the dplyr package is the easiest and most straightforward. The new variable name comes first. See help(rename).

Here we rename the X245.ab variable. Make sure you assign the output to your books value, otherwise it will just print it to the console. In other words, we are overwriting the previous books value with the new one, with X245.ab renamed to title.

# rename the . Make sure you return (<-) the output to your 
# variable, otherwise it will just print it to the console
books <- rename(books,
                title = X245.ab)

Side note:

Where does X245.ab come from? That is the MARC field 245|ab. However, because R variables cannot start with a number, R automatically inserted and X, and because pipes | are not allowed in variable names, R replaced it with a period. {: .callout}

# rename multiple variables at once
books <- rename(books,
                author = X245.c,
                callnumber = CALL...BIBLIO.,
                isbn = ISN,
                pubyear = X008.Date.One,
                subCollection = BCODE1,
                format = BCODE2,
                location = LOCATION,
                tot_chkout = TOT.CHKOUT,
                loutdate = LOUTDATE,
                subject = SUBJECT)
books
# A tibble: 10,000 x 12
   callnumber title author location tot_chkout loutdate subject isbn 
   <chr>      <chr> <chr>  <chr>         <dbl> <chr>    <chr>   <chr>
 1 001.94 Do… Berm… writt… juv               6 11-21-2… Reader… 0789…
 2 001.942 B… Inva… writt… juv               2 02-07-2… Reader… 0789…
 3 027.073 A… Down… by Ka… juv               3 10-16-2… Packho… 0060…
 4 133.5 Hua… The … by Ch… juv               6 11-22-2… Astrol… 0060…
 5 170 She 2… Judg… illus… juv               7 04-10-2… Childr… 0060…
 6 170.44 Sh… Judg… illus… juv               6 11-12-2… Conduc… 0060…
 7 220.9505 … A yo… retol… juv               4 12-01-2… Bible … 0060…
 8 225.9505 … God'… retol… juv               2 08-06-2… Bible … 0689…
 9 292.13 Mc… Roma… retol… juv               4 04-03-2… Mythol… 0689…
10 292.211 M… Gree… retol… juv              13 11-16-2… Gods, … 0689…
# … with 9,990 more rows, and 4 more variables: CALL...ITEM. <chr>,
#   pubyear <chr>, format <chr>, subCollection <chr>

Rename CALL...ITEM.

  1. Use rename() to rename the CALL...ITEM. column to callnumber2. Remember to add the period to the end of the CALL...ITEM. value

Solution

books <- rename(books,
                callnumber2 = CALL...ITEM.)

Recoding values

It is often necessary to recode or reclassify values in your data. For example, in the sample dataset provided to you, the sub_collection (formerly BCODE1) and format (formerly BCODE2) variables contain single characters.

Sub-Collection (formerly BCODE1) export from Sierra
Sub-Collection (formerly BCODE1) export from Sierra
Format (formerly BCODE2) export from Sierra
Format (formerly BCODE2) export from Sierra

You can do this easily using the recode() function, also in the dplyr package. Unlike rename(), the old value comes first here. Also notice that we are overwriting the books$subCollection variable.

# first print to the console all of the unique values you will need to recode
unique(books$subCollection)
FALSE  [1] "j" "b" "u" "r" "-" "s" "c" "z" "a" "t"
books$subCollection <- recode(books$subCollection,
                                      "-" = "general collection",
                                      u = "government documents",
                                      r = "reference",
                                      b = "k-12 materials",
                                      j = "juvenile",
                                      s = "special collections",
                                      c = "computer files",
                                      t = "theses",
                                      a = "archives",
                                      z = "reserves")
books 
FALSE # A tibble: 10,000 x 12
FALSE    callnumber title author location tot_chkout loutdate subject isbn 
FALSE    <chr>      <chr> <chr>  <chr>         <dbl> <chr>    <chr>   <chr>
FALSE  1 001.94 Do… Berm… writt… juv               6 11-21-2… Reader… 0789…
FALSE  2 001.942 B… Inva… writt… juv               2 02-07-2… Reader… 0789…
FALSE  3 027.073 A… Down… by Ka… juv               3 10-16-2… Packho… 0060…
FALSE  4 133.5 Hua… The … by Ch… juv               6 11-22-2… Astrol… 0060…
FALSE  5 170 She 2… Judg… illus… juv               7 04-10-2… Childr… 0060…
FALSE  6 170.44 Sh… Judg… illus… juv               6 11-12-2… Conduc… 0060…
FALSE  7 220.9505 … A yo… retol… juv               4 12-01-2… Bible … 0060…
FALSE  8 225.9505 … God'… retol… juv               2 08-06-2… Bible … 0689…
FALSE  9 292.13 Mc… Roma… retol… juv               4 04-03-2… Mythol… 0689…
FALSE 10 292.211 M… Gree… retol… juv              13 11-16-2… Gods, … 0689…
FALSE # … with 9,990 more rows, and 4 more variables: callnumber2 <chr>,
FALSE #   pubyear <chr>, format <chr>, subCollection <chr>

Do the same for the format column. Note that you must put "5" and "4" into quotation marks for the function to operate correctly.

books$format <- recode(books$format,
                              a = "book",
                              e = "serial",
                              w = "microform",
                              s = "e-gov doc",
                              o = "map",
                              n = "database",
                              k = "cd-rom",
                              m = "image",
                              "5" = "kit/object",
                              "4" = "online video")

Subsetting dataframes

Subsetting using filter() in the dplyr package

In the last lesson we learned how to subset a data frame using brackets. As with other R functions, the dplyr package makes it much more straightforward, using the filter() function.

Here we will create a subset of books called booksOnly, which includes only those items where the format is books. Notice that we use two equal signs == as the logical operator:

booksOnly <- filter(books, format == "book") # filter books to return only those items where the format is books

You can also use multiple filter conditions. Here, the order matters: first we filter to include only books, then of the results, we include only items that have more than zero checkouts.

bookCheckouts <- filter(books,
                        format == "book",
                        tot_chkout > 0)

How many items were removed? You can find out functionally with:

nrow(books) - nrow(bookCheckouts)
FALSE [1] 5733

You can then check the summary statistics of checkouts for books with more than zero checkouts. Notice how different these numbers are from the previous lesson, when we kept zero in. The median is now 3 and the mean is 5.

summary(bookCheckouts$tot_chkout)
FALSE    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
FALSE   1.000   2.000   3.000   5.281   6.000 113.000
booksJuv <- filter(books,
                   format == "book",
                   subCollection ==  "juvenile")

Filtering with filter()

  1. Use filter() to create a data frame called booksJuv consisting of format books and subCollection juvenile materials.

  2. Use mean() to check the average number of checkouts for the booksJuv data frame.

Solution

booksJuv <- filter(books,
                   format == "book",
                   subCollection ==  "juvenile")
    mean(booksJuv$tot_chkout)
[1] 10.41404

Selecting variables

The select() function allows you to keep or remove specific columns It also provides a convenient way to reorder variables.

# specify the variables you want to keep by name
booksTitleCheckouts <- dplyr::select(books, title, tot_chkout)
booksTitleCheckouts
# A tibble: 10,000 x 2
   title                                                         tot_chkout
   <chr>                                                              <dbl>
 1 Bermuda Triangle /                                                     6
 2 Invaders from outer space :|real-life stories of UFOs /                2
 3 Down Cut Shin Creek :|the pack horse librarians of Kentucky /          3
 4 The Chinese book of animal powers /                                    6
 5 Judge Judy Sheindlin's Win or lose by how you choose! /                7
 6 Judge Judy Sheindlin's You can't judge a book by its cover :…          6
 7 A young child's Bible /                                                4
 8 God's Kingdom :|stories from the New Testament /                       2
 9 Roman myths /                                                          4
10 Greek gods and goddesses /                                            13
# … with 9,990 more rows
# specify the variables you want to remove with a -
books <- select(books, -location)

# reorder columns, combined with everything()
booksReordered <- select(books, title, tot_chkout, loutdate, everything())

Ordering data

The arrange() function in the dplyr package allows you to sort your data by alphabetical or numerical order.

booksTitleArrange <- arrange(books, title)

# use desc() to sort a variable in descending order
booksHighestChkout <- arrange(books, desc(tot_chkout))
booksHighestChkout
# A tibble: 10,000 x 11
   callnumber title author tot_chkout loutdate subject isbn  callnumber2
   <chr>      <chr> <chr>       <dbl> <chr>    <chr>   <chr> <chr>      
 1 E Cro 2000 Clic… by Do…        113 01-23-2… Cows -… 0689… E Cro 2000 
 2 PZ7.W6367… The … by Da…        106 03-07-2… Pigs -… 0618… 398.2452 W…
 3 <NA>       Cook… Janet…        103 03-13-2… Cake -… 0152… E Ste 1999 
 4 PZ7.D5455… Beca… Kate …         79 03-27-2… Dogs -… 0763… Fic Dic 20…
 5 PZ7.C6775… Upto… Bryan…         69 02-05-2… Harlem… 9780… E Col 2000 
 6 <NA>       <NA>  <NA>           64 08-23-2… <NA>    <NA>  #1  ENC. C…
 7 F379.N59 … Thro… Ruby …         63 11-01-2… Bridge… 0590… 920 Bri 19…
 8 PZ7.C9413… Bud,… Chris…         63 04-03-2… Runawa… 0385… Fic Cur 19…
 9 E Mar 1992 Brow… by Bi…         61 02-16-2… Color … 0805… E Mar 1992 
10 PZ7.P338 … A ye… Richa…         47 03-26-2… Grandm… 0803… Fic Pec 20…
# … with 9,990 more rows, and 3 more variables: pubyear <chr>,
#   format <chr>, subCollection <chr>
# order data based on multiple variables (e.g. sort first by checkout, then by publication year)
booksChkoutYear <- arrange(books, desc(tot_chkout), desc(pubyear))

Creating new variables

The mutate() function allows you to create new variables. Here, we use the str_sub() function from the stringr package to extract the first character of the callnumber variable (the call number class) and put it into a new column called call_class.

booksLC <- mutate(books
                  , call_class = str_sub(callnumber, 1, 1))

There are two numbers because you must specify a start and an end value–here, we start with the first character, and end with the first character.

mutate() is also helpful to coerce a column from one data type to another. For example, we can see there are some errors in the pubyear variable–some dates are 19zz or uuuu. As a result, this variable was read in as a character rather than an integer.

books <- books %>%
  mutate(pubyear = as.integer(pubyear))
Warning: NAs introduced by coercion

We see the error message NAs introduced by coercion. This is because non-numerical variables become NA and the remainder become integers.

Putting it all together with %>%

The Pipe Operator %>% is loaded with the tidyverse. It takes the output of one statement and makes it the input of the next statement. You can think of it as “then” in natural language. So instead of making a bunch of intermediate data frames and cluttering up your workspace, you can run multiple functions at once. You can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

So in the following example, the books tibble is first called, then the format is filtered to include only book, then only the title and tot_chkout columns are selected, and finally the data is rearranged from most to least checkouts.

myBooks <- books %>%
  filter(format == "book") %>%
  select(title, tot_chkout) %>%
  arrange(desc(tot_chkout))
myBooks
# A tibble: 6,983 x 2
   title                                      tot_chkout
   <chr>                                           <dbl>
 1 Click, clack, moo :|cows that type /              113
 2 The three pigs /                                  106
 3 Cook-a-doodle-doo! /                              103
 4 Because of Winn-Dixie /                            79
 5 Uptown /                                           69
 6 Through my eyes /                                  63
 7 Bud, not Buddy /                                   63
 8 Brown bear, brown bear, what do you see? /         61
 9 A year down yonder /                               47
10 Wemberly worried /                                 43
# … with 6,973 more rows

Playing with pipes %>%

  1. Create a new data frame booksKids with these conditions:
    • filter() to include subCollection juvenile & k-12 materials and format books.
      • select() only title, call number, total checkouts, and publication year
      • arrange() by total checkouts in descending order
  2. Use mean() to check the average number of checkouts for the booksJuv data frame.

Solution

booksKids <- books %>%
      filter(subCollection == "juvenile",
      format == "book") %>%
    select(title, callnumber, tot_chkout, pubyear) %>%
    arrange(desc(tot_chkout))

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

So to compute the average checkouts by format:

books %>%
  group_by(format) %>%
  summarize(mean_checkouts = mean(tot_chkout))
# A tibble: 10 x 2
   format       mean_checkouts
   <chr>                 <dbl>
 1 book                3.23   
 2 cd-rom              0.333  
 3 database            0      
 4 e-gov doc           0.0402 
 5 image               0.0275 
 6 kit/object          1.33   
 7 map                10.6    
 8 microform           0.00122
 9 online video        0      
10 serial              0      

Books and maps have the highest, and as we would expect, databases, online videos, and serials have zero checkouts.

Here is a more complex example:

books %>% 
  filter(format == "book") %>%
  mutate(call_class = str_sub(callnumber, 1, 1)) %>%
  group_by(call_class) %>%
  summarize(count = n(),
            sum_tot_chkout = sum(tot_chkout)) %>%
  arrange(desc(sum_tot_chkout))
# A tibble: 34 x 3
   call_class count sum_tot_chkout
   <chr>      <int>          <dbl>
 1 E            487           3114
 2 <NA>         459           3024
 3 H           1142           2902
 4 P            800           2645
 5 F            240           1306
 6 Q            333           1305
 7 B            426           1233
 8 R            193            981
 9 L            358            862
10 5             60            838
# … with 24 more rows

Let’s break this down step by step:

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

In preparation for our next lesson on plotting, we are going to create a version of the dataset with most of the changes we made above. We will first read in the original, then make all the changes with pipes.

books_reformatted <- read_csv("./data/books.csv") %>%
  rename(title = X245.ab,
         author = X245.c,
         callnumber = CALL...BIBLIO.,
         isbn = ISN,
         pubyear = X008.Date.One,
         subCollection = BCODE1,
         format = BCODE2,
         location = LOCATION,
         tot_chkout = TOT.CHKOUT,
         loutdate = LOUTDATE,
         subject = SUBJECT,
         callnumber2 = CALL...ITEM.) %>%
  mutate(pubyear = as.integer(pubyear),
         call_class = str_sub(callnumber, 1, 1),
         subCollection = recode(subCollection,
                                "-" = "general collection",
                                u = "government documents",
                                r = "reference",
                                b = "k-12 materials",
                                j = "juvenile",
                                s = "special collections",
                                c = "computer files",
                                t = "theses",
                                a = "archives",
                                z = "reserves"),
         format = recode(format,
                         a = "book",
                         e = "serial",
                         w = "microform",
                         s = "e-gov doc",
                         o = "map",
                         n = "database",
                         k = "cd-rom",
                         m = "image",
                         "5" = "kit/object",
                         "4" = "online video"))

This chunk of code read the CSV, renamed the variables, used mutate() in combination with recode() to recode the format and subCollection values, used mutate() in combination with as.integer() to coerce pubyear to integer, and used mutate() in combination with str_sub to create the new varable call_class.

We now write it to a CSV and put it in the data/output sub-directory:

write_csv(books_reformatted, "./data_output/books_reformatted.csv")

Help with dplyr

Key Points

  • Use the dplyr package to manipulate dataframes.

  • Use select() to choose variables from a dataframe.

  • Use filter() to choose data based on values.

  • Use group_by() and summarize() to work with subsets of data.

  • Use mutate() to create new variables.