Starting with Data
OverviewTeaching: 50 min
Exercises: 30 minQuestions
What is a data.frame?
How can I read a complete csv file into R?
How can I get basic summary information about my dataset?
How can I change the way R treats strings in my dataset?
Why would I want strings to be treated differently?
How are dates represented in R and how can I change the format?Objectives
Describe what a data frame is.
Load external data from a .csv file into a data frame.
Summarize the contents of a data frame.
Describe the difference between a factor and a string.
Convert between strings and factors.
Reorder and rename factors.
Change how character strings are handled in a data frame.
Examine and change date formats.
Open your Rproj file
First, open your R Project file (
library_carpentry.Rproj) created in the
Before We Start lesson.
If you did not complete that step, do the following:
- Under the
Filemenu, click on
New project, choose
New directory, then
- Enter the name
library_carpentryfor this new folder (or “directory”). This will be your working directory for the rest of the day.
- Click on
- Create a new file where we will type our scripts. Go to File > New File > R
script. Click the save icon on your toolbar and save your script as
Presentation of the data
This data was downloaded from the University of Houston–Clear Lake Integrated Library System in 2018. It is a relatively random sample of books from the catalog. It consists of 10,000 observations of 11 variables.
These variables are:
CALL...BIBLIO.: Bibliographic call number. Most of these are cataloged with the Library of Congress classification, but there are also items cataloged in the Dewey Decimal System (including fiction and non-fiction), and Superintendent of Documents call numbers. Character.
X245.ab: The title and remainder of title. Exported from MARC tag 245|ab fields. Separated by a
|pipe character. Character.
X245.c: The author (statement of responsibility). Exported from MARC tag 245
TOT.CHKOUT: The total number of checkouts. Integer.
LOUTDATE: The last date the item was checked out. Date. YYYY-MM-DDThh:mmTZD
SUBJECT: Bibliographic subject in Library of Congress Subject Headings. Separated by a
|pipe character. Character.
ISN: ISBN or ISSN. Exported from MARC field 020
CALL...ITEM: Item call number. Most of these are
NAbut there are some secondary call numbers.
X008.Date.One: Date of publication. Date. YYYY
BCODE2: Item format. Character.
Getting data into R
Ways to get data into R
In order to use your data in R, you must import it and turn it into an R object. There are many ways to get data into R.
- Manually: You can manually create it using the
data.frame()function in Base R, or the
tibble()function in the tidyverse.
- Import it from a file Below is a very incomplete list
- Text: TXT (
- Tabular data: CSV, TSV (
- Excel: XLSX (
- Google sheets: (
- Statistics program: SPSS, SAS (
- Databases: MySQL (
- Text: TXT (
- Gather it from the web: You can connect to webpages, servers, or APIs
directly from within R, or you can create a data scraped from HTML webpages
rvestpackage. For example
Organizing your working directory
Using a consistent folder structure across your projects will help keep things organized and make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you might create directories (folders) for scripts, data, and documents. Here are some examples of suggested directories:
data/Use this folder to store your raw data and intermediate datasets. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible.
data_output/When you need to modify your raw data, it might be useful to store the modified versions of the datasets in a different folder.
documents/Used for outlines, drafts, and other text.
fig_output/This folder can store the graphics that are generated by your scripts.
scripts/A place to keep your R scripts for different analyses or plotting.
You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory.
The working directory
The working directory is an important concept to understand. It is the place on your computer where R will look for and save files. When you write code for your project, your scripts should refer to files in relation to the root of your working directory and only to files within this structure.
Using RStudio projects makes this easy and ensures that your working directory
is set up properly. If you need to check it, you can use
getwd(). If for some
reason your working directory is not what it should be, you can change it in the
RStudio interface by navigating in the file browser to where your working directory
should be, clicking on the blue gear icon “More”, and selecting “Set As Working
Directory”. Alternatively, you can use
reset your working directory. However, your scripts should not include this line,
because it will fail on someone else’s computer.
Setting your working directory with
Some points to note about setting your working directory:
The directory must be in quotation marks.
On Windows computers, directories in file paths are separated with a backslash
\. However, in R, you must use a forward slash
/. You can copy and paste from the Windows Explorer window directly into R and use find/replace (Ctrl/Cmd + F) in R Studio to replace all backslashes with forward slashes.
On Mac computers, open the Finder and navigate to the directory you wish to set as your working directory. Right click on that folder and press the options key on your keyboard. The ‘Copy “Folder Name”’ option will transform into ‘Copy “Folder Name” as Pathname. It will copy the path to the folder to the clipboard. You can then paste this into your
setwd()function. You do not need to replace backslashes with forward slashes.
After you set your working directory, you can use
./to represent it. So if you have a folder in your directory called
data, you can use read.csv(“./data”) to represent that sub-directory.
Downloading the data and getting set up
Now that you have set your working directory, we will create our folder
structure using the
For this lesson we will use the following folders in our working directory:
fig_output/. Let’s write them all in
lowercase to be consistent. We can create them using the RStudio interface by
clicking on the “New Folder” button in the file pane (bottom right), or directly
from R by typing at console:
dir.create("data") dir.create("data_output") dir.create("fig_output")
Go to the Figshare page for this curriculum and download the dataset called
books.csv”. The direct download link is:
https://ndownloader.figshare.com/files/22031487. Place this downloaded file in
data/ you just created. Alternatively, you can do this directly from R by copying and
pasting this in your terminal (your instructor can place this chunk of code in
download.file("https://ndownloader.figshare.com/files/22031487", "data/books.csv", mode = "wb")
Now if you navigate to your
data folder, the
books.csv file should be there.
We now need to load it into our R session.
R has some base functions for reading a local data file into your R
read.csv(), but these have some
idiosyncrasies that were improved upon in the
readr package, which is
installed and loaded with
library(tidyverse) # loads the core tidyverse, including dplyr, readr, ggplot2, purrr
── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1 ✔ purrr 0.3.2 ✔ tibble 2.1.3 ✔ dplyr 0.8.3 ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ✔ ggplot2 3.2.1 ✔ forcats 0.4.0
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
To get our sample data into our R session, we will use the
and assign it to the
books <- read_csv("./data/books.csv")
You will see the message
Parsed with column specification, followed by each
column name and its data type. When you execute
read_csv on a data file, it
looks through the first 1000 rows of each column and guesses the data type for
each column as it reads it into R. For example, in this dataset, it reads
col_character (character), and
col_double. You have
the option to specify the data type for a column manually by using the
col_types argument in
You should now have an R object called
books in the Environment pane: 10000
observations of 12 variables. We will be using this data file in the next
read_csv()assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the
read_csv2function. It behaves exactly like
read_csvbut uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for
?read_csvto learn more. There is also the
read_tsv()for tab-separated data files, and
read_delim()allows you to specify more details about the structure of your file.
What are data frames and tibbles?
Data frames are the de facto data structure for tabular data in
R, and what
we use for data processing, statistics, and plotting.
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
A data frame can be created by hand, but most commonly they are generated by the
read_table(); in other words, when importing
spreadsheets from your hard drive (or the web).
A tibble is an extension of
R data frames used by the tidyverse. When the
data is read using
read_csv(), it is stored in an object of class
data.frame. You can see the class of an object with
Inspecting data frames
When calling a
tbl_df object (like
interviews here), there is already a lot of information about our data frame being displayed such as the number of rows, the number of columns, the names of the columns, and as we just saw the class of data stored in each column. However, there are functions to extract this information from data frames. Here is a non-exhaustive list of some of these
functions. Let’s try them out!
dim(books)- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
nrow(books)- returns the number of rows
ncol(books)- returns the number of columns
head(books)- shows the first 6 rows
tail(books)- shows the last 6 rows
names(books)- returns the column names (synonym of
View(books)- look at the data in the viewer
str(books)- structure of the object and information about the class, length and content of each column
summary(books)- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of objects besides data frames.
map() function from
purrr is a useful way of running a function on all
variables in a data frame or list. If you loaded the
tidyverse at the
beginning of the session, you also loaded
purrr. Here we call
map_chr(), which will return a character vector of the classes
for each variable.
CALL...BIBLIO. X245.ab X245.c LOCATION TOT.CHKOUT "character" "character" "character" "character" "numeric" LOUTDATE SUBJECT ISN CALL...ITEM. X008.Date.One "character" "character" "character" "character" "character" BCODE2 BCODE1 "character" "character"
Indexing and subsetting data frames
books data frame has 2 dimensions: rows (observations) and columns
(variables). If we want to extract some specific data from it, we need to
specify the “coordinates” we want from it. In the last session, we used square
[ ] to subset values from vectors. Here we will do the same thing for
data frames, but we can now add a second dimension. Row numbers come first,
followed by column numbers. However, note that different ways of specifying
these coordinates lead to results with different classes.
## first element in the first column of the data frame (as a vector) books[1, 1] ## first element in the 6th column (as a vector) books[1, 6] ## first column of the data frame (as a vector) books[] ## first column of the data frame (as a data.frame) books ## first three elements in the 7th column (as a vector) books[1:3, 7] ## the 3rd row of the data frame (as a data.frame) books[3, ] ## equivalent to head_books <- head(books) head_books <- books[1:6, ]
The dollar sign
$ is used to distinguish a specific variable (column, in Excel-speak) in a data frame:
head(books$X245.ab) # print the first six book titles
 "Bermuda Triangle /"  "Invaders from outer space :|real-life stories of UFOs /"  "Down Cut Shin Creek :|the pack horse librarians of Kentucky /"  "The Chinese book of animal powers /"  "Judge Judy Sheindlin's Win or lose by how you choose! /"  "Judge Judy Sheindlin's You can't judge a book by its cover :|cool rules for school /"
# print the mean number of checkouts mean(books$TOT.CHKOUT)
unique() to see all the distinct values in a variable:
 "a" "w" "s" "m" "e" "4" "k" "5" "n" "o"
Take that one step further with
table() to get quick frequency counts on a variable:
table(books$BCODE2) # frequency counts on a variable
4 5 a e k m n o s w 1 3 6983 68 3 109 2 21 1988 822
You can combine
table() with relational operators:
table(books$TOT.CHKOUT > 50) # how many books have 50 or more checkouts?
FALSE TRUE 9991 9
duplicated() will give you the a logical vector of duplicated values.
duplicated(books$ISN) # a TRUE/FALSE vector of duplicated values in the ISN column !duplicated(books$ISN) # you can put an exclamation mark before it to get non-duplicated values table(duplicated(books$ISN)) # run a table of duplicated values which(duplicated(books$ISN)) # get row numbers of duplicated values
Exploring missing values
You may also need to know the number of missing values:
sum(is.na(books)) # How many total missing values?
colSums(is.na(books)) # Total missing values per column
CALL...BIBLIO. X245.ab X245.c LOCATION TOT.CHKOUT 561 12 2801 0 0 LOUTDATE SUBJECT ISN CALL...ITEM. X008.Date.One 0 63 2934 7980 158 BCODE2 BCODE1 0 0
table(is.na(books$ISN)) # use table() and is.na() in combination
FALSE TRUE 7066 2934
booksNoNA <- na.omit(books) # Return only observations that have no missing values
View(books)to examine the data frame. Use the small arrow buttons in the variable name to sort tot_chkout by the highest checkouts. What item has the most checkouts?
What is the class of the TOT.CHKOUT variable?
is.na()to find out how many NA values are in the ISN variable.
summary(books$TOT.CHKOUT). What can we infer when we compare the mean, median, and max?
hist()will print a rudimentary histogram, which displays frequency counts. Call
hist(books$TOT.CHKOUT). What is this telling us?
Click, clack, moo : cows that type.
The median is 0, indicating that, consistent with all book circulation I have seen, the majority of items have 0 checkouts.
As we saw in
summary(), the majority of items have a small number of checkouts
R contains a number of operators you can use to compare
help(Comparison) to read the R help file. Note that two equal
==) are used for evaluating equality (because one equals sign (
is used for assigning variables).
||Less Than or Equal To|
||Greater Than or Equal To|
||Not Equal To|
||Has a Match In|
||Is Not NA|
Sometimes you need to do multiple logical tests (think Boolean logic). Use
help(Logic) to read the help file.
||Are some values true?|
||Are all values true?|
Use read.csv to read tabular data in R.
Use factors to represent categorical data in R.