Formatting data tables in Spreadsheets
Last updated on 2024-10-29 | Edit this page
Overview
Questions
- How should data be formatted in spreadsheets?
Objectives
- Describe best practices for data entry and formatting in spreadsheets.
- Apply best practices to arrange variables and observations in a spreadsheet.
The most common mistake made is treating the program like it is a notebook by relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers are unintelligent, and unless we explain to the computer what every single thing means (and that can be hard!) it will not be able to see how our data fit together.
Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).
This is why it’s extremely important to set up well-formatted tables from the outset before you even start collecting data to analyse. Data organization is the foundation of your data-related work. Unorganized data can make it harder to work with your data, so you should be mindful of your data organization when doing your data entry. You’ll want to organize your data in a way that allows other programs and people to easily understand and use the data.
Callout
Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.
Keeping track of your analyses
When working with spreadsheets during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when your leadership team ask for a different analysis, you should:
- create a new file with your cleaned or analyzed data. Don’t modify the original dataset, or you will never know where you started!
- keep track of the steps you took in your clean up or analysis. You should track these steps as you would any step in an experiment. We recommend that you do this in a plain text file stored in the same folder as the data file.
- record the steps you followed in your data cleanup or analysis. Just as you would document key actions in a research project, it’s important to keep a clear record of your workflow. We suggest saving this documentation in a plain text file within the same folder as your data file for easy access and organization.
This might be an example of the setup as you work:
We will put these principles into practice today during your exercises.
Structuring data in spreadsheets
The cardinal rules of using spreadsheet programs for data:
- Put all your variables in columns - the thing you’re measuring, like ‘length’ or ‘attendance’.
- Put each observation in its own row.
- Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
- Leave the raw data raw - don’t mess with it!
- Export the cleaned data to a text based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.
For instance, we have data from attendance and instruction for previous research data management workshops. Different people have entered data into a single spreadsheet. They keep track of things like date, number of attendees, and who delivered the workshop.
If they were to keep track of the data like this:
the problem is that the number of attendees of different types (post-graduate researcher (PGR), post-doctoral research associate (PDRA), and other) are in the same field. So if they wanted to look at attendance by post-graduate researchers, it would be hard to set up the data to do this. If instead we put attendee categories in different columns, you can see that it would be much easier.
Columns for variables and rows for observations
The rule of thumb, when setting up a datasheet, is columns = variables, rows = observations, cells = data (values).
So, instead we should have:
Exercise
We’re going to take a messy version of some library training data and clean it up.
- First download the data
- Open up the data in a spreadsheet program.
- You can see that there are three tabs. Various people have recorded training attendance statistics over 2016 and 2017, and they have kept track of the data in their own way. Now you’re being asked to evaluate the training programme and you want to be able to start doing statistics with the data.
- With the person next to you, work on the messy data so that a computer will be able to understand it. Clean up the 2016 and 2017 tabs, and put them all together in one spreadsheet.
- After you go through this exercise, we’ll discuss as a group what you think was wrong with this data and how you fixed it.
Important
Do not forget of our first piece of advice: create a new file for the cleaned data, and never modify the original (raw) data.
An excellent reference, in particular with regard to R scripting is
Resource
Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.
Key Points
- Use one column for one variable
- Use one row for one observation
- Use one cell for one value
- Never modify your raw data. Always make a copy before making any changes.
- Keep all of the steps you take to clean your data in a plain text file.