Oceanographic data encompasses a wide variety of data formats, file sizes, and states of data completeness. Data of interest may be available from public repositories, collected on an individual basis, or some combination of these, and each type has its own set of challenges. This “10 Things” guide introduces 10 topics relevant to making oceanographic data FAIR: findable, accessible, interoperable, and reusable.
The goal of this lesson is to introduce oceanographers to FAIR data practices in their research workflow through 10 guided activities.
There are numerous data repositories for finding oceanographic data. Many of these are from official “data centers” and generally have well-organized and well-documented datasets available for free and public use.
At some point, you may want or need to deposit your own data into a data repository, so that others may find and build upon your work. Many funding agencies now require data collected or created with the grant funds to be shared with the broader community. For instance, the National Science Foundation (NSF) Division of Ocean Sciences (OCE) mandates sharing of data as well as metadata files and any derived data products. Finding the “right” repository for your data can be overwhelming, but there are resources available to help pick the best location for your data. For instance, OCE has a list of approved repositories in which to submit final data products.
High quality metadata (information about the data, such as creator, keywords, units, flags, etc.) significantly improves data discovery. While metadata is most often for the data itself, metadata can also include information about machines/instruments used, such as make, model, and manufacturer, as well as process metadata, which would include details about any cleaning/analysis steps and scripts used to create data products.
Using controlled vocabularies in metadata allows for serendipitous discovery in user searches. Additionally, using a metadata schema to mark up a dataset can make your data findable to the world.
Read this walkthrough of how to “FAIRify” a dataset using the data cleaning tool OpenRefine: https://docs.google.com/document/d/1hQ0KBnMpQq93-HQnVa1AR5v4esk6BRlG6NvnnzJuAPQ/edit#heading=h.v3puannmxh4u
Permanent identifiers (PIDs) are a necessary step for keeping track of data. Web links can break, or “rot”, and tracking down data based on a general description can be extremely challenging. A permanent identifier like a digital object identifier (DOI) is a unique ID assigned to a dataset to ensure that properly managed data does not get lost or misidentified. Additionally, a DOI makes it easier to cite and track the impact of datasets, much like cited journal articles.
Identifiers exist for researchers as well: OCRID is essentially a DOI for an individual researcher. This ensures that if you have a common name, change your name, change your affiliation, or otherwise change your author information, you still get credit for your own and maintain a full, identifiable list of your scientific contributions.
Go to re3data.org and search for a data repository related to your research subject area. From the repository you choose, pick a dataset. Does it have a DOI? What is? Who is the creator of that dataset? What is the ORCID of the author?
You’ve been given this DOI: 10.6075/J03N21KQ
Citing data properly is equally as important as citing journal articles and other papers. In general, a data citation should include: author/creator, date of publication, title of dataset, publisher/organization (for instance, NOAA), and unique identifier (preferably DOI).
Long-term data stewardship is an important factor for keeping data open and accessible for the long term.
Oceanographic data can include everything from maps and images to high dimensional numeric data. Some data are saved as common, near-universal formats (such as csv files), while others require specialized knowledge and software to open properly (e.g., netCDF). Explore the Intrinsic characteristics of the dataset that influence the choice of the format, such as a time series versus a regular 3-D grid of temperature varying on time; robust ways to connect the data with metadata; size factors, binary versus ASCII file; and think about why a format to store/archive data is not necessarily the best way to distribute data.
Good data organization is the foundation of your research project. Data often has a longer lifespan than the project it is originally associated with and may be reused for follow-up projects or by other researchers. Data is critical to solving research questions, but lots of data are lost or poorly managed. Poor data organization can directly impact the project or future reuse.
Some research institutions and research funders now require a Data Management Plan (DMP) for new research projects. Let’s talk about the importance of a DMP and what should a DMP cover. Think about it you would you be able to create a DMP?
A Data Management Plan (DMP) documents how data will be managed, stored and shared during and after a research project. Some research funders are now requesting that researchers submit a DMP as part of their project proposal.
There are many Data Management Plan (DMP) templates in the DMPTool.
There are two aspects to reusability: reusable data, and reusable derived data/process products.
Reusable data is the result of successful implementation of the other “Things” discussed so far. Reusable data (1) has a license which specifies reuse scenarios, (2) is in a domain-suitable format and an “open” format when possible, and (3) is associated with extensive metadata consistent with community and domain standards.
What is often overlooked in terms of reusability are the products created to automate research steps. Whether it’s using the command line, Python, R, or some other programming platform, automation scripts in and of themselves are a useful output that can be reused. For example, data cleaning scripts can be reapplied to datasets that are continually updated, rather than starting from scratch each time. Modeling scripts can be re-used and adapted as parameters are updated. Additionally, these research automation products make any data-related decision you made explicit: if future data users have questions about exclusions, aggregations, or derivations, the methodology used is transparent in these products.
When working with your data, there are a selection of proprietary and open source tools available to conduct your research analysis.
Open source tools are software tools developed, in which the source code is openly available and published for use and/or modification by any one free of charge. There are many advantages to using open source tools:
Caution: be selective with the tools you use
There are additional benefits you may hear about using open sources tools which are:
Keep in mind, in an ideal world these three ideas are what we all wish for, however not every open source tool satisfies these benefits. When selecting an open source tool, choose a package with a large community of users and developers that proves to have long-term support.
If Open source tools are not an option and commercial software is necessary for your project, there are benefits and issues to consider when using proprietary or commercial software tools.
Reproducibility increases impacts credibility and reuse.
Read through the following best practices to make your work reproducible.
Making your project reproducible from the start of the project is ideal.
This is useful not only for anyone else who wants to test your analysis - often the primary beneficiary is you!
Research often takes months, if not years, to complete a certain project, so by starting with reproducibility in mind from the beginning, you can often save yourself time and energy later on.
Think about a project you have completed or are currently working on.
APIs (Application Programming Interfaces) allow programmatic access to many databases and tools. They can directly access or query existing data, without the need to download entire datasets, which can be very large.
Certain software platforms, such as R and Python, often have packages available to facilitate access to large, frequently used database APIs. For instance, the R package “rnoaa” can access and import various NOAA data sources directly from the R console. You can think of it as using an API from the comfort of a tool you’re already familiar with. This not only saves time and computer memory, but also ensures that as databases are updated, so are your results: re-running your code automatically pulls in new data (unless you have specified a more restricted date range).
On the ERDDAP server for Spray Underwater Glider data, select temperature data for the line 90 (https://spraydata.ucsd.edu/erddap/tabledap/binnedCUGN90.html).