Top 10 FAIR Data & Software Things

Geoscience


Sprinters:

John Brown, Janice Chan, Niamh Quigley (Curtin University, Perth, Western Australia)

Audience:

Researchers

Things

Findable

Thing 1: Data sharing and discovery

Thing 6: Vocabularies for data description

Thing 7: Identifiers and linked data

Thing 10: Spatial data

Accessible

Thing 2: Long-lived data: curation & preservation

Thing 3: Data citation for access & attribution

Thing 4: DOIs and citation metrics

Interoperable

Thing 4: DOIs and citation metrics

Thing 6: Vocabularies for data description

Thing 7: Identifiers and linked data

Thing 9: Exploring APIs and Apps

Reusable

Thing 5: Licensing data for reuse

Thing 8: What are publishers & funders saying about data?

Thing 1: Data sharing and discovery

Activity 1: Data discovery

Data repositories enable others to find existing data by publishing data descriptions (“metadata”) about the data they hold, much like a library catalogue describes the resources held in a library. Also, repositories often provide access to the data itself and some even provide ways for users to explore that data. Many research funding requirements reference researchers depositing their data into data repositories (which we’ll discuss later in Thing 8).

Data portals or aggregators draw together research data records from a number of repositories. Because of the huge amounts of data available they sometimes focus on data from one discipline or geographic region. The EU Open Data Portal is an example that aggregates metadata records from over 30 European national data repositories and The US Government’s Open Data portal data.gov aggregates from over 100 US government agencies.

  1. Look at this Data.gov.au record from Geoscience Australia: Lord Howe Rise Marine Survey 2017.
    • Examine the Description and Additional Info fields to see the ways that Geoscience Australia has made this record findable to other researchers. If you knew about this data portal, would you be able to easily find this dataset if it was relevant to your research?
  2. Spend a few minutes exploring the Scottish Spatial Data Infrastructure Metadata Portal.
    • Try browsing or searching on a topic of interest.
    • Explore a record and see where it came from and if there’s a way to contact the creator.
    • Have a look at the map and see if you can find and add a map layer relating to fishing.
  3. Look at EarthChem.

Consider: If your research appeared in the right data portal or repository, what things might result from that for yourself? What about your discipline?

Activity 2: Finding data repositories

  1. Choose one of the specialised data repositories below, or find another data repository on re3data.org (perhaps one outside your particular focus area) and spend some time browsing around your chosen repository to get a feel for the data available.
  2. Think about how the data here differs from data you are familiar with, for example, in format, size and access method.

Consider: Could you apply a dataset from one of these repositories to your own work? Would you need to change file formats or learn a new software package?

Thing 2: Long-lived data: curation & preservation

Activity 1: Preserving born digital objects

Information sources that were commonly used in the past such as maps and handwritten observation notes and can easily survive for years, decades or even centuries. However, because most current research is done mostly on computers, it’s important to remember that digital items require special care to keep them usable over time.

  1. This video (2.5 min) from the US Library of Congress shows the vulnerability of “born digital” objects like research data: they are fragile; they are dependent on software and hardware; and they require active management.
  2. Look at the ANDS page on file formats.

Consider: If your research was put into a time capsule and unearthed in 50 years’ time, would future researchers be able to determine if your research is still useful to them? If you were allowed to update the time capsule every 5 years, what would you change to make it easier for those unearthing it?

Activity 2: Readme files

One way that researchers can ensure their data is useful in the future is to package their data with an explanation that can be opened without any software. These explanatory files mean that anyone who finds the data will know if the data is useful to them and hopefully won’t have any questions for the original researcher, who may not be available or not remember. The files are usually called “readme” files in the hope that by reading the file, all the important questions will be answered.

  1. Read the Guide to writing “readme” style metadata from the Cornell Research Data Management Service Group and create a readme.txt file for one of your own datasets. Don’t forget to include notes on software versions used, methodology and any special things you’d tell a colleague if you were giving them the data yourself!

Thing 3: Data citation for access & attribution

Activity 1: Citing research data

When authors cite an article they have used ideas from, they formally and publicly acknowledge the work of the earlier author. Data citation works in the same way – by citing the data created by earlier researchers they get formal and public credit for their contribution to the new work. Along with books, journals and other scholarly works, it is now possible to formally cite research datasets and even the software that was used to create or analyse the data.

  1. Have a look at https://www.bgs.ac.uk/services/ngdc/citedData/catalogue/a59128b5-8e7f-4100-b0ff-87325438435b.html the Geophysical, hydraulic and mechanical properties of synthetic versus natural sandstones under variable stress conditions dataset from the British Geological Survey. If someone wanted to use this dataset for further research, would they know how to give credit to the creator of the original dataset?
  2. Find a DOI of a dataset from one of the repositories you found in Thing 1 and enter it into the DOI Citation Formatter: https://citation.crosscite.org/. If you saw the citation, would you know how to go about accessing the data?
  3. Read the article, “Sharing Detailed Research Data Is Associated with Increased Citation Rate” – why would it be that papers that make their data openly available get better citation counts? Would you feel more confident citing another person’s work if you knew?

Consider: Data citation is a relatively new concept in the scholarly landscape and as yet, is not routinely done by researchers, or demanded by journals. What could be done to encourage routine citation of research data and software associated with research outputs?

Activity 2: Citing software

The increase in available computational power over the last 50 years has led to a massive increase in the usage of computational analysis methods in geoscience.

As such techniques become more commonplace, it’s important to distinguish between the data itself, the tools used to analyse data and any discrete components within those tools. In some cases, a particular function of the software is critical to the analysis process; in other cases the critical part is an interchangeable block of code within that software package. Recognising the difference between these two is important as it changes who gets credit for their previous work and who gets left unsung.

It’s not always easy to know which to cite, but trying to give recognition for the creation of software and software components can make huge impacts on the career of a researcher, especially if they create scientific software!

  1. Read https://libguides.mit.edu/c.php?g=551454&p=3900280 the How to cite software guide from the MIT Libraries.
  2. Read Adding CITATION to your R package blog post.

Consider: If you wrote a package of code for a computer program to run and made it freely available to your colleagues to solve a problem in your field, would they know how they could give you credit in their work? Would they think that you would want attribution?

Thing 4: DOIs and citation metrics

DOIs are unique identifiers that enable data citation, metrics for data and related research objects, and impact metrics. Citation analysis and citation metrics are important to the academic community. Find out where data fits in the citation picture.

Activity 1: DOIs

Digital Object Identifiers (DOIs) are a type of ‘persistent identifier’. DOIs are unique identifiers that provide persistent access to published articles, datasets, software versions and a range of other research inputs and outputs. There are over 120 million Digital Object Identifiers (DOIs) in use, and in 2016 DOIs were “resolved” (clicked on) over 5 billion times!

Each DOI is unique but a typical DOI looks like this: http://dx.doi.org/10.4225/06/577F022BA6954

  1. Start by watching this short 4.5-minute video Persistent identifiers and data citation explained from Research Data Netherlands. It gives you a succinct, clear explanation of how DOIs underpin data citation.
  2. Have a look at the poster Building a culture of data citation and follow the arrows to see how DOIs are attached to data sets and are used in data citation.
  3. Let’s go to a Research Data Australia data record which shows how DOIs are used. Click on this DOI to ‘resolve’ the DOI and take us to the record: http://dx.doi.org/10.4225/06/577F022BA6954.
  4. Click on the Cite icon on the upper left of the record (under the green Access the data tab). No matter where the DOI appears it always resolves back to its original dataset record to avoid duplication. i.e. many records, one copy.
  5. DOIs can also be applied to grey literature, a term that refers to research that is either unpublished or has been published in non-commercial form, such as government reports. For example, reports like this: http://doi.org/10.4225/06/583d354b89060.

Activity 2: IGSNs

International Geo Sample Number (IGSN) are designed to provide an unambiguous globally unique persistent identifier for physical samples. It facilitates the location, identification, and citation of physical samples used in research.

Each IGSN is unique but a typical IGSN looks like this IEEVB00C3. The first five characters of the IGSN represent a name space (a unique user code) that uniquely identifies the person or institution that registers the sample. The last 4 characters of the IGSN are a random string of alphanumeric characters (0-9, A-Z).

  1. Start by reading this brief introduction to IGSN.
  2. Review the scope and capability of each IGSN allocation agent listed on the IGSN website and consider which allocation agent is most appropriate for your samples.
  3. Have a look at an IGSN record https://app.geosamples.org/sample/igsn/IEEVB00C3 which displays what information about the sample was recorded.
  4. Now have a look at how IGSNs are referenced in a dataset record http://get.iedadata.org/doi/100548.

Consider: How are you managing your physical samples? The ANDS IGSN minting service may be used by Australian researchers at no cost. Do you know of a service provider in your region?

Activity 3: Altmetrics

Data citation best practice, as discussed in Thing 3, enables citation metrics for data to be tracked and analysed. Data citations are available from the Clarivate Data Citation Index which is a commercial product.

Altmetrics is an alternative measure to help understand the influence of your work. It refers to metrics such as number of views, number of downloads, number of mentions in policy documents, social media, and social bookmarking platforms associated with any research outputs that have a DOI or other persistent identifiers. Because of their immediacy, altmetrics can be an early indicator of the impact or reach of a dataset; long before formal citation metrics can be assessed.

  1. Start by looking at the altmetrics for this phylogenomics article published in Science. Note the usage statistics, including number and pattern of downloads, for this article since it was published in November 2014.
  2. Now click on the “donut” or the link to ‘See More Details’ to see the wealth of information available.
  3. Look also at the associated data in Dryad noting that the data has been assigned a DOI. Can you see how many times the data has been downloaded and the record viewed (scroll down to the bottom of the record)?

By way of comparison, as of early November 2018:

Consider: Do you think altmetrics for data have value in academic settings? Why, or why not?

Thing 5: Licensing data for reuse

Understand the importance of data licensing, learn about Creative Commons and find out how enabling reuse of data can speed up research and innovation.

Activity 1: Why license research data?

Consider this scenario: You’ve found a dataset you are interested in. You’ve downloaded it. Excellent! But do you know what you can and cannot do with the data? The answer lies in data licensing. Licensing is critical to enabling data to be reused and cited.

  1. Start by reading this brief introduction to licensing research data.
  2. Now watch this Creative Commons Licensing introductory video or have a closer look at the Understanding CC Licences poster.
  3. Check out the licence chooser from Creative Commons, which walks you through the decision of which licence is appropriate for your purpose.

Consider: If you were considering licensing a dataset on something which may have commercial value to others - what licence would you apply?

Activity 2: Data licences: unlock data for innovation

Enabling reuse of data can speed up research and innovation. Licensing is critical to enabling data reuse.

  1. Start by watching this 4.30mins video in which Dr Kevin Cullen from the University of New South Wales explains their approach to licensing which aims to strengthen the University’s relationship with business and industry.
  2. Check out the data standards of Geoscience Australia, which refers to the Australian government policy on Public Data. Which Creative Commons licence is applied to government data by default?
  3. Since November 2009, Geoscience Australia has officially adopted Creative Commons Attribution as the default licence for its website. That means thousands of products and datasets available through the website are free to be reused.
  4. See the range of data products and license available at British Geological Survey.

Does your institution have policies or guidelines around data licensing?

Activity 3: Data licensing in practice

Not all research data that is shared is licensed for reuse. It should be!

  1. Explore the following data repositories:
  2. Or review the following example records:
  3. Do all data repositories or metadata catalogues enable users to refine search by licenses? Look closely at the specific Licensing information on a small sample of those records with ‘open’ licences. How easy or difficult it is to work out if the data can or can’t be reused e.g. for commercial purposes? with international collaborators?

Consider: Assigning Open Licenses is not routine. Suggest one tip for encouraging uptake of ‘open’ licensing.

Thing 6: Vocabularies for data description

In addition to selecting a metadata standard or schema, whenever possible you should also use a controlled vocabulary.

Activity 1: What is controlled vocabulary?

A controlled vocabulary provides a consistent way to describe data - location, time, place name, subject. Read this short explanation of controlled vocabularies.

Controlled vocabularies significantly improve data discovery. It makes data more shareable with researchers in the same discipline because everyone is ‘talking the same language’ when searching for specific data e.g. plants, animals, medical conditions, places etc.

If you have time, have a look at Controlling your Language: a Directory of Metadata Vocabularies from JISC in the UK. Make sure you scroll down to 5. Conclusion - it’s worth a read.

Activity 2: Controlled vocabularies in action

We are going to see some controlled vocabularies in action in the Atlas of Living Australia (ALA).

  1. Do a search in the ALA search engine. Type “whale” in the search box and click on search. Choose one of the records listed and click on the (red text) View record link.
  2. Any metadata field where you see Supplied… tells you that the information supplied by the person who submitted the record (often a ‘citizen scientist’) has been changed to the controlled vocabulary being used in metadata fields e.g. Observer, Record date and Common name.
  3. Have a scroll down the record and consider how many of the metadata fields probably have a controlled vocabulary in use (e.g. taxonomy, geospatial etc.).

If you have time: have a browse around the stunning level of data description and data contained in the Atlas of Living Australia.

Activity 3: Geoscience vocabularies

Explore some examples of vocabularies used in geoscience:

Consider: Do you use controlled vocabularies to describe your data? How would you encourage other researchers to use them?

Thing 7: Identifiers and linked data

ORCID is a unique identifier for researchers. Many research data repositories record your ORCID when you submit research data for publication.

Activity 1: Check your ORCID

In your ORCID record, datasets you have published will be displayed in the Works section.

Log into ORCID now and check your details are up to date, including:

If you don’t already have an ORCID you can get one, this Curtin University webpage has information on how to get the most out of your ORCID.

Activity 2: Get more from your ORCID

ORCID populates your ORCID record from many sources, one of which is peer review activities. Publishers such as the American Geophysical Union Publications now send details of peer review activities to ORCID.

Activity 3: Identifiers and linked data

Because they are unique identifiers, ORCIDs can be used to link data from different datasets together. GeoLink is a network of Linked Data from multiple data repositories.

  1. Go to the portal for the GeoLink demo.
  2. Choose an entity e.g. Datasets, Cruises, Vessels, Instruments, Researchers and explore! The Help guide is here.

Thing 8: What are publishers & funders saying about data?

Geoscience research data is a world heritage. Researchers share the responsibility with research institutions and funders of ensuring their data is well-documented, preserved and openly available.

Many publishers have special requirements for the citation of data in publications. This can be in the form of compliance with a data policy, author guidelines or the completion of a Data Availability Statement.

Activity 1: Research data and scholarly publishing

Have a look at the Nature Data Availability Statement examples or the PLOS Data Availability policy to get an idea of what publishers expect.

COPDESS is The Coalition for Publishing Data in the Earth and Space Sciences, and they have collected links to author instructions and data policies for some geoscience journals, publishers and funders.

Activity 2: Research funders and data sharing

Activity 1 has shown us that it’s becoming more common for journals and publishers to demand your data be made available when you seek to publish. However, if your research is publicly funded it’s almost guaranteed that your grant and funding obligations with require you to make your data publicly available at the end of your project – the outputs of research funded by a population should be made available to that population.

The Australian Research Council’s data management requirements states that funded researchers are expected to follow the OECD Principles and Guidelines for Access to Research Data from Public Funding. Similar principles are outlined by the UK Research and Innovation (UKRI) in their Guidance on best practice in the management of research data document.

Consider: If you were on a funding panel and were asked to assess a grant with a clear plan for making the data openly available, would you rate the future impact of that proposal better or worse than one with a poorly defined plan?

Thing 9: Exploring APIs and applications

Geosciences has many specialised services, applications and APIs which can be used to directly access and harness existing research data. Some are free, and some are subscription-based, but your research institution may have access.

Activity 1: Try an app

Activity 2: APIs

APIs (Application Programming Interfaces) are software services that allow you to access structured data or systems held by someone else. These are usually provided so that developers can access data held by an organisation on demand, rather than them having to hold an entire dataset (which may not be possible due to security, space requirements or if the dataset is constantly changing). Some companies charge for using their APIs, but many research-oriented organisations provide their APIs for free so that other organisations can link in to their knowledge.

Consider: If you could systematically access and integrate the data provided from one of the sources above, can you think of a way you could enrich the outputs of your own research?

Thing 10: Spatial data

The importance of spatial data is ever increasing. Many of the societal challenges we face today such as food scarcity and economic growth are inherently linked to big spatial data. In fact, it is often said that 80% of all research data has a geographic or spatial component. It is useful then, for all of us to have an understanding of spatial data.

Activity 1: Spatial data: Maps and more

  1. Start by watching this incredible, inspiring video (3.59 min) from the University of Wollongong’s PetaJakarta project. It shows innovative ways of combining social media and geospatial data to save lives.
  2. Now read The Application of Geographic Information Science in Earth Sciences.
  3. This video combines a range of different data visualisations depicting the human impacts on our environment.
  4. Geospatial data is fundamental to Australia’s economic future. Check out this very short article about how GeoScience Australia is mapping the mineral potential of our continent - a world first!

Just for fun: Enter your address in the Atlas of Living Australia and see what birds and plants have been reported in your street or suburb. You may be surprised at how ‘alive’ your street is!

Consider: Why do you think these geospatial visualisations are so powerful?

Activity 2: Spatial data concepts

There are many types and sources of geospatial data. If you are new to the world of geospatial data, you will probably appreciate some ‘busting’ of the jargon of geospatial data.

  1. Start by reading this Fundamentals Chapter to learn more about maps, projections, coordinate systems, datums and GIS.
  2. Want more? Continue with this blog about Finding and Making Sense of Geospatial Data on the Internet which explains some basic geospatial data file formats and concepts.
  3. Prefer watching? Most of these concepts are also explained in this video.
  4. Read more about two important aspects of spatial data: scale and resolution.

Consider: How would you give an explanation of two new terms you have just learnt?

Activity 3: Using and visualising spatial data

Spatial data can be used in many ways, and there are many tools that you can use to manipulate and display spatial data.

You can try one of the tools below. Do one, or do them all and compare the results.

  1. 13 Free GIS Software Options: Map the World in Open Source
    • Browse through this site for ideas for free, open source geospatial software; the descriptions often include discipline specific advice. Download one and try your hand at mapping.
  2. Spatial data visualisation with R: For those who have done the R modules in Software Carpentry - this might be a good activity to flex your R muscles! Want more? Here are some more R tutorials.
  3. Create a map using Google Fusion Tables: This offers lots of features, but you need a Google account. The excellent Google Fusion tutorial uses butterfly data to show you how to import data, map the data and customise your map.

The Open Geospatial Consortium (OGC) is an international not-for-profit organization that develops open standards for the geospatial community. OGC through their dedicated global members have developed several standards to share geospatial data. Some of the most commonly use standards are:

  1. Web Map Service (WMS): A standard web protocol to query and access geo-registered static map images as a web service. The outputs are images that can be displayed in a browser application.
  2. Web Feature Service (WFS): A standard web protocol to query and extract geographic features of a map, these are typically attributes of a map. The latest version of WFS (3.0, Dec 2017) has created a lot of excitement in the community.
  3. Web Coverage Service (WCS): Provides access to geospatial information representing phenomena that are variable over space and time, such as satellite images or aerial photos. The service delivers a raster image that can be further interpreted and processed.

Geoserver is the most popular open source reference implementation of WMS, WFS and WCS standards.

Consider: The data world is hungry for Geospatial tools and metadata and there is growing demand for people with these skills. How can these skills be encouraged in your institution?

References:

ANDS 23 (Research Data) Things https://www.ands.org.au/working-with-data/skills/23-research-data-things/all23

10 Eco Data Things https://www.ands.org.au/__data/assets/pdf_file/0003/1376121/10-Eco-Data-Things_handout.pdf