Thing 1: Data sharing and discovery
Thing 6: Vocabularies for data description
Thing 7: Identifiers and linked data
Thing 10: Spatial data
Thing 2: Long-lived data: curation & preservation
Thing 3: Data citation for access & attribution
Thing 4: DOIs and citation metrics
Thing 4: DOIs and citation metrics
Thing 6: Vocabularies for data description
Thing 7: Identifiers and linked data
Thing 9: Exploring APIs and Apps
Thing 5: Licensing data for reuse
Thing 8: What are publishers & funders saying about data?
Data repositories enable others to find existing data by publishing data descriptions (“metadata”) about the data they hold, much like a library catalogue describes the resources held in a library. Also, repositories often provide access to the data itself and some even provide ways for users to explore that data. Many research funding requirements reference researchers depositing their data into data repositories (which we’ll discuss later in Thing 8).
Data portals or aggregators draw together research data records from a number of repositories. Because of the huge amounts of data available they sometimes focus on data from one discipline or geographic region. The EU Open Data Portal is an example that aggregates metadata records from over 30 European national data repositories and The US Government’s Open Data portal data.gov aggregates from over 100 US government agencies.
Consider: If your research appeared in the right data portal or repository, what things might result from that for yourself? What about your discipline?
Consider: Could you apply a dataset from one of these repositories to your own work? Would you need to change file formats or learn a new software package?
Information sources that were commonly used in the past such as maps and handwritten observation notes and can easily survive for years, decades or even centuries. However, because most current research is done mostly on computers, it’s important to remember that digital items require special care to keep them usable over time.
Consider: If your research was put into a time capsule and unearthed in 50 years’ time, would future researchers be able to determine if your research is still useful to them? If you were allowed to update the time capsule every 5 years, what would you change to make it easier for those unearthing it?
One way that researchers can ensure their data is useful in the future is to package their data with an explanation that can be opened without any software. These explanatory files mean that anyone who finds the data will know if the data is useful to them and hopefully won’t have any questions for the original researcher, who may not be available or not remember. The files are usually called “readme” files in the hope that by reading the file, all the important questions will be answered.
When authors cite an article they have used ideas from, they formally and publicly acknowledge the work of the earlier author. Data citation works in the same way – by citing the data created by earlier researchers they get formal and public credit for their contribution to the new work. Along with books, journals and other scholarly works, it is now possible to formally cite research datasets and even the software that was used to create or analyse the data.
Consider: Data citation is a relatively new concept in the scholarly landscape and as yet, is not routinely done by researchers, or demanded by journals. What could be done to encourage routine citation of research data and software associated with research outputs?
The increase in available computational power over the last 50 years has led to a massive increase in the usage of computational analysis methods in geoscience.
As such techniques become more commonplace, it’s important to distinguish between the data itself, the tools used to analyse data and any discrete components within those tools. In some cases, a particular function of the software is critical to the analysis process; in other cases the critical part is an interchangeable block of code within that software package. Recognising the difference between these two is important as it changes who gets credit for their previous work and who gets left unsung.
It’s not always easy to know which to cite, but trying to give recognition for the creation of software and software components can make huge impacts on the career of a researcher, especially if they create scientific software!
Consider: If you wrote a package of code for a computer program to run and made it freely available to your colleagues to solve a problem in your field, would they know how they could give you credit in their work? Would they think that you would want attribution?
DOIs are unique identifiers that enable data citation, metrics for data and related research objects, and impact metrics. Citation analysis and citation metrics are important to the academic community. Find out where data fits in the citation picture.
Digital Object Identifiers (DOIs) are a type of ‘persistent identifier’. DOIs are unique identifiers that provide persistent access to published articles, datasets, software versions and a range of other research inputs and outputs. There are over 120 million Digital Object Identifiers (DOIs) in use, and in 2016 DOIs were “resolved” (clicked on) over 5 billion times!
Each DOI is unique but a typical DOI looks like this: http://dx.doi.org/10.4225/06/577F022BA6954
International Geo Sample Number (IGSN) are designed to provide an unambiguous globally unique persistent identifier for physical samples. It facilitates the location, identification, and citation of physical samples used in research.
Each IGSN is unique but a typical IGSN looks like this IEEVB00C3. The first five characters of the IGSN represent a name space (a unique user code) that uniquely identifies the person or institution that registers the sample. The last 4 characters of the IGSN are a random string of alphanumeric characters (0-9, A-Z).
Consider: How are you managing your physical samples? The ANDS IGSN minting service may be used by Australian researchers at no cost. Do you know of a service provider in your region?
Data citation best practice, as discussed in Thing 3, enables citation metrics for data to be tracked and analysed. Data citations are available from the Clarivate Data Citation Index which is a commercial product.
Altmetrics is an alternative measure to help understand the influence of your work. It refers to metrics such as number of views, number of downloads, number of mentions in policy documents, social media, and social bookmarking platforms associated with any research outputs that have a DOI or other persistent identifiers. Because of their immediacy, altmetrics can be an early indicator of the impact or reach of a dataset; long before formal citation metrics can be assessed.
By way of comparison, as of early November 2018:
Consider: Do you think altmetrics for data have value in academic settings? Why, or why not?
Understand the importance of data licensing, learn about Creative Commons and find out how enabling reuse of data can speed up research and innovation.
Consider this scenario: You’ve found a dataset you are interested in. You’ve downloaded it. Excellent! But do you know what you can and cannot do with the data? The answer lies in data licensing. Licensing is critical to enabling data to be reused and cited.
Consider: If you were considering licensing a dataset on something which may have commercial value to others - what licence would you apply?
Enabling reuse of data can speed up research and innovation. Licensing is critical to enabling data reuse.
Does your institution have policies or guidelines around data licensing?
Not all research data that is shared is licensed for reuse. It should be!
Consider: Assigning Open Licenses is not routine. Suggest one tip for encouraging uptake of ‘open’ licensing.
In addition to selecting a metadata standard or schema, whenever possible you should also use a controlled vocabulary.
A controlled vocabulary provides a consistent way to describe data - location, time, place name, subject. Read this short explanation of controlled vocabularies.
Controlled vocabularies significantly improve data discovery. It makes data more shareable with researchers in the same discipline because everyone is ‘talking the same language’ when searching for specific data e.g. plants, animals, medical conditions, places etc.
If you have time, have a look at Controlling your Language: a Directory of Metadata Vocabularies from JISC in the UK. Make sure you scroll down to 5. Conclusion - it’s worth a read.
We are going to see some controlled vocabularies in action in the Atlas of Living Australia (ALA).
If you have time: have a browse around the stunning level of data description and data contained in the Atlas of Living Australia.
Explore some examples of vocabularies used in geoscience:
Consider: Do you use controlled vocabularies to describe your data? How would you encourage other researchers to use them?
ORCID is a unique identifier for researchers. Many research data repositories record your ORCID when you submit research data for publication.
In your ORCID record, datasets you have published will be displayed in the Works section.
Log into ORCID now and check your details are up to date, including:
If you don’t already have an ORCID you can get one, this Curtin University webpage has information on how to get the most out of your ORCID.
ORCID populates your ORCID record from many sources, one of which is peer review activities. Publishers such as the American Geophysical Union Publications now send details of peer review activities to ORCID.
Because they are unique identifiers, ORCIDs can be used to link data from different datasets together. GeoLink is a network of Linked Data from multiple data repositories.
Geoscience research data is a world heritage. Researchers share the responsibility with research institutions and funders of ensuring their data is well-documented, preserved and openly available.
Many publishers have special requirements for the citation of data in publications. This can be in the form of compliance with a data policy, author guidelines or the completion of a Data Availability Statement.
COPDESS is The Coalition for Publishing Data in the Earth and Space Sciences, and they have collected links to author instructions and data policies for some geoscience journals, publishers and funders.
Activity 1 has shown us that it’s becoming more common for journals and publishers to demand your data be made available when you seek to publish. However, if your research is publicly funded it’s almost guaranteed that your grant and funding obligations with require you to make your data publicly available at the end of your project – the outputs of research funded by a population should be made available to that population.
The Australian Research Council’s data management requirements states that funded researchers are expected to follow the OECD Principles and Guidelines for Access to Research Data from Public Funding. Similar principles are outlined by the UK Research and Innovation (UKRI) in their Guidance on best practice in the management of research data document.
Consider: If you were on a funding panel and were asked to assess a grant with a clear plan for making the data openly available, would you rate the future impact of that proposal better or worse than one with a poorly defined plan?
Geosciences has many specialised services, applications and APIs which can be used to directly access and harness existing research data. Some are free, and some are subscription-based, but your research institution may have access.
APIs (Application Programming Interfaces) are software services that allow you to access structured data or systems held by someone else. These are usually provided so that developers can access data held by an organisation on demand, rather than them having to hold an entire dataset (which may not be possible due to security, space requirements or if the dataset is constantly changing). Some companies charge for using their APIs, but many research-oriented organisations provide their APIs for free so that other organisations can link in to their knowledge.
Consider: If you could systematically access and integrate the data provided from one of the sources above, can you think of a way you could enrich the outputs of your own research?
The importance of spatial data is ever increasing. Many of the societal challenges we face today such as food scarcity and economic growth are inherently linked to big spatial data. In fact, it is often said that 80% of all research data has a geographic or spatial component. It is useful then, for all of us to have an understanding of spatial data.
Just for fun: Enter your address in the Atlas of Living Australia and see what birds and plants have been reported in your street or suburb. You may be surprised at how ‘alive’ your street is!
Consider: Why do you think these geospatial visualisations are so powerful?
There are many types and sources of geospatial data. If you are new to the world of geospatial data, you will probably appreciate some ‘busting’ of the jargon of geospatial data.
Consider: How would you give an explanation of two new terms you have just learnt?
Spatial data can be used in many ways, and there are many tools that you can use to manipulate and display spatial data.
You can try one of the tools below. Do one, or do them all and compare the results.
The Open Geospatial Consortium (OGC) is an international not-for-profit organization that develops open standards for the geospatial community. OGC through their dedicated global members have developed several standards to share geospatial data. Some of the most commonly use standards are:
Geoserver is the most popular open source reference implementation of WMS, WFS and WCS standards.
Consider: The data world is hungry for Geospatial tools and metadata and there is growing demand for people with these skills. How can these skills be encouraged in your institution?
ANDS 23 (Research Data) Things https://www.ands.org.au/working-with-data/skills/23-research-data-things/all23