Kristina Hettne, Peter Verhaar (Centre for Digital Scholarship at Leiden University), Ben Companjen, Laurents Sesink, Fieke Schoots (Centre for Digital Scholarship at Leiden University, reviewer), Erik Schultes (GO FAIR, reviewer), Rajaram Kaliyaperumal (Leiden Universitair Medisch Centrum, reviewer), Erzsebet Toth-Czifra (DARIAH, reviewer), Ricardo de Miranda Azevedo (Maastricht University, reviewer), Sanne Muurling (Leiden University Library, reviewer).
This document offers a concise overview of the ten topics that are most essential for scholars in the field of historical research who aim to publish their data set in accordance with the FAIR principles. In historical research, research data mostly consists of databases (spreadsheets, relational databases), text corpora, images, interviews, sound recordings or video materials.
To ensure that data sets can be found, scholars need to deposit their data sets and all the associated metadata in a repository which assigns persistent identifiers.
Data repositories enable researchers to share their data sets. The following data repositories accept data sets in the field of history:
A number of additional data repositories can be found by going to re3data.org, and by clicking on Browse > Browse by subject > History
Choosing a repository that complies with the CoreTrustSeal criteria for long term repositories is recommended. This way, the durable findability of the data is guaranteed.
Once a certain data repository has been selected, the data set can be submitted, together with the metadata describing this data set. Metadata is commonly described as data about data. In the context of data management, it is structural information about a data set which describes characteristics such as the quality, the format and the contents. Most repositories require a minimum set of metadata, such as name of the creator, the title and the year of creation. Check what kind of metadata the repository you choose asks. Remember that the effort you put into metadata will contribute to the findability of your dataset.
Metadata are often captured using a fixed metadata schema. A schema is a set of fields which can be used to record a particular type of information. The format of the metadata is often prescribed by the data repository which will manage the data set.
Datasets need to be deposited in repositories that assign persistent identifiers (PIDs) to ensure that online references to publications, research data, and persons remain available in the future. A PID is a specific type of a Uniform Resource Identifier (URI), which is managed by an organisation that links a persistent identification code with the most recent Uniform Resource Locator (URL).
Academic journals mostly work with DOIs. DOIs are globally unique identifiers that provide persistent access to publications, datasets, software applications, and a wide range of other research results. DOI has been an ISO standard since 2012. A typical DOI looks as follows: http://doi.org/10.17026/dans-x4b-uy8q. When users click on this DOI, the DOI is resolved to an actual web address.
Next to identifiers for data sets and for publications, it is also possible to create PIDs for people. Open Researcher and Contributor Identifier (ORCID) is an international system for the persistent identification of academic authors. It is a non-proprietary system, managed by an international consortium consisting of universities, national libraries, research institutes and data repositories. When your research results are associated with an ORCID, this information can be exchanged effectively across databases, across countries and across academic disciplines. You always retain full control over your own ORCID id. It is the de facto standard when submitting a research article or grant application, or depositing research data.
The FAIR principles stipulate that data and metadata ought to be “retrievable by their identifier using a standardised communication protocol” (requirement A1). This requirement does not necessarily imply that the data should fully be available in open access. It principally means that there needs to be a protocol that users may follow to obtain of the data set. There can be many good reasons for limiting the access to a file. Public accessibility may be difficult because of privacy laws or copyright protection regulations, for example.
The accessibility of the data may occasionally be complicated by the fact that the data have been stored using a so-called proprietary format, i.e. a format that owned exclusively by a single company. For formats which are associated with specific software applications, it can be difficult to guarantee their long-term usability, accessibility and preservation. For this reason, the DANS EASY archive in The Netherlands works with a list ‘preferred formats’.
Well-structured and well-organised data can evidently be reused much more easily. This section explains how researchers can organize their data in such a way that they can be analysed effectively with data science tools. Many historians capture their data in spreadsheets. As is explained by Broman and Woo (2018), there are a number of important principles to bear in mind when you work with spreadsheets.
Once you have developed a suitable data model, you are also advised to develop a data dictionary which documents the model. This document may contain the following information:
Read Karl Broman and Kara H. Woo, “Data organization in spreadsheets”.
Tim Berners-Lee, the inventor of the Web, argued that there are five levels of open data. Creators of data can earn five stars by following the steps below.
When researchers have published their well-structured and their well-organised data set in a data repository via a public license, as explain in things 1 to 5 above, they will have arrived at data set that can be awarded three stars, according to Berners-Lee’s scheme. This section and the following section will further explain how you enhance the interoperability of their data sets even further by working with RDF and with persistent identifiers.
As a first step, it can be useful to explore whether some of the general topics that you focus on have already been assigned persistent identifiers or URIs. Many researchers and institutions have developed shared vocabularies and ontologies to standardise terminology. In many cases, the terms which have been defined have also been assigned persistent identifiers. Such shared vocabularies can make it clear that we are talking about the same thing when we exchange knowledge.
Historical research often concentrates on people, events, organisations and locations. The following ontologies and shared vocabularies concentrate on entities such as these:
Where possible, try to use terms that have been defined in these existing ontologies in your own data set. An example where a specific vocabulary (the VOC glossary) was used to markup a dataset can be found here. The dataset is part of a project to reconstruct the domestic market for colonial goods in the Dutch Republic.
The fourth and the fifth star in Berner Lee’s model can be awarded when the data are stored in a format in which the topics their properties and their characteristics are identified using URIs whenever possible. More concretely, it implies that you record your data using the Resource Description Framework (RDF) format. RDF, simply put, is a technology which enables you to publish the contents of a database via the web. It is based on a simple data model which assumes that all statements about resources can be reduced to a basic form, consisting of a subject, a predicate and an object. RDF assertions are also known as triples. In a FAIR data model, elements of data are organised and identified using PIDs. The same goes for the relations between these elements. The FAIR data model is a graphical view of the data that act as a metadata key to a spreadsheet but it can also be used as a guide to expose data as a linked data graph in RDF format.
Existing data sets can be converted to RDF by making use of the FAIRifier software. This application is based on OpenRefine. Other examples of tools to generate RDF are Karma and RML. In the FAIRifier, it is possible to upload a CSV file. After this, the data set can be connected to elements from existing ontologies.
A license describes the conditions under which your data or software is (re)usable. Picking a license can be a daunting process because of the common feeling that if you do not pick the right license something will go wrong. However keep in mind that if you do not choose a license for your data or software, it means that it cannot be used or reused. A copyright expert can help you, but to get you going you can try out the activities listed below.
If you deposit your data in a repository there will be default options available.
When you have made use of someone else’s data, you are strongly recommended to attribute the original creators of these data by including a proper reference. Data sets, and even software applications, can be cited in the same way as textual publications such as articles and monographs. Structured data citations can also be used to calculate metrics about the reuse of the data. Data citations, regardless of citation style, typically contain the authors, the year, the title, the publisher and a persistent identifier.
Policies for data availability can come from publishers, funders and universities. These policies are listed on the respective website, but finding these is not always straightforward. FAIRsharing is a repository for standards, databases and policies with the possibility to filter on information for a specific research domain. It started as an initiative for the life sciences but is rapidly expanding its content for other disciplines as well.