Humanities: Historical Research

Sprinters:

Kristina Hettne, Peter Verhaar (Centre for Digital Scholarship at Leiden University), Ben Companjen, Laurents Sesink, Fieke Schoots (Centre for Digital Scholarship at Leiden University, reviewer), Erik Schultes (GO FAIR, reviewer), Rajaram Kaliyaperumal (Leiden Universitair Medisch Centrum, reviewer), Erzsebet Toth-Czifra (DARIAH, reviewer), Ricardo de Miranda Azevedo (Maastricht University, reviewer), Sanne Muurling (Leiden University Library, reviewer).

Description:

This document offers a concise overview of the ten topics that are most essential for scholars in the field of historical research who aim to publish their data set in accordance with the FAIR principles. In historical research, research data mostly consists of databases (spreadsheets, relational databases), text corpora, images, interviews, sound recordings or video materials.

Things

Findable

To ensure that data sets can be found, scholars need to deposit their data sets and all the associated metadata in a repository which assigns persistent identifiers.

Thing 1: Data repositories

Data repositories enable researchers to share their data sets. The following data repositories accept data sets in the field of history:

A number of additional data repositories can be found by going to re3data.org, and by clicking on Browse > Browse by subject > History

Choosing a repository that complies with the CoreTrustSeal criteria for long term repositories is recommended. This way, the durable findability of the data is guaranteed.

ACTIVITIES:

Study the data set that can be found via https://doi.org/10.17026/dans-zw3-fkxb. How can the dataset be downloaded? Which formats are available?

Thing 2: Metadata

Once a certain data repository has been selected, the data set can be submitted, together with the metadata describing this data set. Metadata is commonly described as data about data. In the context of data management, it is structural information about a data set which describes characteristics such as the quality, the format and the contents. Most repositories require a minimum set of metadata, such as name of the creator, the title and the year of creation. Check what kind of metadata the repository you choose asks. Remember that the effort you put into metadata will contribute to the findability of your dataset.

Metadata are often captured using a fixed metadata schema. A schema is a set of fields which can be used to record a particular type of information. The format of the metadata is often prescribed by the data repository which will manage the data set.

ACTIVITIES:

Read the Digital Scholarship @ Leiden blog to learn about metadata for humans and machines
Log in at Zenodo.org and click on Upload > New Upload. On the web page that appears, take stock of the various metadata fields that need to be completed. Zenodo is an international repository. Different countries and institutions might have other preferred repositories, such as DANS EASY. DANS EASY list the following specific requirements for historical sciences: Historical sciences: 1) a description of the (archival) sources; 2) the selection procedure used; 3) the way in which the sources were used; and 4) which standards or classification systems (such as HISCO) were used. Read more at https://dans.knaw.nl/en/deposit/information-about-depositing-data/before-depositing

Thing 3: Persistent identifiers

Datasets need to be deposited in repositories that assign persistent identifiers (PIDs) to ensure that online references to publications, research data, and persons remain available in the future. A PID is a specific type of a Uniform Resource Identifier (URI), which is managed by an organisation that links a persistent identification code with the most recent Uniform Resource Locator (URL).

Academic journals mostly work with DOIs. DOIs are globally unique identifiers that provide persistent access to publications, datasets, software applications, and a wide range of other research results. DOI has been an ISO standard since 2012. A typical DOI looks as follows: http://doi.org/10.17026/dans-x4b-uy8q. When users click on this DOI, the DOI is resolved to an actual web address.

Next to identifiers for data sets and for publications, it is also possible to create PIDs for people. Open Researcher and Contributor Identifier (ORCID) is an international system for the persistent identification of academic authors. It is a non-proprietary system, managed by an international consortium consisting of universities, national libraries, research institutes and data repositories. When your research results are associated with an ORCID, this information can be exchanged effectively across databases, across countries and across academic disciplines. You always retain full control over your own ORCID id. It is the de facto standard when submitting a research article or grant application, or depositing research data.

ACTIVITIES:

Watch the video “Persistent identifiers and data citation explained” by Research Data Netherlands.
Watch the video “What are persistent identifiers” for an example on how they are used in digital heritage.
If you don’t have one, request an ORCID. Add all your information as completely as possible.
Read Alice Meadow’s blog post Six Things to do now you have an ORCID iD.
Go to a data record and click on the DOI to see how the DOI can be resolved to current URL of the data set: http://dx.doi.org/10.17026/dans-x4b-uy8q.
Read “Digital Object Identifier (DOI) System for Research Data”.

Accessible

Thing 4: Open data

The FAIR principles stipulate that data and metadata ought to be “retrievable by their identifier using a standardised communication protocol” (requirement A1). This requirement does not necessarily imply that the data should fully be available in open access. It principally means that there needs to be a protocol that users may follow to obtain of the data set. There can be many good reasons for limiting the access to a file. Public accessibility may be difficult because of privacy laws or copyright protection regulations, for example.

The accessibility of the data may occasionally be complicated by the fact that the data have been stored using a so-called proprietary format, i.e. a format that owned exclusively by a single company. For formats which are associated with specific software applications, it can be difficult to guarantee their long-term usability, accessibility and preservation. For this reason, the DANS EASY archive in The Netherlands works with a list ‘preferred formats’.

ACTIVITIES:

Read the article on the website of DANS about preferred formats, and about what you can do to improve the durability of non-preferred formats.
Read the web page on open data on the ANDS website.
Consider the following three articles. To what extent can the data sets that are mentioned in the articles be accessed? Are the data sets also in preferred formats?
Look at the data set that can be found via https://doi.org/10.17026/dans-x5u-usxj. What is needed to access the data?

Interoperable

Thing 5: Data structuring and organisation

Well-structured and well-organised data can evidently be reused much more easily. This section explains how researchers can organize their data in such a way that they can be analysed effectively with data science tools. Many historians capture their data in spreadsheets. As is explained by Broman and Woo (2018), there are a number of important principles to bear in mind when you work with spreadsheets.

It is important to be consistent. Terminology should be used invariably.
Avoid empty cells. Use a consistent code for data which is unavailable, such as ‘NA’ used in R.
Use a regular format for dates, such as YYYY-MM-DD.
Use all cells to capture atomic data. Do not place multiple values in a single cell. Every value that you may want to use in calculations or in other analyses needs to be available separately.
Organise the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row)
Do not make use with colours to indicate properties of data.. Represent all data that you need as actual values in the spreadsheet.
Do not include calculations in the raw data files.

Once you have developed a suitable data model, you are also advised to develop a data dictionary which documents the model. This document may contain the following information:

A list of all the column names used in the data spreadsheet
A description of the purpose and the contents of these different columns.
If applicable, give an indication of the units of measurement.
If applicable, describe the measures that have been taken to ensure the correctness and the consistency of the data
Explain abbreviations or notational conventions that have been used in the data set.

ACTIVITIES:
Read Karl Broman and Kara H. Woo, “Data organization in spreadsheets”.

Thing 6: Controlled vocabularies and ontologies

Tim Berners-Lee, the inventor of the Web, argued that there are five levels of open data. Creators of data can earn five stars by following the steps below.

Data sets can be awarded one star if it has been made public. This is clearly the case for data which have been published via an open license in a data repository.
In order to win a second star, the open data needs to be made available as machine-readable data. This criterion can be satisfied by providing access to an Excel Spreadsheet, for instance.
One disadvantage of an Excel spreadsheet is that users need proprietary software to open the data. The third star can be awarded to datasets which are captured using open formats, such as CSV or TXT.
A fourth star can be awarded when the entities in the data set are identified using persistent identifiers. Such PIDs have the effect that other researcher can effectively link to the data set.
The fifth star can be earned by linking the data to entities in other data sets via PIDs.

When researchers have published their well-structured and their well-organised data set in a data repository via a public license, as explain in things 1 to 5 above, they will have arrived at data set that can be awarded three stars, according to Berners-Lee’s scheme. This section and the following section will further explain how you enhance the interoperability of their data sets even further by working with RDF and with persistent identifiers.

As a first step, it can be useful to explore whether some of the general topics that you focus on have already been assigned persistent identifiers or URIs. Many researchers and institutions have developed shared vocabularies and ontologies to standardise terminology. In many cases, the terms which have been defined have also been assigned persistent identifiers. Such shared vocabularies can make it clear that we are talking about the same thing when we exchange knowledge.

Historical research often concentrates on people, events, organisations and locations. The following ontologies and shared vocabularies concentrate on entities such as these:

The CIDOC Conceptual Reference Model (CRM) concept search.
Wikidata assigns identifiers to a wide range of entities, including people, locations and organisations
The Library of Congress name authority files, e.g. http://id.loc.gov/authorities/names/n79021400.
VIAF (Virtual International Authority File (https://viaf.org/)
Identifiers for book published in Dutch or in the Netherlands can be found via the STCN, whose contents is available as Linked Open Data.
The UNESCO history thesaurus.
Aspects of books can be described using terms from the Bibliographic Ontology and the FABIO ontology.
GeoNames defined persistent identifiers to locations, e.g. https://www.geonames.org/2751773/leiden.html.
TaDiRAh and BARTOC (Basel Register of Thesauri, Ontologies & Classifications also offer valuable overviews of the ontologies that have been developed within specific disciplines.
One of the ways to describe the provenance of data sets is by so-called nanopublications, i.e. a set of Resource Description Framework (RDF) triples (subject-predicate-object tuples). Although you do not need nanopublications to describe provenance, Nanopublications are a way of combining argument and provenance in a single package. Nanopublications rely on the Provenance Ontology to express provenance. You can read more about them and their application in historical research in this paper by Patrick Golden and Ryan Shaw: Nanopublication beyond the sciences: the PeriodO period gazetteer

Where possible, try to use terms that have been defined in these existing ontologies in your own data set. An example where a specific vocabulary (the VOC glossary) was used to markup a dataset can be found here. The dataset is part of a project to reconstruct the domestic market for colonial goods in the Dutch Republic.

ACTIVITIES:

Try to find one or two terms that are relevant to your research using the resources that are mentioned above. You can aso use Swoogle to search for vocabularies related to your research.
Search for a term related to your research in the CIDOC Conceptual Reference Model (CRM) concept search. Were you able to find it? Tip 1: Search for “person” to get an idea of how the thesaurus works. Tip 2: All the terms used can be found in the last release of the model: http://www.cidoc-crm.org/get-last-official-release.

Thing 7: FAIR data modelling

The fourth and the fifth star in Berner Lee’s model can be awarded when the data are stored in a format in which the topics their properties and their characteristics are identified using URIs whenever possible. More concretely, it implies that you record your data using the Resource Description Framework (RDF) format. RDF, simply put, is a technology which enables you to publish the contents of a database via the web. It is based on a simple data model which assumes that all statements about resources can be reduced to a basic form, consisting of a subject, a predicate and an object. RDF assertions are also known as triples. In a FAIR data model, elements of data are organised and identified using PIDs. The same goes for the relations between these elements. The FAIR data model is a graphical view of the data that act as a metadata key to a spreadsheet but it can also be used as a guide to expose data as a linked data graph in RDF format.

Existing data sets can be converted to RDF by making use of the FAIRifier software. This application is based on OpenRefine. Other examples of tools to generate RDF are Karma and RML. In the FAIRifier, it is possible to upload a CSV file. After this, the data set can be connected to elements from existing ontologies.

ACTIVITIES:

Learn about the basics of RDF modeling by going through the first 15 slides of the webinar about the UNESCO Thesaurus.
Dig in deep by exploring the FAIRifier for a dataset you already have available in CSV.

Reusable

Thing 8: Licensing

A license describes the conditions under which your data or software is (re)usable. Picking a license can be a daunting process because of the common feeling that if you do not pick the right license something will go wrong. However keep in mind that if you do not choose a license for your data or software, it means that it cannot be used or reused. A copyright expert can help you, but to get you going you can try out the activities listed below.

ACTIVITIES:

Try to pick a license for a data set you are working on by using the Creative Commons license picker
Try to pick a license for a piece of software or code you are working on by using the choose a license picker
Learn more about licensing your data by reading this guide from the Digital Curation Center

If you deposit your data in a repository there will be default options available.

Thing 9: Data citation

When you have made use of someone else’s data, you are strongly recommended to attribute the original creators of these data by including a proper reference. Data sets, and even software applications, can be cited in the same way as textual publications such as articles and monographs. Structured data citations can also be used to calculate metrics about the reuse of the data. Data citations, regardless of citation style, typically contain the authors, the year, the title, the publisher and a persistent identifier.

ACTIVITIES:

Read the ANDS guide on data citation.
Read the FORCE11 data citation principles.
Study the following data set on figshare: https://doi.org/10.6084/m9.figshare.3519755.v1. Note that there is the possibility to generate a data citation, under the link “Cite”, in the citation style of your choice.
Consider the following publication: https://doi.org/10.1371/journal.pone.0149621. Note that the article has a “data availability” statement.
Explore CiteAs by typing in the figshare doi from above (10.6084/m9.figshare.3519755.v1).

Context

Thing 10: Policies

Policies for data availability can come from publishers, funders and universities. These policies are listed on the respective website, but finding these is not always straightforward. FAIRsharing is a repository for standards, databases and policies with the possibility to filter on information for a specific research domain. It started as an initiative for the life sciences but is rapidly expanding its content for other disciplines as well.

ACTIVITIES:

Start by going to FAIRsharing
Click on the blue “Policies” button at the top
In the left side menu under “Subjects”, click on “show more” and select “Humanities”.
Scroll down to the Taylor and Francis Data Policy
Which databases and standards are mentioned in this policy?
Go to the specific policy for the “European Review of History” journal.
Does it differ from the general Taylor and Francis policy?
Try to find the data policy for your favorite journal