Biodiversity

Sprinters:

Silvia Di Giorgio, Akinyemi Mandela Fasemore, Konrad Förstner, Till Sauerwein, Eva Seidlmayer (ZB MED - Information Center for Life Science, Cologne, Germany), Ilja Zeitlin, Susannah Bacon, Chris Erdmann (Library Carpentry/The Carpentries/California Digital Library)

Audience:

Researchers

Things

Findability

Thing 1: Identifiers

To make data findable, it has to be uniquely and persistently stored with an identifier.

A digital object identifier (DOI) is a unique, case-insensitive, alphanumeric character sequence and can be very helpful for this purpose. You can reach the identified digital object by using the DOI as a URL. Just fill in the DOI in the address bar (e. g. https://doi.org/10.1109/5.771073). Also, see: ANDS Guide: Digital Object Identifier (DOI) System for Research Data.

NOTE:
The distributing DataCite-agency (i.e. issues DOIs) for Life Sciences is PUBLISSO:
https://www.publisso.de/wir-fuer-sie/doi-service/

Exercise:
For easy look up, we have a list of DOIs below. Can you match the right document to the appropriate DOI? HINT: Start from here https://www.doi.org/!

10.1103/PhysRev.48.73
10.5962/bhl.title.28875

On the origin of species
The Particle Problem in the General Theory of Relativity

Which of these is not a valid DOI?

10.1037/arc0000014
12.1093/fMicb.2018.00257
10.1101/468025 HINT: Check the prefix (before the forward slash)!

Which part indicates the publishing institution? The prefix or the suffix of a DOI?

ORCID Exercise: ORCID is a self-identifier for authors to avoid author name ambiguity. Use ORCIDs from the start of data creation, i.e. attach data creator name/ORCID to dataset as a metadata field. Include ORCIDs with datasets in repositories (e.g. in Sequence Read Archive (SRA), include the ORCID for the data creator). This allows for the tracking of data provenance (the origins, custody, and ownership of research data).

Go through the Getting Started with ORCID Integration.

Thing 2: Citations

Zenodo, for example, is a tool that makes scientific data and publications easier to cite.
It supports various data and license types. It also supports source code from GitHub repositories.

See https://zenodo.org/

Exercise:

Use the Zenodo Sandbox to upload an example dataset, software program, etc.
https://sandbox.zenodo.org/

Questions:

Which metadata fields do you have to add when uploading data and why?
Which fields are mandatory and which ones are not?
What identifiers can you use?

Uploading to Zenodo (Sandbox)

Thing 3: Wikidata

Wikidata provides a common source of open data which can be used by Wikimedia projects such as Wikipedia, and by anyone else, under a public domain license.

Exercise:
Go to Wikidata and find the publication date of the book “On the origin of species”.

Switch over to the linked dataset of the author of the book and see his other publications.
What did he publish in 1839?

Thing 4: Registry of Research Data Repositories (re3data)

This project aims to accelerate scientific discovery and enhance the integrity, transparency, and reproducibility of data. To enable FAIR data sharing, data need to be deposited in a repository that is taking steps to make data as open and FAIR as possible. It’s not clear-cut what is FAIR at this time, there is no such thing as a FAIR stamp - although the CoreTrustSeal certification provides a good indication. Therefore, under the auspices of the Enabling FAIR Data Project, American Geophysical Union (AGU), re3data, and DataCite, these organisations have decided to develop new tools to assist researchers with finding an appropriate repository for their data:

Exercise:

How many entries are returned for the query specific for your research topic on re3data?
If you filter under “Subject”, what do you find?
Do you think something is missing from the results? If so, suggest a repository.

Try the “browse by Subject” entry to the re3data-database since this gives a great overview on the wide landscape of research data repositories: https://www.re3data.org/browse/by-subject/

Accessibility

Thing 5: Bioschemas

bioschemas.org aims to improve data interoperability in the life sciences. It does this by encouraging people in the life sciences to use schema.org markup, so that their websites and services contain consistently structured information (metadata). This structured information then makes it easier to discover, collate and analyse distributed data.

Exercises can be found on the Bioschema website under “tutorials” and “how to”.

https://bioschemas.gitbook.io/training-portal/

Thing 6: Licenses

Knowing the appropriate licenses to use for your data can help others understand how they can use your data and can also help with improving accessibility.

Exercise:

Use the Creative Commons License Tool to select the appropriate license with the following intentions;
Allow your work to be adapted and also allow it to be used commercially.

Thing 7: Availability via torrents

The era of Big Data is finally upon us. A prerequisite for accessibility is availability. Well established sharing protocols like torrents will ensure data are perpetually available without the constraint of time and space. Using the torrent protocol for scientific data will lead to some of the below advantages:

Immutability
Distribution capabilities (lower cost for distributing the data)
No sole maintainer (we don’t have to rely only on one specific maintainer because data can be cloned and maintained across the peer-networks)

The Magnet URI scheme defines the format of magnet links, a de facto standard for identifying files by their content, via cryptographic hash value rather than by their location.

Using Magnet URI scheme directly on the publication will make all the data accessible. For more information, read:

Exercise:

Upload any small data set of your choice with the above link.
Share with a colleague a link to access it over torrent.

Interoperability:

Thing 8: ELIXIR platforms

Standardisation of life science data will ensure interoperability across different sub fields. ELIXIR is an intergovernmental organisation that brings together life science resources from across Europe.

ELIXIR Interoperability Platform

Exercise:
Use the ELIXIR software bio.tools to find the author of the RNA-seq python pipeline “READemption”.

Thing 9: Research data management

Bio2RDF is a large network of linked data for the life sciences. The database provides interlinked life science data using semantic web technologies. To learn more about Bio2RDF, read Bio2RDF: towards a mashup to build bioinformatics knowledge systems.

http://bio2rdf.org/

The German Federation for Biological Data (GFBio) is the authoritative, national contact point for issues concerning the management and standardisation of biological and environmental research data during the entire data life cycle (from acquisition to archiving and data publication). GFBio mediates expertises and services between the GFBio data centers and the scientific community, covering all areas of research data management.

https://www.gfbio.org/

Thing 10: Machine-readability

Make the data accessible via an API, in a structured data format that can be automatically read and processed by a computer. See the Open Data Handbook Glossary - Machine readable.

Exercise - Crossref:

Pick the DOI of a publication of your choice.
Open a Web browser and add the URL.
https://api.crossref.org/works/DOI <= replace DOI with the DOI of the publication.

Example: https://api.crossref.org/works/10.1371/journal.pcbi.1004668

Exercise - DataCite:

Pick the DOI of a dataset in Zenodo.
Open https://api.datacite.org/works/DOI <= replace DOI with the DOI of the Zenodo entry.

Example: https://api.datacite.org/works/10.5281/zenodo.1574835

Reusability

Thing 11: Digitalization

If the methods to record complex experiments are prone to error, so that reproducible results cannot be guaranteed, how can you ever be sure you’re dealing with real insights and not random information? The electronic lab notebook provides the missing infrastructure for data recording, retrieval and integrity. An electronic lab notebook must be able to create, import, store and retrieve all important data types in digital format. For more information, read:

Exercise:
Explore the demo lab notebook at https://demo.elabftw.net/experiments.php

Thing 12: Containers

In a scientific field, most of the time we have to deal with large amounts of data that have to be processed before publication. One important aspect of the reproducibility challenge is ensuring computational analysis can be reproduced, even in different environments. For more information, read:

Grüning, Björn, et al. “Practical computational reproducibility in the life sciences.” Cell systems 6.6 (2018): 631-635.

Exercise:

Learn Docker & Containers using Interactive Browser-Based Scenarios: https://www.katacoda.com/courses/docker

Thing 13: Blockchain for Life Science

Blockchain technology has the potential to be a technical solution to the current reproducibility crisis in science, and could “reduce waste and make more research results true”. See:

Living document example:
See Blockchain for Open Science – the living document: https://www.blockchainforscience.com/2017/02/23/blockchain-for-open-science-the-living-document/

Supplementary information:

Research Data Infrastructure for the Life Sciences (NFDI4Life)
NFDI4Life brings together research communities across the life sciences domain in the context of the planned National Research Data Infrastructure (NFDI). As a response to the increasing scientific and societal demand for data and data analysis, NFDI4Life brings together scientific communities and research data infrastructures broadly covering the life sciences with particular focus on the subdomains biology, medicine (with veterinary medicine), epidemiology, nutrition, agricultural and environmental science as well as biodiversity research.

https://www.nfdi4life.de/

Carpentries Community
The carpentries develops and teaches workshops on the fundamental data skills needed to conduct research.

https://carpentries.org/

Go-FAIR-Initiative

GO FAIR follows a bottom-up open implementation strategy for the European Open Science Cloud (EOSC) as part of a broader global Internet of FAIR Data & Services.

FAIRDOM
FAIRDOM supports researchers, students, trainers, funders and publishers to make their data, operating procedures and models, Findable, Accessible, Interoperable and Reusable (FAIR).

https://fair-dom.org/about-fairdom/