Silvia Di Giorgio, Akinyemi Mandela Fasemore, Konrad Förstner, Till Sauerwein, Eva Seidlmayer (ZB MED - Information Center for Life Science, Cologne, Germany), Ilja Zeitlin, Susannah Bacon, Chris Erdmann (Library Carpentry/The Carpentries/California Digital Library)
Researchers
To make data findable, it has to be uniquely and persistently stored with an identifier.
NOTE:
The distributing DataCite-agency (i.e. issues DOIs) for Life Sciences is PUBLISSO:
https://www.publisso.de/wir-fuer-sie/doi-service/
Exercise:
For easy look up, we have a list of DOIs below. Can you match the right document to the appropriate DOI? HINT: Start from here https://www.doi.org/!
Which of these is not a valid DOI?
Which part indicates the publishing institution? The prefix or the suffix of a DOI?
ORCID Exercise: ORCID is a self-identifier for authors to avoid author name ambiguity. Use ORCIDs from the start of data creation, i.e. attach data creator name/ORCID to dataset as a metadata field. Include ORCIDs with datasets in repositories (e.g. in Sequence Read Archive (SRA), include the ORCID for the data creator). This allows for the tracking of data provenance (the origins, custody, and ownership of research data).
Go through the Getting Started with ORCID Integration.
Zenodo, for example, is a tool that makes scientific data and publications easier to cite.
It supports various data and license types. It also supports source code from GitHub repositories.
Exercise:
Questions:
Wikidata provides a common source of open data which can be used by Wikimedia projects such as Wikipedia, and by anyone else, under a public domain license.
Exercise:
Go to Wikidata and find the publication date of the book “On the origin of species”.
This project aims to accelerate scientific discovery and enhance the integrity, transparency, and reproducibility of data. To enable FAIR data sharing, data need to be deposited in a repository that is taking steps to make data as open and FAIR as possible. It’s not clear-cut what is FAIR at this time, there is no such thing as a FAIR stamp - although the CoreTrustSeal certification provides a good indication. Therefore, under the auspices of the Enabling FAIR Data Project, American Geophysical Union (AGU), re3data, and DataCite, these organisations have decided to develop new tools to assist researchers with finding an appropriate repository for their data:
Exercise:
Try the “browse by Subject” entry to the re3data-database since this gives a great overview on the wide landscape of research data repositories: https://www.re3data.org/browse/by-subject/
bioschemas.org aims to improve data interoperability in the life sciences. It does this by encouraging people in the life sciences to use schema.org markup, so that their websites and services contain consistently structured information (metadata). This structured information then makes it easier to discover, collate and analyse distributed data.
Exercises can be found on the Bioschema website under “tutorials” and “how to”.
Knowing the appropriate licenses to use for your data can help others understand how they can use your data and can also help with improving accessibility.
Exercise:
The era of Big Data is finally upon us. A prerequisite for accessibility is availability. Well established sharing protocols like torrents will ensure data are perpetually available without the constraint of time and space. Using the torrent protocol for scientific data will lead to some of the below advantages:
The Magnet URI scheme defines the format of magnet links, a de facto standard for identifying files by their content, via cryptographic hash value rather than by their location.
Using Magnet URI scheme directly on the publication will make all the data accessible. For more information, read:
Exercise:
Standardisation of life science data will ensure interoperability across different sub fields. ELIXIR is an intergovernmental organisation that brings together life science resources from across Europe.
Exercise:
Use the ELIXIR software bio.tools to find the author of the RNA-seq python pipeline “READemption”.
Bio2RDF is a large network of linked data for the life sciences. The database provides interlinked life science data using semantic web technologies. To learn more about Bio2RDF, read Bio2RDF: towards a mashup to build bioinformatics knowledge systems.
The German Federation for Biological Data (GFBio) is the authoritative, national contact point for issues concerning the management and standardisation of biological and environmental research data during the entire data life cycle (from acquisition to archiving and data publication). GFBio mediates expertises and services between the GFBio data centers and the scientific community, covering all areas of research data management.
Make the data accessible via an API, in a structured data format that can be automatically read and processed by a computer. See the Open Data Handbook Glossary - Machine readable.
Example: https://api.crossref.org/works/10.1371/journal.pcbi.1004668
Example: https://api.datacite.org/works/10.5281/zenodo.1574835
If the methods to record complex experiments are prone to error, so that reproducible results cannot be guaranteed, how can you ever be sure you’re dealing with real insights and not random information? The electronic lab notebook provides the missing infrastructure for data recording, retrieval and integrity. An electronic lab notebook must be able to create, import, store and retrieve all important data types in digital format. For more information, read:
Exercise:
Explore the demo lab notebook at https://demo.elabftw.net/experiments.php
In a scientific field, most of the time we have to deal with large amounts of data that have to be processed before publication. One important aspect of the reproducibility challenge is ensuring computational analysis can be reproduced, even in different environments. For more information, read:
Learn Docker & Containers using Interactive Browser-Based Scenarios: https://www.katacoda.com/courses/docker
Blockchain technology has the potential to be a technical solution to the current reproducibility crisis in science, and could “reduce waste and make more research results true”. See:
Living document example:
See Blockchain for Open Science – the living document: https://www.blockchainforscience.com/2017/02/23/blockchain-for-open-science-the-living-document/
Research Data Infrastructure for the Life Sciences (NFDI4Life)
NFDI4Life brings together research communities across the life sciences domain in the context of the planned National Research Data Infrastructure (NFDI). As a response to the increasing scientific and societal demand for data and data analysis, NFDI4Life brings together scientific communities and research data infrastructures broadly covering the life sciences with particular focus on the subdomains biology, medicine (with veterinary medicine), epidemiology, nutrition, agricultural and environmental science as well as biodiversity research.
Carpentries Community
The carpentries develops and teaches workshops on the fundamental data skills needed to conduct research.
Go-FAIR-Initiative
GO FAIR follows a bottom-up open implementation strategy for the European Open Science Cloud (EOSC) as part of a broader global Internet of FAIR Data & Services.
FAIRDOM
FAIRDOM supports researchers, students, trainers, funders and publishers to make their data, operating procedures and models, Findable, Accessible, Interoperable and Reusable (FAIR).