This lesson is still being designed and assembled (Pre-Alpha version)

Library Carpentry: FAIR Data and Software

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What does the acronym “FAIR” stand for, and what does it mean?

  • How can library services contribute to FAIR research?

Objectives
  • Articulate the purpose and value of making research FAIR

  • Understand that library services impact various parts of the research lifecycle

Goals of this lesson:

Library services across the research lifecycle

Libraries actively help researchers navigate the requirements, demands, and tools that make up the research data management landscape, particularly when it comes to the organization, preservation, and sharing of research data and software.

They play a vital role in directly supporting the academic enterprise by promoting data sharing and reproducibility. Through their research training and services, particularly for early career researchers and graduate students, libraries are driving a cultural shift towards more effective data and software stewardship.

As a trusted partner, and with an embedded understanding of their communities, libraries foster collaboration and facilitate coordination between community stakeholders and are a critical part of the discussion.

Research Data Life Cycle

El-Gebali, Sara. (2020, September 29). Research Data Life Cycle. Zenodo. http://doi.org/10.5281/zenodo.4057867

FAIR in one sentence

The FAIR data principles are all about how machines and humans communicate with each other. They are not a standard, but a set of principles for developing robust, extensible infrastructure which facilitates discovery, access and reuse of research data and software.

Where did FAIR come from?

The FAIR data principles emerged from a FORCE11 workshop in 2014. This was formalised in 2016 when these were published in Scientific Data: FAIR Guiding Principles for scientific data management and stewardship. In this article, the authors provide general guidance on machine-actionability and improvements that can be made to streamline the findability, accessibility, interoperatbility, and reuability (FAIR) of digital assets.

“as open as possible, as closed as necessary”

FAIR brings all the stakeholders together

We all win when the outputs of research are properly managed, preserved and reusable. This is applicable from big data of genomic expression all the way through to the ‘small data’ of qualitative research.

Research is increasingly dependent on computational support and yet there are still many bottlenecks in the process. The overall aim of FAIR is to cut down on the inefficient processes in research by taking advantage of linked resources and the exchange of data so that all stakeholders in the research ecosystem, can automate repetitive, boring, error-prone tasks.

Examples of Library Services implementing the FAIR principles

Further reading following this lesson

TIB Hannover has provided the following FAIR guide with examples: TIB Hannover FAIR Principles Guide

The European Commission gives tips on implementing FAIR: Six Recommendations for Implementation of FAIR Practice

How does “FAIR” translate to your institution or workplace?

Group exercise Use an etherpad / whiteboard

Key Points

  • The FAIR principles set out how to make data more usable, by humans and machines, by making it Findable, Accessible, Interoperable and Reusable.

  • Librarians have key expertise in information management that can help researchers navigate the process of making their research more FAIR


Findable

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a persistent identifier or PID?

  • What types of PIDs are there?

Objectives
  • Explain what globally unique, persistent, resolvable identifiers are and how they make data and metadata findable

  • Articulate what metadata is and how metadata makes data findable

  • Articulate how metadata can be explicitly linked to data and vice versa

  • Understand how and where to find data discovery platforms

  • Articulate the role of data repositories in enabling findable data

For data & software to be findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier or PID
F2. data are described with rich metadata
F3. (meta)data are registered or indexed in a searchable resource
F4. metadata specify the data identifier

Persistent identifiers (PIDs) 101

A persistent identifier (PID) is a long-lasting reference to a (digital or physical) resource:

Different types of PIDs

PIDs have community support, organizational commitment and technical infrastructure to ensure persistence of identifiers. They often are created to respond to a community need. For instance, the International Standard Book Number or ISBN was created to assign unique numbers to books, is used by book publishers, and is managed by the International ISBN Agency. Another type of PID, the Open Researcher and Contributor ID or ORCID (iD) was created to help with author disambiguation by providing unique identifiers for authors. The ODIN Project identifies additional PIDs along with Wikipedia’s page on PIDs.

Digital Object Identifiers (DOIs)

The DOI is a common identifier used for academic, professional, and governmental information such as articles, datasets, reports, and other supplemental information. The International DOI Foundation (IDF) is the agency that oversees DOIs. CrossRef and Datacite are two prominent not-for-profit registries that provide services to create or mint DOIs. Both have membership models where their clients are able to mint DOIs distinguished by their prefix. For example, DataCite features a statistics page where you can see registrations by members.

Anatomy of a DOI

A DOI has three main parts:

Anatomy of a DOI

In the example above, the prefix is used by the Australian National Data Service (ANDS) now called the Australia Research Data Commons (ARDC) and the suffix is a unique identifier for an object at Griffith University. DataCite provides DOI display guidance so that they are easy to recognize and use, for both humans and machines.

Challenge

arXiv is a preprint repository for physics, math, computer science and related disciplines. It allows researchers to share and access their work before it is formally published. Visit the arXiv new papers page for Machine Learning. Choose any paper by clicking on the ‘pdf’ link next to it. Now use control + F or command + F and search for ‘http’. Did the author use DOIs for their data and software?

Solution

Authors will often link to platforms such as GitHub where they have shared their software and/or they will link to their website where they are hosting the data used in the paper. The danger here is that platforms like GitHub and personal websites are not permanent. Instead, authors can use repositories to deposit and preserve their data and software while minting a DOI. Links to software sharing platforms or personal websites might move but DOIs will always resolve to information about the software and/or data. See DataCite’s Best Practices for a Tombstone Page.

Rich Metadata

More and more services are using common schemas such as DataCite’s Metadata Schema or Dublin Core to foster greater use and discovery. A schema provides an overall structure for the metadata and describes core metadata properties. While DataCite’s Metadata Schema is more general, there are discipline specific schemas such as Data Documentation Initiative (DDI) and Darwin Core.

Thanks to schemas, the process of adding metadata has been standardised to some extent but there is still room for error. For instance, DataCite reports that links between papers and data are still very low. Publishers and authors are missing this opportunity.

Challenges: Automatic ORCID profile update when DOI is minted RelatedIdentifiers linking papers, data, software in Zenodo

Connecting research outputs

DOIs are everywhere. Examples.

Resource IDs (articles, data, software, …) Researcher IDs Organisation IDs, Funder IDs Projects IDs Instrument IDs Ship cruises IDs Physical sample IDs, DMP IDs… videos images 3D models grey literature

Connecting Research Outputs

https://support.datacite.org/docs/connecting-research-outputs

Bullet points about the current state of linking… https://blog.datacite.org/citation-analysis-scholix-rda/

Provenance?

Provenance means validation & credibility – a researcher should comply to good scientific practices and be sure about what should get a PID (and what not). Metadata is central to visibility and citability – metadata behind a PID should be provided with consideration. Policies behind a PID system ensure persistence in the WWW - point. At least metadata will be available for a long time. Machine readability will be an essential part of future discoverability – resources should be checked and formats should be adjusted (as far possible). Metrics (e.g. altmetrics) are supported by PID systems.

Publishing behaviour of researchers

According to:

Technische Informationsbibliothek (TIB) (conducted by engage AG) (2017): Questionnaire and Dataset of the TIB Survey 2017 on information procurment and pubishing behaviour of researchers in the natural sciences and engineering. Technische Informationsbibliothek (TIB). DOI: https://doi.org/10.22000/54

Choosing the right repository

Ask your colleagues & collaborators Look for institutional repository at your own institution

determining the right repo for your reseearch data are kept safe in a secure environment data are regularly backed up and preserved (long-term) for future use data can be easily discovered by search engines and included in online catalogues intellectual property rights and licencing of data are managed access to data can be administered and usage monitored the visibility of data can be enhanced enables more use and citation citation of data increases researchers scientific reputation Decision for or against a specific repository depends on various criteria, e.g. Data quality Discipline Institutional requirements Reputation (researcher and/or repository) Visibility of research Legal terms and conditions Data value (FAIR Principles) Exit strategy (tested?) Certificate (based only on documents?)

Some recommendations: → look for the usage of PIDs → look for the usage of standards (DataCite, Dublin Core, discipline-specific metadata → look for licences offered → look for certifications (DSA / Core Trust Seal, DINI/nestor, WDS, …)

Searching re3data w/ exercise https://www.re3data.org/ Out of more than 2115 repository systems listed in re3data.org in July 2018, only 809 (less than 39 %!) state to provide a PID service, with 524 of them using the DOI system

Search open access repos http://v2.sherpa.ac.uk/opendoar/

FAIRSharing https://fairsharing.org/databases/

Data Journals

Another method available to researchers to cite and give credit to research data is to author works in data journals or supplemental approaches used by publishers, societies, disciplines, and/or journals.

Articles in data journals allow authors to:

Examples:

Also, the following study discusses data journals in depth and reviews over 100 data journals: Candela, L. , Castelli, D. , Manghi, P. and Tani, A. (2015), Data Journals: A Survey. J Assn Inf Sci Tec, 66: 1747-1762. doi:10.1002/asi.23358

How does your discipline share data

Does your discipline have a data journal? Or some other mechanism to share data? For example, the American Astronomical Society (AAS) via the publisher IOP Physics offers a supplment series as a way for astronomers to publish data.

List recent publications re: benefits of data sharing / software sharing

Questions: Is FAIRSharing vs re3data comparison slide from TIB findability slides needed here? Should we include recent thread about handle system vs DOIs in IRs (costs) Zenodo-GitHub linking is listed in another episode, right? Include guidance for Google schema indexing…

Notes:
Note about authors being proactive and working with the journals/societies to improve papers referencing data, software…

Tombstone

Key Points

  • First key point.


Accessible

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question

Objectives
  • Understand what a protocol is

  • Understand authentication protocols and their role in FAIR

  • Articulate the value of landing pages

  • Explain closed, open and mediated access to data

For data & software to be accessible:

A1.  (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata remain accessible, even when the data are no longer available

What is a protocol?

Simply put, it’s an access method of exchanging data over a computer network. Each protocol has its rules for how data is formatted, compressed, checked for errors. Research repositories often use the OAI-PMH or REST API protocols to interface with data in the repository. The following image from TutorialEdge.net: What is a RESTful API by Elliot Forbes provides a useful overview of how RESTful interfaces work:

TutorialEdge.net: What is a RESTful API? by Elliot Forbes

Zenodo offers a visual interface for seeing how formats such as DataCite XML will look like when requested for records such as the following record from the Biodiversity Literature Repository:

Formiche di Madagascar raccolte dal Sig. A. Mocquerys nei pressi della Baia di Antongil (1897-1898).

Wikipedia has a list of commonly used network protocols but check the service you are using for documentation on the protocols it uses and whether it corresponds with the FAIR Principles. For instance, see Zenodo’s Principles page.

Contributor information

Alternatively, for sensitive/protected data, if the protocol cannot guarantee secure access, an e-mail or other contact information of a person/data manager should be provided, via the metadata, with whom access to the data can be discussed. The DataCite metadata schema includes contributor type and name as fields where contact information is included. Collaborative projects such as THOR, FREYA, and ODIN are working towards improving the interoperability and exchange of metadata such as contributor information.

Author disambiguation and authentication

Across the research ecosystem, publishers, repositories, funders, research information systems, have recognized the need to address the problem of author disambiguation. The illustrative example below of the many variations of the name Jens Åge Smærup Sørensen demonstrations the challenge of wrangling the correct name for each individual author or contributor:

Jens Åge Smærup Sørensen

Thankfully, a number of research systems are now integrating ORCID into their authentication systems. Zenodo provides the login ORCID authentication option. Once logged in, your ORCID will be assigned to your authored and deposited works.

Exercise to create ORCID account and authenticate via Zenodo

  1. Register for an ORCID.
  2. You will receive a confirmation email. Click the link in the email to establish your unique 16-digit ORCID.
  3. Go to Zenodo and select Log in (if you are new to Zenodo select Sign up).
  4. Go to linked accounts and click the Connect button next to ORCID.

Next time you log into Zenodo you will be able to ‘Log in with ORCID’:

Zenodo login with ORCID and GitHub authentication

Understanding whether something is open, free, and universally implementable

ORCID features a principles page where we can assess where it lies on the spectrum of these criteria. Can you identify statements that speak to these conditions: open, free, and universally implemetable?

Answers:

Challenge Questions:

Tombstones, a very grave subject

There are a variety of reasons why a placeholder with metadata or tombstone of the removed research object exists including but not limited to staff removal, spam, request from owner, data center does not exist is still, etc. A tombstone page is needed when data and software is no longer accessible. A tombstone page communicates that the record is gone, why it is gone, and in case you really must know, there is a copy of the metadata for the record. A tombstone page should include: DOI, date of deaccession, reason for deaccession, message explaining the data center’s policies, and a message that a copy of the metadata is kept for record keeping purposes as well as checksums of the files. Zenodo offers us further explanation of the reasoning behind tombstone pages.

DataCite offers statistics where the failure to resolve DOIs after a certain number of attempts is reported (see DataCite statistics support pagefor more information). In the case of Zenodo and the GitHub issue above, the hidden field reveals thousands of records that are a result of spam.

DataCite Statistics Page

If a DOI is no longer available and the data center does not have the resources to create a tombstone page, DataCite provides a generic tombstone page.

See the following tombstone examples:

Discussion of tombstones

Key Points

  • First key point.


Interoperable

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What does interoperability mean?

  • What is a controlled vocabulary, a metadata schema and linked data?

  • How do I describe data so that humans and computers can understand?

Objectives
  • Explain what makes data and software (more) interoperable for machines

  • Identify widely used metadata standards for research, including generic and discipline-focussed examples

  • Explain the role of controlled vocabularies for encoding data and for annotating metadata in enabling interoperability

  • Understand how linked data standards and conventions for metadata schema documentation relate to interoperability

For data & software to be interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data

What is interoperability for data and software?

Shared understanding of concepts, for humans as well as machines.

What does it mean to be machine readable vs human readable?

According to the Open Data Handbook:

Human Readable
“Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.”

Machine Readable
“Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable. Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable. As another example scans (photographs) of text are not machine-readable (but are human readable!) but the equivalent text in a format such as a simple ASCII text file can machine readable and processable.”

Software uses community accepted standards and platforms, making it possible for users to run the software. Top 10 FAIR things for research software

Describing data and software with shared, controlled vocabularies

See

Representing knowledge in data and software

See https://librarycarpentry.org/Top-10-FAIR//2018/12/01/historical-research/#thing-5-data-structuring-and-organisation.

Beyond the PDF

Publishers, librarians, researchers, developers, funders, they have all been working towards a future where we can move beyond the PDF, from ‘static and disparate data and knowledge representations to richly integrated content which grows and changes the more we learn.” Research objects of the future will capture all aspects of scholarship: hypotheses, data, methods, results, presentations etc.) that are semantically enriched, interoperable and easily transmitted and comprehended. Attribution, Evaluation, Archiving, Impact https://sites.google.com/site/beyondthepdf/

Beyond the PDF has now grown into FORCE… Towards a vision where research will move from document- to knowledge-based information flows semantic descriptions of research data & their structures aggregation, development & teaching of subject-specific vocabularies, ontologies & knowledge graphs Paper of the Future https://www.authorea.com/users/23/articles/8762-the-paper-of-the-future to Jupyter Notebooks/Stencilia https://stenci.la/

Knowledge representation languages

provide machine-readable (meta)data with a well-established formalism structured, using discipline-established vocabularies / ontologies / thesauri (RDF extensible knowledge representation model, OWL, JSON LD, schema.org) offer (meta)data ingest from relevant sources (Document Information Dictionary or Extensible Metadata Platform from PDF) provide as precise & complete metadata as possible look for metrics to evaluate the FAIRness of a controlled vocabulary / ontology / thesaurus often do not (yet) exist assist in their development clearly identify relationships between datasets in the metadata (e.g. “is new version of”, “is supplement to”, “relates to”, etc.) request support regarding these tasks from the repositories in your field of study for software: follow established code style guides (thanks to @npch!)

Adding qualified references among data and software

support referencing metadata fields between datasets via a schema (relatedIdentifer, relationType)

Data Science and Digital Libraries => (research) knowledge graph(s) Scientific Data Management Visual Analytics to expose information within videos as keywords => av.tib.eu Scientific Knowledge Engineering => ontologies

Example: → Automatic ORCID profile update when DOI is minted DataCite – CrossRef – ORCID collaboration → PID of choice for RDM: Here: The Digital Object Identifier (DOI)

Detour: Replication / Reproducibility Crisis doi.org/10.1073/pnas.1708272114 doi.org/10.1371/journal.pbio.1002165 doi.org/10.12688/f1000research.11334.1 Examples of science failing due to software errors/bugs: figshare.com/authors/Neil_Chue_Hong/96503

“[…] around 70% of research relies on software […] if almost a half of that software is untested, this is a huge risk to the reliability of research results.” Results from a US survey about Research Software Engineers URSSI.us/blog/2018/06/21/results-from-a-us-survey-about-research-software-engineers (Daniel S. Katz, Sandra Gesing, Olivier Philippe, and Simon Hettrick) Olivier Philippe, Martin Hammitzsch, Stephan Janosch, Anelda van der Walt, Ben van Werkhoven, Simon Hettrick, Daniel S. Katz, Katrin Leinweber, Sandra Gesing, Stephan Druskat. 2018. doi.org/10.5281/zenodo.1194669

Code style guides & formatters (thanks to Neil Chu Hong) faster than manual/menial formatting code looks the same, regardless of author can be automated enforced to keep diffs focussed PyPI.org/project/pycodestyle, /black, etc. ROpenSci packaging guide style.tidyverse.org Google.GitHub.io/styleguide

If others can use your code, convey the meaning of updates with SemVer.org (CC BY 3.0) “version number[ changes] convey meaning about the underlying code” (Tom Preston-Werner)

Exercise Python & R Carpentries lessons

Linked Data

Top 10 FAIR things: Linked Open Data

Linked data example Triples - RDF - SPARQL Wikidata exercise

Standards: https://fairsharing.org/standards/ schema.org: http://schema.org/

ISA framework: ‘Investigation’ (the project context), ‘Study’ (a unit of research) and ‘Assay’ (analytical measurement) - https://isa-tools.github.io/

Example of schema.org: rOpenSci/codemetar

Modularity http://bioschemas.org

codemeta croswalks to other standards https://codemeta.github.io/crosswalk/

DCAT https://www.w3.org/TR/vocab-dcat/

Using community accepted code style guidelines such as PEP 8 for Python (PEP 8 itself is FAIR)

Scholix - related indentifiers - Zenodo example linking data/software to papers https://dliservice.research-infrastructures.eu/#/ https://authorcarpentry.github.io/dois-citation-data/01-register-doi.html

Should vocabularies from reusable episode be moved here?

Key Points

  • Understand that FAIR is about both humans and machines understanding data.

  • Interoperability means choosing a data format or knowledge representation language that helps machines to understand the data.


Reusable

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question

Objectives
  • Explain machine readability in terms of file naming conventions and providing provenance metadata

  • Explain how data citation works in practice

  • Understand key components of a data citation

  • Explore domain-relevant community standards including metadata standards

  • Understand how proper licensing is essential for reusability

  • Know about some of the licenses commonly used for data and software

For data & software to be reusable:

R1. (meta)data have a plurality of accurate and relevant attributes
R1.1 (meta)data are released with a clear and accessible data usage licence
R1.2 (meta)data are associated with their provenance
R1.3 (meta)data meet domain-relevant community standards

What does it mean to be machine readable vs human readable?

According to the Open Data Handbook:

Human Readable “Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.”

Machine Readable “Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable.

Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable.

As another example scans (photographs) of text are not machine-readable (but are human readable!) but the equivalent text in a format such as a simple ASCII text file can machine readable and processable.”

File naming best practices

A file name should be unique, consistent and descriptive. This allows for increased visibility and discoverability and can be used to easily classify and sort files. Remember, a file name is the primary identifier to the file and its contents.

Do’s and Don’ts of file naming:

Do’s:

Examples:

Directory structures and README files

A clear directory structure will make it easier to locate files and versions and this is particularly important when collaborating with others. Consider a hierarchical file structure starting from broad topics to more specific ones nested inside, restricting the level of folders to 3 or 4 with a limited number of items inside each of them.

The UK data services offers an example of directory structure and naming: https://ukdataservice.ac.uk/manage-data/format/organising.aspx

For others to reuse your research, it is important to include a README file and to organize your files in a logical way. Consider the following file structure examples from Dryad:

Dryad File Structures

It is also good practice to include README files to describe how the data was collected, processed, and analyzed. In other words, README files help others correctly interpret and reanalyze your data. A README file can include file names/directory structure, glossary/definitions of acronyms/terms, description of the parameters/variables and units of measurement, report precision/accuracy/uncertainty in measurements, standards/calibrations used, environment/experimental conditions, quality assurance/quality control applied, known problems, research date information, description of relationships/dependencies, additional resources/references, methods/software/data used, example records, and other supplemental information.

Disciplinary Data Formats

Many disciplines have developed formal metadata standards that enable re-use of data; however, these standards are not universal and often it requires background knowledge to indentify, contextualize, and interpret the underlying data. Interoperability between disciplines is still a challenge based on the continued use of custom metadata schmes, and the development of new, incompatiable standards. Thankfully, DataCite is providing a common, overarching metadata standard across disciplinary datasets, albeit at a generic vs granular level.

In the meantime, the Research Data Alliance (RDA) Metadata Standards Directory - Working Group developed a collaborative, open directory of metadata standards, applicable to scientific data, to help the research community learn about metadata standards, controlled vocabularies, and the underlying elements across the different disciplines, to potentially help with mapping data elements from different sources.

Exercise/Quiz?

Metadata Standards Directory
Features: Standards, Extensions, Tools, and Use Cases

Quality Control

Quality control is a fundamental step in research, which ensures the integrity of the data and could affect its use and reuse and is required in order to identify potential problems.

It is therefore essential to outline how data collection will be controlled at various stages (data collection,digitisation or data entry, checking and analysis).

Versioning

In order to keep track of changes made to a file/dataset, versioning can be an efficient way to see who did what and when, in collaborative work this can be very useful.

A version control strategy will allow you to easily detect the most current/final version, organize, manage and record any edits made while working on the document/data, drafting, editing and analysis.

Consider the following practices:

Example: UK Data service version control guide: https://www.ukdataservice.ac.uk/manage-data/format/versioning.aspx

Research vocabularies

Research Vocabularies Australia https://vocabs.ands.org.au/ AGROVOC & VocBench http://aims.fao.org/vest-registry/vocabularies/agrovoc Dimensions Fields of Research https://dimensions.freshdesk.com/support/solutions/articles/23000012844-what-are-fields-of-research-

Versioning/SHA https://swcarpentry.github.io/git-novice/reference

Binder - executable environment, making your code immediately reproducible by anyone, anywhere. https://blog.jupyter.org/binder-2-0-a-tech-guide-2017-fd40515a3a84

Narrative & Documentation Jupyter Notebooks https://www.contentful.com/blog/2018/06/01/create-interactive-tutorials-jupyter-notebooks/

Licenses Licenses rarely used From GitHub https://blog.github.com/2015-03-09-open-source-license-usage-on-github-com/

Lack of licenses provide friction, understanding of whether can reuse Peter Murray Project - ContentMine - The Right to Read is the Right to Mine - OpenMinTed Creative Commons Wizard and GitHub software licensing wizards (highlight attribution, non commercial)

Lessons to teach with this episode Data Carpentry - tidy data/data organization with spreadsheets https://datacarpentry.org/lessons/ Library Carpentry - intro to data/tidy data

Exercise? Reference Management w/ Zotero or other

demo: import Zenodo.org/record/1308061 into Zotero demo: RStudio > Packages > Update, run PANGAEA example, then install updates https://tibhannover.github.io/2018-07-09-FAIR-Data-and-Software/FAIR-remix-PANGAEA/index.html

Useful content for Licenses Note: TIB Hannover Slides https://docs.google.com/presentation/d/1mSeanQqO0Y2khA8KK48wtQQ_JGYncGexjnspzs7cWLU/edit#slide=id.g3a64c782ff_1_138

Additional licensing resources: Choose an open source license: https://choosealicense.com/ 4 Simple recommendations for Open Source Software https://softdev4research.github.io/4OSS-lesson/ Use a license: https://softdev4research.github.io/4OSS-lesson/03-use-license/index.html Top 10 FAIR Imaging https://librarycarpentry.org/Top-10-FAIR//2019/06/27/imaging/ Licensing your work: https://librarycarpentry.org/Top-10-FAIR//2019/06/27/imaging/#9-licensing-your-work The Turing Way a Guide for reproducible Research: https://the-turing-way.netlify.app/welcome Licensing https://the-turing-way.netlify.app/reproducible-research/licensing.html The Open Science Training Handbook: https://open-science-training-handbook.gitbook.io/book/ Open Licensing and file formats https://open-science-training-handbook.gitbook.io/book/open-science-basics/open-licensing-and-file-formats#6-open-licensing-and-file-formats DCC How to license research data https://www.dcc.ac.uk/guidance/how-guides/license-research-data

Exercise- Thanks, but no Thanks!

In groups of 2-3 discuss and note down;

Key Points

  • First key point.


Assessment

Overview

Teaching: 10 min
Exercises: 50 min
Questions
  • How can I assess the FAIRness of myself, an organisation, a service, a community… ?

  • Which FAIR assessment tools exist to understand how FAIR you are?

Objectives
  • Assess the current FAIRness level of myself, organisation, service, community…

  • Know about available tools for assessing FAIRness.

  • Understand the next steps you can take to being FAIRer.

Reasons for assessment

FAIR is a journey. Technology and the way people work is shifting often and what might be FAIR today might not be months, years from now. A FAIR assessment now is a snapshot in time. Nevertheless, individuals, organizations, disciplines, services, countries, and communities will look to how FAIR they are. The reasons are various, including gaining a better understanding, comparing with others, making improvements, and participating further in the scholarly ecosystem, to name a handful. Ultimately, an assessment can be a helpful guide on the path to becoming more FAIR.

Mirror, mirror on the wall, who is the FAIRest one of all?

Mirror, mirror on the wall, who is the FAIRest one of all? - In March 2020, Theuringen FDM-TAGE offered awards to the FAIRest datasets based on the FAIR principles. The FAIRest Dataset winners were announced in June 2020. What is FAIR about the winning datasets? Is there anything else that can be done to make them FAIRer?

FAIR is a vision, NOT a standard

The FAIR principles are a way of reaching for best data and software practices, coming to a convergence on what those are, and how to get there. They are NOT rules. They are NOT a standard. They are NOT a requirement. The principles were not meant to be prescriptive but instead offer a vision to optimise data/software sharing and reuse by humans and machines.

Inconsistent interpretations

The lack of information on how to implement the FAIR principles have led to inconsistent interpretations. Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista, D., Coles, S., Cornet, R., … & Goble, C. (2020). FAIR principles: interpretations and implementation considerations describes implementation considerations.

Types of assessment

Depending on your needs, whether you want to assess yourself, a service, your organization or community, or even your country or region, FAIR assessment or evaluation tools are available to help guide you in your path towards FAIR betterment. The following are some resources and exercises to help you get started.

Individual assessment

How FAIR are you? The FAIRsFAIR project has developed an assessment tool called FAIR-Aware that both helps you understand the principles and also how you can improve the FAIRness of your research. Before taking the assessment, have a target dataset/software in mind to prepare you for the questions which include questions about yourself and 10 questions about FAIR. Each question provides additional information and guidance and helps you assess your current FAIRness level along with potential actions to take. The assessment takes 10 to 30 minutes to complete depending on your familiarity with the subject and issues covered.

Challenge

Encourage your workshop participants to review the episodes in this FAIR lesson and take the FAIR-Aware assessment ahead of time. In person (or virtual), ask the participants to split up into groups and to highlight some of their key questions/findings from the FAIR-Aware assessment. Ask them to note their questions/findings/anything else in the session’s collaborative notes. After a duration, ask the groups to return to the main group and call on each group (leader) to summarise their discussion. Synthesise some of the key points and discuss next steps on how participants can address their FAIRness moving forward.

Alternatively, the Australian Research Data Commons (ARDC) FAIR data assessment tool and/or the How FAIR are your data? checklist by Jones and Grootveld are also available and can be substituted for the FAIR-Aware assessment tool.

Evaluate the FAIRness of digital resources

How FAIR is your service and the digital resources you share? How can your service enable greater machine discoverability and (re)use of its digital resoruces? Evaluation of your service’s FAIRness lies on a continuum based on the behaviors and norms of your community. Frameworks and tools to assess services are currently under development and what options are available should be paired with the evaulation of what makes sense to your community.

FAIR Evaluation Services

The FAIR Evaluation Service is available to assess the FAIRness of your digital resources. Developed by the Maturity Indicator Authoring Group, FAIR Maturity Indicators are available to test your service via a submission process. The rationale for the Service are explained in Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. To get started, the Group can also be reached via their Service.

As a service provider, for example a data repository, you might want to assess the FAIRness of datasets in your systems. You can do this by using one of the resources at FAIRassist or you can start your assessment manually (as a group exercise). Some infrastructure providers have provided overviews of how their services enable FAIR.

Challenge

Encourage your workshop participants to review the episodes in this FAIR lesson and then review the FAIR principles responses/statements from Zenodo and Figshare above before the workshop. Again, ahead of the workshop, ask the participants to develop similar responses/statements for a service at their organisation, in their community. An outline with brief bullet points is best. Pre-assign workshop participants to groups and ask them to share their responses/statements with each other. Then in person (or virtual), ask the participants to split up into their pre-assigned groups and to discuss each other’s responses/statements. Ask them to note their questions/findings/anything else in the session’s collaborative notes. After a duration, ask the groups to return to the main group and call on each group (leader) to summarise their discussion. Synthesise some of the key points and discuss next steps on how participants can address their FAIRness moving forward.

Quantifying FAIR

In a recent DataONE webinar titled Quatifying FAIR, Jones and Slaughter describe tests that have been conducted to assess the FAIRness of digital resources across their services. The MetaDIG tool is referenced, used to check the quality of metadata in these services. Based on this work, DataONE also lists a Make your data FAIR tool as coming soon

Community assessment

Communities can also assess how FAIR they are and develop goals and/or action plans for advancing FAIR. Communities can range from topical to regional and even organisational. A recently published report from the Directorate-General for Research and Innovation (European Commission) titled “Turning FAIR into reality” is an invaluable resource for creating action plans and turning FAIR into reality for communities. The report includes a survey and analysis of what is needed to implement FAIR, concrete recommendations and actions for stakeholders, and example case studies to learn from. Ultimately, the report serves as a useful framework for mapping a community’s next steps towards a FAIRer future.

Challenge

Encourage your workshop participants to first read:

Collins, Sandra, et al. “Turning FAIR into reality: Final report and action plan from the European Commission expert group on FAIR data.” (2018). https://op.europa.eu/en/publication-detail/-/publication/7769a148-f1f6-11e8-9982-01aa75ed71a1

Leverage the recommendations (pictured below) and organise group discussions on these themes. Ask the participants to brainstorm initial responses to these themes ahead of the workshop and collect their initial responses in a collaborative document, structured by the themes. During the workshop, ask the participants to join their groups based on the themes they initially responded to and discuss each other’s responses. Ask them to note their questions/findings/anything else in the session’s collaborative notes. After a duration, ask the groups to return to the main group and call on each group (leader) to summarise their discussion. Synthesise some of the key points and discuss next steps on how participants can address their FAIRness moving forward.

Index to FAIR Action Plan recommendations

Challenge

Alternatively, start your community off on the path to FAIR by setting up an initial study group. The group’s goal, to scan, and possibly survey, work that has been done by community members (or like communities) on example/case studies, policies, recommendation, guidance, etc to collect resources to help inform future FAIR discussions/initiatives. Consider structuring your group work to produce guidance, e.g. Top 10 FAIR Data and Software Things:

Paula Andrea Martinez, Christopher Erdmann, Natasha Simons, Reid Otsuji, Stephanie Labou, Ryan Johnson, … Eliane Fankhauser. (2019, February). Top 10 FAIR Data & Software Things. Zenodo. http://doi.org/10.5281/zenodo.3409968

Other assessment tools

To see a list of additional resources for the assessment and/or evaluation of digital objects against the FAIR principles, see FAIRassist.

Planning

Data and software management plans are also a helpful tool for planning out how you/your group will manage data and software throughout your project but they also provide a mechanism for assessment at different checkpoints. Revisiting your plan at different checkpoints allows you to review how well you are doing, incorporate findings, and make improvements. This allows your plan to evolve, be more actionable, and less static.

Resources and examples include:

Data:

Software:

Challenge

Use one of the resources above to draft a plan for your research project (as an individual and/or group). As an individual, ask a colleague to review your draft, provide feedback, and discuss. As a group, outline the plan questions in a collaborative document and work together to draft responses, then discuss together as a group. Consider publishing your plan to share with others.

Resources

This is a developing area, so if you have any resources that you would like to share, please add them to this lesson via a pull request or GitHub issue.

Key Points

  • Assessments and plans are helpful tools for understanding next steps/action plans for becoming more FAIR.


Software

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question

Objectives
  • The objective of this lesson is to get learners up to speed on how the FAIR principles apply to software, and to make learners aware of accepted best practices.

Applying FAIR to software

The FAIR principles have been developed with research data in mind, however they are relevant to all digital objects resulting from the research process - which includes software. The discussion around FAIRification of software - particularly with respect to how the principles can be adapted, reinterpreted, or expanded - is ongoing. This position paper (https://doi.org/10.3233/DS-190026) elaborates on this discussion and provides an overview on how the 15 FAIR Guiding Principles can be applied to software.

BTW what do we mean by software here? Software can refer to programming scripts and packages such as those for R & Python, as well as programs and webtools/webservices such as …… For the most part, this lesson focuses on the former.

Note/Acknowledgements

This lesson draft, complied during the FAIR lesson sprint @CarpentryCon2020, draws heavily from the Top 10 FAIR Data & Software Things + FAIR Software Guide (see reference list).

Findable

For software to be findable,

  1. Software should be well documented to to increase findability and eventually, reusability. This metadata can start with a minimum set of information that includes a short description and meaningful keywords. This documentation be further improved by applying metadata standards such as those from CodeMeta (https://codemeta.github.io/terms/) or DataCite.

  2. Software should be registered in a dedicated registry such as Research Software Directory, rOpenSci project, Zenodo. These registries will usually request the previously mentioned metadata descriptors, and they are optimized to show up in search engine results. question to be resolved: how is a registry different from a repository?

  3. When attaching identifiers to software, they should be unqiue and persistent. They should point only the one version and location of your software. A software registry usually provides a PID, otherwise you can obtain one from another organization and include it in your metadata yourself. See: Software Hertigage archive.

  4. Your software can also be hosted/deposited in an open and publicly accessible repository, preferably with version control and citation guidelines. Examples: Zenodo or GitHub (see GitHub doc on citing code, but note the issue about attaching identifiers and the ‘permanence’ of GitHub). Still have to check the difference between registries and repositories.

Accessible

For software to be accessible,

  1. What does it mean to make metadata machine and human readable? This should be covered in a previous lesson and referred back here…

  2. The (executable) software and metadata should be publicly accessible and downloadable. These downloadable and executable files can be hosted or deposited in a repository / registry, as well as project websites.

Interoperable

For softare to be interoperable,

  1. Interoperability is increased when there is documentation that outlines functions of the software. This means providing clear and concise descriptions of all the operations available with the software, including input and output functions and the data types associated with them.

  2. If community standards are available, input and output formats should adhere to them. This makes it easy to exchange data between software (the data itself will be better linked too).

Reusable

For software to be reusable,

  1. Software should be sufficiently documented beyond metadata. This could include instructions, troubleshooting, statements on dependencies, examples or tutorials.

  2. Include a license. This informs potential users on how they may/may not (re)use the software.

  3. Include a citation guideline, if you’d like to receive credit or acknowledgement for your work.

  4. Combine best practices for software development with FAIR. This would include working out of a repository with version control (which allows for tracking changes and transparency), following code standards, building modual code etc.

  5. In addition to 4, you can use a software quality checklist. There are several available and you can select which is most suitable for your software. A good checklist will include items which allow for a granular evaluation of the software, explains the rationale behind the check, and explains how to implement the check. It can cover documentation, testing, standardization of code etc. You quality checklist could be maintained in your README file.

References


Lessons to teach in connection with this section and exercises

Software Carpentry: Version Control with Git or Library Carpentry: Intro to Git Making your Code Citeable Does an exercise or lesson exist that we can point to involving Software Heritage?

Key Points

  • First key point.