This “10 Things” guide aims to promote the use of the FAIR principles in the bioimaging community. The FAIR principles are described in the context of bioimaging and the activities are optionald. This guide is to empower researchers, scientists and health professionals to incorporate data best practices throughout the research cycle, in order to improve research data quality, reproducibility and reusability of research outputs.
Imaging researchers, neuroscientists, clinicians, microscopists, platform engineers, graduate students and computational scientists working on image analysis and processing.
To inform data producers and users about the FAIR principles applied to bioimaging and suggest activities to apply to their research.
1. What is FAIR?
2. What are publishers and funders saying about data access?
3. Data sharing and discovery
4. Reusable data repositories for the image community
5. Managing and sharing sensitive data
6. Persistent identifiers
7. Describing data: metadata
8. Reusable data best practices
9. Licensing your work
10. Data citation for access and attribution
The term FAIR as detailed in 15 principles  stands for Findable, Accessible, Interoperable and Reusable. The FAIR principles  are guidelines to motivate and enhance reusability of data, by facilitating its discovery, integration and evaluation. In this context, “data” refers to all research-oriented digital objects (including data, metadata, software, workflows and packages) . Wilkinson et al., have pioneered the definition of the guiding principles “emphasising the capacity of computational systems to Find, Access, Interoperate and Reuse data with none or minimal human intervention”, which is also referred as machine-actionable FAIR principles . FAIR is also connected with open research and data management movement as Higman et al describe in Three camps, one destination: the intersection of research data management, FAIR and Open.
“FAIRness is a prerequisite for proper data management and data stewardship”
Communities are motivated to apply the FAIR principles to research activities and to enable people and machines to find, read, use and reuse research data and research outputs. For instance, in 2018, The Enabling FAIR Data Project , a coalition of stakeholders representing the international Earth and Space science community set out to develop standards that will connect researchers, publishers, and data repositories in this community to enable FAIR data on a large scale. This project will accelerate scientific discovery and enhance the integrity, transparency, and reproducibility of this data. In imaging, on 1 March 2019, Euro-BioImaging  and other research infrastructures including ELIXIR-Europe  joined forces as part of the The European Open Science Cloud  project to publish research data via FAIR databases. Community participation from academia, industry, small and medium-sized enterprises (SMEs) and regional bio-clusters is paramount for the success of this four-year project (starting in 2020). The imminent global uptake of the FAIR principles through different scientific domains, can only motivate us to move forward, promote and apply them.
Activity 1: CODATA, The Committee on Data for Science and Technology, shared in 2018 news to an important milestone “Enabling FAIR Data Project and Commitment Statement” . Take a look at the partners , do you recognise partners in the imaging discipline?
Activity 2: Can you think of the benefits of making your data FAIR? And how you can align your current data practices to the FAIR principles? Consider the following resources when addressing the activity above:
The following examples and statements are meant to motivate organisations and researchers to adopt the next steps towards FAIR. As disclaimer, most of the examples are from Australian stakeholders as the guide is being developed in Australia; nonetheless, international examples have also been included.
“Nature journals require sharing research materials because their core business is ensuring research quality and promoting research to the widest readership” (Nature Genetics, 2004) .
In 2014, The Nature Publishing Group welcomed its newest journal, Scientific Data  — a peer-reviewed, open-access publication designed to provide a better way to share and explain data. Scientific Data promotes reproducible, collaborative science and due credit to scientists .
PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction at the time of publication . PLOS suggests using FAIRsharing  to index resources, for example their own PLOS list of recommended resources .
eLife Journal Policy  “wherever possible, authors should make major datasets available using domain-specific public archives, or generic databases, e.g. FAIRsharing page for eLife recommended repositories and standards” .
Funders like the European Commission have drafted Guidelines on FAIR Data Management for the H2020 programme (European Commission, 2016) “those projects funded in this scheme must submit a version of this FAIR Data Management Plan (DMP)” .
The Australian 2016 National Research Infrastructure Roadmap  has two FAIR highlights: 1. Australia must stay at the forefront of international developments and should continue to engage with internationally recognised initiatives… such as the FAIR guiding principles. 2.The Australian National Data Service (ANDS) (now ARDC) has been a foundational, providing in many cases leading, international policies and practices to support researchers and institutions in making data FAIR.
In 2016, The European Commission General directorate for Research and Innovation published the report and action plan Turning FAIR into reality  to implement FAIR and provide concrete recommendations and actions for stakeholders in Europe and beyond.
Starting in September 2016, all research papers accepted for publication in Nature and an initial 12 other Nature Research titles will be required to include information on whether and how others can access the underlying data. Nature Announcement: where are the data?.
The Australian Research Council (ARC) Open Access Policy Version 2017.1  states “Author(s) should consider selecting publishers and research outlets, which have policies supporting the F.A.I.R. principles, as well as immediate or early availability of Publications via Open Access, in order to maximise the availability and impact of their ARC Funded Research.”
Policy Statement (2017) On FAIR Access To Australia’s Research Outputs https://www.fair-access.net.au/fair-statement  Headline: By 2020, Australian publicly funded researchers and research organisations will have in place policies, standards and practices to make publicly funded research outputs findable, accessible, interoperable and reusable to the Australian and international community.
The (Australian) National Health and Medical Research Council (NHMRC) promotes the highest quality in the research that it funds, based on international best practice. The NHMRC lists the FAIR principles under useful resources  for publication and reporting of research outcomes.
In late 2017, Australian Health Research Alliance (AHRA) committed to developing a coordinated national approach to Data Driven Healthcare Improvement. Leveraging data registration, linkage, integration, storage, security, access, management and analysis capabilities .
In 2018, The Enabling FAIR Data Commitment Statement , has been formalised, by a significant group of stakeholders (repositories, publishers, societies, communities, institutions, funding agencies and organisations, and researchers) to support and promulgate open and FAIR data principles and practices in their core science activities and policies.
Wiley’s data sharing and citation policies and service support the growing movement to make research more open , because this leads to a fairer, more efficient and accountable research landscape, driving effective and faster pace of discovery Wiley’s Data sharing and citation .
The Genomics Health Futures Mission (GHFM) - Projects Grant Opportunity guidelines 2019 states “research projects proposals with plans to manage genomic and/or phenomic data in alignment with the FAIR principles for research data are preferred”.
All disciplines should follow the geosciences and demand best practice for publishing and sharing data” (Stall et al., 2019) .
“Grant makers, professional organisations, research journals, publishers, and other entities in the research field increasingly stress the ethics as well as societal and practical benefits of data sharing, and require researchers to do so within a reasonable time after data collection ends.” (Dijkers, 2019) .
“Both researchers and the broader community stand to benefit from the knowledge produced through publicly funded research” (ARC open access policy). Data sharing is well connected with the concept of reproducibility.
Activity 1: The slides (2-11) motivation on neuroimaging reproducibility What is your opinion about: Data + Workflow specification + Execution environment = Results?.
Activity 2: (Infographic) Research data may be discovered (findable) and shared (accessible) in many ways. Start by looking at some data sharing trends across countries and research disciplines. Consider your own current data sharing practices, and those of your project team(s). How FAIR are they?
Activity 3: How can data be shared and discovered? Think about open, mediated, restricted access data repositories. What examples of these types of repositories are you aware of? Discuss with others about their answers.
How to walk towards FAIR?
Imagine if you were able to obtain extra datasets for your existing research project, or start a new project reusing publicly available datasets. You can do this by exploring the following resources.
Neurosciences Data repositories recommended by the Scientific Data Journal which accept human-derived data, in addition NeuroMorpho.org and G-Node also accept data from other organisms. Please note that human-subject data submitted to OpenNeuro must be de-identified, while Functional Connectomes Project International Neuroimaging Data-Sharing Initiative (FCP/INDI) can handle sensitive patient data.
Data registries and catalogues re3data.org - a registry of some 2000 data repositories. Research data australia read more.FAIRSharing.org offers a catalogue of databases, described according to the BioDBcore guidelines. OpenAIRE content provider, European Open Science Cloud, Google Public Data, Google Dataset Share, for open access publications Open knowledge maps.
This and the previous section intend to show that it is becoming more common for funding agents and publishers to require research data to be made accessible via appropriate repositories. This list is a starting point for you to find out what data already exists in your research area. If you want to share your data, or find data relevant to your research take a detailed look at the examples provided, most if not, all will have guides on how to share data.
Activity 1: (Find a repository) Go to: https://fairsharing.org/biodbcore/?q=imaging and browse or search to find repositories relevant to your research. Try for example, searching on “neuroimaging”. Explore at least one repository you find. How well does it support the FAIR data principles? Tip: look for things such as persistent identifiers, clear descriptions, licence information, download options, file formats.
Clarification, FAIR data is not necessarily “open” data. There are some good reasons why some data should not be open. For example, to protect intellectual property, commercialisation, national security, personal privacy or endangered species. However, it may still be possible to provide mediated access to such data, or to publish a description of the data so that others can discover_ its existence. To align with FAIR principles your “research data should be as open as possible, as closed as necessary_”.
The FAIR principles encourage us to disseminate data as widely as possible, in the most effective manner and at the earliest opportunity. This statement takes into account any restrictions relating to privacy, confidentiality, intellectual property, embargo period, or cultural sensitivities, that need to be addressed, discussed and clarified before sharing any data. In the planning phase of a research project, researchers need to consider at least making project metadata publicly accessible.
If you need examples and more information, check OpenAIRE sensitive data guide, ANDS publishing and sharing sensitive data, Earth Science Information Partners (ESPI) Handling sensitive data tutorial. The Australian Bureau of Statistics (ABS) informs about the application of the five safes framework and Table 2 provides examples at different levels of accessibility.
Activity 1: Promoting FAIR principles in the healthcare field by the Digital Curation Centre (DCC), January 2019. Highlights: The sensitive nature of patient data and additional concerns for these data include security and anonymisation of data subjects and although not the primary concern from a technical aspect, these are a major component considered. For more information visit FAIR4health.eu.
Activity 2: Think about when and how people can share data along the research cycle. Keeping in mind that it is strongly recommended to release metadata (description) of the project to comply with FAIR principles, even if you cannot share the data itself. Institutional repositories or domain specific repository should be able to store metadata of your project and then link that information via registries (Look the previous section).
De-identification / Anonymisation
Sensitive data should -seek to minimise the risk of exposing confidential information_. Sometimes restrictions of sharing can be resolved by de-identification or anonymisation of data. Anonymisation is sometimes used interchangeably with de-identification, ANDS makes a clarification of these terms.
Activity 3: Look at The Future of Privacy Forum’s visual guide to practical data de-identification.
Optional extra information. 1. Open de-identification tools by Open Brain Consent Halchenko, Y. et al., 2018. 2. The (Health Insurance Portability and Accountability Act) HIPAA Privacy Rule establishes national standards (US) to protect individuals’ medical records and personal health information, and guidance about methods for de-identification. 3 Anonymization of DICOM Electronic Medical Records (Newhauseret al., 2015).
Identifiers are essential to the human-machine interoperation. Assigning globally unique persistent identifiers “is arguably the most important FAIR principle, because it will be hard to achieve other aspects of FAIR without them” (F1). Persistent identifiers or PIDs help find and collect data accurately, enable proper citation by collecting citation metrics about the use of a dataset, article or data generator (e.g. instrument, software, workflow). For the researcher, persistent identifiers enable disambiguation of people and enable linking existing works.
For digital objects (files, datasets, publications, software, etc.):
Disclaimer, there are a wide range of PIDs available, we only cited two examples for each type.
Activity 1: OpenAIRE/FREYA/ORCID guide for researchers “How can identifiers improve the dissemination of your research outputs?”
Activity 3: (Discuss in pairs, 5 min) The Joint Declaration of Data Citation Principles from FORCE11 https://www.force11.org/datacitationprinciples
“Metadata (information about data) provides means for discovering data objects as well as providing other useful information about the data objects such as experimental parameters, creation conditions, etc.” (Rajasekar & Moore, 2001).
Why building and using metadata is relevant? Because it supports the discovery, understanding and organisation of the process of research data across different communities, more information.
Some aspects of metadata to keep in mind whether you produce, read or reuse metadata. Creating, using and reusing metadata emphasises the need of a standard vocabulary, in order to properly be interpreted by either humans or software. Hence, why metadata items need to be precisely defined. A defined list of agreed terms constitute a controlled vocabulary, which is usually led by a user-community. Controlled vocabularies help data integration when, for example, ambiguities may exist on the terms used in the different datasets and across different repositories. If the data are to be re-used outside this community additional information may be required.
Controlled vocabularies are part of a model called ontology. An ontology has controlled vocabularies and the glue to link the terms providing an effective means whereby human and electronic agents can communicate unambiguously about concepts. This connects together to the Interoperability principle of FAIR I1. The goal of making data interoperable is to enable members of disparate communities to reuse and understand digital information over time.
Metadata for imaging should include a standard terminology and tools for describing physiological, clinical, demographic and genetic changes. The main recommendation is to share metadata per project whenever possible, even if the data is not yet available. Remember that metadata can be stored in general purpose repositories.
We can group metadata types in two: either automatically created metadata or manually created metadata, more information.
a. Why ontologies?
By expressing image annotation in machine computable form as a formal ontology, human knowledge can be brought to bear on effective search and interpretation of image data, especially across multiple disciplines, scales, and modalities” (Eliceiri et al. 2012). Keep in mind that if privacy is an issue, any (meta)data can be listed under embargo.
Implementation, adoption and harvesting of metadata, requires defined ontologies. Due to increased demand for quantitative analysis and robust curation and sharing of the image data, the need for full ontologies and annotations is growing.
More examples, Ontologies for Neuroscience describe three domain specific ontologies and how they build on top of each other Larson & Martone, 2009. They also note that existing domain specific vocabularies built the ontology with the help of the Open Biological Ontologies (OBO) (Smith et al., 2007) community. For example a subset of OBO is the EDAM Ontology for bio-imaging (Kalaš et al., 2019). The Neuroscience Information Framework has developed a comprehensive vocabulary NIF Standard ontology (NIFSTD) for annotating and searching neuroscience resources. Plant et al., 2011 provide an overview of what is needed to implement metadata that follows domain specific ontologies, they use as example microscopy cell image data. The National Center for Biomedical Ontology (NCBO) NCBO’s BioPortal provides access to more than 270 biomedical ontologies and controlled terminologies (Musen et al., 2012), and include some of those cited before. The Ontology for Biomedical Investigations (Bandrowski et al., 2016), OBI Ontology.org enables communication between existing ontologies.
b. Controlled vocabularies
Domain specific controlled vocabularies might be a wider landscape than ontologies to cover here, hence some more generic vocabulary examples are given. Schema.org widely used to build controlled vocabularies, a more specific example is bioschemas.org a collection of specifications that provide guidelines to facilitate a more consistent adoption of schema.org within the life sciences. Research vocabularies Australia is a public database of controlled vocabularies, at the time of writing this guide, no specific bioimaging vocabularies were found, maybe that is something you can help with?
c. Storing and publishing metadata
Where to store and publish metadata? The short answer is, depends which institution you are from. Enquiring the library, research officer or data steward are the best sources of information. Some options are:
Keep in mind that the FAIR principle A2. Metadata are accessible, even when the data are no longer available, which reinforces the need of having at least shared metadata. For example, first look at the section “Reusable data repositories for the image community”. For a broad view FAIRsharing.org databases for imaging. The ARDC - Research Data Archive (RDA) harvests institutional repositories, hence it can be a generic repository. The CSIRO - data access portal (for projects related to CSIRO). DataCite metadata store allows users to register DataCite DOIs and associated metadata. Zenodo, provides a DOI and versioning capabilities.
Activity 1: (Discuss in pairs) Have a look at the metadata stored at Research Data Australia for the 7T Magnetom instrument. It contains simple but important public metadata and a PID. Activity 2: Where to store metadata? from ARDC.
Here is a suggested list of data best practices to implement to your research outputs. These will improve data and software reusability by others, which includes yourself in the future. Remember, making data available for others to re-use publicly is the goal, but not all data must be shared to all. Adding terms and conditions of accessibility is an option to consider. To share data, you can make use of public infrastructures already mentioned (section “4. Reusable data repositories”) or use your institutionally provided data repository. To get started, there are a few things you should keep in mind.
a. Provenance - Usually provenance is a manually produce metadata file (it can also be automatically produced). It is important for the reuse of data in the future, it should contain descriptors such as data producer, date history (log of changes), data dictionary. Primary data ought be read only.
b. File formats - Most file formats are defined by the data producer (e.g. instrument or software), whenever possible you should try to convert data to formats that are publicly accessible.
DICOM Digital Imaging and Communications in Medicine. Mostly used in neurosciences, can be converted to NIfTI(Neuroimaging Informatics Technology Initiative) or BIDS format. Bio-formats. The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data HDF5) used by MINC and Huygens Software. Tiff, extensively used in Microscopy.
c. Data structures Keep consistent file and folder naming conventions across linked projects.
d. Data curation Should be included in your data quality workflow as part of the process, ideally this will be automated.
f. Containerisation For data processing pipelines. E.g. Singularity, Docker, or use Virtual environments, such as the Characterisation Virtual Lab.
g. Protocols Search for imaging protocols publicly shared. For example, Protocol Exchange from Nature Protocols is an open resource where the community of scientists pool their experimental know-how to help accelerate research.
h. Create documentation A README file helps ensure that your data can be correctly interpreted and reanalysed by others. For example, the DataDryad Readme is an example of minimum documentation.
i. Benchmarks or checksums
Activity 1: A brain imaging case study that provides direct evidence of the impact of open sharing on data use and resulting publications over a seven-year period (2010-2017). “We dispel the myth that scientific findings using shared data cannot be published in high-impact journals and demonstrate rapid growth in the publication of such journal articles” (Milham, M. P. et al., 2018).
Activity 2 (Discussion + Action): What Can You Do?
- Contribute your data – Previously published datasets.
- Release some or all of the project metadata – your call, as a simple rule, the more the better!
- Curate existing datasets to make available in the future - you set the upload schedule.
- Contribute your scripts/code
- Have discussions with your team members about licensing and sharing.
- Create a data management plan.
Activity 3: Go through the questions from the Horizon2020 guide to create a FAIR Data Management Plan and see if you can already answer many of them.
Recommended extra reading: Best Practices in Data Analysis and Sharing in Neuroimaging using MRI, Ten Simple Rules for Creating a Good Data Management Plan, Ten Simple Rules for Reproducible Computational Research and Ten principles for machine-actionable data management plans, these papers will help you connect all the concepts that you have learned so far.
Licensing your work / research outputs to be open access (research output here means data, metadata, code, workflows) allows you as author or contributor to enable reuse and appropriate attribution of the work. If there is no license attached to your work, you are actually stopping anyone to legally reuse it. Did you know that No license = No permissions?. Also, if you find research outputs that you want to reuse, you should only reuse it according to their license.
Be aware that you have the right to choose a license that best suits your purpose. There are multiple different licenses and versions of these, to be applied to data and software. Some licenses are applicable only in certain countries, think of applying an international license. Be aware that the data repository that you use might ask you to accept their “terms and conditions” which affects how you might use or share data, by expanding, modifying or limiting the intended purpose or your own license. You can have multiple licenses, for different purposes or different audiences. Finally, not every part of your work/ research outputs needs to be publicly available or be licensed. The more you share the better.
Activity 1: What if you don’t choose a license?, explains and gives you a few reasons to think about licensing your work. If you are interested in reading about GitHub terms and conditions take 5 extra minutes.
Activity 2: (flowcharts as a survey) The ARDC has a guides about licensing for three specific scenarios: a) Data creator flowchart b) Data supplier flowchart and c) Data users flowchart. If you want to know more about licensing and copyright for data reuse visit the ANDS (now joined into ARDC) page.
A few types of licenses: Creative Commons (CC) is, so far, very easy to apply and it is broadly being reused; it is strongly promoted in the United States, however it is an internationally recognised license creator. CC is good for: a) very simple, factual data sets b) data to be used automatically. You should watch out for the version in use, recommended to use version 4 or later. CC has attribution stacking Non Commercial (NC), Shared Alike (SA) and Non derivatives (ND). The NC condition: only to be used with dual licensing. The SA condition reduces interoperability. The ND condition severely restricts reuse. To help you decide, use this https://creativecommons.org/choose/. Copyleft is a general method for making a program (or other work) free (in the sense of freedom, not “zero price”), and requiring all modified and extended versions of the program to be free as well. Open Data commons, also provides licenses specifically for open data, good for most databases and datasets, e.g. Open Data Commons Open Database Licence (ODC-ODbL) or Open Data Commons attribution license (ODC-By). Licenses specific for software: Mozilla Public Licence (MPL), MIT Licence, the GNU General Public Licence (GPL) and a list of open source licenses by category. To help you choose a license for software, look at the descriptions: https://choosealicense.com/. Acknowledgement, most of the cited licenses on this section, were first mentioned by License Research Data from the Digital Curation Centre (DCC).
Citation analysis and citation metrics are important to the academic community, which gives recognition to the researchers and their work. Data citation continues the tradition of acknowledging other people’s work and ideas. It also helps make research data more findable and accessible. It is now common practice for authors to formally cite the research datasets and associated software that underpin their research findings.
Activity 1: (Video, 12 mins) Responsible Data Use: Citation and Credit.
Activity 2: How to cite data and software? This example from Dryad clearly shows how to cite the dataset that underpins a journal article as well as the article itself. Note that both citations include a Digital Object Identifier (DOI).
Activity 3: What to cite and why? For data and software from ARDC for more information.
We acknowledge Chris Erdmann for reviewing the first version of this document, and Jose Manzano Patron for the adding important resources to the third version of this document.
This document is also available via the Open Science Framework as a pre-print and it is citable with the following DOI 10.17605/OSF.IO/ZKJ4R where versions of it in docx and odt have been saved.