The (Re)usable Data Project

Inspired by the efforts of scientists around the world and the game-changing efforts of projects like the Creative Commons, the Wikipedia Foundation, and the Free Software movement, we hope to engage the larger community in an open and fruitful discussion on issues concerning the use and reuse of scientific data, including the balance of openness and how to make ends meet in an increasingly competitive environment.

If you would like to join our efforts to highlight the use and reuse of data in the sciences, please feel free to contact us on our tracker or create a pull request against our repository.

Learn more »     See our data »

Who we are

We are not lawyers and this is not legal advice: all institutions and groups have their own perspectives and counsel. We are a group of scientists, engineers, and specialists that are concerned about the use and reuse of increasingly interconnected, derived, and reprocessed data. We want to make sure that data-driven scientific endeavors can work with one another in meaningful ways without undue legal concerns.

The (Re)usable Data Project is meant provide a resource that looks at some of the issues around the reuse of scientific data and open a conversation about how to deal with them.

We also want to actively work with the community in considering our criteria and in making sure that our information about scientific data resources is up-to-date and correct. If you have any questions, concerns, or see any problems, please open a ticket on our GitHub tracker.

What this is »

What this is

The initial driving concern of this project is the use and reuse of biological and biomedical data. However, this is a general problem in the scientific community and needs to be addressed directly.
For each resource, using our criteria, we attempt to objectively assign zero to five stars for how well we believe a resource's data may build upon, edited, modified, and redistributed.
Grossly speaking:

  • 5 stars ★ ★ ★ ★ ★
    The license unambiguously allows the unfettered (re)use and redistribution of the data.
  • 4 stars ★ ★ ★ ★
    The license unambiguously allows (re)use and redistribution of the data under some terms.
  • 3 stars ★ ★ ★
    The license is clearly stated, unambiguous, and of a standard type, and has clear access, but has terms that may greatly impact the (re)use and redistribution of the data.
  • 2.5 or less stars ★ ★ ½ - ∅
    There are likely issues in definitively finding the license, ambiguities in the license that hamper further analysis, issues with clean data access, or terms that require legal advice.

If you see any problems with our determinations or would like to make corrections or clarifications, please open a ticket for us on our issue tracker.

Our criteria »

Our criteria

This is a short overview of the criteria that we use when evaluating a resource's data license for use and reuse. We have attempted to balance many needs (credit, mutability, commercialization, redistribution, etc.) and focused on trying to objectively see how licenses can interact across resources.

To learn more about how we look at resource data licenses, please see our criteria and license type pages.

  • Clearly stated
    A clearly stated, unambiguous, and hopefully standard, license for data use is critical for any (re)use of data: if there is no license to be found, then rights are unclear and one needs to assume the default: all rights reserved. more »
  • Comprehensive and non-negotiated
    Data that is mixed under different licenses, only partially available, or must be in some way negotiated creates barriers to the (re)use of data. more »
  • Accessible
    Data must be accessible in a reasonable and manner to be useful to the broader community. more »
  • Avoid restrictions on kinds of (re)use
    Data should be able to be copied, built upon, edited, and modified as freely as possible. more »
  • Avoid restrictions who may (re)use
    Data should should be available to as many people as possible for their (re)use. more »

Our sources data »

Our sources data

NameTagsGradeDescriptionLicense InfoLicense Issues
NameTagsGradeDescriptionLicense InfoLicense Issues
BGee (data)biomedical, x-species, expression dataBgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).unknown 
Criteria A.1.2
After some search we were unable to determine the license of the non-ontology portions of the BGee data. No explicit license statements were apparent, causing a cascade of criteria violations. In spite of this, access itself is clear, so one star is awarded accordingly for C.
BGee (ontology)biomedical, expression data, ontology★ ★ ★ ★Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).copyleft 🔗
Criteria D.1.2
By using the GPLv3, there may be issues in mixing and redistributing this data with licenses that have incompatible terms.
Comparative Toxicogenomics Database (CTD)biology, x-species, disease-gene association★ ★ ½ CTD promotes understanding about the effects of environmental chemicals on human health by integrating data from curated scientific literature to describe chemical interactions with genes and proteins, and associations between diseases and chemicals, and diseases and genes/proteins.restrictive 🔗
Criteria A.2
Custom license with interesting use restrictions.
Criteria B.1
For quality control purposes, you must provide CTD with periodic access to your publication of our data.
Criteria D.1.2
Given the four statements in the Additional Terms of Data Use, notably number 4, it looks like any downstream user would have to renegotiate with CTD.
Criteria E.1.1
Without negotiation: "It is to be used only for research and educational purposes."
dbGaP (public)biology, human, genotype-phenotypeThe database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans. Provides authorized access to protected and raw data (e.g., Genotype-Tissue Expression (GTEx) project).unknown 🔗
Criteria A.1.1
Per the dgGaP data use certification, 'The terms and conditions of using dbGaP data vary by study'. All terms and conditions are to align with NIH GDS.
Criteria C.1
Cannot access all the data.
Criteria C.2
Access methods are not transparent.
ENCODEbiology, genomic resource, genomic elements★ ★ ★ ★ ½ The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.permissive 🔗
Criteria A.2
Easy and perfectly (re)usable, yet custom.
Fantom5biology, human, gene expression★ ★ ★ ★ ★We are complex multicellular organisms composed of ~400 distinct cell types. This diversity of cell types allow us to see, think, hear, fight infections etc. yet all of this is encoded in the same genome. The difference between all these cells is what parts of the genome they use – for instance, neurons use different genes than muscle cells, and therefore they work very differently. In FANTOM5, we have systematically investigated exactly what are the sets of genes used in virtually all cell types across the human body, and the genomic regions which determine where the genes are read from. We aim to use this information to build transcriptional regulatory models for every primary cell type that makes up a human.permissive 🔗
FlyBasebiology, MOD, genotype-phenotype association★ ★ ½ FlyBase is the model organism database providing integrated genetic, genomic, phenomic, and biological data for Drosophila melanogaster.copyright 🔗
Criteria A.2
Copyright statement includes 'This publication may be copied for non-commercial, scientific uses by individuals or organizations (including for-profit organizations). FlyBase is freely distributed to the scientific community on the understanding that it will not be used for commercial gain by any organization. Any commercial use of this publication, or any parts thereof, is expressly prohibited without permission in writing from the FlyBase consortium.'
Criteria B.1
Copyright statement includes 'Certain portions of FlyBase are copyrighted separately.'
Criteria B.2.1
Copyright statement includes 'Certain portions of FlyBase are copyrighted separately.'
Criteria D.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed in a non-commercial context.
Criteria E.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed in a non-commercial context.
GTExbiology, human, gene expression★ ★ ½ The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation. This project will collect and analyze multiple human tissues from donors who are also densely genotyped, to assess genetic variation within their genomes. By analyzing global RNA expression within individual tissues and treating the expression levels of genes as quantitative traits, variations in gene expression that are highly correlated with genetic variation can be identified as expression quantitative trait loci, or eQTLs.permissive 🔗
Criteria A.2
Custom license based on the various datasets and NIH Genomic Data Sharing Policy.
Criteria C.1
No API or URL to access all data groupings with single action.
Criteria C.2
No API or URL and therefore no reasonable and transparent access.
Criteria D.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed for research/scientific purposes.
Criteria E.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed research/scientific purposes.
Human Phenotype Ontology (HPO)biology, human, disease-phenotype association★ ★ ½ A curated database of human hereditary syndromes from OMIM, Orphanet, and DECIPHER mapped to classes of the human phenotype ontology. Various meta-attributes such as frequency, references and negations are associated with each annotation. These are presently limited to rare mendelian diseases.restrictive 🔗
Criteria A.2
HPO is copyrighted to protect ontologies and all changes must be made by hpo developers.
Criteria D.1.2
Restricted downstream use. May not be edited.
Criteria E.1.2
Restricted downstream use translates to agents as well.
International Mouse Phenotyping Consortium (IMPC)biology, mouse, genotype-phenotype associationThe International Mouse Phenotyping Consortium (IMPC) is generating a knockout mouse strain for every protein coding gene by using the embryonic stem cell resource generated by the International Knockout Mouse Consortium (IKMC). Systematic broad-based phenotyping is performed by each IMPC center using standardized procedures found within the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS) resource. Gene-to-phenotype associations are made by a versioned statistical analysis.copyright 
Criteria A.1.2
Could not find licensing information in a reasonable location. Determined ARR by default.
Criteria D.1.2
Restricted downstream use per ARR.
Criteria E.1.2
All downstream agents restricted per ARR.
Kyoto Encyclopedia of Genes and Genomes (KEGG)biology, genomic resource, gene-pathway association, disease-gene association, orthologyKEGG is an integrated database resource consisting of the seventeen main databases including systems, genomic, chemical, and health information.restrictive 🔗
Criteria A.1.1
Has multiple licenses (which are kind of hard to find).
Criteria B.1
Needs negotiation.
Criteria D.1.2
Does not explicitly allow re-use, just use.
Criteria E.1.1
Certain groups are forbidden, but academic can use data
Mouse Genome Informatics (MGI)biology, MOD, genotype-phenotype association, disease-model association, gene expression★ ★ ★ ★MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.permissive 🔗
Criteria A.2
Custom license.
Criteria E.1.1
distinguishes groups, allowing for research/academic. Commercial groups can negotiate.
Mouse Phenome Database (MPD)biology, MOD, genotype (strain)-phenotype association★ ★ ★ ★ ½ The Mouse Phenome Database is a collaborative standardized collection of measured data on laboratory mouse strains, and includes: baseline phenotype data sets; studies of drug, diet, disease and aging effect; protocols, projects, and publications; and SNP, variation and gene expression studies. MPD collects data for classical inbred strains, other fixed-genotype strains, derived lines and populations that are openly acquirable (strain panel examples). Strains can be from JAX-Mice or from any other vendor that\'s a recognized breeding source.permissive 🔗
Criteria A.2
Custom license, yet consistent.
MyGene.infobiology, genomic resource, gene definition★ ★MyGene.info provides simple-to-use REST web services to query/retrieve gene annotation data.copyright 🔗
Criteria A.2
Custom license, but consistent.
Criteria B.2.1
Scope is incomplete - they claim no responsibility for data from other sources on their site.
Criteria D.1.2
No re-use allowed, just use.
Criteria E.1.2
No re-use allowed, just use.
National Center for Biotechnology Information (Gene)biology, genomic resource, gene definition, taxon definition, gene-publication association★ ½ Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.unknown 🔗
Criteria A.2
The license apparently uses language to declare something similar to "public domain", but with the caveat that it may contain data that is otherwise.
Criteria B.1
This is judged to be a violation as any (re)use would depend on negotiating with all upstream copyright holders, which are not presented.
Criteria B.2.1
It is implied that their license does not cover all data.
Criteria B.2.2
Could not find an explicit "clean" version of the data in the downloads.
Criteria D.1.2
There is no apparent way to guarantee usage without possibly violating copyright.
Criteria E.1.2
There is no apparent way to guarantee usage without possibly violating copyright.
Online Mendelian Inheritance in Animals (OMIA)biology, veterinary x-species, gene-disease association★ ★ ★Online Mendelian Inheritance in Animals (OMIA) is a catalogue/compendium of inherited disorders, other (single-locus) traits, and genes in 215 (non-model) animal species.copyright 🔗
Criteria D.1.2
The license seems to be a standard ARR, with no exception for any kind of bulk (re)use.
Criteria E.1.2
The license seems to be a standard ARR, with no exception for any kind of bulk (re)use.
Orphanet portal for rare diseases and orphan drugs (open access subset)biomedical, human, disease-gene association, disease-phenotype association, disease classification, ontology★ ★ ★Orphanet provides reference information on rare diseases and orphan drugs to help improve the diagnosis, care and treatment of patients with rare diseases.restrictive 🔗
Criteria D.1.2
The CC-BY-ND license prevents derivation.
Criteria E.1.2
The CC-BY-ND license prevents derivation.
Protein ANalysis THrough Evolutionary Relationships Classification System (PANTHER)biology, genomic resource, orthology★ ★ ★The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) according to evolutionary family/subfamily, molecular function, biological process, and pathway. The PANTHER Classifications are the result of human curation as well as sophisticated bioinformatics algorithms.copyright 
Criteria D.1.2
Given the all rights reserved copyright statement, any downstream reuse would require negotiation.
Criteria E.1.2
Given the all rights reserved copyright statement, all user/agent types would need to negotiate downstream reuse.
Reactomebiology, pathway, pathway data★ ★ ★ ★ ½ Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.permissive 🔗
Criteria B.2.2
KEGG gene and pathway annotations used to construct Reactome Functional Interaction (FI) Network are not licenced CC-BY-4.0. There is a comment that "the recipient may not distribute this data to other users without a license from Pathway Solutions, Inc."
WormBasebiology, model organism genome sequencesWormBase is an international consortium dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes.unknown 🔗
Criteria A.1.1
A single license is not provided, rather data users are intructed that they are responsible for identifying and complying with licensing and copyright restrictions for each piece of information in the database.
Zebrafish Information Network (ZFIN)biology, model organism database★ ★The Zebrafish Information Resource is the community database resource for the laboratory use of zebrafish which develops and supports integrated zebrafish genetic, genomic and developmental information, maintains the definitive reference data sets of zebrafish research information toward facilitation of the use of zebrafish as a model for human biology.restrictive 🔗
Criteria A.2
Custom license with non academic and non research use restrictions.
Criteria B.1
The license explicity requires intervention for downstream reuse and redistribution.
Criteria D.1.2
The license requires written permission for redistribution.
Criteria E.1.2
The license requires written premission for redistribution even for academic and non commerical parties.

Contact us

All copyrightable materials on this site are © 2017 the (Re)usable Data Project under the CC-BY 4.0 license.
ReusableData.org is funded by the National Center for Advancing Translational Sciences (NCATS) OT3 TR002019 as part of the Biomedical Data Translator project.
The (Re)usable Data Project would like to acknowledge the assistance of many more people than can be listed here. Please visit the about page for the full list.