Beyond Open and Closed: A Proposed Tiered Classification Framework for Biological Datasets
Data Governance Biosecurity Biological Datasets Open Science Bioinformatics Policy Data Frameworks Dual-Use Research

Beyond Open and Closed: A Proposed Tiered Classification Framework for Biological Datasets

4 min read 1,035 words

Key Takeaways

  • The governance of biological data is a critical and unresolved challenge in science policy.
  • The traditional binary 'open'/'closed' distinction for biological data is inadequate.
  • A tiered classification framework offers a more nuanced approach to data access based on sensitivity and dual-use potential.
  • This framework balances scientific openness with biosecurity imperatives more effectively.
  • Tiers range from 'Unrestricted' for low-risk data to 'Restricted Access' for high-risk, dual-use datasets.
  • Implementing such a system has significant implications for research, biosecurity, and global equity.
Share this post

The governance of biological data has never been more consequential — or more contested. The exponential growth of genomic, proteomic, metabolomic, and ecological datasets over the past two decades has created a scientific commons of extraordinary richness, enabling discoveries that would have been impossible in any previous era. At the same time, the same data that powers vaccine development, biodiversity monitoring, and personalised medicine can, in the wrong hands or the wrong context, enable the identification of pathogen vulnerabilities, the reconstruction of dangerous organisms, or the targeting of specific populations based on their genetic profiles.

The dominant framework for managing this tension — a binary distinction between "open" data and "restricted" data — is increasingly inadequate. It fails to capture the enormous variation in sensitivity across biological datasets, it creates perverse incentives that either over-restrict valuable scientific resources or under-protect genuinely dangerous information, and it provides no principled basis for resolving the inevitable conflicts between scientific openness and biosecurity imperatives. A more sophisticated approach is needed.

The Case for Tiered Classification

The concept of tiered data classification is not new — it has been applied to national security information, personal health data, and financial records for decades. What is new is its systematic application to biological datasets, driven by the recognition that the dual-use potential of biological information exists on a spectrum rather than a binary. A genome sequence of a non-pathogenic soil bacterium poses fundamentally different risks from a complete genome sequence of a select agent pathogen with annotated virulence factors. A population-level SNP dataset from a public health study poses different risks from a dataset linking genetic variants to immune evasion mechanisms in a novel virus.

A tiered classification framework for biological datasets would assign each dataset to one of several tiers based on a structured assessment of its sensitivity profile. The number and definition of tiers is a matter of ongoing debate, but a representative framework might include the following structure:

TierClassificationAccess ModelExample Dataset Types
1UnrestrictedFully open, no registration requiredEnvironmental metagenomes, non-pathogenic organism genomes, published phenotypic data
2Open with AttributionOpen access with mandatory registration and use declarationHuman population genomics (anonymised), agricultural pathogen surveillance data
3Controlled AccessApplication-based access with institutional reviewHuman clinical genomics, pathogen genome sequences with virulence annotations
4Restricted AccessAccess limited to vetted institutions with biosafety oversightSelect agent genomics, gain-of-function research datasets, dual-use protein structure data
5ClassifiedAccess restricted to authorised government/defence entitiesWeaponisation-relevant biological information, classified biodefence research

This framework is not merely theoretical. Elements of it are already being implemented, imperfectly and inconsistently, across different jurisdictions and data repositories. The NIH's Genomic Data Sharing Policy, the Global Initiative on Sharing Avian Influenza Data (GISAID), and the European Genome-phenome Archive (EGA) all embody different approaches to tiered access, but none of them operates within a coherent, internationally harmonised framework.

The Sensitivity Assessment Challenge

The most technically demanding aspect of any tiered classification system is the sensitivity assessment — the process by which a dataset is assigned to a tier. This assessment must consider multiple dimensions of risk simultaneously, including the intrinsic properties of the biological information (does it describe a dangerous pathogen? does it contain information that could be used to enhance transmissibility or lethality?), the context of collection (was it collected under conditions of informed consent? does it contain information that could identify individuals or populations?), and the potential for misuse (could the information be combined with other publicly available data to enable harm?).

Machine learning is increasingly being applied to automate elements of this assessment. Natural language processing models can scan dataset metadata and associated publications to flag datasets that describe dangerous pathogens, dual-use research of concern, or sensitive human subjects. Graph-based models can assess the information hazard potential of a dataset by mapping its relationships to other datasets and publications in the scientific literature. These tools are not yet mature enough to replace human expert review, but they are beginning to provide valuable decision support for data governance bodies.

Equity, Access, and the Global South

Any tiered classification framework for biological data must grapple seriously with the equity implications of access restrictions. The history of biological data governance is not a neutral one: datasets collected from populations in the Global South have repeatedly been used to generate scientific value that accrues primarily to institutions in high-income countries, while the populations that contributed the data have had limited access to the resulting knowledge or the benefits it generates. Tiered access controls, if poorly designed, risk exacerbating this asymmetry by creating additional barriers that disproportionately affect researchers in low- and middle-income countries.

A well-designed tiered framework must therefore include explicit provisions for equitable access — mechanisms that ensure researchers from data-contributing communities have priority access to the data they helped generate, that capacity-building support is provided to enable compliance with access requirements, and that the governance bodies that make access decisions are representative of the full diversity of the global scientific community. The Nagoya Protocol on Access and Benefit-Sharing provides a partial model for these provisions in the context of genetic resources, but its application to digital sequence information remains contested and unresolved.

Toward an International Biological Data Governance Architecture

The development of a coherent, internationally harmonised tiered classification framework for biological datasets will require sustained engagement across multiple governance domains: the Convention on Biological Diversity, the Biological Weapons Convention, the World Health Organization's pandemic preparedness frameworks, and the emerging international governance discussions around AI and biotechnology. It will require the active participation of the scientific community, not just as technical advisors but as advocates for governance frameworks that serve the long-term interests of science and society.

The stakes are high. Biological data is the raw material of the biotechnology revolution — the substrate from which new medicines, new materials, new agricultural tools, and new biosecurity capabilities are being built. Getting its governance right is not a peripheral concern for science policy. It is one of the central challenges of the coming decade.

Frequently Asked Questions

Share this post