What is the main challenge in biological data governance?

The main challenge in biological data governance is balancing the benefits of open scientific data sharing with the need to mitigate biosecurity risks, moving beyond an inadequate binary open/closed distinction.

Why is the binary open/closed data framework insufficient for biological data?

The binary open/closed data framework is insufficient because it fails to account for the wide spectrum of sensitivity in biological datasets, leading to either over-restriction of valuable data or under-protection of dangerous information, and cannot resolve conflicts between openness and biosecurity.

How does a tiered system improve upon the binary approach?

A tiered system improves upon the binary approach by providing a more sophisticated and granular method for managing biological data, recognizing that risks exist on a spectrum and allowing for tailored access controls that better balance scientific progress with biosecurity.

What factors determine the tier of a biological dataset?

The tier of a biological dataset is determined by factors such as its sensitivity profile, dual-use potential, and provenance, which are assessed to match appropriate access controls to the data's inherent risks.

Can you give examples of different tiers in a biological data framework?

Examples of tiers include: Tier 1 (Unrestricted) for environmental metagenomes; Tier 2 (Open with Attribution) for anonymized human population genomics; Tier 3 (Controlled Access) for human clinical genomics; and Tier 4 (Restricted Access) for select agent genomics or gain-of-function research datasets.

What are the implications of a tiered classification framework?

The implications of a tiered classification framework are profound, impacting research by enabling more appropriate data sharing, enhancing biosecurity by better protecting sensitive information, and promoting global equity by establishing clearer, more consistent data governance policies.

Is the concept of tiered data classification new?

The concept of tiered data classification is not new, having been applied to national security information, personal health data, and financial records for decades; however, its systematic application to biological datasets is a novel development driven by the recognition of biological information's dual-use potential.

What is dual-use potential in biological data?

Dual-use potential in biological data refers to the capacity of scientific research, knowledge, or products to be used for both beneficial purposes (e.g., vaccine development) and harmful purposes (e.g., bioweapons development), necessitating careful governance.

Who would benefit from a tiered classification system for biological data?

Researchers, policymakers, biosecurity experts, public health organizations, and the general public would all benefit from a tiered classification system, as it aims to optimize scientific discovery while safeguarding against misuse of biological information.

Tiered Classification for Biological Datasets: Beyond Open/Closed

Q: What is a tiered classification framework for biological datasets?

A tiered classification framework for biological datasets assigns each dataset to one of several access tiers based on a structured assessment of its sensitivity, dual-use potential, and provenance, allowing for more nuanced access controls.

The governance of biological data has never been more consequential — or more contested. The exponential growth of genomic, proteomic, metabolomic, and ecological datasets over the past two decades has created a scientific commons of extraordinary richness, enabling discoveries that would have been impossible in any previous era. At the same time, the same data that powers vaccine development, biodiversity monitoring, and personalised medicine can, in the wrong hands or the wrong context, enable the identification of pathogen vulnerabilities, the reconstruction of dangerous organisms, or the targeting of specific populations based on their genetic profiles.

The dominant framework for managing this tension — a binary distinction between "open" data and "restricted" data — is increasingly inadequate. It fails to capture the enormous variation in sensitivity across biological datasets, it creates perverse incentives that either over-restrict valuable scientific resources or under-protect genuinely dangerous information, and it provides no principled basis for resolving the inevitable conflicts between scientific openness and biosecurity imperatives. A more sophisticated approach is needed.

The Case for Tiered Classification

The concept of tiered data classification is not new — it has been applied to national security information, personal health data, and financial records for decades. What is new is its systematic application to biological datasets, driven by the recognition that the dual-use potential of biological information exists on a spectrum rather than a binary. A genome sequence of a non-pathogenic soil bacterium poses fundamentally different risks from a complete genome sequence of a select agent pathogen with annotated virulence factors. A population-level SNP dataset from a public health study poses different risks from a dataset linking genetic variants to immune evasion mechanisms in a novel virus.

A tiered classification framework for biological datasets would assign each dataset to one of several tiers based on a structured assessment of its sensitivity profile. The number and definition of tiers is a matter of ongoing debate, but a representative framework might include the following structure:

Tier	Classification	Access Model	Example Dataset Types
1	Unrestricted	Fully open, no registration required	Environmental metagenomes, non-pathogenic organism genomes, published phenotypic data
2	Open with Attribution	Open access with mandatory registration and use declaration	Human population genomics (anonymised), agricultural pathogen surveillance data
3	Controlled Access	Application-based access with institutional review	Human clinical genomics, pathogen genome sequences with virulence annotations
4	Restricted Access	Access limited to vetted institutions with biosafety oversight	Select agent genomics, gain-of-function research datasets, dual-use protein structure data
5	Classified	Access restricted to authorised government/defence entities	Weaponisation-relevant biological information, classified biodefence research

This framework is not merely theoretical. Elements of it are already being implemented, imperfectly and inconsistently, across different jurisdictions and data repositories. The NIH's Genomic Data Sharing Policy, the Global Initiative on Sharing Avian Influenza Data (GISAID), and the European Genome-phenome Archive (EGA) all embody different approaches to tiered access, but none of them operates within a coherent, internationally harmonised framework.

The Sensitivity Assessment Challenge

The most technically demanding aspect of any tiered classification system is the sensitivity assessment — the process by which a dataset is assigned to a tier. This assessment must consider multiple dimensions of risk simultaneously, including the intrinsic properties of the biological information (does it describe a dangerous pathogen? does it contain information that could be used to enhance transmissibility or lethality?), the context of collection (was it collected under conditions of informed consent? does it contain information that could identify individuals or populations?), and the potential for misuse (could the information be combined with other publicly available data to enable harm?).

Machine learning is increasingly being applied to automate elements of this assessment. Natural language processing models can scan dataset metadata and associated publications to flag datasets that describe dangerous pathogens, dual-use research of concern, or sensitive human subjects. Graph-based models can assess the information hazard potential of a dataset by mapping its relationships to other datasets and publications in the scientific literature. These tools are not yet mature enough to replace human expert review, but they are beginning to provide valuable decision support for data governance bodies.

Equity, Access, and the Global South

Any tiered classification framework for biological data must grapple seriously with the equity implications of access restrictions. The history of biological data governance is not a neutral one: datasets collected from populations in the Global South have repeatedly been used to generate scientific value that accrues primarily to institutions in high-income countries, while the populations that contributed the data have had limited access to the resulting knowledge or the benefits it generates. Tiered access controls, if poorly designed, risk exacerbating this asymmetry by creating additional barriers that disproportionately affect researchers in low- and middle-income countries.

A well-designed tiered framework must therefore include explicit provisions for equitable access — mechanisms that ensure researchers from data-contributing communities have priority access to the data they helped generate, that capacity-building support is provided to enable compliance with access requirements, and that the governance bodies that make access decisions are representative of the full diversity of the global scientific community. The Nagoya Protocol on Access and Benefit-Sharing provides a partial model for these provisions in the context of genetic resources, but its application to digital sequence information remains contested and unresolved.

Toward an International Biological Data Governance Architecture

The development of a coherent, internationally harmonised tiered classification framework for biological datasets will require sustained engagement across multiple governance domains: the Convention on Biological Diversity, the Biological Weapons Convention, the World Health Organization's pandemic preparedness frameworks, and the emerging international governance discussions around AI and biotechnology. It will require the active participation of the scientific community, not just as technical advisors but as advocates for governance frameworks that serve the long-term interests of science and society.

The stakes are high. Biological data is the raw material of the biotechnology revolution — the substrate from which new medicines, new materials, new agricultural tools, and new biosecurity capabilities are being built. Getting its governance right is not a peripheral concern for science policy. It is one of the central challenges of the coming decade.

Beyond Open and Closed: A Proposed Tiered Classification Framework for Biological Datasets

Key Takeaways

The Case for Tiered Classification

The Sensitivity Assessment Challenge

Equity, Access, and the Global South

Toward an International Biological Data Governance Architecture

Frequently Asked Questions