The relationship between machine learning and biotechnology is not simply one of tool and application — it is a relationship of mutual transformation. The scale and complexity of biological data generated by modern high-throughput technologies have driven the development of new machine learning architectures and training paradigms specifically adapted to biological problems. In turn, those architectures have enabled biological discoveries that were not merely difficult but conceptually impossible with classical analytical methods. Understanding this relationship requires more than a catalogue of applications: it requires a methodological literacy — an understanding of how specific ML and DL approaches work, what assumptions they make, and what kinds of biological problems they are genuinely suited to address.
This post provides that methodological grounding, organised around the principal families of ML and DL methods and their specific roles in contemporary biotechnology.
The Methodological Landscape: A Taxonomy
Before examining specific methods, it is useful to establish a taxonomic framework. Machine learning methods can be broadly organised along two axes: the nature of the learning signal (supervised, unsupervised, semi-supervised, or reinforcement learning) and the architectural complexity of the model (from linear models and decision trees through ensemble methods to deep neural networks). These axes are not independent — deep learning architectures are most commonly trained in a supervised or self-supervised fashion — but the distinction is useful for understanding why particular methods are chosen for particular problems.
| Method Family | Learning Paradigm | Key Biotechnology Applications | Representative Models |
|---|---|---|---|
| Linear/logistic regression | Supervised | GWAS effect estimation, biomarker discovery | Ridge, LASSO, ElasticNet |
| Decision trees / ensembles | Supervised | Drug toxicity prediction, phenotype classification | Random Forest, XGBoost, LightGBM |
| Support vector machines | Supervised | Protein subcellular localisation, splice site prediction | SVM-RBF, SVM-linear |
| Clustering algorithms | Unsupervised | Single-cell transcriptomics, metagenomics | k-means, DBSCAN, Leiden algorithm |
| Dimensionality reduction | Unsupervised | Visualisation, feature extraction | PCA, UMAP, t-SNE, autoencoders |
| Convolutional neural networks | Supervised / self-supervised | Genomic sequence analysis, microscopy image analysis | DeepBind, Basset, CellProfiler-DNN |
| Recurrent neural networks / LSTMs | Supervised | Protein sequence modelling, time-series omics | DeepSEA (LSTM components), splice prediction |
| Transformer / attention models | Self-supervised / supervised | Protein language models, genomic foundation models | ESM-2, DNABERT-2, Enformer, AlphaFold |
| Graph neural networks | Supervised / self-supervised | Drug-target interaction, metabolic network analysis | GCN, GAT, MPNN, DimeNet |
| Generative models (VAE, GAN, diffusion) | Unsupervised / self-supervised | De novo drug design, protein sequence generation | REINVENT, RFdiffusion, ProteinMPNN |
| Reinforcement learning | Reward-based | Drug optimisation, gene circuit design | REINFORCE, PPO, AlphaFold-RL variants |
| Bayesian methods | Probabilistic | Uncertainty quantification, active learning | Gaussian processes, Bayesian neural networks |
Supervised Learning: The Workhorse of Predictive Biotechnology
Supervised learning — in which a model learns to map inputs to outputs from labelled training examples — remains the most widely deployed paradigm in biotechnology, for the straightforward reason that many of the most important biological questions are prediction problems: given this genomic sequence, what is the binding affinity of this transcription factor? Given this molecular structure, what is the probability of hepatotoxicity? Given this protein sequence, what is the subcellular localisation?
Gradient boosting ensembles (XGBoost, LightGBM, CatBoost) have become the default choice for tabular biological data — datasets in which each sample is described by a fixed set of features, such as gene expression matrices, molecular descriptors, or clinical variables. These methods build an ensemble of decision trees sequentially, with each tree trained to correct the residual errors of the previous ensemble. Their advantages for biological applications are substantial: they handle missing data gracefully, are relatively robust to feature scaling, provide feature importance estimates that support biological interpretation, and achieve competitive performance with far less training data than deep neural networks. In drug discovery, gradient boosting models trained on molecular fingerprints and physicochemical descriptors have been used for ADMET (absorption, distribution, metabolism, excretion, toxicity) prediction with performance that frequently matches or exceeds deep learning approaches on datasets of fewer than 10,000 compounds.
Convolutional neural networks (CNNs) applied to one-dimensional biological sequences represent one of the most productive intersections of deep learning and genomics. The key insight is that biological sequences — DNA, RNA, and protein — share a structural property with images: local patterns (motifs, domains, secondary structure elements) are meaningful regardless of their position in the sequence, and hierarchical combinations of local patterns produce higher-order functional features. CNNs exploit this property through learnable convolutional filters that scan the sequence and detect patterns at multiple scales. DeepBind, one of the earliest successful applications of CNNs to genomics, learned sequence-specific binding preferences for hundreds of transcription factors and RNA-binding proteins directly from SELEX and ChIP-seq data, outperforming classical position weight matrix approaches by capturing non-linear dependencies between positions within binding sites.
Transformer architectures have become the dominant paradigm for sequence modelling in biotechnology, displacing recurrent neural networks for most applications due to their ability to model long-range dependencies without the vanishing gradient problems that limit LSTMs. The transformer's self-attention mechanism computes, for each position in a sequence, a weighted sum of representations from all other positions, with weights determined by the learned compatibility between position representations. This allows the model to capture dependencies between residues that are far apart in sequence but close in three-dimensional structure — a property that is critical for protein structure prediction and has been central to the success of AlphaFold 2.
Self-Supervised Learning and Biological Foundation Models
The most significant methodological development in biological machine learning over the past five years has been the emergence of large-scale self-supervised pre-training — the training of models on vast quantities of unlabelled biological sequence data using objectives that do not require human annotation. This approach, pioneered in natural language processing with BERT and GPT, has been adapted to biological sequences with transformative results.
The core insight is that the evolutionary record — the approximately 250 million protein sequences in UniProt, the billions of nucleotide sequences in GenBank — constitutes an implicit annotation of biological function. Sequences that have been conserved across evolutionary time must be encoding functional information; the statistical regularities of sequence variation across homologues encode the constraints imposed by structure and function. A model trained to predict masked residues in protein sequences — the masked language modelling objective used by ESM-2 — must, in order to do so accurately, learn representations that capture these functional constraints.
The resulting foundation models encode biological knowledge in high-dimensional representations that can be fine-tuned for specific downstream tasks with minimal labelled data. ESM-2, trained on 250 million protein sequences at scales up to 15 billion parameters, produces residue-level representations that encode structural information with sufficient fidelity to enable zero-shot prediction of mutational effects, contact maps, and even three-dimensional coordinates. DNABERT-2, trained on multi-species genomic sequences, produces nucleotide-level representations that capture regulatory grammar across species. The Nucleotide Transformer, trained on 2,500 genomes from across the tree of life, enables fine-tuning for regulatory element prediction, splice site identification, and chromatin accessibility prediction with state-of-the-art performance on benchmarks.
Deep Learning for Structural Biology: The AlphaFold Revolution
No development in biological machine learning has had a more immediate and transformative impact on the practice of science than AlphaFold 2, the deep learning system developed by DeepMind that predicts protein three-dimensional structure from amino acid sequence with accuracy approaching experimental methods. Understanding the methodological innovations that make AlphaFold 2 work is essential for appreciating both its power and its limitations.
AlphaFold 2 combines three key architectural innovations. First, it uses a multiple sequence alignment (MSA) of evolutionarily related sequences as input, encoding the co-evolutionary information that constrains the three-dimensional structure. Second, it processes this information through an Evoformer module — a stack of transformer blocks that jointly update the MSA representation and a pairwise representation of residue-residue relationships, allowing information to flow between the sequence and structure representations. Third, it uses an equivariant structure module that directly predicts the three-dimensional coordinates of each residue in a frame-invariant manner, ensuring that the predicted structure is consistent regardless of the orientation of the input.
The practical implications of AlphaFold 2 for biotechnology are profound. The AlphaFold Protein Structure Database now contains predicted structures for over 200 million proteins, providing structural hypotheses for proteins that have never been crystallised or characterised by cryo-EM. These structures are being used to identify drug binding sites, to understand the structural consequences of disease-associated mutations, and to design novel proteins with specified structural properties.
| Structural Biology Tool | Methodology | Key Capability | Limitation |
|---|---|---|---|
| AlphaFold 2 | Evoformer + equivariant structure module | Single-chain structure prediction from sequence | Limited accuracy for intrinsically disordered regions |
| AlphaFold-Multimer | Extended Evoformer for multi-chain inputs | Protein complex structure prediction | Accuracy decreases with complex size |
| RoseTTAFold | Three-track network (1D/2D/3D) | Structure prediction with uncertainty estimates | Generally lower accuracy than AlphaFold 2 |
| ESMFold | ESM-2 language model + folding head | Ultra-fast structure prediction (no MSA required) | Lower accuracy than AlphaFold 2 for divergent sequences |
| RFdiffusion | Diffusion model on protein backbone frames | De novo protein backbone design | Requires separate sequence design step |
| ProteinMPNN | Message-passing neural network | Sequence design for fixed backbone | Does not model backbone flexibility |
Graph Neural Networks for Molecular and Network Biology
Graph neural networks (GNNs) have emerged as the natural architecture for biological problems in which the data has an inherent graph structure — molecular graphs, protein interaction networks, metabolic networks, and knowledge graphs. A GNN operates by iteratively updating the representation of each node in a graph by aggregating information from its neighbours, allowing information to propagate across the graph and enabling the model to learn representations that capture both local and global graph structure.
In drug discovery, GNNs operating on molecular graphs — in which atoms are nodes and bonds are edges — have achieved state-of-the-art performance on molecular property prediction tasks including solubility, binding affinity, and toxicity. The message-passing neural network (MPNN) framework, which generalises many GNN architectures, has been particularly influential: it defines a general class of models in which node representations are updated by aggregating messages from neighbouring nodes, with the messages computed by a learnable function of the sender and receiver representations and the edge features.
In systems biology, GNNs operating on protein interaction networks and metabolic networks are being used to predict gene essentiality, identify synthetic lethal interactions, and model the propagation of perturbations through biological networks. The ability of GNNs to learn from the topology of biological networks — not just the properties of individual nodes — makes them particularly well-suited to problems in which the network context is biologically meaningful.
Generative Models: Designing Novel Biological Molecules
The application of generative deep learning to biological molecule design represents one of the most exciting frontiers in computational biotechnology. Rather than predicting the properties of existing molecules, generative models learn the distribution of biologically relevant molecules and can sample novel molecules from that distribution — molecules that may have never existed in nature or been synthesised in a laboratory.
Variational autoencoders (VAEs) encode molecules into a continuous latent space and decode points in that space back to molecular structures. By navigating the latent space — interpolating between known molecules, optimising in the direction of desired properties — it is possible to generate novel molecules with specified property profiles. The REINVENT drug design platform uses a recurrent neural network generative model trained with reinforcement learning to generate SMILES strings that optimise multi-objective scoring functions incorporating predicted binding affinity, ADMET properties, and synthetic accessibility.
Diffusion models have recently emerged as the most powerful generative architecture for three-dimensional molecular design. RFdiffusion generates protein backbone structures by learning to reverse a diffusion process that progressively adds noise to protein backbone coordinates. Starting from pure noise, the model iteratively denoises to produce a coherent protein backbone that satisfies specified geometric constraints. RFdiffusion has been used to design novel protein binders, enzyme active sites, and symmetric protein assemblies with experimental success rates that far exceed previous computational design methods.
Reinforcement Learning for Biological Optimisation
Reinforcement learning (RL) — in which an agent learns to take actions in an environment to maximise a cumulative reward — is increasingly being applied to biological optimisation problems in which the objective is to find a sequence of decisions that leads to a desired biological outcome. The key advantage of RL over supervised learning for these problems is that it does not require labelled examples of optimal decisions: it learns from the consequences of its own actions, making it applicable to problems in which the optimal solution is unknown.
In drug discovery, RL has been used to guide the generation of molecular structures that optimise multi-objective scoring functions. The REINFORCE algorithm and its variants train a generative model to produce molecules that score highly on a reward function combining predicted binding affinity, drug-likeness, and synthetic accessibility. In synthetic biology, RL is being used to design gene circuit architectures and metabolic pathway configurations that achieve specified biological objectives.
Uncertainty Quantification and the Reliability of Biological Predictions
One of the most important methodological challenges in biological machine learning is uncertainty quantification — the estimation of how confident a model is in its predictions. In many biological applications, the cost of acting on an incorrect prediction is high: a drug candidate predicted to be non-toxic that turns out to be hepatotoxic, a gene therapy vector predicted to have high specificity that causes off-target editing.
Bayesian neural networks provide principled uncertainty estimates but are computationally expensive and difficult to train. Deep ensembles — collections of independently trained neural networks whose predictions are aggregated — provide a practical approximation to Bayesian uncertainty that has been shown to be well-calibrated for many biological applications. Conformal prediction, a distribution-free framework for constructing prediction intervals with guaranteed coverage, is gaining traction in biological machine learning as a method for providing statistically rigorous uncertainty estimates.
Methodological Challenges Specific to Biological Data
Distribution shift is perhaps the most pervasive challenge. Models trained on existing biological data are evaluated on held-out data from the same distribution, but deployed on data from a different distribution — novel chemical scaffolds, organisms from different phylogenetic clades, patients from different demographic groups. Prospective validation — evaluating models on data collected after the model was trained — is the gold standard for assessing real-world performance, but it is rarely performed in academic publications.
Data leakage through inappropriate train-test splitting is a systematic source of over-optimistic performance estimates in biological machine learning. Scaffold-based splitting, which ensures that the test set contains molecular scaffolds not present in the training set, provides a more realistic estimate of generalisation performance.
Interpretability — the ability to understand why a model makes a particular prediction — is a methodological requirement in biological applications that is often underweighted in favour of predictive performance. Attention weights, saliency maps, SHAP values, and concept-based explanations provide different levels and types of interpretability, each with its own assumptions and limitations.
Conclusion: Toward a Methodologically Literate Biotechnology
The integration of machine learning and deep learning into biotechnology is not a passing trend — it is a fundamental methodological shift that is reshaping every domain of the discipline. But the full potential of this shift will only be realised if the biological scientists who use these methods develop genuine methodological literacy: an understanding not just of what the methods do, but of how they work, what assumptions they make, and what their limitations demand of the scientists who deploy them.
This literacy requires engagement with the methodological literature of machine learning — not just the biological applications literature — and a willingness to apply the same critical standards to computational claims that are applied to experimental ones. It requires attention to the quality of training data, the appropriateness of evaluation protocols, the calibration of uncertainty estimates, and the interpretability of model predictions. And it requires the intellectual humility to recognise that a model that performs well on a benchmark may perform poorly in the laboratory, and that the gap between the two is where the most important methodological work remains to be done.
