Senior Data Scientist · PhD · Computational Biology

Amy
Francis

Data science, ML & drug discovery - for everyone.

I'm a Senior Data Scientist bridging computational biology and experimental research, with expertise in machine learning, bioinformatics, and tool development across cancer genomics, immunology, and drug discovery. I think a lot about how AI is changing science - and about making sure those changes reach every researcher, not just those with a machine learning background. My work tries to do that: building practical tools, leading interdisciplinary teams, and figuring out how to get the most out of modern AI in real biological research.

Cancer Genomics Drug Discovery Protein ML RNA-seq Foundation Models AI in Science 2× Hackathon Winner

View projects Get in touch

About me

Where biology meets
machine learning.

I came from the biology side, and I've always thought the most interesting work happens where disciplines overlap.

I'm a Senior Data Scientist at Nexus BioQuest, a contract research organisation in Bristol, where I work across data science, machine learning, and analytical tool development in support of research programmes spanning pharmaceutical and biotech clients.

My PhD at the University of Bristol, funded by a competitive Cancer Research UK studentship, focused on predicting the functional impact of genetic variants in cancer genomes. I've since worked at Roche in Basel and Zürich, exploring protein language models and antibody optimisation - and published four peer-reviewed papers across cancer genomics, variant prediction, and drug discovery.

Beyond the code, I've led interdisciplinary teams to back-to-back hackathon victories - at Cambridge and the Wellcome Collection - and co-organised Bristol's first AI in Health meeting, securing two interdisciplinary research grants. I believe the best science happens at the edges of disciplines, and I love building the teams and environments where that becomes possible.

Based in

Bristol, UK

Experience

5+ Years

Current role

Senior Data Scientist, Nexus BioQuest

PhD

Bioinformatics & ML, Bristol

Funded by

Cancer Research UK

Interests

AI in science, open collaboration

A perspective on AI in science

AI is going to be part of everyday science.
It should work for everyone.

The tools exist. The clinical evidence is starting to follow. The harder question is who can actually use them.

"Some of the most powerful tools in the history of biology are sitting in research papers and GitHub repos that most bench scientists have never heard of. That feels like a problem worth working on."

AlphaFold more or less solved protein structure prediction in 2021 - a problem that had been open for 50 years. AI-designed drugs are now reaching clinical trials. The industry is reorganising around this, with pharma companies partnering with specialist AI firms and embedding NVIDIA infrastructure directly into their R&D pipelines. These are not future possibilities. They are happening now, and the pace is accelerating.

But using these tools well still requires an unusual combination of skills - enough ML to run and adapt the models, enough compute to work with them, and enough domain knowledge to ask the right questions. Most biologists have one of those things, maybe two. That gap is real, and it matters. I came into data science from the biology side, so I know what it feels like to have a question you cannot answer because the tools are out of reach. That is what drives most of what I build and write about.

The deeper analysis - the specific partnerships, what the clinical evidence actually shows, what NVIDIA's infrastructure deals mean in practice, and the honest open questions about whether this produces better drugs - is in the blog.

Foundation models reshaping biology

AlphaFold 2 & 3

DeepMind / Google

Predicted structures for 200M+ proteins by 2022. AlphaFold 3 (2024) extends to DNA, RNA, and small molecules - relevant to structure-based drug design.

ESM-2 & ESMFold

Meta AI

Protein language models trained on 250M sequences. Enable zero-shot fitness and function prediction - models I've used directly in my own research at Roche.

Evo 1 & 2

Arc Institute

Genomic foundation models trained across the tree of life on DNA sequences. Enable generative design of genes, regulatory elements, and CRISPR guides.

RFdiffusion

Baker Lab, UW

Diffusion model for de novo protein design - binders, enzymes, vaccine candidates - designed from scratch and experimentally validated.

BioNeMo

NVIDIA

An open platform wrapping biological foundation models - ESMFold, DiffDock, MolMIM - behind accessible APIs. Used by Amgen, Genentech, AstraZeneca, GSK, and Novo Nordisk. Over 100 firms on the platform as of 2024.

DiffDock & MolMIM

NVIDIA / MIT

DiffDock predicts protein-ligand binding poses using diffusion - faster than traditional docking. MolMIM generates novel small molecules optimised for target properties.

Geneformer

Broad Institute

Transformer pre-trained on 30M single-cell transcriptomes. Supports in silico gene perturbation and network inference - useful for target identification without wet-lab experiments.

scGPT

University of Toronto

Single-cell foundation model pre-trained on 33M cells. Enables cell type annotation, perturbation prediction, and gene regulatory network inference.

TxGNN

Harvard / Broad

Graph neural network trained on biomedical knowledge graphs for drug indication and contraindication prediction - a practical tool for repurposing existing compounds.

Pharma.AI / INS018_055

Insilico Medicine

End-to-end AI drug discovery platform. Used to design INS018_055, the first fully AI-generated drug to show efficacy in a Phase IIa trial (Nature Biotechnology, 2024). Target to Phase I in under 30 months.

Projects & Hackathons

Things I've built and worked on.

A mix of published tools, hackathon projects, and pipelines built for real research problems.

🏆 1st Place · Cambridge

Mapping novel compounds to biological pathways

Led the winning team at GetSeen Ventures' AI × Cancer Bio Hackathon. Used transformer encoders on SMILES strings and high-content image embeddings from the RxRx3-core dataset to predict molecular pathways. Ongoing collaboration likely to result in publication.

TransformersSMILESImage EmbeddingsDrug Discovery

View project

🏆 1st Place · Wellcome Collection

Deep learning for protein fitness prediction

Led the winning team at the Roche & HDR Hackathon. Encoded protein sequences with pre-trained language models (ESM, AntiBERT) and explored CNNs to model sequence-function relationships using DMS data from Protein Gym. Secured a Roche AI internship as a direct result.

ESMAntiBERTCNNsDMSPyTorch

View project

Flow Cytometry

Automated flow cytometry analysis pipeline

Built a post-acquisition flow cytometry analysis pipeline with an intuitive Streamlit interface, applying unsupervised ML - clustering and dimensionality reduction - to high-dimensional cytometry data to uncover cell population patterns and accelerate downstream reporting.

PythonStreamlitScikit-learnUnsupervised ML

View project

Cancer Genomics

DrivR-Base - variant annotation toolkit

Published a data mining toolkit integrating molecular annotations for SNVs, creating a centralised resource that reduces redundancy and accelerates machine learning model development for variant effect prediction.

PythonRDockerCOSMICGnomAD

Read paper

Antibody Optimisation

Predicting mutation impact on antibody-antigen binding

At Roche pRED, used TensorFlow models grounded in global epistasis and pre-trained protein language models to predict binding affinity from deep mutational scanning data. Ongoing collaboration with University of Oslo, aiming for publication.

TensorFlowESMHPCDMS

Global Epistasis repo

Community

Bristol AI in Health Meeting

Co-organised Bristol's first interdisciplinary AI in Health Meeting in collaboration with the Elizabeth Blackwell Institute. Facilitated cross-disciplinary collaboration that resulted in two interdisciplinary grants for applied AI projects.

LeadershipEvent OrganisationGrant Facilitation

Learn more

Skills & Tools

The craft behind the work.

Picked up across academia, industry, and a few hackathons.

Languages & Systems

Python
R
SQL
Linux / HPC
Cloud platforms
Docker

Machine Learning

Scikit-learn
XGBoost / SVMs
Neural Networks
Foundation & Language Models
TensorFlow / PyTorch
MLflow

RNA-seq & Transcriptomics

Alignment: STAR, HISAT2, Salmon
QC: FastQC, MultiQC, Trimmomatic
Quantification: featureCounts, DESeq2
Differential expression: edgeR, limma
Unsupervised ML for cell population definition
Immunology cell type characterisation
UMAP / t-SNE dimensionality reduction
Pathway & gene set enrichment analysis

Bioinformatics & Data

Flow Cytometry
Deep Mutational Scanning
CRISPR / Image Analysis
COSMIC, GnomAD, TCGA
Protein / DNA Sequences
Proteomics (Olink)

Visualisation & Comms

Streamlit
Matplotlib / Seaborn
Scientific writing
Client consultation
Conference presenting

Leadership & Collaboration

Interdisciplinary team leadership
Hackathon team lead (2× winner)
Cross-functional collaboration
User-centred design
Grant facilitation
CRO client management

Publications

Peer-reviewed research.

Four published works spanning cancer genomics, variant effect prediction, and drug discovery.

Journal Article · 2026

CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants

Bioinformatics Advances - Accepted & Published

doi.org/10.1093/bioadv/vbag008

Application Note · 2024

DrivR-Base: A Feature Extraction Toolkit For Variant Effect Prediction Model Construction

Bioinformatics - Accepted & Published

doi.org/10.1093/bioinformatics/btae197

Review Article · 2023

Predicting Pathogenicity from Non-Coding Mutations

Nature Biomedical Engineering - Accepted & Published

doi.org/10.1038/s41551-022-00996-x

Online Report · 2024

Toxicity Prediction for Drug Discovery

The Alan Turing Institute, Data Study Group

doi.org/10.5281/zenodo.13882192

Writing

Thinking out loud.

AI is going to reshape how biology is done. I think scientists at every level need to understand what is actually happening - not the hype version, and not the version that assumes a computer science degree. I write here to try to bridge that gap: covering real tools, real evidence, and real implications, in language that a working scientist can use. Four themes: opinion, industry analysis, technical walkthroughs, and practical guides.

Opinion Published Great science needs more than scientists On what actually makes interdisciplinary teams work - diversity of background, the friction that comes with it, and why that friction is usually the point. Draws on three very different collaborative experiences. Read post →

Opinion Coming soon What makes a team actually work - on feedback, trust, and creating space for honest challenge The leadership questions that rarely get asked directly: how to build teams where people feel safe enough to disagree, how to give feedback that lands well, and why psychological safety is not a soft metric.

Industry & Research Published How AI is changing drug discovery - and what the evidence actually says The industry is reorganising fast - specialist AI partnerships, NVIDIA infrastructure embedded in pharma R&D, and the first AI-designed drugs reaching clinical trials. What is actually happening, what the evidence shows, and the honest open questions. Read post →

Technical Coming soon What NVIDIA's BioNeMo actually means for wet-lab scientists A practical look at what BioNeMo, DiffDock, and the broader NVIDIA biology stack make possible - and what still needs a specialist to get right.

Technical Coming soon Using ESM-2 to predict protein fitness - a walkthrough with real data A hands-on look at Meta's protein language model: what it actually does, how to run it, and how to interpret the output in a biological context.

Technical Coming soon Unsupervised ML for immune cell population discovery in RNA-seq data From alignment to UMAP - how to use clustering and dimensionality reduction to define cell populations without prior labels, and what to watch out for when you do.

Tutorials & Guides Coming soon RNA-seq from scratch - a practical guide for biologists who have never touched the command line A step-by-step walkthrough of a full RNA-seq pipeline - from raw FASTQ files to differential expression results - written for researchers with domain knowledge but no computational background.

Tutorials & Guides Coming soon What AlphaFold actually gives you - and how to make sense of it A plain-language guide to interpreting AlphaFold output: confidence scores, what pLDDT means in practice, and how to decide whether a predicted structure is actually useful for your question.

AmyFrancis

Where biology meetsmachine learning.

I came from the biology side, and I've always thought the most interesting work happens where disciplines overlap.

AI is going to be part of everyday science.It should work for everyone.

Things I've built and worked on.

Mapping novel compounds to biological pathways

Deep learning for protein fitness prediction

Automated flow cytometry analysis pipeline

DrivR-Base - variant annotation toolkit

Predicting mutation impact on antibody-antigen binding

Bristol AI in Health Meeting

The craft behind the work.

Languages & Systems

Machine Learning

RNA-seq & Transcriptomics

Bioinformatics & Data

Visualisation & Comms

Leadership & Collaboration

Peer-reviewed research.

Always happy to talk science.

Thinking out loud.

Amy
Francis

Where biology meets
machine learning.

AI is going to be part of everyday science.
It should work for everyone.