New data compression technique transforms pangenomics research
Researchers at UC San Diego have developed PanMAN, a novel data structure that compresses pangenomic information up to 1,391 times more efficiently than existing formats whilst simultaneously encoding detailed evolutionary and mutational histories. The technique, published in Nature Genetics, enabled construction of a comprehensive SARS-CoV-2 pangenome from 8 million sequences using just 366 MB of storage.
Engineers at the University of California San Diego have unveiled a breakthrough approach to handling the vast quantities of genetic data generated by modern sequencing technologies. The new compression technique, called Pangenome Mutation-Annotated Network (PanMAN), addresses critical scalability challenges in pangenomics whilst significantly expanding what information can be represented.
Pangenomics studies collections of genomes from a single species rather than relying on one reference genome, providing a more complete picture of natural variation. However, existing data structures struggle to scale to millions of sequences and typically capture only genetic variation, not the evolutionary relationships and mutational events that produced it.
Achieving unprecedented compression ratios
The research team, led by Professor Yatish Turakhia from UC San Diego’s Department of Electrical and Computer Engineering, evaluated PanMAN across six microbial species including SARS-CoV-2, HIV, respiratory syncytial virus, Mycobacterium tuberculosis, Escherichia coli and Klebsiella pneumoniae.
PanMAN demonstrated consistently superior compression compared to existing variation-preserving formats. The technique achieved compression ratios of 19.4-468 times over graphical fragment assembly format, 6.1-147 times over VG format, 3.5-40 times over GBZ, 26.0-541 times over PanGraph, and 52.1-1,391 times over tskit format.
“Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis,” said Turakhia, the study’s corresponding author.
The compression gains proved particularly dramatic for SARS-CoV-2, which had the largest collection size and lowest genetic diversity amongst the datasets tested. Importantly, compression performance generally improved as more sequences were added, demonstrating excellent scalability.
Evolutionary compression with enhanced representation
PanMAN’s efficiency stems from its use of “evolutionary compression”, a concept introduced in the UShER tool during the COVID-19 pandemic. Rather than storing complete sequences, PanMAN maintains phylogenetic trees with a root sequence and branches annotated with mutations. Any sequence can be reconstructed by applying mutations along the path from root to tip.
“The data structures used for pangenomics research are critical because they determine not only how efficiently genetic data is represented, but also what the data can represent,” explained Sumit Walia, an electrical engineering PhD candidate and co-first author of the study. Unlike previous formats, PanMAN employs a three-level, reference-free coordinate system that handles insertions, deletions and structural rearrangements. The system uses blocks representing homologous genome segments, with lower levels tracking specific nucleotide positions within blocks.
PanMAN extends beyond single trees to represent complex genetic events. The format uses network edges to connect multiple mutation-annotated trees (PanMATs), storing recombination and horizontal gene transfer events by recording breakpoint coordinates in parent sequences.
Demonstrating comprehensive SARS-CoV-2 analysis
The researchers constructed a PanMAN containing over 8 million SARS-CoV-2 genomes, requiring only 366 MB of disk space. This represents 5.3 times better compression than the AGC format and 3,032 times better than multiple sequence alignment in FASTA format.
Analysis of 20,000 SARS-CoV-2 sequences spanning 1,000 Pango lineages revealed PanMAN’s ability to capture detailed mutational landscapes. The format identified 91,871 substitutions, 52,954 insertions and 67,618 deletions. Insertions and deletions, though less frequent than substitutions, affected approximately four times more genomic sites.
The research confirmed all previously reported lineage-defining indels for variants of concern, including Omicron sublineages BA.1 and BA.2, Delta B.1.617.2, Gamma P.1 and Alpha B.1.1.7. Analysis revealed that indels longer than 12 bases were extremely rare in the spike protein, with strong purifying selection against frame-shifting mutations evident from the prevalence of indels in multiples of three nucleotides.
“Extending compressive pangenomics to human genomes can fundamentally transform how we store, analyse, and share large-scale human genetic data,” said Turakhia. “Besides enabling studies of human genetic diversity, disease, and evolution at unprecedented scale and speed, it can depict detailed evolutionary and mutational histories which shape diverse human populations, something that current representations do not capture.”
The team has developed panmanUtils, a software toolkit supporting common analyses and ensuring interoperability with existing tools. The researchers are now working to extend their approach from microbial genomes to human genomes, supported by a Jacobs School Early Career Faculty Development Award.
Reference:
Walia, S., Motwani, H., Tseng, Y.-H., et. al. (2026). Compressive pangenomics using mutation-annotated networks. Nature Genetics. https://doi.org/10.1038/s41588-025-02478-7





