EPISODE · Dec 1, 2025 · 20 MIN
215: Protein Set Transformer for high-diversity viromics
from Base by Base · host Gustavo Barra
Martin et al., Nat Commun (2025) - Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets. Key terms: viromics, protein-language-model, genome-embeddings, triplet-loss, host-prediction. Study Highlights:PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework. Conclusion:PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications Music:Enjoy the music based on this article at the end of the episode. First author:Martin Journal:Nat Commun (2025) DOI:10.1038/s41467-025-66049-4 Reference:Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4 License:This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/ Support:Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00 Official website https://basebybase.com On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics. Episode link: https://basebybase.com/episodes/protein-set-transformer QC:This episode was checked against the original article PDF and publication metadata for the episode release published on 2025-12-01. QC Scope:- article metadata and core scientific claims from the narration- excludes analogies, intro/outro, and music- transcript coverage: Substantive audit of the core scientific claims and results described in PST工作 (architecture, training, data, evaluation, functional insights, host prediction, generalizability, and biosafety/licensing) as presented in the transcript.- transcript topics: PST architecture: genome modeled as set of proteins with context; Protein embeddings and genome-position/strand augmentation; Triplet loss training: Chamfer distance and PointSwap; Pretraining data scale: >100k viral genomes, >6M proteins; Performance: PST-TL outperforms PST-CTX and PST-MLM; Remote relation detection: ASI correlations with PST embeddings QC Summary:- factual score: 10/10- metadata score: 10/10- supported core claims: 8- claims flagged for review: 0- metadata checks passed: 4- metadata issues found: 0 Metadata Audited:- article_doi- article_title- article_journal- license Factual Items Audited:- Genomes are modeled as sets of proteins with context (not as linear sequences)- ESM2 protein embeddings are augmented with two learnable vectors representing protein position and coding strand- Training uses triplet loss with Chamfer distance and PointSwap augmentation Chapters (00:00:00) - Deep Learning in Viral Biology(00:02:31) - Preliminary insights into viral biology(00:08:01) - PSTTL: The Hidden Genome of Viruses(00:11:29) - PSTTL: The Virality Model(00:14:22) - Preston 2, Context-aware viral evolution(00:16:19) - Signs and Numbers in the Code
What this episode covers
Martin et al., Nat Commun (2025) - Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets. Key terms: viromics, protein-language-model, genome-embeddings, triplet-loss, host-prediction. Study Highlights:PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework. Conclusion:PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications Music:Enjoy the music based on this article at the end of the episode. First author:Martin Journal:Nat Commun (2025) DOI:10.1038/s41467-025-66049-4 Reference:Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4 License:This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/ Support:Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00 Official website https://basebybase.com On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics. Episode link: https://basebybase.com/episodes/protein-set-transformer QC:This episode was checked against the original article PDF and publication metadata for the episode release published on 2025-12-01. QC Scope:- article metadata and core scientific claims from the narration- excludes analogies, intro/outro, and music- transcript coverage: Substantive audit of the core scientific claims and results described in PST工作 (architecture, training, data, evaluation, functional insights, host prediction, generalizability, and biosafety/licensing) as presented in the transcript.- transcript topics: PST architecture: genome modeled as set of proteins with context; Protein embeddings and genome-position/strand augmentation; Triplet loss training: Chamfer distance and PointSwap; Pretraining data scale: >100k viral genomes, >6M proteins; Performance: PST-TL outperforms PST-CTX and PST-MLM; Remote relation detection: ASI correlations with PST embeddings QC Summary:- factual score: 10/10- metadata score: 10/10- supported core claims: 8- claims flagged for review: 0- metadata checks passed: 4- metadata issues found: 0 Metadata Audited:- article_doi- article_title- article_journal- license Factual Items Audited:- Genomes are modeled as sets of proteins with context (not as linear sequences)- ESM2 protein embeddings are augmented with two learnable vectors representing protein position and coding strand- Training uses triplet loss with Chamfer distance and PointSwap augmentation
NOW PLAYING
215: Protein Set Transformer for high-diversity viromics
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Jan 2, 2026 ·47m
Dec 21, 2025 ·46m