Tech

Ligo Bio finds natural protein folds far more redundant than previously thought

New analysis suggests reusable structural neighbourhoods number closer to 25,000, not 2.3 million, with implications for enzyme design and biomolecular modelling.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Media Research

Related coverage

Explore Artificial Intelligence coverage Explore Media coverage Explore Research coverage More from the Tech desk

Tech

No image available

Research note challenges assumption that scaling training data yields proportional structural diversity for generative AI

A recent research note from Ligo Bio argues that natural protein sequence space is highly redundant in terms of structural folds, challenging the prevailing assumption that scaling training data for generative biomolecular models yields proportional structural diversity. The authors contend that simply folding more natural sequences does not provide the novel structural variety required to improve generative models for enzyme design.

Ligo researchers propose a data engineering method using spectral bisection to split multi-domain proteins into compact fragments for more accurate clustering. This approach addresses the limitations of previous fast clustering methods, such as Foldseek, which the authors argue overestimate structural novelty by failing to account for disordered regions and multi-domain linkers in predicted structures.

Their analysis suggests the number of reusable structural neighbourhoods is closer to 25,000, significantly lower than the 2.3 million reported by previous fast clustering methods. The study indicates that the top 1,000 clusters contain 71.5% of the fragments, highlighting a skewed distribution where most mass sits in a small number of structural neighbourhoods.

The note discusses implications for training generative models for enzyme design, suggesting that simply adding more natural sequence-derived structures may not help if the model primarily sees the same scaffold families. The authors propose a sampling strategy using a balancing exponent to weight clusters by the square root of their size, rather than uniform cluster or member sampling, to better capture fold diversity.

While the study notes that models trained with MGnify-scale data show improved performance in antibody-antigen prediction, where coevolutionary signals are absent, it raises questions about whether models can learn to explore outside the natural fold manifold. The findings suggest that for enzyme design, engineering active-site neighbourhoods on familiar scaffolds may be more effective than seeking globally novel backbones.

Ligo Bio finds natural protein folds far more redundant than previously thought

More from Tech

Apple to roll out manual EQ controls for AirPods in iOS 27 update

Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset

Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026