Tech

Ligo Bio finds natural protein folds far more redundant than previously thought

New analysis suggests reusable structural neighbourhoods number closer to 25,000, not 2.3 million, with implications for enzyme design and biomolecular modelling.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
Research note challenges assumption that scaling training data yields proportional structural diversity for generative AI

A recent research note from Ligo Bio argues that natural protein sequence space is highly redundant in terms of structural folds, challenging the prevailing assumption that scaling training data for generative biomolecular models yields proportional structural diversity. The authors contend that simply folding more natural sequences does not provide the novel structural variety required to improve generative models for enzyme design.

Ligo researchers propose a data engineering method using spectral bisection to split multi-domain proteins into compact fragments for more accurate clustering. This approach addresses the limitations of previous fast clustering methods, such as Foldseek, which the authors argue overestimate structural novelty by failing to account for disordered regions and multi-domain linkers in predicted structures.

Their analysis suggests the number of reusable structural neighbourhoods is closer to 25,000, significantly lower than the 2.3 million reported by previous fast clustering methods. The study indicates that the top 1,000 clusters contain 71.5% of the fragments, highlighting a skewed distribution where most mass sits in a small number of structural neighbourhoods.

The note discusses implications for training generative models for enzyme design, suggesting that simply adding more natural sequence-derived structures may not help if the model primarily sees the same scaffold families. The authors propose a sampling strategy using a balancing exponent to weight clusters by the square root of their size, rather than uniform cluster or member sampling, to better capture fold diversity.

While the study notes that models trained with MGnify-scale data show improved performance in antibody-antigen prediction, where coevolutionary signals are absent, it raises questions about whether models can learn to explore outside the natural fold manifold. The findings suggest that for enzyme design, engineering active-site neighbourhoods on familiar scaffolds may be more effective than seeking globally novel backbones.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026