Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Meta let 15 billion parameter language model learn to design "brand-new" protein from scratch, LeCun: amazing effect

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Meta's latest masterpiece! The trained language model directly incarnates the Creator and can design and produce proteins. Will the ultimate mystery of life be discovered by artificial intelligence?

AI has made new progress in the field of biomedicine. Yes, this time it has something to do with protein.

The difference is that in the past, AI discovered the protein structure, and this time it began to design and generate its own protein structure. If we used to be a "prosecutor", it is not impossible to say that we have evolved into a "creator".

Meta's AI research facility includes FAIR's protein research team that participated in this study. As the chief AI scientist at Facebook for many years, Yann LeCun also forwarded the results of his own team for the first time and spoke highly of them.

These two papers on BioRxiv are "amazing" achievements of Meta in protein design / production. The system uses simulated annealing algorithm to find an amino acid sequence, which is folded in a manner that meets the desired shape or constraints (such as symmetry).

ESM2, the model of atomic hierarchical structure prediction, you guessed right. This study and these two papers are based on the large language model of protein prediction and discovery proposed by Meta not long ago: ESM2.

This is a large model with 15 billion parameters. With the expansion of the model from 8 million parameters to 15 million parameters, the information in the internal characterization can be used to predict the three-dimensional structure at atomic resolution.

Using large language models to learn evolutionary patterns, accurate structural prediction can be generated directly from end to end in protein sequences. while maintaining accuracy, the prediction speed is 60 times faster than the most advanced methods at present.

In fact, with the help of this new structure prediction ability, Meta predicted the sequences of more than 600 million macrogenomic proteins in the map using a cluster of about 2000 GPU in just two weeks.

The co-author of the two papers, Alex Rives from Meta AI, said that the versatility of the ESM2 language model is not only beyond the scope of natural proteins, but also can programmatically generate complex and modular protein structures.

Protein design "special programming language" must first sharpen its tools if it wants to be good at its work.

In order to make protein design and generation more efficient, researchers have developed a high-level programming language for protein design on the basis of previous results (mainly ESM2).

Paper address: https://www.biorxiv.org/ content / 10.1101/2022.12.21.521526v1

This result makes it possible to program the formation of large proteins and complexes with complex and modular structures, said Alex Rives, the correspondent author of the paper A high-level programming language for generative protein design, who led the study, on social media.

Brian Hie, a researcher at Stanford University and one of the authors of the paper, also explained the main ideas and results of this article on Twitter.

In general, this article describes how generative machine learning implements the modular design of complex proteins controlled by a high-level programming language for protein design.

The main idea of this article, he says, is not to use the building blocks of sequences or structures, but to place modularity at a higher level of abstraction and allow black boxes to optimize to generate specific designs. Each step of the optimization predicts the atomic level structure.

Compared with the previous protein design methods, this new approach allows designers to specify arbitrary and non-differentiable constraints, ranging from specified atomic coordinates to abstract protein designs, such as symmetric design.

For programmability, it is important that constraints are modular. For example, the following figure shows the hierarchical application of the same constraint to two levels of symmetric programming.

These constraints are also easy to reassemble. For example, constraints on atomic coordinates can be combined with constraints on symmetry. Or you can combine different forms of two-level symmetry to program an asymmetric composite structure.

Brian Hie believes that this result is a step towards more controllable, regular and expressive protein design. He also thanked Meta AI and other partners for their joint efforts.

Make protein design "like building" in the paper, the researchers believe that protein design will benefit from the regularity, simplicity, and programmability provided by a basic set of abstract concepts, just like those used in architecture, machines, circuits, and computer software engineering.

But unlike these artificial creations, proteins cannot be broken down into parts that can be easily reorganized because the local structure of the sequence is entangled with its overall environment. The classic protein design from scratch attempts to identify a set of basic structural components and then assemble them into higher-order structures.

Similarly, traditional protein engineering usually recombines fragments or domains of natural protein sequences into mixed chimerism. However, the existing methods do not achieve the high combinatorial complexity required for true programmability.

This paper shows that the modern generation model achieves the classical goal of modularization and programmability at a new level of combinatorial complexity. Modularity and programmability are put at a higher level of abstraction, at which the generative model makes up for the gap between human intuition and the generation of specific sequences and structures.

In this case, the protein designer only needs to reassemble the high-level instructions, and the task of obtaining the proteins that meet these instructions is placed on the generation model.

The researchers proposed a programming language for the design of generative proteins that allows designers to specify intuitive, modular and hierarchical programs. High-level programs can be transformed into low-level sequences and structures by generating models. This method takes advantage of the development of protein language models and can learn structural information and protein design principles.

The implementation in this study is based on an energy-based generation model, as shown in the figure above.

First, a protein designer specifies a high-level program consisting of a set of hierarchical constraints (figure A).

The program is then compiled into an energy function to evaluate compatibility with constraints, which can be arbitrary and indistinguishable (figure B).

Structural constraints are applied by incorporating atomic structure prediction (enabled by the language model) into the energy function. This method can generate a wide range of complex designs (figure C).

Generating protein sequences from scratch in the paper "Language models generalize beyond natural proteins", Tom Sercu, author of the MetaAI team, said that the work accomplished two main tasks.

Https://www.biorxiv.org/ content / 10.1101/2022.12.21.521521v1 the first item is to design a sequence for a given main chain structure. By using the language model, the successful design for all goals can be achieved, and the success rate of the sequence design without the participation of the language model is only 1 to 20.

The second task is unconstrained generation. The research team proposed a new method to sample (sequence, structure) pairs from the energy landscape defined by the language model.

Sampling through different topologies can once again improve the success rate of the experiment (up to 71max 129 or 55%).

In order to prove that the predicted protein structure exceeds the limits of natural proteins, the team searched the protein sequences generated by the language model in a sequence database covering all known natural proteins.

The results show that there is no matching relationship between them, and the prediction structures generated by natural sequences and language models are different.

Sercu said that the protein structure can be designed using the ESM2 protein language model alone. The team tested 228 proteins through experiments, with a success rate of 67%!

Sercu believes that protein language models trained only on sequences can learn deep patterns that connect sequences and structures, and can be used to design proteins from scratch, beyond the design space of natural exploration.

Exploring the Deep Grammar of protein production in the paper, Meta researchers say that although the language model is only trained in sequence, the model can still design the deep grammatical structure of proteins and break through the limitations of natural proteins.

If the square of figure An is used to represent the space made up of all protein sequences, then the natural protein sequence is a gray part, covering a small part of it. In order to extend beyond the natural sequence, the language model needs to access the underlying design patterns.

The research team needs to do two things: first, to design the protein (de novo) main chain from scratch; and second, to generate protein sequences from scratch.

The team used a masked language model to train ESM2, which included millions of different natural proteins in evolution.

After the training of the language model, the information about the tertiary structure of proteins can be identified in the internal attention state of the model. Then, through linear projection, the researchers converted the attention of a pair of positions in the protein sequence into the distribution of the distance between the residues.

The researchers say the ability of language models to predict protein structures points to the deeper structural sequences that make up natural protein sequences, as well as the possibility of a deep grammar that can be learned by models.

The results show that in the process of evolution, a large number of protein sequences contain biological structure and function, which reveals the design and structure of proteins. This construction can be reproduced by learning the machine model of protein sequences.

The existence of protein structure across the deep grammar of proteins successfully predicted by the language model in six experiments explains the seemingly contradictory two sets of findings: the understanding of natural proteins depends on the training data; the language model can be predicted and explored outside the known family of natural proteins.

If the scaling law of protein language model continues to be valid, it can be expected that the generation ability of AI language model will continue to improve.

The team said that because of the basic syntax of protein structures, machine models will learn rarer protein structures, thus expanding the predictive power and exploration space of the model.

A year ago, DeepMind open source AlphaFold2 joined Nature, Science, Biology and AI circles.

A year later, artificial intelligence prediction models sprang up like bamboo shoots after a spring rain, frequently filling the gaps in the field of protein structure.

If human beings give artificial intelligence life, is artificial intelligence the last piece of jigsaw puzzle to complete the mysteries of human life?

Reference:

Https://twitter.com/TomSercu/status/1606075975891972096

Https://twitter.com/BrianHie/status/1606074806620737536

Https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1

Https://www.biorxiv.org/content/10.1101/2022.12.21.521526v1

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), Editor: editorial Department

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report