New AI language model nach0 bridges biomedical text and chemical data
Researchers from Insilico Medicine and NVIDIA have developed a novel large language model (LLM) called nach0 that can understand and generate both biomedical text and chemical structural data.
This multi-domain, multi-task transformer was trained on a massive dataset combining PubMed abstracts, patent descriptions, and simplified molecular-input line-entry system (SMILES) representations of chemical structures.
While existing biomedical LLMs like BioBERT and SciFive excel at natural language processing of biomedical text, they lack integrated abilities to work with chemical structure data. Conversely, chemically-aware models to date have had limited training on diverse biomedical text sources. Nach0 is the first LLM designed from the ground up to operate fluently across both domains.
The model was trained on 100 million biomedical documents from PubMed and patent sources totalling 355 million text tokens, as well as 2.9 billion patent SMILES strings converted into 4.7 billion chemical tokens. Special annotation symbols were used to encode the SMILES representations.
What tasks can nach0 perform?
Nach0 can perform a variety of tasks spanning natural language processing like document classification and question answering, molecular property prediction, molecular generation, reagent prediction, and cross-domain capabilities such as description-guided molecular design.
In benchmark evaluations, nach0 significantly outperformed general LLMs like ChatGPT on molecular tasks while delivering competitive performance on biomedical text processing compared to specialized models like FLAN and SciFive.
Automating molecular discovery
Case studies demonstrated nach0’s ability to generate molecular structures for potential diabetes drugs based on prompts describing the desired biological activity, mechanism of action, synthesis route, and properties. The model rapidly produced chemically sensible molecules satisfying the criteria within minutes.
“Nach0 represents a major advance in automating molecular discovery and design through natural language interaction,” said Alex Zhavoronkov, CEO of Insilico Medicine. “We envision further enhancing it with protein sequence data and using transfer learning to specialize for new applications.”
The model leverages the NVIDIA BioNeMo platform, taking advantage of its data loading, natural language processing, and generative AI capabilities optimized for biology and chemistry workloads.
As models like nach0 continue to evolve, they may provide powerful molecular design assistance while reducing the need for extensive human supervision compared to traditional computational chemistry methods.
“We anticipate that as nach0 evolves, it will require less supervision, and it will be able to simply generate and validate promising therapeutic options for medicinal chemists,” says Maksim Kuznetsov, a senior research scientist at Insilico and one of the paper’s lead authors.
The nach0 framework is available for research purposes:
nach0 base is available via: https://huggingface.co/insilicomedicine/nach0_base;
nach0 large is available via: https://huggingface.co/insilicomedicine/nach0_large;
for pre-processing scripts, see: https://github.com/insilicomedicine/nach0.
- Read the research paper: nach0: multimodal natural and chemical languages foundation model in Chemical Science. doi: https://doi.org/10.1039/D4SC00966E