The collaboration between NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT) and International Business Machines (IBM) has resulted in the development of INDUS, a revolutionary suite of large language models (LLMs) specifically designed for Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics domains. This collaboration signifies a major leap forward in utilizing advanced technology to enhance scientific research and knowledge discovery.
The INDUS suite of models consists of encoders and sentence transformers that are trained using scientific corpora curated from diverse data sources. The encoders play a crucial role in converting natural language text into numerical coding that can be easily processed by the LLM. INDUS encoders were trained on a vast corpus of 60 billion tokens, covering a wide range of scientific disciplines, including astrophysics, planetary science, Earth science, biological and physical sciences, and heliophysics. The custom tokenizer developed by the IMPACT-IBM collaborative team is a significant advancement in recognizing scientific terms, enhancing the overall performance of the models.
The collaboration between IMPACT and IBM has resulted in superior performance of INDUS models compared to open-source LLMs in various benchmarks for biomedical tasks, scientific question-answering, and Earth science entity recognition. By fine-tuning the sentence transformer models on millions of text pairs, including titles/abstracts and questions/answers, the team achieved remarkable results. The ability of INDUS to process researcher queries, retrieve relevant information, and generate accurate answers showcases its efficiency and effectiveness for scientific knowledge retrieval.
INDUS has been successfully integrated into various NASA projects, including the Biological and Physical Sciences (BPS) Division and the Goddard Earth Sciences Data and Information Services Center (GES-DISC). The collaboration has enabled the development of advanced applications such as chatbots for intuitive data search, dataset categorization, and knowledge graph integration. The incorporation of INDUS models has significantly improved user experience and efficiency in accessing and analyzing scientific data.
Aligned with NASA and IBM’s commitment to open and transparent artificial intelligence, the INDUS models are openly available on Hugging Face, benefiting the scientific community at large. The release of benchmark datasets and models for named entity recognition, extractive QA, and information retrieval further demonstrates the team’s dedication to advancing scientific research. The adaptability of INDUS encoder models for various science domain applications and its support for information retrieval in RAG applications hold promise for future advancements in the field.
The collaboration between NASA and IBM to develop the INDUS suite of models represents a significant milestone in the realm of scientific research and knowledge discovery. The innovative approach, superior performance, and diverse applications of INDUS showcase its potential to revolutionize the way researchers access, analyze, and extract valuable insights from vast amounts of scientific data. The open accessibility and commitment to transparency further underscore the impact of this collaboration in advancing scientific research for the benefit of the global scientific community.
Leave a Reply