
Google has announced DataGemma, the first open models, which connect LLMs with extensive real-world data drawn from Google’s Data Commons. The models can answer questions about numerical facts with more accuracy than previous algorithms.
DataGemma addresses one of modern artificial intelligence’s most significant problems: hallucinations in large language models (LLMs). The DataGemma models have been introduced in two specific variants to enhance the LLMs’ performance: DataGemma-RAG-27B-IT and DataGemma-RIG-27B-IT.
Data Commons: At the Heart of DataGemma
DataGemma answers queries using information from Data Commons, a publicly available knowledge graph containing over 240 billion data points from different statistical variables. It sources data from organizations like the United Nations, WHO, Centers for Disease Control and Prevention (CDC) and Census Bureaus.
These datasets, by combining into one unified set of tools and AI models, help policymakers, researchers, and enterprises looking for precise insights.
Data Commons includes information on a wide range of topics, including economics, public health, environmental statistics, and demographic trends. Through an NLP interface, users can engage with this massive dataset by posing queries such as which nations have achieved the greatest progress in increasing access to renewable energy sources or how income levels link to health outcomes in particular areas.
According to the company, DataGemma will enhance the capabilities of Gemma models by leveraging Data Commons knowledge and improve LLM reasoning and factuality through two different methods: RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation).
RIG augments Google’s Gemma 2 language model’s capabilities by proactively sourcing information from reliable sources and fact-checking it with Data Commons data. With RAG, language models may take in more context, integrate pertinent material outside of their training set, and produce outputs that are more comprehensive and informative.
Outcomes and Future Directions
While initial research on the RIG and RAG approaches is still in its early stages, it indicates promising gains in the precision of LLMs while processing numerical facts. Google aims to continue the refinement of its RIG and RAG approaches, scale up these efforts, put them through rigorous testing, and eventually integrate this expanded functionality into both Gemma and Gemini models.
DataGemma models are available to researchers and developers. Google has made the source code for the algorithms available on Hugging Face.
Stay Tuned to The Future Talk for more AI news and insights!