Semantic Croissant
Slava Tykhonov is an innovator in Graph+AI space for decades. He’s been a lead R&D engineer, and now an ambassador, of Dataverse — a distributed public repository of datasets enabling science across the world. Now Slava is a Head of AI and Interoperability at CODATA, the International Science Council, a European consortium building the infrastructure of data and AI for scientific ecosystems.
Alexy discusses Graph AI with Slava Tykhonov
In this interview, we go over various technologies that form the semantic layer for Graph AI. First of all, datasets must be understandable for both humans and agents. There should be rich metadata that describes the format, nature, provenance, licensing, and other aspects of the underlying data. Croissant is a format for describing datasets, now stewarded by ML Commons. Slava is one of the original authors of the Croissant 1.0 standard.
Ontologies enrich the context that agents give the LLMs. But using proper ontologies across domains requires alignment. Cross-Domain Interoperability Framework, CDIF, addresses that problem.
Zenodo description of CDIF: https://zenodo.org/records/17711820
A big missing piece of the original Croissant is semantics, such as ontological categorization in subject domains. You can use the JSON-LD format to represent arbitrary nested terms, but you need to figure out who defines and maintains domain ontologies, where authoritative ones reside, how to discover them and route domain-specific queries to them.
Semantic Croissant adds the ontologies, authority discovery, and domain routing to Croissant. Aligning business domains poses a question of access. Many of these domains are proprietary, with the ontologies reflecting proprietary information and queries accessing privileged data. It’s important to track queries and also enable fine-grained permissions for the agents running them.
Enter DID and ODRL. Both are established technologies coming from the W3C. DIDs are Decentralized IDs, globally unique and traceable: https://www.w3.org/TR/did-1.0/. ODRL is the Open Digital Rights Language: https://www.w3.org/TR/odrl-model/. DID can carry a payload. Slava wraps every prompt in a DID, and a response can be wrapped as well. This way, a complete trace and provenance of the query session is preserved. Since the DID identifies and proves ownership, the ODRL descriptor can be attached to control access in detail, enabling the agent to access relevant data.
Finally, we talk about Palefire, a Graph+Vector AI system that allows for more precise Q&A. It can be used to answer fairly involved and specific questions, such as predicting the price of coffee given market and weather uncertainty.
https://github.com/agstack/palefire
The full transcript follows.
Summary
Alexy Khrabrov, founder of the Community Research Center for Reliable AI, and Vyacheslav Tykhonov, Head of AI and Interoperability at CODATA, discussed CODATA's use of graph approaches, proprietary models, and multilingual frameworks to enhance AI reliability, particularly addressing issues like disambiguation and delucination. Vyacheslav Tykhonov detailed CODATA's projects, including the "ask dataverse" question-answering system for the Dataverse repository, the Croissant standard (including semantic extensions for multilingual search), and the Cross-Domain Interoperability Framework (CDIF), which uses graph-expressed variables to facilitate data exchange between domains. The conversation also covered the use of Decentralized Identifiers (DID) and Open Digital Rights Language (ODRL) to trace LLM interactions for prompt marketplaces, and the Pale Fire project, which integrates Qdrant vector-graph storage for complex predictions and enhanced concept identity resolution.
Details
Introductions and Background: Alexy Khrabrov, a community leader in open source science and the founder of the Community Research Center for Reliable AI at Northeastern University, introduced Vyacheslav Tykhonov (00:00:00). Vyacheslav Tykhonov, referred to as Slava, is the Head of AI and Interoperability at CODATA, an R&D engineer, an ambassador of Dataverse, and co-author of the Croissant standard 1.0 for machine learning data annotation, as well as the creator of the semantic Croissant standard. Their discussion focused on using graph approaches to enhance the reliability of AI on graphs, specifically addressing issues like disambiguation, hallucinations, and tracing interactions with large language models (LLMs) and various data sets (00:08:41).
Overview of CODATA's Mission and Projects: CODATA, part of the International Science Council, provides advisory services on data and artificial intelligence to governmental and institutional organizations, including supporting the United Nations with projects like detecting hazards. They use AI, train proprietary models, and develop multilingual frameworks to translate and recognize hazard information in various languages automatically. CODATA recently secured two European Commission-funded projects related to AI, which cover Croissant and a cross-domain interoperability framework (00:09:52).
Dataverse and Innovations in Search Reliability: Dataverse is an open-source data repository originally maintained by Harvard University since 2006, functioning as a platform where researchers can upload, describe, and publish data sets to make them citable (00:12:17). Vyacheslav Tykhonov, as an ambassador of Dataverse, developed an innovative search tool that uses graph data for more precise querying (00:11:11). This tool was initially a straightforward application that connected an LLM and ingested data from Croissant, using it as a JSON-LD navigation mechanism to locate information efficiently, such as authors or keywords, within the graph (00:13:49).
The Ask Dataverse Service and Distributed Querying: Following the initial implementation, the search tool was extended into a question-answering system called "ask dataverse," allowing users to chat with the data repository and ask questions about the data sets. The service is implemented in a distributed manner, functioning as a navigation system similar to a car's GPS, where Croissant metadata guides the service to find information across different countries and repositories (00:15:10). This allows the service to route queries properly based on the Croissant metadata, which currently indexes around 750,000 distributed data sets (00:17:06).
Croissant Standard and Semantic Extensions: Croissant is an annotation standard defining a JSON metadata structure that describes various properties of a data set (00:18:05). A critical extension proposed is the inclusion of control vocabulary support and ontology alignment to address shortcomings in the original 1.0 specification. This extension is crucial for building multilingual search capabilities, as it references control vocabularies that contain all translations and variations of specific terms, allowing AI models to use the relationships to create a graph for precise data understanding (00:19:16).
Cross-Domain Interoperability Framework (CDIF) and Semantic Croissant: The Cross-Domain Interoperability Framework (CDIF) was created to facilitate the transparent transition of data between different domains. CDIF defines semantics, allowing data sets to be represented with different ontologies and control vocabularies used in specific domains, ensuring precise understanding of the data content and relationships (00:20:28). Data is repackaged in a structured format, supplied with Croissant for provenance information, and incorporates a concept called "variable cascade," which defines complex indicators using very precise variables including units of measurement and class hierarchies, expressed as a graph (00:21:41).
Ontology Alignment and Transformation for Domain Mapping: Ontology alignment is the process of creating semantic mappings between different standards to enable data transformation, which is essential when integrating data from different domains. For example, converting metadata from a standard like CodeMeta to Croissant requires mapping fields, such as 'title' to 'name' and 'license' to 'digital properties' (00:22:51). This process involves creating and applying transformation steps to the JSON metadata so that the data is ready for consumption by other applications and is represented within a querable graph (00:23:49).
Importance of Units of Measure for Scientific Data: Defining variables with precise units of measure is crucial for scientific data, especially in fields like climate change research, which often tracks temperature in different scales such as Fahrenheit, Celsius, or Kelvin (00:23:49). Harmonization and transformation of units of measurement are essential steps before integrating data from diverse sources, which is particularly challenging since units of measurement are frequently not explicitly indicated in the original data (00:24:54). A human-in-the-loop process, aided by AI predictions for units of measurement, is necessary to verify the data and ensure it is "AI ready" for integration (00:25:53).
Semantic Croissant and the Decentralized Identifier (DID) Architecture: To address the need for aligning large, proprietary domain ontologies (e.g., in automotive supply chains) and tracing LLM interactions, the concept of a Decentralized Identifier (DID) coupled with the Open Digital Rights Language (ODRL) was introduced (00:29:58). The DID serves as a globally unique, verifiable identifier, acting as a "glue" in the knowledge graph to connect various knowledge sources and concepts, even if the knowledge remains in proprietary systems. ODRL defines access permissions, ensuring that proprietary knowledge can be queried only by authorized entities, which can include non-human AI agents (00:30:51) (00:33:07).
DID Application in Tracing LLM Interactions and Building Prompt Marketplaces: The DIDs are being used to trace and snapshot LLM interactions, assigning a unique identifier to every LLM prompt to make it globally resolvable and shareable. This system allows for the creation of reference models, the training of new models, and the construction of a graph that can be used to compress knowledge for smaller models or implement specific skills (00:34:17) (00:38:02). The wrapped, verifiable prompt allows for the vision of a marketplace or repository of skills and prompts, where ownership and intellectual property can be proven and access restricted using ODRL (00:40:20).
The Pale Fire Project and Vector-Graph Integration: The Pale Fire project, donated to the Linux Foundation's Agricultural Stack, originated from an AI-powered framework called "ghostwriter engine" used for the ask dataverse service. This engine integrates both graph and vector storage technologies in a complementary way, allowing for similarity measures between the knowledge graph and the vector store to determine concept identity (00:41:16). Pale Fire's architecture enables the combination of structured data from the knowledge graph with information from the vector store to improve accuracy and understand if an AI-discovered concept is synonymous with an accepted official terminology (00:44:33).
Use Cases for Pale Fire and Future Predictions: Pale Fire's ability to transition and compress data between a knowledge graph and a vector store can be used for complex use cases, such as describing phenomena like earthquakes from spreadsheet data or predicting coffee prices based on news and reports (00:45:38). This intelligence approach collects and ingests information into a knowledge graph, applies ontologies and control vocabularies, and uses LLM models to provide predictions. The technology is highly generalizable and can be used for various prediction systems and deep research by connecting non-obvious entities and relationships (00:46:42).
Selection of Qdrant as a Vector Database: Qdrant was selected for its vector-graph integration capability following contact with its CEO, which inspired a move toward Retrieval-Augmented Generation (RAG) (00:48:07). Qdrant is seen as an efficient and fast solution for AI agents, especially when optimized for high-speed performance by running the database in memory (00:49:41). Qdrant's capability to use DIDs as block identifiers and facilitate neighborhood queries makes it essential for connecting AI agents and discovering unknown information that looks "something like that" (00:50:48).