Building Knowledge Graph from Unstructured Data using Neo4j and Rebel AI
Knowledge graphs possess the transformative power to organize and integrate information across various domains and sources, enabling a deeper understanding of complex relationships and interdependencies.
By representing data as nodes (entities) and edges (relationships), knowledge graphs facilitate semantic querying and advanced analytics, allowing for the extraction of insights that might otherwise remain hidden within unstructured data.
This capability is extremely valuable for a range of applications, from enhancing search engine results to driving personalized recommendations, and from improving artificial intelligence systems to enabling breakthroughs in scientific research. Knowledge graphs thus serve as a foundational technology for constructing a more interconnected and intelligent digital ecosystem, where data becomes not only more accessible but also more meaningful.
Creating knowledge graphs from structured data is relatively straightforward because the data is already organized into defined formats, such as tables with rows and columns in relational databases, or objects in document-based stores. These formats lend themselves well to translation into nodes and edges, where entities and their attributes are clearly delineated, and relationships are explicitly stated or can easily be inferred.
However, constructing knowledge graphs from unstructured data presents a much greater challenge. Unstructured data, for e.g. text documents, lack a predefined data model, making the extraction of entities and relationships for the graph a complex task.
Indeed, there are very limited tools available for directly creating knowledge graphs from unstructured data, which makes the process challenging. This represents a significant barrier for many organizations, particularly those without the resources to invest in the necessary technology and talent to build and manage these complex systems.
We found a robust solution for converting unstructured data into a knowledge graph using an end-to-end AI model called REBEL.
REBEL is a text-to-text model developed by BabelScape through the fine-tuning of the BART model. It is designed to parse sentences that contain entities and implicit relationships, converting them into explicit relational triplets. Trained on over 200 distinct types of relations, REBEL's training utilized a bespoke dataset drawn from the abstracts of Wikipedia and the relational data of Wikidata. This dataset was refined with the assistance of a RoBERTa-based Natural Language Inference model, ensuring the quality and relevance of the entities and relations included.
The dataset utilized for REBEL's pre-training is accessible on the Hugging Face Hub, as detailed in the paper that outlines its development process. REBEL has demonstrated impressive performance across several benchmarks in both Relation Extraction and Relation Classification tasks.
For those interested in exploring or utilizing REBEL, the model is available on the Hugging Face Hub platform.
Neo4j is a highly regarded graph database platform that has been instrumental in the development and deployment of knowledge graphs. It is designed to store and retrieve data structured in graphs rather than tables, making it ideal for representing complex networks of data with interconnecting relationships.
Neo4j knowledge graphs enable organizations to uncover and leverage intricate connections in their data, allowing for powerful queries and analyses that traditional relational databases struggle with.
The platform offers a rich set of tools and features, including the Cypher query language, which is specifically tailored for graph operations. This makes it possible to intuitively model, query, and visualize relationships within the data. Neo4j's knowledge graph capabilities are applied across various industries for use cases such as recommendation systems, fraud detection, artificial intelligence, and more, demonstrating its flexibility and robustness in handling connected data at scale.
In order to connect to a Neo4j instance, you can launch a DB on their Aura platform.
Upon launching a new instance, you can download a text file that will contain the username, password, and the URL of your database. This will be needed later to connect to the database.
First, install all the dependencies needed to execute the program.
Let’s establish a connection with Neo4j’s graph db.
Let’s define a text splitter that will break our unstructured data into smaller chunks to be processed by the REBEL model.
Now let’s load the text that we would want to be inserted and organized into our knowledge graph. It would be interesting to see a knowledge graph of the very popular ‘Dune’ universe, as it is complex with many entities and subplots.
Then we load the REBEL model.
This helper function (available from the model’s Hugging Face page itself) takes the text and returns a triplet containing an entity pair and their relationship.
We shall also create a knowledge graph class to store information about the nodes and the edges.
Next, we wrap the above into a single function that takes text as input and returns the knowledge graph for that text.
To insert the Dune universe data into Neo4j’s graph, use the code below:
Now the knowledge graph has been populated. We can head over to Neo4j’s console to study our graph.
To visualize the graph complete graph at once, we use this query:
Let’s zoom into an area of the graph:
To view the name of the node instead of the id, we can click on Node on the right and change the caption to <type>.
To visualize a node and all the other nodes interconnected to it, we use the following query:
In this case, our center of interest is the `Dune` node.
To list down all the director relationships:
Similarly, if we want to view all the Father relationships of the characters in ‘Dune’:
If we want to count the total number of Father relationships:
You are all set now to create a knowledge graph from your own unique dataset. As you can see from this blog, knowledge graphs are a powerful tool to visualize and study large datasets.
Feel free to reach out to founders@superteams.ai if you need any assistance in designing your AI workflows. We have some of the very best AI engineers working with us.
Hope you enjoyed reading this article.