In this blog, we will use CocoIndex to extract relationships/ontologies using LLM and build a knowledge graph with Neo4j. We will illustrate how it works step by step using a graph to represent the relationships between core concepts of CocoIndex Documentation.
Source code - https://github.com/cocoindex-io/cocoindex
If you like our work, it would mean a lot to us if you could support CocoIndex on GitHub with a star 🥥🤗.
Install PostgreSQL if you don't have it. CocoIndex uses PostgreSQL to manage the data index internally for incremental processing. We have it on our roadmap to support other databases. If you are interested in other databases, please let us know by creating a GitHub issue.
Install Neo4j if you don't have it.
Install/configure an LLM API. In this example, we use OpenAI. You need to configure your OpenAI API key before running the example. Alternatively, you can switch to Ollama, which runs LLM models locally. You can get it ready by following this guide.
@cocoindex.flow_def(name="DocsToKG")
def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
"""
Define an example flow that extracts triples from files and build knowledge graph.
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../../docs/docs/core",
included_patterns=["*.md", "*.mdx"]))
In this example, we are going to process the cocoindex documentation markdown files (.md, .mdx) from the docs/core directory. You can change the path to the documentation you want to process.
flow_builder.add_source will create a table with the following sub fields, see documentation here.
filename (key, type: str): the filename of the file, e.g. dir1/file1.mdcontent (type: str if binary is False, otherwise bytes): the content of the filedocument_node = data_scope.add_collector()
entity_relationship = data_scope.add_collector()
entity_mention = data_scope.add_collector()
We are going to add three collectors at the root scope to collect
document_node: the document nodes, e.g. core/basics.mdx (https://cocoindex.io/docs/core/basics)entity_relationship: the relationship between entities, e.g. Indexing flow and Data are related to each other (An indexing flow has two aspects: data and operations on data).entity_mention: the mention of entities in the document; for example, document core/basics.mdx mentions Indexing flow, Retrieval ...We will define a DocumentSummary data class to extract the summary of a document with structured output.
@dataclasses.dataclass
class DocumentSummary:
"""Describe a summary of a document."""
title: str
summary: str
And then within the flow, lets use cocoindex.functions.ExtractByLlm for structured output.
with data_scope["documents"].row() as doc:
doc["summary"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
output_type=DocumentSummary,
instruction="Please summarize the content of the document."))
document_node.collect(
filename=doc["filename"], title=doc["summary"]["title"],
summary=doc["summary"]["summary"])
Here, we are processing each document and using an LLM to extract a summary of the document. We then collect the title and summary information into the document_node collector. For detailed information about cocoindex.functions.ExtractByLlm, please refer to the documentation.
Note that if you want to use a local model, like Ollama, you can replace the llm_spec with the following spec:
# Replace by this spec below, to use Ollama API instead of OpenAI
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OLLAMA, model="llama3.2"),
CocoIndex allows you to choose components like LEGO :)
For each document, we will perform simple syntax based chunking. This is optional. We find that a reasonable chunk size performs better in terms of quality for the LLM to understand and process the content.
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=10000)
Next, let's define a data class to represent relationship (triples) for the LLM extraction.
@dataclasses.dataclass
class Relationship:
"""Describe a relationship between two nodes."""
subject: str
predicate: str
object: str
In a knowledge graph triple (Subject, Predicate, Object):
subject: Represents the entity the statement is about (e.g., 'CocoIndex').predicate: Describes the type of relationship or property connecting the subject and object (e.g., 'supports').object: Represents the entity or value that the subject is related to via the predicate (e.g., 'Incremental Processing').Next, we will use cocoindex.functions.ExtractByLlm to extract the relationship from the document.
with doc["chunks"].row() as chunk:
chunk["relationships"] = chunk["text"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
# Replace by this spec below, to use Ollama API instead of OpenAI
# llm_spec=cocoindex.LlmSpec(
# api_type=cocoindex.LlmApiType.OLLAMA, model="llama3.2"),
output_type=list[Relationship],
instruction=(
"Please extract relationships from CocoIndex documents. "
"Focus on concepts and ingnore specific examples. "
"Each relationship should be a tuple of (subject, predicate, object).")))
Here, we are processing each chunk and using LLM to extract relationships from the chunked text. For detailed information about cocoindex.functions.ExtractByLlm, please refer to the documentation.
For each relationship, we will embed the subject and object for retrieval.
with chunk["relationships"].row() as relationship:
relationship["subject_embedding"] = relationship["subject"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
relationship["object_embedding"] = relationship["object"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
For each relationship, after the transformation, we will use the collectors to collect the fields.
entity_relationship.collect(
id=cocoindex.GeneratedField.UUID,
subject=relationship["subject"],
subject_embedding=relationship["subject_embedding"],
object=relationship["object"],
object_embedding=relationship["object_embedding"],
predicate=relationship["predicate"],
)
entity_mention.collect(
id=cocoindex.GeneratedField.UUID, entity=relationship["subject"],
filename=doc["filename"], location=chunk["location"],
)
entity_mention.collect(
id=cocoindex.GeneratedField.UUID, entity=relationship["object"],
filename=doc["filename"], location=chunk["location"],
)
entity_relationship collector will collect relationships between subjects and objects.entity_mention collector will collect mentions of entities (as subjects or objects) in the document separately.At the root scope, we will configure the Neo4j connection:
conn_spec = cocoindex.add_auth_entry(
"Neo4jConnection",
cocoindex.storages.Neo4jConnection(
uri="bolt://localhost:7687",
user="neo4j",
password="cocoindex",
))
And then we will export the collectors to the Neo4j database.
document_node.export(
"document_node",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.NodeMapping(label="Document")),
primary_key_fields=["filename"],
foreign_key_fields=["title", "summary"],
)
entity_collector.export(
"entity_node",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.NodeMapping(label="Entity")),
primary_key_fields=["value"],
)
This exports the document_node (filename, title, summary - collected above) to the Neo4j database and creates Neo4j nodes with label Document using cocoindex.storages.NodeMapping. This is a simple node export. In the data flow, for each document, we collect exactly one document node per document. It is clearly 1:1 mapping - one document produced exactly one neo4j node, without any requirement to deduplicate.
Next, we will export the entity_relationship to the Neo4j database.
entity_relationship.export(
"entity_relationship",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.RelationshipMapping(
rel_type="RELATIONSHIP",
source=cocoindex.storages.NodeReferenceMapping(
label="Entity",
keys=[
cocoindex.storages.TargetFieldMapping(
source="key", target="key"),
]
),
target=cocoindex.storages.NodeReferenceMapping(
label="Entity",
fields=[
cocoindex.storages.TargetFieldMapping(
source="object", target="value"),
cocoindex.storages.TargetFieldMapping(
source="object_embedding", target="embedding"),
]
),
nodes_storage_spec={
"Entity": cocoindex.storages.NodeStorageSpec(
primary_key_fields=["value"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
),
],
),
},
),
),
primary_key_fields=["id"],
)
This code exports the entity_relationship data to a Neo4j database. Let's break down what's happening:
We're calling the export method on the entity_relationship data collection, with three parameters:
entity_relationship for this exportid in this case, which is generated by cocoindex.GeneratedField.UUID for each relationship) for each exported relationshipThe RelationshipMapping mapping defines (documentation):
The relationship type as RELATIONSHIP, this is just a label for what kind of relationship it is.
The source node configuration:
EntityNodeReferenceMapping to create a reference to the source node to define the relationship. In addition, it also maps the fields from the data collector to the Neo4j node. It defines two pairs of mapping:
subject field from the data collector -> value field in the Neo4j nodesubject_embedding field from the data collector -> embedding field in the Neo4j nodeThe target node configuration:
Entity. In this example, we are using LLM to extract entities (like key concepts - data indexing, data types etc) and find relationships between them. So the source and target are the same node type, and will use the same entity label.NodeReferenceMapping to create a reference to the target node to define the relationship. In addition, it also maps the fields from the data collector to the Neo4j node. It defines two pairs of mapping:
object field from the data collector -> value field in the Neo4j nodeobject_embedding field from the data collector -> embedding field in the Neo4j nodeNote that when using NodeReferenceMapping to create a reference. Unlike the Document label which is based on rows collected by document_node, nodes for the Entity label are based on rows collected for relationships (using fields as specified in the NodeReferenceMapping). Since different relationships may share the same node, and CocoIndex uses primary keys for Nodes (value for Entity) to decide nodes' identity, and creates exactly one node to be shared by multiple such relationships. For example,
Next, let's export the entity_mention to the Neo4j database.
entity_mention.export(
"entity_mention",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.RelationshipMapping(
rel_type="MENTION",
source=cocoindex.storages.NodeReferenceMapping(
label="Document",
),
target=cocoindex.storages.NodeReferenceMapping(
label="Entity",
fields=[cocoindex.storages.TargetFieldMapping(
source="entity", target="value")],
),
),
),
primary_key_fields=["id"],
)
This code exports the entity_mention data to the Neo4j database. Let's break down what's happening:
We're calling the export method on the entity_mention data collection, with three parameters:
entity_mention for this exportid in this case) for each exported mention relationshipThe RelationshipMapping mapping defines how to create relationships in Neo4j from the collected data. It specifies the relationship type and configures both the source and target nodes that will be connected by this relationship.
The relationship type is MENTION, which represents that a document mentions an entity
The source node configuration:
DocumentNodeReferenceMapping that maps the filename field from the data collector -> filename field in the Neo4j nodeThe target node configuration:
Entity. Note that this is different from the label Document in the source node configuration. They are indeed different kinds of nodes in the graph. A document node (e.g., core/basics.mdx) is a node that contains the content of the document, while an entity node (e.g., CocoIndex) is a node that contains the entity information.NodeReferenceMapping that maps the entity field from the data collector to the value field in the Neo4j nodeFinally, the main function for the flow initializes the CocoIndex flow and runs it.
@cocoindex.main_fn()
def _run():
pass
if __name__ == "__main__":
load_dotenv(override=True)
_run()
🎉 Now you are all set!
Install the dependencies:
pip install -e .
Run following commands to setup and update the index.
python main.py cocoindex setup
python main.py cocoindex update
You'll see the index updates state in the terminal. For example, you'll see the following output:
documents: 3 added, 0 removed, 0 updated
After the knowledge graph is built, you can explore the knowledge graph you built in Neo4j Browser.
For the dev environment, you can connect to Neo4j browser using credentials:
neo4jcocoindexYou can open it at http://localhost:7474, and run the following Cypher query to get all relationships:
MATCH p=()-->() RETURN p
We are constantly improving, and more examples and blogs coming soon! Follow us on  CocoIndex on GitHub with a star 🥥🤗.