DataFlow-KG: An LLM-Driven Knowledge Graph Processing Library
Build, enrich, reason over, and operationalize knowledge graphs with composable DataFlow-KG operators.
DataFlow-KG is an LLM-driven knowledge graph processing library built on top of the DataFlow ecosystem. It is designed to provide reusable, extensible, and modular operators for knowledge graph construction, reasoning, retrieval, querying, and domain-specific applications.
Rather than treating KG workflows as isolated scripts, DataFlow-KG organizes graph capabilities into operator packages by graph type and application scenario. These operators can be composed into larger pipelines, including but not limited to:
- knowledge graph construction
- graph reasoning
- graph retrieval
- domain-specific knowledge graph applications
DataFlow-KG aims to serve as a unified infrastructure layer for research and development on graph-centric LLM applications.
DataFlow-KG provides reusable operators that can be flexibly composed into pipelines for graph construction, graph enrichment, reasoning, retrieval, and task-specific graph processing. Operators are not standalone utilities. They are designed to be assembled into end-to-end workflows, enabling scalable and reproducible graph data engineering.
The library supports a broad range of graph settings in one framework, including general KG, commonsense KG, temporal KG, multimodal KG, hyper-relational KG, Graph RAG, and domain-specific KGs. As an extension of DataFlow, DataFlow-KG follows the same design philosophy of composable operators and pipeline-based processing, making it easy to integrate with broader data preparation workflows.
The framework is designed for both research scenarios and practical vertical applications, supporting graph processing tasks from foundational KG construction to specialized domain deployment.
conda create -n dfkg python=3.10
conda activate dfkgpip install uv
uv pip install dataflow-kgIf you want to enable local GPU inference, use:
conda create -n dfkg python=3.10
conda activate dfkg
pip install uv
uv pip install dataflow-kg[vllm]DataFlow-KG supports Python >= 3.10.
You can check whether the installation is successful with:
dfkg -vIf the installation is correct and DataFlow-KG is the latest release, you will see something like:
open-dataflow-kg codebase version: 0.0.2
Checking for updates...
Local version: 0.0.2
PyPI newest version: 0.0.2
You are using the latest version: 0.0.2.
In addition, the dfkg env command can be used to inspect the current hardware and software environment, which is useful for bug reporting:
dfkg envDataFlow-KG follows a code generation + custom modification + script execution workflow. In practice, you initialize a project with the CLI, customize the generated pipeline script if needed, and then run the Python file to execute your workflow.
You can get started in three steps.
Run the following command in an empty directory:
dfkg initPipelines with the same name across different folders are usually incremental variants with different dependency requirements:
| Directory | Required Resources |
|---|---|
cpu_pipelines |
CPU only |
api_pipelines |
CPU + LLM API |
gpu_pipelines |
CPU + API + local GPU |
Tip: If you are new to DataFlow-KG, start with
api_pipelines. Later, if you have a local GPU, you can replaceLLMServingwith a local model backend.
Go into any pipeline directory, for example:
cd api_pipelinesOpen one of the generated Python pipeline files. In most cases, you only need to check two configurations:
self.storage = FileStorage(
first_entry_file_name="<path_to_dataset>"
)By default, this points to the provided example dataset, so you can run it directly. You can also replace it with your own dataset path.
If you are using an API-based serving backend, set the API key first.
Linux / macOS
export DF_API_KEY=sk-xxxxxWindows CMD
set DF_API_KEY=sk-xxxxxPowerShell
$env:DF_API_KEY="sk-xxxxx"Then run the pipeline script:
python xxx_pipeline.pyDataFlow-KG is released under the Apache License 2.0.
If you use DataFlow-KG in your research, please cite:
@misc{dataflowkg2026,
title={DataFlow-KG: LLM-Driven Knowledge Graph Processing Library},
author={DataFlow-KG Team},
year={2026},
howpublished={\url{https://github.com/OpenDCAI/DataFlow-KG}}
}