GraphTokenizer

Graph Tokenization for Bridging Graphs and Transformers

[中文文档 / Chinese README] · [Paper (ICLR 2026)]

Branches: release — clean code for reproducing paper experiments. dev — full development version with utility scripts, benchmarks, and internal docs.

Overview

The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. GraphTokenizer extends this paradigm to graph-structured data by introducing a general graph tokenization framework. It converts arbitrary labeled graphs into discrete token sequences, enabling standard off-the-shelf Transformer models (e.g., BERT, GTE) to be applied directly to graph data without any architectural modifications.

The framework combines reversible graph serialization with Byte Pair Encoding (BPE), the de facto standard tokenizer in large language models. To better capture structural information, the serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear as adjacent symbols in the resulting sequence — an ideal input for BPE to discover a meaningful vocabulary of structural graph tokens. The entire process is reversible: the original graph can be faithfully reconstructed from its token sequence.

Framework overview. (A) Substructure frequencies (labeled-edge patterns) are collected from the training graphs. (B) Structure-guided reversible serialization via frequency-guided Eulerian circuit — at each node, the next edge is selected according to the frequency priority (e.g., at the red C node, the C–C pattern has the highest frequency, so that edge is traversed first). (C) A BPE vocabulary is trained on the serialized corpus; BPE iteratively merges the most frequent adjacent symbol pairs into new tokens, compressing sequences to ~10% of their original length while preserving common substructures.

Labeled Graphs  →  Structure-Guided Serialization  →  BPE Tokenization  →  Transformer  →  Predictions

Key Contributions

General Graph Tokenization Framework. Combines reversible graph serialization with BPE to create a bidirectional interface between graphs and sequence models. By decoupling the encoding of graph structure from the model architecture, it enables standard off-the-shelf Transformers to process graph data without any architectural modifications.
Structure-Guided Serialization for BPE. A deterministic serialization mechanism guided by global substructure statistics. It addresses the ordering ambiguity inherent in graphs (permutation invariance) and systematically arranges frequent substructures into adjacent symbol patterns — precisely the input that BPE's greedy merging strategy is designed to exploit.
State-of-the-Art on 14 Benchmarks. Achieves SOTA results across diverse graph classification and regression benchmarks spanning molecular, biomedical, social, academic, and synthetic domains. Scaling from a compact BERT-small to a larger GTE backbone yields consistent gains, demonstrating that graph tokenization can leverage the proven scaling behavior of Transformers.

Main Results

Classification (↑ higher is better) and regression (↓ lower is better) results:

Model	molhiv (AUC↑)	p-func (AP↑)	mutag (Acc↑)	coildel (Acc↑)	dblp (Acc↑)	qm9 (MAE↓)	zinc (MAE↓)	aqsol (MAE↓)	p-struct (MAE↓)
GCN	74.0	53.2	79.7	74.6	76.6	0.134	0.399	1.345	0.342
GIN	76.1	61.4	80.4	72.0	73.8	0.176	0.379	2.053	0.338
GAT	72.1	51.2	80.1	74.4	76.3	0.114	0.445	1.388	0.316
GatedGCN	80.6	51.2	83.6	83.7	86.0	0.096	0.370	0.940	0.312
GraphGPS	78.5	53.5	84.3	80.5	71.6	0.084	0.310	1.587	0.251
Exphormer	82.3	64.5	82.7	91.5	84.9	0.080	0.281	0.749	0.251
GraphMamba	81.2	67.7	85.0	74.5	87.6	0.083	0.209	1.133	0.248
GCN+	80.1	72.6	88.7	88.9	89.6	0.077	0.116	0.712	0.244
GT+BERT	82.6	68.5	87.5	74.1	93.2	0.122	0.241	0.648	0.247
GT+GTE	87.4	73.1	90.1	89.6	93.6	0.071	0.131	0.609	0.242

Results are mean over 5 independent runs. Bold = best. See the paper for full results on all 14 datasets including DD, Twitter, Proteins, Colors-3, and Synthetic.

Supported Serialization Methods

Method	Reversible	Deterministic	Applicable to
Freq-Guided Eulerian (Feuler)	✅	✅	Any labeled graph
Freq-Guided CPP (FCPP)	✅	✅	Any labeled graph
Eulerian circuit	✅	❌	Any labeled graph
Chinese Postman (CPP)	✅	❌	Any labeled graph
Canonical SMILES	✅	✅	Molecular graphs only
DFS / BFS / Topo	❌	❌	Any graph

The default method is Feuler (Frequency-Guided Eulerian circuit), which provides both reversibility and determinism with O(|E|) time complexity.

Project Structure

GraphTokenizer/
├── prepare_data_new.py         # Data preprocessing: serialization + BPE training + vocab
├── run_pretrain.py             # Pre-training entry point (MLM)
├── run_finetune.py             # Fine-tuning entry point (regression/classification)
├── batch_pretrain_simple.py    # Batch pre-training across datasets/methods/GPUs
├── batch_finetune_simple.py    # Batch fine-tuning
├── aggregate_results.py        # Collect and tabulate experiment results
├── config.py                   # Centralized configuration management
├── config/default_config.yml   # Default config values
├── src/
│   ├── algorithms/
│   │   ├── serializer/         # Graph serialization (Freq-Euler, Euler, DFS, BFS, Topo, SMILES, CPP, ...)
│   │   └── compression/        # BPE engine (C++ / Numba / Python backends)
│   ├── data/                   # Unified data interface and per-dataset loaders
│   │   └── loader/             # Per-dataset loaders (QM9, ZINC, AQSOL, MNIST, Peptides, ...)
│   ├── models/                 # Model definitions
│   │   ├── bert/               # BERT encoder, vocab manager, data pipeline
│   │   ├── gte/                # GTE encoder (Alibaba-NLP/gte-multilingual-base)
│   │   └── unified_encoder.py  # Unified encoder interface
│   ├── training/               # Training pipelines (pretrain, finetune, evaluation)
│   └── utils/                  # Logging, metrics, visualization
├── gte_model/                  # Local GTE model config (for offline use)
├── final/                      # Paper experiment scripts and plotting code
└── docs/                       # Documentation

Installation

git clone https://github.com/BUPT-GAMMA/GraphTokenizer.git
cd GraphTokenizer

# Install in development mode
pip install -e .

# Build the C++ BPE backend (optional but recommended for speed)
python setup.py build_ext --inplace

Key dependencies: torch, dgl, networkx, rdkit, transformers, pybind11, pandas.

Quick Start

1. Data Preparation

Serialize graphs and train a BPE tokenizer:

python prepare_data_new.py \
    --datasets qm9test \
    --methods feuler \
    --bpe_merges 2000

This loads the dataset, serializes all graphs with the chosen method (e.g., frequency-guided Eulerian circuit), trains a BPE model on the resulting sequences, and builds a vocabulary. All artifacts are cached for reuse.

2. Pre-training

Pre-train a Transformer encoder with Masked Language Modeling (MLM):

python run_pretrain.py \
    --dataset qm9test \
    --method feuler \
    --experiment_group my_experiment \
    --epochs 100 \
    --batch_size 256

3. Fine-tuning

Fine-tune the pre-trained model on downstream graph prediction tasks:

python run_finetune.py \
    --dataset qm9test \
    --method feuler \
    --experiment_group my_experiment \
    --target_property homo \
    --epochs 200 \
    --batch_size 64

4. Batch Experiments

Run experiments across multiple datasets, serialization methods, and GPUs in parallel:

python batch_pretrain_simple.py \
    --datasets qm9,zinc,mutagenicity \
    --methods feuler,eulerian,cpp \
    --bpe_scenarios all,raw \
    --gpus 0,1

python batch_finetune_simple.py \
    --datasets qm9,zinc,mutagenicity \
    --methods feuler,eulerian,cpp \
    --bpe_scenarios all,raw \
    --gpus 0,1

Reproducing Paper Experiments

Scripts for all paper experiments are in the final/ directory:

Main experiments — final/exp1_main/run/: pre-training and fine-tuning commands for all 14 datasets
Efficiency analysis — final/exp1_speed/: serialization speed, token length stats, training throughput
Multi-sampling comparison — final/exp2_mult_seralize_comp/: effect of multiple serialization samples
BPE vocabulary visualization — final/exp4_bpe_vocab_visual/: codebook inspection and visualization

Documentation

Configuration Guide — config file structure and parameters
Experiment Guide — how to design and run experiments
BPE Usage Guide — BPE engine API and usage

Citation

If you find this work useful, please cite our paper:

@inproceedings{guo2026graphtokenizer,
  title={Graph Tokenization for Bridging Graphs and Transformers},
  author={Guo, Zeyuan and Diao, Enmao and Yang, Cheng and Shi, Chuan},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Branches

release — Clean version with only the code needed to reproduce paper experiments.
dev — Full development version with all utility scripts, benchmarks, and internal documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphTokenizer

Overview

Key Contributions

Main Results

Supported Serialization Methods

Project Structure

Installation

Quick Start

1. Data Preparation

2. Pre-training

3. Fine-tuning

4. Batch Experiments

Reproducing Paper Experiments

Documentation

Citation

Branches

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
config		config
data		data
docs		docs
final		final
gte_model		gte_model
scripts/dataset_conversion		scripts/dataset_conversion
src		src
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
aggregate_results.py		aggregate_results.py
batch_finetune_simple.py		batch_finetune_simple.py
batch_pretrain_simple.py		batch_pretrain_simple.py
config.py		config.py
prepare_data_new.py		prepare_data_new.py
pytest.ini		pytest.ini
run_finetune.py		run_finetune.py
run_pretrain.py		run_pretrain.py
setup.py		setup.py

BUPT-GAMMA/GraphTokenizer

Folders and files

Latest commit

History

Repository files navigation

GraphTokenizer

Overview

Key Contributions

Main Results

Supported Serialization Methods

Project Structure

Installation

Quick Start

1. Data Preparation

2. Pre-training

3. Fine-tuning

4. Batch Experiments

Reproducing Paper Experiments

Documentation

Citation

Branches

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages