Graph Tokenization for Bridging Graphs and Transformers
[中文文档 / Chinese README] · [Paper (ICLR 2026)]
Branches:
release— clean code for reproducing paper experiments.dev— full development version with utility scripts, benchmarks, and internal docs.
The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. GraphTokenizer extends this paradigm to graph-structured data by introducing a general graph tokenization framework. It converts arbitrary labeled graphs into discrete token sequences, enabling standard off-the-shelf Transformer models (e.g., BERT, GTE) to be applied directly to graph data without any architectural modifications.
The framework combines reversible graph serialization with Byte Pair Encoding (BPE), the de facto standard tokenizer in large language models. To better capture structural information, the serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear as adjacent symbols in the resulting sequence — an ideal input for BPE to discover a meaningful vocabulary of structural graph tokens. The entire process is reversible: the original graph can be faithfully reconstructed from its token sequence.
Framework overview. (A) Substructure frequencies (labeled-edge patterns) are collected from the training graphs. (B) Structure-guided reversible serialization via frequency-guided Eulerian circuit — at each node, the next edge is selected according to the frequency priority (e.g., at the red C node, the C–C pattern has the highest frequency, so that edge is traversed first). (C) A BPE vocabulary is trained on the serialized corpus; BPE iteratively merges the most frequent adjacent symbol pairs into new tokens, compressing sequences to ~10% of their original length while preserving common substructures.
Labeled Graphs → Structure-Guided Serialization → BPE Tokenization → Transformer → Predictions
- General Graph Tokenization Framework. Combines reversible graph serialization with BPE to create a bidirectional interface between graphs and sequence models. By decoupling the encoding of graph structure from the model architecture, it enables standard off-the-shelf Transformers to process graph data without any architectural modifications.
- Structure-Guided Serialization for BPE. A deterministic serialization mechanism guided by global substructure statistics. It addresses the ordering ambiguity inherent in graphs (permutation invariance) and systematically arranges frequent substructures into adjacent symbol patterns — precisely the input that BPE's greedy merging strategy is designed to exploit.
- State-of-the-Art on 14 Benchmarks. Achieves SOTA results across diverse graph classification and regression benchmarks spanning molecular, biomedical, social, academic, and synthetic domains. Scaling from a compact BERT-small to a larger GTE backbone yields consistent gains, demonstrating that graph tokenization can leverage the proven scaling behavior of Transformers.
Classification (↑ higher is better) and regression (↓ lower is better) results:
| Model | molhiv (AUC↑) | p-func (AP↑) | mutag (Acc↑) | coildel (Acc↑) | dblp (Acc↑) | qm9 (MAE↓) | zinc (MAE↓) | aqsol (MAE↓) | p-struct (MAE↓) |
|---|---|---|---|---|---|---|---|---|---|
| GCN | 74.0 | 53.2 | 79.7 | 74.6 | 76.6 | 0.134 | 0.399 | 1.345 | 0.342 |
| GIN | 76.1 | 61.4 | 80.4 | 72.0 | 73.8 | 0.176 | 0.379 | 2.053 | 0.338 |
| GAT | 72.1 | 51.2 | 80.1 | 74.4 | 76.3 | 0.114 | 0.445 | 1.388 | 0.316 |
| GatedGCN | 80.6 | 51.2 | 83.6 | 83.7 | 86.0 | 0.096 | 0.370 | 0.940 | 0.312 |
| GraphGPS | 78.5 | 53.5 | 84.3 | 80.5 | 71.6 | 0.084 | 0.310 | 1.587 | 0.251 |
| Exphormer | 82.3 | 64.5 | 82.7 | 91.5 | 84.9 | 0.080 | 0.281 | 0.749 | 0.251 |
| GraphMamba | 81.2 | 67.7 | 85.0 | 74.5 | 87.6 | 0.083 | 0.209 | 1.133 | 0.248 |
| GCN+ | 80.1 | 72.6 | 88.7 | 88.9 | 89.6 | 0.077 | 0.116 | 0.712 | 0.244 |
| GT+BERT | 82.6 | 68.5 | 87.5 | 74.1 | 93.2 | 0.122 | 0.241 | 0.648 | 0.247 |
| GT+GTE | 87.4 | 73.1 | 90.1 | 89.6 | 93.6 | 0.071 | 0.131 | 0.609 | 0.242 |
Results are mean over 5 independent runs. Bold = best. See the paper for full results on all 14 datasets including DD, Twitter, Proteins, Colors-3, and Synthetic.
| Method | Reversible | Deterministic | Applicable to |
|---|---|---|---|
| Freq-Guided Eulerian (Feuler) | ✅ | ✅ | Any labeled graph |
| Freq-Guided CPP (FCPP) | ✅ | ✅ | Any labeled graph |
| Eulerian circuit | ✅ | ❌ | Any labeled graph |
| Chinese Postman (CPP) | ✅ | ❌ | Any labeled graph |
| Canonical SMILES | ✅ | ✅ | Molecular graphs only |
| DFS / BFS / Topo | ❌ | ❌ | Any graph |
The default method is Feuler (Frequency-Guided Eulerian circuit), which provides both reversibility and determinism with O(|E|) time complexity.
GraphTokenizer/
├── prepare_data_new.py # Data preprocessing: serialization + BPE training + vocab
├── run_pretrain.py # Pre-training entry point (MLM)
├── run_finetune.py # Fine-tuning entry point (regression/classification)
├── batch_pretrain_simple.py # Batch pre-training across datasets/methods/GPUs
├── batch_finetune_simple.py # Batch fine-tuning
├── aggregate_results.py # Collect and tabulate experiment results
├── config.py # Centralized configuration management
├── config/default_config.yml # Default config values
├── src/
│ ├── algorithms/
│ │ ├── serializer/ # Graph serialization (Freq-Euler, Euler, DFS, BFS, Topo, SMILES, CPP, ...)
│ │ └── compression/ # BPE engine (C++ / Numba / Python backends)
│ ├── data/ # Unified data interface and per-dataset loaders
│ │ └── loader/ # Per-dataset loaders (QM9, ZINC, AQSOL, MNIST, Peptides, ...)
│ ├── models/ # Model definitions
│ │ ├── bert/ # BERT encoder, vocab manager, data pipeline
│ │ ├── gte/ # GTE encoder (Alibaba-NLP/gte-multilingual-base)
│ │ └── unified_encoder.py # Unified encoder interface
│ ├── training/ # Training pipelines (pretrain, finetune, evaluation)
│ └── utils/ # Logging, metrics, visualization
├── gte_model/ # Local GTE model config (for offline use)
├── final/ # Paper experiment scripts and plotting code
└── docs/ # Documentation
git clone https://github.com/BUPT-GAMMA/GraphTokenizer.git
cd GraphTokenizer
# Install in development mode
pip install -e .
# Build the C++ BPE backend (optional but recommended for speed)
python setup.py build_ext --inplaceKey dependencies: torch, dgl, networkx, rdkit, transformers, pybind11, pandas.
Serialize graphs and train a BPE tokenizer:
python prepare_data_new.py \
--datasets qm9test \
--methods feuler \
--bpe_merges 2000This loads the dataset, serializes all graphs with the chosen method (e.g., frequency-guided Eulerian circuit), trains a BPE model on the resulting sequences, and builds a vocabulary. All artifacts are cached for reuse.
Pre-train a Transformer encoder with Masked Language Modeling (MLM):
python run_pretrain.py \
--dataset qm9test \
--method feuler \
--experiment_group my_experiment \
--epochs 100 \
--batch_size 256Fine-tune the pre-trained model on downstream graph prediction tasks:
python run_finetune.py \
--dataset qm9test \
--method feuler \
--experiment_group my_experiment \
--target_property homo \
--epochs 200 \
--batch_size 64Run experiments across multiple datasets, serialization methods, and GPUs in parallel:
python batch_pretrain_simple.py \
--datasets qm9,zinc,mutagenicity \
--methods feuler,eulerian,cpp \
--bpe_scenarios all,raw \
--gpus 0,1
python batch_finetune_simple.py \
--datasets qm9,zinc,mutagenicity \
--methods feuler,eulerian,cpp \
--bpe_scenarios all,raw \
--gpus 0,1Scripts for all paper experiments are in the final/ directory:
- Main experiments —
final/exp1_main/run/: pre-training and fine-tuning commands for all 14 datasets - Efficiency analysis —
final/exp1_speed/: serialization speed, token length stats, training throughput - Multi-sampling comparison —
final/exp2_mult_seralize_comp/: effect of multiple serialization samples - BPE vocabulary visualization —
final/exp4_bpe_vocab_visual/: codebook inspection and visualization
- Configuration Guide — config file structure and parameters
- Experiment Guide — how to design and run experiments
- BPE Usage Guide — BPE engine API and usage
If you find this work useful, please cite our paper:
@inproceedings{guo2026graphtokenizer,
title={Graph Tokenization for Bridging Graphs and Transformers},
author={Guo, Zeyuan and Diao, Enmao and Yang, Cheng and Shi, Chuan},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}release— Clean version with only the code needed to reproduce paper experiments.dev— Full development version with all utility scripts, benchmarks, and internal documentation.
