TableMind++ extends the TableMind framework with an uncertainty-aware inference pipeline that mitigates hallucinations in multi-turn table reasoning. Building on two-stage training (SFT + RAPO), TableMind++ introduces three inference-time mechanisms: (1) Memory-Guided Plan Pruning to reduce epistemic uncertainty by validating plans against a dual-memory bank, (2) Confidence-Based Action Refinement to manage aleatoric uncertainty via token-level probability monitoring, and (3) Dual-Weighted Trajectory Aggregation to synthesise reliable consensus across multiple reasoning paths.
Large Language Models struggle with precise numerical operations on tables. TableMind++ addresses this through a two-stage training strategy followed by a dynamic uncertainty-aware inference framework:
- (Stage 1: SFT Warm-up) Supervised fine-tuning on 200 high-quality distilled trajectories to bootstrap tool-use and plan-action-reflect capabilities.
- (Stage 2: RFT with RAPO) Reinforcement Fine-Tuning with Rank-Aware Policy Optimization (RAPO), a group-based policy gradient algorithm that identifies misaligned trajectories and amplifies learning signals through rank-aware advantage weighting.
- (Inference: Uncertainty-Aware Framework) Three novel inference mechanisms:
- Memory-Guided Plan Pruning: retrieves historical trajectories from a dual-memory bank (MβΊ/Mβ») and filters plans based on contrastive structural similarity scores.
- Confidence-Based Action Refinement: monitors token-level probabilities of semantic tokens (identifiers, literals) and triggers self-correction when confidence falls below threshold Ο.
- Dual-Weighted Trajectory Aggregation: weights each trajectory by
Ο(S_con) Γ C(h_i)and derives the final answer via weighted voting.
- Autonomous Plan-Action-Reflect Agent: Internalises deliberate multi-step reasoning within a lightweight Qwen3-8B backbone.
- RAPO: Rank-aware policy gradient that increases update weight for misaligned winner-loser trajectory pairs.
- Multi-Perspective Reward Design: R_format + R_acc + R_tool with curriculum decay
e^{-Οs}(Ξ²Β·I_success - CΒ·N_turnsΒ²). - Dual-Memory Bank: Offline self-generated trajectories split into MβΊ (correct) and Mβ» (deceptive) for structural plan validation.
- Token-Level Confidence: Lexical analysis identifies semantic tokens; geometric-mean log-probability avoids probability dilution.
TableMind-PP/
βββ agent_r1/ # Training framework (RAPO + multi-turn RL)
β βββ llm_agent/ # LLM generation utilities
β βββ src/ # Core RL training code
β β βββ core_algos.py # RAPO advantage computation
β β βββ agent_ray_trainer.py # Ray-based distributed trainer
β β βββ reward_score/ # Multi-perspective reward functions
β β β βββ tqa.py # R_format + R_acc + R_tool for QA tasks
β β β βββ tfv.py # R_format + R_acc + R_tool for fact verification
β β βββ config/
β β βββ agent_trainer.yaml
β βββ tool/ # Tool execution environment
β β βββ tools/python_tool.py # Python sandbox (via SandboxFusion)
β β βββ envs/nous.py # NousToolEnv: tool call parsing & dispatch
β βββ vllm_infer/ # Basic single-pass inference
βββ inference/ # TableMind++ uncertainty-aware inference
β βββ memory_builder.py # SemanticParser + MemoryBank (MβΊ/Mβ»)
β βββ plan_pruner.py # Memory-guided plan pruning (Levenshtein)
β βββ action_refiner.py # Token-level confidence scoring & refinement
β βββ trajectory_aggregator.py # Dual-weighted trajectory aggregation
β βββ tablemind_pp.py # Main inference orchestrator
βββ scripts/
β βββ build_memory.py # Offline dual-memory bank construction
β βββ evaluate.py # Benchmark evaluation (WTQ/TabMWP/TabFact/HiTab/FinQA)
βββ csv_files/ # CSV data files for sandbox execution
βββ environment.yml # Conda environment (CUDA 12.4, PyTorch 2.6)
βββ run_train.sh # Stage 2 RFT training entry point
βββ run_inference.sh # Full inference pipeline entry point
conda env create -f environment.yml
conda activate tableMind-ppFollow the official guide: SandboxFusion
# Run SandboxFusion in a tmux session (default: http://localhost:8080)
tmux new-session -d -s sandbox "sandbox-fusion serve --port 8080"Fine-tune on 200 distilled synthetic trajectories for 1 epoch with lr=1e-6. Use any standard SFT framework (e.g. HuggingFace Trainer, LLaMA-Factory) on the SFT dataset.
# Edit run_train.sh to set BASE_MODEL, PROJECT_NAME, EXPERIMENT_NAME, CSV_FILE_PATH
bash run_train.shKey hyperparameters (matching the paper):
| Parameter | Value |
|---|---|
| Backbone | Qwen3-8B |
| Learning rate | 1e-6 |
| Group size G | 8 |
| Max turns | 3 |
| Temperature | 1.0 |
| R_tool: Ο | 0.05 |
| R_tool: Ξ² | 0.5 |
| R_tool: C | 0.01 |
| RAPO: Ξ΅_low | 0.2 |
| RAPO: Ξ΅_high | 0.28 |
| GPUs | 4Γ A800 |
The full TableMind++ inference pipeline runs in two steps after training:
# Start vLLM server with the trained model
python -m vllm.entrypoints.openai.api_server \
--model /path/to/trained/tablemind \
--served-model-name tablemind \
--port 8000
# Build memory bank from training data
python scripts/build_memory.py \
--model-path tablemind \
--train-data data/train.parquet \
--output memory_bank.pkl \
--encoder BAAI/bge-m3# Edit run_inference.sh to set MODEL_PATH, DATASET, etc.
bash run_inference.sh
# Or run evaluation directly:
python scripts/evaluate.py \
--data-path data/test.parquet \
--memory-bank memory_bank.pkl \
--dataset WTQ \
--num-candidates 16 \
--top-k-memory 5 \
--retention-ratio 0.5 \
--confidence-threshold 0.8Key inference hyperparameters (paper defaults from Table 6):
| Parameter | Value | Description |
|---|---|---|
| N | 16 | Candidate plans sampled |
| K | 5 | Memory prototypes retrieved |
| Ο | 0.5 | Plan pruning retention ratio |
| Ο | 0.8 | Confidence threshold for action refinement |
| Model | WikiTQ | TabMWP | TabFact | HiTab | FinQA |
|---|---|---|---|---|---|
| GPT-5 | 77.42 | 96.12 | 90.05 | 44.52 | 28.93 |
| Deepseek-R1 | 74.63 | 98.03 | 86.25 | 76.08 | 37.42 |
| Table-R1 | 74.86 | 96.02 | 87.17 | 64.76 | 41.27 |
| TableMind | 76.82 | 99.27 | 91.85 | 71.95 | 42.02 |
| TableMind++ | 78.07 | 99.57 | 93.73 | 73.69 | 45.48 |
RAPO builds on GRPO with three enhancements:
- No KL penalty: removes reference policy constraint for larger exploration.
- Token-level normalisation: normalises by sequence length to prevent length bias.
- Asymmetric clipping:
Ξ΅_low=0.2, Ξ΅_high=0.28promotes generation diversity.
The rank-aware advantage weight Ξ³_w,l is increased for misaligned pairs where the model assigns higher confidence to a lower-reward trajectory:
Ξ³_w,l = 1 + Ξ± Β· I[log P(o_w) < log P(o_l)]
A'_i = Ξ³_i Β· (R_i - mean(R)) / std(R)
- Parse plans into action sequences using keyword-to-primitive mapping (FILTER, GROUP, AGGREGATE, SORT, JOIN, COMPUTE, SELECT, MERGE, PIVOT, RENAME).
- Retrieve top-K similar historical instances via cosine similarity on bge-m3 embeddings.
- Compute contrastive score:
S_con(p_i) = Dβ»(p_i) - DβΊ(p_i)using Levenshtein edit distance. - Retain top Ο=50% of candidates.
Compute C(a) = exp(mean_{i β K} log P(a_i)) over semantically significant tokens only (identifiers, function names, numeric/string literals), excluding boilerplate Python syntax. If C(a) < Ο, trigger a self-correction prompt before sandbox execution.
w_i = Ο(S_con(p_i)) Β· C(h_i)
Ε· = argmax_{y} Ξ£_i I(y_i = y) Β· w_i
@article{cheng2025tablemindpp,
title={TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning},
author={Cheng, Mingyue and Yu, Shuo and Jiang, Chuang and Tao, Xiaoyu and Mao, Qingyang and Ouyang, Jie and Liu, Qi and Chen, Enhong},
journal={arXiv},
year={2025}
}
