Skip to content

fishsure/TableMind-PP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning

πŸ“– Abstract

TableMind++ extends the TableMind framework with an uncertainty-aware inference pipeline that mitigates hallucinations in multi-turn table reasoning. Building on two-stage training (SFT + RAPO), TableMind++ introduces three inference-time mechanisms: (1) Memory-Guided Plan Pruning to reduce epistemic uncertainty by validating plans against a dual-memory bank, (2) Confidence-Based Action Refinement to manage aleatoric uncertainty via token-level probability monitoring, and (3) Dual-Weighted Trajectory Aggregation to synthesise reliable consensus across multiple reasoning paths.

🌟 Overview

Large Language Models struggle with precise numerical operations on tables. TableMind++ addresses this through a two-stage training strategy followed by a dynamic uncertainty-aware inference framework:

  1. (Stage 1: SFT Warm-up) Supervised fine-tuning on 200 high-quality distilled trajectories to bootstrap tool-use and plan-action-reflect capabilities.
  2. (Stage 2: RFT with RAPO) Reinforcement Fine-Tuning with Rank-Aware Policy Optimization (RAPO), a group-based policy gradient algorithm that identifies misaligned trajectories and amplifies learning signals through rank-aware advantage weighting.
  3. (Inference: Uncertainty-Aware Framework) Three novel inference mechanisms:
    • Memory-Guided Plan Pruning: retrieves historical trajectories from a dual-memory bank (M⁺/M⁻) and filters plans based on contrastive structural similarity scores.
    • Confidence-Based Action Refinement: monitors token-level probabilities of semantic tokens (identifiers, literals) and triggers self-correction when confidence falls below threshold Ο„.
    • Dual-Weighted Trajectory Aggregation: weights each trajectory by Οƒ(S_con) Γ— C(h_i) and derives the final answer via weighted voting.

βš™οΈ Key Features

  • Autonomous Plan-Action-Reflect Agent: Internalises deliberate multi-step reasoning within a lightweight Qwen3-8B backbone.
  • RAPO: Rank-aware policy gradient that increases update weight for misaligned winner-loser trajectory pairs.
  • Multi-Perspective Reward Design: R_format + R_acc + R_tool with curriculum decay e^{-ρs}(Ξ²Β·I_success - CΒ·N_turnsΒ²).
  • Dual-Memory Bank: Offline self-generated trajectories split into M⁺ (correct) and M⁻ (deceptive) for structural plan validation.
  • Token-Level Confidence: Lexical analysis identifies semantic tokens; geometric-mean log-probability avoids probability dilution.

πŸ—‚οΈ Repository Structure

TableMind-PP/
β”œβ”€β”€ agent_r1/                    # Training framework (RAPO + multi-turn RL)
β”‚   β”œβ”€β”€ llm_agent/               # LLM generation utilities
β”‚   β”œβ”€β”€ src/                     # Core RL training code
β”‚   β”‚   β”œβ”€β”€ core_algos.py        # RAPO advantage computation
β”‚   β”‚   β”œβ”€β”€ agent_ray_trainer.py # Ray-based distributed trainer
β”‚   β”‚   β”œβ”€β”€ reward_score/        # Multi-perspective reward functions
β”‚   β”‚   β”‚   β”œβ”€β”€ tqa.py           # R_format + R_acc + R_tool for QA tasks
β”‚   β”‚   β”‚   └── tfv.py           # R_format + R_acc + R_tool for fact verification
β”‚   β”‚   └── config/
β”‚   β”‚       └── agent_trainer.yaml
β”‚   β”œβ”€β”€ tool/                    # Tool execution environment
β”‚   β”‚   β”œβ”€β”€ tools/python_tool.py # Python sandbox (via SandboxFusion)
β”‚   β”‚   └── envs/nous.py         # NousToolEnv: tool call parsing & dispatch
β”‚   └── vllm_infer/              # Basic single-pass inference
β”œβ”€β”€ inference/                   # TableMind++ uncertainty-aware inference
β”‚   β”œβ”€β”€ memory_builder.py        # SemanticParser + MemoryBank (M⁺/M⁻)
β”‚   β”œβ”€β”€ plan_pruner.py           # Memory-guided plan pruning (Levenshtein)
β”‚   β”œβ”€β”€ action_refiner.py        # Token-level confidence scoring & refinement
β”‚   β”œβ”€β”€ trajectory_aggregator.py # Dual-weighted trajectory aggregation
β”‚   └── tablemind_pp.py          # Main inference orchestrator
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ build_memory.py          # Offline dual-memory bank construction
β”‚   └── evaluate.py              # Benchmark evaluation (WTQ/TabMWP/TabFact/HiTab/FinQA)
β”œβ”€β”€ csv_files/                   # CSV data files for sandbox execution
β”œβ”€β”€ environment.yml              # Conda environment (CUDA 12.4, PyTorch 2.6)
β”œβ”€β”€ run_train.sh                 # Stage 2 RFT training entry point
└── run_inference.sh             # Full inference pipeline entry point

πŸš€ Quick Start

Environment Setup

conda env create -f environment.yml
conda activate tableMind-pp

Sandbox Fusion (required for code execution)

Follow the official guide: SandboxFusion

# Run SandboxFusion in a tmux session (default: http://localhost:8080)
tmux new-session -d -s sandbox "sandbox-fusion serve --port 8080"

πŸ› οΈ Training

Stage 1: SFT Warm-up

Fine-tune on 200 distilled synthetic trajectories for 1 epoch with lr=1e-6. Use any standard SFT framework (e.g. HuggingFace Trainer, LLaMA-Factory) on the SFT dataset.

Stage 2: Reinforcement Fine-Tuning (RAPO)

# Edit run_train.sh to set BASE_MODEL, PROJECT_NAME, EXPERIMENT_NAME, CSV_FILE_PATH
bash run_train.sh

Key hyperparameters (matching the paper):

Parameter Value
Backbone Qwen3-8B
Learning rate 1e-6
Group size G 8
Max turns 3
Temperature 1.0
R_tool: ρ 0.05
R_tool: Ξ² 0.5
R_tool: C 0.01
RAPO: Ξ΅_low 0.2
RAPO: Ξ΅_high 0.28
GPUs 4Γ— A800

πŸ” Inference

The full TableMind++ inference pipeline runs in two steps after training:

Step 1: Build the Dual-Memory Bank (offline, once)

# Start vLLM server with the trained model
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/trained/tablemind \
    --served-model-name tablemind \
    --port 8000

# Build memory bank from training data
python scripts/build_memory.py \
    --model-path tablemind \
    --train-data data/train.parquet \
    --output memory_bank.pkl \
    --encoder BAAI/bge-m3

Step 2: Run TableMind++ Evaluation

# Edit run_inference.sh to set MODEL_PATH, DATASET, etc.
bash run_inference.sh

# Or run evaluation directly:
python scripts/evaluate.py \
    --data-path data/test.parquet \
    --memory-bank memory_bank.pkl \
    --dataset WTQ \
    --num-candidates 16 \
    --top-k-memory 5 \
    --retention-ratio 0.5 \
    --confidence-threshold 0.8

Key inference hyperparameters (paper defaults from Table 6):

Parameter Value Description
N 16 Candidate plans sampled
K 5 Memory prototypes retrieved
ρ 0.5 Plan pruning retention ratio
Ο„ 0.8 Confidence threshold for action refinement

πŸ“Š Main Results

Model WikiTQ TabMWP TabFact HiTab FinQA
GPT-5 77.42 96.12 90.05 44.52 28.93
Deepseek-R1 74.63 98.03 86.25 76.08 37.42
Table-R1 74.86 96.02 87.17 64.76 41.27
TableMind 76.82 99.27 91.85 71.95 42.02
TableMind++ 78.07 99.57 93.73 73.69 45.48

πŸ“ Method Details

RAPO Algorithm

RAPO builds on GRPO with three enhancements:

  1. No KL penalty: removes reference policy constraint for larger exploration.
  2. Token-level normalisation: normalises by sequence length to prevent length bias.
  3. Asymmetric clipping: Ξ΅_low=0.2, Ξ΅_high=0.28 promotes generation diversity.

The rank-aware advantage weight Ξ³_w,l is increased for misaligned pairs where the model assigns higher confidence to a lower-reward trajectory:

Ξ³_w,l = 1 + Ξ± Β· I[log P(o_w) < log P(o_l)]
A'_i  = Ξ³_i Β· (R_i - mean(R)) / std(R)

Memory-Guided Plan Pruning

  1. Parse plans into action sequences using keyword-to-primitive mapping (FILTER, GROUP, AGGREGATE, SORT, JOIN, COMPUTE, SELECT, MERGE, PIVOT, RENAME).
  2. Retrieve top-K similar historical instances via cosine similarity on bge-m3 embeddings.
  3. Compute contrastive score: S_con(p_i) = D⁻(p_i) - D⁺(p_i) using Levenshtein edit distance.
  4. Retain top ρ=50% of candidates.

Confidence-Based Action Refinement

Compute C(a) = exp(mean_{i ∈ K} log P(a_i)) over semantically significant tokens only (identifiers, function names, numeric/string literals), excluding boilerplate Python syntax. If C(a) < Ο„, trigger a self-correction prompt before sandbox execution.

Dual-Weighted Trajectory Aggregation

w_i = Οƒ(S_con(p_i)) Β· C(h_i)
Ε·   = argmax_{y} Ξ£_i I(y_i = y) Β· w_i

πŸ“„ Citation

@article{cheng2025tablemindpp,
  title={TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning},
  author={Cheng, Mingyue and Yu, Shuo and Jiang, Chuang and Tao, Xiaoyu and Mao, Qingyang and Ouyang, Jie and Liu, Qi and Chen, Enhong},
  journal={arXiv},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors