Table Metric Study

Meta-evaluation of table extraction metrics against human judgment, accompanying the paper:

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

This repository provides implementations of rule-based table metrics (TEDS, GriTS, SCORE), LLM-as-a-judge scoring, a human evaluation interface, and the correlation analysis used to validate that LLM-based evaluation substantially outperforms rule-based metrics in agreement with human judgment.

Results

The dataset includes over 1,500 human quality ratings on 518 table pairs. The correlation analysis shows that LLM-based judges achieve substantially higher agreement with human judgment than rule-based metrics:

Correlation of each metric with the averaged human scores (three annotators per table pair):

Metric	Pearson r	Spearman ρ	Kendall τ	Cost / 1k pairs ($)
TEDS	0.684	0.717	0.558	—
TEDS struct.	0.627	0.720	0.579	—
GriTS-Top	0.633	0.735	0.597	—
GriTS-Con	0.701	0.745	0.598	—
GriTS-Avg	0.698	0.765	0.606	—
SCORE Index	0.558	0.684	0.561	—
SCORE Content	0.642	0.657	0.524	—
SCORE Content Shifted	0.648	0.653	0.526	—
SCORE-Avg	0.637	0.687	0.541	—
LLM: claude-opus-4.6	0.939	0.891	0.804	7.60
LLM: gemma-4-31b-it	0.929	0.884	0.796	0.18
LLM: gemini-3-flash-preview	0.924	0.892	0.803	0.78
LLM: gemma-4-26b-a4b-it	0.909	0.861	0.766	0.54
LLM: gemini-3.1-flash-lite-preview	0.909	0.851	0.754	0.36
LLM: gpt-5.4-nano	0.809	0.799	0.683	0.28
LLM: deepseek-v3.2	0.780	0.805	0.699	0.42
LLM: mistral-small-2603	0.756	0.799	0.685	0.28

LLM judge costs are based on OpenRouter pricing as of 2026-04-24.

Prompt sensitivity

Same 518 extractions, three prompt variants:

tuned — the prompt used for the main results. Response schema {errors[], score}.

You are a strict table evaluator. Your task is to determine if the extracted table correctly represents the ground truth table, focusing on content accuracy, structural preservation, and information completeness. The extracted table was parsed from the rendered table. Disregard LaTeX-specific elements in the ground truth (e.g., comments, styling commands, font formatting) that have no effect on content or structure.

Ground Truth Table (LaTeX):
{gt_table}

Extracted Table:
{extracted_table}

Evaluate the extracted table using the following criteria:
1. Content accuracy: Are all cell values, headers, and data correctly preserved?
2. Structure preservation: Are all rows and columns present, and can each value be unambiguously mapped to its row/column headers? Broken or ambiguous associations count as errors.

Note: Different output formats (markdown, HTML, plain text) are acceptable as long as no information is lost. Apply this key test: Could a reader who sees ONLY the extracted table — without access to the ground truth — unambiguously reconstruct every cell-to-header mapping and all content of the original table? If not, consider the parsing as failed and assign a low score.

First, enumerate up to 5 of the most significant errors and ambiguities found. Then assign a score from 0 to 10, where 10 is a perfect match.

tuned_no_cot — identical to tuned with the final sentence replaced by "Assign a score from 0 to 10, where 10 is a perfect match." and response schema reduced to {score}. Isolates the CoT scaffold.

naive — "Rate how well the extracted table matches the ground truth table on a scale from 0 to 10." Response schema {score}.

Both deltas are taken relative to the naive baseline: Δ content = tuned_no_cot − naive, Δ full = tuned − naive.

Model	Pearson (tuned)	Pearson (tuned_no_cot)	Pearson (naive)	Δ content	Δ full
gemini-3.1-flash-lite-preview	0.909	0.863	0.807	+0.056	+0.102
gpt-5.4-nano	0.809	0.795	0.718	+0.076	+0.091
gemini-3-flash-preview	0.924	0.907	0.851	+0.056	+0.073
gemma-4-26b-a4b-it	0.909	0.777	0.856	−0.078	+0.054
claude-opus-4.6	0.939	0.933	0.891	+0.042	+0.048
gemma-4-31b-it	0.929	0.916	0.907	+0.009	+0.022
deepseek-v3.2	0.780	0.671	0.779	−0.108	+0.001
mistral-small-2603	0.756	0.842	0.803	+0.040	−0.046

The CoT scaffold and engineered content interact: gemma-4-26b and deepseek-v3.2 show negative Δ content but positive Δ full, while mistral-small shows the inverse. Of 24 (model × variant) cells, only deepseek-v3.2/tuned_no_cot (0.671) falls below the best rule-based metric (GriTS-Con 0.701).

Project Structure

File	Description
`all_tables.json`	Central dataset: ground truth tables, parser extractions, all metric scores, and human ratings
`compute_metrics.py`	Compute rule-based metrics (TEDS, GriTS, SCORE) for all extractions
`compute_llm_scores.py`	LLM-as-a-judge scoring via OpenRouter API
`latex_to_html_claude.py`	Convert LaTeX ground truth tables to HTML (required by rule-based metrics)
`human_eval.py`	Gradio web UI for human annotation (0–10 scoring)
`correlation_analysis.py`	Correlation analysis and scatter plots (generates paper figures)
`scorers/`	Metric implementations (TEDS, GriTS, SCORE, table normalization)

Reproducing

Requires Python 3.12+ and uv. All scripts can be run via uv run python <script>.py.

uv sync

System dependencies for rule-based metrics and human evaluation UI:

pdflatex and pdftoppm (e.g., via TeX Live)
latexmlc (for LaTeX-to-HTML normalization)

LLM scoring requires an OpenRouter API key (export OPENROUTER_API_KEY=...).

Data Format

Each entry in all_tables.json pairs a ground truth table with its parser extractions, metric scores, and human ratings:

{
  "gt_id": "000_00",
  "gt_table": "\\begin{tabular}...",
  "gt_table_html": "<table>...</table>",
  "complexity": "simple | moderate | complex",
  "extractions": [
    {
      "parser": "gemini_3_flash",
      "extracted_table": "...",
      "metrics": { "teds": 0.91, "grits_top": 0.89, "grits_con": 0.87, ... },
      "llm_scores": [
        { "judge_model": "google/gemini-3-flash-preview", "score": 9, "errors": [...] }
      ],
      "human_scores": [8, 8, 7]
    }
  ]

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.vscode		.vscode
correlation_plots		correlation_plots
scorers		scorers
.gitignore		.gitignore
README.md		README.md
all_tables.json		all_tables.json
compute_llm_scores.py		compute_llm_scores.py
compute_metrics.py		compute_metrics.py
correlation_analysis.py		correlation_analysis.py
human_eval.py		human_eval.py
latex_to_html_claude.py		latex_to_html_claude.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table Metric Study

Results

Prompt sensitivity

Project Structure

Reproducing

Data Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table Metric Study

Results

Prompt sensitivity

Project Structure

Reproducing

Data Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages