Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,16 @@ exstruct input.xlsx -o out.json --pretty # 整形 JSON をファイルへ
exstruct input.xlsx --format yaml # YAML(pyyaml が必要)
exstruct input.xlsx --format toon # TOON(python-toon が必要)
exstruct input.xlsx --sheets-dir sheets/ # シートごとに分割出力
exstruct input.xlsx --auto-page-breaks-dir auto_areas/ # COM 限定(利用可能な環境のみ表示)
exstruct input.xlsx --auto-page-breaks-dir auto_areas/ # 常時表示。実行には standard/verbose + Excel COM が必要
exstruct input.xlsx --alpha-col # 列キーを A, B, ..., AA 形式で出力
exstruct input.xlsx --include-backend-metadata # shape/chart の backend metadata を含める
exstruct input.xlsx --mode light # セル+テーブル候補のみ
exstruct input.xlsx --mode libreoffice # COM なしで図形/コネクタ/チャートを best-effort 抽出
exstruct input.xlsx --pdf --image # PDF と PNG(Excel COM 必須)
```

自動改ページ範囲の書き出しは API/CLI 両方に対応(Excel/COM が必要)し、CLI は利用可能な環境でのみ `--auto-page-breaks-dir` を表示します
`mode=libreoffice` では `--pdf` / `--image` / `--auto-page-breaks-dir` を早期エラーにし、これらの機能は `standard` または `verbose` + Excel COM を前提にします。
自動改ページ範囲の書き出しは API/CLI 両方に対応(Excel/COM が必要)し、CLI では `--auto-page-breaks-dir` を常時表示したうえで実行時に検証します
`mode=libreoffice` では `--pdf` / `--image` / `--auto-page-breaks-dir` を早期エラーにし、`mode=light` でも `--auto-page-breaks-dir` を拒否します。これらの機能は `standard` または `verbose` + Excel COM を前提にします。
CLI の既定では列キーは従来どおり 0 始まりの数値文字列(`"0"`, `"1"`, ...)です。Excel 形式(`"A"`, `"B"`, ...)が必要な場合は `--alpha-col` を指定してください。
CLI の既定では shape/chart の `provenance` / `approximation_level` / `confidence` も出力しません。必要な場合は `--include-backend-metadata` を指定してください。
注意: MCP の `exstruct_extract` は `options.alpha_col=true` が既定で、CLI の既定(`false`)とは異なります。
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,16 +93,16 @@ exstruct input.xlsx -o out.json --pretty # write pretty JSON to a file
exstruct input.xlsx --format yaml # YAML (requires pyyaml)
exstruct input.xlsx --format toon # TOON (requires python-toon)
exstruct input.xlsx --sheets-dir sheets/ # write one file per sheet
exstruct input.xlsx --auto-page-breaks-dir auto_areas/ # COM only; shown only when available
exstruct input.xlsx --auto-page-breaks-dir auto_areas/ # always shown; execution requires standard/verbose + Excel COM
exstruct input.xlsx --alpha-col # output column keys as A, B, ..., AA
exstruct input.xlsx --include-backend-metadata # include shape/chart backend metadata
exstruct input.xlsx --mode light # cells + table candidates only
exstruct input.xlsx --mode libreoffice # best-effort extraction of shapes/connectors/charts without COM
exstruct input.xlsx --pdf --image # PDF and PNGs (Excel COM required)
```

Auto page-break export is available from both the API and the CLI when Excel/COM is available. The CLI exposes `--auto-page-breaks-dir` only in COM-capable environments.
`mode=libreoffice` rejects `--pdf`, `--image`, and `--auto-page-breaks-dir` early. Use `standard` or `verbose` with Excel COM for those features.
Auto page-break export is available from both the API and the CLI when Excel/COM is available. The CLI always exposes `--auto-page-breaks-dir`, but validates it at execution time.
`mode=libreoffice` rejects `--pdf`, `--image`, and `--auto-page-breaks-dir` early, and `mode=light` also rejects `--auto-page-breaks-dir`. Use `standard` or `verbose` with Excel COM for those features.
By default, the CLI keeps legacy 0-based numeric string column keys (`"0"`, `"1"`, ...). Use `--alpha-col` when you need Excel-style keys (`"A"`, `"B"`, ...).
By default, serialized shape/chart output omits backend metadata (`provenance`, `approximation_level`, `confidence`) to reduce token usage. Use `--include-backend-metadata` or the corresponding Python/MCP option when you need it.
Note: MCP `exstruct_extract` defaults to `options.alpha_col=true`, which differs from the CLI default (`false`).
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# ADR-0008: Extraction CLI Runtime Capability Validation

## Status

`accepted`

## Background

The extraction CLI currently probes Excel COM availability while building its
argument parser so that `--auto-page-breaks-dir` is shown only on COM-capable
hosts. On Windows this probe may instantiate `xlwings.App()` and launch Excel
even for lightweight commands such as `exstruct --help`.

This behavior creates two policy problems that are likely to recur:

- parser construction performs host-dependent side effects and adds startup
latency before the user has requested any COM-only behavior
- help output changes by host capability even though the CLI syntax itself is
stable across hosts

ExStruct already treats extraction-mode validation as part of the product
contract in ADR-0001, and treats rich-backend fallback as a runtime concern in
ADR-0002. We need a CLI-facing policy for how capability-gated extraction flags
should be exposed and validated without reintroducing startup probes.

## Decision

- Extraction CLI parser construction must remain side-effect-free and must not
probe COM or launch Excel.
- Capability-gated extraction flags may remain visible in help output when
their syntax is stable across hosts and their requirements can be validated at
execution time.
- `--auto-page-breaks-dir` is always exposed in extraction CLI help.
- `--auto-page-breaks-dir` is validated only when the user requests it:
- `mode="libreoffice"` keeps the existing invalid-combination validation
- `mode="light"` is rejected explicitly because auto page-break export
requires COM-backed `standard` or `verbose`
- `mode="standard"` / `mode="verbose"` require Excel COM and fail with an
explicit runtime error when COM is unavailable
- Ordinary extraction without requested COM-only side outputs keeps the existing
fallback policy from ADR-0002.

## Consequences

- `exstruct --help` and parser construction become faster and stop triggering
Excel startup side effects.
- CLI help becomes consistent across hosts because it documents syntax instead
of reflecting a startup-time environment probe.
- Users on unsupported hosts may still see `--auto-page-breaks-dir`, but they
now receive an actionable runtime error instead of hidden syntax or silent
export skipping.
- Future extraction CLI features that are host-capability-gated should prefer
execution-time validation over parser-time probing unless the syntax itself is
host-specific.

## Rationale

- Tests:
- `tests/cli/test_cli.py`
- Code:
- `src/exstruct/cli/main.py`
- `src/exstruct/cli/availability.py`
- Related specs:
- `docs/cli.md`
- `README.md`
- `README.ja.md`
- `dev-docs/specs/excel-extraction.md`

## Supersedes

- None

## Superseded by

- None
1 change: 1 addition & 0 deletions dev-docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,4 @@ ADRs record what was decided, under which constraints, and which trade-offs were
| `ADR-0005` | PathPolicy Safety Boundary | `accepted` | `safety` |
| `ADR-0006` | Public Edit API and Host-Owned Safety Boundary | `accepted` | `editing` |
| `ADR-0007` | Editing CLI as Public Operational Interface | `accepted` | `editing` |
| `ADR-0008` | Extraction CLI Runtime Capability Validation | `accepted` | `cli` |
3 changes: 3 additions & 0 deletions dev-docs/adr/decision-map.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This document is a human-readable map for navigating ADRs by domain.

- `ADR-0001` Extraction Mode Responsibility Boundaries (`accepted`)
- `ADR-0002` Rich Backend Fallback Policy (`accepted`)
- `ADR-0008` Extraction CLI Runtime Capability Validation (`accepted`)

## mode

Expand Down Expand Up @@ -33,6 +34,7 @@ This document is a human-readable map for navigating ADRs by domain.
- `ADR-0003` Output Serialization Omission Policy (`accepted`)
- `ADR-0004` Patch Backend Selection Policy (`accepted`)
- `ADR-0007` Editing CLI as Public Operational Interface (`accepted`)
- `ADR-0008` Extraction CLI Runtime Capability Validation (`accepted`)

## mcp

Expand All @@ -57,6 +59,7 @@ This document is a human-readable map for navigating ADRs by domain.
## cli

- `ADR-0007` Editing CLI as Public Operational Interface (`accepted`)
- `ADR-0008` Extraction CLI Runtime Capability Validation (`accepted`)

## Supersession Relationships

Expand Down
16 changes: 16 additions & 0 deletions dev-docs/adr/index.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,19 @@ adrs:
- docs/api.md
- README.md
- README.ja.md
- id: ADR-0008
title: Extraction CLI Runtime Capability Validation
status: accepted
path: dev-docs/adr/ADR-0008-extraction-cli-runtime-capability-validation.md
primary_domain: cli
domains:
- cli
- extraction
- compatibility
supersedes: []
superseded_by: []
related_specs:
- docs/cli.md
- README.md
- README.ja.md
- dev-docs/specs/excel-extraction.md
1 change: 1 addition & 0 deletions dev-docs/specs/excel-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ What is extracted:

- print_areas are retrieved via pre-com (openpyxl); COM only supplements missing parts
- auto_page_breaks are retrieved via COM only
- The extraction CLI always exposes auto page-break export syntax and validates the required mode/runtime at execution time instead of probing COM during parser construction

## Colors Map

Expand Down
3 changes: 3 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ empty workbook.
| `--include-backend-metadata` | Include shape/chart backend metadata (`provenance`, `approximation_level`, `confidence`) in structured output. |
| `--sheets-dir DIR` | Write one file per sheet (format follows `--format`). |
| `--print-areas-dir DIR` | Write one file per print area (format follows `--format`). |
| `--auto-page-breaks-dir DIR` | Write one file per auto page-break area. The flag is always shown in help, but execution requires `--mode standard` or `--mode verbose` with Excel COM. |

## Common workflows

Expand Down Expand Up @@ -157,6 +158,8 @@ exstruct sample.xlsx --pdf --image --dpi 144 -o out.json
are usually the better fit.
- On non-COM environments, prefer `--mode libreoffice` for best-effort rich extraction on `.xlsx/.xlsm`, or `--mode light` for minimal extraction.
- `--mode libreoffice` is best-effort, not a strict subset of COM output. It does not render PDFs/PNGs and does not compute auto page-break areas in v1.
- `--auto-page-breaks-dir` is always shown in help output and is validated at execution time.
- `--mode libreoffice` combined with `--pdf`, `--image`, or `--auto-page-breaks-dir` fails early with a configuration error instead of silently ignoring the option.
- `--mode light` also rejects `--auto-page-breaks-dir`; use `--mode standard` or `--mode verbose` with Excel COM for auto page-break export.
- `--sheets-dir` and `--print-areas-dir` accept existing or new directories (created if missing).
- `--alpha-col` switches row column keys from legacy numeric strings (`"0"`, `"1"`, ...) to Excel-style keys (`"A"`, `"B"`, ...). CLI default is disabled for backward compatibility.
2 changes: 1 addition & 1 deletion src/exstruct/cli/availability.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ class ComAvailability(BaseModel):


def get_com_availability() -> ComAvailability:
"""Detect whether Excel COM is available for CLI features.
"""Detect whether Excel COM is available for runtime CLI features.

Returns:
ComAvailability describing whether COM features can be used.
Expand Down
63 changes: 40 additions & 23 deletions src/exstruct/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@
import sys

from exstruct import process_excel
from exstruct.cli.availability import ComAvailability, get_com_availability
from exstruct.cli.availability import get_com_availability
from exstruct.cli.edit import is_edit_subcommand, run_edit_cli
from exstruct.constraints import validate_libreoffice_process_request


def _ensure_utf8_stdout() -> None:
Expand All @@ -29,33 +30,21 @@ def _ensure_utf8_stdout() -> None:
return


def _add_auto_page_breaks_argument(
parser: argparse.ArgumentParser, availability: ComAvailability
) -> None:
"""Add auto page-break export option when COM is available."""
if not availability.available:
return
def _add_auto_page_breaks_argument(parser: argparse.ArgumentParser) -> None:
"""Add the auto page-break export option to the extraction CLI."""
parser.add_argument(
"--auto-page-breaks-dir",
type=Path,
help=(
"Optional directory to write one file per auto page-break area "
"(Excel COM only; not supported in libreoffice mode)."
"(format follows --format; requires --mode standard or "
"--mode verbose with Excel COM)."
),
)


def build_parser(
availability: ComAvailability | None = None,
) -> argparse.ArgumentParser:
"""Build the CLI argument parser.

Args:
availability: Optional COM availability for tests or overrides.

Returns:
Configured argument parser.
"""
def build_parser() -> argparse.ArgumentParser:
"""Build the CLI argument parser."""
parser = argparse.ArgumentParser(
description="CLI for ExStruct extraction.",
epilog=(
Expand Down Expand Up @@ -130,10 +119,7 @@ def build_parser(
type=Path,
help="Optional directory to write one file per print area (format follows --format).",
)
resolved_availability = (
availability if availability is not None else get_com_availability()
)
_add_auto_page_breaks_argument(parser, resolved_availability)
_add_auto_page_breaks_argument(parser)
parser.add_argument(
"--alpha-col",
action="store_true",
Expand All @@ -150,6 +136,36 @@ def build_parser(
return parser


def _validate_auto_page_breaks_request(args: argparse.Namespace) -> None:
"""Validate runtime requirements for auto page-break export."""
auto_page_breaks_dir = getattr(args, "auto_page_breaks_dir", None)
if auto_page_breaks_dir is None:
return

message = (
"--auto-page-breaks-dir requires --mode standard or --mode verbose "
"with Excel COM."
)
if args.mode == "libreoffice":
validate_libreoffice_process_request(
args.input,
mode=args.mode,
include_auto_page_breaks=True,
pdf=args.pdf,
image=args.image,
)
return
if args.mode == "light":
raise RuntimeError(message)

availability = get_com_availability()
if availability.available:
return

reason = f" Reason: {availability.reason}" if availability.reason else ""
raise RuntimeError(f"{message}{reason}")


def main(argv: list[str] | None = None) -> int:
"""Run the CLI entrypoint.

Expand All @@ -173,6 +189,7 @@ def main(argv: list[str] | None = None) -> int:
return 0

try:
_validate_auto_page_breaks_request(args)
process_excel(
file_path=input_path,
output_path=args.output,
Expand Down
Loading
Loading