Skip to content

Add model-agnostic analysis, response-cleaning, prompt/version tracking, incremental runs, and Chinese prompts#1

Open
Huberyky wants to merge 1 commit into
mainfrom
codex/add-support-for-non-openai-models
Open

Add model-agnostic analysis, response-cleaning, prompt/version tracking, incremental runs, and Chinese prompts#1
Huberyky wants to merge 1 commit into
mainfrom
codex/add-support-for-non-openai-models

Conversation

@Huberyky
Copy link
Copy Markdown
Owner

Motivation

  • Enable non-OpenAI (OpenAI‑compatible) models by restoring cost estimates and sanitizing vendor reasoning traces so downstream JSON parsing is robust.
  • Provide built‑in reliability/validation/robustness analytics so multi‑run outputs can produce standard measurement diagnostics used in social‑science research.
  • Improve reproducibility and incremental workflows by binding prompt versions to outputs, reporting parse success rates, and supporting prompt language selection (e.g., Chinese templates).

Description

  • Added a new analysis module gabriel.analysis exposing reliability, validate, and robustness helpers for Krippendorff's α, ICC, Pearson/Spearman, MAE, F1/Cohen's κ, stratified reports, and bootstrap CIs.
  • Introduced gabriel.utils.model_utils with strip_reasoning_tags, prompt_hash, write_run_metadata, and load_incremental_cache to strip vendor CoT traces (e.g., <think>...</think>), compute prompt hashes, and manage incremental reuse.
  • Extended core tasks (rate, classify, extract) to accept prompt_language and incremental flags, write prompt‑hash metadata to run_metadata.json, sanitize raw responses with strip_reasoning_tags before parsing, produce per‑task parse reports (*_parse_report.csv), and attempt incremental merged reads when incremental=True.
  • Added Chinese Jinja2 prompt templates (ratings_prompt_zh.jinja2, classification_prompt_zh.jinja2, extraction_prompt_zh.jinja2) and wired template selection by prompt_language.
  • Added pricing entries for DeepSeek / Qwen families to MODEL_PRICING so cost estimates remain available when using OpenAI‑compatible base_urls.
  • Small plumbing: exported new top‑level functions (gabriel.reliability, gabriel.validate, gabriel.robustness) and added unit tests and utilities accordingly.

Testing

  • Ran targeted unit tests: pytest -q tests/test_analysis_extensions.py tests/test_imports.py tests/test_discover_exports.py, all tests passed.
  • Performed static sanity checks: python -m py_compile $(rg --files src/gabriel -g '*.py' | tr '\n' ' ') succeeded (no syntax errors).
  • Verified template and runtime integration by running the new analysis and response‑sanitization flows in task code paths (parse reports created and incremental merge logic exercised during tests).

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant