Skip to content

shallwe16623/Limira_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OKA-SR

Local OKA-SR pipeline for building legal motion-to-dismiss casepacks, running OpenAI-compatible model providers, and scoring structured state-recovery output.

如果你没有代码或美国法律背景,想先用中文理解项目目标、文件结构、题目设计、 本地/API 跑法、评分标签和数据安全边界,请读 docs/oka_sr_project_intro_zh.md

Quick Start

cd oka_sr
python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"

oka inventory --source ../case_download/outputs/2024_ip_mtd_relaxed
oka extract
oka mask --provider mock
oka gold --provider mock
oka flip --provider mock
oka build-casepacks
oka eval --providers mock --run-id mock-smoke
oka score --run runs/mock-smoke
oka report --run runs/mock-smoke
oka audit

To merge the 20-case case_new expansion into the local private working dataset, run:

oka inventory --source ../case_new/mtd_pr_order_casepack
oka extract
oka mask --provider mock
oka gold --provider mock
oka flip --provider mock
oka build-casepacks
oka audit

The current technical baseline attempts 29 source cases after the expansion. The automated quality gate promotes 16 source cases, producing 64 static prediction instances and 32 ARC-style world-model episodes. Excluded cases are listed with machine-readable reasons in reports/quality_promotion_ledger.jsonl.

For local Qwen through Ollama:

brew install ollama
oka ollama start --model qwen3:0.6b --pull
oka eval --providers ollama_qwen3_0_6b --quality-reviewed --run-id qwen3-0_6b-reviewed
oka score --run runs/qwen3-0_6b-reviewed
oka report --run runs/qwen3-0_6b-reviewed

The small local Qwen provider uses deterministic prompt truncation for long Document Mode inputs so the reviewed subset can run on a laptop-sized model. Use ollama_qwen3_8b for less aggressive local evaluation.

For a stronger sub-10B Qwen local run:

ollama pull qwen3.5:9b
oka eval --providers ollama_qwen3_5_9b --quality-reviewed --run-id qwen3_5-9b-reviewed
oka score --run runs/qwen3_5-9b-reviewed
oka report --run runs/qwen3_5-9b-reviewed

For the DeepSeek v4-pro cloud baseline, keep the API key in the shell environment only:

export DEEPSEEK_API_KEY="<your-deepseek-api-key>"
oka eval --providers deepseek_v4_pro --quality-reviewed --run-id deepseek_v4_pro-reviewed
oka score --run runs/deepseek_v4_pro-reviewed
oka report --run runs/deepseek_v4_pro-reviewed

ARC-Style World Model Episodes

The world-model path sits beside the static casepack benchmark. It builds multi-step episodes where a provider requests sanitized materials, declares a legal state, predicts the base outcome, predicts the state-flip transition, and is scored on outcome, world-model recovery, transition correctness, evidence grounding, and action efficiency.

oka wm-build --quality-reviewed
oka wm-eval --providers mock --run-id wm-mock-reviewed
oka wm-score --run runs/wm-mock-reviewed
oka wm-report --run runs/wm-mock-reviewed

The v1 world-model report is a research technical baseline only. It does not claim final legal prediction capability, and its action-efficiency score is relative to a reference action baseline rather than a formal human baseline. The mock provider is a reference smoke baseline that uses private local casepack answers to verify the runner/scorer pipeline; use Qwen or DeepSeek providers for actual model behavior.

See docs/arc_world_model_benchmark_zh.md for the Chinese design note.

After oka audit, prefer oka eval --quality-reviewed ... for official-ish runs. The full casepack file remains diagnostic; the quality-reviewed file excludes cases whose order/gold/flip did not pass the Codex quality gate.

For DeepSeek:

export DEEPSEEK_API_KEY=...
oka eval --providers deepseek_chat

Repository Data Policy

This repository intentionally excludes raw PDFs, extracted source text, private gold files, hidden answer keys, private casepacks, and raw run outputs. The checked-in public data file is a sanitized reviewed-subset stub:

data/casepacks_public_visible/casepacks_visible_quality_reviewed_safe.jsonl

Use the local private data/ directory to reproduce full Document Mode runs.

Public Release Candidates

Release candidates are built from allowlisted public-safe files only. The GitHub package contains source code, tests, docs, safe public casepacks, public world-model episodes, selected reports, checksums, and a release manifest. The Hugging Face package contains dataset files, schema notes, metrics, checksums, and a dataset card.

oka release-build --target github --output dist/github
oka release-build --target huggingface --output dist/huggingface
oka release-verify --target github --input dist/github
oka release-verify --target huggingface --input dist/huggingface

The release verifier rejects raw case folders, raw PDFs, extracted text, private gold, hidden answer keys, local source paths, raw logs, identifier-like public IDs, judge/counsel signature names, and obvious case/docket URL leakage. The public-safe Document materials are release-safe digest views rather than long source-text clones; private full-text runs remain local.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors