Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ RUN pip install --no-cache-dir --upgrade pip \

COPY . .

# Run as non-root for security
RUN useradd --create-home --shell /bin/bash appuser \
&& chown -R appuser:appuser /app
USER appuser

EXPOSE 7860

HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3 \
Expand Down
111 changes: 78 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ lawyers, procurement teams, and founders. Key challenges for an AI agent:
- **Partial-progress rewards**: improving a clause partially (e.g., adding a
liability cap without addressing IP ownership) deserves more reward than doing
nothing — but less than resolving every risk.
- **Multi-turn dynamics**: the counterparty pushes back on proposals, requiring
adaptive negotiation strategies across multiple rounds.

---

Expand All @@ -39,6 +41,9 @@ lawyers, procurement teams, and founders. Key challenges for an AI agent:
| `hard_conflicting_obligations` | Hard (4/5) | Performance/Changes | HIGH | Yes |
| `easy_compliance_agreement` | Easy+ (2/5) | Compliance | LOW | No |
| `hard_intellectual_property` | Hard+ (5/5) | IP Ownership | HIGH | Yes |
| `medium_confidentiality_nda` | Medium+ (3/5) | Confidentiality | MODERATE | Yes |
| `hard_termination_convenience` | Hard++ (4/5) | Termination | HIGH | Yes |
| `expert_data_protection` | Expert (5/5) | Data Protection | HIGH | Yes |

### Task descriptions

Expand All @@ -63,6 +68,43 @@ improvement: adding explicit breach-notification obligations ("+6% bonus for
the customer provides specifications. The agent must rewrite to assign IP to the
customer and limit the supplier to a scoped license.

**medium_confidentiality_nda** — An overbroad NDA with perpetual obligations and
no carve-outs for public information. The agent must narrow the scope, add a
time limit (3 years), and carve out publicly available and independently
developed information.

**hard_termination_convenience** — A one-sided termination clause allowing only
the Supplier to terminate at will with 5-day notice, while the Customer has no
termination rights and waives all remedies. The agent must establish mutual
termination, add a 30-day cure period, and include transition/wind-down
provisions.

**expert_data_protection** — A clause giving the Supplier blanket authority to
process personal data, transfer it to any jurisdiction, engage sub-processors
without notice, and waive data-subject rights. The agent must add DPA
requirements, 72-hour breach notification, sub-processor consent, data-subject
rights assistance, and data deletion obligations.

---

## Opponent Simulation

Each task includes **opponent responses** keyed by action type. When the agent
takes an action (e.g., `FLAG_RISK`, `EDIT_CLAUSE`), the counterparty replies
with a contextually appropriate pushback or counter-proposal, creating realistic
multi-turn negotiation dynamics:

```
agent → FLAG_RISK
opponent → "Our legal team considers this standard. What specific cap do you propose?"
agent → EDIT_CLAUSE (with cap at 12 months)
opponent → "We can accept a cap but consequential damages must remain."
agent → PROPOSE_COUNTER (addressing consequential damages)
...
```

Opponent replies appear in the `negotiation_history` and in `info.opponent_reply`.

---

## Observation Space
Expand All @@ -72,12 +114,13 @@ Every call to `/reset` or `/step` returns an `Observation`:
```json
{
"contract_text": "string — the current clause text (may be rewritten after EDIT/PROPOSE)",
"clause_type": "string — e.g. liability, term_renewal, intellectual_property",
"clause_type": "string — e.g. liability, term_renewal, intellectual_property, confidentiality, termination, data_protection",
"risk_level": "float ∈ (0, 1) — observed risk density (0=safe, 1=highly risky)",
"step_count": "int — steps taken so far (0 = just reset)",
"negotiation_history": [
"opponent|[Counterparty] Unlimited indemnity is standard.",
"agent|step=1 action=FLAG_RISK content_len=0",
"opponent|[Counterparty] Our legal team considers this standard.",
"..."
]
}
Expand Down Expand Up @@ -109,16 +152,20 @@ Sending empty content returns a validation error and a near-zero reward.
Every step returns a scalar `reward ∈ (0.001, 0.999)`, computed as:

```
reward = 0.40 × correctness
+ 0.30 × improvement
+ 0.30 × risk_alignment
reward = 0.35 × correctness
+ 0.25 × improvement
+ 0.25 × risk_alignment
+ 0.10 × semantic_similarity
+ 0.05 × completeness
```

| Component | What it measures |
|-----------|-----------------|
| **Correctness** (40%) | For EDIT/PROPOSE: how much risky language was *removed* from the original. For FLAG/REJECT/ACCEPT: how many risk keywords are identified in context. |
| **Improvement** (30%) | How well the proposed edit matches safe keywords and the expected safe rewrite. |
| **Risk Alignment** (30%) | Whether the chosen action is appropriate for the current risk level (e.g., editing a HIGH-risk clause scores 0.92×; accepting it scores 0.20×). |
| **Correctness** (35%) | For EDIT/PROPOSE: how much risky language was *removed* from the original. For FLAG/REJECT/ACCEPT: how many risk keywords are identified in context. |
| **Improvement** (25%) | How well the proposed edit matches safe keywords and the expected safe rewrite. |
| **Risk Alignment** (25%) | Whether the chosen action is appropriate for the current risk level (e.g., editing a HIGH-risk clause scores 0.92×; accepting it scores 0.20×). |
| **Semantic Similarity** (10%) | Combined Jaccard + cosine similarity between the rewrite and the expected safe edit. |
| **Completeness** (5%) | Fraction of required legal elements present in the rewritten clause (e.g., liability cap, notice period, cure clause). |

### Task-specific adjustments

Expand All @@ -129,6 +176,9 @@ reward = 0.40 × correctness
| Hard | −50% penalty when hidden trap markers remain in the proposed text |
| Easy+ | +6% bonus for including breach-notification language |
| Hard+ | −45% penalty for unresolved IP traps; +7% bonus for explicit customer ownership |
| Medium+ | +8% bonus for well-scoped NDA; −30% for accepting overbroad terms |
| Hard++ | −45% penalty for unresolved one-sided termination; +9% for cure-period language |
| Expert | −50% penalty for missing data-protection safeguards; +10% for GDPR language (requires ≥2 indicators) |

Blocked accepts (accepting HIGH-risk text) are clamped to `0.001`.

Expand All @@ -139,22 +189,6 @@ An episode is considered successful if `score ≥ 0.50`.

---

## Reference Baseline Scores

Measured over 1 episode per task with `Qwen/Qwen2.5-72B-Instruct`:

| Task | Avg reward/step | Episode score |
|------|----------------|---------------|
| `easy_unlimited_liability` | 0.64 | 0.64 |
| `medium_auto_renewal` | 0.58 | 0.58 |
| `hard_conflicting_obligations` | 0.45 | 0.45 |
| `easy_compliance_agreement` | 0.61 | 0.61 |
| `hard_intellectual_property` | 0.42 | 0.42 |

A random agent achieves approximately 0.28 average per step across all tasks.

---

## API Endpoints

| Method | Path | Description |
Expand All @@ -175,7 +209,7 @@ A random agent achieves approximately 0.28 average per step across all tasks.

```bash
pip install -e ".[dev]"
python -m unittest discover contract_env/tests/ -v
python -m pytest contract_env/tests/ -v # 78 tests
```

### Run the server
Expand All @@ -188,8 +222,14 @@ uvicorn contract_env.server.app:app --host 0.0.0.0 --port 7860

```bash
export HF_TOKEN="your-huggingface-token"
python inference.py --benchmark # one episode per task (5 total)
python inference.py --benchmark # one episode per task (8 total)
python inference.py --episodes 3 # run 3 episodes cycling through tasks

# Retry any task that scores below 0.4:
python inference.py --benchmark --retry-low 0.4

# Against the Docker API server (for competition evaluation):
python inference.py --benchmark --mode api
```

### Docker
Expand All @@ -200,6 +240,9 @@ docker run -p 7860:7860 \
-e HF_TOKEN=your-token \
-e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
contract-negotiation-env

# Then run inference against the Docker server:
python inference.py --benchmark --mode api
```

---
Expand All @@ -212,7 +255,9 @@ docker run -p 7860:7860 \
| `API_BASE_URL` | No | `https://router.huggingface.co/v1` | LLM API endpoint |
| `MODEL_NAME` | No | `Qwen/Qwen2.5-72B-Instruct` | Model identifier |
| `BENCHMARK` | No | `contract_negotiation` | Benchmark name in [START] log line |
| `ENV_SERVER_URL` | No | `http://localhost:7860` | Docker server URL (for `--mode api`) |
| `PORT` | No | `7860` | Server port |
| `CORS_ORIGINS` | No | `*` | Comma-separated allowed CORS origins |

---

Expand All @@ -221,19 +266,19 @@ docker run -p 7860:7860 \
```
contract_env/
├── env/
│ ├── environment.py # ContractEnv — reset/step/state, 7-step episodes
│ ├── graders.py # evaluate_action() + 5 task-specific grader functions
│ ├── environment.py # ContractEnv — reset/step/state, 7-step episodes, opponent simulation
│ ├── graders.py # evaluate_action() + 8 task-specific grader functions + semantic/completeness scoring
│ ├── models.py # Pydantic v2 models: Action, Observation, Reward
│ └── tasks.py # 5 NegotiationTask definitions with metadata
│ └── tasks.py # 8 NegotiationTask definitions with metadata + opponent responses
├── server/
│ └── app.py # FastAPI server (port 7860)
├── tests/
│ ├── test_graders.py # 13 unit tests covering all grader edge cases
│ ├── test_graders.py # Grader unit tests covering all edge cases + new metrics
│ ├── test_api.py # API endpoint tests
│ └── test_smoke.py # Smoke tests
└── client.py # HTTP client helper
inference.py # LLM-driven baseline agent
openenv.yaml # OpenEnv manifest (spec_version: 1)
│ └── test_smoke.py # Smoke tests including opponent simulation + opponent stance parsing
└── client.py # HTTP client helper with from_docker_image() support
inference.py # LLM-driven agent with opponent-aware multi-turn strategy + HTTP mode
openenv.yaml # OpenEnv manifest (spec_version: 1, 8 graded tasks, action_space)
Dockerfile # Python 3.10-slim container, port 7860
verify_graders.py # Pre-submission grader validation script
```
15 changes: 14 additions & 1 deletion contract_env/env/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@
grade_easy,
grade_medium,
grade_hard,
grade_easy_plus,
grade_hard_plus,
grade_medium_plus,
grade_hard_plus2,
grade_expert,
clause_completeness_score,
semantic_similarity,
TASK_GRADERS,
GRADED_TASKS,
NUM_GRADED_TASKS,
Expand All @@ -18,7 +25,6 @@
validate_all_tasks_have_graders,
GRADED_TASK_IDS,
GRADED_TASK_NAMES,
NUM_GRADED_TASKS as TASKS_NUM_GRADED,
)

__all__ = [
Expand All @@ -33,6 +39,13 @@
"grade_easy",
"grade_medium",
"grade_hard",
"grade_easy_plus",
"grade_hard_plus",
"grade_medium_plus",
"grade_hard_plus2",
"grade_expert",
"clause_completeness_score",
"semantic_similarity",
"TASK_GRADERS",
"GRADED_TASKS",
"NUM_GRADED_TASKS",
Expand Down
Loading
Loading