bigturtle679 · bigturtle679 · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026
diff --git a/Dockerfile b/Dockerfile
@@ -12,6 +12,11 @@ RUN pip install --no-cache-dir --upgrade pip \
 
 COPY . .
 
+# Run as non-root for security
+RUN useradd --create-home --shell /bin/bash appuser \
+    && chown -R appuser:appuser /app
+USER appuser
+
 EXPOSE 7860
 
 HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3 \

diff --git a/README.md b/README.md
@@ -27,6 +27,8 @@ lawyers, procurement teams, and founders. Key challenges for an AI agent:
 - **Partial-progress rewards**: improving a clause partially (e.g., adding a
   liability cap without addressing IP ownership) deserves more reward than doing
   nothing — but less than resolving every risk.
+- **Multi-turn dynamics**: the counterparty pushes back on proposals, requiring
+  adaptive negotiation strategies across multiple rounds.
 
 ---
 
@@ -39,6 +41,9 @@ lawyers, procurement teams, and founders. Key challenges for an AI agent:
 | `hard_conflicting_obligations` | Hard (4/5) | Performance/Changes | HIGH | Yes |
 | `easy_compliance_agreement` | Easy+ (2/5) | Compliance | LOW | No |
 | `hard_intellectual_property` | Hard+ (5/5) | IP Ownership | HIGH | Yes |
+| `medium_confidentiality_nda` | Medium+ (3/5) | Confidentiality | MODERATE | Yes |
+| `hard_termination_convenience` | Hard++ (4/5) | Termination | HIGH | Yes |
+| `expert_data_protection` | Expert (5/5) | Data Protection | HIGH | Yes |
 
 ### Task descriptions
 
@@ -63,6 +68,43 @@ improvement: adding explicit breach-notification obligations ("+6% bonus for
 the customer provides specifications. The agent must rewrite to assign IP to the
 customer and limit the supplier to a scoped license.
 
+**medium_confidentiality_nda** — An overbroad NDA with perpetual obligations and
+no carve-outs for public information. The agent must narrow the scope, add a
+time limit (3 years), and carve out publicly available and independently
+developed information.
+
+**hard_termination_convenience** — A one-sided termination clause allowing only
+the Supplier to terminate at will with 5-day notice, while the Customer has no
+termination rights and waives all remedies. The agent must establish mutual
+termination, add a 30-day cure period, and include transition/wind-down
+provisions.
+
+**expert_data_protection** — A clause giving the Supplier blanket authority to
+process personal data, transfer it to any jurisdiction, engage sub-processors
+without notice, and waive data-subject rights. The agent must add DPA
+requirements, 72-hour breach notification, sub-processor consent, data-subject
+rights assistance, and data deletion obligations.
+
+---
+
+## Opponent Simulation
+
+Each task includes **opponent responses** keyed by action type. When the agent
+takes an action (e.g., `FLAG_RISK`, `EDIT_CLAUSE`), the counterparty replies
+with a contextually appropriate pushback or counter-proposal, creating realistic
+multi-turn negotiation dynamics:
+
+```
+agent   → FLAG_RISK
+opponent → "Our legal team considers this standard. What specific cap do you propose?"
+agent   → EDIT_CLAUSE (with cap at 12 months)
+opponent → "We can accept a cap but consequential damages must remain."
+agent   → PROPOSE_COUNTER (addressing consequential damages)
+...
+```
+
+Opponent replies appear in the `negotiation_history` and in `info.opponent_reply`.
+
 ---
 
 ## Observation Space
@@ -72,12 +114,13 @@ Every call to `/reset` or `/step` returns an `Observation`:
 ```json
 {
   "contract_text": "string — the current clause text (may be rewritten after EDIT/PROPOSE)",
-  "clause_type": "string — e.g. liability, term_renewal, intellectual_property",
+  "clause_type": "string — e.g. liability, term_renewal, intellectual_property, confidentiality, termination, data_protection",
   "risk_level": "float ∈ (0, 1) — observed risk density (0=safe, 1=highly risky)",
   "step_count": "int — steps taken so far (0 = just reset)",
   "negotiation_history": [
     "opponent|[Counterparty] Unlimited indemnity is standard.",
     "agent|step=1 action=FLAG_RISK content_len=0",
+    "opponent|[Counterparty] Our legal team considers this standard.",
     "..."
   ]
 }
@@ -109,16 +152,20 @@ Sending empty content returns a validation error and a near-zero reward.
 Every step returns a scalar `reward ∈ (0.001, 0.999)`, computed as:
 
 ```
-reward = 0.40 × correctness
-       + 0.30 × improvement
-       + 0.30 × risk_alignment
+reward = 0.35 × correctness
+       + 0.25 × improvement
+       + 0.25 × risk_alignment
+       + 0.10 × semantic_similarity
+       + 0.05 × completeness
 ```
 
 | Component | What it measures |
 |-----------|-----------------|
-| **Correctness** (40%) | For EDIT/PROPOSE: how much risky language was *removed* from the original. For FLAG/REJECT/ACCEPT: how many risk keywords are identified in context. |
-| **Improvement** (30%) | How well the proposed edit matches safe keywords and the expected safe rewrite. |
-| **Risk Alignment** (30%) | Whether the chosen action is appropriate for the current risk level (e.g., editing a HIGH-risk clause scores 0.92×; accepting it scores 0.20×). |
+| **Correctness** (35%) | For EDIT/PROPOSE: how much risky language was *removed* from the original. For FLAG/REJECT/ACCEPT: how many risk keywords are identified in context. |
+| **Improvement** (25%) | How well the proposed edit matches safe keywords and the expected safe rewrite. |
+| **Risk Alignment** (25%) | Whether the chosen action is appropriate for the current risk level (e.g., editing a HIGH-risk clause scores 0.92×; accepting it scores 0.20×). |
+| **Semantic Similarity** (10%) | Combined Jaccard + cosine similarity between the rewrite and the expected safe edit. |
+| **Completeness** (5%) | Fraction of required legal elements present in the rewritten clause (e.g., liability cap, notice period, cure clause). |
 
 ### Task-specific adjustments
 
@@ -129,6 +176,9 @@ reward = 0.40 × correctness
 | Hard | −50% penalty when hidden trap markers remain in the proposed text |
 | Easy+ | +6% bonus for including breach-notification language |
 | Hard+ | −45% penalty for unresolved IP traps; +7% bonus for explicit customer ownership |
+| Medium+ | +8% bonus for well-scoped NDA; −30% for accepting overbroad terms |
+| Hard++ | −45% penalty for unresolved one-sided termination; +9% for cure-period language |
+| Expert | −50% penalty for missing data-protection safeguards; +10% for GDPR language (requires ≥2 indicators) |
 
 Blocked accepts (accepting HIGH-risk text) are clamped to `0.001`.
 
@@ -139,22 +189,6 @@ An episode is considered successful if `score ≥ 0.50`.
 
 ---
 
-## Reference Baseline Scores
-
-Measured over 1 episode per task with `Qwen/Qwen2.5-72B-Instruct`:
-
-| Task | Avg reward/step | Episode score |
-|------|----------------|---------------|
-| `easy_unlimited_liability` | 0.64 | 0.64 |
-| `medium_auto_renewal` | 0.58 | 0.58 |
-| `hard_conflicting_obligations` | 0.45 | 0.45 |
-| `easy_compliance_agreement` | 0.61 | 0.61 |
-| `hard_intellectual_property` | 0.42 | 0.42 |
-
-A random agent achieves approximately 0.28 average per step across all tasks.
-
----
-
 ## API Endpoints
 
 | Method | Path | Description |
@@ -175,7 +209,7 @@ A random agent achieves approximately 0.28 average per step across all tasks.
 
 ```bash
 pip install -e ".[dev]"
-python -m unittest discover contract_env/tests/ -v
+python -m pytest contract_env/tests/ -v   # 78 tests
 ```
 
 ### Run the server
@@ -188,8 +222,14 @@ uvicorn contract_env.server.app:app --host 0.0.0.0 --port 7860
 
 ```bash
 export HF_TOKEN="your-huggingface-token"
-python inference.py --benchmark    # one episode per task (5 total)
+python inference.py --benchmark    # one episode per task (8 total)
 python inference.py --episodes 3   # run 3 episodes cycling through tasks
+
+# Retry any task that scores below 0.4:
+python inference.py --benchmark --retry-low 0.4
+
+# Against the Docker API server (for competition evaluation):
+python inference.py --benchmark --mode api
 ```
 
 ### Docker
@@ -200,6 +240,9 @@ docker run -p 7860:7860 \
   -e HF_TOKEN=your-token \
   -e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
   contract-negotiation-env
+
+# Then run inference against the Docker server:
+python inference.py --benchmark --mode api
 ```
 
 ---
@@ -212,7 +255,9 @@ docker run -p 7860:7860 \
 | `API_BASE_URL` | No | `https://router.huggingface.co/v1` | LLM API endpoint |
 | `MODEL_NAME` | No | `Qwen/Qwen2.5-72B-Instruct` | Model identifier |
 | `BENCHMARK` | No | `contract_negotiation` | Benchmark name in [START] log line |
+| `ENV_SERVER_URL` | No | `http://localhost:7860` | Docker server URL (for `--mode api`) |
 | `PORT` | No | `7860` | Server port |
+| `CORS_ORIGINS` | No | `*` | Comma-separated allowed CORS origins |
 
 ---
 
@@ -221,19 +266,19 @@ docker run -p 7860:7860 \
 ```
 contract_env/
 ├── env/
-│   ├── environment.py   # ContractEnv — reset/step/state, 7-step episodes
-│   ├── graders.py       # evaluate_action() + 5 task-specific grader functions
+│   ├── environment.py   # ContractEnv — reset/step/state, 7-step episodes, opponent simulation
+│   ├── graders.py       # evaluate_action() + 8 task-specific grader functions + semantic/completeness scoring
 │   ├── models.py        # Pydantic v2 models: Action, Observation, Reward
-│   └── tasks.py         # 5 NegotiationTask definitions with metadata
+│   └── tasks.py         # 8 NegotiationTask definitions with metadata + opponent responses
 ├── server/
 │   └── app.py           # FastAPI server (port 7860)
 ├── tests/
-│   ├── test_graders.py  # 13 unit tests covering all grader edge cases
+│   ├── test_graders.py  # Grader unit tests covering all edge cases + new metrics
 │   ├── test_api.py      # API endpoint tests
-│   └── test_smoke.py    # Smoke tests
-└── client.py            # HTTP client helper
-inference.py             # LLM-driven baseline agent
-openenv.yaml             # OpenEnv manifest (spec_version: 1)
+│   └── test_smoke.py    # Smoke tests including opponent simulation + opponent stance parsing
+└── client.py            # HTTP client helper with from_docker_image() support
+inference.py             # LLM-driven agent with opponent-aware multi-turn strategy + HTTP mode
+openenv.yaml             # OpenEnv manifest (spec_version: 1, 8 graded tasks, action_space)
 Dockerfile               # Python 3.10-slim container, port 7860
 verify_graders.py        # Pre-submission grader validation script
 ```
diff --git a/contract_env/env/__init__.py b/contract_env/env/__init__.py
@@ -5,6 +5,13 @@
     grade_easy,
     grade_medium,
     grade_hard,
+    grade_easy_plus,
+    grade_hard_plus,
+    grade_medium_plus,
+    grade_hard_plus2,
+    grade_expert,
+    clause_completeness_score,
+    semantic_similarity,
     TASK_GRADERS,
     GRADED_TASKS,
     NUM_GRADED_TASKS,
@@ -18,7 +25,6 @@
     validate_all_tasks_have_graders,
     GRADED_TASK_IDS,
     GRADED_TASK_NAMES,
-    NUM_GRADED_TASKS as TASKS_NUM_GRADED,
 )
 
 __all__ = [
@@ -33,6 +39,13 @@
     "grade_easy",
     "grade_medium",
     "grade_hard",
+    "grade_easy_plus",
+    "grade_hard_plus",
+    "grade_medium_plus",
+    "grade_hard_plus2",
+    "grade_expert",
+    "clause_completeness_score",
+    "semantic_similarity",
     "TASK_GRADERS",
     "GRADED_TASKS",
     "NUM_GRADED_TASKS",