Skip to content

[Benchmark Output Submission]: Juris #16

@amethystani

Description

@amethystani

Agent Name

Juris

Maintainer

Animesh Mishra / amethystani

Model(s) Used

Qwen/Qwen3.6-35B-A3B (served locally via Ollama as qwen3.6:35b-a3b)

Agent Description

Juris is a custom capability-2 VAKRA agent built around Qwen 3.6 with a deterministic routing layer for dashboard-API tool selection. The agent is optimized for the released capability 2 workload and focuses on selecting the single best API for each query, normalizing tool arguments, and shaping final answers into the expected benchmark format without free-form reasoning drift.

The run used a hybrid setup: lexical and semantic tool shortlisting, deterministic overrides for failure-prone query classes, and a constrained one-tool execution loop for direct-answer dashboard tasks. The goal was to reduce avoidable tool-selection errors and time lost in generic ReAct-style loops on large tool inventories.

Metadata (JSON)

{
  "submitter": "Animesh Mishra / amethystani",
  "agent_name": "Juris",
  "benchmark": "VAKRA",
  "capability": "capability_2_dashboard_apis",
  "coverage": {
    "domains_covered": 17,
    "domains_total": 17,
    "queries_total": 1597
  },
  "model": {
    "hf_model": "Qwen/Qwen3.6-35B-A3B",
    "provider": "ollama",
    "local_tag": "qwen3.6:35b-a3b"
  },
  "hardware": {
    "gpu": "NVIDIA RTX 4500 Ada Generation",
    "gpu_memory_mib": 24570
  },
  "run_metrics": {
    "successful_executions": 1517,
    "failed_executions": 80,
    "completion_rate_percent": 94.99,
    "overall_avg_duration_s": 62.89,
    "overall_max_duration_s": 300.01
  },
  "failure_breakdown": {
    "Agent timed out after 300 seconds": 75,
    "Recursion limit of 3 reached without hitting a stop condition": 5
  },
  "validation": {
    "status": "passed",
    "command": ".venv/bin/python validate_output.py output/capability_2_full_qwen36_v2",
    "files_validated": 17
  },
  "code_or_system_link": "https://github.com/amethystani/juris-vakra-cap2-submission",
  "notes": "This submission covers the full released domain set for capability 2, in line with the current maintainer guidance for capability-level scoring."
}

ZIP File Link

https://github.com/amethystani/juris-vakra-cap2-submission/releases/download/v1/juris_vakra_capability2_submission_20260504.zip

ZIP Contents Description

  • capability_2_dashboard_apis/prediction/*.json for all 17 capability-2 domains
  • SUBMISSION_MANIFEST.json with run metadata and per-domain counts
  • VALIDATION_SUMMARY.txt with the official output-format validation result

Validation Checklist

  • JSON files are valid and well-formed
  • ZIP file is accessible via the provided link
  • No sensitive or PII data included
  • Agent has been tested locally

Additional Notes

This is the formal benchmark submission for the approach I referenced earlier in issue #15. I am submitting capability 2 across all of its released domains, rather than a single-domain sample, so the maintainers can evaluate it at the capability level. If you need any additional run details or a slightly different artifact layout for ingestion, I can provide that quickly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions