Agent Name
Juris
Maintainer
Animesh Mishra / amethystani
Model(s) Used
Qwen/Qwen3.6-35B-A3B (served locally via Ollama as qwen3.6:35b-a3b)
Agent Description
Juris is a custom capability-2 VAKRA agent built around Qwen 3.6 with a deterministic routing layer for dashboard-API tool selection. The agent is optimized for the released capability 2 workload and focuses on selecting the single best API for each query, normalizing tool arguments, and shaping final answers into the expected benchmark format without free-form reasoning drift.
The run used a hybrid setup: lexical and semantic tool shortlisting, deterministic overrides for failure-prone query classes, and a constrained one-tool execution loop for direct-answer dashboard tasks. The goal was to reduce avoidable tool-selection errors and time lost in generic ReAct-style loops on large tool inventories.
Metadata (JSON)
{
"submitter": "Animesh Mishra / amethystani",
"agent_name": "Juris",
"benchmark": "VAKRA",
"capability": "capability_2_dashboard_apis",
"coverage": {
"domains_covered": 17,
"domains_total": 17,
"queries_total": 1597
},
"model": {
"hf_model": "Qwen/Qwen3.6-35B-A3B",
"provider": "ollama",
"local_tag": "qwen3.6:35b-a3b"
},
"hardware": {
"gpu": "NVIDIA RTX 4500 Ada Generation",
"gpu_memory_mib": 24570
},
"run_metrics": {
"successful_executions": 1517,
"failed_executions": 80,
"completion_rate_percent": 94.99,
"overall_avg_duration_s": 62.89,
"overall_max_duration_s": 300.01
},
"failure_breakdown": {
"Agent timed out after 300 seconds": 75,
"Recursion limit of 3 reached without hitting a stop condition": 5
},
"validation": {
"status": "passed",
"command": ".venv/bin/python validate_output.py output/capability_2_full_qwen36_v2",
"files_validated": 17
},
"code_or_system_link": "https://github.com/amethystani/juris-vakra-cap2-submission",
"notes": "This submission covers the full released domain set for capability 2, in line with the current maintainer guidance for capability-level scoring."
}
ZIP File Link
https://github.com/amethystani/juris-vakra-cap2-submission/releases/download/v1/juris_vakra_capability2_submission_20260504.zip
ZIP Contents Description
capability_2_dashboard_apis/prediction/*.json for all 17 capability-2 domains
SUBMISSION_MANIFEST.json with run metadata and per-domain counts
VALIDATION_SUMMARY.txt with the official output-format validation result
Validation Checklist
Additional Notes
This is the formal benchmark submission for the approach I referenced earlier in issue #15. I am submitting capability 2 across all of its released domains, rather than a single-domain sample, so the maintainers can evaluate it at the capability level. If you need any additional run details or a slightly different artifact layout for ingestion, I can provide that quickly.
Agent Name
Juris
Maintainer
Animesh Mishra / amethystani
Model(s) Used
Qwen/Qwen3.6-35B-A3B (served locally via Ollama as
qwen3.6:35b-a3b)Agent Description
Juris is a custom capability-2 VAKRA agent built around Qwen 3.6 with a deterministic routing layer for dashboard-API tool selection. The agent is optimized for the released capability 2 workload and focuses on selecting the single best API for each query, normalizing tool arguments, and shaping final answers into the expected benchmark format without free-form reasoning drift.
The run used a hybrid setup: lexical and semantic tool shortlisting, deterministic overrides for failure-prone query classes, and a constrained one-tool execution loop for direct-answer dashboard tasks. The goal was to reduce avoidable tool-selection errors and time lost in generic ReAct-style loops on large tool inventories.
Metadata (JSON)
{ "submitter": "Animesh Mishra / amethystani", "agent_name": "Juris", "benchmark": "VAKRA", "capability": "capability_2_dashboard_apis", "coverage": { "domains_covered": 17, "domains_total": 17, "queries_total": 1597 }, "model": { "hf_model": "Qwen/Qwen3.6-35B-A3B", "provider": "ollama", "local_tag": "qwen3.6:35b-a3b" }, "hardware": { "gpu": "NVIDIA RTX 4500 Ada Generation", "gpu_memory_mib": 24570 }, "run_metrics": { "successful_executions": 1517, "failed_executions": 80, "completion_rate_percent": 94.99, "overall_avg_duration_s": 62.89, "overall_max_duration_s": 300.01 }, "failure_breakdown": { "Agent timed out after 300 seconds": 75, "Recursion limit of 3 reached without hitting a stop condition": 5 }, "validation": { "status": "passed", "command": ".venv/bin/python validate_output.py output/capability_2_full_qwen36_v2", "files_validated": 17 }, "code_or_system_link": "https://github.com/amethystani/juris-vakra-cap2-submission", "notes": "This submission covers the full released domain set for capability 2, in line with the current maintainer guidance for capability-level scoring." }ZIP File Link
https://github.com/amethystani/juris-vakra-cap2-submission/releases/download/v1/juris_vakra_capability2_submission_20260504.zip
ZIP Contents Description
capability_2_dashboard_apis/prediction/*.jsonfor all 17 capability-2 domainsSUBMISSION_MANIFEST.jsonwith run metadata and per-domain countsVALIDATION_SUMMARY.txtwith the official output-format validation resultValidation Checklist
Additional Notes
This is the formal benchmark submission for the approach I referenced earlier in issue #15. I am submitting capability 2 across all of its released domains, rather than a single-domain sample, so the maintainers can evaluate it at the capability level. If you need any additional run details or a slightly different artifact layout for ingestion, I can provide that quickly.