A Modular Platform for Agent Inference, Evaluation, and Training
AgentInferKit is an open-source platform for building, running, and analyzing LLM/VLM/Agent systems across text, multimodal, RAG, and tool-use settings. It is designed for agent inference today, and built to extend toward agent reasoning, agent training, and RL-based optimization in the future.
AgentInferKit follows a three-layer design:
- Platform Layer: unified model access, inference execution, tool simulation, batch evaluation, visualization, and engineering management
- Data Layer: dataset organization, preprocessing, standardization, versioning, and custom data loading
- Experiment Layer: benchmark protocols, controlled comparisons, and research-oriented analysis
At the current stage, the project mainly focuses on the platform layer and data layer.
- Unified access to API models, local models, and multimodal models
- Pluggable reasoning strategies: Direct, CoT, Long-CoT, ToT, ReAct, Self-Refine, Self-Consistency
- Built-in RAG pipeline with chunking, indexing, retrieval, and evidence tracking
- Stateful tool simulation environment with world state, registry, and side-effect replay
- Tool categories: file search, calendar, issue tracking — all with in-process sandbox execution
- Batch inference, single-sample debugging, logging, retry, and resume
- Configurable evaluation with metrics for text, retrieval, and agent tasks
- Experiment runners for stateful agent flows and stateless baselines
- Research-friendly visualization for predictions, traces, evidence, and errors
- Modular architecture for future extension to training and RL
AgentInferKit currently targets the following task types:
- Text QA
- Knowledge-oriented text exam
- Image understanding
- API / function calling
- Retrieval-augmented reasoning
- Prompt-based reasoning strategy comparison
- Stateful tool-use agent evaluation
The engineering foundation of the project, including:
- model adapters
- reasoning strategies
- RAG pipeline
- task runners
- tool simulation (
toolsim/)core/— world state, environment, registry, constants, side effectsexecution/— stateful executor and tracertools/— file, search, calendar, issue toolsevaluators/— call-level and state-level evaluatorsrunners/— experiment and comparison runnersbackends/— mock and sandbox backendsadapters/— stateful runtime adapterlegacy/— legacy executor and tracer
- evaluators
- visualization dashboard
- config and logging system
Standardizes heterogeneous data into reusable benchmark assets, including:
- QA data
- text-exam data
- image understanding data
- agent API function calling data
The data layer is designed to make data runnable, evaluable, traceable, and versioned.
# Clone the repo
git clone https://github.com/CodeSoul-co/AgentInferKit.git
cd AgentInferKit
# Create conda environment
conda create -n benchmark python=3.11 -y
conda activate benchmark
pip install -r requirements.txt
# Configure API key
cp .env.example .env
# Edit .env and fill in your DEEPSEEK_API_KEYPYTHONPATH=$(pwd) uvicorn src.main:app --host 0.0.0.0 --port 8000Open browser: http://localhost:8000/docs to see all API endpoints.
Direct mode (fast, concise):
curl -s -X POST http://localhost:8000/chat/complete \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "direct",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}' | python3 -m json.toolChain-of-Thought mode (step-by-step reasoning):
curl -s -X POST http://localhost:8000/chat/complete \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "cot",
"messages": [{"role": "user", "content": "A train travels 120km in 2 hours. What is its speed?"}]
}' | python3 -m json.toolStreaming mode (token-by-token output):
curl -N -X POST http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "direct",
"messages": [{"role": "user", "content": "Write a short poem about spring"}]
}'PYTHONPATH=$(pwd) python -m toolsim.runners.experiment_runner \
--backend sandbox \
--config configs/toolsim/demo_flow.yamlPYTHONPATH=$(pwd) python scripts/run_experiment.py \
--config configs/experiments/demo_exam_direct.yamlResults are saved to outputs/predictions/ and outputs/metrics/.
We ran 5 exam questions (math, physics, CS) with two strategies:
| Metric | Direct | CoT |
|---|---|---|
| Accuracy | 80% (4/5) | 100% (5/5) |
| Avg Latency | 2.2s | 10.7s |
| Avg Tokens | 69.6 | 281.4 |
CoT reasoning improves accuracy at the cost of higher latency and token usage.
AgentInferKit/
├── src/
│ ├── adapters/ # LLM provider adapters (DeepSeek, OpenAI, Anthropic, Qwen)
│ ├── strategies/ # Inference strategies (direct, cot, long_cot, tot, react, self_refine, self_consistency)
│ ├── rag/ # RAG pipeline (chunker, embedder, milvus_store, retriever, pipeline)
│ ├── runners/ # Task runners (qa, exam, batch, agent)
│ ├── evaluators/ # Metrics (text, choice, rag, efficiency)
│ ├── toolsim/ # Tool simulation environment
│ │ ├── core/ # World state, environment, registry, constants, side effects
│ │ ├── execution/ # Stateful executor and tracer
│ │ ├── tools/ # File, search, calendar, issue tools
│ │ ├── evaluators/ # Call-level and state-level evaluators
│ │ ├── runners/ # Experiment and comparison runners
│ │ ├── backends/ # Mock and sandbox backends
│ │ ├── adapters/ # Stateful runtime adapter
│ │ └── legacy/ # Legacy executor and tracer
│ ├── api/ # FastAPI routes (chat, datasets, results, system)
│ └── utils/ # Shared utilities
├── scripts/ # CLI scripts (run_experiment, build_chunks, build_index, build_mcq)
├── configs/ # YAML configs for models and experiments
├── data/ # Datasets and schemas
└── outputs/ # Experiment results (gitignored)
| Endpoint | Method | Description |
|---|---|---|
/chat/complete |
POST | Single chat completion with strategy selection |
/chat/stream |
POST | Streaming chat completion (SSE) |
/datasets |
GET | List available datasets |
/datasets/upload |
POST | Upload a new dataset |
/results/{id}/metrics |
GET | Get experiment metrics |
/results/{id}/predictions |
GET | Get experiment predictions |
/results/compare |
POST | Compare multiple experiments |
/api/v1/system/health |
GET | Health check |
Full interactive docs at: http://localhost:8000/docs
The toolsim module provides a fully in-process, deterministic simulation environment for tool-use agents. Each tool operates against an in-memory WorldState with simulated time, enabling reproducible experiments without external services.
| Component | Description |
|---|---|
WorldState |
In-memory entity store with time simulation |
Environment |
Tool registry, backend dispatch, execution context |
FileTools |
File snapshot and reindex with delayed search refresh |
SearchTools |
Entity-based search index (file, calendar, issue) |
CalendarTools |
CRUD for calendar events with status transitions |
IssueTools |
Issue lifecycle: create, assign, comment, close/reopen |
StatefulExecutor |
Executes tool calls with side-effect replay |
StatefulTracer |
Records full execution traces for evaluation |
SandboxBackend |
Sandboxed execution for untrusted tool code |
- Call-level: Success/failure counts, phase transitions, argument validation
- State-level: Entity existence, field values, indexed search hits, goal satisfaction
| Provider | Model | Status |
|---|---|---|
| DeepSeek | deepseek-chat | Verified |
| OpenAI | gpt-4o, gpt-4o-mini | Ready (needs API key) |
| Anthropic | claude-3.5-sonnet | Ready (needs API key) |
| Qwen | qwen-plus | Ready (needs API key) |
| Strategy | Key | Description |
|---|---|---|
| Direct | direct |
Simple prompt, fast response |
| Chain-of-Thought | cot |
Step-by-step reasoning |
| Long CoT | long_cot |
Extended multi-step reasoning |
| Tree-of-Thought | tot |
Multiple reasoning paths + evaluation |
| ReAct | react |
Reasoning + tool actions interleaved |
| Self-Refine | self_refine |
Generate -> critique -> improve loop |
| Self-Consistency | self_consistency |
Multiple paths + majority voting |
Contributions are welcome, especially in:
- model adapters
- task runners
- evaluators
- RAG pipelines
- tool simulation
- visualization
- data preprocessing
- documentation
@misc{agentinferkit,
title={AgentInferKit: A Modular Platform for Agent Inference, Evaluation, and Training},
author={CodeSoul-co},
year={2026},
howpublished={GitHub repository}
}