- Core Models
- Task Parser
- API Client
- Retry Logic
- Model Executor
- LLM Judge
- Model Comparison
- Cost Tracker
- Error Handling
Location: taskbench.core.models
All core models are Pydantic BaseModel subclasses with automatic validation.
Represents a user-defined evaluation task.
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
name |
str |
Yes | Unique identifier for the task |
description |
str |
Yes | Human-readable task description |
input_type |
str |
Yes | Type of input data: "transcript", "text", "csv", "json" |
output_format |
str |
Yes | Expected output format: "csv", "json", "markdown" |
evaluation_criteria |
List[str] |
Yes | List of criteria for evaluation |
constraints |
Dict[str, Any] |
No | Constraints that outputs must satisfy |
examples |
List[Dict[str, Any]] |
No | Example inputs and expected outputs |
judge_instructions |
str |
Yes | Instructions for the LLM-as-judge evaluator |
Validation:
input_typemust be one of: "transcript", "text", "csv", "json"output_formatmust be one of: "csv", "json", "markdown"
Example:
from taskbench.core.models import TaskDefinition
task = TaskDefinition(
name="lecture_concept_extraction",
description="Extract teaching concepts from lecture transcripts",
input_type="transcript",
output_format="csv",
evaluation_criteria=[
"Timestamp accuracy",
"Duration compliance",
"Concept clarity"
],
constraints={
"min_duration_minutes": 2,
"max_duration_minutes": 7,
"required_csv_columns": ["concept", "start_time", "end_time"]
},
examples=[],
judge_instructions="Evaluate based on accuracy, format, and compliance..."
)API response from an LLM completion.
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
content |
str |
Yes | The model's response text |
model |
str |
Yes | Model identifier (e.g., "anthropic/claude-sonnet-4.5") |
input_tokens |
int |
Yes | Number of input tokens consumed |
output_tokens |
int |
Yes | Number of output tokens generated |
total_tokens |
int |
Yes | Total tokens (input + output) |
latency_ms |
float |
Yes | Response latency in milliseconds |
timestamp |
datetime |
No | When the response was received (auto-generated) |
Example:
from taskbench.core.models import CompletionResponse
response = CompletionResponse(
content="concept,start_time,end_time\n01_Introduction,00:00:00,00:05:30",
model="anthropic/claude-sonnet-4.5",
input_tokens=1500,
output_tokens=500,
total_tokens=2000,
latency_ms=2345.67
)
print(f"Model: {response.model}")
print(f"Tokens: {response.total_tokens}")
print(f"Latency: {response.latency_ms:.2f}ms")Single model evaluation result.
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
model_name |
str |
Yes | Model identifier |
task_name |
str |
Yes | Task identifier |
output |
str |
Yes | Model's output for the task |
input_tokens |
int |
Yes | Input tokens consumed |
output_tokens |
int |
Yes | Output tokens generated |
total_tokens |
int |
Yes | Total tokens used |
cost_usd |
float |
Yes | Cost in USD for this evaluation |
latency_ms |
float |
Yes | Execution latency in milliseconds |
timestamp |
datetime |
No | When the evaluation was performed (auto-generated) |
status |
str |
No | Evaluation status: "success" or "failed" (default: "success") |
error |
Optional[str] |
No | Error message if status is "failed" |
Example:
from taskbench.core.models import EvaluationResult
result = EvaluationResult(
model_name="anthropic/claude-sonnet-4.5",
task_name="lecture_concept_extraction",
output="concept,start_time,end_time\n...",
input_tokens=1500,
output_tokens=500,
total_tokens=2000,
cost_usd=0.36,
latency_ms=2345.67,
status="success"
)
if result.status == "success":
print(f"Output: {result.output[:100]}...")
print(f"Cost: ${result.cost_usd:.4f}")LLM-as-judge scoring result.
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
model_evaluated |
str |
Yes | Model that was evaluated |
accuracy_score |
int |
Yes | Accuracy score (0-100) |
format_score |
int |
Yes | Format compliance score (0-100) |
compliance_score |
int |
Yes | Constraint compliance score (0-100) |
overall_score |
int |
Yes | Overall score (0-100) |
violations |
List[str] |
No | List of constraint violations found |
reasoning |
str |
Yes | Detailed explanation of the scores |
timestamp |
datetime |
No | When the evaluation was performed (auto-generated) |
Validation:
- All scores must be in range 0-100
Example:
from taskbench.core.models import JudgeScore
score = JudgeScore(
model_evaluated="anthropic/claude-sonnet-4.5",
accuracy_score=95,
format_score=100,
compliance_score=90,
overall_score=95,
violations=["One segment slightly over 7 minutes"],
reasoning="Excellent extraction with minor duration issue..."
)
print(f"Overall Score: {score.overall_score}/100")
print(f"Violations: {len(score.violations)}")
for violation in score.violations:
print(f" - {violation}")Model pricing and configuration.
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
model_id |
str |
Yes | Unique model identifier for API calls |
display_name |
str |
Yes | Human-readable model name |
input_price_per_1m |
float |
Yes | Input price per 1M tokens in USD |
output_price_per_1m |
float |
Yes | Output price per 1M tokens in USD |
context_window |
int |
Yes | Maximum context window size in tokens |
provider |
str |
Yes | Model provider (e.g., "Anthropic", "OpenAI") |
Example:
from taskbench.core.models import ModelConfig
config = ModelConfig(
model_id="anthropic/claude-sonnet-4.5",
display_name="Claude Sonnet 4.5",
input_price_per_1m=3.00,
output_price_per_1m=15.00,
context_window=200000,
provider="Anthropic"
)
print(f"{config.display_name} ({config.provider})")
print(f"Input: ${config.input_price_per_1m}/1M tokens")
print(f"Output: ${config.output_price_per_1m}/1M tokens")Location: taskbench.core.task
Parser and validator for task definitions.
Load a task definition from a YAML file.
Parameters:
yaml_path(str): Path to the YAML file containing the task definition
Returns:
TaskDefinition: Parsed task definition object
Raises:
FileNotFoundError: If the YAML file doesn't existyaml.YAMLError: If the YAML is malformedValidationError: If the YAML doesn't match TaskDefinition schema
Example:
from taskbench.core.task import TaskParser
parser = TaskParser()
task = parser.load_from_yaml("tasks/lecture_analysis.yaml")
print(f"Loaded task: {task.name}")
print(f"Input type: {task.input_type}")
print(f"Output format: {task.output_format}")Validate a task definition for logical consistency.
Parameters:
task(TaskDefinition): Task definition to validate
Returns:
Tuple[bool, List[str]]: (is_valid, list_of_errors)is_valid: True if task is valid, False otherwiselist_of_errors: List of error messages (empty if valid)
Validation Checks:
- Evaluation criteria is non-empty
- Judge instructions is non-empty
- Min/max constraints satisfy min < max
- CSV output has required_csv_columns constraint
- All constraint values have correct types
Example:
from taskbench.core.task import TaskParser
parser = TaskParser()
task = parser.load_from_yaml("tasks/my_task.yaml")
is_valid, errors = parser.validate_task(task)
if not is_valid:
print("Validation errors:")
for error in errors:
print(f" - {error}")
else:
print("Task is valid!")Save a task definition to a YAML file.
Parameters:
task(TaskDefinition): Task definition to saveyaml_path(str): Path where the YAML file should be saved
Raises:
IOError: If the file cannot be written
Example:
from taskbench.core.task import TaskParser
from taskbench.core.models import TaskDefinition
parser = TaskParser()
task = TaskDefinition(
name="custom_task",
description="My custom task",
input_type="text",
output_format="json",
evaluation_criteria=["Accuracy"],
judge_instructions="Evaluate the output..."
)
parser.save_to_yaml(task, "tasks/custom_task.yaml")
print("Task saved successfully!")Location: taskbench.api.client
Async HTTP client for OpenRouter API.
Initialize the OpenRouter client.
Parameters:
api_key(str): OpenRouter API keybase_url(str): Base URL for OpenRouter API (default: official endpoint)timeout(float): Request timeout in seconds (default: 120s)
Example:
from taskbench.api.client import OpenRouterClient
async with OpenRouterClient(api_key="your-key") as client:
# Use client for API calls
passcomplete(model: str, prompt: str, max_tokens: int = 1000, temperature: float = 0.7, **kwargs) -> CompletionResponse
Send a completion request to OpenRouter.
Parameters:
model(str): Model identifier (e.g., "anthropic/claude-sonnet-4.5")prompt(str): The prompt to send to the modelmax_tokens(int): Maximum tokens to generate (default: 1000)temperature(float): Sampling temperature 0-1 (default: 0.7)**kwargs: Additional parameters to pass to the API
Returns:
CompletionResponse: Response with model output and metadata
Raises:
AuthenticationError: If API key is invalidRateLimitError: If rate limit is exceededBadRequestError: If request is malformedOpenRouterError: For other API errors
Example:
from taskbench.api.client import OpenRouterClient
async with OpenRouterClient(api_key="your-key") as client:
response = await client.complete(
model="anthropic/claude-sonnet-4.5",
prompt="Explain Python lists in 2 sentences",
max_tokens=100,
temperature=0.7
)
print(f"Response: {response.content}")
print(f"Tokens: {response.total_tokens}")
print(f"Latency: {response.latency_ms:.2f}ms")complete_with_json(model: str, prompt: str, max_tokens: int = 1000, temperature: float = 0.7, **kwargs) -> CompletionResponse
Request a completion in JSON mode.
Adds JSON formatting instructions to the prompt and validates that the response is valid JSON.
Parameters:
- Same as
complete()
Returns:
CompletionResponse: Response with JSON content (cleaned of markdown blocks)
Raises:
OpenRouterError: If response is not valid JSON- Other exceptions same as
complete()
Example:
from taskbench.api.client import OpenRouterClient
import json
async with OpenRouterClient(api_key="your-key") as client:
response = await client.complete_with_json(
model="anthropic/claude-sonnet-4.5",
prompt="List 3 programming languages with their year of creation",
max_tokens=500,
temperature=0.5
)
data = json.loads(response.content)
print(json.dumps(data, indent=2))Close the HTTP client and cleanup resources.
Example:
from taskbench.api.client import OpenRouterClient
client = OpenRouterClient(api_key="your-key")
# Use client...
await client.close()Base exception for OpenRouter API errors.
Raised when API rate limit is exceeded (HTTP 429).
Raised when API authentication fails (HTTP 401).
Raised when the request is malformed (HTTP 400).
Location: taskbench.api.retry
Token bucket rate limiter for API requests.
Initialize the rate limiter.
Parameters:
max_requests_per_minute(int): Maximum requests allowed per minute
Example:
from taskbench.api.retry import RateLimiter
limiter = RateLimiter(max_requests_per_minute=60)Acquire permission to make a request.
Sleeps if making a request now would exceed the rate limit.
Example:
from taskbench.api.retry import RateLimiter
limiter = RateLimiter(max_requests_per_minute=60)
async def make_api_call():
await limiter.acquire() # Wait if rate limit would be exceeded
# Make API request
passDecorator for retrying async functions with exponential backoff.
retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, retryable_exceptions: Optional[Set[Type[Exception]]] = None, non_retryable_exceptions: Optional[Set[Type[Exception]]] = None)
Parameters:
max_retries(int): Maximum number of retry attempts (default: 3)base_delay(float): Initial delay in seconds (default: 1.0)max_delay(float): Maximum delay in seconds (default: 60.0)retryable_exceptions(Set[Type[Exception]]): Exceptions to retry (default: RateLimitError, OpenRouterError, TimeoutError, ConnectionError)non_retryable_exceptions(Set[Type[Exception]]): Exceptions to never retry (default: AuthenticationError, BadRequestError, ValueError, TypeError)
Returns:
- Decorated function with retry logic
Retry Strategy:
- Exponential backoff: delay = min(base_delay * (2 ** attempt), max_delay)
- Retries transient errors (rate limits, timeouts, server errors)
- Immediately raises non-retryable errors (auth, bad requests)
Example:
from taskbench.api.retry import retry_with_backoff
from taskbench.api.client import OpenRouterClient
@retry_with_backoff(max_retries=3, base_delay=2.0)
async def make_api_call(client: OpenRouterClient):
return await client.complete(
model="anthropic/claude-sonnet-4.5",
prompt="Hello, world!"
)
# If the call fails with a retryable error, it will retry up to 3 times
# with delays of 2s, 4s, 8sDecorator to enforce rate limiting on async functions.
Parameters:
limiter(RateLimiter): RateLimiter instance to use
Returns:
- Decorated function with rate limiting
Example:
from taskbench.api.retry import RateLimiter, with_rate_limit
limiter = RateLimiter(max_requests_per_minute=60)
@with_rate_limit(limiter)
async def make_request():
# This function will automatically respect the rate limit
passLocation: taskbench.evaluation.executor
Execute tasks on LLM models and collect results.
Initialize the model executor.
Parameters:
api_client(OpenRouterClient): OpenRouter client for making API callscost_tracker(CostTracker): Cost tracker for calculating costs
Example:
from taskbench.api.client import OpenRouterClient
from taskbench.evaluation.cost import CostTracker
from taskbench.evaluation.executor import ModelExecutor
async with OpenRouterClient(api_key="your-key") as client:
cost_tracker = CostTracker()
executor = ModelExecutor(client, cost_tracker)Build a comprehensive prompt from task definition and input data.
Parameters:
task(TaskDefinition): Task definition describing the taskinput_data(str): Input data to process
Returns:
str: Complete prompt string to send to the model
Prompt Structure:
- Task name and description
- Output format requirements
- CRITICAL CONSTRAINTS section (emphasized)
- Examples of good outputs
- Evaluation criteria
- Input data
- Final instructions
Example:
from taskbench.evaluation.executor import ModelExecutor
from taskbench.core.task import TaskParser
parser = TaskParser()
task = parser.load_from_yaml("tasks/lecture_analysis.yaml")
input_data = "Lecture transcript content..."
# Assuming executor is already initialized
prompt = executor.build_prompt(task, input_data)
print(prompt[:500]) # Preview first 500 charsexecute(model_id: str, task: TaskDefinition, input_data: str, max_tokens: int = 2000, temperature: float = 0.7) -> EvaluationResult
Execute a task on a single model.
Parameters:
model_id(str): Model identifier (e.g., "anthropic/claude-sonnet-4.5")task(TaskDefinition): Task definition describing the taskinput_data(str): Input data to processmax_tokens(int): Maximum tokens to generate (default: 2000)temperature(float): Sampling temperature (default: 0.7)
Returns:
EvaluationResult: Evaluation result with output and metadata
Error Handling:
- On success: Returns EvaluationResult with status="success"
- On error: Returns EvaluationResult with status="failed" and error message
Example:
from taskbench.evaluation.executor import ModelExecutor
from taskbench.core.task import TaskParser
parser = TaskParser()
task = parser.load_from_yaml("tasks/lecture_analysis.yaml")
input_data = open("data/transcript.txt").read()
# Assuming executor is already initialized
result = await executor.execute(
model_id="anthropic/claude-sonnet-4.5",
task=task,
input_data=input_data,
max_tokens=2000,
temperature=0.7
)
if result.status == "success":
print(f"Output: {result.output[:200]}...")
print(f"Cost: ${result.cost_usd:.4f}")
else:
print(f"Error: {result.error}")evaluate_multiple(model_ids: List[str], task: TaskDefinition, input_data: str, max_tokens: int = 2000, temperature: float = 0.7) -> List[EvaluationResult]
Execute a task on multiple models with progress tracking.
Parameters:
model_ids(List[str]): List of model identifierstask(TaskDefinition): Task definition describing the taskinput_data(str): Input data to processmax_tokens(int): Maximum tokens to generate per model (default: 2000)temperature(float): Sampling temperature (default: 0.7)
Returns:
List[EvaluationResult]: List of evaluation results, one per model
Features:
- Displays progress bar with Rich
- Shows real-time status updates
- Prints summary after completion
Example:
from taskbench.evaluation.executor import ModelExecutor
# Assuming executor is already initialized
results = await executor.evaluate_multiple(
model_ids=[
"anthropic/claude-sonnet-4.5",
"openai/gpt-4o",
"qwen/qwen-2.5-72b-instruct"
],
task=task,
input_data=input_data,
max_tokens=2000,
temperature=0.7
)
for result in results:
if result.status == "success":
print(f"{result.model_name}: ${result.cost_usd:.4f}")Location: taskbench.evaluation.judge
Use an LLM to evaluate model outputs.
Initialize the LLM judge.
Parameters:
api_client(OpenRouterClient): OpenRouter client for making API callsjudge_model(str): Model to use as judge (default: "anthropic/claude-sonnet-4.5")
Example:
from taskbench.api.client import OpenRouterClient
from taskbench.evaluation.judge import LLMJudge
async with OpenRouterClient(api_key="your-key") as client:
judge = LLMJudge(client, judge_model="anthropic/claude-sonnet-4.5")Build evaluation prompt for the judge model.
Parameters:
task(TaskDefinition): Task definition with evaluation criteriamodel_output(str): The output to evaluateinput_data(str): Original input data for context
Returns:
str: Complete judge prompt
Prompt Structure:
- Judge role and task description
- Evaluation criteria from task
- Constraints to check
- Original input data (for context)
- Model output to evaluate
- Judge instructions from task
- JSON response format specification
Example:
from taskbench.evaluation.judge import LLMJudge
# Assuming judge is already initialized
prompt = judge.build_judge_prompt(task, result.output, input_data)Evaluate a model's output using LLM-as-judge.
Parameters:
task(TaskDefinition): Task definition with evaluation criteriaresult(EvaluationResult): Evaluation result to evaluateinput_data(str): Original input data
Returns:
JudgeScore: Score with accuracy, format, compliance scores and violations
Raises:
Exception: If judge fails to return valid JSON
Judge Configuration:
- Uses JSON mode for structured output
- Temperature: 0.3 (for consistency)
- Max tokens: 2000
Example:
from taskbench.evaluation.judge import LLMJudge
# Assuming judge is already initialized
score = await judge.evaluate(
task=task,
result=result,
input_data=input_data
)
print(f"Overall Score: {score.overall_score}/100")
print(f"Accuracy: {score.accuracy_score}/100")
print(f"Format: {score.format_score}/100")
print(f"Compliance: {score.compliance_score}/100")
print(f"Violations: {score.violations}")
print(f"Reasoning: {score.reasoning}")Categorize violations by type.
Parameters:
violations(List[str]): List of violation strings
Returns:
Dict[str, List[str]]: Dictionary mapping violation types to specific violations
Categories:
under_min: Below minimum requirementsover_max: Exceeds maximum limitsformat: Format specification violationsmissing_field: Required fields absentother: Miscellaneous issues
Example:
from taskbench.evaluation.judge import LLMJudge
judge = LLMJudge(client)
violations = [
"Segment duration under 2 minutes",
"Missing required CSV column: end_time",
"Timestamp format invalid"
]
categorized = judge.parse_violations(violations)
print(categorized)
# {
# "under_min": ["Segment duration under 2 minutes"],
# "missing_field": ["Missing required CSV column: end_time"],
# "format": ["Timestamp format invalid"],
# "over_max": [],
# "other": []
# }Location: taskbench.evaluation.judge
Compare and rank model evaluation results.
Combine results and scores into comparison data.
Parameters:
results(List[EvaluationResult]): List of evaluation resultsscores(List[JudgeScore]): List of corresponding judge scores
Returns:
List[Dict[str, Any]]: List of dicts with combined data, sorted by overall_score descending
Raises:
ValueError: If results and scores lists have different lengths
Comparison Data Fields:
rank: Ranking (1 = best)model: Model identifieroverall_score: Overall score (0-100)accuracy_score,format_score,compliance_score: Subscoresviolations: Number of violationsviolation_list: List of violation stringscost_usd: Cost in USDtokens: Total tokens usedlatency_ms: Latency in millisecondsstatus: Evaluation statusreasoning: Judge's detailed reasoning
Example:
from taskbench.evaluation.judge import ModelComparison
comparison = ModelComparison.compare_results(results, scores)
for item in comparison:
print(f"Rank {item['rank']}: {item['model']}")
print(f" Score: {item['overall_score']}/100")
print(f" Cost: ${item['cost_usd']:.4f}")
print(f" Violations: {item['violations']}")Identify model with highest overall score.
Parameters:
comparison(List[Dict[str, Any]]): Comparison data from compare_results()
Returns:
str: Model identifier of the best model
Example:
from taskbench.evaluation.judge import ModelComparison
comparison = ModelComparison.compare_results(results, scores)
best_model = ModelComparison.identify_best(comparison)
print(f"Best model: {best_model}")Identify model with best score/cost ratio.
Parameters:
comparison(List[Dict[str, Any]]): Comparison data from compare_results()max_cost(float, optional): Optional maximum cost filter
Returns:
str: Model identifier with best value
Value Calculation:
- If cost > 0: value_score = overall_score / cost_usd
- If cost = 0: value_score = overall_score * 1000 (free models get bonus)
Example:
from taskbench.evaluation.judge import ModelComparison
comparison = ModelComparison.compare_results(results, scores)
# Best value overall
best_value = ModelComparison.identify_best_value(comparison)
print(f"Best value: {best_value}")
# Best value under $0.50
best_cheap = ModelComparison.identify_best_value(comparison, max_cost=0.50)
print(f"Best value under $0.50: {best_cheap}")Generate Rich table for comparison display.
Parameters:
comparison(List[Dict[str, Any]]): Comparison data from compare_results()
Returns:
rich.table.Table: Rich Table object
Table Columns:
- Rank
- Model (short name)
- Score (color-coded: green >=90, yellow >=80, red <80)
- Violations (color-coded: green =0, yellow <=2, red >2)
- Cost (USD)
- Tokens
- Value (P/PP/PPP rating based on score/cost ratio)
Example:
from rich.console import Console
from taskbench.evaluation.judge import ModelComparison
console = Console()
comparison = ModelComparison.compare_results(results, scores)
table = ModelComparison.generate_comparison_table(comparison)
console.print(table)Location: taskbench.evaluation.cost
Calculate and track costs for LLM evaluations.
Initialize the cost tracker.
Parameters:
models_config_path(str): Path to YAML file containing model pricing (default: "config/models.yaml")
Raises:
FileNotFoundError: If config file doesn't existValueError: If config file is invalid
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker("config/models.yaml")Calculate cost for a specific API call.
Parameters:
model_id(str): Model identifier (e.g., "anthropic/claude-sonnet-4.5")input_tokens(int): Number of input tokens consumedoutput_tokens(int): Number of output tokens generated
Returns:
float: Cost in USD, rounded to $0.01 precision
Raises:
ValueError: If model_id is not found in pricing database
Formula:
cost = (input_tokens / 1,000,000) * input_price_per_1m
+ (output_tokens / 1,000,000) * output_price_per_1m
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
cost = tracker.calculate_cost(
model_id="anthropic/claude-sonnet-4.5",
input_tokens=1000,
output_tokens=500
)
print(f"Cost: ${cost:.4f}")
# Cost: $0.0105
# Calculation: (1000/1M * $3.00) + (500/1M * $15.00) = $0.0030 + $0.0075 = $0.0105Track an evaluation result for cost analysis.
Parameters:
result(EvaluationResult): Evaluation result to track
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
tracker.track_evaluation(result)Get total cost of all tracked evaluations.
Returns:
float: Total cost in USD
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
# ... track some evaluations ...
total = tracker.get_total_cost()
print(f"Total cost: ${total:.2f}")Get per-model cost breakdown.
Returns:
Dict[str, float]: Dictionary mapping model names to their total costs
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
# ... track some evaluations ...
breakdown = tracker.get_cost_breakdown()
for model, cost in breakdown.items():
print(f"{model}: ${cost:.4f}")Get comprehensive cost statistics.
Returns:
Dict[str, Any]: Dictionary with statistics:total_cost: Total cost in USDtotal_tokens: Total tokens across all evaluationstotal_evaluations: Number of evaluations trackedavg_cost_per_eval: Average cost per evaluationavg_tokens_per_eval: Average tokens per evaluationcost_by_model: Per-model cost breakdown
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
# ... track some evaluations ...
stats = tracker.get_statistics()
print(f"Total cost: ${stats['total_cost']:.2f}")
print(f"Total tokens: {stats['total_tokens']:,}")
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Average cost: ${stats['avg_cost_per_eval']:.4f}")
print(f"Average tokens: {stats['avg_tokens_per_eval']:,}")Get configuration for a specific model.
Parameters:
model_id(str): Model identifier
Returns:
Optional[ModelConfig]: ModelConfig if found, None otherwise
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
config = tracker.get_model_config("anthropic/claude-sonnet-4.5")
if config:
print(f"{config.display_name}")
print(f"Input: ${config.input_price_per_1m}/1M tokens")
print(f"Output: ${config.output_price_per_1m}/1M tokens")Get list of all available models.
Returns:
List[ModelConfig]: List of all model configurations
Example:
from taskbench.evaluation.cost import CostTracker
tracker = CostTracker()
models = tracker.list_models()
for model in models:
print(f"{model.display_name} ({model.provider})")
print(f" Input: ${model.input_price_per_1m}/1M")
print(f" Output: ${model.output_price_per_1m}/1M")Exception
├── OpenRouterError (base for all API errors)
│ ├── AuthenticationError (401)
│ ├── BadRequestError (400)
│ └── RateLimitError (429)
├── FileNotFoundError (task/config files)
├── yaml.YAMLError (YAML parsing)
└── pydantic.ValidationError (data validation)
- Always use async context managers for API clients:
async with OpenRouterClient(api_key="key") as client:
# client will be properly closed even if errors occur
pass- Check evaluation status before using results:
if result.status == "success":
process(result.output)
else:
print(f"Error: {result.error}")- Validate tasks before running evaluations:
is_valid, errors = parser.validate_task(task)
if not is_valid:
for error in errors:
print(f"Error: {error}")
return- Handle missing models gracefully:
try:
cost = tracker.calculate_cost(model_id, input_tokens, output_tokens)
except ValueError as e:
print(f"Model not found: {e}")- Use retry decorators for resilience:
@retry_with_backoff(max_retries=3)
async def robust_api_call():
return await client.complete(...)Here's a complete example using all major components:
import asyncio
from taskbench.api.client import OpenRouterClient
from taskbench.core.task import TaskParser
from taskbench.evaluation.cost import CostTracker
from taskbench.evaluation.executor import ModelExecutor
from taskbench.evaluation.judge import LLMJudge, ModelComparison
from rich.console import Console
async def main():
console = Console()
# Load task
parser = TaskParser()
task = parser.load_from_yaml("tasks/lecture_analysis.yaml")
# Validate task
is_valid, errors = parser.validate_task(task)
if not is_valid:
console.print("[red]Task validation failed:[/red]")
for error in errors:
console.print(f" - {error}")
return
# Load input
with open("data/transcript.txt") as f:
input_data = f.read()
# Initialize components
async with OpenRouterClient(api_key="your-key") as client:
cost_tracker = CostTracker()
executor = ModelExecutor(client, cost_tracker)
judge = LLMJudge(client)
# Evaluate models
model_ids = [
"anthropic/claude-sonnet-4.5",
"openai/gpt-4o",
"qwen/qwen-2.5-72b-instruct"
]
results = await executor.evaluate_multiple(
model_ids=model_ids,
task=task,
input_data=input_data
)
# Judge results
scores = []
for result in results:
if result.status == "success":
score = await judge.evaluate(task, result, input_data)
scores.append(score)
else:
scores.append(None)
# Compare results
valid_results = [r for r, s in zip(results, scores) if s is not None]
valid_scores = [s for s in scores if s is not None]
comparison = ModelComparison.compare_results(valid_results, valid_scores)
table = ModelComparison.generate_comparison_table(comparison)
console.print(table)
# Show best models
best_model = ModelComparison.identify_best(comparison)
best_value = ModelComparison.identify_best_value(comparison)
console.print(f"\nBest Overall: {best_model}")
console.print(f"Best Value: {best_value}")
# Cost statistics
stats = cost_tracker.get_statistics()
console.print(f"\nTotal Cost: ${stats['total_cost']:.2f}")
console.print(f"Total Tokens: {stats['total_tokens']:,}")
if __name__ == "__main__":
asyncio.run(main())This completes the API reference documentation for LLM TaskBench.