External API Reference

This document describes the public APIs that external applications rely on. These endpoints provide access to puzzle data, AI model analysis, user feedback, and performance metrics.

🔄 Recent Changes (Sept 2025): All artificial API result limits have been removed or significantly increased to support external applications.

🚀 NEW: ARC API Client for Python Researchers

Simple Python client for contributing analyses to ARC Explainer encyclopedia.

Installation & Usage

# Copy to your project
cp tools/api-client/arc_client.py your_project/

from arc_client import contribute_to_arc_explainer

# One-line contribution to encyclopedia (no API key required)
result = contribute_to_arc_explainer(
    "3a25b0d8", analysis_result, "grok-4-2025-10-13",
    "https://arc-explainer-staging.up.railway.app"
)

Features:

One-line integration for any Python researcher
Current October 2025 model names (no deprecated models)
Uses existing POST /api/puzzle/save-explained/:puzzleId endpoint
Model-specific functions: contribute_grok4_analysis(), contribute_gpt5_analysis()
Batch processing for multiple puzzles
Zero external dependencies (only requests)

Complete Documentation: tools/api-client/README.md

Authentication

⚠️ NO AUTHENTICATION REQUIRED ⚠️

Most API endpoints are publicly accessible and require NO authentication. A small set of ARC3 community submission moderation endpoints is intentionally token-gated for safety (see "ARC3 Community" below).

Core Data Endpoints SUPER IMPORTANT!!

Puzzle Management

GET /api/puzzle/list - Get paginated list of all puzzles
- Query params: page, limit, source (ARC1, ARC1-Eval, ARC2, ARC2-Eval)
- Response: Paginated puzzle list with metadata
- Limits: No artificial limits - returns all puzzles by default
GET /api/puzzle/overview - Get puzzle statistics and overview
- Response: Puzzle counts by source, difficulty distribution
- Limits: No limits
GET /api/puzzle/task/:taskId - Get specific puzzle data by ID
- Params: taskId (string) - Puzzle identifier
- Response: Complete puzzle data with input/output grids
- Limits: Single puzzle fetch - no limits
POST /api/puzzle/analyze/:taskId/:model - Analyze puzzle with specific AI model
- Params: taskId (string), model (string) - Model name
- Body: Analysis configuration options (see Debate Mode below for debate-specific options). For conversation chaining via the Responses API, include previousResponseId to continue a prior analysis.
- Response: Analysis result with explanation and predictions
- Limits: No limits
- Debate Mode: Include originalExplanation and customChallenge in body to generate debate rebuttals
POST /api/stream/analyze - Prepare Server-Sent Events analysis stream
- Body: Same analysis options accepted by the non-streaming POST endpoint (temperature, promptId, omitAnswer, reasoning options, etc.) plus taskId and modelKey
- Response: { sessionId, expiresInSeconds, expiresAt } referencing the cached payload stored on the server for the follow-up SSE request. expiresAt is an ISO timestamp representing the handshake expiration window.
- Notes: Payloads are discarded automatically when the stream completes, errors, or is cancelled, and they auto-expire after 60 seconds if the SSE connection is never opened
GET /api/stream/analyze/:taskId/:modelKey/:sessionId - Start Server-Sent Events stream for token-by-token analysis
- Params: taskId (string), modelKey (string), sessionId (string) returned from the POST handshake
- Query: No longer accepts large option blobs; the server retrieves the cached payload prepared during the POST handshake
- Safety: If the taskId/modelKey tuple does not match the cached payload, the server rejects the connection and clears the pending session to avoid leaks.
- Response: SSE channel emitting stream.init, stream.chunk, stream.status, stream.complete, stream.error. The initial stream.init payload now includes expiresAt so clients can display remaining handshake time.
- Notes: Enabled when STREAMING_ENABLED=true; defaults to true in development builds so SSE works out of the box. Currently implemented for GPT-5 mini/nano and Grok-4(-Fast) models.
- Client: New createAnalysisStream utility in client/src/lib/streaming/analysisStream.ts provides a typed wrapper
- Notes: Enabled when ENABLE_SSE_STREAMING=true; currently implemented for GPT-5 mini/nano and Grok-4(-Fast) models
- Client: createAnalysisStream utility in client/src/lib/streaming/analysisStream.ts now performs the POST handshake automatically before opening the SSE connection

📘 Streaming configuration — Set STREAMING_ENABLED=false to disable SSE globally (frontend and backend). Leaving it unset keeps streaming enabled in development and requires explicit opt-out in production.

GET /api/puzzle/:puzzleId/has-explanation - Check if puzzle has existing explanation
- Params: puzzleId (string)
- Response: Boolean indicating explanation existence
- Limits: No limits
POST /api/puzzle/reinitialize - Reinitialize puzzle database
- Admin endpoint: Reloads all ARC puzzle data
- Limits: No limits

AI Model Analysis SUPER IMPORTANT!!

GET /api/models - List all available AI models and providers
- Limits: No limits

Model Dataset Performance Analysis ✨ NEW!

GET /api/model-dataset/datasets - Get all available ARC datasets dynamically
- Response: Array of DatasetInfo objects with name, puzzleCount, and path
- Discovery: Automatically scans data/ directory for JSON puzzle files
- Examples: evaluation (400 puzzles), training (400 puzzles), evaluation2 (117 puzzles), etc.
- Limits: No limits - returns all discovered datasets
GET /api/model-dataset/models - Get all models that have attempted puzzles
- Response: Array of model names from database explanations table
- Data Source: Distinct model_name values with existing attempts
- Limits: No limits - returns all models with database entries
GET /api/model-dataset/performance/:modelName/:datasetName - Get model performance on specific dataset
- Params: modelName (string), datasetName (string) - Any model and any dataset
- Response: ModelDatasetPerformance with categorized puzzle results:
  - solved[]: Puzzle IDs where is_prediction_correct = true OR multi_test_all_correct = true
  - failed[]: Puzzle IDs attempted but incorrect
  - notAttempted[]: Puzzle IDs with no database entries for this model
  - summary: Counts and success rate percentage
- Query Logic: Uses exact same logic as puzzle-analysis.ts script
- Dynamic: Works with ANY model name and ANY dataset discovered from filesystem
- Limits: No limits

DEPRECATED BATCH ENDPOINTS (never worked correctly):

POST /api/model/batch-analyze - Start batch analysis across multiple puzzles
GET /api/model/batch-status/:sessionId - Get batch analysis progress
POST /api/model/batch-control/:sessionId - Control batch analysis (pause/resume/stop)
GET /api/model/batch-results/:sessionId - Get batch analysis results
GET /api/model/batch-sessions - Get all batch analysis sessions

Explanation Management SUPER IMPORTANT!!

GET /api/puzzle/:puzzleId/explanations - Get all explanations for a puzzle
- Query params: correctness (optional) - Filter by 'correct', 'incorrect', or 'all'
- Limits: No limits - returns all explanations
- Use case: ModelDebate page uses ?correctness=incorrect to show only wrong answers for debate
GET /api/puzzle/:puzzleId/explanation - Get single explanation for a puzzle
- Limits: Single result - no limits
POST /api/puzzle/save-explained/:puzzleId - Save AI-generated explanation
- Limits: No limits

Debate & Rebuttal Tracking ✨ NEW! (September 2025)

Generate Debate Rebuttal

POST /api/puzzle/analyze/:taskId/:model - Generate AI challenge to existing explanation

Debate Mode Body:

{
  "originalExplanation": {
    "id": 123,
    "modelName": "gpt-4o",
    "patternDescription": "...",
    "solvingStrategy": "...",
    "hints": ["..."],
    "confidence": 85,
    "isPredictionCorrect": false
  },
  "customChallenge": "Focus on edge cases in corners",
  "temperature": 0.2,
  "promptId": "debate"
}

Response: New explanation with rebuttingExplanationId set to original explanation's ID
Use case: AI-vs-AI debate where one model critiques another's reasoning
Database: Stores relationship in rebutting_explanation_id column

Query Debate Chains

GET /api/explanations/:id/chain - Get full rebuttal chain for an explanation
- Params: id (number) - Explanation ID to get debate chain for
- Response: Array of ExplanationData objects in chronological order (original → rebuttals)
- Use case: Display complete debate thread showing which AIs challenged which
- Database: Uses recursive CTE query to walk rebuttal relationships
- Limits: No limits - returns entire chain regardless of depth
- Example Response:
```
{
  "success": true,
  "data": [
    { "id": 100, "modelName": "gpt-4o", "rebuttingExplanationId": null },
    { "id": 101, "modelName": "claude-3.5-sonnet", "rebuttingExplanationId": 100 },
    { "id": 102, "modelName": "gemini-2.5-pro", "rebuttingExplanationId": 101 }
  ]
}
```
GET /api/explanations/:id/original - Get parent explanation that a rebuttal is challenging
- Params: id (number) - Rebuttal explanation ID
- Response: Single ExplanationData object or 404 if not a rebuttal
- Use case: Navigate from challenge back to original explanation
- Database: Joins on rebutting_explanation_id foreign key
- Returns 404: If explanation is not a rebuttal or parent doesn't exist

Conversation Chaining (Responses API) ✨ NEW! (October 2025)

Multi-turn conversations with full context retention using provider-native conversation chaining.

How It Works

Each AI analysis returns a providerResponseId in the response
Pass previousResponseId in the next analysis request to maintain context
Provider automatically retrieves ALL previous reasoning and responses (server-side)
No token cost for accessing previous reasoning (30-day retention)

Supported Providers

OpenAI: o-series models (o3, o4, o4-mini) and GPT-5
xAI: Grok-4 models
Provider Compatibility: Response IDs only work within the same provider
- OpenAI ID → OpenAI models ✅
- xAI ID → xAI models ✅
- Cross-provider chaining ❌ (will start new conversation)

API Usage

// Request 1: Initial analysis
POST /api/puzzle/analyze/00d62c1b/openai%2Fo4-mini
Body: { "promptId": "solver" }
Response: { "providerResponseId": "resp_abc123", ... }

// Request 2: Follow-up with full context
POST /api/puzzle/analyze/00d62c1b/openai%2Fo4-mini
Body: { 
  "promptId": "solver",
  "previousResponseId": "resp_abc123"  // Maintains context
}
Response: { "providerResponseId": "resp_def456", ... }

Database Storage

Column: provider_response_id (text) in explanations table
Frontend Field: providerResponseId in ExplanationData type
Mapping: Automatically handled by useExplanation hook

Get Eligible Explanations for Discussion

GET /api/discussion/eligible - Get recent explanations eligible for conversation chaining
- Query params: limit (default 20), offset (default 0)
- Eligibility Criteria:
  - Has provider_response_id in database
  - Created within last 30 days (provider retention window)
  - NO model type restrictions - any model with response ID is eligible
- Response: Array of eligible explanations with metadata
```
{
  "explanations": [
    {
      "id": 29432,
      "puzzleId": "e8dc4411",
      "modelName": "openai/o4-mini",
      "provider": "openai",
      "createdAt": "2025-10-06T12:00:00Z",
      "daysOld": 3,
      "hasProviderResponseId": true,
      "confidence": 85,
      "isCorrect": true
    }
  ],
  "total": 1,
  "limit": 20,
  "offset": 0
}
```
- Use case: PuzzleDiscussion landing page shows recent eligible analyses
- Limits: Server-side pagination with configurable limit

Documentation

docs/API_Conversation_Chaining.md - Complete usage guide
docs/Responses_API_Chain_Storage_Analysis.md - Technical implementation details

User Feedback VERY IMPORTANT!!

POST /api/feedback - Submit user feedback on explanations
- Limits: No limits
GET /api/explanation/:explanationId/feedback - Get feedback for specific explanation
- Limits: No limits
GET /api/puzzle/:puzzleId/feedback - Get all feedback for a puzzle
- Limits: No limits
GET /api/feedback - Get all feedback with optional filtering
- Query params: limit (max 10000, increased from 1000), offset, filters
- Limits: Maximum 10000 results per request (previously 1000)
GET /api/feedback/stats - Get feedback summary statistics
- Limits: No limits

Analytics and Metrics Endpoints SUPER IMPORTANT!!

Performance Statistics SUPER IMPORTANT!!

Accuracy Statistics

🚨 CRITICAL CHANGE (Sept 30, 2025): Solver accuracy and debate accuracy are now tracked separately to prevent data pollution.

GET /api/feedback/accuracy-stats - Primary solver accuracy endpoint - Pure 1-shot puzzle-solving accuracy
- Response: PureAccuracyStats with modelAccuracyRankings[] (used by AccuracyLeaderboard)
- Sort Order: Ascending by accuracy (worst performers first - "Models Needing Improvement")
- Data Source: is_prediction_correct and multi_test_all_correct boolean fields only
- Filtering: WHERE rebutting_explanation_id IS NULL - EXCLUDES debate rebuttals
- Use Case: Fair apples-to-apples model comparison for pure puzzle solving (no contextual advantage)
- 🔄 CHANGED: No longer limited to 10 results - returns ALL models with stats
GET /api/feedback/debate-accuracy-stats - Debate challenger accuracy - Success rate for AI challenges/rebuttals
- Response: PureAccuracyStats with modelAccuracyRankings[] (same structure as solver accuracy)
- Sort Order: Descending by accuracy (best performers first - "Top Debate Challengers")
- Data Source: is_prediction_correct and multi_test_all_correct boolean fields only
- Filtering: WHERE rebutting_explanation_id IS NOT NULL - ONLY debate rebuttals
- Use Case: Identify which models excel at challenging/critiquing incorrect explanations
- Research Value: Compare solver vs. critique capabilities across models

Trustworthiness Statistics

GET /api/puzzle/performance-stats - Trustworthiness and confidence reliability metrics
- Response: PerformanceLeaderboards with trustworthinessLeaders[], speedLeaders[], efficiencyLeaders[]
- Data Source: trustworthiness_score field (AI confidence reliability)
- 🔄 CHANGED: No longer limited to 10 results - returns ALL models with stats

Combined Analytics

GET /api/puzzle/accuracy-stats - DEPRECATED - Mixed accuracy/trustworthiness data
- Warning: Despite name, contains trustworthiness-filtered results
GET /api/puzzle/general-stats - General model statistics (mixed data from MetricsRepository)
GET /api/puzzle/raw-stats - Infrastructure and database performance metrics
GET /api/metrics/comprehensive-dashboard - Combined analytics dashboard from all repositories

Cost Statistics NEW - September 2025

🚨 CRITICAL: Cost calculations completely refactored for proper domain separation. All cost endpoints now use dedicated CostRepository following SRP principles.

GET /api/metrics/costs/models - Get cost summaries for all models
- Response: Array of ModelCostSummary objects with normalized model names
- Data: Total cost, average cost, attempts, min/max costs per model
- Business Rules: Uses consistent model name normalization (removes :free, :beta, :alpha suffixes)
- Limits: No limits - returns all models with cost data
GET /api/metrics/costs/models/:modelName - Get detailed cost summary for specific model
- Params: modelName (normalized automatically - "claude-3.5-sonnet" matches "claude-3.5-sonnet:beta")
- Response: Single ModelCostSummary object
- Limits: Single model result
GET /api/metrics/costs/models/:modelName/trends?days=30 - Get cost trends over time for model
- Query params: days (1-365, default: 30) - Time range for trend analysis
- Response: Array of CostTrend objects with daily cost data
- Use case: Cost optimization and pattern analysis
- Limits: Maximum 365 days of historical data
GET /api/metrics/costs/system/stats - Get system-wide cost statistics
- Response: Total system cost, total requests, average cost per request, unique models, cost-bearing requests
- Use case: Financial reporting and system cost analysis
- Limits: System-wide aggregated data only
GET /api/metrics/costs/models/map - Get cost map for cross-repository integration
- Response: Object with modelName → {totalCost, avgCost, attempts} mapping
- Use case: Internal cross-repository data integration (used by MetricsRepository)
- Limits: No limits

🔄 Data Consistency: All cost endpoints now return identical values for the same model (eliminated previous inconsistencies between UI components).

⚙️ Performance: Cost queries optimized with database indexes on (model_name, estimated_cost) and (created_at, estimated_cost, model_name).

Model Comparison & Analysis ✨

Model-to-Model Comparison

GET /api/metrics/compare - Compare specific models on a dataset

Query params: model1 (required), model2 (required), model3 (optional), model4 (optional), dataset (required)
Response: ModelComparisonResult with detailed puzzle-by-puzzle comparison

Data Structure:

{
  summary: {
    totalPuzzles: number;
    model1Name: string;
    model2Name: string;
    model3Name?: string;
    model4Name?: string;
    dataset: string;
    allCorrect: number;        // All models got it right
    allIncorrect: number;      // All models got it wrong
    allNotAttempted: number;   // No model tried
    threeCorrect?: number;     // Exactly 3 correct (4-model comparison)
    twoCorrect?: number;       // Exactly 2 correct
    oneCorrect?: number;       // Exactly 1 correct
    model1OnlyCorrect: number; // Only model 1 correct
    model2OnlyCorrect: number; // Only model 2 correct
    model3OnlyCorrect?: number;
    model4OnlyCorrect?: number;

    // Attempt-union stats (used by /scoring)
    // If you compare two attempt-suffixed models of the same base model
    // (e.g. "some-model-attempt1" vs "some-model-attempt2"), the server returns
    // union metrics for the base model name.
    attemptUnionStats: Array<{
      baseModelName: string;
      attemptModelNames: string[];
      unionAccuracyPercentage: number;
      unionCorrectCount: number;
      totalPuzzles: number;
      totalTestPairs?: number;
      puzzlesCounted?: number;
      puzzlesFullySolved?: number;
      datasetTotalPuzzles?: number;
      datasetTotalTestPairs?: number;
    }>;
  },
  details: PuzzleComparisonDetail[];  // Per-puzzle results
}

Use Case: Head-to-head model performance comparison on specific datasets
Example: /api/metrics/compare?model1=gpt-5-pro&model2=grok-4&dataset=evaluation2
Limits: Up to 4 models simultaneously, any dataset from data/ directory
Union puzzle IDs: For union scoring, clients can compute the solved puzzle/test-pair IDs by scanning details[] and treating a puzzle/test pair as solved if any compared attempt is correct.

POST /api/puzzle/analyze-list - Analyze specific puzzles across ALL models

Body: { puzzleIds: string[] } - Array of puzzle IDs (max 500)
Response: PuzzleListAnalysisResponse with model-puzzle matrix

Data Structure:

{
  modelPuzzleMatrix: Array<{
    modelName: string;
    puzzleStatuses: Array<{
      puzzleId: string;
      status: 'correct' | 'incorrect' | 'not_attempted';
    }>;
  }>;
  puzzleResults: Array<{
    puzzle_id: string;
    correct_models: string[];
    total_attempts: number;
  }>;
  summary: {
    totalPuzzles: number;
    totalModels: number;
    perfectModels: number;      // Got ALL puzzles correct
    partialModels: number;      // Got some correct, some wrong
    notAttemptedModels: number; // Never tried any
  };
}

Use Case: Check which models solved specific user-selected puzzles (inverse of model comparison)
Limits: Max 500 puzzle IDs per request

Model Analysis

GET /api/puzzle/confidence-stats - Model confidence analysis
- Limits: No limits
GET /api/puzzle/worst-performing - Identify problematic puzzles
- Query params: limit (max 500, increased from 50), sortBy, accuracy filters
- 🔄 CHANGED: Maximum limit increased from 50 to 500 results

Worm Arena & SnakeBench API (Worm 🐛 Arena)

Worm Arena (LLM Snake) and the embedded SnakeBench backend expose a small, public API surface for running matches, listing replays, querying stats, and streaming tournaments:

POST /api/snakebench/run-match – Run a single Worm Arena match between two models
POST /api/snakebench/run-batch – Run a bounded batch of matches (small count)
GET /api/snakebench/games / /api/snakebench/games/:gameId – List games and fetch full replay JSON
GET /api/snakebench/health – Embedded SnakeBench health check (Python/backend/runner)
GET /api/snakebench/stats – Global Worm Arena stats (total games, active models, apples, total cost)
GET /api/snakebench/model-rating / /api/snakebench/model-history – Per-model TrueSkill snapshot + match history
GET /api/snakebench/leaderboard / /api/snakebench/trueskill-leaderboard – Leaderboards for Worm Arena models
GET /api/snakebench/greatest-hits – Curated list of “greatest hits” games (longest, most expensive, highest-scoring)
POST /api/wormarena/prepare – Prepare live Worm Arena batch session (multi-opponent or legacy count-based)
GET /api/wormarena/stream/:sessionId – SSE stream for live Worm Arena batches and single matches

All of these endpoints are public with no authentication, consistent with the rest of ARC Explainer. For detailed request/response schemas and SSE event types, see:

docs/reference/api/SnakeBench_WormArena_API.md

Solution Submission (Community Features)

GET /api/puzzles/:puzzleId/solutions - Get community solutions for puzzle
POST /api/puzzles/:puzzleId/solutions - Submit community solution
POST /api/solutions/:solutionId/vote - Vote on community solutions
GET /api/solutions/:solutionId/votes - Get solution vote counts

Prompt Management

GET /api/prompts - Get available prompt templates
POST /api/prompt-preview - Preview AI prompt before analysis
POST /api/prompt/preview/:provider/:taskId - Preview prompt for specific provider

Conversation Chaining (Responses API)

Multi‑turn conversations with provider‑managed context retention using response IDs.

Each analysis returns providerResponseId in the payload
Subsequent requests may include previousResponseId to continue the chain
Supported: OpenAI o‑series/GPT‑5 and xAI Grok‑4 (same‑provider chains only)
Retention typically 30 days (when store: true); new requests still consume tokens

Example request body:

{
  "promptId": "solver",
  "previousResponseId": "resp_abc123"
}

Storage and indexing:

-- Stored on each explanation row
provider_response_id TEXT DEFAULT NULL;

-- Recommended indexes
CREATE INDEX IF NOT EXISTS idx_explanations_provider_response_id
  ON explanations(provider_response_id) WHERE provider_response_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_explanations_created_recent
  ON explanations(created_at DESC);

Administrative Endpoints

Health and Recovery

GET /api/health/database - Database connection status
GET /api/admin/recovery-stats - Data recovery statistics
POST /api/admin/recover-multiple-predictions - Recover missing prediction data

Validation

POST /api/puzzle/validate - Validate puzzle data structure

🔄 Major Changes for External Applications

Removed Limits

Analytics endpoints: No longer return only top 10 results
Performance stats: All trustworthiness, speed, and efficiency data returned
Accuracy rankings: Complete model accuracy data available

Increased Limits

Feedback endpoint: Maximum limit increased from 1000 to 10000 results
Worst-performing puzzles: Maximum limit increased from 50 to 500 results
Batch results: Configurable limits up to 10000 results

No Change (Already Unlimited)

Puzzle list: Returns all puzzles without pagination by default
Individual puzzle data: No limits on single puzzle fetches
Model listings: No limits on available models

Response Format

All API endpoints return JSON responses in this format:

{
  "success": true,
  "data": { /* response data */ },
  "message": "Operation completed successfully",
  "timestamp": "2025-01-01T00:00:00.000Z"
}

Error responses:

{
  "success": false,
  "error": "Error description",
  "details": "Additional error information",
  "timestamp": "2025-01-01T00:00:00.000Z"
}

Data Models and Examples

Key Response Interfaces USE THESE!!!

PureAccuracyStats (from `/api/feedback/accuracy-stats`)

interface PureAccuracyStats {
  totalSolverAttempts: number;
  totalCorrectPredictions: number; 
  overallAccuracyPercentage: number;
  modelAccuracyRankings: ModelAccuracyRanking[]; // Now returns ALL models, not just 10
}

interface ModelAccuracyRanking {
  modelName: string;
  totalAttempts: number;
  correctPredictions: number;
  accuracyPercentage: number;
  singleTestAttempts: number;
  singleCorrectPredictions: number;
  singleTestAccuracy: number;
  multiTestAttempts: number; 
  multiCorrectPredictions: number;
  multiTestAccuracy: number;
}

PerformanceLeaderboards (from `/api/puzzle/performance-stats`)

interface PerformanceLeaderboards {
  trustworthinessLeaders: Array<{ // Now returns ALL models, not just 10
    modelName: string;
    avgTrustworthiness: number;
    avgConfidence: number;
    avgProcessingTime: number;
    avgCost: number;
    totalCost: number;
  }>;
  speedLeaders: Array<{ // Now returns ALL models, not just 10
    modelName: string;
    avgProcessingTime: number;
    totalAttempts: number;
    avgTrustworthiness: number;
  }>;
  efficiencyLeaders: Array<{ // Now returns ALL models, not just 10
    modelName: string;
    costEfficiency: number;
    tokenEfficiency: number;
    avgTrustworthiness: number;
    totalAttempts: number;
  }>;
  overallTrustworthiness: number;
}

FeedbackStats (from `/api/feedback/stats`)

interface FeedbackStats {
  totalFeedback: number;
  helpfulPercentage: number;
  topModels: Array<{
    modelName: string;
    feedbackCount: number;
    helpfulCount: number;
    notHelpfulCount: number;
    helpfulPercentage: number;
  }>;
  feedbackByModel: Record<string, {
    helpful: number;
    notHelpful: number;
  }>;
}

Admin Dashboard & Ingestion ✨ NEW! (October 2025)

Admin Dashboard Stats

GET /api/admin/quick-stats - Dashboard statistics
- Response: { totalModels, totalExplanations, databaseConnected, lastIngestion, timestamp }
- Use case: Admin Hub homepage quick stats
- Limits: No limits
GET /api/admin/recent-activity - Recent ingestion activity
- Response: Array of last 10 ingestion runs with stats
- Limits: Fixed at 10 most recent runs

HuggingFace Dataset Ingestion

POST /api/admin/validate-ingestion - Pre-flight validation before ingestion
- Body: { datasetName, baseUrl }
- Response: Validation result with checks (URL accessible, token present, DB connected, etc.)
- Use case: Validate configuration before starting ingestion
- Limits: No limits
POST /api/admin/start-ingestion - Start HuggingFace dataset ingestion
- Body: { datasetName, baseUrl, source, limit, delay, dryRun, forceOverwrite, verbose }
- Response: { success, message, config } (202 Accepted - async operation)
- Use case: Import external model predictions from HuggingFace datasets
- Limits: No limits
- Note: Returns immediately; ingestion runs in background
GET /api/admin/ingestion-history - Complete ingestion run history
- Response: Array of all ingestion runs with full details
- Limits: No limits - returns all historical runs

Ingestion Data Model

interface IngestionRun {
  id: number;
  datasetName: string;
  baseUrl: string;
  source: string;  // ARC1-Eval, ARC2-Eval, etc.
  totalPuzzles: number;
  successful: number;
  failed: number;
  skipped: number;
  durationMs: number;
  dryRun: boolean;
  accuracyPercent: number | null;
  startedAt: string;
  completedAt: string;
  errorLog: string | null;
}

RE-ARC Bench (Dataset Generation & Evaluation) ✨ NEW! (December 2025)

Self-service platform for generating unique ARC evaluation datasets and scoring solver submissions. Contributed by David Lu (@conundrumer).

Dataset Generation

POST /api/rearc/generate - Generate unique 120-task evaluation dataset
- Response: Streaming gzip JSON download
- Headers: Content-Disposition: attachment; filename="re-arc_test_challenges-{timestamp}.json"
- Rate Limit: 2 requests per 5 minutes
- Notes: Each request generates a cryptographically unique dataset. Task IDs encode the generation seed via XOR, enabling stateless verification without server-side storage.

Submission Evaluation

POST /api/rearc/evaluate - Evaluate solver submission against generated dataset
- Content-Type: multipart/form-data with JSON submission file
- Response: Server-Sent Events stream
- Rate Limit: 20 requests per 5 minutes
- SSE Events:
  - progress - { current: number, total: number } - Evaluation progress
  - complete - { type: "score", score: number } - Final score (0.0-1.0)
  - complete - { type: "mismatches", mismatches: [...] } - Test pair count mismatches
  - error - { message: string } - Validation or processing error
- Scoring: Uses official ARC Prize competition rules - pair solved if ANY of 2 attempts correct
- Caching: LRU cache provides ~100x speedup on repeated evaluations of same dataset

Submission Format

[
  {  // Test Pair 0
    "attempt_1": [[0, 1], [2, 3]],
    "attempt_2": [[0, 1], [2, 3]]
  },
  {  // Test Pair 1
    "attempt_1": [[4, 5]],
    "attempt_2": [[4, 5]]
  }
]

Security Notes

Task solutions are inaccessible without server-side RE_ARC_SEED_PEPPER environment variable
HMAC-SHA256 seed derivation prevents dataset regeneration without server access
Each production deployment should use a unique pepper value

Authentication

All endpoints are publicly accessible with NO authentication required.

Rate Limiting

No explicit rate limiting implemented. Consider implementing for production use with external integrations.

WebSocket Integration

The Saturn Visual Solver provides real-time updates via WebSockets:

Connection endpoint: ws://localhost:5000
Event types: progress, image-update, completion, error
Session-based communication using sessionId

Important Notes for External Applications

Complete Data Access: Analytics endpoints now return complete datasets instead of arbitrary top-10 limits
Higher Limits: Feedback and batch endpoints support much larger result sets
Backward Compatibility: All existing query parameters continue to work
Performance: Database queries have been optimized to handle larger result sets efficiently
Database Dependency: Most endpoints require PostgreSQL connection. Fall back to in-memory mode if unavailable
Token Tracking: API calls with AI models consume tokens and incur costs tracked in the database

Real-time and Advanced Features

Saturn Visual Solver

POST /api/saturn/analyze/:taskId - Analyze puzzle with Saturn visual solver
POST /api/saturn/analyze-with-reasoning/:taskId - Saturn analysis with reasoning steps
GET /api/saturn/status/:sessionId - Get Saturn analysis progress
WebSocket: Real-time Saturn solver progress updates

ARC3 Community

Public endpoints (no auth)

GET /api/arc3-community/games - List approved and playable games (includes official ARCEngine games)
GET /api/arc3-community/games/featured - Featured games for the ARC3 landing page
GET /api/arc3-community/games/:gameId - Fetch a single approved/playable game by ID
POST /api/arc3-community/session/start - Start a play session for a game (official or approved community)
POST /api/arc3-community/session/:sessionGuid/action - Send an action to an active session
POST /api/arc3-community/submissions - Submit a single .py file for review (stored as status='pending', non-playable until approved)

Admin-only endpoints (token-gated)

These endpoints require a server-configured token:

Server env var: ARC3_COMMUNITY_ADMIN_TOKEN
Request header: X-ARC3-Admin-Token: <token> (or Authorization: Bearer <token>)

Endpoints:

GET /api/arc3-community/submissions?status=pending|approved|rejected|archived - List DB submissions (pending by default)
GET /api/arc3-community/submissions/:submissionId/source - Fetch stored source for a submission (including pending)
POST /api/arc3-community/submissions/:submissionId/publish - Approve a submission (sets status='approved', is_playable=true)
POST /api/arc3-community/submissions/:submissionId/reject - Reject a submission (sets status='rejected', is_playable=false)

FilesExpand file tree

EXTERNAL_API.md

Latest commit

History

EXTERNAL_API.md

File metadata and controls

External API Reference

🚀 NEW: ARC API Client for Python Researchers

Installation & Usage

Authentication

Core Data Endpoints SUPER IMPORTANT!!

Puzzle Management

AI Model Analysis SUPER IMPORTANT!!

Model Dataset Performance Analysis ✨ NEW!

Explanation Management SUPER IMPORTANT!!

Debate & Rebuttal Tracking ✨ NEW! (September 2025)

Generate Debate Rebuttal

Query Debate Chains

Conversation Chaining (Responses API) ✨ NEW! (October 2025)

How It Works

Supported Providers

API Usage

Database Storage

Get Eligible Explanations for Discussion

Documentation

User Feedback VERY IMPORTANT!!

Analytics and Metrics Endpoints SUPER IMPORTANT!!

Performance Statistics SUPER IMPORTANT!!

Accuracy Statistics

Trustworthiness Statistics

Combined Analytics

Cost Statistics NEW - September 2025

Model Comparison & Analysis ✨

Model-to-Model Comparison

Model Analysis

Worm Arena & SnakeBench API (Worm 🐛 Arena)

Solution Submission (Community Features)

Prompt Management

Conversation Chaining (Responses API)

Administrative Endpoints

Health and Recovery

Validation

🔄 Major Changes for External Applications

Removed Limits

Increased Limits

No Change (Already Unlimited)

Response Format

Data Models and Examples

Key Response Interfaces USE THESE!!!

PureAccuracyStats (from /api/feedback/accuracy-stats)

PerformanceLeaderboards (from /api/puzzle/performance-stats)

FeedbackStats (from /api/feedback/stats)

Admin Dashboard & Ingestion ✨ NEW! (October 2025)

Admin Dashboard Stats

HuggingFace Dataset Ingestion

Ingestion Data Model

RE-ARC Bench (Dataset Generation & Evaluation) ✨ NEW! (December 2025)

Dataset Generation

Submission Evaluation

Submission Format

Security Notes

Authentication

Rate Limiting

WebSocket Integration

Important Notes for External Applications

Real-time and Advanced Features

Saturn Visual Solver

ARC3 Community

Public endpoints (no auth)

Admin-only endpoints (token-gated)

PureAccuracyStats (from `/api/feedback/accuracy-stats`)

PerformanceLeaderboards (from `/api/puzzle/performance-stats`)

FeedbackStats (from `/api/feedback/stats`)