Releases: mann1x/osync
Releases · mann1x/osync
osync v1.3.0
What's New
Code Quality & Reliability
- Fixed HttpClient leaks in QcCommand and BenchCommand — now uses shared static instance instead of creating new clients per request
- Fixed race conditions in BufferedPipeStream with proper locking
- Fixed blocking async calls — converted to proper async/await patterns
- Fixed resource disposal — added IDisposable to ChatSession and StreamHelpers
- Extracted JudgeArgumentParser from Program.cs to reduce god-class complexity
- Fixed LogFileWriter Spectre markup sanitization regex for correct bracket handling
- Fixed ManageCommand TUI showing hardcoded v1.1.6 instead of actual app version
- Single-file publish — binaries are now self-contained single executables
Comprehensive Test Suite (120 unit tests)
First proper test coverage for internal logic:
- BenchScoring — scoring algorithms, formatting, statistics, rankings (45 tests)
- QcScoring — weighted formula, judgment blending, edge cases (14 tests)
- CloudJudgeProviderFactory — all 15 provider aliases, parsing, env vars (20 tests)
- JudgeArgumentParser — local/remote/cloud URL parsing (11 tests)
- LogFileWriter — ANSI escape and Spectre markup sanitization (14 tests)
- BufferedPipeStream — async pipe stream with backpressure (8 tests)
- ThrottledStream — bandwidth limiting and constructor validation (8 tests)
SpecFlow Integration Test Infrastructure
- Base test model changed to
tinyllamafor faster test runs - Auto-download of test model in BeforeTestRun hook
- AfterScenario cleanup for models created during tests
- New step definitions: ServerSteps, ChatSteps
- Feature files normalized to use
{model}and{RemoteServer}variables - Stdin support in OsyncRunner for interactive test scenarios
Downloads
- osync-windows-x64.exe — Windows x64 single-file executable
- osync-linux-x64 — Linux x64 single-file executable (run
chmod +x osync-linux-x64after download)
v1.2.9
v1.2.9
- Real-Time Monitor Improvements - Enhanced monitoring dashboard
- Added
-Hishortcut for--historyargument (e.g.,osync monitor -Hi 10) - Plain integers in
--historyare now treated as minutes (e.g.,-Hi 10= 10 minutes) - Moved date/time from header to status bar (right-justified), reclaiming one line of vertical space
- Status bar now always pinned to bottom of terminal window
- Added osync version with build number in Ollama panel (e.g.,
osync v1.2.9 (b20260116-1814)) - Reduced screen flicker during refresh by overwriting content in place
- Fixed display artifacts when models load/unload (consistent table structure)
- Fixed graph scaling to properly fill the graph width based on history duration
- Time axis shows seconds when history < 5 minutes (avoids repeated minute labels)
- Fixed model expiration time calculation (was showing wrong time due to timezone handling)
- Ollama process metrics now aggregate ALL related processes (ollama serve, runners, llama-server)
- Shows process count in Ollama panel when multiple processes exist (e.g., "28 procs, 3.9 GB")
- Dynamic graph headers: Shows "Ollama" (blue/orange) when per-process data available, "System" (yellow/steel) when using system-wide fallback
- VRAM fallback: Uses Ollama API SizeVram when nvidia-smi can't report per-process VRAM (Windows/WDDM)
- GPU utilization fallback: Uses total GPU utilization when per-process utilization unavailable
- NvAPIWrapper integration (Windows only): Additional NVIDIA GPU data source as fallback
- D3DKMT per-process GPU monitoring (Windows only): Native Windows kernel API for accurate per-process GPU utilization and VRAM tracking, works with any GPU vendor (NVIDIA, AMD, Intel)
- Improved process name matching for Ollama detection (supports ollama_llama_server, llama-server, runner)
- Added
- New Bench Command - Context tracking benchmark with dynamic story-based tests
- Generates stories with embedded facts across multiple categories
- Tests model's ability to track context through conversation
- Question types: New (current category facts), Old (retrieval from previous categories)
- Tool calling support for function execution tests
- Judge evaluation for answer quality assessment
--enablethinkingand--thinklevelarguments for thinking models (qwen3, deepseek-r1)--no-unloadallargument to skip unloading all models before testing--overwriteargument to skip file overwrite prompts--generate-suiteto create custom test suite JSON files with-Tand-Ooptions- Configurable context length, temperature, seed, and other parameters
- Progress bar during testing with timing statistics (Last/Avg/Max response times)
- Improved pull progress display with download speed and ETA
- Thinking token tracking: separate tracking for model thinking/reasoning tokens with verbose output
- Character consistency: story generator maintains consistent name-to-animal mapping across all chapters
- Optimized message flow: instructions and context combined with first question (avoids model confusion)
- Context length management: auto-detects model max context, configurable overhead (2K normal, 4K thinking)
- Two-phase HuggingFace rate limit retry: 50 quick retries (2s delay), then 50 slow retries (30s/API delay)
- Auto-backup: creates .backup.zip of existing results file before continuing (protects against data loss)
--mode=parallelfor parallel judgment - judges answers in background while testing continues- Parallel mode dual progress bars: test progress bar shows test model metrics (Avg/Max time, p:/e: tok/s), judge progress bar shows judge model metrics separately
- Pre-flight check caching: thinking detection, context settings, and tools validation cached per model (SHA256 verified)
- New BenchView Command - View and export context benchmark results
- Multiple output formats: table (console), json, md (markdown), html, pdf
- Category breakdown with accuracy percentages
- Question type analysis (New vs Old fact retrieval)
- Tool usage statistics when applicable
--overwriteargument to skip file overwrite prompts
- Increased num_predict Limits - Larger token limits for improved response quality
- Bench pre-flight check: 512 → 2048 (fixes issues with some models)
- Bench test responses: 2048 → 16384 (configurable via test suite
numPredict) - Bench judge: added 8192 limit (was missing)
- QC judge: 800 → 8192 (reduces truncated judge responses)
- Bench Test Suite Configuration - New
numPredictfield in bench test suite JSON- Controls maximum tokens generated per test response
- Default: 16384 tokens
- Can be customized per test suite for different use cases
- Enhanced Process Status (
ps) - Extended system monitoring- Shows Ollama process CPU and memory usage when running locally
- GPU monitoring for NVIDIA cards (uses nvidia-smi): utilization, memory, temperature, power
- GPU monitoring for AMD cards (uses rocm-smi): utilization, memory, temperature, power
- Ollama-specific VRAM usage per GPU (shows percentage of total GPU memory)
- Color-coded output: green (0-50%), yellow (50-75%), orange (75-90%), red (90-100%)
- Temperature color coding: green (<60°C), yellow (60-70°C), orange (70-80°C), red (>80°C)
- Automatically detects available GPU monitoring tools
- Load/Unload Command Improvements
- Load command now shows proper "Model not found" error instead of misleading connection error
- Both commands verify status using /api/ps after operation completes
- Better error messages for different HTTP status codes (404, 500, etc.)
- CLI Improvements
- Shortened
osync -houtput to show only global options and available commands - Use
osync <command> -hfor detailed help on specific commands - Fixed ANSI color bleed on Linux/macOS - terminal no longer stays green after osync exits
- Explicit ANSI reset sequence (
\x1b[0m) on exit prevents color leakage to shell prompt
- Shortened
- Bug Fixes
- Fixed nvidia-smi parsing for power and memory values on systems with non-English locales
- Fixed GPU stats display to properly match ollama processes to their respective GPUs by UUID
- Fixed copy command not detecting IP addresses as remote servers (e.g.,
osync cp model 192.168.0.100) - Fixed copy to remote destination requiring model name - now uses source model name when destination has no model
- Fixed HuggingFace model copy using source model name when destination has no explicit name
- Fixed
qcandbenchcommands silently exiting when remote test server is unreachable - now shows clear "Could not connect to server" error message - Fixed spurious ANSI escape characters (
←[0m) appearing after command output on Windows
- QC Command Updates
--enablethinkingand--thinklevelarguments for thinking models (qwen3, deepseek-r1)--no-unloadallargument to skip unloading all models before testing--overwriteargument to overwrite existing output file without prompting- Fixed HttpClient timeout modification error ("This instance has already started one or more requests")
- Per-request timeout handling allows dynamic timeout extension during retries
- Improved pull progress display with download speed and ETA for
--ondemandmode - Fixed OutOfMemoryException during JSON serialization of large results (uses streaming)
- Fixed model preloading hanging issue - switched from
/api/chatto lightweight/api/generatecall - Smart model loading: detects if model is already loaded, skips unload and just resets keep_alive timer
- Two-phase HuggingFace rate limit retry: 50 quick retries (2s delay), then 50 slow retries (30s/API delay)
--fixargument to recover corrupted/malformed JSON results files (outputs to .fixed.json)- Multi-strategy recovery: structural analysis finds last valid QuestionResult, then rebuilds proper JSON closures
- Handles corrupted closing sequences (e.g., missing array brackets, extra braces)
- Reports recovery statistics: truncated arrays/objects, fixed closures, removed bytes
- Auto-backup: creates .backup.zip of existing results file before continuing (protects against data loss)
- Atomic file saves: write to temp file then rename, prevents corruption on cancellation
- Force exit (double Ctrl+C) now saves results before exiting
- Fixed Spectre.Console markup errors when loading corrupted JSON files (proper escape of exception messages)
- Fixed logprobs detection after model preload by using separate HTTP connections
- QcView Command Updates
--overwriteargument to overwrite existing output file without prompting- Fixed OutOfMemoryException when loading large JSON results files (uses streaming deserialization)
- BenchView Command Updates
- Fixed OutOfMemoryException when loading large JSON results files (uses streaming deserialization)
- Multiple results files comparison - Pass comma-separated files to compare different models
- Test suite digest validation ensures all files used identical test suite
- Default output filename computed from input filenames (e.g.,
file1-file2.html) - Enhanced HTML output - qcview-style dark theme with toggle, collapsible sections
- Description field spans full width in header
- Q&A details always shown in HTML and PDF (no longer requires
--details) - Full answers, model thinking, judgment reasons, and tools used
- Subcategory table with two-row header (category spanning subcategories)
- Average speed per category table with response times
- PDF improvements - Header in table format with all metadata including versions
- All tables have proper borders, all three summary tables included
- Bench Command Updates
- **Test s...
v1.2.8
v1.2.8
- Cloud Provider Support for Judge Models - Use cloud AI providers for
--judgeand--judgebest- Support for 9 providers: Anthropic Claude, OpenAI, Google Gemini, Azure OpenAI, Mistral AI, Cohere, Together AI, HuggingFace, and Replicate
- Syntax:
@provider[:token]/model(e.g.,@claude/claude-sonnet-4-20250514,@openai/gpt-4o) - API keys loaded from environment variables by default, can be specified explicitly
- Connection and model validation before testing starts
- Cloud provider info (name, API version) recorded in results for traceability
- QcView displays cloud provider badges in HTML/PDF output (only for cloud, not Ollama)
- New
--help-cloudoption for detailed provider documentation
- PDF Text Rendering Fix - Fixed text corruption in Q&A answers for PDF output
- Resolved character scrambling issue with certain text patterns (e.g., Python format strings)
- Uses line-by-line Text elements to prevent text reordering
- Added Courier monospace font for code content for better readability
- Added text sanitization to handle problematic Unicode characters
- QcView File Access Check - Moved file overwrite confirmation before progress bar
- Prevents concurrent display errors when output file already exists
- Applies to all output formats (JSON, Markdown, HTML, PDF)
v1.2.7
v1.2.7
- Separate Best Answer Judge Model (--judgebest) - New command-line argument for best answer determination
- Use a different model for best answer judgment vs similarity scoring (--judge)
--judgebestcan be used alone or combined with--judgefor different models- Same configuration options as
--judge: local model name orhttp://host:port/modelfor remote - New system prompt focused purely on qualitative best answer determination
- Supports both serial and parallel execution modes
- Works with
--rejudgeto re-run only best answer judgment with new model
- Version Tracking in QC Results - Record osync and Ollama versions in test results
OsyncVersion- Version of osync used for testingOllamaVersion- Ollama server version for test quantizationsOllamaJudgeVersion- Ollama version for judge server (similarity scoring)OllamaJudgeBestAnswerVersion- Ollama version for best answer judge server- Versions captured automatically from Ollama
/api/versionendpoint
- QCView Output Updates - All output formats updated with new information
- Table output shows Best Answer Judge model (when different from Judge) and versions
- JSON output includes all version fields and JudgeModelBestAnswer
- Markdown output includes Best Answer Judge and versions in header
- HTML output shows Best Answer Judge in info grid and versions row
- PDF output includes Best Answer Judge and versions in header tables
- Manage TUI Multi-Select Delete Fix - Fixed batch delete for multiple selected models
- Multi-selection delete now works correctly (previously only deleted single model)
- Added batch confirmation dialog showing count and list of models to be deleted
- Judge Retry Output Improvements - Better visibility into retry attempts during judgment
- Both judge and judgebest operations now show retry warnings with error codes at each attempt
- Displays retry delay countdown before each retry attempt
- Fixed Copy to Remote Server - Resolved HTTP 500 errors when loading copied models
- Fixed
stopparameter serialization (now correctly sent as array instead of string) - Fixed numeric/boolean parameter type conversion (top_k, temperature, seed, etc.)
- New
ConvertParameterValuehelper ensures correct JSON types for all Ollama model parameters
- Fixed
- Fixed HuggingFace Model Copy - Correct path resolution for
hf.co/...models- HuggingFace models now use correct manifest path (not under registry.ollama.ai)
- Fixed cross-platform path separator handling for model paths with forward slashes
- Load/Unload URL Format Support - Both commands now accept URL format with embedded model name
- Supports
osync load http://host:port/modelnamein addition toosync load modelname -d host - Same URL parsing for unload command
- Supports
- Fixed --rejudge Model Pulling - Rejudge mode no longer attempts to download test models
- When using
--rejudgewith existing results, only the judge model is needed - Wildcard expansion now filters to only tags present in the results file
- Skips model verification for all existing results in rejudge mode
- Properly queues partial results for re-judgment without resuming tests
- When using
v1.2.6
v1.2.6
- QcView Multiple Output Formats - Export results to various formats
- Markdown (.md) - Tables formatted for GitHub and documentation
- HTML (.html) - Interactive report with dark/light theme toggle, collapsible Q&A sections, color-coded scores
- PDF (.pdf) - Professional report using QuestPDF library with:
- Summary tables with color-coded scores for all metrics
- Scores by category table
- Rankings tables (by score, eval speed, perplexity, best answers)
- Full Q&A pages for each quantization with judgment details
- Use
-Fo md,-Fo html, or-Fo pdfto select format
- QcView Repository URL - New
--repoargument to specify model source repository- URL is displayed in output headers and included in JSON export
- Can be saved during
qctesting and overridden inqcview
- Headless/Background Mode Fix - Fixed console errors when running qc command in background
- Console.WindowWidth and Console.ReadKey now properly handled in headless environments
- Prevents "The handle is invalid" errors when running without a terminal
- New Version Command - Added
osync version(alias-v) to display version info- Shows osync version number and build timestamp
--verboseflag displays detailed info: binary path, installation status, shell type/version, tab completion status- Detects bash, zsh, PowerShell (Core/Desktop), and cmd shells
- Smart installation detection: when running a different binary, compares version AND build timestamp with installed version
- Reports if installed version is older/newer (e.g.,
installed v1.2.6 (b20260110-1156) is older) - Fixed tab completion detection to match actual script markers in profiles
- Model Digest Tracking - QC results now include SHA256 digest for each tested model
- Full digest (
Digest) and short digest (ShortDigest, first 12 chars) stored in results JSON - Automatically populated from local Ollama or HuggingFace registry
- Backfill: missing digests are automatically retrieved when loading existing results files
- Full digest (
- Fixed Model ID Display - IDs now show first 12 chars of manifest SHA256 (matches
ollama ls)osync lsand manage TUI compute SHA256 of manifest file content- ID column width increased from 8 to 12 characters
- Consistent with
ollama lsoutput for easy cross-reference
- Improved osync ps Output - Dynamic console width and better model name display
- Detects console width and adjusts column sizes dynamically
- Model names now truncated from beginning to preserve full tag (e.g.,
...0B-A3B-Instruct-GGUF:Q4_K_S) - Better visibility of quantization tags for HuggingFace models with long paths
- Load Command Timing - Shows elapsed time and API-reported load duration
- Displays total elapsed time and Ollama's
load_durationfrom response - Example:
✓ Model 'model:tag' loaded successfully (2m 15s) (API: 2m 5s)
- Displays total elapsed time and Ollama's
- QCView Table Alignment Fix - Tag and Quant columns now left-aligned instead of centered
- Timeout Handling Improvements - Better handling of HTTP timeouts during testing
- Timeouts are now properly distinguished from user cancellation (no longer shows "Operation cancelled by user")
- Timeouts trigger retry with exponential backoff instead of immediate failure
- After retry attempts exhausted, prompts user: y=cancel, n=double timeout and retry
- Allows recovery from slow model responses without losing progress
- Improved On-Demand Model Cleanup - Fixed critical bug where models were deleted during testing
- Models with incomplete test results are NEVER cleaned up (preserves for resume)
- Fixed cleanup to protect incomplete models regardless of error type (timeout, cancellation, etc.)
- On-demand status tracking is now consistent when resuming interrupted tests
- Fixed HuggingFace Wildcard Tag Detection - Now detects all quantization formats
- Added support for XL variants (Q2_K_XL, Q3_K_XL, Q4_K_XL, Q5_K_XL, Q6_K_XL, Q8_K_XL)
- Added support for TQ ternary quantization (TQ1_0)
- Fixed HuggingFace Model Quant Column - QC results now correctly show quantization type
- Ollama returns
"quantization_level": "unknown"for HuggingFace models - Now extracts quantization type from model name/tag when API returns "unknown"
- Ollama returns
- Enhanced Quantization Display with Tensor Analysis - Quant column now shows dominant tensor quantization
- Analyzes transformer block weight tensors only (excludes embeddings, output, and norms)
- Calculates weighted percentage by tensor size (elements × bits per weight)
- Displays format like
Q4_0 (87%)orQ6_K (81% Q8_0)showing actual tensor distribution - Uses Ollama API
verbose=trueto fetch tensor metadata - Fixed: Extract quant type from model name before tensor analysis for correct formatting
- Fixed: Filter to transformer weights only (Q8_0 embeddings/output were skewing results)
- Fixed: Unknown tensor types shown with "?" suffix (e.g.,
Q3_K?) to indicate uncertainty - Supports all quantization types: Q*_K variants, IQ (importance matrix), and TQ (ternary)
- Fixed QC Model Validation - Relaxed overly strict parameter size comparison
- Parameter size formatting varies between models (e.g., "999.89M" vs "1,000M" for same model)
- Now only warns on family mismatch instead of blocking testing
- Testing continues even with warnings
- Improved Judge API Retry Strategy - More resilient handling of judge server errors
- Increased retry attempts from 5 to 25 for judge API calls
- Delay ramps from 5 seconds to 30 seconds progressively
- Shows warning and skips judgment only after all retries exhausted (instead of failing)
- Better handles overloaded or slow judge servers (HTTP 500 errors)
- Fixed Base Model Re-Pull When Adding Quants - Skip base model if results already exist
- When adding new quants to existing test results without
-b, no longer tries to pull the base model - If results file contains any base model results (even partial), the base is skipped entirely
- Improved base model detection: automatically identifies base by common patterns (fp16, f16, bf16, etc.)
- Use
--forceto re-run the base model if needed
- When adding new quants to existing test results without
- Improved osync ls Wildcard Handling - Better shell expansion handling on Linux/macOS
- Default behavior:
osync ls codematches models starting with "code" (prefix match, same ascode*) - Suffix match:
osync ls *q4_k_mfinds all models ending with "q4_k_m" (useful for finding by quantization) - Contains match:
osync ls *code*finds models containing "code" anywhere in the name - Shell expansion handling: detects when shell expanded unquoted wildcards and shows helpful warning
- Suggests using quotes to prevent expansion:
osync ls 'gemma*'
- Default behavior:
- Wildcard Tag Expansion for osync pull - Pull multiple models with tag patterns
- Supports wildcards in tags:
osync pull gemma3:1b-it-q*pulls all matching tags - Works with HuggingFace:
osync pull hf.co/unsloth/gemma-3-1b-it-GGUF:IQ2* - Works with remote servers:
osync pull -d http://server:11434 gemma3:1b-it-q* - Automatically resolves available tags from Ollama registry or HuggingFace API
- Shows list of matching tags before pulling
- Supports wildcards in tags:
- Judge Best Answer Tracking - QC judge now evaluates which response is qualitatively better
- Judge model returns
bestanswer: A (base better), B (quant better), or AB (tie) - Verbose output shows best answer for each judgment:
Score: 75% (27/50 54%) Best: AB - Handles edge cases: normalizes various formats (ResponseA, Response_A, Tie, identical, etc.)
- Results automatically re-judged if
--judgeis active andbestansweris missing
- Judge model returns
- QcView Judge Best Column - New column showing quant win statistics
- Format:
67% (B:10 A:5 =:3)showing quant won 67% of non-tie comparisons - B = quant better, A = base better, = = tie
- Best percentage excludes ties (only counts decisive wins/losses)
- Color-coded: green (>=60%), yellow (40-60%), red (<40%)
- Format:
- Enhanced JSON Output - Additional statistics in JSON export
- Per-question
BestAnswerfield (A/B/AB) - Per-quantization:
BestCount,WorstCount,TieCount,BestPercentage,WorstPercentage,TiePercentage - Per-category:
CategoryBestStatswith counts and percentages
- Per-question
- QcView Metrics-Only Mode - New
--metricsonlyargument to ignore judgment data- Shows only metrics-based scores (token similarity, logprobs divergence, perplexity, length consistency)
- Useful for comparing pure model output quality without judge influence
- Works with all output formats (table, json, md, html, pdf)
- Automatic Judge Context Length - Judge model context is now auto-calculated by default
- When
--judge-ctxsizeis 0 (new default), calculates: test_ctx × 2 + 2048 - Ensures judge has enough context for both base and quantized responses plus prompt
- Can still be manually overridden with explicit value
- When
- PDF Generation Progress Bar - Visual progress indicator when generating PDF reports
- Shows progress through Q&A pages for each quantization
- Useful for large test results files with many questions
- PDF Layout Improvements - Better page break handling in PDF reports
- Ranking tables use ShowEntire() to prevent splitting across pages
- Speed columns simplified to show only percentage (removed tok/s to prevent wrapping)
- Category scores section moves entirely to next page if it won't fit
- Rankings organized into paired rows (Final Score + Eval Speed, Perplexity + Prompt Speed, Best Answers)
- Added Prompt Speed ranking table with vs Base percentage column
- Manage TUI Batch Delete Fix - Fixed multi-selection delete not working
- Delete now properly handles multiple selected models (Ctrl+D with checkmarks)
- Confirmation dialog shows ...
v1.2.4
v1.2.4
- New --rejudge Argument for QC Command - Re-run judgment without re-testing
- New
--rejudgeargument to re-run judgment process for existing test results - Unlike
--forcewhich re-runs both testing and judgment,--rejudgeonly re-runs judgment - Useful for re-evaluating results with a different judge model or updated prompts
- New
- Improved Judge Response Parsing - More robust handling of judge model responses
- Case-insensitive JSON property matching for Reason/reason/REASON fields
- Multiple regex patterns with increasing leniency for fallback parsing
- Truncated JSON repair to handle incomplete responses from models
- Increased num_predict from 200 to 800 to reduce truncated responses
- Full raw JSON output displayed when reason parsing fails (for debugging)
- Improved Judge Scoring Accuracy - Fixed score interpretation issues
- Changed JSON schema score type from "integer" to "number" for better model compatibility
- Added explicit prompt instructions for 1-100 integer scoring (not 0.0-1.0 decimal)
- Score normalization to handle both 0.0-1.0 and 1-100 ranges from different models
- Fixed HuggingFace Model Verification in On-Demand Mode - Registry check now supports HuggingFace models
- On-demand mode (
--ondemand) now properly verifies HuggingFace models (hf.co/...) - Checks HuggingFace API to verify repository and GGUF file existence
- Supports various GGUF filename patterns for tag matching
- On-demand mode (
- Fixed Base Model Name Handling - Full model names now preserved for base tag
- Base model specified as full model name (e.g.,
-b qwen3-coder:30b-a3b-fp16or-b hf.co/namespace/repo:tag) is now used as-is - Previously, only the tag portion was extracted and combined with
-Mmodel name
- Base model specified as full model name (e.g.,
- Wildcard Tag Selection for QC Command - Dynamically select quantizations with patterns
- Support for wildcard patterns (
*) in-Qargument (e.g.,Q4*,IQ*,*) - Fetches available tags from HuggingFace API for
hf.co/...models - Scrapes available tags from Ollama website for Ollama registry models
- Case-insensitive pattern matching
- New
ModelTagResolverclass for reusable tag resolution across commands
- Support for wildcard patterns (
- Improved On-Demand Cleanup - Models pulled on-demand are now properly cleaned up on failure
- On-demand models are removed when testing fails or is cancelled
- Cleanup happens in exception handlers to ensure no orphaned models remain
- Models tracked at class level for reliable cleanup across failures
- Preload failures now also trigger cleanup of on-demand models
- Improved Model Preload - Better error handling and retry logic for model loading
- Added retry logic (3 attempts with exponential backoff) for transient failures
- Shows actual error message when preload fails (HTTP status, error details)
- Uses configurable timeout (
--timeout) for model loading - Handles timeout, connection errors, and server errors gracefully
- Fixed Model Name Case Sensitivity - Handle Ollama storing HuggingFace tags with different case
- After pulling, resolves actual model name stored by Ollama (case-insensitive lookup)
- Fixes issue where
Q4_0is stored asq4_0causing preload to fail
v1.2.3
v1.2.3
- On-Demand Mode for QC Command - Pull models automatically and remove after testing
- New
--ondemandargument to enable on-demand model management - Models missing from the Ollama instance are automatically pulled from the registry
- Models that were already present are NOT removed (only on-demand pulled models)
- After testing and judgment complete, on-demand models are removed to free disk space
- State is persisted in results file for proper cleanup on resume
- Works with both local and remote Ollama servers
- Ideal for testing large models or many quantizations without consuming permanent storage
- New
- Context Length Support for QC Command - Configure context length (num_ctx) for testing and judgment
- Default test context length: 4096, default judge context length: 12288
- Suite-level
contextLengthproperty in built-in test suites (v1base, v1quick, v1code) - External JSON test suites support
contextLengthat suite, category, and question levels - Hierarchical override system: question > category > suite
- Console output displays context length at startup and when overridden
- New
--judge-ctxsizeargument to configure judge model context length
- Improved Console Output - Context length settings displayed during test execution
- Shows test and judge context lengths at the beginning of testing
- Displays override notifications when context length changes (e.g., "Context length changed to 8192 (from category Code)")
- Fixed Linux Terminal Display Issues - Resolved ANSI color rendering problems
- Fixed white box display issue in interactive REPL mode on Linux terminals
- Downgraded PrettyConsole (3.1.0 → 2.0.0) and Spectre.Console (0.54.0 → 0.49.1) for compatibility
- Improved Installer - Streamlined installation process
- Installer now only copies the main executable (no longer copies all directory files)
- Added mechanism for platform-specific optional dependencies
- Removed unnecessary libuv.dylib dependency for macOS (not needed in .NET 8+)
- Fixed Bash Completion on Linux
- Fixed tab completion for model names containing colons (e.g.,
osync ls qwen2:) - Fixed file tab completion for qcview command
- Fixed tab completion for model names containing colons (e.g.,
v1.2.2
v1.2.2
- Improved Judgment Prompt Format - Better compatibility with more models
- Instructions now in both system prompt and user message for redundancy
- Clear text markers for RESPONSE A and RESPONSE B instead of JSON encoding
- Question included for context with clear delimiters
- Explicit rules to prevent models from judging quality/correctness instead of similarity
- Verbose Judgment Output - New
--verboseflag to show judgment details- Displays question ID, score (color-coded), and first 4 lines of reason
- Works with both serial and parallel judgment modes
- Helps debug and understand judge model scoring
- Fixed Parallel Verbose Output - Verbose output now displays during parallel judgment execution
- Previously showed all results after completion; now shows each result as it completes
- Fixed Serial Verbose Progress - Progress bar now displays alongside verbose output in serial mode
- Improved Cancellation Handling - Ctrl+C now immediately stops judgment without retrying
- Cancellation exceptions are no longer retried 5 times
- Judgment loop checks for cancellation before each question
- Missing Reason Retry - Judge API retries up to 5 times when response contains score but no reason
- Warning displayed if reason still missing after all retries
v1.2.1
v1.2.1
- Bug Fix: Base Model Detection - Fixed issue where base model wasn't correctly identified when using full model names
- Base tag is now properly normalized when specified with full path (e.g.,
user/model:tag) - Existing results files with missing
IsBaseflag are automatically repaired on load - Judgment now correctly runs for quantizations that need it
- Base tag is now properly normalized when specified with full path (e.g.,
- Bug Fix: Output Filename Sanitization - Model names with
/or\are now converted to-in default output filename- Prevents file path issues when model name contains directory separators
- Improved Startup Output - Output file path is now displayed early in the execution
- Shows right after loading test suite, before judge model verification
- Cancellation Improvements - Better handling of Ctrl+C during API calls
- Cancellation token now passed to HTTP requests for immediate cancellation
- Wrapped cancellation exceptions are properly detected
v1.2.0
v1.2.0
- Coding Test Suite - New
v1codetest suite for evaluating code generation quality- 50 challenging coding questions across 5 languages: Python, C++, C#, TypeScript, Rust
- Double token output limit (8192) for longer code responses
- Questions include instruction to limit response size
- Available as
-T v1codeor via externalv1code.jsonfile
- Configurable Token Output - Test suites now support custom
numPredictvalues- Each test suite can specify its own maximum token output
- External JSON test suites support
numPredictproperty (default: 4096) - Displayed in test suite info when non-default value is used
- Improved Model Existence Check - Pull command now uses Ollama registry API for faster, more reliable model validation
- Uses
registry.ollama.ai/v2/manifest endpoint instead of HTML scraping - Properly handles both library models and user models
- Faster response times and more accurate error messages
- Uses
- True Independent Parallel Judgment - Testing continues to next quantization while judgment runs in background
- Testing no longer waits for judgment to complete before moving to next quantization
- Background judgment tasks tracked and awaited at the end with progress display
- Progress bars show real-time status for both testing and judgment
- Improved Progress Display - Better visibility into parallel operations
- Dual progress bars during testing (Testing + Judging) in parallel mode
- Background judgment status shown after each quantization completes
- Final wait screen shows progress for all pending judgment tasks
- Configurable API Timeout - Added
--timeoutargument for testing and judgment API calls- Default increased from 300 to 600 seconds for longer code generation
- Configurable via
--timeout <seconds>argument - Applies to both test model and judge model API calls
- Resume Support - Gracefully handle interruptions and resume from where you left off
- Press Ctrl+C to save partial results and exit cleanly
- Re-run the same command to resume testing from the last saved question
- Partial quantization results are preserved in the JSON file
- Progress bar shows resumed position when continuing
- Missing judgments are automatically detected and re-run on resume
- UI Improvements
- Unified color scheme: lime for good scores (80%+) and performance above 100%
- Orange color for performance below 100%