Skip to content

Releases: mann1x/osync

osync v1.3.0

13 Apr 01:32

Choose a tag to compare

What's New

Code Quality & Reliability

  • Fixed HttpClient leaks in QcCommand and BenchCommand — now uses shared static instance instead of creating new clients per request
  • Fixed race conditions in BufferedPipeStream with proper locking
  • Fixed blocking async calls — converted to proper async/await patterns
  • Fixed resource disposal — added IDisposable to ChatSession and StreamHelpers
  • Extracted JudgeArgumentParser from Program.cs to reduce god-class complexity
  • Fixed LogFileWriter Spectre markup sanitization regex for correct bracket handling
  • Fixed ManageCommand TUI showing hardcoded v1.1.6 instead of actual app version
  • Single-file publish — binaries are now self-contained single executables

Comprehensive Test Suite (120 unit tests)

First proper test coverage for internal logic:

  • BenchScoring — scoring algorithms, formatting, statistics, rankings (45 tests)
  • QcScoring — weighted formula, judgment blending, edge cases (14 tests)
  • CloudJudgeProviderFactory — all 15 provider aliases, parsing, env vars (20 tests)
  • JudgeArgumentParser — local/remote/cloud URL parsing (11 tests)
  • LogFileWriter — ANSI escape and Spectre markup sanitization (14 tests)
  • BufferedPipeStream — async pipe stream with backpressure (8 tests)
  • ThrottledStream — bandwidth limiting and constructor validation (8 tests)

SpecFlow Integration Test Infrastructure

  • Base test model changed to tinyllama for faster test runs
  • Auto-download of test model in BeforeTestRun hook
  • AfterScenario cleanup for models created during tests
  • New step definitions: ServerSteps, ChatSteps
  • Feature files normalized to use {model} and {RemoteServer} variables
  • Stdin support in OsyncRunner for interactive test scenarios

Downloads

  • osync-windows-x64.exe — Windows x64 single-file executable
  • osync-linux-x64 — Linux x64 single-file executable (run chmod +x osync-linux-x64 after download)

v1.2.9

18 Jan 20:12

Choose a tag to compare

v1.2.9

  • Real-Time Monitor Improvements - Enhanced monitoring dashboard
    • Added -Hi shortcut for --history argument (e.g., osync monitor -Hi 10)
    • Plain integers in --history are now treated as minutes (e.g., -Hi 10 = 10 minutes)
    • Moved date/time from header to status bar (right-justified), reclaiming one line of vertical space
    • Status bar now always pinned to bottom of terminal window
    • Added osync version with build number in Ollama panel (e.g., osync v1.2.9 (b20260116-1814))
    • Reduced screen flicker during refresh by overwriting content in place
    • Fixed display artifacts when models load/unload (consistent table structure)
    • Fixed graph scaling to properly fill the graph width based on history duration
    • Time axis shows seconds when history < 5 minutes (avoids repeated minute labels)
    • Fixed model expiration time calculation (was showing wrong time due to timezone handling)
    • Ollama process metrics now aggregate ALL related processes (ollama serve, runners, llama-server)
    • Shows process count in Ollama panel when multiple processes exist (e.g., "28 procs, 3.9 GB")
    • Dynamic graph headers: Shows "Ollama" (blue/orange) when per-process data available, "System" (yellow/steel) when using system-wide fallback
    • VRAM fallback: Uses Ollama API SizeVram when nvidia-smi can't report per-process VRAM (Windows/WDDM)
    • GPU utilization fallback: Uses total GPU utilization when per-process utilization unavailable
    • NvAPIWrapper integration (Windows only): Additional NVIDIA GPU data source as fallback
    • D3DKMT per-process GPU monitoring (Windows only): Native Windows kernel API for accurate per-process GPU utilization and VRAM tracking, works with any GPU vendor (NVIDIA, AMD, Intel)
    • Improved process name matching for Ollama detection (supports ollama_llama_server, llama-server, runner)
  • New Bench Command - Context tracking benchmark with dynamic story-based tests
    • Generates stories with embedded facts across multiple categories
    • Tests model's ability to track context through conversation
    • Question types: New (current category facts), Old (retrieval from previous categories)
    • Tool calling support for function execution tests
    • Judge evaluation for answer quality assessment
    • --enablethinking and --thinklevel arguments for thinking models (qwen3, deepseek-r1)
    • --no-unloadall argument to skip unloading all models before testing
    • --overwrite argument to skip file overwrite prompts
    • --generate-suite to create custom test suite JSON files with -T and -O options
    • Configurable context length, temperature, seed, and other parameters
    • Progress bar during testing with timing statistics (Last/Avg/Max response times)
    • Improved pull progress display with download speed and ETA
    • Thinking token tracking: separate tracking for model thinking/reasoning tokens with verbose output
    • Character consistency: story generator maintains consistent name-to-animal mapping across all chapters
    • Optimized message flow: instructions and context combined with first question (avoids model confusion)
    • Context length management: auto-detects model max context, configurable overhead (2K normal, 4K thinking)
    • Two-phase HuggingFace rate limit retry: 50 quick retries (2s delay), then 50 slow retries (30s/API delay)
    • Auto-backup: creates .backup.zip of existing results file before continuing (protects against data loss)
    • --mode=parallel for parallel judgment - judges answers in background while testing continues
    • Parallel mode dual progress bars: test progress bar shows test model metrics (Avg/Max time, p:/e: tok/s), judge progress bar shows judge model metrics separately
    • Pre-flight check caching: thinking detection, context settings, and tools validation cached per model (SHA256 verified)
  • New BenchView Command - View and export context benchmark results
    • Multiple output formats: table (console), json, md (markdown), html, pdf
    • Category breakdown with accuracy percentages
    • Question type analysis (New vs Old fact retrieval)
    • Tool usage statistics when applicable
    • --overwrite argument to skip file overwrite prompts
  • Increased num_predict Limits - Larger token limits for improved response quality
    • Bench pre-flight check: 512 → 2048 (fixes issues with some models)
    • Bench test responses: 2048 → 16384 (configurable via test suite numPredict)
    • Bench judge: added 8192 limit (was missing)
    • QC judge: 800 → 8192 (reduces truncated judge responses)
  • Bench Test Suite Configuration - New numPredict field in bench test suite JSON
    • Controls maximum tokens generated per test response
    • Default: 16384 tokens
    • Can be customized per test suite for different use cases
  • Enhanced Process Status (ps) - Extended system monitoring
    • Shows Ollama process CPU and memory usage when running locally
    • GPU monitoring for NVIDIA cards (uses nvidia-smi): utilization, memory, temperature, power
    • GPU monitoring for AMD cards (uses rocm-smi): utilization, memory, temperature, power
    • Ollama-specific VRAM usage per GPU (shows percentage of total GPU memory)
    • Color-coded output: green (0-50%), yellow (50-75%), orange (75-90%), red (90-100%)
    • Temperature color coding: green (<60°C), yellow (60-70°C), orange (70-80°C), red (>80°C)
    • Automatically detects available GPU monitoring tools
  • Load/Unload Command Improvements
    • Load command now shows proper "Model not found" error instead of misleading connection error
    • Both commands verify status using /api/ps after operation completes
    • Better error messages for different HTTP status codes (404, 500, etc.)
  • CLI Improvements
    • Shortened osync -h output to show only global options and available commands
    • Use osync <command> -h for detailed help on specific commands
    • Fixed ANSI color bleed on Linux/macOS - terminal no longer stays green after osync exits
    • Explicit ANSI reset sequence (\x1b[0m) on exit prevents color leakage to shell prompt
  • Bug Fixes
    • Fixed nvidia-smi parsing for power and memory values on systems with non-English locales
    • Fixed GPU stats display to properly match ollama processes to their respective GPUs by UUID
    • Fixed copy command not detecting IP addresses as remote servers (e.g., osync cp model 192.168.0.100)
    • Fixed copy to remote destination requiring model name - now uses source model name when destination has no model
    • Fixed HuggingFace model copy using source model name when destination has no explicit name
    • Fixed qc and bench commands silently exiting when remote test server is unreachable - now shows clear "Could not connect to server" error message
    • Fixed spurious ANSI escape characters (←[0m) appearing after command output on Windows
  • QC Command Updates
    • --enablethinking and --thinklevel arguments for thinking models (qwen3, deepseek-r1)
    • --no-unloadall argument to skip unloading all models before testing
    • --overwrite argument to overwrite existing output file without prompting
    • Fixed HttpClient timeout modification error ("This instance has already started one or more requests")
    • Per-request timeout handling allows dynamic timeout extension during retries
    • Improved pull progress display with download speed and ETA for --ondemand mode
    • Fixed OutOfMemoryException during JSON serialization of large results (uses streaming)
    • Fixed model preloading hanging issue - switched from /api/chat to lightweight /api/generate call
    • Smart model loading: detects if model is already loaded, skips unload and just resets keep_alive timer
    • Two-phase HuggingFace rate limit retry: 50 quick retries (2s delay), then 50 slow retries (30s/API delay)
    • --fix argument to recover corrupted/malformed JSON results files (outputs to .fixed.json)
      • Multi-strategy recovery: structural analysis finds last valid QuestionResult, then rebuilds proper JSON closures
      • Handles corrupted closing sequences (e.g., missing array brackets, extra braces)
      • Reports recovery statistics: truncated arrays/objects, fixed closures, removed bytes
    • Auto-backup: creates .backup.zip of existing results file before continuing (protects against data loss)
    • Atomic file saves: write to temp file then rename, prevents corruption on cancellation
    • Force exit (double Ctrl+C) now saves results before exiting
    • Fixed Spectre.Console markup errors when loading corrupted JSON files (proper escape of exception messages)
    • Fixed logprobs detection after model preload by using separate HTTP connections
  • QcView Command Updates
    • --overwrite argument to overwrite existing output file without prompting
    • Fixed OutOfMemoryException when loading large JSON results files (uses streaming deserialization)
  • BenchView Command Updates
    • Fixed OutOfMemoryException when loading large JSON results files (uses streaming deserialization)
    • Multiple results files comparison - Pass comma-separated files to compare different models
    • Test suite digest validation ensures all files used identical test suite
    • Default output filename computed from input filenames (e.g., file1-file2.html)
    • Enhanced HTML output - qcview-style dark theme with toggle, collapsible sections
    • Description field spans full width in header
    • Q&A details always shown in HTML and PDF (no longer requires --details)
    • Full answers, model thinking, judgment reasons, and tools used
    • Subcategory table with two-row header (category spanning subcategories)
    • Average speed per category table with response times
    • PDF improvements - Header in table format with all metadata including versions
    • All tables have proper borders, all three summary tables included
  • Bench Command Updates
    • **Test s...
Read more

v1.2.8

13 Jan 21:06

Choose a tag to compare

v1.2.8

  • Cloud Provider Support for Judge Models - Use cloud AI providers for --judge and --judgebest
    • Support for 9 providers: Anthropic Claude, OpenAI, Google Gemini, Azure OpenAI, Mistral AI, Cohere, Together AI, HuggingFace, and Replicate
    • Syntax: @provider[:token]/model (e.g., @claude/claude-sonnet-4-20250514, @openai/gpt-4o)
    • API keys loaded from environment variables by default, can be specified explicitly
    • Connection and model validation before testing starts
    • Cloud provider info (name, API version) recorded in results for traceability
    • QcView displays cloud provider badges in HTML/PDF output (only for cloud, not Ollama)
    • New --help-cloud option for detailed provider documentation
  • PDF Text Rendering Fix - Fixed text corruption in Q&A answers for PDF output
    • Resolved character scrambling issue with certain text patterns (e.g., Python format strings)
    • Uses line-by-line Text elements to prevent text reordering
    • Added Courier monospace font for code content for better readability
    • Added text sanitization to handle problematic Unicode characters
  • QcView File Access Check - Moved file overwrite confirmation before progress bar
    • Prevents concurrent display errors when output file already exists
    • Applies to all output formats (JSON, Markdown, HTML, PDF)

v1.2.7

12 Jan 23:00

Choose a tag to compare

v1.2.7

  • Separate Best Answer Judge Model (--judgebest) - New command-line argument for best answer determination
    • Use a different model for best answer judgment vs similarity scoring (--judge)
    • --judgebest can be used alone or combined with --judge for different models
    • Same configuration options as --judge: local model name or http://host:port/model for remote
    • New system prompt focused purely on qualitative best answer determination
    • Supports both serial and parallel execution modes
    • Works with --rejudge to re-run only best answer judgment with new model
  • Version Tracking in QC Results - Record osync and Ollama versions in test results
    • OsyncVersion - Version of osync used for testing
    • OllamaVersion - Ollama server version for test quantizations
    • OllamaJudgeVersion - Ollama version for judge server (similarity scoring)
    • OllamaJudgeBestAnswerVersion - Ollama version for best answer judge server
    • Versions captured automatically from Ollama /api/version endpoint
  • QCView Output Updates - All output formats updated with new information
    • Table output shows Best Answer Judge model (when different from Judge) and versions
    • JSON output includes all version fields and JudgeModelBestAnswer
    • Markdown output includes Best Answer Judge and versions in header
    • HTML output shows Best Answer Judge in info grid and versions row
    • PDF output includes Best Answer Judge and versions in header tables
  • Manage TUI Multi-Select Delete Fix - Fixed batch delete for multiple selected models
    • Multi-selection delete now works correctly (previously only deleted single model)
    • Added batch confirmation dialog showing count and list of models to be deleted
  • Judge Retry Output Improvements - Better visibility into retry attempts during judgment
    • Both judge and judgebest operations now show retry warnings with error codes at each attempt
    • Displays retry delay countdown before each retry attempt
  • Fixed Copy to Remote Server - Resolved HTTP 500 errors when loading copied models
    • Fixed stop parameter serialization (now correctly sent as array instead of string)
    • Fixed numeric/boolean parameter type conversion (top_k, temperature, seed, etc.)
    • New ConvertParameterValue helper ensures correct JSON types for all Ollama model parameters
  • Fixed HuggingFace Model Copy - Correct path resolution for hf.co/... models
    • HuggingFace models now use correct manifest path (not under registry.ollama.ai)
    • Fixed cross-platform path separator handling for model paths with forward slashes
  • Load/Unload URL Format Support - Both commands now accept URL format with embedded model name
    • Supports osync load http://host:port/modelname in addition to osync load modelname -d host
    • Same URL parsing for unload command
  • Fixed --rejudge Model Pulling - Rejudge mode no longer attempts to download test models
    • When using --rejudge with existing results, only the judge model is needed
    • Wildcard expansion now filters to only tags present in the results file
    • Skips model verification for all existing results in rejudge mode
    • Properly queues partial results for re-judgment without resuming tests

v1.2.6

12 Jan 10:08

Choose a tag to compare

v1.2.6

  • QcView Multiple Output Formats - Export results to various formats
    • Markdown (.md) - Tables formatted for GitHub and documentation
    • HTML (.html) - Interactive report with dark/light theme toggle, collapsible Q&A sections, color-coded scores
    • PDF (.pdf) - Professional report using QuestPDF library with:
      • Summary tables with color-coded scores for all metrics
      • Scores by category table
      • Rankings tables (by score, eval speed, perplexity, best answers)
      • Full Q&A pages for each quantization with judgment details
    • Use -Fo md, -Fo html, or -Fo pdf to select format
  • QcView Repository URL - New --repo argument to specify model source repository
    • URL is displayed in output headers and included in JSON export
    • Can be saved during qc testing and overridden in qcview
  • Headless/Background Mode Fix - Fixed console errors when running qc command in background
    • Console.WindowWidth and Console.ReadKey now properly handled in headless environments
    • Prevents "The handle is invalid" errors when running without a terminal
  • New Version Command - Added osync version (alias -v) to display version info
    • Shows osync version number and build timestamp
    • --verbose flag displays detailed info: binary path, installation status, shell type/version, tab completion status
    • Detects bash, zsh, PowerShell (Core/Desktop), and cmd shells
    • Smart installation detection: when running a different binary, compares version AND build timestamp with installed version
    • Reports if installed version is older/newer (e.g., installed v1.2.6 (b20260110-1156) is older)
    • Fixed tab completion detection to match actual script markers in profiles
  • Model Digest Tracking - QC results now include SHA256 digest for each tested model
    • Full digest (Digest) and short digest (ShortDigest, first 12 chars) stored in results JSON
    • Automatically populated from local Ollama or HuggingFace registry
    • Backfill: missing digests are automatically retrieved when loading existing results files
  • Fixed Model ID Display - IDs now show first 12 chars of manifest SHA256 (matches ollama ls)
    • osync ls and manage TUI compute SHA256 of manifest file content
    • ID column width increased from 8 to 12 characters
    • Consistent with ollama ls output for easy cross-reference
  • Improved osync ps Output - Dynamic console width and better model name display
    • Detects console width and adjusts column sizes dynamically
    • Model names now truncated from beginning to preserve full tag (e.g., ...0B-A3B-Instruct-GGUF:Q4_K_S)
    • Better visibility of quantization tags for HuggingFace models with long paths
  • Load Command Timing - Shows elapsed time and API-reported load duration
    • Displays total elapsed time and Ollama's load_duration from response
    • Example: ✓ Model 'model:tag' loaded successfully (2m 15s) (API: 2m 5s)
  • QCView Table Alignment Fix - Tag and Quant columns now left-aligned instead of centered
  • Timeout Handling Improvements - Better handling of HTTP timeouts during testing
    • Timeouts are now properly distinguished from user cancellation (no longer shows "Operation cancelled by user")
    • Timeouts trigger retry with exponential backoff instead of immediate failure
    • After retry attempts exhausted, prompts user: y=cancel, n=double timeout and retry
    • Allows recovery from slow model responses without losing progress
  • Improved On-Demand Model Cleanup - Fixed critical bug where models were deleted during testing
    • Models with incomplete test results are NEVER cleaned up (preserves for resume)
    • Fixed cleanup to protect incomplete models regardless of error type (timeout, cancellation, etc.)
    • On-demand status tracking is now consistent when resuming interrupted tests
  • Fixed HuggingFace Wildcard Tag Detection - Now detects all quantization formats
    • Added support for XL variants (Q2_K_XL, Q3_K_XL, Q4_K_XL, Q5_K_XL, Q6_K_XL, Q8_K_XL)
    • Added support for TQ ternary quantization (TQ1_0)
  • Fixed HuggingFace Model Quant Column - QC results now correctly show quantization type
    • Ollama returns "quantization_level": "unknown" for HuggingFace models
    • Now extracts quantization type from model name/tag when API returns "unknown"
  • Enhanced Quantization Display with Tensor Analysis - Quant column now shows dominant tensor quantization
    • Analyzes transformer block weight tensors only (excludes embeddings, output, and norms)
    • Calculates weighted percentage by tensor size (elements × bits per weight)
    • Displays format like Q4_0 (87%) or Q6_K (81% Q8_0) showing actual tensor distribution
    • Uses Ollama API verbose=true to fetch tensor metadata
    • Fixed: Extract quant type from model name before tensor analysis for correct formatting
    • Fixed: Filter to transformer weights only (Q8_0 embeddings/output were skewing results)
    • Fixed: Unknown tensor types shown with "?" suffix (e.g., Q3_K?) to indicate uncertainty
    • Supports all quantization types: Q*_K variants, IQ (importance matrix), and TQ (ternary)
  • Fixed QC Model Validation - Relaxed overly strict parameter size comparison
    • Parameter size formatting varies between models (e.g., "999.89M" vs "1,000M" for same model)
    • Now only warns on family mismatch instead of blocking testing
    • Testing continues even with warnings
  • Improved Judge API Retry Strategy - More resilient handling of judge server errors
    • Increased retry attempts from 5 to 25 for judge API calls
    • Delay ramps from 5 seconds to 30 seconds progressively
    • Shows warning and skips judgment only after all retries exhausted (instead of failing)
    • Better handles overloaded or slow judge servers (HTTP 500 errors)
  • Fixed Base Model Re-Pull When Adding Quants - Skip base model if results already exist
    • When adding new quants to existing test results without -b, no longer tries to pull the base model
    • If results file contains any base model results (even partial), the base is skipped entirely
    • Improved base model detection: automatically identifies base by common patterns (fp16, f16, bf16, etc.)
    • Use --force to re-run the base model if needed
  • Improved osync ls Wildcard Handling - Better shell expansion handling on Linux/macOS
    • Default behavior: osync ls code matches models starting with "code" (prefix match, same as code*)
    • Suffix match: osync ls *q4_k_m finds all models ending with "q4_k_m" (useful for finding by quantization)
    • Contains match: osync ls *code* finds models containing "code" anywhere in the name
    • Shell expansion handling: detects when shell expanded unquoted wildcards and shows helpful warning
    • Suggests using quotes to prevent expansion: osync ls 'gemma*'
  • Wildcard Tag Expansion for osync pull - Pull multiple models with tag patterns
    • Supports wildcards in tags: osync pull gemma3:1b-it-q* pulls all matching tags
    • Works with HuggingFace: osync pull hf.co/unsloth/gemma-3-1b-it-GGUF:IQ2*
    • Works with remote servers: osync pull -d http://server:11434 gemma3:1b-it-q*
    • Automatically resolves available tags from Ollama registry or HuggingFace API
    • Shows list of matching tags before pulling
  • Judge Best Answer Tracking - QC judge now evaluates which response is qualitatively better
    • Judge model returns bestanswer: A (base better), B (quant better), or AB (tie)
    • Verbose output shows best answer for each judgment: Score: 75% (27/50 54%) Best: AB
    • Handles edge cases: normalizes various formats (ResponseA, Response_A, Tie, identical, etc.)
    • Results automatically re-judged if --judge is active and bestanswer is missing
  • QcView Judge Best Column - New column showing quant win statistics
    • Format: 67% (B:10 A:5 =:3) showing quant won 67% of non-tie comparisons
    • B = quant better, A = base better, = = tie
    • Best percentage excludes ties (only counts decisive wins/losses)
    • Color-coded: green (>=60%), yellow (40-60%), red (<40%)
  • Enhanced JSON Output - Additional statistics in JSON export
    • Per-question BestAnswer field (A/B/AB)
    • Per-quantization: BestCount, WorstCount, TieCount, BestPercentage, WorstPercentage, TiePercentage
    • Per-category: CategoryBestStats with counts and percentages
  • QcView Metrics-Only Mode - New --metricsonly argument to ignore judgment data
    • Shows only metrics-based scores (token similarity, logprobs divergence, perplexity, length consistency)
    • Useful for comparing pure model output quality without judge influence
    • Works with all output formats (table, json, md, html, pdf)
  • Automatic Judge Context Length - Judge model context is now auto-calculated by default
    • When --judge-ctxsize is 0 (new default), calculates: test_ctx × 2 + 2048
    • Ensures judge has enough context for both base and quantized responses plus prompt
    • Can still be manually overridden with explicit value
  • PDF Generation Progress Bar - Visual progress indicator when generating PDF reports
    • Shows progress through Q&A pages for each quantization
    • Useful for large test results files with many questions
  • PDF Layout Improvements - Better page break handling in PDF reports
    • Ranking tables use ShowEntire() to prevent splitting across pages
    • Speed columns simplified to show only percentage (removed tok/s to prevent wrapping)
    • Category scores section moves entirely to next page if it won't fit
    • Rankings organized into paired rows (Final Score + Eval Speed, Perplexity + Prompt Speed, Best Answers)
    • Added Prompt Speed ranking table with vs Base percentage column
  • Manage TUI Batch Delete Fix - Fixed multi-selection delete not working
    • Delete now properly handles multiple selected models (Ctrl+D with checkmarks)
    • Confirmation dialog shows ...
Read more

v1.2.4

09 Jan 17:14

Choose a tag to compare

v1.2.4

  • New --rejudge Argument for QC Command - Re-run judgment without re-testing
    • New --rejudge argument to re-run judgment process for existing test results
    • Unlike --force which re-runs both testing and judgment, --rejudge only re-runs judgment
    • Useful for re-evaluating results with a different judge model or updated prompts
  • Improved Judge Response Parsing - More robust handling of judge model responses
    • Case-insensitive JSON property matching for Reason/reason/REASON fields
    • Multiple regex patterns with increasing leniency for fallback parsing
    • Truncated JSON repair to handle incomplete responses from models
    • Increased num_predict from 200 to 800 to reduce truncated responses
    • Full raw JSON output displayed when reason parsing fails (for debugging)
  • Improved Judge Scoring Accuracy - Fixed score interpretation issues
    • Changed JSON schema score type from "integer" to "number" for better model compatibility
    • Added explicit prompt instructions for 1-100 integer scoring (not 0.0-1.0 decimal)
    • Score normalization to handle both 0.0-1.0 and 1-100 ranges from different models
  • Fixed HuggingFace Model Verification in On-Demand Mode - Registry check now supports HuggingFace models
    • On-demand mode (--ondemand) now properly verifies HuggingFace models (hf.co/...)
    • Checks HuggingFace API to verify repository and GGUF file existence
    • Supports various GGUF filename patterns for tag matching
  • Fixed Base Model Name Handling - Full model names now preserved for base tag
    • Base model specified as full model name (e.g., -b qwen3-coder:30b-a3b-fp16 or -b hf.co/namespace/repo:tag) is now used as-is
    • Previously, only the tag portion was extracted and combined with -M model name
  • Wildcard Tag Selection for QC Command - Dynamically select quantizations with patterns
    • Support for wildcard patterns (*) in -Q argument (e.g., Q4*, IQ*, *)
    • Fetches available tags from HuggingFace API for hf.co/... models
    • Scrapes available tags from Ollama website for Ollama registry models
    • Case-insensitive pattern matching
    • New ModelTagResolver class for reusable tag resolution across commands
  • Improved On-Demand Cleanup - Models pulled on-demand are now properly cleaned up on failure
    • On-demand models are removed when testing fails or is cancelled
    • Cleanup happens in exception handlers to ensure no orphaned models remain
    • Models tracked at class level for reliable cleanup across failures
    • Preload failures now also trigger cleanup of on-demand models
  • Improved Model Preload - Better error handling and retry logic for model loading
    • Added retry logic (3 attempts with exponential backoff) for transient failures
    • Shows actual error message when preload fails (HTTP status, error details)
    • Uses configurable timeout (--timeout) for model loading
    • Handles timeout, connection errors, and server errors gracefully
  • Fixed Model Name Case Sensitivity - Handle Ollama storing HuggingFace tags with different case
    • After pulling, resolves actual model name stored by Ollama (case-insensitive lookup)
    • Fixes issue where Q4_0 is stored as q4_0 causing preload to fail

v1.2.3

09 Jan 07:28

Choose a tag to compare

v1.2.3

  • On-Demand Mode for QC Command - Pull models automatically and remove after testing
    • New --ondemand argument to enable on-demand model management
    • Models missing from the Ollama instance are automatically pulled from the registry
    • Models that were already present are NOT removed (only on-demand pulled models)
    • After testing and judgment complete, on-demand models are removed to free disk space
    • State is persisted in results file for proper cleanup on resume
    • Works with both local and remote Ollama servers
    • Ideal for testing large models or many quantizations without consuming permanent storage
  • Context Length Support for QC Command - Configure context length (num_ctx) for testing and judgment
    • Default test context length: 4096, default judge context length: 12288
    • Suite-level contextLength property in built-in test suites (v1base, v1quick, v1code)
    • External JSON test suites support contextLength at suite, category, and question levels
    • Hierarchical override system: question > category > suite
    • Console output displays context length at startup and when overridden
    • New --judge-ctxsize argument to configure judge model context length
  • Improved Console Output - Context length settings displayed during test execution
    • Shows test and judge context lengths at the beginning of testing
    • Displays override notifications when context length changes (e.g., "Context length changed to 8192 (from category Code)")
  • Fixed Linux Terminal Display Issues - Resolved ANSI color rendering problems
    • Fixed white box display issue in interactive REPL mode on Linux terminals
    • Downgraded PrettyConsole (3.1.0 → 2.0.0) and Spectre.Console (0.54.0 → 0.49.1) for compatibility
  • Improved Installer - Streamlined installation process
    • Installer now only copies the main executable (no longer copies all directory files)
    • Added mechanism for platform-specific optional dependencies
    • Removed unnecessary libuv.dylib dependency for macOS (not needed in .NET 8+)
  • Fixed Bash Completion on Linux
    • Fixed tab completion for model names containing colons (e.g., osync ls qwen2:)
    • Fixed file tab completion for qcview command

v1.2.2

03 Jan 16:58

Choose a tag to compare

v1.2.2

  • Improved Judgment Prompt Format - Better compatibility with more models
    • Instructions now in both system prompt and user message for redundancy
    • Clear text markers for RESPONSE A and RESPONSE B instead of JSON encoding
    • Question included for context with clear delimiters
    • Explicit rules to prevent models from judging quality/correctness instead of similarity
  • Verbose Judgment Output - New --verbose flag to show judgment details
    • Displays question ID, score (color-coded), and first 4 lines of reason
    • Works with both serial and parallel judgment modes
    • Helps debug and understand judge model scoring
  • Fixed Parallel Verbose Output - Verbose output now displays during parallel judgment execution
    • Previously showed all results after completion; now shows each result as it completes
  • Fixed Serial Verbose Progress - Progress bar now displays alongside verbose output in serial mode
  • Improved Cancellation Handling - Ctrl+C now immediately stops judgment without retrying
    • Cancellation exceptions are no longer retried 5 times
    • Judgment loop checks for cancellation before each question
  • Missing Reason Retry - Judge API retries up to 5 times when response contains score but no reason
    • Warning displayed if reason still missing after all retries

v1.2.1

03 Jan 01:13

Choose a tag to compare

v1.2.1

  • Bug Fix: Base Model Detection - Fixed issue where base model wasn't correctly identified when using full model names
    • Base tag is now properly normalized when specified with full path (e.g., user/model:tag)
    • Existing results files with missing IsBase flag are automatically repaired on load
    • Judgment now correctly runs for quantizations that need it
  • Bug Fix: Output Filename Sanitization - Model names with / or \ are now converted to - in default output filename
    • Prevents file path issues when model name contains directory separators
  • Improved Startup Output - Output file path is now displayed early in the execution
    • Shows right after loading test suite, before judge model verification
  • Cancellation Improvements - Better handling of Ctrl+C during API calls
    • Cancellation token now passed to HTTP requests for immediate cancellation
    • Wrapped cancellation exceptions are properly detected

v1.2.0

01 Jan 12:13

Choose a tag to compare

v1.2.0

  • Coding Test Suite - New v1code test suite for evaluating code generation quality
    • 50 challenging coding questions across 5 languages: Python, C++, C#, TypeScript, Rust
    • Double token output limit (8192) for longer code responses
    • Questions include instruction to limit response size
    • Available as -T v1code or via external v1code.json file
  • Configurable Token Output - Test suites now support custom numPredict values
    • Each test suite can specify its own maximum token output
    • External JSON test suites support numPredict property (default: 4096)
    • Displayed in test suite info when non-default value is used
  • Improved Model Existence Check - Pull command now uses Ollama registry API for faster, more reliable model validation
    • Uses registry.ollama.ai/v2/ manifest endpoint instead of HTML scraping
    • Properly handles both library models and user models
    • Faster response times and more accurate error messages
  • True Independent Parallel Judgment - Testing continues to next quantization while judgment runs in background
    • Testing no longer waits for judgment to complete before moving to next quantization
    • Background judgment tasks tracked and awaited at the end with progress display
    • Progress bars show real-time status for both testing and judgment
  • Improved Progress Display - Better visibility into parallel operations
    • Dual progress bars during testing (Testing + Judging) in parallel mode
    • Background judgment status shown after each quantization completes
    • Final wait screen shows progress for all pending judgment tasks
  • Configurable API Timeout - Added --timeout argument for testing and judgment API calls
    • Default increased from 300 to 600 seconds for longer code generation
    • Configurable via --timeout <seconds> argument
    • Applies to both test model and judge model API calls
  • Resume Support - Gracefully handle interruptions and resume from where you left off
    • Press Ctrl+C to save partial results and exit cleanly
    • Re-run the same command to resume testing from the last saved question
    • Partial quantization results are preserved in the JSON file
    • Progress bar shows resumed position when continuing
    • Missing judgments are automatically detected and re-run on resume
  • UI Improvements
    • Unified color scheme: lime for good scores (80%+) and performance above 100%
    • Orange color for performance below 100%