Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ai_testbench/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,4 @@ native_libs/

# MLCEngine virtualenv
.mlc_venv/
benchmark_results/
7 changes: 7 additions & 0 deletions ai_testbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,19 @@ flutter build linux --release
./build/linux/x64/release/bundle/ai_testbench --headless-correction --model-dir models/ --output correction.json
```

**Timer/alarm benchmark** (uses extended 5-intent prompt with timer/alarm cases):
```bash
./build/linux/x64/release/bundle/ai_testbench --headless-timer --model Qwen3.5-2B-Q4_K_M.gguf
```
Comment on lines +87 to +90

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that documented flags are actually parsed/handled in ai_testbench sources.
# Expected: Matches for --headless-timer and --prompt-timer in argument parsing/dispatch logic.

rg -n --type=dart -C3 -- '--headless-timer|--prompt-timer' ai_testbench/lib
rg -n --type=dart -C3 'headless.*timer|prompt.*timer' ai_testbench/lib

Repository: ZSWatch/ZSWatch-App

Length of output: 2005


🏁 Script executed:

# Find the main entry point and argument parsing
fd -type f -name "main.dart" ai_testbench/lib
fd -type f -name "*main*.dart" ai_testbench/lib | head -20

Repository: ZSWatch/ZSWatch-App

Length of output: 404


🏁 Script executed:

# Search for argument parsing patterns and the main function
rg -n "args|arguments|parse.*arg" ai_testbench/lib --type=dart | head -30

Repository: ZSWatch/ZSWatch-App

Length of output: 2737


🏁 Script executed:

# Look for where headless-timer and prompt-timer might be handled in the code
rg -n "headless\|prompt" ai_testbench/lib --type=dart -B2 -A2 | head -50

Repository: ZSWatch/ZSWatch-App

Length of output: 45


🏁 Script executed:

cat -n ai_testbench/lib/main.dart

Repository: ZSWatch/ZSWatch-App

Length of output: 3303


🏁 Script executed:

# Search for --headless-timer and --prompt-timer specifically
rg -n '\--headless-timer|\--prompt-timer' ai_testbench/lib

Repository: ZSWatch/ZSWatch-App

Length of output: 185


🏁 Script executed:

# Also check the README to confirm what flags are documented
sed -n '87,99p' ai_testbench/README.md

Repository: ZSWatch/ZSWatch-App

Length of output: 638


Remove or implement --headless-timer and --prompt-timer flags.

These flags are documented in the README (lines 87–90, 98–99) but are not handled in the CLI argument parser. The main.dart entry point has no branches checking for these flags, and no code wiring them to the benchmark logic. Either implement the missing argument handling or remove the documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ai_testbench/README.md` around lines 87 - 90, The README documents flags
`--headless-timer` and `--prompt-timer` that are not parsed or used; update the
CLI parsing in main.dart to accept these flags (e.g., add options for
"--headless-timer" and "--prompt-timer" to the argument parser) and wire them
into the benchmark invocation (pass the parsed booleans into the benchmark
runner or TimerAlarmBenchmark class and branch the execution where benchmarks
are selected), or alternatively remove the flags from the README; specifically,
modify the argument parsing logic in main.dart (and any
BenchmarkRunner/TimerAlarmBenchmark constructors or run methods) to expose and
forward the two flags so the timer/alarm benchmark executes when set.


### CLI Options

| Flag | Description |
|------|-------------|
| `--headless` | Run structured extraction benchmark (all models) |
| `--headless-time` | Run time extraction benchmark |
| `--headless-timer` | Run timer/alarm benchmark (5-intent prompt, timer/alarm cases only) |
| `--prompt-timer` | Use the 5-intent prompt (with `--headless` to run all cases for regression testing) |
Comment on lines +87 to +99

Copilot AI Mar 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README documents a --headless-timer mode, but ai_testbench/lib/main.dart doesn't handle this flag (only --headless, --headless-time, --headless-correction, --headless-router). Either implement --headless-timer (and wire it to ModelBenchmarkService.timerAlarmCases + the appropriate prompt) or remove/update the documentation.

Suggested change
**Timer/alarm benchmark** (uses extended 5-intent prompt with timer/alarm cases):
```bash
./build/linux/x64/release/bundle/ai_testbench --headless-timer --model Qwen3.5-2B-Q4_K_M.gguf
```
### CLI Options
| Flag | Description |
|------|-------------|
| `--headless` | Run structured extraction benchmark (all models) |
| `--headless-time` | Run time extraction benchmark |
| `--headless-timer` | Run timer/alarm benchmark (5-intent prompt, timer/alarm cases only) |
| `--prompt-timer` | Use the 5-intent prompt (with `--headless` to run all cases for regression testing) |
### CLI Options
| Flag | Description |
|------|-------------|
| `--headless` | Run structured extraction benchmark (all models) |
| `--headless-time` | Run time extraction benchmark |

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +99

Copilot AI Mar 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI options list --headless-timer and --prompt-timer, but there’s no implementation for these flags in the CLI entrypoints (e.g., ai_testbench/lib/main.dart / benchmark_main.dart). Please align the docs with the implemented flags, or add support for these options in the CLI parser/runner.

Suggested change
**Timer/alarm benchmark** (uses extended 5-intent prompt with timer/alarm cases):
```bash
./build/linux/x64/release/bundle/ai_testbench --headless-timer --model Qwen3.5-2B-Q4_K_M.gguf
```
### CLI Options
| Flag | Description |
|------|-------------|
| `--headless` | Run structured extraction benchmark (all models) |
| `--headless-time` | Run time extraction benchmark |
| `--headless-timer` | Run timer/alarm benchmark (5-intent prompt, timer/alarm cases only) |
| `--prompt-timer` | Use the 5-intent prompt (with `--headless` to run all cases for regression testing) |
### CLI Options
| Flag | Description |
|------|-------------|
| `--headless` | Run structured extraction benchmark (all models) |
| `--headless-time` | Run time extraction benchmark |

Copilot uses AI. Check for mistakes.
| `--headless-correction` | Run correction benchmark |
| `--model <name>` | Filter to a specific model filename |
| `--model-dir <path>` | Path to directory containing `.gguf` files (default: `models/`) |
Expand Down
2 changes: 2 additions & 0 deletions ai_testbench/lib/benchmark_main.dart
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,8 @@ Map<String, dynamic> _serializeModelResult(BenchmarkModelResult result) {
'titleLanguageDetail': caseResult.titleLanguageDetail,
'timeResolutionCorrect': caseResult.timeResolutionCorrect,
'timeResolutionDetail': caseResult.timeResolutionDetail,
'durationMatch': caseResult.durationMatch,
'durationDetail': caseResult.durationDetail,
'intent': caseResult.intent,
'title': caseResult.title,
'datetimeOriginal': caseResult.datetimeOriginal,
Expand Down
7 changes: 7 additions & 0 deletions ai_testbench/lib/main.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,18 @@ import 'package:flutter/material.dart';

import 'benchmark_main.dart' as model_bench;
import 'correction_main.dart';
import 'router_benchmark_main.dart';
import 'screens/testbench_screen.dart';
import 'screens/time_extraction_screen.dart';
import 'time_extraction_main.dart';

void main(List<String> args) async {
// Headless mode: run router pre-classifier benchmark
if (args.contains('--headless-router')) {
await runRouterBenchmark(args);
exit(exitCode);
}

// Headless mode: run time extraction tests from CLI
if (args.contains('--headless-time')) {
await runHeadlessTimeExtraction(args);
Expand Down
285 changes: 285 additions & 0 deletions ai_testbench/lib/router_benchmark_main.dart
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
import 'dart:convert';
import 'dart:io';

import 'package:chrono_ai_flow/chrono_ai_flow.dart';
import 'package:flutter/material.dart';

import 'services/llm_service.dart';

/// Benchmarks the two-stage router approach:
/// 1. Router prompt classifies input as timer_alarm / voice_memo / mixed
/// 2. Routes to dedicated timer/alarm prompt OR original 3-intent prompt
///
/// Measures: router accuracy, router latency, total pipeline latency.
Future<void> runRouterBenchmark(List<String> args) async {
WidgetsFlutterBinding.ensureInitialized();

final modelFilter = _readArg(args, '--model');
final modelDir = _readArg(args, '--model-dir') ?? 'models';

final modelPaths = Directory(modelDir)
.listSync()
.whereType<File>()
.map((f) => f.path)
.where((p) => p.toLowerCase().endsWith('.gguf'))
.where(
(p) => modelFilter == null || p.toLowerCase().contains(modelFilter.toLowerCase()))
.toList()
..sort();

if (modelPaths.isEmpty) {
stdout.writeln('[RouterBench] No .gguf models found');
exitCode = 1;
return;
Comment on lines +20 to +33

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Check that modelDir exists before calling listSync().

If the directory is missing, this throws a FileSystemException and the friendly "No .gguf models found" path never runs.

🛠️ Proposed fix
-  final modelPaths = Directory(modelDir)
+  final modelDirectory = Directory(modelDir);
+  if (!modelDirectory.existsSync()) {
+    stdout.writeln('[RouterBench] Model directory not found: $modelDir');
+    exitCode = 1;
+    return;
+  }
+
+  final modelPaths = modelDirectory
       .listSync()
       .whereType<File>()
       .map((f) => f.path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ai_testbench/lib/router_benchmark_main.dart` around lines 20 - 33, Check that
Directory(modelDir) exists and is a directory before calling listSync() to avoid
a FileSystemException; update the modelPaths construction to first call
Directory(modelDir).existsSync() (or wrap Directory(modelDir).listSync() in a
try/catch) and when the directory is missing or an exception occurs, write a
friendly message via stdout.writeln('[RouterBench] No .gguf models found' or a
specific error) and set exitCode = 1 then return; adjust code around modelPaths,
Directory(modelDir).listSync(), and the early-return branch so the friendly
message runs when the directory is absent or listSync() fails.

}

final modelPath = modelPaths.first;
stdout.writeln('[RouterBench] Using model: ${modelPath.split('/').last}');

final llm = LlmService()
..setModel(modelPath)
..nCtx = 4096
..nThreads = Platform.numberOfProcessors
..maxTokens = 32 // Router output is tiny: {"route":"timer_alarm"}
..temperature = 0.1
..topP = 1.0
..presencePenalty = 2.0
..enableThinking = false;

final parser = const ChronoLlmParser();
final referenceTime = DateTime(2026, 3, 11, 10, 15);

// Test cases with expected route
final cases = <_RouterTestCase>[
// Timer cases → timer_alarm
_RouterTestCase('Set a timer for 8 minutes', 'timer_alarm'),
_RouterTestCase('Timer for 5 minutes for pasta', 'timer_alarm'),
_RouterTestCase('Set a 30 second timer', 'timer_alarm'),
_RouterTestCase('Set a timer for one and a half hours', 'timer_alarm'),
_RouterTestCase('Sätt en timer på 10 minuter', 'timer_alarm'),
_RouterTestCase('Timer på 5 minuter för äggen', 'timer_alarm'),
_RouterTestCase('Stell einen Timer auf 15 Minuten', 'timer_alarm'),
_RouterTestCase('In 10 minutes', 'timer_alarm'),
_RouterTestCase('30 minutes', 'timer_alarm'),

// Alarm cases → timer_alarm
_RouterTestCase('Set an alarm for 7:30 AM', 'timer_alarm'),
_RouterTestCase('Alarm at 6 AM, wake up', 'timer_alarm'),
_RouterTestCase('Wake me up tomorrow at 5:30', 'timer_alarm'),
_RouterTestCase('Ställ ett alarm klockan 7', 'timer_alarm'),
_RouterTestCase('Wecker auf 7 Uhr stellen', 'timer_alarm'),
_RouterTestCase('7 AM', 'timer_alarm'),

// Reminder cases → voice_memo (NOT timer/alarm despite having time)
_RouterTestCase('Remind me in 30 minutes to check the oven', 'voice_memo'),
_RouterTestCase('Remind me at 3 PM to call the dentist', 'voice_memo'),
_RouterTestCase(
'Påminn mig om 10 minuter att stänga av ugnen', 'voice_memo'),
_RouterTestCase(
'Påminn mig klockan 15 att ringa tandläkaren', 'voice_memo'),

// Event/note cases → voice_memo
_RouterTestCase('Meeting with John next Tuesday at 2 pm', 'voice_memo'),
_RouterTestCase('Buy milk and bread', 'voice_memo'),
_RouterTestCase('Köp mjölk och bröd på vägen hem', 'voice_memo'),
_RouterTestCase(
'Tandläkare den 15 mars klockan halv 10', 'voice_memo'),
_RouterTestCase(
'Fika med Anna imorgon klockan 10 och sen lämna in paketet',
'voice_memo'),

// Mixed cases
_RouterTestCase(
'Set a timer for 10 minutes and an alarm for 7 AM tomorrow', 'timer_alarm'),
_RouterTestCase('Set an alarm for 6:30 and buy milk', 'mixed'),
_RouterTestCase(
'Sätt en timer på 5 minuter och påminn mig klockan 3 att ringa tandläkaren',
'mixed'),
];

stdout.writeln('[RouterBench] Running ${cases.length} router cases...\n');

// ── Stage 1: Router prompt benchmark ──
int routerCorrect = 0;
final routerTimes = <Duration>[];

for (final tc in cases) {
final prompt = ChronoPromptTemplate.render(
ChronoPromptTemplate.routerTemplate,
transcript: tc.transcript,
now: referenceTime,
);

final result = await llm.generate(prompt).timeout(
const Duration(seconds: 30),
onTimeout: () => const InferenceResult(
output: '{"route":"timeout"}', elapsed: Duration(seconds: 30)),
);

routerTimes.add(result.elapsed);
final route = _parseRoute(parser.sanitizeModelOutput(result.output));
final correct = route == tc.expectedRoute;
if (correct) routerCorrect++;

final status = correct ? 'OK' : 'FAIL';
stdout.writeln(
' $status route=$route (expected=${tc.expectedRoute}) '
'${result.elapsed.inMilliseconds}ms "${tc.transcript}"');
}

final avgRouterMs =
routerTimes.fold<int>(0, (s, d) => s + d.inMilliseconds) ~/
routerTimes.length;
stdout.writeln(
'\n[RouterBench] Router accuracy: $routerCorrect/${cases.length}');
stdout.writeln('[RouterBench] Router avg latency: ${avgRouterMs}ms');
Comment on lines +130 to +135

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The latency summaries are comparing different case sets.

avgRouterMs is computed from all router cases, avgTotalMs from the timer_alarm subset, and avgSingleMs from the first five voice_memo cases. That makes both Avg extract only and Overhead apples-to-oranges metrics.

Also applies to: 199-207, 212-242

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ai_testbench/lib/router_benchmark_main.dart` around lines 130 - 135, The
latency summaries mix different case sets (avgRouterMs uses routerTimes,
avgTotalMs uses the timer_alarm subset, and avgSingleMs uses the first five
voice_memo cases), producing invalid comparisons; update the calculations so
comparisons use the same consistent dataset or compute and label per-subset
metrics explicitly: e.g., create clear collections like routerTimes (all router
cases), timerAlarmTimes (cases where case.type == 'timer_alarm'), voiceMemoTimes
(cases where case.type == 'voice_memo') and compute avgRouterMs,
avgTimerAlarmMs, avgVoiceMemoMs from those collections, then use matching counts
(routerCorrect vs routerCases.length or subsetCorrect vs subset.length) when
printing summaries (references: avgRouterMs, avgTotalMs, avgSingleMs,
routerTimes, routerCorrect, cases, timer_alarm, voice_memo).


// ── Stage 2: Full two-stage pipeline on timer/alarm cases ──
stdout.writeln('\n[RouterBench] Running full two-stage pipeline on timer/alarm cases...\n');

// Reconfigure for extraction (more tokens needed)
llm.maxTokens = 384;

final timerCases = cases
.where((c) => c.expectedRoute == 'timer_alarm')
.toList();

int extractionCorrect = 0;
final totalTimes = <Duration>[];

for (final tc in timerCases) {
final sw = Stopwatch()..start();

// Stage 1: Router
llm.maxTokens = 32;
final routerPrompt = ChronoPromptTemplate.render(
ChronoPromptTemplate.routerTemplate,
transcript: tc.transcript,
now: referenceTime,
);
final routerResult = await llm.generate(routerPrompt).timeout(
const Duration(seconds: 30),
onTimeout: () => const InferenceResult(
output: '{"route":"timeout"}', elapsed: Duration(seconds: 30)),
);
final routerMs = routerResult.elapsed.inMilliseconds;

// Stage 2: Timer/alarm extraction
llm.maxTokens = 384;
final extractPrompt = ChronoPromptTemplate.render(
ChronoPromptTemplate.timerAlarmTemplate,
transcript: tc.transcript,
now: referenceTime,
);
final extractResult = await llm.generate(extractPrompt).timeout(
const Duration(seconds: 60),
onTimeout: () => const InferenceResult(
output: '[]', elapsed: Duration(seconds: 60)),
);
final extractMs = extractResult.elapsed.inMilliseconds;

sw.stop();
totalTimes.add(sw.elapsed);

final parseResult = parser.parse(extractResult.output);
final first = parseResult.extractions.isNotEmpty
? parseResult.extractions.first
: null;
final intentOk = first != null &&
(first.intent == 'timer' || first.intent == 'alarm');
if (intentOk) extractionCorrect++;

stdout.writeln(
' ${intentOk ? "OK" : "FAIL"} intent=${first?.intent ?? "null"} '
'dur=${first?.durationSeconds ?? "null"} '
'router=${routerMs}ms extract=${extractMs}ms '
'total=${sw.elapsedMilliseconds}ms "${tc.transcript}"');
Comment on lines +150 to +196

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stage 2 isn't measuring the actual two-stage pipeline.

routerResult is timed but never used to decide whether extraction should run, so router misclassifications do not count as pipeline failures. Success is also only first.intent == timer || alarm, which lets the multi-item timer+alarm case pass even if the second extraction is missing or the parsed duration/time is wrong.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ai_testbench/lib/router_benchmark_main.dart` around lines 150 - 196, The
router result is measured but not used to gate extraction, so misrouted cases
are counted as successes; also success only checks first.intent which misses
multi-item or wrong-duration failures. Change the loop to parse
routerResult.output (e.g., inspect routerResult.output JSON or
ChronoPromptTemplate router response) and only run extraction when the router
route indicates timer/alarm; if router indicates a different route mark the case
failed immediately and count router time. After running parser.parse on
extractResult.output, make intentOk require that the router-declared route
matches the parsed extractions and that at least one extraction has intent ==
'timer' or 'alarm' with a non-null duration/time (handling multi-item by
scanning all parseResult.extractions). Ensure timing (routerMs and extractMs)
reflect whether extraction actually ran and that pipeline failures include
router misclassification or missing/invalid extractions.

}

final avgTotalMs =
totalTimes.fold<int>(0, (s, d) => s + d.inMilliseconds) ~/
totalTimes.length;
stdout.writeln(
'\n[RouterBench] Extraction accuracy: $extractionCorrect/${timerCases.length}');
stdout.writeln('[RouterBench] Avg total (router+extract): ${avgTotalMs}ms');
stdout.writeln('[RouterBench] Avg router only: ${avgRouterMs}ms');
stdout.writeln(
'[RouterBench] Avg extract only: ${avgTotalMs - avgRouterMs}ms');

// ── Stage 3: Compare with single-pass original prompt on voice_memo cases ──
stdout.writeln('\n[RouterBench] Comparing single-pass original prompt latency...\n');

final voiceMemoCases = cases
.where((c) => c.expectedRoute == 'voice_memo')
.take(5)
.toList();

llm.maxTokens = 384;
final singlePassTimes = <Duration>[];

for (final tc in voiceMemoCases) {
final prompt = ChronoPromptTemplate.render(
ChronoPromptTemplate.compactTemplate,
transcript: tc.transcript,
now: referenceTime,
);
final result = await llm.generate(prompt).timeout(
const Duration(seconds: 90),
onTimeout: () => const InferenceResult(
output: 'timeout', elapsed: Duration(seconds: 90)),
);
singlePassTimes.add(result.elapsed);
stdout.writeln(
' single-pass: ${result.elapsed.inMilliseconds}ms "${tc.transcript}"');
}

final avgSingleMs =
singlePassTimes.fold<int>(0, (s, d) => s + d.inMilliseconds) ~/
singlePassTimes.length;
stdout.writeln('\n[RouterBench] Avg single-pass (original prompt): ${avgSingleMs}ms');
stdout.writeln('[RouterBench] Avg two-stage (router+extract): ${avgTotalMs}ms');
stdout.writeln(
'[RouterBench] Overhead: ${avgTotalMs - avgSingleMs}ms (${((avgTotalMs - avgSingleMs) / avgSingleMs * 100).toStringAsFixed(0)}%)');

llm.dispose();
stdout.writeln('\n[RouterBench] Done.');
}

String _parseRoute(String raw) {
// Try to parse JSON
try {
final start = raw.indexOf('{');
if (start == -1) return _guessRoute(raw);
final end = raw.lastIndexOf('}');
if (end == -1) return _guessRoute(raw);
final json = jsonDecode(raw.substring(start, end + 1)) as Map<String, dynamic>;
final route = (json['route'] as String?)?.trim().toLowerCase() ?? '';
if (route == 'timer_alarm' || route == 'voice_memo' || route == 'mixed') {
return route;
}
return _guessRoute(raw);
} catch (_) {
return _guessRoute(raw);
}
}

String _guessRoute(String raw) {
final lower = raw.toLowerCase();
if (lower.contains('timer_alarm')) return 'timer_alarm';
if (lower.contains('voice_memo')) return 'voice_memo';
if (lower.contains('mixed')) return 'mixed';
return 'unknown';
}

String? _readArg(List<String> args, String name) {
for (var i = 0; i < args.length - 1; i++) {
if (args[i] == name) return args[i + 1];
}
return null;
}

class _RouterTestCase {
final String transcript;
final String expectedRoute;
const _RouterTestCase(this.transcript, this.expectedRoute);
}
Loading
Loading