Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 70 additions & 45 deletions ai_testbench/bin/test_time_extraction.dart
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,9 @@ void main(List<String> args) async {

for (var i = 0; i < testCases.length; i++) {
final tc = testCases[i];
print('─── Test ${i + 1}/${testCases.length}: ${tc.name} ───────────────────────');
print(
'─── Test ${i + 1}/${testCases.length}: ${tc.name} ───────────────────────',
);
print(' Input: "${tc.transcript}"');

// Build prompt
Expand Down Expand Up @@ -283,13 +285,15 @@ void main(List<String> args) async {
}
} catch (e) {
stderr.writeln(' ERROR during generation: $e');
results.add(TestResult(
testCase: tc,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: TestStatus.fail,
failures: ['LLM generation error: $e'],
));
results.add(
TestResult(
testCase: tc,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: TestStatus.fail,
failures: ['LLM generation error: $e'],
),
);
print('');
continue;
}
Expand All @@ -300,10 +304,14 @@ void main(List<String> args) async {
// Strip end-of-turn tokens
raw = raw.replaceAll('<|im_end|>', '').trim();
// Strip thinking blocks (Qwen3 models may use these)
raw = raw.replaceAll(RegExp(r'<think>.*?</think>', dotAll: true), '').trim();
raw = raw
.replaceAll(RegExp(r'<think>.*?</think>', dotAll: true), '')
.trim();
Comment on lines +307 to +309

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strip unclosed <think> blocks too.

Current cleanup only removes closed think blocks. Unclosed blocks can still poison JSON extraction.

Suggested fix
     raw = raw
         .replaceAll(RegExp(r'<think>.*?</think>', dotAll: true), '')
+        .replaceAll(RegExp(r'<think>.*', dotAll: true), '')
         .trim();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raw = raw
.replaceAll(RegExp(r'<think>.*?</think>', dotAll: true), '')
.trim();
raw = raw
.replaceAll(RegExp(r'<think>.*?</think>', dotAll: true), '')
.replaceAll(RegExp(r'<think>.*', dotAll: true), '')
.trim();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ai_testbench/bin/test_time_extraction.dart` around lines 307 - 309, The
cleanup currently only removes closed think blocks for the variable raw in the
replaceAll call; update the regex used when cleaning raw (the replaceAll on raw)
to also match unclosed <think> blocks by allowing the match to end at either the
closing tag or the end of the string (e.g., change the pattern to accept
(</think>|$) with dotAll enabled), so any '<think>' without a corresponding
'</think>' is removed as well.


final secs = genSw.elapsed.inMilliseconds / 1000;
print(' LLM time: ${secs.toStringAsFixed(2)}s (~${(tokenCount / secs).toStringAsFixed(1)} tok/s)');
print(
' LLM time: ${secs.toStringAsFixed(2)}s (~${(tokenCount / secs).toStringAsFixed(1)} tok/s)',
);
Comment on lines +312 to +314

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard tok/s calculation against zero elapsed time.

Line 313 can divide by zero when generation finishes within the same millisecond, producing Infinity/NaN in logs.

Suggested fix
-    final secs = genSw.elapsed.inMilliseconds / 1000;
+    final secs = genSw.elapsed.inMilliseconds / 1000;
+    final tokPerSec = secs > 0 ? (tokenCount / secs) : 0.0;
     print(
-      '  LLM time: ${secs.toStringAsFixed(2)}s (~${(tokenCount / secs).toStringAsFixed(1)} tok/s)',
+      '  LLM time: ${secs.toStringAsFixed(2)}s (~${tokPerSec.toStringAsFixed(1)} tok/s)',
     );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ai_testbench/bin/test_time_extraction.dart` around lines 312 - 314, The log
printing of token/s in the print call that uses secs and tokenCount can divide
by zero when secs is zero; update the expression used in that print (the string
interpolation in the print call) to guard the division by checking secs > 0 and
computing a safe rate (e.g., compute rate = secs > 0 ? tokenCount / secs : 0)
and use that rate in the (~${...} tok/s) portion so you never produce
Infinity/NaN while keeping the rest of the formatting (toStringAsFixed) intact.


if (verbose) {
print(' Raw output:');
Expand Down Expand Up @@ -341,29 +349,31 @@ void main(List<String> args) async {
if (!verbose) {
print(' Raw output: $raw');
}
results.add(TestResult(
testCase: tc,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: TestStatus.fail,
failures: ['JSON parse failed: $e'],
));
results.add(
TestResult(
testCase: tc,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: TestStatus.fail,
failures: ['JSON parse failed: $e'],
),
);
print('');
continue;
}

// Resolve time expression with chrono
ResolvedTime? resolvedTime;
// Try English translation first, fall back to original expression
final timeExpr = llmResult.datetimeExpressionEnglish ??
final timeExpr =
llmResult.datetimeExpressionEnglish ??
llmResult.datetimeExpressionOriginal;
if (timeExpr != null) {
resolvedTime = resolver.resolve(
timeExpr,
referenceDate: referenceTime,
);
resolvedTime = resolver.resolve(timeExpr, referenceDate: referenceTime);
if (resolvedTime != null) {
print(' Chrono parse: ${resolvedTime.dateTime} (via ${resolvedTime.method})');
print(
' Chrono parse: ${resolvedTime.dateTime} (via ${resolvedTime.method})',
);
} else {
print(' Chrono parse: FAILED — could not resolve "$timeExpr"');
}
Expand All @@ -378,7 +388,8 @@ void main(List<String> args) async {
final intentMatch = _intentMatches(llmResult.intent, tc.expectedIntent);
if (!intentMatch) {
failures.add(
'Intent mismatch: got "${llmResult.intent}", expected "${tc.expectedIntent}"');
'Intent mismatch: got "${llmResult.intent}", expected "${tc.expectedIntent}"',
);
}

// Check 2: Time expression present/absent
Expand All @@ -389,7 +400,8 @@ void main(List<String> args) async {
if (tc.expectedTimeEnglish == null &&
llmResult.datetimeExpressionEnglish != null) {
failures.add(
'Expected no time expression but got "${llmResult.datetimeExpressionEnglish}"');
'Expected no time expression but got "${llmResult.datetimeExpressionEnglish}"',
);
}

// Check 3: Chrono parse succeeded when expected
Expand All @@ -398,24 +410,28 @@ void main(List<String> args) async {
}
if (tc.expectedDateTime == null && resolvedTime != null) {
failures.add(
'Expected no resolved time but got ${resolvedTime.dateTime}');
'Expected no resolved time but got ${resolvedTime.dateTime}',
);
}

// Check 4: DateTime accuracy
if (tc.expectedDateTime != null && resolvedTime != null) {
final diff =
resolvedTime.dateTime.difference(tc.expectedDateTime!).inMinutes.abs();
final diff = resolvedTime.dateTime
.difference(tc.expectedDateTime!)
.inMinutes
.abs();
if (diff > tc.toleranceMinutes) {
failures.add(
'DateTime mismatch: got ${resolvedTime.dateTime}, expected ${tc.expectedDateTime} (diff: ${diff}min, tolerance: ${tc.toleranceMinutes}min)');
'DateTime mismatch: got ${resolvedTime.dateTime}, expected ${tc.expectedDateTime} (diff: ${diff}min, tolerance: ${tc.toleranceMinutes}min)',
);
}
}

final status = failures.isEmpty
? TestStatus.pass
: (failures.length == 1 && !failures.first.contains('Intent'))
? TestStatus.partial
: TestStatus.fail;
? TestStatus.partial
: TestStatus.fail;

if (failures.isEmpty) {
print(' ✅ PASS');
Expand All @@ -429,15 +445,17 @@ void main(List<String> args) async {
print(' Expected: ${tc.expectedDateTime}');
}

results.add(TestResult(
testCase: tc,
llmResult: llmResult,
resolvedTime: resolvedTime,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: status,
failures: failures,
));
results.add(
TestResult(
testCase: tc,
llmResult: llmResult,
resolvedTime: resolvedTime,
llmDuration: genSw.elapsed,
tokenCount: tokenCount,
status: status,
failures: failures,
),
);

print('');
}
Expand All @@ -449,12 +467,18 @@ void main(List<String> args) async {
final partial = results.where((r) => r.status == TestStatus.partial).length;
final failed = results.where((r) => r.status == TestStatus.fail).length;
final totalLlmTime = results.fold<Duration>(
Duration.zero, (sum, r) => sum + r.llmDuration);
Duration.zero,
(sum, r) => sum + r.llmDuration,
);

print('╔══════════════════════════════════════════════════════════╗');
print('║ Results: $passed passed, $partial partial, $failed failed '
'out of ${testCases.length} tests');
print('║ Total LLM time: ${(totalLlmTime.inMilliseconds / 1000).toStringAsFixed(1)}s');
print(
'║ Results: $passed passed, $partial partial, $failed failed '
'out of ${testCases.length} tests',
);
print(
'║ Total LLM time: ${(totalLlmTime.inMilliseconds / 1000).toStringAsFixed(1)}s',
);
print('║ Model: $modelFile');
print('╚══════════════════════════════════════════════════════════╝');

Expand All @@ -463,7 +487,8 @@ void main(List<String> args) async {
print('');
print('Failed/partial tests:');
for (final r in results.where(
(r) => r.status == TestStatus.fail || r.status == TestStatus.partial)) {
(r) => r.status == TestStatus.fail || r.status == TestStatus.partial,
)) {
print(' ${r.testCase.name}:');
for (final f in r.failures) {
print(' - $f');
Expand Down
97 changes: 52 additions & 45 deletions ai_testbench/lib/benchmark_main.dart
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ Future<void> main(List<String> args) async {
if (filteredModelPaths.isEmpty) {
stdout.writeln('[BenchmarkRunner] No matching .gguf files found');
if (config.modelFilter != null) {
stdout.writeln('[BenchmarkRunner] Model filter: ${config.modelFilter}');
stdout.writeln(
'[BenchmarkRunner] Model filter: ${config.modelFilter}',
);
}
exitCode = 1;
return;
Expand All @@ -48,7 +50,9 @@ Future<void> main(List<String> args) async {
caseLimit: config.caseLimit,
);
if (selectedCases.isEmpty) {
stdout.writeln('[BenchmarkRunner] No benchmark cases matched the request');
stdout.writeln(
'[BenchmarkRunner] No benchmark cases matched the request',
);
if (config.caseFilter != null) {
stdout.writeln('[BenchmarkRunner] Case filter: ${config.caseFilter}');
}
Expand Down Expand Up @@ -77,11 +81,7 @@ Future<void> main(List<String> args) async {
}
}

runApp(
BenchmarkApp(
modelDirectory: modelDir,
),
);
runApp(BenchmarkApp(modelDirectory: modelDir));
}

class _RunnerConfig {
Expand Down Expand Up @@ -113,13 +113,17 @@ _RunnerConfig _parseConfig(List<String> args) {
return null;
}

final modelDir = readValue('--model-dir') ?? Directory('models').absolute.path;
final modelDir =
readValue('--model-dir') ?? Directory('models').absolute.path;
final outputPath = readValue('--output');
final modelFilter = readValue('--model');
final caseFilter = readValue('--case');
final caseLimitValue = readValue('--case-limit');
final caseLimit = caseLimitValue == null ? null : int.tryParse(caseLimitValue);
final headless = hasFlag('--headless') || Platform.environment['AI_BENCH_HEADLESS'] == '1';
final caseLimit = caseLimitValue == null
? null
: int.tryParse(caseLimitValue);
final headless =
hasFlag('--headless') || Platform.environment['AI_BENCH_HEADLESS'] == '1';

return _RunnerConfig(
headless: headless,
Expand Down Expand Up @@ -219,16 +223,20 @@ Future<void> _runHeadlessBenchmark({
'finishedAt': finishedAt.toIso8601String(),
'modelCount': results.length,
'caseCount': selectedCases.length,
if (modelFilter != null && modelFilter.isNotEmpty) 'modelFilter': modelFilter,
if (modelFilter != null && modelFilter.isNotEmpty)
'modelFilter': modelFilter,
if (caseFilter != null && caseFilter.isNotEmpty) 'caseFilter': caseFilter,
'results': results.map(_serializeModelResult).toList(growable: false),
};

final resolvedOutputPath = outputPath ??
final resolvedOutputPath =
outputPath ??
'${Directory.current.path}${Platform.pathSeparator}benchmark_results_${DateTime.now().millisecondsSinceEpoch}.json';
final outputFile = File(resolvedOutputPath);
outputFile.parent.createSync(recursive: true);
outputFile.writeAsStringSync(const JsonEncoder.withIndent(' ').convert(report));
outputFile.writeAsStringSync(
const JsonEncoder.withIndent(' ').convert(report),
);

stdout.writeln('[BenchmarkRunner] Headless benchmark complete');
stdout.writeln('[BenchmarkRunner] Results written to ${outputFile.path}');
Expand Down Expand Up @@ -258,42 +266,41 @@ Map<String, dynamic> _serializeModelResult(BenchmarkModelResult result) {
'totalCases': result.cases.length,
'avgTokensPerSecond': result.avgTokensPerSecond,
'totalElapsedMs': result.totalElapsed.inMilliseconds,
'cases': result.cases.map((caseResult) {
return <String, dynamic>{
'caseName': caseResult.caseName,
'passed': caseResult.passed,
'validJson': caseResult.validJson,
'intentMatch': caseResult.intentMatch,
'timePresenceMatch': caseResult.timePresenceMatch,
'titleLanguageMatch': caseResult.titleLanguageMatch,
'titleLanguageDetail': caseResult.titleLanguageDetail,
'timeResolutionCorrect': caseResult.timeResolutionCorrect,
'timeResolutionDetail': caseResult.timeResolutionDetail,
'durationMatch': caseResult.durationMatch,
'durationDetail': caseResult.durationDetail,
'intent': caseResult.intent,
'title': caseResult.title,
'datetimeOriginal': caseResult.datetimeOriginal,
'datetimeEnglish': caseResult.datetimeEnglish,
'elapsedMs': caseResult.elapsed.inMilliseconds,
'tokensPerSecond': caseResult.tokensPerSecond,
'outputPreview': caseResult.outputPreview,
'error': caseResult.error,
'extractedCount': caseResult.extractedCount,
'expectedCount': caseResult.expectedCount,
'countMatch': caseResult.countMatch,
if (caseResult.itemFailures.isNotEmpty)
'itemFailures': caseResult.itemFailures,
};
}).toList(growable: false),
'cases': result.cases
.map((caseResult) {
return <String, dynamic>{
'caseName': caseResult.caseName,
'passed': caseResult.passed,
'validJson': caseResult.validJson,
'intentMatch': caseResult.intentMatch,
'timePresenceMatch': caseResult.timePresenceMatch,
'titleLanguageMatch': caseResult.titleLanguageMatch,
'titleLanguageDetail': caseResult.titleLanguageDetail,
'timeResolutionCorrect': caseResult.timeResolutionCorrect,
'timeResolutionDetail': caseResult.timeResolutionDetail,
'durationMatch': caseResult.durationMatch,
'durationDetail': caseResult.durationDetail,
'intent': caseResult.intent,
'title': caseResult.title,
'datetimeOriginal': caseResult.datetimeOriginal,
'datetimeEnglish': caseResult.datetimeEnglish,
'elapsedMs': caseResult.elapsed.inMilliseconds,
'tokensPerSecond': caseResult.tokensPerSecond,
'outputPreview': caseResult.outputPreview,
'error': caseResult.error,
'extractedCount': caseResult.extractedCount,
'expectedCount': caseResult.expectedCount,
'countMatch': caseResult.countMatch,
if (caseResult.itemFailures.isNotEmpty)
'itemFailures': caseResult.itemFailures,
};
})
.toList(growable: false),
};
}

class BenchmarkApp extends StatelessWidget {
const BenchmarkApp({
super.key,
required this.modelDirectory,
});
const BenchmarkApp({super.key, required this.modelDirectory});

final String modelDirectory;

Expand Down
Loading
Loading