This document defines target performance thresholds for the SynthLink Catalog’s five evaluation metrics. These criteria represent near-optimal system behavior when answering complex, multi-hop research questions.
Responses are aggregated using the weighted sum:
Aggregate Score = 0.3 * F1 + 0.2 * P@5 + 0.3 * RQS + 0.1 * FCS + 0.1 * IE
Target Aggregate Score: ≥ 0.90
-
Definition: Harmonic mean of token-level precision and recall against the reference summary.
-
Calculation:
Precision = |P ∩ T| / |P|Recall = |P ∩ T| / |T|F1 = 2 * Precision * Recall / (Precision + Recall)
-
Target Range: 0.90 – 1.00
-
Rationale: Indicates comprehensive coverage of key facts with minimal omissions.
-
Definition: Proportion of top-five retrieved documents that are relevant.
-
Calculation:
P@5 = |D_pred ∩ D_relevant| / 5
-
Target Range: 0.80 – 1.00
-
Rationale: Ensures robust source selection for evidence chaining.
-
Definition: Fraction of predefined reasoning steps present in the system’s explanation, capped at 1.0.
-
Calculation:
RQS = min((1/n) * sum_{i=1..n} I(s_i present), 1.0)- Here,
I(condition)is 1 if the step is present, otherwise 0.
-
Target Range: 0.90 – 1.00
-
Rationale: Validates that all critical inference steps are articulated.
-
Definition: Proportion of factual claims that are verifiable against cited sources.
-
Calculation:
FCS = (1/m) * sum_{j=1..m} I(c_j verified)
-
Target Value: 1.00
-
Rationale: Guarantees that all assertions are grounded in evidence, especially for speculative topics.
-
Definition: Inverse of the number of iterations (
k) required to produce the first correct answer, withkcapped at 5. -
Calculation:
IE = 1 / min(k, 5)
-
Target Range: 0.50 – 1.00
-
Rationale: Measures the system’s convergence speed on complex queries.
Document last updated: May 2025