| title | 🧪 Testing the Explainable AI (XAI) |
|---|---|
| nav_order | 9 |
The Explainable AI (XAI) layer is evaluated using four distinct clinical scenarios generated by the Neo4j inference engine. These tests are designed to challenge the LLM's ability to handle logical negations (blocking symptoms), differential weighting, and confidence assessment.
Focus: Can the model correctly identify why a high-scoring disease was excluded?
| Field | Value |
|---|---|
| Most Likely | nonparalytic poliomyelitis (Passed Filter: True) |
| Differentials | Ebola virus disease and poliomyelitis (Passed Filter: True) |
| Excluded conditions | Marburg hemorrhagic fever and West Nile fever (Passed Filter: False) |
| Blocking symptoms | maculopapular rash and/or chills (patient denied both) |
Success criteria: The reasoning must explicitly mention the absence of the rash as the reason for excluding West Nile fever and both rash and chills for Marburg hemorrhagic fever.
Focus: Distinguishing between diseases with very similar symptom profiles.
| Field | Value |
|---|---|
| Most Likely | hepatitis E (Passed Filter: True) |
| Differentials | hepatitis B, hepatitis C and hepatitis A (Passed Filter: True) |
| Excluded conditions | hepatitis D (Passed Filter: False) |
| Blocking symptoms | drowsiness and confusion (patient denied both) |
Success criteria: Model assigns High Confidence to Hepatitis E, explains that A, B, and C are less likely due to lower symptomatic alignment and mention the absence of the drowsiness and confusion as the reason for excluding hepatitis D.
Focus: Managing high-severity cases with low disease coverage.
| Field | Value |
|---|---|
| Most Likely | West Nile encephalitis (Passed Filter: True) |
| Differentials | Powassan encephalitis and Eastern equine encephalitis (Passed Filter: True) |
| Excluded conditions | Japanese encephalitis and St. Louis encephalitis (Passed Filter: False) |
| Blocking symptoms | spastic paralysis (patient denied) |
Success criteria: Model identifies West Nile as the primary diagnosis despite low confidence, explicitly recommends further testing and mention the absence of the spastic paralysis as the reason for excluding Japanese encephalitis and St. Louis encephalitis.
Focus: Handling overlapping symptoms where no conditions are excluded.
| Field | Value |
|---|---|
| Most Likely | Powassan encephalitis (Passed Filter: True) |
| Differentials | nonparalytic poliomyelitis, La Crosse encephalitis and poliomyelitis (Passed Filter: True) |
| Excluded conditions | - |
| Blocking symptoms | - |
Success criteria: Model identifies seizure as the clinical tie-breaker that elevates Powassan encephalitis over the competing candidates.
Overall assessment: Strong JSON structural compliance, but exhibits hallucinatory reasoning when it comes to the clinical filter logic provided in the input data.
| Metric | Result | Commentary |
|---|---|---|
| JSON structural integrity | ✅ 100% | Perfectly followed the schema and maintained all keys |
| Exclusion logic (blocking symptoms) | Correctly identified Hepatitis D exclusion; failed on Japanese Encephalitis | |
| Internal consistency | ❌ Fail | In TC3, contradicts the input filter logic |
| Clinical tone | ✅ High | Professional language / medical vocabulary |
1. Negation & filter logic (TC1 & TC3)
In TC3, the model correctly identifies West Nile encephalitis as the most likely diagnosis. However, it fails the logical constraint test by placing Japanese encephalitis and St. Louis encephalitis into the differential diagnosis category, completely ignoring the passed_filter: false flag. These conditions should have been moved to excluded_conditions due to the presence of the blocking symptom spastic paralysis, which the patient explicitly denied. Furthermore, the model incorrectly justifies their inclusion by claiming they have lower scores, when in fact, they had higher raw scores but were excluded.
2. Differential comparison (TC2 & TC4)
TC2 (Hepatitis panel) was handled well. The model correctly identified Hepatitis D's exclusion based on the absence of drowsiness and confusion.
In TC4, the model ranked Powassan encephalitis correctly but did not explicitly name seizure as the clinical tie-breaker.
3. Safety & clinical recommendations
Across all test cases the model consistently appended relevant next steps (e.g., CSF analysis, PCR for flavivirus). This suggests its medical pre-training knowledge is being used to enrich explanations beyond the provided JSON — a useful behaviour, as long as it does not override or contradict the input data.
Overall assessment: Successfully generates valid JSON structures and writes highly professional, clinically coherent paragraphs. However, it exhibits a severe inability to correctly map the logical passed_filter flag to the appropriate JSON arrays, leading to contradictory outputs where the text explains a disease is excluded, but the array lists it as a viable differential.
| Metric | Result | Commentary |
|---|---|---|
| JSON structural integrity | ✅ 100% | Perfectly followed the schema and maintained all keys |
| Exclusion logic (blocking symptoms) | ❌ 25% | Recognized blocking symptoms in text, but placed diseases in the wrong categories |
| Internal consistency | ❌ Fail | Massive semantic disconnect — the text frequently contradicts the JSON arrays |
| Clinical tone | ✅ High | Professional language / medical vocabulary |
1. Negation & filter logic (TC1, TC2 & TC3)
In TC1, it places Marburg hemorrhagic fever in the differentials, completely ignoring the passed_filter: false flag. Additionally, poliomyelitis should be in differentials, not in excluded_conditions.
In TC2, it places Hepatitis D in differentials, but then correctly writes in the exclusion_criteria paragraph that Hepatitis D was "rejected due to the presence of blocking symptoms".
In TC3, it completely inverts the logic: it places the excluded diseases (Japanese encephalitis and St. Louis encephalitis) into the differentials array, and puts valid differentials (Powassan encephalitis and Eastern equine encephalitis) into the excluded_conditions array.
2. Hallucinated Constraints (TC4):
In TC4, all diseases passed the filter (no blocking symptoms were triggered). However, the model forced La Crosse encephalitis and primary amebic meningoencephalitis into the excluded_conditions array. To justify this, it hallucinated that blocking symptoms were present, stating: "La Crosse... excluded due to their lower scores and the presence of blocking symptoms." It clearly confused "missing symptoms" with "blocking symptoms" to force a logical narrative.
Overall assessment: Qwen 2.5 (14b) shows a significant leap in performance compared to Llama 3 (8b), particularly in logical consistency and adherence to rigid system rules. While Llama often "hallucinated" reasons for exclusion to justify its placement errors, Qwen demonstrates much stricter compliance with the passed_filter flag.
| Metric | Result | Commentary |
|---|---|---|
| JSON structural integrity | ✅ 100% | Perfectly followed the schema and maintained all keys |
| Exclusion logic (blocking symptoms) | ✅ 100% | Accurately identifies passed_filter: false and correctly maps diseases to excluded_conditions |
| Internal consistency | ✅ High | Textual reasoning directly supports the content of the JSON arrays |
| Clinical tone | ✅ High | Professional language / medical vocabulary |
1. Precise Handling of Negation (TC1, TC2, & TC3)
In TC1, correctly places Marburg hemorrhagic fever in excluded_conditions. In the text, it explicitly states the disease is excluded due to the absence of a rash and chills, showing that the model "understands" that passed_filter: false takes priority over a high score.
In TC2, Hepatitis D is correctly excluded. The reasoning is precise: it notes that drowsiness and confusion are mandatory markers that are missing, which directly justifies its placement in the excluded list.
In TC3, the model successfully resolves the "inversion" problem. Japanese encephalitis and St. Louis encephalitis are correctly classified as excluded, with a clear explanation regarding the blocking symptom (spastic paralysis).
Overall assessment: Phi 4 (14b) shows a high level of logical maturity. Unlike smaller models that often prioritize conversational fluency over data constraints, Phi-4 treats the provided symbolic filters as hard requirements.
| Metric | Result | Commentary |
|---|---|---|
| JSON structural integrity | ✅ 100% | Perfectly followed the schema and maintained all keys |
| Exclusion logic (blocking symptoms) | ✅ 100% | Accurately identifies passed_filter: false and correctly maps diseases to excluded_conditions |
| Internal consistency | ✅ High | Textual reasoning directly supports the content of the JSON arrays |
| Clinical tone | ✅ High | Professional language / medical vocabulary |
1. Precise Handling of Negation (TC1, TC2, & TC3)
In cases where a disease has a high normalized_score but is marked as passed_filter: false, Phi-4 successfully exclude that disease.
For example, in TC1, even though Marburg hemorrhagic fever matches 5 out of 5 symptoms, Phi-4 correctly excludes it, explicitly stating that the absence of a maculopapular rash and chills is the deciding factor. TC2 and TC3 shows same behaviours.
| Model | JSON Integrity | Exclusion Logic | Internal Consistency | Clinical Tone |
|---|---|---|---|---|
Llama 3.2 (3b) |
✅ 100% | ❌ Fail | ✅ High | |
Llama 3 (8b) |
✅ 100% | ❌ 25% | ❌ Fail | ✅ High |
Qwen 2.5 (14b) |
✅ 100% | ✅ 100% | ✅ High | ✅ High |
Phi-4 (14b) |
✅ 100% | ✅ 100% | ✅ High | ✅ High |



