title	🧪 Testing the Explainable AI (XAI)
nav_order	9

🧪 Testing the Explainable AI (XAI)

📋 Overview

The Explainable AI (XAI) layer is evaluated using four distinct clinical scenarios generated by the Neo4j inference engine. These tests are designed to challenge the LLM's ability to handle logical negations (blocking symptoms), differential weighting, and confidence assessment.

Test cases

TC1 - Negation Handling

Focus: Can the model correctly identify why a high-scoring disease was excluded?

Field	Value
Most Likely	`nonparalytic poliomyelitis` (Passed Filter: `True`)
Differentials	`Ebola virus disease` and `poliomyelitis` (Passed Filter: `True`)
Excluded conditions	`Marburg hemorrhagic fever` and `West Nile fever` (Passed Filter: `False`)
Blocking symptoms	`maculopapular rash` and/or `chills` (patient denied both)

Success criteria: The reasoning must explicitly mention the absence of the rash as the reason for excluding West Nile fever and both rash and chills for Marburg hemorrhagic fever.

TC2 - Broad Differential Diagnostic (Hepatitis Panel)

Focus: Distinguishing between diseases with very similar symptom profiles.

Field	Value
Most Likely	`hepatitis E` (Passed Filter: `True`)
Differentials	`hepatitis B`, `hepatitis C` and `hepatitis A` (Passed Filter: `True`)
Excluded conditions	`hepatitis D` (Passed Filter: `False`)
Blocking symptoms	`drowsiness` and `confusion` (patient denied both)

Success criteria: Model assigns High Confidence to Hepatitis E, explains that A, B, and C are less likely due to lower symptomatic alignment and mention the absence of the drowsiness and confusion as the reason for excluding hepatitis D.

TC3 - Low Disease Coverage

Focus: Managing high-severity cases with low disease coverage.

Field	Value
Most Likely	`West Nile encephalitis` (Passed Filter: `True`)
Differentials	`Powassan encephalitis` and `Eastern equine encephalitis` (Passed Filter: `True`)
Excluded conditions	`Japanese encephalitis` and `St. Louis encephalitis` (Passed Filter: `False`)
Blocking symptoms	`spastic paralysis` (patient denied)

Success criteria: Model identifies West Nile as the primary diagnosis despite low confidence, explicitly recommends further testing and mention the absence of the spastic paralysis as the reason for excluding Japanese encephalitis and St. Louis encephalitis.

TC4 - Overlapping symptoms

Focus: Handling overlapping symptoms where no conditions are excluded.

Field	Value
Most Likely	`Powassan encephalitis` (Passed Filter: `True`)
Differentials	`nonparalytic poliomyelitis`, `La Crosse encephalitis` and `poliomyelitis` (Passed Filter: `True`)
Excluded conditions	-
Blocking symptoms	-

Success criteria: Model identifies seizure as the clinical tie-breaker that elevates Powassan encephalitis over the competing candidates.

Model Evaluations

🧪 Test 1: `llama3.2:3b` (local)

Overall assessment: Strong JSON structural compliance, but exhibits hallucinatory reasoning when it comes to the clinical filter logic provided in the input data.

Performance Matrix

Metric	Result	Commentary
JSON structural integrity	✅ 100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	⚠️ 50%	Correctly identified Hepatitis D exclusion; failed on Japanese Encephalitis
Internal consistency	❌ Fail	In TC3, contradicts the input filter logic
Clinical tone	✅ High	Professional language / medical vocabulary

Qualitative Analysis

1. Negation & filter logic (TC1 & TC3)

In TC3, the model correctly identifies West Nile encephalitis as the most likely diagnosis. However, it fails the logical constraint test by placing Japanese encephalitis and St. Louis encephalitis into the differential diagnosis category, completely ignoring the passed_filter: false flag. These conditions should have been moved to excluded_conditions due to the presence of the blocking symptom spastic paralysis, which the patient explicitly denied. Furthermore, the model incorrectly justifies their inclusion by claiming they have lower scores, when in fact, they had higher raw scores but were excluded.

2. Differential comparison (TC2 & TC4)

TC2 (Hepatitis panel) was handled well. The model correctly identified Hepatitis D's exclusion based on the absence of drowsiness and confusion.

In TC4, the model ranked Powassan encephalitis correctly but did not explicitly name seizure as the clinical tie-breaker.

3. Safety & clinical recommendations

Across all test cases the model consistently appended relevant next steps (e.g., CSF analysis, PCR for flavivirus). This suggests its medical pre-training knowledge is being used to enrich explanations beyond the provided JSON — a useful behaviour, as long as it does not override or contradict the input data.

🧪 Test 2: `llama3:8b` (local)

Overall assessment: Successfully generates valid JSON structures and writes highly professional, clinically coherent paragraphs. However, it exhibits a severe inability to correctly map the logical passed_filter flag to the appropriate JSON arrays, leading to contradictory outputs where the text explains a disease is excluded, but the array lists it as a viable differential.

Performance Matrix

Metric	Result	Commentary
JSON structural integrity	✅ 100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	❌ 25%	Recognized blocking symptoms in text, but placed diseases in the wrong categories
Internal consistency	❌ Fail	Massive semantic disconnect — the text frequently contradicts the JSON arrays
Clinical tone	✅ High	Professional language / medical vocabulary

Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, it places Marburg hemorrhagic fever in the differentials, completely ignoring the passed_filter: false flag. Additionally, poliomyelitis should be in differentials, not in excluded_conditions.

In TC2, it places Hepatitis D in differentials, but then correctly writes in the exclusion_criteria paragraph that Hepatitis D was "rejected due to the presence of blocking symptoms".

In TC3, it completely inverts the logic: it places the excluded diseases (Japanese encephalitis and St. Louis encephalitis) into the differentials array, and puts valid differentials (Powassan encephalitis and Eastern equine encephalitis) into the excluded_conditions array.

2. Hallucinated Constraints (TC4): In TC4, all diseases passed the filter (no blocking symptoms were triggered). However, the model forced La Crosse encephalitis and primary amebic meningoencephalitis into the excluded_conditions array. To justify this, it hallucinated that blocking symptoms were present, stating: "La Crosse... excluded due to their lower scores and the presence of blocking symptoms." It clearly confused "missing symptoms" with "blocking symptoms" to force a logical narrative.

🧪 Test 3: `qwen2.5:14b` (local)

Overall assessment: Qwen 2.5 (14b) shows a significant leap in performance compared to Llama 3 (8b), particularly in logical consistency and adherence to rigid system rules. While Llama often "hallucinated" reasons for exclusion to justify its placement errors, Qwen demonstrates much stricter compliance with the passed_filter flag.

Performance Matrix

Metric	Result	Commentary
JSON structural integrity	✅ 100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	✅ 100%	Accurately identifies `passed_filter: false` and correctly maps diseases to `excluded_conditions`
Internal consistency	✅ High	Textual reasoning directly supports the content of the JSON arrays
Clinical tone	✅ High	Professional language / medical vocabulary

Qualitative Analysis

1. Precise Handling of Negation (TC1, TC2, & TC3)

In TC1, correctly places Marburg hemorrhagic fever in excluded_conditions. In the text, it explicitly states the disease is excluded due to the absence of a rash and chills, showing that the model "understands" that passed_filter: false takes priority over a high score.

In TC2, Hepatitis D is correctly excluded. The reasoning is precise: it notes that drowsiness and confusion are mandatory markers that are missing, which directly justifies its placement in the excluded list.

In TC3, the model successfully resolves the "inversion" problem. Japanese encephalitis and St. Louis encephalitis are correctly classified as excluded, with a clear explanation regarding the blocking symptom (spastic paralysis).

🧪 Test 4: `phi4:14b` (local)

Overall assessment: Phi 4 (14b) shows a high level of logical maturity. Unlike smaller models that often prioritize conversational fluency over data constraints, Phi-4 treats the provided symbolic filters as hard requirements.

Performance Matrix

Metric	Result	Commentary
JSON structural integrity	✅ 100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	✅ 100%	Accurately identifies `passed_filter: false` and correctly maps diseases to `excluded_conditions`
Internal consistency	✅ High	Textual reasoning directly supports the content of the JSON arrays
Clinical tone	✅ High	Professional language / medical vocabulary

Qualitative Analysis

1. Precise Handling of Negation (TC1, TC2, & TC3)

In cases where a disease has a high normalized_score but is marked as passed_filter: false, Phi-4 successfully exclude that disease.

For example, in TC1, even though Marburg hemorrhagic fever matches 5 out of 5 symptoms, Phi-4 correctly excludes it, explicitly stating that the absence of a maculopapular rash and chills is the deciding factor. TC2 and TC3 shows same behaviours.

📊 Summary: Comparison of Model Performance

Model	JSON Integrity	Exclusion Logic	Internal Consistency	Clinical Tone
`Llama 3.2 (3b)`	✅ 100%	⚠️ 50%	❌ Fail	✅ High
`Llama 3 (8b)`	✅ 100%	❌ 25%	❌ Fail	✅ High
`Qwen 2.5 (14b)`	✅ 100%	✅ 100%	✅ High	✅ High
`Phi-4 (14b)`	✅ 100%	✅ 100%	✅ High	✅ High

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 Testing the Explainable AI (XAI)

📋 Overview

Test cases

TC1 - Negation Handling

TC2 - Broad Differential Diagnostic (Hepatitis Panel)

TC3 - Low Disease Coverage

TC4 - Overlapping symptoms

Model Evaluations

🧪 Test 1: `llama3.2:3b` (local)

Performance Matrix

Qualitative Analysis

🧪 Test 2: `llama3:8b` (local)

Performance Matrix

Qualitative Analysis

🧪 Test 3: `qwen2.5:14b` (local)

Performance Matrix

Qualitative Analysis

🧪 Test 4: `phi4:14b` (local)

Performance Matrix

Qualitative Analysis

📊 Summary: Comparison of Model Performance

FilesExpand file tree

xai-test.md

Latest commit

History

xai-test.md

File metadata and controls

🧪 Testing the Explainable AI (XAI)

📋 Overview

Test cases

TC1 - Negation Handling

TC2 - Broad Differential Diagnostic (Hepatitis Panel)

TC3 - Low Disease Coverage

TC4 - Overlapping symptoms

Model Evaluations

🧪 Test 1: llama3.2:3b (local)

Performance Matrix

Qualitative Analysis

🧪 Test 2: llama3:8b (local)

Performance Matrix

Qualitative Analysis

🧪 Test 3: qwen2.5:14b (local)

Performance Matrix

Qualitative Analysis

🧪 Test 4: phi4:14b (local)

Performance Matrix

Qualitative Analysis

📊 Summary: Comparison of Model Performance

🧪 Test 1: `llama3.2:3b` (local)

🧪 Test 2: `llama3:8b` (local)

🧪 Test 3: `qwen2.5:14b` (local)

🧪 Test 4: `phi4:14b` (local)