This repository contains the code and experiments for evaluating Bayesian coherence in large language models (LLMs) within an in-context learning (ICL) paradigm. The project investigates whether LLMs can perform coherent Bayesian inference when estimating latent (indirectly observable) probability events, rather than merely reproducing observable frequencies.
LLMs often appear to behave "Bayesian-like" in simple settings, but it remains unclear whether their probability judgments are globally coherent—that is, whether they respect Bayes' theorem when combining multiple learned probabilities. Rather than focusing only on prediction accuracy, this work explicitly measures coherence between probability estimates produced by the model.
- Instruction-fine-tuned models using chat templates can accurately infer the held-out transition probability, reaching Bayes-optimal accuracy with sufficient evidence.
- Despite high accuracy, these models often exhibit systematic Bayesian incoherence, driven by misestimation of intermediate conditional probabilities.
- Base (non-instruction-tuned) models fail to reliably infer the latent transition but show different coherence deviations.
- Coherence is generally worse for indirectly observable probabilities than for directly observable ones.
- Causal and non-causal factorizations do not yield systematic differences in coherence.
We design a synthetic causal data-generating process with three binary random variables:
LLMs observe sequences of shuffled triplets
From a single context, we extract the model's estimates of:
- Marginal probabilities (e.g.,
$P(x)$ ) - Conditional probabilities (e.g.,
$P(y \mid x)$ ,$P(z \mid x, y)$ ) - The held-out posterior
$P(x \mid y, z)$
All probabilities are obtained directly from token log-likelihoods.
To quantify coherence, we introduce a causal coherence score that compares the model's direct estimate of the posterior with its Bayes-consistent reconstruction from factorized probabilities:
A score of zero indicates perfect Bayesian coherence. Deviations reveal systematic inconsistencies in the model's probabilistic reasoning.
We additionally compare:
- Held-out vs. observable (baseline) transitions
- Causal vs. non-causal factorizations
- Base vs. instruction-fine-tuned models
- Chat-template vs. raw autoregressive prompting
Bayesian-like behavior in LLMs is not an inherent consequence of autoregressive training or model scale, but is strongly influenced by instruction fine-tuning and prompt format. Even when prediction accuracy is high, probability estimates may remain globally incoherent, suggesting that LLMs approximate probabilistic structure in a task- and prompt-dependent manner rather than implementing Bayes' rule explicitly.
- Synthetic data generation for the causal inference task
- Prompting and evaluation pipelines for multiple LLMs
- Bayesian coherence and accuracy metrics
- Statistical analysis and visualization scripts