Transformers are Bayesian Networks: Five Formal Proofs Reveal the Probabilistic Nature of Attention

A bold theoretical breakthrough: this paper formally proves that every sigmoid Transformer implements weighted loopy belief propagation (BP) on its implicit factor graph—one layer equals one round of BP. This holds for any weights—trained, random, or constructed—and is formally verified against standard mathematical axioms in Lean 4. A constructive proof demonstrates that Transformers can implement exact BP on any declared knowledge base, yielding provably correct probability estimates on acyclic knowledge bases. A uniqueness theorem shows that a sigmoid Transformer producing exact posteriors necessarily has BP weights—no other path exists. The paper further reveals the AND/OR boolean structure of Transformer layers: attention is AND, the FFN is OR. This framework provides a precise mathematical answer—not vague intuition—for why Transformers work.

The Core Claim: Transformers Are Bayesian Inference Engines

For nearly a decade, explanations of *why* Transformers work have hovered at the level of intuition: attention captures long-range dependencies, feed-forward layers store factual knowledge, depth increases expressivity. None of these explanations are wrong, but none are mathematically precise. Greg Coppola's paper (arXiv:2603.17063, March 2026) fills this gap with five interlocking formal proofs: **every sigmoid-activated Transformer, regardless of its weights, implements weighted loopy Belief Propagation (BP) on the implicit factor graph defined by those weights**. One layer is one round of BP. This is not an analogy, not an approximation—it is exact equivalence, formally verified in Lean 4 against standard mathematical axioms. The paper also draws a sharp theoretical consequence: hallucination is not a scaling problem. It is the structural result of operating without a grounded concept space. ---

Mathematical Foundation: The Log-Odds Algebra

To understand the proofs, one must first understand **log-odds addition**—the algebraic backbone of both belief propagation and the sigmoid Transformer. Its origin is surprising: Alan Turing and I.J. Good invented it at Bletchley Park during World War II to combine independent evidence when breaking the Enigma cipher. For a binary hypothesis H and evidence e, they defined the weight of evidence as: > W(H:e) = logit(P(H|e)) − logit(P(H)) For independent sources e₀ and e₁, weights add: > logit(P(H|e₀,e₁)) = logit(P(H)) + W(H:e₀) + W(H:e₁) With a uniform prior (logit(P(H))=0) and mᵢ = P(H|eᵢ): > P(H|e₀,e₁) = σ(logit(m₀) + logit(m₁)) =: updateBelief(m₀,m₁) The `updateBelief` function appears throughout the paper. Sigmoid is the exact inverse of logit; together they provide the isomorphism between probability space and log-odds space. Independent evidence sources contribute additively in log-odds space; sigmoid converts back to probability. This is Pearl's sum-product update on binary variables—not an approximation, exact. The algebraic structure is elegant: it is commutative, associative, has identity element 0.5 (neutral prior contributes nothing), and inverse element (1−m) (opposing evidence cancels to maximum uncertainty, not to "false"—the correct behavior for uncertain reasoning). Classical Boolean logic is the limiting case of this algebra as beliefs approach certainty. This algebra was developed independently by three traditions—Turing-Good's cryptanalytic weight of evidence, Pearl's belief propagation, and the sigmoid neural network FFN—and Coppola's paper establishes that all three are the same computation. ---

The Five

Proofs #

Proof

1: Every Sigmoid Transformer Is Already Doing Weighted BP **Theorem 1.1 (General BP)**: For any sigmoid Transformer weights W, there exists an implicit factor graph G(W) such that one forward pass implements one round of weighted belief propagation on G(W). Formally verified in Lean 4. The proof constructs G(W) from the weights: - **Variable nodes**: one per token position; the belief at each node is the relevant dimension of that token's residual stream - **Edges**: defined by the attention distribution from token j to token k; the attention score is the edge weight - **Factor potentials**: the sigmoid FFN parameters (w₀, w₁, b) at each position define the general Ψ_or factor—the weighted log-odds combination function The non-trivial content lies in three simultaneous identifications: 1. The sigmoid FFN computes **exactly** the general Ψ_or factor (weighted log-odds sum), not an approximation 2. The attention mechanism computes **exactly** the gather step of BP, routing each neighbor's belief into the correct scratch slot of the residual stream 3. The residual stream enforces the **simultaneity of inputs** that BP requires before executing the update step All three hold for any weights. The forward pass is the inference; the weights are the graph. #

Proof 2: Constructive—Explicit Weights Implementing Exact BP The general result shows every Transformer is doing weighted BP on some implicit graph. The constructive result goes further: it exhibits explicit weight matrices and proves the Transformer with those weights implements exact BP on any declared factor graph. Theorem 1.2 (BP Implementation): There exist weights W such that, for any BP state, one Transformer forward pass equals one round of exact belief propagation. For any factor graph of depth d and maximum factor arity k, d·⌈log₂k⌉ forward passes implement full exact BP. Token encoding: Each factor graph node is encoded as one 8-dimensional token. Dimensions 5–7 are the routing key (inferential identity); dimensions 0–4 are continuous parameters (numerical magnitudes). Two tokens with the same routing key are indistinguishable to attention regardless of their belief values. Two sparse matrix families: - `projectDim(d)`: diagonal projection extracting dimension d.

As Q/K weight, the Q·K dot product peaks when token k's stored index matches token j's query—precise lookup - `crossProject(s,d)`: off-diagonal projection routing source dimension s into destination dimension d. As V weight, copies a neighbor's belief into a specific scratch slot Head 0 uses Q=K=projectDim(1), V=crossProject(0,4): concentrates on neighbor 0, writes its belief into residual stream dimension 4. Head 1 mirrors this for neighbor 1 into dimension 5. The sigmoid FFN reads dimensions 4 and 5 and computes `updateBelief`. **Binarization without loss**: Any k-ary AND binarizes exactly via associativity of boolean AND; any k-ary OR binarizes exactly via associativity of log-odds addition. Both are formally verified (godel/ANDDecomposition.lean, godel/ORDecomposition.lean). Therefore **two attention heads always suffice**; only depth grows with the arity and depth of reasoning. #

Proof

3: Uniqueness—Exact Posteriors Force BP Weights The first two proofs are constructive: here are weights, here is what they compute. The uniqueness proof is the converse: if a sigmoid Transformer produces exact Bayesian posteriors for all inputs, what must its weights be? **Theorem 1.4 (Uniqueness)**: A sigmoid Transformer that computes exact Bayesian posteriors necessarily has BP weights: w₀=w₁=1, b=0 in the FFN, and projectDim/crossProject structure in attention. The internal intermediate computations are provably the BP quantities—not merely the outputs. Two Lean files prove this: - `FFNUniqueness.lean`: sigmoid is injective; the logit-sum form is the unique fixed point of the Bayesian update equations - `AttentionUniqueness.lean`: exact routing requires the Q·K score to peak *uniquely* at the correct neighbor, which forces the rank-1 index-matching structure This closes the logical circle. Every sigmoid Transformer does weighted BP (general result). Exact BP weights exist (constructive result). Exact posteriors force those weights (uniqueness). Together: **a sigmoid Transformer implements exact Bayesian inference if and only if it has BP weights**. #

Proof

4: The Boolean Structure—AND and OR All the Way Down The BP construction combined with the QBBN (Quantified Bayesian Belief Network) framework reveals the precise boolean algebra of the Transformer layer. **Attention is AND**: Each head is a focused lookup—it scans all positions, concentrates on the token whose index matches the target, and copies that token's belief into a specific dimension. The conjunction is not inside any single head; it lives in the residual stream. Both heads write before the FFN runs. The FFN has no mechanism to execute on partial inputs—it reads the full residual stream or does not run. That is: all required inputs are simultaneously present in a shared workspace before any conclusion is drawn. That is AND. **FFN is OR**: Once attention assembles the evidence, the sigmoid FFN computes: > updateBelief(m₀,m₁) = m₀m₁/(m₀m₁ + (1−m₀)(1−m₁)) = σ(logit(m₀)+logit(m₁)) This is Ψ_or—probabilistic disjunction over causes, correctly normalized. It contains AND inside it (the numerator m₀·m₁ is the joint probability that both inputs are true), as expected: computing the probability of a disjunction requires computing the probability of the contributing conjunction. **L-layer Transformer** = L rounds of: AND→OR→AND→OR→… This is Pearl's gather/update algorithm, unrolled over depth. The depth of the network is not an engineering hyperparameter; it is determined by the diameter of the factor graph. #

Proof

5: Verifiable Inference Requires a Finite Concept Space **Theorem 1.5 (Finite Concept Space)**: Any finite verification procedure can distinguish at most finitely many concepts. A finite-state machine with n states partitions any input space into at most nⁿ equivalence classes. The philosophical import: grounding introduces a finite verifier. The finite verifier implies a finite concept space. The concept space is what makes "is this output correct?" a well-defined question—not a design choice, but a logical consequence of verifiability. **Hallucination: a precise definition**. Let K be a QBBN knowledge base, E observed evidence, P_true(j) the true marginal posterior of node j given K and E. An agent *hallucinates* at node j if it outputs belief b(j) ≠ P_true(j). An ungrounded language model has no finite verifier, therefore no well-defined concept space, therefore no fact of the matter about whether its outputs are correct. **Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.** Conversely: a Transformer with BP weights running over a fully grounded tree-structured factor graph *cannot* hallucinate. Hallucination requires either wrong weights, wrong routing, or absence of grounding. On a grounded tree with BP weights, none holds. ---

Experimental Validation: Gradient Descent Finds BP Weights

Every formal result is confirmed experimentally. The experiments are not illustrations—they are independent witnesses. **Experiment 1: BP Learner**. 20,000 randomly generated two-variable factor graphs (factor table entries uniform on [0.05, 1.0]), 18k/2k train/val split. Model: standard Transformer encoder, 2 layers, 2 heads, d_model=32, ~5,000 parameters. Training: MSE loss, Adam lr=1e-3, 50 epochs. From random initialization, no construction hints, no weight initialization bias. Result: validation MAE = 0.000752. Posteriors matched to three decimal places on held-out graphs the model had never seen. 99.3% improvement over baseline. First run. **Experiment 2: Turing Machine Simulation**. Five structurally distinct Turing machines (binary incrementer, decrementer, bitwise complement, left shift, right shift). Same hyperparameters, no per-machine tuning. All five reached 100% validation accuracy in 4 training epochs. This confirms the formal Turing completeness proof is constructive in a strong sense: gradient descent rediscovers the construction. **Experiment 3: Loopy BP in Practice**. Despite the current lack of a theoretical convergence guarantee for loopy BP, experiments across five graph structures of increasing complexity show convergence to exact posteriors within tolerance on every trial. The lack of a theoretical guarantee is a gap in theory, not an obstacle in practice. ---

Engineering

Implications **1. Activation functions carry semantics**. Sigmoid is not a pragmatic choice for gradient flow or output normalization—it is the exact function required to implement the Turing-Good-Pearl algebra. The paper characterizes ReLU-based Transformers as "compatible but not exact" with BP: they can approximate BP behavior but cannot achieve the provable exactness that sigmoid enables. **2. Network depth equals reasoning depth**. The number of layers required is determined by the diameter of the reasoning chain in the factor graph, not by engineering convention. For tree-structured knowledge bases without circular dependencies, this provides an exact formula: d·⌈log₂k⌉ layers for depth-d, arity-k problems. **3. Two attention heads always suffice**. The standard practice of scaling model width (more heads, larger d_model) for harder tasks is misaligned with the theoretical structure. Width is determined by the binary structure of factors (always 2 heads). Depth is the scaling axis for reasoning complexity. This suggests a radically different approach to model design. **4. A blueprint for hallucination-free AI**. The theory provides a precise architectural recipe: use a grounded knowledge base, use BP weights, run for at least diameter(G) passes. On tree-structured grounded knowledge bases, no hallucination is structurally possible. The transition from ungrounded LLMs to grounded systems is not an incremental improvement—it changes whether correctness is even a well-defined concept. **5. Interpretability has a new language**. A trained Transformer implicitly defines a factor graph via its weights. Maximum-likelihood training recovers the factor potentials that best explain the training data under this graphical model interpretation. Every weight has a precise probabilistic meaning: it encodes a factor potential in the implicit Bayesian network. **6. Formal verification as methodology**. All five core theorems are verified in Lean 4. This means the proofs are checkable against standard mathematical axioms with no possibility of human error. The paper represents a methodological contribution to AI theory: the field's core architectures can be subjected to the same formal rigor as programming language theory or mathematics. ---

Significance and Context

This result sits at the intersection of several research traditions: - **Turing completeness of Transformers** (Pérez et al. 2021, Bhattamishra et al.): the paper's constructive proof of Turing completeness is proven as a lemma and reused in the BP proof - **Mechanistic interpretability** (Elhage et al.'s "Mathematical Framework"): the routing view of attention as read-write operations on a residual stream is formalized here with provably correct semantics - **BP correspondences** (Joshi's GNN connection, Liao et al.'s attention-BP parallel): previous work observed the resemblance; this paper proves exact equivalence - **Log-odds tradition** (Turing, Good, Pearl): unified as a single algebraic tradition underlying BP, sigmoid FFNs, and boolean logic simultaneously The claim is not that *trained* Transformers learn to do Bayesian inference—that would be an empirical claim about what training converges to. The claim is stronger: the *architecture itself*, by construction, for any weights, implements weighted BP. Training merely selects which factor graph (via the learned weights). This is an architectural theorem, not an empirical generalization. If correct—and the Lean 4 verification against standard axioms is as strong a guarantee as mathematics provides—this resolves one of the deepest open questions in deep learning: not just *that* Transformers work, but *what* they are computing and *why* it works.

Sources

arXiv