Generates a synthetic attention energy matrix J, then samples from the Boltzmann distribution using the selected backend. Watch the Gibbs chain converge in the animation — then explore the interactive analysis plots: scroll to zoom, drag to pan, hover for values.
Actual per-token softmax attention matrices from frozen LLaMA 3.2-3B captured on A100 — real tokenization, real weights, no synthetic data. Token strings from real tokenization appear on both axes so you can see exactly which tokens attend to which tokens at every layer and head.
Prompt: The capital of France is Paris. The capital of Germany is
Layer × head thermodynamic landscape measured directly from LLaMA 3.2-3B frozen weights. Select a prompt type to see how different linguistic contexts produce different attention heat profiles across all 28 transformer layers.
What Is Thermobridge?
Standard transformer models compute attention using a function called softmax — a formula that converts raw energy scores into a probability distribution over tokens. Thermobridge replaces that computation with Boltzmann sampling: instead of computing the exact probability of each attention pattern, it draws samples from the same probability distribution using controlled randomness.
The key insight: these are mathematically the same thing. The softmax formula and the Boltzmann distribution from physics share the same equation. Thermobridge is not an approximation of softmax — it is an alternative implementation of the same underlying math, one that runs naturally on probabilistic computing hardware.
No fine-tuning. No architectural changes. No retraining. Every frozen model checkpoint in existence is already compatible.
| Core claim | Softmax attention ≡ Boltzmann distribution — mathematically exact |
| Bridge method | Draw K samples from the Boltzmann distribution; average converges to softmax |
| Error bound | KL(softmax ‖ bridge) → 0 at rate 1/K |
| Requires retraining? | No — works on any frozen transformer weights |
| Hardware target | Extropic THRML / thermodynamic sampling units (TSUs) |
| Patent | USPTO Provisional 64/019,999 |
What You're Looking At
The Gibbs Chain Animation
The canvas shows a live Boltzmann sampler running in your browser. Think of it as watching the thermodynamic process happening in real time.
The grid is an S × S attention window — the same size as the energy matrix J you configured. Each row is one query token asking "which key tokens should I attend to?" Each column is a candidate key token.
- 🟡 Gold dot — where the sampler landed on this sweep. This is the token being attended to right now for that query position.
- Amber glow — accumulated probability across all sweeps. Brighter = attended to more often across the full run.
- Dark dots — positions with little accumulated attention weight.
As the sweep count K increases, the amber glow pattern converges toward the softmax distribution. You can see convergence happening — random at first, then stabilizing into a coherent pattern that matches the Softmax panel in the analysis plots.
The Analysis Plots
Six interactive panels. Scroll to zoom, drag to pan, hover any cell or bar for the exact value.
| Panel | What it shows | What to look for |
|---|---|---|
| Energy matrix J | Raw attention scores Q·Kᵀ/√d. Brighter = stronger affinity between tokens. | The "landscape" the sampler is climbing. |
| Softmax p(i→j) | The exact Boltzmann distribution. Mathematical ground truth. | This is what thermobridge is converging toward. |
| Bridge [backend, K] | The sampled approximation after K draws. | Should visually match Softmax as K increases. |
| Absolute error | |bridge − softmax| per cell. | Should be near zero everywhere at high K. |
| Row sums | Each row of the bridge output should sum to exactly 1.0. | Gold = valid. Red = outside 5% tolerance (try higher K). |
| KL per row | Kullback-Leibler divergence between softmax and bridge, per query. | Gold bars = converged. Red = not yet (increase K). |
The Controls
| Control | What it does |
|---|---|
| Sequence length S | Size of the attention window. Larger = more complex energy landscape. |
| Samples K | Number of Gibbs sweeps. Higher K → lower error, longer run. Try 1000 to see near-perfect convergence. |
| Animation speed | Steps rendered per animation frame. Pure visual — does not change the math or results. |
| Backend | Which sampling algorithm. See The Backends section below. |
| Energy template | Shape of the synthetic J matrix — random noise, diagonal structure (local attention), or block structure (local + global). |
| Temperature / strength | Scales J values. Higher temperature = flatter distribution (more uncertainty). Lower = sharper peaks (confident attention). |
The Theory
Step 1 — Softmax Attention
In a transformer, every token generates a query vector Q ("what am I looking for?") and a key vector K ("what do I contain?"). The attention energy between token i and token j is their dot product, scaled by the key dimension:
J(i, j) = Qᵢ · Kⱼ / √d_k
Softmax converts these raw energy scores into a probability distribution — how much attention token i should pay to token j:
p(i → j) = exp( J(i,j) ) / Σₖ exp( J(i,k) )
The model then takes a weighted sum of value vectors V using these probabilities. That weighted sum is the attention output: what each token "decides to carry forward" based on what it attended to.
Step 2 — The Boltzmann Connection
In statistical physics, the Boltzmann distribution describes the probability of finding a physical system in a particular energy state when it has reached thermal equilibrium:
p(state j) = exp( −Eⱼ / kT ) / Z
where Eⱼ is the energy of state j, kT is temperature × Boltzmann's constant, and Z is a normalizing constant (the "partition function") that makes the probabilities sum to 1.
These formulas are identical. Set Eⱼ = −J(i,j) and kT = 1. Softmax attention is a Boltzmann distribution over the attention energy landscape.
This equivalence was formally proven by Kajitsuka & Sato (arXiv:2307.14023), who showed that the softmax function in transformers is exactly a Boltzmann operator — and that this operator preserves the mathematical distinctness of different input sequences.
Step 3 — Sampling vs. Computing
Standard attention computes the Boltzmann distribution exactly and uses it as weights.
Thermobridge samples from the same Boltzmann distribution and uses the empirical sample frequencies as weights instead.
By the law of large numbers, sample frequencies converge to the true probabilities as the number of samples K increases. The convergence rate is 1/K — you can watch this in the KL per row panel. This is not a heuristic; it is a mathematical guarantee.
Think of it like estimating a coin's bias. You could calculate the theoretical probability from the coin's physical properties (softmax), or you could flip it 1000 times and count heads (thermobridge). Both give you the same answer — the sampling approach just requires enough flips.
Step 4 — Why Sampling Unlocks New Hardware
Probabilistic computing hardware — specifically thermodynamic sampling units (TSUs) like Extropic's THRML chips — implements Boltzmann sampling physically, through thermal noise at the transistor level. These devices don't compute softmax; they are Boltzmann samplers at the hardware level.
Thermobridge is the adapter that makes this work for transformers. A frozen LLaMA model running through thermobridge becomes a valid workload for TSU hardware — no simulation, no approximation, just physics implementing mathematics directly.
The Backends
All three backends sample from the same Boltzmann distribution. They differ in how they sample — which affects speed, numerical properties, and hardware mapping.
Exact — Multinomial Sampling
Computes the full softmax distribution analytically, then draws K independent categorical samples. This is the mathematical ground truth: each draw is perfectly independent and unbiased. Most accurate; most CPU-intensive. Use this to establish the reference baseline.
Gumbel-Max Trick
Adds Gumbel-distributed noise to the log-energy scores, then takes the argmax. This produces a sample from the same Boltzmann distribution as softmax — provably — without ever computing the softmax explicitly. Computationally efficient and differentiable, making it useful for training scenarios. Visually should converge identically to Exact.
THRML / Block Gibbs
Simulates a block Gibbs Markov chain — the algorithm that THRML thermodynamic hardware executes physically through thermal noise.
Each "sweep" updates every query position by sampling from its conditional distribution, given the current state of all other positions. After a warmup period, the chain mixes to the stationary Boltzmann distribution and produces valid samples.
The animation specifically shows the THRML/Block Gibbs process: each gold dot is one Gibbs draw, and the amber accumulation shows the chain converging to stationarity. On real THRML hardware, this process happens in nanoseconds via physical thermalization rather than software simulation.
Thermodynamic Specific Heat — Cv Observable
Kim (2026, arXiv:2602.08216) derives that each frozen attention head at inference time has a measurable thermodynamic specific heat:
Cv = Var_softmax( Q·Kᵀ / √d_k )
= E_p[logits²] − E_p[logits]²
This is the variance of the scaled attention logits under their own softmax distribution. It measures how "thermodynamically active" each attention head is — heads with diffuse, spread-out attention have high Cv; heads that sharply attend to one token have low Cv.
Physical meaning: High Cv means the attention distribution is spread across many candidate tokens. A finite-K sampler will miss parts of that distribution on each draw, leading to higher KL error. Low Cv means the distribution is peaked — even K=1 sample lands on the right token most of the time.
Empirical measurements — LLaMA 3.2-3B on A100 (Tab 2 of this demo):
| Measurement | Value |
|---|---|
| Cv range across all heads, layer 18 | [0.18, 1.48] |
| Predicted range (Kim 2026 theory) | [0.1, 2.5] ✓ |
| Pearson r(Cv, KL) at layer 18 | 0.8241, p = 6.1 × 10⁻²⁵ |
| Observations | 96 (4 prompts × 24 heads) |
| Cv peak layer (all prompts) | Layers 9–11 (semantic integration) |
| Total measurements | 3,360 (5 prompts × 28 layers × 24 heads) |
The r = 0.8241 result is the key empirical bridge between Kim's thermodynamic framework and thermobridge's sampling fidelity: the same physical quantity that describes attention disorder in transformer training also predicts how many samples K you need for accurate inference. This is not an approximation — it is a direct consequence of the Boltzmann-softmax equivalence.
This is the first published measurement of Kim's Cv observable at inference time on a frozen pretrained transformer. Kim's original results were measured during training only.
Why This Matters
Every Existing Model Is Already Compatible
The transformer architecture — used in GPT, LLaMA, Mistral, Gemini, Claude, and every major language model — internally computes attention as a Boltzmann distribution. Thermobridge doesn't change this; it reveals it. Any frozen checkpoint from any model family can be run through thermobridge immediately.
A New Hardware Execution Path
The AI hardware industry is investing heavily in probabilistic and thermodynamic computing as an energy-efficient alternative to GPU clusters. These systems can't run standard floating-point transformer inference. With thermobridge, they can — because the attention mechanism is already Boltzmann sampling at its mathematical core.
Principled Stochasticity
At finite K, thermobridge introduces controlled randomness into attention — not noise, but thermal fluctuations around the correct distribution. This connects transformer inference to the thermodynamics literature on energy-efficient computation and opens questions about temperature, annealing, and inference-time scaling that don't exist in the deterministic softmax world.
Technical Reference
Patent: USPTO Provisional Application 64/019,999
Foundational theorem (Kajitsuka & Sato, 2023, arXiv:2307.14023): The contextual map of single-layer self-attention is a Boltzmann operator, and this map is injective — meaning distinct input sequences map to distinct Boltzmann distributions. The softmax-Boltzmann equivalence is exact, not approximate.
Thermodynamic specific heat (Kim, 2026, arXiv:2602.08216): Cv = Var_ρ(E)/T² where E = −Q·K attention energies, T = √d_k, ρ = softmax(QKᵀ/√d_k). Simplifies to Cv = Var_softmax(scaled logits) — directly computable from captured QK activations.
Energy function (attention logits):
J(i, j) = ( Qᵢ · Kⱼ ) / √d_k
where Q, K are the frozen query and key projection outputs at the target transformer layer.
Bridge output (empirical sample mean):
p̂(i → j) = (1/K) Σₖ 𝟙[ sample_k(i) = j ]
KL convergence:
KL( softmax ‖ bridge_K ) ≈ C / K
where C depends on the entropy of the softmax distribution. Confirmed empirically — increase K from 10 to 2000 in Tab 1 and watch the KL bars collapse to zero.
GitHub: github.com/whtetigr2/TASB