Say we have a hard-to-compute probability distribution $p$ and we're looking to approximate it with $q$. How good is that approximation? Kullback-Leibler divergence, noted as $KL(p\parallel q)$ gives exactly that. * $p$ is called the reference, target, or true distribution * $q$ is called the approximate, proposal, surrogate, or model distribution ### KL calculation $KL(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \sum_x p(x) \log{p(x)} - \sum_x p(x) \log{q(x)} = H(p,q) - H(p)$ Where $H$ is the entropy or cross-entropy. ### KL intuition Easiest way to intuit KL is "what's the inefficiency in encoding a stream that is generated by $p$ with encoding that is optimized for $qquot; because: * $H(p)$ is just "how many bits do you need per prediction of $pquot;, via (e.g. Huffman) coding optimized for $p$. * $H(p,q)$ (cross-entropy) "how many bits do you need per prediction of $p$, via coding optimized for $q$. This should be trivial to see from the respective definitions: when you get token $p(x)$ you "pay" $\log q(x)$. But there are other ways to interpret this: * How much information is lost when $q$ is used to approximate $p$. * Average "Bayesian surprise" you experience if you think the world followed $q$ and it follows $p$ instead. * If an LLM output tokens from $q$, what's the loss (beyond irreducible loss) that it'll get if text is in $p$. ### KL is not symmetrical Intuitively, you think about KL as "distance" and we intuit distances to be symmetrical. However, $KL(a\parallel b) \neq KL(b\parallel a)$ in the general case. The easiest way to see this is the following example. - a is a highly concentrated distribution (0.9, 0.1, 0, 0, 0) - b is uniform (0.2, 0.2, 0.2, 0.2, 0.2) - **KL(a||b) = 1.284**: Relatively small because b puts - **KL(b||a) = ∞**: Infinite because b puts probability on values where a has zero mass (violates [Cromwell's rule](https://en.wikipedia.org/wiki/Cromwell%27s_rule)!). So b approximates a better than a approximates b. If an LLM for instance put 0 on these tokens, and it actually showed up, they'd get infinite loss. [[KL asymmetry and loaded dice]]. (I'll create a little playground for this) It's hard to build intuition around this on why it happens mathematically. When computing KL(p||q), each term is weighted by p. When computing KL(q||p), each term is weighted by q. So changing a high probability (0.8→0.4) gets more weight than changing a lower probability (0.4→0.8). Intuitively: > If you’d like an explicit example, consider the difference between me and Einstein. We both sleep and eat a lot, but if you’re watching Einstein he has a small chance p of inventing a new theory of physics. Say my distribution over sleeping, eating, and inventing new physics is {0.7, 0.3, 0} while Einstein’s distribution is {0.7, 0.2, 0.1}. If you’re expecting Einstein, but you get me, you’re not that surprised; we both spend a lot of time sleeping and eating. But if you expect me, and you get Einstein, you will be very surprised when you see him invent a new theory of physics. ([View Highlight](https://read.readwise.io/read/01jp2xkjqrhz1xnxe9qgzw96bf)) ### Variational Inference In variational, we're trying to build a $q$ that approximates $p$. First of all to say something super obvious, the best approximation is just $q = p$, which zeroes both the forward and reverse. But the context of variational inference, as in $q$ is forced to be a in a distribution that is constrained to family. Let's go with an example: * $p$ is a distribution with p(A)=0.9, p(B)=1 * $q$ must be uniform across the states that it covers. As in it's either a coin-flip or a full-bet on one of them. ### Forward and Reverse divergences, mass-covering or mode-seeking $KL(p\parallel q)$ is called reverse, inclusive, or "mass-covering" divergence $KL(q \parallel p)$ is called forward, exclusive, or "mode-seeking" divergence Why are these names? It's about what happens when you try to optimize $q$ to approximate $p$. In the example above: If you optimize for the forward, $q$ will set q(A)=1 ("seek the mode") - $q \sim Uni({A}) \Rightarrow KL(q\parallel p) = 1×\log(1/0.9) + 0×\log(0/0.1) ≈ 0.105$ - $q \sim Uni({A,B}) \Rightarrow KL(q\parallel p) = 0.5×\log(0.5/0.9) + 0×\log(0.5/0.1) ≈ 0.511$ This is because there's much more to be gained by assigning the probability mass to the mode and minimizing the first term in the sum than covering the latter. But for this $q$, look at what happens for the reverse! - $q \sim Uni({A}) \Rightarrow KL(q\parallel p) = 0.9×\log(0.9/1) + 0.1×\log(0.1/0) ≈ \infty$ Hence, the $q$ (even in the restricted family) that optimizes for the reverse is called "mass-covering". It has to assign *some* probability to everywhere $p$ is nonzero. This is called "covering the support". ### Tractability What's interesting is $KL(p\parallel q)$, but for the same reasons $p$ is intractable, $KL(p\parallel q)$ will also often be intractable. So traditionally we use $KL(q\parallel p)$, and very specifically, in inference we use it to approximate the posterior $K(q \parallel p(x|D))$. Even that turns out to be not fully tractable because the **posterior** still needs to use the **marginal**, but with some shuffling around, we can get to [[Evidence Lower Bound (ELBO)]], which acts as a bound on this gap. ### KL Divergence in the brain From [[Information Theory for Intelligent People]]: > [They](https://www.sciencedirect.com/science/article/pii/S0042698908004380#fig2) did something very clever: they took a movie clip, and measured the KL divergence spatially over the screen. Formally, they divided up the screen into little patches. In each patch, they computed the probability distribution over pixel colors at time t (that’s the p distribution) and then again at time t plus one second (that’s the q distribution). Then they could compute the KL divergence between p and q in each patch, and (since the movie clip lasts more than a few seconds), actually see how it evolves over time. Amazingly, their eye-tracking devices showed that people tended to look not (say) at the bright patches, or the uncertain patches, but at the high KL-divergence patches. We tend, in other words, to look towards those parts of a video that violate our expectations (rather than, for example, those parts that just have high uncertainty—we get bored of television fuzz). #published 2025-03-02