On the cover: Cross Entropy Loss
Cross Entropy Loss is a loss function that is used to measure the difference between the predicted probability distribution and the true probability distribution. It is a measure of the uncertainty of the model’s predictions.
That just sounds like a bunch of jargon doesn’t it. Well, let’s break it down. All of this is rooted in Information Theory. Claude Shannon-who essentially invented the entire field of information theory-asked a deceptively simple question:
“How do we measure information in a way that makes sense?”
He proposed three axioms for an information measure (H(p)):
Shannon’s Axioms
-
Continuity (H(p)) should change smoothly as probabilities change.
-
Maximization For a fixed number of outcomes, entropy is maximized when all outcomes are equally likely. (A perfectly fair die is the most chaotic die.)
-
Additivity (a.k.a. the “Chain Rule”) If a decision can be broken into independent parts, total entropy is the weighted sum of entropies. [ H(X, Y) = H(X) + H(Y|X) ]
Shannon proved that the only function satisfying these axioms (up to a constant factor) is:
[ H(p_1,\dots,p_n) = -K \sum_i p_i \log p_i ]
Set (K = 1) and we get the familiar form.
🔥 2. Self-Information: Surprise in a Single Number
Shannon also defined the information content of a single event:
[ I(x) = -\log p(x) ]
Properties:
- When (p(x)) is small → big surprise
- When (p(x)) is large → small surprise
- When (p(x) = 1) → (I(x) = 0) (no surprise)
Example: Your friend says “gravity still works today.” Probability = 1. Surprise = 0 bits.
Your friend says “the stock market went up because Mercury is in retrograde.” Surprise = ∞ bits.
🔥 3. Entropy: Expected Surprise
Entropy measures the average amount of self-information:
[ H(p) = \mathbb{E}_{x \sim p}[ -\log p(x) ] ] [ H(p) = -\sum_x p(x) \log p(x) ]
Alternative continuous form:
[ H(p) = -\int p(x) \log p(x), dx ]
Entropy reflects how uncertain the world is.
- High entropy → chaotic, unpredictable
- Low entropy → predictable, boring (like a model that always predicts “cat”)
🥊 4. Cross Entropy: When Your Beliefs Are Wrong
Cross entropy is:
[ H(p, q) = -\sum_x p(x) \log q(x) ]
Compare with entropy:
[ H(p) = -\sum_x p(x) \log p(x) ]
Interpretation:
“How many bits do I need to encode messages drawn from (p) if I assume the world is (q)?”
This is exactly what we do in machine learning:
- (p) = true labels
- (q) = model’s predicted distribution
For one-hot labels, if true class is (y):
[ H(p,q) = - \log q(y) ]
This is the famous per-sample classification loss.
💣 5. Negative Log-Likelihood (NLL): The ML Version
Likelihood for a sample (x):
[ \mathcal{L} = q(x) ]
Negative log-likelihood loss:
[ \text{NLL} = -\log q(x) ]
Averaged over the dataset:
[ \frac{1}{N} \sum_{i=1}^N -\log q(x_i) ]
This is exactly the empirical cross entropy:
[ H_{\text{empirical}}(p, q) ]
Why log?
- Turns products into sums
- Penalizes confident wrong predictions heavily
- Leads to convex/log-concave objectives (nice gradients)
- Sine/cosine would oscillate and destroy the loss landscape (and your model)
🧩 6. KL Divergence: Distance Between Belief Systems
Defined as:
[ D_{\text{KL}}(p,||,q) = \sum_x p(x) \log \frac{p(x)}{q(x)} ]
Equivalent form:
[ D_{\text{KL}}(p,||,q) = H(p, q) - H(p) ]
Interpretation:
Extra surprise caused by using the wrong model (q)
Properties:
- (D_{\text{KL}} \ge 0)
-
(D_{\text{KL}}(p q) = 0) only when (p = q) - Not symmetric: [ D_{\text{KL}}(p||q) \neq D_{\text{KL}}(q||p) ]
KL divergence is everywhere:
- Variational inference
- Regularization in VAEs
- Policy updates in RL (Trust Region methods)
- Comparing distributions
- Aligning model predictions with truth
🎯 7. Importance Sampling: When Rare Events Refuse to Show Up
Suppose you want to estimate:
[ \mu = \mathbb{E}_{x \sim p}[f(x)] ]
But sampling from (p) rarely hits the events that matter (e.g., finance tail losses).
Rewrite expectation using another distribution (q):
[ \mu = \int f(x) p(x) , dx ]
Multiply & divide by (q(x)):
[ \mu = \int f(x) \frac{p(x)}{q(x)} q(x), dx ]
Now sample from (q):
[ \hat{\mu} = \frac{1}{N} \sum_{i=1}^N f(x_i), w(x_i) ]
where [ w(x) = \frac{p(x)}{q(x)} ] is the importance weight.
Interpretation:
You sample from a convenient distribution (q) and then “fix” the bias using the ratio (p/q).
Example: Rare Events in Finance
Want the probability of a 5-sigma loss:
- Sampling from the real distribution won’t give you enough 5-sigma events
- You choose a biased distribution that produces more extreme returns
- Apply importance weights to correct the bias
- Estimator now converges with far fewer samples
Importance sampling is basically:
“Cheat smartly, and then correct the cheat mathematically.”
🔗 8. Deep Connections
All these concepts are unified:
- Self-information: surprise of one event
- Entropy: expected surprise under true distribution
- Cross entropy: expected surprise under model distribution
- NLL: empirical cross entropy
- KL divergence: the price you pay for being wrong
- Importance sampling: fixing the price by reweighting surprise
The common thread? They all revolve around:
[ -\log(\text{probability}) ]
This quantity is the currency of information.