AI

June 5, 2026 13 min read

From Probability Distributions to Loss Functions

Learn how maximum likelihood connects distributions to loss functions, why raw likelihood underflows, and how negative log-likelihood makes training possible.

Stories by Sagar Kharel

The Tryout

A player walks into basketball tryouts.

The coach wants to answer one question:

Should this player make the team?

One free throw will not answer that.

A single shot can be lucky.

A single miss can be nerves.

So the coach asks for the same shot again and again.

Same player.
Same hoop.
Same spot on the court.

Still, every shot is a little different.

One drifts left.
One lands almost centered.
One drifts right.


The Pattern Behind the Shots

During the tryout, the coach is collecting numbers.

Not just whether the ball went in.

Where did it land?

A scoring shot is marked as 0.

A miss to the left becomes negative.

A miss to the right becomes positive.

That gives us a list of errors:

ϵ=landing positionscoring position\epsilon = \text{landing position} - \text{scoring position}

At the end of the tryout, the coach plots those errors.

Why?

To evaluate the player from the full pattern, not from one lucky make or one nervous miss.

0.0

Most errors gather near 0.

A few errors drift farther away.

Wild misses are rare.

The notebook no longer looks like random noise.

It has a shape.

And that shape is what the coach wants to describe.


Which Curve Explains the Tryout?

The coach has plotted the notebook.

The shots are not random chaos.

They have a shape.

Now the coach wants to describe that shape with a curve.

For this player, a bell-shaped curve is a good starting guess.

In math, we write that as:

ϵN(μ,σ2)\epsilon \sim \mathcal{N}(\mu, \sigma^2)

Read it as:

The shot errors follow a normal distribution.

But there is not just one possible bell curve.

Each curve is one guess.

A guess for μ\mu.
A guess for σ2\sigma^2.

One guess can lean left.
One guess can be too narrow.
One guess can be too wide.
One guess can fit the shot pattern better than the rest.

Same tryout data. Different candidate curves.

So the question is not:

Can we draw a curve?

The question is:

Which curve should win?

That is where maximum likelihood enters.


Maximum Likelihood: Scoring the Curve

The plot shows several possible curves.

So he picks one curve by choosing two settings:

  • μ\mu: where the shots usually land
  • σ2\sigma^2: how spread out the misses are

That gives him one candidate curve.

Now he can score each shot using the Normal curve:

p(ϵ(i)μ,σ2)=12πσ2exp((ϵ(i)μ)22σ2)\large \textcolor{#0072B2}{p} \left( \textcolor{#D55E00}{\epsilon^{(i)}} \mid \textcolor{#009E73}{\mu, \sigma^2} \right) = \frac{1}{\sqrt{2\pi\textcolor{#009E73}{\sigma^2}}} \exp\left( -\frac{ \left(\textcolor{#D55E00}{\epsilon^{(i)}} - \textcolor{#009E73}{\mu}\right)^2 }{ 2\textcolor{#009E73}{\sigma^2} } \right)

Read it as three pieces:

  • p()\large p(\cdot): the score from this curve.
  • ϵ(i)\large \epsilon^{(i)}: one shot error from the notebook.
  • μ,σ2\large \mu, \sigma^2: the curve settings being tested.

In plain English: once the coach picks a curve with μ\mu and σ2\sigma^2, this asks what score one shot error gets.

Scoring a shot

For the coach, scoring one shot means asking:

Was that miss normal for this kind of player?

A shot that lands close to the hoop fits a steady-player curve.
A shot that lands far from the hoop makes that curve harder to trust.

One shot does not decide the roster spot.

But it does help judge the curve.

So the coach repeats this for every shot in the notebook.

Scoring the Whole Tryout

One shot is only one clue.

A steady player can have one bad miss.
A risky player can have one lucky shot.

So the coach does not judge the player from one shot.

He asks a bigger question:

Does the whole tryout look like the kind of player this curve is describing?

We know how to score one shot.

To score the whole tryout, we do the same thing for every shot, then combine the scores:

p(all shots)=p(shot1)××p(shotn)p(\text{all shots}) = p(\text{shot}_1) \times \cdots \times p(\text{shot}_n) L(μ,σ2)=p(ϵ(1)μ,σ2)××p(ϵ(n)μ,σ2)L(\mu, \sigma^2) = p(\epsilon^{(1)} \mid \mu, \sigma^2) \times \ldots \times p(\epsilon^{(n)} \mid \mu, \sigma^2)

This whole-tryout score is called the likelihood.

Higher likelihood means the observed tryout fits the curve better.

Now each curve can be compared by the same score.

So the coach changes the settings and scores again.

Different μ\mu.
Different σ2\sigma^2.

Then another.
And another.

Each curve tells a different story about the same player.

The curve with the highest likelihood wins.

That is maximum likelihood.

It means:

Choose the curve under which the whole tryout looks least surprising.

Same tryout data. Different candidate curves.

The numbers above each curve are its settings.

Different μ\mu.
Different σ\sigma.

Same tryout data.

The picture labels σ\sigma because it is easier to read as the curve’s width.
The formulas use σ2\sigma^2, the variance.

That is the curve maximum likelihood would choose.


The Quiet Assumption: IID

Before we multiply all those probabilities together, we are quietly making one important promise:

Each shot is a fresh repeat of the same situation.

Same player.
Same hoop.
Same distance.

But each shot still lands a little differently.

That idea shows up everywhere.

  • Weight readings: same person, same scale — but each reading wiggles a little because of posture, floor position, or timing.
  • Pancake batter: same batter, same pan, same ladle — but each pancake spreads a little differently.
  • Thermostat readings: same room, same thermostat — but the number flickers a little as air moves, sunlight shifts, or the heater kicks on.

Each attempt has a similar setup.
But the result is a little different.

That is the gut feeling behind IID.

IID stands for:

Independent and identically distributed.

That sounds formal, but the idea is simple.

Independent means one attempt does not control the next one.

A miss does not magically shove the next shot left.
A weird scale reading does not force the next reading to be weird.
A lopsided pancake does not curse the next pancake.

Each attempt gets its own little randomness.

Identically distributed means every attempt comes from the same setup.

We are not mixing free throws, half-court shots, and blindfolded shots.
We are not mixing bathroom scales from five different houses.
We are not mixing pancakes made with different pans and different batter.

Same world.
Same rulebook.
Fresh randomness each time.

That is why multiplying the probabilities makes sense.

We are treating the whole tryout as one repeated experiment, not a pile of unrelated accidents.

Once that promise is in place, the multiplication has a home.

Each shot contributes one probability.

The full tryout becomes:

P(all shots)=P(shot1)×P(shot2)×P(shot3)××P(shotn)P(\text{all shots}) = P(\text{shot}_1) \times P(\text{shot}_2) \times P(\text{shot}_3) \times \cdots \times P(\text{shot}_n)

The hidden assumption is not the multiplication itself.

The hidden assumption is this:

Same setup, fresh wiggle.

That is IID.

And now the next problem appears.

The math asks for a product.

The computer struggles because that product becomes tiny.


The Product Problem: Underflow

On a whiteboard, maximum likelihood is flawless.

It is a principled way to estimate a pattern.

But the likelihood score has a problem.

It is built by multiplying many probability scores together.

And probability scores are usually numbers between 0 and 1.

What happens when you multiply numbers smaller than 1?

It gets smaller.

Tiny enough that the computer may round the likelihood down to 00.

That is called underflow.

This is not just a whiteboard problem.

It shows up directly in deep learning.

A classifier gives a probability to the correct label for each example.

A language model gives a probability to the next correct token.

Training asks:

How likely was the whole batch under the model?

That means multiplying many small probability scores together.

Example after example.

Token after token.

So the product problem is not hypothetical.

It is sitting inside the training loop.

The goal is still right.

Pick the model that makes the data most likely.

But the product form is fragile.

So we rewrite the score without changing the goal.

That is the log trick.


The Log Trick

We still want the same thing:

Pick the curve with the highest likelihood.

But the raw likelihood is fragile because it multiplies many tiny scores.

So we take the log of the likelihood.

Why is that allowed?

Monotonic Nature

Because log is monotonic.

Monotonic means the order does not change.

If one number is bigger than another before the log, it is still bigger after the log.

For example:

100>10100 > 10

Take the log of both sides:

log(100)>log(10)\log(100) > \log(10)

Using base 10, that becomes:

2>12 > 1

It changes the scale, not the winner.

Fixing the Underflow

The second reason is the real rescue:

Log turns multiplication into addition.

log(abc)=loga+logb+logc\log(a \cdot b \cdot c) = \log a + \log b + \log c

For the computer, this changes the job.

Instead of carrying one fragile product, it carries a sum.

That fixes the underflow problem.

101300<10700log(101300)<log(10700)2993<1612\begin{aligned} 10^{-1300} &< 10^{-700} \\ \log(10^{-1300}) &< \log(10^{-700}) \\ -2993 &< -1612 \end{aligned}

Without the log trick, both raw likelihoods can underflow to 00.

The computer loses the difference.

With the log trick, the numbers become safer to carry, but the ordering stays the same.

Same winner. No vanishing product.

Negative Log Likelihood

There is one final flip.

Log-likelihood is a mountain.

Higher is better.

The best curve sits at the peak.

But training optimizers are usually written to roll downhill.

They expect a loss.

A loss is the number training tries to make smaller.

Lower is better.

So we flip the mountain into a valley.

We multiply by 1-1:

logL(μ,σ2)-\log L(\mu, \sigma^2)

This is negative log-likelihood.

A believable curve gets a small loss.

A surprised curve gets a large loss.

Same best curve.

Opposite landscape.

Maximum likelihood climbs to the highest score.

Negative log-likelihood rolls down to the lowest penalty.


From Normal Likelihood to MSE

This next section is the full derivation.

You can skim the algebra and go directly to the birth of a loss function.

But this is where MSE falls out of the Normal curve.

Now we can use the full machinery.

The coach has a whole notebook of errors.

For one shot, the Normal rulebook gives one probability:

p(ϵ(i))p(\epsilon^{(i)})

For the whole notebook, likelihood multiplies all those shot probabilities:

L=i=1np(ϵ(i))L = \prod_{i=1}^{n} p(\epsilon^{(i)})

But raw likelihood has the product problem.

So we apply the negative log:

log(L)=log(i=1np(ϵ(i)))-\log(L) = -\log\left( \prod_{i=1}^{n} p(\epsilon^{(i)}) \right)

The log turns the product into a sum:

log(L)=i=1nlogp(ϵ(i))(1)-\log(L) = -\sum_{i=1}^{n} \log p(\epsilon^{(i)}) \tag{1}

Equation (1) is a sum over shots.

So let’s simplify one repeated term first:

logp(ϵ(i))-\log p(\epsilon^{(i)})

Once we know this piece, we can sum it over all nn shots.

For a zero-centered Normal error:

ϵ(i)N(0,σ2)\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)

the probability density is:

p(ϵ(i))=1σ2πexp((ϵ(i))22σ2)p(\epsilon^{(i)}) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{(\epsilon^{(i)})^2}{2\sigma^2} \right)

Now take the negative log:

logp(ϵ(i))=log[1σ2πexp((ϵ(i))22σ2)]-\log p(\epsilon^{(i)}) = -\log\left[ \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{(\epsilon^{(i)})^2}{2\sigma^2} \right) \right]

The log splits the product into two pieces.

It separates the front fraction from the exponential:

logp(ϵ(i))=log(1σ2π)log[exp((ϵ(i))22σ2)]-\log p(\epsilon^{(i)}) = -\log\left( \frac{1}{\sigma\sqrt{2\pi}} \right) - \log\left[ \exp\left( -\frac{(\epsilon^{(i)})^2}{2\sigma^2} \right) \right]

Now clean up the first piece:

log(1σ2π)=log(σ2π)-\log\left( \frac{1}{\sigma\sqrt{2\pi}} \right) = \log(\sigma\sqrt{2\pi})

And clean up the second piece.

The log cancels the exponential:

log[exp((ϵ(i))22σ2)]=(ϵ(i))22σ2-\log\left[ \exp\left( -\frac{(\epsilon^{(i)})^2}{2\sigma^2} \right) \right] = \frac{(\epsilon^{(i)})^2}{2\sigma^2}

So one shot’s negative log-likelihood becomes:

logp(ϵ(i))=log(σ2π)front term+(ϵ(i))22σ2squared error piece-\log p(\epsilon^{(i)}) = \underbrace{ \log(\sigma\sqrt{2\pi}) }_{\text{front term}} + \underbrace{ \frac{(\epsilon^{(i)})^2}{2\sigma^2} }_{\text{squared error piece}}

That is the penalty for one shot.

Now return to Equation (1):

log(L)=i=1nlogp(ϵ(i))-\log(L) = -\sum_{i=1}^{n} \log p(\epsilon^{(i)})

We just simplified the repeated term:

logp(ϵ(i))=log(σ2π)+(ϵ(i))22σ2-\log p(\epsilon^{(i)}) = \log(\sigma\sqrt{2\pi}) + \frac{(\epsilon^{(i)})^2}{2\sigma^2}

Now plug that back into Equation (1):

log(L)=i=1n[log(σ2π)+(ϵ(i))22σ2]-\log(L) = \sum_{i=1}^{n} \left[ \log(\sigma\sqrt{2\pi}) + \frac{(\epsilon^{(i)})^2}{2\sigma^2} \right]

Split the sum:

log(L)=i=1nlog(σ2π)+i=1n(ϵ(i))22σ2-\log(L) = \sum_{i=1}^{n} \log(\sigma\sqrt{2\pi}) + \sum_{i=1}^{n} \frac{(\epsilon^{(i)})^2}{2\sigma^2}

The first part is the same for every shot, so it adds up to nn copies:

i=1nlog(σ2π)=nlog(σ2π)\sum_{i=1}^{n} \log(\sigma\sqrt{2\pi}) = n\log(\sigma\sqrt{2\pi})

The second part has a fixed scale, so we can pull it outside the sum:

i=1n(ϵ(i))22σ2=12σ2i=1n(ϵ(i))2\sum_{i=1}^{n} \frac{(\epsilon^{(i)})^2}{2\sigma^2} = \frac{1}{2\sigma^2} \sum_{i=1}^{n} (\epsilon^{(i)})^2

So the full negative log-likelihood is:

log(L)=nlog(σ2π)same frontterm for all shots+12σ2fixedscalei=1n(ϵ(i))2squarederrors-\log(L) = \underbrace{ n\log(\sigma\sqrt{2\pi}) }_{\substack{\text{same front}\\\text{term for all shots}}} + \underbrace{ \frac{1}{2\sigma^2} }_{\substack{\text{fixed}\\\text{scale}}} \hspace{0.6em} \underbrace{ \sum_{i=1}^{n} (\epsilon^{(i)})^2 }_{\substack{\text{squared}\\\text{errors}}}

Now translate the basketball notebook into model language.

In the tryout story, the coach measured error like this:

ϵ(i)=actual landing(i)target(i)\epsilon^{(i)} = \text{actual landing}^{(i)} - \text{target}^{(i)}

In model language, we use EGO:

Error = Ground truth − Output

The ground truth is:

y(i)y^{(i)}

The model output is:

y^(i)\hat{y}^{(i)}

So the prediction error is:

ϵ(i)=y(i)y^(i)\epsilon^{(i)} = y^{(i)} - \hat{y}^{(i)}

Same notebook idea.

Different symbols.

The error is still the gap between what happened and what the model expected.

Now substitute that into the negative log-likelihood:

log(L)=nlog(σ2π)same frontterm for all shots+12σ2fixedscalei=1n(y(i)y^(i))2predictionerrors-\log(L) = \underbrace{ n\log(\sigma\sqrt{2\pi}) }_{\substack{\text{same front}\\\text{term for all shots}}} + \underbrace{ \frac{1}{2\sigma^2} }_{\substack{\text{fixed}\\\text{scale}}} \hspace{0.6em} \underbrace{ \sum_{i=1}^{n} \left( y^{(i)} - \hat{y}^{(i)} \right)^2 }_{\substack{\text{prediction}\\\text{errors}}}

Throwing Away the Dead Weight

Now look at this from the training optimizer’s point of view.

The optimizer is trying to improve the predictions, y^\hat{y}.

But it cannot change nn.

It cannot change π\pi.

And if we treat σ\sigma as fixed, it cannot change σ\sigma either.

So this part is dead weight for training:

nlog(σ2π)n\log(\sigma\sqrt{2\pi})

It does not change when the prediction changes.

The scale factor also does not decide which prediction wins:

12σ2\frac{1}{2\sigma^2}

It stretches the loss, but it does not change the best prediction.

So the only part that still matters for choosing the best model is:

i=1n(y(i)y^(i))2\sum_{i=1}^{n} \left( y^{(i)} - \hat{y}^{(i)} \right)^2

That is the sum of squared errors.

But the raw sum grows with dataset size.

A 1,000-shot notebook naturally creates a bigger penalty pile than a 10-shot notebook.

So we average the pile by dividing by nn.

That gives us the Mean Squared Error:

MSE=1ni=1n(y(i)y^(i))2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y^{(i)} - \hat{y}^{(i)} \right)^2

That is the full derivation.

The Normal rulebook has turned into MSE.

One small note: dropping the σ\sigma terms is safe here because we treated σ\sigma as fixed. If the model is also learning σ\sigma, those terms matter.


The Birth of a Loss Function

We knew the tryout looked like a bell curve.
We had the Normal distribution formula.

So why couldn’t we stop there?

Because the Normal distribution gave the coach a shape, not a final answer.

It still left him with many possible curves.

We already saw them earlier:
shifted left, too narrow, too wide, best fit.

Same tryout.
Different explanations.

But the coach is not really trying to pick a curve.

He is trying to answer the question from the beginning:

Is this player reliable enough for my team?

The curve is how he estimates that reliability.
Pick the wrong curve, and the coach may misread the player.

So he needs a fair way to choose between the curves.

So how should he choose?

Each curve has to defend itself against the actual tryout.

A too-wide curve makes the player look more scattered than he is.
A shifted curve makes the player look biased to one side.
A too-narrow curve makes normal misses look alarming.

The right curve does not make the player better.
It makes the tryout easier to read.

Maximum likelihood is the rule for choosing that curve:

Pick the curve under which the actual tryout looks least surprising.

Now the coach knows how to score a guess.

But there are too many possibilities to check by hand.

Which center?
Which spread?
Which curve makes the actual tryout least surprising?

That is machine learning:

spin the dials until we find the settings that best explain the data.

And that is exactly where the math becomes fragile.

The raw likelihood score is fragile because it multiplies many tiny probabilities.

The log trick makes the score safe.

The negative makes the score trainable.

Together, the log and the negative give us negative log-likelihood.

Now the direction has changed.

Likelihood was a score we wanted to make bigger.

Negative log-likelihood is a penalty we want to make smaller.

That is the birth of a loss function.

Training can now ask a simple question:

Which settings minimize this penalty?

For the Normal distribution, that penalty simplifies into Mean Squared Error.

So MSE is not a random formula.

It is what Normal negative log-likelihood becomes when we turn “most believable curve” into “smallest loss.”

Machine learning turns a hidden pattern in the data into a loss that code can optimize.


The Coach’s Whiteboard

Notebook

The measured data. In our story, it is the pile of shot errors.

Notebook. The measured data. In our story, it is the pile of shot errors.


Quiz

86% of people love quizzes after learning. Are you one of them?

Question 1 of 12 🏆 0 / 120 ⚡ Attempt 1 of 2

Question text