May 23, 2026 16 min read

Regularization: Stop Your Model From Chasing Every Buzz

YieldMaster looks perfect on spring records, but perfect training charts can hide overfitting. Learn how regularization adds restraint: L1 cuts, L2 calms.

Stories by Sagar Kharel

The Mystery of the Harvest 🍯

At the end of every season, the beekeeper faced the same mystery.

Some hives overflowed with heavy, golden frames.

Others looked busy all spring but produced almost nothing.

The answer had to be hiding somewhere in the apiary.

So the beekeeper started keeping records.

Every day, he wrote down the weather. He measured hive weight. He logged humidity changes. He listened to the hive’s buzz. He counted mites when he inspected the frames.

By harvest time, the notes filled a shoebox.

So the beekeeper carried it to TechFables.

He set it in front of the engineers and asked:

Can AI predict how much honey a hive will produce?

The engineers decided to find out.

They built YieldMaster to read the apiary’s clues and forecast the honey yield.

Its job was simple:

Read the hive. Predict the harvest.

A Quick Note Before We Train

Before YieldMaster can train, the beekeeper’s shoebox has to become clean training data.

That preparation step is called data preprocessing.

If you want the full setup, start here:

Read the setup first Data Preprocessing with HiveDoctor

That article walks through the full pipeline of cleaning data, handling missing values, and preventing leakage.

This article zooms in on regularization.

Inside the Spring Records

Before YieldMaster can train, the engineers need to understand what is inside the cleaned shoebox.

For this harvest problem, each row is one spring morning from one hive.

Each column is a clue from the apiary.

H17

-2

+1.2

H19

-1

+1.8

H20

+2.6

H22

+3.1

H23

+2.4

H25

+3.6

The clues are things the engineers can know before harvest.

The target is the thing they want YieldMaster to predict: Honey yield

The Illusion of Perfection

The beekeeper pointed to the spring records.

“Warmer mornings usually mean better harvests,” he said.

“When the bees wake earlier, they start foraging sooner.”

The junior engineer looked at the table.

The beekeeper’s hunch seemed to be there in the numbers.

As morning temperature rose, honey yield usually rose too.

“It might be the perfect clue,” he said.

So he built the first version of YieldMaster using just one input:

Morning temp → Honey yield

A few hours later, he rushed into the apiary holding a chart.

“I think I built the perfect model.”

On the training chart, YieldMaster looked brilliant.

Not just the general upward trend.

Every point.
Every dip.
Every strange little wiggle.

The line bent itself through each record as if it were stitching the spring together by hand.

But the lead architect looked closer.

“Did it learn how temperature affects honey production?”

“Or did it just memorize those six spring mornings?”

Model Flexibility: Too Stiff, Sweet Spot, Too Eager

The junior engineer built YieldMaster by choosing how much freedom the model should have.

That freedom is called model flexibility.

Flexibility means how much freedom YieldMaster has to bend.

Low flexibility: the model can only draw a stiff, simple pattern.
Medium flexibility: the model can follow the main curve.
High flexibility: the model can twist itself through individual records.

To judge each curve, the engineer uses mean squared error, or MSE.

MSE asks: How far are YieldMaster’s predictions from the real honey-yield values?

Lower MSE means the curve sits closer to the spring records.

That sounds good. But there is a trap:

A model can drive training MSE down by memorizing the records instead of learning the pattern.

Use the slider to replay the junior engineer’s choice.

Too stiff. It sees the rise, but misses the bend.

Polynomial degree 1

1. Too Stiff: Underfitting

At low flexibility (Degree 1), YieldMaster is too stiff.

It sees the harvest rising, but misses the bend.

The model has a stubborn belief:

A simple line is good enough.

But the honey-yield pattern curves.

So the model misses important structure.

This is underfitting.

It has high bias because its belief is too simple.

It has low variance because the stiff line barely changes, even when the training records change. Swap a few spring records, and YieldMaster still draws almost the same line.

Stable? Yes. Useful? Not enough.

2. The Sweet Spot

At degree 3, YieldMaster starts to see the real harvest pattern.

The curve bends enough to follow the main shape.

It catches the rise.
It respects the dip.
But it does not chase every record.

The MSE drops to about 2.03, so the model is much closer to the training records.

But it still keeps restraint.

This is the sweet spot:

Flexible enough to learn.
Not so flexible that it memorizes.

3. Too Eager: Overfitting

At degree 5, YieldMaster looks brilliant on the training chart.

This is the moment the junior engineer gets excited.

He wanted a lower MSE.

Now he has it: Training MSE: 0.00

The curve bends through every point.

Every dip.

Every bump.

Every strange little wiggle.

The line stitches the spring records together.

It looks like the perfect model.

But that is exactly the illusion.

YieldMaster may not be learning how temperature affects honey yield.

It may only be memorizing those six spring mornings.

This is overfitting.

It has high variance because small changes in the training records can produce a very different curve.

It looks perfect on the data it already saw.

But a new morning could make it fail.

That is the trap.

The Core Idea: Generalization

The visualization showed the trap.

A flexible model can stitch itself through every training point and start treating noise like truth.

That can make the training MSE look perfect.

But the real goal is not to win on records the model already saw.

The real goal is generalization.

Generalization means the model can make good predictions on new examples, not just the training examples.

For YieldMaster, that means:

Can it predict honey yield for a new spring morning?

Not just the six mornings on the chart.

That is why the perfect-looking curve is suspicious.

It may have learned the training records too well.

It may not have learned the harvest pattern.

Regularization is the restraint we add next.

It changes the model’s goal from: $\text{fit the data}$

to: $\text{fit the data} + \text{pay for complexity}$

In plain English:

Listen to the hive, but do not panic at every buzz.

The Math Behind Restraint

Degrees of Freedom

The flexible curve can twist through the six spring records because it has more knobs to turn.

In our harvest story:

$x$ is the morning temperature
$y$ is the honey yield
$h_\theta(x)$ is YieldMaster’s prediction

A simple model might be just a straight line.

Degree 1: no bends, only a tilt.

h_\theta(x) = \theta_0 + \theta_1x

Because it is so stiff, the model has very little room to adjust.

A more flexible model adds more terms.

Degree 4: more terms, more room to curve.

h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4

Now YieldMaster has five weights it can adjust:

$\theta_0$ sets the baseline
$\theta_1$ controls the line’s tilt
$\theta_2$ , $\theta_3$ , and $\theta_4$ add extra ways to curve

Each extra term gives YieldMaster another mathematical knob to turn.

That extra flexibility can help the model follow the real harvest pattern.

But it also creates risk.

If $\theta_2$ , $\theta_3$ , and $\theta_4$ grow too large, the model can start chasing tiny, meaningless wiggles in the training data.

That is where the scorecard matters.

How does the engineer decide whether a curve is good?

The Scorecard: MSE

First, the engineer measures training error with mean squared error, or MSE:

\text{MSE} = \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

MSE asks: How far were YieldMaster’s predictions from the real honey yields?

In plain English, MSE says:

Get as close as possible to the training records.

That is useful.

But it is also the junior engineer’s trap.

If MSE is the only scorecard, the degree-5 curve looks like a masterpiece.

It reaches: Training MSE: 0.00

The Architect’s Restraint

To fix the trap, the lead architect adds a second cost.

Not to punish complexity.

To make complexity earn its place.

This restraint is called regularization.

Regularization means adding a penalty when the model becomes too complex.

The model can still bend.

It can still use extra terms.

But every extra twist now has a price.

So the goal changes from:

\text{fit the data}

to:

\text{fit the data} + \text{pay for complexity}

The New Scorecard: Regularization

So how does the model pay for complexity?

It pays through a regularization penalty.

MSE asks one question:

How close did the model get to the training records?

Regularization adds a second question:

How much complexity did the model spend to get there?

MSE was the old scorecard.

The objective is the full scorecard:

\text{Objective} = \text{MSE} + \text{regularization penalty}

You may also hear this called the loss or cost.

Different names, same job here:

It is the score YieldMaster is trying to make as small as possible.

The Tug-of-War

Now the model faces a tug-of-war:

The MSE part rewards fitting the training data.
The penalty part charges the model for becoming too complex.

In the equation, that charge appears as a penalty.

In the visualization, the same idea appears as a weight budget.

If the penalty is small, YieldMaster gets a bigger budget and can chase every buzz.

If the penalty grows, the budget shrinks and extra wiggles become expensive.

Now the model is forced to ask:

Is this extra twist really worth the cost?

That is the heart of regularization.

It does not ban complexity.

It puts a price on it.

Two Ways to Charge Complexity

Regularization always adds a penalty.

But not every penalty behaves the same way.

The two common penalties are L1 and L2.

They both charge the model for complexity.

They just charge it differently.

L1 Regularization: The Strict Editor

L1 regularization charges the model for the sum of absolute weight sizes.

It is also called Lasso when used with linear regression.

The objective becomes:

\text{Objective} = \text{MSE} + \lambda \sum_{j=1}^{n} |\theta_j|

Here:

$\theta_j$ is a model weight
$\lambda$ (lambda) controls how strong the penalty is

Small $\lambda$ means gentle restraint.

Large $\lambda$ means stronger restraint.

L1 is different because it can push some weights all the way to exactly zero.

When a weight becomes zero, that part of the model stops mattering.

Earlier, YieldMaster’s degree-4 model looked like this:

h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4

Each term has a weight.

If L1 pushes $\theta_4$ to zero, this part disappears:

\theta_4x^4 = 0 \cdot x^4 = 0

So the model effectively becomes:

h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3

The fourth-degree term is still written in the original design.

But it no longer contributes anything.

L1 has cut one knob from the model.

That is why L1 can act like feature selection or term selection.

It does not just lower the volume.

It can cut the term.

L1 is a strict editor: if a clue or curve term does not earn its place, cut it.

Why Is It Called LASSO?

The name LASSO is an acronym:

Least Absolute Shrinkage and Selection Operator

The method was introduced by statistician Robert Tibshirani in 1996.

The name does two jobs at once.

First, it describes the math:

Least Absolute: the penalty uses absolute weight sizes, like $\sum |\theta_j|$
Shrinkage: the penalty pushes weights closer to zero
Selection: weak weights can become exactly zero
Operator: the mathematical rule doing the work

That word selection is the key.

L1 does not only shrink weak parts of the model.

It can remove them.

Second, the name gives us the right picture.

Like a lasso tightening around a noisy herd of clues, L1 can keep the useful parts and leave weak parts out of the equation.

For YieldMaster, that might mean keeping the main temperature pattern while cutting a noisy term like $\theta_4x^4$ .

Try It: Add L1 Penalty

This is the same degree-5 polynomial from earlier.

Here, $\lambda$ is the penalty strength.

When $\lambda = 0$ , there is no restraint.

The model can still reach almost zero training error because it bends through the studied records too perfectly.

That looks impressive. But we do not fully trust it.

So now we add an L1 penalty.

This is the LASSO idea in motion: tighten the budget until weak terms drop out.

Use the slider to increase $\lambda$ .

As $\lambda$ grows, every extra weight becomes more expensive.

Watch three things:

the weights shrink
weak weights can disappear
the curve stops chasing every wiggle

In the widget, watch weak weights like $\theta_4$ or $\theta_5$ drop toward zero.

When that happens, YieldMaster is not saying:

This term is quieter.

It is saying:

This term is out.

λ = 0. Degree 5 can still stitch every studied record.

L1 penalty (λ) 0.0

Objective = MSE + λΣ|θ| = — + (— × —) = —

Same degree. Fewer active terms. Less noise.

L1 Geometry: The Diamond

Imagine a simple x-y plane.

The x-axis is one weight: $\theta_1$ . The y-axis is another weight: $\theta_2$ .

For L1, the allowed weights must satisfy:

|\theta_1| + |\theta_2| \le \text{budget}

If the budget is 1, the boundary touches four points:

( 1,  0)
( 0,  1)
(-1,  0)
( 0, -1)

Connect those points, and you get a diamond.

That diamond is the L1 budget boundary.

This is the budget view of the same penalty we saw in the objective.

Penalty form says:

Large weights are expensive.

Budget form says:

Weights must stay inside this allowed region.

Stronger penalty. Smaller diamond.

Shrink the budget. The best allowed point moves toward a corner.

L1 budget 1.00

The important part is the sharp corners.

Each corner sits on an axis.

And sitting on an axis means one weight is exactly zero.

So when the best allowed point lands on a corner, one active term disappears.

That is why L1 can act like feature selection or term selection.

In the first widget, weak weights like $\theta_4$ or $\theta_5$ can disappear.

In the geometry widget, the same idea appears as one weight landing on an axis.

The model is not saying:

This term matters less.

It is saying:

This term is out.

L2 Regularization: The Quiet Volume Knob

L2 regularization charges the model for the sum of squared weight sizes.

It is also called Ridge when used with linear regression.

The objective becomes:

\text{Objective} = \text{MSE} + \lambda \sum_{j=1}^{n} \theta_j^2

Here:

$\theta_j$ is a model weight
$\lambda$ (lambda) controls how strong the penalty is

Small $\lambda$ means gentle restraint.

Large $\lambda$ means stronger restraint.

L2 is different from L1.

L1 can push weak weights all the way to zero.

L2 usually keeps the weights alive, but makes large weights expensive.

Why?

Because L2 squares the weights:

2²  = 4
5²  = 25
10² = 100

A large weight does not just cost more.

It costs much more.

So if one term starts shouting too loudly, L2 pushes back.

For YieldMaster, that might mean a higher-order weight like $\theta_4$ gets smaller.

The $\theta_4x^4$ term can still contribute.

But it no longer gets to run the whole curve.

L2 says:

Keep the term, but turn the volume down.

That is why L2 often creates smoother, steadier models.

It does not cut every suspicious term.

It lowers the panic.

Try It: Add L2 Penalty

This is the same degree-5 polynomial from earlier.

Here, $\lambda$ is the penalty strength.

When $\lambda = 0$ , there is no restraint.

The model can still reach almost zero training error because it bends through the studied records too perfectly.

That looks impressive.

But we do not fully trust it.

So now we add an L2 penalty.

Use the slider to increase $\lambda$ .

As $\lambda$ grows, large weights become much more expensive.

Watch three things:

the weights shrink
the curve becomes calmer
no term gets cleanly erased

That is the key difference from L1.

L1 can cut weak terms completely.

L2 usually keeps the terms, but turns their volume down.

In the widget, watch the weights shrink together.

Even at a strong penalty, the weak weights do not snap to zero the way they did with L1.

λ = 0. Degree 5 can still stitch every studied record.

L2 penalty (λ) 0.0

Objective = MSE + λΣθ² = — + (— × —) = —

Same degree. Smaller weights. Calmer curve.

L2 Geometry: The Circle

Now imagine the same simple x-y plane.

The x-axis is one weight: $\theta_1$ .

The y-axis is another weight: $\theta_2$ .

For L2, the allowed weights must satisfy:

\theta_1^2 + \theta_2^2 \le \text{budget}

If the budget is 1, the boundary becomes:

\theta_1^2 + \theta_2^2 = 1

That is a circle.

The circle is the L2 budget boundary.

Penalty form says:

Large weights are expensive.

Budget form says:

Weights must stay inside this round allowed region.

Stronger penalty. Smaller circle.

Increase the budget. The circle grows, and the allowed point moves outward.

L2 budget radius 1.00

The important part is the smooth edge.

L1 had sharp diamond corners.

Those corners sat on the axes, where one weight became exactly zero.

L2 is different.

The circle has no corners.

So the best allowed point usually lands somewhere along the smooth curve, not directly on an axis.

That means the weights usually shrink instead of disappearing.

In the first widget, the weights got smaller together.

In the geometry widget, the same idea appears as a point sliding along the circle.

The model is not saying:

This term is out.

It is saying:

This term can stay, but quieter.

L1 vs. L2: Which One Do You Pick?

Both L1 and L2 fight overfitting.

They just fight differently.

Use L1 when you suspect some clues or terms are mostly noise.

L1 can push weak weights to zero, so the model stops using them.

This term is out.

Use L2 when most clues are useful, but some are shouting too loudly.

L2 shrinks large weights, so the model becomes smoother and steadier.

This term can stay, but quieter.

For YieldMaster:

if a higher-order term is mostly noise, L1 can cut it ✂️
if a higher-order term is useful but too powerful, L2 can calm it 🎛️

And when the real world gives you both problems, there is a third option: Elastic Net.

Elastic Net combines L1 and L2.

It can cut weak terms and calm the survivors.

Memory hook:

L1 ✂️ cuts. L2 🎛️ calms.

YieldMaster Field Notes

Underfitting

YieldMaster is too stiff. It sees honey yield rising, but refuses to believe the harvest curve bends. Very confident. Very wrong.

Underfitting. YieldMaster is too stiff. It sees honey yield rising, but refuses to believe the harvest curve bends. Very confident. Very wrong.

Quiz

86% of people love quizzes after learning. Are you one of them?

★

Question 1 of 12 🏆 0 / 120 ⚡ Attempt 1 of 2

Question text