AI

May 7, 2026 13 min read

Confusion Matrix, Precision, Recall and F1 - Sherlock’s Fraud Case

Sherlock Holmes investigates fraud while uncovering accuracy, precision, recall, and F1 score through the confusion matrix.

The Case

At first—nothing unusual.

Then it wouldn’t stop.

Transactions flagged. Money missing. Customers calling.

Not one bank. Many.

The Crisis

Fraud teams were good.

But this was overwhelming.

Too many cases. Too little time.

Meet Sherlock 🕵️‍♂️

Banks call you for help.

Using thousands of past fraud cases, you build Sherlock — a detective trained to make the next call:

🚩 Fraud
✔️ Not Fraud


Confusion Matrix

Every prediction compares:

  • 🏛️ What actually happened
  • 🕵️ What Sherlock predicted

There are only four possible outcomes.

👏 If you can CLAP, you already understand this.

When they agree → correct
Where they don’t → mistakes

Your Turn

If you can redraw the four boxes— you’re ready for the quiz..

Accuracy

Now the real question

How often is Sherlock right?

Insight: Accuracy asks
Out of all predictions, how many were correct?

Balanced World

Sherlock reviews 5,000 transactions labeled by the bank.

  • 🚩 Fraud: 2,500
  • ✔️ Not Fraud: 2,500

But Sherlock hasn’t had coffee yet 🕵️ ☕

So every prediction is:

Not Fraud

Predicted: FraudPredicted: Not Fraud
Actual: FraudTP = 0FN = 2,500
Actual: Not FraudFP = 0TN = 2,500
Accuracy
How often Sherlock was right?
Correct predictionsTotal predictions=(0+2,500)(0+2,500+0+2,500)=2,5005,000=50%\frac{\color{#0f6b4f}{\text{Correct predictions}}} {\color{#5f756f}{\text{Total predictions}}} = \frac{(0 + 2{,}500)}{(0 + 2{,}500 + 0 + 2{,}500)} = \frac{2{,}500}{5{,}000} = 50\%
Correct predictionsTotal predictions=0+2,5000+2,500+0+2,500=2,5005,000=50%\begin{aligned} \frac{\color{#0f6b4f}{\text{Correct predictions}}} {\color{#5f756f}{\text{Total predictions}}} \\[0.4em] = \frac{0 + 2{,}500}{0 + 2{,}500 + 0 + 2{,}500} \\[0.4em] = \frac{2{,}500}{5{,}000} \\[0.4em] = 50\% \end{aligned}
50%
Sherlock flipped a coin.
Low accuracy shows Sherlock is not reliable.

Sherlock Wakes Up ☕

Predicted: FraudPredicted: Not Fraud
Actual: FraudTP = 2,250FN = 250
Actual: Not FraudFP = 250TN = 2,250
Accuracy
How often Sherlock was right?
Correct predictionsTotal predictions=(2,250+2,250)2,250+2,250+250+250=4,5005,000=90%\frac{\color{#0f6b4f}{\text{Correct predictions}}} {\color{#5f756f}{\text{Total predictions}}} = \frac{{(2{,}250 + 2{,}250)}} {{2{,}250 + 2{,}250} + {250 + 250}} = \frac{{4{,}500}}{{5{,}000}} = {90\%}
Correct predictionsTotal predictions=2,250+2,2502,250+2,250+250+250=4,5005,000=90%\begin{aligned} \frac{\color{#0f6b4f}{\text{Correct predictions}}} {\color{#5f756f}{\text{Total predictions}}} \\[0.6em] = \frac{{2{,}250 + 2{,}250}} {{2{,}250 + 2{,}250} + {250 + 250}} \\[0.6em] = \frac{{4{,}500}} {{5{,}000}} = 90\% \end{aligned}
90%
Sherlock was right 90% of the time.
High accuracy shows Sherlock is reliable on both fraud and non-fraud cases.

The real world - data is imbalanced

Sherlock reviews 10,000 transactions labeled by the bank.

  • 🚩 Fraud: 100
  • ✔️ Not Fraud: 9,900

But Sherlock hasn’t had coffee yet 🕵️ ☕

Not Fraud

Predicted: FraudPredicted: Not Fraud
Actual: FraudTP = 0FN = 100
Actual: Not FraudFP = 0TN = 9,900
Accuracy
How often Sherlock was right?
Correct predictionsTotal predictions=(0+9,900)(0+100+0+9,900)=99%\frac{\color{#0f6b4f}{\text{Correct predictions}}} {\color{#5f756f}{\text{Total predictions}}} = \frac{(0 + 9{,}900)}{(0 + 100 + 0 + 9{,}900)} = 99\%
=0+9,9000+100+0+9,900=99%\begin{aligned} = \frac{0 + 9{,}900}{0 + 100 + 0 + 9{,}900} \\[0.4em] = 99\% \end{aligned}
90%
Sherlock was right 99% of the time by saying Not Fraud.
Sherlock has 99% accuracy but caught 0 out of 100 fraud.

The 99% Illusion

Here, 99% of the dots are green.

Close your eyes. Point anywhere. Guess “green.”

You’ll be 99% right — just like Sherlock. Sounds impressive, right?

But you still catch zero fraud.

Warning: In imbalanced data, accuracy can look good by always picking the majority—while doing nothing useful.

The Insight

Insight: Accuracy has a blind spot: it treats all “Model is Wrong” boxes (off diagonals, FP and FN) as identical penalties.

A false alarm → flagging a normal transaction as fraud; you annoy a customer

A missed fraud → letting a real fraud go through and bank loses money!

Same penalty.

But in reality— they are not equal. Missing fraud costs more.

Accuracy Works Well if

  • Data is balanced
  • Mistakes have similar cost
  • You want a quick overview

Avoid Accuracy When

  • Rare events matter
  • Data is heavily imbalanced
  • One mistake is more costly

Error

The leftover Sherlock couldn’t solve

Sherlock closes the case file.

“90% solved.”

Looks impressive.

But then…

He pauses.

“What about the rest?”

Predicted: FraudPredicted: Not Fraud
Actual: FraudTP = 2,250FN = 250
Actual: Not FraudFP = 250TN = 2,250
Error
Wrong predictionsTotal predictions=(250+250)2,250+2,250+250+250=5005,000=10%\frac{\color{#0f6b4f}{\text{Wrong predictions}}} {\color{#5f756f}{\text{Total predictions}}} = \frac{{(250 + 250)}} {{2{,}250 + 2{,}250} + {250 + 250}} = \frac{{500}}{{5{,}000}} = {10\%}
Wrong predictionsTotal predictions=TP+TNTP+TN+FP+FN=250+2502,250+2,250+250+250=5005,000=10%\begin{aligned} \frac{\color{#0f6b4f}{\text{Wrong predictions}}} {\color{#5f756f}{\text{Total predictions}}} \\[0.6em] = \frac{{TP + TN}} {\color{#5f756f}{TP + TN + FP + FN}} \\[0.6em] = \frac{{250 + 250}} {{2{,}250 + 2{,}250} + {250 + 250}} \\[0.6em] = \frac{{500}} {{5{,}000}} = 10\% \end{aligned}
How often Sherlock was wrong?

Insight: These are the cases Sherlock got wrong. This entire off-diagonal is wrong prediction.

The flip

Accuracy shows how often Sherlock is right.
Error shows what escapend him

No new math. Just what’s left behind.

The punch

Sherlock solved 90%.

The remaining 10%…

That’s where the real crimes slipped through.

⚠️ Caution

Error has the same blind spot as accuracy. Sherlock’s mistakes may look small…
while the real crimes still slip through.


Insight:
Error is not new work — it’s just the part accuracy couldn’t cover.

Precision

🧠 PREcision starts from PREdiction.

Takeaway from this animation

The stem of the P appears first.

Only the Predicted Positive column survives — vertical like the stem of the P.

Insight: Precision is all about model’s Prediction

Because this is the Predicted Positive column, we already know one thing:

The model labeled every case as fraud.

But reality does not always agree.

For this column:

  • The bank says some are fraud
  • The bank says some are not fraud

So Precision asks:

Of all the cases Sherlock predicted as fraud,
how many were truly fraud?

The loop of the P is the part where Sherlock was actually correct.

Precision
Precision=Loop of PEverything inside P=TPTP+FP\text{Precision} = \frac{\color{#b06a00}{\text{Loop of P}}} {\color{#6b4a2f}{\text{Everything inside P}}} = \frac{\color{#d88900}{TP}} {\color{#d88900}{TP} + \color{#c65a3d}{FP}}
Precision=Loop of PEverything inside P=TPTP+FP\begin{aligned} \text{Precision} \\[0.6em] = \frac{\color{#b06a00}{\text{Loop of P}}} {\color{#6b4a2f}{\text{Everything inside P}}} \\[0.8em] = \frac{\color{#d88900}{TP}} {\color{#d88900}{TP} + \color{#c65a3d}{FP}} \end{aligned}
Of everything the model predicts as positive, how much is actually positive?

Recall

🧠 The tail of the R points right — toward the horizontal Actual Positive row.

Takeaway from this animation

The animation starts with the P shape from Precision.

But then the tail appears and turns the P into an R.

That slanted leg stretches across the horizontal row.

Only the Actual Positive row survives

Insight: Recall is about Actual Positives

Because this is the Actual Positive row, we already know one thing:

Every case here is fraud.

But Sherlock does not catch all of them as fraud.

  • Some cases Sherlock says are fraud
  • Some cases Sherlock says are not fraud

So Recall asks:

Of all the cases that were truly fraud,
how many did Sherlock catch as fraud?

The loop of the R is the part where Sherlock successfully caught the fraud.

Recall
Recall=Loop of REverything inside R=TPTP+FN\text{Recall} = \frac{\color{#b06a00}{\text{Loop of R}}} {\color{#6b4a2f}{\text{Everything inside R}}} = \frac{\color{#d88900}{TP}} {\color{#d88900}{TP} + \color{#c65a3d}{FN}}
Recall=Loop of REverything inside R=TPTP+FN\begin{aligned} \text{Recall} \\[0.6em] = \frac{\color{#b06a00}{\text{Loop of R}}} {\color{#6b4a2f}{\text{Everything inside R}}} \\[0.8em] = \frac{\color{#d88900}{TP}} {\color{#d88900}{TP} + \color{#c65a3d}{FN}} \end{aligned}
Of everything actually positive, how much did the model correctly predict as positive?

⚖️ Precision vs Recall Tradeoff

🕵️ ☕ Careful Sherlock

First morning coffee.

Sherlock becomes calm and extremely careful.

He only says “fraud” when he is very, very confident.

But look what happens inside the confusion matrix.

Predicted FraudPredicted Not Fraud
Actual FraudTP = 40FN = 60
Actual Not FraudFP = 5TN = 895

Precision Consequence

Out of all the cases Sherlock predicted as fraud:

  • 40 were actually fraud
  • Only 5 were innocent

So when Sherlock says fraud:

He is usually right.

Precision=Loop of PEverything inside P=4040+5=88.9%\text{Precision} = \frac{\color{#b06a00}{\text{Loop of P}}} {\color{#6b4a2f}{\text{Everything inside P}}} = \frac{40}{40 + 5} = 88.9\%

Recall Consequence

But there were actually 100 fraud cases.

Sherlock only caught 40 of them.

Recall=Loop of REverything inside R=4040+60=40%\text{Recall} = \frac{\color{#b06a00}{\text{Loop of R}}} {\color{#6b4a2f}{\text{Everything inside R}}} = \frac{40}{40 + 60} = 40\%

The Tradeoff

✅ High Precision ❌ Low Recall

Insight:
Careful Sherlock rarely accuses innocent people.
But many real fraud cases escape unnoticed.


🕵️ ☕☕ Aggressive Sherlock

Too much coffee.

Sherlock now suspects almost everyone.

He labels almost everything as fraud.

Look what happens inside the confusion matrix.

Predicted FraudPredicted Not Fraud
Actual FraudTP = 95FN = 5
Actual Not FraudFP = 300TN = 600

Precision Consequence

Out of all the cases Sherlock predicted as fraud:

  • Only 95 were actually fraud
  • 300 were innocent!

So when Sherlock says fraud:

Many innocent people get accused.

Precision=Top of Loop of PEverything inside P=9595+300=24%\text{Precision} = \frac{\text{Top of Loop of P}} {\text{Everything inside P}} = \frac{95}{95 + 300} = 24\%

Recall Consequence

There were actually 100 fraud cases.

Sherlock successfully caught 95 of them.

Recall=Top of Loop of REverything inside R=9595+5=95%\text{Recall} = \frac{\text{Top of Loop of R}} {\text{Everything inside R}} = \frac{95}{95 + 5} = 95\%

⚖️ The Tradeoff

✅ High Recall ❌ Low Precision

Insight: Aggressive Sherlock catches almost every fraud case. But innocent people constantly get accused.

🤔 So should we use Precision or Recall?

Now let’s imagine we are building a cancer detection model. Positve = yes cancer, Negative = no cancer.

Now we need to ask:

Which mistake hurts more?

Optimizing for Precision

Precision focuses on the Predicted Positive column.

These are the patients where the model says:

“Cancer.”

But this column is still a mix:

  • Some patients truly have cancer
  • Some patients do not have cancer

The false alarm case is painful.

A healthy patient may hear scary news, go through more tests, wait anxiously, and later discover:

“You do not have cancer.”

That is not harmless.

But from the doctor’s perspective, the patient was still investigated further.

The bigger danger is not here.

Optimizing for Recall

Recall focuses on the Actual Positive row.

These are the patients who truly have cancer.

But the model can still split them into two groups:

  • Some are correctly labeled cancer
  • Some are wrongly labeled no cancer

That second case is dangerous.

If the model says:

“No cancer.”

the doctor may not order more tests.

But the patient actually has cancer.

Now the disease can continue unnoticed.

That mistake can cost a life.

The tradeoff

So choosing a model is often a tradeoff.

From the doctor’s perspective:

Missing real cancer is far more dangerous than ordering extra tests.

That pushes the system toward Recall.

Doctors would rather investigate further than accidentally send a cancer patient home untreated.


But from the patient’s perspective:

False alarms are not free either.

A healthy patient may experience: emotional stress, expensive medical tests

That pushes the system toward Precision.

Insight: Different problems care about different mistakes.
Which mistake is more dangerous in your situation?
And whose pain is your model trying to minimize?


F1 Score

Now we hit a problem.

If we optimize too much for Precision:

  • Sherlock becomes too careful and real fraud cases get missed

If we optimize too much for Recall:

  • Sherlock accuses almost everyone and innocent people suffer

So how do we combine Precision and Recall into one score?

F1 and Harmonic Mean

Suppose Sherlock has

                         Precision = 1.0 and Recall = 0.1

Arithmetic Mean says: 1.0+0.12=0.55\frac{1.0 + 0.1}{2} = 0.55

But Sherlock still misses most fraud cases.


F1 uses Harmonic Mean as:

HM=21A+1B=2ABA+BHM = \frac{2} { \frac{1}{A} + \frac{1}{B} } = \frac{2AB}{A+B}

Using the same precison and recall:

HM=2(1.0)(0.1)1.0+0.1=0.18HM = \frac{2(1.0)(0.1)} {1.0 + 0.1} = 0.18

0.18 instead of 0.55 — much harsher, and much more honest.


Observe Precision, Recall, and the gap between Arithmetic Mean and Harmonic Mean.

CasePrecisionRecallArithmeticHarmonicVisual Insight
Perfect harmony1.01.01.001.00AM = HM
Similar values0.90.80.850.85AM ≈ HM
One side becomes weak1.00.10.550.18🔻 HM exposes weakness
One side dies1.00.00.500.00💀 HM collapses
Both sides low0.20.20.200.20Balanced but weak → HM still stays weak

Insight: Harmonic Mean is suspicious of one-trick detectives.

Sherlock should both catch fraud and avoid falsely accusing innocent people.

Mechanically, that means F1 keeps pulling him toward the TP corner — the place where fraud is caught without too many false alarms.

Precison vs Recall Vs F1

Does that mean we should always optimize for F1 score instead?

Not necessarily.

F1 is useful when we want a balance between Precision and Recall.

But sometimes one mistake matters far more than the other.

In cancer detection, missing real cancer may be far more dangerous than false alarms.

In spam detection, a few spam emails slipping through may be acceptable if important emails are never blocked.

So the right metric depends on:

Which mistake your system can afford to make.


🧾 Detective Cheat Sheet


Quiz

86% people love quiz after learning! Are you one of them?

Question 1 of 12 🏆 0 / 120 ⚡ Attempt 1 of 2

Question text