Confusion Matrix, Precision, Recall and F1 - Sherlock’s Fraud Case
Sherlock Holmes investigates fraud while uncovering accuracy, precision, recall, and F1 score through the confusion matrix.
The Case
At first—nothing unusual.
Then it wouldn’t stop.
Transactions flagged. Money missing. Customers calling.
Not one bank. Many.
The Crisis
Fraud teams were good.
But this was overwhelming.
Too many cases. Too little time.
Meet Sherlock 🕵️♂️
Banks call you for help.
Using thousands of past fraud cases, you build Sherlock — a detective trained to make the next call:
🚩 Fraud
✔️ Not Fraud
Confusion Matrix
Every prediction compares:
- 🏛️ What actually happened
- 🕵️ What Sherlock predicted
There are only four possible outcomes.
👏 If you can CLAP, you already understand this.
When they agree → correct
Where they don’t → mistakes
Your Turn
If you can redraw the four boxes— you’re ready for the quiz..
Accuracy
Now the real question
How often is Sherlock right?
Insight: Accuracy asks
Out of all predictions, how many were correct?
Balanced World
Sherlock reviews 5,000 transactions labeled by the bank.
- 🚩 Fraud: 2,500
- ✔️ Not Fraud: 2,500
But Sherlock hasn’t had coffee yet 🕵️ ☕
So every prediction is:
Not Fraud
| Predicted: Fraud | Predicted: Not Fraud | |
|---|---|---|
| Actual: Fraud | TP = 0 | FN = 2,500 |
| Actual: Not Fraud | FP = 0 | TN = 2,500 |
Sherlock Wakes Up ☕
| Predicted: Fraud | Predicted: Not Fraud | |
|---|---|---|
| Actual: Fraud | TP = 2,250 | FN = 250 |
| Actual: Not Fraud | FP = 250 | TN = 2,250 |
The real world - data is imbalanced
Sherlock reviews 10,000 transactions labeled by the bank.
- 🚩 Fraud: 100
- ✔️ Not Fraud: 9,900
But Sherlock hasn’t had coffee yet 🕵️ ☕
Not Fraud
| Predicted: Fraud | Predicted: Not Fraud | |
|---|---|---|
| Actual: Fraud | TP = 0 | FN = 100 |
| Actual: Not Fraud | FP = 0 | TN = 9,900 |
The 99% Illusion
Here, 99% of the dots are green.
Close your eyes. Point anywhere. Guess “green.”
You’ll be 99% right — just like Sherlock. Sounds impressive, right?
But you still catch zero fraud.
Warning: In imbalanced data, accuracy can look good by always picking the majority—while doing nothing useful.
The Insight
Insight: Accuracy has a blind spot: it treats all “Model is Wrong” boxes (off diagonals, FP and FN) as identical penalties.
A false alarm → flagging a normal transaction as fraud; you annoy a customer
A missed fraud → letting a real fraud go through and bank loses money!
Same penalty.
But in reality— they are not equal. Missing fraud costs more.
Accuracy Works Well if
- Data is balanced
- Mistakes have similar cost
- You want a quick overview
Avoid Accuracy When
- Rare events matter
- Data is heavily imbalanced
- One mistake is more costly
Error
The leftover Sherlock couldn’t solve
Sherlock closes the case file.
“90% solved.”
Looks impressive.
But then…
He pauses.
“What about the rest?”
| Predicted: Fraud | Predicted: Not Fraud | |
|---|---|---|
| Actual: Fraud | TP = 2,250 | FN = 250 |
| Actual: Not Fraud | FP = 250 | TN = 2,250 |
Insight: These are the cases Sherlock got wrong. This entire off-diagonal is wrong prediction.
The flip
Accuracy shows how often Sherlock is right.
Error shows what escapend him
No new math. Just what’s left behind.
The punch
Sherlock solved 90%.
The remaining 10%…
That’s where the real crimes slipped through.
⚠️ Caution
Error has the same blind spot as accuracy. Sherlock’s mistakes may look small…
while the real crimes still slip through.
Insight:
Error is not new work — it’s just the part accuracy couldn’t cover.
Precision
🧠 PREcision starts from PREdiction.
Takeaway from this animation
The stem of the P appears first.
Only the Predicted Positive column survives — vertical like the stem of the P.
Insight: Precision is all about model’s Prediction
Because this is the Predicted Positive column, we already know one thing:
The model labeled every case as fraud.
But reality does not always agree.
For this column:
- The bank says some are fraud
- The bank says some are not fraud
So Precision asks:
Of all the cases Sherlock predicted as fraud,
how many were truly fraud?
The loop of the P is the part where Sherlock was actually correct.
Recall
🧠 The tail of the R points right — toward the horizontal Actual Positive row.
Takeaway from this animation
The animation starts with the P shape from Precision.
But then the tail appears and turns the P into an R.
That slanted leg stretches across the horizontal row.
Only the Actual Positive row survives
Insight: Recall is about Actual Positives
Because this is the Actual Positive row, we already know one thing:
Every case here is fraud.
But Sherlock does not catch all of them as fraud.
- Some cases Sherlock says are fraud
- Some cases Sherlock says are not fraud
So Recall asks:
Of all the cases that were truly fraud,
how many did Sherlock catch as fraud?
The loop of the R is the part where Sherlock successfully caught the fraud.
⚖️ Precision vs Recall Tradeoff
🕵️ ☕ Careful Sherlock
First morning coffee.
Sherlock becomes calm and extremely careful.
He only says “fraud” when he is very, very confident.
But look what happens inside the confusion matrix.
| Predicted Fraud | Predicted Not Fraud | |
|---|---|---|
| Actual Fraud | TP = 40 | FN = 60 |
| Actual Not Fraud | FP = 5 | TN = 895 |
Precision Consequence
Out of all the cases Sherlock predicted as fraud:
- 40 were actually fraud
- Only 5 were innocent
So when Sherlock says fraud:
He is usually right.
Recall Consequence
But there were actually 100 fraud cases.
Sherlock only caught 40 of them.
The Tradeoff
✅ High Precision ❌ Low Recall
Insight:
Careful Sherlock rarely accuses innocent people.
But many real fraud cases escape unnoticed.
🕵️ ☕☕ Aggressive Sherlock
Too much coffee.
Sherlock now suspects almost everyone.
He labels almost everything as fraud.
Look what happens inside the confusion matrix.
| Predicted Fraud | Predicted Not Fraud | |
|---|---|---|
| Actual Fraud | TP = 95 | FN = 5 |
| Actual Not Fraud | FP = 300 | TN = 600 |
Precision Consequence
Out of all the cases Sherlock predicted as fraud:
- Only 95 were actually fraud
- 300 were innocent!
So when Sherlock says fraud:
Many innocent people get accused.
Recall Consequence
There were actually 100 fraud cases.
Sherlock successfully caught 95 of them.
⚖️ The Tradeoff
✅ High Recall ❌ Low Precision
Insight: Aggressive Sherlock catches almost every fraud case. But innocent people constantly get accused.
🤔 So should we use Precision or Recall?
Now let’s imagine we are building a cancer detection model. Positve = yes cancer, Negative = no cancer.
Now we need to ask:
Which mistake hurts more?
Optimizing for Precision
Precision focuses on the Predicted Positive column.
These are the patients where the model says:
“Cancer.”
But this column is still a mix:
- Some patients truly have cancer
- Some patients do not have cancer
The false alarm case is painful.
A healthy patient may hear scary news, go through more tests, wait anxiously, and later discover:
“You do not have cancer.”
That is not harmless.
But from the doctor’s perspective, the patient was still investigated further.
The bigger danger is not here.
Optimizing for Recall
Recall focuses on the Actual Positive row.
These are the patients who truly have cancer.
But the model can still split them into two groups:
- Some are correctly labeled cancer
- Some are wrongly labeled no cancer
That second case is dangerous.
If the model says:
“No cancer.”
the doctor may not order more tests.
But the patient actually has cancer.
Now the disease can continue unnoticed.
That mistake can cost a life.
The tradeoff
So choosing a model is often a tradeoff.
From the doctor’s perspective:
Missing real cancer is far more dangerous than ordering extra tests.
That pushes the system toward Recall.
Doctors would rather investigate further than accidentally send a cancer patient home untreated.
But from the patient’s perspective:
False alarms are not free either.
A healthy patient may experience: emotional stress, expensive medical tests
That pushes the system toward Precision.
Insight: Different problems care about different mistakes.
Which mistake is more dangerous in your situation?
And whose pain is your model trying to minimize?
F1 Score
Now we hit a problem.
If we optimize too much for Precision:
- Sherlock becomes too careful and real fraud cases get missed
If we optimize too much for Recall:
- Sherlock accuses almost everyone and innocent people suffer
So how do we combine Precision and Recall into one score?
F1 and Harmonic Mean
Suppose Sherlock has
Precision = 1.0 and Recall = 0.1
Arithmetic Mean says:
But Sherlock still misses most fraud cases.
F1 uses Harmonic Mean as:
Using the same precison and recall:
0.18 instead of 0.55 — much harsher, and much more honest.
Observe Precision, Recall, and the gap between Arithmetic Mean and Harmonic Mean.
| Case | Precision | Recall | Arithmetic | Harmonic | Visual Insight |
|---|---|---|---|---|---|
| Perfect harmony | 1.0 | 1.0 | 1.00 | 1.00 | AM = HM |
| Similar values | 0.9 | 0.8 | 0.85 | 0.85 | AM ≈ HM |
| One side becomes weak | 1.0 | 0.1 | 0.55 | 0.18 | 🔻 HM exposes weakness |
| One side dies | 1.0 | 0.0 | 0.50 | 0.00 | 💀 HM collapses |
| Both sides low | 0.2 | 0.2 | 0.20 | 0.20 | Balanced but weak → HM still stays weak |
Insight: Harmonic Mean is suspicious of one-trick detectives.
Sherlock should both catch fraud and avoid falsely accusing innocent people.
Mechanically, that means F1 keeps pulling him toward the TP corner — the place where fraud is caught without too many false alarms.
Precison vs Recall Vs F1
Does that mean we should always optimize for F1 score instead?
Not necessarily.
F1 is useful when we want a balance between Precision and Recall.
But sometimes one mistake matters far more than the other.
In cancer detection, missing real cancer may be far more dangerous than false alarms.
In spam detection, a few spam emails slipping through may be acceptable if important emails are never blocked.
So the right metric depends on:
Which mistake your system can afford to make.
🧾 Detective Cheat Sheet
Confusion Matrix
A grid that compares truth vs prediction.
Quiz
86% people love quiz after learning! Are you one of them?
Question text
Quiz complete