The Launch Log Comes First
A rocket team keeps a small launch log.
Before every launch attempt, Mission Control records a few clues and makes one call:
GO or HOLD?
Deep learning notation starts here.
Not with symbols.
From One Neuron to the Whole Network
In the last article, we used this launch log to learn how one neuron thinks:
- inputs come in
- weights decide how much to listen
- bias nudges the score
- activation turns the score into GO or HOLD
Start with Part 1 How one rocket neuron turns Wind and Fuel into GO or HOLD. →
In this article, we keep the same rocket model and zoom out.
Instead of reading one neuron, we will learn how to read the whole network.
At first, the arrows look like spaghetti.
But they are not random.
Every arrow has an address.
The Three Parts of the Network
Now let’s zoom out from one neuron to a small network.
For our rocket model, the network has one hidden layer.
The input layer (in) receives the launch clues:
x=[WindFuel]=[x1x2]
The hidden layer (h) sits between the input clues and the final answer.
It is called hidden because we do not directly program what each middle neuron should notice.
Mission Control provides the inputs:
Wind and Fuel
And the training data provides the target:
GO or HOLD
But no one tells the hidden layer:
watch Wind this way, combine Fuel that way, create this exact signal.
Those middle signals are learned.
That is why the layer is hidden: it is the model’s internal logic, not a column we manually design.
The output layer (out) turns the final signal into the prediction:
GO or HOLD?
So the network flow is:
input layer (in)↓hidden layer (h)↓output layer (out)
Our example has one hidden layer.
A network with zero or one hidden layer is often called a shallow neural network.
A network with more than one hidden layer is often called a deep neural network.
A deeper network simply repeats the hidden-layer idea:
input layer↓hidden layer↓⋯↓hidden layer↓output layer
With one hidden layer, the network gets one middle round to combine Wind and Fuel.
With more hidden layers, it can build patterns on top of patterns.
In practice, networks can go very deep — dozens, hundreds, or even more layers.
But depth is not magic.
More layers can give the network more room to learn, but they can also make it harder to train and harder to understand.
The Notation Map
Layer Addresses
First we need to agree on how we count the layers.
Textbooks often number layers starting at 0:
- layer 0 → input layer
- layer 1 → 1st hidden layer
- layer 2 → 2nd hidden layer
- layer L → output layer
We will use in and out for the input and output layer — the bookends of the network.
The middle hidden layers will use numbers: 1, 2, 3, and so on.
input layer (0/in)↓hidden layer (1)↓output layer (L/out)
General Convention
Most neural network notation looks like:
symbol∣subscript(superscript)←Layer address←Unit address
So the superscript tells us where in the network we are.
The subscript tells us which circle inside that layer we mean.
W needs one extra rule because it connects two layers.
We will handle that when we get to the arrows.
Inputs are the raw launch clues entering the network.
xi(in)←input layer←i-th feature in the input
So x2(in) means the second input feature.
This is the Fuel circle in our input layer.
2. Activations: a
Activations are the signals produced by neurons.
ai(l)←layer l (e.g., 1, 2, or h)←i-th unit inside that layer
So a2(1) means second activation in the first hidden layer.
This is the second circle in our layer 1.
3. Biases: b
Bias is the starting nudge for a layer.
b(l)←layer l (e.g., h or out)←
So b(1) means the bias vector for layer 1.
The number of elements in bias matches the number of neurons in that layer.
Our layer 1 has three neurons, and hence there are 3 entries:
b(1)=b1(1)b2(1)b3(1)
4. Weights: W
Weights represent the connections between neurons.
They live on the arrows between layers.
Because weights connect two different layers, their notation is slightly different.
A weight connects a unit in layer l to a unit in layer l+1:
layer l→layer l+1
For the superscript, we use the destination layer:
W(l+1)←destination layer←destination unit
Since the superscript already points to the destination layer, the subscript follows the same idea:
Wdestination, source
So:
Wj,k(l+1)←destination layer←destination unit j source unit k
So Wj,k(l+1) means the weight going to unit j in layer l+1, from unit k in layer l.
The matrix shape follows the same rule:
W=output units×input units
In plain English:
how many outputs do we want from how many inputs?
Memory hook:
Where are you going? Destination first.
In pictures
Think of notation as an address system.
For values like x, a, and b, we are describing where a value lives:
valueunit(layer)←layer where the value lives←unit in that layer
For weights W, we are describing an arrow between two places:
Wj,k(l+1)←destination layer←destination unit j, source unit k
1. The Shapes
At this point, we only know the shapes we are working with:
-
Our input x is a single launch attempt.
- As a row, it carries 2 launch clues: Wind and Fuel.
- So its shape is 1×2.
-
We want our hidden layer to have 3 units.
-
Based on our outputs × inputs rule, our weight matrix W must be 3×2.
If we try to multiply a 1×2 row by a 3×2 matrix, the inner dimensions do not match:
(1×2)(3×2)
So we use the WT transpose.
Now the multiplication works:
x⋅WT=(1×2)(2×3)=1×3
The Elephant in the Room: Why use WT?
Why did Part 1 use Xw? One neuron: one weight column. Many neurons: one weight matrix. →
You might remember that in our single-neuron story, we multiplied the launch logbook by the weights like this:
Xw
That worked because w was a simple column vector.
But now the hidden layer has many neurons, so our weights have grown into a full matrix W.
2. The Linear Combinations
Now let’s open up the multiplication.
We are building the first hidden layer, so every weight gets a (1) superscript.
For now, do not worry about the exact subscripts.
Just notice the shape:
x⋅WT=[x1x2][w(1)w(1)w(1)w(1)w(1)w(1)]
If we multiply the rows by the columns, we get three separate equations for our three new scores:
So the multiplication gives us three scores.
One score for each hidden unit.
x1w(1)+x2w(1)x1w(1)+x2w(1)x1w(1)+x2w(1)=z1(1)=z2(1)=z3(1)
So those three scores sit side by side:
z(1)=[z1(1)z2(1)z3(1)]
One launch attempt gives one row of hidden-layer scores.
3. Mapping the Rows and Columns
Now the matrix starts to tell us what it is doing.
To make the map clear, let’s color-code the weights based on the score they build:
- one color for the weights that build z1(1)
- one color for the weights that build z2(1)
- one color for the weights that build z3(1)
[x1x2][w(1)w(1)w(1)w(1)w(1)w(1)]↓z1(1)↓z2(1)↓z3(1)←multiplies x1←multiplies x2
Read the arrows:
- The first row of weights meets x1.
- The second row of weights meets x2.
- Each column builds one hidden-layer score.
x1w(1)+x2w(1)=z1(1)
x1w(1)+x2w(1)=z2(1)
x1w(1)+x2w(1)=z3(1)
That is the map:
Rows line up with the input clues.
Columns build the hidden-layer scores.
4. Filling the Destination Slot: w?,_(1)
We are already building the first hidden layer, so the superscript stays (1).
Now add the first subscript: the destination unit.
The first column builds z1(1), so its weights get destination index 1.
The second column builds z2(1), so its weights get destination index 2.
The third column builds z3(1), so its weights get destination index 3.
w1,_(1)w1,_(1)w2,_(1)w2,_(1)w3,_(1)w3,_(1)↓z1(1)↓z2(1)↓z3(1)
Now the equations carry that same destination index.
x1w1,_(1)+x2w1,_(1)=z1(1)
x1w2,_(1)+x2w2,_(1)=z2(1)
x1w3,_(1)+x2w3,_(1)=z3(1)
That is the first subscript:
destination unit first
5. Filling the Source Slot: wj,_(1)
The second subscript tells us where the weight is coming from.
- weights in the first row come from source 1
- weights in the second row come from source 2
w1,1(1)w1,2(1)w2,1(1)w2,2(1)w3,1(1)w3,2(1)↓z1(1)↓z2(1)↓z3(1)←source 1←source 2
Now every weight has a full address:
x1w1,1(1)+x2w1,2(1)=z1(1)
x1w2,1(1)+x2w2,2(1)=z2(1)
source 1x1w3,1(1)+source 2x2w3,2(1)=z3(1)
That is the full weight address:
destination unit first, source input second
Scoring the Entire Logbook at Once
So far, we scored one launch attempt:
x⋅WT=1×3
That gave us one row of hidden-layer scores:
z(1)=[z1(1)z2(1)z3(1)]
But the launch logbook has more than one row.
If we score the whole logbook, lowercase x becomes capital X:
X=n×2
- n launch attempts.
- 2 input clues: Wind and Fuel.
Because we are still building 3 hidden units, the weight matrix stays the same.
WT=2×3
So the multiplication becomes:
X⋅WT=(n×2)(2×3)=n×3
The inner 2s still match.
The output is now:
one score row per launch attempt
For two launch attempts, it looks like this:
X[x1x1x2x2]⋅WTw1,1(1)w1,2(1)w2,1(1)w2,2(1)w3,1(1)w3,2(1)
=[z1(1)z1(2)z2(1)z2(2)z3(1)z3(2)]←launch 1←launch 2
That is the trick.
We do not score Launch 1, then come back and score Launch 2.
One matrix multiplication gives hidden-layer scores for the whole logbook at once.
Vectorization: Scoring the Whole Page at Once
In the last section, we stopped scoring one launch attempt at a time and scored the whole logbook instead.
That move has a name: vectorization
Vectorization is where linear algebra starts doing the heavy lifting.
Instead of treating the launch logbook as separate rows, we treat it as one matrix.
That lets one matrix multiplication score the whole logbook in one pass:
X⋅WT=Z(1)
In our rocket story:
- lowercase x is one launch attempt
- capital X is the whole launch logbook
So vectorization is the jump from one row to many rows:
x→X
One launch attempt becomes the whole logbook.
And the formula barely changes:
x⋅WT⟶X⋅WT
That is why linear algebra is so powerful in neural networks.
It lets the model stop doing this:
- score launch 1
- score launch 2
- score launch 3
and start doing this:
score the whole logbook at once
The shapes show the speed trick:
(n×2)(2×3)=n×3
That output means:
one score row per launch attempt
Vectorization is not just cleaner notation.
It is how the same calculation scales from one launch attempt to the whole logbook.
Mission Control Briefing
Input Layer (in)
The nervous intern with the clipboard: Wind and Fuel. It does not decide anything; it just reports the facts.
Input Layer (in). The nervous intern with the clipboard: Wind and Fuel. It does not decide anything; it just reports the facts.
Quiz
86% of people love quizzes after learning. Are you one of them?
Quiz complete
0 / 120