Neural Networks & Transformers: A Visual Introduction

September 24, 2025 • AI • Interactive

The core building blocks: linear units, nonlinearity, loss & gradient descent, softmax & cross‑entropy, dot‑product attention, positional encodings, layer norm & residuals, and receptive fields. Each section has an explainer, a formula, and a small visual.

Hover here for how this connects to earlier pieces.

Dot product & linear maps → Toolkit.
Optimization landscape intuition → Primer: Potentials.
Attention weights use softmax and dot products → see below; also Toolkit: Fourier for oscillatory ideas.

Linear Units & Nonlinearity

A single neuron computes a weighted sum and passes it through a nonlinearity: $$y = \phi(\mathbf w\cdot\mathbf x + b).$$ Without $\phi$, stacks of linear layers collapse to one linear map. With $\phi$ (ReLU, GELU, tanh), you can fit bends in the data.

w0 w1 b

Decision line: ….

Softmax & Cross‑Entropy

Softmax turns logits into probabilities: $$p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}.$$ For a target class $t$, cross‑entropy is $L=-\log p_t$.

logit[0] logit[1] logit[2] target

probabilities

values: …, …, … • Cross‑entropy: …

Dot‑Product Attention

Weights for one query $q_i$ over keys $K$: softmax of scaled dot products.

query index temperature

Positional Encodings

Sinusoidal bands encode position at multiple scales.

length channels

Layer Norm & Residuals

Receptive Fields

mode kernel center

Gradient Descent (2D Bowl)

learning rate η

…

A Tiny MLP (Two ReLUs)

w1a b1a v1a w1b b1b v1b b2

Equation: …

x₀ (for backprop)

Gradients at x₀: …

A Toy Tokenizer (BPE)

Input text

Tokens

Top pairs

Merges so far: …

Attention: Values Aggregation

query index temperature

Multi‑Head Attention (Weights)

heads query index temperature

Position‑wise Feed‑Forward (FFN)

Putting It Together: A Transformer Block

Dot products & linear maps → Toolkit.
Optimization intuition → Primer.
Softmax appears in attention & classification.