Neural Networks & Transformers: A Visual Introduction
The core building blocks: linear units, nonlinearity, loss & gradient descent, softmax & cross‑entropy, dot‑product attention, positional encodings, layer norm & residuals, and receptive fields. Each section has an explainer, a formula, and a small visual.
Hover here for how this connects to earlier pieces.
- Dot product & linear maps → Toolkit.
- Optimization landscape intuition → Primer: Potentials.
- Attention weights use softmax and dot products → see below; also Toolkit: Fourier for oscillatory ideas.
Linear Units & Nonlinearity
A single neuron computes a weighted sum and passes it through a nonlinearity: $$y = \phi(\mathbf w\cdot\mathbf x + b).$$ Without $\phi$, stacks of linear layers collapse to one linear map. With $\phi$ (ReLU, GELU, tanh), you can fit bends in the data.
Decision line: ….
Softmax & Cross‑Entropy
Softmax turns logits into probabilities: $$p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}.$$ For a target class $t$, cross‑entropy is $L=-\log p_t$.
Dot‑Product Attention
Weights for one query $q_i$ over keys $K$: softmax of scaled dot products.
Positional Encodings
Sinusoidal bands encode position at multiple scales.
Layer Norm & Residuals
Receptive Fields
Gradient Descent (2D Bowl)
…
A Tiny MLP (Two ReLUs)
Equation: …
Gradients at x₀: …