antifold

AI

Neural Networks & Transformers: A Visual Introduction

The core building blocks: linear units, nonlinearity, loss & gradient descent, softmax & cross‑entropy, dot‑product attention, positional encodings, layer norm & residuals, and receptive fields. Each section has an explainer, a formula, and a small visual.

Hover here for how this connects to earlier pieces.

Connections

Linear Units & Nonlinearity

A single neuron computes a weighted sum and passes it through a nonlinearity: $$y = \phi(\mathbf w\cdot\mathbf x + b).$$ Without $\phi$, stacks of linear layers collapse to one linear map. With $\phi$ (ReLU, GELU, tanh), you can fit bends in the data.

Decision line: .

Softmax & Cross‑Entropy

Softmax turns logits into probabilities: $$p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}.$$ For a target class $t$, cross‑entropy is $L=-\log p_t$.

probabilities
values: , , • Cross‑entropy:

Dot‑Product Attention

Weights for one query $q_i$ over keys $K$: softmax of scaled dot products.

Positional Encodings

Sinusoidal bands encode position at multiple scales.

Layer Norm & Residuals

Receptive Fields

Gradient Descent (2D Bowl)

A Tiny MLP (Two ReLUs)

Equation:

Gradients at x₀:

A Toy Tokenizer (BPE)

Tokens
Top pairs
Merges so far:

Attention: Values Aggregation

Multi‑Head Attention (Weights)

Position‑wise Feed‑Forward (FFN)

Putting It Together: A Transformer Block

Connections
  • Dot products & linear maps → Toolkit.
  • Optimization intuition → Primer.
  • Softmax appears in attention & classification.