Neural Network Playground
Build a small neural network, pick a dataset, hit Play — watch backpropagation carve a decision boundary in real time. The 2D point classifiers that logistic regression and the perceptron couldn’t solve (XOR, spirals) become trivial once you stack a hidden layer.
Build & train
Architecture
Activation
Hyperparameters
Run
Dataset presets
Theory & exercises · backprop derivation, watch-outs, ideas to try
The math, in four passes
1. Forward pass.
Each layer transforms its input by a weighted sum + bias + activation:
$$ z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \qquad a^{(l)} = \phi(z^{(l)}) $$The final layer’s $a$ is the prediction $\hat{y}$.
2. Loss — cross-entropy.
$$ L = -\sum_i t_i \log \hat{y}_i $$For two-class problems, $t \in \{(1,0),(0,1)\}$ is a one-hot vector; cross-entropy measures how wrong the predicted probability distribution is.
3. Backward pass (chain rule).
Start at the output, compute $\delta^{(L)} = \hat{y} - t$, then propagate back through layers:
$$ \delta^{(l)} = \big(W^{(l+1)}\big)^\top\,\delta^{(l+1)} \,\odot\, \phi'(z^{(l)}) $$$\odot$ is the elementwise product. Each $\delta$ is "how much this layer’s activations were off by".
4. Weight update.
Once you have all $\delta$s, the gradient is mechanical:
$$ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)}\,\big(a^{(l-1)}\big)^\top \qquad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} $$Step opposite each gradient scaled by $\eta$: $W \leftarrow W - \eta \cdot \partial L / \partial W$. Repeat for many epochs.
Try this
The XOR upgrade
Pick XOR. Remove all hidden layers (just input → output) and train — loss never falls. Add one hidden layer with 4 neurons. Now it works. That’s the multi-layer unlock in one experiment.
Vanishing gradients
On the spiral dataset, add 4 hidden layers with sigmoid activation. Watch the loss barely budge. Switch all activations to ReLU. Loss drops fast. This is why ReLU dominates modern deep nets.
Lr too high → divergence
Push η to 0.4. The loss spikes erratically; the boundary thrashes. Halve it and watch convergence smooth out.
Capacity vs data
Add 3 hidden layers of 16 neurons each. Place just 8 data points by hand. Train. The network memorizes those 8 points perfectly but the decision boundary looks insane — classic overfitting on tiny data.
Batch size = 1 vs 32
Drop batch size to 1 (pure SGD). Loss curve gets jagged; boundary jitters. Push it to 64 — smoother loss but slower iteration. Mini-batch is the trade-off.
Spirals need depth
With 1 hidden layer of 4 neurons, spiral barely separates. Try 2 layers × 8 neurons each. The boundary curls. That curl is what depth bought you.