Neural Network Playground

Neural Networks

Build a small neural network, pick a dataset, hit Play — watch backpropagation carve a decision boundary in real time. The 2D point classifiers that logistic regression and the perceptron couldn’t solve (XOR, spirals) become trivial once you stack a hidden layer.

Decision boundary · click anywhere to add a training point
Training loss per epoch
Epoch
0
Loss
Accuracy

Build & train

Architecture

Activation

Hyperparameters

Learning rate η0.010
Batch size32
Training speedNormal

Run

Dataset presets

Spiral
XOR
Circle
Moons
Theory & exercises · backprop derivation, watch-outs, ideas to try

The math, in four passes

1. Forward pass.

Each layer transforms its input by a weighted sum + bias + activation:

$$ z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \qquad a^{(l)} = \phi(z^{(l)}) $$

The final layer’s $a$ is the prediction $\hat{y}$.

2. Loss — cross-entropy.

$$ L = -\sum_i t_i \log \hat{y}_i $$

For two-class problems, $t \in \{(1,0),(0,1)\}$ is a one-hot vector; cross-entropy measures how wrong the predicted probability distribution is.

3. Backward pass (chain rule).

Start at the output, compute $\delta^{(L)} = \hat{y} - t$, then propagate back through layers:

$$ \delta^{(l)} = \big(W^{(l+1)}\big)^\top\,\delta^{(l+1)} \,\odot\, \phi'(z^{(l)}) $$

$\odot$ is the elementwise product. Each $\delta$ is "how much this layer’s activations were off by".

4. Weight update.

Once you have all $\delta$s, the gradient is mechanical:

$$ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)}\,\big(a^{(l-1)}\big)^\top \qquad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} $$

Step opposite each gradient scaled by $\eta$: $W \leftarrow W - \eta \cdot \partial L / \partial W$. Repeat for many epochs.

Try this

The XOR upgrade

Pick XOR. Remove all hidden layers (just input → output) and train — loss never falls. Add one hidden layer with 4 neurons. Now it works. That’s the multi-layer unlock in one experiment.

Vanishing gradients

On the spiral dataset, add 4 hidden layers with sigmoid activation. Watch the loss barely budge. Switch all activations to ReLU. Loss drops fast. This is why ReLU dominates modern deep nets.

Lr too high → divergence

Push η to 0.4. The loss spikes erratically; the boundary thrashes. Halve it and watch convergence smooth out.

Capacity vs data

Add 3 hidden layers of 16 neurons each. Place just 8 data points by hand. Train. The network memorizes those 8 points perfectly but the decision boundary looks insane — classic overfitting on tiny data.

Batch size = 1 vs 32

Drop batch size to 1 (pure SGD). Loss curve gets jagged; boundary jitters. Push it to 64 — smoother loss but slower iteration. Mini-batch is the trade-off.

Spirals need depth

With 1 hidden layer of 4 neurons, spiral barely separates. Try 2 layers × 8 neurons each. The boundary curls. That curl is what depth bought you.

In one glance

⚠️ Watch out Vanishing gradients High lr → divergence Overfit small data Bad random init Too many hidden layers
🔧 In practice torch.nn.Linear F.relu CrossEntropyLoss torch.optim.Adam batch_norm dropout

Frequently asked

A stack of layers — input, one or more hidden, output. Each layer takes a weighted sum of its inputs, adds a bias, and runs the result through an activation function (ReLU, sigmoid, tanh). Training adjusts every weight and bias by gradient descent on a loss function — usually cross-entropy for classification. The network can learn non-linear boundaries that single linear classifiers (logistic regression, perceptron) provably cannot.
It computes the gradient of the loss with respect to every weight and bias in the network, layer by layer, working backward from the output. Once you have those gradients, gradient descent just nudges each parameter in the direction that reduces loss. Backprop is the chain rule applied carefully — and it’s what makes deep learning practical.
If you have zero hidden layers, you’re effectively running logistic regression — and logistic regression can’t separate XOR. Add at least one hidden layer with 4+ neurons and the network will learn the non-linear boundary. Try removing all hidden layers and watch the model fail; then add one and watch it succeed.
ReLU is the default in modern hidden layers — fast, doesn’t saturate, sparse activations. Sigmoid and Tanh saturate at the ends, which causes vanishing gradients in deep networks. For binary output layers, sigmoid still wins (it gives a clean probability). For deeper hidden layers, ReLU or one of its variants (Leaky ReLU, GELU) is almost always the right pick.