Logistic Regression, visualized

Linear Models

Logistic regression learns a straight line that separates two classes, then squashes the signed distance through a sigmoid to produce a probability. Click the chart to drop points, hit Train, and watch the boundary rotate while the loss curve falls on the right. For threshold/AUC/precision-recall exploration, head to ROC & AUC.

Data · decision boundary · probability field (toggle)
Loss curve · cross-entropy + L2
Log loss
ΔLoss
Accuracy
Steps
0
N points
0
Preset

Tune the model

Parameters — what gradient descent learns. You can also drag these by hand to draw a boundary.
w₀
slope along x
w₁
slope along y
b
offset / intercept
Hyperparameters — you set these. They control how training behaves.
η
learning rate — gradient-descent step size
λ
L2 strength — penalty on big weights
Probability field
what the model thinks everywhere, not just along the line

Evaluate

Threshold τ — turn the probability into a hard prediction. Drag the slider — the confusion matrix below moves with it.
τ
predict Class 1 when σ(w·x + b) ≥ τ
Confusion matrix at threshold τ
Predicted 0
Predicted 1
Actual 0
TN
FP · type I
Actual 1
FN · type II
TP

The math, derived

1. The model.

Combine the inputs linearly, then squash through a sigmoid to get a probability:

$$ z \;=\; w_0\,x + w_1\,y + b \qquad \hat{p} \;=\; \sigma(z) \;=\; \frac{1}{1 + e^{-z}} $$

$\hat{p} \in (0, 1)$ is interpreted as probability of class 1. Predict class 1 when $\hat{p} > \tau$.

2. The loss — binary cross-entropy.

For each example $(x_i, y_i)$ with label $t_i \in \{0, 1\}$:

$$ L \;=\; -\frac{1}{N} \sum_{i=1}^{N} \Big[\, t_i \log \hat{p}_i + (1 - t_i)\log(1 - \hat{p}_i) \,\Big] $$

Cross-entropy is convex in $(w_0, w_1, b)$ — gradient descent finds the global optimum (if you give it enough steps).

3. The gradient.

The chain rule makes the gradient surprisingly clean — the sigmoid’s derivative cancels nicely against the log:

$$ \frac{\partial L}{\partial w_0} \;=\; \frac{1}{N}\sum_i (\hat{p}_i - t_i)\,x_i \qquad \frac{\partial L}{\partial b} \;=\; \frac{1}{N}\sum_i (\hat{p}_i - t_i) $$

Same shape as linear regression. Add $+\,2\lambda w_0$ to the $w_0$ gradient (and similarly for $w_1$) if you want L2 regularization.

4. The update.

Step opposite the gradient, scaled by the learning rate:

$$ w_0 \leftarrow w_0 - \eta\,\frac{\partial L}{\partial w_0} \qquad w_1 \leftarrow w_1 - \eta\,\frac{\partial L}{\partial w_1} \qquad b \leftarrow b - \eta\,\frac{\partial L}{\partial b} $$

Repeat until the loss curve flattens. If it overshoots and bounces, lower $\eta$. If it crawls, raise $\eta$ (carefully).

Try this

The XOR wall

Hit XOR and Auto-train. Watch the boundary thrash — no single line separates four corners. This is what motivated multi-layer perceptrons in the 1980s.

Learning rate explosion

On blobs, crank η to 0.8+ and train. Loss spikes, weights swing wildly. The boundary may flip back and forth across runs. Halve and retry.

Regularization tightens the line

On overlap, train without L2, then with λ = 0.05. The regularized boundary is straighter — weights are smaller, the model is humbler.

Threshold-induced trade-off

Slide τ from 0.3 to 0.7. The confusion matrix shifts: lower τ catches more positives (TP up, FN down) but also more false alarms (FP up). For threshold sweeping with ROC/AUC, see the metrics page.

Imbalanced reality

imbalanced has 20 negatives, 100 positives. Notice 95% accuracy from the start — that’s near the trivial majority-class baseline. Accuracy lies on imbalanced data; use F1 or AUC instead.

Hand-set the boundary

Without training, slide w₀, w₁, b to draw a line that separates the data by eye. Compare your accuracy to gradient descent’s — humans can match it on easy data.

In one glance

⚠️ Watch out Linear-only — can’t do XOR No feature scaling Ignore class imbalance Accuracy on 95/5 split High η diverges
🔧 In practice sklearn.linear_model.LogisticRegression class_weight='balanced' StandardScaler first L2 by default (C=1.0) cross_val_score

Frequently asked

A linear classifier for binary outcomes. It computes $z = w_0 x + w_1 y + b$, then squashes $z$ through the sigmoid to get a probability $\hat{p} = \sigma(z) = 1/(1 + e^{-z})$. Predict class 1 if $\hat{p} > \tau$ (typically 0.5). Training fits $w_0, w_1, b$ by minimizing cross-entropy loss via gradient descent — the same idea as linear regression but with a probability output.
XOR has four clusters at the corners — class 1 at top-left + bottom-right, class 0 at top-right + bottom-left. No single straight line can separate them; linear classifiers can’t fit non-linearly separable data. This is the canonical motivation for neural networks (multiple stacked linear+activation layers) or kernel methods.
It adds a $\lambda(w_0^2 + w_1^2)$ term to the loss, which penalizes large weights. The effect: the decision boundary stays gentler, the model resists overfitting noisy data, and gradients shrink the weights every step. On the overlap dataset, try $\lambda = 0.05$ — the boundary becomes smoother and accuracy is more stable.
MSE on probabilities gives a non-convex loss surface and slow training. Cross-entropy is convex for logistic regression and pairs naturally with the sigmoid — the gradient simplifies to $(p - y) \cdot x$, which is what we use in the algorithm. Cross-entropy also penalizes confident-wrong predictions much harder than MSE would.