ROC, AUC & PR, visualized

Evaluation

A trained classifier outputs probabilities; threshold turns those into class labels. ROC plots TPR vs FPR as τ sweeps from 0 to 1; PR plots precision vs recall. Drag the threshold slider below to move the magenta point on both curves. For training internals see Logistic Regression →.

ROC AUC
awaiting data
PR AUC (avg precision)
awaiting data
ROC curve · TPR vs FPR
Precision-Recall curve
Threshold τ — drag to move the magenta point on both curves · the matrix below shifts with it
Confusion matrix at τ
Pred 0
Pred 1
Actual 0
Actual 1
Metrics at τ
Accuracy
Precision
Recall
F1
N points
0
Dataset & decision boundary
Class 0 Class 1 Operating point
Parameters — what training learns
w₀
slope along x
w₁
slope along y
b
offset / intercept
Hyperparameter — you set it
η
learning rate
Preset
Theory & exercises · math derivation, watch-outs, and ideas to try

The math, derived

1. The four counts.

Pick a threshold $\tau$. For each example with probability $\hat{p}_i$ and label $t_i$, the predicted class is $\hat{y}_i = \mathbf{1}[\hat{p}_i \ge \tau]$. Tally:

$$ \text{TP, FP, FN, TN} \;=\; \text{counts of } (\hat{y},\, t) \in \{(1,1),(1,0),(0,1),(0,0)\} $$

2. The two rates.

$$ \text{TPR (recall)} \;=\; \frac{TP}{TP + FN} \qquad \text{FPR} \;=\; \frac{FP}{FP + TN} \qquad \text{Precision} \;=\; \frac{TP}{TP + FP} $$

As $\tau$ falls, more examples are predicted positive — TPR rises (good!), FPR rises (bad), precision usually falls. The whole game is in the trade-off.

3. ROC AUC — sweep all thresholds.

Plot $(\text{FPR}(\tau),\, \text{TPR}(\tau))$ for every $\tau \in [0, 1]$. The area under the curve is

$$ \text{AUC} \;=\; \int_0^1 \text{TPR}(\text{FPR}^{-1}(u))\, du \;=\; \Pr\big[\,\hat{p}_{+} > \hat{p}_{-}\,\big] $$

The last equality is the key intuition: AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative one. Threshold-independent.

4. PR AUC — the imbalance-aware sibling.

Plot precision vs recall as $\tau$ sweeps. PR AUC (a.k.a. average precision):

$$ \text{AP} \;=\; \sum_{i} \big(\text{recall}_i - \text{recall}_{i-1}\big)\,\text{precision}_i $$

On a 99/1 imbalanced dataset, ROC AUC can be near 1.0 even when the classifier is barely better than majority-class. PR AUC stays sensitive because true negatives don’t enter the formulas at all.

5. F1 — the harmonic mean.

If you have to pick one $\tau$, F1 combines precision and recall into a single score (penalizing extreme imbalances between the two):

$$ F_1 \;=\; \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

Pick $\tau$ to maximize F1 when you don’t have a domain-specific cost matrix for FP vs FN.

Try this

Operating-point sweep

Slide τ from 0.05 to 0.95. Watch the magenta dot trace the entire ROC curve. At τ → 1, FPR=0 and TPR=0 (predict nothing positive). At τ → 0, both are 1.

Imbalanced data → use PR

Hit imbalanced. Auto-train a bit. Notice ROC AUC stays high (~0.9+) while PR AUC dips. The PR view tells the truth about minority-class performance.

Find the F1-optimal threshold

On the overlap dataset after training, slide τ until F1 peaks. It’s usually not 0.5 — depends on class balance and the cost of each error type.

Perfect separability

Hit near-perfect and Auto-train. ROC curve hugs the top-left, AUC ≈ 1.0. Confusion matrix shows zero (or near-zero) off-diagonals at τ = 0.5.

An "anti-classifier"

Hit blobs, then manually set w₀ = -1, w₁ = -1, b = 0 (without training). ROC AUC drops below 0.5 — the model is anti-correlated with the truth. Inverting predictions would beat it.

Confusion-matrix sweep

At each τ, the four cells trade off. Lower τ: TP and FP up, TN and FN down. The four metrics in the KPI row each respond differently — F1 most stably, accuracy least.

In one glance

⚠️ Watch out ROC AUC on imbalanced data τ = 0.5 by default Single metric reporting No baseline Cherry-picked τ
🔧 In practice roc_auc_score average_precision_score classification_report precision_recall_curve f1_score

Frequently asked

The probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. AUC = 1.0 means perfect ranking; AUC = 0.5 is random; below 0.5 means the model is anti-correlated (you’d do better flipping the predictions). Crucially, AUC is threshold-independent — it measures the model’s ability to rank, not its accuracy at any single τ.
ROC AUC treats true negatives equally with true positives. On imbalanced data (e.g., 99/1 split), a model that predicts mostly the majority class can have a deceptively high ROC AUC — the FPR denominator is huge so even many false positives barely move the curve. PR AUC ignores true negatives entirely, so it stays sensitive to performance on the minority class. Rule of thumb: imbalanced data → look at PR AUC, not ROC AUC.
Depends on which error is more expensive. If false positives are costly (flagging legit transactions as fraud), raise τ. If false negatives are costly (missing a cancer diagnosis), lower τ. F1 balances both. For calibrated probabilities, τ = 0.5 is a starting point — but for most real problems, the optimal τ is found by maximizing F1 or by a cost-benefit analysis.
Each step corresponds to one training example changing its predicted class as the threshold sweeps past its predicted probability. With more data points, the curve becomes smoother. With fewer points it’s blocky — useful for seeing the discrete nature of the sweep but cosmetically rougher.