ML Pipeline, end to end
A real ML project moves through six stages: dataset → EDA → preprocess → train → evaluate → deploy. This page walks one synthetic example through all six. Pick a dataset on Step 1, click through, watch three models train head-to-head, and end up at a tiny live-prediction demo. Everything runs in your browser.
The lifecycle is a cycle, not a line
Deployment is not the end. In production, you monitor inputs and predictions, detect drift, and feed insights back into preprocessing or retraining. Models decay; pipelines loop.
- Dataset
- EDA
- Preprocess
- Train
- Evaluate
- Deploy
Pick a dataset
Every pipeline starts with a question — and the dataset that might answer it.
A dataset is rows (samples) × columns (features + target). It constrains every downstream choice — feature space, model family, evaluation metric, even feasibility.
Iris Flowers
150 samples · 4 features · 3 classes
Wine Quality
200 samples · 6 features · binary
Heart Disease
180 samples · 5 features · binary
Explore the data
Sense-make before you model. Tukey wrote a whole book on this in 1977 for a reason.
EDA = looking at the data before modeling — stats, distributions, class balance, missing values. Bad EDA → bad model, and you won’t catch it from a test-accuracy number alone.
Prepare the data
Scale, split, encode. Do it in the right order — or you’ll leak the test set.
Scale numerics, encode categoricals, impute missing values, split before fitting transforms — otherwise you leak the test set into training. This is the single most common pipeline bug.
Split summary
Training samples: —
Test samples: —
Rule of thumb: 80/20 for medium datasets. For small data, use k-fold cross-validation.
Train the models
Fit parameters that minimize loss. Then check whether you’ve overfit.
Training fits parameters (weights, splits) by minimizing a loss. You set hyperparameters (lr, depth, regularization); training finds the rest. We run three baselines — different model families, different assumptions.
Evaluate honestly
Accuracy is the wrong metric most of the time. Pick one that matches your problem.
Pick metrics that match your problem: accuracy for balanced, F1 for imbalanced, AUC for ranking, RMSE for regression. Always against a baseline — 95% accuracy on a 95/5 split is the majority-class trivial model.
Deploy — and keep watching
Shipping is when ML starts, not when it ends. Models drift; data shifts; performance decays.
Deployment is the start of the loop, not the end. Serve predictions, monitor inputs and outputs, detect drift, retrain. Models decay; pipelines loop.
Make a prediction
Confidence: —%
Beyond this demo
Toy → real data
Real data is messy — missing values, label noise, leakage, multi-modal. Try Kaggle or fetch_openml.
Hand-rolled → sklearn
Production uses sklearn, xgboost, lightgbm, pytorch — built for edge cases and scale.
One split → k-fold CV
Real evaluation = k=5/10 folds + separate validation set for tuning + multiple seeds for variance.
Form → production
Registries, CI/CD, A/B tests, shadow modes, drift detectors, retraining triggers, rollback playbooks.
ML Pipeline Glossary — 20 terms
- Feature
- An input column. Numeric, categorical, or derived (engineered) from raw signals.
- Target / label
- The output column you’re trying to predict. Sometimes called y.
- Sample / example
- One row of the dataset — a feature vector and (in supervised learning) its label.
- Parameter
- A value the model learns during training (e.g., a weight in linear regression).
- Hyperparameter
- A value you set before training (e.g., learning rate, k in KNN, tree depth).
- Loss function
- A scalar that measures how wrong the model is. Training minimizes it.
- Gradient descent
- Standard optimizer: take steps opposite to the gradient of the loss to lower it.
- Overfitting
- The model memorized the training set, so test accuracy is much worse than train.
- Underfitting
- The model is too simple to capture the pattern — high error everywhere.
- Bias-variance tradeoff
- Simple models have high bias / low variance; complex models, the opposite.
- Regularization
- A penalty added to the loss to keep parameters small (L1, L2, dropout).
- Cross-validation (k-fold)
- Split the data into k chunks; train on k-1, test on the rest; rotate; average.
- Stratified sampling
- Split that preserves class ratios — essential for imbalanced data.
- Class imbalance
- One class dominates (e.g., 95/5). Accuracy becomes a useless metric.
- Data leakage
- Test-set information bleeds into training. Inflates reported accuracy; ruins production.
- Inference
- Running the trained model on new data to produce predictions.
- Pipeline (sklearn)
- An object that chains preprocessing and a final estimator with safe fit/transform.
- Feature engineering
- Hand-crafting new features (ratios, log transforms, interactions) from existing ones.
- Drift
- Input distribution (data drift) or input–output relationship (concept drift) shifts after deployment.
- Baseline
- A trivially simple model (majority class, mean, linear). Beat it before claiming you have an ML model.
Frequently asked
[0, 1] (or z-score scaling to mean 0 / std 1) puts every feature on equal footing.