ML Pipeline, end to end

Lifecycle walkthrough

A real ML project moves through six stages: dataset → EDA → preprocess → train → evaluate → deploy. This page walks one synthetic example through all six. Pick a dataset on Step 1, click through, watch three models train head-to-head, and end up at a tiny live-prediction demo. Everything runs in your browser.

The lifecycle is a cycle, not a line

Deployment is not the end. In production, you monitor inputs and predictions, detect drift, and feed insights back into preprocessing or retraining. Models decay; pipelines loop.

Datasetstep 1 EDAstep 2 Prepstep 3 Trainstep 4 Evalstep 5 Deploystep 6 Monitordrift retrain on drift · new labels · new data source · label · version describe · plot · sanity-check clean · scale · split fit parameters · validate metric on test set serve predictions
  1. 📊Dataset
  2. 🔍EDA
  3. ⚙️Preprocess
  4. 🎓Train
  5. 📈Evaluate
  6. 🚀Deploy
Step 01 · of 06

Pick a dataset

Every pipeline starts with a question — and the dataset that might answer it.

A dataset is rows (samples) × columns (features + target). It constrains every downstream choice — feature space, model family, evaluation metric, even feasibility.

🌺

Iris Flowers

150 samples · 4 features · 3 classes

🍷

Wine Quality

200 samples · 6 features · binary

❤️

Heart Disease

180 samples · 5 features · binary

⚠️ Watch out Available ≠ appropriate Too few samples / class No data versioning Benchmark ≠ production
🔧 In practice DVC Feast Tecton data registry
Step 02 · of 06

Explore the data

Sense-make before you model. Tukey wrote a whole book on this in 1977 for a reason.

EDA = looking at the data before modeling — stats, distributions, class balance, missing values. Bad EDA → bad model, and you won’t catch it from a test-accuracy number alone.

Samples
rows in the table
Features
input columns
Classes
target categories
Missing
0
synthetic — no gaps
Feature distributions (means)
Class distribution
⚠️ Watch out Skip EDA → fit() Ignore class balance Trust column names Outliers = typos
🔧 In practice pandas.describe() seaborn.pairplot pandas-profiling plot target first
Step 03 · of 06

Prepare the data

Scale, split, encode. Do it in the right order — or you’ll leak the test set.

Scale numerics, encode categoricals, impute missing values, split before fitting transforms — otherwise you leak the test set into training. This is the single most common pipeline bug.

Train / Test split 80/20

Split summary

Training samples:

Test samples:

Rule of thumb: 80/20 for medium datasets. For small data, use k-fold cross-validation.

⚠️ Watch out Test-set leakage Drop NaN blindly High-card one-hot No inference-time transforms Unstratified imbalanced split
🔧 In practice sklearn.Pipeline ColumnTransformer StratifiedKFold SimpleImputer
Step 04 · of 06

Train the models

Fit parameters that minimize loss. Then check whether you’ve overfit.

Training fits parameters (weights, splits) by minimizing a loss. You set hyperparameters (lr, depth, regularization); training finds the rest. We run three baselines — different model families, different assumptions.

Training progress
Logistic Regression
K-Nearest Neighbors
Decision Tree
⚠️ Watch out Mix params/hyperparams No random_state Skip baseline Silent overfit No learning curves
🔧 In practice GridSearchCV Optuna early_stopping class_weight cross_val_score
Step 05 · of 06

Evaluate honestly

Accuracy is the wrong metric most of the time. Pick one that matches your problem.

Pick metrics that match your problem: accuracy for balanced, F1 for imbalanced, AUC for ranking, RMSE for regression. Always against a baseline — 95% accuracy on a 95/5 split is the majority-class trivial model.

Model comparison · test accuracy
Feature importance (logistic |w|)
Winner
⚠️ Watch out Accuracy on imbalanced Single metric No baseline Cherry-pick seed No CI
🔧 In practice classification_report confusion_matrix roc_auc_score cross_val_score
Step 06 · of 06

Deploy — and keep watching

Shipping is when ML starts, not when it ends. Models drift; data shifts; performance decays.

Deployment is the start of the loop, not the end. Serve predictions, monitor inputs and outputs, detect drift, retrain. Models decay; pipelines loop.

Make a prediction

Predicted class:
Confidence: %
Model summary metrics
⚠️ Watch out No monitoring No shadow period No rollback plan Model as static artifact
🔧 In practice MLflow BentoML SageMaker Evidently Arize

Beyond this demo

Toy → real data

Real data is messy — missing values, label noise, leakage, multi-modal. Try Kaggle or fetch_openml.

Hand-rolled → sklearn

Production uses sklearn, xgboost, lightgbm, pytorch — built for edge cases and scale.

One split → k-fold CV

Real evaluation = k=5/10 folds + separate validation set for tuning + multiple seeds for variance.

Form → production

Registries, CI/CD, A/B tests, shadow modes, drift detectors, retraining triggers, rollback playbooks.

ML Pipeline Glossary — 20 terms
Feature
An input column. Numeric, categorical, or derived (engineered) from raw signals.
Target / label
The output column you’re trying to predict. Sometimes called y.
Sample / example
One row of the dataset — a feature vector and (in supervised learning) its label.
Parameter
A value the model learns during training (e.g., a weight in linear regression).
Hyperparameter
A value you set before training (e.g., learning rate, k in KNN, tree depth).
Loss function
A scalar that measures how wrong the model is. Training minimizes it.
Gradient descent
Standard optimizer: take steps opposite to the gradient of the loss to lower it.
Overfitting
The model memorized the training set, so test accuracy is much worse than train.
Underfitting
The model is too simple to capture the pattern — high error everywhere.
Bias-variance tradeoff
Simple models have high bias / low variance; complex models, the opposite.
Regularization
A penalty added to the loss to keep parameters small (L1, L2, dropout).
Cross-validation (k-fold)
Split the data into k chunks; train on k-1, test on the rest; rotate; average.
Stratified sampling
Split that preserves class ratios — essential for imbalanced data.
Class imbalance
One class dominates (e.g., 95/5). Accuracy becomes a useless metric.
Data leakage
Test-set information bleeds into training. Inflates reported accuracy; ruins production.
Inference
Running the trained model on new data to produce predictions.
Pipeline (sklearn)
An object that chains preprocessing and a final estimator with safe fit/transform.
Feature engineering
Hand-crafting new features (ratios, log transforms, interactions) from existing ones.
Drift
Input distribution (data drift) or input–output relationship (concept drift) shifts after deployment.
Baseline
A trivially simple model (majority class, mean, linear). Beat it before claiming you have an ML model.

Frequently asked

The canonical stages are: data acquisition, exploratory data analysis (EDA), preprocessing (cleaning, encoding, scaling, splitting), model training, evaluation, and deployment. This page walks one synthetic example through all six in order — same shape as a real production ML workflow, just compressed and with toy data.
To estimate how the model will perform on unseen data. Evaluating on the same examples you trained on is meaningless — the model can just memorize them. A held-out test set gives an unbiased estimate of generalization. 80/20 is a common default; smaller datasets sometimes use cross-validation for more robust estimates.
Many algorithms (logistic regression, KNN, neural networks) are sensitive to feature scale. If one feature ranges 0–10000 and another 0–1, the larger one will dominate the loss gradient or distance metric. Min-max scaling to [0, 1] (or z-score scaling to mean 0 / std 1) puts every feature on equal footing.
A score showing how much each feature contributed to the model’s predictions. For logistic regression we use the absolute value of the learned weight — features with bigger weights move the prediction more. Tree models compute it from information-gain reductions. Use it to spot which inputs actually matter and which you could drop.