What does an ML pipeline include?

The canonical stages are: data acquisition, exploratory data analysis (EDA), preprocessing (cleaning, encoding, scaling, splitting), model training, model evaluation, and deployment. This page walks one synthetic example through all six in order — same shape as a real production ML workflow, just compressed and with toy data.

ML Pipeline, end to end

Lifecycle walkthrough

A real ML project moves through six stages: dataset → EDA → preprocess → train → evaluate → deploy. This page walks one synthetic example through all six. Pick a dataset on Step 1, click through, watch three models train head-to-head, and end up at a tiny live-prediction demo. Everything runs in your browser.

The lifecycle is a cycle, not a line

Deployment is not the end. In production, you monitor inputs and predictions, detect drift, and feed insights back into preprocessing or retraining. Models decay; pipelines loop.

📊Dataset
🔍EDA
⚙️Preprocess
🎓Train
📈Evaluate
🚀Deploy

Step 01 · of 06

Pick a dataset

Every pipeline starts with a question — and the dataset that might answer it.

A dataset is rows (samples) × columns (features + target). It constrains every downstream choice — feature space, model family, evaluation metric, even feasibility.

🌺

Iris Flowers

150 samples · 4 features · 3 classes

🍷

Wine Quality

200 samples · 6 features · binary

❤️

Heart Disease

180 samples · 5 features · binary

⚠️ Watch out Available ≠ appropriate Too few samples / class No data versioning Benchmark ≠ production

🔧 In practice DVC Feast Tecton data registry

Step 02 · of 06

Explore the data

Sense-make before you model. Tukey wrote a whole book on this in 1977 for a reason.

EDA = looking at the data before modeling — stats, distributions, class balance, missing values. Bad EDA → bad model, and you won’t catch it from a test-accuracy number alone.

Samples

—

rows in the table

Features

—

input columns

Classes

—

target categories

Missing

synthetic — no gaps

Feature distributions (means)

Class distribution

⚠️ Watch out Skip EDA → fit() Ignore class balance Trust column names Outliers = typos

🔧 In practice pandas.describe() seaborn.pairplot pandas-profiling plot target first

Step 03 · of 06

Prepare the data

Scale, split, encode. Do it in the right order — or you’ll leak the test set.

Scale numerics, encode categoricals, impute missing values, split before fitting transforms — otherwise you leak the test set into training. This is the single most common pipeline bug.

Normalize features (min-max to [0, 1]) Stratified split (preserve class ratios)

Train / Test split 80/20

Split summary

Training samples: —

Test samples: —

Rule of thumb: 80/20 for medium datasets. For small data, use k-fold cross-validation.

⚠️ Watch out Test-set leakage Drop NaN blindly High-card one-hot No inference-time transforms Unstratified imbalanced split

🔧 In practice sklearn.Pipeline ColumnTransformer StratifiedKFold SimpleImputer

Step 04 · of 06

Train the models

Fit parameters that minimize loss. Then check whether you’ve overfit.

Training fits parameters (weights, splits) by minimizing a loss. You set hyperparameters (lr, depth, regularization); training finds the rest. We run three baselines — different model families, different assumptions.

Training progress

Logistic Regression —

K-Nearest Neighbors —

Decision Tree —

⚠️ Watch out Mix params/hyperparams No random_state Skip baseline Silent overfit No learning curves

🔧 In practice GridSearchCV Optuna early_stopping class_weight cross_val_score

Step 05 · of 06

Evaluate honestly

Accuracy is the wrong metric most of the time. Pick one that matches your problem.

Pick metrics that match your problem: accuracy for balanced, F1 for imbalanced, AUC for ranking, RMSE for regression. Always against a baseline — 95% accuracy on a 95/5 split is the majority-class trivial model.

Model comparison · test accuracy

Feature importance (logistic |w|)

Winner —

⚠️ Watch out Accuracy on imbalanced Single metric No baseline Cherry-pick seed No CI

🔧 In practice classification_report confusion_matrix roc_auc_score cross_val_score

Step 06 · of 06

Deploy — and keep watching

Shipping is when ML starts, not when it ends. Models drift; data shifts; performance decays.

Deployment is the start of the loop, not the end. Serve predictions, monitor inputs and outputs, detect drift, retrain. Models decay; pipelines loop.

Make a prediction

Predicted class: —
Confidence: —%

Model summary metrics

⚠️ Watch out No monitoring No shadow period No rollback plan Model as static artifact

🔧 In practice MLflow BentoML SageMaker Evidently Arize

Beyond this demo

Toy → real data

Real data is messy — missing values, label noise, leakage, multi-modal. Try Kaggle or fetch_openml.

Hand-rolled → sklearn

Production uses sklearn, xgboost, lightgbm, pytorch — built for edge cases and scale.

One split → k-fold CV

Real evaluation = k=5/10 folds + separate validation set for tuning + multiple seeds for variance.

Form → production

Registries, CI/CD, A/B tests, shadow modes, drift detectors, retraining triggers, rollback playbooks.

ML Pipeline Glossary — 20 terms

Feature: An input column. Numeric, categorical, or derived (engineered) from raw signals.
Target / label: The output column you’re trying to predict. Sometimes called y.
Sample / example: One row of the dataset — a feature vector and (in supervised learning) its label.
Parameter: A value the model learns during training (e.g., a weight in linear regression).
Hyperparameter: A value you set before training (e.g., learning rate, k in KNN, tree depth).
Loss function: A scalar that measures how wrong the model is. Training minimizes it.
Gradient descent: Standard optimizer: take steps opposite to the gradient of the loss to lower it.
Overfitting: The model memorized the training set, so test accuracy is much worse than train.
Underfitting: The model is too simple to capture the pattern — high error everywhere.
Bias-variance tradeoff: Simple models have high bias / low variance; complex models, the opposite.
Regularization: A penalty added to the loss to keep parameters small (L1, L2, dropout).
Cross-validation (k-fold): Split the data into k chunks; train on k-1, test on the rest; rotate; average.
Stratified sampling: Split that preserves class ratios — essential for imbalanced data.
Class imbalance: One class dominates (e.g., 95/5). Accuracy becomes a useless metric.
Data leakage: Test-set information bleeds into training. Inflates reported accuracy; ruins production.
Inference: Running the trained model on new data to produce predictions.
Pipeline (sklearn): An object that chains preprocessing and a final estimator with safe fit/transform.
Feature engineering: Hand-crafting new features (ratios, log transforms, interactions) from existing ones.
Drift: Input distribution (data drift) or input–output relationship (concept drift) shifts after deployment.
Baseline: A trivially simple model (majority class, mean, linear). Beat it before claiming you have an ML model.

Frequently asked

The canonical stages are: data acquisition, exploratory data analysis (EDA), preprocessing (cleaning, encoding, scaling, splitting), model training, evaluation, and deployment. This page walks one synthetic example through all six in order — same shape as a real production ML workflow, just compressed and with toy data.

To estimate how the model will perform on unseen data. Evaluating on the same examples you trained on is meaningless — the model can just memorize them. A held-out test set gives an unbiased estimate of generalization. 80/20 is a common default; smaller datasets sometimes use cross-validation for more robust estimates.

Many algorithms (logistic regression, KNN, neural networks) are sensitive to feature scale. If one feature ranges 0–10000 and another 0–1, the larger one will dominate the loss gradient or distance metric. Min-max scaling to [0, 1] (or z-score scaling to mean 0 / std 1) puts every feature on equal footing.

A score showing how much each feature contributed to the model’s predictions. For logistic regression we use the absolute value of the learned weight — features with bigger weights move the prediction more. Tree models compute it from information-gain reductions. Use it to spot which inputs actually matter and which you could drop.

Menu

⭐ Popular Tools

🕒 Recently Used

📁 All Categories

Quick Links

Support