Machine Learning
Masterclass
Zero to Advanced — 77 Topics · Deep Theory + Practical Python · FAANG-level Interview Prep
Phase 1 — Basics
ML Course Introduction
This course teaches Machine Learning from first principles to production-level systems. It follows the exact order used in real-world ML pipelines: data → preprocessing → modeling → evaluation → deployment thinking.
| Pillar | What it means | Example |
|---|---|---|
| Data | The raw input — the lifeblood of ML | CSV of house prices |
| Algorithm | The "brain" that finds patterns | Linear Regression |
| Prediction | Output after the model learns | Predicted price: $450k |
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sklearn # Assumed already known: numpy, pandas, matplotlib, seaborn # This course focuses on: sklearn + ML theory + real projects print("scikit-learn version:", sklearn.__version__) # Quick dataset sanity check from sklearn.datasets import load_iris iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['target'] = iris.target print(df.head()) print("Shape:", df.shape)
- ML is about learning patterns from data, not writing explicit rules
- This course follows the real-world pipeline order
- You already know numpy/pandas/matplotlib — we skip those basics
- scikit-learn is the primary ML library throughout
What is Machine Learning?
Machine Learning is a subset of AI where algorithms learn from data to make predictions or decisions — without being explicitly programmed for each scenario. Formally: a computer program "learns" from experience E with respect to task T if performance measure P improves with E (Tom Mitchell, 1997).
A rule-based system for predicting house prices needs explicit rules: IF 3 bedrooms AND near school AND
1500 sqft THEN price = $400k. This is brittle — it can't generalize.
An ML model looks at 100,000 past sales, finds the pattern between features and prices automatically, and
predicts prices for houses it's never seen. That's learning.
| Type | Definition | Key Word | Example |
|---|---|---|---|
| Supervised | Learns from labeled data (input→output pairs) | "Teacher present" | Spam detection, house prices |
| Unsupervised | Finds hidden patterns in unlabeled data | "No teacher" | Customer segmentation, anomaly detection |
| Semi-supervised | Mix of labeled + unlabeled data | "Partial teacher" | Image labeling at scale |
| Reinforcement | Agent learns by reward/punishment signals | "Trial and error" | Game playing, robotics |
| Sub-type | Output Type | Examples |
|---|---|---|
| Regression | Continuous number | House price, temperature, salary |
| Classification | Category/label | Spam/not spam, disease/no disease |
import numpy as np from sklearn.linear_model import LinearRegression # Supervised from sklearn.cluster import KMeans # Unsupervised # ── SUPERVISED: We have labels ────────────────────────── # X = house size (sqft), y = price ($1000s) X_sup = np.array([[1000],[1500],[2000],[2500],[3000]]) y = np.array([200, 280, 350, 430, 500]) # ← LABELS model_sup = LinearRegression() model_sup.fit(X_sup, y) pred = model_sup.predict([[1800]]) print(f"Supervised — Predicted price for 1800sqft: ${pred[0]:.0f}k") # ── UNSUPERVISED: No labels, find natural groups ───────── # Customer data: [age, spending_score] — no target! X_uns = np.array([[22,80],[25,75],[45,20], [48,15],[35,50],[40,45]]) model_uns = KMeans(n_clusters=2, random_state=42) clusters = model_uns.fit_predict(X_uns) # ← Discovers groups print("Unsupervised — Cluster assignments:", clusters) # Output: [0 0 1 1 0 0] or similar — groups found automatically!
- Treating regression as classification (or vice versa) — always ask: "Is my output a number or a category?"
- Assuming more data always = better model — quality > quantity
- Ignoring the problem type — not all ML problems need the same approach
- ML = algorithms that learn patterns from data without explicit programming
- 3 main types: Supervised (labeled), Unsupervised (unlabeled), Reinforcement (rewards)
- Supervised splits into: Regression (continuous output) and Classification (categorical output)
- The key question always is: "What is my output type? Do I have labels?"
For each scenario below, identify: (1) Type of ML and (2) Supervised sub-type if applicable:
- Predicting tomorrow's stock price
- Grouping news articles by topic automatically
- Detecting whether a tumor is malignant
- Teaching a robot to walk
Answers: Regression | Unsupervised | Classification | Reinforcement
Complete ML Roadmap
The ML Roadmap is the end-to-end workflow that every ML project follows — from raw data to a deployed model. Understanding this pipeline is fundamental; every topic in this course maps to a step in this process.
""" THE COMPLETE ML PIPELINE ───────────────────────────────────────────────────── STEP 1: Problem Definition → What are we predicting? What's the metric of success? STEP 2: Data Collection → SQL, APIs, web scraping, Kaggle, company databases STEP 3: Exploratory Data Analysis (EDA) → Understand distributions, correlations, missing data STEP 4: Data Preprocessing (MOST TIME IS SPENT HERE) → Clean missing values, encode categories, scale features, remove outliers, handle duplicates STEP 5: Feature Engineering & Selection → Create new features, select the most informative ones → Techniques: Backward Elimination, Forward Selection STEP 6: Model Training → Split data (train/test), select algorithm, fit model → Regression: Linear, Ridge, Lasso, Polynomial → Classification: Logistic, SVM, Trees, KNN, Naive Bayes → Clustering: K-Means, DBSCAN, Hierarchical STEP 7: Evaluation & Tuning → Regression: R², MSE, RMSE → Classification: Accuracy, F1, ROC-AUC, Confusion Matrix → Tuning: GridSearchCV, CrossValidation [DEPLOY] → Flask/FastAPI, Docker, Cloud ───────────────────────────────────────────────────── """ # Time allocation in real-world ML projects: time_allocation = { "Problem Definition": "5%", "Data Collection": "10%", "EDA": "15%", "Data Preprocessing": "40%", # ← MOST work happens here "Modeling": "20%", "Evaluation & Tuning": "10%", } for step, pct in time_allocation.items(): print(f" {step:30s} → {pct}")
| Problem Type | Algorithm Choices | When to Use |
|---|---|---|
| Regression | Linear, Ridge, Lasso, Polynomial, SVR, Decision Tree | Continuous numeric output |
| Classification | Logistic, SVM, KNN, Naive Bayes, Decision Tree, Random Forest | Categorical output |
| Clustering | K-Means, DBSCAN, Hierarchical | No labels, find groups |
| Association | Apriori, FP-Growth | Market basket analysis |
- Every ML project follows the same 7-step pipeline
- Data preprocessing takes ~40% of total project time
- Choosing the wrong algorithm matters less than poor data quality
- Always define your evaluation metric BEFORE building models
Phase 2 — Data Preprocessing
Types of Variables
Variables (features/columns) in a dataset have different measurement scales. Knowing this determines which preprocessing steps and algorithms you can use. This is one of the most foundational concepts in statistics and ML.
| Type | Sub-type | Properties | Examples | Allowed Operations |
|---|---|---|---|---|
| Categorical (Qualitative) | Nominal | No order, just names | Color: Red/Blue/Green, Country | Mode, frequency count |
| Ordinal | Has order, no equal gaps | Rating: Low/Med/High, Education level | Mode, median, comparison | |
| Numerical (Quantitative) | Discrete | Countable integers | Number of children, room count | All arithmetic |
| Continuous | Any real value in a range | Height, temperature, price | All arithmetic + calculus |
import pandas as pd import numpy as np # Create a sample dataset with mixed variable types df = pd.DataFrame({ 'age': [25, 32, 28, 45, 30], # Numerical - Discrete 'salary': [50000.5, 72000.0, 63000.25, 95000.0, 68000.75], # Numerical - Continuous 'city': ['NY','LA','NY','Chicago','LA'], # Categorical - Nominal 'education': ['BSc','MSc','BSc', 'PhD','MSc'], # Categorical - Ordinal 'satisfied': [1,0,1,1,0] # Binary (special case) }) # ── Step 1: Quick dtype overview ────────────────────── print("--- dtypes ---") print(df.dtypes) # ── Step 2: Identify categorical vs numerical ───────── categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist() numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist() print(f"\nCategorical columns: {categorical_cols}") print(f"Numerical columns: {numerical_cols}") # ── Step 3: Cardinality check (unique value count) ──── # High cardinality nominal → might need target encoding # Low cardinality nominal → safe for one-hot encoding print("\n--- Cardinality (unique values per column) ---") for col in df.columns: print(f" {col:15s}: {df[col].nunique()} unique values") # ── Step 4: Check continuous vs discrete numerics ───── print("\n--- Numerical Analysis ---") print(df[numerical_cols].describe())
- Treating ordinal as nominal: Encoding "Low/Med/High" as [0,1,2] loses order info if you one-hot encode it — use OrdinalEncoder instead
- Treating numeric-coded categoricals as numeric: ZIP codes, user IDs are stored as integers but are nominal — never use them as continuous features
- Ignoring binary variables: Binary (0/1) doesn't need any encoding but often needs balancing
Load the Titanic dataset (pd.read_csv from seaborn or sklearn) and classify every column as: Nominal, Ordinal, Discrete, or Continuous. Then identify which need encoding and which need scaling.
Data Cleaning
Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. "Garbage in, garbage out" — no algorithm can save you from dirty data.
| Issue | Example | Fix |
|---|---|---|
| Missing values | NaN in age column | Impute or drop (Topics 6–9) |
| Duplicates | Same row appears twice | drop_duplicates() (Topic 18) |
| Outliers | Age = 999 | IQR/Z-Score (Topics 14–15) |
| Wrong data types | Price stored as "object" | astype() (Topic 19) |
| Inconsistent categories | "Male", "male", "M" all mean same | str.lower() + map() |
| Noise | Typos, extra spaces | str.strip(), str.replace() |
import pandas as pd import numpy as np # Create dirty dataset for demo data = { 'name': ['Alice', 'Bob', 'Alice', 'Charlie', ' Dave ', None], 'age': [25, 30, 25, 999, 28, 35], # 999 = outlier 'gender': ['F', 'M', 'F', 'Male', 'm', 'F'], # inconsistent 'salary': ['50000', '72000', '50000', '95000', None, '68000'], # wrong dtype } df = pd.DataFrame(data) print("=== DIRTY DATA ===") print(df) print() # ── AUDIT FUNCTION: Run this at start of every project ── def audit_dataframe(df): print("📊 SHAPE:", df.shape) print("\n📌 DTYPES:\n", df.dtypes) print("\n❌ MISSING VALUES:") missing = df.isnull().sum() print(missing[missing > 0]) print(f"\n🔁 DUPLICATES: {df.duplicated().sum()}") print("\n📈 NUMERICAL STATS:\n", df.describe()) audit_dataframe(df) # ── FIX 1: Strip whitespace from string columns ─────── df['name'] = df['name'].str.strip() # ── FIX 2: Standardize categories ───────────────────── # Normalize gender: 'Male'/'m'/'M' → 'male', 'F'/'female' → 'female' df['gender'] = df['gender'].str.lower().map({ 'm': 'male', 'male': 'male', 'f': 'female', 'female': 'female' }) # ── FIX 3: Fix data types ───────────────────────────── df['salary'] = pd.to_numeric(df['salary'], errors='coerce') # errors='coerce' → invalid values become NaN instead of crashing # ── FIX 4: Replace impossible values with NaN ───────── df.loc[df['age'] > 120, 'age'] = np.nan # Age 999 → NaN print("\n=== CLEANED DATA ===") print(df)
- Always run an audit (shape, dtypes, nulls, duplicates) at project start
- Inconsistent categories are sneaky bugs — always standardize with str.lower()
- Use pd.to_numeric(errors='coerce') to safely convert mixed-type columns
- Replace domain-impossible values (age=999) with NaN before imputation
Missing Values — Concept & Detection
Missing values are data points that were not recorded or stored. They appear as NaN (Not a
Number), None, null, or empty strings in pandas. Most ML algorithms cannot
handle missing values — you must deal with them before training.
| Type | Abbreviation | Meaning | Example | Recommended Fix |
|---|---|---|---|---|
| Missing Completely at Random | MCAR | Missing for no reason — purely random | Survey respondent accidentally skipped a question | Safe to drop or impute |
| Missing at Random | MAR | Missing depends on OTHER observed variables | Males less likely to report salary than females | Impute using other features |
| Missing Not at Random | MNAR | Missing depends on the value itself | High earners not reporting salary | Model the missingness; risky to impute |
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # ── Create sample dataset with missing values ────────── np.random.seed(42) n = 100 df = pd.DataFrame({ 'age': np.random.choice([25,30,35,np.nan], n, p=[0.3,0.3,0.3,0.1]), 'salary': np.random.choice([50000,60000,np.nan], n, p=[0.4,0.4,0.2]), 'city': np.random.choice(['NY','LA',None], n, p=[0.4,0.4,0.2]), 'score': np.random.randint(50, 100, n).astype(float), # no missing }) # ── Method 1: Count missing values ──────────────────── print("--- Missing Value Counts ---") print(df.isnull().sum()) # ── Method 2: Percentage missing ───────────────────── print("\n--- Missing Value % ---") missing_pct = (df.isnull().sum() / len(df) * 100).round(2) print(missing_pct) # ── Method 3: Heatmap visualization ────────────────── plt.figure(figsize=(8, 4)) sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False) plt.title("Missing Value Heatmap\n(Yellow = Missing, Purple = Present)") plt.tight_layout() plt.show() # ── Method 4: Threshold-based decision ─────────────── # Rule of thumb: # <5% → usually safe to drop rows or impute simply # 5-30% → impute carefully (mean/median/mode) # >30% → consider dropping the COLUMN or using model-based imputation # >50% → column is likely unusable threshold = 30 cols_to_drop = missing_pct[missing_pct > threshold].index.tolist() print(f"\nColumns with >{threshold}% missing (consider dropping): {cols_to_drop}")
- Imputing before train/test split: This causes data leakage — fit the imputer on train data only, then transform both
- Treating all missingness equally: MCAR → drop safely; MNAR → dropping biases your model
- Ignoring that "" (empty string) is not NaN: Always check with df[col].eq("").sum() too
- Missing values appear as NaN/None — most ML algorithms reject them
- 3 types: MCAR (safe), MAR (impute with other features), MNAR (risky)
- Always measure % missing before deciding to drop or impute
- Columns with >50% missing are usually not worth keeping
Handling Missing Values — Dropping
Dropping removes rows or columns that contain missing values. It's the simplest approach but must be used carefully — dropping too aggressively shrinks your dataset and can introduce bias.
Drop ROWS when: Missing % is small (<5%), data is MCAR, and losing rows doesn't bias results significantly.
Drop COLUMNS when: Missing % is very high (>50%), column is non-critical, or column is duplicate information.
NEVER drop when: The missingness is MNAR (high earners hiding salary) — you'd systematically bias your model.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4, np.nan], 'B': [np.nan, 2, 3, 4, 5], 'C': [1, np.nan, np.nan, np.nan, np.nan], # 80% missing }) print("Original:\n", df) # ── Option 1: Drop any row with ANY missing value ───── df1 = df.dropna() # how='any' is default print("\ndropna() — rows with any NaN removed:\n", df1) # ── Option 2: Drop rows where ALL values are missing ── df2 = df.dropna(how='all') print("\ndropna(how='all') — only fully-empty rows removed:\n", df2) # ── Option 3: Drop rows where specific columns have NaN df3 = df.dropna(subset=['A', 'B']) # Only care about A and B print("\ndropna(subset=['A','B']):\n", df3) # ── Option 4: Drop columns with too many missing values threshold = 0.5 # Drop columns with >50% missing df4 = df.dropna(axis=1, thresh=int(threshold * len(df))) print(f"\nDropped columns with >50% missing:\n", df4) # C is dropped (80% missing), A and B are kept # ── Option 5: Threshold-based row dropping ──────────── # Keep rows that have at least 2 non-NaN values df5 = df.dropna(thresh=2) print("\ndropna(thresh=2) — rows with at least 2 valid values:\n", df5) # ── Best Practice: Track what you dropped ───────────── original_size = len(df) cleaned = df.dropna() dropped_count = original_size - len(cleaned) print(f"\nDropped {dropped_count}/{original_size} rows ({dropped_count/original_size*100:.1f}%)")
- Using
dropna()on the full dataset before train/test split — always split first - Dropping rows when >20% of data is lost — consider imputation instead
- Forgetting to reset the index after dropping: use
.reset_index(drop=True)
Handling Missing Values — Categorical Imputation
Imputation means filling in missing values with estimated substitutes rather than dropping them. For categorical variables, common strategies are filling with the mode (most frequent value) or a special "Unknown" / "Missing" category.
| Strategy | For what type? | When to use |
|---|---|---|
| Mode (most frequent) | Categorical | Data is roughly uniform, small % missing |
| "Unknown" / "Missing" label | Categorical | When absence itself is informative (MNAR) |
| Mean imputation | Numerical | Data is roughly symmetric, no major outliers |
| Median imputation | Numerical | Data is skewed or has outliers (recommended over mean) |
| KNN imputation | Both | When other features can predict the missing value |
| Forward/Backward fill | Time series | Sequential/temporal data |
import pandas as pd import numpy as np df = pd.DataFrame({ 'city': ['NY', 'LA', None, 'NY', None, 'LA', 'NY'], 'product': ['A', None, 'B', 'A', 'B', None, 'A'], 'salary': [50000, np.nan, 60000, np.nan, 55000, 70000, np.nan], }) print("Before:\n", df) # ── Strategy 1: Fill categorical with Mode ──────────── city_mode = df['city'].mode()[0] # .mode() returns a Series, take [0] df_mode = df.copy() df_mode['city'] = df_mode['city'].fillna(city_mode) print(f"\nMode imputation for city (mode='{city_mode}'):\n", df_mode['city']) # ── Strategy 2: Fill with "Unknown" (when missing is informative) ── df_unk = df.copy() df_unk['city'] = df_unk['city'].fillna('Unknown') df_unk['product'] = df_unk['product'].fillna('Unknown') print("\nUnknown imputation:\n", df_unk[['city', 'product']]) # ── Numerical: Mean vs Median ────────────────────────── print("\n--- Numerical Imputation ---") df_num = df.copy() # Mean — affected by outliers df_num['salary_mean_imp'] = df_num['salary'].fillna(df_num['salary'].mean()) # Median — robust to outliers (PREFERRED for skewed data) df_num['salary_median_imp'] = df_num['salary'].fillna(df_num['salary'].median()) print(df_num[['salary', 'salary_mean_imp', 'salary_median_imp']]) # ── Group-wise imputation (ADVANCED + POWERFUL) ──────── # Fill salary with the median salary for that CITY group df['salary_group_imp'] = df.groupby('city')['salary'].\ transform(lambda x: x.fillna(x.median())) print("\nGroup-wise imputation:\n", df[['city', 'salary', 'salary_group_imp']])
- Using mean for skewed numerical data — always prefer median for salary, prices, counts
- Imputing with global statistics when group-wise would be more accurate
- Forgetting to fit imputation on train set only — never use test set statistics
Handling Missing Values — Scikit-Learn Imputers
Scikit-learn provides production-ready imputer classes that follow the fit()/transform() API —
ensuring imputation parameters are learned from training data only, preventing data leakage.
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer, KNNImputer from sklearn.experimental import enable_iterative_imputer # must import first from sklearn.impute import IterativeImputer from sklearn.model_selection import train_test_split np.random.seed(42) X = pd.DataFrame({ 'age': [25, np.nan, 35, 28, np.nan, 45, 22, 30], 'salary': [np.nan, 60000, 75000, np.nan, 55000, 90000, 48000, 65000], 'score': [80, 90, np.nan, 70, 85, np.nan, 95, 88], }) # ── CRITICAL: Split BEFORE imputation ───────────────── X_train, X_test = train_test_split(X, test_size=0.25, random_state=42) # ── SimpleImputer ───────────────────────────────────── # strategy options: 'mean', 'median', 'most_frequent', 'constant' si = SimpleImputer(strategy='median') si.fit(X_train) # Learn medians from TRAIN only X_train_si = si.transform(X_train) # Fill train X_test_si = si.transform(X_test) # Fill test using TRAIN medians (no leakage) print("SimpleImputer (median) — train statistics:", si.statistics_) # ── KNNImputer ──────────────────────────────────────── # Uses K nearest neighbors to estimate missing value # More accurate than SimpleImputer but slower knn_imp = KNNImputer(n_neighbors=2) knn_imp.fit(X_train) X_train_knn = knn_imp.transform(X_train) X_test_knn = knn_imp.transform(X_test) print("\nKNNImputer result (first row):", X_train_knn[0]) # ── IterativeImputer (MICE) ─────────────────────────── # Regresses each feature with missing values on all other features # Most powerful but computationally expensive iter_imp = IterativeImputer(max_iter=10, random_state=42) iter_imp.fit(X_train) X_train_iter = iter_imp.transform(X_train) print("\nIterativeImputer result (first row):", X_train_iter[0]) # ── When to use which? ─────────────────────────────── # SimpleImputer → fast, simple, good for numerical + categorical # KNNImputer → better for correlated features, medium dataset # IterativeImputer → best quality, expensive, use on small/medium data
- Calling
fit_transform()on both train and test — only fit on train - Using KNNImputer on large datasets without thinking about O(n²) memory
- Always use sklearn imputers in production — they prevent data leakage
- SimpleImputer for quick work; KNNImputer for correlated data; IterativeImputer for best quality
- The golden rule: fit on train data, transform both train and test
One Hot Encoding & Dummy Variables
One Hot Encoding (OHE) converts a nominal categorical variable into multiple binary (0/1) columns — one per unique category. This allows ML algorithms (which expect numbers) to use categorical data without imposing any false ordering.
If we encode Color as: Red=1, Blue=2, Green=3 — the model thinks Green > Blue > Red. That's wrong! Colors have no numeric relationship.
OHE creates: [is_Red, is_Blue, is_Green] = [1,0,0], [0,1,0], [0,0,1] — now there's no false ordering.
If you have 3 colors [Red, Blue, Green], you only need 2 dummy variables. The 3rd is perfectly predictable from the other two (if not Red and not Blue → Green). Including all 3 causes multicollinearity. Solution: drop_first=True
import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import train_test_split df = pd.DataFrame({ 'color': ['Red','Blue','Green','Red','Blue'], 'size': ['S','M','L','M','S'], 'price': [10,20,30,15,25] }) # ── Method 1: pd.get_dummies (fast for EDA) ─────────── # drop_first=True avoids dummy variable trap df_dummies = pd.get_dummies(df, columns=['color'], drop_first=True, dtype=int) print("pd.get_dummies (drop_first=True):") print(df_dummies) # Creates: color_Green, color_Red (Blue is the dropped reference) # ── Method 2: sklearn OneHotEncoder (production use) ─ # MUST use this in pipelines to avoid data leakage X = df[['color', 'size']] X_train, X_test = train_test_split(X, test_size=0.4, random_state=42) ohe = OneHotEncoder( drop='first', # Avoid dummy trap sparse_output=False, # Return dense array handle_unknown='ignore' # Don't crash on unseen categories in test ) # Fit on TRAIN only — learn the categories from training data ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) print("\nsklearn OHE categories learned:", ohe.categories_) print("Feature names:", ohe.get_feature_names_out()) print("Encoded train:\n", X_train_enc)
- Not using
drop_first=True— causes multicollinearity in linear models - Using
pd.get_dummiesin production — it doesn't handle unseen categories in test data - OHE on high-cardinality columns (eg: ZIP codes with 10,000 values) — use Target Encoding instead
Label Encoding
Label Encoding assigns an integer to each unique category: "cat" → 0, "dog" → 1, "rabbit" → 2. It's compact but introduces an artificial ordering. Only use it for the target variable or for tree-based models that don't assume ordering.
import pandas as pd from sklearn.preprocessing import LabelEncoder df = pd.DataFrame({ 'animal': ['cat','dog','rabbit','cat','dog'], 'target': ['spam','ham','spam','ham','spam'] }) le = LabelEncoder() # For TARGET variable (y) — safe use case df['target_encoded'] = le.fit_transform(df['target']) print("Target encoding:") print(df[['target', 'target_encoded']]) print("Classes:", le.classes_) # Shows mapping: ham=0, spam=1 # For FEATURES — only safe with tree-based models df['animal_encoded'] = le.fit_transform(df['animal']) print("\nAnimal encoding (risky for linear models!):") print(df[['animal', 'animal_encoded']]) # cat=0, dog=1, rabbit=2 # Linear model would think rabbit(2) > dog(1) > cat(0) — WRONG! # Decode back from integers print("\nDecode: [0,2,1] →", le.inverse_transform([0,2,1]))
- Using LabelEncoder on nominal features with linear models — the model thinks 0 < 1 < 2, but "cat < dog < rabbit" has no meaning. Use OHE for nominal features in linear models.
- LabelEncoder is fine for: target variables, and nominal features in tree models (Random Forest, XGBoost handle it fine)
- Label encoding = integer per category. Simple but imposes ordering.
- Safe for: target variables, tree-based model features
- Unsafe for: nominal features in linear/distance-based models
Ordinal Encoding
Ordinal Encoding assigns integers to categories preserving their natural order. Unlike LabelEncoder (which assigns arbitrary order), OrdinalEncoder lets you specify the order: Low=0, Medium=1, High=2.
import pandas as pd from sklearn.preprocessing import OrdinalEncoder df = pd.DataFrame({ 'education': ['BSc', 'PhD', 'HSC', 'MSc', 'BSc'], 'size': ['M', 'XL', 'S', 'L', 'XS'], }) # ── WRONG: LabelEncoder assigns arbitrary order ─────── from sklearn.preprocessing import LabelEncoder le = LabelEncoder() print("LabelEncoder (arbitrary):", le.fit_transform(df['education'])) # Could give: BSc=0, HSC=1, MSc=2, PhD=3 — alphabetical, not education level! # ── RIGHT: OrdinalEncoder with explicit ordering ────── # You tell it the correct order for each column enc = OrdinalEncoder(categories=[ ['HSC', 'BSc', 'MSc', 'PhD'], # education order ['XS', 'S', 'M', 'L', 'XL'] # size order ]) encoded = enc.fit_transform(df) df_enc = df.copy() df_enc['education_enc'] = encoded[:, 0] # HSC=0, BSc=1, MSc=2, PhD=3 df_enc['size_enc'] = encoded[:, 1] # XS=0, S=1, M=2, L=3, XL=4 print("\nOrdinalEncoder (correct order):") print(df_enc)
- Use OrdinalEncoder when categories have a meaningful order (Low/Med/High, education levels)
- Always specify the category order explicitly — don't let sklearn guess
- Use OHE for nominal (unordered) categories, OrdinalEncoder for ordered ones
Outliers — Concept & Handling
Outliers are data points that differ significantly from other observations. They can be genuine (a CEO's salary in an employee dataset) or errors (age = 999). Outliers distort statistical measures and can severely degrade model performance.
| Type | Description | Example | Action |
|---|---|---|---|
| Point Outlier | Single value far from rest | Income of $10M in a $50k dataset | Cap, remove, or model separately |
| Contextual Outlier | Normal globally, abnormal in context | 30°C in winter (not summer) | Context-aware handling |
| Collective Outlier | A group of values that are collectively abnormal | 5 identical transactions in 1 second | Anomaly detection |
| Model | Sensitivity to Outliers |
|---|---|
| Linear Regression | Very sensitive — outliers pull the regression line |
| Logistic Regression | Moderately sensitive |
| Decision Trees | Less sensitive — splits are threshold-based |
| Random Forest | Robust — averages many trees |
| SVM | Sensitive — support vectors can shift |
| KNN | Very sensitive — distances distorted |
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns np.random.seed(42) data = np.concatenate([ np.random.normal(50, 10, 95), # Normal data [150, -30, 200, -50, 180] # Outliers injected ]) df = pd.DataFrame({'salary': data}) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # 1. Boxplot — best for outlier visualization axes[0].boxplot(df['salary']) axes[0].set_title('Boxplot\n(whiskers = 1.5×IQR, dots = outliers)') # 2. Histogram — shows skew caused by outliers axes[1].hist(df['salary'], bins=30, color='steelblue') axes[1].set_title('Histogram') # 3. Scatter plot — shows position of outliers axes[2].scatter(range(len(df)), df['salary'], alpha=0.5) axes[2].set_title('Scatter Plot') plt.tight_layout() plt.show() # Describe to spot issues print(df.describe()) # If max >> 75th percentile → likely outliers # If min << 25th percentile → likely outliers
- Outliers = data points far from the rest — genuine or errors
- Linear models and KNN are most sensitive; tree-based models are more robust
- Always visualize first (boxplot, histogram) before deciding how to handle them
- Two detection methods: IQR (robust) and Z-Score (assumes normality)
IQR Method for Outlier Detection
The Interquartile Range (IQR) method is a robust, non-parametric outlier detection technique. IQR = Q3 − Q1. Points beyond 1.5×IQR from the quartiles are flagged as outliers. It doesn't assume any distribution.
Upper Bound = Q3 + 1.5 × IQR
Outlier if: value < Lower Bound OR value > Upper Bound
import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame({ 'salary': list(np.random.normal(60000, 10000, 95)) + [200000, -5000, 180000, -10000, 190000] }) # ── Step 1: Calculate IQR ────────────────────────────── Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"Q1={Q1:.0f}, Q3={Q3:.0f}, IQR={IQR:.0f}") print(f"Lower bound: {lower_bound:.0f}") print(f"Upper bound: {upper_bound:.0f}") # ── Step 2: Identify outliers ───────────────────────── is_outlier = (df['salary'] < lower_bound) | (df['salary'] > upper_bound) print(f"\nOutliers found: {is_outlier.sum()}") print(df[is_outlier]) # ── Step 3: Choose handling strategy ───────────────── # STRATEGY A: Remove outliers df_removed = df[~is_outlier].copy() print(f"\nAfter removal: {len(df_removed)} rows (was {len(df)})") # STRATEGY B: Cap/Winsorize (clip to bounds) # This preserves sample size but limits extreme values df_capped = df.copy() df_capped['salary'] = df_capped['salary'].clip(lower=lower_bound, upper=upper_bound) print(f"\nAfter capping: max={df_capped['salary'].max():.0f}, min={df_capped['salary'].min():.0f}") # STRATEGY C: Replace with NaN → then impute df_nan = df.copy() df_nan.loc[is_outlier, 'salary'] = np.nan print(f"\nAfter NaN replacement: {df_nan['salary'].isnull().sum()} missing values") # ── Function to apply IQR to all numerical columns ──── def remove_outliers_iqr(df, multiplier=1.5): df_clean = df.copy() num_cols = df.select_dtypes(include=[np.number]).columns for col in num_cols: Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75) IQR = Q3 - Q1 mask = (df[col] >= Q1 - multiplier*IQR) & (df[col] <= Q3 + multiplier*IQR) df_clean = df_clean[mask] return df_clean.reset_index(drop=True)
Z-Score Method for Outlier Detection
The Z-Score measures how many standard deviations a data point is from the mean. Points with |Z| > 3 are typically considered outliers. Assumes the data is approximately normally distributed.
Outlier if |Z| > 3 (roughly 0.3% of data in a normal distribution)
import pandas as pd import numpy as np from scipy import stats np.random.seed(42) df = pd.DataFrame({ 'height': list(np.random.normal(170, 10, 95)) + [300, 50, 280, -10, 320] }) # ── Method 1: Manual Z-Score ────────────────────────── df['z_score'] = (df['height'] - df['height'].mean()) / df['height'].std() threshold = 3 outliers_z = df[df['z_score'].abs() > threshold] print(f"Z-Score outliers (|z|>{threshold}): {len(outliers_z)}") print(outliers_z) # ── Method 2: scipy zscore (same result) ────────────── z_scores = np.abs(stats.zscore(df['height'])) df_clean = df[(z_scores <= threshold)] print(f"\nAfter removal: {len(df_clean)} rows") # ── Modified Z-Score (robust, uses median) ──────────── # Better than standard Z-score when outliers are present median = df['height'].median() mad = (df['height'] - median).abs().median() # Median Absolute Deviation modified_z = 0.6745 * (df['height'] - median) / mad df_mod_clean = df[modified_z.abs() <= 3.5] print(f"\nModified Z-Score clean: {len(df_mod_clean)} rows") # ── IQR vs Z-Score comparison ───────────────────────── # IQR: Non-parametric, robust, works for any distribution # Z-Score: Parametric, assumes normality, sensitive to extreme outliers # Modified Z: Best of both - robust + uses normal distribution logic
- Z-Score = (x - mean) / std — outlier if |z| > 3
- Assumes normality — use IQR for skewed data
- Modified Z-Score (using MAD) is more robust — recommended over standard Z-Score
Feature Scaling — Standardization
Standardization (Z-score normalization) transforms features so they have mean = 0 and standard deviation = 1. This allows algorithms that use distances or gradients to treat all features equally regardless of their original scale.
Result: mean = 0, std = 1
Consider: age (20–60) and salary (30,000–200,000). In KNN, distance is dominated by salary simply because it has larger numbers. The model effectively ignores age. Standardization puts both on equal footing.
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split df = pd.DataFrame({ 'age': [25, 35, 45, 28, 52], 'salary': [50000, 75000, 95000, 60000, 110000], 'score': [80, 90, 70, 85, 95], }) X = df.values X_train, X_test = train_test_split(X, test_size=0.4, random_state=42) # ── StandardScaler ──────────────────────────────────── scaler = StandardScaler() # Fit on train, transform both — NEVER fit on test! X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print("Original train data:") print(pd.DataFrame(X_train, columns=df.columns)) print("\nStandardized train data:") print(pd.DataFrame(X_train_scaled, columns=df.columns).round(3)) print(f"\nMean after scaling (should be ~0): {X_train_scaled.mean(axis=0).round(3)}") print(f"Std after scaling (should be ~1): {X_train_scaled.std(axis=0).round(3)}") # ── Access scaler parameters ────────────────────────── print(f"\nLearned means: {scaler.mean_.round(2)}") print(f"Learned stds: {scaler.scale_.round(2)}") # ── Inverse transform ───────────────────────────────── X_back = scaler.inverse_transform(X_train_scaled) print("\nInverse transform (back to original):") print(pd.DataFrame(X_back, columns=df.columns).round(1))
| Scaling Method | Use When | Not when |
|---|---|---|
| StandardScaler (Standardization) | Data has outliers; algorithms assume normal distribution (SVM, Linear/Logistic Regression, PCA) | You need values in [0,1] range |
| MinMaxScaler (Normalization) | Neural networks; algorithms sensitive to magnitude; when you need bounded [0,1] range | Data has significant outliers |
- Fitting scaler on the full dataset before train/test split — causes data leakage
- Scaling the target variable y (for regression) — don't, unless you undo it with inverse_transform
- Forgetting to scale test data with the SAME scaler that was fit on train
Feature Scaling — Normalization (MinMaxScaler)
Normalization (Min-Max Scaling) scales all values to the range [0, 1] (or any specified range). It preserves the original distribution shape but compresses it into the specified range.
Result: All values between 0 and 1
import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler from sklearn.model_selection import train_test_split # Dataset with an outlier to show differences data = pd.DataFrame({'salary': [30000, 45000, 50000, 60000, 200000]}) # 200000 is an outlier X_train, X_test = train_test_split(data, test_size=0.4, random_state=42) # ── MinMaxScaler ────────────────────────────────────── mms = MinMaxScaler(feature_range=(0, 1)) mms.fit(X_train) print("MinMax scaled (outlier pulls everything to bottom):") print(pd.DataFrame(mms.transform(data), columns=['salary_minmax']).round(3)) # ── RobustScaler ────────────────────────────────────── # Uses median and IQR — robust to outliers! # X_scaled = (X - median) / IQR rs = RobustScaler() rs.fit(X_train) print("\nRobust scaled (better with outliers):") print(pd.DataFrame(rs.transform(data), columns=['salary_robust']).round(3)) # ── Visual comparison ───────────────────────────────── ss = StandardScaler() ss.fit(X_train) comparison = pd.DataFrame({ 'original': data['salary'].values, 'standardized': ss.transform(data).flatten(), 'normalized(MinMax)': mms.transform(data).flatten(), 'robust': rs.transform(data).flatten() }) print("\nScaling Comparison:") print(comparison.round(3)) # Summary: # StandardScaler → mean=0, std=1 — distorted by outlier # MinMaxScaler → [0,1] — distorted by outlier # RobustScaler → median-centered — handles outliers best
Duplicate Data Handling
Duplicate rows are identical or near-identical records in a dataset. They can arise from data entry errors, merging datasets, or web scraping. Duplicates bias models by over-representing certain patterns and inflate accuracy metrics.
import pandas as pd import numpy as np df = pd.DataFrame({ 'id': [1, 2, 2, 3, 4, 4, 5], 'name': ['Alice','Bob','Bob','Charlie','Dave','Dave','Eve'], 'salary': [50000,60000,60000,70000,80000,80001,55000], # Note: Dave has slightly different salary — near duplicate }) # ── Exact duplicates ────────────────────────────────── print("Exact duplicates:", df.duplicated().sum()) print(df[df.duplicated(keep=False)]) # Show all occurrences # ── Duplicates based on subset of columns ───────────── # Same person (name+salary close), different id print("\nDuplicates by name:", df.duplicated(subset=['name']).sum()) # ── Remove exact duplicates ─────────────────────────── # keep='first': keep first occurrence (default) # keep='last': keep last occurrence # keep=False: drop ALL duplicates df_clean = df.drop_duplicates(keep='first').reset_index(drop=True) print("\nAfter removing exact duplicates:") print(df_clean) # ── Remove subset duplicates ────────────────────────── df_clean2 = df.drop_duplicates(subset=['name'], keep='first').reset_index(drop=True) print("\nAfter removing name duplicates:") print(df_clean2) # ── Near-duplicate detection (fuzzy matching) ───────── # For salary difference within 5%, flag as near duplicate df_sorted = df.sort_values('name') grouped = df_sorted.groupby('name').filter(lambda x: len(x) > 1) print("\nNear-duplicate candidates (same name):") print(grouped)
Changing Data Types
Data types define how data is stored and what operations are valid. Incorrect data types cause errors, performance issues, and wrong calculations. Converting to the right type is a fundamental preprocessing step.
import pandas as pd import numpy as np # Common raw data type issues df = pd.DataFrame({ 'age': ['25', '30', 'bad', '28'], # should be int, but string 'salary': ['$50,000', '$72,000', '$63,000', '$95,000'], # currency string 'date': ['2023-01-15', '2023-03-22', '2023-07-01', '2023-11-30'], # string date 'is_active':['True','False','True','True'] # string bool }) print("Original dtypes:\n", df.dtypes) # ── Fix 1: String → numeric (safe with coerce) ───────── df['age_clean'] = pd.to_numeric(df['age'], errors='coerce') # 'bad' → NaN instead of crashing # ── Fix 2: Currency string → float ──────────────────── df['salary_clean'] = (df['salary'] .str.replace('$', '', regex=False) .str.replace(',', '', regex=False) .astype(float)) # ── Fix 3: String → datetime ────────────────────────── df['date_clean'] = pd.to_datetime(df['date']) # Now you can do: df['date_clean'].dt.year, .month, .day, .dayofweek df['year'] = df['date_clean'].dt.year df['month'] = df['date_clean'].dt.month # ── Fix 4: String → boolean ─────────────────────────── df['is_active_bool'] = df['is_active'].map({'True': True, 'False': False}) # ── Fix 5: Downcast for memory efficiency ───────────── # int64 → int32 saves memory on large datasets df['age_int32'] = df['age_clean'].astype('Int32') # nullable int print("\nCleaned dtypes:\n", df.dtypes) print("\nCleaned data:\n", df[['age_clean','salary_clean','year','month','is_active_bool']])
Function Transformer
FunctionTransformer wraps any Python function into a scikit-learn compatible transformer. This allows you to apply custom transformations (like log, square root, domain-specific functions) inside sklearn Pipelines.
import numpy as np import pandas as pd from sklearn.preprocessing import FunctionTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression # Skewed data — log transform helps normalize it salary = np.array([20000, 30000, 35000, 50000, 200000, 500000]) # ── Basic FunctionTransformer ───────────────────────── log_transformer = FunctionTransformer( func=np.log1p, # log(1 + x) — handles zeros safely inverse_func=np.expm1 # exp(x) - 1 — for inverse transform ) log_salary = log_transformer.transform(salary.reshape(-1, 1)) print("Original salary:", salary) print("Log transformed: ", log_salary.flatten().round(2)) # ── Custom function transformer ─────────────────────── def clip_outliers(X, lower_pct=5, upper_pct=95): """Clip values to specified percentile range.""" lower = np.percentile(X, lower_pct) upper = np.percentile(X, upper_pct) return np.clip(X, lower, upper) clip_transformer = FunctionTransformer(clip_outliers) clipped = clip_transformer.transform(salary.reshape(-1, 1)) print("\nClipped (5th-95th pct):", clipped.flatten()) # ── In a Pipeline ───────────────────────────────────── X = np.random.lognormal(10, 1, (100, 1)) y = np.random.randn(100) pipeline = Pipeline([ ('log_transform', FunctionTransformer(np.log1p)), ('model', LinearRegression()), ]) pipeline.fit(X, y) print("\nPipeline with log transform fitted successfully!") print("Coef:", pipeline.named_steps['model'].coef_)
- FunctionTransformer makes any function sklearn-pipeline-compatible
- log1p is the go-to for right-skewed distributions (salary, prices, counts)
- Always define inverse_func if you need to inverse-transform predictions
Phase 3 — Feature Selection
Backward Elimination
Backward Elimination is a wrapper feature selection method that starts with all features and iteratively removes the least significant one (highest p-value) until all remaining features meet a significance threshold.
1. Start with ALL features
2. Train model, compute p-values for each feature
3. If max p-value > threshold (usually 0.05): remove that feature
4. Retrain with remaining features
5. Repeat until all p-values < threshold
6. Remaining features = selected set
import numpy as np import pandas as pd import statsmodels.api as sm from sklearn.datasets import load_diabetes # Load dataset data = load_diabetes() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target # ── Backward Elimination ────────────────────────────── def backward_elimination(X, y, significance_level=0.05): """ Iteratively removes features with p-value > significance_level. Returns list of selected feature names. """ cols = list(X.columns) while True: X_with_const = sm.add_constant(X[cols]) # Add intercept model = sm.OLS(y, X_with_const).fit() # Ordinary Least Squares # Get p-values (exclude constant at index 0) pvalues = model.pvalues.drop('const') max_pval = pvalues.max() if max_pval > significance_level: worst_feat = pvalues.idxmax() print(f"Removing '{worst_feat}' (p-value: {max_pval:.4f})") cols.remove(worst_feat) else: break return cols selected = backward_elimination(X, y, significance_level=0.05) print(f"\n✅ Selected features ({len(selected)}): {selected}") # ── Using mlxtend SequentialFeatureSelector (wrapper) ─ from mlxtend.feature_selection import SequentialFeatureSelector from sklearn.linear_model import LinearRegression sfs = SequentialFeatureSelector( LinearRegression(), k_features=5, # Select top 5 features forward=False, # backward=True scoring='r2', cv=5 ) sfs.fit(X.values, y) print("\nmlxtend Backward SFS selected features:", sfs.k_feature_names_)
Forward Selection
Forward Selection starts with zero features and iteratively adds the most significant feature at each step until no remaining feature improves the model above the threshold.
import numpy as np import pandas as pd from sklearn.datasets import load_diabetes from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score from mlxtend.feature_selection import SequentialFeatureSelector data = load_diabetes() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target # ── Manual Forward Selection ────────────────────────── def forward_selection(X, y, significance_level=0.05): import statsmodels.api as sm remaining = list(X.columns) selected = [] while remaining: best_pval = 1.0 best_feat = None for feat in remaining: cols = selected + [feat] X_const = sm.add_constant(X[cols]) model = sm.OLS(y, X_const).fit() pval = model.pvalues[feat] if pval < best_pval: best_pval = pval best_feat = feat if best_feat and best_pval < significance_level: selected.append(best_feat) remaining.remove(best_feat) print(f"Added '{best_feat}' (p-value: {best_pval:.4f})") else: break return selected selected = forward_selection(X, y) print(f"\n✅ Forward selected ({len(selected)}): {selected}") # ── mlxtend (cleaner, cross-validated) ─────────────── sfs_fwd = SequentialFeatureSelector( LinearRegression(), k_features=5, forward=True, # Forward selection scoring='r2', cv=5 ) sfs_fwd.fit(X.values, y) print("\nmlxtend Forward SFS features:", sfs_fwd.k_feature_names_) # When to use Forward vs Backward? # Forward: many features, want to build minimal set # Backward: moderate features, start broad # Both: O(n²) complexity — use filter first for huge feature sets
Phase 4 — Model Training
Train-Test Split
Train-test split divides the dataset into two sets: a training set (model learns from this) and a test set (used to evaluate final performance on unseen data). This simulates real-world deployment where the model encounters data it was never trained on.
If you train and evaluate on the same data, the model can "memorize" it (overfit) and report 100% accuracy — but fail completely on new data. Test set = honest evaluation of real-world performance.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target # ── Basic split ──────────────────────────────────────── X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% test, 80% train random_state=42, # Reproducibility stratify=y # IMPORTANT: maintain class distribution ) print(f"Train size: {X_train.shape}, Test size: {X_test.shape}") # ── Why stratify=y is critical for classification ───── # Without stratify: test might have 0 samples of class 2! print("\nClass distribution in full dataset:") print(pd.Series(y).value_counts(normalize=True).round(3)) print("Class distribution in train:") print(pd.Series(y_train).value_counts(normalize=True).round(3)) print("Class distribution in test:") print(pd.Series(y_test).value_counts(normalize=True).round(3)) # With stratify=y, all three should be ~33% each # ── Train/Validation/Test split ─────────────────────── # For hyperparameter tuning: 60/20/20 X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}") # Train: 90, Val: 30, Test: 30 (out of 150) # ── Common split ratios ─────────────────────────────── # Small dataset (<1000): 70/30 or 60/20/20 # Medium (1000-10000): 80/20 or 70/10/20 # Large (>10000): 90/10 (more train data = better model)
- Not using
stratify=yfor classification — imbalanced splits mislead evaluation - Preprocessing (scaling, imputing) the full dataset before splitting — data leakage!
- Not setting
random_state— results are not reproducible - Making the test set too small — noisy evaluation; too large — less data for training
Phase 5 — Regression
Regression Analysis
Regression is a supervised learning task that predicts a continuous numerical output based on one or more input features. The goal is to find the mathematical relationship between inputs and the target variable.
| Type | Formula/Description | Use When |
|---|---|---|
| Simple Linear | y = mx + b | 1 feature, linear relationship |
| Multiple Linear | y = b₀ + b₁x₁ + b₂x₂ + ... | Multiple features, linear relationship |
| Polynomial | y = b₀ + b₁x + b₂x² + ... | Curved/nonlinear relationship |
| Ridge (L2) | Linear + L2 penalty | Multicollinearity, prevent overfitting |
| Lasso (L1) | Linear + L1 penalty | Feature selection built-in |
| Metric | Formula | Interpretation | Range |
|---|---|---|---|
| MAE | mean(|y - ŷ|) | Average absolute error — intuitive, robust | [0, ∞) |
| MSE | mean((y - ŷ)²) | Penalizes large errors heavily | [0, ∞) |
| RMSE | √MSE | Same unit as target — most interpretable | [0, ∞) |
| R² | 1 - SS_res/SS_tot | % variance explained (1 = perfect) | (-∞, 1] |
| Adj. R² | 1 - (1-R²)(n-1)/(n-k-1) | R² penalized for extra features | (-∞, 1] |
Linear Regression — Theory
Linear Regression models the relationship between input X and output y as a straight line: y = β₀ + β₁X + ε. It finds the line that minimizes the sum of squared residuals (differences between actual and predicted values).
🏠 House price analogy: "For every additional 100 sqft, price increases by $10,000." That relationship IS a linear regression. β₀ = base price (intercept), β₁ = price per sqft (slope).
Closed-form solution: β = (XᵀX)⁻¹ Xᵀy
No iteration needed — there's an exact mathematical solution. This is why linear regression is fast even on large datasets.
| Assumption | What It Means | How to Check |
|---|---|---|
| Linearity | X and y have linear relationship | Scatter plot |
| Independence | Observations are independent | Domain knowledge |
| Homoscedasticity | Residuals have constant variance | Residual vs fitted plot |
| Normality of residuals | Residuals are normally distributed | QQ-plot |
| No multicollinearity | Features not highly correlated with each other | Correlation matrix, VIF |
Linear Regression — Practical
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score from sklearn.preprocessing import StandardScaler # ── Generate synthetic house price data ─────────────── np.random.seed(42) n = 200 sqft = np.random.uniform(500, 3000, n) price = 100 + 0.15 * sqft + np.random.normal(0, 20, n) # True: y = 100 + 0.15x + noise df = pd.DataFrame({'sqft': sqft, 'price': price}) # ── Step 1: Prepare data ────────────────────────────── X = df[['sqft']] y = df['price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ── Step 2: Train model ─────────────────────────────── model = LinearRegression() model.fit(X_train, y_train) print(f"Intercept (β₀): {model.intercept_:.2f}") print(f"Coefficient (β₁): {model.coef_[0]:.2f}") print(f"Equation: price = {model.intercept_:.2f} + {model.coef_[0]:.2f} × sqft") # ── Step 3: Predict ─────────────────────────────────── y_pred = model.predict(X_test) # ── Step 4: Evaluate ───────────────────────────────── mae = mean_absolute_error(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print(f"\nMAE: {mae:.2f}") print(f"RMSE: {rmse:.2f}") print(f"R²: {r2:.4f}") # Should be ~0.97+ with clean linear data # ── Step 5: Residual Analysis ───────────────────────── residuals = y_test - y_pred fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Actual vs Predicted axes[0].scatter(y_test, y_pred, alpha=0.5) axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted') axes[0].set_title('Actual vs Predicted\n(Perfect = on the red line)') # Plot 2: Residuals vs Fitted axes[1].scatter(y_pred, residuals, alpha=0.5) axes[1].axhline(y=0, color='r', linestyle='--') axes[1].set_xlabel('Fitted Values'); axes[1].set_ylabel('Residuals') axes[1].set_title('Residuals vs Fitted\n(Good = random scatter around 0)') # Plot 3: Residual histogram axes[2].hist(residuals, bins=20, color='steelblue') axes[2].set_title('Residual Distribution\n(Good = roughly normal, centered at 0)') plt.tight_layout(); plt.show()
- Not plotting residuals — you'll miss heteroscedasticity and nonlinearity
- R² alone is a misleading metric — always check residual plots too
- Negative R² is possible (model worse than predicting mean) — something is very wrong
Load sklearn's California Housing dataset. Build a linear regression predicting house prices. Print MAE, RMSE, R². Plot Actual vs Predicted. Is the relationship linear? What does the residual plot tell you?
Multiple Linear Regression
Multiple Linear Regression extends simple linear regression to use multiple input features simultaneously. Each feature gets its own coefficient (weight) representing its independent contribution to the prediction.
β₀ = intercept, βᵢ = coefficient for feature xᵢ
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error from sklearn.datasets import fetch_california_housing import matplotlib.pyplot as plt import seaborn as sns # ── Load dataset ────────────────────────────────────── housing = fetch_california_housing() X = pd.DataFrame(housing.data, columns=housing.feature_names) y = pd.Series(housing.target, name='Price') print("Features:", X.columns.tolist()) print("Shape:", X.shape) # ── Check multicollinearity with correlation matrix ─── plt.figure(figsize=(10, 8)) corr = X.corr() sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f') plt.title('Feature Correlation Matrix\n(High values = multicollinearity risk)') plt.tight_layout(); plt.show() # ── Train/test split ────────────────────────────────── X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ── Fit model ───────────────────────────────────────── model = LinearRegression() model.fit(X_train, y_train) # ── Coefficients interpretation ─────────────────────── coef_df = pd.DataFrame({ 'Feature': X.columns, 'Coefficient': model.coef_ }).sort_values('Coefficient', ascending=False) print("\nCoefficients (show each feature's independent effect):") print(coef_df.to_string(index=False)) # ── Evaluate ───────────────────────────────────────── y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print(f"\nR²: {r2:.4f}, RMSE: {rmse:.4f}") # ── VIF: Variance Inflation Factor for multicollinearity from statsmodels.stats.outliers_influence import variance_inflation_factor import statsmodels.api as sm X_const = sm.add_constant(X_train) vif_data = pd.DataFrame() vif_data["Feature"] = X_const.columns vif_data["VIF"] = [variance_inflation_factor(X_const.values, i) for i in range(X_const.shape[1])] print("\nVIF (>10 = multicollinearity problem):") print(vif_data.sort_values('VIF', ascending=False))
- Not checking multicollinearity: VIF > 10 = features are redundant, coefficients become unreliable
- Interpreting coefficients without scaling: Compare magnitudes only after standardizing features
- Adding too many features: More features → more risk of overfitting and multicollinearity
Polynomial Regression
Polynomial Regression extends linear regression by adding polynomial terms (x², x³, ...) as new features. Despite the curve fitting, it's still a linear model because the parameters are still linear — only the features are transformed.
Key insight: this is linear regression on [x, x², x³] as features
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score from sklearn.pipeline import Pipeline np.random.seed(42) X = np.random.uniform(-3, 3, 100) y = 0.5 * X**3 - 2 * X**2 + X + np.random.normal(0, 2, 100) X = X.reshape(-1, 1) # ── Compare degrees ─────────────────────────────────── fig, axes = plt.subplots(1, 4, figsize=(18, 4), sharey=True) X_plot = np.linspace(-3, 3, 300).reshape(-1, 1) for i, degree in enumerate([1, 2, 3, 10]): # Pipeline: transform features → fit linear regression pipe = Pipeline([ ('poly', PolynomialFeatures(degree=degree, include_bias=False)), ('model', LinearRegression()) ]) pipe.fit(X, y) y_plot = pipe.predict(X_plot) r2 = r2_score(y, pipe.predict(X)) axes[i].scatter(X, y, alpha=0.4, s=20) axes[i].plot(X_plot, y_plot, color='red', lw=2) axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}") if degree == 10: axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}\n⚠️ OVERFITTING") plt.suptitle('Polynomial Regression: Degree Comparison', y=1.02) plt.tight_layout() plt.show() # ── PolynomialFeatures explained ───────────────────── poly = PolynomialFeatures(degree=2, include_bias=False) X_2feat = np.array([[2, 3]]) # [x1, x2] X_poly = poly.fit_transform(X_2feat) print("Input: [x1, x2]") print("Poly degree 2 output: [x1, x2, x1², x1·x2, x2²]") print("Feature names:", poly.get_feature_names_out(['x1','x2'])) print("Values:", X_poly)
- High degree polynomial = overfitting (memorizes noise, fails on test data)
- Not scaling features — polynomial terms create extremely large values without scaling
- Using polynomial regression when you should use a tree-based model — trees handle nonlinearity naturally
Cost Function
A cost function (loss function) quantifies how wrong the model's predictions are. During training, the algorithm adjusts model parameters to minimize this cost. Understanding the cost function is essential — it defines what "learning" means.
| Cost Function | Formula | Properties |
|---|---|---|
| MSE (OLS) | Σ(y−ŷ)²/n | Penalizes large errors heavily; differentiable everywhere; default for linear regression |
| MAE | Σ|y−ŷ|/n | Robust to outliers; not differentiable at 0 |
| Huber Loss | MSE if |error|<δ, MAE otherwise | Best of both worlds — use for outlier-prone data |
import numpy as np import matplotlib.pyplot as plt # Simple 1D regression to visualize cost landscape np.random.seed(42) X = np.random.uniform(0, 10, 50) y = 3 * X + np.random.normal(0, 3, 50) # Cost = MSE for different values of slope (β₁) slopes = np.linspace(-2, 8, 200) mse_costs = [] for slope in slopes: y_pred = slope * X # intercept=0 for simplicity mse = np.mean((y - y_pred)**2) mse_costs.append(mse) # The minimum of this U-shaped curve = optimal β₁ optimal_slope = slopes[np.argmin(mse_costs)] print(f"Optimal slope found by scanning: {optimal_slope:.2f}") plt.figure(figsize=(8, 4)) plt.plot(slopes, mse_costs, 'b-', lw=2) plt.axvline(x=optimal_slope, color='r', linestyle='--', label=f'Min at β={optimal_slope:.2f}') plt.xlabel('Slope (β₁)'); plt.ylabel('MSE Cost') plt.title('MSE Cost Landscape\n(Training finds the bottom of this bowl)') plt.legend(); plt.tight_layout(); plt.show() # ── Gradient Descent from scratch ──────────────────── # This is how sklearn's SGD solver works internally def gradient_descent(X, y, learning_rate=0.01, epochs=1000): n = len(X) beta = 0.0 # start at 0 costs = [] for epoch in range(epochs): y_pred = beta * X cost = np.mean((y - y_pred)**2) # Gradient of MSE w.r.t. beta = -2/n * Σ(y - ŷ)*x gradient = -2 / n * np.sum((y - y_pred) * X) beta -= learning_rate * gradient # Update step costs.append(cost) return beta, costs beta_found, costs = gradient_descent(X, y) print(f"Gradient Descent found β = {beta_found:.4f} (true = 3.0)")
R² and Adjusted R²
R² (coefficient of determination) measures the proportion of variance in y explained by the model. Adjusted R² penalizes adding irrelevant features, making it the preferred metric for model comparison.
SS_res = Σ(y − ŷ)² (residual sum of squares)
SS_tot = Σ(y − ȳ)² (total sum of squares)
Adjusted R² = 1 − (1−R²) × (n−1) / (n−k−1) where n = samples, k = number of features
import numpy as np from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split housing = fetch_california_housing() X_full = housing.data y = housing.target X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42) # ── R² from scratch ─────────────────────────────────── def manual_r2(y_true, y_pred): ss_res = np.sum((y_true - y_pred)**2) ss_tot = np.sum((y_true - np.mean(y_true))**2) return 1 - (ss_res / ss_tot) # ── Adjusted R² ────────────────────────────────────── def adjusted_r2(r2, n, k): """ r2: R² value n: number of samples k: number of features (predictors) """ return 1 - (1 - r2) * (n - 1) / (n - k - 1) # ── Compare R² vs Adj-R² as we add noise features ───── # Adding random noise features should NOT improve Adj-R² print(f"{'Features':>10} {'R²':>10} {'Adj R²':>10}") print("-" * 35) for n_noise in [0, 5, 20, 50]: np.random.seed(42) if n_noise > 0: noise = np.random.randn(X_train.shape[0], n_noise) X_tr = np.hstack([X_train, noise]) noise_t = np.random.randn(X_test.shape[0], n_noise) X_ts = np.hstack([X_test, noise_t]) else: X_tr, X_ts = X_train, X_test model = LinearRegression() model.fit(X_tr, y_train) y_pred = model.predict(X_ts) r2 = r2_score(y_test, y_pred) n = X_ts.shape[0] k = X_tr.shape[1] adj = adjusted_r2(r2, n, k) print(f"{X_train.shape[1]+n_noise:>10} {r2:>10.4f} {adj:>10.4f}") # Key insight: # R² keeps increasing as you add features (even noise ones!) # Adj R² penalizes — will DROP if features are irrelevant # → Always use Adj R² for model comparison
Best Fit Line
The "best fit line" (regression line) is the line that minimizes the sum of squared vertical distances between the data points and the line itself. It passes through (x̄, ȳ) and is uniquely defined by the OLS solution.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression np.random.seed(42) X = np.random.uniform(1, 10, 30) y = 2 * X + np.random.normal(0, 2, 30) model = LinearRegression().fit(X.reshape(-1,1), y) y_pred = model.predict(X.reshape(-1,1)) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # ── Plot 1: Scatter + best fit line + residuals ─────── axes[0].scatter(X, y, color='steelblue', zorder=5, label='Data') X_line = np.linspace(X.min(), X.max(), 100) axes[0].plot(X_line, model.predict(X_line.reshape(-1,1)), 'r-', lw=2, label='Best Fit Line') # Draw residuals as vertical lines for xi, yi, yp in zip(X, y, y_pred): axes[0].plot([xi, xi], [yi, yp], 'g--', alpha=0.4, lw=1) axes[0].axvline(x=X.mean(), color='purple', linestyle=':', label=f'x̄ = {X.mean():.1f}') axes[0].axhline(y=y.mean(), color='orange', linestyle=':', label=f'ȳ = {y.mean():.1f}') axes[0].set_title(f'Best Fit Line\nGreen lines = residuals (what we minimize)\nβ₀={model.intercept_:.2f}, β₁={model.coef_[0]:.2f}') axes[0].legend() # ── Plot 2: Residuals squared (what OLS actually minimizes) ── residuals = y - y_pred axes[1].bar(range(len(residuals)), residuals**2, color='coral') axes[1].set_title(f'Squared Residuals\nTotal = {(residuals**2).sum():.2f} (OLS minimizes this)') axes[1].set_xlabel('Sample Index') axes[1].set_ylabel('Squared Residual') plt.tight_layout() plt.show() print(f"Best fit: y = {model.intercept_:.2f} + {model.coef_[0]:.2f}x") print(f"True: y = 0 + 2x") print(f"Close! Noise causes small deviation.")
Lasso (L1) & Ridge (L2) — Theory
Ridge and Lasso are regularized versions of linear regression that add a penalty term to prevent overfitting by discouraging large coefficients. They're essential when you have many features or multicollinearity.
When a linear model overfits, it assigns large coefficients to noise features. Regularization adds a penalty for large coefficients, forcing the model to be simpler. This trades a tiny bit of bias for a large reduction in variance.
Bias-Variance Tradeoff: Underfitting = High Bias. Overfitting = High Variance. Regularization = Find the sweet spot.
Ridge: Minimize Σ(y − ŷ)² + λΣβᵢ² (L2 penalty)
Lasso: Minimize Σ(y − ŷ)² + λΣ|βᵢ| (L1 penalty)
Elastic Net: Ridge + Lasso combined
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | Sum of squared coefficients | Sum of absolute coefficients |
| Effect on coefficients | Shrinks toward 0, but rarely exactly 0 | Can shrink to EXACTLY 0 (feature selection!) |
| Feature selection | No — keeps all features small | Yes — eliminates irrelevant features |
| Best for | Multicollinearity; all features are relevant | When you suspect many features are irrelevant |
| Hyperparameter α (λ) | Higher α = more shrinkage | Higher α = more features set to 0 |
Lasso & Ridge — Practical (Continued)
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.datasets import fetch_california_housing from sklearn.metrics import r2_score housing = fetch_california_housing() X = pd.DataFrame(housing.data, columns=housing.feature_names) y = housing.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) sc = StandardScaler() Xtr = sc.fit_transform(X_train) Xte = sc.transform(X_test) # ── Coefficient Path: how coefficients shrink with alpha ─ alphas = np.logspace(-3, 3, 100) ridge_coefs, lasso_coefs = [], [] for a in alphas: ridge_coefs.append(Ridge(alpha=a).fit(Xtr, y_train).coef_) lasso_coefs.append(Lasso(alpha=a, max_iter=5000).fit(Xtr, y_train).coef_) ridge_coefs = np.array(ridge_coefs) lasso_coefs = np.array(lasso_coefs) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) for i, name in enumerate(X.columns): ax1.plot(alphas, ridge_coefs[:, i], label=name) ax2.plot(alphas, lasso_coefs[:, i], label=name) ax1.set_xscale('log'); ax1.set_title('Ridge Coefficient Path\n(All shrink but stay nonzero)') ax1.set_xlabel('Alpha (regularization strength)'); ax1.legend(fontsize=8) ax2.set_xscale('log'); ax2.set_title('Lasso Coefficient Path\n(Features drop to EXACT zero)') ax2.set_xlabel('Alpha (regularization strength)'); ax2.legend(fontsize=8) plt.tight_layout(); plt.show() # ── Auto-tune alpha with Cross-Validation ───────────── # RidgeCV / LassoCV test many alphas internally via CV ridge_cv = RidgeCV(alphas=alphas, cv=5) ridge_cv.fit(Xtr, y_train) print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}") print(f"RidgeCV R²: {r2_score(y_test, ridge_cv.predict(Xte)):.4f}") lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=5000) lasso_cv.fit(Xtr, y_train) print(f"\nLassoCV best alpha: {lasso_cv.alpha_:.4f}") print(f"LassoCV R²: {r2_score(y_test, lasso_cv.predict(Xte)):.4f}") n_zero = np.sum(lasso_cv.coef_ == 0) print(f"LassoCV zeroed out {n_zero}/{X.shape[1]} features (automatic feature selection!)")
- Not scaling before Ridge/Lasso — regularization penalizes coefficient magnitude; unscaled features get unfairly penalized
- Using
alpha=0in Ridge — that's just OLS, no regularization - Forgetting that Lasso can set ALL features to zero if alpha is too high
- Ridge shrinks coefficients but keeps all features — good for multicollinearity
- Lasso does feature selection by zeroing coefficients — good when many features are irrelevant
- Elastic Net combines both — best for correlated feature groups
- Always scale features before applying any regularized model
- Use RidgeCV / LassoCV to auto-select optimal alpha via cross-validation
Load the Boston/California housing dataset. Add 10 random noise features to X. Compare R² and Adjusted R² of: OLS, Ridge, Lasso. Show which features Lasso zeroes out. Which model is best and why?
⚡ Project 1 — After Regression
Real-World Project: House Price Predictor
Dataset: California Housing (sklearn) or Kaggle Ames Housing
Goal: Build a production-grade regression pipeline with all preprocessing steps chained together.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.impute import SimpleImputer from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # ── 1. LOAD DATA ────────────────────────────────────── data = fetch_california_housing(as_frame=True) df = data.frame print("Shape:", df.shape) print(df.describe().round(2)) # ── 2. FEATURE ENGINEERING ─────────────────────────── df['rooms_per_household'] = df['AveRooms'] / df['AveOccup'] df['bedrooms_per_room'] = df['AveBedrms'] / df['AveRooms'] df['population_per_household'] = df['Population'] / df['AveOccup'] feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household'] X = df[feature_cols] y = df['MedHouseVal'] # ── 3. SPLIT ───────────────────────────────────────── X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ── 4. BUILD PIPELINES ─────────────────────────────── pipe_lr = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', LinearRegression()) ]) pipe_ridge = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', Ridge(alpha=1.0)) ]) pipe_lasso = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', Lasso(alpha=0.01)) ]) # ── 5. EVALUATE ALL ────────────────────────────────── print(f"\n{'Model':15} {'R²':>8} {'RMSE':>10} {'MAE':>10} {'CV R²':>10}") print("-" * 60) for name, pipe in [('LinearReg', pipe_lr), ('Ridge', pipe_ridge), ('Lasso', pipe_lasso)]: pipe.fit(X_train, y_train) yp = pipe.predict(X_test) r2 = r2_score(y_test, yp) rmse = np.sqrt(mean_squared_error(y_test, yp)) mae = mean_absolute_error(y_test, yp) cv = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2').mean() print(f"{name:15} {r2:8.4f} {rmse:10.4f} {mae:10.4f} {cv:10.4f}") # ── 6. RESIDUAL PLOT for best model ────────────────── pipe_ridge.fit(X_train, y_train) y_pred = pipe_ridge.predict(X_test) residuals = y_test - y_pred plt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1) plt.scatter(y_test, y_pred, alpha=0.3, s=10) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.title('Actual vs Predicted (Ridge)'); plt.xlabel('Actual'); plt.ylabel('Predicted') plt.subplot(1, 2, 2) plt.hist(residuals, bins=50, color='steelblue') plt.title('Residual Distribution') plt.tight_layout(); plt.show()
Phase 6 — Classification
Classification Overview
Classification is supervised learning where the output is a discrete category/label rather than a continuous number. The model learns a decision boundary that separates classes.
| Type | Output | Example | Algorithm |
|---|---|---|---|
| Binary | 2 classes (0/1) | Spam / Not Spam | Logistic Regression |
| Multiclass | 3+ mutually exclusive classes | Dog / Cat / Bird | Softmax, Decision Tree |
| Multilabel | Multiple labels per sample | Movie: Action + Comedy | One-vs-Rest wrapper |
| Algorithm | How it works | Best for |
|---|---|---|
| Logistic Regression | Sigmoid probability on linear boundary | Linearly separable, interpretability needed |
| Decision Tree | Splits on feature thresholds | Nonlinear, interpretable rules |
| Random Forest | Average of many trees | High accuracy, feature importance |
| SVM | Maximize margin between classes | High-dimensional data, small datasets |
| KNN | Majority vote of k nearest neighbors | Small datasets, nonlinear boundaries |
| Naive Bayes | Probabilistic, Bayes theorem | Text classification, very fast |
- Classification = predicting a label/category, not a number
- Binary (2 classes) is simplest; multiclass uses one-vs-rest or softmax extensions
- Choose algorithm based on: dataset size, linearity, interpretability needs
Logistic Regression — Binary
Logistic Regression is a classification algorithm (misleadingly named) that uses the sigmoid function to output a probability between 0 and 1. It models the log-odds of the positive class as a linear function of inputs.
Linear regression gives output ∈ (−∞, +∞) — useless for probability. We need output ∈ [0, 1]. The sigmoid (logistic) function squashes any number into [0,1]. Output = probability of class 1. If P > 0.5 → predict class 1; else → class 0.
P(y=1 | X) = σ(Xβ)
Decision: ŷ = 1 if P ≥ threshold (default 0.5), else 0
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer # ── Load binary classification dataset ──────────────── data = load_breast_cancer() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target # 0=malignant, 1=benign print("Classes:", data.target_names) print("Class balance:", pd.Series(y).value_counts()) # ── Prepare ─────────────────────────────────────────── X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # ── Train logistic regression ───────────────────────── # C = inverse regularization strength (C = 1/alpha) # solver: 'lbfgs' (default), 'liblinear' (good for small data) model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=1000, random_state=42) model.fit(X_train_s, y_train) # ── Evaluate ───────────────────────────────────────── y_pred = model.predict(X_test_s) y_proba = model.predict_proba(X_test_s)[:, 1] # P(class=1) print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}") print(classification_report(y_test, y_pred, target_names=data.target_names)) # ── Probability threshold analysis ─────────────────── # Default threshold = 0.5. In medical diagnosis, lower threshold # = catch more true positives (higher recall, lower precision) thresholds = np.arange(0.1, 0.9, 0.1) print(f"\n{'Threshold':>12} {'Accuracy':>10} {'TP (recall)':>13}") from sklearn.metrics import recall_score for t in thresholds: pred_t = (y_proba >= t).astype(int) acc = accuracy_score(y_test, pred_t) rec = recall_score(y_test, pred_t) print(f"{t:>12.1f} {acc:>10.4f} {rec:>13.4f}") # In cancer detection: lower threshold = catch more cases (higher recall) # but more false alarms (lower precision) # ── Coefficients — feature importance ───────────────── coef_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]}).sort_values('Coefficient') print("\nTop positive features (push toward malignant):") print(coef_df.tail(5))
- Using logistic regression on highly nonlinear data — use trees or SVM instead
- Ignoring the threshold — 0.5 is not always optimal; tune it based on recall/precision needs
- Not scaling features — logistic regression with regularization is scale-sensitive
Logistic Regression — Multiple Input Features
Multiple-input logistic regression uses many features simultaneously to compute the log-odds. Each feature gets its own coefficient, and the sigmoid is applied to the linear combination.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import (roc_curve, auc, classification_report, ConfusionMatrixDisplay) from sklearn.datasets import load_breast_cancer from sklearn.pipeline import Pipeline data = load_breast_cancer() X, y = data.data, data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # Pipeline: scale → logistic pipe = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))]) pipe.fit(X_train, y_train) y_prob = pipe.predict_proba(X_test)[:, 1] y_pred = pipe.predict(X_test) # ── ROC Curve ───────────────────────────────────────── fpr, tpr, _ = roc_curve(y_test, y_prob) roc_auc = auc(fpr, tpr) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].plot(fpr, tpr, lw=2, label=f'ROC (AUC = {roc_auc:.3f})') axes[0].plot([0,1],[0,1], 'k--', label='Random (AUC=0.5)') axes[0].set_xlabel('False Positive Rate'); axes[0].set_ylabel('True Positive Rate') axes[0].set_title('ROC Curve\n(Closer to top-left = better model)') axes[0].legend() # ── Confusion Matrix ────────────────────────────────── ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=data.target_names, ax=axes[1], colorbar=False) axes[1].set_title('Confusion Matrix') plt.tight_layout(); plt.show() print(f"ROC-AUC: {roc_auc:.4f}") print(classification_report(y_test, y_pred)) # AUC: 0.5 = random, 1.0 = perfect. Aim for >0.85 in practice
Logistic Regression — Polynomial Features
Adding polynomial features to logistic regression allows it to learn nonlinear decision boundaries. Without this, logistic regression can only separate classes with a straight line/plane.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.pipeline import Pipeline from sklearn.datasets import make_circles # Circular data — not linearly separable! X, y = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42) def plot_boundary(model, X, y, ax, title): h = 0.02 x_min, x_max = X[:,0].min()-0.5, X[:,0].max()+0.5 y_min, y_max = X[:,1].min()-0.5, X[:,1].max()+0.5 xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu') ax.scatter(X[:,0], X[:,1], c=y, cmap='RdYlBu', edgecolors='k', s=20) acc = model.score(X, y) ax.set_title(f'{title}\nAccuracy: {acc:.3f}') fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Linear boundary — fails on circles pipe_lin = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression())]) pipe_lin.fit(X, y) plot_boundary(pipe_lin, X, y, ax1, 'Linear Logistic Reg\n(FAILS on circles)') # Polynomial degree 3 — learns circular boundary! pipe_poly = Pipeline([ ('poly', PolynomialFeatures(degree=3, include_bias=False)), ('sc', StandardScaler()), ('lr', LogisticRegression(C=1.0)) ]) pipe_poly.fit(X, y) plot_boundary(pipe_poly, X, y, ax2, 'Polynomial (degree=3)\n(Learns circular boundary!)') plt.tight_layout(); plt.show()
Logistic Regression — Multiclass
Multiclass classification extends binary logistic regression to 3+ classes using two strategies: One-vs-Rest (OvR) — trains one binary classifier per class vs all others — and Softmax (Multinomial) — outputs a probability distribution over all classes simultaneously.
All probabilities sum to 1: Σ P(y=k|X) = 1
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, accuracy_score from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target print("Classes:", iris.target_names) print("Samples per class:", np.bincount(y)) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) sc = StandardScaler() X_tr = sc.fit_transform(X_train) X_te = sc.transform(X_test) # ── Strategy 1: One-vs-Rest (OvR) ───────────────────── # For k classes: trains k binary classifiers # Class with highest probability wins lr_ovr = LogisticRegression(multi_class='ovr', C=1.0, max_iter=1000) lr_ovr.fit(X_tr, y_train) print(f"\nOvR Accuracy: {accuracy_score(y_test, lr_ovr.predict(X_te)):.4f}") # ── Strategy 2: Softmax (multinomial) ───────────────── # Trains ONE model jointly across all classes # Better when classes are mutually exclusive lr_soft = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=1.0, max_iter=1000) lr_soft.fit(X_tr, y_train) y_pred_soft = lr_soft.predict(X_te) print(f"Softmax Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}") # ── Probability output (Softmax) ────────────────────── proba = lr_soft.predict_proba(X_te[:5]) print("\nProbabilities for first 5 test samples (3 classes):") df_proba = pd.DataFrame(proba, columns=iris.target_names).round(3) df_proba['Prediction'] = iris.target_names[y_pred_soft[:5]] df_proba['True Label'] = iris.target_names[y_test[:5]] print(df_proba) print("\nDetailed Report:") print(classification_report(y_test, y_pred_soft, target_names=iris.target_names))
Phase 7 — Evaluation Metrics
Confusion Matrix
A Confusion Matrix is a table that visualizes the complete performance of a classification model by showing how many predictions fell into each category: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | TP — Correct positive prediction | FN — Missed positive (Type II Error) |
| Actual: Negative | FP — Wrong alarm (Type I Error) | TN — Correct negative prediction |
import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler data = load_breast_cancer() X, y = data.data, data.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) sc = StandardScaler() X_tr = sc.fit_transform(X_tr); X_te = sc.transform(X_te) model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr) y_pred = model.predict(X_te) # ── Raw confusion matrix ────────────────────────────── cm = confusion_matrix(y_te, y_pred) print("Confusion Matrix:\n", cm) tn, fp, fn, tp = cm.ravel() print(f"\nTN={tn} FP={fp}") print(f"FN={fn} TP={tp}") # ── All derived metrics from confusion matrix ───────── accuracy = (tp + tn) / (tp + tn + fp + fn) precision = tp / (tp + fp) recall = tp / (tp + fn) # = Sensitivity = True Positive Rate specificity = tn / (tn + fp) # True Negative Rate f1 = 2 * precision * recall / (precision + recall) fpr = fp / (fp + tn) # False Positive Rate print(f"\nAccuracy: {accuracy:.4f}") print(f"Precision: {precision:.4f} (Of all predicted positive, how many are actually?)") print(f"Recall: {recall:.4f} (Of all actual positive, how many did we catch?)") print(f"Specificity: {specificity:.4f} (Of all actual negative, how many correctly rejected?)") print(f"F1-Score: {f1:.4f} (Harmonic mean of precision + recall)") # ── Heatmap visualization ───────────────────────────── fig, axes = plt.subplots(1, 2, figsize=(12, 4)) ConfusionMatrixDisplay.from_predictions(y_te, y_pred, display_labels=data.target_names, ax=axes[0]) axes[0].set_title('Counts') ConfusionMatrixDisplay.from_predictions(y_te, y_pred, display_labels=data.target_names, ax=axes[1], normalize='true') axes[1].set_title('Normalized (Recall per class)') plt.tight_layout(); plt.show()
Precision, Recall, F1-Score
Precision: Of all the samples we predicted as positive, what fraction is truly positive?
(How careful we are.)
Recall: Of all truly positive samples, what fraction did we correctly identify? (How
thorough we are.)
F1-Score: Harmonic mean of Precision and Recall — penalizes extreme imbalances between
them.
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F-beta = (1+β²) × (P × R) / (β²P + R) — β>1 weights recall more
High Precision, Low Recall: Only flag cases you're very sure about. Miss some. (Spam:
rarely mark real email as spam, but let some spam through.)
High Recall, Low Precision: Catch everything, accept false alarms. (Cancer screening:
catch all cancer cases, even if some false alarms need follow-up.)
Control it: Lower the threshold → higher recall, lower precision. Higher threshold →
higher precision, lower recall.
import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import (precision_score, recall_score, f1_score, fbeta_score, precision_recall_curve, average_precision_score, classification_report) from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler data = load_breast_cancer() X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2, stratify=data.target, random_state=42) sc = StandardScaler() model = LogisticRegression(max_iter=1000).fit(sc.fit_transform(X_tr), y_tr) y_prob = model.predict_proba(sc.transform(X_te))[:, 1] y_pred = model.predict(sc.transform(X_te)) # ── Key metrics ─────────────────────────────────────── print(f"Precision: {precision_score(y_te, y_pred):.4f}") print(f"Recall: {recall_score(y_te, y_pred):.4f}") print(f"F1-Score: {f1_score(y_te, y_pred):.4f}") print(f"F2-Score: {fbeta_score(y_te, y_pred, beta=2):.4f} (weights recall more)") print(f"F0.5-Score: {fbeta_score(y_te, y_pred, beta=0.5):.4f} (weights precision more)") # ── Precision-Recall Curve ──────────────────────────── precision, recall, thresholds = precision_recall_curve(y_te, y_prob) avg_prec = average_precision_score(y_te, y_prob) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].plot(recall, precision, lw=2, label=f'AP = {avg_prec:.3f}') axes[0].set_xlabel('Recall'); axes[0].set_ylabel('Precision') axes[0].set_title('Precision-Recall Curve\n(Area = Average Precision)') axes[0].legend() # ── Threshold vs Precision/Recall ───────────────────── axes[1].plot(thresholds, precision[:-1], label='Precision') axes[1].plot(thresholds, recall[:-1], label='Recall') axes[1].set_xlabel('Threshold') axes[1].set_title('Precision vs Recall at different thresholds\n(Move threshold → tradeoff changes)') axes[1].legend() plt.tight_layout(); plt.show() # Per-class report (essential for multiclass) print("\nClassification Report:") print(classification_report(y_te, y_pred, target_names=data.target_names))
- Accuracy is misleading on imbalanced data — use F1, Precision, Recall
- Precision = how trustworthy your positive predictions are
- Recall = how complete your positive detections are
- F-beta: β>1 weights recall; β<1 weights precision — choose based on business cost
Imbalanced Dataset Handling
An imbalanced dataset has a significant difference in class frequencies (e.g., 98% class 0, 2% class 1). Models trained on such data develop a bias toward the majority class and effectively ignore the minority class — which is usually the one we care most about (fraud, disease, defects).
| Strategy | How It Works | Pros / Cons |
|---|---|---|
| Class Weights | Penalize misclassifying minority class more | Simple, no data change; always try first |
| Oversampling (SMOTE) | Synthetically generate new minority samples | More data; risk of overfitting |
| Undersampling | Randomly remove majority samples | Fast; loses real information |
| SMOTE + Undersampling | Combine both | Balanced; best of both |
| Threshold tuning | Lower decision threshold for minority | No data change; tune for business metric |
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, f1_score from sklearn.datasets import make_classification from sklearn.preprocessing import StandardScaler # ── Create heavily imbalanced dataset ───────────────── X, y = make_classification(n_samples=1000, n_features=10, weights=[0.95, 0.05], # 95% class 0, 5% class 1 random_state=42) print("Class distribution:", np.bincount(y)) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) sc = StandardScaler() X_tr_s = sc.fit_transform(X_tr); X_te_s = sc.transform(X_te) # ── 1. No correction (baseline) ─────────────────────── lr_base = LogisticRegression().fit(X_tr_s, y_tr) print("\n--- Baseline (no correction) ---") print(classification_report(y_te, lr_base.predict(X_te_s))) # ── 2. class_weight='balanced' ──────────────────────── # sklearn auto-computes weights: w_i = n_samples/(n_classes * n_i) lr_bal = LogisticRegression(class_weight='balanced').fit(X_tr_s, y_tr) print("--- class_weight='balanced' ---") print(classification_report(y_te, lr_bal.predict(X_te_s))) # ── 3. SMOTE Oversampling (requires imbalanced-learn) ─ try: from imblearn.over_sampling import SMOTE from imblearn.pipeline import Pipeline as ImbPipeline smote_pipe = ImbPipeline([ ('smote', SMOTE(random_state=42)), ('lr', LogisticRegression()) ]) smote_pipe.fit(X_tr_s, y_tr) print("--- SMOTE + Logistic Regression ---") print(classification_report(y_te, smote_pipe.predict(X_te_s))) except ImportError: print("Install: pip install imbalanced-learn") # ── 4. Threshold tuning ──────────────────────────────── prob = lr_bal.predict_proba(X_te_s)[:, 1] for t in [0.3, 0.4, 0.5]: pred_t = (prob >= t).astype(int) f1 = f1_score(y_te, pred_t) print(f"Threshold {t}: F1={f1:.4f}")
- Applying SMOTE before train/test split — synthetic samples from test data leak into training
- Using accuracy as the metric — use F1, ROC-AUC, or G-Mean for imbalanced data
- Not trying
class_weight='balanced'first — it's free and often sufficient
Phase 8 — Probabilistic Models
Naive Bayes — Theory
Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that all features are conditionally independent given the class. Despite this simplification, it works remarkably well — especially for text classification.
Naive Independence Assumption: P(X|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)
Decision: ŷ = argmax_k P(y=k) × Π P(xᵢ|y=k)
Email Spam Example: P(spam | "free", "money", "click") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("click"|spam). We multiply individual word probabilities. We call it "naive" because words in a real email are NOT truly independent — but this assumption still works very well in practice.
| Variant | Feature Distribution Assumption | Best For |
|---|---|---|
| GaussianNB | Features follow Gaussian (normal) distribution | Continuous features, numerical data |
| MultinomialNB | Features are counts or frequencies | Text classification (word counts, TF-IDF) |
| BernoulliNB | Features are binary (0/1) | Text (word presence/absence) |
| ComplementNB | Extension of Multinomial | Imbalanced text classification |
Naive Bayes — Practical
import numpy as np import pandas as pd from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import classification_report, accuracy_score from sklearn.datasets import load_iris from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.pipeline import Pipeline # ══ PART 1: GaussianNB (continuous features) ══════════ iris = load_iris() X, y = iris.data, iris.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) gnb = GaussianNB() gnb.fit(X_tr, y_tr) print("GaussianNB on Iris:") print(f" Accuracy: {accuracy_score(y_te, gnb.predict(X_te)):.4f}") print(f" CV Score: {cross_val_score(gnb, X, y, cv=5).mean():.4f}") # Learned class priors (relative frequency of each class) print(f" Class priors: {gnb.class_prior_.round(3)}") print(f" Class means (theta):\n{gnb.theta_.round(2)}") # ══ PART 2: MultinomialNB (text classification) ═══════ print("\n--- Text Classification with MultinomialNB ---") texts = [ "free money win prize", "win big cash now free", "click free offer today", "meeting tomorrow project update", "report deadline next week", "team lunch calendar invite", "schedule call project review", "free gift card win money", "urgent response required now", ] labels = [1,1,1, 0,0,0,0, 1,1] # 1=spam, 0=ham pipe_mnb = Pipeline([ ('tfidf', TfidfVectorizer()), ('model', MultinomialNB(alpha=1.0)) # alpha=Laplace smoothing ]) pipe_mnb.fit(texts, labels) new_emails = ["free offer click win", "project deadline report"] predictions = pipe_mnb.predict(new_emails) probas = pipe_mnb.predict_proba(new_emails) for email, pred, prob in zip(new_emails, predictions, probas): label = "SPAM" if pred == 1 else "HAM" print(f" '{email}' → {label} (P(spam)={prob[1]:.3f})") # ── Laplace Smoothing explanation ───────────────────── # alpha=1.0 adds 1 to all counts — prevents P(word|class)=0 # for words not seen in training data (zero-probability problem) print("\nalpha=0: no smoothing (risk of zero prob)") print("alpha=1: Laplace smoothing (standard)") print("alpha>1: more smoothing, more uniform distribution")
- Naive Bayes is extremely fast, scales to huge datasets, great for text
- GaussianNB for continuous features; MultinomialNB for word counts
- Laplace smoothing (alpha) prevents zero probabilities for unseen features
- Despite naive assumption, often competitive with complex models for text
Phase 9 — Advanced Models
Decision Tree — Classification (Theory)
A Decision Tree is a flowchart-like model that makes predictions by recursively splitting the feature space based on threshold conditions. Each internal node tests a feature, each branch represents an outcome, and each leaf node predicts a class.
Think of 20 questions: "Is the animal a mammal? → Yes → Does it live in water? → No → Does it have stripes? → Yes → Tiger!" A decision tree asks the most informative questions first, each question splitting the data into increasingly pure groups.
| Criterion | Formula | Used in | Intuition |
|---|---|---|---|
| Gini Impurity | 1 − Σ pᵢ² | CART, sklearn default | Probability of misclassifying a random sample. 0 = pure, 0.5 = maximally impure (binary) |
| Entropy (Info Gain) | −Σ pᵢ log₂(pᵢ) | ID3, C4.5 | Average bits needed to encode class labels. 0 = pure, 1 = maximally uncertain (binary) |
| Log Loss | Cross-entropy | sklearn 1.1+ | Better calibrated probability estimates |
Gini at node = 1 − (p_class0² + p_class1² + ...)
Decision Tree — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score from sklearn.datasets import load_iris import pandas as pd iris = load_iris() X = pd.DataFrame(iris.data, columns=iris.feature_names) y = iris.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # ── 1. Full tree (overfits) ─────────────────────────── dt_full = DecisionTreeClassifier(random_state=42) # No limits = grows fully dt_full.fit(X_tr, y_tr) print(f"Full Tree — depth:{dt_full.get_depth()}, leaves:{dt_full.get_n_leaves()}") print(f" Train Acc: {accuracy_score(y_tr, dt_full.predict(X_tr)):.4f}") print(f" Test Acc: {accuracy_score(y_te, dt_full.predict(X_te)):.4f}") # ── 2. Depth-limited tree (better generalization) ────── dt_lim = DecisionTreeClassifier(max_depth=3, random_state=42) dt_lim.fit(X_tr, y_tr) print(f"\nLimited Tree (max_depth=3):") print(f" Train Acc: {accuracy_score(y_tr, dt_lim.predict(X_tr)):.4f}") print(f" Test Acc: {accuracy_score(y_te, dt_lim.predict(X_te)):.4f}") # ── 3. Visualize tree ───────────────────────────────── fig, ax = plt.subplots(figsize=(16, 6)) plot_tree(dt_lim, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, ax=ax, fontsize=9) plt.title('Decision Tree (max_depth=3)\nColor intensity = class purity') plt.tight_layout(); plt.show() # ── 4. Text rules (human-readable) ─────────────────── rules = export_text(dt_lim, feature_names=iris.feature_names) print("\nTree Rules (human-readable):") print(rules) # ── 5. Feature importance ───────────────────────────── imp = pd.DataFrame({ 'Feature': iris.feature_names, 'Importance': dt_lim.feature_importances_ }).sort_values('Importance', ascending=False) print("\nFeature Importances (Gini-based):") print(imp) # Importance = total Gini impurity reduction from splits on this feature # ── 6. Depth vs accuracy tradeoff ───────────────────── depths = range(1, 15) train_accs, test_accs = [], [] for d in depths: dt = DecisionTreeClassifier(max_depth=d, random_state=42).fit(X_tr, y_tr) train_accs.append(accuracy_score(y_tr, dt.predict(X_tr))) test_accs.append(accuracy_score(y_te, dt.predict(X_te))) plt.figure(figsize=(8,4)) plt.plot(depths, train_accs, 'b-o', label='Train') plt.plot(depths, test_accs, 'r-o', label='Test') plt.axvline(x=3, color='g', linestyle='--', label='Sweet spot') plt.xlabel('Max Depth'); plt.ylabel('Accuracy') plt.title('Depth vs Accuracy: Overfitting Curve') plt.legend(); plt.tight_layout(); plt.show()
Pre-Pruning & Post-Pruning
Pruning controls decision tree complexity to prevent overfitting. Pre-pruning stops growth early using constraints. Post-pruning grows the full tree then removes unnecessary branches using a complexity penalty (cost-complexity pruning).
| Parameter | Effect | Typical Value |
|---|---|---|
| max_depth | Maximum tree depth | 3–10 |
| min_samples_split | Min samples required to split a node | 2–20 |
| min_samples_leaf | Min samples required at leaf | 1–10 |
| max_features | Max features considered per split | "sqrt" for classification |
| max_leaf_nodes | Max number of leaf nodes | None or 10–100 |
import numpy as np import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import load_breast_cancer data = load_breast_cancer() X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2, random_state=42) # ── PRE-PRUNING ─────────────────────────────────────── dt_pre = DecisionTreeClassifier( max_depth=5, min_samples_split=10, # need ≥10 samples to even try splitting min_samples_leaf=5, # each leaf must have ≥5 samples max_leaf_nodes=20, # cap total leaves random_state=42 ) dt_pre.fit(X_tr, y_tr) print(f"Pre-pruned: depth={dt_pre.get_depth()}, test_acc={accuracy_score(y_te, dt_pre.predict(X_te)):.4f}") # ── POST-PRUNING: Cost-Complexity Pruning (CCP) ──────── # ccp_alpha = complexity parameter. # Higher alpha → more aggressive pruning → simpler tree # Find optimal alpha using effective_alphas dt_full = DecisionTreeClassifier(random_state=42) path = dt_full.cost_complexity_pruning_path(X_tr, y_tr) ccp_alphas = path.ccp_alphas[:-1] # Exclude last (trivial root) train_scores, test_scores, n_leaves = [], [], [] for alpha in ccp_alphas: dt = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42).fit(X_tr, y_tr) train_scores.append(accuracy_score(y_tr, dt.predict(X_tr))) test_scores.append(accuracy_score(y_te, dt.predict(X_te))) n_leaves.append(dt.get_n_leaves()) best_idx = np.argmax(test_scores) best_alpha = ccp_alphas[best_idx] print(f"\nBest CCP alpha: {best_alpha:.5f} → Test acc: {test_scores[best_idx]:.4f}") fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4)) ax1.plot(ccp_alphas, train_scores, 'b-', label='Train') ax1.plot(ccp_alphas, test_scores, 'r-', label='Test') ax1.axvline(x=best_alpha, color='g', linestyle='--', label=f'Best α={best_alpha:.5f}') ax1.set_xlabel('ccp_alpha'); ax1.set_title('Post-Pruning: Accuracy vs Alpha'); ax1.legend() ax2.plot(ccp_alphas, n_leaves, 'purple') ax2.set_xlabel('ccp_alpha'); ax2.set_ylabel('Number of Leaves') ax2.set_title('More alpha → Simpler tree') plt.tight_layout(); plt.show()
Decision Tree — Regression
Decision Tree Regression partitions the feature space into rectangular regions and predicts the mean of training samples in each region. Instead of Gini/Entropy, it minimizes MSE (mean squared error) at each split.
import numpy as np import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score np.random.seed(42) X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1, 1) y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42) fig, axes = plt.subplots(1, 3, figsize=(16, 4)) X_plot = np.linspace(0, 10, 500).reshape(-1, 1) for i, depth in enumerate([2, 5, 20]): dtr = DecisionTreeRegressor(max_depth=depth, random_state=42) dtr.fit(X_tr, y_tr) y_plot = dtr.predict(X_plot) rmse = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te))) r2 = r2_score(y_te, dtr.predict(X_te)) axes[i].scatter(X_tr, y_tr, alpha=0.4, s=15, label='Train') axes[i].scatter(X_te, y_te, alpha=0.4, s=15, color='orange', label='Test') axes[i].plot(X_plot, y_plot, 'r-', lw=2, label='Prediction') axes[i].set_title(f"depth={depth}\nRMSE={rmse:.3f} R²={r2:.3f}") axes[i].legend(fontsize=8) # depth=2: underfits (blocky step function) # depth=5: good fit # depth=20: overfits (jagged, memorizes noise) plt.suptitle('Decision Tree Regression: Step-Function Predictions') plt.tight_layout(); plt.show() # ── Key insight: DT predictions are step functions ───── # Each region gets the MEAN of training samples in it # This is why DTs can't extrapolate beyond training range!
- Decision Trees cannot extrapolate — predictions outside training range = last leaf's mean. Use neural nets for extrapolation.
- Not controlling depth leads to extreme overfitting — always use CCP or set max_depth
K-Nearest Neighbors (KNN) — Classification
KNN is a lazy, non-parametric algorithm — it stores all training data and, to predict a new point, finds the K closest training points and predicts the majority class (classification) or mean value (regression). No explicit training step.
"Tell me who your neighbors are, and I'll tell you who you are." To classify a new patient, find the 5 most similar patients in the medical database and predict the majority diagnosis. The model IS the data.
Distance (Manhattan): d(a,b) = Σ|aᵢ − bᵢ|
Prediction: ŷ = majority_class(k nearest neighbors)
| Property | Details |
|---|---|
| Training cost | O(1) — just stores data |
| Prediction cost | O(n·d) — computes distance to ALL training points |
| Memory | O(n) — stores entire training set |
| Curse of dimensionality | Performance degrades sharply with many features — all points become equidistant |
| Feature scaling | REQUIRED — distance-based, so scale matters critically |
| k too small | Overfitting — noisy, jagged boundary |
| k too large | Underfitting — blurry boundary, ignores local structure |
KNN — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report from sklearn.datasets import load_iris from sklearn.pipeline import Pipeline iris = load_iris() X, y = iris.data, iris.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # ── Find optimal k ──────────────────────────────────── k_range = range(1, 31) cv_scores = [] for k in k_range: pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))]) scores = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='accuracy') cv_scores.append(scores.mean()) best_k = k_range[np.argmax(cv_scores)] print(f"Best k = {best_k} (CV accuracy = {max(cv_scores):.4f})") plt.figure(figsize=(10, 4)) plt.plot(k_range, cv_scores, 'b-o', markersize=5) plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best k={best_k}') plt.xlabel('k'); plt.ylabel('CV Accuracy') plt.title('Elbow Method: Finding Optimal k\n(k=1: overfitting, high k: underfitting)') plt.legend(); plt.tight_layout(); plt.show() # ── Final model with best k ─────────────────────────── best_pipe = Pipeline([ ('sc', StandardScaler()), ('knn', KNeighborsClassifier( n_neighbors=best_k, weights='distance', # closer neighbors vote more metric='euclidean' # try 'manhattan', 'minkowski' )) ]) best_pipe.fit(X_tr, y_tr) print(classification_report(y_te, best_pipe.predict(X_te), target_names=iris.target_names)) # ── Weights comparison ──────────────────────────────── for w in ['uniform', 'distance']: p = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=best_k, weights=w))]) p.fit(X_tr, y_tr) acc = p.score(X_te, y_te) print(f"weights='{w}': acc={acc:.4f}")
KNN — Regression
KNN Regression predicts the average (or weighted average) of the k nearest neighbors' target values rather than majority vote. Simple and powerful for locally smooth functions.
import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsRegressor from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score np.random.seed(42) X = np.sort(np.random.uniform(0, 10, 200)).reshape(-1,1) y = np.sin(X.ravel()) + 0.5*np.cos(2*X.ravel()) + np.random.normal(0, 0.15, 200) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42) X_plot = np.linspace(0, 10, 500).reshape(-1,1) fig, axes = plt.subplots(1, 3, figsize=(16,4)) for i, k in enumerate([1, 7, 50]): pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))]) pipe.fit(X_tr, y_tr) rmse = np.sqrt(mean_squared_error(y_te, pipe.predict(X_te))) r2 = r2_score(y_te, pipe.predict(X_te)) axes[i].scatter(X, y, alpha=0.3, s=10) axes[i].plot(X_plot, pipe.predict(X_plot), 'r-', lw=2) axes[i].set_title(f"k={k}\nRMSE={rmse:.3f} R²={r2:.3f}") plt.suptitle('KNN Regression: k=1 overfits, large k underfits') plt.tight_layout(); plt.show()
Support Vector Machine (SVM) — Theory
SVM finds the optimal hyperplane that maximizes the margin between classes. The margin is the distance between the hyperplane and the nearest data points from each class (called support vectors). Maximizing margin = maximizing generalization.
You have red and blue dots on a table. SVM finds the widest possible "road" between them. Only the dots closest to the road (support vectors) determine the boundary — all other points are irrelevant. This makes SVM very efficient and robust.
Soft Margin: Minimize ½||w||² + C·Σξᵢ (C = regularization)
Kernel trick: K(x,z) = φ(x)·φ(z) — maps to higher-dimensional space
| Kernel | Formula | Use When |
|---|---|---|
| Linear | K(x,z) = x·z | Linearly separable, high-dimensional (text) |
| RBF (Gaussian) | K(x,z) = exp(−γ||x−z||²) | Default; nonlinear data; medium-sized datasets |
| Polynomial | K(x,z) = (x·z + r)^d | When polynomial features expected |
| Sigmoid | K(x,z) = tanh(αx·z + c) | Neural network-like behavior |
| Parameter | Small Value | Large Value |
|---|---|---|
| C (regularization) | Wider margin, more misclassifications allowed (underfitting) | Narrow margin, fewer misclassifications (overfitting) |
| γ (RBF kernel) | Large decision region, smooth boundary (underfitting) | Small region per point, jagged boundary (overfitting) |
SVM — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report from sklearn.datasets import load_breast_cancer, make_moons # ── Part 1: Kernel comparison on moons dataset ──────── X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42) fig, axes = plt.subplots(1, 3, figsize=(16, 4)) for i, (kernel, C) in enumerate([('linear',1),('rbf',1),('poly',1)]): pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel=kernel, C=C))]) pipe.fit(X_m, y_m) h = 0.02 x_min, x_max = X_m[:,0].min()-0.5, X_m[:,0].max()+0.5 y_min, y_max = X_m[:,1].min()-0.5, X_m[:,1].max()+0.5 xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h)) Z = pipe.predict(np.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape) axes[i].contourf(xx, yy, Z, alpha=0.3) axes[i].scatter(X_m[:,0], X_m[:,1], c=y_m, s=20, edgecolors='k') axes[i].set_title(f'Kernel: {kernel}\nAcc={pipe.score(X_m,y_m):.3f}') plt.tight_layout(); plt.show() # ── Part 2: Full pipeline on real data + GridSearch ─── data = load_breast_cancer() X, y = data.data, data.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) pipe_svm = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel='rbf', probability=True))]) # GridSearch for C and gamma param_grid = {'svc__C': [0.1, 1, 10], 'svc__gamma': ['scale', 'auto', 0.01]} gs = GridSearchCV(pipe_svm, param_grid, cv=5, scoring='accuracy') gs.fit(X_tr, y_tr) print(f"Best params: {gs.best_params_}") print(f"Best CV accuracy: {gs.best_score_:.4f}") print(f"Test accuracy: {gs.score(X_te, y_te):.4f}") print(classification_report(y_te, gs.predict(X_te), target_names=data.target_names))
SVM — Regression (SVR)
Support Vector Regression (SVR) fits a tube of width 2ε around the regression line. Points inside the tube incur no penalty. Only points outside the ε-tube contribute to the loss. This makes SVR robust to outliers.
Subject to: |yᵢ − (w·xᵢ + b)| ≤ ε + ξᵢ
ε = tube width; ξ = slack variables for points outside tube
import numpy as np import matplotlib.pyplot as plt from sklearn.svm import SVR from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score np.random.seed(42) X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1,1) y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42) X_plot = np.linspace(0, 10, 500).reshape(-1,1) pipe_svr = Pipeline([ ('sc', StandardScaler()), ('svr', SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)) ]) pipe_svr.fit(X_tr, y_tr) y_pred = pipe_svr.predict(X_te) rmse = np.sqrt(mean_squared_error(y_te, y_pred)) r2 = r2_score(y_te, y_pred) print(f"SVR — RMSE: {rmse:.4f}, R²: {r2:.4f}") plt.figure(figsize=(10, 4)) plt.scatter(X_tr, y_tr, c='steelblue', alpha=0.5, s=20, label='Train') plt.scatter(X_te, y_te, c='orange', alpha=0.5, s=20, label='Test') y_plot = pipe_svr.predict(X_plot) plt.plot(X_plot, y_plot, 'r-', lw=2, label='SVR (RBF)') plt.fill_between(X_plot.ravel(), y_plot-0.1, y_plot+0.1, alpha=0.2, color='red', label='ε-tube') plt.title(f'SVR RBF — RMSE={rmse:.3f}, R²={r2:.3f}') plt.legend(); plt.tight_layout(); plt.show()
- SVM finds the maximum-margin hyperplane — defined only by support vectors
- Kernel trick allows nonlinear boundaries without explicitly transforming data
- C controls margin width / regularization; γ controls RBF kernel "reach"
- Always scale features before SVM — it's distance-based
⚡ Project 2 — After Classification
Real-World Project: Customer Churn Predictor
Goal: Predict which telecom customers will leave. Compare Logistic Regression, SVM, Decision Tree, KNN — all in pipelines. Use confusion matrix, F1, ROC-AUC to choose the best model.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.metrics import f1_score, roc_auc_score, classification_report # Simulate churn data (imbalanced: 80/20) X, y = make_classification(n_samples=2000, n_features=12, weights=[0.8,0.2], random_state=42) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) models = { 'Logistic Reg': Pipeline([('sc',StandardScaler()),('m',LogisticRegression(class_weight='balanced',max_iter=1000))]), 'Decision Tree': Pipeline([('m',DecisionTreeClassifier(max_depth=5,class_weight='balanced',random_state=42))]), 'KNN': Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))]), 'SVM': Pipeline([('sc',StandardScaler()),('m',SVC(class_weight='balanced',probability=True))]), 'Naive Bayes': Pipeline([('sc',StandardScaler()),('m',GaussianNB())]), } results = [] print(f"{'Model':18} {'F1':>8} {'ROC-AUC':>10} {'CV-F1':>10}") print("-"*50) for name, pipe in models.items(): pipe.fit(X_tr, y_tr) yp = pipe.predict(X_te) yprb = pipe.predict_proba(X_te)[:,1] f1 = f1_score(y_te, yp) auc = roc_auc_score(y_te, yprb) cvf1 = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='f1').mean() print(f"{name:18} {f1:8.4f} {auc:10.4f} {cvf1:10.4f}") results.append({'model':name,'f1':f1,'auc':auc}) # Best model report best = max(results, key=lambda x: x['auc']) print(f"\n🏆 Best model by ROC-AUC: {best['model']} (AUC={best['auc']:.4f})")
Phase 10 — Model Tuning
Model Parameters vs Hyperparameters
Parameters are values the model learns from training data (e.g., linear regression coefficients β). Hyperparameters are configuration settings set BEFORE training that control the learning process (e.g., max_depth, C, k). You tune hyperparameters; the model learns parameters.
| Algorithm | Parameters (learned) | Hyperparameters (set by you) |
|---|---|---|
| Linear Regression | β₀, β₁, ..., βₙ (coefficients) | fit_intercept, normalize |
| Ridge/Lasso | β (coefficients) | alpha (λ), max_iter |
| Logistic Regression | w (weights), b (bias) | C, solver, max_iter |
| Decision Tree | Split thresholds, leaf values | max_depth, min_samples_split, criterion |
| SVM | w, b, support vectors | C, γ, kernel |
| KNN | None (lazy learner) | k, weights, metric |
| Naive Bayes | Class priors, feature likelihoods | alpha (smoothing), var_smoothing |
GridSearchCV & RandomizedSearchCV
GridSearchCV exhaustively tests every combination of hyperparameter values. RandomizedSearchCV randomly samples a fixed number of combinations. Both use cross-validation to evaluate each combination on training data without touching the test set.
import numpy as np import pandas as pd from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from scipy.stats import uniform, randint data = load_breast_cancer() X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2, stratify=data.target, random_state=42) # ── GridSearchCV: Exhaustive ─────────────────────────── # 3 × 3 × 2 = 18 combinations × 5 folds = 90 model fits pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(probability=True))]) param_grid = { 'svc__C': [0.1, 1.0, 10.0], 'svc__gamma': ['scale', 'auto', 0.01], 'svc__kernel': ['rbf', 'linear'] } gs = GridSearchCV(pipe, param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1) # n_jobs=-1: use all CPU cores gs.fit(X_tr, y_tr) print(f"GridSearch Best params: {gs.best_params_}") print(f"GridSearch Best CV F1: {gs.best_score_:.4f}") print(f"GridSearch Test F1: {__import__('sklearn.metrics',fromlist=['f1_score']).f1_score(y_te, gs.predict(X_te)):.4f}") # ── RandomizedSearchCV: Faster for large search spaces ─ # Instead of testing all combos, randomly sample n_iter combinations param_dist = { 'max_depth': randint(2, 20), # Random int [2, 20) 'min_samples_split': randint(2, 30), 'min_samples_leaf': randint(1, 15), 'criterion': ['gini', 'entropy'], } rs = RandomizedSearchCV( DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=50, # Test 50 random combinations cv=5, scoring='f1', n_jobs=-1, random_state=42 ) rs.fit(X_tr, y_tr) print(f"\nRandomizedSearch Best params: {rs.best_params_}") print(f"RandomizedSearch Best CV F1: {rs.best_score_:.4f}") # ── Results DataFrame ───────────────────────────────── cv_results = pd.DataFrame(rs.cv_results_) print("\nTop 5 configurations:") print(cv_results.nlargest(5, 'mean_test_score')[['mean_test_score','std_test_score','params']])
Cross Validation — Theory
Cross Validation (CV) is a resampling technique for evaluating model performance and tuning hyperparameters. It splits data into k folds, trains on k-1 folds and validates on the remaining fold, rotating k times to use every sample as validation exactly once.
| Technique | How it works | Best for |
|---|---|---|
| K-Fold CV | Split into k equal folds, rotate validation | Standard choice; large datasets |
| Stratified K-Fold | K-Fold preserving class distribution | Classification; imbalanced data |
| Leave-One-Out (LOO) | k = n (each sample is a fold) | Very small datasets; expensive |
| Time Series Split | Respects temporal order — no future leakage | Time series data |
| Repeated K-Fold | Run K-Fold r times with different splits | More stable estimate on small data |
k=5 or k=10 is standard. Larger k = less bias, more variance in estimate.
A single train/test split is highly dependent on the random split. You might get lucky (test set is easy) or unlucky (test set is hard). CV averages over k different splits, giving a much more reliable performance estimate. The std of CV scores also tells you model stability.
Cross Validation — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import (KFold, StratifiedKFold, cross_val_score, cross_validate, TimeSeriesSplit, RepeatedStratifiedKFold) from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline iris = load_iris() X, y = iris.data, iris.target pipe = Pipeline([('sc',StandardScaler()),('dt',DecisionTreeClassifier(max_depth=4,random_state=42))]) # ── 1. Basic K-Fold ─────────────────────────────────── kf = KFold(n_splits=5, shuffle=True, random_state=42) scores_kf = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy') print(f"K-Fold(5) : {scores_kf.mean():.4f} ± {scores_kf.std():.4f}") # ── 2. Stratified K-Fold (for classification) ───────── skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores_skf = cross_val_score(pipe, X, y, cv=skf, scoring='accuracy') print(f"Stratified(5) : {scores_skf.mean():.4f} ± {scores_skf.std():.4f}") # ── 3. Get multiple metrics at once ─────────────────── cv_results = cross_validate(pipe, X, y, cv=5, scoring=['accuracy','f1_macro'], return_train_score=True) print(f"\ncross_validate:") print(f" Train acc: {cv_results['train_accuracy'].mean():.4f}") print(f" Test acc: {cv_results['test_accuracy'].mean():.4f}") print(f" Test F1: {cv_results['test_f1_macro'].mean():.4f}") # If train >> test → overfitting signal! # ── 4. Repeated Stratified K-Fold ───────────────────── rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42) scores_r = cross_val_score(pipe, X, y, cv=rskf) print(f"\nRepeated K-Fold (5×3): {scores_r.mean():.4f} ± {scores_r.std():.4f}") # ── 5. Nested CV: unbiased estimate with tuning ──────── # Outer CV evaluates model; Inner CV tunes hyperparameters from sklearn.model_selection import GridSearchCV inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1) outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2) dt = DecisionTreeClassifier(random_state=42) gs = GridSearchCV(dt, {'max_depth':[2,3,5,7]}, cv=inner_cv, scoring='accuracy') nested_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='accuracy') print(f"\nNested CV (unbiased estimate): {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
- CV gives reliable performance estimates by averaging over multiple splits
- Always use StratifiedKFold for classification to preserve class ratios
- cross_validate returns both train and test scores — compare them to detect overfitting
- Nested CV is the gold standard: outer CV evaluates, inner CV tunes hyperparameters
Phase 11 — Unsupervised Learning
Unsupervised Learning Overview
Unsupervised learning discovers hidden patterns in data without labeled outputs. The model must find structure — groups, outliers, or latent representations — entirely from the input data itself.
| Category | Goal | Algorithms | Example |
|---|---|---|---|
| Clustering | Group similar data points | K-Means, DBSCAN, Hierarchical | Customer segmentation |
| Dimensionality Reduction | Compress features while retaining structure | PCA, t-SNE, UMAP | Visualize high-dim data |
| Anomaly Detection | Identify outliers/unusual samples | Isolation Forest, LOF | Fraud detection |
| Association Rules | Find co-occurring patterns | Apriori, FP-Growth | Market basket analysis |
Evaluating unsupervised models is hard — there's no correct answer. Internal metrics (Silhouette Score, Davies-Bouldin) measure cluster quality without labels. External metrics (Adjusted Rand Index) work if you do have ground truth labels for validation.
K-Means Clustering — Theory
K-Means partitions n samples into k clusters by iteratively assigning each point to the nearest centroid, then recomputing centroids as cluster means. It minimizes the total within-cluster sum of squared distances (inertia).
1. Randomly initialize k centroids
2. Assign each point to nearest centroid (by Euclidean distance)
3. Recompute each centroid as mean of its assigned points
4. Repeat steps 2–3 until convergence (centroids don't move or max_iter reached)
5. Result: k cluster assignments + k centroid positions
μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x (centroid = cluster mean)
| Limitation | Problem | Workaround |
|---|---|---|
| Must specify k | k is unknown in practice | Elbow method, Silhouette analysis |
| Sensitive to initialization | Different runs → different results | K-Means++ init (default in sklearn) |
| Assumes spherical clusters | Fails on elongated/ring shapes | DBSCAN, GMM for arbitrary shapes |
| Sensitive to outliers | Outliers pull centroids | K-Medoids (uses median), remove outliers first |
| Scale-dependent | Large-scale features dominate | Always scale features before K-Means |
K-Means — Practical
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score from sklearn.datasets import make_blobs # ── Generate clusterable data ───────────────────────── X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42) X_scaled = StandardScaler().fit_transform(X) # ── Elbow Method: Find optimal k ────────────────────── inertias, sil_scores = [], [] k_range = range(2, 11) for k in k_range: km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42) labels = km.fit_predict(X_scaled) inertias.append(km.inertia_) sil_scores.append(silhouette_score(X_scaled, labels)) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) ax1.plot(k_range, inertias, 'b-o') ax1.set_xlabel('k'); ax1.set_ylabel('Inertia (WCSS)') ax1.set_title('Elbow Method\n(Look for the "elbow" bend)') ax2.plot(k_range, sil_scores, 'r-o') ax2.set_xlabel('k'); ax2.set_ylabel('Silhouette Score') ax2.set_title('Silhouette Score\n(Higher = better separated clusters)') plt.tight_layout(); plt.show() best_k = k_range[np.argmax(sil_scores)] print(f"Optimal k by Silhouette: {best_k}") # ── Final K-Means ───────────────────────────────────── km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10, random_state=42) labels = km_final.fit_predict(X_scaled) plt.figure(figsize=(8, 5)) plt.scatter(X_scaled[:,0], X_scaled[:,1], c=labels, cmap='tab10', s=20, alpha=0.7) plt.scatter(km_final.cluster_centers_[:,0], km_final.cluster_centers_[:,1], c='red', s=300, marker='*', label='Centroids') plt.title(f'K-Means with k={best_k}\nSilhouette={silhouette_score(X_scaled,labels):.3f}') plt.legend(); plt.tight_layout(); plt.show() # ── Cluster analysis: describe each cluster ─────────── df = pd.DataFrame(X, columns=['feature1','feature2']) df['cluster'] = labels print("\nCluster statistics:") print(df.groupby('cluster').agg(['mean','std','count']))
Hierarchical Clustering — Theory
Hierarchical clustering builds a dendrogram (tree) of nested clusters. Unlike K-Means, you don't specify k upfront — you can cut the tree at any level to get different numbers of clusters. Two approaches: Agglomerative (bottom-up: start with n clusters, merge) and Divisive (top-down: start with 1 cluster, split).
| Linkage | Distance between clusters | Creates |
|---|---|---|
| Single (min) | Minimum pairwise distance | Elongated, chaining clusters |
| Complete (max) | Maximum pairwise distance | Compact, equal-sized clusters |
| Average | Average of all pairwise distances | Balance between single/complete |
| Ward | Minimizes total within-cluster variance (default) | Compact, roughly equal clusters; usually best |
Agglomerative Clustering — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import AgglomerativeClustering from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score from sklearn.datasets import make_blobs from scipy.cluster.hierarchy import dendrogram, linkage X, _ = make_blobs(n_samples=200, centers=4, random_state=42) X_sc = StandardScaler().fit_transform(X) # ── Dendrogram ───────────────────────────────────────── Z = linkage(X_sc, method='ward') # Full linkage tree plt.figure(figsize=(12, 4)) dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=8) plt.title('Dendrogram (Ward Linkage)\nCut at any height to get k clusters') plt.xlabel('Sample Index'); plt.ylabel('Distance') plt.tight_layout(); plt.show() # ── Compare linkage methods ─────────────────────────── fig, axes = plt.subplots(1, 4, figsize=(18, 4)) for i, link in enumerate(['single','complete','average','ward']): agg = AgglomerativeClustering(n_clusters=4, linkage=link) labels = agg.fit_predict(X_sc) sil = silhouette_score(X_sc, labels) axes[i].scatter(X_sc[:,0], X_sc[:,1], c=labels, cmap='tab10', s=20) axes[i].set_title(f'Linkage: {link}\nSilhouette={sil:.3f}') plt.tight_layout(); plt.show() # Ward usually gives the best silhouette score
DBSCAN — Theory
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) discovers clusters as dense regions separated by sparse regions. It doesn't require specifying k and can find arbitrarily shaped clusters while labeling outliers as noise.
| Term | Definition |
|---|---|
| ε (eps) | Radius of neighborhood around each point |
| min_samples | Minimum points required to form a dense core |
| Core point | Has ≥ min_samples neighbors within radius ε |
| Border point | Within ε of a core point but has fewer than min_samples neighbors |
| Noise point | Neither core nor border — labeled as −1 (outlier) |
DBSCAN — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score from sklearn.datasets import make_moons, make_blobs # ── DBSCAN vs K-Means on moon-shaped data ───────────── X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42) X_sc = StandardScaler().fit_transform(X_moons) from sklearn.cluster import KMeans fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) km = KMeans(n_clusters=2, random_state=42).fit_predict(X_sc) ax1.scatter(X_sc[:,0], X_sc[:,1], c=km, cmap='coolwarm', s=20) ax1.set_title('K-Means (k=2)\n❌ FAILS on moons') db = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_sc) ax2.scatter(X_sc[:,0], X_sc[:,1], c=db, cmap='coolwarm', s=20) ax2.set_title('DBSCAN (eps=0.3)\n✅ Correctly identifies moons') plt.tight_layout(); plt.show() # ── How to choose eps: K-distance plot ──────────────── from sklearn.neighbors import NearestNeighbors X2, _ = make_blobs(n_samples=300, centers=3, random_state=42) X2_sc = StandardScaler().fit_transform(X2) nbrs = NearestNeighbors(n_neighbors=5).fit(X2_sc) distances, _ = nbrs.kneighbors(X2_sc) k_dist = np.sort(distances[:, 4])[::-1] # 5th nearest neighbor distance, sorted plt.figure(figsize=(8, 4)) plt.plot(k_dist) plt.axhline(y=0.5, color='r', linestyle='--', label='eps ≈ 0.5 (elbow)') plt.title('K-Distance Plot (k=5)\nChoose eps at the "elbow" of the curve') plt.xlabel('Points (sorted)'); plt.ylabel('5th Nearest Neighbor Distance') plt.legend(); plt.tight_layout(); plt.show() # ── DBSCAN for anomaly detection ───────────────────── db_final = DBSCAN(eps=0.5, min_samples=5).fit(X2_sc) labels = db_final.labels_ n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) print(f"Clusters found: {n_clusters}") print(f"Noise points (anomalies): {n_noise}") print(f"Noise indices: {np.where(labels == -1)[0]}")
Silhouette Score
Silhouette Score measures how similar each point is to its own cluster vs other clusters. It's an internal evaluation metric — no ground truth labels needed. Range: [−1, 1]; higher = better-defined clusters.
a(i) = mean distance to own cluster (intra-cluster cohesion)
b(i) = mean distance to nearest other cluster (inter-cluster separation)
s = 1: perfect cluster assignment | s = 0: overlapping | s = -1: wrong cluster
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score, silhouette_samples from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs import matplotlib.cm as cm X, _ = make_blobs(n_samples=300, centers=4, random_state=42) X_sc = StandardScaler().fit_transform(X) # ── Silhouette plot for different k values ───────────── fig, axes = plt.subplots(2, 3, figsize=(16, 10)) axes = axes.ravel() for idx, k in enumerate([2,3,4,5,6,7]): km = KMeans(n_clusters=k, random_state=42, n_init=10) labels = km.fit_predict(X_sc) avg_sil = silhouette_score(X_sc, labels) sample_sils = silhouette_samples(X_sc, labels) ax = axes[idx] y_lower = 10 for c in range(k): sil_c = sorted(sample_sils[labels == c]) size_c = len(sil_c) y_upper = y_lower + size_c color = cm.nipy_spectral(float(c)/k) ax.fill_betweenx(np.arange(y_lower, y_upper), 0, sil_c, color=color) y_lower = y_upper + 10 ax.axvline(x=avg_sil, color='red', linestyle='--') ax.set_title(f'k={k}, avg_sil={avg_sil:.3f}') ax.set_xlabel('Silhouette coefficient') plt.suptitle('Silhouette Analysis: Wide, uniform blades = good clustering') plt.tight_layout(); plt.show() # k=4 should show the best uniform, wide silhouette blades
| Algorithm | Needs k? | Cluster Shape | Outliers | Speed |
|---|---|---|---|---|
| K-Means | Yes | Spherical only | No | Fast |
| Hierarchical | No (cut later) | Any | No | O(n² log n) |
| DBSCAN | No | Any arbitrary | Yes (label -1) | O(n log n) |
⚡ Project 3 — After Clustering
Mini Project: Customer Market Segmentation
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans, DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score from sklearn.decomposition import PCA np.random.seed(42) n = 500 df = pd.DataFrame({ 'age': np.random.normal(40, 15, n).clip(18, 80), 'annual_income': np.random.lognormal(10.5, 0.5, n), 'spending_score': np.random.normal(50, 25, n).clip(1, 100), 'purchase_freq': np.random.poisson(5, n), 'loyalty_years': np.random.exponential(3, n).clip(0, 15), }) # 1. Scale sc = StandardScaler() X_sc = sc.fit_transform(df) # 2. PCA for visualization pca = PCA(n_components=2) X_2d = pca.fit_transform(X_sc) # 3. Find optimal k sil_scores = [silhouette_score(X_sc, KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X_sc)) for k in range(2, 8)] best_k = np.argmax(sil_scores) + 2 # 4. Final segmentation km = KMeans(n_clusters=best_k, n_init=10, random_state=42) df['segment'] = km.fit_predict(X_sc) # 5. Visualize plt.figure(figsize=(10, 5)) for seg in df['segment'].unique(): mask = df['segment'] == seg plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f'Segment {seg}', s=20, alpha=0.7) plt.title(f'Customer Segments (PCA 2D) — {best_k} segments') plt.legend(); plt.tight_layout(); plt.show() # 6. Profile segments print("\nSegment Profiles:") print(df.groupby('segment')['age', 'annual_income', 'spending_score', 'loyalty_years'].mean().round(1))
Phase 12 — Association Rule Learning
Association Rule Learning
Association Rule Learning discovers co-occurrence patterns in transactional data: "If a customer buys X, they also tend to buy Y." Used for market basket analysis, recommendation systems, and web clickstream analysis.
| Metric | Formula | Meaning | Range |
|---|---|---|---|
| Support | P(X ∪ Y) | How often {X,Y} appears together in all transactions | [0,1] |
| Confidence | P(Y|X) = P(X∪Y)/P(X) | If X is bought, probability that Y is also bought | [0,1] |
| Lift | Confidence / P(Y) | How much more likely Y is given X, vs random. >1 = positive association | [0,∞) |
| Conviction | (1−P(Y))/(1−Confidence) | How much X being present increases certainty of Y | [0,∞) |
Support({beer,diapers}) = 0.30 (30% of transactions contain both)
Confidence(diapers→beer) = 0.70 (70% of diaper buyers also bought beer)
Lift = 0.70 / 0.50 = 1.4 (40% more likely than random)
Apriori Algorithm — Theory
Apriori generates frequent itemsets (item combinations meeting minimum support) by starting with 1-item sets and iteratively building larger sets. Uses the Apriori principle: any subset of a frequent itemset must also be frequent — this prunes the search space dramatically.
1. Generate all 1-item sets, filter by min_support → frequent 1-items
2. Join frequent k-items to generate (k+1)-item candidates
3. Prune any candidate whose subset is NOT frequent (Apriori pruning)
4. Filter candidates by min_support
5. Repeat until no new frequent itemsets found
6. Generate rules from all frequent itemsets using min_confidence
Apriori — Practical
import pandas as pd import numpy as np # pip install mlxtend from mlxtend.frequent_patterns import apriori, association_rules from mlxtend.preprocessing import TransactionEncoder # ── Sample grocery transactions ──────────────────────── transactions = [ ['milk','bread','butter'], ['milk','bread'], ['milk','butter','eggs'], ['bread','butter','eggs','cheese'], ['milk','bread','butter','eggs'], ['cheese','butter'], ['milk','eggs','bread'], ['milk','cheese','bread'], ['butter','eggs','milk'], ['bread','milk','cheese','eggs'], ['milk','bread','butter','cheese'], ] # ── One-hot encode transactions ───────────────────────── te = TransactionEncoder() te_ary = te.fit(transactions).transform(transactions) df = pd.DataFrame(te_ary, columns=te.columns_) # ── Generate frequent itemsets with min_support=0.3 ─────── frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True) # ── Generate rules with min_confidence=0.7 ─────────────────────────────── rules = association_rules(frequent_itemsets, metric ='confidence', min_threshold=0.7) # ── Display rules sorted by lift ─────────────────────────────── rules = rules.sort_values('lift', ascending=False) print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
FP-Growth Algorithm — Theory
FP-Growth (Frequent Pattern Growth) is a faster alternative to Apriori that avoids repeated database scans. It compresses the entire transaction database into a compact FP-tree (prefix tree), then mines frequent patterns directly from this tree — no candidate generation needed.
Instead of repeatedly reading raw transactions, FP-Growth reads the database exactly twice: once to find frequent single items, once to build a compressed prefix tree. Transactions sharing common prefixes share nodes in the tree — like a trie/prefix tree. Mining patterns then happens entirely in memory on this compact tree, recursively building "conditional pattern bases".
| Property | Apriori | FP-Growth |
|---|---|---|
| Database scans | k scans (one per itemset size) | Exactly 2 scans |
| Candidate generation | Yes — generates then prunes | No — divides problem recursively |
| Memory | Lower (no tree structure) | Higher (FP-tree in RAM) |
| Speed | Slow on large data | 10–100× faster than Apriori |
| Implementation | Simpler to understand | More complex |
| Best for | Small datasets, education | Production, large transactions |
Pass 1: Scan database → count item frequencies → discard items below min_support → sort
frequent items by frequency (descending)
Pass 2: Build FP-tree — insert each transaction (only frequent items, in sorted order) into
prefix tree. Shared prefixes share nodes; each node stores item name + count.
Mining: For each frequent item, extract its conditional pattern base (all paths ending at
that item), build a conditional FP-tree, recurse. Each recursive call produces frequent itemsets containing
that item.
FP-Growth — Practical
import pandas as pd import numpy as np import time from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules from mlxtend.preprocessing import TransactionEncoder import matplotlib.pyplot as plt # ── Generate larger synthetic transaction dataset ────── np.random.seed(42) items = ['milk','bread','butter','eggs','cheese', 'yogurt','juice','cereal','coffee','tea'] transactions = [] for _ in range(500): n_items = np.random.randint(2, 7) # Simulate real shopping patterns with biased probabilities weights = [0.7,0.65,0.5,0.55,0.35,0.3,0.4,0.25,0.45,0.3] chosen = [item for item, w in zip(items, weights) if np.random.random() < w] if len(chosen) >= 2: transactions.append(chosen) # Encode te = TransactionEncoder() df_enc = pd.DataFrame(te.fit_transform(transactions), columns=te.columns_) print(f"Dataset: {len(transactions)} transactions × {len(te.columns_)} items") # ── Speed benchmark: FP-Growth vs Apriori ───────────── min_sup = 0.2 t0 = time.time() fp_itemsets = fpgrowth(df_enc, min_support=min_sup, use_colnames=True) t_fp = time.time() - t0 t0 = time.time() ap_itemsets = apriori(df_enc, min_support=min_sup, use_colnames=True) t_ap = time.time() - t0 print(f"\nFP-Growth: {len(fp_itemsets)} itemsets in {t_fp:.4f}s") print(f"Apriori: {len(ap_itemsets)} itemsets in {t_ap:.4f}s") print(f"Speedup: {t_ap/t_fp:.2f}×") # ── Generate rules from FP-Growth itemsets ──────────── rules = association_rules(fp_itemsets, metric='lift', min_threshold=1.1) rules['antecedents'] = rules['antecedents'].apply(lambda x: ', '.join(list(x))) rules['consequents'] = rules['consequents'].apply(lambda x: ', '.join(list(x))) top_rules = rules.nlargest(10, 'lift')[ ['antecedents','consequents','support','confidence','lift'] ] print("\nTop 10 Rules by Lift:") print(top_rules.to_string(index=False)) # ── Heatmap: Item co-occurrence matrix ──────────────── cooc = df_enc.T.dot(df_enc) cooc_norm = cooc / cooc.values.diagonal() # normalize by item frequency plt.figure(figsize=(8, 6)) import seaborn as sns sns.heatmap(cooc_norm, annot=True, fmt='.2f', cmap='YlOrRd', linewidths=0.5, vmin=0, vmax=1) plt.title('Item Co-occurrence Matrix\n(value = P(col bought | row bought))') plt.tight_layout(); plt.show()
- FP-Growth is always preferred over Apriori for production — faster and fewer DB scans
- Both produce identical rules — the algorithm differs, not the output
- Real business value: product placement, "customers also bought", promotions
- Always evaluate rules with Lift, not just Confidence — lift > 1 means genuine co-occurrence
Download a real retail dataset (e.g. UCI Online Retail Dataset). Run FP-Growth with min_support=0.05. Find the top 5 rules by lift. What business action would you recommend based on each rule?
Phase 13 — Ensemble Learning
Ensemble Learning — Overview
Ensemble Learning combines multiple models (weak learners) to produce a stronger, more accurate and robust prediction than any single model alone. The key insight: models make different errors, and combining them cancels out individual mistakes.
A single expert doctor may be wrong. But if 100 doctors independently diagnose the same patient and you take the majority vote, accuracy dramatically improves — individual errors cancel out. This is the ensemble principle. Even weak individual models (slightly better than random) can combine into a very strong model.
| Family | Core Idea | Algorithms | Strength |
|---|---|---|---|
| Bagging | Train models on bootstrap samples in parallel, average outputs | Random Forest, BaggingClassifier | Reduces variance (overfitting) |
| Boosting | Train models sequentially, each focusing on previous errors | AdaBoost, Gradient Boosting, XGBoost | Reduces bias (underfitting) |
| Stacking | Train base models, use their predictions as input to a meta-model | StackingClassifier/Regressor | Combines heterogeneous models |
- Ensemble = many weak learners combining into one strong learner
- Bagging: parallel training on bootstrap samples → reduces variance
- Boosting: sequential training fixing errors → reduces bias
- Stacking: base models feed a meta-learner → most flexible, most powerful
- Diversity between models is essential — correlated models don't help each other
Voting Methods — Max, Average, Weighted
Voting ensembles combine predictions from multiple different algorithm types (heterogeneous ensemble). Each model votes, and the final prediction is determined by: Hard Voting (majority class), Soft Voting (average probabilities), or Weighted Voting (trusted models vote more).
| Strategy | How It Works | When to Use | Requires |
|---|---|---|---|
| Hard Voting | Majority class label wins (mode) | Classification; when probabilities unavailable | predict() from each model |
| Soft Voting | Average predicted probabilities; argmax wins | Classification; more accurate than hard when models are calibrated | predict_proba() from each model |
| Weighted Voting | Better models get higher vote weight | When you know which models are stronger | weights= parameter |
| Average (Regression) | Mean of all model predictions | Regression; baseline ensemble | predict() from each model |
Soft Voting: ŷ = argmax_k [ (1/n) Σ P_i(y=k|X) ]
Weighted: ŷ = argmax_k [ Σ wᵢ × P_i(y=k|X) ] where Σwᵢ = 1
Voting Regression — Practical
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.ensemble import VotingRegressor from sklearn.linear_model import Ridge from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.svm import SVR from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import mean_squared_error, r2_score from sklearn.datasets import fetch_california_housing housing = fetch_california_housing() X, y = housing.data, housing.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42) # ── Define individual base models ───────────────────── # Note: VotingRegressor requires estimators that support fit/predict # Wrap scale-sensitive models in a Pipeline r1 = ('ridge', Pipeline([('sc',StandardScaler()), ('m',Ridge(alpha=1.0))])) r2 = ('dtree', DecisionTreeRegressor(max_depth=6, random_state=42)) r3 = ('knn', Pipeline([('sc',StandardScaler()), ('m',KNeighborsRegressor(n_neighbors=7))])) r4 = ('svr', Pipeline([('sc',StandardScaler()), ('m',SVR(C=10, gamma='scale'))])) # ── VotingRegressor: uniform average ────────────────── vr_uniform = VotingRegressor(estimators=[r1, r2, r3, r4]) # ── VotingRegressor: weighted (Ridge is best, give it more weight) vr_weighted = VotingRegressor(estimators=[r1, r2, r3, r4], weights=[3, 2, 1, 2]) # ── Evaluate all models ─────────────────────────────── results = {} for name, model in [('Ridge', r1[1]), ('DTree', r2[1]), ('KNN', r3[1]), ('SVR', r4[1]), ('VotingUniform', vr_uniform), ('VotingWeighted', vr_weighted)]: model.fit(X_tr, y_tr) yp = model.predict(X_te) rmse = np.sqrt(mean_squared_error(y_te, yp)) r2 = r2_score(y_te, yp) results[name] = {'RMSE': rmse, 'R²': r2} df_res = pd.DataFrame(results).T print(df_res.round(4)) # ── Bar chart comparison ────────────────────────────── fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) colors = ['steelblue']*4 + ['#f59e0b', '#10b981'] ax1.bar(df_res.index, df_res['RMSE'], color=colors) ax1.set_title('RMSE Comparison\n(Lower = Better)') ax1.tick_params(axis='x', rotation=25) ax2.bar(df_res.index, df_res['R²'], color=colors) ax2.set_title('R² Comparison\n(Higher = Better)') ax2.tick_params(axis='x', rotation=25) plt.tight_layout(); plt.show() # Ensemble should outperform all individual models
Voting Classification — Practical
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.datasets import load_breast_cancer import pandas as pd data = load_breast_cancer() X, y = data.data, data.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # ── Base estimators ─────────────────────────────────── lr = Pipeline([('sc',StandardScaler()),('m',LogisticRegression(max_iter=1000))]) dt = DecisionTreeClassifier(max_depth=5, random_state=42) knn = Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))]) gnb = Pipeline([('sc',StandardScaler()),('m',GaussianNB())]) svc = Pipeline([('sc',StandardScaler()),('m',SVC(probability=True))]) estimators = [('lr',lr),('dt',dt),('knn',knn),('gnb',gnb),('svc',svc)] # ── Hard vs Soft Voting ─────────────────────────────── vc_hard = VotingClassifier(estimators=estimators, voting='hard') vc_soft = VotingClassifier(estimators=estimators, voting='soft') # ── Evaluate all models ─────────────────────────────── results = [] for name, model in estimators + [('VotingHard', vc_hard), ('VotingSoft', vc_soft)]: model.fit(X_tr, y_tr) yp = model.predict(X_te) acc = accuracy_score(y_te, yp) f1 = f1_score(y_te, yp) cv = cross_val_score(model, X_tr, y_tr, cv=5, scoring='f1').mean() results.append({'Model':name, 'Test Acc':acc, 'Test F1':f1, 'CV F1':cv}) df_r = pd.DataFrame(results) print(df_r.sort_values('Test F1', ascending=False).to_string(index=False)) # ── Visualize ensemble improvement ──────────────────── fig, ax = plt.subplots(figsize=(10, 5)) colors = ['steelblue']*5 + ['#f59e0b', '#10b981'] bars = ax.barh(df_r['Model'], df_r['Test F1'], color=colors) ax.axvline(x=df_r['Test F1'][:5].max(), color='red', linestyle='--', label='Best individual model') ax.set_xlabel('F1-Score') ax.set_title('Voting Ensemble vs Individual Models\n(Ensemble bars in gold/green)') ax.legend(); plt.tight_layout(); plt.show()
- Using correlated models in ensemble — if all models make the same mistakes, voting doesn't help. Mix: linear + tree + distance-based models for diversity.
- Using Hard Voting with poorly calibrated models — their class boundaries don't align in probability space
- Not scaling inside each pipeline — VotingClassifier calls each model separately, so each needs its own scaler
Bagging & Random Forest — Theory
Bagging (Bootstrap Aggregating) trains multiple models on different bootstrap samples (random samples with replacement) of the training data, then aggregates predictions. Random Forest extends bagging by also randomizing the feature subset at each split — creating maximum diversity among trees.
A single Decision Tree has high variance: σ²(single tree). If n trees make independent predictions, the variance of their mean = σ²/n. With 100 trees, variance drops 100×. In practice trees are correlated (same data), but Random Forest's feature randomization breaks this correlation, achieving near-independent variance reduction.
| Property | Bagging (BaggingClassifier) | Random Forest |
|---|---|---|
| Bootstrap sampling | Yes | Yes |
| Feature randomization per split | No — all features considered | Yes — random subset of sqrt(n) features |
| Tree correlation | High (same features) | Low (different feature subsets) |
| Diversity | Moderate | High — best decorrelation |
| Out-of-Bag (OOB) evaluation | Yes (with oob_score=True) | Yes (with oob_score=True) |
| Feature importance | Depends on base estimator | Built-in (Gini importance) |
~63.2% unique points per sample (rest are duplicates)
OOB samples: ~36.8% left out per tree → free validation set
Feature subset per split: max_features = sqrt(p) for classification, p/3 for regression
Bagging — Classification Practical
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.ensemble import BaggingClassifier, RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.datasets import load_breast_cancer import seaborn as sns data = load_breast_cancer() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # ── 1. Single Decision Tree (baseline) ─────────────── single_dt = DecisionTreeClassifier(max_depth=None, random_state=42) single_dt.fit(X_tr, y_tr) print("Single DTree:") print(f" Train acc: {accuracy_score(y_tr, single_dt.predict(X_tr)):.4f}") print(f" Test acc: {accuracy_score(y_te, single_dt.predict(X_te)):.4f}") # ── 2. Bagging with DTree base estimator ────────────── bag = BaggingClassifier( estimator=DecisionTreeClassifier(random_state=42), n_estimators=100, # number of trees max_samples=0.8, # use 80% of data per tree max_features=0.8, # use 80% of features per tree bootstrap=True, # with replacement bootstrap_features=False, oob_score=True, # free OOB evaluation n_jobs=-1, random_state=42 ) bag.fit(X_tr, y_tr) print(f"\nBaggingClassifier:") print(f" OOB score: {bag.oob_score_:.4f} (free cross-validation estimate)") print(f" Test acc: {accuracy_score(y_te, bag.predict(X_te)):.4f}") # ── 3. Random Forest ────────────────────────────────── rf = RandomForestClassifier( n_estimators=200, max_depth=None, # let trees grow fully (bagging controls variance) max_features='sqrt', # sqrt(n_features) per split — key RF innovation min_samples_split=2, min_samples_leaf=1, bootstrap=True, oob_score=True, n_jobs=-1, random_state=42 ) rf.fit(X_tr, y_tr) print(f"\nRandomForestClassifier:") print(f" OOB score: {rf.oob_score_:.4f}") print(f" Test acc: {accuracy_score(y_te, rf.predict(X_te)):.4f}") print(f" Test F1: {f1_score(y_te, rf.predict(X_te)):.4f}") print("\nClassification Report:") print(classification_report(y_te, rf.predict(X_te), target_names=data.target_names)) # ── 4. n_estimators vs OOB score curve ──────────────── oob_scores = [] n_range = range(10, 201, 10) for n in n_range: model = RandomForestClassifier(n_estimators=n, oob_score=True, n_jobs=-1, random_state=42) model.fit(X_tr, y_tr) oob_scores.append(model.oob_score_) plt.figure(figsize=(10, 4)) plt.plot(n_range, oob_scores, 'b-o', markersize=4) plt.xlabel('n_estimators'); plt.ylabel('OOB Score') plt.title('OOB Score vs Number of Trees\n(Score stabilizes — no overfitting with more trees!)') plt.tight_layout(); plt.show() # ── 5. Feature Importance ───────────────────────────── imp = pd.DataFrame({ 'Feature': X.columns, 'Importance': rf.feature_importances_ }).sort_values('Importance', ascending=False) plt.figure(figsize=(10, 6)) plt.barh(imp['Feature'][:15], imp['Importance'][:15], color='steelblue') plt.xlabel('Feature Importance (Mean Decrease in Gini Impurity)') plt.title('Top 15 Feature Importances — Random Forest\n(Higher = more useful for prediction)') plt.gca().invert_yaxis() plt.tight_layout(); plt.show() print("\nTop 5 most important features:") print(imp.head(5).to_string(index=False))
- More trees never overfit in Random Forest — unlike a single tree, adding more trees reduces variance monotonically. After ~100–200 trees, gains are marginal; it's a compute/accuracy tradeoff.
- Setting max_features=None in RF — this removes the feature randomization that makes RF special. You'd just get regular Bagging.
- Confusing feature_importances_ with "causal importance" — RF importance measures how useful a feature is for prediction, not causation. Correlated features split importance between them.
Bagging — Regression Practical
Bagging Regression uses the same bootstrap aggregating principle for continuous targets. The ensemble prediction is the mean of all base regressor predictions. Random Forest Regressor is the most widely used variant.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.ensemble import (BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor) from sklearn.tree import DecisionTreeRegressor from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.datasets import fetch_california_housing import seaborn as sns housing = fetch_california_housing(as_frame=True) df = housing.frame X = df.drop('MedHouseVal', axis=1) y = df['MedHouseVal'] X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42) # ── Model zoo ───────────────────────────────────────── models = { 'Single DTree': DecisionTreeRegressor(max_depth=None, random_state=42), 'BaggingDTree': BaggingRegressor( estimator=DecisionTreeRegressor(random_state=42), n_estimators=100, oob_score=True, n_jobs=-1, random_state=42 ), 'RandomForest': RandomForestRegressor( n_estimators=200, max_features='sqrt', oob_score=True, n_jobs=-1, random_state=42 ), 'GradientBoost': GradientBoostingRegressor( # bonus: show boosting too n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42 ), } results = [] for name, model in models.items(): model.fit(X_tr, y_tr) yp = model.predict(X_te) rmse = np.sqrt(mean_squared_error(y_te, yp)) mae = mean_absolute_error(y_te, yp) r2 = r2_score(y_te, yp) oob = getattr(model, 'oob_score_', np.nan) results.append({'Model':name, 'RMSE':rmse, 'MAE':mae, 'R²':r2, 'OOB':oob}) df_res = pd.DataFrame(results) print(df_res.to_string(index=False)) # ── Visualize 1: RMSE bar chart ──────────────────────── fig, axes = plt.subplots(1, 2, figsize=(14, 4)) colors = ['#64748b', 'steelblue', '#10b981', '#f59e0b'] axes[0].bar(df_res['Model'], df_res['RMSE'], color=colors) axes[0].set_title('RMSE: Lower is Better\n(Ensemble beats single tree consistently)') axes[0].tick_params(axis='x', rotation=15) axes[1].bar(df_res['Model'], df_res['R²'], color=colors) axes[1].set_title('R²: Higher is Better') axes[1].tick_params(axis='x', rotation=15) plt.tight_layout(); plt.show() # ── Visualize 2: Actual vs Predicted (Random Forest) ── rf_model = models['RandomForest'] y_pred_rf = rf_model.predict(X_te) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].scatter(y_te, y_pred_rf, alpha=0.2, s=8, color='steelblue') axes[0].plot([y_te.min(), y_te.max()], [y_te.min(), y_te.max()], 'r--') axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted') axes[0].set_title(f'Random Forest: Actual vs Predicted\nR²={r2_score(y_te,y_pred_rf):.4f}') # Feature importance imp = pd.DataFrame({'Feature':X.columns, 'Importance':rf_model.feature_importances_}).\ sort_values('Importance', ascending=True) axes[1].barh(imp['Feature'], imp['Importance'], color='steelblue') axes[1].set_title('Feature Importances\n(Random Forest Regression)') plt.tight_layout(); plt.show() # ── Hyperparameter tuning: key RF regression params ─── print("\n--- Key RandomForestRegressor Hyperparameters ---") params_guide = { 'n_estimators': "100–500. More = better up to diminishing returns. Never overfits.", 'max_features': "'sqrt' (clf) or 'log2'/'auto' (reg). Tune: try 1/3 to 2/3 of features.", 'max_depth': "None (grow full) is fine for RF. Set only if memory is a concern.", 'min_samples_split': "2–10. Increase to reduce overfitting on noisy data.", 'min_samples_leaf': "1–5. Larger = smoother predictions, less sensitive to noise.", 'bootstrap': "True always for Bagging. False = random subspaces only.", } for p, desc in params_guide.items(): print(f" {p:25s}: {desc}")
- Treating feature_importances_ as definitive — for correlated features, importance is split arbitrarily between correlated variables. Use permutation importance (sklearn's permutation_importance) for more reliable estimates.
- Not using n_jobs=-1 — Random Forest is embarrassingly parallelizable. Without it, training 200 trees sequentially is 4–8× slower than needed.
- Tuning n_estimators to "exactly 100" for performance — always plot the OOB curve and stop when it flattens, not at an arbitrary number.
- Bagging trains parallel models on bootstrap samples → reduces variance via averaging
- Random Forest = Bagging + random feature subsets at each split → maximum tree diversity
- OOB score gives free cross-validated performance estimate using left-out samples
- More trees never hurt RF — only diminishing returns after ~200 trees
- Feature importances reveal predictive power but can be misleading for correlated features
Build a complete Random Forest pipeline on the California Housing dataset: (1) Feature engineering, (2) RF with OOB score, (3) Permutation importance to validate feature importances, (4) GridSearchCV for max_features and min_samples_leaf, (5) Compare final RF to single DTree and Ridge — report RMSE, R², and training time for each.
🏆 Course Complete — Final Master Summary
Complete ML Masterclass — Final Summary
| Phase | Topics | Key Takeaway |
|---|---|---|
| Basics (1–3) | ML Introduction, Types, Roadmap | Know the 7-step pipeline cold. Supervised/Unsupervised distinction. |
| Data Preprocessing (4–20) | Variables, Cleaning, Missing, Encoding, Outliers, Scaling, FunctionTransformer | 80% of ML work. Data quality beats algorithm choice. |
| Feature Selection (21–22) | Backward/Forward Selection | Wrapper methods are model-aware but O(n²). Use filter first. |
| Model Training (23) | Train-Test Split | Always split BEFORE any preprocessing. Stratify for classification. |
| Regression (24–33) | Linear → Poly → Ridge/Lasso | Always plot residuals. Regularization requires scaling. Use AdjR². |
| Classification (34–38) | Logistic, Binary/Multi-class | Sigmoid outputs probability. Threshold tuning = precision/recall tradeoff. |
| Evaluation (39–41) | Confusion Matrix, F1, Imbalanced | Accuracy is useless on imbalanced data. Use F1, ROC-AUC, lift. |
| Naive Bayes (42–43) | Theory + Practical | Best for text. Naive independence assumption works surprisingly well. |
| Advanced Models (44–53) | DTree, KNN, SVM | DTree: interpretable but high variance. KNN: lazy, scale-sensitive. SVM: margin maximization, kernel trick. |
| Model Tuning (54–57) | Hyperparams, GridSearch, CV | Nested CV is gold standard. Always tune on train set, evaluate on test. |
| Unsupervised (58–65) | K-Means, Hierarchical, DBSCAN, Silhouette | K-Means: fast, spherical. DBSCAN: arbitrary shapes, finds outliers. Silhouette: no labels needed. |
| Association (66–70) | Apriori, FP-Growth | FP-Growth = production standard. Lift > 1 = genuine rule. |
| Ensemble (71–77) | Voting, Bagging, Random Forest | RF = best general-purpose model. Diversity + averaging = power. OOB = free CV. |
1. Always split data BEFORE any preprocessing step — prevent data leakage.
2. Fit transformers on train data only. Transform train AND test with the same fit.
3. Plot residuals — R² alone is never enough to validate a regression model.
4. For imbalanced data: never use accuracy. Use F1, ROC-AUC, G-Mean.
5. Scale features before: KNN, SVM, Logistic Regression, Ridge, Lasso, PCA.
6. Decision Trees are interpretable but high variance — use Random Forest for
production.
7. Hyperparameter tuning belongs in the training set — nested CV for unbiased
evaluation.
8. For clustering, always use Silhouette analysis — never just pick k arbitrarily.
9. More trees in Random Forest never overfit — stop at ~200 when OOB curve flattens.
10. Ensemble diversity matters more than individual model accuracy.
| # | Question | Topic |
|---|---|---|
| 1 | Walk me through an end-to-end ML project | Topic 3: Roadmap |
| 2 | What is data leakage and how do you prevent it? | Topic 23: Train-Test Split |
| 3 | Explain the bias-variance tradeoff | Topic 32: Ridge/Lasso |
| 4 | Why use Adjusted R² over R²? | Topic 30 |
| 5 | Lasso vs Ridge — mathematical difference and when to choose | Topic 32 |
| 6 | Why is logistic regression called regression? | Topic 35 |
| 7 | Precision vs Recall tradeoff — give a business example | Topic 40 |
| 8 | Why does Naive Bayes work despite the independence assumption? | Topic 42 |
| 9 | Gini Impurity vs Entropy — when does it matter? | Topic 44 |
| 10 | What is the curse of dimensionality and how does it affect KNN? | Topic 48 |
| 11 | What are support vectors and why do only they define the boundary? | Topic 51 |
| 12 | Parameters vs Hyperparameters — with examples | Topic 54 |
| 13 | GridSearch vs RandomizedSearch — when to use each? | Topic 55 |
| 14 | What is OOB error in Random Forest? | Topic 76 |
| 15 | K-Means limitations and how DBSCAN addresses them | Topics 59/63 |
| 16 | What makes ensemble methods work mathematically? | Topic 71 |
| 17 | Random Forest vs Gradient Boosting — when to choose? | Topic 77 |
| 18 | What is the Apriori principle? | Topic 67 |
| 19 | Why is Soft Voting better than Hard Voting? | Topic 72 |
| 20 | How do you handle imbalanced datasets? | Topic 41 |
| Topic | Why It Matters | Where to Start |
|---|---|---|
| XGBoost / LightGBM | Win 90% of Kaggle tabular competitions | xgboost.readthedocs.io |
| Neural Networks / Deep Learning | Images, text, audio, sequences | fast.ai, PyTorch |
| Feature Engineering | Often the highest ROI activity in ML | Kaggle competitions |
| Model Explainability (SHAP) | Required in regulated industries | shap.readthedocs.io |
| ML Pipelines (sklearn Pipeline) | Production-ready, prevents leakage | sklearn Pipeline docs |
| Time Series (ARIMA, Prophet) | Finance, forecasting, IoT | statsmodels, Prophet |
| NLP (TF-IDF, Transformers) | Text classification, sentiment, chatbots | HuggingFace Transformers |
| MLOps (MLflow, Docker) | Deploy and monitor models in production | mlflow.org |
Using everything you've learned, build a complete end-to-end ML system:
- Pick a real dataset from Kaggle (tabular, CSV format)
- EDA: distribution plots, correlation heatmap, missing value audit
- Preprocessing Pipeline: imputation, encoding, scaling — all in sklearn Pipeline
- Baseline model: Linear/Logistic Regression — establish a benchmark
- Advanced models: Random Forest, XGBoost — GridSearchCV on train only
- Evaluation: confusion matrix (if classification), residual plot (if regression), CV scores, test set final evaluation
- Explainability: feature importance + SHAP values
- Deployment: wrap best model in a Flask API that accepts JSON and returns predictions
Complete this and you have a portfolio project worthy of FAANG interviews. You now know enough to build real ML systems.