Machine Learning
Masterclass

Zero to Advanced — 77 Topics · Deep Theory + Practical Python · FAANG-level Interview Prep

scikit-learn 77 Topics Interview Ready Production Grade

Phase 1 — Basics

#01

ML Course Introduction

Definition

This course teaches Machine Learning from first principles to production-level systems. It follows the exact order used in real-world ML pipelines: data → preprocessing → modeling → evaluation → deployment thinking.

The 3 Pillars of ML
Pillar What it means Example
Data The raw input — the lifeblood of ML CSV of house prices
Algorithm The "brain" that finds patterns Linear Regression
Prediction Output after the model learns Predicted price: $450k
Your Toolkit
environment_check.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Assumed already known: numpy, pandas, matplotlib, seaborn
# This course focuses on: sklearn + ML theory + real projects

print("scikit-learn version:", sklearn.__version__)

# Quick dataset sanity check
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
print(df.head())
print("Shape:", df.shape)
Summary
  • ML is about learning patterns from data, not writing explicit rules
  • This course follows the real-world pipeline order
  • You already know numpy/pandas/matplotlib — we skip those basics
  • scikit-learn is the primary ML library throughout
#02

What is Machine Learning?

Definition

Machine Learning is a subset of AI where algorithms learn from data to make predictions or decisions — without being explicitly programmed for each scenario. Formally: a computer program "learns" from experience E with respect to task T if performance measure P improves with E (Tom Mitchell, 1997).

Intuition
🏠 Analogy: Real Estate Agent vs Rule Book

A rule-based system for predicting house prices needs explicit rules: IF 3 bedrooms AND near school AND 1500 sqft THEN price = $400k. This is brittle — it can't generalize.

An ML model looks at 100,000 past sales, finds the pattern between features and prices automatically, and predicts prices for houses it's never seen. That's learning.

Types of ML
Type Definition Key Word Example
Supervised Learns from labeled data (input→output pairs) "Teacher present" Spam detection, house prices
Unsupervised Finds hidden patterns in unlabeled data "No teacher" Customer segmentation, anomaly detection
Semi-supervised Mix of labeled + unlabeled data "Partial teacher" Image labeling at scale
Reinforcement Agent learns by reward/punishment signals "Trial and error" Game playing, robotics
Supervised Learning Breakdown
Sub-type Output Type Examples
Regression Continuous number House price, temperature, salary
Classification Category/label Spam/not spam, disease/no disease
Code: Supervised vs Unsupervised
ml_types_demo.py
import numpy as np
from sklearn.linear_model import LinearRegression     # Supervised
from sklearn.cluster import KMeans                    # Unsupervised

# ── SUPERVISED: We have labels ──────────────────────────
# X = house size (sqft), y = price ($1000s)
X_sup = np.array([[1000],[1500],[2000],[2500],[3000]])
y     = np.array([200, 280, 350, 430, 500])            # ← LABELS

model_sup = LinearRegression()
model_sup.fit(X_sup, y)
pred = model_sup.predict([[1800]])
print(f"Supervised — Predicted price for 1800sqft: ${pred[0]:.0f}k")

# ── UNSUPERVISED: No labels, find natural groups ─────────
# Customer data: [age, spending_score] — no target!
X_uns = np.array([[22,80],[25,75],[45,20],
                   [48,15],[35,50],[40,45]])

model_uns = KMeans(n_clusters=2, random_state=42)
clusters  = model_uns.fit_predict(X_uns)            # ← Discovers groups
print("Unsupervised — Cluster assignments:", clusters)
# Output: [0 0 1 1 0 0] or similar — groups found automatically!
Common Mistakes
  • Treating regression as classification (or vice versa) — always ask: "Is my output a number or a category?"
  • Assuming more data always = better model — quality > quantity
  • Ignoring the problem type — not all ML problems need the same approach
Interview Insights
Q: What's the difference between AI, ML, and Deep Learning?
A: AI ⊃ ML ⊃ Deep Learning. AI = any technique making machines intelligent. ML = AI that learns from data automatically. Deep Learning = ML using multi-layered neural networks that can learn from raw, unstructured data (images, text).
Q: What's the difference between supervised and unsupervised learning?
A: Supervised = we have ground-truth labels and train a model to predict them. Unsupervised = no labels — we find hidden structure/patterns in data. Key distinction: labels vs no labels.
Summary
  • ML = algorithms that learn patterns from data without explicit programming
  • 3 main types: Supervised (labeled), Unsupervised (unlabeled), Reinforcement (rewards)
  • Supervised splits into: Regression (continuous output) and Classification (categorical output)
  • The key question always is: "What is my output type? Do I have labels?"
🏋️ Mini Practice Task

For each scenario below, identify: (1) Type of ML and (2) Supervised sub-type if applicable:

  • Predicting tomorrow's stock price
  • Grouping news articles by topic automatically
  • Detecting whether a tumor is malignant
  • Teaching a robot to walk

Answers: Regression | Unsupervised | Classification | Reinforcement

#03

Complete ML Roadmap

Definition

The ML Roadmap is the end-to-end workflow that every ML project follows — from raw data to a deployed model. Understanding this pipeline is fundamental; every topic in this course maps to a step in this process.

The 7-Step ML Pipeline
ml_pipeline_overview.py
"""
THE COMPLETE ML PIPELINE
─────────────────────────────────────────────────────
STEP 1: Problem Definition
  → What are we predicting? What's the metric of success?
  
STEP 2: Data Collection
  → SQL, APIs, web scraping, Kaggle, company databases
  
STEP 3: Exploratory Data Analysis (EDA)
  → Understand distributions, correlations, missing data
  
STEP 4: Data Preprocessing (MOST TIME IS SPENT HERE)
  → Clean missing values, encode categories, scale features,
    remove outliers, handle duplicates
    
STEP 5: Feature Engineering & Selection
  → Create new features, select the most informative ones
  → Techniques: Backward Elimination, Forward Selection
  
STEP 6: Model Training
  → Split data (train/test), select algorithm, fit model
  → Regression: Linear, Ridge, Lasso, Polynomial
  → Classification: Logistic, SVM, Trees, KNN, Naive Bayes
  → Clustering: K-Means, DBSCAN, Hierarchical
  
STEP 7: Evaluation & Tuning
  → Regression: R², MSE, RMSE
  → Classification: Accuracy, F1, ROC-AUC, Confusion Matrix
  → Tuning: GridSearchCV, CrossValidation
  
[DEPLOY] → Flask/FastAPI, Docker, Cloud
─────────────────────────────────────────────────────
"""

# Time allocation in real-world ML projects:
time_allocation = {
    "Problem Definition": "5%",
    "Data Collection": "10%",
    "EDA": "15%",
    "Data Preprocessing": "40%",   # ← MOST work happens here
    "Modeling": "20%",
    "Evaluation & Tuning": "10%",
}

for step, pct in time_allocation.items():
    print(f"  {step:30s} → {pct}")
Algorithms Map
Problem Type Algorithm Choices When to Use
Regression Linear, Ridge, Lasso, Polynomial, SVR, Decision Tree Continuous numeric output
Classification Logistic, SVM, KNN, Naive Bayes, Decision Tree, Random Forest Categorical output
Clustering K-Means, DBSCAN, Hierarchical No labels, find groups
Association Apriori, FP-Growth Market basket analysis
Interview Insights
Q: Walk me through a complete ML project from data to production.
A: State the 7 steps clearly: (1) Define problem & success metric, (2) Collect data, (3) EDA to understand distributions, (4) Preprocess — handle nulls, encode categoricals, scale features, (5) Feature selection, (6) Train multiple models + tune hyperparameters via CrossValidation, (7) Evaluate on held-out test set, then deploy via REST API. This answer alone scores you higher than 80% of candidates.
Summary
  • Every ML project follows the same 7-step pipeline
  • Data preprocessing takes ~40% of total project time
  • Choosing the wrong algorithm matters less than poor data quality
  • Always define your evaluation metric BEFORE building models

Phase 2 — Data Preprocessing

#04

Types of Variables

Definition

Variables (features/columns) in a dataset have different measurement scales. Knowing this determines which preprocessing steps and algorithms you can use. This is one of the most foundational concepts in statistics and ML.

Variable Taxonomy
Type Sub-type Properties Examples Allowed Operations
Categorical (Qualitative) Nominal No order, just names Color: Red/Blue/Green, Country Mode, frequency count
Ordinal Has order, no equal gaps Rating: Low/Med/High, Education level Mode, median, comparison
Numerical (Quantitative) Discrete Countable integers Number of children, room count All arithmetic
Continuous Any real value in a range Height, temperature, price All arithmetic + calculus
Code: Detecting Variable Types
variable_types.py
import pandas as pd
import numpy as np

# Create a sample dataset with mixed variable types
df = pd.DataFrame({
    'age': [25, 32, 28, 45, 30],              # Numerical - Discrete
    'salary': [50000.5, 72000.0, 63000.25,
               95000.0, 68000.75],              # Numerical - Continuous
    'city': ['NY','LA','NY','Chicago','LA'],  # Categorical - Nominal
    'education': ['BSc','MSc','BSc',
                  'PhD','MSc'],               # Categorical - Ordinal
    'satisfied': [1,0,1,1,0]                 # Binary (special case)
})

# ── Step 1: Quick dtype overview ──────────────────────
print("--- dtypes ---")
print(df.dtypes)

# ── Step 2: Identify categorical vs numerical ─────────
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols   = df.select_dtypes(include=[np.number]).columns.tolist()

print(f"\nCategorical columns: {categorical_cols}")
print(f"Numerical columns:   {numerical_cols}")

# ── Step 3: Cardinality check (unique value count) ────
# High cardinality nominal → might need target encoding
# Low cardinality nominal  → safe for one-hot encoding
print("\n--- Cardinality (unique values per column) ---")
for col in df.columns:
    print(f"  {col:15s}: {df[col].nunique()} unique values")

# ── Step 4: Check continuous vs discrete numerics ─────
print("\n--- Numerical Analysis ---")
print(df[numerical_cols].describe())
Common Mistakes
  • Treating ordinal as nominal: Encoding "Low/Med/High" as [0,1,2] loses order info if you one-hot encode it — use OrdinalEncoder instead
  • Treating numeric-coded categoricals as numeric: ZIP codes, user IDs are stored as integers but are nominal — never use them as continuous features
  • Ignoring binary variables: Binary (0/1) doesn't need any encoding but often needs balancing
Interview Insights
Q: Can you use a nominal variable directly in a linear model?
A: No. Linear models need numeric input. Nominal variables must be encoded first — typically via one-hot encoding. Ordinal variables can be label-encoded since their order is meaningful.
🏋️ Mini Practice Task

Load the Titanic dataset (pd.read_csv from seaborn or sklearn) and classify every column as: Nominal, Ordinal, Discrete, or Continuous. Then identify which need encoding and which need scaling.

#05

Data Cleaning

Definition

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. "Garbage in, garbage out" — no algorithm can save you from dirty data.

Types of Data Quality Issues
Issue Example Fix
Missing values NaN in age column Impute or drop (Topics 6–9)
Duplicates Same row appears twice drop_duplicates() (Topic 18)
Outliers Age = 999 IQR/Z-Score (Topics 14–15)
Wrong data types Price stored as "object" astype() (Topic 19)
Inconsistent categories "Male", "male", "M" all mean same str.lower() + map()
Noise Typos, extra spaces str.strip(), str.replace()
Code: Comprehensive Data Cleaning Audit
data_cleaning.py
import pandas as pd
import numpy as np

# Create dirty dataset for demo
data = {
    'name': ['Alice', 'Bob', 'Alice', 'Charlie', '  Dave  ', None],
    'age': [25, 30, 25, 999, 28, 35],           # 999 = outlier
    'gender': ['F', 'M', 'F', 'Male', 'm', 'F'], # inconsistent
    'salary': ['50000', '72000', '50000',
               '95000', None, '68000'],           # wrong dtype
}
df = pd.DataFrame(data)

print("=== DIRTY DATA ===")
print(df)
print()

# ── AUDIT FUNCTION: Run this at start of every project ──
def audit_dataframe(df):
    print("📊 SHAPE:", df.shape)
    print("\n📌 DTYPES:\n", df.dtypes)
    print("\n❌ MISSING VALUES:")
    missing = df.isnull().sum()
    print(missing[missing > 0])
    print(f"\n🔁 DUPLICATES: {df.duplicated().sum()}")
    print("\n📈 NUMERICAL STATS:\n", df.describe())

audit_dataframe(df)

# ── FIX 1: Strip whitespace from string columns ───────
df['name'] = df['name'].str.strip()

# ── FIX 2: Standardize categories ─────────────────────
# Normalize gender: 'Male'/'m'/'M' → 'male', 'F'/'female' → 'female'
df['gender'] = df['gender'].str.lower().map({
    'm': 'male', 'male': 'male',
    'f': 'female', 'female': 'female'
})

# ── FIX 3: Fix data types ─────────────────────────────
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
# errors='coerce' → invalid values become NaN instead of crashing

# ── FIX 4: Replace impossible values with NaN ─────────
df.loc[df['age'] > 120, 'age'] = np.nan   # Age 999 → NaN

print("\n=== CLEANED DATA ===")
print(df)
Summary
  • Always run an audit (shape, dtypes, nulls, duplicates) at project start
  • Inconsistent categories are sneaky bugs — always standardize with str.lower()
  • Use pd.to_numeric(errors='coerce') to safely convert mixed-type columns
  • Replace domain-impossible values (age=999) with NaN before imputation
#06

Missing Values — Concept & Detection

Definition

Missing values are data points that were not recorded or stored. They appear as NaN (Not a Number), None, null, or empty strings in pandas. Most ML algorithms cannot handle missing values — you must deal with them before training.

3 Types of Missingness (Critical for choosing the right fix)
Type Abbreviation Meaning Example Recommended Fix
Missing Completely at Random MCAR Missing for no reason — purely random Survey respondent accidentally skipped a question Safe to drop or impute
Missing at Random MAR Missing depends on OTHER observed variables Males less likely to report salary than females Impute using other features
Missing Not at Random MNAR Missing depends on the value itself High earners not reporting salary Model the missingness; risky to impute
Code: Detecting & Visualizing Missing Values
missing_values_detection.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ── Create sample dataset with missing values ──────────
np.random.seed(42)
n = 100
df = pd.DataFrame({
    'age':     np.random.choice([25,30,35,np.nan], n, p=[0.3,0.3,0.3,0.1]),
    'salary':  np.random.choice([50000,60000,np.nan], n, p=[0.4,0.4,0.2]),
    'city':    np.random.choice(['NY','LA',None], n, p=[0.4,0.4,0.2]),
    'score':   np.random.randint(50, 100, n).astype(float),  # no missing
})

# ── Method 1: Count missing values ────────────────────
print("--- Missing Value Counts ---")
print(df.isnull().sum())

# ── Method 2: Percentage missing ─────────────────────
print("\n--- Missing Value % ---")
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_pct)

# ── Method 3: Heatmap visualization ──────────────────
plt.figure(figsize=(8, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title("Missing Value Heatmap\n(Yellow = Missing, Purple = Present)")
plt.tight_layout()
plt.show()

# ── Method 4: Threshold-based decision ───────────────
# Rule of thumb:
# <5%  → usually safe to drop rows or impute simply
# 5-30% → impute carefully (mean/median/mode)
# >30%  → consider dropping the COLUMN or using model-based imputation
# >50%  → column is likely unusable

threshold = 30
cols_to_drop = missing_pct[missing_pct > threshold].index.tolist()
print(f"\nColumns with >{threshold}% missing (consider dropping): {cols_to_drop}")
Common Mistakes
  • Imputing before train/test split: This causes data leakage — fit the imputer on train data only, then transform both
  • Treating all missingness equally: MCAR → drop safely; MNAR → dropping biases your model
  • Ignoring that "" (empty string) is not NaN: Always check with df[col].eq("").sum() too
Summary
  • Missing values appear as NaN/None — most ML algorithms reject them
  • 3 types: MCAR (safe), MAR (impute with other features), MNAR (risky)
  • Always measure % missing before deciding to drop or impute
  • Columns with >50% missing are usually not worth keeping
#07

Handling Missing Values — Dropping

Definition

Dropping removes rows or columns that contain missing values. It's the simplest approach but must be used carefully — dropping too aggressively shrinks your dataset and can introduce bias.

When to Drop (Decision Tree)

Drop ROWS when: Missing % is small (<5%), data is MCAR, and losing rows doesn't bias results significantly.

Drop COLUMNS when: Missing % is very high (>50%), column is non-critical, or column is duplicate information.

NEVER drop when: The missingness is MNAR (high earners hiding salary) — you'd systematically bias your model.

Code: dropna() — All Options
dropping_missing.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, np.nan, np.nan, np.nan, np.nan],  # 80% missing
})
print("Original:\n", df)

# ── Option 1: Drop any row with ANY missing value ─────
df1 = df.dropna()  # how='any' is default
print("\ndropna() — rows with any NaN removed:\n", df1)

# ── Option 2: Drop rows where ALL values are missing ──
df2 = df.dropna(how='all')
print("\ndropna(how='all') — only fully-empty rows removed:\n", df2)

# ── Option 3: Drop rows where specific columns have NaN
df3 = df.dropna(subset=['A', 'B'])  # Only care about A and B
print("\ndropna(subset=['A','B']):\n", df3)

# ── Option 4: Drop columns with too many missing values
threshold = 0.5  # Drop columns with >50% missing
df4 = df.dropna(axis=1, thresh=int(threshold * len(df)))
print(f"\nDropped columns with >50% missing:\n", df4)
# C is dropped (80% missing), A and B are kept

# ── Option 5: Threshold-based row dropping ────────────
# Keep rows that have at least 2 non-NaN values
df5 = df.dropna(thresh=2)
print("\ndropna(thresh=2) — rows with at least 2 valid values:\n", df5)

# ── Best Practice: Track what you dropped ─────────────
original_size = len(df)
cleaned = df.dropna()
dropped_count = original_size - len(cleaned)
print(f"\nDropped {dropped_count}/{original_size} rows ({dropped_count/original_size*100:.1f}%)")
Common Mistakes
  • Using dropna() on the full dataset before train/test split — always split first
  • Dropping rows when >20% of data is lost — consider imputation instead
  • Forgetting to reset the index after dropping: use .reset_index(drop=True)
#08

Handling Missing Values — Categorical Imputation

Definition

Imputation means filling in missing values with estimated substitutes rather than dropping them. For categorical variables, common strategies are filling with the mode (most frequent value) or a special "Unknown" / "Missing" category.

Imputation Strategies
Strategy For what type? When to use
Mode (most frequent) Categorical Data is roughly uniform, small % missing
"Unknown" / "Missing" label Categorical When absence itself is informative (MNAR)
Mean imputation Numerical Data is roughly symmetric, no major outliers
Median imputation Numerical Data is skewed or has outliers (recommended over mean)
KNN imputation Both When other features can predict the missing value
Forward/Backward fill Time series Sequential/temporal data
Code: Manual Categorical Imputation
categorical_imputation.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'city': ['NY', 'LA', None, 'NY', None, 'LA', 'NY'],
    'product': ['A', None, 'B', 'A', 'B', None, 'A'],
    'salary': [50000, np.nan, 60000, np.nan, 55000, 70000, np.nan],
})

print("Before:\n", df)

# ── Strategy 1: Fill categorical with Mode ────────────
city_mode = df['city'].mode()[0]          # .mode() returns a Series, take [0]
df_mode = df.copy()
df_mode['city'] = df_mode['city'].fillna(city_mode)
print(f"\nMode imputation for city (mode='{city_mode}'):\n", df_mode['city'])

# ── Strategy 2: Fill with "Unknown" (when missing is informative) ──
df_unk = df.copy()
df_unk['city'] = df_unk['city'].fillna('Unknown')
df_unk['product'] = df_unk['product'].fillna('Unknown')
print("\nUnknown imputation:\n", df_unk[['city', 'product']])

# ── Numerical: Mean vs Median ──────────────────────────
print("\n--- Numerical Imputation ---")
df_num = df.copy()

# Mean — affected by outliers
df_num['salary_mean_imp'] = df_num['salary'].fillna(df_num['salary'].mean())

# Median — robust to outliers (PREFERRED for skewed data)
df_num['salary_median_imp'] = df_num['salary'].fillna(df_num['salary'].median())

print(df_num[['salary', 'salary_mean_imp', 'salary_median_imp']])

# ── Group-wise imputation (ADVANCED + POWERFUL) ────────
# Fill salary with the median salary for that CITY group
df['salary_group_imp'] = df.groupby('city')['salary'].\
    transform(lambda x: x.fillna(x.median()))
print("\nGroup-wise imputation:\n", df[['city', 'salary', 'salary_group_imp']])
Common Mistakes
  • Using mean for skewed numerical data — always prefer median for salary, prices, counts
  • Imputing with global statistics when group-wise would be more accurate
  • Forgetting to fit imputation on train set only — never use test set statistics
#09

Handling Missing Values — Scikit-Learn Imputers

Definition

Scikit-learn provides production-ready imputer classes that follow the fit()/transform() API — ensuring imputation parameters are learned from training data only, preventing data leakage.

Code: SimpleImputer, KNNImputer, IterativeImputer
sklearn_imputers.py
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer  # must import first
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split

np.random.seed(42)
X = pd.DataFrame({
    'age':    [25, np.nan, 35, 28, np.nan, 45, 22, 30],
    'salary': [np.nan, 60000, 75000, np.nan, 55000, 90000, 48000, 65000],
    'score':  [80, 90, np.nan, 70, 85, np.nan, 95, 88],
})

# ── CRITICAL: Split BEFORE imputation ─────────────────
X_train, X_test = train_test_split(X, test_size=0.25, random_state=42)

# ── SimpleImputer ─────────────────────────────────────
# strategy options: 'mean', 'median', 'most_frequent', 'constant'
si = SimpleImputer(strategy='median')
si.fit(X_train)                     # Learn medians from TRAIN only
X_train_si = si.transform(X_train) # Fill train
X_test_si  = si.transform(X_test)  # Fill test using TRAIN medians (no leakage)
print("SimpleImputer (median) — train statistics:", si.statistics_)

# ── KNNImputer ────────────────────────────────────────
# Uses K nearest neighbors to estimate missing value
# More accurate than SimpleImputer but slower
knn_imp = KNNImputer(n_neighbors=2)
knn_imp.fit(X_train)
X_train_knn = knn_imp.transform(X_train)
X_test_knn  = knn_imp.transform(X_test)
print("\nKNNImputer result (first row):", X_train_knn[0])

# ── IterativeImputer (MICE) ───────────────────────────
# Regresses each feature with missing values on all other features
# Most powerful but computationally expensive
iter_imp = IterativeImputer(max_iter=10, random_state=42)
iter_imp.fit(X_train)
X_train_iter = iter_imp.transform(X_train)
print("\nIterativeImputer result (first row):", X_train_iter[0])

# ── When to use which? ───────────────────────────────
# SimpleImputer  → fast, simple, good for numerical + categorical
# KNNImputer     → better for correlated features, medium dataset
# IterativeImputer → best quality, expensive, use on small/medium data
Common Mistakes
  • Calling fit_transform() on both train and test — only fit on train
  • Using KNNImputer on large datasets without thinking about O(n²) memory
Summary
  • Always use sklearn imputers in production — they prevent data leakage
  • SimpleImputer for quick work; KNNImputer for correlated data; IterativeImputer for best quality
  • The golden rule: fit on train data, transform both train and test
#10

One Hot Encoding & Dummy Variables

Definition

One Hot Encoding (OHE) converts a nominal categorical variable into multiple binary (0/1) columns — one per unique category. This allows ML algorithms (which expect numbers) to use categorical data without imposing any false ordering.

Intuition

If we encode Color as: Red=1, Blue=2, Green=3 — the model thinks Green > Blue > Red. That's wrong! Colors have no numeric relationship.

OHE creates: [is_Red, is_Blue, is_Green] = [1,0,0], [0,1,0], [0,0,1] — now there's no false ordering.

Dummy Variable Trap
⚠️ The Dummy Variable Trap (Multicollinearity)

If you have 3 colors [Red, Blue, Green], you only need 2 dummy variables. The 3rd is perfectly predictable from the other two (if not Red and not Blue → Green). Including all 3 causes multicollinearity. Solution: drop_first=True

Code: pd.get_dummies vs sklearn OneHotEncoder
one_hot_encoding.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'color': ['Red','Blue','Green','Red','Blue'],
    'size':  ['S','M','L','M','S'],
    'price': [10,20,30,15,25]
})

# ── Method 1: pd.get_dummies (fast for EDA) ───────────
# drop_first=True avoids dummy variable trap
df_dummies = pd.get_dummies(df, columns=['color'], drop_first=True, dtype=int)
print("pd.get_dummies (drop_first=True):")
print(df_dummies)
# Creates: color_Green, color_Red  (Blue is the dropped reference)

# ── Method 2: sklearn OneHotEncoder (production use) ─
# MUST use this in pipelines to avoid data leakage
X = df[['color', 'size']]
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

ohe = OneHotEncoder(
    drop='first',         # Avoid dummy trap
    sparse_output=False,  # Return dense array
    handle_unknown='ignore'  # Don't crash on unseen categories in test
)

# Fit on TRAIN only — learn the categories from training data
ohe.fit(X_train)

X_train_enc = ohe.transform(X_train)
X_test_enc  = ohe.transform(X_test)

print("\nsklearn OHE categories learned:", ohe.categories_)
print("Feature names:", ohe.get_feature_names_out())
print("Encoded train:\n", X_train_enc)
Common Mistakes
  • Not using drop_first=True — causes multicollinearity in linear models
  • Using pd.get_dummies in production — it doesn't handle unseen categories in test data
  • OHE on high-cardinality columns (eg: ZIP codes with 10,000 values) — use Target Encoding instead
Interview Insights
Q: What is the dummy variable trap and how do you avoid it?
A: When you OHE a column with k categories into k dummy variables, perfect multicollinearity exists because the k-th column is always predictable from the other k-1. Fix: drop one category (drop_first=True). This dropped category becomes the "reference level" — all effects are relative to it.
#11

Label Encoding

Definition

Label Encoding assigns an integer to each unique category: "cat" → 0, "dog" → 1, "rabbit" → 2. It's compact but introduces an artificial ordering. Only use it for the target variable or for tree-based models that don't assume ordering.

Code: LabelEncoder
label_encoding.py
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
    'animal': ['cat','dog','rabbit','cat','dog'],
    'target': ['spam','ham','spam','ham','spam']
})

le = LabelEncoder()

# For TARGET variable (y) — safe use case
df['target_encoded'] = le.fit_transform(df['target'])
print("Target encoding:")
print(df[['target', 'target_encoded']])
print("Classes:", le.classes_)  # Shows mapping: ham=0, spam=1

# For FEATURES — only safe with tree-based models
df['animal_encoded'] = le.fit_transform(df['animal'])
print("\nAnimal encoding (risky for linear models!):")
print(df[['animal', 'animal_encoded']])
# cat=0, dog=1, rabbit=2
# Linear model would think rabbit(2) > dog(1) > cat(0) — WRONG!

# Decode back from integers
print("\nDecode: [0,2,1] →", le.inverse_transform([0,2,1]))
Common Mistakes
  • Using LabelEncoder on nominal features with linear models — the model thinks 0 < 1 < 2, but "cat < dog < rabbit" has no meaning. Use OHE for nominal features in linear models.
  • LabelEncoder is fine for: target variables, and nominal features in tree models (Random Forest, XGBoost handle it fine)
Summary
  • Label encoding = integer per category. Simple but imposes ordering.
  • Safe for: target variables, tree-based model features
  • Unsafe for: nominal features in linear/distance-based models
#12

Ordinal Encoding

Definition

Ordinal Encoding assigns integers to categories preserving their natural order. Unlike LabelEncoder (which assigns arbitrary order), OrdinalEncoder lets you specify the order: Low=0, Medium=1, High=2.

Code: OrdinalEncoder with Custom Order
ordinal_encoding.py
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'education': ['BSc', 'PhD', 'HSC', 'MSc', 'BSc'],
    'size':      ['M',   'XL', 'S',   'L',   'XS'],
})

# ── WRONG: LabelEncoder assigns arbitrary order ───────
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
print("LabelEncoder (arbitrary):", le.fit_transform(df['education']))
# Could give: BSc=0, HSC=1, MSc=2, PhD=3 — alphabetical, not education level!

# ── RIGHT: OrdinalEncoder with explicit ordering ──────
# You tell it the correct order for each column
enc = OrdinalEncoder(categories=[
    ['HSC', 'BSc', 'MSc', 'PhD'],  # education order
    ['XS', 'S', 'M', 'L', 'XL']     # size order
])

encoded = enc.fit_transform(df)
df_enc = df.copy()
df_enc['education_enc'] = encoded[:, 0]  # HSC=0, BSc=1, MSc=2, PhD=3
df_enc['size_enc']      = encoded[:, 1]  # XS=0, S=1, M=2, L=3, XL=4

print("\nOrdinalEncoder (correct order):")
print(df_enc)
Summary
  • Use OrdinalEncoder when categories have a meaningful order (Low/Med/High, education levels)
  • Always specify the category order explicitly — don't let sklearn guess
  • Use OHE for nominal (unordered) categories, OrdinalEncoder for ordered ones
#13

Outliers — Concept & Handling

Definition

Outliers are data points that differ significantly from other observations. They can be genuine (a CEO's salary in an employee dataset) or errors (age = 999). Outliers distort statistical measures and can severely degrade model performance.

Types of Outliers
Type Description Example Action
Point Outlier Single value far from rest Income of $10M in a $50k dataset Cap, remove, or model separately
Contextual Outlier Normal globally, abnormal in context 30°C in winter (not summer) Context-aware handling
Collective Outlier A group of values that are collectively abnormal 5 identical transactions in 1 second Anomaly detection
Effects on Models
Model Sensitivity to Outliers
Linear Regression Very sensitive — outliers pull the regression line
Logistic Regression Moderately sensitive
Decision Trees Less sensitive — splits are threshold-based
Random Forest Robust — averages many trees
SVM Sensitive — support vectors can shift
KNN Very sensitive — distances distorted
Code: Visualizing Outliers
outlier_visualization.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 95),  # Normal data
    [150, -30, 200, -50, 180]       # Outliers injected
])
df = pd.DataFrame({'salary': data})

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Boxplot — best for outlier visualization
axes[0].boxplot(df['salary'])
axes[0].set_title('Boxplot\n(whiskers = 1.5×IQR, dots = outliers)')

# 2. Histogram — shows skew caused by outliers
axes[1].hist(df['salary'], bins=30, color='steelblue')
axes[1].set_title('Histogram')

# 3. Scatter plot — shows position of outliers
axes[2].scatter(range(len(df)), df['salary'], alpha=0.5)
axes[2].set_title('Scatter Plot')

plt.tight_layout()
plt.show()

# Describe to spot issues
print(df.describe())
# If max >> 75th percentile → likely outliers
# If min << 25th percentile → likely outliers
Summary
  • Outliers = data points far from the rest — genuine or errors
  • Linear models and KNN are most sensitive; tree-based models are more robust
  • Always visualize first (boxplot, histogram) before deciding how to handle them
  • Two detection methods: IQR (robust) and Z-Score (assumes normality)
#14

IQR Method for Outlier Detection

Definition

The Interquartile Range (IQR) method is a robust, non-parametric outlier detection technique. IQR = Q3 − Q1. Points beyond 1.5×IQR from the quartiles are flagged as outliers. It doesn't assume any distribution.

Lower Bound = Q1 − 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
Outlier if: value < Lower Bound OR value > Upper Bound
Code: IQR Detection & Handling Strategies
iqr_method.py
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'salary': list(np.random.normal(60000, 10000, 95)) + [200000, -5000, 180000, -10000, 190000]
})

# ── Step 1: Calculate IQR ──────────────────────────────
Q1  = df['salary'].quantile(0.25)
Q3  = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1={Q1:.0f}, Q3={Q3:.0f}, IQR={IQR:.0f}")
print(f"Lower bound: {lower_bound:.0f}")
print(f"Upper bound: {upper_bound:.0f}")

# ── Step 2: Identify outliers ─────────────────────────
is_outlier = (df['salary'] < lower_bound) | (df['salary'] > upper_bound)
print(f"\nOutliers found: {is_outlier.sum()}")
print(df[is_outlier])

# ── Step 3: Choose handling strategy ─────────────────

# STRATEGY A: Remove outliers
df_removed = df[~is_outlier].copy()
print(f"\nAfter removal: {len(df_removed)} rows (was {len(df)})")

# STRATEGY B: Cap/Winsorize (clip to bounds)
# This preserves sample size but limits extreme values
df_capped = df.copy()
df_capped['salary'] = df_capped['salary'].clip(lower=lower_bound, upper=upper_bound)
print(f"\nAfter capping: max={df_capped['salary'].max():.0f}, min={df_capped['salary'].min():.0f}")

# STRATEGY C: Replace with NaN → then impute
df_nan = df.copy()
df_nan.loc[is_outlier, 'salary'] = np.nan
print(f"\nAfter NaN replacement: {df_nan['salary'].isnull().sum()} missing values")

# ── Function to apply IQR to all numerical columns ────
def remove_outliers_iqr(df, multiplier=1.5):
    df_clean = df.copy()
    num_cols = df.select_dtypes(include=[np.number]).columns
    for col in num_cols:
        Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
        IQR = Q3 - Q1
        mask = (df[col] >= Q1 - multiplier*IQR) & (df[col] <= Q3 + multiplier*IQR)
        df_clean = df_clean[mask]
    return df_clean.reset_index(drop=True)
Interview Insights
Q: Why prefer IQR over Z-Score for outlier detection?
A: IQR is non-parametric — it doesn't assume a normal distribution. Z-Score uses mean and std, which are themselves skewed by outliers, making it circular. Use IQR for skewed data (salary, prices); Z-Score for data you know is approximately normal.
#15

Z-Score Method for Outlier Detection

Definition

The Z-Score measures how many standard deviations a data point is from the mean. Points with |Z| > 3 are typically considered outliers. Assumes the data is approximately normally distributed.

Z = (X − μ) / σ
Outlier if |Z| > 3 (roughly 0.3% of data in a normal distribution)
Code: Z-Score Outlier Detection
zscore_outliers.py
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
df = pd.DataFrame({
    'height': list(np.random.normal(170, 10, 95)) + [300, 50, 280, -10, 320]
})

# ── Method 1: Manual Z-Score ──────────────────────────
df['z_score'] = (df['height'] - df['height'].mean()) / df['height'].std()

threshold = 3
outliers_z = df[df['z_score'].abs() > threshold]
print(f"Z-Score outliers (|z|>{threshold}): {len(outliers_z)}")
print(outliers_z)

# ── Method 2: scipy zscore (same result) ──────────────
z_scores = np.abs(stats.zscore(df['height']))
df_clean = df[(z_scores <= threshold)]
print(f"\nAfter removal: {len(df_clean)} rows")

# ── Modified Z-Score (robust, uses median) ────────────
# Better than standard Z-score when outliers are present
median = df['height'].median()
mad    = (df['height'] - median).abs().median()  # Median Absolute Deviation
modified_z = 0.6745 * (df['height'] - median) / mad
df_mod_clean = df[modified_z.abs() <= 3.5]
print(f"\nModified Z-Score clean: {len(df_mod_clean)} rows")

# ── IQR vs Z-Score comparison ─────────────────────────
# IQR:      Non-parametric, robust, works for any distribution
# Z-Score:  Parametric, assumes normality, sensitive to extreme outliers
# Modified Z: Best of both - robust + uses normal distribution logic
Summary
  • Z-Score = (x - mean) / std — outlier if |z| > 3
  • Assumes normality — use IQR for skewed data
  • Modified Z-Score (using MAD) is more robust — recommended over standard Z-Score
#16

Feature Scaling — Standardization

Definition

Standardization (Z-score normalization) transforms features so they have mean = 0 and standard deviation = 1. This allows algorithms that use distances or gradients to treat all features equally regardless of their original scale.

X_scaled = (X − μ) / σ
Result: mean = 0, std = 1
Why does scaling matter?

Consider: age (20–60) and salary (30,000–200,000). In KNN, distance is dominated by salary simply because it has larger numbers. The model effectively ignores age. Standardization puts both on equal footing.

Code: StandardScaler
standardization.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'age':    [25, 35, 45, 28, 52],
    'salary': [50000, 75000, 95000, 60000, 110000],
    'score':  [80, 90, 70, 85, 95],
})

X = df.values
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

# ── StandardScaler ────────────────────────────────────
scaler = StandardScaler()

# Fit on train, transform both — NEVER fit on test!
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print("Original train data:")
print(pd.DataFrame(X_train, columns=df.columns))

print("\nStandardized train data:")
print(pd.DataFrame(X_train_scaled, columns=df.columns).round(3))

print(f"\nMean after scaling (should be ~0): {X_train_scaled.mean(axis=0).round(3)}")
print(f"Std after scaling (should be ~1):  {X_train_scaled.std(axis=0).round(3)}")

# ── Access scaler parameters ──────────────────────────
print(f"\nLearned means: {scaler.mean_.round(2)}")
print(f"Learned stds:  {scaler.scale_.round(2)}")

# ── Inverse transform ─────────────────────────────────
X_back = scaler.inverse_transform(X_train_scaled)
print("\nInverse transform (back to original):")
print(pd.DataFrame(X_back, columns=df.columns).round(1))
When to use Standardization vs Normalization
Scaling Method Use When Not when
StandardScaler (Standardization) Data has outliers; algorithms assume normal distribution (SVM, Linear/Logistic Regression, PCA) You need values in [0,1] range
MinMaxScaler (Normalization) Neural networks; algorithms sensitive to magnitude; when you need bounded [0,1] range Data has significant outliers
Common Mistakes
  • Fitting scaler on the full dataset before train/test split — causes data leakage
  • Scaling the target variable y (for regression) — don't, unless you undo it with inverse_transform
  • Forgetting to scale test data with the SAME scaler that was fit on train
#17

Feature Scaling — Normalization (MinMaxScaler)

Definition

Normalization (Min-Max Scaling) scales all values to the range [0, 1] (or any specified range). It preserves the original distribution shape but compresses it into the specified range.

X_scaled = (X − X_min) / (X_max − X_min)
Result: All values between 0 and 1
Code: MinMaxScaler + RobustScaler
normalization.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split

# Dataset with an outlier to show differences
data = pd.DataFrame({'salary': [30000, 45000, 50000, 60000, 200000]})
# 200000 is an outlier

X_train, X_test = train_test_split(data, test_size=0.4, random_state=42)

# ── MinMaxScaler ──────────────────────────────────────
mms = MinMaxScaler(feature_range=(0, 1))
mms.fit(X_train)
print("MinMax scaled (outlier pulls everything to bottom):")
print(pd.DataFrame(mms.transform(data), columns=['salary_minmax']).round(3))

# ── RobustScaler ──────────────────────────────────────
# Uses median and IQR — robust to outliers!
# X_scaled = (X - median) / IQR
rs = RobustScaler()
rs.fit(X_train)
print("\nRobust scaled (better with outliers):")
print(pd.DataFrame(rs.transform(data), columns=['salary_robust']).round(3))

# ── Visual comparison ─────────────────────────────────
ss = StandardScaler()
ss.fit(X_train)

comparison = pd.DataFrame({
    'original': data['salary'].values,
    'standardized': ss.transform(data).flatten(),
    'normalized(MinMax)': mms.transform(data).flatten(),
    'robust': rs.transform(data).flatten()
})
print("\nScaling Comparison:")
print(comparison.round(3))

# Summary:
# StandardScaler → mean=0, std=1  — distorted by outlier
# MinMaxScaler   → [0,1]           — distorted by outlier
# RobustScaler   → median-centered — handles outliers best
Interview Insights
Q: When would you use RobustScaler over StandardScaler?
A: When the dataset contains significant outliers. StandardScaler's mean and std are influenced by outliers, distorting the scaling. RobustScaler uses median and IQR — both resistant to outliers — giving a more stable scaling. Use RobustScaler → clean data scenario with outliers. StandardScaler → when data is already roughly normal.
#18

Duplicate Data Handling

Definition

Duplicate rows are identical or near-identical records in a dataset. They can arise from data entry errors, merging datasets, or web scraping. Duplicates bias models by over-representing certain patterns and inflate accuracy metrics.

Code: Detect and Handle Duplicates
duplicates.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id':     [1, 2, 2, 3, 4, 4, 5],
    'name':   ['Alice','Bob','Bob','Charlie','Dave','Dave','Eve'],
    'salary': [50000,60000,60000,70000,80000,80001,55000],
    # Note: Dave has slightly different salary — near duplicate
})

# ── Exact duplicates ──────────────────────────────────
print("Exact duplicates:", df.duplicated().sum())
print(df[df.duplicated(keep=False)])  # Show all occurrences

# ── Duplicates based on subset of columns ─────────────
# Same person (name+salary close), different id
print("\nDuplicates by name:", df.duplicated(subset=['name']).sum())

# ── Remove exact duplicates ───────────────────────────
# keep='first': keep first occurrence (default)
# keep='last':  keep last occurrence
# keep=False:   drop ALL duplicates
df_clean = df.drop_duplicates(keep='first').reset_index(drop=True)
print("\nAfter removing exact duplicates:")
print(df_clean)

# ── Remove subset duplicates ──────────────────────────
df_clean2 = df.drop_duplicates(subset=['name'], keep='first').reset_index(drop=True)
print("\nAfter removing name duplicates:")
print(df_clean2)

# ── Near-duplicate detection (fuzzy matching) ─────────
# For salary difference within 5%, flag as near duplicate
df_sorted = df.sort_values('name')
grouped = df_sorted.groupby('name').filter(lambda x: len(x) > 1)
print("\nNear-duplicate candidates (same name):")
print(grouped)
#19

Changing Data Types

Definition

Data types define how data is stored and what operations are valid. Incorrect data types cause errors, performance issues, and wrong calculations. Converting to the right type is a fundamental preprocessing step.

Code: Type Conversion Patterns
data_types.py
import pandas as pd
import numpy as np

# Common raw data type issues
df = pd.DataFrame({
    'age':      ['25', '30', 'bad', '28'],  # should be int, but string
    'salary':   ['$50,000', '$72,000',
                 '$63,000', '$95,000'],      # currency string
    'date':     ['2023-01-15', '2023-03-22',
                 '2023-07-01', '2023-11-30'], # string date
    'is_active':['True','False','True','True'] # string bool
})
print("Original dtypes:\n", df.dtypes)

# ── Fix 1: String → numeric (safe with coerce) ─────────
df['age_clean'] = pd.to_numeric(df['age'], errors='coerce')
# 'bad' → NaN instead of crashing

# ── Fix 2: Currency string → float ────────────────────
df['salary_clean'] = (df['salary']
                       .str.replace('$', '', regex=False)
                       .str.replace(',', '', regex=False)
                       .astype(float))

# ── Fix 3: String → datetime ──────────────────────────
df['date_clean'] = pd.to_datetime(df['date'])
# Now you can do: df['date_clean'].dt.year, .month, .day, .dayofweek
df['year']  = df['date_clean'].dt.year
df['month'] = df['date_clean'].dt.month

# ── Fix 4: String → boolean ───────────────────────────
df['is_active_bool'] = df['is_active'].map({'True': True, 'False': False})

# ── Fix 5: Downcast for memory efficiency ─────────────
# int64 → int32 saves memory on large datasets
df['age_int32'] = df['age_clean'].astype('Int32')  # nullable int

print("\nCleaned dtypes:\n", df.dtypes)
print("\nCleaned data:\n", df[['age_clean','salary_clean','year','month','is_active_bool']])
#20

Function Transformer

Definition

FunctionTransformer wraps any Python function into a scikit-learn compatible transformer. This allows you to apply custom transformations (like log, square root, domain-specific functions) inside sklearn Pipelines.

Code: FunctionTransformer Usage
function_transformer.py
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Skewed data — log transform helps normalize it
salary = np.array([20000, 30000, 35000, 50000, 200000, 500000])

# ── Basic FunctionTransformer ─────────────────────────
log_transformer = FunctionTransformer(
    func=np.log1p,          # log(1 + x) — handles zeros safely
    inverse_func=np.expm1   # exp(x) - 1 — for inverse transform
)

log_salary = log_transformer.transform(salary.reshape(-1, 1))
print("Original salary:", salary)
print("Log transformed: ", log_salary.flatten().round(2))

# ── Custom function transformer ───────────────────────
def clip_outliers(X, lower_pct=5, upper_pct=95):
    """Clip values to specified percentile range."""
    lower = np.percentile(X, lower_pct)
    upper = np.percentile(X, upper_pct)
    return np.clip(X, lower, upper)

clip_transformer = FunctionTransformer(clip_outliers)
clipped = clip_transformer.transform(salary.reshape(-1, 1))
print("\nClipped (5th-95th pct):", clipped.flatten())

# ── In a Pipeline ─────────────────────────────────────
X = np.random.lognormal(10, 1, (100, 1))
y = np.random.randn(100)

pipeline = Pipeline([
    ('log_transform', FunctionTransformer(np.log1p)),
    ('model',         LinearRegression()),
])
pipeline.fit(X, y)
print("\nPipeline with log transform fitted successfully!")
print("Coef:", pipeline.named_steps['model'].coef_)
Summary
  • FunctionTransformer makes any function sklearn-pipeline-compatible
  • log1p is the go-to for right-skewed distributions (salary, prices, counts)
  • Always define inverse_func if you need to inverse-transform predictions

Phase 3 — Feature Selection

#21

Backward Elimination

Definition

Backward Elimination is a wrapper feature selection method that starts with all features and iteratively removes the least significant one (highest p-value) until all remaining features meet a significance threshold.

Algorithm Steps

1. Start with ALL features
2. Train model, compute p-values for each feature
3. If max p-value > threshold (usually 0.05): remove that feature
4. Retrain with remaining features
5. Repeat until all p-values < threshold
6. Remaining features = selected set

Code: Backward Elimination with statsmodels
backward_elimination.py
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_diabetes

# Load dataset
data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# ── Backward Elimination ──────────────────────────────
def backward_elimination(X, y, significance_level=0.05):
    """
    Iteratively removes features with p-value > significance_level.
    Returns list of selected feature names.
    """
    cols = list(X.columns)
    
    while True:
        X_with_const = sm.add_constant(X[cols])  # Add intercept
        model = sm.OLS(y, X_with_const).fit()    # Ordinary Least Squares
        
        # Get p-values (exclude constant at index 0)
        pvalues = model.pvalues.drop('const')
        max_pval = pvalues.max()
        
        if max_pval > significance_level:
            worst_feat = pvalues.idxmax()
            print(f"Removing '{worst_feat}' (p-value: {max_pval:.4f})")
            cols.remove(worst_feat)
        else:
            break
    
    return cols

selected = backward_elimination(X, y, significance_level=0.05)
print(f"\n✅ Selected features ({len(selected)}): {selected}")

# ── Using mlxtend SequentialFeatureSelector (wrapper) ─
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

sfs = SequentialFeatureSelector(
    LinearRegression(),
    k_features=5,              # Select top 5 features
    forward=False,             # backward=True
    scoring='r2',
    cv=5
)
sfs.fit(X.values, y)
print("\nmlxtend Backward SFS selected features:", sfs.k_feature_names_)
Interview Insights
Q: What's the difference between wrapper, filter, and embedded feature selection?
A: Filter = select features by statistical tests independent of model (correlation, chi-squared). Wrapper = use a model to evaluate subsets — computationally expensive but model-aware (forward/backward selection). Embedded = feature selection happens during model training (Lasso L1 regularization, tree importance). For production: start with filter (fast), refine with embedded.
#22

Forward Selection

Definition

Forward Selection starts with zero features and iteratively adds the most significant feature at each step until no remaining feature improves the model above the threshold.

Code: Forward Selection
forward_selection.py
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from mlxtend.feature_selection import SequentialFeatureSelector

data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# ── Manual Forward Selection ──────────────────────────
def forward_selection(X, y, significance_level=0.05):
    import statsmodels.api as sm
    remaining = list(X.columns)
    selected  = []
    
    while remaining:
        best_pval = 1.0
        best_feat = None
        
        for feat in remaining:
            cols = selected + [feat]
            X_const = sm.add_constant(X[cols])
            model   = sm.OLS(y, X_const).fit()
            pval    = model.pvalues[feat]
            
            if pval < best_pval:
                best_pval = pval
                best_feat = feat
        
        if best_feat and best_pval < significance_level:
            selected.append(best_feat)
            remaining.remove(best_feat)
            print(f"Added '{best_feat}' (p-value: {best_pval:.4f})")
        else:
            break
    
    return selected

selected = forward_selection(X, y)
print(f"\n✅ Forward selected ({len(selected)}): {selected}")

# ── mlxtend (cleaner, cross-validated) ───────────────
sfs_fwd = SequentialFeatureSelector(
    LinearRegression(),
    k_features=5,
    forward=True,   # Forward selection
    scoring='r2',
    cv=5
)
sfs_fwd.fit(X.values, y)
print("\nmlxtend Forward SFS features:", sfs_fwd.k_feature_names_)

# When to use Forward vs Backward?
# Forward:   many features, want to build minimal set
# Backward:  moderate features, start broad
# Both:      O(n²) complexity — use filter first for huge feature sets

Phase 4 — Model Training

#23

Train-Test Split

Definition

Train-test split divides the dataset into two sets: a training set (model learns from this) and a test set (used to evaluate final performance on unseen data). This simulates real-world deployment where the model encounters data it was never trained on.

The Core Problem It Solves
⚠️ Why you MUST split data

If you train and evaluate on the same data, the model can "memorize" it (overfit) and report 100% accuracy — but fail completely on new data. Test set = honest evaluation of real-world performance.

Code: train_test_split — All Patterns
train_test_split.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# ── Basic split ────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% test, 80% train
    random_state=42,    # Reproducibility
    stratify=y          # IMPORTANT: maintain class distribution
)

print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

# ── Why stratify=y is critical for classification ─────
# Without stratify: test might have 0 samples of class 2!
print("\nClass distribution in full dataset:")
print(pd.Series(y).value_counts(normalize=True).round(3))
print("Class distribution in train:")
print(pd.Series(y_train).value_counts(normalize=True).round(3))
print("Class distribution in test:")
print(pd.Series(y_test).value_counts(normalize=True).round(3))
# With stratify=y, all three should be ~33% each

# ── Train/Validation/Test split ───────────────────────
# For hyperparameter tuning: 60/20/20
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test     = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
# Train: 90, Val: 30, Test: 30 (out of 150)

# ── Common split ratios ───────────────────────────────
# Small dataset (<1000):   70/30 or 60/20/20
# Medium (1000-10000):     80/20 or 70/10/20
# Large (>10000):          90/10 (more train data = better model)
Common Mistakes
  • Not using stratify=y for classification — imbalanced splits mislead evaluation
  • Preprocessing (scaling, imputing) the full dataset before splitting — data leakage!
  • Not setting random_state — results are not reproducible
  • Making the test set too small — noisy evaluation; too large — less data for training
Interview Insights
Q: What is data leakage and how does train-test split prevent it?
A: Data leakage = when information from outside the training set influences model training or evaluation, giving an overly optimistic performance estimate. If you impute missing values using the full dataset mean (including test data), the model has "seen" test data indirectly. Solution: fit all transformers on train data only, then apply to test. Train-test split enforces this discipline.

Phase 5 — Regression

#24

Regression Analysis

Definition

Regression is a supervised learning task that predicts a continuous numerical output based on one or more input features. The goal is to find the mathematical relationship between inputs and the target variable.

Types of Regression
Type Formula/Description Use When
Simple Linear y = mx + b 1 feature, linear relationship
Multiple Linear y = b₀ + b₁x₁ + b₂x₂ + ... Multiple features, linear relationship
Polynomial y = b₀ + b₁x + b₂x² + ... Curved/nonlinear relationship
Ridge (L2) Linear + L2 penalty Multicollinearity, prevent overfitting
Lasso (L1) Linear + L1 penalty Feature selection built-in
Regression Evaluation Metrics
Metric Formula Interpretation Range
MAE mean(|y - ŷ|) Average absolute error — intuitive, robust [0, ∞)
MSE mean((y - ŷ)²) Penalizes large errors heavily [0, ∞)
RMSE √MSE Same unit as target — most interpretable [0, ∞)
1 - SS_res/SS_tot % variance explained (1 = perfect) (-∞, 1]
Adj. R² 1 - (1-R²)(n-1)/(n-k-1) R² penalized for extra features (-∞, 1]
#25

Linear Regression — Theory

Definition

Linear Regression models the relationship between input X and output y as a straight line: y = β₀ + β₁X + ε. It finds the line that minimizes the sum of squared residuals (differences between actual and predicted values).

Intuition

🏠 House price analogy: "For every additional 100 sqft, price increases by $10,000." That relationship IS a linear regression. β₀ = base price (intercept), β₁ = price per sqft (slope).

How It Learns: Ordinary Least Squares
Minimize: J(β) = Σ(yᵢ − ŷᵢ)² = Σ(yᵢ − β₀ − β₁xᵢ)²

Closed-form solution: β = (XᵀX)⁻¹ Xᵀy

No iteration needed — there's an exact mathematical solution. This is why linear regression is fast even on large datasets.

5 Key Assumptions (MUST know for interviews)
Assumption What It Means How to Check
Linearity X and y have linear relationship Scatter plot
Independence Observations are independent Domain knowledge
Homoscedasticity Residuals have constant variance Residual vs fitted plot
Normality of residuals Residuals are normally distributed QQ-plot
No multicollinearity Features not highly correlated with each other Correlation matrix, VIF
Interview Insights
Q: Does linear regression assume normality of the INPUT features?
A: No! A very common misconception. Linear regression assumes normality of the RESIDUALS (errors), not the input features. Your features can be any distribution. However, if residuals are highly non-normal, the p-values and confidence intervals won't be valid.
#26

Linear Regression — Practical

Code: Full Linear Regression Pipeline
linear_regression_practical.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# ── Generate synthetic house price data ───────────────
np.random.seed(42)
n = 200
sqft   = np.random.uniform(500, 3000, n)
price  = 100 + 0.15 * sqft + np.random.normal(0, 20, n)  # True: y = 100 + 0.15x + noise

df = pd.DataFrame({'sqft': sqft, 'price': price})

# ── Step 1: Prepare data ──────────────────────────────
X = df[['sqft']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Step 2: Train model ───────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Coefficient (β₁): {model.coef_[0]:.2f}")
print(f"Equation: price = {model.intercept_:.2f} + {model.coef_[0]:.2f} × sqft")

# ── Step 3: Predict ───────────────────────────────────
y_pred = model.predict(X_test)

# ── Step 4: Evaluate ─────────────────────────────────
mae  = mean_absolute_error(y_test, y_pred)
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print(f"\nMAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²:   {r2:.4f}")  # Should be ~0.97+ with clean linear data

# ── Step 5: Residual Analysis ─────────────────────────
residuals = y_test - y_pred

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Actual vs Predicted
axes[0].scatter(y_test, y_pred, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted')
axes[0].set_title('Actual vs Predicted\n(Perfect = on the red line)')

# Plot 2: Residuals vs Fitted
axes[1].scatter(y_pred, residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Fitted Values'); axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs Fitted\n(Good = random scatter around 0)')

# Plot 3: Residual histogram
axes[2].hist(residuals, bins=20, color='steelblue')
axes[2].set_title('Residual Distribution\n(Good = roughly normal, centered at 0)')

plt.tight_layout(); plt.show()
Common Mistakes
  • Not plotting residuals — you'll miss heteroscedasticity and nonlinearity
  • R² alone is a misleading metric — always check residual plots too
  • Negative R² is possible (model worse than predicting mean) — something is very wrong
🏋️ Mini Practice Task

Load sklearn's California Housing dataset. Build a linear regression predicting house prices. Print MAE, RMSE, R². Plot Actual vs Predicted. Is the relationship linear? What does the residual plot tell you?

#27

Multiple Linear Regression

Definition

Multiple Linear Regression extends simple linear regression to use multiple input features simultaneously. Each feature gets its own coefficient (weight) representing its independent contribution to the prediction.

ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ
β₀ = intercept, βᵢ = coefficient for feature xᵢ
Code: Multiple Linear Regression
multiple_linear_regression.py
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
import seaborn as sns

# ── Load dataset ──────────────────────────────────────
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='Price')

print("Features:", X.columns.tolist())
print("Shape:", X.shape)

# ── Check multicollinearity with correlation matrix ───
plt.figure(figsize=(10, 8))
corr = X.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix\n(High values = multicollinearity risk)')
plt.tight_layout(); plt.show()

# ── Train/test split ──────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Fit model ─────────────────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train)

# ── Coefficients interpretation ───────────────────────
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)
print("\nCoefficients (show each feature's independent effect):")
print(coef_df.to_string(index=False))

# ── Evaluate ─────────────────────────────────────────
y_pred = model.predict(X_test)
r2   = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nR²: {r2:.4f}, RMSE: {rmse:.4f}")

# ── VIF: Variance Inflation Factor for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

X_const = sm.add_constant(X_train)
vif_data = pd.DataFrame()
vif_data["Feature"] = X_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_const.values, i)
                   for i in range(X_const.shape[1])]
print("\nVIF (>10 = multicollinearity problem):")
print(vif_data.sort_values('VIF', ascending=False))
Common Mistakes
  • Not checking multicollinearity: VIF > 10 = features are redundant, coefficients become unreliable
  • Interpreting coefficients without scaling: Compare magnitudes only after standardizing features
  • Adding too many features: More features → more risk of overfitting and multicollinearity
#28

Polynomial Regression

Definition

Polynomial Regression extends linear regression by adding polynomial terms (x², x³, ...) as new features. Despite the curve fitting, it's still a linear model because the parameters are still linear — only the features are transformed.

ŷ = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ
Key insight: this is linear regression on [x, x², x³] as features
Code: Polynomial Regression
polynomial_regression.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline

np.random.seed(42)
X = np.random.uniform(-3, 3, 100)
y = 0.5 * X**3 - 2 * X**2 + X + np.random.normal(0, 2, 100)
X = X.reshape(-1, 1)

# ── Compare degrees ───────────────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(18, 4), sharey=True)
X_plot = np.linspace(-3, 3, 300).reshape(-1, 1)

for i, degree in enumerate([1, 2, 3, 10]):
    # Pipeline: transform features → fit linear regression
    pipe = Pipeline([
        ('poly',  PolynomialFeatures(degree=degree, include_bias=False)),
        ('model', LinearRegression())
    ])
    pipe.fit(X, y)
    y_plot = pipe.predict(X_plot)
    r2 = r2_score(y, pipe.predict(X))
    
    axes[i].scatter(X, y, alpha=0.4, s=20)
    axes[i].plot(X_plot, y_plot, color='red', lw=2)
    axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}")
    
    if degree == 10:
        axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}\n⚠️ OVERFITTING")

plt.suptitle('Polynomial Regression: Degree Comparison', y=1.02)
plt.tight_layout()
plt.show()

# ── PolynomialFeatures explained ─────────────────────
poly = PolynomialFeatures(degree=2, include_bias=False)
X_2feat = np.array([[2, 3]])  # [x1, x2]
X_poly = poly.fit_transform(X_2feat)
print("Input: [x1, x2]")
print("Poly degree 2 output: [x1, x2, x1², x1·x2, x2²]")
print("Feature names:", poly.get_feature_names_out(['x1','x2']))
print("Values:", X_poly)
Common Mistakes
  • High degree polynomial = overfitting (memorizes noise, fails on test data)
  • Not scaling features — polynomial terms create extremely large values without scaling
  • Using polynomial regression when you should use a tree-based model — trees handle nonlinearity naturally
#29

Cost Function

Definition

A cost function (loss function) quantifies how wrong the model's predictions are. During training, the algorithm adjusts model parameters to minimize this cost. Understanding the cost function is essential — it defines what "learning" means.

Cost Functions for Regression
Cost Function Formula Properties
MSE (OLS) Σ(y−ŷ)²/n Penalizes large errors heavily; differentiable everywhere; default for linear regression
MAE Σ|y−ŷ|/n Robust to outliers; not differentiable at 0
Huber Loss MSE if |error|<δ, MAE otherwise Best of both worlds — use for outlier-prone data
Code: Visualizing Cost Functions + Gradient Descent
cost_function.py
import numpy as np
import matplotlib.pyplot as plt

# Simple 1D regression to visualize cost landscape
np.random.seed(42)
X = np.random.uniform(0, 10, 50)
y = 3 * X + np.random.normal(0, 3, 50)

# Cost = MSE for different values of slope (β₁)
slopes = np.linspace(-2, 8, 200)
mse_costs = []

for slope in slopes:
    y_pred = slope * X      # intercept=0 for simplicity
    mse = np.mean((y - y_pred)**2)
    mse_costs.append(mse)

# The minimum of this U-shaped curve = optimal β₁
optimal_slope = slopes[np.argmin(mse_costs)]
print(f"Optimal slope found by scanning: {optimal_slope:.2f}")

plt.figure(figsize=(8, 4))
plt.plot(slopes, mse_costs, 'b-', lw=2)
plt.axvline(x=optimal_slope, color='r', linestyle='--', label=f'Min at β={optimal_slope:.2f}')
plt.xlabel('Slope (β₁)'); plt.ylabel('MSE Cost')
plt.title('MSE Cost Landscape\n(Training finds the bottom of this bowl)')
plt.legend(); plt.tight_layout(); plt.show()

# ── Gradient Descent from scratch ────────────────────
# This is how sklearn's SGD solver works internally
def gradient_descent(X, y, learning_rate=0.01, epochs=1000):
    n = len(X)
    beta = 0.0  # start at 0
    costs = []
    
    for epoch in range(epochs):
        y_pred = beta * X
        cost   = np.mean((y - y_pred)**2)
        # Gradient of MSE w.r.t. beta = -2/n * Σ(y - ŷ)*x
        gradient = -2 / n * np.sum((y - y_pred) * X)
        beta -= learning_rate * gradient  # Update step
        costs.append(cost)
    
    return beta, costs

beta_found, costs = gradient_descent(X, y)
print(f"Gradient Descent found β = {beta_found:.4f} (true = 3.0)")
Interview Insights
Q: Why does linear regression use MSE instead of MAE as cost function?
A: MSE has a convenient mathematical property: its gradient is smooth and differentiable everywhere, making it easy to optimize analytically (closed-form OLS solution) and with gradient descent. MAE's gradient is discontinuous at zero. Additionally, MSE's quadratic penalty gives a unique optimal solution. However, MSE is more sensitive to outliers — Huber loss is a robust alternative.
#30

R² and Adjusted R²

Definition

R² (coefficient of determination) measures the proportion of variance in y explained by the model. Adjusted R² penalizes adding irrelevant features, making it the preferred metric for model comparison.

R² = 1 − SS_res / SS_tot
SS_res = Σ(y − ŷ)² (residual sum of squares)
SS_tot = Σ(y − ȳ)² (total sum of squares)

Adjusted R² = 1 − (1−R²) × (n−1) / (n−k−1) where n = samples, k = number of features
Code: R² and Adjusted R²
r_squared.py
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()
X_full = housing.data
y      = housing.target

X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)

# ── R² from scratch ───────────────────────────────────
def manual_r2(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (ss_res / ss_tot)

# ── Adjusted R² ──────────────────────────────────────
def adjusted_r2(r2, n, k):
    """
    r2: R² value
    n:  number of samples
    k:  number of features (predictors)
    """
    return 1 - (1 - r2) * (n - 1) / (n - k - 1)

# ── Compare R² vs Adj-R² as we add noise features ─────
# Adding random noise features should NOT improve Adj-R²
print(f"{'Features':>10} {'R²':>10} {'Adj R²':>10}")
print("-" * 35)

for n_noise in [0, 5, 20, 50]:
    np.random.seed(42)
    if n_noise > 0:
        noise = np.random.randn(X_train.shape[0], n_noise)
        X_tr  = np.hstack([X_train, noise])
        noise_t = np.random.randn(X_test.shape[0], n_noise)
        X_ts  = np.hstack([X_test, noise_t])
    else:
        X_tr, X_ts = X_train, X_test
    
    model = LinearRegression()
    model.fit(X_tr, y_train)
    y_pred = model.predict(X_ts)
    
    r2  = r2_score(y_test, y_pred)
    n   = X_ts.shape[0]
    k   = X_tr.shape[1]
    adj = adjusted_r2(r2, n, k)
    
    print(f"{X_train.shape[1]+n_noise:>10} {r2:>10.4f} {adj:>10.4f}")

# Key insight:
# R²     keeps increasing as you add features (even noise ones!)
# Adj R² penalizes — will DROP if features are irrelevant
# → Always use Adj R² for model comparison
Interview Insights
Q: When would R² be misleading?
A: (1) Comparing models with different numbers of features — use Adjusted R² instead. (2) R² can be high even when the model is wrong — always check residual plots. (3) R² = 0 doesn't mean no relationship, just that a linear relationship explains nothing. (4) For time series with trends, R² can be misleadingly high.
#31

Best Fit Line

Definition

The "best fit line" (regression line) is the line that minimizes the sum of squared vertical distances between the data points and the line itself. It passes through (x̄, ȳ) and is uniquely defined by the OLS solution.

Code: Best Fit Line Visualization + Residuals
best_fit_line.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(42)
X = np.random.uniform(1, 10, 30)
y = 2 * X + np.random.normal(0, 2, 30)

model = LinearRegression().fit(X.reshape(-1,1), y)
y_pred = model.predict(X.reshape(-1,1))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ── Plot 1: Scatter + best fit line + residuals ───────
axes[0].scatter(X, y, color='steelblue', zorder=5, label='Data')
X_line = np.linspace(X.min(), X.max(), 100)
axes[0].plot(X_line, model.predict(X_line.reshape(-1,1)), 'r-', lw=2, label='Best Fit Line')

# Draw residuals as vertical lines
for xi, yi, yp in zip(X, y, y_pred):
    axes[0].plot([xi, xi], [yi, yp], 'g--', alpha=0.4, lw=1)

axes[0].axvline(x=X.mean(), color='purple', linestyle=':', label=f'x̄ = {X.mean():.1f}')
axes[0].axhline(y=y.mean(), color='orange', linestyle=':', label=f'ȳ = {y.mean():.1f}')
axes[0].set_title(f'Best Fit Line\nGreen lines = residuals (what we minimize)\nβ₀={model.intercept_:.2f}, β₁={model.coef_[0]:.2f}')
axes[0].legend()

# ── Plot 2: Residuals squared (what OLS actually minimizes) ──
residuals = y - y_pred
axes[1].bar(range(len(residuals)), residuals**2, color='coral')
axes[1].set_title(f'Squared Residuals\nTotal = {(residuals**2).sum():.2f} (OLS minimizes this)')
axes[1].set_xlabel('Sample Index')
axes[1].set_ylabel('Squared Residual')

plt.tight_layout()
plt.show()

print(f"Best fit: y = {model.intercept_:.2f} + {model.coef_[0]:.2f}x")
print(f"True:     y = 0 + 2x")
print(f"Close! Noise causes small deviation.")
#32

Lasso (L1) & Ridge (L2) — Theory

Definition

Ridge and Lasso are regularized versions of linear regression that add a penalty term to prevent overfitting by discouraging large coefficients. They're essential when you have many features or multicollinearity.

The Core Idea: Why Regularization?

When a linear model overfits, it assigns large coefficients to noise features. Regularization adds a penalty for large coefficients, forcing the model to be simpler. This trades a tiny bit of bias for a large reduction in variance.

Bias-Variance Tradeoff: Underfitting = High Bias. Overfitting = High Variance. Regularization = Find the sweet spot.

OLS: Minimize Σ(y − ŷ)²

Ridge: Minimize Σ(y − ŷ)² + λΣβᵢ² (L2 penalty)

Lasso: Minimize Σ(y − ŷ)² + λΣ|βᵢ| (L1 penalty)

Elastic Net: Ridge + Lasso combined
Ridge vs Lasso — Key Differences
Property Ridge (L2) Lasso (L1)
Penalty Sum of squared coefficients Sum of absolute coefficients
Effect on coefficients Shrinks toward 0, but rarely exactly 0 Can shrink to EXACTLY 0 (feature selection!)
Feature selection No — keeps all features small Yes — eliminates irrelevant features
Best for Multicollinearity; all features are relevant When you suspect many features are irrelevant
Hyperparameter α (λ) Higher α = more shrinkage Higher α = more features set to 0
Interview Insights
Q: Why does Lasso produce sparse models (exact zeros) but Ridge doesn't?
A: Geometrically: Ridge's constraint region is a circle (smooth), and the OLS objective function's elliptical contours usually touch it at non-zero points. Lasso's constraint region is a diamond with corners — the corners lie on axes, where coefficients are exactly zero. The OLS contours are likely to first touch the diamond at a corner. Algebraically: L1 gradient doesn't exist at 0, so the subgradient condition allows exact zeros.
#33

Lasso & Ridge — Practical (Continued)

Code: Coefficient Paths + RidgeCV / LassoCV
ridge_lasso_paths.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
Xte = sc.transform(X_test)

# ── Coefficient Path: how coefficients shrink with alpha ─
alphas = np.logspace(-3, 3, 100)
ridge_coefs, lasso_coefs = [], []
for a in alphas:
    ridge_coefs.append(Ridge(alpha=a).fit(Xtr, y_train).coef_)
    lasso_coefs.append(Lasso(alpha=a, max_iter=5000).fit(Xtr, y_train).coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
for i, name in enumerate(X.columns):
    ax1.plot(alphas, ridge_coefs[:, i], label=name)
    ax2.plot(alphas, lasso_coefs[:, i], label=name)

ax1.set_xscale('log'); ax1.set_title('Ridge Coefficient Path\n(All shrink but stay nonzero)')
ax1.set_xlabel('Alpha (regularization strength)'); ax1.legend(fontsize=8)
ax2.set_xscale('log'); ax2.set_title('Lasso Coefficient Path\n(Features drop to EXACT zero)')
ax2.set_xlabel('Alpha (regularization strength)'); ax2.legend(fontsize=8)
plt.tight_layout(); plt.show()

# ── Auto-tune alpha with Cross-Validation ─────────────
# RidgeCV / LassoCV test many alphas internally via CV
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(Xtr, y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")
print(f"RidgeCV R²: {r2_score(y_test, ridge_cv.predict(Xte)):.4f}")

lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=5000)
lasso_cv.fit(Xtr, y_train)
print(f"\nLassoCV best alpha: {lasso_cv.alpha_:.4f}")
print(f"LassoCV R²: {r2_score(y_test, lasso_cv.predict(Xte)):.4f}")

n_zero = np.sum(lasso_cv.coef_ == 0)
print(f"LassoCV zeroed out {n_zero}/{X.shape[1]} features (automatic feature selection!)")
Common Mistakes
  • Not scaling before Ridge/Lasso — regularization penalizes coefficient magnitude; unscaled features get unfairly penalized
  • Using alpha=0 in Ridge — that's just OLS, no regularization
  • Forgetting that Lasso can set ALL features to zero if alpha is too high
Interview Insights
Q: When would you choose Elastic Net over Ridge or Lasso alone?
A: When you have many correlated features. Lasso arbitrarily picks one from each correlated group and zeros the rest (unstable). Ridge keeps all but doesn't do selection. Elastic Net (l1_ratio between 0 and 1) groups correlated features together and selects or drops them as a group — best of both worlds. Tune l1_ratio: closer to 1 = more Lasso-like, closer to 0 = more Ridge-like.
Summary
  • Ridge shrinks coefficients but keeps all features — good for multicollinearity
  • Lasso does feature selection by zeroing coefficients — good when many features are irrelevant
  • Elastic Net combines both — best for correlated feature groups
  • Always scale features before applying any regularized model
  • Use RidgeCV / LassoCV to auto-select optimal alpha via cross-validation
🏋️ Mini Practice Task

Load the Boston/California housing dataset. Add 10 random noise features to X. Compare R² and Adjusted R² of: OLS, Ridge, Lasso. Show which features Lasso zeroes out. Which model is best and why?

⚡ Project 1 — After Regression

🚀P1

Real-World Project: House Price Predictor

🏠 House Price Prediction Pipeline

Dataset: California Housing (sklearn) or Kaggle Ames Housing

Goal: Build a production-grade regression pipeline with all preprocessing steps chained together.

Code: End-to-End Regression Pipeline
project1_house_price.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# ── 1. LOAD DATA ──────────────────────────────────────
data = fetch_california_housing(as_frame=True)
df = data.frame
print("Shape:", df.shape)
print(df.describe().round(2))

# ── 2. FEATURE ENGINEERING ───────────────────────────
df['rooms_per_household']   = df['AveRooms']   / df['AveOccup']
df['bedrooms_per_room']     = df['AveBedrms']  / df['AveRooms']
df['population_per_household'] = df['Population'] / df['AveOccup']

feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                'Population', 'AveOccup', 'Latitude', 'Longitude',
                'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

X = df[feature_cols]
y = df['MedHouseVal']

# ── 3. SPLIT ─────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── 4. BUILD PIPELINES ───────────────────────────────
pipe_lr = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   LinearRegression())
])

pipe_ridge = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   Ridge(alpha=1.0))
])

pipe_lasso = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   Lasso(alpha=0.01))
])

# ── 5. EVALUATE ALL ──────────────────────────────────
print(f"\n{'Model':15} {'R²':>8} {'RMSE':>10} {'MAE':>10} {'CV R²':>10}")
print("-" * 60)

for name, pipe in [('LinearReg', pipe_lr), ('Ridge', pipe_ridge), ('Lasso', pipe_lasso)]:
    pipe.fit(X_train, y_train)
    yp   = pipe.predict(X_test)
    r2   = r2_score(y_test, yp)
    rmse = np.sqrt(mean_squared_error(y_test, yp))
    mae  = mean_absolute_error(y_test, yp)
    cv   = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2').mean()
    print(f"{name:15} {r2:8.4f} {rmse:10.4f} {mae:10.4f} {cv:10.4f}")

# ── 6. RESIDUAL PLOT for best model ──────────────────
pipe_ridge.fit(X_train, y_train)
y_pred = pipe_ridge.predict(X_test)
residuals = y_test - y_pred

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.3, s=10)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted (Ridge)'); plt.xlabel('Actual'); plt.ylabel('Predicted')

plt.subplot(1, 2, 2)
plt.hist(residuals, bins=50, color='steelblue')
plt.title('Residual Distribution')
plt.tight_layout(); plt.show()

Phase 6 — Classification

#34

Classification Overview

Definition

Classification is supervised learning where the output is a discrete category/label rather than a continuous number. The model learns a decision boundary that separates classes.

Types of Classification
Type Output Example Algorithm
Binary 2 classes (0/1) Spam / Not Spam Logistic Regression
Multiclass 3+ mutually exclusive classes Dog / Cat / Bird Softmax, Decision Tree
Multilabel Multiple labels per sample Movie: Action + Comedy One-vs-Rest wrapper
Classification Algorithms Overview
Algorithm How it works Best for
Logistic Regression Sigmoid probability on linear boundary Linearly separable, interpretability needed
Decision Tree Splits on feature thresholds Nonlinear, interpretable rules
Random Forest Average of many trees High accuracy, feature importance
SVM Maximize margin between classes High-dimensional data, small datasets
KNN Majority vote of k nearest neighbors Small datasets, nonlinear boundaries
Naive Bayes Probabilistic, Bayes theorem Text classification, very fast
Summary
  • Classification = predicting a label/category, not a number
  • Binary (2 classes) is simplest; multiclass uses one-vs-rest or softmax extensions
  • Choose algorithm based on: dataset size, linearity, interpretability needs
#35

Logistic Regression — Binary

Definition

Logistic Regression is a classification algorithm (misleadingly named) that uses the sigmoid function to output a probability between 0 and 1. It models the log-odds of the positive class as a linear function of inputs.

Intuition

Linear regression gives output ∈ (−∞, +∞) — useless for probability. We need output ∈ [0, 1]. The sigmoid (logistic) function squashes any number into [0,1]. Output = probability of class 1. If P > 0.5 → predict class 1; else → class 0.

σ(z) = 1 / (1 + e⁻ᶻ) where z = β₀ + β₁x₁ + β₂x₂ + ...
P(y=1 | X) = σ(Xβ)
Decision: ŷ = 1 if P ≥ threshold (default 0.5), else 0
Code: Binary Logistic Regression
logistic_binary.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# ── Load binary classification dataset ────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target   # 0=malignant, 1=benign

print("Classes:", data.target_names)
print("Class balance:", pd.Series(y).value_counts())

# ── Prepare ───────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       random_state=42, stratify=y)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── Train logistic regression ─────────────────────────
# C = inverse regularization strength (C = 1/alpha)
# solver: 'lbfgs' (default), 'liblinear' (good for small data)
model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=1000, random_state=42)
model.fit(X_train_s, y_train)

# ── Evaluate ─────────────────────────────────────────
y_pred      = model.predict(X_test_s)
y_proba     = model.predict_proba(X_test_s)[:, 1]  # P(class=1)

print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# ── Probability threshold analysis ───────────────────
# Default threshold = 0.5. In medical diagnosis, lower threshold
# = catch more true positives (higher recall, lower precision)
thresholds = np.arange(0.1, 0.9, 0.1)
print(f"\n{'Threshold':>12} {'Accuracy':>10} {'TP (recall)':>13}")
from sklearn.metrics import recall_score
for t in thresholds:
    pred_t = (y_proba >= t).astype(int)
    acc    = accuracy_score(y_test, pred_t)
    rec    = recall_score(y_test, pred_t)
    print(f"{t:>12.1f} {acc:>10.4f} {rec:>13.4f}")
# In cancer detection: lower threshold = catch more cases (higher recall)
# but more false alarms (lower precision)

# ── Coefficients — feature importance ─────────────────
coef_df = pd.DataFrame({'Feature': X.columns,
                         'Coefficient': model.coef_[0]}).sort_values('Coefficient')
print("\nTop positive features (push toward malignant):")
print(coef_df.tail(5))
Common Mistakes
  • Using logistic regression on highly nonlinear data — use trees or SVM instead
  • Ignoring the threshold — 0.5 is not always optimal; tune it based on recall/precision needs
  • Not scaling features — logistic regression with regularization is scale-sensitive
Interview Insights
Q: Why is it called regression if it's used for classification?
A: Because internally it still performs a linear regression on log-odds: log(P/(1-P)) = β₀ + β₁x₁ + ... The sigmoid is just applied on top to squash the output to [0,1] for probability interpretation. The "regression" refers to the underlying linear model, not the output type.
#36

Logistic Regression — Multiple Input Features

Definition

Multiple-input logistic regression uses many features simultaneously to compute the log-odds. Each feature gets its own coefficient, and the sigmoid is applied to the linear combination.

Code: Full Pipeline + ROC Curve
logistic_multi_input.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (roc_curve, auc, classification_report,
                               ConfusionMatrixDisplay)
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       stratify=y, random_state=42)
# Pipeline: scale → logistic
pipe = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))])
pipe.fit(X_train, y_train)
y_prob = pipe.predict_proba(X_test)[:, 1]
y_pred = pipe.predict(X_test)

# ── ROC Curve ─────────────────────────────────────────
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(fpr, tpr, lw=2, label=f'ROC (AUC = {roc_auc:.3f})')
axes[0].plot([0,1],[0,1], 'k--', label='Random (AUC=0.5)')
axes[0].set_xlabel('False Positive Rate'); axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve\n(Closer to top-left = better model)')
axes[0].legend()

# ── Confusion Matrix ──────────────────────────────────
ConfusionMatrixDisplay.from_predictions(y_test, y_pred,
    display_labels=data.target_names, ax=axes[1], colorbar=False)
axes[1].set_title('Confusion Matrix')
plt.tight_layout(); plt.show()

print(f"ROC-AUC: {roc_auc:.4f}")
print(classification_report(y_test, y_pred))
# AUC: 0.5 = random, 1.0 = perfect. Aim for >0.85 in practice
#37

Logistic Regression — Polynomial Features

Definition

Adding polynomial features to logistic regression allows it to learn nonlinear decision boundaries. Without this, logistic regression can only separate classes with a straight line/plane.

Code: Nonlinear Boundary with PolynomialFeatures
logistic_polynomial.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_circles

# Circular data — not linearly separable!
X, y = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)

def plot_boundary(model, X, y, ax, title):
    h = 0.02
    x_min, x_max = X[:,0].min()-0.5, X[:,0].max()+0.5
    y_min, y_max = X[:,1].min()-0.5, X[:,1].max()+0.5
    xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    ax.scatter(X[:,0], X[:,1], c=y, cmap='RdYlBu', edgecolors='k', s=20)
    acc = model.score(X, y)
    ax.set_title(f'{title}\nAccuracy: {acc:.3f}')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Linear boundary — fails on circles
pipe_lin = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression())])
pipe_lin.fit(X, y)
plot_boundary(pipe_lin, X, y, ax1, 'Linear Logistic Reg\n(FAILS on circles)')

# Polynomial degree 3 — learns circular boundary!
pipe_poly = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('sc',   StandardScaler()),
    ('lr',   LogisticRegression(C=1.0))
])
pipe_poly.fit(X, y)
plot_boundary(pipe_poly, X, y, ax2, 'Polynomial (degree=3)\n(Learns circular boundary!)')

plt.tight_layout(); plt.show()
#38

Logistic Regression — Multiclass

Definition

Multiclass classification extends binary logistic regression to 3+ classes using two strategies: One-vs-Rest (OvR) — trains one binary classifier per class vs all others — and Softmax (Multinomial) — outputs a probability distribution over all classes simultaneously.

Softmax: P(y=k|X) = e^(z_k) / Σ e^(z_j) for all classes j
All probabilities sum to 1: Σ P(y=k|X) = 1
Code: OvR vs Softmax Multiclass
logistic_multiclass.py
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
print("Classes:", iris.target_names)
print("Samples per class:", np.bincount(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       stratify=y, random_state=42)
sc = StandardScaler()
X_tr = sc.fit_transform(X_train)
X_te = sc.transform(X_test)

# ── Strategy 1: One-vs-Rest (OvR) ─────────────────────
# For k classes: trains k binary classifiers
# Class with highest probability wins
lr_ovr = LogisticRegression(multi_class='ovr', C=1.0, max_iter=1000)
lr_ovr.fit(X_tr, y_train)
print(f"\nOvR Accuracy: {accuracy_score(y_test, lr_ovr.predict(X_te)):.4f}")

# ── Strategy 2: Softmax (multinomial) ─────────────────
# Trains ONE model jointly across all classes
# Better when classes are mutually exclusive
lr_soft = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                              C=1.0, max_iter=1000)
lr_soft.fit(X_tr, y_train)
y_pred_soft = lr_soft.predict(X_te)
print(f"Softmax Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}")

# ── Probability output (Softmax) ──────────────────────
proba = lr_soft.predict_proba(X_te[:5])
print("\nProbabilities for first 5 test samples (3 classes):")
df_proba = pd.DataFrame(proba, columns=iris.target_names).round(3)
df_proba['Prediction'] = iris.target_names[y_pred_soft[:5]]
df_proba['True Label'] = iris.target_names[y_test[:5]]
print(df_proba)

print("\nDetailed Report:")
print(classification_report(y_test, y_pred_soft, target_names=iris.target_names))
Interview Insights
Q: When would you use OvR vs Softmax for multiclass?
A: Softmax is preferred when classes are mutually exclusive (exactly one class is true per sample) — it models the joint distribution. OvR is better for multilabel scenarios or when individual class boundaries are very different. In practice for logistic regression, Softmax (multinomial) almost always performs better or equally and is more theoretically sound.

Phase 7 — Evaluation Metrics

#39

Confusion Matrix

Definition

A Confusion Matrix is a table that visualizes the complete performance of a classification model by showing how many predictions fell into each category: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).

Anatomy of the Confusion Matrix
Predicted: Positive Predicted: Negative
Actual: Positive TP — Correct positive prediction FN — Missed positive (Type II Error)
Actual: Negative FP — Wrong alarm (Type I Error) TN — Correct negative prediction
Code: Full Confusion Matrix Analysis
confusion_matrix.py
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
sc = StandardScaler()
X_tr = sc.fit_transform(X_tr); X_te = sc.transform(X_te)
model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
y_pred = model.predict(X_te)

# ── Raw confusion matrix ──────────────────────────────
cm = confusion_matrix(y_te, y_pred)
print("Confusion Matrix:\n", cm)

tn, fp, fn, tp = cm.ravel()
print(f"\nTN={tn}  FP={fp}")
print(f"FN={fn}  TP={tp}")

# ── All derived metrics from confusion matrix ─────────
accuracy  = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall    = tp / (tp + fn)  # = Sensitivity = True Positive Rate
specificity = tn / (tn + fp)  # True Negative Rate
f1        = 2 * precision * recall / (precision + recall)
fpr       = fp / (fp + tn)  # False Positive Rate

print(f"\nAccuracy:    {accuracy:.4f}")
print(f"Precision:   {precision:.4f}  (Of all predicted positive, how many are actually?)")
print(f"Recall:      {recall:.4f}  (Of all actual positive, how many did we catch?)")
print(f"Specificity: {specificity:.4f}  (Of all actual negative, how many correctly rejected?)")
print(f"F1-Score:    {f1:.4f}  (Harmonic mean of precision + recall)")

# ── Heatmap visualization ─────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=data.target_names, ax=axes[0])
axes[0].set_title('Counts')

ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=data.target_names, ax=axes[1], normalize='true')
axes[1].set_title('Normalized (Recall per class)')
plt.tight_layout(); plt.show()
Interview Insights
Q: Why is accuracy a bad metric for imbalanced datasets?
A: If 99% of emails are not spam, a model that predicts "not spam" for everything achieves 99% accuracy — but detects zero actual spam. Accuracy = (TP+TN)/Total, which is dominated by the majority class. Use F1-score, Precision, Recall, or ROC-AUC for imbalanced data — they expose the model's actual behavior on the minority class.
#40

Precision, Recall, F1-Score

Definitions

Precision: Of all the samples we predicted as positive, what fraction is truly positive? (How careful we are.)
Recall: Of all truly positive samples, what fraction did we correctly identify? (How thorough we are.)
F1-Score: Harmonic mean of Precision and Recall — penalizes extreme imbalances between them.

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F-beta = (1+β²) × (P × R) / (β²P + R) — β>1 weights recall more
The Precision-Recall Tradeoff
⚖️ Precision vs Recall Tradeoff

High Precision, Low Recall: Only flag cases you're very sure about. Miss some. (Spam: rarely mark real email as spam, but let some spam through.)
High Recall, Low Precision: Catch everything, accept false alarms. (Cancer screening: catch all cancer cases, even if some false alarms need follow-up.)
Control it: Lower the threshold → higher recall, lower precision. Higher threshold → higher precision, lower recall.

Code: Precision-Recall Curve + F-beta
precision_recall_f1.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (precision_score, recall_score, f1_score,
                               fbeta_score, precision_recall_curve,
                               average_precision_score, classification_report)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, stratify=data.target, random_state=42)
sc = StandardScaler()
model = LogisticRegression(max_iter=1000).fit(sc.fit_transform(X_tr), y_tr)
y_prob = model.predict_proba(sc.transform(X_te))[:, 1]
y_pred = model.predict(sc.transform(X_te))

# ── Key metrics ───────────────────────────────────────
print(f"Precision:  {precision_score(y_te, y_pred):.4f}")
print(f"Recall:     {recall_score(y_te, y_pred):.4f}")
print(f"F1-Score:   {f1_score(y_te, y_pred):.4f}")
print(f"F2-Score:   {fbeta_score(y_te, y_pred, beta=2):.4f}  (weights recall more)")
print(f"F0.5-Score: {fbeta_score(y_te, y_pred, beta=0.5):.4f}  (weights precision more)")

# ── Precision-Recall Curve ────────────────────────────
precision, recall, thresholds = precision_recall_curve(y_te, y_prob)
avg_prec = average_precision_score(y_te, y_prob)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(recall, precision, lw=2, label=f'AP = {avg_prec:.3f}')
axes[0].set_xlabel('Recall'); axes[0].set_ylabel('Precision')
axes[0].set_title('Precision-Recall Curve\n(Area = Average Precision)')
axes[0].legend()

# ── Threshold vs Precision/Recall ─────────────────────
axes[1].plot(thresholds, precision[:-1], label='Precision')
axes[1].plot(thresholds, recall[:-1],    label='Recall')
axes[1].set_xlabel('Threshold')
axes[1].set_title('Precision vs Recall at different thresholds\n(Move threshold → tradeoff changes)')
axes[1].legend()
plt.tight_layout(); plt.show()

# Per-class report (essential for multiclass)
print("\nClassification Report:")
print(classification_report(y_te, y_pred, target_names=data.target_names))
Summary
  • Accuracy is misleading on imbalanced data — use F1, Precision, Recall
  • Precision = how trustworthy your positive predictions are
  • Recall = how complete your positive detections are
  • F-beta: β>1 weights recall; β<1 weights precision — choose based on business cost
#41

Imbalanced Dataset Handling

Definition

An imbalanced dataset has a significant difference in class frequencies (e.g., 98% class 0, 2% class 1). Models trained on such data develop a bias toward the majority class and effectively ignore the minority class — which is usually the one we care most about (fraud, disease, defects).

Handling Strategies
Strategy How It Works Pros / Cons
Class Weights Penalize misclassifying minority class more Simple, no data change; always try first
Oversampling (SMOTE) Synthetically generate new minority samples More data; risk of overfitting
Undersampling Randomly remove majority samples Fast; loses real information
SMOTE + Undersampling Combine both Balanced; best of both
Threshold tuning Lower decision threshold for minority No data change; tune for business metric
Code: Class Weights + SMOTE
imbalanced_handling.py
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# ── Create heavily imbalanced dataset ─────────────────
X, y = make_classification(n_samples=1000, n_features=10,
                            weights=[0.95, 0.05],  # 95% class 0, 5% class 1
                            random_state=42)
print("Class distribution:", np.bincount(y))

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
sc = StandardScaler()
X_tr_s = sc.fit_transform(X_tr); X_te_s = sc.transform(X_te)

# ── 1. No correction (baseline) ───────────────────────
lr_base = LogisticRegression().fit(X_tr_s, y_tr)
print("\n--- Baseline (no correction) ---")
print(classification_report(y_te, lr_base.predict(X_te_s)))

# ── 2. class_weight='balanced' ────────────────────────
# sklearn auto-computes weights: w_i = n_samples/(n_classes * n_i)
lr_bal = LogisticRegression(class_weight='balanced').fit(X_tr_s, y_tr)
print("--- class_weight='balanced' ---")
print(classification_report(y_te, lr_bal.predict(X_te_s)))

# ── 3. SMOTE Oversampling (requires imbalanced-learn) ─
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline

    smote_pipe = ImbPipeline([
        ('smote', SMOTE(random_state=42)),
        ('lr',    LogisticRegression())
    ])
    smote_pipe.fit(X_tr_s, y_tr)
    print("--- SMOTE + Logistic Regression ---")
    print(classification_report(y_te, smote_pipe.predict(X_te_s)))
except ImportError:
    print("Install: pip install imbalanced-learn")

# ── 4. Threshold tuning ────────────────────────────────
prob = lr_bal.predict_proba(X_te_s)[:, 1]
for t in [0.3, 0.4, 0.5]:
    pred_t = (prob >= t).astype(int)
    f1 = f1_score(y_te, pred_t)
    print(f"Threshold {t}: F1={f1:.4f}")
Common Mistakes
  • Applying SMOTE before train/test split — synthetic samples from test data leak into training
  • Using accuracy as the metric — use F1, ROC-AUC, or G-Mean for imbalanced data
  • Not trying class_weight='balanced' first — it's free and often sufficient

Phase 8 — Probabilistic Models

#42

Naive Bayes — Theory

Definition

Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that all features are conditionally independent given the class. Despite this simplification, it works remarkably well — especially for text classification.

Bayes' Theorem: P(y|X) = P(X|y) × P(y) / P(X)

Naive Independence Assumption: P(X|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

Decision: ŷ = argmax_k P(y=k) × Π P(xᵢ|y=k)
Intuition

Email Spam Example: P(spam | "free", "money", "click") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("click"|spam). We multiply individual word probabilities. We call it "naive" because words in a real email are NOT truly independent — but this assumption still works very well in practice.

Naive Bayes Variants
Variant Feature Distribution Assumption Best For
GaussianNB Features follow Gaussian (normal) distribution Continuous features, numerical data
MultinomialNB Features are counts or frequencies Text classification (word counts, TF-IDF)
BernoulliNB Features are binary (0/1) Text (word presence/absence)
ComplementNB Extension of Multinomial Imbalanced text classification
Interview Insights
Q: Why does Naive Bayes work well despite the independence assumption being almost always wrong?
A: Even when the independence assumption is violated, Naive Bayes still makes accurate classification decisions because the RANKING of class probabilities is often preserved — you don't need exact probabilities, just the correct argmax. Additionally, with limited data, the naive assumption prevents overfitting by reducing model complexity.
#43

Naive Bayes — Practical

Code: GaussianNB + Text Classification with MultinomialNB
naive_bayes_practical.py
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# ══ PART 1: GaussianNB (continuous features) ══════════
iris = load_iris()
X, y = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

gnb = GaussianNB()
gnb.fit(X_tr, y_tr)
print("GaussianNB on Iris:")
print(f"  Accuracy: {accuracy_score(y_te, gnb.predict(X_te)):.4f}")
print(f"  CV Score: {cross_val_score(gnb, X, y, cv=5).mean():.4f}")

# Learned class priors (relative frequency of each class)
print(f"  Class priors: {gnb.class_prior_.round(3)}")
print(f"  Class means (theta):\n{gnb.theta_.round(2)}")

# ══ PART 2: MultinomialNB (text classification) ═══════
print("\n--- Text Classification with MultinomialNB ---")

texts = [
    "free money win prize", "win big cash now free", "click free offer today",
    "meeting tomorrow project update", "report deadline next week",
    "team lunch calendar invite", "schedule call project review",
    "free gift card win money", "urgent response required now",
]
labels = [1,1,1, 0,0,0,0, 1,1]  # 1=spam, 0=ham

pipe_mnb = Pipeline([
    ('tfidf',  TfidfVectorizer()),
    ('model', MultinomialNB(alpha=1.0))  # alpha=Laplace smoothing
])
pipe_mnb.fit(texts, labels)

new_emails = ["free offer click win", "project deadline report"]
predictions = pipe_mnb.predict(new_emails)
probas      = pipe_mnb.predict_proba(new_emails)

for email, pred, prob in zip(new_emails, predictions, probas):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"  '{email}' → {label} (P(spam)={prob[1]:.3f})")

# ── Laplace Smoothing explanation ─────────────────────
# alpha=1.0 adds 1 to all counts — prevents P(word|class)=0
# for words not seen in training data (zero-probability problem)
print("\nalpha=0: no smoothing (risk of zero prob)")
print("alpha=1: Laplace smoothing (standard)")
print("alpha>1: more smoothing, more uniform distribution")
Summary
  • Naive Bayes is extremely fast, scales to huge datasets, great for text
  • GaussianNB for continuous features; MultinomialNB for word counts
  • Laplace smoothing (alpha) prevents zero probabilities for unseen features
  • Despite naive assumption, often competitive with complex models for text

Phase 9 — Advanced Models

#44

Decision Tree — Classification (Theory)

Definition

A Decision Tree is a flowchart-like model that makes predictions by recursively splitting the feature space based on threshold conditions. Each internal node tests a feature, each branch represents an outcome, and each leaf node predicts a class.

Intuition

Think of 20 questions: "Is the animal a mammal? → Yes → Does it live in water? → No → Does it have stripes? → Yes → Tiger!" A decision tree asks the most informative questions first, each question splitting the data into increasingly pure groups.

How Trees Split: Splitting Criteria
Criterion Formula Used in Intuition
Gini Impurity 1 − Σ pᵢ² CART, sklearn default Probability of misclassifying a random sample. 0 = pure, 0.5 = maximally impure (binary)
Entropy (Info Gain) −Σ pᵢ log₂(pᵢ) ID3, C4.5 Average bits needed to encode class labels. 0 = pure, 1 = maximally uncertain (binary)
Log Loss Cross-entropy sklearn 1.1+ Better calibrated probability estimates
Information Gain = Entropy(parent) − Σ [|child|/|parent| × Entropy(child)]
Gini at node = 1 − (p_class0² + p_class1² + ...)
Interview Insights
Q: Gini vs Entropy — which is better?
A: In practice, almost no difference in tree structure or performance. Gini is slightly faster (no log computation). Entropy can sometimes create more balanced trees. sklearn defaults to Gini. The choice of max_depth and min_samples_split has far more impact than Gini vs Entropy.
Q: What's the main weakness of a single decision tree?
A: High variance (overfitting). A deep tree memorizes training noise. Changing even a few training samples can produce a very different tree. This is why ensemble methods like Random Forest (averaging many trees) drastically improve stability.
#45

Decision Tree — Practical

Code: Full Decision Tree + Visualization
decision_tree_practical.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── 1. Full tree (overfits) ───────────────────────────
dt_full = DecisionTreeClassifier(random_state=42)  # No limits = grows fully
dt_full.fit(X_tr, y_tr)
print(f"Full Tree — depth:{dt_full.get_depth()}, leaves:{dt_full.get_n_leaves()}")
print(f"  Train Acc: {accuracy_score(y_tr, dt_full.predict(X_tr)):.4f}")
print(f"  Test  Acc: {accuracy_score(y_te, dt_full.predict(X_te)):.4f}")

# ── 2. Depth-limited tree (better generalization) ──────
dt_lim = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_lim.fit(X_tr, y_tr)
print(f"\nLimited Tree (max_depth=3):")
print(f"  Train Acc: {accuracy_score(y_tr, dt_lim.predict(X_tr)):.4f}")
print(f"  Test  Acc: {accuracy_score(y_te, dt_lim.predict(X_te)):.4f}")

# ── 3. Visualize tree ─────────────────────────────────
fig, ax = plt.subplots(figsize=(16, 6))
plot_tree(dt_lim, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True,
          rounded=True, ax=ax, fontsize=9)
plt.title('Decision Tree (max_depth=3)\nColor intensity = class purity')
plt.tight_layout(); plt.show()

# ── 4. Text rules (human-readable) ───────────────────
rules = export_text(dt_lim, feature_names=iris.feature_names)
print("\nTree Rules (human-readable):")
print(rules)

# ── 5. Feature importance ─────────────────────────────
imp = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': dt_lim.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importances (Gini-based):")
print(imp)
# Importance = total Gini impurity reduction from splits on this feature

# ── 6. Depth vs accuracy tradeoff ─────────────────────
depths = range(1, 15)
train_accs, test_accs = [], []
for d in depths:
    dt = DecisionTreeClassifier(max_depth=d, random_state=42).fit(X_tr, y_tr)
    train_accs.append(accuracy_score(y_tr, dt.predict(X_tr)))
    test_accs.append(accuracy_score(y_te, dt.predict(X_te)))

plt.figure(figsize=(8,4))
plt.plot(depths, train_accs, 'b-o', label='Train')
plt.plot(depths, test_accs,  'r-o', label='Test')
plt.axvline(x=3, color='g', linestyle='--', label='Sweet spot')
plt.xlabel('Max Depth'); plt.ylabel('Accuracy')
plt.title('Depth vs Accuracy: Overfitting Curve')
plt.legend(); plt.tight_layout(); plt.show()
#46

Pre-Pruning & Post-Pruning

Definition

Pruning controls decision tree complexity to prevent overfitting. Pre-pruning stops growth early using constraints. Post-pruning grows the full tree then removes unnecessary branches using a complexity penalty (cost-complexity pruning).

Pre-Pruning Parameters in sklearn
Parameter Effect Typical Value
max_depth Maximum tree depth 3–10
min_samples_split Min samples required to split a node 2–20
min_samples_leaf Min samples required at leaf 1–10
max_features Max features considered per split "sqrt" for classification
max_leaf_nodes Max number of leaf nodes None or 10–100
Code: Pre-Pruning + Cost-Complexity Post-Pruning
pruning.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, random_state=42)

# ── PRE-PRUNING ───────────────────────────────────────
dt_pre = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,  # need ≥10 samples to even try splitting
    min_samples_leaf=5,   # each leaf must have ≥5 samples
    max_leaf_nodes=20,    # cap total leaves
    random_state=42
)
dt_pre.fit(X_tr, y_tr)
print(f"Pre-pruned: depth={dt_pre.get_depth()}, test_acc={accuracy_score(y_te, dt_pre.predict(X_te)):.4f}")

# ── POST-PRUNING: Cost-Complexity Pruning (CCP) ────────
# ccp_alpha = complexity parameter.
# Higher alpha → more aggressive pruning → simpler tree
# Find optimal alpha using effective_alphas

dt_full = DecisionTreeClassifier(random_state=42)
path = dt_full.cost_complexity_pruning_path(X_tr, y_tr)
ccp_alphas = path.ccp_alphas[:-1]  # Exclude last (trivial root)

train_scores, test_scores, n_leaves = [], [], []
for alpha in ccp_alphas:
    dt = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42).fit(X_tr, y_tr)
    train_scores.append(accuracy_score(y_tr, dt.predict(X_tr)))
    test_scores.append(accuracy_score(y_te, dt.predict(X_te)))
    n_leaves.append(dt.get_n_leaves())

best_idx = np.argmax(test_scores)
best_alpha = ccp_alphas[best_idx]
print(f"\nBest CCP alpha: {best_alpha:.5f} → Test acc: {test_scores[best_idx]:.4f}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
ax1.plot(ccp_alphas, train_scores, 'b-', label='Train')
ax1.plot(ccp_alphas, test_scores,  'r-', label='Test')
ax1.axvline(x=best_alpha, color='g', linestyle='--', label=f'Best α={best_alpha:.5f}')
ax1.set_xlabel('ccp_alpha'); ax1.set_title('Post-Pruning: Accuracy vs Alpha'); ax1.legend()

ax2.plot(ccp_alphas, n_leaves, 'purple')
ax2.set_xlabel('ccp_alpha'); ax2.set_ylabel('Number of Leaves')
ax2.set_title('More alpha → Simpler tree')
plt.tight_layout(); plt.show()
#47

Decision Tree — Regression

Definition

Decision Tree Regression partitions the feature space into rectangular regions and predicts the mean of training samples in each region. Instead of Gini/Entropy, it minimizes MSE (mean squared error) at each split.

Code: Decision Tree Regressor
decision_tree_regression.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
X_plot = np.linspace(0, 10, 500).reshape(-1, 1)

for i, depth in enumerate([2, 5, 20]):
    dtr = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dtr.fit(X_tr, y_tr)
    y_plot = dtr.predict(X_plot)
    rmse   = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te)))
    r2     = r2_score(y_te, dtr.predict(X_te))
    
    axes[i].scatter(X_tr, y_tr, alpha=0.4, s=15, label='Train')
    axes[i].scatter(X_te, y_te, alpha=0.4, s=15, color='orange', label='Test')
    axes[i].plot(X_plot, y_plot, 'r-', lw=2, label='Prediction')
    axes[i].set_title(f"depth={depth}\nRMSE={rmse:.3f}  R²={r2:.3f}")
    axes[i].legend(fontsize=8)

# depth=2: underfits (blocky step function)
# depth=5: good fit
# depth=20: overfits (jagged, memorizes noise)
plt.suptitle('Decision Tree Regression: Step-Function Predictions')
plt.tight_layout(); plt.show()

# ── Key insight: DT predictions are step functions ─────
# Each region gets the MEAN of training samples in it
# This is why DTs can't extrapolate beyond training range!
Common Mistakes
  • Decision Trees cannot extrapolate — predictions outside training range = last leaf's mean. Use neural nets for extrapolation.
  • Not controlling depth leads to extreme overfitting — always use CCP or set max_depth
#48

K-Nearest Neighbors (KNN) — Classification

Definition

KNN is a lazy, non-parametric algorithm — it stores all training data and, to predict a new point, finds the K closest training points and predicts the majority class (classification) or mean value (regression). No explicit training step.

Intuition

"Tell me who your neighbors are, and I'll tell you who you are." To classify a new patient, find the 5 most similar patients in the medical database and predict the majority diagnosis. The model IS the data.

Distance (Euclidean): d(a,b) = √[Σ(aᵢ − bᵢ)²]
Distance (Manhattan): d(a,b) = Σ|aᵢ − bᵢ|
Prediction: ŷ = majority_class(k nearest neighbors)
KNN Key Properties
Property Details
Training cost O(1) — just stores data
Prediction cost O(n·d) — computes distance to ALL training points
Memory O(n) — stores entire training set
Curse of dimensionality Performance degrades sharply with many features — all points become equidistant
Feature scaling REQUIRED — distance-based, so scale matters critically
k too small Overfitting — noisy, jagged boundary
k too large Underfitting — blurry boundary, ignores local structure
Interview Insights
Q: What is the curse of dimensionality and how does it affect KNN?
A: In high dimensions, the Euclidean distance between any two points becomes nearly equal — all neighbors become "equally far". To cover the same fraction of data space, you need exponentially more points as dimensions grow. This makes nearest neighbors meaningless in high-dimensional spaces. Solutions: PCA dimensionality reduction before KNN, or use algorithms that don't rely on distance (trees).
#49

KNN — Practical

Code: KNN with Optimal K Selection
knn_practical.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── Find optimal k ────────────────────────────────────
k_range = range(1, 31)
cv_scores = []
for k in k_range:
    pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))])
    scores = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

best_k = k_range[np.argmax(cv_scores)]
print(f"Best k = {best_k} (CV accuracy = {max(cv_scores):.4f})")

plt.figure(figsize=(10, 4))
plt.plot(k_range, cv_scores, 'b-o', markersize=5)
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best k={best_k}')
plt.xlabel('k'); plt.ylabel('CV Accuracy')
plt.title('Elbow Method: Finding Optimal k\n(k=1: overfitting, high k: underfitting)')
plt.legend(); plt.tight_layout(); plt.show()

# ── Final model with best k ───────────────────────────
best_pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('knn', KNeighborsClassifier(
        n_neighbors=best_k,
        weights='distance',  # closer neighbors vote more
        metric='euclidean'    # try 'manhattan', 'minkowski'
    ))
])
best_pipe.fit(X_tr, y_tr)
print(classification_report(y_te, best_pipe.predict(X_te), target_names=iris.target_names))

# ── Weights comparison ────────────────────────────────
for w in ['uniform', 'distance']:
    p = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=best_k, weights=w))])
    p.fit(X_tr, y_tr)
    acc = p.score(X_te, y_te)
    print(f"weights='{w}': acc={acc:.4f}")
#50

KNN — Regression

Definition

KNN Regression predicts the average (or weighted average) of the k nearest neighbors' target values rather than majority vote. Simple and powerful for locally smooth functions.

Code: KNN Regressor
knn_regression.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 200)).reshape(-1,1)
y = np.sin(X.ravel()) + 0.5*np.cos(2*X.ravel()) + np.random.normal(0, 0.15, 200)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
X_plot = np.linspace(0, 10, 500).reshape(-1,1)

fig, axes = plt.subplots(1, 3, figsize=(16,4))
for i, k in enumerate([1, 7, 50]):
    pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))])
    pipe.fit(X_tr, y_tr)
    rmse = np.sqrt(mean_squared_error(y_te, pipe.predict(X_te)))
    r2   = r2_score(y_te, pipe.predict(X_te))
    axes[i].scatter(X, y, alpha=0.3, s=10)
    axes[i].plot(X_plot, pipe.predict(X_plot), 'r-', lw=2)
    axes[i].set_title(f"k={k}\nRMSE={rmse:.3f}  R²={r2:.3f}")
plt.suptitle('KNN Regression: k=1 overfits, large k underfits')
plt.tight_layout(); plt.show()
#51

Support Vector Machine (SVM) — Theory

Definition

SVM finds the optimal hyperplane that maximizes the margin between classes. The margin is the distance between the hyperplane and the nearest data points from each class (called support vectors). Maximizing margin = maximizing generalization.

Intuition

You have red and blue dots on a table. SVM finds the widest possible "road" between them. Only the dots closest to the road (support vectors) determine the boundary — all other points are irrelevant. This makes SVM very efficient and robust.

Hard Margin: Maximize margin = 2/||w||, subject to yᵢ(w·xᵢ + b) ≥ 1

Soft Margin: Minimize ½||w||² + C·Σξᵢ (C = regularization)

Kernel trick: K(x,z) = φ(x)·φ(z) — maps to higher-dimensional space
Kernels
Kernel Formula Use When
Linear K(x,z) = x·z Linearly separable, high-dimensional (text)
RBF (Gaussian) K(x,z) = exp(−γ||x−z||²) Default; nonlinear data; medium-sized datasets
Polynomial K(x,z) = (x·z + r)^d When polynomial features expected
Sigmoid K(x,z) = tanh(αx·z + c) Neural network-like behavior
Hyperparameter C and γ
Parameter Small Value Large Value
C (regularization) Wider margin, more misclassifications allowed (underfitting) Narrow margin, fewer misclassifications (overfitting)
γ (RBF kernel) Large decision region, smooth boundary (underfitting) Small region per point, jagged boundary (overfitting)
Interview Insights
Q: What are support vectors and why do only they matter?
A: Support vectors are the training points closest to the decision boundary — they "support" (define) the hyperplane. Points farther from the boundary are irrelevant to the decision. This is why SVMs are efficient: removing non-support vector points doesn't change the model. It also makes SVMs robust — the boundary is defined by the hardest-to-classify points.
#52

SVM — Practical

Code: SVM Classification + Kernel Comparison
svm_practical.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer, make_moons

# ── Part 1: Kernel comparison on moons dataset ────────
X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for i, (kernel, C) in enumerate([('linear',1),('rbf',1),('poly',1)]):
    pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel=kernel, C=C))])
    pipe.fit(X_m, y_m)
    h = 0.02
    x_min, x_max = X_m[:,0].min()-0.5, X_m[:,0].max()+0.5
    y_min, y_max = X_m[:,1].min()-0.5, X_m[:,1].max()+0.5
    xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h))
    Z = pipe.predict(np.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape)
    axes[i].contourf(xx, yy, Z, alpha=0.3)
    axes[i].scatter(X_m[:,0], X_m[:,1], c=y_m, s=20, edgecolors='k')
    axes[i].set_title(f'Kernel: {kernel}\nAcc={pipe.score(X_m,y_m):.3f}')
plt.tight_layout(); plt.show()

# ── Part 2: Full pipeline on real data + GridSearch ───
data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

pipe_svm = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel='rbf', probability=True))])

# GridSearch for C and gamma
param_grid = {'svc__C': [0.1, 1, 10], 'svc__gamma': ['scale', 'auto', 0.01]}
gs = GridSearchCV(pipe_svm, param_grid, cv=5, scoring='accuracy')
gs.fit(X_tr, y_tr)
print(f"Best params: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")
print(f"Test accuracy: {gs.score(X_te, y_te):.4f}")
print(classification_report(y_te, gs.predict(X_te), target_names=data.target_names))
#53

SVM — Regression (SVR)

Definition

Support Vector Regression (SVR) fits a tube of width 2ε around the regression line. Points inside the tube incur no penalty. Only points outside the ε-tube contribute to the loss. This makes SVR robust to outliers.

SVR: Minimize ½||w||² + C·Σ(ξᵢ + ξᵢ*)
Subject to: |yᵢ − (w·xᵢ + b)| ≤ ε + ξᵢ
ε = tube width; ξ = slack variables for points outside tube
Code: SVR with RBF Kernel
svr_practical.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1,1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
X_plot = np.linspace(0, 10, 500).reshape(-1,1)

pipe_svr = Pipeline([
    ('sc',  StandardScaler()),
    ('svr', SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1))
])
pipe_svr.fit(X_tr, y_tr)
y_pred = pipe_svr.predict(X_te)

rmse = np.sqrt(mean_squared_error(y_te, y_pred))
r2   = r2_score(y_te, y_pred)
print(f"SVR — RMSE: {rmse:.4f}, R²: {r2:.4f}")

plt.figure(figsize=(10, 4))
plt.scatter(X_tr, y_tr, c='steelblue', alpha=0.5, s=20, label='Train')
plt.scatter(X_te, y_te, c='orange', alpha=0.5, s=20, label='Test')
y_plot = pipe_svr.predict(X_plot)
plt.plot(X_plot, y_plot, 'r-', lw=2, label='SVR (RBF)')
plt.fill_between(X_plot.ravel(), y_plot-0.1, y_plot+0.1, alpha=0.2, color='red', label='ε-tube')
plt.title(f'SVR RBF — RMSE={rmse:.3f}, R²={r2:.3f}')
plt.legend(); plt.tight_layout(); plt.show()
Summary
  • SVM finds the maximum-margin hyperplane — defined only by support vectors
  • Kernel trick allows nonlinear boundaries without explicitly transforming data
  • C controls margin width / regularization; γ controls RBF kernel "reach"
  • Always scale features before SVM — it's distance-based

⚡ Project 2 — After Classification

🚀P2

Real-World Project: Customer Churn Predictor

📉 Customer Churn Prediction

Goal: Predict which telecom customers will leave. Compare Logistic Regression, SVM, Decision Tree, KNN — all in pipelines. Use confusion matrix, F1, ROC-AUC to choose the best model.

Code: Multi-Model Churn Pipeline
project2_churn.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, roc_auc_score, classification_report

# Simulate churn data (imbalanced: 80/20)
X, y = make_classification(n_samples=2000, n_features=12,
                            weights=[0.8,0.2], random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

models = {
    'Logistic Reg':   Pipeline([('sc',StandardScaler()),('m',LogisticRegression(class_weight='balanced',max_iter=1000))]),
    'Decision Tree':  Pipeline([('m',DecisionTreeClassifier(max_depth=5,class_weight='balanced',random_state=42))]),
    'KNN':            Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))]),
    'SVM':            Pipeline([('sc',StandardScaler()),('m',SVC(class_weight='balanced',probability=True))]),
    'Naive Bayes':    Pipeline([('sc',StandardScaler()),('m',GaussianNB())]),
}

results = []
print(f"{'Model':18} {'F1':>8} {'ROC-AUC':>10} {'CV-F1':>10}")
print("-"*50)
for name, pipe in models.items():
    pipe.fit(X_tr, y_tr)
    yp   = pipe.predict(X_te)
    yprb = pipe.predict_proba(X_te)[:,1]
    f1   = f1_score(y_te, yp)
    auc  = roc_auc_score(y_te, yprb)
    cvf1 = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='f1').mean()
    print(f"{name:18} {f1:8.4f} {auc:10.4f} {cvf1:10.4f}")
    results.append({'model':name,'f1':f1,'auc':auc})

# Best model report
best = max(results, key=lambda x: x['auc'])
print(f"\n🏆 Best model by ROC-AUC: {best['model']} (AUC={best['auc']:.4f})")

Phase 10 — Model Tuning

#54

Model Parameters vs Hyperparameters

Definitions

Parameters are values the model learns from training data (e.g., linear regression coefficients β). Hyperparameters are configuration settings set BEFORE training that control the learning process (e.g., max_depth, C, k). You tune hyperparameters; the model learns parameters.

Examples by Algorithm
Algorithm Parameters (learned) Hyperparameters (set by you)
Linear Regression β₀, β₁, ..., βₙ (coefficients) fit_intercept, normalize
Ridge/Lasso β (coefficients) alpha (λ), max_iter
Logistic Regression w (weights), b (bias) C, solver, max_iter
Decision Tree Split thresholds, leaf values max_depth, min_samples_split, criterion
SVM w, b, support vectors C, γ, kernel
KNN None (lazy learner) k, weights, metric
Naive Bayes Class priors, feature likelihoods alpha (smoothing), var_smoothing
Interview Insights
Q: If you increase max_depth in a Decision Tree, are you tuning a parameter or hyperparameter?
A: Hyperparameter. max_depth is set before training — it controls the learning process but is not learned FROM data. The split thresholds and leaf values that result from training are the actual parameters. This distinction matters: parameters are optimized by the algorithm internally; hyperparameters are optimized by YOU, typically via cross-validation (GridSearchCV).
#55

GridSearchCV & RandomizedSearchCV

Definition

GridSearchCV exhaustively tests every combination of hyperparameter values. RandomizedSearchCV randomly samples a fixed number of combinations. Both use cross-validation to evaluate each combination on training data without touching the test set.

Code: GridSearchCV + RandomizedSearchCV
gridsearch.py
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from scipy.stats import uniform, randint

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, stratify=data.target, random_state=42)

# ── GridSearchCV: Exhaustive ───────────────────────────
# 3 × 3 × 2 = 18 combinations × 5 folds = 90 model fits
pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(probability=True))])
param_grid = {
    'svc__C':      [0.1, 1.0, 10.0],
    'svc__gamma':  ['scale', 'auto', 0.01],
    'svc__kernel': ['rbf', 'linear']
}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='f1',
                   n_jobs=-1, verbose=1)  # n_jobs=-1: use all CPU cores
gs.fit(X_tr, y_tr)

print(f"GridSearch Best params: {gs.best_params_}")
print(f"GridSearch Best CV F1:  {gs.best_score_:.4f}")
print(f"GridSearch Test F1:    {__import__('sklearn.metrics',fromlist=['f1_score']).f1_score(y_te, gs.predict(X_te)):.4f}")

# ── RandomizedSearchCV: Faster for large search spaces ─
# Instead of testing all combos, randomly sample n_iter combinations
param_dist = {
    'max_depth':       randint(2, 20),        # Random int [2, 20)
    'min_samples_split': randint(2, 30),
    'min_samples_leaf':  randint(1, 15),
    'criterion':        ['gini', 'entropy'],
}

rs = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,          # Test 50 random combinations
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)
rs.fit(X_tr, y_tr)
print(f"\nRandomizedSearch Best params: {rs.best_params_}")
print(f"RandomizedSearch Best CV F1:  {rs.best_score_:.4f}")

# ── Results DataFrame ─────────────────────────────────
cv_results = pd.DataFrame(rs.cv_results_)
print("\nTop 5 configurations:")
print(cv_results.nlargest(5, 'mean_test_score')[['mean_test_score','std_test_score','params']])
Interview Insights
Q: When use GridSearch vs RandomizedSearch vs Bayesian Optimization?
A: GridSearch: small search space (few params, few values), exhaustive is feasible. RandomizedSearch: large search space, faster, often finds good solutions with n_iter ≈ 50-100. Bayesian Optimization (Optuna, scikit-optimize): most efficient — uses past results to guide next search; best for expensive models and large spaces. For production: always start with Randomized, then refine with Grid around the best region.
#56

Cross Validation — Theory

Definition

Cross Validation (CV) is a resampling technique for evaluating model performance and tuning hyperparameters. It splits data into k folds, trains on k-1 folds and validates on the remaining fold, rotating k times to use every sample as validation exactly once.

CV Variants
Technique How it works Best for
K-Fold CV Split into k equal folds, rotate validation Standard choice; large datasets
Stratified K-Fold K-Fold preserving class distribution Classification; imbalanced data
Leave-One-Out (LOO) k = n (each sample is a fold) Very small datasets; expensive
Time Series Split Respects temporal order — no future leakage Time series data
Repeated K-Fold Run K-Fold r times with different splits More stable estimate on small data
CV Score = mean(score_fold₁, score_fold₂, ..., score_fold_k)
k=5 or k=10 is standard. Larger k = less bias, more variance in estimate.
Why CV is critical

A single train/test split is highly dependent on the random split. You might get lucky (test set is easy) or unlucky (test set is hard). CV averages over k different splits, giving a much more reliable performance estimate. The std of CV scores also tells you model stability.

#57

Cross Validation — Practical

Code: All CV Variants + Nested CV
cross_validation.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import (KFold, StratifiedKFold, cross_val_score,
                                        cross_validate, TimeSeriesSplit,
                                        RepeatedStratifiedKFold)
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target
pipe = Pipeline([('sc',StandardScaler()),('dt',DecisionTreeClassifier(max_depth=4,random_state=42))])

# ── 1. Basic K-Fold ───────────────────────────────────
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
print(f"K-Fold(5)     : {scores_kf.mean():.4f} ± {scores_kf.std():.4f}")

# ── 2. Stratified K-Fold (for classification) ─────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_skf = cross_val_score(pipe, X, y, cv=skf, scoring='accuracy')
print(f"Stratified(5) : {scores_skf.mean():.4f} ± {scores_skf.std():.4f}")

# ── 3. Get multiple metrics at once ───────────────────
cv_results = cross_validate(pipe, X, y, cv=5,
                             scoring=['accuracy','f1_macro'],
                             return_train_score=True)
print(f"\ncross_validate:")
print(f"  Train acc: {cv_results['train_accuracy'].mean():.4f}")
print(f"  Test acc:  {cv_results['test_accuracy'].mean():.4f}")
print(f"  Test F1:   {cv_results['test_f1_macro'].mean():.4f}")
# If train >> test → overfitting signal!

# ── 4. Repeated Stratified K-Fold ─────────────────────
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
scores_r = cross_val_score(pipe, X, y, cv=rskf)
print(f"\nRepeated K-Fold (5×3): {scores_r.mean():.4f} ± {scores_r.std():.4f}")

# ── 5. Nested CV: unbiased estimate with tuning ────────
# Outer CV evaluates model; Inner CV tunes hyperparameters
from sklearn.model_selection import GridSearchCV

inner_cv  = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
outer_cv  = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)

dt = DecisionTreeClassifier(random_state=42)
gs = GridSearchCV(dt, {'max_depth':[2,3,5,7]}, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='accuracy')
print(f"\nNested CV (unbiased estimate): {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
Summary
  • CV gives reliable performance estimates by averaging over multiple splits
  • Always use StratifiedKFold for classification to preserve class ratios
  • cross_validate returns both train and test scores — compare them to detect overfitting
  • Nested CV is the gold standard: outer CV evaluates, inner CV tunes hyperparameters

Phase 11 — Unsupervised Learning

#58

Unsupervised Learning Overview

Definition

Unsupervised learning discovers hidden patterns in data without labeled outputs. The model must find structure — groups, outliers, or latent representations — entirely from the input data itself.

Unsupervised Learning Categories
Category Goal Algorithms Example
Clustering Group similar data points K-Means, DBSCAN, Hierarchical Customer segmentation
Dimensionality Reduction Compress features while retaining structure PCA, t-SNE, UMAP Visualize high-dim data
Anomaly Detection Identify outliers/unusual samples Isolation Forest, LOF Fraud detection
Association Rules Find co-occurring patterns Apriori, FP-Growth Market basket analysis
Key Challenge: No Ground Truth

Evaluating unsupervised models is hard — there's no correct answer. Internal metrics (Silhouette Score, Davies-Bouldin) measure cluster quality without labels. External metrics (Adjusted Rand Index) work if you do have ground truth labels for validation.

#59

K-Means Clustering — Theory

Definition

K-Means partitions n samples into k clusters by iteratively assigning each point to the nearest centroid, then recomputing centroids as cluster means. It minimizes the total within-cluster sum of squared distances (inertia).

Algorithm (Lloyd's Algorithm)

1. Randomly initialize k centroids
2. Assign each point to nearest centroid (by Euclidean distance)
3. Recompute each centroid as mean of its assigned points
4. Repeat steps 2–3 until convergence (centroids don't move or max_iter reached)
5. Result: k cluster assignments + k centroid positions

Inertia = Σᵢ Σₓ∈Cᵢ ||x − μᵢ||²
μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x (centroid = cluster mean)
K-Means Limitations
Limitation Problem Workaround
Must specify k k is unknown in practice Elbow method, Silhouette analysis
Sensitive to initialization Different runs → different results K-Means++ init (default in sklearn)
Assumes spherical clusters Fails on elongated/ring shapes DBSCAN, GMM for arbitrary shapes
Sensitive to outliers Outliers pull centroids K-Medoids (uses median), remove outliers first
Scale-dependent Large-scale features dominate Always scale features before K-Means
#60

K-Means — Practical

Code: K-Means + Elbow Method + Optimal k
kmeans_practical.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# ── Generate clusterable data ─────────────────────────
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# ── Elbow Method: Find optimal k ──────────────────────
inertias, sil_scores = [], []
k_range = range(2, 11)
for k in k_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(k_range, inertias, 'b-o')
ax1.set_xlabel('k'); ax1.set_ylabel('Inertia (WCSS)')
ax1.set_title('Elbow Method\n(Look for the "elbow" bend)')

ax2.plot(k_range, sil_scores, 'r-o')
ax2.set_xlabel('k'); ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Score\n(Higher = better separated clusters)')
plt.tight_layout(); plt.show()

best_k = k_range[np.argmax(sil_scores)]
print(f"Optimal k by Silhouette: {best_k}")

# ── Final K-Means ─────────────────────────────────────
km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10, random_state=42)
labels = km_final.fit_predict(X_scaled)

plt.figure(figsize=(8, 5))
plt.scatter(X_scaled[:,0], X_scaled[:,1], c=labels, cmap='tab10', s=20, alpha=0.7)
plt.scatter(km_final.cluster_centers_[:,0], km_final.cluster_centers_[:,1],
           c='red', s=300, marker='*', label='Centroids')
plt.title(f'K-Means with k={best_k}\nSilhouette={silhouette_score(X_scaled,labels):.3f}')
plt.legend(); plt.tight_layout(); plt.show()

# ── Cluster analysis: describe each cluster ───────────
df = pd.DataFrame(X, columns=['feature1','feature2'])
df['cluster'] = labels
print("\nCluster statistics:")
print(df.groupby('cluster').agg(['mean','std','count']))
#61

Hierarchical Clustering — Theory

Definition

Hierarchical clustering builds a dendrogram (tree) of nested clusters. Unlike K-Means, you don't specify k upfront — you can cut the tree at any level to get different numbers of clusters. Two approaches: Agglomerative (bottom-up: start with n clusters, merge) and Divisive (top-down: start with 1 cluster, split).

Linkage Criteria (how to measure cluster-to-cluster distance)
Linkage Distance between clusters Creates
Single (min) Minimum pairwise distance Elongated, chaining clusters
Complete (max) Maximum pairwise distance Compact, equal-sized clusters
Average Average of all pairwise distances Balance between single/complete
Ward Minimizes total within-cluster variance (default) Compact, roughly equal clusters; usually best
#62

Agglomerative Clustering — Practical

Code: Dendrogram + Agglomerative Clustering
hierarchical_clustering.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage

X, _ = make_blobs(n_samples=200, centers=4, random_state=42)
X_sc = StandardScaler().fit_transform(X)

# ── Dendrogram ─────────────────────────────────────────
Z = linkage(X_sc, method='ward')  # Full linkage tree
plt.figure(figsize=(12, 4))
dendrogram(Z, truncate_mode='level', p=5,
           leaf_rotation=90, leaf_font_size=8)
plt.title('Dendrogram (Ward Linkage)\nCut at any height to get k clusters')
plt.xlabel('Sample Index'); plt.ylabel('Distance')
plt.tight_layout(); plt.show()

# ── Compare linkage methods ───────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for i, link in enumerate(['single','complete','average','ward']):
    agg = AgglomerativeClustering(n_clusters=4, linkage=link)
    labels = agg.fit_predict(X_sc)
    sil = silhouette_score(X_sc, labels)
    axes[i].scatter(X_sc[:,0], X_sc[:,1], c=labels, cmap='tab10', s=20)
    axes[i].set_title(f'Linkage: {link}\nSilhouette={sil:.3f}')
plt.tight_layout(); plt.show()
# Ward usually gives the best silhouette score
#63

DBSCAN — Theory

Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) discovers clusters as dense regions separated by sparse regions. It doesn't require specifying k and can find arbitrarily shaped clusters while labeling outliers as noise.

Core Concepts
Term Definition
ε (eps) Radius of neighborhood around each point
min_samples Minimum points required to form a dense core
Core point Has ≥ min_samples neighbors within radius ε
Border point Within ε of a core point but has fewer than min_samples neighbors
Noise point Neither core nor border — labeled as −1 (outlier)
Interview Insights
Q: When would you choose DBSCAN over K-Means?
A: (1) You don't know k upfront. (2) Clusters have arbitrary shapes (rings, crescents). (3) You need outlier detection (DBSCAN labels noise points). (4) Clusters have varying densities — actually, this is a challenge for DBSCAN too; HDBSCAN handles it better. K-Means is faster and scales better; DBSCAN handles irregular shapes and anomalies.
#64

DBSCAN — Practical

Code: DBSCAN + Parameter Tuning + Outlier Detection
dbscan_practical.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_moons, make_blobs

# ── DBSCAN vs K-Means on moon-shaped data ─────────────
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
X_sc = StandardScaler().fit_transform(X_moons)

from sklearn.cluster import KMeans
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

km = KMeans(n_clusters=2, random_state=42).fit_predict(X_sc)
ax1.scatter(X_sc[:,0], X_sc[:,1], c=km, cmap='coolwarm', s=20)
ax1.set_title('K-Means (k=2)\n❌ FAILS on moons')

db = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_sc)
ax2.scatter(X_sc[:,0], X_sc[:,1], c=db, cmap='coolwarm', s=20)
ax2.set_title('DBSCAN (eps=0.3)\n✅ Correctly identifies moons')
plt.tight_layout(); plt.show()

# ── How to choose eps: K-distance plot ────────────────
from sklearn.neighbors import NearestNeighbors

X2, _ = make_blobs(n_samples=300, centers=3, random_state=42)
X2_sc = StandardScaler().fit_transform(X2)

nbrs = NearestNeighbors(n_neighbors=5).fit(X2_sc)
distances, _ = nbrs.kneighbors(X2_sc)
k_dist = np.sort(distances[:, 4])[::-1]  # 5th nearest neighbor distance, sorted

plt.figure(figsize=(8, 4))
plt.plot(k_dist)
plt.axhline(y=0.5, color='r', linestyle='--', label='eps ≈ 0.5 (elbow)')
plt.title('K-Distance Plot (k=5)\nChoose eps at the "elbow" of the curve')
plt.xlabel('Points (sorted)'); plt.ylabel('5th Nearest Neighbor Distance')
plt.legend(); plt.tight_layout(); plt.show()

# ── DBSCAN for anomaly detection ─────────────────────
db_final = DBSCAN(eps=0.5, min_samples=5).fit(X2_sc)
labels = db_final.labels_
n_clusters  = len(set(labels)) - (1 if -1 in labels else 0)
n_noise     = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points (anomalies): {n_noise}")
print(f"Noise indices: {np.where(labels == -1)[0]}")
#65

Silhouette Score

Definition

Silhouette Score measures how similar each point is to its own cluster vs other clusters. It's an internal evaluation metric — no ground truth labels needed. Range: [−1, 1]; higher = better-defined clusters.

s(i) = (b(i) − a(i)) / max(a(i), b(i))
a(i) = mean distance to own cluster (intra-cluster cohesion)
b(i) = mean distance to nearest other cluster (inter-cluster separation)
s = 1: perfect cluster assignment | s = 0: overlapping | s = -1: wrong cluster
Code: Silhouette Analysis + Visualization
silhouette_score.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.cm as cm

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_sc = StandardScaler().fit_transform(X)

# ── Silhouette plot for different k values ─────────────
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, k in enumerate([2,3,4,5,6,7]):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_sc)
    avg_sil = silhouette_score(X_sc, labels)
    sample_sils = silhouette_samples(X_sc, labels)
    
    ax = axes[idx]
    y_lower = 10
    for c in range(k):
        sil_c = sorted(sample_sils[labels == c])
        size_c = len(sil_c)
        y_upper = y_lower + size_c
        color = cm.nipy_spectral(float(c)/k)
        ax.fill_betweenx(np.arange(y_lower, y_upper), 0, sil_c, color=color)
        y_lower = y_upper + 10
    
    ax.axvline(x=avg_sil, color='red', linestyle='--')
    ax.set_title(f'k={k}, avg_sil={avg_sil:.3f}')
    ax.set_xlabel('Silhouette coefficient')

plt.suptitle('Silhouette Analysis: Wide, uniform blades = good clustering')
plt.tight_layout(); plt.show()
# k=4 should show the best uniform, wide silhouette blades
Summary — Clustering Comparison
Algorithm Needs k? Cluster Shape Outliers Speed
K-Means Yes Spherical only No Fast
Hierarchical No (cut later) Any No O(n² log n)
DBSCAN No Any arbitrary Yes (label -1) O(n log n)

⚡ Project 3 — After Clustering

🚀P3

Mini Project: Customer Market Segmentation

Code: Full Segmentation Pipeline
project3_segmentation.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age':             np.random.normal(40, 15, n).clip(18, 80),
    'annual_income':   np.random.lognormal(10.5, 0.5, n),
    'spending_score':  np.random.normal(50, 25, n).clip(1, 100),
    'purchase_freq':   np.random.poisson(5, n),
    'loyalty_years':   np.random.exponential(3, n).clip(0, 15),
})

# 1. Scale
sc = StandardScaler()
X_sc = sc.fit_transform(df)

# 2. PCA for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_sc)

# 3. Find optimal k
sil_scores = [silhouette_score(X_sc, KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X_sc))
              for k in range(2, 8)]
best_k = np.argmax(sil_scores) + 2

# 4. Final segmentation
km = KMeans(n_clusters=best_k, n_init=10, random_state=42)
df['segment'] = km.fit_predict(X_sc)

# 5. Visualize
plt.figure(figsize=(10, 5))
for seg in df['segment'].unique():
    mask = df['segment'] == seg
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f'Segment {seg}', s=20, alpha=0.7)
plt.title(f'Customer Segments (PCA 2D) — {best_k} segments')
plt.legend(); plt.tight_layout(); plt.show()

# 6. Profile segments
print("\nSegment Profiles:")
print(df.groupby('segment')['age', 'annual_income', 'spending_score', 'loyalty_years'].mean().round(1))

Phase 12 — Association Rule Learning

#66

Association Rule Learning

Definition

Association Rule Learning discovers co-occurrence patterns in transactional data: "If a customer buys X, they also tend to buy Y." Used for market basket analysis, recommendation systems, and web clickstream analysis.

Key Metrics
Metric Formula Meaning Range
Support P(X ∪ Y) How often {X,Y} appears together in all transactions [0,1]
Confidence P(Y|X) = P(X∪Y)/P(X) If X is bought, probability that Y is also bought [0,1]
Lift Confidence / P(Y) How much more likely Y is given X, vs random. >1 = positive association [0,∞)
Conviction (1−P(Y))/(1−Confidence) How much X being present increases certainty of Y [0,∞)
🛒 Classic Example: Beer & Diapers

Support({beer,diapers}) = 0.30 (30% of transactions contain both)
Confidence(diapers→beer) = 0.70 (70% of diaper buyers also bought beer)
Lift = 0.70 / 0.50 = 1.4 (40% more likely than random)

#67

Apriori Algorithm — Theory

Definition

Apriori generates frequent itemsets (item combinations meeting minimum support) by starting with 1-item sets and iteratively building larger sets. Uses the Apriori principle: any subset of a frequent itemset must also be frequent — this prunes the search space dramatically.

Algorithm Steps

1. Generate all 1-item sets, filter by min_support → frequent 1-items
2. Join frequent k-items to generate (k+1)-item candidates
3. Prune any candidate whose subset is NOT frequent (Apriori pruning)
4. Filter candidates by min_support
5. Repeat until no new frequent itemsets found
6. Generate rules from all frequent itemsets using min_confidence

Interview Insights
Q: What is the Apriori principle and how does it speed up the algorithm?
A: Apriori principle: if an itemset is infrequent, all its supersets are infrequent too. So if {milk, bread} has support below threshold, we NEVER need to check {milk, bread, butter} — it will also be infrequent. This allows pruning entire branches of the search space before computing their support, reducing computation from O(2ⁿ) to much less in practice.
#68

Apriori — Practical

Code: Apriori with mlxtend
apriori_practical.py
import pandas as pd
import numpy as np
# pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# ── Sample grocery transactions ────────────────────────
transactions = [
    ['milk','bread','butter'],
    ['milk','bread'],
    ['milk','butter','eggs'],
    ['bread','butter','eggs','cheese'],
    ['milk','bread','butter','eggs'],
    ['cheese','butter'],
    ['milk','eggs','bread'],
    ['milk','cheese','bread'],
    ['butter','eggs','milk'],
    ['bread','milk','cheese','eggs'],
    ['milk','bread','butter','cheese'],
]
# ── One-hot encode transactions ─────────────────────────
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# ── Generate frequent itemsets with min_support=0.3 ───────
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
# ── Generate rules with min_confidence=0.7 ───────────────────────────────
rules = association_rules(frequent_itemsets, metric ='confidence', min_threshold=0.7)
# ── Display rules sorted by lift ───────────────────────────────
rules = rules.sort_values('lift', ascending=False)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
#69

FP-Growth Algorithm — Theory

Definition

FP-Growth (Frequent Pattern Growth) is a faster alternative to Apriori that avoids repeated database scans. It compresses the entire transaction database into a compact FP-tree (prefix tree), then mines frequent patterns directly from this tree — no candidate generation needed.

Intuition
🌳 The FP-Tree Idea

Instead of repeatedly reading raw transactions, FP-Growth reads the database exactly twice: once to find frequent single items, once to build a compressed prefix tree. Transactions sharing common prefixes share nodes in the tree — like a trie/prefix tree. Mining patterns then happens entirely in memory on this compact tree, recursively building "conditional pattern bases".

FP-Growth vs Apriori Comparison
Property Apriori FP-Growth
Database scans k scans (one per itemset size) Exactly 2 scans
Candidate generation Yes — generates then prunes No — divides problem recursively
Memory Lower (no tree structure) Higher (FP-tree in RAM)
Speed Slow on large data 10–100× faster than Apriori
Implementation Simpler to understand More complex
Best for Small datasets, education Production, large transactions
FP-Growth Algorithm Steps

Pass 1: Scan database → count item frequencies → discard items below min_support → sort frequent items by frequency (descending)

Pass 2: Build FP-tree — insert each transaction (only frequent items, in sorted order) into prefix tree. Shared prefixes share nodes; each node stores item name + count.

Mining: For each frequent item, extract its conditional pattern base (all paths ending at that item), build a conditional FP-tree, recurse. Each recursive call produces frequent itemsets containing that item.

Interview Insights
Q: When does FP-Growth degrade in performance?
A: When the FP-tree cannot fit in memory (very low min_support on huge data → very large tree), or when the database is sparse (few shared prefixes → tree is nearly the same size as original data). In those cases, disk-based variants like H-Mine or Eclat are better. For most real-world grocery/retail datasets, FP-Growth is the clear winner.
#70

FP-Growth — Practical

Code: FP-Growth vs Apriori Speed Comparison
fpgrowth_practical.py
import pandas as pd
import numpy as np
import time
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import matplotlib.pyplot as plt

# ── Generate larger synthetic transaction dataset ──────
np.random.seed(42)
items = ['milk','bread','butter','eggs','cheese',
         'yogurt','juice','cereal','coffee','tea']

transactions = []
for _ in range(500):
    n_items = np.random.randint(2, 7)
    # Simulate real shopping patterns with biased probabilities
    weights = [0.7,0.65,0.5,0.55,0.35,0.3,0.4,0.25,0.45,0.3]
    chosen = [item for item, w in zip(items, weights) if np.random.random() < w]
    if len(chosen) >= 2:
        transactions.append(chosen)

# Encode
te = TransactionEncoder()
df_enc = pd.DataFrame(te.fit_transform(transactions), columns=te.columns_)
print(f"Dataset: {len(transactions)} transactions × {len(te.columns_)} items")

# ── Speed benchmark: FP-Growth vs Apriori ─────────────
min_sup = 0.2

t0 = time.time()
fp_itemsets = fpgrowth(df_enc, min_support=min_sup, use_colnames=True)
t_fp = time.time() - t0

t0 = time.time()
ap_itemsets = apriori(df_enc, min_support=min_sup, use_colnames=True)
t_ap = time.time() - t0

print(f"\nFP-Growth: {len(fp_itemsets)} itemsets in {t_fp:.4f}s")
print(f"Apriori:   {len(ap_itemsets)} itemsets in {t_ap:.4f}s")
print(f"Speedup:   {t_ap/t_fp:.2f}×")

# ── Generate rules from FP-Growth itemsets ────────────
rules = association_rules(fp_itemsets, metric='lift', min_threshold=1.1)
rules['antecedents'] = rules['antecedents'].apply(lambda x: ', '.join(list(x)))
rules['consequents'] = rules['consequents'].apply(lambda x: ', '.join(list(x)))

top_rules = rules.nlargest(10, 'lift')[
    ['antecedents','consequents','support','confidence','lift']
]
print("\nTop 10 Rules by Lift:")
print(top_rules.to_string(index=False))

# ── Heatmap: Item co-occurrence matrix ────────────────
cooc = df_enc.T.dot(df_enc)
cooc_norm = cooc / cooc.values.diagonal()  # normalize by item frequency

plt.figure(figsize=(8, 6))
import seaborn as sns
sns.heatmap(cooc_norm, annot=True, fmt='.2f', cmap='YlOrRd',
           linewidths=0.5, vmin=0, vmax=1)
plt.title('Item Co-occurrence Matrix\n(value = P(col bought | row bought))')
plt.tight_layout(); plt.show()
Summary — Association Rules
  • FP-Growth is always preferred over Apriori for production — faster and fewer DB scans
  • Both produce identical rules — the algorithm differs, not the output
  • Real business value: product placement, "customers also bought", promotions
  • Always evaluate rules with Lift, not just Confidence — lift > 1 means genuine co-occurrence
🏋️ Mini Practice Task

Download a real retail dataset (e.g. UCI Online Retail Dataset). Run FP-Growth with min_support=0.05. Find the top 5 rules by lift. What business action would you recommend based on each rule?

Phase 13 — Ensemble Learning

#71

Ensemble Learning — Overview

Definition

Ensemble Learning combines multiple models (weak learners) to produce a stronger, more accurate and robust prediction than any single model alone. The key insight: models make different errors, and combining them cancels out individual mistakes.

Intuition
🧑‍⚖️ The Wisdom of Crowds

A single expert doctor may be wrong. But if 100 doctors independently diagnose the same patient and you take the majority vote, accuracy dramatically improves — individual errors cancel out. This is the ensemble principle. Even weak individual models (slightly better than random) can combine into a very strong model.

Three Ensemble Families
Family Core Idea Algorithms Strength
Bagging Train models on bootstrap samples in parallel, average outputs Random Forest, BaggingClassifier Reduces variance (overfitting)
Boosting Train models sequentially, each focusing on previous errors AdaBoost, Gradient Boosting, XGBoost Reduces bias (underfitting)
Stacking Train base models, use their predictions as input to a meta-model StackingClassifier/Regressor Combines heterogeneous models
Interview Insights
Q: What makes ensemble methods work? What's the mathematical intuition?
A: For averaging to reduce error, two conditions must hold: (1) Each model must be better than random (accuracy > 50% for binary classification). (2) Models must make DIFFERENT errors (low correlation between errors). If all models make the same mistakes, averaging doesn't help. This is why diversity is critical: Random Forest uses feature randomness and data bootstrapping to decorrelate trees. The variance of the ensemble mean = σ²/n when models are uncorrelated — halving the error with just 4 models.
Summary
  • Ensemble = many weak learners combining into one strong learner
  • Bagging: parallel training on bootstrap samples → reduces variance
  • Boosting: sequential training fixing errors → reduces bias
  • Stacking: base models feed a meta-learner → most flexible, most powerful
  • Diversity between models is essential — correlated models don't help each other
#72

Voting Methods — Max, Average, Weighted

Definition

Voting ensembles combine predictions from multiple different algorithm types (heterogeneous ensemble). Each model votes, and the final prediction is determined by: Hard Voting (majority class), Soft Voting (average probabilities), or Weighted Voting (trusted models vote more).

Voting Strategies Explained
Strategy How It Works When to Use Requires
Hard Voting Majority class label wins (mode) Classification; when probabilities unavailable predict() from each model
Soft Voting Average predicted probabilities; argmax wins Classification; more accurate than hard when models are calibrated predict_proba() from each model
Weighted Voting Better models get higher vote weight When you know which models are stronger weights= parameter
Average (Regression) Mean of all model predictions Regression; baseline ensemble predict() from each model
Hard Voting: ŷ = mode(ŷ₁, ŷ₂, ..., ŷₙ)
Soft Voting: ŷ = argmax_k [ (1/n) Σ P_i(y=k|X) ]
Weighted: ŷ = argmax_k [ Σ wᵢ × P_i(y=k|X) ] where Σwᵢ = 1
Interview Insights
Q: Why is Soft Voting usually better than Hard Voting?
A: Soft voting uses probability estimates — richer information than just class labels. Consider 3 models predicting class 1: Model A says P=0.51 (barely), Models B and C say P=0.49 (barely). Hard voting picks class 1 (2 vs 1). Soft voting averages probabilities → P̄=0.497 → picks class 0. Soft voting is more nuanced. However, it requires well-calibrated probabilities — if models are poorly calibrated (SVM without Platt scaling, for example), hard voting may actually be more reliable.
#73

Voting Regression — Practical

Code: VotingRegressor with Multiple Models
voting_regressor.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Define individual base models ─────────────────────
# Note: VotingRegressor requires estimators that support fit/predict
# Wrap scale-sensitive models in a Pipeline
r1 = ('ridge',   Pipeline([('sc',StandardScaler()), ('m',Ridge(alpha=1.0))]))
r2 = ('dtree',   DecisionTreeRegressor(max_depth=6, random_state=42))
r3 = ('knn',     Pipeline([('sc',StandardScaler()), ('m',KNeighborsRegressor(n_neighbors=7))]))
r4 = ('svr',     Pipeline([('sc',StandardScaler()), ('m',SVR(C=10, gamma='scale'))]))

# ── VotingRegressor: uniform average ──────────────────
vr_uniform = VotingRegressor(estimators=[r1, r2, r3, r4])

# ── VotingRegressor: weighted (Ridge is best, give it more weight)
vr_weighted = VotingRegressor(estimators=[r1, r2, r3, r4],
                               weights=[3, 2, 1, 2])

# ── Evaluate all models ───────────────────────────────
results = {}
for name, model in [('Ridge', r1[1]), ('DTree', r2[1]),
                     ('KNN',  r3[1]), ('SVR',  r4[1]),
                     ('VotingUniform', vr_uniform),
                     ('VotingWeighted', vr_weighted)]:
    model.fit(X_tr, y_tr)
    yp   = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    r2   = r2_score(y_te, yp)
    results[name] = {'RMSE': rmse, 'R²': r2}

df_res = pd.DataFrame(results).T
print(df_res.round(4))

# ── Bar chart comparison ──────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
colors = ['steelblue']*4 + ['#f59e0b', '#10b981']

ax1.bar(df_res.index, df_res['RMSE'], color=colors)
ax1.set_title('RMSE Comparison\n(Lower = Better)')
ax1.tick_params(axis='x', rotation=25)

ax2.bar(df_res.index, df_res['R²'], color=colors)
ax2.set_title('R² Comparison\n(Higher = Better)')
ax2.tick_params(axis='x', rotation=25)
plt.tight_layout(); plt.show()
# Ensemble should outperform all individual models
#74

Voting Classification — Practical

Code: Hard vs Soft Voting + Decision Boundary
voting_classifier.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

# ── Base estimators ───────────────────────────────────
lr  = Pipeline([('sc',StandardScaler()),('m',LogisticRegression(max_iter=1000))])
dt  = DecisionTreeClassifier(max_depth=5, random_state=42)
knn = Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))])
gnb = Pipeline([('sc',StandardScaler()),('m',GaussianNB())])
svc = Pipeline([('sc',StandardScaler()),('m',SVC(probability=True))])

estimators = [('lr',lr),('dt',dt),('knn',knn),('gnb',gnb),('svc',svc)]

# ── Hard vs Soft Voting ───────────────────────────────
vc_hard = VotingClassifier(estimators=estimators, voting='hard')
vc_soft = VotingClassifier(estimators=estimators, voting='soft')

# ── Evaluate all models ───────────────────────────────
results = []
for name, model in estimators + [('VotingHard', vc_hard), ('VotingSoft', vc_soft)]:
    model.fit(X_tr, y_tr)
    yp  = model.predict(X_te)
    acc = accuracy_score(y_te, yp)
    f1  = f1_score(y_te, yp)
    cv  = cross_val_score(model, X_tr, y_tr, cv=5, scoring='f1').mean()
    results.append({'Model':name, 'Test Acc':acc, 'Test F1':f1, 'CV F1':cv})

df_r = pd.DataFrame(results)
print(df_r.sort_values('Test F1', ascending=False).to_string(index=False))

# ── Visualize ensemble improvement ────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
colors = ['steelblue']*5 + ['#f59e0b', '#10b981']
bars = ax.barh(df_r['Model'], df_r['Test F1'], color=colors)
ax.axvline(x=df_r['Test F1'][:5].max(), color='red', linestyle='--',
          label='Best individual model')
ax.set_xlabel('F1-Score')
ax.set_title('Voting Ensemble vs Individual Models\n(Ensemble bars in gold/green)')
ax.legend(); plt.tight_layout(); plt.show()
Common Mistakes
  • Using correlated models in ensemble — if all models make the same mistakes, voting doesn't help. Mix: linear + tree + distance-based models for diversity.
  • Using Hard Voting with poorly calibrated models — their class boundaries don't align in probability space
  • Not scaling inside each pipeline — VotingClassifier calls each model separately, so each needs its own scaler
#75

Bagging & Random Forest — Theory

Definition

Bagging (Bootstrap Aggregating) trains multiple models on different bootstrap samples (random samples with replacement) of the training data, then aggregates predictions. Random Forest extends bagging by also randomizing the feature subset at each split — creating maximum diversity among trees.

Why Bagging Reduces Variance
📊 Variance Reduction Math

A single Decision Tree has high variance: σ²(single tree). If n trees make independent predictions, the variance of their mean = σ²/n. With 100 trees, variance drops 100×. In practice trees are correlated (same data), but Random Forest's feature randomization breaks this correlation, achieving near-independent variance reduction.

Random Forest vs Bagging
Property Bagging (BaggingClassifier) Random Forest
Bootstrap sampling Yes Yes
Feature randomization per split No — all features considered Yes — random subset of sqrt(n) features
Tree correlation High (same features) Low (different feature subsets)
Diversity Moderate High — best decorrelation
Out-of-Bag (OOB) evaluation Yes (with oob_score=True) Yes (with oob_score=True)
Feature importance Depends on base estimator Built-in (Gini importance)
Bootstrap: sample n points WITH replacement from training set
~63.2% unique points per sample (rest are duplicates)
OOB samples: ~36.8% left out per tree → free validation set

Feature subset per split: max_features = sqrt(p) for classification, p/3 for regression
Interview Insights
Q: What is Out-of-Bag (OOB) error and why is it useful?
A: Each tree in a Random Forest is trained on ~63.2% of the data (bootstrap sample). The remaining ~36.8% (out-of-bag samples) were never seen by that tree. We can evaluate each tree on its OOB samples, then average — this gives a free, approximately cross-validated error estimate WITHOUT needing a separate validation set. OOB error is very close to 5-fold CV error and requires no extra compute.
Q: When would you use a single Decision Tree over Random Forest?
A: When interpretability is paramount — a single tree produces human-readable rules. Random Forest is a black box. In regulated industries (banking credit decisions, medical guidelines) where you must explain each decision, a single pruned tree is often legally required. For pure predictive accuracy, Random Forest almost always wins.
#76

Bagging — Classification Practical

Code: BaggingClassifier + RandomForestClassifier — Full Analysis
bagging_classification.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.datasets import load_breast_cancer
import seaborn as sns

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

# ── 1. Single Decision Tree (baseline) ───────────────
single_dt = DecisionTreeClassifier(max_depth=None, random_state=42)
single_dt.fit(X_tr, y_tr)
print("Single DTree:")
print(f"  Train acc: {accuracy_score(y_tr, single_dt.predict(X_tr)):.4f}")
print(f"  Test  acc: {accuracy_score(y_te, single_dt.predict(X_te)):.4f}")

# ── 2. Bagging with DTree base estimator ──────────────
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,         # number of trees
    max_samples=0.8,          # use 80% of data per tree
    max_features=0.8,         # use 80% of features per tree
    bootstrap=True,           # with replacement
    bootstrap_features=False,
    oob_score=True,            # free OOB evaluation
    n_jobs=-1,
    random_state=42
)
bag.fit(X_tr, y_tr)
print(f"\nBaggingClassifier:")
print(f"  OOB score:  {bag.oob_score_:.4f}  (free cross-validation estimate)")
print(f"  Test acc:   {accuracy_score(y_te, bag.predict(X_te)):.4f}")

# ── 3. Random Forest ──────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,            # let trees grow fully (bagging controls variance)
    max_features='sqrt',      # sqrt(n_features) per split — key RF innovation
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
rf.fit(X_tr, y_tr)
print(f"\nRandomForestClassifier:")
print(f"  OOB score:  {rf.oob_score_:.4f}")
print(f"  Test acc:   {accuracy_score(y_te, rf.predict(X_te)):.4f}")
print(f"  Test F1:    {f1_score(y_te, rf.predict(X_te)):.4f}")
print("\nClassification Report:")
print(classification_report(y_te, rf.predict(X_te), target_names=data.target_names))

# ── 4. n_estimators vs OOB score curve ────────────────
oob_scores = []
n_range = range(10, 201, 10)
for n in n_range:
    model = RandomForestClassifier(n_estimators=n, oob_score=True,
                                    n_jobs=-1, random_state=42)
    model.fit(X_tr, y_tr)
    oob_scores.append(model.oob_score_)

plt.figure(figsize=(10, 4))
plt.plot(n_range, oob_scores, 'b-o', markersize=4)
plt.xlabel('n_estimators'); plt.ylabel('OOB Score')
plt.title('OOB Score vs Number of Trees\n(Score stabilizes — no overfitting with more trees!)')
plt.tight_layout(); plt.show()

# ── 5. Feature Importance ─────────────────────────────
imp = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(imp['Feature'][:15], imp['Importance'][:15], color='steelblue')
plt.xlabel('Feature Importance (Mean Decrease in Gini Impurity)')
plt.title('Top 15 Feature Importances — Random Forest\n(Higher = more useful for prediction)')
plt.gca().invert_yaxis()
plt.tight_layout(); plt.show()

print("\nTop 5 most important features:")
print(imp.head(5).to_string(index=False))
Common Mistakes
  • More trees never overfit in Random Forest — unlike a single tree, adding more trees reduces variance monotonically. After ~100–200 trees, gains are marginal; it's a compute/accuracy tradeoff.
  • Setting max_features=None in RF — this removes the feature randomization that makes RF special. You'd just get regular Bagging.
  • Confusing feature_importances_ with "causal importance" — RF importance measures how useful a feature is for prediction, not causation. Correlated features split importance between them.
Interview Insights
Q: Why doesn't Random Forest overfit with very deep trees?
A: Two reasons. First, averaging many uncorrelated predictions reduces variance even when individual trees have high variance (are overfit). Each tree overfits to its particular bootstrap sample, but their errors cancel when averaged. Second, the feature randomization at each split ensures trees are different enough that their individual overfitting patterns don't align. The ensemble "washes out" individual tree noise.
#77

Bagging — Regression Practical

Definition

Bagging Regression uses the same bootstrap aggregating principle for continuous targets. The ensemble prediction is the mean of all base regressor predictions. Random Forest Regressor is the most widely used variant.

Code: BaggingRegressor + RandomForestRegressor — Complete Analysis
bagging_regression.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import (BaggingRegressor, RandomForestRegressor,
                                GradientBoostingRegressor)
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import fetch_california_housing
import seaborn as sns

housing = fetch_california_housing(as_frame=True)
df = housing.frame
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Model zoo ─────────────────────────────────────────
models = {
    'Single DTree': DecisionTreeRegressor(max_depth=None, random_state=42),
    'BaggingDTree': BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=100, oob_score=True, n_jobs=-1, random_state=42
    ),
    'RandomForest': RandomForestRegressor(
        n_estimators=200, max_features='sqrt',
        oob_score=True, n_jobs=-1, random_state=42
    ),
    'GradientBoost': GradientBoostingRegressor(  # bonus: show boosting too
        n_estimators=200, learning_rate=0.1,
        max_depth=4, random_state=42
    ),
}

results = []
for name, model in models.items():
    model.fit(X_tr, y_tr)
    yp   = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    mae  = mean_absolute_error(y_te, yp)
    r2   = r2_score(y_te, yp)
    oob  = getattr(model, 'oob_score_', np.nan)
    results.append({'Model':name, 'RMSE':rmse, 'MAE':mae, 'R²':r2, 'OOB':oob})

df_res = pd.DataFrame(results)
print(df_res.to_string(index=False))

# ── Visualize 1: RMSE bar chart ────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
colors = ['#64748b', 'steelblue', '#10b981', '#f59e0b']
axes[0].bar(df_res['Model'], df_res['RMSE'], color=colors)
axes[0].set_title('RMSE: Lower is Better\n(Ensemble beats single tree consistently)')
axes[0].tick_params(axis='x', rotation=15)

axes[1].bar(df_res['Model'], df_res['R²'], color=colors)
axes[1].set_title('R²: Higher is Better')
axes[1].tick_params(axis='x', rotation=15)
plt.tight_layout(); plt.show()

# ── Visualize 2: Actual vs Predicted (Random Forest) ──
rf_model = models['RandomForest']
y_pred_rf = rf_model.predict(X_te)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(y_te, y_pred_rf, alpha=0.2, s=8, color='steelblue')
axes[0].plot([y_te.min(), y_te.max()], [y_te.min(), y_te.max()], 'r--')
axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Random Forest: Actual vs Predicted\nR²={r2_score(y_te,y_pred_rf):.4f}')

# Feature importance
imp = pd.DataFrame({'Feature':X.columns,
                     'Importance':rf_model.feature_importances_}).\
                sort_values('Importance', ascending=True)
axes[1].barh(imp['Feature'], imp['Importance'], color='steelblue')
axes[1].set_title('Feature Importances\n(Random Forest Regression)')
plt.tight_layout(); plt.show()

# ── Hyperparameter tuning: key RF regression params ───
print("\n--- Key RandomForestRegressor Hyperparameters ---")
params_guide = {
    'n_estimators':      "100–500. More = better up to diminishing returns. Never overfits.",
    'max_features':      "'sqrt' (clf) or 'log2'/'auto' (reg). Tune: try 1/3 to 2/3 of features.",
    'max_depth':         "None (grow full) is fine for RF. Set only if memory is a concern.",
    'min_samples_split': "2–10. Increase to reduce overfitting on noisy data.",
    'min_samples_leaf':  "1–5. Larger = smoother predictions, less sensitive to noise.",
    'bootstrap':         "True always for Bagging. False = random subspaces only.",
}
for p, desc in params_guide.items():
    print(f"  {p:25s}: {desc}")
Common Mistakes
  • Treating feature_importances_ as definitive — for correlated features, importance is split arbitrarily between correlated variables. Use permutation importance (sklearn's permutation_importance) for more reliable estimates.
  • Not using n_jobs=-1 — Random Forest is embarrassingly parallelizable. Without it, training 200 trees sequentially is 4–8× slower than needed.
  • Tuning n_estimators to "exactly 100" for performance — always plot the OOB curve and stop when it flattens, not at an arbitrary number.
Interview Insights
Q: Random Forest vs Gradient Boosting — which to choose?
A: Random Forest: parallel training (faster), more robust to noise, less hyperparameter tuning, better when features are noisy or data is limited. Gradient Boosting (XGBoost/LightGBM): typically higher accuracy on structured/tabular data, sequential so slower, more sensitive to hyperparameters, prone to overfit without careful tuning. In Kaggle competitions, XGBoost/LightGBM win almost always on tabular data. In production with limited tuning time, Random Forest is safer. Start with RF, then try GBM if you need more performance.
Q: What is the bias-variance decomposition of Random Forest?
A: Each individual tree has low bias (fully grown trees can memorize data) but high variance (sensitive to training data). Averaging n trees keeps bias the same but reduces variance by factor ~n (if uncorrelated). The feature randomization in RF ensures low correlation between trees, maximizing the variance reduction. So RF = low bias + dramatically reduced variance = excellent generalization.
Summary — Ensemble Methods
  • Bagging trains parallel models on bootstrap samples → reduces variance via averaging
  • Random Forest = Bagging + random feature subsets at each split → maximum tree diversity
  • OOB score gives free cross-validated performance estimate using left-out samples
  • More trees never hurt RF — only diminishing returns after ~200 trees
  • Feature importances reveal predictive power but can be misleading for correlated features
🏋️ Final Practice Task

Build a complete Random Forest pipeline on the California Housing dataset: (1) Feature engineering, (2) RF with OOB score, (3) Permutation importance to validate feature importances, (4) GridSearchCV for max_features and min_samples_leaf, (5) Compare final RF to single DTree and Ridge — report RMSE, R², and training time for each.

🏆 Course Complete — Final Master Summary

🏆

Complete ML Masterclass — Final Summary

✅ All 77 Topics Complete
Full Syllabus Mastered
Phase Topics Key Takeaway
Basics (1–3) ML Introduction, Types, Roadmap Know the 7-step pipeline cold. Supervised/Unsupervised distinction.
Data Preprocessing (4–20) Variables, Cleaning, Missing, Encoding, Outliers, Scaling, FunctionTransformer 80% of ML work. Data quality beats algorithm choice.
Feature Selection (21–22) Backward/Forward Selection Wrapper methods are model-aware but O(n²). Use filter first.
Model Training (23) Train-Test Split Always split BEFORE any preprocessing. Stratify for classification.
Regression (24–33) Linear → Poly → Ridge/Lasso Always plot residuals. Regularization requires scaling. Use AdjR².
Classification (34–38) Logistic, Binary/Multi-class Sigmoid outputs probability. Threshold tuning = precision/recall tradeoff.
Evaluation (39–41) Confusion Matrix, F1, Imbalanced Accuracy is useless on imbalanced data. Use F1, ROC-AUC, lift.
Naive Bayes (42–43) Theory + Practical Best for text. Naive independence assumption works surprisingly well.
Advanced Models (44–53) DTree, KNN, SVM DTree: interpretable but high variance. KNN: lazy, scale-sensitive. SVM: margin maximization, kernel trick.
Model Tuning (54–57) Hyperparams, GridSearch, CV Nested CV is gold standard. Always tune on train set, evaluate on test.
Unsupervised (58–65) K-Means, Hierarchical, DBSCAN, Silhouette K-Means: fast, spherical. DBSCAN: arbitrary shapes, finds outliers. Silhouette: no labels needed.
Association (66–70) Apriori, FP-Growth FP-Growth = production standard. Lift > 1 = genuine rule.
Ensemble (71–77) Voting, Bagging, Random Forest RF = best general-purpose model. Diversity + averaging = power. OOB = free CV.
The 10 Rules Every ML Engineer Must Know

1. Always split data BEFORE any preprocessing step — prevent data leakage.
2. Fit transformers on train data only. Transform train AND test with the same fit.
3. Plot residuals — R² alone is never enough to validate a regression model.
4. For imbalanced data: never use accuracy. Use F1, ROC-AUC, G-Mean.
5. Scale features before: KNN, SVM, Logistic Regression, Ridge, Lasso, PCA.
6. Decision Trees are interpretable but high variance — use Random Forest for production.
7. Hyperparameter tuning belongs in the training set — nested CV for unbiased evaluation.
8. For clustering, always use Silhouette analysis — never just pick k arbitrarily.
9. More trees in Random Forest never overfit — stop at ~200 when OOB curve flattens.
10. Ensemble diversity matters more than individual model accuracy.

Top 20 FAANG Interview Questions Covered
# Question Topic
1 Walk me through an end-to-end ML project Topic 3: Roadmap
2 What is data leakage and how do you prevent it? Topic 23: Train-Test Split
3 Explain the bias-variance tradeoff Topic 32: Ridge/Lasso
4 Why use Adjusted R² over R²? Topic 30
5 Lasso vs Ridge — mathematical difference and when to choose Topic 32
6 Why is logistic regression called regression? Topic 35
7 Precision vs Recall tradeoff — give a business example Topic 40
8 Why does Naive Bayes work despite the independence assumption? Topic 42
9 Gini Impurity vs Entropy — when does it matter? Topic 44
10 What is the curse of dimensionality and how does it affect KNN? Topic 48
11 What are support vectors and why do only they define the boundary? Topic 51
12 Parameters vs Hyperparameters — with examples Topic 54
13 GridSearch vs RandomizedSearch — when to use each? Topic 55
14 What is OOB error in Random Forest? Topic 76
15 K-Means limitations and how DBSCAN addresses them Topics 59/63
16 What makes ensemble methods work mathematically? Topic 71
17 Random Forest vs Gradient Boosting — when to choose? Topic 77
18 What is the Apriori principle? Topic 67
19 Why is Soft Voting better than Hard Voting? Topic 72
20 How do you handle imbalanced datasets? Topic 41
What to Study Next (Beyond This Syllabus)
Topic Why It Matters Where to Start
XGBoost / LightGBM Win 90% of Kaggle tabular competitions xgboost.readthedocs.io
Neural Networks / Deep Learning Images, text, audio, sequences fast.ai, PyTorch
Feature Engineering Often the highest ROI activity in ML Kaggle competitions
Model Explainability (SHAP) Required in regulated industries shap.readthedocs.io
ML Pipelines (sklearn Pipeline) Production-ready, prevents leakage sklearn Pipeline docs
Time Series (ARIMA, Prophet) Finance, forecasting, IoT statsmodels, Prophet
NLP (TF-IDF, Transformers) Text classification, sentiment, chatbots HuggingFace Transformers
MLOps (MLflow, Docker) Deploy and monitor models in production mlflow.org
🚀 Capstone Challenge — Build an AI Startup MVP

Using everything you've learned, build a complete end-to-end ML system:

  1. Pick a real dataset from Kaggle (tabular, CSV format)
  2. EDA: distribution plots, correlation heatmap, missing value audit
  3. Preprocessing Pipeline: imputation, encoding, scaling — all in sklearn Pipeline
  4. Baseline model: Linear/Logistic Regression — establish a benchmark
  5. Advanced models: Random Forest, XGBoost — GridSearchCV on train only
  6. Evaluation: confusion matrix (if classification), residual plot (if regression), CV scores, test set final evaluation
  7. Explainability: feature importance + SHAP values
  8. Deployment: wrap best model in a Flask API that accepts JSON and returns predictions

Complete this and you have a portfolio project worthy of FAANG interviews. You now know enough to build real ML systems.