ML Masterclass Zero to Advanced

#01

ML Course Introduction

Definition

This course teaches Machine Learning from first principles to production-level systems. It follows the exact order used in real-world ML pipelines: data → preprocessing → modeling → evaluation → deployment thinking.

The 3 Pillars of ML

Pillar	What it means	Example
Data	The raw input — the lifeblood of ML	CSV of house prices
Algorithm	The "brain" that finds patterns	Linear Regression
Prediction	Output after the model learns	Predicted price: $450k

Your Toolkit

environment_check.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Assumed already known: numpy, pandas, matplotlib, seaborn
# This course focuses on: sklearn + ML theory + real projects

print("scikit-learn version:", sklearn.__version__)

# Quick dataset sanity check
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
print(df.head())
print("Shape:", df.shape)

Summary

ML is about learning patterns from data, not writing explicit rules
This course follows the real-world pipeline order
You already know numpy/pandas/matplotlib — we skip those basics
scikit-learn is the primary ML library throughout

#02

What is Machine Learning?

Definition

Machine Learning is a subset of AI where algorithms learn from data to make predictions or decisions — without being explicitly programmed for each scenario. Formally: a computer program "learns" from experience E with respect to task T if performance measure P improves with E (Tom Mitchell, 1997).

Intuition

🏠 Analogy: Real Estate Agent vs Rule Book

A rule-based system for predicting house prices needs explicit rules: IF 3 bedrooms AND near school AND 1500 sqft THEN price = $400k. This is brittle — it can't generalize.

An ML model looks at 100,000 past sales, finds the pattern between features and prices automatically, and predicts prices for houses it's never seen. That's learning.

Types of ML

Type	Definition	Key Word	Example
Supervised	Learns from labeled data (input→output pairs)	"Teacher present"	Spam detection, house prices
Unsupervised	Finds hidden patterns in unlabeled data	"No teacher"	Customer segmentation, anomaly detection
Semi-supervised	Mix of labeled + unlabeled data	"Partial teacher"	Image labeling at scale
Reinforcement	Agent learns by reward/punishment signals	"Trial and error"	Game playing, robotics

Supervised Learning Breakdown

Sub-type	Output Type	Examples
Regression	Continuous number	House price, temperature, salary
Classification	Category/label	Spam/not spam, disease/no disease

Code: Supervised vs Unsupervised

ml_types_demo.py

import numpy as np
from sklearn.linear_model import LinearRegression     # Supervised
from sklearn.cluster import KMeans                    # Unsupervised

# ── SUPERVISED: We have labels ──────────────────────────
# X = house size (sqft), y = price ($1000s)
X_sup = np.array([[1000],[1500],[2000],[2500],[3000]])
y     = np.array([200, 280, 350, 430, 500])            # ← LABELS

model_sup = LinearRegression()
model_sup.fit(X_sup, y)
pred = model_sup.predict([[1800]])
print(f"Supervised — Predicted price for 1800sqft: ${pred[0]:.0f}k")

# ── UNSUPERVISED: No labels, find natural groups ─────────
# Customer data: [age, spending_score] — no target!
X_uns = np.array([[22,80],[25,75],[45,20],
                   [48,15],[35,50],[40,45]])

model_uns = KMeans(n_clusters=2, random_state=42)
clusters  = model_uns.fit_predict(X_uns)            # ← Discovers groups
print("Unsupervised — Cluster assignments:", clusters)
# Output: [0 0 1 1 0 0] or similar — groups found automatically!

Common Mistakes

Treating regression as classification (or vice versa) — always ask: "Is my output a number or a category?"
Assuming more data always = better model — quality > quantity
Ignoring the problem type — not all ML problems need the same approach

Interview Insights

Q: What's the difference between AI, ML, and Deep Learning?

A: AI ⊃ ML ⊃ Deep Learning. AI = any technique making machines intelligent. ML = AI that learns from data automatically. Deep Learning = ML using multi-layered neural networks that can learn from raw, unstructured data (images, text).

Q: What's the difference between supervised and unsupervised learning?

A: Supervised = we have ground-truth labels and train a model to predict them. Unsupervised = no labels — we find hidden structure/patterns in data. Key distinction: labels vs no labels.

Summary

ML = algorithms that learn patterns from data without explicit programming
3 main types: Supervised (labeled), Unsupervised (unlabeled), Reinforcement (rewards)
Supervised splits into: Regression (continuous output) and Classification (categorical output)
The key question always is: "What is my output type? Do I have labels?"

🏋️ Mini Practice Task

For each scenario below, identify: (1) Type of ML and (2) Supervised sub-type if applicable:

Predicting tomorrow's stock price
Grouping news articles by topic automatically
Detecting whether a tumor is malignant
Teaching a robot to walk

Answers: Regression | Unsupervised | Classification | Reinforcement

#03

Complete ML Roadmap

Definition

The ML Roadmap is the end-to-end workflow that every ML project follows — from raw data to a deployed model. Understanding this pipeline is fundamental; every topic in this course maps to a step in this process.

The 7-Step ML Pipeline

ml_pipeline_overview.py

"""
THE COMPLETE ML PIPELINE
─────────────────────────────────────────────────────
STEP 1: Problem Definition
  → What are we predicting? What's the metric of success?
  
STEP 2: Data Collection
  → SQL, APIs, web scraping, Kaggle, company databases
  
STEP 3: Exploratory Data Analysis (EDA)
  → Understand distributions, correlations, missing data
  
STEP 4: Data Preprocessing (MOST TIME IS SPENT HERE)
  → Clean missing values, encode categories, scale features,
    remove outliers, handle duplicates
    
STEP 5: Feature Engineering & Selection
  → Create new features, select the most informative ones
  → Techniques: Backward Elimination, Forward Selection
  
STEP 6: Model Training
  → Split data (train/test), select algorithm, fit model
  → Regression: Linear, Ridge, Lasso, Polynomial
  → Classification: Logistic, SVM, Trees, KNN, Naive Bayes
  → Clustering: K-Means, DBSCAN, Hierarchical
  
STEP 7: Evaluation & Tuning
  → Regression: R², MSE, RMSE
  → Classification: Accuracy, F1, ROC-AUC, Confusion Matrix
  → Tuning: GridSearchCV, CrossValidation
  
[DEPLOY] → Flask/FastAPI, Docker, Cloud
─────────────────────────────────────────────────────
"""

# Time allocation in real-world ML projects:
time_allocation = {
    "Problem Definition": "5%",
    "Data Collection": "10%",
    "EDA": "15%",
    "Data Preprocessing": "40%",   # ← MOST work happens here
    "Modeling": "20%",
    "Evaluation & Tuning": "10%",
}

for step, pct in time_allocation.items():
    print(f"  {step:30s} → {pct}")

Algorithms Map

Problem Type	Algorithm Choices	When to Use
Regression	Linear, Ridge, Lasso, Polynomial, SVR, Decision Tree	Continuous numeric output
Classification	Logistic, SVM, KNN, Naive Bayes, Decision Tree, Random Forest	Categorical output
Clustering	K-Means, DBSCAN, Hierarchical	No labels, find groups
Association	Apriori, FP-Growth	Market basket analysis

Interview Insights

Q: Walk me through a complete ML project from data to production.

A: State the 7 steps clearly: (1) Define problem & success metric, (2) Collect data, (3) EDA to understand distributions, (4) Preprocess — handle nulls, encode categoricals, scale features, (5) Feature selection, (6) Train multiple models + tune hyperparameters via CrossValidation, (7) Evaluate on held-out test set, then deploy via REST API. This answer alone scores you higher than 80% of candidates.

Summary

Every ML project follows the same 7-step pipeline
Data preprocessing takes ~40% of total project time
Choosing the wrong algorithm matters less than poor data quality
Always define your evaluation metric BEFORE building models

#04

Types of Variables

Definition

Variables (features/columns) in a dataset have different measurement scales. Knowing this determines which preprocessing steps and algorithms you can use. This is one of the most foundational concepts in statistics and ML.

Variable Taxonomy

Type	Sub-type	Properties	Examples	Allowed Operations
Categorical (Qualitative)	Nominal	No order, just names	Color: Red/Blue/Green, Country	Mode, frequency count
Categorical (Qualitative)	Ordinal	Has order, no equal gaps	Rating: Low/Med/High, Education level	Mode, median, comparison
Numerical (Quantitative)	Discrete	Countable integers	Number of children, room count	All arithmetic
Numerical (Quantitative)	Continuous	Any real value in a range	Height, temperature, price	All arithmetic + calculus

Code: Detecting Variable Types

variable_types.py

import pandas as pd
import numpy as np

# Create a sample dataset with mixed variable types
df = pd.DataFrame({
    'age': [25, 32, 28, 45, 30],              # Numerical - Discrete
    'salary': [50000.5, 72000.0, 63000.25,
               95000.0, 68000.75],              # Numerical - Continuous
    'city': ['NY','LA','NY','Chicago','LA'],  # Categorical - Nominal
    'education': ['BSc','MSc','BSc',
                  'PhD','MSc'],               # Categorical - Ordinal
    'satisfied': [1,0,1,1,0]                 # Binary (special case)
})

# ── Step 1: Quick dtype overview ──────────────────────
print("--- dtypes ---")
print(df.dtypes)

# ── Step 2: Identify categorical vs numerical ─────────
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols   = df.select_dtypes(include=[np.number]).columns.tolist()

print(f"\nCategorical columns: {categorical_cols}")
print(f"Numerical columns:   {numerical_cols}")

# ── Step 3: Cardinality check (unique value count) ────
# High cardinality nominal → might need target encoding
# Low cardinality nominal  → safe for one-hot encoding
print("\n--- Cardinality (unique values per column) ---")
for col in df.columns:
    print(f"  {col:15s}: {df[col].nunique()} unique values")

# ── Step 4: Check continuous vs discrete numerics ─────
print("\n--- Numerical Analysis ---")
print(df[numerical_cols].describe())

Common Mistakes

Treating ordinal as nominal: Encoding "Low/Med/High" as [0,1,2] loses order info if you one-hot encode it — use OrdinalEncoder instead
Treating numeric-coded categoricals as numeric: ZIP codes, user IDs are stored as integers but are nominal — never use them as continuous features
Ignoring binary variables: Binary (0/1) doesn't need any encoding but often needs balancing

Interview Insights

Q: Can you use a nominal variable directly in a linear model?

A: No. Linear models need numeric input. Nominal variables must be encoded first — typically via one-hot encoding. Ordinal variables can be label-encoded since their order is meaningful.

🏋️ Mini Practice Task

Load the Titanic dataset (pd.read_csv from seaborn or sklearn) and classify every column as: Nominal, Ordinal, Discrete, or Continuous. Then identify which need encoding and which need scaling.

#05

Data Cleaning

Definition

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. "Garbage in, garbage out" — no algorithm can save you from dirty data.

Types of Data Quality Issues

Issue	Example	Fix
Missing values	NaN in age column	Impute or drop (Topics 6–9)
Duplicates	Same row appears twice	drop_duplicates() (Topic 18)
Outliers	Age = 999	IQR/Z-Score (Topics 14–15)
Wrong data types	Price stored as "object"	astype() (Topic 19)
Inconsistent categories	"Male", "male", "M" all mean same	str.lower() + map()
Noise	Typos, extra spaces	str.strip(), str.replace()

Code: Comprehensive Data Cleaning Audit

data_cleaning.py

import pandas as pd
import numpy as np

# Create dirty dataset for demo
data = {
    'name': ['Alice', 'Bob', 'Alice', 'Charlie', '  Dave  ', None],
    'age': [25, 30, 25, 999, 28, 35],           # 999 = outlier
    'gender': ['F', 'M', 'F', 'Male', 'm', 'F'], # inconsistent
    'salary': ['50000', '72000', '50000',
               '95000', None, '68000'],           # wrong dtype
}
df = pd.DataFrame(data)

print("=== DIRTY DATA ===")
print(df)
print()

# ── AUDIT FUNCTION: Run this at start of every project ──
def audit_dataframe(df):
    print("📊 SHAPE:", df.shape)
    print("\n📌 DTYPES:\n", df.dtypes)
    print("\n❌ MISSING VALUES:")
    missing = df.isnull().sum()
    print(missing[missing > 0])
    print(f"\n🔁 DUPLICATES: {df.duplicated().sum()}")
    print("\n📈 NUMERICAL STATS:\n", df.describe())

audit_dataframe(df)

# ── FIX 1: Strip whitespace from string columns ───────
df['name'] = df['name'].str.strip()

# ── FIX 2: Standardize categories ─────────────────────
# Normalize gender: 'Male'/'m'/'M' → 'male', 'F'/'female' → 'female'
df['gender'] = df['gender'].str.lower().map({
    'm': 'male', 'male': 'male',
    'f': 'female', 'female': 'female'
})

# ── FIX 3: Fix data types ─────────────────────────────
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
# errors='coerce' → invalid values become NaN instead of crashing

# ── FIX 4: Replace impossible values with NaN ─────────
df.loc[df['age'] > 120, 'age'] = np.nan   # Age 999 → NaN

print("\n=== CLEANED DATA ===")
print(df)

Summary

Always run an audit (shape, dtypes, nulls, duplicates) at project start
Inconsistent categories are sneaky bugs — always standardize with str.lower()
Use pd.to_numeric(errors='coerce') to safely convert mixed-type columns
Replace domain-impossible values (age=999) with NaN before imputation

#06

Missing Values — Concept & Detection

Definition

Missing values are data points that were not recorded or stored. They appear as NaN (Not a Number), None, null, or empty strings in pandas. Most ML algorithms cannot handle missing values — you must deal with them before training.

3 Types of Missingness (Critical for choosing the right fix)

Type	Abbreviation	Meaning	Example	Recommended Fix
Missing Completely at Random	MCAR	Missing for no reason — purely random	Survey respondent accidentally skipped a question	Safe to drop or impute
Missing at Random	MAR	Missing depends on OTHER observed variables	Males less likely to report salary than females	Impute using other features
Missing Not at Random	MNAR	Missing depends on the value itself	High earners not reporting salary	Model the missingness; risky to impute

Code: Detecting & Visualizing Missing Values

missing_values_detection.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ── Create sample dataset with missing values ──────────
np.random.seed(42)
n = 100
df = pd.DataFrame({
    'age':     np.random.choice([25,30,35,np.nan], n, p=[0.3,0.3,0.3,0.1]),
    'salary':  np.random.choice([50000,60000,np.nan], n, p=[0.4,0.4,0.2]),
    'city':    np.random.choice(['NY','LA',None], n, p=[0.4,0.4,0.2]),
    'score':   np.random.randint(50, 100, n).astype(float),  # no missing
})

# ── Method 1: Count missing values ────────────────────
print("--- Missing Value Counts ---")
print(df.isnull().sum())

# ── Method 2: Percentage missing ─────────────────────
print("\n--- Missing Value % ---")
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_pct)

# ── Method 3: Heatmap visualization ──────────────────
plt.figure(figsize=(8, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title("Missing Value Heatmap\n(Yellow = Missing, Purple = Present)")
plt.tight_layout()
plt.show()

# ── Method 4: Threshold-based decision ───────────────
# Rule of thumb:
# <5%  → usually safe to drop rows or impute simply
# 5-30% → impute carefully (mean/median/mode)
# >30%  → consider dropping the COLUMN or using model-based imputation
# >50%  → column is likely unusable

threshold = 30
cols_to_drop = missing_pct[missing_pct > threshold].index.tolist()
print(f"\nColumns with >{threshold}% missing (consider dropping): {cols_to_drop}")

Common Mistakes

Imputing before train/test split: This causes data leakage — fit the imputer on train data only, then transform both
Treating all missingness equally: MCAR → drop safely; MNAR → dropping biases your model
Ignoring that "" (empty string) is not NaN: Always check with df[col].eq("").sum() too

Summary

Missing values appear as NaN/None — most ML algorithms reject them
3 types: MCAR (safe), MAR (impute with other features), MNAR (risky)
Always measure % missing before deciding to drop or impute
Columns with >50% missing are usually not worth keeping

#07

Handling Missing Values — Dropping

Definition

Dropping removes rows or columns that contain missing values. It's the simplest approach but must be used carefully — dropping too aggressively shrinks your dataset and can introduce bias.

When to Drop (Decision Tree)

Drop ROWS when: Missing % is small (<5%), data is MCAR, and losing rows doesn't bias results significantly.

Drop COLUMNS when: Missing % is very high (>50%), column is non-critical, or column is duplicate information.

NEVER drop when: The missingness is MNAR (high earners hiding salary) — you'd systematically bias your model.

Code: dropna() — All Options

dropping_missing.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, np.nan, np.nan, np.nan, np.nan],  # 80% missing
})
print("Original:\n", df)

# ── Option 1: Drop any row with ANY missing value ─────
df1 = df.dropna()  # how='any' is default
print("\ndropna() — rows with any NaN removed:\n", df1)

# ── Option 2: Drop rows where ALL values are missing ──
df2 = df.dropna(how='all')
print("\ndropna(how='all') — only fully-empty rows removed:\n", df2)

# ── Option 3: Drop rows where specific columns have NaN
df3 = df.dropna(subset=['A', 'B'])  # Only care about A and B
print("\ndropna(subset=['A','B']):\n", df3)

# ── Option 4: Drop columns with too many missing values
threshold = 0.5  # Drop columns with >50% missing
df4 = df.dropna(axis=1, thresh=int(threshold * len(df)))
print(f"\nDropped columns with >50% missing:\n", df4)
# C is dropped (80% missing), A and B are kept

# ── Option 5: Threshold-based row dropping ────────────
# Keep rows that have at least 2 non-NaN values
df5 = df.dropna(thresh=2)
print("\ndropna(thresh=2) — rows with at least 2 valid values:\n", df5)

# ── Best Practice: Track what you dropped ─────────────
original_size = len(df)
cleaned = df.dropna()
dropped_count = original_size - len(cleaned)
print(f"\nDropped {dropped_count}/{original_size} rows ({dropped_count/original_size*100:.1f}%)")

Common Mistakes

Using dropna() on the full dataset before train/test split — always split first
Dropping rows when >20% of data is lost — consider imputation instead
Forgetting to reset the index after dropping: use .reset_index(drop=True)

#08

Handling Missing Values — Categorical Imputation

Definition

Imputation means filling in missing values with estimated substitutes rather than dropping them. For categorical variables, common strategies are filling with the mode (most frequent value) or a special "Unknown" / "Missing" category.

Imputation Strategies

Strategy	For what type?	When to use
Mode (most frequent)	Categorical	Data is roughly uniform, small % missing
"Unknown" / "Missing" label	Categorical	When absence itself is informative (MNAR)
Mean imputation	Numerical	Data is roughly symmetric, no major outliers
Median imputation	Numerical	Data is skewed or has outliers (recommended over mean)
KNN imputation	Both	When other features can predict the missing value
Forward/Backward fill	Time series	Sequential/temporal data

Code: Manual Categorical Imputation

categorical_imputation.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'city': ['NY', 'LA', None, 'NY', None, 'LA', 'NY'],
    'product': ['A', None, 'B', 'A', 'B', None, 'A'],
    'salary': [50000, np.nan, 60000, np.nan, 55000, 70000, np.nan],
})

print("Before:\n", df)

# ── Strategy 1: Fill categorical with Mode ────────────
city_mode = df['city'].mode()[0]          # .mode() returns a Series, take [0]
df_mode = df.copy()
df_mode['city'] = df_mode['city'].fillna(city_mode)
print(f"\nMode imputation for city (mode='{city_mode}'):\n", df_mode['city'])

# ── Strategy 2: Fill with "Unknown" (when missing is informative) ──
df_unk = df.copy()
df_unk['city'] = df_unk['city'].fillna('Unknown')
df_unk['product'] = df_unk['product'].fillna('Unknown')
print("\nUnknown imputation:\n", df_unk[['city', 'product']])

# ── Numerical: Mean vs Median ──────────────────────────
print("\n--- Numerical Imputation ---")
df_num = df.copy()

# Mean — affected by outliers
df_num['salary_mean_imp'] = df_num['salary'].fillna(df_num['salary'].mean())

# Median — robust to outliers (PREFERRED for skewed data)
df_num['salary_median_imp'] = df_num['salary'].fillna(df_num['salary'].median())

print(df_num[['salary', 'salary_mean_imp', 'salary_median_imp']])

# ── Group-wise imputation (ADVANCED + POWERFUL) ────────
# Fill salary with the median salary for that CITY group
df['salary_group_imp'] = df.groupby('city')['salary'].\
    transform(lambda x: x.fillna(x.median()))
print("\nGroup-wise imputation:\n", df[['city', 'salary', 'salary_group_imp']])

Common Mistakes

Using mean for skewed numerical data — always prefer median for salary, prices, counts
Imputing with global statistics when group-wise would be more accurate
Forgetting to fit imputation on train set only — never use test set statistics

#09

Handling Missing Values — Scikit-Learn Imputers

Definition

Scikit-learn provides production-ready imputer classes that follow the fit()/transform() API — ensuring imputation parameters are learned from training data only, preventing data leakage.

Code: SimpleImputer, KNNImputer, IterativeImputer

sklearn_imputers.py

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer  # must import first
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split

np.random.seed(42)
X = pd.DataFrame({
    'age':    [25, np.nan, 35, 28, np.nan, 45, 22, 30],
    'salary': [np.nan, 60000, 75000, np.nan, 55000, 90000, 48000, 65000],
    'score':  [80, 90, np.nan, 70, 85, np.nan, 95, 88],
})

# ── CRITICAL: Split BEFORE imputation ─────────────────
X_train, X_test = train_test_split(X, test_size=0.25, random_state=42)

# ── SimpleImputer ─────────────────────────────────────
# strategy options: 'mean', 'median', 'most_frequent', 'constant'
si = SimpleImputer(strategy='median')
si.fit(X_train)                     # Learn medians from TRAIN only
X_train_si = si.transform(X_train) # Fill train
X_test_si  = si.transform(X_test)  # Fill test using TRAIN medians (no leakage)
print("SimpleImputer (median) — train statistics:", si.statistics_)

# ── KNNImputer ────────────────────────────────────────
# Uses K nearest neighbors to estimate missing value
# More accurate than SimpleImputer but slower
knn_imp = KNNImputer(n_neighbors=2)
knn_imp.fit(X_train)
X_train_knn = knn_imp.transform(X_train)
X_test_knn  = knn_imp.transform(X_test)
print("\nKNNImputer result (first row):", X_train_knn[0])

# ── IterativeImputer (MICE) ───────────────────────────
# Regresses each feature with missing values on all other features
# Most powerful but computationally expensive
iter_imp = IterativeImputer(max_iter=10, random_state=42)
iter_imp.fit(X_train)
X_train_iter = iter_imp.transform(X_train)
print("\nIterativeImputer result (first row):", X_train_iter[0])

# ── When to use which? ───────────────────────────────
# SimpleImputer  → fast, simple, good for numerical + categorical
# KNNImputer     → better for correlated features, medium dataset
# IterativeImputer → best quality, expensive, use on small/medium data

Common Mistakes

Calling fit_transform() on both train and test — only fit on train
Using KNNImputer on large datasets without thinking about O(n²) memory

Summary

Always use sklearn imputers in production — they prevent data leakage
SimpleImputer for quick work; KNNImputer for correlated data; IterativeImputer for best quality
The golden rule: fit on train data, transform both train and test

#10

One Hot Encoding & Dummy Variables

Definition

One Hot Encoding (OHE) converts a nominal categorical variable into multiple binary (0/1) columns — one per unique category. This allows ML algorithms (which expect numbers) to use categorical data without imposing any false ordering.

Intuition

If we encode Color as: Red=1, Blue=2, Green=3 — the model thinks Green > Blue > Red. That's wrong! Colors have no numeric relationship.

OHE creates: [is_Red, is_Blue, is_Green] = [1,0,0], [0,1,0], [0,0,1] — now there's no false ordering.

Dummy Variable Trap

⚠️ The Dummy Variable Trap (Multicollinearity)

If you have 3 colors [Red, Blue, Green], you only need 2 dummy variables. The 3rd is perfectly predictable from the other two (if not Red and not Blue → Green). Including all 3 causes multicollinearity. Solution: drop_first=True

Code: pd.get_dummies vs sklearn OneHotEncoder

one_hot_encoding.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'color': ['Red','Blue','Green','Red','Blue'],
    'size':  ['S','M','L','M','S'],
    'price': [10,20,30,15,25]
})

# ── Method 1: pd.get_dummies (fast for EDA) ───────────
# drop_first=True avoids dummy variable trap
df_dummies = pd.get_dummies(df, columns=['color'], drop_first=True, dtype=int)
print("pd.get_dummies (drop_first=True):")
print(df_dummies)
# Creates: color_Green, color_Red  (Blue is the dropped reference)

# ── Method 2: sklearn OneHotEncoder (production use) ─
# MUST use this in pipelines to avoid data leakage
X = df[['color', 'size']]
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

ohe = OneHotEncoder(
    drop='first',         # Avoid dummy trap
    sparse_output=False,  # Return dense array
    handle_unknown='ignore'  # Don't crash on unseen categories in test
)

# Fit on TRAIN only — learn the categories from training data
ohe.fit(X_train)

X_train_enc = ohe.transform(X_train)
X_test_enc  = ohe.transform(X_test)

print("\nsklearn OHE categories learned:", ohe.categories_)
print("Feature names:", ohe.get_feature_names_out())
print("Encoded train:\n", X_train_enc)

Common Mistakes

Not using drop_first=True — causes multicollinearity in linear models
Using pd.get_dummies in production — it doesn't handle unseen categories in test data
OHE on high-cardinality columns (eg: ZIP codes with 10,000 values) — use Target Encoding instead

Interview Insights

Q: What is the dummy variable trap and how do you avoid it?

A: When you OHE a column with k categories into k dummy variables, perfect multicollinearity exists because the k-th column is always predictable from the other k-1. Fix: drop one category (drop_first=True). This dropped category becomes the "reference level" — all effects are relative to it.

#11

Label Encoding

Definition

Label Encoding assigns an integer to each unique category: "cat" → 0, "dog" → 1, "rabbit" → 2. It's compact but introduces an artificial ordering. Only use it for the target variable or for tree-based models that don't assume ordering.

Code: LabelEncoder

label_encoding.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
    'animal': ['cat','dog','rabbit','cat','dog'],
    'target': ['spam','ham','spam','ham','spam']
})

le = LabelEncoder()

# For TARGET variable (y) — safe use case
df['target_encoded'] = le.fit_transform(df['target'])
print("Target encoding:")
print(df[['target', 'target_encoded']])
print("Classes:", le.classes_)  # Shows mapping: ham=0, spam=1

# For FEATURES — only safe with tree-based models
df['animal_encoded'] = le.fit_transform(df['animal'])
print("\nAnimal encoding (risky for linear models!):")
print(df[['animal', 'animal_encoded']])
# cat=0, dog=1, rabbit=2
# Linear model would think rabbit(2) > dog(1) > cat(0) — WRONG!

# Decode back from integers
print("\nDecode: [0,2,1] →", le.inverse_transform([0,2,1]))

Common Mistakes

Using LabelEncoder on nominal features with linear models — the model thinks 0 < 1 < 2, but "cat < dog < rabbit" has no meaning. Use OHE for nominal features in linear models.
LabelEncoder is fine for: target variables, and nominal features in tree models (Random Forest, XGBoost handle it fine)

Summary

Label encoding = integer per category. Simple but imposes ordering.
Safe for: target variables, tree-based model features
Unsafe for: nominal features in linear/distance-based models

#12

Ordinal Encoding

Definition

Ordinal Encoding assigns integers to categories preserving their natural order. Unlike LabelEncoder (which assigns arbitrary order), OrdinalEncoder lets you specify the order: Low=0, Medium=1, High=2.

Code: OrdinalEncoder with Custom Order

ordinal_encoding.py

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'education': ['BSc', 'PhD', 'HSC', 'MSc', 'BSc'],
    'size':      ['M',   'XL', 'S',   'L',   'XS'],
})

# ── WRONG: LabelEncoder assigns arbitrary order ───────
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
print("LabelEncoder (arbitrary):", le.fit_transform(df['education']))
# Could give: BSc=0, HSC=1, MSc=2, PhD=3 — alphabetical, not education level!

# ── RIGHT: OrdinalEncoder with explicit ordering ──────
# You tell it the correct order for each column
enc = OrdinalEncoder(categories=[
    ['HSC', 'BSc', 'MSc', 'PhD'],  # education order
    ['XS', 'S', 'M', 'L', 'XL']     # size order
])

encoded = enc.fit_transform(df)
df_enc = df.copy()
df_enc['education_enc'] = encoded[:, 0]  # HSC=0, BSc=1, MSc=2, PhD=3
df_enc['size_enc']      = encoded[:, 1]  # XS=0, S=1, M=2, L=3, XL=4

print("\nOrdinalEncoder (correct order):")
print(df_enc)

Summary

Use OrdinalEncoder when categories have a meaningful order (Low/Med/High, education levels)
Always specify the category order explicitly — don't let sklearn guess
Use OHE for nominal (unordered) categories, OrdinalEncoder for ordered ones

#13

Outliers — Concept & Handling

Definition

Outliers are data points that differ significantly from other observations. They can be genuine (a CEO's salary in an employee dataset) or errors (age = 999). Outliers distort statistical measures and can severely degrade model performance.

Types of Outliers

Type	Description	Example	Action
Point Outlier	Single value far from rest	Income of $10M in a $50k dataset	Cap, remove, or model separately
Contextual Outlier	Normal globally, abnormal in context	30°C in winter (not summer)	Context-aware handling
Collective Outlier	A group of values that are collectively abnormal	5 identical transactions in 1 second	Anomaly detection

Effects on Models

Model	Sensitivity to Outliers
Linear Regression	Very sensitive — outliers pull the regression line
Logistic Regression	Moderately sensitive
Decision Trees	Less sensitive — splits are threshold-based
Random Forest	Robust — averages many trees
SVM	Sensitive — support vectors can shift
KNN	Very sensitive — distances distorted

Code: Visualizing Outliers

outlier_visualization.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 95),  # Normal data
    [150, -30, 200, -50, 180]       # Outliers injected
])
df = pd.DataFrame({'salary': data})

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Boxplot — best for outlier visualization
axes[0].boxplot(df['salary'])
axes[0].set_title('Boxplot\n(whiskers = 1.5×IQR, dots = outliers)')

# 2. Histogram — shows skew caused by outliers
axes[1].hist(df['salary'], bins=30, color='steelblue')
axes[1].set_title('Histogram')

# 3. Scatter plot — shows position of outliers
axes[2].scatter(range(len(df)), df['salary'], alpha=0.5)
axes[2].set_title('Scatter Plot')

plt.tight_layout()
plt.show()

# Describe to spot issues
print(df.describe())
# If max >> 75th percentile → likely outliers
# If min << 25th percentile → likely outliers

Summary

Outliers = data points far from the rest — genuine or errors
Linear models and KNN are most sensitive; tree-based models are more robust
Always visualize first (boxplot, histogram) before deciding how to handle them
Two detection methods: IQR (robust) and Z-Score (assumes normality)

#14

IQR Method for Outlier Detection

Definition

The Interquartile Range (IQR) method is a robust, non-parametric outlier detection technique. IQR = Q3 − Q1. Points beyond 1.5×IQR from the quartiles are flagged as outliers. It doesn't assume any distribution.

Lower Bound = Q1 − 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
Outlier if: value < Lower Bound OR value > Upper Bound

Code: IQR Detection & Handling Strategies

iqr_method.py

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'salary': list(np.random.normal(60000, 10000, 95)) + [200000, -5000, 180000, -10000, 190000]
})

# ── Step 1: Calculate IQR ──────────────────────────────
Q1  = df['salary'].quantile(0.25)
Q3  = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1={Q1:.0f}, Q3={Q3:.0f}, IQR={IQR:.0f}")
print(f"Lower bound: {lower_bound:.0f}")
print(f"Upper bound: {upper_bound:.0f}")

# ── Step 2: Identify outliers ─────────────────────────
is_outlier = (df['salary'] < lower_bound) | (df['salary'] > upper_bound)
print(f"\nOutliers found: {is_outlier.sum()}")
print(df[is_outlier])

# ── Step 3: Choose handling strategy ─────────────────

# STRATEGY A: Remove outliers
df_removed = df[~is_outlier].copy()
print(f"\nAfter removal: {len(df_removed)} rows (was {len(df)})")

# STRATEGY B: Cap/Winsorize (clip to bounds)
# This preserves sample size but limits extreme values
df_capped = df.copy()
df_capped['salary'] = df_capped['salary'].clip(lower=lower_bound, upper=upper_bound)
print(f"\nAfter capping: max={df_capped['salary'].max():.0f}, min={df_capped['salary'].min():.0f}")

# STRATEGY C: Replace with NaN → then impute
df_nan = df.copy()
df_nan.loc[is_outlier, 'salary'] = np.nan
print(f"\nAfter NaN replacement: {df_nan['salary'].isnull().sum()} missing values")

# ── Function to apply IQR to all numerical columns ────
def remove_outliers_iqr(df, multiplier=1.5):
    df_clean = df.copy()
    num_cols = df.select_dtypes(include=[np.number]).columns
    for col in num_cols:
        Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
        IQR = Q3 - Q1
        mask = (df[col] >= Q1 - multiplier*IQR) & (df[col] <= Q3 + multiplier*IQR)
        df_clean = df_clean[mask]
    return df_clean.reset_index(drop=True)

Interview Insights

Q: Why prefer IQR over Z-Score for outlier detection?

A: IQR is non-parametric — it doesn't assume a normal distribution. Z-Score uses mean and std, which are themselves skewed by outliers, making it circular. Use IQR for skewed data (salary, prices); Z-Score for data you know is approximately normal.

#15

Z-Score Method for Outlier Detection

Definition

The Z-Score measures how many standard deviations a data point is from the mean. Points with |Z| > 3 are typically considered outliers. Assumes the data is approximately normally distributed.

Z = (X − μ) / σ
Outlier if |Z| > 3 (roughly 0.3% of data in a normal distribution)

Code: Z-Score Outlier Detection

zscore_outliers.py

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
df = pd.DataFrame({
    'height': list(np.random.normal(170, 10, 95)) + [300, 50, 280, -10, 320]
})

# ── Method 1: Manual Z-Score ──────────────────────────
df['z_score'] = (df['height'] - df['height'].mean()) / df['height'].std()

threshold = 3
outliers_z = df[df['z_score'].abs() > threshold]
print(f"Z-Score outliers (|z|>{threshold}): {len(outliers_z)}")
print(outliers_z)

# ── Method 2: scipy zscore (same result) ──────────────
z_scores = np.abs(stats.zscore(df['height']))
df_clean = df[(z_scores <= threshold)]
print(f"\nAfter removal: {len(df_clean)} rows")

# ── Modified Z-Score (robust, uses median) ────────────
# Better than standard Z-score when outliers are present
median = df['height'].median()
mad    = (df['height'] - median).abs().median()  # Median Absolute Deviation
modified_z = 0.6745 * (df['height'] - median) / mad
df_mod_clean = df[modified_z.abs() <= 3.5]
print(f"\nModified Z-Score clean: {len(df_mod_clean)} rows")

# ── IQR vs Z-Score comparison ─────────────────────────
# IQR:      Non-parametric, robust, works for any distribution
# Z-Score:  Parametric, assumes normality, sensitive to extreme outliers
# Modified Z: Best of both - robust + uses normal distribution logic

Summary

Z-Score = (x - mean) / std — outlier if |z| > 3
Assumes normality — use IQR for skewed data
Modified Z-Score (using MAD) is more robust — recommended over standard Z-Score

#16

Feature Scaling — Standardization

Definition

Standardization (Z-score normalization) transforms features so they have mean = 0 and standard deviation = 1. This allows algorithms that use distances or gradients to treat all features equally regardless of their original scale.

X_scaled = (X − μ) / σ
Result: mean = 0, std = 1

Why does scaling matter?

Consider: age (20–60) and salary (30,000–200,000). In KNN, distance is dominated by salary simply because it has larger numbers. The model effectively ignores age. Standardization puts both on equal footing.

Code: StandardScaler

standardization.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'age':    [25, 35, 45, 28, 52],
    'salary': [50000, 75000, 95000, 60000, 110000],
    'score':  [80, 90, 70, 85, 95],
})

X = df.values
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

# ── StandardScaler ────────────────────────────────────
scaler = StandardScaler()

# Fit on train, transform both — NEVER fit on test!
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print("Original train data:")
print(pd.DataFrame(X_train, columns=df.columns))

print("\nStandardized train data:")
print(pd.DataFrame(X_train_scaled, columns=df.columns).round(3))

print(f"\nMean after scaling (should be ~0): {X_train_scaled.mean(axis=0).round(3)}")
print(f"Std after scaling (should be ~1):  {X_train_scaled.std(axis=0).round(3)}")

# ── Access scaler parameters ──────────────────────────
print(f"\nLearned means: {scaler.mean_.round(2)}")
print(f"Learned stds:  {scaler.scale_.round(2)}")

# ── Inverse transform ─────────────────────────────────
X_back = scaler.inverse_transform(X_train_scaled)
print("\nInverse transform (back to original):")
print(pd.DataFrame(X_back, columns=df.columns).round(1))

When to use Standardization vs Normalization

Scaling Method	Use When	Not when
StandardScaler (Standardization)	Data has outliers; algorithms assume normal distribution (SVM, Linear/Logistic Regression, PCA)	You need values in [0,1] range
MinMaxScaler (Normalization)	Neural networks; algorithms sensitive to magnitude; when you need bounded [0,1] range	Data has significant outliers

Common Mistakes

Fitting scaler on the full dataset before train/test split — causes data leakage
Scaling the target variable y (for regression) — don't, unless you undo it with inverse_transform
Forgetting to scale test data with the SAME scaler that was fit on train

#17

Feature Scaling — Normalization (MinMaxScaler)

Definition

Normalization (Min-Max Scaling) scales all values to the range [0, 1] (or any specified range). It preserves the original distribution shape but compresses it into the specified range.

X_scaled = (X − X_min) / (X_max − X_min)
Result: All values between 0 and 1

Code: MinMaxScaler + RobustScaler

normalization.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split

# Dataset with an outlier to show differences
data = pd.DataFrame({'salary': [30000, 45000, 50000, 60000, 200000]})
# 200000 is an outlier

X_train, X_test = train_test_split(data, test_size=0.4, random_state=42)

# ── MinMaxScaler ──────────────────────────────────────
mms = MinMaxScaler(feature_range=(0, 1))
mms.fit(X_train)
print("MinMax scaled (outlier pulls everything to bottom):")
print(pd.DataFrame(mms.transform(data), columns=['salary_minmax']).round(3))

# ── RobustScaler ──────────────────────────────────────
# Uses median and IQR — robust to outliers!
# X_scaled = (X - median) / IQR
rs = RobustScaler()
rs.fit(X_train)
print("\nRobust scaled (better with outliers):")
print(pd.DataFrame(rs.transform(data), columns=['salary_robust']).round(3))

# ── Visual comparison ─────────────────────────────────
ss = StandardScaler()
ss.fit(X_train)

comparison = pd.DataFrame({
    'original': data['salary'].values,
    'standardized': ss.transform(data).flatten(),
    'normalized(MinMax)': mms.transform(data).flatten(),
    'robust': rs.transform(data).flatten()
})
print("\nScaling Comparison:")
print(comparison.round(3))

# Summary:
# StandardScaler → mean=0, std=1  — distorted by outlier
# MinMaxScaler   → [0,1]           — distorted by outlier
# RobustScaler   → median-centered — handles outliers best

Interview Insights

Q: When would you use RobustScaler over StandardScaler?

A: When the dataset contains significant outliers. StandardScaler's mean and std are influenced by outliers, distorting the scaling. RobustScaler uses median and IQR — both resistant to outliers — giving a more stable scaling. Use RobustScaler → clean data scenario with outliers. StandardScaler → when data is already roughly normal.

#18

Duplicate Data Handling

Definition

Duplicate rows are identical or near-identical records in a dataset. They can arise from data entry errors, merging datasets, or web scraping. Duplicates bias models by over-representing certain patterns and inflate accuracy metrics.

Code: Detect and Handle Duplicates

duplicates.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id':     [1, 2, 2, 3, 4, 4, 5],
    'name':   ['Alice','Bob','Bob','Charlie','Dave','Dave','Eve'],
    'salary': [50000,60000,60000,70000,80000,80001,55000],
    # Note: Dave has slightly different salary — near duplicate
})

# ── Exact duplicates ──────────────────────────────────
print("Exact duplicates:", df.duplicated().sum())
print(df[df.duplicated(keep=False)])  # Show all occurrences

# ── Duplicates based on subset of columns ─────────────
# Same person (name+salary close), different id
print("\nDuplicates by name:", df.duplicated(subset=['name']).sum())

# ── Remove exact duplicates ───────────────────────────
# keep='first': keep first occurrence (default)
# keep='last':  keep last occurrence
# keep=False:   drop ALL duplicates
df_clean = df.drop_duplicates(keep='first').reset_index(drop=True)
print("\nAfter removing exact duplicates:")
print(df_clean)

# ── Remove subset duplicates ──────────────────────────
df_clean2 = df.drop_duplicates(subset=['name'], keep='first').reset_index(drop=True)
print("\nAfter removing name duplicates:")
print(df_clean2)

# ── Near-duplicate detection (fuzzy matching) ─────────
# For salary difference within 5%, flag as near duplicate
df_sorted = df.sort_values('name')
grouped = df_sorted.groupby('name').filter(lambda x: len(x) > 1)
print("\nNear-duplicate candidates (same name):")
print(grouped)

#19

Changing Data Types

Definition

Data types define how data is stored and what operations are valid. Incorrect data types cause errors, performance issues, and wrong calculations. Converting to the right type is a fundamental preprocessing step.

Code: Type Conversion Patterns

data_types.py

import pandas as pd
import numpy as np

# Common raw data type issues
df = pd.DataFrame({
    'age':      ['25', '30', 'bad', '28'],  # should be int, but string
    'salary':   ['$50,000', '$72,000',
                 '$63,000', '$95,000'],      # currency string
    'date':     ['2023-01-15', '2023-03-22',
                 '2023-07-01', '2023-11-30'], # string date
    'is_active':['True','False','True','True'] # string bool
})
print("Original dtypes:\n", df.dtypes)

# ── Fix 1: String → numeric (safe with coerce) ─────────
df['age_clean'] = pd.to_numeric(df['age'], errors='coerce')
# 'bad' → NaN instead of crashing

# ── Fix 2: Currency string → float ────────────────────
df['salary_clean'] = (df['salary']
                       .str.replace('$', '', regex=False)
                       .str.replace(',', '', regex=False)
                       .astype(float))

# ── Fix 3: String → datetime ──────────────────────────
df['date_clean'] = pd.to_datetime(df['date'])
# Now you can do: df['date_clean'].dt.year, .month, .day, .dayofweek
df['year']  = df['date_clean'].dt.year
df['month'] = df['date_clean'].dt.month

# ── Fix 4: String → boolean ───────────────────────────
df['is_active_bool'] = df['is_active'].map({'True': True, 'False': False})

# ── Fix 5: Downcast for memory efficiency ─────────────
# int64 → int32 saves memory on large datasets
df['age_int32'] = df['age_clean'].astype('Int32')  # nullable int

print("\nCleaned dtypes:\n", df.dtypes)
print("\nCleaned data:\n", df[['age_clean','salary_clean','year','month','is_active_bool']])

#20

Function Transformer

Definition

FunctionTransformer wraps any Python function into a scikit-learn compatible transformer. This allows you to apply custom transformations (like log, square root, domain-specific functions) inside sklearn Pipelines.

Code: FunctionTransformer Usage

function_transformer.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Skewed data — log transform helps normalize it
salary = np.array([20000, 30000, 35000, 50000, 200000, 500000])

# ── Basic FunctionTransformer ─────────────────────────
log_transformer = FunctionTransformer(
    func=np.log1p,          # log(1 + x) — handles zeros safely
    inverse_func=np.expm1   # exp(x) - 1 — for inverse transform
)

log_salary = log_transformer.transform(salary.reshape(-1, 1))
print("Original salary:", salary)
print("Log transformed: ", log_salary.flatten().round(2))

# ── Custom function transformer ───────────────────────
def clip_outliers(X, lower_pct=5, upper_pct=95):
    """Clip values to specified percentile range."""
    lower = np.percentile(X, lower_pct)
    upper = np.percentile(X, upper_pct)
    return np.clip(X, lower, upper)

clip_transformer = FunctionTransformer(clip_outliers)
clipped = clip_transformer.transform(salary.reshape(-1, 1))
print("\nClipped (5th-95th pct):", clipped.flatten())

# ── In a Pipeline ─────────────────────────────────────
X = np.random.lognormal(10, 1, (100, 1))
y = np.random.randn(100)

pipeline = Pipeline([
    ('log_transform', FunctionTransformer(np.log1p)),
    ('model',         LinearRegression()),
])
pipeline.fit(X, y)
print("\nPipeline with log transform fitted successfully!")
print("Coef:", pipeline.named_steps['model'].coef_)

Summary

FunctionTransformer makes any function sklearn-pipeline-compatible
log1p is the go-to for right-skewed distributions (salary, prices, counts)
Always define inverse_func if you need to inverse-transform predictions

#21

Backward Elimination

Definition

Backward Elimination is a wrapper feature selection method that starts with all features and iteratively removes the least significant one (highest p-value) until all remaining features meet a significance threshold.

Algorithm Steps

1. Start with ALL features
2. Train model, compute p-values for each feature
3. If max p-value > threshold (usually 0.05): remove that feature
4. Retrain with remaining features
5. Repeat until all p-values < threshold
6. Remaining features = selected set

Code: Backward Elimination with statsmodels

backward_elimination.py

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_diabetes

# Load dataset
data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# ── Backward Elimination ──────────────────────────────
def backward_elimination(X, y, significance_level=0.05):
    """
    Iteratively removes features with p-value > significance_level.
    Returns list of selected feature names.
    """
    cols = list(X.columns)
    
    while True:
        X_with_const = sm.add_constant(X[cols])  # Add intercept
        model = sm.OLS(y, X_with_const).fit()    # Ordinary Least Squares
        
        # Get p-values (exclude constant at index 0)
        pvalues = model.pvalues.drop('const')
        max_pval = pvalues.max()
        
        if max_pval > significance_level:
            worst_feat = pvalues.idxmax()
            print(f"Removing '{worst_feat}' (p-value: {max_pval:.4f})")
            cols.remove(worst_feat)
        else:
            break
    
    return cols

selected = backward_elimination(X, y, significance_level=0.05)
print(f"\n✅ Selected features ({len(selected)}): {selected}")

# ── Using mlxtend SequentialFeatureSelector (wrapper) ─
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

sfs = SequentialFeatureSelector(
    LinearRegression(),
    k_features=5,              # Select top 5 features
    forward=False,             # backward=True
    scoring='r2',
    cv=5
)
sfs.fit(X.values, y)
print("\nmlxtend Backward SFS selected features:", sfs.k_feature_names_)

Interview Insights

Q: What's the difference between wrapper, filter, and embedded feature selection?

A: Filter = select features by statistical tests independent of model (correlation, chi-squared). Wrapper = use a model to evaluate subsets — computationally expensive but model-aware (forward/backward selection). Embedded = feature selection happens during model training (Lasso L1 regularization, tree importance). For production: start with filter (fast), refine with embedded.

#22

Forward Selection

Definition

Forward Selection starts with zero features and iteratively adds the most significant feature at each step until no remaining feature improves the model above the threshold.

Code: Forward Selection

forward_selection.py

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from mlxtend.feature_selection import SequentialFeatureSelector

data = load_diabetes()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# ── Manual Forward Selection ──────────────────────────
def forward_selection(X, y, significance_level=0.05):
    import statsmodels.api as sm
    remaining = list(X.columns)
    selected  = []
    
    while remaining:
        best_pval = 1.0
        best_feat = None
        
        for feat in remaining:
            cols = selected + [feat]
            X_const = sm.add_constant(X[cols])
            model   = sm.OLS(y, X_const).fit()
            pval    = model.pvalues[feat]
            
            if pval < best_pval:
                best_pval = pval
                best_feat = feat
        
        if best_feat and best_pval < significance_level:
            selected.append(best_feat)
            remaining.remove(best_feat)
            print(f"Added '{best_feat}' (p-value: {best_pval:.4f})")
        else:
            break
    
    return selected

selected = forward_selection(X, y)
print(f"\n✅ Forward selected ({len(selected)}): {selected}")

# ── mlxtend (cleaner, cross-validated) ───────────────
sfs_fwd = SequentialFeatureSelector(
    LinearRegression(),
    k_features=5,
    forward=True,   # Forward selection
    scoring='r2',
    cv=5
)
sfs_fwd.fit(X.values, y)
print("\nmlxtend Forward SFS features:", sfs_fwd.k_feature_names_)

# When to use Forward vs Backward?
# Forward:   many features, want to build minimal set
# Backward:  moderate features, start broad
# Both:      O(n²) complexity — use filter first for huge feature sets

#23

Train-Test Split

Definition

Train-test split divides the dataset into two sets: a training set (model learns from this) and a test set (used to evaluate final performance on unseen data). This simulates real-world deployment where the model encounters data it was never trained on.

The Core Problem It Solves

⚠️ Why you MUST split data

If you train and evaluate on the same data, the model can "memorize" it (overfit) and report 100% accuracy — but fail completely on new data. Test set = honest evaluation of real-world performance.

Code: train_test_split — All Patterns

train_test_split.py

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# ── Basic split ────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% test, 80% train
    random_state=42,    # Reproducibility
    stratify=y          # IMPORTANT: maintain class distribution
)

print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

# ── Why stratify=y is critical for classification ─────
# Without stratify: test might have 0 samples of class 2!
print("\nClass distribution in full dataset:")
print(pd.Series(y).value_counts(normalize=True).round(3))
print("Class distribution in train:")
print(pd.Series(y_train).value_counts(normalize=True).round(3))
print("Class distribution in test:")
print(pd.Series(y_test).value_counts(normalize=True).round(3))
# With stratify=y, all three should be ~33% each

# ── Train/Validation/Test split ───────────────────────
# For hyperparameter tuning: 60/20/20
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test     = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
# Train: 90, Val: 30, Test: 30 (out of 150)

# ── Common split ratios ───────────────────────────────
# Small dataset (<1000):   70/30 or 60/20/20
# Medium (1000-10000):     80/20 or 70/10/20
# Large (>10000):          90/10 (more train data = better model)

Common Mistakes

Not using stratify=y for classification — imbalanced splits mislead evaluation
Preprocessing (scaling, imputing) the full dataset before splitting — data leakage!
Not setting random_state — results are not reproducible
Making the test set too small — noisy evaluation; too large — less data for training

Interview Insights

Q: What is data leakage and how does train-test split prevent it?

A: Data leakage = when information from outside the training set influences model training or evaluation, giving an overly optimistic performance estimate. If you impute missing values using the full dataset mean (including test data), the model has "seen" test data indirectly. Solution: fit all transformers on train data only, then apply to test. Train-test split enforces this discipline.

#24

Regression Analysis

Definition

Regression is a supervised learning task that predicts a continuous numerical output based on one or more input features. The goal is to find the mathematical relationship between inputs and the target variable.

Types of Regression

Type	Formula/Description	Use When
Simple Linear	y = mx + b	1 feature, linear relationship
Multiple Linear	y = b₀ + b₁x₁ + b₂x₂ + ...	Multiple features, linear relationship
Polynomial	y = b₀ + b₁x + b₂x² + ...	Curved/nonlinear relationship
Ridge (L2)	Linear + L2 penalty	Multicollinearity, prevent overfitting
Lasso (L1)	Linear + L1 penalty	Feature selection built-in

Regression Evaluation Metrics

Metric	Formula	Interpretation	Range
MAE	mean(\|y - ŷ\|)	Average absolute error — intuitive, robust	[0, ∞)
MSE	mean((y - ŷ)²)	Penalizes large errors heavily	[0, ∞)
RMSE	√MSE	Same unit as target — most interpretable	[0, ∞)
R²	1 - SS_res/SS_tot	% variance explained (1 = perfect)	(-∞, 1]
Adj. R²	1 - (1-R²)(n-1)/(n-k-1)	R² penalized for extra features	(-∞, 1]

#25

Linear Regression — Theory

Definition

Linear Regression models the relationship between input X and output y as a straight line: y = β₀ + β₁X + ε. It finds the line that minimizes the sum of squared residuals (differences between actual and predicted values).

Intuition

🏠 House price analogy: "For every additional 100 sqft, price increases by $10,000." That relationship IS a linear regression. β₀ = base price (intercept), β₁ = price per sqft (slope).

How It Learns: Ordinary Least Squares

Minimize: J(β) = Σ(yᵢ − ŷᵢ)² = Σ(yᵢ − β₀ − β₁xᵢ)²

Closed-form solution: β = (XᵀX)⁻¹ Xᵀy

No iteration needed — there's an exact mathematical solution. This is why linear regression is fast even on large datasets.

5 Key Assumptions (MUST know for interviews)

Assumption	What It Means	How to Check
Linearity	X and y have linear relationship	Scatter plot
Independence	Observations are independent	Domain knowledge
Homoscedasticity	Residuals have constant variance	Residual vs fitted plot
Normality of residuals	Residuals are normally distributed	QQ-plot
No multicollinearity	Features not highly correlated with each other	Correlation matrix, VIF

Interview Insights

Q: Does linear regression assume normality of the INPUT features?

A: No! A very common misconception. Linear regression assumes normality of the RESIDUALS (errors), not the input features. Your features can be any distribution. However, if residuals are highly non-normal, the p-values and confidence intervals won't be valid.

#26

Linear Regression — Practical

Code: Full Linear Regression Pipeline

linear_regression_practical.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# ── Generate synthetic house price data ───────────────
np.random.seed(42)
n = 200
sqft   = np.random.uniform(500, 3000, n)
price  = 100 + 0.15 * sqft + np.random.normal(0, 20, n)  # True: y = 100 + 0.15x + noise

df = pd.DataFrame({'sqft': sqft, 'price': price})

# ── Step 1: Prepare data ──────────────────────────────
X = df[['sqft']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Step 2: Train model ───────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Coefficient (β₁): {model.coef_[0]:.2f}")
print(f"Equation: price = {model.intercept_:.2f} + {model.coef_[0]:.2f} × sqft")

# ── Step 3: Predict ───────────────────────────────────
y_pred = model.predict(X_test)

# ── Step 4: Evaluate ─────────────────────────────────
mae  = mean_absolute_error(y_test, y_pred)
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print(f"\nMAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²:   {r2:.4f}")  # Should be ~0.97+ with clean linear data

# ── Step 5: Residual Analysis ─────────────────────────
residuals = y_test - y_pred

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Actual vs Predicted
axes[0].scatter(y_test, y_pred, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted')
axes[0].set_title('Actual vs Predicted\n(Perfect = on the red line)')

# Plot 2: Residuals vs Fitted
axes[1].scatter(y_pred, residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Fitted Values'); axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs Fitted\n(Good = random scatter around 0)')

# Plot 3: Residual histogram
axes[2].hist(residuals, bins=20, color='steelblue')
axes[2].set_title('Residual Distribution\n(Good = roughly normal, centered at 0)')

plt.tight_layout(); plt.show()

Common Mistakes

Not plotting residuals — you'll miss heteroscedasticity and nonlinearity
R² alone is a misleading metric — always check residual plots too
Negative R² is possible (model worse than predicting mean) — something is very wrong

🏋️ Mini Practice Task

Load sklearn's California Housing dataset. Build a linear regression predicting house prices. Print MAE, RMSE, R². Plot Actual vs Predicted. Is the relationship linear? What does the residual plot tell you?

#27

Multiple Linear Regression

Definition

Multiple Linear Regression extends simple linear regression to use multiple input features simultaneously. Each feature gets its own coefficient (weight) representing its independent contribution to the prediction.

ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ
β₀ = intercept, βᵢ = coefficient for feature xᵢ

Code: Multiple Linear Regression

multiple_linear_regression.py

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
import seaborn as sns

# ── Load dataset ──────────────────────────────────────
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='Price')

print("Features:", X.columns.tolist())
print("Shape:", X.shape)

# ── Check multicollinearity with correlation matrix ───
plt.figure(figsize=(10, 8))
corr = X.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix\n(High values = multicollinearity risk)')
plt.tight_layout(); plt.show()

# ── Train/test split ──────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Fit model ─────────────────────────────────────────
model = LinearRegression()
model.fit(X_train, y_train)

# ── Coefficients interpretation ───────────────────────
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)
print("\nCoefficients (show each feature's independent effect):")
print(coef_df.to_string(index=False))

# ── Evaluate ─────────────────────────────────────────
y_pred = model.predict(X_test)
r2   = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nR²: {r2:.4f}, RMSE: {rmse:.4f}")

# ── VIF: Variance Inflation Factor for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

X_const = sm.add_constant(X_train)
vif_data = pd.DataFrame()
vif_data["Feature"] = X_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_const.values, i)
                   for i in range(X_const.shape[1])]
print("\nVIF (>10 = multicollinearity problem):")
print(vif_data.sort_values('VIF', ascending=False))

Common Mistakes

Not checking multicollinearity: VIF > 10 = features are redundant, coefficients become unreliable
Interpreting coefficients without scaling: Compare magnitudes only after standardizing features
Adding too many features: More features → more risk of overfitting and multicollinearity

#28

Polynomial Regression

Definition

Polynomial Regression extends linear regression by adding polynomial terms (x², x³, ...) as new features. Despite the curve fitting, it's still a linear model because the parameters are still linear — only the features are transformed.

ŷ = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ
Key insight: this is linear regression on [x, x², x³] as features

Code: Polynomial Regression

polynomial_regression.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline

np.random.seed(42)
X = np.random.uniform(-3, 3, 100)
y = 0.5 * X**3 - 2 * X**2 + X + np.random.normal(0, 2, 100)
X = X.reshape(-1, 1)

# ── Compare degrees ───────────────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(18, 4), sharey=True)
X_plot = np.linspace(-3, 3, 300).reshape(-1, 1)

for i, degree in enumerate([1, 2, 3, 10]):
    # Pipeline: transform features → fit linear regression
    pipe = Pipeline([
        ('poly',  PolynomialFeatures(degree=degree, include_bias=False)),
        ('model', LinearRegression())
    ])
    pipe.fit(X, y)
    y_plot = pipe.predict(X_plot)
    r2 = r2_score(y, pipe.predict(X))
    
    axes[i].scatter(X, y, alpha=0.4, s=20)
    axes[i].plot(X_plot, y_plot, color='red', lw=2)
    axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}")
    
    if degree == 10:
        axes[i].set_title(f"Degree {degree}\nR² = {r2:.3f}\n⚠️ OVERFITTING")

plt.suptitle('Polynomial Regression: Degree Comparison', y=1.02)
plt.tight_layout()
plt.show()

# ── PolynomialFeatures explained ─────────────────────
poly = PolynomialFeatures(degree=2, include_bias=False)
X_2feat = np.array([[2, 3]])  # [x1, x2]
X_poly = poly.fit_transform(X_2feat)
print("Input: [x1, x2]")
print("Poly degree 2 output: [x1, x2, x1², x1·x2, x2²]")
print("Feature names:", poly.get_feature_names_out(['x1','x2']))
print("Values:", X_poly)

Common Mistakes

High degree polynomial = overfitting (memorizes noise, fails on test data)
Not scaling features — polynomial terms create extremely large values without scaling
Using polynomial regression when you should use a tree-based model — trees handle nonlinearity naturally

#29

Cost Function

Definition

A cost function (loss function) quantifies how wrong the model's predictions are. During training, the algorithm adjusts model parameters to minimize this cost. Understanding the cost function is essential — it defines what "learning" means.

Cost Functions for Regression

Cost Function	Formula	Properties
MSE (OLS)	Σ(y−ŷ)²/n	Penalizes large errors heavily; differentiable everywhere; default for linear regression
MAE	Σ\|y−ŷ\|/n	Robust to outliers; not differentiable at 0
Huber Loss	MSE if \|error\|<δ, MAE otherwise	Best of both worlds — use for outlier-prone data

Code: Visualizing Cost Functions + Gradient Descent

cost_function.py

import numpy as np
import matplotlib.pyplot as plt

# Simple 1D regression to visualize cost landscape
np.random.seed(42)
X = np.random.uniform(0, 10, 50)
y = 3 * X + np.random.normal(0, 3, 50)

# Cost = MSE for different values of slope (β₁)
slopes = np.linspace(-2, 8, 200)
mse_costs = []

for slope in slopes:
    y_pred = slope * X      # intercept=0 for simplicity
    mse = np.mean((y - y_pred)**2)
    mse_costs.append(mse)

# The minimum of this U-shaped curve = optimal β₁
optimal_slope = slopes[np.argmin(mse_costs)]
print(f"Optimal slope found by scanning: {optimal_slope:.2f}")

plt.figure(figsize=(8, 4))
plt.plot(slopes, mse_costs, 'b-', lw=2)
plt.axvline(x=optimal_slope, color='r', linestyle='--', label=f'Min at β={optimal_slope:.2f}')
plt.xlabel('Slope (β₁)'); plt.ylabel('MSE Cost')
plt.title('MSE Cost Landscape\n(Training finds the bottom of this bowl)')
plt.legend(); plt.tight_layout(); plt.show()

# ── Gradient Descent from scratch ────────────────────
# This is how sklearn's SGD solver works internally
def gradient_descent(X, y, learning_rate=0.01, epochs=1000):
    n = len(X)
    beta = 0.0  # start at 0
    costs = []
    
    for epoch in range(epochs):
        y_pred = beta * X
        cost   = np.mean((y - y_pred)**2)
        # Gradient of MSE w.r.t. beta = -2/n * Σ(y - ŷ)*x
        gradient = -2 / n * np.sum((y - y_pred) * X)
        beta -= learning_rate * gradient  # Update step
        costs.append(cost)
    
    return beta, costs

beta_found, costs = gradient_descent(X, y)
print(f"Gradient Descent found β = {beta_found:.4f} (true = 3.0)")

Interview Insights

Q: Why does linear regression use MSE instead of MAE as cost function?

A: MSE has a convenient mathematical property: its gradient is smooth and differentiable everywhere, making it easy to optimize analytically (closed-form OLS solution) and with gradient descent. MAE's gradient is discontinuous at zero. Additionally, MSE's quadratic penalty gives a unique optimal solution. However, MSE is more sensitive to outliers — Huber loss is a robust alternative.

#30

R² and Adjusted R²

Definition

R² (coefficient of determination) measures the proportion of variance in y explained by the model. Adjusted R² penalizes adding irrelevant features, making it the preferred metric for model comparison.

R² = 1 − SS_res / SS_tot
SS_res = Σ(y − ŷ)² (residual sum of squares)
SS_tot = Σ(y − ȳ)² (total sum of squares)

Adjusted R² = 1 − (1−R²) × (n−1) / (n−k−1) where n = samples, k = number of features

Code: R² and Adjusted R²

r_squared.py

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()
X_full = housing.data
y      = housing.target

X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)

# ── R² from scratch ───────────────────────────────────
def manual_r2(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (ss_res / ss_tot)

# ── Adjusted R² ──────────────────────────────────────
def adjusted_r2(r2, n, k):
    """
    r2: R² value
    n:  number of samples
    k:  number of features (predictors)
    """
    return 1 - (1 - r2) * (n - 1) / (n - k - 1)

# ── Compare R² vs Adj-R² as we add noise features ─────
# Adding random noise features should NOT improve Adj-R²
print(f"{'Features':>10} {'R²':>10} {'Adj R²':>10}")
print("-" * 35)

for n_noise in [0, 5, 20, 50]:
    np.random.seed(42)
    if n_noise > 0:
        noise = np.random.randn(X_train.shape[0], n_noise)
        X_tr  = np.hstack([X_train, noise])
        noise_t = np.random.randn(X_test.shape[0], n_noise)
        X_ts  = np.hstack([X_test, noise_t])
    else:
        X_tr, X_ts = X_train, X_test
    
    model = LinearRegression()
    model.fit(X_tr, y_train)
    y_pred = model.predict(X_ts)
    
    r2  = r2_score(y_test, y_pred)
    n   = X_ts.shape[0]
    k   = X_tr.shape[1]
    adj = adjusted_r2(r2, n, k)
    
    print(f"{X_train.shape[1]+n_noise:>10} {r2:>10.4f} {adj:>10.4f}")

# Key insight:
# R²     keeps increasing as you add features (even noise ones!)
# Adj R² penalizes — will DROP if features are irrelevant
# → Always use Adj R² for model comparison

Interview Insights

Q: When would R² be misleading?

A: (1) Comparing models with different numbers of features — use Adjusted R² instead. (2) R² can be high even when the model is wrong — always check residual plots. (3) R² = 0 doesn't mean no relationship, just that a linear relationship explains nothing. (4) For time series with trends, R² can be misleadingly high.

#31

Best Fit Line

Definition

The "best fit line" (regression line) is the line that minimizes the sum of squared vertical distances between the data points and the line itself. It passes through (x̄, ȳ) and is uniquely defined by the OLS solution.

Code: Best Fit Line Visualization + Residuals

best_fit_line.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(42)
X = np.random.uniform(1, 10, 30)
y = 2 * X + np.random.normal(0, 2, 30)

model = LinearRegression().fit(X.reshape(-1,1), y)
y_pred = model.predict(X.reshape(-1,1))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ── Plot 1: Scatter + best fit line + residuals ───────
axes[0].scatter(X, y, color='steelblue', zorder=5, label='Data')
X_line = np.linspace(X.min(), X.max(), 100)
axes[0].plot(X_line, model.predict(X_line.reshape(-1,1)), 'r-', lw=2, label='Best Fit Line')

# Draw residuals as vertical lines
for xi, yi, yp in zip(X, y, y_pred):
    axes[0].plot([xi, xi], [yi, yp], 'g--', alpha=0.4, lw=1)

axes[0].axvline(x=X.mean(), color='purple', linestyle=':', label=f'x̄ = {X.mean():.1f}')
axes[0].axhline(y=y.mean(), color='orange', linestyle=':', label=f'ȳ = {y.mean():.1f}')
axes[0].set_title(f'Best Fit Line\nGreen lines = residuals (what we minimize)\nβ₀={model.intercept_:.2f}, β₁={model.coef_[0]:.2f}')
axes[0].legend()

# ── Plot 2: Residuals squared (what OLS actually minimizes) ──
residuals = y - y_pred
axes[1].bar(range(len(residuals)), residuals**2, color='coral')
axes[1].set_title(f'Squared Residuals\nTotal = {(residuals**2).sum():.2f} (OLS minimizes this)')
axes[1].set_xlabel('Sample Index')
axes[1].set_ylabel('Squared Residual')

plt.tight_layout()
plt.show()

print(f"Best fit: y = {model.intercept_:.2f} + {model.coef_[0]:.2f}x")
print(f"True:     y = 0 + 2x")
print(f"Close! Noise causes small deviation.")

#32

Lasso (L1) & Ridge (L2) — Theory

Definition

Ridge and Lasso are regularized versions of linear regression that add a penalty term to prevent overfitting by discouraging large coefficients. They're essential when you have many features or multicollinearity.

The Core Idea: Why Regularization?

When a linear model overfits, it assigns large coefficients to noise features. Regularization adds a penalty for large coefficients, forcing the model to be simpler. This trades a tiny bit of bias for a large reduction in variance.

Bias-Variance Tradeoff: Underfitting = High Bias. Overfitting = High Variance. Regularization = Find the sweet spot.

OLS: Minimize Σ(y − ŷ)²

Ridge: Minimize Σ(y − ŷ)² + λΣβᵢ² (L2 penalty)

Lasso: Minimize Σ(y − ŷ)² + λΣ|βᵢ| (L1 penalty)

Elastic Net: Ridge + Lasso combined

Ridge vs Lasso — Key Differences

Property	Ridge (L2)	Lasso (L1)
Penalty	Sum of squared coefficients	Sum of absolute coefficients
Effect on coefficients	Shrinks toward 0, but rarely exactly 0	Can shrink to EXACTLY 0 (feature selection!)
Feature selection	No — keeps all features small	Yes — eliminates irrelevant features
Best for	Multicollinearity; all features are relevant	When you suspect many features are irrelevant
Hyperparameter α (λ)	Higher α = more shrinkage	Higher α = more features set to 0

Interview Insights

Q: Why does Lasso produce sparse models (exact zeros) but Ridge doesn't?

A: Geometrically: Ridge's constraint region is a circle (smooth), and the OLS objective function's elliptical contours usually touch it at non-zero points. Lasso's constraint region is a diamond with corners — the corners lie on axes, where coefficients are exactly zero. The OLS contours are likely to first touch the diamond at a corner. Algebraically: L1 gradient doesn't exist at 0, so the subgradient condition allows exact zeros.

#33

Lasso & Ridge — Practical (Continued)

Code: Coefficient Paths + RidgeCV / LassoCV

ridge_lasso_paths.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
Xtr = sc.fit_transform(X_train)
Xte = sc.transform(X_test)

# ── Coefficient Path: how coefficients shrink with alpha ─
alphas = np.logspace(-3, 3, 100)
ridge_coefs, lasso_coefs = [], []
for a in alphas:
    ridge_coefs.append(Ridge(alpha=a).fit(Xtr, y_train).coef_)
    lasso_coefs.append(Lasso(alpha=a, max_iter=5000).fit(Xtr, y_train).coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
for i, name in enumerate(X.columns):
    ax1.plot(alphas, ridge_coefs[:, i], label=name)
    ax2.plot(alphas, lasso_coefs[:, i], label=name)

ax1.set_xscale('log'); ax1.set_title('Ridge Coefficient Path\n(All shrink but stay nonzero)')
ax1.set_xlabel('Alpha (regularization strength)'); ax1.legend(fontsize=8)
ax2.set_xscale('log'); ax2.set_title('Lasso Coefficient Path\n(Features drop to EXACT zero)')
ax2.set_xlabel('Alpha (regularization strength)'); ax2.legend(fontsize=8)
plt.tight_layout(); plt.show()

# ── Auto-tune alpha with Cross-Validation ─────────────
# RidgeCV / LassoCV test many alphas internally via CV
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(Xtr, y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")
print(f"RidgeCV R²: {r2_score(y_test, ridge_cv.predict(Xte)):.4f}")

lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=5000)
lasso_cv.fit(Xtr, y_train)
print(f"\nLassoCV best alpha: {lasso_cv.alpha_:.4f}")
print(f"LassoCV R²: {r2_score(y_test, lasso_cv.predict(Xte)):.4f}")

n_zero = np.sum(lasso_cv.coef_ == 0)
print(f"LassoCV zeroed out {n_zero}/{X.shape[1]} features (automatic feature selection!)")

Common Mistakes

Not scaling before Ridge/Lasso — regularization penalizes coefficient magnitude; unscaled features get unfairly penalized
Using alpha=0 in Ridge — that's just OLS, no regularization
Forgetting that Lasso can set ALL features to zero if alpha is too high

Interview Insights

Q: When would you choose Elastic Net over Ridge or Lasso alone?

A: When you have many correlated features. Lasso arbitrarily picks one from each correlated group and zeros the rest (unstable). Ridge keeps all but doesn't do selection. Elastic Net (l1_ratio between 0 and 1) groups correlated features together and selects or drops them as a group — best of both worlds. Tune l1_ratio: closer to 1 = more Lasso-like, closer to 0 = more Ridge-like.

Summary

Ridge shrinks coefficients but keeps all features — good for multicollinearity
Lasso does feature selection by zeroing coefficients — good when many features are irrelevant
Elastic Net combines both — best for correlated feature groups
Always scale features before applying any regularized model
Use RidgeCV / LassoCV to auto-select optimal alpha via cross-validation

🏋️ Mini Practice Task

Load the Boston/California housing dataset. Add 10 random noise features to X. Compare R² and Adjusted R² of: OLS, Ridge, Lasso. Show which features Lasso zeroes out. Which model is best and why?

🚀P1

Real-World Project: House Price Predictor

🏠 House Price Prediction Pipeline

Dataset: California Housing (sklearn) or Kaggle Ames Housing

Goal: Build a production-grade regression pipeline with all preprocessing steps chained together.

Code: End-to-End Regression Pipeline

project1_house_price.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# ── 1. LOAD DATA ──────────────────────────────────────
data = fetch_california_housing(as_frame=True)
df = data.frame
print("Shape:", df.shape)
print(df.describe().round(2))

# ── 2. FEATURE ENGINEERING ───────────────────────────
df['rooms_per_household']   = df['AveRooms']   / df['AveOccup']
df['bedrooms_per_room']     = df['AveBedrms']  / df['AveRooms']
df['population_per_household'] = df['Population'] / df['AveOccup']

feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                'Population', 'AveOccup', 'Latitude', 'Longitude',
                'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

X = df[feature_cols]
y = df['MedHouseVal']

# ── 3. SPLIT ─────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── 4. BUILD PIPELINES ───────────────────────────────
pipe_lr = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   LinearRegression())
])

pipe_ridge = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   Ridge(alpha=1.0))
])

pipe_lasso = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   Lasso(alpha=0.01))
])

# ── 5. EVALUATE ALL ──────────────────────────────────
print(f"\n{'Model':15} {'R²':>8} {'RMSE':>10} {'MAE':>10} {'CV R²':>10}")
print("-" * 60)

for name, pipe in [('LinearReg', pipe_lr), ('Ridge', pipe_ridge), ('Lasso', pipe_lasso)]:
    pipe.fit(X_train, y_train)
    yp   = pipe.predict(X_test)
    r2   = r2_score(y_test, yp)
    rmse = np.sqrt(mean_squared_error(y_test, yp))
    mae  = mean_absolute_error(y_test, yp)
    cv   = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2').mean()
    print(f"{name:15} {r2:8.4f} {rmse:10.4f} {mae:10.4f} {cv:10.4f}")

# ── 6. RESIDUAL PLOT for best model ──────────────────
pipe_ridge.fit(X_train, y_train)
y_pred = pipe_ridge.predict(X_test)
residuals = y_test - y_pred

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.3, s=10)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted (Ridge)'); plt.xlabel('Actual'); plt.ylabel('Predicted')

plt.subplot(1, 2, 2)
plt.hist(residuals, bins=50, color='steelblue')
plt.title('Residual Distribution')
plt.tight_layout(); plt.show()

#34

Classification Overview

Definition

Classification is supervised learning where the output is a discrete category/label rather than a continuous number. The model learns a decision boundary that separates classes.

Types of Classification

Type	Output	Example	Algorithm
Binary	2 classes (0/1)	Spam / Not Spam	Logistic Regression
Multiclass	3+ mutually exclusive classes	Dog / Cat / Bird	Softmax, Decision Tree
Multilabel	Multiple labels per sample	Movie: Action + Comedy	One-vs-Rest wrapper

Classification Algorithms Overview

Algorithm	How it works	Best for
Logistic Regression	Sigmoid probability on linear boundary	Linearly separable, interpretability needed
Decision Tree	Splits on feature thresholds	Nonlinear, interpretable rules
Random Forest	Average of many trees	High accuracy, feature importance
SVM	Maximize margin between classes	High-dimensional data, small datasets
KNN	Majority vote of k nearest neighbors	Small datasets, nonlinear boundaries
Naive Bayes	Probabilistic, Bayes theorem	Text classification, very fast

Summary

Classification = predicting a label/category, not a number
Binary (2 classes) is simplest; multiclass uses one-vs-rest or softmax extensions
Choose algorithm based on: dataset size, linearity, interpretability needs

#35

Logistic Regression — Binary

Definition

Logistic Regression is a classification algorithm (misleadingly named) that uses the sigmoid function to output a probability between 0 and 1. It models the log-odds of the positive class as a linear function of inputs.

Intuition

Linear regression gives output ∈ (−∞, +∞) — useless for probability. We need output ∈ [0, 1]. The sigmoid (logistic) function squashes any number into [0,1]. Output = probability of class 1. If P > 0.5 → predict class 1; else → class 0.

σ(z) = 1 / (1 + e⁻ᶻ) where z = β₀ + β₁x₁ + β₂x₂ + ...
P(y=1 | X) = σ(Xβ)
Decision: ŷ = 1 if P ≥ threshold (default 0.5), else 0

Code: Binary Logistic Regression

logistic_binary.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# ── Load binary classification dataset ────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target   # 0=malignant, 1=benign

print("Classes:", data.target_names)
print("Class balance:", pd.Series(y).value_counts())

# ── Prepare ───────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       random_state=42, stratify=y)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── Train logistic regression ─────────────────────────
# C = inverse regularization strength (C = 1/alpha)
# solver: 'lbfgs' (default), 'liblinear' (good for small data)
model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=1000, random_state=42)
model.fit(X_train_s, y_train)

# ── Evaluate ─────────────────────────────────────────
y_pred      = model.predict(X_test_s)
y_proba     = model.predict_proba(X_test_s)[:, 1]  # P(class=1)

print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# ── Probability threshold analysis ───────────────────
# Default threshold = 0.5. In medical diagnosis, lower threshold
# = catch more true positives (higher recall, lower precision)
thresholds = np.arange(0.1, 0.9, 0.1)
print(f"\n{'Threshold':>12} {'Accuracy':>10} {'TP (recall)':>13}")
from sklearn.metrics import recall_score
for t in thresholds:
    pred_t = (y_proba >= t).astype(int)
    acc    = accuracy_score(y_test, pred_t)
    rec    = recall_score(y_test, pred_t)
    print(f"{t:>12.1f} {acc:>10.4f} {rec:>13.4f}")
# In cancer detection: lower threshold = catch more cases (higher recall)
# but more false alarms (lower precision)

# ── Coefficients — feature importance ─────────────────
coef_df = pd.DataFrame({'Feature': X.columns,
                         'Coefficient': model.coef_[0]}).sort_values('Coefficient')
print("\nTop positive features (push toward malignant):")
print(coef_df.tail(5))

Common Mistakes

Using logistic regression on highly nonlinear data — use trees or SVM instead
Ignoring the threshold — 0.5 is not always optimal; tune it based on recall/precision needs
Not scaling features — logistic regression with regularization is scale-sensitive

Interview Insights

Q: Why is it called regression if it's used for classification?

A: Because internally it still performs a linear regression on log-odds: log(P/(1-P)) = β₀ + β₁x₁ + ... The sigmoid is just applied on top to squash the output to [0,1] for probability interpretation. The "regression" refers to the underlying linear model, not the output type.

#36

Logistic Regression — Multiple Input Features

Definition

Multiple-input logistic regression uses many features simultaneously to compute the log-odds. Each feature gets its own coefficient, and the sigmoid is applied to the linear combination.

Code: Full Pipeline + ROC Curve

logistic_multi_input.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (roc_curve, auc, classification_report,
                               ConfusionMatrixDisplay)
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       stratify=y, random_state=42)
# Pipeline: scale → logistic
pipe = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))])
pipe.fit(X_train, y_train)
y_prob = pipe.predict_proba(X_test)[:, 1]
y_pred = pipe.predict(X_test)

# ── ROC Curve ─────────────────────────────────────────
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(fpr, tpr, lw=2, label=f'ROC (AUC = {roc_auc:.3f})')
axes[0].plot([0,1],[0,1], 'k--', label='Random (AUC=0.5)')
axes[0].set_xlabel('False Positive Rate'); axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve\n(Closer to top-left = better model)')
axes[0].legend()

# ── Confusion Matrix ──────────────────────────────────
ConfusionMatrixDisplay.from_predictions(y_test, y_pred,
    display_labels=data.target_names, ax=axes[1], colorbar=False)
axes[1].set_title('Confusion Matrix')
plt.tight_layout(); plt.show()

print(f"ROC-AUC: {roc_auc:.4f}")
print(classification_report(y_test, y_pred))
# AUC: 0.5 = random, 1.0 = perfect. Aim for >0.85 in practice

#37

Logistic Regression — Polynomial Features

Definition

Adding polynomial features to logistic regression allows it to learn nonlinear decision boundaries. Without this, logistic regression can only separate classes with a straight line/plane.

Code: Nonlinear Boundary with PolynomialFeatures

logistic_polynomial.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_circles

# Circular data — not linearly separable!
X, y = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)

def plot_boundary(model, X, y, ax, title):
    h = 0.02
    x_min, x_max = X[:,0].min()-0.5, X[:,0].max()+0.5
    y_min, y_max = X[:,1].min()-0.5, X[:,1].max()+0.5
    xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    ax.scatter(X[:,0], X[:,1], c=y, cmap='RdYlBu', edgecolors='k', s=20)
    acc = model.score(X, y)
    ax.set_title(f'{title}\nAccuracy: {acc:.3f}')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Linear boundary — fails on circles
pipe_lin = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression())])
pipe_lin.fit(X, y)
plot_boundary(pipe_lin, X, y, ax1, 'Linear Logistic Reg\n(FAILS on circles)')

# Polynomial degree 3 — learns circular boundary!
pipe_poly = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('sc',   StandardScaler()),
    ('lr',   LogisticRegression(C=1.0))
])
pipe_poly.fit(X, y)
plot_boundary(pipe_poly, X, y, ax2, 'Polynomial (degree=3)\n(Learns circular boundary!)')

plt.tight_layout(); plt.show()

#38

Logistic Regression — Multiclass

Definition

Multiclass classification extends binary logistic regression to 3+ classes using two strategies: One-vs-Rest (OvR) — trains one binary classifier per class vs all others — and Softmax (Multinomial) — outputs a probability distribution over all classes simultaneously.

Softmax: P(y=k|X) = e^(z_k) / Σ e^(z_j) for all classes j
All probabilities sum to 1: Σ P(y=k|X) = 1

Code: OvR vs Softmax Multiclass

logistic_multiclass.py

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
print("Classes:", iris.target_names)
print("Samples per class:", np.bincount(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       stratify=y, random_state=42)
sc = StandardScaler()
X_tr = sc.fit_transform(X_train)
X_te = sc.transform(X_test)

# ── Strategy 1: One-vs-Rest (OvR) ─────────────────────
# For k classes: trains k binary classifiers
# Class with highest probability wins
lr_ovr = LogisticRegression(multi_class='ovr', C=1.0, max_iter=1000)
lr_ovr.fit(X_tr, y_train)
print(f"\nOvR Accuracy: {accuracy_score(y_test, lr_ovr.predict(X_te)):.4f}")

# ── Strategy 2: Softmax (multinomial) ─────────────────
# Trains ONE model jointly across all classes
# Better when classes are mutually exclusive
lr_soft = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                              C=1.0, max_iter=1000)
lr_soft.fit(X_tr, y_train)
y_pred_soft = lr_soft.predict(X_te)
print(f"Softmax Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}")

# ── Probability output (Softmax) ──────────────────────
proba = lr_soft.predict_proba(X_te[:5])
print("\nProbabilities for first 5 test samples (3 classes):")
df_proba = pd.DataFrame(proba, columns=iris.target_names).round(3)
df_proba['Prediction'] = iris.target_names[y_pred_soft[:5]]
df_proba['True Label'] = iris.target_names[y_test[:5]]
print(df_proba)

print("\nDetailed Report:")
print(classification_report(y_test, y_pred_soft, target_names=iris.target_names))

Interview Insights

Q: When would you use OvR vs Softmax for multiclass?

A: Softmax is preferred when classes are mutually exclusive (exactly one class is true per sample) — it models the joint distribution. OvR is better for multilabel scenarios or when individual class boundaries are very different. In practice for logistic regression, Softmax (multinomial) almost always performs better or equally and is more theoretically sound.

#39

Confusion Matrix

Definition

A Confusion Matrix is a table that visualizes the complete performance of a classification model by showing how many predictions fell into each category: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).

Anatomy of the Confusion Matrix

	Predicted: Positive	Predicted: Negative
Actual: Positive	TP — Correct positive prediction	FN — Missed positive (Type II Error)
Actual: Negative	FP — Wrong alarm (Type I Error)	TN — Correct negative prediction

Code: Full Confusion Matrix Analysis

confusion_matrix.py

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
sc = StandardScaler()
X_tr = sc.fit_transform(X_tr); X_te = sc.transform(X_te)
model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
y_pred = model.predict(X_te)

# ── Raw confusion matrix ──────────────────────────────
cm = confusion_matrix(y_te, y_pred)
print("Confusion Matrix:\n", cm)

tn, fp, fn, tp = cm.ravel()
print(f"\nTN={tn}  FP={fp}")
print(f"FN={fn}  TP={tp}")

# ── All derived metrics from confusion matrix ─────────
accuracy  = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall    = tp / (tp + fn)  # = Sensitivity = True Positive Rate
specificity = tn / (tn + fp)  # True Negative Rate
f1        = 2 * precision * recall / (precision + recall)
fpr       = fp / (fp + tn)  # False Positive Rate

print(f"\nAccuracy:    {accuracy:.4f}")
print(f"Precision:   {precision:.4f}  (Of all predicted positive, how many are actually?)")
print(f"Recall:      {recall:.4f}  (Of all actual positive, how many did we catch?)")
print(f"Specificity: {specificity:.4f}  (Of all actual negative, how many correctly rejected?)")
print(f"F1-Score:    {f1:.4f}  (Harmonic mean of precision + recall)")

# ── Heatmap visualization ─────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=data.target_names, ax=axes[0])
axes[0].set_title('Counts')

ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=data.target_names, ax=axes[1], normalize='true')
axes[1].set_title('Normalized (Recall per class)')
plt.tight_layout(); plt.show()

Interview Insights

Q: Why is accuracy a bad metric for imbalanced datasets?

A: If 99% of emails are not spam, a model that predicts "not spam" for everything achieves 99% accuracy — but detects zero actual spam. Accuracy = (TP+TN)/Total, which is dominated by the majority class. Use F1-score, Precision, Recall, or ROC-AUC for imbalanced data — they expose the model's actual behavior on the minority class.

#40

Precision, Recall, F1-Score

Definitions

Precision: Of all the samples we predicted as positive, what fraction is truly positive? (How careful we are.)
Recall: Of all truly positive samples, what fraction did we correctly identify? (How thorough we are.)
F1-Score: Harmonic mean of Precision and Recall — penalizes extreme imbalances between them.

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F-beta = (1+β²) × (P × R) / (β²P + R) — β>1 weights recall more

The Precision-Recall Tradeoff

⚖️ Precision vs Recall Tradeoff

High Precision, Low Recall: Only flag cases you're very sure about. Miss some. (Spam: rarely mark real email as spam, but let some spam through.)
High Recall, Low Precision: Catch everything, accept false alarms. (Cancer screening: catch all cancer cases, even if some false alarms need follow-up.)
Control it: Lower the threshold → higher recall, lower precision. Higher threshold → higher precision, lower recall.

Code: Precision-Recall Curve + F-beta

precision_recall_f1.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (precision_score, recall_score, f1_score,
                               fbeta_score, precision_recall_curve,
                               average_precision_score, classification_report)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, stratify=data.target, random_state=42)
sc = StandardScaler()
model = LogisticRegression(max_iter=1000).fit(sc.fit_transform(X_tr), y_tr)
y_prob = model.predict_proba(sc.transform(X_te))[:, 1]
y_pred = model.predict(sc.transform(X_te))

# ── Key metrics ───────────────────────────────────────
print(f"Precision:  {precision_score(y_te, y_pred):.4f}")
print(f"Recall:     {recall_score(y_te, y_pred):.4f}")
print(f"F1-Score:   {f1_score(y_te, y_pred):.4f}")
print(f"F2-Score:   {fbeta_score(y_te, y_pred, beta=2):.4f}  (weights recall more)")
print(f"F0.5-Score: {fbeta_score(y_te, y_pred, beta=0.5):.4f}  (weights precision more)")

# ── Precision-Recall Curve ────────────────────────────
precision, recall, thresholds = precision_recall_curve(y_te, y_prob)
avg_prec = average_precision_score(y_te, y_prob)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(recall, precision, lw=2, label=f'AP = {avg_prec:.3f}')
axes[0].set_xlabel('Recall'); axes[0].set_ylabel('Precision')
axes[0].set_title('Precision-Recall Curve\n(Area = Average Precision)')
axes[0].legend()

# ── Threshold vs Precision/Recall ─────────────────────
axes[1].plot(thresholds, precision[:-1], label='Precision')
axes[1].plot(thresholds, recall[:-1],    label='Recall')
axes[1].set_xlabel('Threshold')
axes[1].set_title('Precision vs Recall at different thresholds\n(Move threshold → tradeoff changes)')
axes[1].legend()
plt.tight_layout(); plt.show()

# Per-class report (essential for multiclass)
print("\nClassification Report:")
print(classification_report(y_te, y_pred, target_names=data.target_names))

Summary

Accuracy is misleading on imbalanced data — use F1, Precision, Recall
Precision = how trustworthy your positive predictions are
Recall = how complete your positive detections are
F-beta: β>1 weights recall; β<1 weights precision — choose based on business cost

#41

Imbalanced Dataset Handling

Definition

An imbalanced dataset has a significant difference in class frequencies (e.g., 98% class 0, 2% class 1). Models trained on such data develop a bias toward the majority class and effectively ignore the minority class — which is usually the one we care most about (fraud, disease, defects).

Handling Strategies

Strategy	How It Works	Pros / Cons
Class Weights	Penalize misclassifying minority class more	Simple, no data change; always try first
Oversampling (SMOTE)	Synthetically generate new minority samples	More data; risk of overfitting
Undersampling	Randomly remove majority samples	Fast; loses real information
SMOTE + Undersampling	Combine both	Balanced; best of both
Threshold tuning	Lower decision threshold for minority	No data change; tune for business metric

Code: Class Weights + SMOTE

imbalanced_handling.py

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# ── Create heavily imbalanced dataset ─────────────────
X, y = make_classification(n_samples=1000, n_features=10,
                            weights=[0.95, 0.05],  # 95% class 0, 5% class 1
                            random_state=42)
print("Class distribution:", np.bincount(y))

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
sc = StandardScaler()
X_tr_s = sc.fit_transform(X_tr); X_te_s = sc.transform(X_te)

# ── 1. No correction (baseline) ───────────────────────
lr_base = LogisticRegression().fit(X_tr_s, y_tr)
print("\n--- Baseline (no correction) ---")
print(classification_report(y_te, lr_base.predict(X_te_s)))

# ── 2. class_weight='balanced' ────────────────────────
# sklearn auto-computes weights: w_i = n_samples/(n_classes * n_i)
lr_bal = LogisticRegression(class_weight='balanced').fit(X_tr_s, y_tr)
print("--- class_weight='balanced' ---")
print(classification_report(y_te, lr_bal.predict(X_te_s)))

# ── 3. SMOTE Oversampling (requires imbalanced-learn) ─
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline

    smote_pipe = ImbPipeline([
        ('smote', SMOTE(random_state=42)),
        ('lr',    LogisticRegression())
    ])
    smote_pipe.fit(X_tr_s, y_tr)
    print("--- SMOTE + Logistic Regression ---")
    print(classification_report(y_te, smote_pipe.predict(X_te_s)))
except ImportError:
    print("Install: pip install imbalanced-learn")

# ── 4. Threshold tuning ────────────────────────────────
prob = lr_bal.predict_proba(X_te_s)[:, 1]
for t in [0.3, 0.4, 0.5]:
    pred_t = (prob >= t).astype(int)
    f1 = f1_score(y_te, pred_t)
    print(f"Threshold {t}: F1={f1:.4f}")

Common Mistakes

Applying SMOTE before train/test split — synthetic samples from test data leak into training
Using accuracy as the metric — use F1, ROC-AUC, or G-Mean for imbalanced data
Not trying class_weight='balanced' first — it's free and often sufficient

#42

Naive Bayes — Theory

Definition

Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that all features are conditionally independent given the class. Despite this simplification, it works remarkably well — especially for text classification.

Bayes' Theorem: P(y|X) = P(X|y) × P(y) / P(X)

Naive Independence Assumption: P(X|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

Decision: ŷ = argmax_k P(y=k) × Π P(xᵢ|y=k)

Intuition

Email Spam Example: P(spam | "free", "money", "click") ∝ P(spam) × P("free"|spam) × P("money"|spam) × P("click"|spam). We multiply individual word probabilities. We call it "naive" because words in a real email are NOT truly independent — but this assumption still works very well in practice.

Naive Bayes Variants

Variant	Feature Distribution Assumption	Best For
GaussianNB	Features follow Gaussian (normal) distribution	Continuous features, numerical data
MultinomialNB	Features are counts or frequencies	Text classification (word counts, TF-IDF)
BernoulliNB	Features are binary (0/1)	Text (word presence/absence)
ComplementNB	Extension of Multinomial	Imbalanced text classification

Interview Insights

Q: Why does Naive Bayes work well despite the independence assumption being almost always wrong?

A: Even when the independence assumption is violated, Naive Bayes still makes accurate classification decisions because the RANKING of class probabilities is often preserved — you don't need exact probabilities, just the correct argmax. Additionally, with limited data, the naive assumption prevents overfitting by reducing model complexity.

#43

Naive Bayes — Practical

Code: GaussianNB + Text Classification with MultinomialNB

naive_bayes_practical.py

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# ══ PART 1: GaussianNB (continuous features) ══════════
iris = load_iris()
X, y = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

gnb = GaussianNB()
gnb.fit(X_tr, y_tr)
print("GaussianNB on Iris:")
print(f"  Accuracy: {accuracy_score(y_te, gnb.predict(X_te)):.4f}")
print(f"  CV Score: {cross_val_score(gnb, X, y, cv=5).mean():.4f}")

# Learned class priors (relative frequency of each class)
print(f"  Class priors: {gnb.class_prior_.round(3)}")
print(f"  Class means (theta):\n{gnb.theta_.round(2)}")

# ══ PART 2: MultinomialNB (text classification) ═══════
print("\n--- Text Classification with MultinomialNB ---")

texts = [
    "free money win prize", "win big cash now free", "click free offer today",
    "meeting tomorrow project update", "report deadline next week",
    "team lunch calendar invite", "schedule call project review",
    "free gift card win money", "urgent response required now",
]
labels = [1,1,1, 0,0,0,0, 1,1]  # 1=spam, 0=ham

pipe_mnb = Pipeline([
    ('tfidf',  TfidfVectorizer()),
    ('model', MultinomialNB(alpha=1.0))  # alpha=Laplace smoothing
])
pipe_mnb.fit(texts, labels)

new_emails = ["free offer click win", "project deadline report"]
predictions = pipe_mnb.predict(new_emails)
probas      = pipe_mnb.predict_proba(new_emails)

for email, pred, prob in zip(new_emails, predictions, probas):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"  '{email}' → {label} (P(spam)={prob[1]:.3f})")

# ── Laplace Smoothing explanation ─────────────────────
# alpha=1.0 adds 1 to all counts — prevents P(word|class)=0
# for words not seen in training data (zero-probability problem)
print("\nalpha=0: no smoothing (risk of zero prob)")
print("alpha=1: Laplace smoothing (standard)")
print("alpha>1: more smoothing, more uniform distribution")

Summary

Naive Bayes is extremely fast, scales to huge datasets, great for text
GaussianNB for continuous features; MultinomialNB for word counts
Laplace smoothing (alpha) prevents zero probabilities for unseen features
Despite naive assumption, often competitive with complex models for text

#44

Decision Tree — Classification (Theory)

Definition

A Decision Tree is a flowchart-like model that makes predictions by recursively splitting the feature space based on threshold conditions. Each internal node tests a feature, each branch represents an outcome, and each leaf node predicts a class.

Intuition

Think of 20 questions: "Is the animal a mammal? → Yes → Does it live in water? → No → Does it have stripes? → Yes → Tiger!" A decision tree asks the most informative questions first, each question splitting the data into increasingly pure groups.

How Trees Split: Splitting Criteria

Criterion	Formula	Used in	Intuition
Gini Impurity	1 − Σ pᵢ²	CART, sklearn default	Probability of misclassifying a random sample. 0 = pure, 0.5 = maximally impure (binary)
Entropy (Info Gain)	−Σ pᵢ log₂(pᵢ)	ID3, C4.5	Average bits needed to encode class labels. 0 = pure, 1 = maximally uncertain (binary)
Log Loss	Cross-entropy	sklearn 1.1+	Better calibrated probability estimates

Information Gain = Entropy(parent) − Σ [|child|/|parent| × Entropy(child)]
Gini at node = 1 − (p_class0² + p_class1² + ...)

Interview Insights

Q: Gini vs Entropy — which is better?

A: In practice, almost no difference in tree structure or performance. Gini is slightly faster (no log computation). Entropy can sometimes create more balanced trees. sklearn defaults to Gini. The choice of max_depth and min_samples_split has far more impact than Gini vs Entropy.

Q: What's the main weakness of a single decision tree?

A: High variance (overfitting). A deep tree memorizes training noise. Changing even a few training samples can produce a very different tree. This is why ensemble methods like Random Forest (averaging many trees) drastically improve stability.

#45

Decision Tree — Practical

Code: Full Decision Tree + Visualization

decision_tree_practical.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── 1. Full tree (overfits) ───────────────────────────
dt_full = DecisionTreeClassifier(random_state=42)  # No limits = grows fully
dt_full.fit(X_tr, y_tr)
print(f"Full Tree — depth:{dt_full.get_depth()}, leaves:{dt_full.get_n_leaves()}")
print(f"  Train Acc: {accuracy_score(y_tr, dt_full.predict(X_tr)):.4f}")
print(f"  Test  Acc: {accuracy_score(y_te, dt_full.predict(X_te)):.4f}")

# ── 2. Depth-limited tree (better generalization) ──────
dt_lim = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_lim.fit(X_tr, y_tr)
print(f"\nLimited Tree (max_depth=3):")
print(f"  Train Acc: {accuracy_score(y_tr, dt_lim.predict(X_tr)):.4f}")
print(f"  Test  Acc: {accuracy_score(y_te, dt_lim.predict(X_te)):.4f}")

# ── 3. Visualize tree ─────────────────────────────────
fig, ax = plt.subplots(figsize=(16, 6))
plot_tree(dt_lim, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True,
          rounded=True, ax=ax, fontsize=9)
plt.title('Decision Tree (max_depth=3)\nColor intensity = class purity')
plt.tight_layout(); plt.show()

# ── 4. Text rules (human-readable) ───────────────────
rules = export_text(dt_lim, feature_names=iris.feature_names)
print("\nTree Rules (human-readable):")
print(rules)

# ── 5. Feature importance ─────────────────────────────
imp = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': dt_lim.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importances (Gini-based):")
print(imp)
# Importance = total Gini impurity reduction from splits on this feature

# ── 6. Depth vs accuracy tradeoff ─────────────────────
depths = range(1, 15)
train_accs, test_accs = [], []
for d in depths:
    dt = DecisionTreeClassifier(max_depth=d, random_state=42).fit(X_tr, y_tr)
    train_accs.append(accuracy_score(y_tr, dt.predict(X_tr)))
    test_accs.append(accuracy_score(y_te, dt.predict(X_te)))

plt.figure(figsize=(8,4))
plt.plot(depths, train_accs, 'b-o', label='Train')
plt.plot(depths, test_accs,  'r-o', label='Test')
plt.axvline(x=3, color='g', linestyle='--', label='Sweet spot')
plt.xlabel('Max Depth'); plt.ylabel('Accuracy')
plt.title('Depth vs Accuracy: Overfitting Curve')
plt.legend(); plt.tight_layout(); plt.show()

#46

Pre-Pruning & Post-Pruning

Definition

Pruning controls decision tree complexity to prevent overfitting. Pre-pruning stops growth early using constraints. Post-pruning grows the full tree then removes unnecessary branches using a complexity penalty (cost-complexity pruning).

Pre-Pruning Parameters in sklearn

Parameter	Effect	Typical Value
max_depth	Maximum tree depth	3–10
min_samples_split	Min samples required to split a node	2–20
min_samples_leaf	Min samples required at leaf	1–10
max_features	Max features considered per split	"sqrt" for classification
max_leaf_nodes	Max number of leaf nodes	None or 10–100

Code: Pre-Pruning + Cost-Complexity Post-Pruning

pruning.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, random_state=42)

# ── PRE-PRUNING ───────────────────────────────────────
dt_pre = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,  # need ≥10 samples to even try splitting
    min_samples_leaf=5,   # each leaf must have ≥5 samples
    max_leaf_nodes=20,    # cap total leaves
    random_state=42
)
dt_pre.fit(X_tr, y_tr)
print(f"Pre-pruned: depth={dt_pre.get_depth()}, test_acc={accuracy_score(y_te, dt_pre.predict(X_te)):.4f}")

# ── POST-PRUNING: Cost-Complexity Pruning (CCP) ────────
# ccp_alpha = complexity parameter.
# Higher alpha → more aggressive pruning → simpler tree
# Find optimal alpha using effective_alphas

dt_full = DecisionTreeClassifier(random_state=42)
path = dt_full.cost_complexity_pruning_path(X_tr, y_tr)
ccp_alphas = path.ccp_alphas[:-1]  # Exclude last (trivial root)

train_scores, test_scores, n_leaves = [], [], []
for alpha in ccp_alphas:
    dt = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42).fit(X_tr, y_tr)
    train_scores.append(accuracy_score(y_tr, dt.predict(X_tr)))
    test_scores.append(accuracy_score(y_te, dt.predict(X_te)))
    n_leaves.append(dt.get_n_leaves())

best_idx = np.argmax(test_scores)
best_alpha = ccp_alphas[best_idx]
print(f"\nBest CCP alpha: {best_alpha:.5f} → Test acc: {test_scores[best_idx]:.4f}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
ax1.plot(ccp_alphas, train_scores, 'b-', label='Train')
ax1.plot(ccp_alphas, test_scores,  'r-', label='Test')
ax1.axvline(x=best_alpha, color='g', linestyle='--', label=f'Best α={best_alpha:.5f}')
ax1.set_xlabel('ccp_alpha'); ax1.set_title('Post-Pruning: Accuracy vs Alpha'); ax1.legend()

ax2.plot(ccp_alphas, n_leaves, 'purple')
ax2.set_xlabel('ccp_alpha'); ax2.set_ylabel('Number of Leaves')
ax2.set_title('More alpha → Simpler tree')
plt.tight_layout(); plt.show()

#47

Decision Tree — Regression

Definition

Decision Tree Regression partitions the feature space into rectangular regions and predicts the mean of training samples in each region. Instead of Gini/Entropy, it minimizes MSE (mean squared error) at each split.

Code: Decision Tree Regressor

decision_tree_regression.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
X_plot = np.linspace(0, 10, 500).reshape(-1, 1)

for i, depth in enumerate([2, 5, 20]):
    dtr = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dtr.fit(X_tr, y_tr)
    y_plot = dtr.predict(X_plot)
    rmse   = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te)))
    r2     = r2_score(y_te, dtr.predict(X_te))
    
    axes[i].scatter(X_tr, y_tr, alpha=0.4, s=15, label='Train')
    axes[i].scatter(X_te, y_te, alpha=0.4, s=15, color='orange', label='Test')
    axes[i].plot(X_plot, y_plot, 'r-', lw=2, label='Prediction')
    axes[i].set_title(f"depth={depth}\nRMSE={rmse:.3f}  R²={r2:.3f}")
    axes[i].legend(fontsize=8)

# depth=2: underfits (blocky step function)
# depth=5: good fit
# depth=20: overfits (jagged, memorizes noise)
plt.suptitle('Decision Tree Regression: Step-Function Predictions')
plt.tight_layout(); plt.show()

# ── Key insight: DT predictions are step functions ─────
# Each region gets the MEAN of training samples in it
# This is why DTs can't extrapolate beyond training range!

Common Mistakes

Decision Trees cannot extrapolate — predictions outside training range = last leaf's mean. Use neural nets for extrapolation.
Not controlling depth leads to extreme overfitting — always use CCP or set max_depth

#48

K-Nearest Neighbors (KNN) — Classification

Definition

KNN is a lazy, non-parametric algorithm — it stores all training data and, to predict a new point, finds the K closest training points and predicts the majority class (classification) or mean value (regression). No explicit training step.

Intuition

"Tell me who your neighbors are, and I'll tell you who you are." To classify a new patient, find the 5 most similar patients in the medical database and predict the majority diagnosis. The model IS the data.

Distance (Euclidean): d(a,b) = √[Σ(aᵢ − bᵢ)²]
Distance (Manhattan): d(a,b) = Σ|aᵢ − bᵢ|
Prediction: ŷ = majority_class(k nearest neighbors)

KNN Key Properties

Property	Details
Training cost	O(1) — just stores data
Prediction cost	O(n·d) — computes distance to ALL training points
Memory	O(n) — stores entire training set
Curse of dimensionality	Performance degrades sharply with many features — all points become equidistant
Feature scaling	REQUIRED — distance-based, so scale matters critically
k too small	Overfitting — noisy, jagged boundary
k too large	Underfitting — blurry boundary, ignores local structure

Interview Insights

Q: What is the curse of dimensionality and how does it affect KNN?

A: In high dimensions, the Euclidean distance between any two points becomes nearly equal — all neighbors become "equally far". To cover the same fraction of data space, you need exponentially more points as dimensions grow. This makes nearest neighbors meaningless in high-dimensional spaces. Solutions: PCA dimensionality reduction before KNN, or use algorithms that don't rely on distance (trees).

#49

KNN — Practical

Code: KNN with Optimal K Selection

knn_practical.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# ── Find optimal k ────────────────────────────────────
k_range = range(1, 31)
cv_scores = []
for k in k_range:
    pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))])
    scores = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

best_k = k_range[np.argmax(cv_scores)]
print(f"Best k = {best_k} (CV accuracy = {max(cv_scores):.4f})")

plt.figure(figsize=(10, 4))
plt.plot(k_range, cv_scores, 'b-o', markersize=5)
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best k={best_k}')
plt.xlabel('k'); plt.ylabel('CV Accuracy')
plt.title('Elbow Method: Finding Optimal k\n(k=1: overfitting, high k: underfitting)')
plt.legend(); plt.tight_layout(); plt.show()

# ── Final model with best k ───────────────────────────
best_pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('knn', KNeighborsClassifier(
        n_neighbors=best_k,
        weights='distance',  # closer neighbors vote more
        metric='euclidean'    # try 'manhattan', 'minkowski'
    ))
])
best_pipe.fit(X_tr, y_tr)
print(classification_report(y_te, best_pipe.predict(X_te), target_names=iris.target_names))

# ── Weights comparison ────────────────────────────────
for w in ['uniform', 'distance']:
    p = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=best_k, weights=w))])
    p.fit(X_tr, y_tr)
    acc = p.score(X_te, y_te)
    print(f"weights='{w}': acc={acc:.4f}")

#50

KNN — Regression

Definition

KNN Regression predicts the average (or weighted average) of the k nearest neighbors' target values rather than majority vote. Simple and powerful for locally smooth functions.

Code: KNN Regressor

knn_regression.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 200)).reshape(-1,1)
y = np.sin(X.ravel()) + 0.5*np.cos(2*X.ravel()) + np.random.normal(0, 0.15, 200)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
X_plot = np.linspace(0, 10, 500).reshape(-1,1)

fig, axes = plt.subplots(1, 3, figsize=(16,4))
for i, k in enumerate([1, 7, 50]):
    pipe = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))])
    pipe.fit(X_tr, y_tr)
    rmse = np.sqrt(mean_squared_error(y_te, pipe.predict(X_te)))
    r2   = r2_score(y_te, pipe.predict(X_te))
    axes[i].scatter(X, y, alpha=0.3, s=10)
    axes[i].plot(X_plot, pipe.predict(X_plot), 'r-', lw=2)
    axes[i].set_title(f"k={k}\nRMSE={rmse:.3f}  R²={r2:.3f}")
plt.suptitle('KNN Regression: k=1 overfits, large k underfits')
plt.tight_layout(); plt.show()

#51

Support Vector Machine (SVM) — Theory

Definition

SVM finds the optimal hyperplane that maximizes the margin between classes. The margin is the distance between the hyperplane and the nearest data points from each class (called support vectors). Maximizing margin = maximizing generalization.

Intuition

You have red and blue dots on a table. SVM finds the widest possible "road" between them. Only the dots closest to the road (support vectors) determine the boundary — all other points are irrelevant. This makes SVM very efficient and robust.

Hard Margin: Maximize margin = 2/||w||, subject to yᵢ(w·xᵢ + b) ≥ 1

Soft Margin: Minimize ½||w||² + C·Σξᵢ (C = regularization)

Kernel trick: K(x,z) = φ(x)·φ(z) — maps to higher-dimensional space

Kernels

Kernel	Formula	Use When
Linear	K(x,z) = x·z	Linearly separable, high-dimensional (text)
RBF (Gaussian)	K(x,z) = exp(−γ\|\|x−z\|\|²)	Default; nonlinear data; medium-sized datasets
Polynomial	K(x,z) = (x·z + r)^d	When polynomial features expected
Sigmoid	K(x,z) = tanh(αx·z + c)	Neural network-like behavior

Hyperparameter C and γ

Parameter	Small Value	Large Value
C (regularization)	Wider margin, more misclassifications allowed (underfitting)	Narrow margin, fewer misclassifications (overfitting)
γ (RBF kernel)	Large decision region, smooth boundary (underfitting)	Small region per point, jagged boundary (overfitting)

Interview Insights

Q: What are support vectors and why do only they matter?

A: Support vectors are the training points closest to the decision boundary — they "support" (define) the hyperplane. Points farther from the boundary are irrelevant to the decision. This is why SVMs are efficient: removing non-support vector points doesn't change the model. It also makes SVMs robust — the boundary is defined by the hardest-to-classify points.

#52

SVM — Practical

Code: SVM Classification + Kernel Comparison

svm_practical.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer, make_moons

# ── Part 1: Kernel comparison on moons dataset ────────
X_m, y_m = make_moons(n_samples=300, noise=0.2, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for i, (kernel, C) in enumerate([('linear',1),('rbf',1),('poly',1)]):
    pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel=kernel, C=C))])
    pipe.fit(X_m, y_m)
    h = 0.02
    x_min, x_max = X_m[:,0].min()-0.5, X_m[:,0].max()+0.5
    y_min, y_max = X_m[:,1].min()-0.5, X_m[:,1].max()+0.5
    xx, yy = np.meshgrid(np.arange(x_min,x_max,h), np.arange(y_min,y_max,h))
    Z = pipe.predict(np.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape)
    axes[i].contourf(xx, yy, Z, alpha=0.3)
    axes[i].scatter(X_m[:,0], X_m[:,1], c=y_m, s=20, edgecolors='k')
    axes[i].set_title(f'Kernel: {kernel}\nAcc={pipe.score(X_m,y_m):.3f}')
plt.tight_layout(); plt.show()

# ── Part 2: Full pipeline on real data + GridSearch ───
data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

pipe_svm = Pipeline([('sc', StandardScaler()), ('svc', SVC(kernel='rbf', probability=True))])

# GridSearch for C and gamma
param_grid = {'svc__C': [0.1, 1, 10], 'svc__gamma': ['scale', 'auto', 0.01]}
gs = GridSearchCV(pipe_svm, param_grid, cv=5, scoring='accuracy')
gs.fit(X_tr, y_tr)
print(f"Best params: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")
print(f"Test accuracy: {gs.score(X_te, y_te):.4f}")
print(classification_report(y_te, gs.predict(X_te), target_names=data.target_names))

#53

SVM — Regression (SVR)

Definition

Support Vector Regression (SVR) fits a tube of width 2ε around the regression line. Points inside the tube incur no penalty. Only points outside the ε-tube contribute to the loss. This makes SVR robust to outliers.

SVR: Minimize ½||w||² + C·Σ(ξᵢ + ξᵢ*)
Subject to: |yᵢ − (w·xᵢ + b)| ≤ ε + ξᵢ
ε = tube width; ξ = slack variables for points outside tube

Code: SVR with RBF Kernel

svr_practical.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 150)).reshape(-1,1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.2, 150)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
X_plot = np.linspace(0, 10, 500).reshape(-1,1)

pipe_svr = Pipeline([
    ('sc',  StandardScaler()),
    ('svr', SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1))
])
pipe_svr.fit(X_tr, y_tr)
y_pred = pipe_svr.predict(X_te)

rmse = np.sqrt(mean_squared_error(y_te, y_pred))
r2   = r2_score(y_te, y_pred)
print(f"SVR — RMSE: {rmse:.4f}, R²: {r2:.4f}")

plt.figure(figsize=(10, 4))
plt.scatter(X_tr, y_tr, c='steelblue', alpha=0.5, s=20, label='Train')
plt.scatter(X_te, y_te, c='orange', alpha=0.5, s=20, label='Test')
y_plot = pipe_svr.predict(X_plot)
plt.plot(X_plot, y_plot, 'r-', lw=2, label='SVR (RBF)')
plt.fill_between(X_plot.ravel(), y_plot-0.1, y_plot+0.1, alpha=0.2, color='red', label='ε-tube')
plt.title(f'SVR RBF — RMSE={rmse:.3f}, R²={r2:.3f}')
plt.legend(); plt.tight_layout(); plt.show()

Summary

SVM finds the maximum-margin hyperplane — defined only by support vectors
Kernel trick allows nonlinear boundaries without explicitly transforming data
C controls margin width / regularization; γ controls RBF kernel "reach"
Always scale features before SVM — it's distance-based

🚀P2

Real-World Project: Customer Churn Predictor

📉 Customer Churn Prediction

Goal: Predict which telecom customers will leave. Compare Logistic Regression, SVM, Decision Tree, KNN — all in pipelines. Use confusion matrix, F1, ROC-AUC to choose the best model.

Code: Multi-Model Churn Pipeline

project2_churn.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, roc_auc_score, classification_report

# Simulate churn data (imbalanced: 80/20)
X, y = make_classification(n_samples=2000, n_features=12,
                            weights=[0.8,0.2], random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

models = {
    'Logistic Reg':   Pipeline([('sc',StandardScaler()),('m',LogisticRegression(class_weight='balanced',max_iter=1000))]),
    'Decision Tree':  Pipeline([('m',DecisionTreeClassifier(max_depth=5,class_weight='balanced',random_state=42))]),
    'KNN':            Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))]),
    'SVM':            Pipeline([('sc',StandardScaler()),('m',SVC(class_weight='balanced',probability=True))]),
    'Naive Bayes':    Pipeline([('sc',StandardScaler()),('m',GaussianNB())]),
}

results = []
print(f"{'Model':18} {'F1':>8} {'ROC-AUC':>10} {'CV-F1':>10}")
print("-"*50)
for name, pipe in models.items():
    pipe.fit(X_tr, y_tr)
    yp   = pipe.predict(X_te)
    yprb = pipe.predict_proba(X_te)[:,1]
    f1   = f1_score(y_te, yp)
    auc  = roc_auc_score(y_te, yprb)
    cvf1 = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring='f1').mean()
    print(f"{name:18} {f1:8.4f} {auc:10.4f} {cvf1:10.4f}")
    results.append({'model':name,'f1':f1,'auc':auc})

# Best model report
best = max(results, key=lambda x: x['auc'])
print(f"\n🏆 Best model by ROC-AUC: {best['model']} (AUC={best['auc']:.4f})")

#54

Model Parameters vs Hyperparameters

Definitions

Parameters are values the model learns from training data (e.g., linear regression coefficients β). Hyperparameters are configuration settings set BEFORE training that control the learning process (e.g., max_depth, C, k). You tune hyperparameters; the model learns parameters.

Examples by Algorithm

Algorithm	Parameters (learned)	Hyperparameters (set by you)
Linear Regression	β₀, β₁, ..., βₙ (coefficients)	fit_intercept, normalize
Ridge/Lasso	β (coefficients)	alpha (λ), max_iter
Logistic Regression	w (weights), b (bias)	C, solver, max_iter
Decision Tree	Split thresholds, leaf values	max_depth, min_samples_split, criterion
SVM	w, b, support vectors	C, γ, kernel
KNN	None (lazy learner)	k, weights, metric
Naive Bayes	Class priors, feature likelihoods	alpha (smoothing), var_smoothing

Interview Insights

Q: If you increase max_depth in a Decision Tree, are you tuning a parameter or hyperparameter?

A: Hyperparameter. max_depth is set before training — it controls the learning process but is not learned FROM data. The split thresholds and leaf values that result from training are the actual parameters. This distinction matters: parameters are optimized by the algorithm internally; hyperparameters are optimized by YOU, typically via cross-validation (GridSearchCV).

#55

GridSearchCV & RandomizedSearchCV

Definition

GridSearchCV exhaustively tests every combination of hyperparameter values. RandomizedSearchCV randomly samples a fixed number of combinations. Both use cross-validation to evaluate each combination on training data without touching the test set.

Code: GridSearchCV + RandomizedSearchCV

gridsearch.py

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from scipy.stats import uniform, randint

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target,
                                             test_size=0.2, stratify=data.target, random_state=42)

# ── GridSearchCV: Exhaustive ───────────────────────────
# 3 × 3 × 2 = 18 combinations × 5 folds = 90 model fits
pipe = Pipeline([('sc', StandardScaler()), ('svc', SVC(probability=True))])
param_grid = {
    'svc__C':      [0.1, 1.0, 10.0],
    'svc__gamma':  ['scale', 'auto', 0.01],
    'svc__kernel': ['rbf', 'linear']
}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='f1',
                   n_jobs=-1, verbose=1)  # n_jobs=-1: use all CPU cores
gs.fit(X_tr, y_tr)

print(f"GridSearch Best params: {gs.best_params_}")
print(f"GridSearch Best CV F1:  {gs.best_score_:.4f}")
print(f"GridSearch Test F1:    {__import__('sklearn.metrics',fromlist=['f1_score']).f1_score(y_te, gs.predict(X_te)):.4f}")

# ── RandomizedSearchCV: Faster for large search spaces ─
# Instead of testing all combos, randomly sample n_iter combinations
param_dist = {
    'max_depth':       randint(2, 20),        # Random int [2, 20)
    'min_samples_split': randint(2, 30),
    'min_samples_leaf':  randint(1, 15),
    'criterion':        ['gini', 'entropy'],
}

rs = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,          # Test 50 random combinations
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)
rs.fit(X_tr, y_tr)
print(f"\nRandomizedSearch Best params: {rs.best_params_}")
print(f"RandomizedSearch Best CV F1:  {rs.best_score_:.4f}")

# ── Results DataFrame ─────────────────────────────────
cv_results = pd.DataFrame(rs.cv_results_)
print("\nTop 5 configurations:")
print(cv_results.nlargest(5, 'mean_test_score')[['mean_test_score','std_test_score','params']])

Interview Insights

Q: When use GridSearch vs RandomizedSearch vs Bayesian Optimization?

A: GridSearch: small search space (few params, few values), exhaustive is feasible. RandomizedSearch: large search space, faster, often finds good solutions with n_iter ≈ 50-100. Bayesian Optimization (Optuna, scikit-optimize): most efficient — uses past results to guide next search; best for expensive models and large spaces. For production: always start with Randomized, then refine with Grid around the best region.

#56

Cross Validation — Theory

Definition

Cross Validation (CV) is a resampling technique for evaluating model performance and tuning hyperparameters. It splits data into k folds, trains on k-1 folds and validates on the remaining fold, rotating k times to use every sample as validation exactly once.

CV Variants

Technique	How it works	Best for
K-Fold CV	Split into k equal folds, rotate validation	Standard choice; large datasets
Stratified K-Fold	K-Fold preserving class distribution	Classification; imbalanced data
Leave-One-Out (LOO)	k = n (each sample is a fold)	Very small datasets; expensive
Time Series Split	Respects temporal order — no future leakage	Time series data
Repeated K-Fold	Run K-Fold r times with different splits	More stable estimate on small data

CV Score = mean(score_fold₁, score_fold₂, ..., score_fold_k)
k=5 or k=10 is standard. Larger k = less bias, more variance in estimate.

Why CV is critical

A single train/test split is highly dependent on the random split. You might get lucky (test set is easy) or unlucky (test set is hard). CV averages over k different splits, giving a much more reliable performance estimate. The std of CV scores also tells you model stability.

#57

Cross Validation — Practical

Code: All CV Variants + Nested CV

cross_validation.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import (KFold, StratifiedKFold, cross_val_score,
                                        cross_validate, TimeSeriesSplit,
                                        RepeatedStratifiedKFold)
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target
pipe = Pipeline([('sc',StandardScaler()),('dt',DecisionTreeClassifier(max_depth=4,random_state=42))])

# ── 1. Basic K-Fold ───────────────────────────────────
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
print(f"K-Fold(5)     : {scores_kf.mean():.4f} ± {scores_kf.std():.4f}")

# ── 2. Stratified K-Fold (for classification) ─────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_skf = cross_val_score(pipe, X, y, cv=skf, scoring='accuracy')
print(f"Stratified(5) : {scores_skf.mean():.4f} ± {scores_skf.std():.4f}")

# ── 3. Get multiple metrics at once ───────────────────
cv_results = cross_validate(pipe, X, y, cv=5,
                             scoring=['accuracy','f1_macro'],
                             return_train_score=True)
print(f"\ncross_validate:")
print(f"  Train acc: {cv_results['train_accuracy'].mean():.4f}")
print(f"  Test acc:  {cv_results['test_accuracy'].mean():.4f}")
print(f"  Test F1:   {cv_results['test_f1_macro'].mean():.4f}")
# If train >> test → overfitting signal!

# ── 4. Repeated Stratified K-Fold ─────────────────────
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
scores_r = cross_val_score(pipe, X, y, cv=rskf)
print(f"\nRepeated K-Fold (5×3): {scores_r.mean():.4f} ± {scores_r.std():.4f}")

# ── 5. Nested CV: unbiased estimate with tuning ────────
# Outer CV evaluates model; Inner CV tunes hyperparameters
from sklearn.model_selection import GridSearchCV

inner_cv  = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
outer_cv  = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)

dt = DecisionTreeClassifier(random_state=42)
gs = GridSearchCV(dt, {'max_depth':[2,3,5,7]}, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='accuracy')
print(f"\nNested CV (unbiased estimate): {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

Summary

CV gives reliable performance estimates by averaging over multiple splits
Always use StratifiedKFold for classification to preserve class ratios
cross_validate returns both train and test scores — compare them to detect overfitting
Nested CV is the gold standard: outer CV evaluates, inner CV tunes hyperparameters

#58

Unsupervised Learning Overview

Definition

Unsupervised learning discovers hidden patterns in data without labeled outputs. The model must find structure — groups, outliers, or latent representations — entirely from the input data itself.

Unsupervised Learning Categories

Category	Goal	Algorithms	Example
Clustering	Group similar data points	K-Means, DBSCAN, Hierarchical	Customer segmentation
Dimensionality Reduction	Compress features while retaining structure	PCA, t-SNE, UMAP	Visualize high-dim data
Anomaly Detection	Identify outliers/unusual samples	Isolation Forest, LOF	Fraud detection
Association Rules	Find co-occurring patterns	Apriori, FP-Growth	Market basket analysis

Key Challenge: No Ground Truth

Evaluating unsupervised models is hard — there's no correct answer. Internal metrics (Silhouette Score, Davies-Bouldin) measure cluster quality without labels. External metrics (Adjusted Rand Index) work if you do have ground truth labels for validation.

#59

K-Means Clustering — Theory

Definition

K-Means partitions n samples into k clusters by iteratively assigning each point to the nearest centroid, then recomputing centroids as cluster means. It minimizes the total within-cluster sum of squared distances (inertia).

Algorithm (Lloyd's Algorithm)

1. Randomly initialize k centroids
2. Assign each point to nearest centroid (by Euclidean distance)
3. Recompute each centroid as mean of its assigned points
4. Repeat steps 2–3 until convergence (centroids don't move or max_iter reached)
5. Result: k cluster assignments + k centroid positions

Inertia = Σᵢ Σₓ∈Cᵢ ||x − μᵢ||²
μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x (centroid = cluster mean)

K-Means Limitations

Limitation	Problem	Workaround
Must specify k	k is unknown in practice	Elbow method, Silhouette analysis
Sensitive to initialization	Different runs → different results	K-Means++ init (default in sklearn)
Assumes spherical clusters	Fails on elongated/ring shapes	DBSCAN, GMM for arbitrary shapes
Sensitive to outliers	Outliers pull centroids	K-Medoids (uses median), remove outliers first
Scale-dependent	Large-scale features dominate	Always scale features before K-Means

#60

K-Means — Practical

Code: K-Means + Elbow Method + Optimal k

kmeans_practical.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# ── Generate clusterable data ─────────────────────────
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# ── Elbow Method: Find optimal k ──────────────────────
inertias, sil_scores = [], []
k_range = range(2, 11)
for k in k_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(k_range, inertias, 'b-o')
ax1.set_xlabel('k'); ax1.set_ylabel('Inertia (WCSS)')
ax1.set_title('Elbow Method\n(Look for the "elbow" bend)')

ax2.plot(k_range, sil_scores, 'r-o')
ax2.set_xlabel('k'); ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Score\n(Higher = better separated clusters)')
plt.tight_layout(); plt.show()

best_k = k_range[np.argmax(sil_scores)]
print(f"Optimal k by Silhouette: {best_k}")

# ── Final K-Means ─────────────────────────────────────
km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10, random_state=42)
labels = km_final.fit_predict(X_scaled)

plt.figure(figsize=(8, 5))
plt.scatter(X_scaled[:,0], X_scaled[:,1], c=labels, cmap='tab10', s=20, alpha=0.7)
plt.scatter(km_final.cluster_centers_[:,0], km_final.cluster_centers_[:,1],
           c='red', s=300, marker='*', label='Centroids')
plt.title(f'K-Means with k={best_k}\nSilhouette={silhouette_score(X_scaled,labels):.3f}')
plt.legend(); plt.tight_layout(); plt.show()

# ── Cluster analysis: describe each cluster ───────────
df = pd.DataFrame(X, columns=['feature1','feature2'])
df['cluster'] = labels
print("\nCluster statistics:")
print(df.groupby('cluster').agg(['mean','std','count']))

#61

Hierarchical Clustering — Theory

Definition

Hierarchical clustering builds a dendrogram (tree) of nested clusters. Unlike K-Means, you don't specify k upfront — you can cut the tree at any level to get different numbers of clusters. Two approaches: Agglomerative (bottom-up: start with n clusters, merge) and Divisive (top-down: start with 1 cluster, split).

Linkage Criteria (how to measure cluster-to-cluster distance)

Linkage	Distance between clusters	Creates
Single (min)	Minimum pairwise distance	Elongated, chaining clusters
Complete (max)	Maximum pairwise distance	Compact, equal-sized clusters
Average	Average of all pairwise distances	Balance between single/complete
Ward	Minimizes total within-cluster variance (default)	Compact, roughly equal clusters; usually best

#62

Agglomerative Clustering — Practical

Code: Dendrogram + Agglomerative Clustering

hierarchical_clustering.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage

X, _ = make_blobs(n_samples=200, centers=4, random_state=42)
X_sc = StandardScaler().fit_transform(X)

# ── Dendrogram ─────────────────────────────────────────
Z = linkage(X_sc, method='ward')  # Full linkage tree
plt.figure(figsize=(12, 4))
dendrogram(Z, truncate_mode='level', p=5,
           leaf_rotation=90, leaf_font_size=8)
plt.title('Dendrogram (Ward Linkage)\nCut at any height to get k clusters')
plt.xlabel('Sample Index'); plt.ylabel('Distance')
plt.tight_layout(); plt.show()

# ── Compare linkage methods ───────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for i, link in enumerate(['single','complete','average','ward']):
    agg = AgglomerativeClustering(n_clusters=4, linkage=link)
    labels = agg.fit_predict(X_sc)
    sil = silhouette_score(X_sc, labels)
    axes[i].scatter(X_sc[:,0], X_sc[:,1], c=labels, cmap='tab10', s=20)
    axes[i].set_title(f'Linkage: {link}\nSilhouette={sil:.3f}')
plt.tight_layout(); plt.show()
# Ward usually gives the best silhouette score

#63

DBSCAN — Theory

Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) discovers clusters as dense regions separated by sparse regions. It doesn't require specifying k and can find arbitrarily shaped clusters while labeling outliers as noise.

Core Concepts

Term	Definition
ε (eps)	Radius of neighborhood around each point
min_samples	Minimum points required to form a dense core
Core point	Has ≥ min_samples neighbors within radius ε
Border point	Within ε of a core point but has fewer than min_samples neighbors
Noise point	Neither core nor border — labeled as −1 (outlier)

Interview Insights

Q: When would you choose DBSCAN over K-Means?

A: (1) You don't know k upfront. (2) Clusters have arbitrary shapes (rings, crescents). (3) You need outlier detection (DBSCAN labels noise points). (4) Clusters have varying densities — actually, this is a challenge for DBSCAN too; HDBSCAN handles it better. K-Means is faster and scales better; DBSCAN handles irregular shapes and anomalies.

#64

DBSCAN — Practical

Code: DBSCAN + Parameter Tuning + Outlier Detection

dbscan_practical.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_moons, make_blobs

# ── DBSCAN vs K-Means on moon-shaped data ─────────────
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
X_sc = StandardScaler().fit_transform(X_moons)

from sklearn.cluster import KMeans
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

km = KMeans(n_clusters=2, random_state=42).fit_predict(X_sc)
ax1.scatter(X_sc[:,0], X_sc[:,1], c=km, cmap='coolwarm', s=20)
ax1.set_title('K-Means (k=2)\n❌ FAILS on moons')

db = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_sc)
ax2.scatter(X_sc[:,0], X_sc[:,1], c=db, cmap='coolwarm', s=20)
ax2.set_title('DBSCAN (eps=0.3)\n✅ Correctly identifies moons')
plt.tight_layout(); plt.show()

# ── How to choose eps: K-distance plot ────────────────
from sklearn.neighbors import NearestNeighbors

X2, _ = make_blobs(n_samples=300, centers=3, random_state=42)
X2_sc = StandardScaler().fit_transform(X2)

nbrs = NearestNeighbors(n_neighbors=5).fit(X2_sc)
distances, _ = nbrs.kneighbors(X2_sc)
k_dist = np.sort(distances[:, 4])[::-1]  # 5th nearest neighbor distance, sorted

plt.figure(figsize=(8, 4))
plt.plot(k_dist)
plt.axhline(y=0.5, color='r', linestyle='--', label='eps ≈ 0.5 (elbow)')
plt.title('K-Distance Plot (k=5)\nChoose eps at the "elbow" of the curve')
plt.xlabel('Points (sorted)'); plt.ylabel('5th Nearest Neighbor Distance')
plt.legend(); plt.tight_layout(); plt.show()

# ── DBSCAN for anomaly detection ─────────────────────
db_final = DBSCAN(eps=0.5, min_samples=5).fit(X2_sc)
labels = db_final.labels_
n_clusters  = len(set(labels)) - (1 if -1 in labels else 0)
n_noise     = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points (anomalies): {n_noise}")
print(f"Noise indices: {np.where(labels == -1)[0]}")

#65

Silhouette Score

Definition

Silhouette Score measures how similar each point is to its own cluster vs other clusters. It's an internal evaluation metric — no ground truth labels needed. Range: [−1, 1]; higher = better-defined clusters.

s(i) = (b(i) − a(i)) / max(a(i), b(i))
a(i) = mean distance to own cluster (intra-cluster cohesion)
b(i) = mean distance to nearest other cluster (inter-cluster separation)
s = 1: perfect cluster assignment | s = 0: overlapping | s = -1: wrong cluster

Code: Silhouette Analysis + Visualization

silhouette_score.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.cm as cm

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_sc = StandardScaler().fit_transform(X)

# ── Silhouette plot for different k values ─────────────
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, k in enumerate([2,3,4,5,6,7]):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_sc)
    avg_sil = silhouette_score(X_sc, labels)
    sample_sils = silhouette_samples(X_sc, labels)
    
    ax = axes[idx]
    y_lower = 10
    for c in range(k):
        sil_c = sorted(sample_sils[labels == c])
        size_c = len(sil_c)
        y_upper = y_lower + size_c
        color = cm.nipy_spectral(float(c)/k)
        ax.fill_betweenx(np.arange(y_lower, y_upper), 0, sil_c, color=color)
        y_lower = y_upper + 10
    
    ax.axvline(x=avg_sil, color='red', linestyle='--')
    ax.set_title(f'k={k}, avg_sil={avg_sil:.3f}')
    ax.set_xlabel('Silhouette coefficient')

plt.suptitle('Silhouette Analysis: Wide, uniform blades = good clustering')
plt.tight_layout(); plt.show()
# k=4 should show the best uniform, wide silhouette blades

Summary — Clustering Comparison

Algorithm	Needs k?	Cluster Shape	Outliers	Speed
K-Means	Yes	Spherical only	No	Fast
Hierarchical	No (cut later)	Any	No	O(n² log n)
DBSCAN	No	Any arbitrary	Yes (label -1)	O(n log n)

🚀P3

Mini Project: Customer Market Segmentation

Code: Full Segmentation Pipeline

project3_segmentation.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age':             np.random.normal(40, 15, n).clip(18, 80),
    'annual_income':   np.random.lognormal(10.5, 0.5, n),
    'spending_score':  np.random.normal(50, 25, n).clip(1, 100),
    'purchase_freq':   np.random.poisson(5, n),
    'loyalty_years':   np.random.exponential(3, n).clip(0, 15),
})

# 1. Scale
sc = StandardScaler()
X_sc = sc.fit_transform(df)

# 2. PCA for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_sc)

# 3. Find optimal k
sil_scores = [silhouette_score(X_sc, KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X_sc))
              for k in range(2, 8)]
best_k = np.argmax(sil_scores) + 2

# 4. Final segmentation
km = KMeans(n_clusters=best_k, n_init=10, random_state=42)
df['segment'] = km.fit_predict(X_sc)

# 5. Visualize
plt.figure(figsize=(10, 5))
for seg in df['segment'].unique():
    mask = df['segment'] == seg
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f'Segment {seg}', s=20, alpha=0.7)
plt.title(f'Customer Segments (PCA 2D) — {best_k} segments')
plt.legend(); plt.tight_layout(); plt.show()

# 6. Profile segments
print("\nSegment Profiles:")
print(df.groupby('segment')['age', 'annual_income', 'spending_score', 'loyalty_years'].mean().round(1))

#66

Association Rule Learning

Definition

Association Rule Learning discovers co-occurrence patterns in transactional data: "If a customer buys X, they also tend to buy Y." Used for market basket analysis, recommendation systems, and web clickstream analysis.

Key Metrics

Metric	Formula	Meaning	Range
Support	P(X ∪ Y)	How often {X,Y} appears together in all transactions	[0,1]
Confidence	P(Y\|X) = P(X∪Y)/P(X)	If X is bought, probability that Y is also bought	[0,1]
Lift	Confidence / P(Y)	How much more likely Y is given X, vs random. >1 = positive association	[0,∞)
Conviction	(1−P(Y))/(1−Confidence)	How much X being present increases certainty of Y	[0,∞)

🛒 Classic Example: Beer & Diapers

Support({beer,diapers}) = 0.30 (30% of transactions contain both)
Confidence(diapers→beer) = 0.70 (70% of diaper buyers also bought beer)
Lift = 0.70 / 0.50 = 1.4 (40% more likely than random)

#67

Apriori Algorithm — Theory

Definition

Apriori generates frequent itemsets (item combinations meeting minimum support) by starting with 1-item sets and iteratively building larger sets. Uses the Apriori principle: any subset of a frequent itemset must also be frequent — this prunes the search space dramatically.

Algorithm Steps

1. Generate all 1-item sets, filter by min_support → frequent 1-items
2. Join frequent k-items to generate (k+1)-item candidates
3. Prune any candidate whose subset is NOT frequent (Apriori pruning)
4. Filter candidates by min_support
5. Repeat until no new frequent itemsets found
6. Generate rules from all frequent itemsets using min_confidence

Interview Insights

Q: What is the Apriori principle and how does it speed up the algorithm?

A: Apriori principle: if an itemset is infrequent, all its supersets are infrequent too. So if {milk, bread} has support below threshold, we NEVER need to check {milk, bread, butter} — it will also be infrequent. This allows pruning entire branches of the search space before computing their support, reducing computation from O(2ⁿ) to much less in practice.

#68

Apriori — Practical

Code: Apriori with mlxtend

apriori_practical.py

import pandas as pd
import numpy as np
# pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# ── Sample grocery transactions ────────────────────────
transactions = [
    ['milk','bread','butter'],
    ['milk','bread'],
    ['milk','butter','eggs'],
    ['bread','butter','eggs','cheese'],
    ['milk','bread','butter','eggs'],
    ['cheese','butter'],
    ['milk','eggs','bread'],
    ['milk','cheese','bread'],
    ['butter','eggs','milk'],
    ['bread','milk','cheese','eggs'],
    ['milk','bread','butter','cheese'],
]
# ── One-hot encode transactions ─────────────────────────
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# ── Generate frequent itemsets with min_support=0.3 ───────
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
# ── Generate rules with min_confidence=0.7 ───────────────────────────────
rules = association_rules(frequent_itemsets, metric ='confidence', min_threshold=0.7)
# ── Display rules sorted by lift ───────────────────────────────
rules = rules.sort_values('lift', ascending=False)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

#69

FP-Growth Algorithm — Theory

Definition

FP-Growth (Frequent Pattern Growth) is a faster alternative to Apriori that avoids repeated database scans. It compresses the entire transaction database into a compact FP-tree (prefix tree), then mines frequent patterns directly from this tree — no candidate generation needed.

Intuition

🌳 The FP-Tree Idea

Instead of repeatedly reading raw transactions, FP-Growth reads the database exactly twice: once to find frequent single items, once to build a compressed prefix tree. Transactions sharing common prefixes share nodes in the tree — like a trie/prefix tree. Mining patterns then happens entirely in memory on this compact tree, recursively building "conditional pattern bases".

FP-Growth vs Apriori Comparison

Property	Apriori	FP-Growth
Database scans	k scans (one per itemset size)	Exactly 2 scans
Candidate generation	Yes — generates then prunes	No — divides problem recursively
Memory	Lower (no tree structure)	Higher (FP-tree in RAM)
Speed	Slow on large data	10–100× faster than Apriori
Implementation	Simpler to understand	More complex
Best for	Small datasets, education	Production, large transactions

FP-Growth Algorithm Steps

Pass 1: Scan database → count item frequencies → discard items below min_support → sort frequent items by frequency (descending)

Pass 2: Build FP-tree — insert each transaction (only frequent items, in sorted order) into prefix tree. Shared prefixes share nodes; each node stores item name + count.

Mining: For each frequent item, extract its conditional pattern base (all paths ending at that item), build a conditional FP-tree, recurse. Each recursive call produces frequent itemsets containing that item.

Interview Insights

Q: When does FP-Growth degrade in performance?

A: When the FP-tree cannot fit in memory (very low min_support on huge data → very large tree), or when the database is sparse (few shared prefixes → tree is nearly the same size as original data). In those cases, disk-based variants like H-Mine or Eclat are better. For most real-world grocery/retail datasets, FP-Growth is the clear winner.

#70

FP-Growth — Practical

Code: FP-Growth vs Apriori Speed Comparison

fpgrowth_practical.py

import pandas as pd
import numpy as np
import time
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import matplotlib.pyplot as plt

# ── Generate larger synthetic transaction dataset ──────
np.random.seed(42)
items = ['milk','bread','butter','eggs','cheese',
         'yogurt','juice','cereal','coffee','tea']

transactions = []
for _ in range(500):
    n_items = np.random.randint(2, 7)
    # Simulate real shopping patterns with biased probabilities
    weights = [0.7,0.65,0.5,0.55,0.35,0.3,0.4,0.25,0.45,0.3]
    chosen = [item for item, w in zip(items, weights) if np.random.random() < w]
    if len(chosen) >= 2:
        transactions.append(chosen)

# Encode
te = TransactionEncoder()
df_enc = pd.DataFrame(te.fit_transform(transactions), columns=te.columns_)
print(f"Dataset: {len(transactions)} transactions × {len(te.columns_)} items")

# ── Speed benchmark: FP-Growth vs Apriori ─────────────
min_sup = 0.2

t0 = time.time()
fp_itemsets = fpgrowth(df_enc, min_support=min_sup, use_colnames=True)
t_fp = time.time() - t0

t0 = time.time()
ap_itemsets = apriori(df_enc, min_support=min_sup, use_colnames=True)
t_ap = time.time() - t0

print(f"\nFP-Growth: {len(fp_itemsets)} itemsets in {t_fp:.4f}s")
print(f"Apriori:   {len(ap_itemsets)} itemsets in {t_ap:.4f}s")
print(f"Speedup:   {t_ap/t_fp:.2f}×")

# ── Generate rules from FP-Growth itemsets ────────────
rules = association_rules(fp_itemsets, metric='lift', min_threshold=1.1)
rules['antecedents'] = rules['antecedents'].apply(lambda x: ', '.join(list(x)))
rules['consequents'] = rules['consequents'].apply(lambda x: ', '.join(list(x)))

top_rules = rules.nlargest(10, 'lift')[
    ['antecedents','consequents','support','confidence','lift']
]
print("\nTop 10 Rules by Lift:")
print(top_rules.to_string(index=False))

# ── Heatmap: Item co-occurrence matrix ────────────────
cooc = df_enc.T.dot(df_enc)
cooc_norm = cooc / cooc.values.diagonal()  # normalize by item frequency

plt.figure(figsize=(8, 6))
import seaborn as sns
sns.heatmap(cooc_norm, annot=True, fmt='.2f', cmap='YlOrRd',
           linewidths=0.5, vmin=0, vmax=1)
plt.title('Item Co-occurrence Matrix\n(value = P(col bought | row bought))')
plt.tight_layout(); plt.show()

Summary — Association Rules

FP-Growth is always preferred over Apriori for production — faster and fewer DB scans
Both produce identical rules — the algorithm differs, not the output
Real business value: product placement, "customers also bought", promotions
Always evaluate rules with Lift, not just Confidence — lift > 1 means genuine co-occurrence

🏋️ Mini Practice Task

Download a real retail dataset (e.g. UCI Online Retail Dataset). Run FP-Growth with min_support=0.05. Find the top 5 rules by lift. What business action would you recommend based on each rule?

#71

Ensemble Learning — Overview

Definition

Ensemble Learning combines multiple models (weak learners) to produce a stronger, more accurate and robust prediction than any single model alone. The key insight: models make different errors, and combining them cancels out individual mistakes.

Intuition

🧑‍⚖️ The Wisdom of Crowds

A single expert doctor may be wrong. But if 100 doctors independently diagnose the same patient and you take the majority vote, accuracy dramatically improves — individual errors cancel out. This is the ensemble principle. Even weak individual models (slightly better than random) can combine into a very strong model.

Three Ensemble Families

Family	Core Idea	Algorithms	Strength
Bagging	Train models on bootstrap samples in parallel, average outputs	Random Forest, BaggingClassifier	Reduces variance (overfitting)
Boosting	Train models sequentially, each focusing on previous errors	AdaBoost, Gradient Boosting, XGBoost	Reduces bias (underfitting)
Stacking	Train base models, use their predictions as input to a meta-model	StackingClassifier/Regressor	Combines heterogeneous models

Interview Insights

Q: What makes ensemble methods work? What's the mathematical intuition?

A: For averaging to reduce error, two conditions must hold: (1) Each model must be better than random (accuracy > 50% for binary classification). (2) Models must make DIFFERENT errors (low correlation between errors). If all models make the same mistakes, averaging doesn't help. This is why diversity is critical: Random Forest uses feature randomness and data bootstrapping to decorrelate trees. The variance of the ensemble mean = σ²/n when models are uncorrelated — halving the error with just 4 models.

Summary

Ensemble = many weak learners combining into one strong learner
Bagging: parallel training on bootstrap samples → reduces variance
Boosting: sequential training fixing errors → reduces bias
Stacking: base models feed a meta-learner → most flexible, most powerful
Diversity between models is essential — correlated models don't help each other

#72

Voting Methods — Max, Average, Weighted

Definition

Voting ensembles combine predictions from multiple different algorithm types (heterogeneous ensemble). Each model votes, and the final prediction is determined by: Hard Voting (majority class), Soft Voting (average probabilities), or Weighted Voting (trusted models vote more).

Voting Strategies Explained

Strategy	How It Works	When to Use	Requires
Hard Voting	Majority class label wins (mode)	Classification; when probabilities unavailable	predict() from each model
Soft Voting	Average predicted probabilities; argmax wins	Classification; more accurate than hard when models are calibrated	predict_proba() from each model
Weighted Voting	Better models get higher vote weight	When you know which models are stronger	weights= parameter
Average (Regression)	Mean of all model predictions	Regression; baseline ensemble	predict() from each model

Hard Voting: ŷ = mode(ŷ₁, ŷ₂, ..., ŷₙ)
Soft Voting: ŷ = argmax_k [ (1/n) Σ P_i(y=k|X) ]
Weighted: ŷ = argmax_k [ Σ wᵢ × P_i(y=k|X) ] where Σwᵢ = 1

Interview Insights

Q: Why is Soft Voting usually better than Hard Voting?

A: Soft voting uses probability estimates — richer information than just class labels. Consider 3 models predicting class 1: Model A says P=0.51 (barely), Models B and C say P=0.49 (barely). Hard voting picks class 1 (2 vs 1). Soft voting averages probabilities → P̄=0.497 → picks class 0. Soft voting is more nuanced. However, it requires well-calibrated probabilities — if models are poorly calibrated (SVM without Platt scaling, for example), hard voting may actually be more reliable.

#73

Voting Regression — Practical

Code: VotingRegressor with Multiple Models

voting_regressor.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Define individual base models ─────────────────────
# Note: VotingRegressor requires estimators that support fit/predict
# Wrap scale-sensitive models in a Pipeline
r1 = ('ridge',   Pipeline([('sc',StandardScaler()), ('m',Ridge(alpha=1.0))]))
r2 = ('dtree',   DecisionTreeRegressor(max_depth=6, random_state=42))
r3 = ('knn',     Pipeline([('sc',StandardScaler()), ('m',KNeighborsRegressor(n_neighbors=7))]))
r4 = ('svr',     Pipeline([('sc',StandardScaler()), ('m',SVR(C=10, gamma='scale'))]))

# ── VotingRegressor: uniform average ──────────────────
vr_uniform = VotingRegressor(estimators=[r1, r2, r3, r4])

# ── VotingRegressor: weighted (Ridge is best, give it more weight)
vr_weighted = VotingRegressor(estimators=[r1, r2, r3, r4],
                               weights=[3, 2, 1, 2])

# ── Evaluate all models ───────────────────────────────
results = {}
for name, model in [('Ridge', r1[1]), ('DTree', r2[1]),
                     ('KNN',  r3[1]), ('SVR',  r4[1]),
                     ('VotingUniform', vr_uniform),
                     ('VotingWeighted', vr_weighted)]:
    model.fit(X_tr, y_tr)
    yp   = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    r2   = r2_score(y_te, yp)
    results[name] = {'RMSE': rmse, 'R²': r2}

df_res = pd.DataFrame(results).T
print(df_res.round(4))

# ── Bar chart comparison ──────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
colors = ['steelblue']*4 + ['#f59e0b', '#10b981']

ax1.bar(df_res.index, df_res['RMSE'], color=colors)
ax1.set_title('RMSE Comparison\n(Lower = Better)')
ax1.tick_params(axis='x', rotation=25)

ax2.bar(df_res.index, df_res['R²'], color=colors)
ax2.set_title('R² Comparison\n(Higher = Better)')
ax2.tick_params(axis='x', rotation=25)
plt.tight_layout(); plt.show()
# Ensemble should outperform all individual models

#74

Voting Classification — Practical

Code: Hard vs Soft Voting + Decision Boundary

voting_classifier.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

# ── Base estimators ───────────────────────────────────
lr  = Pipeline([('sc',StandardScaler()),('m',LogisticRegression(max_iter=1000))])
dt  = DecisionTreeClassifier(max_depth=5, random_state=42)
knn = Pipeline([('sc',StandardScaler()),('m',KNeighborsClassifier(n_neighbors=7))])
gnb = Pipeline([('sc',StandardScaler()),('m',GaussianNB())])
svc = Pipeline([('sc',StandardScaler()),('m',SVC(probability=True))])

estimators = [('lr',lr),('dt',dt),('knn',knn),('gnb',gnb),('svc',svc)]

# ── Hard vs Soft Voting ───────────────────────────────
vc_hard = VotingClassifier(estimators=estimators, voting='hard')
vc_soft = VotingClassifier(estimators=estimators, voting='soft')

# ── Evaluate all models ───────────────────────────────
results = []
for name, model in estimators + [('VotingHard', vc_hard), ('VotingSoft', vc_soft)]:
    model.fit(X_tr, y_tr)
    yp  = model.predict(X_te)
    acc = accuracy_score(y_te, yp)
    f1  = f1_score(y_te, yp)
    cv  = cross_val_score(model, X_tr, y_tr, cv=5, scoring='f1').mean()
    results.append({'Model':name, 'Test Acc':acc, 'Test F1':f1, 'CV F1':cv})

df_r = pd.DataFrame(results)
print(df_r.sort_values('Test F1', ascending=False).to_string(index=False))

# ── Visualize ensemble improvement ────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
colors = ['steelblue']*5 + ['#f59e0b', '#10b981']
bars = ax.barh(df_r['Model'], df_r['Test F1'], color=colors)
ax.axvline(x=df_r['Test F1'][:5].max(), color='red', linestyle='--',
          label='Best individual model')
ax.set_xlabel('F1-Score')
ax.set_title('Voting Ensemble vs Individual Models\n(Ensemble bars in gold/green)')
ax.legend(); plt.tight_layout(); plt.show()

Common Mistakes

Using correlated models in ensemble — if all models make the same mistakes, voting doesn't help. Mix: linear + tree + distance-based models for diversity.
Using Hard Voting with poorly calibrated models — their class boundaries don't align in probability space
Not scaling inside each pipeline — VotingClassifier calls each model separately, so each needs its own scaler

#75

Bagging & Random Forest — Theory

Definition

Bagging (Bootstrap Aggregating) trains multiple models on different bootstrap samples (random samples with replacement) of the training data, then aggregates predictions. Random Forest extends bagging by also randomizing the feature subset at each split — creating maximum diversity among trees.

Why Bagging Reduces Variance

📊 Variance Reduction Math

A single Decision Tree has high variance: σ²(single tree). If n trees make independent predictions, the variance of their mean = σ²/n. With 100 trees, variance drops 100×. In practice trees are correlated (same data), but Random Forest's feature randomization breaks this correlation, achieving near-independent variance reduction.

Random Forest vs Bagging

Property	Bagging (BaggingClassifier)	Random Forest
Bootstrap sampling	Yes	Yes
Feature randomization per split	No — all features considered	Yes — random subset of sqrt(n) features
Tree correlation	High (same features)	Low (different feature subsets)
Diversity	Moderate	High — best decorrelation
Out-of-Bag (OOB) evaluation	Yes (with oob_score=True)	Yes (with oob_score=True)
Feature importance	Depends on base estimator	Built-in (Gini importance)

Bootstrap: sample n points WITH replacement from training set
~63.2% unique points per sample (rest are duplicates)
OOB samples: ~36.8% left out per tree → free validation set

Feature subset per split: max_features = sqrt(p) for classification, p/3 for regression

Interview Insights

Q: What is Out-of-Bag (OOB) error and why is it useful?

A: Each tree in a Random Forest is trained on ~63.2% of the data (bootstrap sample). The remaining ~36.8% (out-of-bag samples) were never seen by that tree. We can evaluate each tree on its OOB samples, then average — this gives a free, approximately cross-validated error estimate WITHOUT needing a separate validation set. OOB error is very close to 5-fold CV error and requires no extra compute.

Q: When would you use a single Decision Tree over Random Forest?

A: When interpretability is paramount — a single tree produces human-readable rules. Random Forest is a black box. In regulated industries (banking credit decisions, medical guidelines) where you must explain each decision, a single pruned tree is often legally required. For pure predictive accuracy, Random Forest almost always wins.

#76

Bagging — Classification Practical

Code: BaggingClassifier + RandomForestClassifier — Full Analysis

bagging_classification.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.datasets import load_breast_cancer
import seaborn as sns

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                             stratify=y, random_state=42)

# ── 1. Single Decision Tree (baseline) ───────────────
single_dt = DecisionTreeClassifier(max_depth=None, random_state=42)
single_dt.fit(X_tr, y_tr)
print("Single DTree:")
print(f"  Train acc: {accuracy_score(y_tr, single_dt.predict(X_tr)):.4f}")
print(f"  Test  acc: {accuracy_score(y_te, single_dt.predict(X_te)):.4f}")

# ── 2. Bagging with DTree base estimator ──────────────
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,         # number of trees
    max_samples=0.8,          # use 80% of data per tree
    max_features=0.8,         # use 80% of features per tree
    bootstrap=True,           # with replacement
    bootstrap_features=False,
    oob_score=True,            # free OOB evaluation
    n_jobs=-1,
    random_state=42
)
bag.fit(X_tr, y_tr)
print(f"\nBaggingClassifier:")
print(f"  OOB score:  {bag.oob_score_:.4f}  (free cross-validation estimate)")
print(f"  Test acc:   {accuracy_score(y_te, bag.predict(X_te)):.4f}")

# ── 3. Random Forest ──────────────────────────────────
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,            # let trees grow fully (bagging controls variance)
    max_features='sqrt',      # sqrt(n_features) per split — key RF innovation
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
rf.fit(X_tr, y_tr)
print(f"\nRandomForestClassifier:")
print(f"  OOB score:  {rf.oob_score_:.4f}")
print(f"  Test acc:   {accuracy_score(y_te, rf.predict(X_te)):.4f}")
print(f"  Test F1:    {f1_score(y_te, rf.predict(X_te)):.4f}")
print("\nClassification Report:")
print(classification_report(y_te, rf.predict(X_te), target_names=data.target_names))

# ── 4. n_estimators vs OOB score curve ────────────────
oob_scores = []
n_range = range(10, 201, 10)
for n in n_range:
    model = RandomForestClassifier(n_estimators=n, oob_score=True,
                                    n_jobs=-1, random_state=42)
    model.fit(X_tr, y_tr)
    oob_scores.append(model.oob_score_)

plt.figure(figsize=(10, 4))
plt.plot(n_range, oob_scores, 'b-o', markersize=4)
plt.xlabel('n_estimators'); plt.ylabel('OOB Score')
plt.title('OOB Score vs Number of Trees\n(Score stabilizes — no overfitting with more trees!)')
plt.tight_layout(); plt.show()

# ── 5. Feature Importance ─────────────────────────────
imp = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(imp['Feature'][:15], imp['Importance'][:15], color='steelblue')
plt.xlabel('Feature Importance (Mean Decrease in Gini Impurity)')
plt.title('Top 15 Feature Importances — Random Forest\n(Higher = more useful for prediction)')
plt.gca().invert_yaxis()
plt.tight_layout(); plt.show()

print("\nTop 5 most important features:")
print(imp.head(5).to_string(index=False))

Common Mistakes

More trees never overfit in Random Forest — unlike a single tree, adding more trees reduces variance monotonically. After ~100–200 trees, gains are marginal; it's a compute/accuracy tradeoff.
Setting max_features=None in RF — this removes the feature randomization that makes RF special. You'd just get regular Bagging.
Confusing feature_importances_ with "causal importance" — RF importance measures how useful a feature is for prediction, not causation. Correlated features split importance between them.

Interview Insights

Q: Why doesn't Random Forest overfit with very deep trees?

A: Two reasons. First, averaging many uncorrelated predictions reduces variance even when individual trees have high variance (are overfit). Each tree overfits to its particular bootstrap sample, but their errors cancel when averaged. Second, the feature randomization at each split ensures trees are different enough that their individual overfitting patterns don't align. The ensemble "washes out" individual tree noise.

#77

Bagging — Regression Practical

Definition

Bagging Regression uses the same bootstrap aggregating principle for continuous targets. The ensemble prediction is the mean of all base regressor predictions. Random Forest Regressor is the most widely used variant.

Code: BaggingRegressor + RandomForestRegressor — Complete Analysis

bagging_regression.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import (BaggingRegressor, RandomForestRegressor,
                                GradientBoostingRegressor)
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import fetch_california_housing
import seaborn as sns

housing = fetch_california_housing(as_frame=True)
df = housing.frame
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Model zoo ─────────────────────────────────────────
models = {
    'Single DTree': DecisionTreeRegressor(max_depth=None, random_state=42),
    'BaggingDTree': BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=100, oob_score=True, n_jobs=-1, random_state=42
    ),
    'RandomForest': RandomForestRegressor(
        n_estimators=200, max_features='sqrt',
        oob_score=True, n_jobs=-1, random_state=42
    ),
    'GradientBoost': GradientBoostingRegressor(  # bonus: show boosting too
        n_estimators=200, learning_rate=0.1,
        max_depth=4, random_state=42
    ),
}

results = []
for name, model in models.items():
    model.fit(X_tr, y_tr)
    yp   = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_te, yp))
    mae  = mean_absolute_error(y_te, yp)
    r2   = r2_score(y_te, yp)
    oob  = getattr(model, 'oob_score_', np.nan)
    results.append({'Model':name, 'RMSE':rmse, 'MAE':mae, 'R²':r2, 'OOB':oob})

df_res = pd.DataFrame(results)
print(df_res.to_string(index=False))

# ── Visualize 1: RMSE bar chart ────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
colors = ['#64748b', 'steelblue', '#10b981', '#f59e0b']
axes[0].bar(df_res['Model'], df_res['RMSE'], color=colors)
axes[0].set_title('RMSE: Lower is Better\n(Ensemble beats single tree consistently)')
axes[0].tick_params(axis='x', rotation=15)

axes[1].bar(df_res['Model'], df_res['R²'], color=colors)
axes[1].set_title('R²: Higher is Better')
axes[1].tick_params(axis='x', rotation=15)
plt.tight_layout(); plt.show()

# ── Visualize 2: Actual vs Predicted (Random Forest) ──
rf_model = models['RandomForest']
y_pred_rf = rf_model.predict(X_te)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(y_te, y_pred_rf, alpha=0.2, s=8, color='steelblue')
axes[0].plot([y_te.min(), y_te.max()], [y_te.min(), y_te.max()], 'r--')
axes[0].set_xlabel('Actual'); axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Random Forest: Actual vs Predicted\nR²={r2_score(y_te,y_pred_rf):.4f}')

# Feature importance
imp = pd.DataFrame({'Feature':X.columns,
                     'Importance':rf_model.feature_importances_}).\
                sort_values('Importance', ascending=True)
axes[1].barh(imp['Feature'], imp['Importance'], color='steelblue')
axes[1].set_title('Feature Importances\n(Random Forest Regression)')
plt.tight_layout(); plt.show()

# ── Hyperparameter tuning: key RF regression params ───
print("\n--- Key RandomForestRegressor Hyperparameters ---")
params_guide = {
    'n_estimators':      "100–500. More = better up to diminishing returns. Never overfits.",
    'max_features':      "'sqrt' (clf) or 'log2'/'auto' (reg). Tune: try 1/3 to 2/3 of features.",
    'max_depth':         "None (grow full) is fine for RF. Set only if memory is a concern.",
    'min_samples_split': "2–10. Increase to reduce overfitting on noisy data.",
    'min_samples_leaf':  "1–5. Larger = smoother predictions, less sensitive to noise.",
    'bootstrap':         "True always for Bagging. False = random subspaces only.",
}
for p, desc in params_guide.items():
    print(f"  {p:25s}: {desc}")

Common Mistakes

Treating feature_importances_ as definitive — for correlated features, importance is split arbitrarily between correlated variables. Use permutation importance (sklearn's permutation_importance) for more reliable estimates.
Not using n_jobs=-1 — Random Forest is embarrassingly parallelizable. Without it, training 200 trees sequentially is 4–8× slower than needed.
Tuning n_estimators to "exactly 100" for performance — always plot the OOB curve and stop when it flattens, not at an arbitrary number.

Interview Insights

Q: Random Forest vs Gradient Boosting — which to choose?

A: Random Forest: parallel training (faster), more robust to noise, less hyperparameter tuning, better when features are noisy or data is limited. Gradient Boosting (XGBoost/LightGBM): typically higher accuracy on structured/tabular data, sequential so slower, more sensitive to hyperparameters, prone to overfit without careful tuning. In Kaggle competitions, XGBoost/LightGBM win almost always on tabular data. In production with limited tuning time, Random Forest is safer. Start with RF, then try GBM if you need more performance.

Q: What is the bias-variance decomposition of Random Forest?

A: Each individual tree has low bias (fully grown trees can memorize data) but high variance (sensitive to training data). Averaging n trees keeps bias the same but reduces variance by factor ~n (if uncorrelated). The feature randomization in RF ensures low correlation between trees, maximizing the variance reduction. So RF = low bias + dramatically reduced variance = excellent generalization.

Summary — Ensemble Methods

Bagging trains parallel models on bootstrap samples → reduces variance via averaging
Random Forest = Bagging + random feature subsets at each split → maximum tree diversity
OOB score gives free cross-validated performance estimate using left-out samples
More trees never hurt RF — only diminishing returns after ~200 trees
Feature importances reveal predictive power but can be misleading for correlated features

🏋️ Final Practice Task

Build a complete Random Forest pipeline on the California Housing dataset: (1) Feature engineering, (2) RF with OOB score, (3) Permutation importance to validate feature importances, (4) GridSearchCV for max_features and min_samples_leaf, (5) Compare final RF to single DTree and Ridge — report RMSE, R², and training time for each.

🏆

Complete ML Masterclass — Final Summary

✅ All 77 Topics Complete

Full Syllabus Mastered

Phase	Topics	Key Takeaway
Basics (1–3)	ML Introduction, Types, Roadmap	Know the 7-step pipeline cold. Supervised/Unsupervised distinction.
Data Preprocessing (4–20)	Variables, Cleaning, Missing, Encoding, Outliers, Scaling, FunctionTransformer	80% of ML work. Data quality beats algorithm choice.
Feature Selection (21–22)	Backward/Forward Selection	Wrapper methods are model-aware but O(n²). Use filter first.
Model Training (23)	Train-Test Split	Always split BEFORE any preprocessing. Stratify for classification.
Regression (24–33)	Linear → Poly → Ridge/Lasso	Always plot residuals. Regularization requires scaling. Use AdjR².
Classification (34–38)	Logistic, Binary/Multi-class	Sigmoid outputs probability. Threshold tuning = precision/recall tradeoff.
Evaluation (39–41)	Confusion Matrix, F1, Imbalanced	Accuracy is useless on imbalanced data. Use F1, ROC-AUC, lift.
Naive Bayes (42–43)	Theory + Practical	Best for text. Naive independence assumption works surprisingly well.
Advanced Models (44–53)	DTree, KNN, SVM	DTree: interpretable but high variance. KNN: lazy, scale-sensitive. SVM: margin maximization, kernel trick.
Model Tuning (54–57)	Hyperparams, GridSearch, CV	Nested CV is gold standard. Always tune on train set, evaluate on test.
Unsupervised (58–65)	K-Means, Hierarchical, DBSCAN, Silhouette	K-Means: fast, spherical. DBSCAN: arbitrary shapes, finds outliers. Silhouette: no labels needed.
Association (66–70)	Apriori, FP-Growth	FP-Growth = production standard. Lift > 1 = genuine rule.
Ensemble (71–77)	Voting, Bagging, Random Forest	RF = best general-purpose model. Diversity + averaging = power. OOB = free CV.

The 10 Rules Every ML Engineer Must Know

1. Always split data BEFORE any preprocessing step — prevent data leakage.
2. Fit transformers on train data only. Transform train AND test with the same fit.
3. Plot residuals — R² alone is never enough to validate a regression model.
4. For imbalanced data: never use accuracy. Use F1, ROC-AUC, G-Mean.
5. Scale features before: KNN, SVM, Logistic Regression, Ridge, Lasso, PCA.
6. Decision Trees are interpretable but high variance — use Random Forest for production.
7. Hyperparameter tuning belongs in the training set — nested CV for unbiased evaluation.
8. For clustering, always use Silhouette analysis — never just pick k arbitrarily.
9. More trees in Random Forest never overfit — stop at ~200 when OOB curve flattens.
10. Ensemble diversity matters more than individual model accuracy.

Top 20 FAANG Interview Questions Covered

#	Question	Topic
1	Walk me through an end-to-end ML project	Topic 3: Roadmap
2	What is data leakage and how do you prevent it?	Topic 23: Train-Test Split
3	Explain the bias-variance tradeoff	Topic 32: Ridge/Lasso
4	Why use Adjusted R² over R²?	Topic 30
5	Lasso vs Ridge — mathematical difference and when to choose	Topic 32
6	Why is logistic regression called regression?	Topic 35
7	Precision vs Recall tradeoff — give a business example	Topic 40
8	Why does Naive Bayes work despite the independence assumption?	Topic 42
9	Gini Impurity vs Entropy — when does it matter?	Topic 44
10	What is the curse of dimensionality and how does it affect KNN?	Topic 48
11	What are support vectors and why do only they define the boundary?	Topic 51
12	Parameters vs Hyperparameters — with examples	Topic 54
13	GridSearch vs RandomizedSearch — when to use each?	Topic 55
14	What is OOB error in Random Forest?	Topic 76
15	K-Means limitations and how DBSCAN addresses them	Topics 59/63
16	What makes ensemble methods work mathematically?	Topic 71
17	Random Forest vs Gradient Boosting — when to choose?	Topic 77
18	What is the Apriori principle?	Topic 67
19	Why is Soft Voting better than Hard Voting?	Topic 72
20	How do you handle imbalanced datasets?	Topic 41

What to Study Next (Beyond This Syllabus)

Topic	Why It Matters	Where to Start
XGBoost / LightGBM	Win 90% of Kaggle tabular competitions	xgboost.readthedocs.io
Neural Networks / Deep Learning	Images, text, audio, sequences	fast.ai, PyTorch
Feature Engineering	Often the highest ROI activity in ML	Kaggle competitions
Model Explainability (SHAP)	Required in regulated industries	shap.readthedocs.io
ML Pipelines (sklearn Pipeline)	Production-ready, prevents leakage	sklearn Pipeline docs
Time Series (ARIMA, Prophet)	Finance, forecasting, IoT	statsmodels, Prophet
NLP (TF-IDF, Transformers)	Text classification, sentiment, chatbots	HuggingFace Transformers
MLOps (MLflow, Docker)	Deploy and monitor models in production	mlflow.org

🚀 Capstone Challenge — Build an AI Startup MVP

Using everything you've learned, build a complete end-to-end ML system:

Pick a real dataset from Kaggle (tabular, CSV format)
EDA: distribution plots, correlation heatmap, missing value audit
Preprocessing Pipeline: imputation, encoding, scaling — all in sklearn Pipeline
Baseline model: Linear/Logistic Regression — establish a benchmark
Advanced models: Random Forest, XGBoost — GridSearchCV on train only
Evaluation: confusion matrix (if classification), residual plot (if regression), CV scores, test set final evaluation
Explainability: feature importance + SHAP values
Deployment: wrap best model in a Flask API that accepts JSON and returns predictions

Complete this and you have a portfolio project worthy of FAANG interviews. You now know enough to build real ML systems.

Machine LearningMasterclass

Phase 1 — Basics

ML Course Introduction

What is Machine Learning?

Complete ML Roadmap

Phase 2 — Data Preprocessing

Types of Variables

Data Cleaning

Missing Values — Concept & Detection

Handling Missing Values — Dropping

Handling Missing Values — Categorical Imputation

Handling Missing Values — Scikit-Learn Imputers

One Hot Encoding & Dummy Variables

Label Encoding

Ordinal Encoding

Outliers — Concept & Handling

IQR Method for Outlier Detection

Z-Score Method for Outlier Detection

Feature Scaling — Standardization

Feature Scaling — Normalization (MinMaxScaler)

Duplicate Data Handling

Changing Data Types

Function Transformer

Phase 3 — Feature Selection

Backward Elimination

Forward Selection

Phase 4 — Model Training

Train-Test Split

Phase 5 — Regression

Regression Analysis

Linear Regression — Theory

Linear Regression — Practical

Multiple Linear Regression

Polynomial Regression

Cost Function

R² and Adjusted R²

Best Fit Line

Lasso (L1) & Ridge (L2) — Theory

Lasso & Ridge — Practical (Continued)

⚡ Project 1 — After Regression

Real-World Project: House Price Predictor

Phase 6 — Classification

Classification Overview

Logistic Regression — Binary

Logistic Regression — Multiple Input Features

Logistic Regression — Polynomial Features

Logistic Regression — Multiclass

Phase 7 — Evaluation Metrics

Confusion Matrix

Precision, Recall, F1-Score

Imbalanced Dataset Handling

Phase 8 — Probabilistic Models

Naive Bayes — Theory

Naive Bayes — Practical

Phase 9 — Advanced Models

Decision Tree — Classification (Theory)

Decision Tree — Practical

Pre-Pruning & Post-Pruning

Decision Tree — Regression

K-Nearest Neighbors (KNN) — Classification

KNN — Practical

KNN — Regression

Support Vector Machine (SVM) — Theory

SVM — Practical

SVM — Regression (SVR)

⚡ Project 2 — After Classification

Real-World Project: Customer Churn Predictor

Phase 10 — Model Tuning

Model Parameters vs Hyperparameters

GridSearchCV & RandomizedSearchCV

Cross Validation — Theory

Cross Validation — Practical

Phase 11 — Unsupervised Learning

Unsupervised Learning Overview

K-Means Clustering — Theory

K-Means — Practical

Hierarchical Clustering — Theory

Agglomerative Clustering — Practical

DBSCAN — Theory

DBSCAN — Practical

Machine Learning
Masterclass