Chapter 45: Machine Learning Basics with scikit-learn

Machine learning sounds intimidating. It isn't.

At its core, machine learning is this: you show a program examples, and it learns a pattern. Then you give it new data it's never seen, and it predicts the answer.

You've been doing something similar your whole life. You've seen thousands of emails. You know what spam looks like. A spam filter does the same thing — except it learned from millions of examples instead of yours.

scikit-learn is Python's standard machine learning library. It has a simple, consistent API and covers most practical use cases.

Setup

pip install scikit-learn pandas numpy matplotlib seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import ...   # we'll import specific parts as we go

The Two Types of Problems

Supervised learning — you have labeled examples. You learn to predict the label for new data.

Regression: predict a number (house price, temperature, salary)
Classification: predict a category (spam/not spam, cat/dog, disease/healthy)

Unsupervised learning — you have data with no labels. You find hidden structure.

Clustering: group similar data points
Dimensionality reduction: compress features while keeping information

This chapter focuses on supervised learning — it's where most beginners start and where most practical ML problems live.

The scikit-learn API

Every model in scikit-learn follows the same three-step API:

from sklearn.some_module import SomeModel

model = SomeModel()          # 1. Create the model
model.fit(X_train, y_train)  # 2. Train it on your data
predictions = model.predict(X_test)   # 3. Make predictions
score = model.score(X_test, y_test)   # 4. Evaluate it

Learn this pattern once and you can use any of scikit-learn's 40+ algorithms.

Step 1 — Preparing Your Data

Load a dataset

from sklearn.datasets import load_iris, load_boston, fetch_california_housing
import pandas as pd

# Built-in datasets (great for learning)
iris    = load_iris(as_frame=True)
housing = fetch_california_housing(as_frame=True)

X = housing.data      # features (DataFrame)
y = housing.target    # labels (Series)

print(X.shape)        # (20640, 8) — 20,640 rows, 8 features
print(y.shape)        # (20640,)
print(X.head())

Train/test split

You must evaluate your model on data it has never seen. Always split your data first:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 20% for testing, 80% for training
    random_state=42,   # reproducible split
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# Train: (16512, 8), Test: (4128, 8)

Feature scaling

Many algorithms (linear models, SVMs, k-nearest neighbours) are sensitive to feature scale. A feature measured in millions dominates one measured in units.

from sklearn.preprocessing import StandardScaler

scaler  = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # learn mean/std, then scale
X_test_scaled  = scaler.transform(X_test)        # use SAME mean/std as train

Critical rule: fit the scaler on training data only. Never fit on test data — that would be "data leakage," letting test information sneak into training.

Linear Regression — Predict a Number

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# ── Toy example ───────────────────────────────────────────────────────────────

np.random.seed(42)
X_toy   = np.random.rand(100, 1) * 10
y_toy   = 2.5 * X_toy.ravel() + np.random.randn(100) * 3

X_tr, X_te, y_tr, y_te = train_test_split(X_toy, y_toy, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_tr, y_tr)

y_pred = model.predict(X_te)

print(f"Coefficient (slope): {model.coef_[0]:.2f}")   # ~2.5
print(f"Intercept:           {model.intercept_:.2f}")
print(f"R^2 score:            {r2_score(y_te, y_pred):.3f}")   # 1.0 = perfect
print(f"RMSE:                {np.sqrt(mean_squared_error(y_te, y_pred)):.3f}")

# Plot
plt.figure(figsize=(7, 4))
plt.scatter(X_te, y_te, alpha=0.6, label="Actual")
plt.plot(X_te, y_pred, color="red", linewidth=2, label="Predicted")
plt.title("Linear Regression")
plt.legend()
plt.tight_layout()
plt.show()

On real data — California housing prices

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing(as_frame=True)
X, y    = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler  = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
print(f"R^2:   {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
# R^2:   0.606
# RMSE: 0.745

R^2 of 0.606 means the model explains about 60% of the variance in house prices. Not great — the relationship is non-linear. Let's try a better model.

Decision Trees and Random Forests — Flexible and Powerful

A decision tree splits the data on feature values, like a flowchart:

Is MedInc > 3.5?
  Yes -> Is HouseAge > 20?
          Yes -> predict 2.8
          No  -> predict 3.5
  No  -> predict 1.2

A random forest builds hundreds of decision trees on random subsets of the data and features, then averages their predictions. This reduces overfitting and greatly improves accuracy.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,   # number of trees
    max_depth=None,     # let trees grow fully
    random_state=42,
    n_jobs=-1,          # use all CPU cores
)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
print(f"R^2:   {r2_score(y_test, y_pred_rf):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.3f}")
# R^2:   0.805
# RMSE: 0.505

# Feature importance — which features matter most?
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind="barh", figsize=(7, 4))
plt.title("Feature Importances")
plt.tight_layout()
plt.show()

The random forest jumps from 0.606 to 0.805 R^2 — no manual feature engineering, no scaling required.

Classification — Predict a Category

Logistic Regression

Despite the name, logistic regression is a classification algorithm. It predicts the probability that something belongs to a class.

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer(as_frame=True)
X, y = data.data, data.target  # 0=malignant, 1=benign

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler   = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=data.target_names))

Output:

              precision    recall  f1-score   support
   malignant       0.97      0.93      0.95        43
      benign       0.96      0.98      0.97        71
    accuracy                           0.96       114

Understanding the metrics

Precision — of all the times we said "malignant," how often were we right?
Recall — of all actual malignant cases, how many did we catch?
F1-score — harmonic mean of precision and recall. Use it when classes are imbalanced.
Accuracy — overall percentage correct. Misleading when classes are very unequal.

Confusion matrix

import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(
    cm, annot=True, fmt="d", cmap="Blues",
    xticklabels=data.target_names,
    yticklabels=data.target_names,
)
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_clf.fit(X_train, y_train)

y_pred_rf = rf_clf.predict(X_test)
print(classification_report(y_test, y_pred_rf, target_names=data.target_names))
# accuracy: 0.96-0.97

Cross-Validation — A More Reliable Score

One train/test split can be lucky or unlucky. Cross-validation gives a more reliable estimate:

from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")

print(f"CV scores:  {scores.round(3)}")
print(f"Mean:       {scores.mean():.3f}")
print(f"Std:        {scores.std():.3f}")
# CV scores:  [0.974 0.956 0.965 0.965 0.982]
# Mean:       0.968
# Std:        0.009

5-fold CV splits the data into 5 parts, trains on 4 and tests on 1, rotates 5 times. The mean score and standard deviation tell you how well and how consistently the model performs.

Pipelines — Prevent Data Leakage

Manually applying a scaler before cross-validation is dangerous — if you fit the scaler on all the data before the CV split, you've leaked test information into training.

The fix is a Pipeline that chains preprocessing and model together:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(max_iter=1000)),
])

scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"Pipeline CV: {scores.mean():.3f} ± {scores.std():.3f}")

The pipeline scales inside each fold — scaler never sees the test fold. This is the correct way to evaluate models.

Hyperparameter Tuning with GridSearchCV

Every model has hyperparameters — settings you choose, not learned from data. Finding the best settings is called hyperparameter tuning.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth":    [None, 5, 10],
    "min_samples_split": [2, 5],
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1,
)
grid.fit(X, y)

print(f"Best params: {grid.best_params_}")
print(f"Best score:  {grid.best_score_:.3f}")

best_model = grid.best_estimator_

GridSearchCV tries every combination of parameters (3x3x2 = 18 combinations here) and picks the best using cross-validation.

For large grids, use RandomizedSearchCV instead — it samples a fixed number of random combinations, which is much faster:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    "n_estimators":     randint(50, 500),
    "max_depth":        [None, 5, 10, 20],
    "min_samples_split": randint(2, 20),
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=20,       # try 20 random combinations
    cv=5,
    scoring="accuracy",
    random_state=42,
)
random_search.fit(X, y)
print(random_search.best_params_)

Unsupervised Learning — Clustering

When you don't have labels, clustering finds natural groups:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate sample data
from sklearn.datasets import make_blobs
X, true_labels = make_blobs(n_samples=300, centers=4, random_state=42)

# Scale
scaler = StandardScaler()
X_s    = scaler.fit_transform(X)

# Fit KMeans
kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
kmeans.fit(X_s)
labels = kmeans.labels_

# Plot
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", alpha=0.7, s=40)
plt.scatter(
    kmeans.cluster_centers_[:, 0] * X.std(0)[0] + X.mean(0)[0],
    kmeans.cluster_centers_[:, 1] * X.std(0)[1] + X.mean(0)[1],
    marker="X", s=200, color="black", label="Centroids"
)
plt.title("KMeans Clustering")
plt.legend()
plt.tight_layout()
plt.show()

Full Project — House Price Predictor

"""
house_price_predictor.py

Predict California median house values using a
Random Forest with a Pipeline and GridSearchCV.
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error


def load_data():
    housing = fetch_california_housing(as_frame=True)
    X, y    = housing.data, housing.target
    print(f"Dataset: {X.shape[0]:,} samples, {X.shape[1]} features")
    print(f"Target range: ${y.min()*100_000:,.0f} -- ${y.max()*100_000:,.0f}")
    return X, y


def build_pipeline():
    return Pipeline([
        ("scaler", StandardScaler()),
        ("model",  RandomForestRegressor(
            n_estimators=200,
            max_depth=None,
            random_state=42,
            n_jobs=-1,
        )),
    ])


def evaluate(model, X_test, y_test):
    y_pred = model.predict(X_test)
    r2   = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"R^2 Score:  {r2:.3f}  (1.0 = perfect)")
    print(f"RMSE:      ${rmse * 100_000:,.0f}")
    return y_pred


def plot_predictions(y_test, y_pred):
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Predicted vs Actual
    axes[0].scatter(y_test, y_pred, alpha=0.3, s=10, color="steelblue")
    lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
    axes[0].plot(lims, lims, "r--", linewidth=2, label="Perfect prediction")
    axes[0].set_xlabel("Actual Price ($100k)")
    axes[0].set_ylabel("Predicted Price ($100k)")
    axes[0].set_title("Predicted vs Actual")
    axes[0].legend()

    # Residuals
    residuals = y_test - y_pred
    axes[1].hist(residuals, bins=50, color="coral", edgecolor="white", alpha=0.8)
    axes[1].axvline(0, color="black", linewidth=2)
    axes[1].set_title("Residuals Distribution")
    axes[1].set_xlabel("Prediction Error ($100k)")
    axes[1].set_ylabel("Count")

    plt.tight_layout()
    plt.savefig("house_predictions.png", dpi=150)
    plt.show()


def feature_importance(model, feature_names):
    rf = model.named_steps["model"]
    importances = pd.Series(rf.feature_importances_, index=feature_names)
    importances.sort_values().plot(kind="barh", figsize=(8, 4), color="steelblue")
    plt.title("Feature Importances")
    plt.tight_layout()
    plt.savefig("feature_importance.png", dpi=150)
    plt.show()


if __name__ == "__main__":
    X, y = load_data()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    pipe = build_pipeline()

    print("\nTraining model...")
    pipe.fit(X_train, y_train)

    print("\n=== Test Set Results ===")
    y_pred = evaluate(pipe, X_test, y_test)

    print("\n=== 5-Fold Cross-Validation ===")
    cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="r2", n_jobs=-1)
    print(f"CV R^2 scores: {cv_scores.round(3)}")
    print(f"Mean: {cv_scores.mean():.3f}  Std: {cv_scores.std():.3f}")

    plot_predictions(y_test, y_pred)
    feature_importance(pipe, X.columns)

Output:

Dataset: 20,640 samples, 8 features
Target range: $14,999 -- $500,001

=== Test Set Results ===
R^2 Score:  0.817  (1.0 = perfect)
RMSE:      $52,831

=== 5-Fold Cross-Validation ===
CV R^2 scores: [0.808 0.812 0.821 0.806 0.810]
Mean: 0.811  Std: 0.005

Choosing the Right Algorithm

Problem	First try	If you need more
Predict a number	Linear Regression	Random Forest, Gradient Boosting
Classify into 2+ classes	Logistic Regression	Random Forest, SVM
Find groups	KMeans	DBSCAN, Agglomerative
Reduce dimensions	PCA	UMAP, t-SNE

Start simple. Add complexity only when the simpler model doesn't perform well enough.

What You Learned in This Chapter

Machine learning has two main types: supervised (predict from labeled examples) and unsupervised (find structure without labels).
The scikit-learn API is always: model.fit(X_train, y_train) -> model.predict(X_test) -> model.score(X_test, y_test).
Always split data with train_test_split. Never evaluate on training data.
Scale features with StandardScaler. Fit it only on training data.
Linear Regression predicts continuous values. Evaluate with R^2 and RMSE.
Random Forest is often the best starting point — handles non-linearity, doesn't require scaling, shows feature importances.
Logistic Regression classifies. Evaluate with precision, recall, F1, and a confusion matrix.
Cross-validation (cross_val_score) gives more reliable performance estimates than a single split.
Pipelines chain preprocessing and models together to prevent data leakage during cross-validation.
GridSearchCV and RandomizedSearchCV automate hyperparameter tuning.
KMeans clusters data without labels.

What's Next?

Chapter 46 covers Automation and Scripting — using Python to automate repetitive tasks: renaming files, sending emails, processing PDFs, working with Excel, and scheduling tasks to run daily.