Chapter 45: Machine Learning Basics with scikit-learn
Machine learning sounds intimidating. It isn't.
At its core, machine learning is this: you show a program examples, and it learns a pattern. Then you give it new data it's never seen, and it predicts the answer.
You've been doing something similar your whole life. You've seen thousands of emails. You know what spam looks like. A spam filter does the same thing — except it learned from millions of examples instead of yours.
scikit-learn is Python's standard machine learning library. It has a simple, consistent API and covers most practical use cases.
Setup
pip install scikit-learn pandas numpy matplotlib seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import ... # we'll import specific parts as we go
The Two Types of Problems
Supervised learning — you have labeled examples. You learn to predict the label for new data.
- Regression: predict a number (house price, temperature, salary)
- Classification: predict a category (spam/not spam, cat/dog, disease/healthy)
Unsupervised learning — you have data with no labels. You find hidden structure.
- Clustering: group similar data points
- Dimensionality reduction: compress features while keeping information
This chapter focuses on supervised learning — it's where most beginners start and where most practical ML problems live.
The scikit-learn API
Every model in scikit-learn follows the same three-step API:
from sklearn.some_module import SomeModel
model = SomeModel() # 1. Create the model
model.fit(X_train, y_train) # 2. Train it on your data
predictions = model.predict(X_test) # 3. Make predictions
score = model.score(X_test, y_test) # 4. Evaluate it
Learn this pattern once and you can use any of scikit-learn's 40+ algorithms.
Step 1 — Preparing Your Data
Load a dataset
from sklearn.datasets import load_iris, load_boston, fetch_california_housing
import pandas as pd
# Built-in datasets (great for learning)
iris = load_iris(as_frame=True)
housing = fetch_california_housing(as_frame=True)
X = housing.data # features (DataFrame)
y = housing.target # labels (Series)
print(X.shape) # (20640, 8) — 20,640 rows, 8 features
print(y.shape) # (20640,)
print(X.head())
Train/test split
You must evaluate your model on data it has never seen. Always split your data first:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing, 80% for training
random_state=42, # reproducible split
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# Train: (16512, 8), Test: (4128, 8)
Feature scaling
Many algorithms (linear models, SVMs, k-nearest neighbours) are sensitive to feature scale. A feature measured in millions dominates one measured in units.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learn mean/std, then scale
X_test_scaled = scaler.transform(X_test) # use SAME mean/std as train
Critical rule: fit the scaler on training data only. Never fit on test data — that would be "data leakage," letting test information sneak into training.
Linear Regression — Predict a Number
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# ── Toy example ───────────────────────────────────────────────────────────────
np.random.seed(42)
X_toy = np.random.rand(100, 1) * 10
y_toy = 2.5 * X_toy.ravel() + np.random.randn(100) * 3
X_tr, X_te, y_tr, y_te = train_test_split(X_toy, y_toy, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
print(f"Coefficient (slope): {model.coef_[0]:.2f}") # ~2.5
print(f"Intercept: {model.intercept_:.2f}")
print(f"R^2 score: {r2_score(y_te, y_pred):.3f}") # 1.0 = perfect
print(f"RMSE: {np.sqrt(mean_squared_error(y_te, y_pred)):.3f}")
# Plot
plt.figure(figsize=(7, 4))
plt.scatter(X_te, y_te, alpha=0.6, label="Actual")
plt.plot(X_te, y_pred, color="red", linewidth=2, label="Predicted")
plt.title("Linear Regression")
plt.legend()
plt.tight_layout()
plt.show()
On real data — California housing prices
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
print(f"R^2: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
# R^2: 0.606
# RMSE: 0.745
R^2 of 0.606 means the model explains about 60% of the variance in house prices. Not great — the relationship is non-linear. Let's try a better model.
Decision Trees and Random Forests — Flexible and Powerful
A decision tree splits the data on feature values, like a flowchart:
Is MedInc > 3.5?
Yes -> Is HouseAge > 20?
Yes -> predict 2.8
No -> predict 3.5
No -> predict 1.2
A random forest builds hundreds of decision trees on random subsets of the data and features, then averages their predictions. This reduces overfitting and greatly improves accuracy.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=100, # number of trees
max_depth=None, # let trees grow fully
random_state=42,
n_jobs=-1, # use all CPU cores
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print(f"R^2: {r2_score(y_test, y_pred_rf):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.3f}")
# R^2: 0.805
# RMSE: 0.505
# Feature importance — which features matter most?
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind="barh", figsize=(7, 4))
plt.title("Feature Importances")
plt.tight_layout()
plt.show()
The random forest jumps from 0.606 to 0.805 R^2 — no manual feature engineering, no scaling required.
Classification — Predict a Category
Logistic Regression
Despite the name, logistic regression is a classification algorithm. It predicts the probability that something belongs to a class.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer(as_frame=True)
X, y = data.data, data.target # 0=malignant, 1=benign
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=data.target_names))
Output:
precision recall f1-score support
malignant 0.97 0.93 0.95 43
benign 0.96 0.98 0.97 71
accuracy 0.96 114
Understanding the metrics
- Precision — of all the times we said "malignant," how often were we right?
- Recall — of all actual malignant cases, how many did we catch?
- F1-score — harmonic mean of precision and recall. Use it when classes are imbalanced.
- Accuracy — overall percentage correct. Misleading when classes are very unequal.
Confusion matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(
cm, annot=True, fmt="d", cmap="Blues",
xticklabels=data.target_names,
yticklabels=data.target_names,
)
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print(classification_report(y_test, y_pred_rf, target_names=data.target_names))
# accuracy: 0.96-0.97
Cross-Validation — A More Reliable Score
One train/test split can be lucky or unlucky. Cross-validation gives a more reliable estimate:
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"CV scores: {scores.round(3)}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
# CV scores: [0.974 0.956 0.965 0.965 0.982]
# Mean: 0.968
# Std: 0.009
5-fold CV splits the data into 5 parts, trains on 4 and tests on 1, rotates 5 times. The mean score and standard deviation tell you how well and how consistently the model performs.
Pipelines — Prevent Data Leakage
Manually applying a scaler before cross-validation is dangerous — if you fit the scaler on all the data before the CV split, you've leaked test information into training.
The fix is a Pipeline that chains preprocessing and model together:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
pipe = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"Pipeline CV: {scores.mean():.3f} ± {scores.std():.3f}")
The pipeline scales inside each fold — scaler never sees the test fold. This is the correct way to evaluate models.
Hyperparameter Tuning with GridSearchCV
Every model has hyperparameters — settings you choose, not learned from data. Finding the best settings is called hyperparameter tuning.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [None, 5, 10],
"min_samples_split": [2, 5],
}
grid = GridSearchCV(
RandomForestClassifier(random_state=42, n_jobs=-1),
param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1,
verbose=1,
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.3f}")
best_model = grid.best_estimator_
GridSearchCV tries every combination of parameters (3x3x2 = 18 combinations here) and picks the best using cross-validation.
For large grids, use RandomizedSearchCV instead — it samples a fixed number of random combinations, which is much faster:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
"n_estimators": randint(50, 500),
"max_depth": [None, 5, 10, 20],
"min_samples_split": randint(2, 20),
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist,
n_iter=20, # try 20 random combinations
cv=5,
scoring="accuracy",
random_state=42,
)
random_search.fit(X, y)
print(random_search.best_params_)
Unsupervised Learning — Clustering
When you don't have labels, clustering finds natural groups:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Generate sample data
from sklearn.datasets import make_blobs
X, true_labels = make_blobs(n_samples=300, centers=4, random_state=42)
# Scale
scaler = StandardScaler()
X_s = scaler.fit_transform(X)
# Fit KMeans
kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
kmeans.fit(X_s)
labels = kmeans.labels_
# Plot
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", alpha=0.7, s=40)
plt.scatter(
kmeans.cluster_centers_[:, 0] * X.std(0)[0] + X.mean(0)[0],
kmeans.cluster_centers_[:, 1] * X.std(0)[1] + X.mean(0)[1],
marker="X", s=200, color="black", label="Centroids"
)
plt.title("KMeans Clustering")
plt.legend()
plt.tight_layout()
plt.show()
Full Project — House Price Predictor
"""
house_price_predictor.py
Predict California median house values using a
Random Forest with a Pipeline and GridSearchCV.
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
def load_data():
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
print(f"Dataset: {X.shape[0]:,} samples, {X.shape[1]} features")
print(f"Target range: ${y.min()*100_000:,.0f} -- ${y.max()*100_000:,.0f}")
return X, y
def build_pipeline():
return Pipeline([
("scaler", StandardScaler()),
("model", RandomForestRegressor(
n_estimators=200,
max_depth=None,
random_state=42,
n_jobs=-1,
)),
])
def evaluate(model, X_test, y_test):
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R^2 Score: {r2:.3f} (1.0 = perfect)")
print(f"RMSE: ${rmse * 100_000:,.0f}")
return y_pred
def plot_predictions(y_test, y_pred):
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.3, s=10, color="steelblue")
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
axes[0].plot(lims, lims, "r--", linewidth=2, label="Perfect prediction")
axes[0].set_xlabel("Actual Price ($100k)")
axes[0].set_ylabel("Predicted Price ($100k)")
axes[0].set_title("Predicted vs Actual")
axes[0].legend()
# Residuals
residuals = y_test - y_pred
axes[1].hist(residuals, bins=50, color="coral", edgecolor="white", alpha=0.8)
axes[1].axvline(0, color="black", linewidth=2)
axes[1].set_title("Residuals Distribution")
axes[1].set_xlabel("Prediction Error ($100k)")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.savefig("house_predictions.png", dpi=150)
plt.show()
def feature_importance(model, feature_names):
rf = model.named_steps["model"]
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.sort_values().plot(kind="barh", figsize=(8, 4), color="steelblue")
plt.title("Feature Importances")
plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150)
plt.show()
if __name__ == "__main__":
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
pipe = build_pipeline()
print("\nTraining model...")
pipe.fit(X_train, y_train)
print("\n=== Test Set Results ===")
y_pred = evaluate(pipe, X_test, y_test)
print("\n=== 5-Fold Cross-Validation ===")
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="r2", n_jobs=-1)
print(f"CV R^2 scores: {cv_scores.round(3)}")
print(f"Mean: {cv_scores.mean():.3f} Std: {cv_scores.std():.3f}")
plot_predictions(y_test, y_pred)
feature_importance(pipe, X.columns)
Output:
Dataset: 20,640 samples, 8 features
Target range: $14,999 -- $500,001
=== Test Set Results ===
R^2 Score: 0.817 (1.0 = perfect)
RMSE: $52,831
=== 5-Fold Cross-Validation ===
CV R^2 scores: [0.808 0.812 0.821 0.806 0.810]
Mean: 0.811 Std: 0.005
Choosing the Right Algorithm
| Problem | First try | If you need more |
|---|---|---|
| Predict a number | Linear Regression | Random Forest, Gradient Boosting |
| Classify into 2+ classes | Logistic Regression | Random Forest, SVM |
| Find groups | KMeans | DBSCAN, Agglomerative |
| Reduce dimensions | PCA | UMAP, t-SNE |
Start simple. Add complexity only when the simpler model doesn't perform well enough.
What You Learned in This Chapter
- Machine learning has two main types: supervised (predict from labeled examples) and unsupervised (find structure without labels).
- The scikit-learn API is always:
model.fit(X_train, y_train)->model.predict(X_test)->model.score(X_test, y_test). - Always split data with
train_test_split. Never evaluate on training data. - Scale features with
StandardScaler. Fit it only on training data. - Linear Regression predicts continuous values. Evaluate with R^2 and RMSE.
- Random Forest is often the best starting point — handles non-linearity, doesn't require scaling, shows feature importances.
- Logistic Regression classifies. Evaluate with precision, recall, F1, and a confusion matrix.
- Cross-validation (
cross_val_score) gives more reliable performance estimates than a single split. - Pipelines chain preprocessing and models together to prevent data leakage during cross-validation.
- GridSearchCV and RandomizedSearchCV automate hyperparameter tuning.
- KMeans clusters data without labels.
What's Next?
Chapter 46 covers Automation and Scripting — using Python to automate repetitive tasks: renaming files, sending emails, processing PDFs, working with Excel, and scheduling tasks to run daily.