Python: Zero to Hero
Home/Advanced Python
Share

Chapter 33: Data Science with Python

Python became the language of data science for one reason: the ecosystem. numpy for fast array math. pandas for working with tabular data. matplotlib and seaborn for visualization. scikit-learn for machine learning. These four libraries, used together, let you go from raw data to trained model to chart in a few hundred lines of code.

This chapter teaches you each one, then shows how they work together in a complete data analysis workflow.

The Data Science Workflow

Every data science project follows the same steps:

1. Load data         — read CSV, database, API, JSON
2. Explore           — understand shape, types, missing values, distributions
3. Clean             — fix missing values, wrong types, duplicates, outliers
4. Analyze           — compute statistics, find patterns, test hypotheses
5. Visualize         — charts that reveal what numbers hide
6. Model             — train a machine learning model (if needed)
7. Evaluate          — measure how good the model is
8. Communicate       — share findings clearly

numpy — Fast Array Mathematics

numpy is the foundation of the Python data science stack. It provides the ndarray — a typed, fixed-size, multidimensional array stored in contiguous memory. Operations on numpy arrays run in C — 10-100x faster than Python loops.

pip install numpy

Creating arrays

import numpy as np

# From a list
a = np.array([1, 2, 3, 4, 5])
print(a)           # [1 2 3 4 5]
print(a.dtype)     # int64
print(a.shape)     # (5,)
print(a.ndim)      # 1

# Specify dtype
b = np.array([1.0, 2.0, 3.0], dtype=np.float32)

# Built-in constructors
zeros  = np.zeros(5)                  # [0. 0. 0. 0. 0.]
ones   = np.ones((3, 4))              # 3x4 matrix of 1s
empty  = np.empty((2, 2))             # uninitialized (fast)
rng    = np.arange(0, 10, 2)          # [0 2 4 6 8]
linsp  = np.linspace(0, 1, 5)         # [0.   0.25 0.5  0.75 1.  ]
eye    = np.eye(3)                    # 3x3 identity matrix
rand   = np.random.rand(3, 3)         # 3x3 random floats [0, 1)
randn  = np.random.randn(3, 3)        # 3x3 standard normal
randint= np.random.randint(0, 10, (3, 3))  # 3x3 random ints

Vectorized operations

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise — no loops needed
print(a + b)       # [11 22 33 44 55]
print(a * b)       # [10 40 90 160 250]
print(a ** 2)      # [ 1  4  9 16 25]
print(np.sqrt(a))  # [1.   1.41 1.73 2.   2.24]

# Broadcasting — operate on arrays of different shapes
matrix = np.ones((3, 4))
row    = np.array([1, 2, 3, 4])
print(matrix + row)   # adds row to each row of matrix

col = np.array([[1], [2], [3]])
print(matrix + col)   # adds col to each column

Indexing, slicing, and masking

a = np.array([10, 20, 30, 40, 50])

print(a[0])      # 10
print(a[-1])     # 50
print(a[1:4])    # [20 30 40]
print(a[::2])    # [10 30 50]

# 2D indexing
m = np.arange(12).reshape(3, 4)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(m[1, 2])    # 6   (row 1, col 2)
print(m[:, 1])    # [1 5 9]  (all rows, col 1)
print(m[0, :])    # [0 1 2 3]  (row 0, all cols)
print(m[1:, 2:])  # [[6 7] [10 11]]  (submatrix)

# Boolean masking
a = np.array([1, -2, 3, -4, 5])
mask = a > 0
print(mask)    # [ True False  True False  True]
print(a[mask]) # [1 3 5]
print(a[a > 0]) # same thing — inline mask
a[a < 0] = 0   # set negatives to zero in-place

Aggregation and statistics

data = np.array([4, 7, 13, 2, 9, 1, 5, 8])

print(np.sum(data))     # 49
print(np.mean(data))    # 6.125
print(np.median(data))  # 6.0
print(np.std(data))     # 3.48
print(np.var(data))     # 12.11
print(np.min(data))     # 1
print(np.max(data))     # 13
print(np.argmin(data))  # 5  (index of minimum)
print(np.argmax(data))  # 2  (index of maximum)
print(np.percentile(data, [25, 50, 75]))  # quartiles

# Axis-wise for 2D
m = np.array([[1, 2, 3], [4, 5, 6]])
print(np.sum(m, axis=0))   # [5 7 9]  — sum each column
print(np.sum(m, axis=1))   # [ 6 15]  — sum each row

Linear algebra

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(A @ B)          # matrix multiplication
print(A.T)            # transpose
print(np.linalg.det(A))          # determinant: -2.0
print(np.linalg.inv(A))          # inverse
eigenvalues, eigenvectors = np.linalg.eig(A)

pandas — Tabular Data

pandas adds two data structures on top of numpy: Series (1D) and DataFrame (2D table). A DataFrame is like a spreadsheet or database table — columns have names and types, rows have an index.

pip install pandas

Creating DataFrames

import pandas as pd

# From a dict
df = pd.DataFrame({
    "name":   ["Alice", "Bob", "Carlos", "Diana"],
    "age":    [30, 25, 35, 28],
    "salary": [95000, 82000, 110000, 78000],
    "dept":   ["Eng", "HR", "Eng", "Sales"],
})

print(df)
#      name  age  salary   dept
# 0   Alice   30   95000    Eng
# 1     Bob   25   82000     HR
# 2  Carlos   35  110000    Eng
# 3   Diana   28   78000  Sales

print(df.shape)   # (4, 4)
print(df.dtypes)
# name       object
# age         int64
# salary      int64
# dept       object

Loading data from files

# CSV — most common
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", sep=";", encoding="utf-8", index_col="id")

# Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# JSON
df = pd.read_json("data.json")

# From a SQL query
import sqlite3
conn = sqlite3.connect("myapp.db")
df   = pd.read_sql("SELECT * FROM users", conn)

# Save back to file
df.to_csv("output.csv", index=False)
df.to_excel("output.xlsx", index=False)
df.to_json("output.json", orient="records")

Exploring data

df = pd.read_csv("sales.csv")

print(df.head())          # first 5 rows
print(df.tail(3))         # last 3 rows
print(df.shape)           # (rows, columns)
print(df.columns.tolist()) # column names
print(df.dtypes)           # column types
print(df.info())           # shape + types + null counts
print(df.describe())       # count, mean, std, min, quartiles, max

# Missing values
print(df.isnull().sum())   # null count per column
print(df.isnull().mean())  # null fraction per column

Selecting data

# Select a column — returns a Series
names = df["name"]

# Select multiple columns — returns a DataFrame
subset = df[["name", "salary"]]

# Select rows by position (like list slicing)
first_three = df.iloc[0:3]
last_row    = df.iloc[-1]
cell        = df.iloc[0, 1]    # row 0, column 1

# Select rows by label/condition (loc)
row         = df.loc[0]        # row with index 0
eng_staff   = df.loc[df["dept"] == "Eng"]

# Boolean filtering
high_earners = df[df["salary"] > 90000]
eng_seniors  = df[(df["dept"] == "Eng") & (df["age"] > 30)]
hr_or_sales  = df[df["dept"].isin(["HR", "Sales"])]

Cleaning data

# Drop rows with any null values
df_clean = df.dropna()

# Drop rows where specific column is null
df_clean = df.dropna(subset=["email"])

# Fill null values
df["age"].fillna(df["age"].median(), inplace=True)
df["notes"].fillna("", inplace=True)

# Drop duplicate rows
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["email"])   # deduplicate by email

# Rename columns
df = df.rename(columns={"dept": "department", "salary": "annual_salary"})

# Change data types
df["age"]    = df["age"].astype(int)
df["date"]   = pd.to_datetime(df["date"])
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")  # NaN for bad values

# Strip whitespace from string columns
df["name"] = df["name"].str.strip()
df["name"] = df["name"].str.title()   # Title Case

# Remove outliers (beyond 3 standard deviations)
col  = df["salary"]
mask = (col - col.mean()).abs() <= 3 * col.std()
df   = df[mask]

Transforming data

# Add a computed column
df["monthly_salary"] = df["salary"] / 12
df["senior"]         = df["age"] >= 35

# Apply a function to a column
df["name_upper"] = df["name"].str.upper()
df["tax"]        = df["salary"].apply(lambda s: s * 0.3 if s > 100000 else s * 0.2)

# Apply a function to every cell
df_normalized = df[["age", "salary"]].apply(
    lambda col: (col - col.min()) / (col.max() - col.min())
)

# Map values
dept_map = {"Eng": "Engineering", "HR": "Human Resources", "Sales": "Sales"}
df["dept_full"] = df["dept"].map(dept_map)

Grouping and aggregation

# Group by one column
dept_stats = df.groupby("dept").agg({
    "salary": ["mean", "min", "max", "count"],
    "age":    "mean",
})
print(dept_stats)

# Rename the multi-level columns
dept_stats.columns = ["avg_salary", "min_salary", "max_salary", "headcount", "avg_age"]

# Group by multiple columns
df.groupby(["dept", "senior"])["salary"].mean()

# Pivot table
pivot = df.pivot_table(
    values  = "salary",
    index   = "dept",
    columns = "senior",
    aggfunc = "mean",
)

Sorting and ranking

df_sorted = df.sort_values("salary", ascending=False)
df_sorted = df.sort_values(["dept", "salary"], ascending=[True, False])

df["salary_rank"] = df["salary"].rank(ascending=False).astype(int)

Merging DataFrames

users  = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carlos"]})
orders = pd.DataFrame({"user_id": [1, 1, 2], "amount": [50, 30, 120]})

# Inner join — only rows that match in both
merged = pd.merge(users, orders, left_on="id", right_on="user_id")

# Left join — all rows from left, NaN where no match in right
merged = pd.merge(users, orders, left_on="id", right_on="user_id", how="left")

# Stack DataFrames vertically
combined = pd.concat([df1, df2], ignore_index=True)

matplotlib — Data Visualization

matplotlib is Python's core plotting library. Every other visualization library builds on top of it.

pip install matplotlib seaborn
import matplotlib.pyplot as plt
import numpy as np

# Line plot
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)

plt.figure(figsize=(10, 4))
plt.plot(x, y, color="steelblue", linewidth=2, label="sin(x)")
plt.plot(x, np.cos(x), color="coral", linestyle="--", label="cos(x)")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Sine and Cosine")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("trig.png", dpi=150)
plt.show()

Common chart types

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Bar chart
depts  = ["Eng", "HR", "Sales"]
counts = [15, 8, 12]
axes[0, 0].bar(depts, counts, color="steelblue")
axes[0, 0].set_title("Headcount by Department")

# Horizontal bar
axes[0, 1].barh(depts, counts, color="coral")
axes[0, 1].set_title("Headcount (horizontal)")

# Histogram
data = np.random.normal(70000, 15000, 500)
axes[0, 2].hist(data, bins=30, color="mediumpurple", edgecolor="white")
axes[0, 2].set_title("Salary Distribution")
axes[0, 2].set_xlabel("Salary ($)")

# Scatter plot
x = np.random.rand(100)
y = x * 2 + np.random.randn(100) * 0.3
axes[1, 0].scatter(x, y, alpha=0.6, c="steelblue", edgecolors="white")
axes[1, 0].set_title("Scatter Plot")

# Pie chart
sizes  = [45, 25, 30]
labels = ["Eng", "HR", "Sales"]
axes[1, 1].pie(sizes, labels=labels, autopct="%1.1f%%", startangle=90)
axes[1, 1].set_title("Department Share")

# Box plot
data_by_group = [np.random.normal(loc, 5, 50) for loc in [60, 70, 80]]
axes[1, 2].boxplot(data_by_group, labels=["Q1", "Q2", "Q3"])
axes[1, 2].set_title("Score by Quarter")

plt.tight_layout()
plt.savefig("charts.png", dpi=150)
plt.show()

seaborn — statistical visualization

seaborn builds on matplotlib with a better default style and higher-level statistical charts:

import seaborn as sns
import pandas as pd
import numpy as np

# Use the built-in tips dataset for demonstration
tips = sns.load_dataset("tips")

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution plot
sns.histplot(tips["total_bill"], kde=True, ax=axes[0, 0])
axes[0, 0].set_title("Bill Distribution")

# Scatter with regression line
sns.regplot(x="total_bill", y="tip", data=tips, ax=axes[0, 1])
axes[0, 1].set_title("Tip vs Bill")

# Box plot by category
sns.boxplot(x="day", y="total_bill", data=tips, ax=axes[1, 0])
axes[1, 0].set_title("Bills by Day")

# Heatmap — correlation matrix
numeric_cols = tips.select_dtypes(include="number")
corr = numeric_cols.corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", ax=axes[1, 1])
axes[1, 1].set_title("Correlation Matrix")

plt.tight_layout()
plt.savefig("seaborn_charts.png", dpi=150)
plt.show()

scikit-learn — Machine Learning

scikit-learn provides a consistent API for dozens of machine learning algorithms. Every model follows the same pattern: fit(), predict(), score().

pip install scikit-learn

The scikit-learn API

from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LinearRegression
from sklearn.metrics         import mean_squared_error, r2_score
import numpy as np

# 1. Prepare data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1, 12.0, 14.2, 15.9])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# 2. Train model
model = LinearRegression()
model.fit(X_train, y_train)

# 3. Predict
y_pred = model.predict(X_test)

# 4. Evaluate
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}")
print(f"R^2:   {r2:.3f}")
print(f"Coefficient: {model.coef_[0]:.3f}")
print(f"Intercept:   {model.intercept_:.3f}")

A classification example

from sklearn.datasets        import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing   import StandardScaler
from sklearn.ensemble        import RandomForestClassifier
from sklearn.metrics         import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load built-in dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)   # use same scaler — never fit on test data

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Cross-validation — more robust estimate of performance
cv_scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.tight_layout()
plt.savefig("confusion_matrix.png")

Pipelines — the right way to combine steps

A Pipeline chains preprocessing and modeling steps, preventing data leakage and making deployment clean:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# Suppose our features have mixed types
numeric_features     = ["age", "salary"]
categorical_features = ["dept", "location"]

# Preprocessing for each type
numeric_transformer     = Pipeline([("scaler", StandardScaler())])
categorical_transformer = Pipeline([("onehot", OneHotEncoder(handle_unknown="ignore"))])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer,     numeric_features),
    ("cat", categorical_transformer, categorical_features),
])

# Full pipeline: preprocess -> model
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   GradientBoostingClassifier(random_state=42)),
])

# Hyperparameter search over the whole pipeline
param_grid = {
    "classifier__n_estimators":  [50, 100, 200],
    "classifier__learning_rate": [0.05, 0.1, 0.2],
    "classifier__max_depth":     [3, 4, 5],
}

grid_search = GridSearchCV(
    full_pipeline,
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,     # use all CPU cores
    verbose=1,
)

# grid_search.fit(X_train, y_train)
# print(f"Best params: {grid_search.best_params_}")
# print(f"Best score:  {grid_search.best_score_:.3f}")

Project: End-to-End Sales Analysis

Let's tie everything together in a complete analysis workflow:

"""
sales_analysis.py — End-to-end data analysis of sales data.
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from pathlib import Path


# ── 1. Generate synthetic data ────────────────────────────────────────────────

np.random.seed(42)
rng = pd.date_range("2024-01-01", "2024-12-31", freq="D")

n = 500
data = pd.DataFrame({
    "date":      np.random.choice(rng, n),
    "product":   np.random.choice(["Laptop", "Phone", "Tablet", "Watch"], n,
                                   p=[0.2, 0.4, 0.3, 0.1]),
    "region":    np.random.choice(["North", "South", "East", "West"], n),
    "units":     np.random.randint(1, 20, n),
    "unit_price":np.random.choice([999.99, 699.99, 499.99, 299.99], n,
                                   p=[0.2, 0.4, 0.3, 0.1]),
    "rep":       np.random.choice([f"Rep{i:02d}" for i in range(1, 11)], n),
})

# Inject some missing values and noise
data.loc[np.random.choice(n, 20), "unit_price"] = np.nan
data.loc[np.random.choice(n, 5),  "units"]      = -1   # bad data


# ── 2. Clean ──────────────────────────────────────────────────────────────────

# Remove impossible values
data = data[data["units"] > 0]

# Fill missing prices with median per product
data["unit_price"] = data.groupby("product")["unit_price"].transform(
    lambda x: x.fillna(x.median())
)

# Add computed columns
data["revenue"] = data["units"] * data["unit_price"]
data["month"]   = pd.to_datetime(data["date"]).dt.month
data["quarter"] = pd.to_datetime(data["date"]).dt.quarter

print(f"Loaded {len(data):,} records after cleaning")
print(f"Total revenue: ${data['revenue'].sum():,.0f}\n")


# ── 3. Analyze ────────────────────────────────────────────────────────────────

# Revenue by product
by_product = (
    data.groupby("product")["revenue"]
    .agg(["sum", "mean", "count"])
    .rename(columns={"sum": "total", "mean": "avg_order", "count": "orders"})
    .sort_values("total", ascending=False)
)
print("Revenue by product:")
for product, row in by_product.iterrows():
    print(f"  {product:<8}  ${row['total']:>10,.0f}  ({row['orders']} orders)")

# Revenue by region
by_region = data.groupby("region")["revenue"].sum().sort_values(ascending=False)

# Monthly trend
monthly = data.groupby("month")["revenue"].sum()

# Top sales reps
top_reps = data.groupby("rep")["revenue"].sum().nlargest(5)


# ── 4. Visualize ──────────────────────────────────────────────────────────────

sns.set_theme(style="whitegrid", palette="muted")
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("2024 Sales Dashboard", fontsize=16, fontweight="bold")

# Revenue by product (bar)
by_product["total"].plot(kind="bar", ax=axes[0, 0], color="steelblue", edgecolor="white")
axes[0, 0].set_title("Total Revenue by Product")
axes[0, 0].set_xlabel("")
axes[0, 0].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
axes[0, 0].tick_params(axis="x", rotation=0)

# Revenue by region (horizontal bar)
by_region.plot(kind="barh", ax=axes[0, 1], color="coral", edgecolor="white")
axes[0, 1].set_title("Revenue by Region")
axes[0, 1].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))

# Monthly revenue trend (line)
monthly.plot(kind="line", ax=axes[1, 0], marker="o", color="mediumpurple", linewidth=2)
axes[1, 0].set_title("Monthly Revenue")
axes[1, 0].set_xlabel("Month")
axes[1, 0].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))

# Top reps (bar)
top_reps.plot(kind="bar", ax=axes[1, 1], color="mediumseagreen", edgecolor="white")
axes[1, 1].set_title("Top 5 Sales Reps")
axes[1, 1].set_xlabel("")
axes[1, 1].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
axes[1, 1].tick_params(axis="x", rotation=30)

plt.tight_layout()
plt.savefig("sales_dashboard.png", dpi=150, bbox_inches="tight")
plt.show()
print("Dashboard saved to sales_dashboard.png")


# ── 5. Simple forecast with scikit-learn ──────────────────────────────────────

from sklearn.linear_model    import LinearRegression
from sklearn.metrics         import mean_absolute_percentage_error

# Use month as a feature to predict revenue
X = monthly.index.values.reshape(-1, 1)
y = monthly.values

model = LinearRegression()
model.fit(X, y)

# Predict next 3 months (months 13, 14, 15)
next_months = np.array([[13], [14], [15]])
forecast    = model.predict(next_months)

print("\nForecast for next 3 months:")
for month, pred in zip([13, 14, 15], forecast):
    print(f"  Month {month}: ${pred:,.0f}")

# MAPE on training data
mape = mean_absolute_percentage_error(y, model.predict(X))
print(f"\nTraining MAPE: {mape:.1%}")

The Data Science Toolkit Summary

Library Purpose Key objects
numpy Fast array math ndarray
pandas Tabular data DataFrame, Series
matplotlib Low-level plotting Figure, Axes
seaborn Statistical visualization High-level wrappers
scikit-learn Machine learning Pipeline, fit/predict/score
scipy Scientific computing Statistics, optimization, signal
statsmodels Statistical modeling Regression, time series
plotly Interactive charts Web-ready visualizations
jupyter Interactive notebooks .ipynb files

What You Learned in This Chapter

  • The data science workflow: load -> explore -> clean -> analyze -> visualize -> model -> evaluate.
  • numpy provides the ndarray — vectorized operations run in C, 10-100x faster than Python loops. Key tools: broadcasting, boolean masking, reshape, argmin/argmax, np.linalg.
  • pandas DataFrame is the workhorse for tabular data. read_csv/to_csv, head/info/describe, iloc/loc, boolean filtering, fillna/dropna/drop_duplicates, groupby/agg, merge/concat, apply/map.
  • Never fit a scaler or encoder on test data — only on training data.
  • matplotlibplt.figure, plt.plot/bar/hist/scatter/pie, axes.set_title/xlabel/ylabel, plt.savefig.
  • seabornhistplot, regplot, boxplot, heatmap — higher-level, better defaults.
  • scikit-learntrain_test_split, StandardScaler, fit/predict/score, classification_report, cross_val_score, Pipeline, GridSearchCV.
  • A Pipeline chains preprocessing and modeling — prevents data leakage and makes deployment clean.

What's Next?

Chapter 34 covers Packaging and Distributionpyproject.toml, building installable packages, publishing to PyPI, and setting up a professional Python project that others can install with pip install yourpackage.

© 2026 Abhilash Sahoo. Python: Zero to Hero.