Chapter 43: Data Science with Pandas
NumPy is great for numbers. But real-world data has labels, mixed types, missing values, and dates. That's what pandas is built for.
pandas gives you the DataFrame — a table with named columns and rows. Think of it as a Python-powered spreadsheet that you control entirely with code. You can load a CSV, clean it, filter it, group it, and visualise it in fewer than twenty lines.
Setup
pip install pandas openpyxl xlrd
import pandas as pd
import numpy as np
openpyxl reads .xlsx files. xlrd reads older .xls files.
The Two Core Objects
Series — a labelled 1D array
# From a list
s = pd.Series([10, 20, 30, 40])
print(s)
# 0 10
# 1 20
# 2 30
# 3 40
# dtype: int64
# With custom index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"]) # 20
# From a dict
s = pd.Series({"jan": 100, "feb": 85, "mar": 92})
print(s.index) # Index(['jan', 'feb', 'mar'], dtype='object')
DataFrame — a labelled 2D table
# From a dict of lists
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"age": [28, 34, 22, 45],
"score": [88.5, 72.0, 95.3, 61.8],
"passed": [True, True, True, False],
})
print(df)
# name age score passed
# 0 Alice 28 88.5 True
# 1 Bob 34 72.0 True
# 2 Carol 22 95.3 True
# 3 Dave 45 61.8 False
Every column is a Series. The df itself is a collection of aligned Series sharing the same index.
Reading and Writing Data
CSV
# Read
df = pd.read_csv("sales.csv")
df = pd.read_csv("sales.csv", index_col="id") # use a column as the index
df = pd.read_csv("sales.csv", parse_dates=["date"]) # parse date columns
# Write
df.to_csv("output.csv", index=False) # index=False omits the row numbers
Excel
df = pd.read_excel("report.xlsx", sheet_name="Q1")
df.to_excel("output.xlsx", sheet_name="Results", index=False)
JSON
df = pd.read_json("data.json")
df.to_json("output.json", orient="records", indent=2)
SQL
import sqlite3
conn = sqlite3.connect("sales.db")
df = pd.read_sql("SELECT * FROM orders WHERE total > 100", conn)
df.to_sql("summary", conn, if_exists="replace", index=False)
conn.close()
Exploring a DataFrame
df = pd.read_csv("sales.csv")
df.shape # (1000, 8) — rows, columns
df.dtypes # data type of each column
df.columns # column names
df.index # row labels
df.head(5) # first 5 rows
df.tail(5) # last 5 rows
df.sample(5) # 5 random rows
df.info() # column names, non-null counts, dtypes, memory
df.describe() # count, mean, std, min, quartiles, max for numeric cols
df.describe(include="all") # includes non-numeric columns too
df.value_counts("category") # frequency of each category
df.nunique() # number of unique values per column
df.isnull().sum() # count missing values per column
Selecting Data
Columns
df["name"] # single column -> Series
df[["name", "score"]] # multiple columns -> DataFrame
Rows with .iloc — by position
df.iloc[0] # first row -> Series
df.iloc[0:5] # first 5 rows
df.iloc[0, 2] # row 0, column 2 (by position)
df.iloc[:, 1:4] # all rows, columns 1-3
Rows with .loc — by label
df.loc[0] # row with index label 0
df.loc[0:5] # rows 0 through 5 (inclusive!)
df.loc[0, "name"] # row 0, column "name"
df.loc[:, "name":"score"] # all rows, columns "name" through "score"
The key difference: .iloc uses integer positions (exclusive end), .loc uses labels (inclusive end).
Boolean filtering
# Single condition
df[df["score"] > 80]
# Multiple conditions — use & (and) and | (or), not Python's and/or
df[(df["score"] > 80) & (df["passed"] == True)]
df[(df["age"] < 30) | (df["score"] > 90)]
# isin — filter on a list of values
df[df["city"].isin(["London", "Paris", "Berlin"])]
# String methods
df[df["name"].str.startswith("A")]
df[df["email"].str.contains("@gmail")]
# query() — readable string syntax
df.query("score > 80 and age < 35")
df.query("city in ['London', 'Paris']")
Modifying Data
Adding and changing columns
# New column from calculation
df["tax"] = df["price"] * 0.2
df["total"] = df["price"] + df["tax"]
# Conditional column
df["grade"] = df["score"].apply(
lambda x: "A" if x >= 90 else ("B" if x >= 80 else "C")
)
# Using np.where — faster than apply for simple conditions
import numpy as np
df["passed"] = np.where(df["score"] >= 60, True, False)
# np.select — multiple conditions
conditions = [df["score"] >= 90, df["score"] >= 80, df["score"] >= 70]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, default="D")
Renaming columns
df.rename(columns={"name": "full_name", "score": "exam_score"}, inplace=True)
df.columns = [c.lower().replace(" ", "_") for c in df.columns] # clean all names
Dropping columns and rows
df.drop(columns=["tax", "unnecessary_col"], inplace=True)
df.drop(index=[0, 1, 2], inplace=True) # drop rows by index
df.dropna(inplace=True) # drop rows with any NaN
df.dropna(subset=["score", "name"]) # only drop if these cols are NaN
The apply function
# Apply a function to each element in a column
df["name"] = df["name"].str.strip().str.title()
# Apply a function to each row (axis=1)
def classify(row):
if row["score"] >= 90 and row["age"] < 30:
return "Star student"
return "Regular"
df["category"] = df.apply(classify, axis=1)
apply is flexible but slow on large DataFrames. Prefer vectorised operations (df["col"] * 2, np.where, .str methods) when possible.
Handling Missing Data
df.isnull().sum() # count NaN per column
df.isnull().sum() / len(df) # fraction missing
# Fill missing values
df["score"].fillna(df["score"].mean(), inplace=True) # fill with mean
df["city"].fillna("Unknown", inplace=True) # fill with constant
df["score"].ffill(inplace=True) # forward fill
df["score"].bfill(inplace=True) # backward fill
# Drop rows with missing values
df.dropna(inplace=True) # any NaN -> drop row
df.dropna(subset=["score"], inplace=True) # NaN in 'score' -> drop row
df.dropna(thresh=5, inplace=True) # keep rows with at least 5 non-NaN
Sorting
df.sort_values("score", ascending=False) # descending
df.sort_values(["grade", "score"], ascending=[True, False]) # multi-column
df.sort_index() # sort by row index
Grouping and Aggregation
groupby splits the data into groups, applies a function to each group, and combines the results — the split-apply-combine pattern.
# Simple groupby
df.groupby("department")["salary"].mean()
# Multiple aggregations
df.groupby("department").agg(
avg_salary = ("salary", "mean"),
max_salary = ("salary", "max"),
employee_count = ("name", "count"),
)
# Group by multiple columns
df.groupby(["department", "level"])["salary"].mean()
# Apply custom function to each group
df.groupby("category")["score"].apply(lambda s: s.max() - s.min())
Merging and Joining
Just like SQL JOINs:
customers = pd.DataFrame({
"id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Carol", "Dave"],
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104, 105],
"customer_id": [1, 2, 1, 3, 5],
"amount": [50, 30, 80, 20, 60],
})
# Inner join — only matching rows
pd.merge(orders, customers, left_on="customer_id", right_on="id")
# Left join — all orders, even without matching customer
pd.merge(orders, customers, left_on="customer_id", right_on="id", how="left")
# Right join, outer join also available
pd.merge(orders, customers, left_on="customer_id", right_on="id", how="outer")
# When key columns have the same name
pd.merge(orders, customers, on="id")
# Concat — stack DataFrames vertically
combined = pd.concat([df_q1, df_q2, df_q3], ignore_index=True)
Working with Dates
# Parse dates on load
df = pd.read_csv("sales.csv", parse_dates=["date"])
# Convert a column to datetime
df["date"] = pd.to_datetime(df["date"])
# Extract components
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day
df["weekday"] = df["date"].dt.day_name()
df["quarter"] = df["date"].dt.quarter
# Date arithmetic
df["days_since"] = (pd.Timestamp.today() - df["date"]).dt.days
# Filter by date range
start = pd.Timestamp("2024-01-01")
end = pd.Timestamp("2024-03-31")
q1 = df[(df["date"] >= start) & (df["date"] <= end)]
# Resample time series (group by time period)
df.set_index("date").resample("ME")["revenue"].sum() # monthly totals
df.set_index("date").resample("W")["revenue"].mean() # weekly averages
Practical Project — Sales Analysis
"""
sales_analysis.py
End-to-end analysis of a sales dataset using pandas.
"""
import pandas as pd
import numpy as np
from io import StringIO
# ── Sample data (in a real project, use pd.read_csv("sales.csv")) ────────────
RAW = """date,product,category,units,price,region
2024-01-05,Widget A,Electronics,10,29.99,North
2024-01-12,Widget B,Electronics,5,49.99,South
2024-01-15,Gadget X,Home,8,19.99,North
2024-02-03,Widget A,Electronics,15,29.99,East
2024-02-14,Gadget Y,Home,12,14.99,West
2024-02-20,Widget B,Electronics,3,49.99,North
2024-03-01,Gadget X,Home,20,19.99,South
2024-03-10,Widget A,Electronics,7,29.99,West
2024-03-22,Gadget Y,Home,9,14.99,East
2024-03-30,Widget B,Electronics,11,49.99,South
"""
# ── Load and prepare ──────────────────────────────────────────────────────────
df = pd.read_csv(StringIO(RAW), parse_dates=["date"])
df["revenue"] = df["units"] * df["price"]
df["month"] = df["date"].dt.to_period("M")
print("=== Data Overview ===")
print(df.dtypes)
print(f"\nRows: {len(df)}, Columns: {len(df.columns)}")
print(f"Date range: {df['date'].min().date()} -> {df['date'].max().date()}")
print(f"Missing values:\n{df.isnull().sum()}")
# ── Summary statistics ────────────────────────────────────────────────────────
print("\n=== Revenue Summary ===")
print(df["revenue"].describe().round(2))
# ── By category ──────────────────────────────────────────────────────────────
print("\n=== Revenue by Category ===")
by_cat = (
df.groupby("category")
.agg(
total_revenue = ("revenue", "sum"),
total_units = ("units", "sum"),
avg_price = ("price", "mean"),
num_orders = ("date", "count"),
)
.round(2)
.sort_values("total_revenue", ascending=False)
)
print(by_cat)
# ── Monthly trend ─────────────────────────────────────────────────────────────
print("\n=== Monthly Revenue ===")
monthly = df.groupby("month")["revenue"].sum().reset_index()
monthly.columns = ["month", "revenue"]
for _, row in monthly.iterrows():
bar = "█" * int(row["revenue"] / 50)
print(f" {row['month']} {bar} ${row['revenue']:,.2f}")
# ── Top products ──────────────────────────────────────────────────────────────
print("\n=== Top Products by Revenue ===")
top_products = (
df.groupby("product")["revenue"]
.sum()
.sort_values(ascending=False)
.head(5)
)
for product, rev in top_products.items():
print(f" {product:15s} ${rev:,.2f}")
# ── Regional breakdown ────────────────────────────────────────────────────────
print("\n=== Revenue by Region ===")
by_region = df.pivot_table(
values="revenue",
index="region",
columns="category",
aggfunc="sum",
fill_value=0,
).round(2)
print(by_region)
# ── Save results ──────────────────────────────────────────────────────────────
df.to_csv("sales_clean.csv", index=False)
print("\nSaved to sales_clean.csv")
Output:
=== Data Overview ===
date datetime64[ns]
product object
category object
units int64
price float64
region object
revenue float64
month period[M]
Rows: 10, Columns: 8
=== Revenue by Category ===
total_revenue total_units avg_price num_orders
category
Electronics 1664.55 51 39.99 6
Home 847.55 49 18.32 4
=== Monthly Revenue ===
2024-01 ████████████ $629.80
2024-02 ████████████ $624.72
2024-03 ████████████ $857.58
=== Top Products by Revenue ===
Widget B $649.87
Widget A $569.80
Gadget X $479.75
Gadget Y $314.73
What You Learned in This Chapter
Seriesis a labelled 1D array.DataFrameis a labelled 2D table. Every column is a Series.pd.read_csv,pd.read_excel,pd.read_json,pd.read_sqlload data..to_csv,.to_excelsave it.df.info(),df.describe(),df.head(),df.isnull().sum()explore a dataset.df["col"]selects one column.df[["a","b"]]selects multiple.df.iloc[row, col]selects by position.df.loc[label, "col"]selects by label.- Boolean masks —
df[df["score"] > 80]— filter rows without loops. df["new"] = df["a"] + df["b"]adds computed columns.apply()runs arbitrary functions.fillna,dropnahandle missing values.groupby("col").agg(...)splits data, computes per-group stats, and combines results.pd.merge(left, right, on=..., how=...)joins DataFrames like SQL.pd.concatstacks them vertically.df["date"].dt.year,.dt.month,.dt.day_name()extract date components..resample()aggregates time series by period.
What's Next?
Chapter 44 covers Data Visualization — turning the DataFrames you just built into charts and graphs. You'll use matplotlib for control and seaborn for beautiful statistical plots with almost no code.