Chapter 43: Data Science with Pandas

NumPy is great for numbers. But real-world data has labels, mixed types, missing values, and dates. That's what pandas is built for.

pandas gives you the DataFrame — a table with named columns and rows. Think of it as a Python-powered spreadsheet that you control entirely with code. You can load a CSV, clean it, filter it, group it, and visualise it in fewer than twenty lines.

Setup

pip install pandas openpyxl xlrd

import pandas as pd
import numpy as np

openpyxl reads .xlsx files. xlrd reads older .xls files.

The Two Core Objects

Series — a labelled 1D array

# From a list
s = pd.Series([10, 20, 30, 40])
print(s)
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64

# With custom index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"])    # 20

# From a dict
s = pd.Series({"jan": 100, "feb": 85, "mar": 92})
print(s.index)   # Index(['jan', 'feb', 'mar'], dtype='object')

DataFrame — a labelled 2D table

# From a dict of lists
df = pd.DataFrame({
    "name":   ["Alice", "Bob", "Carol", "Dave"],
    "age":    [28, 34, 22, 45],
    "score":  [88.5, 72.0, 95.3, 61.8],
    "passed": [True, True, True, False],
})

print(df)
#     name  age  score  passed
# 0  Alice   28   88.5    True
# 1    Bob   34   72.0    True
# 2  Carol   22   95.3    True
# 3   Dave   45   61.8   False

Every column is a Series. The df itself is a collection of aligned Series sharing the same index.

Reading and Writing Data

CSV

# Read
df = pd.read_csv("sales.csv")
df = pd.read_csv("sales.csv", index_col="id")   # use a column as the index
df = pd.read_csv("sales.csv", parse_dates=["date"])  # parse date columns

# Write
df.to_csv("output.csv", index=False)   # index=False omits the row numbers

Excel

df = pd.read_excel("report.xlsx", sheet_name="Q1")
df.to_excel("output.xlsx", sheet_name="Results", index=False)

JSON

df = pd.read_json("data.json")
df.to_json("output.json", orient="records", indent=2)

SQL

import sqlite3

conn = sqlite3.connect("sales.db")
df   = pd.read_sql("SELECT * FROM orders WHERE total > 100", conn)
df.to_sql("summary", conn, if_exists="replace", index=False)
conn.close()

Exploring a DataFrame

df = pd.read_csv("sales.csv")

df.shape          # (1000, 8) — rows, columns
df.dtypes         # data type of each column
df.columns        # column names
df.index          # row labels

df.head(5)        # first 5 rows
df.tail(5)        # last 5 rows
df.sample(5)      # 5 random rows

df.info()         # column names, non-null counts, dtypes, memory
df.describe()     # count, mean, std, min, quartiles, max for numeric cols
df.describe(include="all")   # includes non-numeric columns too

df.value_counts("category")  # frequency of each category
df.nunique()                 # number of unique values per column
df.isnull().sum()            # count missing values per column

Selecting Data

Columns

df["name"]                      # single column -> Series
df[["name", "score"]]           # multiple columns -> DataFrame

Rows with `.iloc` — by position

df.iloc[0]           # first row -> Series
df.iloc[0:5]         # first 5 rows
df.iloc[0, 2]        # row 0, column 2 (by position)
df.iloc[:, 1:4]      # all rows, columns 1-3

Rows with `.loc` — by label

df.loc[0]               # row with index label 0
df.loc[0:5]             # rows 0 through 5 (inclusive!)
df.loc[0, "name"]       # row 0, column "name"
df.loc[:, "name":"score"]  # all rows, columns "name" through "score"

The key difference: .iloc uses integer positions (exclusive end), .loc uses labels (inclusive end).

Boolean filtering

# Single condition
df[df["score"] > 80]

# Multiple conditions — use & (and) and | (or), not Python's and/or
df[(df["score"] > 80) & (df["passed"] == True)]
df[(df["age"] < 30) | (df["score"] > 90)]

# isin — filter on a list of values
df[df["city"].isin(["London", "Paris", "Berlin"])]

# String methods
df[df["name"].str.startswith("A")]
df[df["email"].str.contains("@gmail")]

# query() — readable string syntax
df.query("score > 80 and age < 35")
df.query("city in ['London', 'Paris']")

Modifying Data

Adding and changing columns

# New column from calculation
df["tax"] = df["price"] * 0.2
df["total"] = df["price"] + df["tax"]

# Conditional column
df["grade"] = df["score"].apply(
    lambda x: "A" if x >= 90 else ("B" if x >= 80 else "C")
)

# Using np.where — faster than apply for simple conditions
import numpy as np
df["passed"] = np.where(df["score"] >= 60, True, False)

# np.select — multiple conditions
conditions  = [df["score"] >= 90, df["score"] >= 80, df["score"] >= 70]
choices     = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, default="D")

Renaming columns

df.rename(columns={"name": "full_name", "score": "exam_score"}, inplace=True)
df.columns = [c.lower().replace(" ", "_") for c in df.columns]  # clean all names

Dropping columns and rows

df.drop(columns=["tax", "unnecessary_col"], inplace=True)
df.drop(index=[0, 1, 2], inplace=True)         # drop rows by index
df.dropna(inplace=True)                         # drop rows with any NaN
df.dropna(subset=["score", "name"])             # only drop if these cols are NaN

The `apply` function

# Apply a function to each element in a column
df["name"] = df["name"].str.strip().str.title()

# Apply a function to each row (axis=1)
def classify(row):
    if row["score"] >= 90 and row["age"] < 30:
        return "Star student"
    return "Regular"

df["category"] = df.apply(classify, axis=1)

apply is flexible but slow on large DataFrames. Prefer vectorised operations (df["col"] * 2, np.where, .str methods) when possible.

Handling Missing Data

df.isnull().sum()          # count NaN per column
df.isnull().sum() / len(df) # fraction missing

# Fill missing values
df["score"].fillna(df["score"].mean(), inplace=True)   # fill with mean
df["city"].fillna("Unknown", inplace=True)              # fill with constant
df["score"].ffill(inplace=True)                         # forward fill
df["score"].bfill(inplace=True)                         # backward fill

# Drop rows with missing values
df.dropna(inplace=True)                    # any NaN -> drop row
df.dropna(subset=["score"], inplace=True)  # NaN in 'score' -> drop row
df.dropna(thresh=5, inplace=True)          # keep rows with at least 5 non-NaN

Sorting

df.sort_values("score", ascending=False)                     # descending
df.sort_values(["grade", "score"], ascending=[True, False])  # multi-column
df.sort_index()                                              # sort by row index

Grouping and Aggregation

groupby splits the data into groups, applies a function to each group, and combines the results — the split-apply-combine pattern.

# Simple groupby
df.groupby("department")["salary"].mean()

# Multiple aggregations
df.groupby("department").agg(
    avg_salary   = ("salary", "mean"),
    max_salary   = ("salary", "max"),
    employee_count = ("name", "count"),
)

# Group by multiple columns
df.groupby(["department", "level"])["salary"].mean()

# Apply custom function to each group
df.groupby("category")["score"].apply(lambda s: s.max() - s.min())

Merging and Joining

Just like SQL JOINs:

customers = pd.DataFrame({
    "id":   [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Carol", "Dave"],
})

orders = pd.DataFrame({
    "order_id":   [101, 102, 103, 104, 105],
    "customer_id": [1, 2, 1, 3, 5],
    "amount":      [50, 30, 80, 20, 60],
})

# Inner join — only matching rows
pd.merge(orders, customers, left_on="customer_id", right_on="id")

# Left join — all orders, even without matching customer
pd.merge(orders, customers, left_on="customer_id", right_on="id", how="left")

# Right join, outer join also available
pd.merge(orders, customers, left_on="customer_id", right_on="id", how="outer")

# When key columns have the same name
pd.merge(orders, customers, on="id")

# Concat — stack DataFrames vertically
combined = pd.concat([df_q1, df_q2, df_q3], ignore_index=True)

Working with Dates

# Parse dates on load
df = pd.read_csv("sales.csv", parse_dates=["date"])

# Convert a column to datetime
df["date"] = pd.to_datetime(df["date"])

# Extract components
df["year"]    = df["date"].dt.year
df["month"]   = df["date"].dt.month
df["day"]     = df["date"].dt.day
df["weekday"] = df["date"].dt.day_name()
df["quarter"] = df["date"].dt.quarter

# Date arithmetic
df["days_since"] = (pd.Timestamp.today() - df["date"]).dt.days

# Filter by date range
start = pd.Timestamp("2024-01-01")
end   = pd.Timestamp("2024-03-31")
q1    = df[(df["date"] >= start) & (df["date"] <= end)]

# Resample time series (group by time period)
df.set_index("date").resample("ME")["revenue"].sum()   # monthly totals
df.set_index("date").resample("W")["revenue"].mean()   # weekly averages

Practical Project — Sales Analysis

"""
sales_analysis.py

End-to-end analysis of a sales dataset using pandas.
"""
import pandas as pd
import numpy as np
from io import StringIO

# ── Sample data (in a real project, use pd.read_csv("sales.csv")) ────────────

RAW = """date,product,category,units,price,region
2024-01-05,Widget A,Electronics,10,29.99,North
2024-01-12,Widget B,Electronics,5,49.99,South
2024-01-15,Gadget X,Home,8,19.99,North
2024-02-03,Widget A,Electronics,15,29.99,East
2024-02-14,Gadget Y,Home,12,14.99,West
2024-02-20,Widget B,Electronics,3,49.99,North
2024-03-01,Gadget X,Home,20,19.99,South
2024-03-10,Widget A,Electronics,7,29.99,West
2024-03-22,Gadget Y,Home,9,14.99,East
2024-03-30,Widget B,Electronics,11,49.99,South
"""

# ── Load and prepare ──────────────────────────────────────────────────────────

df = pd.read_csv(StringIO(RAW), parse_dates=["date"])
df["revenue"] = df["units"] * df["price"]
df["month"]   = df["date"].dt.to_period("M")

print("=== Data Overview ===")
print(df.dtypes)
print(f"\nRows: {len(df)}, Columns: {len(df.columns)}")
print(f"Date range: {df['date'].min().date()} -> {df['date'].max().date()}")
print(f"Missing values:\n{df.isnull().sum()}")

# ── Summary statistics ────────────────────────────────────────────────────────

print("\n=== Revenue Summary ===")
print(df["revenue"].describe().round(2))

# ── By category ──────────────────────────────────────────────────────────────

print("\n=== Revenue by Category ===")
by_cat = (
    df.groupby("category")
      .agg(
          total_revenue = ("revenue", "sum"),
          total_units   = ("units",   "sum"),
          avg_price     = ("price",   "mean"),
          num_orders    = ("date",    "count"),
      )
      .round(2)
      .sort_values("total_revenue", ascending=False)
)
print(by_cat)

# ── Monthly trend ─────────────────────────────────────────────────────────────

print("\n=== Monthly Revenue ===")
monthly = df.groupby("month")["revenue"].sum().reset_index()
monthly.columns = ["month", "revenue"]
for _, row in monthly.iterrows():
    bar = "█" * int(row["revenue"] / 50)
    print(f"  {row['month']}  {bar}  ${row['revenue']:,.2f}")

# ── Top products ──────────────────────────────────────────────────────────────

print("\n=== Top Products by Revenue ===")
top_products = (
    df.groupby("product")["revenue"]
      .sum()
      .sort_values(ascending=False)
      .head(5)
)
for product, rev in top_products.items():
    print(f"  {product:15s}  ${rev:,.2f}")

# ── Regional breakdown ────────────────────────────────────────────────────────

print("\n=== Revenue by Region ===")
by_region = df.pivot_table(
    values="revenue",
    index="region",
    columns="category",
    aggfunc="sum",
    fill_value=0,
).round(2)
print(by_region)

# ── Save results ──────────────────────────────────────────────────────────────

df.to_csv("sales_clean.csv", index=False)
print("\nSaved to sales_clean.csv")

Output:

=== Data Overview ===
date        datetime64[ns]
product              object
category             object
units                 int64
price               float64
region               object
revenue             float64
month              period[M]

Rows: 10, Columns: 8

=== Revenue by Category ===
              total_revenue  total_units  avg_price  num_orders
category
Electronics         1664.55           51      39.99           6
Home                 847.55           49      18.32           4

=== Monthly Revenue ===
  2024-01  ████████████  $629.80
  2024-02  ████████████  $624.72
  2024-03  ████████████  $857.58

=== Top Products by Revenue ===
  Widget B         $649.87
  Widget A         $569.80
  Gadget X         $479.75
  Gadget Y         $314.73

What You Learned in This Chapter

Series is a labelled 1D array. DataFrame is a labelled 2D table. Every column is a Series.
pd.read_csv, pd.read_excel, pd.read_json, pd.read_sql load data. .to_csv, .to_excel save it.
df.info(), df.describe(), df.head(), df.isnull().sum() explore a dataset.
df["col"] selects one column. df[["a","b"]] selects multiple. df.iloc[row, col] selects by position. df.loc[label, "col"] selects by label.
Boolean masks — df[df["score"] > 80] — filter rows without loops.
df["new"] = df["a"] + df["b"] adds computed columns. apply() runs arbitrary functions.
fillna, dropna handle missing values.
groupby("col").agg(...) splits data, computes per-group stats, and combines results.
pd.merge(left, right, on=..., how=...) joins DataFrames like SQL. pd.concat stacks them vertically.
df["date"].dt.year, .dt.month, .dt.day_name() extract date components. .resample() aggregates time series by period.

What's Next?

Chapter 44 covers Data Visualization — turning the DataFrames you just built into charts and graphs. You'll use matplotlib for control and seaborn for beautiful statistical plots with almost no code.