Chapter 42: Data Science

Python lists are convenient. But when you need to do math on a million numbers, they're too slow.

NumPy fixes that. It gives you arrays — like lists, but stored as compact blocks of memory and operated on at the speed of C. One NumPy operation replaces a Python for loop and runs 10 to 100 times faster.

Every data science library in Python — pandas, scikit-learn, matplotlib, TensorFlow — is built on NumPy. Learn NumPy and everything else clicks into place.

Setup

pip install numpy

import numpy as np   # np is the universal alias — always use it

Arrays — The Core Concept

A NumPy array (ndarray) is a grid of values, all the same type, stored in contiguous memory.

import numpy as np

# 1D array — like a list
a = np.array([1, 2, 3, 4, 5])
print(a)        # [1 2 3 4 5]
print(type(a))  # <class 'numpy.ndarray'>

# 2D array — like a list of lists (a matrix)
m = np.array([[1, 2, 3],
              [4, 5, 6]])
print(m)
# [[1 2 3]
#  [4 5 6]]

Key properties

a = np.array([[1, 2, 3],
              [4, 5, 6]])

print(a.shape)    # (2, 3) — 2 rows, 3 columns
print(a.ndim)     # 2 — number of dimensions
print(a.size)     # 6 — total number of elements
print(a.dtype)    # int64 — data type of elements

Creating Arrays

# All zeros
np.zeros((3, 4))          # 3x4 matrix of 0.0

# All ones
np.ones((2, 3))           # 2x3 matrix of 1.0

# Filled with a value
np.full((2, 2), 7)        # [[7 7] [7 7]]

# Identity matrix
np.eye(3)                 # 3x3 identity

# Evenly spaced values (like range)
np.arange(0, 10, 2)       # [0 2 4 6 8]
np.arange(5)              # [0 1 2 3 4]

# Evenly spaced values (by count)
np.linspace(0, 1, 5)      # [0.   0.25 0.5  0.75 1.  ]

# Random numbers
np.random.seed(42)                 # for reproducibility
np.random.rand(3, 3)               # uniform [0, 1)
np.random.randn(3, 3)              # standard normal
np.random.randint(0, 10, size=(3, 3))  # random ints

# From a Python list — with explicit dtype
np.array([1.0, 2.5, 3.7])          # float64
np.array([1, 2, 3], dtype=np.float32)

Data Types

a = np.array([1, 2, 3])
print(a.dtype)      # int64 (on most 64-bit systems)

b = np.array([1.0, 2.0])
print(b.dtype)      # float64

# Common dtypes:
# np.int8, np.int16, np.int32, np.int64
# np.float32, np.float64
# np.bool_
# np.complex128

# Convert dtype
a_float = a.astype(np.float32)

Choosing smaller dtypes (like float32 instead of float64) halves memory usage — important for large datasets.

Indexing and Slicing

1D arrays — just like lists

a = np.array([10, 20, 30, 40, 50])

a[0]        # 10
a[-1]       # 50
a[1:4]      # [20 30 40]
a[::2]      # [10 30 50]

2D arrays

m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

m[0]        # [1 2 3] — first row
m[0, 1]     # 2 — row 0, column 1
m[:, 1]     # [2 5 8] — all rows, column 1
m[1:, :2]   # [[4 5] [7 8]] — rows 1+, columns 0-1

Boolean indexing — filter without a loop

a = np.array([10, 25, 3, 47, 8, 30])

mask = a > 20
print(mask)         # [False  True False  True False  True]
print(a[mask])      # [25 47 30]

# One-liner
print(a[a > 20])    # [25 47 30]

# Set all values below 10 to 0
a[a < 10] = 0
print(a)            # [10 25  0 47  0 30]

Fancy indexing — select by index list

a = np.array([100, 200, 300, 400, 500])
indices = [0, 2, 4]
print(a[indices])   # [100 300 500]

Vectorized Operations — No Loops Needed

This is the key insight: NumPy operations apply to the whole array at once, without a Python for loop.

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations
a + b       # [11 22 33 44 55]
a * b       # [10 40 90 160 250]
a ** 2      # [ 1  4  9 16 25]
np.sqrt(a)  # [1. 1.41 1.73 2.  2.24]

# Scalar operations — broadcast to every element
a + 100     # [101 102 103 104 105]
a * 2       # [  2   4   6   8  10]

# Comparison
a > 3       # [False False False  True  True]

Speed comparison

import time
import numpy as np

n = 10_000_000

# Python list
lst = list(range(n))
start = time.time()
result = [x * 2 for x in lst]
print(f"Python list:  {time.time() - start:.3f}s")

# NumPy array
arr = np.arange(n)
start = time.time()
result = arr * 2
print(f"NumPy array:  {time.time() - start:.3f}s")

Output:

Python list:  0.847s
NumPy array:  0.008s

NumPy is ~100x faster for this operation.

Broadcasting

Broadcasting lets NumPy operate on arrays of different shapes without making copies.

# Add a scalar to a 2D array
m = np.array([[1, 2, 3],
              [4, 5, 6]])
m + 10
# [[11 12 13]
#  [14 15 16]]

# Add a 1D array to each row of a 2D array
row = np.array([1, 2, 3])
m + row
# [[ 2  4  6]
#  [ 5  7  9]]

# Add a column vector to each column
col = np.array([[10], [20]])  # shape (2, 1)
m + col
# [[11 12 13]
#  [24 25 26]]

Broadcasting rules (simplified): dimensions are compared right-to-left. A dimension of 1 is stretched to match the other.

Mathematical Functions

a = np.array([1, 4, 9, 16, 25])

np.sqrt(a)      # [1. 2. 3. 4. 5.]
np.log(a)       # natural log
np.log2(a)
np.log10(a)
np.exp(a)       # e^x
np.abs(a)       # absolute value
np.sin(a)
np.cos(a)
np.floor(a)
np.ceil(a)
np.round(a, 2)

Aggregation — Summary Statistics

a = np.array([4, 7, 2, 9, 1, 5, 8, 3, 6])

np.sum(a)       # 45
np.mean(a)      # 5.0
np.median(a)    # 5.0
np.std(a)       # standard deviation: 2.58
np.var(a)       # variance: 6.67
np.min(a)       # 1
np.max(a)       # 9
np.argmin(a)    # 4 — index of minimum
np.argmax(a)    # 3 — index of maximum
np.cumsum(a)    # running total
np.sort(a)      # sorted copy

# 2D aggregation — specify axis
m = np.array([[1, 2, 3],
              [4, 5, 6]])

np.sum(m)           # 21 — total
np.sum(m, axis=0)   # [5 7 9] — sum each column
np.sum(m, axis=1)   # [ 6 15] — sum each row
np.mean(m, axis=0)  # [2.5 3.5 4.5] — mean of each column

Reshaping and Combining

a = np.arange(12)       # [0 1 2 3 4 5 6 7 8 9 10 11]

# Reshape — total elements must be the same
a.reshape(3, 4)         # 3 rows, 4 columns
a.reshape(2, 2, 3)      # 3D: 2x2x3
a.reshape(-1, 3)        # -1 means "figure it out" -> (4, 3)

# Flatten back to 1D
m = a.reshape(3, 4)
m.flatten()             # copy
m.ravel()               # view (faster, shares memory)

# Transpose
m.T                     # swap rows and columns

# Stack arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack([a, b])       # [[1 2 3] [4 5 6]] — stack vertically
np.hstack([a, b])       # [1 2 3 4 5 6]     — stack horizontally
np.concatenate([a, b])  # [1 2 3 4 5 6]

# Split
np.split(a, 3)          # [array([1]), array([2]), array([3])]

Linear Algebra

A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication
A @ B               # [[19 22] [43 50]]
np.dot(A, B)        # same

# Determinant
np.linalg.det(A)    # -2.0

# Inverse
np.linalg.inv(A)    # [[-2.   1. ] [ 1.5 -0.5]]

# Eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(A)

# Solve Ax = b
b = np.array([1, 2])
x = np.linalg.solve(A, b)  # solution to Ax = b

Working with Real Data

import numpy as np

# Load from a text file
data = np.loadtxt("data.csv", delimiter=",", skiprows=1)

# Load from a binary file (much faster for large data)
np.save("data.npy", data)
loaded = np.load("data.npy")

# Generate sample sensor data: 1000 readings with noise
np.random.seed(42)
time_points = np.linspace(0, 10, 1000)          # 0 to 10 seconds
signal      = np.sin(2 * np.pi * time_points)   # clean sine wave
noise       = np.random.randn(1000) * 0.1       # random noise
readings    = signal + noise

# Find peaks
peaks = np.where(readings > 0.9)[0]
print(f"Found {len(peaks)} peaks above 0.9")
print(f"Mean reading: {readings.mean():.4f}")
print(f"Std deviation: {readings.std():.4f}")

Practical Project — Grade Analysis

"""
grade_analysis.py

Analyse student exam scores using NumPy.
"""
import numpy as np


def letter_grade(score: float) -> str:
    if score >= 90: return "A"
    if score >= 80: return "B"
    if score >= 70: return "C"
    if score >= 60: return "D"
    return "F"


def analyse_grades(scores: np.ndarray) -> dict:
    """Return summary statistics for an array of scores."""
    return {
        "count":   len(scores),
        "mean":    round(float(scores.mean()), 2),
        "median":  round(float(np.median(scores)), 2),
        "std":     round(float(scores.std()), 2),
        "min":     float(scores.min()),
        "max":     float(scores.max()),
        "passing": int((scores >= 60).sum()),
        "failing": int((scores < 60).sum()),
    }


def grade_distribution(scores: np.ndarray) -> dict:
    """Count scores in each letter-grade bucket."""
    thresholds = [90, 80, 70, 60, 0]
    labels     = ["A", "B", "C", "D", "F"]
    dist       = {}
    for label, low, high in zip(
        labels, thresholds[:-1] + [0], [100] + thresholds[:-1]
    ):
        dist[label] = int(((scores >= low) & (scores < high)).sum())
    return dist


def normalise(scores: np.ndarray) -> np.ndarray:
    """Scale scores to [0, 1]."""
    min_s, max_s = scores.min(), scores.max()
    if max_s == min_s:
        return np.zeros_like(scores, dtype=float)
    return (scores - min_s) / (max_s - min_s)


def curve(scores: np.ndarray, target_mean: float = 75.0) -> np.ndarray:
    """Add a curve so the mean hits target_mean."""
    shift   = target_mean - scores.mean()
    curved  = scores + shift
    return np.clip(curved, 0, 100)   # cap at 0--100


# ── Demo ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    np.random.seed(99)

    # Simulate 30 students, 4 exams
    raw = np.random.normal(loc=68, scale=15, size=(30, 4))
    raw = np.clip(raw, 0, 100).round(1)

    # Final score = weighted average (exams weighted equally here)
    weights = np.array([0.2, 0.2, 0.25, 0.35])
    finals  = raw @ weights     # matrix multiply: (30,4) @ (4,) -> (30,)

    stats = analyse_grades(finals)
    dist  = grade_distribution(finals)

    print("=== Grade Analysis ===")
    print(f"Students: {stats['count']}")
    print(f"Mean:     {stats['mean']:.1f}   Median: {stats['median']:.1f}")
    print(f"Std dev:  {stats['std']:.1f}")
    print(f"Min: {stats['min']:.1f}   Max: {stats['max']:.1f}")
    print(f"Passing: {stats['passing']}   Failing: {stats['failing']}")

    print("\nGrade Distribution:")
    for grade, count in dist.items():
        bar = "█" * count
        print(f"  {grade}: {bar} ({count})")

    # Apply curve if mean is below 70
    if stats["mean"] < 70:
        curved = curve(finals, target_mean=72)
        print(f"\nCurve applied: mean {stats['mean']:.1f} -> {curved.mean():.1f}")

Output:

=== Grade Analysis ===
Students: 30
Mean:     67.8   Median: 68.4
Std dev:  13.2
Min: 38.1   Max: 94.7
Passing: 22   Failing: 8

Grade Distribution:
  A: ██ (2)
  B: ████ (4)
  C: ████████ (8)
  D: ████████ (8)
  F: ████████ (8)

Curve applied: mean 67.8 -> 72.0

Common Pitfalls

Views vs copies

a = np.array([1, 2, 3, 4, 5])

# Slices return VIEWS — modifying them modifies the original
b = a[1:4]
b[0] = 99
print(a)    # [ 1 99  3  4  5] <- a is changed!

# Make a copy if you don't want this
c = a[1:4].copy()
c[0] = 0
print(a)    # [ 1 99  3  4  5] <- a is unchanged

Integer division

a = np.array([1, 2, 3])
print(a / 2)      # [0.5 1.  1.5] — true division, float result
print(a // 2)     # [0 1 1]        — floor division, int result

Shape mismatches

a = np.array([1, 2, 3])       # shape (3,)
b = np.array([[1], [2], [3]]) # shape (3, 1)
# a + b -> shape (3, 3) due to broadcasting — often unexpected!

When in doubt, use .reshape() to make shapes explicit.

What You Learned in This Chapter

NumPy arrays (ndarray) store homogeneous data in contiguous memory, making math 10--100x faster than Python lists.
.shape, .dtype, .size, .ndim describe an array.
np.zeros, np.ones, np.arange, np.linspace, np.random.* create arrays.
Indexing with [row, col], slicing with [start:stop:step], and boolean indexing a[a > 5] all work without loops.
All arithmetic operators (+, -, *, /, **) and math functions (np.sqrt, np.log, np.sin) work element-wise.
Broadcasting applies operations between arrays of compatible shapes without copying data.
np.sum, np.mean, np.std, np.min, np.max, np.argmin, np.argmax summarise arrays. Use axis= for row/column operations.
reshape, flatten, vstack, hstack, concatenate restructure arrays.
@ or np.dot multiplies matrices. np.linalg handles determinants, inverses, and solvers.
Slices are views — modifying them modifies the original. Use .copy() to avoid this.

What's Next?

Chapter 43 covers pandas — the library built on top of NumPy that adds labels, DataFrames, and powerful tools for reading, cleaning, and transforming real-world tabular data.