Chapter 42: Data Science
Python lists are convenient. But when you need to do math on a million numbers, they're too slow.
NumPy fixes that. It gives you arrays — like lists, but stored as compact blocks of memory and operated on at the speed of C. One NumPy operation replaces a Python for loop and runs 10 to 100 times faster.
Every data science library in Python — pandas, scikit-learn, matplotlib, TensorFlow — is built on NumPy. Learn NumPy and everything else clicks into place.
Setup
pip install numpy
import numpy as np # np is the universal alias — always use it
Arrays — The Core Concept
A NumPy array (ndarray) is a grid of values, all the same type, stored in contiguous memory.
import numpy as np
# 1D array — like a list
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(type(a)) # <class 'numpy.ndarray'>
# 2D array — like a list of lists (a matrix)
m = np.array([[1, 2, 3],
[4, 5, 6]])
print(m)
# [[1 2 3]
# [4 5 6]]
Key properties
a = np.array([[1, 2, 3],
[4, 5, 6]])
print(a.shape) # (2, 3) — 2 rows, 3 columns
print(a.ndim) # 2 — number of dimensions
print(a.size) # 6 — total number of elements
print(a.dtype) # int64 — data type of elements
Creating Arrays
# All zeros
np.zeros((3, 4)) # 3x4 matrix of 0.0
# All ones
np.ones((2, 3)) # 2x3 matrix of 1.0
# Filled with a value
np.full((2, 2), 7) # [[7 7] [7 7]]
# Identity matrix
np.eye(3) # 3x3 identity
# Evenly spaced values (like range)
np.arange(0, 10, 2) # [0 2 4 6 8]
np.arange(5) # [0 1 2 3 4]
# Evenly spaced values (by count)
np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
# Random numbers
np.random.seed(42) # for reproducibility
np.random.rand(3, 3) # uniform [0, 1)
np.random.randn(3, 3) # standard normal
np.random.randint(0, 10, size=(3, 3)) # random ints
# From a Python list — with explicit dtype
np.array([1.0, 2.5, 3.7]) # float64
np.array([1, 2, 3], dtype=np.float32)
Data Types
a = np.array([1, 2, 3])
print(a.dtype) # int64 (on most 64-bit systems)
b = np.array([1.0, 2.0])
print(b.dtype) # float64
# Common dtypes:
# np.int8, np.int16, np.int32, np.int64
# np.float32, np.float64
# np.bool_
# np.complex128
# Convert dtype
a_float = a.astype(np.float32)
Choosing smaller dtypes (like float32 instead of float64) halves memory usage — important for large datasets.
Indexing and Slicing
1D arrays — just like lists
a = np.array([10, 20, 30, 40, 50])
a[0] # 10
a[-1] # 50
a[1:4] # [20 30 40]
a[::2] # [10 30 50]
2D arrays
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
m[0] # [1 2 3] — first row
m[0, 1] # 2 — row 0, column 1
m[:, 1] # [2 5 8] — all rows, column 1
m[1:, :2] # [[4 5] [7 8]] — rows 1+, columns 0-1
Boolean indexing — filter without a loop
a = np.array([10, 25, 3, 47, 8, 30])
mask = a > 20
print(mask) # [False True False True False True]
print(a[mask]) # [25 47 30]
# One-liner
print(a[a > 20]) # [25 47 30]
# Set all values below 10 to 0
a[a < 10] = 0
print(a) # [10 25 0 47 0 30]
Fancy indexing — select by index list
a = np.array([100, 200, 300, 400, 500])
indices = [0, 2, 4]
print(a[indices]) # [100 300 500]
Vectorized Operations — No Loops Needed
This is the key insight: NumPy operations apply to the whole array at once, without a Python for loop.
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Element-wise operations
a + b # [11 22 33 44 55]
a * b # [10 40 90 160 250]
a ** 2 # [ 1 4 9 16 25]
np.sqrt(a) # [1. 1.41 1.73 2. 2.24]
# Scalar operations — broadcast to every element
a + 100 # [101 102 103 104 105]
a * 2 # [ 2 4 6 8 10]
# Comparison
a > 3 # [False False False True True]
Speed comparison
import time
import numpy as np
n = 10_000_000
# Python list
lst = list(range(n))
start = time.time()
result = [x * 2 for x in lst]
print(f"Python list: {time.time() - start:.3f}s")
# NumPy array
arr = np.arange(n)
start = time.time()
result = arr * 2
print(f"NumPy array: {time.time() - start:.3f}s")
Output:
Python list: 0.847s
NumPy array: 0.008s
NumPy is ~100x faster for this operation.
Broadcasting
Broadcasting lets NumPy operate on arrays of different shapes without making copies.
# Add a scalar to a 2D array
m = np.array([[1, 2, 3],
[4, 5, 6]])
m + 10
# [[11 12 13]
# [14 15 16]]
# Add a 1D array to each row of a 2D array
row = np.array([1, 2, 3])
m + row
# [[ 2 4 6]
# [ 5 7 9]]
# Add a column vector to each column
col = np.array([[10], [20]]) # shape (2, 1)
m + col
# [[11 12 13]
# [24 25 26]]
Broadcasting rules (simplified): dimensions are compared right-to-left. A dimension of 1 is stretched to match the other.
Mathematical Functions
a = np.array([1, 4, 9, 16, 25])
np.sqrt(a) # [1. 2. 3. 4. 5.]
np.log(a) # natural log
np.log2(a)
np.log10(a)
np.exp(a) # e^x
np.abs(a) # absolute value
np.sin(a)
np.cos(a)
np.floor(a)
np.ceil(a)
np.round(a, 2)
Aggregation — Summary Statistics
a = np.array([4, 7, 2, 9, 1, 5, 8, 3, 6])
np.sum(a) # 45
np.mean(a) # 5.0
np.median(a) # 5.0
np.std(a) # standard deviation: 2.58
np.var(a) # variance: 6.67
np.min(a) # 1
np.max(a) # 9
np.argmin(a) # 4 — index of minimum
np.argmax(a) # 3 — index of maximum
np.cumsum(a) # running total
np.sort(a) # sorted copy
# 2D aggregation — specify axis
m = np.array([[1, 2, 3],
[4, 5, 6]])
np.sum(m) # 21 — total
np.sum(m, axis=0) # [5 7 9] — sum each column
np.sum(m, axis=1) # [ 6 15] — sum each row
np.mean(m, axis=0) # [2.5 3.5 4.5] — mean of each column
Reshaping and Combining
a = np.arange(12) # [0 1 2 3 4 5 6 7 8 9 10 11]
# Reshape — total elements must be the same
a.reshape(3, 4) # 3 rows, 4 columns
a.reshape(2, 2, 3) # 3D: 2x2x3
a.reshape(-1, 3) # -1 means "figure it out" -> (4, 3)
# Flatten back to 1D
m = a.reshape(3, 4)
m.flatten() # copy
m.ravel() # view (faster, shares memory)
# Transpose
m.T # swap rows and columns
# Stack arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack([a, b]) # [[1 2 3] [4 5 6]] — stack vertically
np.hstack([a, b]) # [1 2 3 4 5 6] — stack horizontally
np.concatenate([a, b]) # [1 2 3 4 5 6]
# Split
np.split(a, 3) # [array([1]), array([2]), array([3])]
Linear Algebra
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
# Matrix multiplication
A @ B # [[19 22] [43 50]]
np.dot(A, B) # same
# Determinant
np.linalg.det(A) # -2.0
# Inverse
np.linalg.inv(A) # [[-2. 1. ] [ 1.5 -0.5]]
# Eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(A)
# Solve Ax = b
b = np.array([1, 2])
x = np.linalg.solve(A, b) # solution to Ax = b
Working with Real Data
import numpy as np
# Load from a text file
data = np.loadtxt("data.csv", delimiter=",", skiprows=1)
# Load from a binary file (much faster for large data)
np.save("data.npy", data)
loaded = np.load("data.npy")
# Generate sample sensor data: 1000 readings with noise
np.random.seed(42)
time_points = np.linspace(0, 10, 1000) # 0 to 10 seconds
signal = np.sin(2 * np.pi * time_points) # clean sine wave
noise = np.random.randn(1000) * 0.1 # random noise
readings = signal + noise
# Find peaks
peaks = np.where(readings > 0.9)[0]
print(f"Found {len(peaks)} peaks above 0.9")
print(f"Mean reading: {readings.mean():.4f}")
print(f"Std deviation: {readings.std():.4f}")
Practical Project — Grade Analysis
"""
grade_analysis.py
Analyse student exam scores using NumPy.
"""
import numpy as np
def letter_grade(score: float) -> str:
if score >= 90: return "A"
if score >= 80: return "B"
if score >= 70: return "C"
if score >= 60: return "D"
return "F"
def analyse_grades(scores: np.ndarray) -> dict:
"""Return summary statistics for an array of scores."""
return {
"count": len(scores),
"mean": round(float(scores.mean()), 2),
"median": round(float(np.median(scores)), 2),
"std": round(float(scores.std()), 2),
"min": float(scores.min()),
"max": float(scores.max()),
"passing": int((scores >= 60).sum()),
"failing": int((scores < 60).sum()),
}
def grade_distribution(scores: np.ndarray) -> dict:
"""Count scores in each letter-grade bucket."""
thresholds = [90, 80, 70, 60, 0]
labels = ["A", "B", "C", "D", "F"]
dist = {}
for label, low, high in zip(
labels, thresholds[:-1] + [0], [100] + thresholds[:-1]
):
dist[label] = int(((scores >= low) & (scores < high)).sum())
return dist
def normalise(scores: np.ndarray) -> np.ndarray:
"""Scale scores to [0, 1]."""
min_s, max_s = scores.min(), scores.max()
if max_s == min_s:
return np.zeros_like(scores, dtype=float)
return (scores - min_s) / (max_s - min_s)
def curve(scores: np.ndarray, target_mean: float = 75.0) -> np.ndarray:
"""Add a curve so the mean hits target_mean."""
shift = target_mean - scores.mean()
curved = scores + shift
return np.clip(curved, 0, 100) # cap at 0--100
# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
np.random.seed(99)
# Simulate 30 students, 4 exams
raw = np.random.normal(loc=68, scale=15, size=(30, 4))
raw = np.clip(raw, 0, 100).round(1)
# Final score = weighted average (exams weighted equally here)
weights = np.array([0.2, 0.2, 0.25, 0.35])
finals = raw @ weights # matrix multiply: (30,4) @ (4,) -> (30,)
stats = analyse_grades(finals)
dist = grade_distribution(finals)
print("=== Grade Analysis ===")
print(f"Students: {stats['count']}")
print(f"Mean: {stats['mean']:.1f} Median: {stats['median']:.1f}")
print(f"Std dev: {stats['std']:.1f}")
print(f"Min: {stats['min']:.1f} Max: {stats['max']:.1f}")
print(f"Passing: {stats['passing']} Failing: {stats['failing']}")
print("\nGrade Distribution:")
for grade, count in dist.items():
bar = "█" * count
print(f" {grade}: {bar} ({count})")
# Apply curve if mean is below 70
if stats["mean"] < 70:
curved = curve(finals, target_mean=72)
print(f"\nCurve applied: mean {stats['mean']:.1f} -> {curved.mean():.1f}")
Output:
=== Grade Analysis ===
Students: 30
Mean: 67.8 Median: 68.4
Std dev: 13.2
Min: 38.1 Max: 94.7
Passing: 22 Failing: 8
Grade Distribution:
A: ██ (2)
B: ████ (4)
C: ████████ (8)
D: ████████ (8)
F: ████████ (8)
Curve applied: mean 67.8 -> 72.0
Common Pitfalls
Views vs copies
a = np.array([1, 2, 3, 4, 5])
# Slices return VIEWS — modifying them modifies the original
b = a[1:4]
b[0] = 99
print(a) # [ 1 99 3 4 5] <- a is changed!
# Make a copy if you don't want this
c = a[1:4].copy()
c[0] = 0
print(a) # [ 1 99 3 4 5] <- a is unchanged
Integer division
a = np.array([1, 2, 3])
print(a / 2) # [0.5 1. 1.5] — true division, float result
print(a // 2) # [0 1 1] — floor division, int result
Shape mismatches
a = np.array([1, 2, 3]) # shape (3,)
b = np.array([[1], [2], [3]]) # shape (3, 1)
# a + b -> shape (3, 3) due to broadcasting — often unexpected!
When in doubt, use .reshape() to make shapes explicit.
What You Learned in This Chapter
- NumPy arrays (
ndarray) store homogeneous data in contiguous memory, making math 10--100x faster than Python lists. .shape,.dtype,.size,.ndimdescribe an array.np.zeros,np.ones,np.arange,np.linspace,np.random.*create arrays.- Indexing with
[row, col], slicing with[start:stop:step], and boolean indexinga[a > 5]all work without loops. - All arithmetic operators (
+,-,*,/,**) and math functions (np.sqrt,np.log,np.sin) work element-wise. - Broadcasting applies operations between arrays of compatible shapes without copying data.
np.sum,np.mean,np.std,np.min,np.max,np.argmin,np.argmaxsummarise arrays. Useaxis=for row/column operations.reshape,flatten,vstack,hstack,concatenaterestructure arrays.@ornp.dotmultiplies matrices.np.linalghandles determinants, inverses, and solvers.- Slices are views — modifying them modifies the original. Use
.copy()to avoid this.
What's Next?
Chapter 43 covers pandas — the library built on top of NumPy that adds labels, DataFrames, and powerful tools for reading, cleaning, and transforming real-world tabular data.