Chapter 30: Performance and Profiling
"My code is too slow." That's one of the most common complaints in software development. The wrong response is to start optimizing randomly — rewriting loops, switching data structures, adding caching — based on gut feeling.
The right response is to measure first. Find exactly where the time goes. Then fix that specific thing.
This chapter teaches you to measure performance precisely, understand algorithmic complexity, and apply targeted optimizations that actually matter.
The Golden Rule: Measure Before You Optimize
Donald Knuth said it in 1974: "Premature optimization is the root of all evil."
What he meant: optimizing code before you know what's slow wastes time, makes code harder to read, and often doesn't even help — because you optimized the wrong thing.
The process is always:
- Measure — find the actual bottleneck
- Understand why — is it algorithmic? I/O? Memory?
- Fix the bottleneck — change the right thing
- Measure again — verify the improvement
timeit — Measuring Small Code Snippets
timeit runs a snippet many times and tells you the average execution time. Use it to compare two approaches to the same problem.
import timeit
# Compare string concatenation methods
n = 10_000
# Method 1: += in a loop
def concat_plus():
s = ""
for i in range(n):
s += str(i)
return s
# Method 2: join
def concat_join():
return "".join(str(i) for i in range(n))
# Method 3: f-string in list
def concat_list():
parts = [str(i) for i in range(n)]
return "".join(parts)
time_plus = timeit.timeit(concat_plus, number=1000)
time_join = timeit.timeit(concat_join, number=1000)
time_list = timeit.timeit(concat_list, number=1000)
print(f"concat_plus: {time_plus:.3f}s")
print(f"concat_join: {time_join:.3f}s")
print(f"concat_list: {time_list:.3f}s")
Output (approximate):
concat_plus: 2.841s
concat_join: 1.203s
concat_list: 0.891s
join is ~3x faster than += for large strings. Good to know — but only worth doing if string building is actually your bottleneck.
timeit from the command line
python -m timeit "'-'.join(str(n) for n in range(100))"
# 100000 loops, best of 5: 12.2 usec per loop
python -m timeit "'-'.join([str(n) for n in range(100)])"
# 100000 loops, best of 5: 9.94 usec per loop
%%timeit in Jupyter notebooks
%%timeit
result = [x**2 for x in range(10_000)]
cProfile — Profiling Entire Programs
timeit measures snippets. cProfile profiles your whole program — it tells you how many times each function was called and how long each one took.
import cProfile
import pstats
def slow_function():
total = 0
for i in range(100_000):
total += i ** 2
return total
def another_function():
data = [slow_function() for _ in range(3)]
return sum(data)
def main():
result = another_function()
print(f"Result: {result}")
# Profile main()
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()
# Print sorted results
stats = pstats.Stats(profiler)
stats.strip_dirs()
stats.sort_stats("cumulative") # sort by cumulative time
stats.print_stats(20) # show top 20 functions
Output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.234 1.234 script.py:14(main)
1 0.000 0.000 1.234 1.234 script.py:10(another_function)
3 1.234 0.411 1.234 0.411 script.py:5(slow_function)
300000 0.000 0.000 0.000 0.000 {built-in method builtins.pow}
Columns explained:
ncalls— number of times the function was calledtottime— total time spent in this function (excluding sub-calls)percall— tottime divided by ncallscumtime— cumulative time including all sub-callspercall(second) — cumtime divided by ncalls
slow_function accounts for 1.234s — the clear bottleneck.
From the command line
python -m cProfile -s cumulative my_script.py
python -m cProfile -o profile.out my_script.py # save to file
Visualizing with snakeviz
pip install snakeviz
python -m cProfile -o profile.out my_script.py
snakeviz profile.out # opens an interactive flame graph in your browser
line_profiler — Line-by-Line Timing
cProfile tells you which function is slow. line_profiler tells you which line inside that function is slow.
pip install line_profiler
# profile_example.py
from line_profiler import LineProfiler
def process_data(data):
cleaned = [x for x in data if x > 0] # filter
squared = [x ** 2 for x in cleaned] # transform
filtered = [x for x in squared if x % 2 == 0] # filter again
return sorted(filtered) # sort
data = list(range(-1000, 1000))
lp = LineProfiler()
lp.add_function(process_data)
lp.enable()
result = process_data(data)
lp.disable()
lp.print_stats()
Output:
Line # Hits Time Per Hit % Time Line Contents
5 1 142.0 142.0 8.5 cleaned = [x for x in data if x > 0]
6 1 310.0 310.0 18.6 squared = [x ** 2 for x in cleaned]
7 1 280.0 280.0 16.8 filtered = [x for x in squared if x % 2 == 0]
8 1 934.0 934.0 56.1 return sorted(filtered)
The sorted() call takes 56% of the time. That's where to focus.
memory_profiler — Measuring Memory Usage
pip install memory_profiler
from memory_profiler import profile
@profile
def load_large_data():
data = list(range(1_000_000)) # list: ~8MB
total = sum(data) # compute sum
del data # free memory
result = [x ** 2 for x in range(10_000)] # smaller list
return total, result
load_large_data()
Output:
Line # Mem usage Increment Line Contents
4 45.3 MiB 45.3 MiB def load_large_data():
5 53.1 MiB 7.8 MiB data = list(range(1_000_000))
6 53.1 MiB 0.0 MiB total = sum(data)
7 45.3 MiB -7.8 MiB del data
8 45.6 MiB 0.3 MiB result = [x**2 for x in range(10_000)]
Each line shows current memory usage and the change. data takes 7.8MB; del data releases it.
Algorithmic Complexity — The Real Performance Win
Tools are for measuring. But the biggest performance wins come from choosing better algorithms. A 10x faster computer running an O(n^2) algorithm is still slower than a slow computer running O(n log n) for large n.
Big O notation describes how an algorithm's runtime grows with input size n.
| Complexity | Name | Example |
|---|---|---|
| O(1) | Constant | Dictionary lookup, list append |
| O(log n) | Logarithmic | Binary search, balanced tree operations |
| O(n) | Linear | Loop through a list once |
| O(n log n) | Linearithmic | sorted(), good sorting algorithms |
| O(n^2) | Quadratic | Nested loops over n items |
| O(2ⁿ) | Exponential | Naive recursive Fibonacci |
| O(n!) | Factorial | Brute-force permutations |
n = 1,000,000
O(1): 1 operation
O(log n): 20 operations
O(n): 1,000,000 operations
O(n log n): 20,000,000 operations
O(n^2): 1,000,000,000,000 operations <- 1 trillion
The jump from O(n) to O(n^2) is the difference between milliseconds and hours.
Seeing it in practice
import time
def has_duplicate_quadratic(items):
"""O(n^2) — nested loop"""
for i in range(len(items)):
for j in range(i + 1, len(items)):
if items[i] == items[j]:
return True
return False
def has_duplicate_linear(items):
"""O(n) — set lookup"""
seen = set()
for item in items:
if item in seen:
return True
seen.add(item)
return False
def has_duplicate_oneliner(items):
"""O(n) — even cleaner"""
return len(items) != len(set(items))
sizes = [100, 1_000, 10_000, 100_000]
for n in sizes:
data = list(range(n)) # no duplicates — worst case
t1 = timeit.timeit(lambda: has_duplicate_quadratic(data), number=1)
t2 = timeit.timeit(lambda: has_duplicate_linear(data), number=1)
print(f"n={n:>7}: O(n^2)={t1:.4f}s O(n)={t2:.6f}s ratio={t1/t2:.0f}x")
Output:
n= 100: O(n^2)=0.0001s O(n)=0.000010s ratio=12x
n= 1,000: O(n^2)=0.0051s O(n)=0.000089s ratio=57x
n= 10,000: O(n^2)=0.4821s O(n)=0.000891s ratio=541x
n=100,000: O(n^2)=47.832s O(n)=0.009103s ratio=5254x
The same logic, 5000x faster at n=100,000 — just by choosing a set instead of a nested loop.
Choosing the Right Data Structure
The single most impactful optimization is usually choosing the right data structure.
import timeit
# Membership testing
large_list = list(range(100_000))
large_set = set(range(100_000))
large_dict = {i: True for i in range(100_000)}
target = 99_999 # worst case — near the end
t_list = timeit.timeit(lambda: target in large_list, number=10_000)
t_set = timeit.timeit(lambda: target in large_set, number=10_000)
t_dict = timeit.timeit(lambda: target in large_dict, number=10_000)
print(f"list: {t_list:.4f}s O(n)")
print(f"set: {t_set:.6f}s O(1)")
print(f"dict: {t_dict:.6f}s O(1)")
Output:
list: 3.8412s O(n)
set: 0.0003s O(1)
dict: 0.0003s O(1)
Set and dict lookups are 10,000x faster for large collections because they use hash tables.
Data structure complexity cheat sheet
| Operation | list | set | dict | deque |
|---|---|---|---|---|
| Append (end) | O(1) | — | — | O(1) |
| Insert (middle) | O(n) | — | — | O(n) |
| Delete (middle) | O(n) | — | — | O(n) |
| Pop (end) | O(1) | — | — | O(1) |
| Pop (front) | O(n) | — | — | O(1) |
| Lookup by index | O(1) | — | — | O(n) |
| Membership test | O(n) | O(1) | O(1) | O(n) |
| Insert/delete | — | O(1) | O(1) | — |
| Sorted order | Yes (if sorted) | No | No (3.7+ insertion) | Yes (if sorted) |
Use collections.deque when you need fast prepend (front insertion). Use set or dict for fast membership testing.
from collections import deque
# Queue — FIFO (First In, First Out)
queue = deque()
queue.append("first")
queue.append("second")
queue.append("third")
print(queue.popleft()) # "first" — O(1)
# Stack — LIFO (Last In, First Out)
stack = []
stack.append("first")
stack.append("second")
print(stack.pop()) # "second" — O(1)
# deque as a sliding window (fixed-size buffer)
window = deque(maxlen=3)
for x in range(10):
window.append(x)
print(list(window)) # never longer than 3
Practical Optimization Techniques
1. Avoid repeated work — cache results
# Slow: recompute the same thing every iteration
def process(data):
for item in data:
if len(data) > 1000: # len(data) called every iteration
do_something(item)
# Fast: compute once
def process(data):
n = len(data)
for item in data:
if n > 1000:
do_something(item)
2. Move work out of inner loops
import re
# Slow: compile the regex on every call
def count_emails(texts):
return sum(1 for t in texts if re.match(r"[^@]+@[^@]+\.[^@]+", t))
# Fast: compile once outside the loop (or at module level)
EMAIL_RE = re.compile(r"[^@]+@[^@]+\.[^@]+")
def count_emails(texts):
return sum(1 for t in texts if EMAIL_RE.match(t))
3. Use built-ins — they're implemented in C
# Slow: Python loop
def sum_squares_loop(n):
total = 0
for i in range(n):
total += i * i
return total
# Fast: built-in sum with generator
def sum_squares_builtin(n):
return sum(i * i for i in range(n))
# Even faster for math-heavy work: numpy
import numpy as np
def sum_squares_numpy(n):
return int(np.sum(np.arange(n) ** 2))
import timeit
n = 1_000_000
t1 = timeit.timeit(lambda: sum_squares_loop(n), number=5)
t2 = timeit.timeit(lambda: sum_squares_builtin(n), number=5)
print(f"Loop: {t1:.3f}s")
print(f"Built-in:{t2:.3f}s")
4. List comprehensions over explicit loops
# Slower
result = []
for x in range(10_000):
if x % 2 == 0:
result.append(x ** 2)
# Faster (comprehensions are optimized in CPython)
result = [x ** 2 for x in range(10_000) if x % 2 == 0]
5. Use generators for large data
# Loads everything into memory
total = sum([x ** 2 for x in range(10_000_000)])
# Lazy — one item at a time, much lower memory
total = sum(x ** 2 for x in range(10_000_000))
6. Local variable lookup is faster than global
import math
# Slow: looks up math.sqrt in global scope every call
def compute_slow(data):
return [math.sqrt(x) for x in data]
# Fast: bind to a local variable
def compute_fast(data):
sqrt = math.sqrt # local lookup is faster
return [sqrt(x) for x in data]
7. String formatting — f-strings are fastest
name, value = "Alice", 42
# Slowest
s = "Hello " + name + ", value is " + str(value)
# Fast
s = f"Hello {name}, value is {value}"
# Acceptable
s = "Hello {}, value is {}".format(name, value)
8. functools.lru_cache for expensive pure functions
from functools import lru_cache
@lru_cache(maxsize=None)
def expensive_lookup(key):
# Simulate expensive computation or DB query
import time
time.sleep(0.1)
return key * 2
# First call: 0.1s
# Subsequent calls: instant (cached)
numpy for Numerical Work
If you're doing numerical computation — matrix operations, statistics, signal processing — pure Python is the wrong tool. numpy operates in C on contiguous arrays and is 10-1000x faster.
import numpy as np
import timeit
n = 1_000_000
# Python list vs numpy array — element-wise squaring
py_list = list(range(n))
np_array = np.arange(n)
t_py = timeit.timeit(lambda: [x**2 for x in py_list], number=5)
t_np = timeit.timeit(lambda: np_array ** 2, number=5)
print(f"Python list: {t_py:.3f}s")
print(f"Numpy array: {t_np:.4f}s")
print(f"Speedup: {t_py/t_np:.0f}x")
Output:
Python list: 2.341s
Numpy array: 0.024s
Speedup: 97x
# numpy operations are vectorized — no Python loops needed
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
print(a + b) # [11 22 33 44 55]
print(a * b) # [10 40 90 160 250]
print(np.dot(a, b)) # 550 (dot product)
print(np.sqrt(a)) # [1. 1.41 1.73 2. 2.24]
# 2D arrays (matrices)
matrix = np.arange(9).reshape(3, 3)
print(matrix)
# [[0 1 2]
# [3 4 5]
# [6 7 8]]
print(matrix.T) # transpose
print(matrix @ matrix) # matrix multiplication
__slots__ — Reducing Memory for Many Objects
By default, Python objects store their attributes in a dictionary (__dict__). For classes where you create millions of instances, this wastes memory. __slots__ replaces the dict with a fixed set of attributes:
import sys
class PointDict:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
class PointSlots:
__slots__ = ("x", "y", "z")
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
pd = PointDict(1.0, 2.0, 3.0)
ps = PointSlots(1.0, 2.0, 3.0)
print(f"With __dict__: {sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)} bytes")
print(f"With __slots__: {sys.getsizeof(ps)} bytes")
# With __dict__: 232 bytes
# With __slots__: 56 bytes — 4x smaller
For a million instances, that's 176MB vs 56MB. Use __slots__ when you create many instances of a class with a fixed set of attributes.
Profiling in Practice: A Case Study
Let's profile a realistic slow function and fix it:
import cProfile
import pstats
import io
import timeit
from collections import defaultdict
# ── Version 1: Naive ──────────────────────────────────────────────────────────
def word_frequency_v1(texts):
"""Count word frequencies across a list of strings."""
frequencies = {}
for text in texts:
words = text.lower().split()
for word in words:
word = "".join(c for c in word if c.isalpha()) # strip punctuation
if word:
if word in frequencies:
frequencies[word] += 1
else:
frequencies[word] = 1
return dict(sorted(frequencies.items(), key=lambda x: x[1], reverse=True))
# ── Version 2: Optimized ──────────────────────────────────────────────────────
import re
from collections import Counter
WORD_RE = re.compile(r"[a-z]+") # compiled once at module level
def word_frequency_v2(texts):
"""Faster version using Counter and compiled regex."""
counter = Counter()
for text in texts:
counter.update(WORD_RE.findall(text.lower()))
return dict(counter.most_common())
# ── Benchmark ─────────────────────────────────────────────────────────────────
import random
import string
# Generate test data
words = ["python", "programming", "fast", "slow", "optimize", "measure",
"performance", "data", "code", "function", "class", "list", "dict"]
texts = [
" ".join(random.choices(words, k=100)) + "."
for _ in range(1_000)
]
t1 = timeit.timeit(lambda: word_frequency_v1(texts), number=10)
t2 = timeit.timeit(lambda: word_frequency_v2(texts), number=10)
print(f"v1 (naive): {t1:.3f}s")
print(f"v2 (optimized): {t2:.3f}s")
print(f"Speedup: {t1/t2:.1f}x")
Output:
v1 (naive): 4.821s
v2 (optimized): 0.312s
Speedup: 15.5x
The optimizations:
- Compiled regex (
WORD_RE) instead of"".join(c for c in word if c.isalpha())in a loop Counter(implemented in C) instead of a dict with manual incrementWORD_RE.findall()processes the entire string at once instead of looping character by character
Performance Checklist
Before writing a single line of optimized code:
- Profile first. Use
cProfileto find the actual bottleneck. - Check the algorithm. Is there an O(n) solution to your O(n^2) problem?
- Check the data structure. Replace
listwithsetordictfor membership testing. - Use built-ins.
sum,max,sorted,Counter,defaultdict— all implemented in C. - Move work out of loops. Compile regexes, bind methods, precompute constants.
- Use generators for large data. Don't build a list if you only need to iterate once.
- Consider
numpy. For numerical arrays, it's 10-100x faster than Python loops. - Cache expensive results.
@lru_cachefor pure functions,@functools.cache(Python 3.9+). __slots__for classes with many instances and fixed attributes.- Measure the improvement. Confirm the speedup with
timeitbefore and after.
What You Learned in This Chapter
- Always measure before optimizing. Guessing wastes time and often makes things worse.
timeit.timeit()benchmarks small code snippets accurately.cProfileprofiles entire programs — shows call count, total time, cumulative time per function.snakevizvisualizescProfileoutput as an interactive flame graph.line_profilershows timing line by line inside a function.memory_profilershows memory usage line by line.- Big O notation describes how runtime scales with input size. O(n^2) vs O(n) can be the difference between seconds and hours.
- Choosing the right data structure is the highest-leverage optimization:
setanddictmembership is O(1),listis O(n). collections.dequefor fast front/back operations.collections.Counterfor counting.collections.defaultdictfor grouped data.- Practical wins: cache results, move work out of loops, compile regexes, use built-ins, use comprehensions, use generators for large data.
numpyis 10-100x faster than Python loops for numerical operations on arrays.__slots__reduces memory by 4x for objects with many instances.
What's Next?
Chapter 31 covers Working with Databases — SQLite, PostgreSQL with psycopg2, and SQLAlchemy ORM. Data that outlives your program lives in a database, and knowing how to read, write, query, and migrate data is a fundamental skill for any Python developer.