Python: Zero to Hero
Home/Advanced Python
Share

Chapter 30: Performance and Profiling

"My code is too slow." That's one of the most common complaints in software development. The wrong response is to start optimizing randomly — rewriting loops, switching data structures, adding caching — based on gut feeling.

The right response is to measure first. Find exactly where the time goes. Then fix that specific thing.

This chapter teaches you to measure performance precisely, understand algorithmic complexity, and apply targeted optimizations that actually matter.

The Golden Rule: Measure Before You Optimize

Donald Knuth said it in 1974: "Premature optimization is the root of all evil."

What he meant: optimizing code before you know what's slow wastes time, makes code harder to read, and often doesn't even help — because you optimized the wrong thing.

The process is always:

  1. Measure — find the actual bottleneck
  2. Understand why — is it algorithmic? I/O? Memory?
  3. Fix the bottleneck — change the right thing
  4. Measure again — verify the improvement

timeit — Measuring Small Code Snippets

timeit runs a snippet many times and tells you the average execution time. Use it to compare two approaches to the same problem.

import timeit

# Compare string concatenation methods
n = 10_000

# Method 1: += in a loop
def concat_plus():
    s = ""
    for i in range(n):
        s += str(i)
    return s

# Method 2: join
def concat_join():
    return "".join(str(i) for i in range(n))

# Method 3: f-string in list
def concat_list():
    parts = [str(i) for i in range(n)]
    return "".join(parts)


time_plus = timeit.timeit(concat_plus, number=1000)
time_join = timeit.timeit(concat_join, number=1000)
time_list = timeit.timeit(concat_list, number=1000)

print(f"concat_plus: {time_plus:.3f}s")
print(f"concat_join: {time_join:.3f}s")
print(f"concat_list: {time_list:.3f}s")

Output (approximate):

concat_plus: 2.841s
concat_join: 1.203s
concat_list: 0.891s

join is ~3x faster than += for large strings. Good to know — but only worth doing if string building is actually your bottleneck.

timeit from the command line

python -m timeit "'-'.join(str(n) for n in range(100))"
# 100000 loops, best of 5: 12.2 usec per loop

python -m timeit "'-'.join([str(n) for n in range(100)])"
# 100000 loops, best of 5: 9.94 usec per loop

%%timeit in Jupyter notebooks

%%timeit
result = [x**2 for x in range(10_000)]

cProfile — Profiling Entire Programs

timeit measures snippets. cProfile profiles your whole program — it tells you how many times each function was called and how long each one took.

import cProfile
import pstats

def slow_function():
    total = 0
    for i in range(100_000):
        total += i ** 2
    return total

def another_function():
    data = [slow_function() for _ in range(3)]
    return sum(data)

def main():
    result = another_function()
    print(f"Result: {result}")


# Profile main()
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()

# Print sorted results
stats = pstats.Stats(profiler)
stats.strip_dirs()
stats.sort_stats("cumulative")    # sort by cumulative time
stats.print_stats(20)             # show top 20 functions

Output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.234    1.234 script.py:14(main)
        1    0.000    0.000    1.234    1.234 script.py:10(another_function)
        3    1.234    0.411    1.234    0.411 script.py:5(slow_function)
   300000    0.000    0.000    0.000    0.000 {built-in method builtins.pow}

Columns explained:

  • ncalls — number of times the function was called
  • tottime — total time spent in this function (excluding sub-calls)
  • percall — tottime divided by ncalls
  • cumtime — cumulative time including all sub-calls
  • percall (second) — cumtime divided by ncalls

slow_function accounts for 1.234s — the clear bottleneck.

From the command line

python -m cProfile -s cumulative my_script.py
python -m cProfile -o profile.out my_script.py   # save to file

Visualizing with snakeviz

pip install snakeviz
python -m cProfile -o profile.out my_script.py
snakeviz profile.out   # opens an interactive flame graph in your browser

line_profiler — Line-by-Line Timing

cProfile tells you which function is slow. line_profiler tells you which line inside that function is slow.

pip install line_profiler
# profile_example.py
from line_profiler import LineProfiler

def process_data(data):
    cleaned  = [x for x in data if x > 0]          # filter
    squared  = [x ** 2 for x in cleaned]            # transform
    filtered = [x for x in squared if x % 2 == 0]  # filter again
    return sorted(filtered)                          # sort

data = list(range(-1000, 1000))

lp = LineProfiler()
lp.add_function(process_data)
lp.enable()
result = process_data(data)
lp.disable()
lp.print_stats()

Output:

Line #   Hits    Time  Per Hit  % Time  Line Contents
     5      1   142.0    142.0     8.5  cleaned  = [x for x in data if x > 0]
     6      1   310.0    310.0    18.6  squared  = [x ** 2 for x in cleaned]
     7      1   280.0    280.0    16.8  filtered = [x for x in squared if x % 2 == 0]
     8      1   934.0    934.0    56.1  return sorted(filtered)

The sorted() call takes 56% of the time. That's where to focus.

memory_profiler — Measuring Memory Usage

pip install memory_profiler
from memory_profiler import profile

@profile
def load_large_data():
    data  = list(range(1_000_000))          # list: ~8MB
    total = sum(data)                        # compute sum
    del data                                 # free memory
    result = [x ** 2 for x in range(10_000)]  # smaller list
    return total, result

load_large_data()

Output:

Line #    Mem usage    Increment  Line Contents
     4   45.3 MiB    45.3 MiB   def load_large_data():
     5   53.1 MiB     7.8 MiB   data  = list(range(1_000_000))
     6   53.1 MiB     0.0 MiB   total = sum(data)
     7   45.3 MiB    -7.8 MiB   del data
     8   45.6 MiB     0.3 MiB   result = [x**2 for x in range(10_000)]

Each line shows current memory usage and the change. data takes 7.8MB; del data releases it.

Algorithmic Complexity — The Real Performance Win

Tools are for measuring. But the biggest performance wins come from choosing better algorithms. A 10x faster computer running an O(n^2) algorithm is still slower than a slow computer running O(n log n) for large n.

Big O notation describes how an algorithm's runtime grows with input size n.

Complexity Name Example
O(1) Constant Dictionary lookup, list append
O(log n) Logarithmic Binary search, balanced tree operations
O(n) Linear Loop through a list once
O(n log n) Linearithmic sorted(), good sorting algorithms
O(n^2) Quadratic Nested loops over n items
O(2ⁿ) Exponential Naive recursive Fibonacci
O(n!) Factorial Brute-force permutations
n = 1,000,000
O(1):       1 operation
O(log n):   20 operations
O(n):       1,000,000 operations
O(n log n): 20,000,000 operations
O(n^2):      1,000,000,000,000 operations  <- 1 trillion

The jump from O(n) to O(n^2) is the difference between milliseconds and hours.

Seeing it in practice

import time

def has_duplicate_quadratic(items):
    """O(n^2) — nested loop"""
    for i in range(len(items)):
        for j in range(i + 1, len(items)):
            if items[i] == items[j]:
                return True
    return False

def has_duplicate_linear(items):
    """O(n) — set lookup"""
    seen = set()
    for item in items:
        if item in seen:
            return True
        seen.add(item)
    return False

def has_duplicate_oneliner(items):
    """O(n) — even cleaner"""
    return len(items) != len(set(items))


sizes = [100, 1_000, 10_000, 100_000]
for n in sizes:
    data = list(range(n))   # no duplicates — worst case

    t1 = timeit.timeit(lambda: has_duplicate_quadratic(data), number=1)
    t2 = timeit.timeit(lambda: has_duplicate_linear(data),    number=1)

    print(f"n={n:>7}: O(n^2)={t1:.4f}s  O(n)={t2:.6f}s  ratio={t1/t2:.0f}x")

Output:

n=    100: O(n^2)=0.0001s  O(n)=0.000010s  ratio=12x
n=  1,000: O(n^2)=0.0051s  O(n)=0.000089s  ratio=57x
n= 10,000: O(n^2)=0.4821s  O(n)=0.000891s  ratio=541x
n=100,000: O(n^2)=47.832s  O(n)=0.009103s  ratio=5254x

The same logic, 5000x faster at n=100,000 — just by choosing a set instead of a nested loop.

Choosing the Right Data Structure

The single most impactful optimization is usually choosing the right data structure.

import timeit

# Membership testing
large_list = list(range(100_000))
large_set  = set(range(100_000))
large_dict = {i: True for i in range(100_000)}

target = 99_999   # worst case — near the end

t_list = timeit.timeit(lambda: target in large_list, number=10_000)
t_set  = timeit.timeit(lambda: target in large_set,  number=10_000)
t_dict = timeit.timeit(lambda: target in large_dict, number=10_000)

print(f"list:  {t_list:.4f}s  O(n)")
print(f"set:   {t_set:.6f}s  O(1)")
print(f"dict:  {t_dict:.6f}s  O(1)")

Output:

list:  3.8412s  O(n)
set:   0.0003s  O(1)
dict:  0.0003s  O(1)

Set and dict lookups are 10,000x faster for large collections because they use hash tables.

Data structure complexity cheat sheet

Operation list set dict deque
Append (end) O(1) O(1)
Insert (middle) O(n) O(n)
Delete (middle) O(n) O(n)
Pop (end) O(1) O(1)
Pop (front) O(n) O(1)
Lookup by index O(1) O(n)
Membership test O(n) O(1) O(1) O(n)
Insert/delete O(1) O(1)
Sorted order Yes (if sorted) No No (3.7+ insertion) Yes (if sorted)

Use collections.deque when you need fast prepend (front insertion). Use set or dict for fast membership testing.

from collections import deque

# Queue — FIFO (First In, First Out)
queue = deque()
queue.append("first")
queue.append("second")
queue.append("third")
print(queue.popleft())   # "first" — O(1)

# Stack — LIFO (Last In, First Out)
stack = []
stack.append("first")
stack.append("second")
print(stack.pop())   # "second" — O(1)

# deque as a sliding window (fixed-size buffer)
window = deque(maxlen=3)
for x in range(10):
    window.append(x)
    print(list(window))   # never longer than 3

Practical Optimization Techniques

1. Avoid repeated work — cache results

# Slow: recompute the same thing every iteration
def process(data):
    for item in data:
        if len(data) > 1000:    # len(data) called every iteration
            do_something(item)

# Fast: compute once
def process(data):
    n = len(data)
    for item in data:
        if n > 1000:
            do_something(item)

2. Move work out of inner loops

import re

# Slow: compile the regex on every call
def count_emails(texts):
    return sum(1 for t in texts if re.match(r"[^@]+@[^@]+\.[^@]+", t))

# Fast: compile once outside the loop (or at module level)
EMAIL_RE = re.compile(r"[^@]+@[^@]+\.[^@]+")

def count_emails(texts):
    return sum(1 for t in texts if EMAIL_RE.match(t))

3. Use built-ins — they're implemented in C

# Slow: Python loop
def sum_squares_loop(n):
    total = 0
    for i in range(n):
        total += i * i
    return total

# Fast: built-in sum with generator
def sum_squares_builtin(n):
    return sum(i * i for i in range(n))

# Even faster for math-heavy work: numpy
import numpy as np
def sum_squares_numpy(n):
    return int(np.sum(np.arange(n) ** 2))
import timeit
n = 1_000_000
t1 = timeit.timeit(lambda: sum_squares_loop(n),    number=5)
t2 = timeit.timeit(lambda: sum_squares_builtin(n), number=5)

print(f"Loop:    {t1:.3f}s")
print(f"Built-in:{t2:.3f}s")

4. List comprehensions over explicit loops

# Slower
result = []
for x in range(10_000):
    if x % 2 == 0:
        result.append(x ** 2)

# Faster (comprehensions are optimized in CPython)
result = [x ** 2 for x in range(10_000) if x % 2 == 0]

5. Use generators for large data

# Loads everything into memory
total = sum([x ** 2 for x in range(10_000_000)])

# Lazy — one item at a time, much lower memory
total = sum(x ** 2 for x in range(10_000_000))

6. Local variable lookup is faster than global

import math

# Slow: looks up math.sqrt in global scope every call
def compute_slow(data):
    return [math.sqrt(x) for x in data]

# Fast: bind to a local variable
def compute_fast(data):
    sqrt = math.sqrt   # local lookup is faster
    return [sqrt(x) for x in data]

7. String formatting — f-strings are fastest

name, value = "Alice", 42

# Slowest
s = "Hello " + name + ", value is " + str(value)

# Fast
s = f"Hello {name}, value is {value}"

# Acceptable
s = "Hello {}, value is {}".format(name, value)

8. functools.lru_cache for expensive pure functions

from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_lookup(key):
    # Simulate expensive computation or DB query
    import time
    time.sleep(0.1)
    return key * 2

# First call: 0.1s
# Subsequent calls: instant (cached)

numpy for Numerical Work

If you're doing numerical computation — matrix operations, statistics, signal processing — pure Python is the wrong tool. numpy operates in C on contiguous arrays and is 10-1000x faster.

import numpy as np
import timeit

n = 1_000_000

# Python list vs numpy array — element-wise squaring
py_list  = list(range(n))
np_array = np.arange(n)

t_py = timeit.timeit(lambda: [x**2 for x in py_list], number=5)
t_np = timeit.timeit(lambda: np_array ** 2,            number=5)

print(f"Python list: {t_py:.3f}s")
print(f"Numpy array: {t_np:.4f}s")
print(f"Speedup: {t_py/t_np:.0f}x")

Output:

Python list: 2.341s
Numpy array: 0.024s
Speedup: 97x
# numpy operations are vectorized — no Python loops needed
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

print(a + b)           # [11 22 33 44 55]
print(a * b)           # [10 40 90 160 250]
print(np.dot(a, b))    # 550  (dot product)
print(np.sqrt(a))      # [1.  1.41 1.73 2.  2.24]

# 2D arrays (matrices)
matrix = np.arange(9).reshape(3, 3)
print(matrix)
# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]
print(matrix.T)        # transpose
print(matrix @ matrix) # matrix multiplication

__slots__ — Reducing Memory for Many Objects

By default, Python objects store their attributes in a dictionary (__dict__). For classes where you create millions of instances, this wastes memory. __slots__ replaces the dict with a fixed set of attributes:

import sys

class PointDict:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

class PointSlots:
    __slots__ = ("x", "y", "z")

    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z


pd = PointDict(1.0, 2.0, 3.0)
ps = PointSlots(1.0, 2.0, 3.0)

print(f"With __dict__:  {sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)} bytes")
print(f"With __slots__: {sys.getsizeof(ps)} bytes")

# With __dict__:  232 bytes
# With __slots__:  56 bytes   — 4x smaller

For a million instances, that's 176MB vs 56MB. Use __slots__ when you create many instances of a class with a fixed set of attributes.

Profiling in Practice: A Case Study

Let's profile a realistic slow function and fix it:

import cProfile
import pstats
import io
import timeit
from collections import defaultdict

# ── Version 1: Naive ──────────────────────────────────────────────────────────

def word_frequency_v1(texts):
    """Count word frequencies across a list of strings."""
    frequencies = {}
    for text in texts:
        words = text.lower().split()
        for word in words:
            word = "".join(c for c in word if c.isalpha())  # strip punctuation
            if word:
                if word in frequencies:
                    frequencies[word] += 1
                else:
                    frequencies[word] = 1
    return dict(sorted(frequencies.items(), key=lambda x: x[1], reverse=True))


# ── Version 2: Optimized ──────────────────────────────────────────────────────

import re
from collections import Counter

WORD_RE = re.compile(r"[a-z]+")   # compiled once at module level

def word_frequency_v2(texts):
    """Faster version using Counter and compiled regex."""
    counter = Counter()
    for text in texts:
        counter.update(WORD_RE.findall(text.lower()))
    return dict(counter.most_common())


# ── Benchmark ─────────────────────────────────────────────────────────────────

import random
import string

# Generate test data
words = ["python", "programming", "fast", "slow", "optimize", "measure",
         "performance", "data", "code", "function", "class", "list", "dict"]
texts = [
    " ".join(random.choices(words, k=100)) + "."
    for _ in range(1_000)
]

t1 = timeit.timeit(lambda: word_frequency_v1(texts), number=10)
t2 = timeit.timeit(lambda: word_frequency_v2(texts), number=10)

print(f"v1 (naive):     {t1:.3f}s")
print(f"v2 (optimized): {t2:.3f}s")
print(f"Speedup: {t1/t2:.1f}x")

Output:

v1 (naive):     4.821s
v2 (optimized): 0.312s
Speedup: 15.5x

The optimizations:

  1. Compiled regex (WORD_RE) instead of "".join(c for c in word if c.isalpha()) in a loop
  2. Counter (implemented in C) instead of a dict with manual increment
  3. WORD_RE.findall() processes the entire string at once instead of looping character by character

Performance Checklist

Before writing a single line of optimized code:

  1. Profile first. Use cProfile to find the actual bottleneck.
  2. Check the algorithm. Is there an O(n) solution to your O(n^2) problem?
  3. Check the data structure. Replace list with set or dict for membership testing.
  4. Use built-ins. sum, max, sorted, Counter, defaultdict — all implemented in C.
  5. Move work out of loops. Compile regexes, bind methods, precompute constants.
  6. Use generators for large data. Don't build a list if you only need to iterate once.
  7. Consider numpy. For numerical arrays, it's 10-100x faster than Python loops.
  8. Cache expensive results. @lru_cache for pure functions, @functools.cache (Python 3.9+).
  9. __slots__ for classes with many instances and fixed attributes.
  10. Measure the improvement. Confirm the speedup with timeit before and after.

What You Learned in This Chapter

  • Always measure before optimizing. Guessing wastes time and often makes things worse.
  • timeit.timeit() benchmarks small code snippets accurately.
  • cProfile profiles entire programs — shows call count, total time, cumulative time per function.
  • snakeviz visualizes cProfile output as an interactive flame graph.
  • line_profiler shows timing line by line inside a function.
  • memory_profiler shows memory usage line by line.
  • Big O notation describes how runtime scales with input size. O(n^2) vs O(n) can be the difference between seconds and hours.
  • Choosing the right data structure is the highest-leverage optimization: set and dict membership is O(1), list is O(n).
  • collections.deque for fast front/back operations. collections.Counter for counting. collections.defaultdict for grouped data.
  • Practical wins: cache results, move work out of loops, compile regexes, use built-ins, use comprehensions, use generators for large data.
  • numpy is 10-100x faster than Python loops for numerical operations on arrays.
  • __slots__ reduces memory by 4x for objects with many instances.

What's Next?

Chapter 31 covers Working with Databases — SQLite, PostgreSQL with psycopg2, and SQLAlchemy ORM. Data that outlives your program lives in a database, and knowing how to read, write, query, and migrate data is a fundamental skill for any Python developer.

© 2026 Abhilash Sahoo. Python: Zero to Hero.