Python: Zero to Hero
Home/The Professional Python Developer
Share

Chapter 56: Python Internals — How Python Really Works

You've written Python for 55 chapters. Now let's look under the hood.

Understanding how Python executes your code makes you a better developer. You'll write faster code, understand error messages more deeply, know why the GIL exists, and stop being surprised by Python's quirks. This chapter won't make you a CPython contributor — but it will make the language feel transparent instead of magical.

CPython — The Reference Implementation

Python the language is a specification. CPython is the most popular implementation of that specification — it's what you download from python.org. Other implementations exist:

Implementation Language Use case
CPython C The standard; most compatible
PyPy Python + RPython Faster (JIT compiler); ~5x speedup for loops
Jython Java Runs on JVM; integrates with Java libraries
MicroPython C Runs on microcontrollers (Raspberry Pi Pico)
GraalPy Java Oracle's high-performance Python on GraalVM

Everything in this chapter refers to CPython unless stated otherwise.

How Python Runs Your Code

When you type python hello.py, four things happen:

1. Lexing       -> source code   -> tokens
2. Parsing      -> tokens        -> Abstract Syntax Tree (AST)
3. Compilation  -> AST           -> bytecode
4. Execution    -> bytecode      -> results (CPython VM)

Step 1 — Lexing (Tokenisation)

The lexer breaks your source into tokens:

import tokenize, io

source = "x = 1 + 2"
tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
for tok in tokens:
    print(tok)

Output:

TokenInfo(type=1  (NAME),   string='x',   ...)
TokenInfo(type=54 (OP),     string='=',   ...)
TokenInfo(type=2  (NUMBER), string='1',   ...)
TokenInfo(type=54 (OP),     string='+',   ...)
TokenInfo(type=2  (NUMBER), string='2',   ...)

Step 2 — Parsing (AST)

The parser converts tokens into an Abstract Syntax Tree:

import ast

source = "x = 1 + 2"
tree   = ast.parse(source)
print(ast.dump(tree, indent=2))

Output:

Module(
  body=[
    Assign(
      targets=[Name(id='x')],
      value=BinOp(
        left=Constant(value=1),
        op=Add(),
        right=Constant(value=2)))],
  type_ignores=[])

The AST is a tree of node objects. Each node represents a syntactic construct: assignments, function calls, loops, conditionals.

Step 3 — Compilation (Bytecode)

The compiler converts the AST into bytecode — a sequence of simple instructions for the Python virtual machine. Use dis to see it:

import dis

def add(a, b):
    return a + b

dis.dis(add)

Output:

  2           0 RESUME          0

  3           2 LOAD_FAST       0 (a)
              4 LOAD_FAST       1 (b)
              6 BINARY_OP      0 (+)
             10 RETURN_VALUE

Each line is one bytecode instruction. The VM reads these instructions one by one and executes them.

Bytecode is cached in __pycache__/ as .pyc files so Python doesn't recompile unchanged modules:

mypackage/
└── __pycache__/
    ├── module.cpython-312.pyc
    └── utils.cpython-312.pyc

Step 4 — Execution (CPython VM)

The CPython virtual machine is a stack-based interpreter. It maintains a call stack. Each function call pushes a new frame onto the stack. Each frame has:

  • The bytecode being executed
  • A local variables dictionary
  • A reference to the global namespace
  • A reference to the enclosing scope
  • A data stack for intermediate values
import sys

def outer():
    def inner():
        frame = sys._getframe()
        print(f"Function:  {frame.f_code.co_name}")
        print(f"File:      {frame.f_code.co_filename}")
        print(f"Line:      {frame.f_lineno}")
        print(f"Locals:    {frame.f_locals}")
        print(f"Caller:    {frame.f_back.f_code.co_name}")
    inner()

outer()

Code Objects — The Compiled Unit

Every function, class, and module has a code object (__code__):

def greet(name: str, times: int = 1) -> str:
    message = f"Hello, {name}!"
    return message * times

code = greet.__code__
print(code.co_name)         # greet
print(code.co_varnames)     # ('name', 'times', 'message') — local variable names
print(code.co_argcount)     # 2
print(code.co_consts)       # (None, 'Hello, ', '!') — constants
print(code.co_filename)     # path to the .py file
print(code.co_firstlineno)  # 1

Code objects are immutable. They're created at compile time and never change at runtime.

The GIL — Global Interpreter Lock

The GIL is a mutex (lock) inside CPython that allows only one thread to execute Python bytecode at a time.

This means:

  • Multithreaded Python cannot run Python code in parallel on multiple CPU cores
  • But it can do I/O in parallel (the GIL is released during I/O operations)

Why does the GIL exist?

CPython's memory management uses reference counting — every object has a counter that tracks how many references point to it:

import sys

a = [1, 2, 3]
print(sys.getrefcount(a))   # 2 (a + the argument to getrefcount)

b = a
print(sys.getrefcount(a))   # 3

del b
print(sys.getrefcount(a))   # 2

When the reference count drops to 0, the object is immediately freed. This is fast and simple — but reference counts are not thread-safe. Without the GIL, two threads could simultaneously decrement the same counter, corrupting memory.

The GIL is a single lock that protects all Python objects instead of thousands of individual locks.

What the GIL means for you

import threading

# I/O-bound: GIL is released, threads run in parallel [x]
# (database queries, HTTP requests, file reads)
def fetch_url(url):
    import urllib.request
    return urllib.request.urlopen(url).read()

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
# These run in true parallel — the GIL is released during network I/O

# CPU-bound: GIL held, threads take turns [-]
# (number crunching, image processing, sorting)
def count_primes(n):
    return sum(1 for i in range(2, n) if all(i % j for j in range(2, i)))

# Use multiprocessing for CPU-bound work instead

Python 3.13+ — The No-GIL Build

Python 3.13 introduced an experimental "free-threaded" build that removes the GIL:

# Install the free-threaded build
python3.13t   # the 't' suffix indicates free-threaded

# Check if GIL is active
import sys
print(sys._is_gil_enabled())   # False in free-threaded build

The free-threaded build is experimental in 3.13 and will stabilise in future versions. For now, most production code should still use multiprocessing for CPU-bound parallelism.

Memory Management

Reference Counting

Every Python object has a reference count. When it hits 0, the memory is freed immediately:

import ctypes

def ref_count(obj):
    """Return the reference count of an object."""
    return sys.getrefcount(obj) - 1   # subtract the getrefcount call itself

x = [1, 2, 3]
print(ref_count(x))   # 1

y = x
print(ref_count(x))   # 2

del y
print(ref_count(x))   # 1
# When x goes out of scope, count hits 0 and memory is freed

Cyclic Garbage Collector

Reference counting can't handle cycles:

import gc

class Node:
    def __init__(self, value):
        self.value = value
        self.next  = None

a = Node(1)
b = Node(2)
a.next = b
b.next = a   # cycle: a -> b -> a

del a
del b
# Reference counts are now 1 for each — never reach 0
# The cyclic GC detects and cleans this up

Python runs a cyclic garbage collector periodically. You can trigger it manually:

import gc

gc.collect()           # run all three generations
gc.collect(0)          # collect youngest generation only
print(gc.get_count())  # (gen0, gen1, gen2) object counts

Object Interning

Python reuses certain objects to save memory. Small integers (-5 to 256) and many short strings are interned — only one copy exists:

# Small integers: same object
a = 100
b = 100
print(a is b)   # True — same object

# Large integers: different objects
a = 1000
b = 1000
print(a is b)   # False (implementation-defined, often False)

# Short strings: often interned
a = "hello"
b = "hello"
print(a is b)   # True (interned)

# Strings with spaces: NOT interned
a = "hello world"
b = "hello world"
print(a is b)   # False

This is why is should only be used for None, True, and False. Use == for all other equality checks.

Memory Profiling

import tracemalloc

tracemalloc.start()

# Your code here
data = [i for i in range(100_000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

print("Top 3 memory consumers:")
for stat in top_stats[:3]:
    print(stat)

The Import System

When you write import mymodule, Python:

  1. Checks sys.modules — if already imported, returns the cached module
  2. Finds the module file using sys.path
  3. Loads and compiles the .py file
  4. Executes the module code in a new namespace
  5. Caches the result in sys.modules
import sys

# See all currently imported modules
print(list(sys.modules.keys())[:10])

# See where Python looks for modules
print(sys.path)

# Reload a module (rarely needed)
import importlib
importlib.reload(mymodule)

__all__ — Controlling What's Exported

# mymodule.py
__all__ = ["PublicClass", "public_function"]

class PublicClass:
    pass

class _PrivateClass:   # won't be imported with "from mymodule import *"
    pass

def public_function():
    pass

def _private_function():
    pass

if __name__ == "__main__"

When Python imports a module, it sets __name__ to the module's name. When you run a file directly, __name__ is "__main__":

# mymodule.py
def useful_function():
    return 42

# This block only runs when the file is executed directly
# NOT when it's imported
if __name__ == "__main__":
    result = useful_function()
    print(f"Result: {result}")

This pattern lets a file be both a reusable module and a runnable script.

Bytecode Deep Dive

Let's trace through a slightly more complex function:

import dis

def fibonacci(n):
    if n <= 1:
        return n
    a, b = 0, 1
    for _ in range(n - 1):
        a, b = b, a + b
    return b

dis.dis(fibonacci)

Output (Python 3.12):

  2           RESUME          0

  3           LOAD_FAST       0 (n)
              LOAD_CONST      1 (1)
              COMPARE_OP      1 (<=)
              POP_JUMP_IF_FALSE ...

  4           LOAD_FAST       0 (n)
              RETURN_VALUE

  5           LOAD_CONST      2 (0)
              LOAD_CONST      1 (1)
              UNPACK_SEQUENCE 2
              STORE_FAST      1 (a)
              STORE_FAST      2 (b)
...

Key instructions:

  • LOAD_FAST — load a local variable onto the stack
  • LOAD_CONST — load a constant onto the stack
  • COMPARE_OP — compare top two stack items
  • POP_JUMP_IF_FALSE — jump if top of stack is falsy
  • STORE_FAST — pop stack and store in a local variable
  • CALL — call a function
  • RETURN_VALUE — return top of stack to caller

Understanding bytecode explains Python performance characteristics. LOAD_FAST (local variable) is faster than LOAD_GLOBAL (global variable) — that's why moving frequently accessed globals into local variables inside tight loops speeds things up.

__slots__ — Memory-Efficient Classes

By default, Python stores instance attributes in a __dict__ (a hash map):

class Regular:
    def __init__(self, x, y):
        self.x = x
        self.y = y

obj = Regular(1, 2)
print(obj.__dict__)   # {'x': 1, 'y': 2}

Each __dict__ uses ~230 bytes. For classes with millions of instances, this adds up.

__slots__ replaces the dict with fixed-size per-attribute descriptors:

class Slotted:
    __slots__ = ("x", "y")

    def __init__(self, x, y):
        self.x = x
        self.y = y

obj = Slotted(1, 2)
# obj.__dict__  <- AttributeError: no __dict__
print(obj.x)    # 1
import sys

print(sys.getsizeof(Regular(1, 2)))   # ~48 bytes + ~230 for __dict__
print(sys.getsizeof(Slotted(1, 2)))   # ~56 bytes — no __dict__ overhead

Use __slots__ when you create millions of instances of a class and memory is a concern.

__getattr__ vs __getattribute__

These two look similar but behave very differently:

class Demo:
    def __init__(self):
        self.exists = "I exist"

    def __getattr__(self, name: str):
        # Called ONLY when normal attribute lookup fails
        print(f"__getattr__: {name} not found")
        return f"default_{name}"

    def __getattribute__(self, name: str):
        # Called for EVERY attribute access — be careful
        print(f"__getattribute__: accessing {name}")
        return super().__getattribute__(name)


d = Demo()
print(d.exists)        # __getattribute__ called -> "I exist"
print(d.missing)       # __getattribute__ called -> not found -> __getattr__ called

__getattr__ — safe to override. Use for proxy objects, lazy loading, dynamic attributes.

__getattribute__ — called for every attribute access. Override it only if you absolutely must — it's easy to create infinite recursion. Always call super().__getattribute__(name).

AST Manipulation — Code That Reads Code

Python's AST can be inspected and transformed:

import ast


def find_function_names(source: str) -> list[str]:
    """Extract all function names defined in source code."""
    tree  = ast.parse(source)
    names = []
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            names.append(node.name)
    return names


source = """
def greet(name):
    return f"Hello, {name}"

def farewell(name):
    return f"Goodbye, {name}"

class MyClass:
    def method(self):
        pass
"""

print(find_function_names(source))
# ['greet', 'farewell', 'method']
def count_operations(source: str) -> dict[str, int]:
    """Count each type of binary operation in source."""
    tree  = ast.parse(source)
    ops   = {}
    for node in ast.walk(tree):
        if isinstance(node, ast.BinOp):
            op_name = type(node.op).__name__
            ops[op_name] = ops.get(op_name, 0) + 1
    return ops


source = "result = (a + b) * (c - d) / (e + f)"
print(count_operations(source))   # {'Add': 2, 'Mult': 1, 'Sub': 1, 'Div': 1}

AST manipulation is used by linters (ruff, flake8), formatters (black), type checkers (mypy), and code analysis tools.

What You Learned in This Chapter

  • CPython is the reference implementation. PyPy is faster via JIT. MicroPython runs on microcontrollers.
  • Python executes your code in four steps: Lex (source -> tokens) -> Parse (tokens -> AST) -> Compile (AST -> bytecode) -> Execute (bytecode -> results via CPython VM).
  • Use tokenize to see tokens, ast.parse() + ast.dump() to see the AST, dis.dis() to see bytecode.
  • Bytecode is cached in __pycache__/*.pyc files. The VM is stack-based.
  • Each function call creates a frame with local variables, the code object, and a data stack. sys._getframe() inspects the current frame.
  • Every object has a reference count. When it hits 0, the object is freed immediately. The cyclic GC handles reference cycles.
  • The GIL allows only one thread to execute Python bytecode at a time. I/O releases the GIL, so threads work well for I/O-bound work. Use multiprocessing for CPU-bound parallelism. Python 3.13 introduced an experimental no-GIL build.
  • Small integers (-5 to 256) and short strings are interned — shared single objects. Use == for equality, is only for None/True/False.
  • __slots__ replaces __dict__ with fixed descriptors, saving 200+ bytes per instance.
  • __getattr__ fires only when normal lookup fails (safe to override). __getattribute__ fires on every access (override with extreme care).
  • The ast module lets you parse, inspect, and transform Python source code programmatically.

What's Next?

Chapter 57 covers The Python Ecosystem Map — a tour of the major libraries and frameworks in every domain: web, data science, machine learning, automation, DevOps, finance, games, and embedded systems. Think of it as a roadmap for wherever you want to go next.

© 2026 Abhilash Sahoo. Python: Zero to Hero.