Chapter 33: Data Science with Python
Python became the language of data science for one reason: the ecosystem. numpy for fast array math. pandas for working with tabular data. matplotlib and seaborn for visualization. scikit-learn for machine learning. These four libraries, used together, let you go from raw data to trained model to chart in a few hundred lines of code.
This chapter teaches you each one, then shows how they work together in a complete data analysis workflow.
The Data Science Workflow
Every data science project follows the same steps:
1. Load data — read CSV, database, API, JSON
2. Explore — understand shape, types, missing values, distributions
3. Clean — fix missing values, wrong types, duplicates, outliers
4. Analyze — compute statistics, find patterns, test hypotheses
5. Visualize — charts that reveal what numbers hide
6. Model — train a machine learning model (if needed)
7. Evaluate — measure how good the model is
8. Communicate — share findings clearly
numpy — Fast Array Mathematics
numpy is the foundation of the Python data science stack. It provides the ndarray — a typed, fixed-size, multidimensional array stored in contiguous memory. Operations on numpy arrays run in C — 10-100x faster than Python loops.
pip install numpy
Creating arrays
import numpy as np
# From a list
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(a.dtype) # int64
print(a.shape) # (5,)
print(a.ndim) # 1
# Specify dtype
b = np.array([1.0, 2.0, 3.0], dtype=np.float32)
# Built-in constructors
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones((3, 4)) # 3x4 matrix of 1s
empty = np.empty((2, 2)) # uninitialized (fast)
rng = np.arange(0, 10, 2) # [0 2 4 6 8]
linsp = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
eye = np.eye(3) # 3x3 identity matrix
rand = np.random.rand(3, 3) # 3x3 random floats [0, 1)
randn = np.random.randn(3, 3) # 3x3 standard normal
randint= np.random.randint(0, 10, (3, 3)) # 3x3 random ints
Vectorized operations
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Element-wise — no loops needed
print(a + b) # [11 22 33 44 55]
print(a * b) # [10 40 90 160 250]
print(a ** 2) # [ 1 4 9 16 25]
print(np.sqrt(a)) # [1. 1.41 1.73 2. 2.24]
# Broadcasting — operate on arrays of different shapes
matrix = np.ones((3, 4))
row = np.array([1, 2, 3, 4])
print(matrix + row) # adds row to each row of matrix
col = np.array([[1], [2], [3]])
print(matrix + col) # adds col to each column
Indexing, slicing, and masking
a = np.array([10, 20, 30, 40, 50])
print(a[0]) # 10
print(a[-1]) # 50
print(a[1:4]) # [20 30 40]
print(a[::2]) # [10 30 50]
# 2D indexing
m = np.arange(12).reshape(3, 4)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
print(m[1, 2]) # 6 (row 1, col 2)
print(m[:, 1]) # [1 5 9] (all rows, col 1)
print(m[0, :]) # [0 1 2 3] (row 0, all cols)
print(m[1:, 2:]) # [[6 7] [10 11]] (submatrix)
# Boolean masking
a = np.array([1, -2, 3, -4, 5])
mask = a > 0
print(mask) # [ True False True False True]
print(a[mask]) # [1 3 5]
print(a[a > 0]) # same thing — inline mask
a[a < 0] = 0 # set negatives to zero in-place
Aggregation and statistics
data = np.array([4, 7, 13, 2, 9, 1, 5, 8])
print(np.sum(data)) # 49
print(np.mean(data)) # 6.125
print(np.median(data)) # 6.0
print(np.std(data)) # 3.48
print(np.var(data)) # 12.11
print(np.min(data)) # 1
print(np.max(data)) # 13
print(np.argmin(data)) # 5 (index of minimum)
print(np.argmax(data)) # 2 (index of maximum)
print(np.percentile(data, [25, 50, 75])) # quartiles
# Axis-wise for 2D
m = np.array([[1, 2, 3], [4, 5, 6]])
print(np.sum(m, axis=0)) # [5 7 9] — sum each column
print(np.sum(m, axis=1)) # [ 6 15] — sum each row
Linear algebra
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B) # matrix multiplication
print(A.T) # transpose
print(np.linalg.det(A)) # determinant: -2.0
print(np.linalg.inv(A)) # inverse
eigenvalues, eigenvectors = np.linalg.eig(A)
pandas — Tabular Data
pandas adds two data structures on top of numpy: Series (1D) and DataFrame (2D table). A DataFrame is like a spreadsheet or database table — columns have names and types, rows have an index.
pip install pandas
Creating DataFrames
import pandas as pd
# From a dict
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carlos", "Diana"],
"age": [30, 25, 35, 28],
"salary": [95000, 82000, 110000, 78000],
"dept": ["Eng", "HR", "Eng", "Sales"],
})
print(df)
# name age salary dept
# 0 Alice 30 95000 Eng
# 1 Bob 25 82000 HR
# 2 Carlos 35 110000 Eng
# 3 Diana 28 78000 Sales
print(df.shape) # (4, 4)
print(df.dtypes)
# name object
# age int64
# salary int64
# dept object
Loading data from files
# CSV — most common
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", sep=";", encoding="utf-8", index_col="id")
# Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
# JSON
df = pd.read_json("data.json")
# From a SQL query
import sqlite3
conn = sqlite3.connect("myapp.db")
df = pd.read_sql("SELECT * FROM users", conn)
# Save back to file
df.to_csv("output.csv", index=False)
df.to_excel("output.xlsx", index=False)
df.to_json("output.json", orient="records")
Exploring data
df = pd.read_csv("sales.csv")
print(df.head()) # first 5 rows
print(df.tail(3)) # last 3 rows
print(df.shape) # (rows, columns)
print(df.columns.tolist()) # column names
print(df.dtypes) # column types
print(df.info()) # shape + types + null counts
print(df.describe()) # count, mean, std, min, quartiles, max
# Missing values
print(df.isnull().sum()) # null count per column
print(df.isnull().mean()) # null fraction per column
Selecting data
# Select a column — returns a Series
names = df["name"]
# Select multiple columns — returns a DataFrame
subset = df[["name", "salary"]]
# Select rows by position (like list slicing)
first_three = df.iloc[0:3]
last_row = df.iloc[-1]
cell = df.iloc[0, 1] # row 0, column 1
# Select rows by label/condition (loc)
row = df.loc[0] # row with index 0
eng_staff = df.loc[df["dept"] == "Eng"]
# Boolean filtering
high_earners = df[df["salary"] > 90000]
eng_seniors = df[(df["dept"] == "Eng") & (df["age"] > 30)]
hr_or_sales = df[df["dept"].isin(["HR", "Sales"])]
Cleaning data
# Drop rows with any null values
df_clean = df.dropna()
# Drop rows where specific column is null
df_clean = df.dropna(subset=["email"])
# Fill null values
df["age"].fillna(df["age"].median(), inplace=True)
df["notes"].fillna("", inplace=True)
# Drop duplicate rows
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["email"]) # deduplicate by email
# Rename columns
df = df.rename(columns={"dept": "department", "salary": "annual_salary"})
# Change data types
df["age"] = df["age"].astype(int)
df["date"] = pd.to_datetime(df["date"])
df["salary"] = pd.to_numeric(df["salary"], errors="coerce") # NaN for bad values
# Strip whitespace from string columns
df["name"] = df["name"].str.strip()
df["name"] = df["name"].str.title() # Title Case
# Remove outliers (beyond 3 standard deviations)
col = df["salary"]
mask = (col - col.mean()).abs() <= 3 * col.std()
df = df[mask]
Transforming data
# Add a computed column
df["monthly_salary"] = df["salary"] / 12
df["senior"] = df["age"] >= 35
# Apply a function to a column
df["name_upper"] = df["name"].str.upper()
df["tax"] = df["salary"].apply(lambda s: s * 0.3 if s > 100000 else s * 0.2)
# Apply a function to every cell
df_normalized = df[["age", "salary"]].apply(
lambda col: (col - col.min()) / (col.max() - col.min())
)
# Map values
dept_map = {"Eng": "Engineering", "HR": "Human Resources", "Sales": "Sales"}
df["dept_full"] = df["dept"].map(dept_map)
Grouping and aggregation
# Group by one column
dept_stats = df.groupby("dept").agg({
"salary": ["mean", "min", "max", "count"],
"age": "mean",
})
print(dept_stats)
# Rename the multi-level columns
dept_stats.columns = ["avg_salary", "min_salary", "max_salary", "headcount", "avg_age"]
# Group by multiple columns
df.groupby(["dept", "senior"])["salary"].mean()
# Pivot table
pivot = df.pivot_table(
values = "salary",
index = "dept",
columns = "senior",
aggfunc = "mean",
)
Sorting and ranking
df_sorted = df.sort_values("salary", ascending=False)
df_sorted = df.sort_values(["dept", "salary"], ascending=[True, False])
df["salary_rank"] = df["salary"].rank(ascending=False).astype(int)
Merging DataFrames
users = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carlos"]})
orders = pd.DataFrame({"user_id": [1, 1, 2], "amount": [50, 30, 120]})
# Inner join — only rows that match in both
merged = pd.merge(users, orders, left_on="id", right_on="user_id")
# Left join — all rows from left, NaN where no match in right
merged = pd.merge(users, orders, left_on="id", right_on="user_id", how="left")
# Stack DataFrames vertically
combined = pd.concat([df1, df2], ignore_index=True)
matplotlib — Data Visualization
matplotlib is Python's core plotting library. Every other visualization library builds on top of it.
pip install matplotlib seaborn
import matplotlib.pyplot as plt
import numpy as np
# Line plot
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)
plt.figure(figsize=(10, 4))
plt.plot(x, y, color="steelblue", linewidth=2, label="sin(x)")
plt.plot(x, np.cos(x), color="coral", linestyle="--", label="cos(x)")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Sine and Cosine")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("trig.png", dpi=150)
plt.show()
Common chart types
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
# Bar chart
depts = ["Eng", "HR", "Sales"]
counts = [15, 8, 12]
axes[0, 0].bar(depts, counts, color="steelblue")
axes[0, 0].set_title("Headcount by Department")
# Horizontal bar
axes[0, 1].barh(depts, counts, color="coral")
axes[0, 1].set_title("Headcount (horizontal)")
# Histogram
data = np.random.normal(70000, 15000, 500)
axes[0, 2].hist(data, bins=30, color="mediumpurple", edgecolor="white")
axes[0, 2].set_title("Salary Distribution")
axes[0, 2].set_xlabel("Salary ($)")
# Scatter plot
x = np.random.rand(100)
y = x * 2 + np.random.randn(100) * 0.3
axes[1, 0].scatter(x, y, alpha=0.6, c="steelblue", edgecolors="white")
axes[1, 0].set_title("Scatter Plot")
# Pie chart
sizes = [45, 25, 30]
labels = ["Eng", "HR", "Sales"]
axes[1, 1].pie(sizes, labels=labels, autopct="%1.1f%%", startangle=90)
axes[1, 1].set_title("Department Share")
# Box plot
data_by_group = [np.random.normal(loc, 5, 50) for loc in [60, 70, 80]]
axes[1, 2].boxplot(data_by_group, labels=["Q1", "Q2", "Q3"])
axes[1, 2].set_title("Score by Quarter")
plt.tight_layout()
plt.savefig("charts.png", dpi=150)
plt.show()
seaborn — statistical visualization
seaborn builds on matplotlib with a better default style and higher-level statistical charts:
import seaborn as sns
import pandas as pd
import numpy as np
# Use the built-in tips dataset for demonstration
tips = sns.load_dataset("tips")
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Distribution plot
sns.histplot(tips["total_bill"], kde=True, ax=axes[0, 0])
axes[0, 0].set_title("Bill Distribution")
# Scatter with regression line
sns.regplot(x="total_bill", y="tip", data=tips, ax=axes[0, 1])
axes[0, 1].set_title("Tip vs Bill")
# Box plot by category
sns.boxplot(x="day", y="total_bill", data=tips, ax=axes[1, 0])
axes[1, 0].set_title("Bills by Day")
# Heatmap — correlation matrix
numeric_cols = tips.select_dtypes(include="number")
corr = numeric_cols.corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", ax=axes[1, 1])
axes[1, 1].set_title("Correlation Matrix")
plt.tight_layout()
plt.savefig("seaborn_charts.png", dpi=150)
plt.show()
scikit-learn — Machine Learning
scikit-learn provides a consistent API for dozens of machine learning algorithms. Every model follows the same pattern: fit(), predict(), score().
pip install scikit-learn
The scikit-learn API
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 1. Prepare data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1, 12.0, 14.2, 15.9])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# 2. Train model
model = LinearRegression()
model.fit(X_train, y_train)
# 3. Predict
y_pred = model.predict(X_test)
# 4. Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R^2: {r2:.3f}")
print(f"Coefficient: {model.coef_[0]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")
A classification example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load built-in dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # use same scaler — never fit on test data
# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Cross-validation — more robust estimate of performance
cv_scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.tight_layout()
plt.savefig("confusion_matrix.png")
Pipelines — the right way to combine steps
A Pipeline chains preprocessing and modeling steps, preventing data leakage and making deployment clean:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
# Suppose our features have mixed types
numeric_features = ["age", "salary"]
categorical_features = ["dept", "location"]
# Preprocessing for each type
numeric_transformer = Pipeline([("scaler", StandardScaler())])
categorical_transformer = Pipeline([("onehot", OneHotEncoder(handle_unknown="ignore"))])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
])
# Full pipeline: preprocess -> model
full_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier(random_state=42)),
])
# Hyperparameter search over the whole pipeline
param_grid = {
"classifier__n_estimators": [50, 100, 200],
"classifier__learning_rate": [0.05, 0.1, 0.2],
"classifier__max_depth": [3, 4, 5],
}
grid_search = GridSearchCV(
full_pipeline,
param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1, # use all CPU cores
verbose=1,
)
# grid_search.fit(X_train, y_train)
# print(f"Best params: {grid_search.best_params_}")
# print(f"Best score: {grid_search.best_score_:.3f}")
Project: End-to-End Sales Analysis
Let's tie everything together in a complete analysis workflow:
"""
sales_analysis.py — End-to-end data analysis of sales data.
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from pathlib import Path
# ── 1. Generate synthetic data ────────────────────────────────────────────────
np.random.seed(42)
rng = pd.date_range("2024-01-01", "2024-12-31", freq="D")
n = 500
data = pd.DataFrame({
"date": np.random.choice(rng, n),
"product": np.random.choice(["Laptop", "Phone", "Tablet", "Watch"], n,
p=[0.2, 0.4, 0.3, 0.1]),
"region": np.random.choice(["North", "South", "East", "West"], n),
"units": np.random.randint(1, 20, n),
"unit_price":np.random.choice([999.99, 699.99, 499.99, 299.99], n,
p=[0.2, 0.4, 0.3, 0.1]),
"rep": np.random.choice([f"Rep{i:02d}" for i in range(1, 11)], n),
})
# Inject some missing values and noise
data.loc[np.random.choice(n, 20), "unit_price"] = np.nan
data.loc[np.random.choice(n, 5), "units"] = -1 # bad data
# ── 2. Clean ──────────────────────────────────────────────────────────────────
# Remove impossible values
data = data[data["units"] > 0]
# Fill missing prices with median per product
data["unit_price"] = data.groupby("product")["unit_price"].transform(
lambda x: x.fillna(x.median())
)
# Add computed columns
data["revenue"] = data["units"] * data["unit_price"]
data["month"] = pd.to_datetime(data["date"]).dt.month
data["quarter"] = pd.to_datetime(data["date"]).dt.quarter
print(f"Loaded {len(data):,} records after cleaning")
print(f"Total revenue: ${data['revenue'].sum():,.0f}\n")
# ── 3. Analyze ────────────────────────────────────────────────────────────────
# Revenue by product
by_product = (
data.groupby("product")["revenue"]
.agg(["sum", "mean", "count"])
.rename(columns={"sum": "total", "mean": "avg_order", "count": "orders"})
.sort_values("total", ascending=False)
)
print("Revenue by product:")
for product, row in by_product.iterrows():
print(f" {product:<8} ${row['total']:>10,.0f} ({row['orders']} orders)")
# Revenue by region
by_region = data.groupby("region")["revenue"].sum().sort_values(ascending=False)
# Monthly trend
monthly = data.groupby("month")["revenue"].sum()
# Top sales reps
top_reps = data.groupby("rep")["revenue"].sum().nlargest(5)
# ── 4. Visualize ──────────────────────────────────────────────────────────────
sns.set_theme(style="whitegrid", palette="muted")
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("2024 Sales Dashboard", fontsize=16, fontweight="bold")
# Revenue by product (bar)
by_product["total"].plot(kind="bar", ax=axes[0, 0], color="steelblue", edgecolor="white")
axes[0, 0].set_title("Total Revenue by Product")
axes[0, 0].set_xlabel("")
axes[0, 0].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
axes[0, 0].tick_params(axis="x", rotation=0)
# Revenue by region (horizontal bar)
by_region.plot(kind="barh", ax=axes[0, 1], color="coral", edgecolor="white")
axes[0, 1].set_title("Revenue by Region")
axes[0, 1].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
# Monthly revenue trend (line)
monthly.plot(kind="line", ax=axes[1, 0], marker="o", color="mediumpurple", linewidth=2)
axes[1, 0].set_title("Monthly Revenue")
axes[1, 0].set_xlabel("Month")
axes[1, 0].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
# Top reps (bar)
top_reps.plot(kind="bar", ax=axes[1, 1], color="mediumseagreen", edgecolor="white")
axes[1, 1].set_title("Top 5 Sales Reps")
axes[1, 1].set_xlabel("")
axes[1, 1].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x/1e3:.0f}K"))
axes[1, 1].tick_params(axis="x", rotation=30)
plt.tight_layout()
plt.savefig("sales_dashboard.png", dpi=150, bbox_inches="tight")
plt.show()
print("Dashboard saved to sales_dashboard.png")
# ── 5. Simple forecast with scikit-learn ──────────────────────────────────────
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error
# Use month as a feature to predict revenue
X = monthly.index.values.reshape(-1, 1)
y = monthly.values
model = LinearRegression()
model.fit(X, y)
# Predict next 3 months (months 13, 14, 15)
next_months = np.array([[13], [14], [15]])
forecast = model.predict(next_months)
print("\nForecast for next 3 months:")
for month, pred in zip([13, 14, 15], forecast):
print(f" Month {month}: ${pred:,.0f}")
# MAPE on training data
mape = mean_absolute_percentage_error(y, model.predict(X))
print(f"\nTraining MAPE: {mape:.1%}")
The Data Science Toolkit Summary
| Library | Purpose | Key objects |
|---|---|---|
numpy |
Fast array math | ndarray |
pandas |
Tabular data | DataFrame, Series |
matplotlib |
Low-level plotting | Figure, Axes |
seaborn |
Statistical visualization | High-level wrappers |
scikit-learn |
Machine learning | Pipeline, fit/predict/score |
scipy |
Scientific computing | Statistics, optimization, signal |
statsmodels |
Statistical modeling | Regression, time series |
plotly |
Interactive charts | Web-ready visualizations |
jupyter |
Interactive notebooks | .ipynb files |
What You Learned in This Chapter
- The data science workflow: load -> explore -> clean -> analyze -> visualize -> model -> evaluate.
- numpy provides the
ndarray— vectorized operations run in C, 10-100x faster than Python loops. Key tools: broadcasting, boolean masking,reshape,argmin/argmax,np.linalg. - pandas
DataFrameis the workhorse for tabular data.read_csv/to_csv,head/info/describe,iloc/loc, boolean filtering,fillna/dropna/drop_duplicates,groupby/agg,merge/concat,apply/map. - Never
fita scaler or encoder on test data — only on training data. - matplotlib —
plt.figure,plt.plot/bar/hist/scatter/pie,axes.set_title/xlabel/ylabel,plt.savefig. - seaborn —
histplot,regplot,boxplot,heatmap— higher-level, better defaults. - scikit-learn —
train_test_split,StandardScaler,fit/predict/score,classification_report,cross_val_score,Pipeline,GridSearchCV. - A
Pipelinechains preprocessing and modeling — prevents data leakage and makes deployment clean.
What's Next?
Chapter 34 covers Packaging and Distribution — pyproject.toml, building installable packages, publishing to PyPI, and setting up a professional Python project that others can install with pip install yourpackage.