Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
You are an expert in NumPy for data analysis and science.
## Key Points
- **Use the modern random API** (`np.random.default_rng()`) instead of the legacy `np.random.seed()` / `np.random.randn()`.
- **Prefer vectorized ops** over Python loops. If you need a loop, consider `np.vectorize` (cosmetic) or writing a ufunc.
- **Understand views vs copies**: slicing creates views; boolean/fancy indexing creates copies. Unintended views cause subtle bugs.
- **Specify `dtype`** explicitly when precision matters (`float32` vs `float64`).
- **Use `np.einsum`** for complex tensor contractions — it is both readable and fast.
- **Pre-allocate arrays** (`np.empty`) instead of growing lists and converting.
- **Mutable default shapes**: `np.zeros(3)` gives shape `(3,)`, not `(3, 1)`. Be explicit about dimensions.
- **Integer overflow**: `np.array([200], dtype=np.int8)` silently overflows. Choose appropriate dtypes.
- **In-place modification of views**: modifying a slice modifies the original array.
- **Confusing `*` with `@`**: `A * B` is element-wise; `A @ B` is matrix multiplication.
- **Using `==` on floats**: floating-point comparisons need `np.isclose()` or `np.allclose()`.
## Quick Example
```python
a = np.array([[1], [2], [3]]) # shape (3, 1)
b = np.array([10, 20, 30, 40]) # shape (4,)
result = a + b # shape (3, 4)
```
```python
dt = np.dtype([("name", "U10"), ("age", "i4"), ("weight", "f8")])
people = np.array([("Alice", 30, 55.0), ("Bob", 25, 70.5)], dtype=dt)
print(people["name"])
```skilldb get data-science-skills/NumpyFull skill: 171 linesNumPy — Data Science
You are an expert in NumPy for data analysis and science.
Overview
NumPy is the fundamental package for numerical computing in Python. It provides the ndarray — a fast, memory-efficient multi-dimensional array — along with a comprehensive library of mathematical functions, broadcasting semantics, and linear algebra routines. Nearly every data science library in Python builds on NumPy.
Core Concepts
Array Creation
import numpy as np
# From lists
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]], dtype=np.float64)
# Generators
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
identity = np.eye(4)
rng = np.arange(0, 10, 0.5)
linspace = np.linspace(0, 1, 50)
# Random (modern API)
rng = np.random.default_rng(seed=42)
samples = rng.standard_normal((1000, 5))
Indexing and Slicing
arr = np.arange(20).reshape(4, 5)
# Basic slicing (views, not copies)
row = arr[1, :]
col = arr[:, 2]
sub = arr[1:3, 2:4]
# Boolean indexing (copies)
mask = arr > 10
filtered = arr[mask]
# Fancy indexing
selected = arr[[0, 2], [1, 3]]
Broadcasting
Broadcasting lets arrays of different shapes interact in element-wise operations.
a = np.array([[1], [2], [3]]) # shape (3, 1)
b = np.array([10, 20, 30, 40]) # shape (4,)
result = a + b # shape (3, 4)
Rules: dimensions are compared from the right; sizes must match or one must be 1.
Linear Algebra
A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
# Matrix multiply
C = A @ A # or np.matmul(A, A)
# Solve Ax = b
x = np.linalg.solve(A, b)
# Eigenvalues
vals, vecs = np.linalg.eig(A)
# SVD
U, S, Vt = np.linalg.svd(A)
Implementation Patterns
Vectorized Computation
# Bad — Python loop
result = []
for x in data:
result.append(x ** 2 + 2 * x + 1)
# Good — vectorized
result = data ** 2 + 2 * data + 1
Structured Arrays
dt = np.dtype([("name", "U10"), ("age", "i4"), ("weight", "f8")])
people = np.array([("Alice", 30, 55.0), ("Bob", 25, 70.5)], dtype=dt)
print(people["name"])
Memory-Mapped Files
# Create
fp = np.memmap("data.dat", dtype="float64", mode="w+", shape=(10000, 100))
fp[:] = rng.standard_normal((10000, 100))
fp.flush()
# Read
fp = np.memmap("data.dat", dtype="float64", mode="r", shape=(10000, 100))
Universal Functions (ufuncs)
# Element-wise operations that work on arrays of any shape
np.sqrt(arr)
np.exp(arr)
np.log1p(arr)
np.clip(arr, 0, 255)
# Reductions
arr.sum(axis=0) # column sums
arr.mean(axis=1) # row means
arr.argmax(axis=1) # index of max per row
Best Practices
- Use the modern random API (
np.random.default_rng()) instead of the legacynp.random.seed()/np.random.randn(). - Prefer vectorized ops over Python loops. If you need a loop, consider
np.vectorize(cosmetic) or writing a ufunc. - Understand views vs copies: slicing creates views; boolean/fancy indexing creates copies. Unintended views cause subtle bugs.
- Specify
dtypeexplicitly when precision matters (float32vsfloat64). - Use
np.einsumfor complex tensor contractions — it is both readable and fast. - Pre-allocate arrays (
np.empty) instead of growing lists and converting.
Core Philosophy
NumPy is the language that data speaks in Python. When you express a computation as a vectorized NumPy operation, you are not just writing faster code -- you are writing at a higher level of abstraction where the intent (element-wise transform, reduction, matrix product) is immediately visible. A Python for-loop over array elements obscures intent behind iteration mechanics; a NumPy expression reveals the mathematical structure directly.
Understanding the memory model is essential. NumPy arrays are contiguous blocks of typed memory, and operations like slicing produce views, not copies. This is both a performance feature and a correctness hazard: modifying a view modifies the original. Developers who internalize this distinction avoid an entire class of subtle bugs and can reason about performance characteristics without profiling every line.
Treat dtype selection as a design decision, not an afterthought. Choosing float32 versus float64, or int8 versus int64, affects memory footprint, numerical precision, and computation speed. For large datasets, appropriate dtype choices can halve memory usage and unlock GPU compatibility. For numerical algorithms, inappropriate choices can introduce silent precision loss that corrupts results.
Anti-Patterns
-
Python loops over array elements: Iterating over NumPy arrays with for-loops instead of using vectorized operations. This throws away NumPy's C-level speed advantage and can make code 100x slower for no benefit in readability.
-
Growing arrays by appending: Building a result array by repeatedly calling
np.append()in a loop, which copies the entire array on each iteration, producing O(n^2) behavior. Pre-allocate withnp.empty()or collect in a list and convert once. -
Ignoring the view/copy distinction: Assuming that slicing produces an independent copy, then modifying the slice and unintentionally corrupting the original array. Use
.copy()explicitly when independence is required. -
Using the legacy random API: Calling
np.random.seed()andnp.random.randn()instead of the modernnp.random.default_rng()API. The legacy API uses global state that is not thread-safe and makes reproducibility harder to guarantee in complex programs. -
Comparing floats with equality operators: Using
==to compare floating-point arrays, which fails due to representation error. Usenp.isclose()ornp.allclose()with appropriate tolerance parameters for numerical comparisons.
Common Pitfalls
- Mutable default shapes:
np.zeros(3)gives shape(3,), not(3, 1). Be explicit about dimensions. - Integer overflow:
np.array([200], dtype=np.int8)silently overflows. Choose appropriate dtypes. - In-place modification of views: modifying a slice modifies the original array.
- Confusing
*with@:A * Bis element-wise;A @ Bis matrix multiplication. - Using
==on floats: floating-point comparisons neednp.isclose()ornp.allclose().
Install this skill directly: skilldb add data-science-skills
Related Skills
Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.
Polars
Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.