Skip to main content
Technology & EngineeringData Science171 lines

Numpy

Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.

Quick Summary31 lines
You are an expert in NumPy for data analysis and science.

## Key Points

- **Use the modern random API** (`np.random.default_rng()`) instead of the legacy `np.random.seed()` / `np.random.randn()`.
- **Prefer vectorized ops** over Python loops. If you need a loop, consider `np.vectorize` (cosmetic) or writing a ufunc.
- **Understand views vs copies**: slicing creates views; boolean/fancy indexing creates copies. Unintended views cause subtle bugs.
- **Specify `dtype`** explicitly when precision matters (`float32` vs `float64`).
- **Use `np.einsum`** for complex tensor contractions — it is both readable and fast.
- **Pre-allocate arrays** (`np.empty`) instead of growing lists and converting.
- **Mutable default shapes**: `np.zeros(3)` gives shape `(3,)`, not `(3, 1)`. Be explicit about dimensions.
- **Integer overflow**: `np.array([200], dtype=np.int8)` silently overflows. Choose appropriate dtypes.
- **In-place modification of views**: modifying a slice modifies the original array.
- **Confusing `*` with `@`**: `A * B` is element-wise; `A @ B` is matrix multiplication.
- **Using `==` on floats**: floating-point comparisons need `np.isclose()` or `np.allclose()`.

## Quick Example

```python
a = np.array([[1], [2], [3]])       # shape (3, 1)
b = np.array([10, 20, 30, 40])      # shape (4,)
result = a + b                        # shape (3, 4)
```

```python
dt = np.dtype([("name", "U10"), ("age", "i4"), ("weight", "f8")])
people = np.array([("Alice", 30, 55.0), ("Bob", 25, 70.5)], dtype=dt)
print(people["name"])
```
skilldb get data-science-skills/NumpyFull skill: 171 lines
Paste into your CLAUDE.md or agent config

NumPy — Data Science

You are an expert in NumPy for data analysis and science.

Overview

NumPy is the fundamental package for numerical computing in Python. It provides the ndarray — a fast, memory-efficient multi-dimensional array — along with a comprehensive library of mathematical functions, broadcasting semantics, and linear algebra routines. Nearly every data science library in Python builds on NumPy.

Core Concepts

Array Creation

import numpy as np

# From lists
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]], dtype=np.float64)

# Generators
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
identity = np.eye(4)
rng = np.arange(0, 10, 0.5)
linspace = np.linspace(0, 1, 50)

# Random (modern API)
rng = np.random.default_rng(seed=42)
samples = rng.standard_normal((1000, 5))

Indexing and Slicing

arr = np.arange(20).reshape(4, 5)

# Basic slicing (views, not copies)
row = arr[1, :]
col = arr[:, 2]
sub = arr[1:3, 2:4]

# Boolean indexing (copies)
mask = arr > 10
filtered = arr[mask]

# Fancy indexing
selected = arr[[0, 2], [1, 3]]

Broadcasting

Broadcasting lets arrays of different shapes interact in element-wise operations.

a = np.array([[1], [2], [3]])       # shape (3, 1)
b = np.array([10, 20, 30, 40])      # shape (4,)
result = a + b                        # shape (3, 4)

Rules: dimensions are compared from the right; sizes must match or one must be 1.

Linear Algebra

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])

# Matrix multiply
C = A @ A           # or np.matmul(A, A)

# Solve Ax = b
x = np.linalg.solve(A, b)

# Eigenvalues
vals, vecs = np.linalg.eig(A)

# SVD
U, S, Vt = np.linalg.svd(A)

Implementation Patterns

Vectorized Computation

# Bad — Python loop
result = []
for x in data:
    result.append(x ** 2 + 2 * x + 1)

# Good — vectorized
result = data ** 2 + 2 * data + 1

Structured Arrays

dt = np.dtype([("name", "U10"), ("age", "i4"), ("weight", "f8")])
people = np.array([("Alice", 30, 55.0), ("Bob", 25, 70.5)], dtype=dt)
print(people["name"])

Memory-Mapped Files

# Create
fp = np.memmap("data.dat", dtype="float64", mode="w+", shape=(10000, 100))
fp[:] = rng.standard_normal((10000, 100))
fp.flush()

# Read
fp = np.memmap("data.dat", dtype="float64", mode="r", shape=(10000, 100))

Universal Functions (ufuncs)

# Element-wise operations that work on arrays of any shape
np.sqrt(arr)
np.exp(arr)
np.log1p(arr)
np.clip(arr, 0, 255)

# Reductions
arr.sum(axis=0)      # column sums
arr.mean(axis=1)     # row means
arr.argmax(axis=1)   # index of max per row

Best Practices

  • Use the modern random API (np.random.default_rng()) instead of the legacy np.random.seed() / np.random.randn().
  • Prefer vectorized ops over Python loops. If you need a loop, consider np.vectorize (cosmetic) or writing a ufunc.
  • Understand views vs copies: slicing creates views; boolean/fancy indexing creates copies. Unintended views cause subtle bugs.
  • Specify dtype explicitly when precision matters (float32 vs float64).
  • Use np.einsum for complex tensor contractions — it is both readable and fast.
  • Pre-allocate arrays (np.empty) instead of growing lists and converting.

Core Philosophy

NumPy is the language that data speaks in Python. When you express a computation as a vectorized NumPy operation, you are not just writing faster code -- you are writing at a higher level of abstraction where the intent (element-wise transform, reduction, matrix product) is immediately visible. A Python for-loop over array elements obscures intent behind iteration mechanics; a NumPy expression reveals the mathematical structure directly.

Understanding the memory model is essential. NumPy arrays are contiguous blocks of typed memory, and operations like slicing produce views, not copies. This is both a performance feature and a correctness hazard: modifying a view modifies the original. Developers who internalize this distinction avoid an entire class of subtle bugs and can reason about performance characteristics without profiling every line.

Treat dtype selection as a design decision, not an afterthought. Choosing float32 versus float64, or int8 versus int64, affects memory footprint, numerical precision, and computation speed. For large datasets, appropriate dtype choices can halve memory usage and unlock GPU compatibility. For numerical algorithms, inappropriate choices can introduce silent precision loss that corrupts results.

Anti-Patterns

  • Python loops over array elements: Iterating over NumPy arrays with for-loops instead of using vectorized operations. This throws away NumPy's C-level speed advantage and can make code 100x slower for no benefit in readability.

  • Growing arrays by appending: Building a result array by repeatedly calling np.append() in a loop, which copies the entire array on each iteration, producing O(n^2) behavior. Pre-allocate with np.empty() or collect in a list and convert once.

  • Ignoring the view/copy distinction: Assuming that slicing produces an independent copy, then modifying the slice and unintentionally corrupting the original array. Use .copy() explicitly when independence is required.

  • Using the legacy random API: Calling np.random.seed() and np.random.randn() instead of the modern np.random.default_rng() API. The legacy API uses global state that is not thread-safe and makes reproducibility harder to guarantee in complex programs.

  • Comparing floats with equality operators: Using == to compare floating-point arrays, which fails due to representation error. Use np.isclose() or np.allclose() with appropriate tolerance parameters for numerical comparisons.

Common Pitfalls

  • Mutable default shapes: np.zeros(3) gives shape (3,), not (3, 1). Be explicit about dimensions.
  • Integer overflow: np.array([200], dtype=np.int8) silently overflows. Choose appropriate dtypes.
  • In-place modification of views: modifying a slice modifies the original array.
  • Confusing * with @: A * B is element-wise; A @ B is matrix multiplication.
  • Using == on floats: floating-point comparisons need np.isclose() or np.allclose().

Install this skill directly: skilldb add data-science-skills

Get CLI access →