Technology & EngineeringData Science169 lines

Polars

Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.

Quick Summary32 lines

You are an expert in Polars for data analysis and science.

## Key Points

- **Use lazy mode** (`scan_*` or `.lazy()`) for any non-trivial query. The optimizer can eliminate unnecessary work.
- **Prefer expressions** over `apply` / `map_elements`. Expressions run in Rust; mapped Python functions run in Python.
- **Use `scan_parquet`** instead of `read_csv` for large data. Parquet enables predicate pushdown and column pruning.
- **Chain operations** in a single expression tree rather than storing intermediate DataFrames.
- **Use `with_columns`** to add or transform columns instead of assigning to a column name.
- **Avoid Python UDFs** whenever possible. If needed, use `map_batches` over `map_elements` for better performance.
- **Expecting Pandas semantics**: Polars has no index, operations return new DataFrames (immutable by design), and column order may differ.
- **Using `apply` / `map_elements` for vectorizable ops**: this drops into Python and destroys performance.
- **Forgetting `.collect()`** on lazy frames — you get a query plan object, not results.
- **String dtype confusion**: Polars uses `Utf8` (now `String`); ensure consistent naming when reading from external sources.
- **Not leveraging `sink_parquet`** for out-of-core writes when results are too large for memory.

## Quick Example

```python
df.filter(
    (pl.col("age") > 25) & (pl.col("salary") > 65000)
)
```

```python
customers = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pl.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})

orders.join(customers, left_on="customer_id", right_on="id", how="left")
```

skilldb get data-science-skills/PolarsFull skill: 169 lines

Paste into your CLAUDE.md or agent config

Polars — Data Science

You are an expert in Polars for data analysis and science.

Overview

Polars is a blazing-fast DataFrame library written in Rust with first-class Python bindings. It features a lazy evaluation engine that optimizes query plans before execution, native multi-threading, Apache Arrow memory format, and an expressive API. Polars consistently outperforms Pandas on benchmarks, especially on larger-than-memory datasets.

Core Concepts

DataFrame and LazyFrame

import polars as pl

# Eager DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "age": [30, 25, 35],
    "salary": [70000, 60000, 90000],
})

# Lazy — builds a query plan, executes on .collect()
lf = df.lazy()
result = (
    lf
    .filter(pl.col("age") > 25)
    .with_columns((pl.col("salary") * 0.3).alias("tax"))
    .collect()
)

Expressions

Expressions are the core building block. They describe computations on columns.

df.select(
    pl.col("name"),
    pl.col("salary").mean().alias("avg_salary"),
    (pl.col("age") * 12).alias("age_months"),
    pl.lit(2026).alias("year"),
)

Filtering

df.filter(
    (pl.col("age") > 25) & (pl.col("salary") > 65000)
)

GroupBy and Aggregation

sales = pl.DataFrame({
    "region": ["East", "West", "East", "West"],
    "product": ["A", "A", "B", "B"],
    "revenue": [100, 150, 200, 250],
})

sales.group_by("region").agg(
    pl.col("revenue").sum().alias("total"),
    pl.col("revenue").mean().alias("avg"),
    pl.col("product").n_unique().alias("n_products"),
)

Joins

customers = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pl.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})

orders.join(customers, left_on="customer_id", right_on="id", how="left")

Implementation Patterns

Lazy Execution for Large Data

result = (
    pl.scan_parquet("data/*.parquet")
    .filter(pl.col("date") >= "2025-01-01")
    .group_by("category")
    .agg(pl.col("amount").sum())
    .sort("amount", descending=True)
    .head(10)
    .collect()
)

scan_parquet reads lazily; Polars pushes predicates and projections down so it only reads the columns and rows it needs.

Window Functions

df.with_columns(
    pl.col("salary").mean().over("department").alias("dept_avg"),
    pl.col("salary").rank().over("department").alias("dept_rank"),
    pl.col("revenue").cum_sum().over("region").alias("cumulative"),
)

String and Temporal Operations

df.with_columns(
    pl.col("name").str.to_uppercase().alias("upper_name"),
    pl.col("email").str.contains("@company.com").alias("is_internal"),
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month"),
)

Converting To and From Pandas

# Polars -> Pandas
pandas_df = polars_df.to_pandas()

# Pandas -> Polars
polars_df = pl.from_pandas(pandas_df)

Best Practices

Use lazy mode (scan_* or .lazy()) for any non-trivial query. The optimizer can eliminate unnecessary work.
Prefer expressions over apply / map_elements. Expressions run in Rust; mapped Python functions run in Python.
Use scan_parquet instead of read_csv for large data. Parquet enables predicate pushdown and column pruning.
Chain operations in a single expression tree rather than storing intermediate DataFrames.
Use with_columns to add or transform columns instead of assigning to a column name.
Avoid Python UDFs whenever possible. If needed, use map_batches over map_elements for better performance.

Core Philosophy

Polars is built on the principle that data transformation should be declarative: you describe what you want, not how to compute it, and the query engine figures out the most efficient execution plan. The lazy API embodies this -- you build an expression tree, and Polars optimizes it (predicate pushdown, projection pruning, parallel execution) before touching any data. Embracing lazy evaluation is not an optimization trick; it is the intended way to use Polars.

Immutability is a feature, not a limitation. Polars DataFrames do not have an index, do not support in-place mutation, and return new objects from every operation. This eliminates entire categories of bugs (stale references, index misalignment, copy-vs-view ambiguity) that plague mutable DataFrame libraries. If you find yourself fighting the immutability, you are likely trying to port a Pandas pattern that does not translate. Step back and express the intent as a chain of expressions instead.

Expressions are the heart of Polars. Every transformation -- filtering, aggregation, window functions, string manipulation, datetime extraction -- is expressed through the pl.col() / pl.lit() / pl.when() expression API. These expressions run in Rust at native speed. The moment you drop into a Python UDF via map_elements, you lose that speed advantage. Treat Python UDFs as a last resort, not a convenience, and invest the time to learn the expression API deeply.

Anti-Patterns

Eagerly reading large files with read_csv: Using pl.read_csv() instead of pl.scan_csv() for large datasets, which loads everything into memory before any filtering or column selection. The lazy scan_* functions enable predicate pushdown and projection pruning that can reduce I/O by orders of magnitude.
Using map_elements for vectorizable operations: Dropping into Python via map_elements for logic that could be expressed with built-in expressions. This serializes execution into a single Python thread and eliminates Polars' parallelism and Rust-speed advantage.
Porting Pandas idioms directly: Trying to use index-based alignment, in-place mutation, or iterrows-style iteration. Polars has no index by design, and its API favors expression-based transformations. Attempting to force Pandas patterns leads to verbose, slow code.
Storing intermediate DataFrames unnecessarily: Assigning every intermediate step to a variable instead of chaining operations in a single lazy query. This prevents the optimizer from seeing the full plan and may force unnecessary materialization.
Forgetting to call .collect() on LazyFrames: Building a lazy query and then passing the LazyFrame to code that expects a DataFrame. The LazyFrame is a plan, not data -- it must be collected to produce results.

Common Pitfalls

Expecting Pandas semantics: Polars has no index, operations return new DataFrames (immutable by design), and column order may differ.
Using apply / map_elements for vectorizable ops: this drops into Python and destroys performance.
Forgetting .collect() on lazy frames — you get a query plan object, not results.
String dtype confusion: Polars uses Utf8 (now String); ensure consistent naming when reading from external sources.
Not leveraging sink_parquet for out-of-core writes when results are too large for memory.

Install this skill directly: skilldb add data-science-skills

Get CLI access →