Polars
Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.
You are an expert in Polars for data analysis and science.
## Key Points
- **Use lazy mode** (`scan_*` or `.lazy()`) for any non-trivial query. The optimizer can eliminate unnecessary work.
- **Prefer expressions** over `apply` / `map_elements`. Expressions run in Rust; mapped Python functions run in Python.
- **Use `scan_parquet`** instead of `read_csv` for large data. Parquet enables predicate pushdown and column pruning.
- **Chain operations** in a single expression tree rather than storing intermediate DataFrames.
- **Use `with_columns`** to add or transform columns instead of assigning to a column name.
- **Avoid Python UDFs** whenever possible. If needed, use `map_batches` over `map_elements` for better performance.
- **Expecting Pandas semantics**: Polars has no index, operations return new DataFrames (immutable by design), and column order may differ.
- **Using `apply` / `map_elements` for vectorizable ops**: this drops into Python and destroys performance.
- **Forgetting `.collect()`** on lazy frames — you get a query plan object, not results.
- **String dtype confusion**: Polars uses `Utf8` (now `String`); ensure consistent naming when reading from external sources.
- **Not leveraging `sink_parquet`** for out-of-core writes when results are too large for memory.
## Quick Example
```python
df.filter(
(pl.col("age") > 25) & (pl.col("salary") > 65000)
)
```
```python
customers = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pl.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})
orders.join(customers, left_on="customer_id", right_on="id", how="left")
```skilldb get data-science-skills/PolarsFull skill: 169 linesPolars — Data Science
You are an expert in Polars for data analysis and science.
Overview
Polars is a blazing-fast DataFrame library written in Rust with first-class Python bindings. It features a lazy evaluation engine that optimizes query plans before execution, native multi-threading, Apache Arrow memory format, and an expressive API. Polars consistently outperforms Pandas on benchmarks, especially on larger-than-memory datasets.
Core Concepts
DataFrame and LazyFrame
import polars as pl
# Eager DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"salary": [70000, 60000, 90000],
})
# Lazy — builds a query plan, executes on .collect()
lf = df.lazy()
result = (
lf
.filter(pl.col("age") > 25)
.with_columns((pl.col("salary") * 0.3).alias("tax"))
.collect()
)
Expressions
Expressions are the core building block. They describe computations on columns.
df.select(
pl.col("name"),
pl.col("salary").mean().alias("avg_salary"),
(pl.col("age") * 12).alias("age_months"),
pl.lit(2026).alias("year"),
)
Filtering
df.filter(
(pl.col("age") > 25) & (pl.col("salary") > 65000)
)
GroupBy and Aggregation
sales = pl.DataFrame({
"region": ["East", "West", "East", "West"],
"product": ["A", "A", "B", "B"],
"revenue": [100, 150, 200, 250],
})
sales.group_by("region").agg(
pl.col("revenue").sum().alias("total"),
pl.col("revenue").mean().alias("avg"),
pl.col("product").n_unique().alias("n_products"),
)
Joins
customers = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pl.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})
orders.join(customers, left_on="customer_id", right_on="id", how="left")
Implementation Patterns
Lazy Execution for Large Data
result = (
pl.scan_parquet("data/*.parquet")
.filter(pl.col("date") >= "2025-01-01")
.group_by("category")
.agg(pl.col("amount").sum())
.sort("amount", descending=True)
.head(10)
.collect()
)
scan_parquet reads lazily; Polars pushes predicates and projections down so it only reads the columns and rows it needs.
Window Functions
df.with_columns(
pl.col("salary").mean().over("department").alias("dept_avg"),
pl.col("salary").rank().over("department").alias("dept_rank"),
pl.col("revenue").cum_sum().over("region").alias("cumulative"),
)
String and Temporal Operations
df.with_columns(
pl.col("name").str.to_uppercase().alias("upper_name"),
pl.col("email").str.contains("@company.com").alias("is_internal"),
pl.col("date").dt.year().alias("year"),
pl.col("date").dt.month().alias("month"),
)
Converting To and From Pandas
# Polars -> Pandas
pandas_df = polars_df.to_pandas()
# Pandas -> Polars
polars_df = pl.from_pandas(pandas_df)
Best Practices
- Use lazy mode (
scan_*or.lazy()) for any non-trivial query. The optimizer can eliminate unnecessary work. - Prefer expressions over
apply/map_elements. Expressions run in Rust; mapped Python functions run in Python. - Use
scan_parquetinstead ofread_csvfor large data. Parquet enables predicate pushdown and column pruning. - Chain operations in a single expression tree rather than storing intermediate DataFrames.
- Use
with_columnsto add or transform columns instead of assigning to a column name. - Avoid Python UDFs whenever possible. If needed, use
map_batchesovermap_elementsfor better performance.
Core Philosophy
Polars is built on the principle that data transformation should be declarative: you describe what you want, not how to compute it, and the query engine figures out the most efficient execution plan. The lazy API embodies this -- you build an expression tree, and Polars optimizes it (predicate pushdown, projection pruning, parallel execution) before touching any data. Embracing lazy evaluation is not an optimization trick; it is the intended way to use Polars.
Immutability is a feature, not a limitation. Polars DataFrames do not have an index, do not support in-place mutation, and return new objects from every operation. This eliminates entire categories of bugs (stale references, index misalignment, copy-vs-view ambiguity) that plague mutable DataFrame libraries. If you find yourself fighting the immutability, you are likely trying to port a Pandas pattern that does not translate. Step back and express the intent as a chain of expressions instead.
Expressions are the heart of Polars. Every transformation -- filtering, aggregation, window functions, string manipulation, datetime extraction -- is expressed through the pl.col() / pl.lit() / pl.when() expression API. These expressions run in Rust at native speed. The moment you drop into a Python UDF via map_elements, you lose that speed advantage. Treat Python UDFs as a last resort, not a convenience, and invest the time to learn the expression API deeply.
Anti-Patterns
-
Eagerly reading large files with
read_csv: Usingpl.read_csv()instead ofpl.scan_csv()for large datasets, which loads everything into memory before any filtering or column selection. The lazyscan_*functions enable predicate pushdown and projection pruning that can reduce I/O by orders of magnitude. -
Using
map_elementsfor vectorizable operations: Dropping into Python viamap_elementsfor logic that could be expressed with built-in expressions. This serializes execution into a single Python thread and eliminates Polars' parallelism and Rust-speed advantage. -
Porting Pandas idioms directly: Trying to use index-based alignment, in-place mutation, or
iterrows-style iteration. Polars has no index by design, and its API favors expression-based transformations. Attempting to force Pandas patterns leads to verbose, slow code. -
Storing intermediate DataFrames unnecessarily: Assigning every intermediate step to a variable instead of chaining operations in a single lazy query. This prevents the optimizer from seeing the full plan and may force unnecessary materialization.
-
Forgetting to call
.collect()on LazyFrames: Building a lazy query and then passing the LazyFrame to code that expects a DataFrame. The LazyFrame is a plan, not data -- it must be collected to produce results.
Common Pitfalls
- Expecting Pandas semantics: Polars has no index, operations return new DataFrames (immutable by design), and column order may differ.
- Using
apply/map_elementsfor vectorizable ops: this drops into Python and destroys performance. - Forgetting
.collect()on lazy frames — you get a query plan object, not results. - String dtype confusion: Polars uses
Utf8(nowString); ensure consistent naming when reading from external sources. - Not leveraging
sink_parquetfor out-of-core writes when results are too large for memory.
Install this skill directly: skilldb add data-science-skills
Related Skills
Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.