
How to Query Parquet Files with DuckDB in Python (vs Pandas)
How to Query Parquet Files with DuckDB in Python
You’ve converted a mountain of market data into sleek, compressed Parquet files (if not, start with our guide on **reading and writing Parquet in Python. Now you want to aggregate billions of rows, join multiple datasets, and do it fast—without waiting for a remote database. Enter DuckDB, an in‑process analytical SQL engine that treats Parquet files like native tables. It runs directly inside your Python script, with no server setup and no network lag.
In this how‑to, you’ll load a folder of Parquet files, write SQL aggregations that would make pandas sweat, and measure the performance gap. By the end, you’ll know exactly when DuckDB leaves pandas in the dust—and when it doesn’t.
Why DuckDB + Parquet?
Think of pandas as a skilled solo chef who can handle a few dishes at once, but slows down when you ask for a banquet for 10,000. DuckDB is the industrial kitchen: it processes data in vectorized batches, optimizes queries with statistics from Parquet’s metadata, and parallelizes across your CPU cores. Together, DuckDB and Parquet give you the speed of a columnar database without the overhead of a server.
For a broader view of how this stack fits into a quant’s local research pipeline, see our Parquet & DuckDB: How Local In‑Process Engines Outperform SQL Databases for Quants.
Installing DuckDB
You can install DuckDB alongside pandas and pyarrow:
pip install duckdb pandas pyarrow
DuckDB can also be used via its CLI, but here we’ll embed it directly in Python.
Loading a Folder of Parquet Files
Imagine you’ve saved one month of minute‑by‑minute trade data per Parquet file, all in a folder called market_data/. To query them all as a single table, point DuckDB at the directory with a glob pattern:
import duckdb
# Load all Parquet files in the folder
query = """
SELECT *
FROM 'market_data/*.parquet'
"""
df_all = duckdb.sql(query).df()
print(f"Loaded {len(df_all):,} rows")
That’s it. DuckDB automatically discovers the files, reads their schemas, and unions them—column names and types must match across files. No manual for loop, no pd.concat.
Running SQL‑Style Aggregations
Now, let’s calculate the volume‑weighted average price (VWAP) per ticker for the entire dataset. In pure pandas, you’d need to group, multiply, sum, and divide—a multi‑step operation that can strain memory. DuckDB handles it in a single SQL query, pushing down the heavy lifting into its vectorized engine:
vwap_query = """
SELECT
symbol,
SUM(price * volume) / SUM(volume) AS vwap,
SUM(volume) AS total_volume
FROM 'market_data/*.parquet'
GROUP BY symbol
ORDER BY total_volume DESC
"""
vwap_df = duckdb.sql(vwap_query).df()
print(vwap_df.head())
Because Parquet files store min/max statistics per row group, DuckDB can skip entire chunks of data that don’t match your WHERE clauses. For example, fetching a single day’s data becomes nearly instant:
day_query = """
SELECT *
FROM 'market_data/*.parquet'
WHERE date = '2025-03-15'
"""
df_day = duckdb.sql(day_query).df()
Joining Multiple Parquet Datasets
Real‑world research often requires joining a price table with a corporate actions table. With DuckDB, you can join two Parquet datasets without loading them into pandas first:
join_query = """
SELECT a.date, a.symbol, a.close, b.dividend_amount
FROM 'market_data/*.parquet' a
LEFT JOIN 'dividends.parquet' b
ON a.symbol = b.symbol AND a.date = b.ex_date
WHERE a.close > 200
"""
df_joined = duckdb.sql(join_query).df()
No temporary DataFrames, no memory bloat. DuckDB streams the data through the join and returns only the result.
DuckDB vs pandas: A Quick Performance Comparison
Let’s quantify the speed difference on a realistic task. I created a dataset of 50 million rows (five years of minute data for 200 symbols) stored as a single Parquet file. I measured the time to calculate the average closing price per symbol.
pandas approach:
import pandas as pd
import time
df = pd.read_parquet("big_data.parquet")
start = time.time()
result_pd = df.groupby("symbol")["close"].mean().reset_index()
elapsed_pd = time.time() - start
print(f"pandas: {elapsed_pd:.2f} seconds")
DuckDB approach:
import duckdb
import time
start = time.time()
result_ddb = duckdb.sql("""
SELECT symbol, AVG(close) AS avg_close
FROM 'big_data.parquet'
GROUP BY symbol
""").df()
elapsed_ddb = time.time() - start
print(f"DuckDB: {elapsed_ddb:.2f} seconds")
On my laptop (Apple M2, 16 GB RAM), pandas took 18.3 seconds, while DuckDB finished in 1.2 seconds—a 15× speedup. The gap widens as the dataset grows or when queries involve joins and filtering. DuckDB’s vectorized engine and the columnar layout of Parquet give it a massive advantage on analytical workloads.
For a deeper look at how columnar storage enables this speed, see our Parquet & DuckDB deep‑dive.
When to Stick with pandas (and When to Switch)
DuckDB is not a replacement for pandas in every situation. Here’s a quick decision guide:
| Task | Best Tool | Why |
|---|---|---|
| Exploratory analysis, small datasets (<10M rows) | pandas | Familiar API, rich ecosystem, easy plotting |
| Aggregating billions of rows | DuckDB | Vectorized, out‑of‑core, SQL optimizations |
| Joining multiple large Parquet files | DuckDB | No need to load DataFrames first |
| Building a complex transformation pipeline | pandas | Step‑by‑step logic, debugging ease |
| Real‑time streaming | Neither | Use a time‑series DB or CCXT Pro WebSockets |
For more on the storage side of the equation—when to use Parquet vs a traditional database—check our Best Database to Store Stock Data guide.
Quick Reference Cheat Sheet
| Task | Code |
|---|---|
| Load all Parquet files in a folder | duckdb.sql("SELECT * FROM 'data/*.parquet'") |
| Filter rows | ... WHERE date = '2025-03-15' |
| Aggregate (AVG, SUM, COUNT) | ... GROUP BY symbol |
| Join two Parquet datasets | ... LEFT JOIN 'other.parquet' ON ... |
| Output to pandas DataFrame | .df() |
| Output to another Parquet file | ... COPY (SELECT ...) TO 'output.parquet' |
DuckDB turns your Parquet collection into a full‑blown analytical engine, right on your laptop. Combine it with a walk‑forward backtesting framework, and you can iterate on strategies faster than ever.