
How to Read and Write Parquet Files in Python: A Practical Guide for Quants
How to Read and Write Parquet Files in Python
You’ve just pulled ten years of minute‑by‑minute crypto data into a CSV file, and it’s a monster—20 GB and growing. Your laptop whirs like a jet engine every time you try to open it. There’s a better way, and it’s already sitting in your Python environment: Apache Parquet. Unlike a CSV, which stores every row in a flat text block, Parquet is a columnar format that shrinks your data, speeds up queries, and plays nicely with in‑process engines like DuckDB.
In this quick how‑to, you’ll learn exactly how to read and write Parquet files with Python, convert existing CSVs, and pick the right compression for your quant research.
What Is Parquet (and Why Should You Care)?
Think of a CSV as a giant scroll of paper where every line contains a full record: date, ticker, open, high, low, close, volume. To calculate the average closing price, you have to scan every single line, reading all that other data along the way. That’s row‑based storage, and it’s slow.
Parquet flips that scroll sideways. It stores data by columns, so all closing prices sit together in one compact block, separate from the timestamps and symbols. When you only need the close, Parquet reads that block and ignores the rest. This is the secret behind columnar storage benefits and why Parquet files can be up to 80% smaller than the equivalent CSV—and queried orders of magnitude faster.
For a deeper dive into how Parquet fits into a high‑performance local stack, see our article on Parquet & DuckDB: How Local In‑Process Engines Outperform SQL Databases for Quants .
Installing Parquet Support
You’ve got three main options in Python: pandas, pyarrow, and fastparquet. All can read and write Parquet, but they have slightly different strengths. For most quant workflows, pandas with pyarrow as the backend is the sweet spot.
# Install the essentials
pip install pandas pyarrow
If you prefer fastparquet (which can be faster for certain compression codecs), just add it:
pip install fastparquet
Reading a Parquet File
Once installed, reading a Parquet file is as simple as reading a CSV—only faster.
import pandas as pd
# Read the entire file
df = pd.read_parquet("market_data.parquet")
# Inspect the schema (columns and types)
print(df.dtypes)
If your file is huge, you can read only a few rows to peek at the structure:
# Read just the first 1000 rows
df_sample = pd.read_parquet("market_data.parquet").head(1000)
But the real power is predicate pushdown. Parquet’s metadata lets you load only the rows and columns you need, without scanning the whole file. For example, to load only trades for a single ticker:
# Read only rows where ticker == 'AAPL'
df_aapl = pd.read_parquet(
"market_data.parquet",
filters=[("ticker", "==", "AAPL")]
)
And to read only specific columns (say, timestamp and close):
df_close = pd.read_parquet(
"market_data.parquet",
columns=["timestamp", "close"]
)
This is a huge time‑saver when you’re backtesting a strategy on a single asset. You don’t need to drag the entire dataset into memory.
Writing a DataFrame to Parquet
After cleaning or aggregating your data, you can save it as Parquet with one line:
# Write the DataFrame to a Parquet file
df.to_parquet("cleaned_data.parquet")
By default, pandas uses pyarrow and applies Snappy compression—a good balance between speed and file size. You can change the compression codec to get even smaller files:
# Use Zstandard for high compression
df.to_parquet("cleaned_data.parquet", compression="zstd")
For the smallest files (at the cost of a bit more CPU), use "gzip". For speed, stick with "snappy" or "lz4". Check our Best Database to Store Stock Data guide for more on choosing the right format for different workloads.
Converting a CSV to Parquet
If you’re sitting on a mountain of legacy CSVs, you can convert them to Parquet with a few lines. This example reads a large CSV in chunks (to avoid blowing up memory) and writes each chunk to a Parquet dataset:
import pandas as pd
csv_file = "gigantic_ohlcv.csv"
parquet_dir = "parquet_data/"
chunksize = 1_000_000 # one million rows at a time
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunksize)):
chunk.to_parquet(
f"{parquet_dir}chunk_{i:04d}.parquet",
compression="snappy",
index=False
)
Later, you can read all those Parquet files together as a single DataFrame:
df_all = pd.read_parquet(parquet_dir)
This approach is great for quants who download raw data from APIs like CCXT or Polygon.io. For finding reliable sources of financial data, see our Essential Data Sources for Quantitative Analysis roundup.
Parquet + DuckDB: The Speed Combo
Once your data is in Parquet, you can pair it with an in‑process engine like DuckDB to run SQL‑style queries at blazing speed. This is exactly the stack we cover in the Parquet & DuckDB article. As a teaser, here’s how you’d join two Parquet files in a single Python script:
import duckdb
duckdb.sql("""
SELECT a.date, a.symbol, a.close
FROM 'market_data.parquet' a
JOIN 'dividends.parquet' b
ON a.symbol = b.symbol
WHERE a.close > 200
""").to_df()
No database server, no network lag—just a local file and a lightning‑fast engine.
When to Stick with CSV
Parquet is fantastic for analytical workloads—backtesting, screening, factor research. But if you need to append new rows to a dataset every few seconds (like real‑time tick ingestion), a CSV or a dedicated time‑series database might be more practical. Parquet files are immutable in practice; you typically write them once and query them many times. For ideas on storing streaming tick data, see our Rate Limits and Order Queues article, which discusses infrastructure patterns.
Quick Reference Cheat Sheet
| Task | Code |
|---|---|
| Read Parquet | pd.read_parquet("file.parquet") |
| Read with filters | pd.read_parquet("file.parquet", filters=[("col","==",val)]) |
| Read specific columns | pd.read_parquet("file.parquet", columns=["col1","col2"]) |
| Write Parquet | df.to_parquet("file.parquet") |
| Write with compression | df.to_parquet("file.parquet", compression="zstd") |
| Convert CSV chunked | pd.read_csv(csv, chunksize=N) then .to_parquet() |
| Query with DuckDB | duckdb.query("SELECT ... FROM 'file.parquet'") |
Now you’ve got the Parquet pipeline. The next step is to put that lightweight data to work in a walk‑forward backtest or a LightGBM model—both of which we cover in detail across the site.