db2pq is a Python library for moving data from PostgreSQL into Apache Parquet files.
It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.
If you are contributing to the project, see CONTRIBUTING.md for the
recommended local development workflow.
- Export a single PostgreSQL table to Parquet.
- Export an Ibis PostgreSQL table expression to Parquet.
- Export all tables in a PostgreSQL schema to Parquet.
- Export WRDS tables to Parquet.
- Update Parquet files only when the WRDS source table is newer.
- Mirror WRDS tables into a local PostgreSQL database.
- Read
last_modifiedmetadata embedded in Parquet files.
Install from PyPI:
pip install --upgrade db2pqThis installs the bundled psycopg PostgreSQL client dependency, so most users
do not need a separate system libpq installation.
Install optional SAS support if you use WRDS SAS metadata checks, especially
when you want WRDS refresh logic based on the SAS-side source timestamps used by wrds_update_pq(..., use_sas=True):
pip install --upgrade "db2pq[sas]"Install optional pandas support if you want metadata queries returned as pandas
DataFrames, for example from pq_last_modified(...) in notebooks or analysis
scripts:
pip install --upgrade "db2pq[pandas]"Install optional Ibis support if your workflow builds PostgreSQL queries with
Ibis expressions and you want to export those query results directly to
Parquet with ibis_to_pq(...):
pip install --upgrade "db2pq[ibis]"Install optional ADBC support if you want the Arrow-based export path for
PostgreSQL-to-Parquet workflows, including helpers that accept
engine="adbc":
pip install --upgrade "db2pq[adbc]"Install both optional SAS and pandas support:
pip install --upgrade "db2pq[sas,pandas]"db2pq supports explicit function arguments and environment-based defaults.
It also loads a local .env file automatically (via python-dotenv) when resolving defaults.
Connection defaults:
PGUSER: PostgreSQL user (falls back to local OS user)PGHOST: PostgreSQL host (default:localhost)PGDATABASE: PostgreSQL database (default:PGUSER)PGPORT: PostgreSQL port (default:5432)
WRDS + output defaults:
WRDS_ID: WRDS username (required for WRDS helpers unless passed directly)WRDS_USER: accepted as a synonym forWRDS_IDfor compatibility with Tidy Finance-style setupsWRDS_PASSWORD: if present and no WRDS.pgpassentry exists,db2pqcan offer to save it to.pgpassDATA_DIR: base directory where Parquet files are written
Example shell setup:
export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"If WRDS_ID is not set, WRDS helpers such as wrds_update_pq() and
wrds_pg_to_pq() will prompt for it on first use and suggest adding it to a
local .env file in the calling project. If your WRDS PostgreSQL password is
not yet stored in ~/.pgpass (or PGPASSFILE), db2pq will prompt for it
securely and save it for future connections. For compatibility with the Tidy
Finance Python setup, db2pq also recognizes WRDS_USER and can offer to
copy WRDS_PASSWORD into .pgpass.
wrds_update_pq(..., use_sas=True) uses SSH to execute SAS remotely. Configure
an SSH key for your WRDS account first:
ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"Use this path when WRDS is the source of truth and the Parquet repository is the destination.
Most users should start with wrds_update_pq(), which updates only when the
WRDS source appears newer than the existing Parquet file:
from db2pq import wrds_update_pq
wrds_update_pq("dsi", "crsp")Use wrds_pg_to_pq() when you want a direct export without the update check:
from db2pq import wrds_pg_to_pq
wrds_pg_to_pq(
table_name="dsi",
schema="crsp",
wrds_id="your_wrds_id", # or set WRDS_ID in the environment
)Use wrds_sql_to_pq() when you want to run a custom SQL query against WRDS
and save the result into the standard Parquet repository layout:
from db2pq import wrds_sql_to_pq
wrds_sql_to_pq(
sql="""
SELECT gvkey, datadate, fyear, at
FROM comp.funda
WHERE indfmt = 'INDL'
AND datafmt = 'STD'
AND consol = 'C'
AND popsrc = 'D'
AND fyear BETWEEN 1991 AND 2008
""",
table_name="funda",
schema="comp",
alt_table_name="funda_filtered",
)Common options on this path:
whereto filter rows before exportkeep/dropto select columnsalt_table_nameto rename the Parquet filearchive=Trueto archive replaced filesuse_sas=Truewhen you want SAS-based metadata for the update check
If WRDS_ID is missing, db2pq will ask for it and suggest adding
WRDS_ID=... to your project's .env file. If no matching WRDS password is
found in .pgpass, db2pq will prompt for your WRDS PostgreSQL password and
store it for next time.
Use this path when you want to materialize WRDS tables into a local PostgreSQL database first.
wrds_update_pg() is the main helper here:
from db2pq import wrds_update_pg
wrds_update_pg(
table_name="dsi",
schema="crsp",
dbname="research",
)This path is useful when you want to:
- keep a local PostgreSQL mirror of selected WRDS tables
- build indexes or comments locally with
process_sql()andset_table_comment() - decouple downstream Parquet generation from live WRDS access
Related helpers:
wrds_get_tables(schema, wrds_id=None)to list WRDS tablesprocess_sql(sql, ...)to run SQL against the destination PostgreSQL databaseset_table_comment(...)to manage destination table comments
Use this path when the source table is already in your local PostgreSQL
database, including tables created by wrds_update_pg().
pg_update_pq() mirrors the behavior of wrds_update_pq() but reads from the
local PostgreSQL source:
from db2pq import pg_update_pq
pg_update_pq(
table_name="dsi",
schema="crsp",
database="research",
)This path relies on the local PostgreSQL table comment for change detection.
If the table has no parseable last-modified comment, pg_update_pq() will not
export by default and will tell you to use force=True.
Use db_to_pq() when you want a direct local PostgreSQL to Parquet export
without the update check:
from db2pq import db_to_pq
pq_file = db_to_pq(
table_name="my_table",
schema="public",
host="localhost",
database="mydb",
)
print(pq_file)These examples use the default DuckDB-backed path. engine="adbc" is available,
but ADBC-based paths should currently be treated as experimental.
Use db_schema_to_pq() when you want to export an entire local PostgreSQL
schema:
from db2pq import db_schema_to_pq
files = db_schema_to_pq(schema="public")
print(files)ibis_to_pq() remains available when your starting point is an Ibis table
expression rather than a table name:
from db2pq import ibis_to_pq
expr = con.table("my_table").filter(lambda t: t.id > 100)
ibis_to_pq(expr, "my_table.parquet", compression="zstd")Files are organized as:
<DATA_DIR>/<schema>/<table>.parquet
For example:
/data/crsp/dsi.parquet
When archive=True, replaced files are moved under:
<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet
At a high level, the core PostgreSQL-to-Parquet flow has three stages:
-
Query planning
db2pqinspects PostgreSQL metadata, applieskeep/drop, normalizes user-suppliedcol_types, handles timestamp conversion rules, and builds a SQLSELECT. -
Query execution The planned query is executed through either:
engine="duckdb": DuckDB reads PostgreSQL and produces Arrow outputengine="adbc": the PostgreSQL ADBC driver streams Arrow record batches directly
-
Parquet writing PyArrow writes the resulting Arrow batches/tables to Parquet, normalizing timestamps, repairing eligible decimal columns on the ADBC path, and buffering row groups with both row-count and byte-size limits.
This means the main export helpers now share the same SQL-planning logic even when they use different execution engines.
"duckdb" remains the default engine. You can override it per call:
db_to_pq("dsi", "crsp", engine="duckdb")or, if you specifically want the ADBC path for a session, set a process-wide default:
from db2pq import set_default_engine
set_default_engine("adbc")You can inspect the current setting with get_default_engine(), and the
environment variable DB2PQ_ENGINE provides the same kind of session-level
default when you prefer configuration outside Python.
In general, DuckDB-backed paths are the stable default. ADBC-backed paths are
still experimental and may be faster or slower depending on the workload and
driver behavior. If you need the most reliable path, prefer engine="duckdb".
From db2pq:
db_to_pq(table_name, schema, ...)pg_update_pq(table_name, schema, ...)ibis_to_pq(table, out_file, engine=None, **writer_kwargs)wrds_pg_to_pq(table_name, schema, ...)db_schema_to_pq(schema, ...)wrds_update_pq(table_name, schema, ...)wrds_update_schema(schema, ...)pq_list_files(schema, data_dir=None, archive=False, archive_dir=None)pq_last_modified(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")pq_archive(table_name=None, schema=None, data_dir=None, file_name=None, archive_dir=None)pq_restore(file_basename, schema, data_dir=None, archive=True, archive_dir=None)pq_remove(table_name=None, schema=None, data_dir=None, file_name=None, archive=False, archive_dir="archive")db_schema_tables(schema, ...)wrds_get_tables(schema, wrds_id=None)wrds_update_pg(table_name, schema, ...)pq_to_pg(table_name, schema, ...)pq_update_pg(table_name, schema, ...)process_sql(sql, ...)set_table_comment(schema=..., table_name=..., comment=..., ...)get_wrds_username(wrds_id=None)get_wrds_conninfo(username=None)find_pgpass_entry(conninfo, **kwargs)has_pgpass_password(conninfo, **kwargs)save_password(conninfo, password=None, **kwargs)ensure_wrds_credentials(wrds_id=None, interactive=True)set_default_engine(engine)get_default_engine()close_adbc_cached()
The main workflow entry points are:
- WRDS to Parquet:
wrds_update_pq(),wrds_pg_to_pq() - WRDS to PostgreSQL:
wrds_update_pg() - Local PostgreSQL to Parquet:
pg_update_pq(),db_to_pq(),db_schema_to_pq() - Parquet to PostgreSQL:
pq_update_pg(),pq_to_pg()
wrds_update_pq() supports SQL-style filtering via where, for example:
wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")
pg_update_pq() provides the same update-to-parquet pattern for a local
PostgreSQL source table, using the local table comment for change detection.
pq_update_pg() provides the reverse update path from a parquet file in the
local repository into PostgreSQL, using parquet last_modified metadata for
change detection.
- WRDS PostgreSQL access uses host
wrds-pgdata.wharton.upenn.eduand port9737. batched=True(default) lowers memory usage for large tables.engine="adbc"streams Arrow record batches directly from PostgreSQL into Parquet and may reduce RAM use versus the default DuckDB path.numeric_mode=None(the default) keeps engine-specific "least interference" behavior: DuckDB preserves PostgreSQLNUMERICcolumns as Arrow decimals, while ADBC defaults to text-backed numerics.numeric_mode="text"casts PostgreSQLNUMERICcolumns toTEXTon both engines.numeric_mode="float64"casts PostgreSQLNUMERICcolumns toDOUBLE PRECISIONon both engines.numeric_mode="decimal"preserves native decimal output on DuckDB. On the ADBC path it transports numerics asTEXTand converts eligible columns back to Arrow decimals using PostgreSQL precision/scale metadata. Columns without usable metadata remain strings.col_typesstill takes precedence over the mode.col_typescan be used to cast selected columns before writing Parquet.keep/dropaccept regex pattern(s) in bothwrds_update_pq()andwrds_update_pg(). If both are supplied,dropis applied beforekeep.tzdefaults to"UTC"in both update paths:wrds_update_pq()uses it to interpret source naive timestamps before writing timezone-aware UTC parquet timestamps;wrds_update_pg()converts sourcetimestamp without time zonecolumns usingAT TIME ZONE '<tz>'.process_sql()executes SQL against the PostgreSQL destination resolved from the samePGUSER/PGHOST/PGDATABASE/PGPORTdefaults used bywrds_update_pg().set_table_comment()can be used the same way for destination PostgreSQL table comments without passing an engine argument.
Run editable install in this repository:
pip install -e .With optional SAS dependency:
pip install -e ".[sas]"- Docs index:
docs/README.md - Contributor guide:
CONTRIBUTING.md - Release process:
RELEASING.md - Changelog:
CHANGELOG.md
MIT License. See LICENSE.