Add functions for reading parquet/csv/json directly into polars df#12
Add functions for reading parquet/csv/json directly into polars df#12natemcintosh wants to merge 1 commit intomainfrom
Conversation
had to add polars as a top level dependency. Hope that's ok
There was a problem hiding this comment.
Pull Request Overview
This PR adds three new functions to read parquet, CSV, and JSON files from a blob container into a polars DataFrame, while adding polars as a top level dependency.
- Added polars dependency to pyproject.toml
- Introduced read_blob_parquet, read_blob_csv, and read_blob_json in azuretools/blob.py
- Updated import statements for blob-related classes in azuretools/blob.py
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pyproject.toml | Added polars dependency to support direct DataFrame reads |
| azuretools/blob.py | Implemented three functions for reading different file types into polars |
|
Thank you for your contribution @natemcintosh 🚀! Your github-pages is ready for download 👉 here 👈! |
|
Apologies for the delay in reviewing this @natemcintosh. Question: what is the motivation for taking the approach here versus the one outlined in the "Example with Azure" in the polars docs? https://docs.pola.rs/user-guide/io/cloud-storage/#using-a-custom-credential_provider-function |
|
Good question.
I would say that if you don't often have an instantiated |
|
I've tried both methods polars suggests (verbatim) and both give the error # Method 1
def credential_provider():
credential = DefaultAzureCredential(exclude_managed_identity_credential=True)
token = credential.get_token("https://storage.azure.com/.default")
return {"bearer_token": token.token}, token.expires_on
# Make the call
pl.scan_parquet(
"abfs://nssp-etl/gold/2025-05-19.parquet",
credential_provider=credential_provider
).collect_schema()
# Gives the error listed above
# Also tried with `"az://.."` instead of `"abfs:"`. Same error
# Method 2
pl.scan_parquet(
"abfs://nssp-etl/gold/2025-05-19.parquet",
credential_provider=pl.CredentialProviderAzure(
credentials=DefaultAzureCredential(exclude_managed_identity_credential=True)
),
).collect_schema()
# Gives the error listed aboveI tried seeing if I could find a place to specify the Account, but didn't see it as an argument for the |
Had to add polars as a top level dependency.
These three functions take in a container client, and the name of the blob to read. As you can see, they are basically identical except for the polars function. I thought about allowing the use to pass in the desired polars read function as an argument, but decided this might be cleaner, as it allows the user to specify
kwargsspecific to each reader.