Skip to content

Add functions for reading parquet/csv/json directly into polars df#12

Draft
natemcintosh wants to merge 1 commit intomainfrom
nam-add-blob-readers
Draft

Add functions for reading parquet/csv/json directly into polars df#12
natemcintosh wants to merge 1 commit intomainfrom
nam-add-blob-readers

Conversation

@natemcintosh
Copy link
Copy Markdown

Had to add polars as a top level dependency.

These three functions take in a container client, and the name of the blob to read. As you can see, they are basically identical except for the polars function. I thought about allowing the use to pass in the desired polars read function as an argument, but decided this might be cleaner, as it allows the user to specify kwargs specific to each reader.

had to add polars as a top level dependency. Hope that's
ok
@natemcintosh natemcintosh requested a review from Copilot May 1, 2025 20:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds three new functions to read parquet, CSV, and JSON files from a blob container into a polars DataFrame, while adding polars as a top level dependency.

  • Added polars dependency to pyproject.toml
  • Introduced read_blob_parquet, read_blob_csv, and read_blob_json in azuretools/blob.py
  • Updated import statements for blob-related classes in azuretools/blob.py

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pyproject.toml Added polars dependency to support direct DataFrame reads
azuretools/blob.py Implemented three functions for reading different file types into polars

@github-actions
Copy link
Copy Markdown

github-actions bot commented May 1, 2025

Thank you for your contribution @natemcintosh 🚀! Your github-pages is ready for download 👉 here 👈!
(The artifact expires on 2025-05-08T20:41:28Z. You can re-generate it by re-running the workflow here.)

@natemcintosh natemcintosh requested a review from dylanhmorris May 1, 2025 20:42
@dylanhmorris
Copy link
Copy Markdown
Collaborator

dylanhmorris commented May 16, 2025

Apologies for the delay in reviewing this @natemcintosh. Question: what is the motivation for taking the approach here versus the one outlined in the "Example with Azure" in the polars docs? https://docs.pola.rs/user-guide/io/cloud-storage/#using-a-custom-credential_provider-function

@natemcintosh
Copy link
Copy Markdown
Author

Good question.

  1. Mostly just that I have not tried it out 😆. I should.
  2. I often work with blob ContainerClients for other purposes (reading/writing non table data), and so I often "have one knocking about", making this use case fairly simple.

I would say that if you don't often have an instantiated ContainerClient already in your code, it would be worth trying out the suggested credential provider method. I guess for that, we would just need a credential_provider() function that we know works.

@natemcintosh
Copy link
Copy Markdown
Author

I've tried both methods polars suggests (verbatim) and both give the error polars.exceptions.ComputeError: Generic MicrosoftAzure error: Account must be specified

# Method 1
def credential_provider():
    credential = DefaultAzureCredential(exclude_managed_identity_credential=True)
    token = credential.get_token("https://storage.azure.com/.default")

    return {"bearer_token": token.token}, token.expires_on

# Make the call
pl.scan_parquet(
    "abfs://nssp-etl/gold/2025-05-19.parquet",
    credential_provider=credential_provider
).collect_schema()
# Gives the error listed above
# Also tried with `"az://.."` instead of `"abfs:"`. Same error

# Method 2
pl.scan_parquet(
    "abfs://nssp-etl/gold/2025-05-19.parquet",
    credential_provider=pl.CredentialProviderAzure(
        credentials=DefaultAzureCredential(exclude_managed_identity_credential=True)
    ),
).collect_schema()
# Gives the error listed above

I tried seeing if I could find a place to specify the Account, but didn't see it as an argument for the DefaultAzureCredential() or in the pl.CredentialProviderAzure().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants