Skip to content

Comments

GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372

Open
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-s3-uri-whitespace-resolution
Open

GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-s3-uri-whitespace-resolution

Conversation

@ernestprovo23
Copy link

@ernestprovo23 ernestprovo23 commented Feb 22, 2026

Rationale

Closes #41365

When _resolve_filesystem_and_path() receives an S3 URI containing spaces (e.g. s3://bucket/path with space/file), the C++ URI parser raises ValueError("Cannot parse URI"). The Python code previously caught all "Cannot parse URI" errors and silently fell back to LocalFileSystem, producing a confusing downstream error instead of the real parsing failure.

What changed

  1. Added _is_likely_uri() helper — a Python port of the C++ IsLikelyUri() heuristic from cpp/src/arrow/filesystem/path_util.cc. It checks whether a string has a valid URI scheme prefix (2-36 chars, RFC 3986 compliant).

  2. Modified error handling in _resolve_filesystem_and_path() — the except ValueError block now distinguishes:

    • "empty scheme" → still falls back to LocalFileSystem (no scheme at all)
    • "Cannot parse URI" + _is_likely_uri() returns False → still falls back (not a URI, just a local path)
    • "Cannot parse URI" + _is_likely_uri() returns Truere-raises the error (it IS a URI, just malformed)
  3. Changed raise e to raise — preserves the original traceback (minor style improvement).

Tests added

  • test_is_likely_uri() — unit tests for the heuristic covering valid schemes (s3://, gs://, hdfs://, abfss://, grpc+https://), local paths, Windows drives, and invalid prefixes.
  • test_resolve_filesystem_and_path_uri_with_spaces() — verifies S3/GCS/ABFSS URIs with spaces now raise ValueError.
  • test_resolve_filesystem_and_path_local_with_spaces() — verifies local paths with spaces still return LocalFileSystem.

Test plan

  • All new tests pass (14/14)
  • Manual verification: _resolve_filesystem_and_path("s3://bucket/path with space") raises ValueError
  • Manual verification: _resolve_filesystem_and_path("/local/path with space") returns LocalFileSystem
  • Existing test_fs.py tests pass (requires full C++ build environment)

…g back to LocalFileSystem

Port the C++ `IsLikelyUri()` heuristic to Python and use it in
`_resolve_filesystem_and_path()` to distinguish between local paths
and malformed URIs before suppressing `ValueError`.

Previously, URIs like `s3://bucket/path with space/file` would silently
fall back to `LocalFileSystem` because the "Cannot parse URI" error was
unconditionally swallowed. Now, paths that look like URIs (valid scheme
prefix) propagate the parsing error, while local paths continue to fall
back to `LocalFileSystem` as expected.
@github-actions
Copy link

⚠️ GitHub issue #41365 has been automatically assigned in GitHub to PR creator.

@rok
Copy link
Member

rok commented Feb 22, 2026

Hey @ernestprovo23, thanks for working on this and opening the PR. I'll kick off the CI run and hopefully I or someone else wil find time early next week to do a review.

)


def _is_likely_uri(path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention on the issue:

Working on a fix that ports the C++ IsLikelyUri() heuristic to Python and uses it in _resolve_filesystem_and_path()

Could you check if it's possible to expose IsLikelyUri through Python so ultimately the logic is only implemented once and is always doing the same?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Pyarrow fs incorrectly resolves S3 URIs with white space as a local path

2 participants