GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372
Open
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
Open
GH-41365: [Python] Fix S3 URI with whitespace silently falling back to LocalFileSystem#49372ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
Conversation
…g back to LocalFileSystem Port the C++ `IsLikelyUri()` heuristic to Python and use it in `_resolve_filesystem_and_path()` to distinguish between local paths and malformed URIs before suppressing `ValueError`. Previously, URIs like `s3://bucket/path with space/file` would silently fall back to `LocalFileSystem` because the "Cannot parse URI" error was unconditionally swallowed. Now, paths that look like URIs (valid scheme prefix) propagate the parsing error, while local paths continue to fall back to `LocalFileSystem` as expected.
|
|
Member
|
Hey @ernestprovo23, thanks for working on this and opening the PR. I'll kick off the CI run and hopefully I or someone else wil find time early next week to do a review. |
rok
reviewed
Feb 22, 2026
| ) | ||
|
|
||
|
|
||
| def _is_likely_uri(path): |
Member
There was a problem hiding this comment.
You mention on the issue:
Working on a fix that ports the C++ IsLikelyUri() heuristic to Python and uses it in _resolve_filesystem_and_path()
Could you check if it's possible to expose IsLikelyUri through Python so ultimately the logic is only implemented once and is always doing the same?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
Closes #41365
When
_resolve_filesystem_and_path()receives an S3 URI containing spaces (e.g.s3://bucket/path with space/file), the C++ URI parser raisesValueError("Cannot parse URI"). The Python code previously caught all"Cannot parse URI"errors and silently fell back toLocalFileSystem, producing a confusing downstream error instead of the real parsing failure.What changed
Added
_is_likely_uri()helper — a Python port of the C++IsLikelyUri()heuristic fromcpp/src/arrow/filesystem/path_util.cc. It checks whether a string has a valid URI scheme prefix (2-36 chars, RFC 3986 compliant).Modified error handling in
_resolve_filesystem_and_path()— theexcept ValueErrorblock now distinguishes:"empty scheme"→ still falls back toLocalFileSystem(no scheme at all)"Cannot parse URI"+_is_likely_uri()returnsFalse→ still falls back (not a URI, just a local path)"Cannot parse URI"+_is_likely_uri()returnsTrue→ re-raises the error (it IS a URI, just malformed)Changed
raise etoraise— preserves the original traceback (minor style improvement).Tests added
test_is_likely_uri()— unit tests for the heuristic covering valid schemes (s3://,gs://,hdfs://,abfss://,grpc+https://), local paths, Windows drives, and invalid prefixes.test_resolve_filesystem_and_path_uri_with_spaces()— verifies S3/GCS/ABFSS URIs with spaces now raiseValueError.test_resolve_filesystem_and_path_local_with_spaces()— verifies local paths with spaces still returnLocalFileSystem.Test plan
_resolve_filesystem_and_path("s3://bucket/path with space")raisesValueError_resolve_filesystem_and_path("/local/path with space")returnsLocalFileSystemtest_fs.pytests pass (requires full C++ build environment)