Skip to content

Expose XetSession APIs to Python#792

Open
seanses wants to merge 12 commits intomainfrom
di/use-hf-xet-in-hf-xet
Open

Expose XetSession APIs to Python#792
seanses wants to merge 12 commits intomainfrom
di/use-hf-xet-in-hf-xet

Conversation

@seanses
Copy link
Copy Markdown
Collaborator

@seanses seanses commented Apr 9, 2026

Summary

Replaces the old upload_files / download_files / hash_files Python functions with a new object-oriented API that exposes XetSession and its child objects directly as PyO3 classes. This gives Python callers full control over session lifecycle, connection pooling, and progress reporting.

The previous module-level functions are kept under hf_xet/src/legacy/ and remain importable as from hf_xet import upload_files etc., but now emit DeprecationWarning.

New Python API

import hf_xet

session = hf_xet.XetSession()

# Upload — multiple files, bytes, and streaming within one commit
with session.new_upload_commit().with_token_refresh_url(url, headers).build() as commit:
    commit.upload_file("/path/to/model.bin")
    commit.upload_file("/path/to/tokenizer.json")
    commit.upload_bytes(b"...", name="config.json")

    stream = commit.upload_stream(name="big.bin")
    for chunk in produce_chunks():
        stream.write(chunk)
    stream.finish()  # must be called before the with-block exits

# Progress callback (fires every interval_ms, default 100 ms)
def on_progress(group, items):          # items is dict[UniqueID, ItemProgressReport]
    bar.n = group.total_bytes_completed
    bar.refresh()

with (session.new_upload_commit()
      .with_token_refresh_url(url, headers)
      .with_progress_callback(on_progress, interval_ms=100)
      .build()) as commit:
    commit.upload_file("/path/to/model.bin")
    commit.upload_file("/path/to/tokenizer.json")

# File download — multiple files within one group (downloads run concurrently)
file_info_a = hf_xet.XetFileInfo(hash_a, size_a)
file_info_b = hf_xet.XetFileInfo(hash_b, size_b)
with session.new_file_download_group().with_token_refresh_url(url, headers).build() as group:
    group.download_file(file_info_a, dest_path_a)
    group.download_file(file_info_b, dest_path_b)

# Streaming download — start/end are both optional (open-ended ranges supported)
group = session.new_download_stream_group().with_token_refresh_url(url, headers).build()
for chunk in group.download_stream(file_info):                    # whole file
    f.write(chunk)
for chunk in group.download_stream(file_info, start=1024):        # 1024 .. EOF
    f.write(chunk)
for chunk in group.download_stream(file_info, start=0, end=4096): # 0 .. 4096
    f.write(chunk)
for offset, chunk in group.download_unordered_stream(file_info):
    buf[offset:offset+len(chunk)] = chunk

Files Changed

New files

  • hf_xet/src/py_xet_session.rsXetSession PyO3 class
  • hf_xet/src/py_upload_commit.rsXetUploadCommitBuilder, XetUploadCommit, Sha256Policy, report types
  • hf_xet/src/py_file_upload_handle.rsXetFileUpload
  • hf_xet/src/py_stream_upload_handle.rsXetStreamUpload
  • hf_xet/src/py_file_download_group.rsXetFileDownloadGroupBuilder, XetFileDownloadGroup
  • hf_xet/src/py_file_download_handle.rsXetFileDownload
  • hf_xet/src/py_download_stream_group.rsXetDownloadStreamGroupBuilder, XetDownloadStreamGroup
  • hf_xet/src/py_download_stream_handle.rsXetDownloadStream, XetUnorderedDownloadStream
  • hf_xet/src/headers.rsbuild_headers_with_user_agent helper
  • hf_xet/src/legacy/mod.rs — re-exports all legacy symbols
  • hf_xet/src/legacy/types.rsPyXetDownloadInfo, PyXetUploadInfo, PyPointerFile
  • hf_xet/src/legacy/functions.rs — deprecated upload_bytes, upload_files, download_files, force_sigint_shutdown; hash_files retained without deprecation
  • hf_xet/src/legacy/progress_update.rsPyItemProgressUpdate, PyTotalProgressUpdate, WrappedProgressUpdater
  • hf_xet/src/legacy/runtime.rs — async runtime + SIGINT handler (used by legacy functions)
  • hf_xet/src/legacy/token_refresh.rs — Python callback token refresher (used by legacy functions)
  • hf_xet/tests/conftest.py — shared fixtures and upload helpers
  • hf_xet/tests/test_upload_commit.py — upload tests (file, bytes, stream, Sha256Policy, progress, abort)
  • hf_xet/tests/test_file_download.py — file download tests (handles, round-trips, progress, cancel)
  • hf_xet/tests/test_stream_download.py — ordered and unordered streaming download tests with range variants
  • hf_xet/tests/test_progress.py — progress callback argument types and field verification
  • hf_xet/tests/test_session.pyXetSession lifecycle and builder creation tests

Modified files

  • hf_xet/src/lib.rs — module declarations; blocking_call_with_signal_check utility; legacy module registered at top level for backward compatibility
  • hf_xet/src/logging.rs — calls xet_pkg::init_logging() instead of xet_runtime directly
  • hf_xet/Cargo.toml — added xet-runtime, xet-client deps (for legacy module); feature flags route through xet-pkg
  • xet_pkg/src/xet_session/file_download_group.rs — exposes XetDownloadGroupReport as a Python class (pyclass(get_all), __repr__)
  • xet_data/src/processing/xet_file.rs#[new] Python constructor for XetFileInfo
  • xet_pkg/Cargo.toml — added no-default-cache, tokio-console, elevated_information_level features
  • xet_pkg/src/lib.rs — added init_logging() wrapper
  • xet_runtime/src/core/runtime.rs — fork-safe Drop: detect child process via stored PID, discard runtime instead of blocking shutdown
  • .github/workflows/ci.yml — added Python integration test step (maturin + pytest) to Linux, Windows, macOS jobs

Test Plan

Design Notes

  • Token refresh: the old API required Python to pass a token-refresh callable that Rust invoked across the GIL boundary. The new API uses .with_token_refresh_url(url, headers) — Rust refreshes autonomously via HTTP, removing GIL re-entry on the hot path. WrappedTokenRefresher is kept only in legacy/.

  • Progress callbacks: with_progress_callback(fn, interval_ms=100) spawns a background thread that delivers (GroupProgressReport, dict[UniqueID, ItemProgressReport]) to the Python callable. The same signature covers both upload and download groups, so a single XetProgressReporter class handles both.

  • GIL release and Ctrl-C: queue operations (upload_file, upload_bytes, download_file) use py.detach() and return quickly. Long-wait operations (commit(), finish()) run the blocking call on a background thread while the calling thread releases the GIL for 100 ms windows and polls py.check_signals() — Ctrl-C raises KeyboardInterrupt within one interval without starving other Python threads. XetError::KeyboardInterrupt maps to PyKeyboardInterrupt. The recommended caller pattern is except KeyboardInterrupt: session.sigint_abort(); raise, which is idiomatic Python: sigint_abort() flags the runtime so the background thread exits cleanly at its next checkpoint, and the cleanup is visible in Python code rather than hidden inside a C extension.

  • Context managers and concurrency: XetUploadCommit and XetFileDownloadGroup implement __enter__/__exit__; __exit__ delegates to commit() / finish() on success and abort() on exception. Multiple upload_file / download_file calls within a with block run concurrently; the block exit waits for all to complete.

  • Streaming: commit.upload_stream() returns a XetStreamUpload handle for incremental writes (.write(bytes), then .finish() before the with-block exits). download_stream and download_unordered_stream accept optional start / end byte offsets; either may be omitted independently.

  • Fork-safe runtime drop: XetRuntime records its creating PID; if drop fires in a child process after fork, the parent's Tokio threads don't exist so shutdown_timeout() would block. The runtime is discarded via mem::forget instead, letting the OS reclaim memory on exit.

  • Backward compatibility: all pre-1.x functions (upload_bytes, upload_files, hash_files, download_files, force_sigint_shutdown) and types (PyXetDownloadInfo, PyXetUploadInfo, PyPointerFile, PyItemProgressUpdate, PyTotalProgressUpdate) remain importable from the top-level hf_xet module. Deprecated functions emit DeprecationWarning at stacklevel=2. hash_files is not deprecated.


Note

Medium Risk
Medium risk due to a large surface-area change in Python bindings (new PyO3 classes, signal/interrupt handling, progress threads) plus a behavioral change to XetRuntime drop semantics for forked processes.

Overview
Introduces a new PyO3 object model for Python consumers built around XetSession, exposing builders/handles for uploads (XetUploadCommit, XetFileUpload, XetStreamUpload), grouped file downloads (XetFileDownloadGroup, XetFileDownload), and ordered/unordered streaming downloads with optional byte ranges.

Moves the previous top-level Python functions/types into a legacy module, keeps them importable for backward compatibility, and adds DeprecationWarning emission while centralizing HTTP header construction (including automatic User-Agent merging).

Extends the Rust crates to better support Python (pyclass report/ID types, XetFileInfo constructor), updates logging initialization via xet_pkg, adds fork-safe XetRuntime drop behavior, bumps hf_xet to 1.5.0, and adds cross-platform CI steps to build the wheel with maturin and run new pytest integration tests.

Reviewed by Cursor Bugbot for commit 5d3aa3c. Bugbot is set up for automated code reviews on this repo. Configure here.

@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from c65cf6e to b3bb567 Compare April 10, 2026 15:25
@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from 85873ea to 1d81a43 Compare April 15, 2026 14:36
@seanses seanses force-pushed the di/use-hf-xet-in-hf-xet branch from 5f9b2ef to ed125ba Compare April 16, 2026 03:42
@seanses seanses marked this pull request as ready for review April 16, 2026 06:29
Comment thread hf_xet/src/lib.rs
Comment thread hf_xet/src/py_upload_commit.rs
Comment thread .github/workflows/ci.yml
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit aa3e037. Configure here.

Comment thread hf_xet/src/lib.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant