Summary
When hf_hub_download takes the xet path and the in-Python token_refresher callback fails terminally, the awaiting Python future is never woken. The per-blob FileLock is never released, sibling threads pile up behind it, and the process sits in futex_wait indefinitely with no error to stderr and no Python-level timeout to escape. We've observed pods stuck for 13+ hours.
In our case the trigger is a RecursionError raised inside the callback (Bug A), but the deadlock (Bug B) is the dangerous part — it converts any fatal token-refresh failure into a permanent stall.
Versions
huggingface_hub 0.34.4 · hf-xet 1.3.1 · datasets 3.0.1 · Python 3.12.13 · linux/amd64
Bug A — RecursionError inside token_refresher
hf-xet invokes the Python callback on a Rust tokio worker thread that isn't registered in threading._active. Anything in the callback that reaches threading.current_thread() synthesizes a _DummyThread;
on Python 3.12 the interaction with the logging machinery recurses to the limit:
TokenRefreshFailure: PyErr { type: <class 'RecursionError'>, value: ('maximum recursion depth exceeded'),
File ".../huggingface_hub/file_download.py", line 597, in token_refresher
File ".../huggingface_hub/utils/_xet.py", line 116, in refresh_xet_connection_info
File ".../huggingface_hub/utils/_xet.py", line 186, in _fetch_xet_connection_info_with_url
File ".../requests/sessions.py", line 602, in get
File ".../urllib3/connectionpool.py", line 1049, in _new_conn
File "/usr/local/lib/python3.12/logging/__init__.py", line 347, in __init__
self.threadName = threading.current_thread().name
File "/usr/local/lib/python3.12/threading.py", line 1495, in current_thread
return _Du... [recurses to limit]
PyO3 catches it and surfaces TokenRefreshFailure. The Rust retry wrapper retries 5×, deterministically hitting the same recursion every time.
Bug B — terminal failure leaves the Python future unsignaled
After hf-xet exhausts retries:
$ for t in /proc/$PID/task/*; do echo $(cat $t/wchan) - $(cat $t/comm); done | sort | uniq -c
24 futex_wait_queue - <worker> # blocked on dataset prep
8 futex_wait_queue - hf-xet-*
$ cat /proc/$PID/net/tcp | awk '{print $4}' | sort | uniq -c
14 08 # CLOSE_WAIT to cas-server.xethub.hf.co — peer closed long ago
$ find ~/.cache/huggingface -name '*.lock'
hub/.locks/datasets--<repo>/<hash>.lock × N
datasets/<repo>/.../*_builder.lock
datasets/<repo>/.../*.incomplete_info.lock
13+ hours after the failure: zero CPU activity, no error written, no resolution. The hf_hub_download thread holding the FileLock is wedged on a future whose wakeup never comes.
Reproduction (Bug B in isolation, deterministic)
import os, concurrent.futures
os.environ.pop("HF_HUB_DISABLE_XET", None)
from huggingface_hub import hf_hub_download
import huggingface_hub.file_download as fd
# Simulate any terminal token-refresh failure
fd.token_refresher = lambda *a, **kw: (_ for _ in ()).throw(RecursionError("simulated"))
REPO = "<any xet-backed repo>"
FILES = ["f1", "f2", ..., "f8"]
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as ex:
list(ex.map(lambda p: hf_hub_download(REPO, p, repo_type="dataset"), FILES))
# Expected: an exception per worker, locks released.
# Observed: workers block in futex_wait forever; .lock files remain on disk.
Plese let me know if you need more informations.
Summary
When
hf_hub_downloadtakes the xet path and the in-Pythontoken_refreshercallback fails terminally, the awaiting Python future is never woken. The per-blobFileLockis never released, sibling threads pile up behind it, and the process sits infutex_waitindefinitely with no error to stderr and no Python-level timeout to escape. We've observed pods stuck for 13+ hours.In our case the trigger is a
RecursionErrorraised inside the callback (Bug A), but the deadlock (Bug B) is the dangerous part — it converts any fatal token-refresh failure into a permanent stall.Versions
huggingface_hub0.34.4 ·hf-xet1.3.1 ·datasets3.0.1 · Python 3.12.13 · linux/amd64Bug A — RecursionError inside
token_refresherhf-xetinvokes the Python callback on a Rust tokio worker thread that isn't registered inthreading._active. Anything in the callback that reachesthreading.current_thread()synthesizes a_DummyThread;on Python 3.12 the interaction with the logging machinery recurses to the limit:
PyO3 catches it and surfaces
TokenRefreshFailure. The Rust retry wrapper retries 5×, deterministically hitting the same recursion every time.Bug B — terminal failure leaves the Python future unsignaled
After hf-xet exhausts retries:
13+ hours after the failure: zero CPU activity, no error written, no resolution. The
hf_hub_downloadthread holding theFileLockis wedged on a future whose wakeup never comes.Reproduction (Bug B in isolation, deterministic)
Plese let me know if you need more informations.