Skip to content

Terminal token-refresh failure deadlocks hf_hub_download forever (xet) #795

@VascoSch92

Description

@VascoSch92

Summary

When hf_hub_download takes the xet path and the in-Python token_refresher callback fails terminally, the awaiting Python future is never woken. The per-blob FileLock is never released, sibling threads pile up behind it, and the process sits in futex_wait indefinitely with no error to stderr and no Python-level timeout to escape. We've observed pods stuck for 13+ hours.

In our case the trigger is a RecursionError raised inside the callback (Bug A), but the deadlock (Bug B) is the dangerous part — it converts any fatal token-refresh failure into a permanent stall.

Versions

huggingface_hub 0.34.4 · hf-xet 1.3.1 · datasets 3.0.1 · Python 3.12.13 · linux/amd64

Bug A — RecursionError inside token_refresher

hf-xet invokes the Python callback on a Rust tokio worker thread that isn't registered in threading._active. Anything in the callback that reaches threading.current_thread() synthesizes a _DummyThread;
on Python 3.12 the interaction with the logging machinery recurses to the limit:

TokenRefreshFailure: PyErr { type: <class 'RecursionError'>, value: ('maximum recursion depth exceeded'),
  File ".../huggingface_hub/file_download.py", line 597, in token_refresher                                                                                                                                        
  File ".../huggingface_hub/utils/_xet.py", line 116, in refresh_xet_connection_info                                                                                                                               
  File ".../huggingface_hub/utils/_xet.py", line 186, in _fetch_xet_connection_info_with_url                                                                                                                       
  File ".../requests/sessions.py", line 602, in get                                                                                                                                                                
  File ".../urllib3/connectionpool.py", line 1049, in _new_conn                                                                                                                                                    
  File "/usr/local/lib/python3.12/logging/__init__.py", line 347, in __init__                                                                                                                                      
    self.threadName = threading.current_thread().name                                                                                                                                                              
  File "/usr/local/lib/python3.12/threading.py", line 1495, in current_thread                                                                                                                                      
    return _Du... [recurses to limit]                                                                                                                                                                              

PyO3 catches it and surfaces TokenRefreshFailure. The Rust retry wrapper retries 5×, deterministically hitting the same recursion every time.

Bug B — terminal failure leaves the Python future unsignaled

After hf-xet exhausts retries:

$ for t in /proc/$PID/task/*; do echo $(cat $t/wchan) - $(cat $t/comm); done | sort | uniq -c
     24 futex_wait_queue - <worker>     # blocked on dataset prep                                                                                                                                                  
      8 futex_wait_queue - hf-xet-*                                                                                                                                                                                
                                                                                                                                                                                                                   
$ cat /proc/$PID/net/tcp | awk '{print $4}' | sort | uniq -c                                                                                                                                                       
     14 08    # CLOSE_WAIT to cas-server.xethub.hf.co — peer closed long ago                                                                                                                                       
                                                                                                                                                                                                                   
$ find ~/.cache/huggingface -name '*.lock'                                                                                                                                                                         
hub/.locks/datasets--<repo>/<hash>.lock     × N                                                                                                                                                                    
datasets/<repo>/.../*_builder.lock                                                                                                                                                                                 
datasets/<repo>/.../*.incomplete_info.lock                                                                                                                                                                         

13+ hours after the failure: zero CPU activity, no error written, no resolution. The hf_hub_download thread holding the FileLock is wedged on a future whose wakeup never comes.

Reproduction (Bug B in isolation, deterministic)

import os, concurrent.futures                                   
os.environ.pop("HF_HUB_DISABLE_XET", None)                                                                                                                                                                         
from huggingface_hub import hf_hub_download                                                                                                                                                                        
import huggingface_hub.file_download as fd                                                                                                                                                                         
                                                                                                                                                                                                                   
# Simulate any terminal token-refresh failure                                                                                                                                                                      
fd.token_refresher = lambda *a, **kw: (_ for _ in ()).throw(RecursionError("simulated"))
                                                                                                                                                                                                                   
REPO = "<any xet-backed repo>"                                                                                                                                                                                     
FILES = ["f1", "f2", ..., "f8"]                                                                                                                                                                                    
                                                                                                                                                                                                                   
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as ex:
    list(ex.map(lambda p: hf_hub_download(REPO, p, repo_type="dataset"), FILES))                                                                                                                                   
# Expected: an exception per worker, locks released.                                                                                                                                                               
# Observed: workers block in futex_wait forever; .lock files remain on disk.                                                                                                                                       

Plese let me know if you need more informations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions