Terminal token-refresh failure deadlocks hf_hub_download forever (xet)

## Summary                                                      

When `hf_hub_download` takes the xet path and the in-Python `token_refresher` callback fails terminally, the awaiting Python future is **never woken**. The per-blob `FileLock` is never released, sibling threads pile up behind it, and the process sits in `futex_wait` indefinitely with no error to stderr and no Python-level timeout to escape. We've observed pods stuck for 13+ hours.
 
In our case the trigger is a `RecursionError` raised inside the callback (Bug A), but the deadlock (Bug B) is the dangerous part — it converts *any* fatal token-refresh failure into a permanent stall.           
   
## Versions                                                                                                                                                                                                        
                                                                  
  `huggingface_hub` 0.34.4 · `hf-xet` 1.3.1 · `datasets` 3.0.1 · Python 3.12.13 · linux/amd64                                                                                                                        
   
## Bug A — RecursionError inside `token_refresher`                                                                                                                                                                 
                                                                  
  `hf-xet` invokes the Python callback on a Rust tokio worker thread that isn't registered in `threading._active`. Anything in the callback that reaches `threading.current_thread()` synthesizes a `_DummyThread`;  
  on Python 3.12 the interaction with the logging machinery recurses to the limit:
                                                                                                                                                                                                                     
  ```text
  TokenRefreshFailure: PyErr { type: <class 'RecursionError'>, value: ('maximum recursion depth exceeded'),
    File ".../huggingface_hub/file_download.py", line 597, in token_refresher                                                                                                                                        
    File ".../huggingface_hub/utils/_xet.py", line 116, in refresh_xet_connection_info                                                                                                                               
    File ".../huggingface_hub/utils/_xet.py", line 186, in _fetch_xet_connection_info_with_url                                                                                                                       
    File ".../requests/sessions.py", line 602, in get                                                                                                                                                                
    File ".../urllib3/connectionpool.py", line 1049, in _new_conn                                                                                                                                                    
    File "/usr/local/lib/python3.12/logging/__init__.py", line 347, in __init__                                                                                                                                      
      self.threadName = threading.current_thread().name                                                                                                                                                              
    File "/usr/local/lib/python3.12/threading.py", line 1495, in current_thread                                                                                                                                      
      return _Du... [recurses to limit]                                                                                                                                                                              
  ```                                                                                                                                                                                                                
                                                                                                                                                                                                                     
  PyO3 catches it and surfaces `TokenRefreshFailure`. The Rust retry wrapper retries 5×, deterministically hitting the same recursion every time.                                                                    
                                                                  
  ## Bug B — terminal failure leaves the Python future unsignaled                                                                                                                                                    
                                                                  
  After hf-xet exhausts retries:                                                                                                                                                                                     
                                                                  
  ```                                                                                                                                                                                                                
  $ for t in /proc/$PID/task/*; do echo $(cat $t/wchan) - $(cat $t/comm); done | sort | uniq -c
       24 futex_wait_queue - <worker>     # blocked on dataset prep                                                                                                                                                  
        8 futex_wait_queue - hf-xet-*                                                                                                                                                                                
                                                                                                                                                                                                                     
  $ cat /proc/$PID/net/tcp | awk '{print $4}' | sort | uniq -c                                                                                                                                                       
       14 08    # CLOSE_WAIT to cas-server.xethub.hf.co — peer closed long ago                                                                                                                                       
                                                                                                                                                                                                                     
  $ find ~/.cache/huggingface -name '*.lock'                                                                                                                                                                         
  hub/.locks/datasets--<repo>/<hash>.lock     × N                                                                                                                                                                    
  datasets/<repo>/.../*_builder.lock                                                                                                                                                                                 
  datasets/<repo>/.../*.incomplete_info.lock                                                                                                                                                                         
  ```                                                                                                                                                                                                                
                                                                                                                                                                                                                     
  13+ hours after the failure: zero CPU activity, no error written, no resolution. The `hf_hub_download` thread holding the `FileLock` is wedged on a future whose wakeup never comes.                               
                                                                  
  ## Reproduction (Bug B in isolation, deterministic)                                                                                                                                                                
                                                                  
  ```python                                                                                                                                                                                                          
  import os, concurrent.futures                                   
  os.environ.pop("HF_HUB_DISABLE_XET", None)                                                                                                                                                                         
  from huggingface_hub import hf_hub_download                                                                                                                                                                        
  import huggingface_hub.file_download as fd                                                                                                                                                                         
                                                                                                                                                                                                                     
  # Simulate any terminal token-refresh failure                                                                                                                                                                      
  fd.token_refresher = lambda *a, **kw: (_ for _ in ()).throw(RecursionError("simulated"))
                                                                                                                                                                                                                     
  REPO = "<any xet-backed repo>"                                                                                                                                                                                     
  FILES = ["f1", "f2", ..., "f8"]                                                                                                                                                                                    
                                                                                                                                                                                                                     
  with concurrent.futures.ThreadPoolExecutor(max_workers=8) as ex:
      list(ex.map(lambda p: hf_hub_download(REPO, p, repo_type="dataset"), FILES))                                                                                                                                   
  # Expected: an exception per worker, locks released.                                                                                                                                                               
  # Observed: workers block in futex_wait forever; .lock files remain on disk.                                                                                                                                       
  ```

Plese let me know if you need more informations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminal token-refresh failure deadlocks hf_hub_download forever (xet) #795

Summary

Versions

Bug A — RecursionError inside `token_refresher`

Bug B — terminal failure leaves the Python future unsignaled

Reproduction (Bug B in isolation, deterministic)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Terminal token-refresh failure deadlocks hf_hub_download forever (xet) #795

Description

Summary

Versions

Bug A — RecursionError inside token_refresher

Bug B — terminal failure leaves the Python future unsignaled

Reproduction (Bug B in isolation, deterministic)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug A — RecursionError inside `token_refresher`