π§ Development Branch - This is the main development repository for LUFFY (Learning to Reason Under OffβPolicy Guidance)
LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. This repository contains the core implementation and development work.
This repository is under active development. Many features are currently being implemented or need refactoring.
# Clone the repository
git clone <repository-url>
cd LUFFY
# Install dependencies
pip install -r luffy/requirements.txt
# Note: Some functionality is incomplete - check TODO list below for detailsLUFFY/
βββ luffy/ # Core framework
β βββ deepscaler/ # Scaling utilities (β οΈ API integration needed)
β βββ verl/ # RL training components (β οΈ Some features incomplete)
β βββ ...
βββ data/ # Training data and scripts
βββ eval_scripts/ # Evaluation utilities
βββ exp_scripts/ # Experiment scripts
βββ README.md # This file
- This is a development version with incomplete implementations
- Many functions contain TODO markers indicating pending work
- API integrations (OpenAI, Gemini) are currently placeholder implementations
- FSDP and distributed training features need completion
- API Integration: OpenAI and Gemini API implementations need completion
- Reward System: Parallel processing and validation for reward computation
- FSDP Training: Model loading and distributed training setup
- Data Processing: Batch dimension operations and tensor reshaping
- luffy/deepscaler/utils.py:45 - Add logging for API calls and errors
- luffy/deepscaler/utils.py:46 - Support batch processing for multiple prompts
- luffy/deepscaler/utils.py:47 - Add timeout configuration for API calls
- luffy/deepscaler/utils.py:107 - Implement Vertex AI initialization and authentication
- luffy/deepscaler/utils.py:108 - Configure safety settings for content generation
- luffy/deepscaler/utils.py:109 - Set up GenerativeModel with proper system instructions
- luffy/deepscaler/utils.py:110 - Implement retry logic with exponential backoff
- luffy/deepscaler/utils.py:111 - Add comprehensive error handling for API access issues
- luffy/deepscaler/utils.py:112 - Handle rate limiting and quota management
- luffy/deepscaler/utils.py:113 - Implement response validation and text extraction
- luffy/deepscaler/utils.py:114 - Add support for different generation configurations
- luffy/test.py:1590 - add smaller page sizes when Dao-AILab/flash-attention#824 is merged
- luffy/verl/examples/split_placement/split_monkey_patch.py:141 - make a canonical logger that supports various backend
- luffy/verl/tests/e2e/check_results.py:21 - this function needs error handling
- luffy/verl/tests/model/test_transformer.py:22 - (sgm): add more models for test
- luffy/verl/tests/model/test_transformer.py:50 - (sgm): we can construct the position_ids_rmpad here
- luffy/verl/tests/model/test_transformer.py:111 - (sgm): we can construct the position_ids_rmpad here
- luffy/verl/tests/model/test_transformers_ulysses.py:34 - (sgm): add more models for test
- luffy/verl/tests/model/test_transformers_ulysses.py:81 - (sgm): we can construct the position_ids_rmpad here
- luffy/verl/tests/model/test_transformers_ulysses.py:159 - (sgm): we can construct the position_ids_rmpad here
- luffy/verl/tests/ray/test_high_level_scheduling_api.py:25 - pass *args and **kwargs is bug prone and not very convincing
- luffy/verl/tests/ray/test_worker_group_basics.py:43 - pass *args and **kwargs is bug prone and not very convincing
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:54 - (sgm): support FSDP hybrid shard for larger model
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:83 - it seems that manual offload is slowly than FSDP offload
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:123 - (zhangchi.usc1992): 1. support create from random initialized model. 2. Support init with FSDP directly
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:199 - (zhangchi.usc1992, shengguangming) fix me. Current, auto_wrap_policy causes HFRollout to hang in Gemma
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:207 - add transformer policy
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:226 - add more optimizer args into config
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:252 - (sgm): support FSDP hybrid shard for larger model
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:263 - a sharding manager that do nothing?
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:391 - here, we should return all metrics
- luffy/verl/verl/mix_src/mix_fsdp_worker.py:517 - support DCP and save sharded checkpoints
- luffy/verl/verl/mix_src/mix_trainer.py:90 - add other ways to estimate advantages
- luffy/verl/verl/mix_src/mix_trainer.py:168 - support each role have individual ray_worker_group_cls,
- luffy/verl/verl/mix_src/mix_trainer.py:293 - we have to make sure the batch size is divisible by the dp size
- luffy/verl/verl/mix_src/mix_trainer.py:599 - make a canonical logger that supports various backend
- luffy/verl/verl/mix_src/mix_trainer.py:637 - add response length
- luffy/verl/verl/mix_src/mix_trainer_acc_rebatch.py:63 - we have to make sure the batch size is divisible by the dp size
- luffy/verl/verl/mix_src/mix_trainer_acc_rebatch.py:437 - make a canonical logger that supports various backend
- luffy/verl/verl/mix_src/mix_trainer_acc_rebatch.py:592 - check path
- luffy/verl/verl/mix_src/mix_trainer_acc_rebatch.py:628 - from remote not implemented yet
- luffy/verl/verl/mix_src/mix_vllm_rollout.py:43
- luffy/verl/verl/models/llama/megatron/layers/parallel_attention.py:380 - llama does not have dropout in the config??
- luffy/verl/verl/models/llama/megatron/layers/parallel_decoder.py:78 - add sequence parallel operator reduce_scatter here
- luffy/verl/verl/models/llama/megatron/layers/parallel_decoder.py:86 - add sequence parallel operator all_gather here
- luffy/verl/verl/models/llama/megatron/layers/parallel_decoder.py:90 - add sequence parallel operator reduce_scatter here
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py:236 - this will hang
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py:245 - will hang when used with device mesh
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py:247 - init using device mesh
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/spmd_gpu_executor.py:62 - (sgm): verl not support speculative decode now
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/spmd_gpu_executor.py:208 - (sgm): not implemented async executor yet
- luffy/verl/verl/third_party/vllm/vllm_v_0_4_2/tokenizer.py:61 - (sgm): the lora tokenizer is also passed, but may be different
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py:53 - (sgm): check this
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py:54 - (sgm): check this
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py:143 - (shengguangming): delete the unused args
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py:226 - (woosuk): Support fine-grained seeds (e.g., seed per request).
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py:366 - spec config
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/hf_weight_loader.py:32
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm.py:148 - check usagecontext
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm.py:205 - (sgm): we can optimize it by making the dataloader yield List[int] without padding.
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm.py:221 - (shengguangming): can be optimzied by rewrite the Sampler._get_logprobs() logits
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py:143 - (woosuk): Print more configs in debug mode.
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py:160 - (shengguangming): maybe we can choose init here or from arguments
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py:262 - (sgm): add for verl but we may not tokenizer in Rollout
- luffy/verl/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py:271 - check whether we should rebuild the CUDAGraph every iter when offload/load KVCache
- Pick a TODO item from the list above
- Implement the functionality
- Test your implementation
- Update this README when TODOs are completed