-
Notifications
You must be signed in to change notification settings - Fork 4
About scalability to larger base models (e.g., Qwen3-8B) #12
Description
Hi, thanks for the great work.
I have a question about scalability across model sizes, especially for larger base models such as Qwen3-8B. In my domain-specific experiments, I observed consistent improvements on models like Qwen3-0.6B and Qwen3-1.7B after adding the module, including gains on benchmarks such as MMLU. However, when I moved to Qwen3-8B, I no longer observed similar improvements.
So I wanted to ask:
Have you tried Qwen3-8B or other similarly strong base models?
Did you observe that the method becomes less effective as the base model gets stronger/larger?
Do you have any suggestions on how to make the method work better on larger models:
different insertion layers,
different learning rates,
partial unfreezing of the backbone,
larger memory/table size,
or different training recipes?