About scalability to larger base models (e.g., Qwen3-8B)

Hi, thanks for the great work.

I have a question about scalability across model sizes, especially for larger base models such as Qwen3-8B. In my domain-specific experiments, I observed consistent improvements on models like Qwen3-0.6B and Qwen3-1.7B after adding the module, including gains on benchmarks such as MMLU. However, when I moved to Qwen3-8B, I no longer observed similar improvements.

So I wanted to ask:

Have you tried Qwen3-8B or other similarly strong base models?
Did you observe that the method becomes less effective as the base model gets stronger/larger?
Do you have any suggestions on how to make the method work better on larger models:
different insertion layers,
different learning rates,
partial unfreezing of the backbone,
larger memory/table size,
or different training recipes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About scalability to larger base models (e.g., Qwen3-8B) #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About scalability to larger base models (e.g., Qwen3-8B) #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions