Skip to content

About scalability to larger base models (e.g., Qwen3-8B) #12

@bxren

Description

@bxren

Hi, thanks for the great work.

I have a question about scalability across model sizes, especially for larger base models such as Qwen3-8B. In my domain-specific experiments, I observed consistent improvements on models like Qwen3-0.6B and Qwen3-1.7B after adding the module, including gains on benchmarks such as MMLU. However, when I moved to Qwen3-8B, I no longer observed similar improvements.

So I wanted to ask:

Have you tried Qwen3-8B or other similarly strong base models?
Did you observe that the method becomes less effective as the base model gets stronger/larger?
Do you have any suggestions on how to make the method work better on larger models:
different insertion layers,
different learning rates,
partial unfreezing of the backbone,
larger memory/table size,
or different training recipes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions