Skip to content

[213] Skywork-R1V3 Technical Report #234

@long8v

Description

@long8v
Image

paper, model, code

TL;DR

  • I read this because.. : skywork 시리즈인데 entropy 얘기가 있어서.
  • task : multimodal reasoning model
  • problem : MLLM의 closed model과의 gap이 더 큼
  • idea : projector만 학습하고자 하는 것은 이전 시리즈([211] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought #232) 와 이어짐. 여러 가지 레시피와 분석을 넣은 논문.
  • input/output : {image/text, prompt} -> {reasoning, answer}
  • architecture : InternVL-38B
  • objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
  • baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
  • data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
  • evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX 등) using VLMEvalKit
  • result : open-source 중 SOTA (MMMU 76.0%), reasoning transfer와 generalization 입증
  • contribution : critical token entropy 지표 제안, connector 역할 강조, RL 분석 및 ablation 제공
  • etc. : slow-thinking > fast-thinking, reasoning hallucination 이슈 발견, connector tuning만 효과적

details

  • thumbnail
Image
  • data preparation

    • Image
    • LongCoT: 20K Chinese high-school difficulty -- Skywork r1v2 rejection sampling (final answer) --> 12K
    • GRPO : K-12 level 15K high quality math data --> entire multi-choice, fill-in-the-blank
    • Data for connector only : 20 domains 10K
    • Image
  • Post-Training Recipes

    • reward: format, accuracy reward
    • cold start sft
      • thousands of cold-start samples from an early internal version of Skywork-R1V2
      • employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly
        lengthy samples, resulting in a refined cold-start dataset
  • vision lanuage benchmark performance

    • vlmevalkit을 사용하되 task별로 조금 가다듬었다고 하는데 곧 오픈소스할거라고 함
    • Image
  • Empirical Analysis on Reinforcement Learning

    • Critical Token Entropy Indicates Reasoning Ability
    • Image
    • cold start CoT SFT만 하는 경우 reasoning을 하는 척만 하고 실제로는 generalizable reasoning 능력은 발현되고 있지 않다고 함( repeating existing patterns rather than truly activating generalizable reasoning capabilities)
    • 이를 측정하기 위해 critical token(wait, alteratively 등)의 entropy를 계산하고 이를 체크포인트 측정하는데 사용했다고 함 (mmmu 성능과 correlation이 높음)
  • The Connector Module Activation is Vital in RL

    • Image
  • The Distribution Shift in Curriculum Learning Hinder Generalization

    • Image
    • K12 -> competition 난이도로 한번 옮기는 작업을 했는데 높은 난이도에 대한 성능은 오르나 normal 난이도는 떨어지고 pyhsics, logics는 유지되는 경향성.
    • hard problem에서 필요한 복잡한 skill, special pattern, high-level strategy가 normal level에선 충돌하는듯한 경향성
  • RL stage 이후에 여러 도메인 학습하는 공정에서 component freeze ablation

    • Image
  • Discussion

  • Image
  • math-only로 SFT와 RL을 했을 때의 in-domain (mathvista), out-of-domain (mmmu) 성능 차이

  • SFT는 generalize가 안되고 RL은 됨 ([209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training #230)

  • thinking budget

    • Image
    • Image
  • Hallucination in Skywork-R1V3’s Chain-of-Thought Impairs Reasoning Performance

    • Image
  • Analysis on Entropy Token in Visual Reasoning Task

    • Image
    • 학습이 진행됨에 따라 전반적인 토큰의 엔트로피는 낮아지나�(determinisitic 해지나) 높은 엔트로피를 가진 토큰들의 확률은 높아짐.
    • 즉, wait, ... 과 같은 delibration token들이 더 많이 나오는 방향으로 학습됨
    • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions