You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GRPO : K-12 level 15K high quality math data --> entire multi-choice, fill-in-the-blank
Data for connector only : 20 domains 10K
Post-Training Recipes
reward: format, accuracy reward
cold start sft
thousands of cold-start samples from an early internal version of Skywork-R1V2
employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly
lengthy samples, resulting in a refined cold-start dataset
vision lanuage benchmark performance
vlmevalkit을 사용하되 task별로 조금 가다듬었다고 하는데 곧 오픈소스할거라고 함
cold start CoT SFT만 하는 경우 reasoning을 하는 척만 하고 실제로는 generalizable reasoning 능력은 발현되고 있지 않다고 함( repeating existing patterns rather than truly activating generalizable reasoning capabilities)
이를 측정하기 위해 critical token(wait, alteratively 등)의 entropy를 계산하고 이를 체크포인트 측정하는데 사용했다고 함 (mmmu 성능과 correlation이 높음)
The Connector Module Activation is Vital in RL
The Distribution Shift in Curriculum Learning Hinder Generalization
K12 -> competition 난이도로 한번 옮기는 작업을 했는데 높은 난이도에 대한 성능은 오르나 normal 난이도는 떨어지고 pyhsics, logics는 유지되는 경향성.
hard problem에서 필요한 복잡한 skill, special pattern, high-level strategy가 normal level에선 충돌하는듯한 경향성
RL stage 이후에 여러 도메인 학습하는 공정에서 component freeze ablation
Discussion
math-only로 SFT와 RL을 했을 때의 in-domain (mathvista), out-of-domain (mmmu) 성능 차이
paper, model, code
TL;DR
details
data preparation
Post-Training Recipes
lengthy samples, resulting in a refined cold-start dataset
vision lanuage benchmark performance
Empirical Analysis on Reinforcement Learning
The Connector Module Activation is Vital in RL
The Distribution Shift in Curriculum Learning Hinder Generalization
RL stage 이후에 여러 도메인 학습하는 공정에서 component freeze ablation
Discussion
math-only로 SFT와 RL을 했을 때의 in-domain (mathvista), out-of-domain (mmmu) 성능 차이
SFT는 generalize가 안되고 RL은 됨 ([209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training #230)
thinking budget
Hallucination in Skywork-R1V3’s Chain-of-Thought Impairs Reasoning Performance
Analysis on Entropy Token in Visual Reasoning Task