babylm.github.io/index.html at main · babylm/babylm.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" href="stylesheet.css">
  <meta name="google-site-verification" content="uGXw8B5MkU92VZio-qMGqDxvDPk9t5WuvJPCuMmwuA8"/>
  <link rel="icon" href="./images/pacifier.png">
</head>

<body>


<div style="display:inline">
 <img style="float: left; padding-right: 20px;" src="./images/pacifier.png" height="80">
 <div class="master-title"> <b>Baby</b>LM Challenge </div>
 <div class="subheader"> Sample-efficient pretraining on a developmentally plausible corpus </div>
</div>


<div id="navbar">
<h4> <a href="index.html"> Overview </a> • <a href="Workshop_times.html"> Workshop Schedule </a> • <a href="posters.html"> Posters </a> • <a href="guidelines.html"> Guidelines </a> • <a href="timeline.html"> Timeline</a> • <a href="faqs.html"> FAQs </a>• <a href="papers.html"> Previous papers </a> <hr> </h4>
</div>

<div class="greybox">

<div class="paragraph"> <b> Summary: </b> BabyLM returns for its <b>4th year</b> as both a shared task and a workshop at <b>EMNLP 2026</b>. This round keeps the core goal: sample-efficient pretraining under human-scale data budgets, while updating the track structure and datasets. </div>

<div class="bullet"> • All data is available <a href="https://huggingface.co/BabyLM-community"><u>at this huggingface community!</u></a> Data includes:</div>

<div class="bullet"> &#8594; A <b>detoxified</b> 100M-word <b>Strict</b> dataset and a detoxified 10M-word <b>Strict-Small</b> dataset.</a></div>

<div class="bullet"> &#8594; A new <b>MultiLingual</b> track based on <a href="https://babylm.github.io/babybabellm/"><b>BabyBabelLM</b></a>, with evaluation focusing on <b>English, Dutch, and Chinese</b>. </div>

<!-- <div class="bullet"> • The evaluation pipeline is out <a href="https://github.com/babylm/evaluation-pipeline-2024">here</a>! </div> -->
<div class="paragraph">
    <b>Track update:</b> This year introduces a dedicated <b>MultiLingual</b> track, and removes <b>Multimodal</b> and <b>Interaction</b>
    as standalone competition tracks, both are now subsumed into <b>Strict</b> / <b>Strict-Small</b> (paired image-text data and teacher-model feedback are allowed).
  </div>

  <div class="paragraph">
    <b>Evaluation pipeline:</b> We will distribute an open-source pipeline building on the 2025 repository; the MultiLingual track will be evaluated with a mix of zero-shot and finetuning-based tasks across English, Dutch, and Chinese. (Pipeline/baselines are planned for <b>early April</b>.)
  </div>

  <div class="paragraph">
    See the <a href="guidelines.html">guidelines</a> for an overview of submission tracks and pretraining data. See the updated call for papers for the full task setup, track rules, and dataset details.
  </div>

<div class="paragraph"> Consider <a href="https://join.slack.com/t/babylmchallenge/shared_invite/zt-3gqtr0fat-Mi4M4eFcszrcakz2Vyj41g">joining the BabyLM Slack</a> if you have any questions for the organizers or want to connect with other participants!</div>

</div>

<!-- <div class="title"> Submission guide </div> <br>
  Submit <a href="https://openreview.net/group?id=EMNLP/2024/Workshop/CoNLL_Shared_Task/BabyLM_Challenge">Here</a>
  <p>To fill out the submission, please prepare these two things:</p>
<ol>
<li>A HuggingFace link to your models.</li>
<li>A download link to your results, assembled via the collect_results.py script in babylm/evaluation-pipeline-2024.</li>
</ol>
  Paper submission follows CoNLL template with 4-8 pages, as well as a hyperparameter <a href="https://forms.gle/nRjdt5w5rCoFFqnJ6"> form </a>. -->
<div class="title"> Submission Links and timeline</div> <br>
<div class="paragraph">
  Submissions will be accepted <b>via ACL Rolling Review (ARR)</b> or <b>directly through OpenReview</b>. Official OpenReview portal links will be posted on this site once they are live.
  <br>
  Tentative timeline:
    <div class="bullet"> <b> February 25 2026: </b> Call for papers and Training data released </div>

    <div class="bullet"> <b> April 2026: </b> Evaluation pipeline and baselines released </div>

    <div class="bullet"> <b> May 25 2026: </b> ARR submission deadline </div>

    <div class="bullet"> <b> Mid July 2026: </b> Direct submission deadline </div>

    <div class="bullet"> <b> Early August 2026: </b> Direct submission reviews due, ARR commitment deadline </div>

    <div class="bullet"> <b> Mid August 2026: </b> Paper decisions released </div>

    <div class="bullet"> <b> Early September 2026: </b> Camera ready due </div>

    <div class="bullet"> <b> Oct 24-29 2026: </b> Workshop at EMNLP in Budapest </div>

    </div>
</div>
<!-- <div class="bullet"> • If you're submitting a workshop paper (not to the shared task), please submit at <a href="https://openreview.net/group?id=EMNLP/2025/Workshop/BabyLM">this OpenReview link</a>. <b>Deadline: August 15</b> (midnight AoE) </div>
<div class="bullet"> • If you're submitting a shared task report for the BabyLM Challenge, please submit at <a href="https://openreview.net/group?id=EMNLP/2025/Workshop/BabyLM_Challenge">this OpenReview link</a>. <b>Deadline: August 17</b> (midnight AoE) </div>
<div class="bullet"> • If you're committing a paper with reviews from ARR, please submit at <a href="https://openreview.net/group?id=EMNLP/2025/Workshop/BabyLM_ARR_Commitment">this OpenReview link</a>. <b>Deadline: September 5</b> (midnight AoE) </div> -->


<div class="title"> Updated Rules for BabyLM Round 4 </div> <br>

<div class="bullet"> • <b>New track: MultiLingual.</b> Participants train on a MultiLingual mixture from <a href="https://babylm.github.io/babybabellm/"><b>BabyBabelLM</b></a>. The challenge track focuses on <b>English, Dutch, and Chinese</b>, and allows a custom mixture totaling <b>100M tokens</b> (with word counts adjusted by each language’s Byte Premium in baseline construction).</div>

<div class="bullet"> • <b>Track restructuring:</b> Dedicated <b>Multimodal</b> and <b>Interaction</b> competition tracks have been removed. Instead, participants may use paired image-text data and/or leverage feedback from a teacher model during training or <b>Strict</b> or <b>Strict-Small</b>.
</div>

<div class="bullet"> • <b>Updated (detoxified) datasets:</b> We release a modified detoxified training dataset, including: <b>100M word Strict</b>, <b>10M word Strict-Small</b>, and <b>100M word + image Multimodal</b>.
</div>

<div class="bullet"> • <b>Compute/epochs limit remains:</b> Competition entries may not conduct more than <b>10 epochs</b> over their training data (this restriction applies to competition entries; workshop-only papers are not required to follow it).
</div>

</div>

<div class="title"> Overview </div> <br>

    <img style="float: right; padding-left: 20px; padding-bottom: 15px;" src="./images/model_sizes.png" height="160">

<div class="paragraph"> Huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, <a href="https://arxiv.org/abs/2203.15556v1"> Chinchilla </a> sees 1.4 <b>trillion</b> words during training--well over 10000 words for every one word a 13 year old child has heard in their entire life.</div>

<div class="paragraph"> The goal of this workshop is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining, which is typically thought to be practical only for large industry groups, by drawing attention to open problems that can be addressed on a university budget. </div>

<div class="title" > Why <100 Million Words? </div>

<div class="paragraph"> Focusing on scaled-down pretraining has several potential benefits: <br> First, small-scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from, hopefully, will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently. </div>

<div class="title"> Organization Team </div>
<div class = "people">
<div class="bullet">• Leshem Choshen (IBM Research, MIT) </div>
<div class="bullet">• Ryan Cotterell (ETH Zurich) </div>
<div class="bullet">• Mustafa Omer Gul (Cornell University) </div>
<div class="bullet">• Jaap Jumelet (University of Groningen) </div>
<div class="bullet">• Tal Linzen (NYU) </div>
<div class="bullet">• Aaron Mueller (Boston University) </div>
<div class="bullet">• Suchir Salhan (University of Cambridge) </div>
<div class="bullet">• Raj Sanjay Shah (Georgia Institute of Technology) </div>
<div class="bullet">• Alex Warstadt (UCSD) </div>
<div class="bullet">• Ethan Wilcox (Georgetown) </div>
</div>

<br>
The BabyLM Challenge was previously held as a shared task (2023-2025) and a workshop (2025). At the following link, you can find the <a href="https://babylm.github.io/papers.html"> last year's call for papers </a>  </a>.


</div>

<div class="footer">

<div style="float:right;"> Images provided by Smashicons </div>

</div>

</body>