Owl Code Pretraining

Minimal Owl-style training toolkit for building code-specialized pretrained models. It covers code-oriented tokenizer creation, dataset preparation, masked-language-model pretraining utilities, continuation pretraining, and simple code-search fine-tuning.

Models such as Shuu12121/Owl-ph2-base-len2048, and related Owl-family checkpoints, were trained with an earlier version of this project. This repository extracts and organizes the core functions and scripts needed to reproduce and extend that workflow.

The preferred interface is Python function calls from a YAML config. CLI entry points are included for convenience, but the library is designed so experiments can call the core pieces directly.

Overview

The toolkit builds code-specialized encoder models in two pretraining phases followed by SentenceTransformer fine-tuning and MTEB evaluation.

Raw code datasets (8 languages function/method bodies, docstrings, comments)
         │
         ▼
   Custom BPE tokenizer  (vocab ≈ 50k + reserved slots)
         │
         ▼
   Tokenized dataset preparation  (split / chunk long examples)
         │
         ▼
 ┌───────────────────────────────────────────────────────┐
 │  Phase 1 — Random-token MLM  (pre_collator: "mlm")    │
 │  Standard BERT-style masking, token-level,            │
 │  mlm_probability tokens replaced with [MASK]          │
 └──────────────────────┬────────────────────────────────┘
                        │  ph1 checkpoint
                        ▼
 ┌───────────────────────────────────────────────────────┐
 │  Phase 2 — Line-level MLM  (continue_collator:        │
 │            "line_no_space")                           │
 │  Masking decision made per source-code line;          │
 │  entire lines are masked rather than random tokens    │
 └──────────────────────┬────────────────────────────────┘
                        │  ph2 checkpoint  (Owl-ph2-*)
                        ▼
   SentenceTransformer wrapping  (owl-wrap-st)
                        │
                        ▼
   Fine-tuning on CodeSearchNet pairs  (owl-finetune-st)
                        │
                        ▼
   MTEB / CodeSearchNetRetrieval evaluation  (owl-eval-mteb)

Why two phases? Phase 1 learns a general code vocabulary representation using standard random-token MLM. Phase 2 switches to line-level masking, which forces the model to reconstruct complete lines of code from surrounding context. Lines in source code are semantic units (statements, expressions, declarations), so masking at line granularity better aligns the pretraining objective with downstream code-search tasks where the model must understand full semantic units in isolation.

Architecture. The base encoder is ModernBERT-style (alternating local/global attention, RoPE embeddings, hidden_size=768, num_hidden_layers=22, num_attention_heads=12, intermediate_size=1152) with a custom BPE tokenizer (vocab size 50,368). RoBERTa-style models are also supported via architectures: ["roberta"].

Datasets

The checkpoint families in this repository were pre-trained on different data mixes. All datasets are public and hosted on the Hugging Face Hub. The dataset selection for each run is driven entirely by the YAML config under configs/.

Owl family

Pre-trained on tree-sitter-parsed function/method bodies in 8 languages, sourced from the Shuu12121/*-treesitter-* datasets (docstring + code pairs). Config: configs/owl_config.yaml.

Language	Dataset
Python	`Shuu12121/python-treesitter-filtered-datasetsV2`
JavaScript	`Shuu12121/javascript-treesitter-filtered-datasetsV2`
TypeScript	`Shuu12121/typescript-treesitter-filtered-datasetsV2`
Java	`Shuu12121/java-treesitter-dedupe_doc-filtered-dataset`
Go	`Shuu12121/go-treesitter-dedupe_doc-filtered-dataset`
Ruby	`Shuu12121/ruby-treesitter-filtered-datasetsV2`
PHP	`Shuu12121/php-treesitter-filtered-datasetsV2`
Rust	`Shuu12121/rust-treesitter-filtered-datasetsV2`

Crow family

Pre-trained on whole-file source code in the same 8 languages, one Hugging Face repo per language (Shuu12121/github-file-programs-dataset-*, text field content).

python, javascript, typescript, java, go, rust, ruby, php

NightOwl family

Pre-trained from scratch on a diverse multi-source mix, in two phases — Phase 1 uses all sources below; Phase 2 continues on the code-related subsets only.

1. bigcode/starcoder2data-extras — 12 subsets. max_samples caps the rows sampled per subset; max_chars truncates very long documents.

Subset	`max_samples`	Priority	Notes	Phase 2
`kaggle`	2,000,000	high	Notebook-style code	✅
`stackoverflow`	2,000,000	high	Q&A code threads	✅
`issues`	1,000,000	medium	GitHub issue text	✅
`owm`	1,000,000	medium	Open web math	—
`lhq`	3,000,000	high	High-quality text	—
`wikipedia`	1,000,000	medium	Encyclopedic NL	—
`arxiv`	600,000	low	Long LaTeX docs (`max_chars=10,000`)	—
`documentation`	2,000,000	high	Technical docs	✅
`ir_cpp`	100,000	low	C++ IR (`max_chars=5,000`)	—
`ir_low_resource`	100,000	low	Low-resource IR (`max_chars=5,000`)	—
`ir_python`	100,000	low	Python IR (`max_chars=5,000`)	—
`ir_rust`	100,000	low	Rust IR (`max_chars=5,000`)	—

2. Shuu12121/github-file-programs-dataset — 8 languages. Whole-file source code, one repo per language. Used in both phases (Phase 1: up to 1,000,000 files/language; Phase 2: up to 2,000,000 files/language).

python, javascript, typescript, java, go, rust, ruby, php

Quick Start

Install the project, run the smallest end-to-end pipeline, then evaluate the resulting SentenceTransformer checkpoint:

uv sync
uv run owl-train-tokenizer --config configs/config.yaml
uv run owl-pretrain --config configs/config.yaml
uv run owl-wrap-st --base-model ./created_models/code-continue-modernbert-len512_Last --output-dir ./created_models/code-st-base
uv run owl-finetune-st --config configs/finetune_csn_simple.yaml
uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

For a faster smoke test, lower the sample limits in the YAML files first, for example max_samples_per_dataset, train_sample_limit, and the MTEB tasks list.

Tested Environment

Python 3.10 in the provided Dev Container
Linux x86_64
torch==2.8.0
CUDA 12.8 / FlashAttention wheel when the platform marker matches

Install

With uv from a local checkout:

uv sync

The default environment includes the pretraining stack, SentenceTransformer fine-tuning tools, and the CUDA 12.8 Dev Container FlashAttention wheel when the platform marker matches Linux x86_64 + Python 3.10.

Development tools:

uv sync --extra dev

The base environment pins torch==2.8.0. The bundled FlashAttention dependency is guarded by platform markers, so it is skipped outside the matching Linux x86_64 + Python 3.10 environment.

Secrets

Training scripts load local secrets from environment variables, with .env as a fallback via python-dotenv.

HF_TOKEN: Hugging Face Hub token, used for private or gated models/datasets.
WANDB_TOKEN: Weights & Biases API key.

For Dev Containers, the recommended approach is to set these on the host and let .devcontainer/devcontainer.json pass them into the container:

setx HF_TOKEN "hf_..."
setx WANDB_TOKEN "..."

Restart VS Code after setx, then rebuild or reopen the Dev Container.

For a local-only checkout, you can also create an untracked .env file in the repository root:

HF_TOKEN=hf_...
WANDB_TOKEN=...

Do not commit tokens. .env is already ignored by git.

Python API

Train and convert a tokenizer from YAML:

from owl_code_pretraining import train_tokenizer_from_config

artifacts = train_tokenizer_from_config("configs/config.yaml")
print(artifacts.tokenizer_dir)

Prepare tokenized datasets from the same YAML:

from owl_code_pretraining import prepare_datasets_from_config

artifacts = prepare_datasets_from_config("configs/config.yaml")
print(artifacts.train_dataset_path)

Run the optional simple SentenceTransformer fine-tuning example:

from owl_code_pretraining import finetune_sentence_transformer_from_config

output_dir = finetune_sentence_transformer_from_config(
    "configs/finetune_csn_simple.yaml"
)

Use lower-level components directly:

from train_codemodernbert.trainer import func_create_modernbert_model
from train_codemodernbert.collators import CustomMLMReplaceRatioCollator
from train_codemodernbert.special_tokens import all_special_tokens_from_config

Config

configs/config.yaml controls:

datasets and language subsets
tokenizer output path, vocabulary size, and special tokens
tokenized dataset output paths
model dimensions
pretraining and continuation sequence lengths
collator choices
sample limits and long-example splitting

Special tokens are explicit and configurable:

tokenizer:
  additional_special_tokens: []
  unused_token_count: 363

Pretraining YAML

Minimal example:

datasets:
  - name: "code-search-net/code_search_net"
    languages: ["python"]
    doc_field: "func_documentation_string"
    code_field: "func_code_string"

tokenizer:
  output_json: "bpe_tokenizer_code.json"
  output_dir: "bpe_tokenizer_code"
  vocab_size: 50000
  max_per_lang: 100000
  additional_special_tokens: []
  unused_token_count: 363
  hf_name: null

training:
  pre_output_dir: "./created_models/code-pretrain"
  continue_output_dir: "./created_models/code-continue"
  train_dataset_path: "./tokenized_train_dataset"
  valid_dataset_path: "./tokenized_validation_dataset"
  model_hidden_size: 768
  model_num_layers: 22
  model_num_heads: 12
  model_intermediate_size: 1152
  architectures: ["modernbert"]
  max_samples_per_dataset: 100000
  split_long_examples: true
  pre_collator: "mlm"
  continue_collator: "line_no_space"
  mlm_probability: 0.3
  num_train_epochs: 3
  per_device_train_batch_size: 16
  gradient_accumulation_steps: 16
  learning_rate: 5e-5
  lr_scheduler_type: null
  fp16: true
  push_to_hub: false
  hub_model_id: null
  hub_strategy: "end"
  hub_private_repo: false
  logging_steps: 100
  eval_steps: 1000
  save_steps: 10000
  save_total_limit: 5
  dataloader_num_workers: 16
  dataloader_pin_memory: true
  eval_sample_size: 1600
  skip_pretraining: false
  pretrained_model: null
  pre_max_lengths: [512]
  continue_max_lengths: [512]
  pre_max_length_tokenized: 512

datasets entries map Hugging Face dataset columns into text used for tokenizer training and MLM pretraining.

name: dataset ID passed to datasets.load_dataset
languages: subset/config names for CodeSearchNet-style datasets, or language filters for datasets with a language column. A single entry such as ["javascript"] is valid.
doc_field: natural-language field. Use "" for code-only pretraining
code_field: source-code field

tokenizer controls tokenizer training and conversion.

output_json: raw BPE tokenizer JSON
output_dir: Hugging Face tokenizer directory produced after conversion
vocab_size: BPE vocabulary size
max_per_lang: maximum tokenizer-training samples per language
additional_special_tokens: project-specific tokens to reserve
unused_token_count: number of [UNUSED*] placeholders
hf_name: existing tokenizer/model ID to reuse instead of creating a new tokenizer

training controls tokenized dataset preparation and MLM runs.

pre_output_dir: output prefix for scratch/pretraining phase
continue_output_dir: output prefix for continuation phase
train_dataset_path, valid_dataset_path: base paths for tokenized datasets
model_hidden_size, model_num_layers, model_num_heads, model_intermediate_size: model shape
architectures: ["modernbert"], ["roberta"], or both
max_samples_per_dataset: training sample cap per dataset/language
split_long_examples: split overflowing examples into chunks instead of only truncating. If this is false, each raw sample stays one tokenized example after truncation.
pre_collator: collator for pretraining, usually mlm
continue_collator: collator for continuation, e.g. line_no_space
mlm_probability: masking probability passed to the selected MLM collator
pretrained_model_for_pretraining: optional model path or HF ID used to initialize the pretraining phase instead of starting from a new randomly initialized model
pretrained_model_for_continue: optional model path or HF ID used when skip_pretraining: true or when you want continuation to start from a specific model
per_device_train_batch_size, gradient_accumulation_steps: effective batch size is approximately per_device_train_batch_size * gradient_accumulation_steps * GPU count
learning_rate: optimizer learning rate
lr_scheduler_type: optional Transformers scheduler. Leave null to use the installed Transformers default. Supported values in the current dependency set include linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, inverse_sqrt, reduce_lr_on_plateau, cosine_with_min_lr, cosine_warmup_with_min_lr, warmup_stable_decay, and greedy.
push_to_hub, hub_model_id, hub_strategy, hub_private_repo: Hugging Face Trainer Hub upload settings
logging_steps, eval_steps, save_steps, save_total_limit: Trainer logging/evaluation/checkpoint cadence
eval_sample_size: validation examples sampled for each evaluation run
pre_max_lengths, continue_max_lengths: sequence lengths to run
pre_max_length_tokenized: max length used when preparing tokenized datasets

TrainingArguments can be configured at three levels:

training:
  training_args:
    per_device_train_batch_size: 16
    gradient_accumulation_steps: 16
    learning_rate: 5e-5
  pre_training_args:
    num_train_epochs: 3
    logging_dir: "./logs/pre"
  continue_training_args:
    num_train_epochs: 1
    logging_dir: "./logs/continue"

training_args applies to both phases. pre_training_args and continue_training_args override it for their respective phases. Commonly useful keys include num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, learning_rate, lr_scheduler_type, warmup_steps, warmup_ratio, weight_decay, fp16, bf16, tf32, gradient_checkpointing, logging_steps, eval_steps, save_steps, save_strategy, save_total_limit, load_best_model_at_end, metric_for_best_model, push_to_hub, and hub_model_id.

Model initialization keys interact as follows:

Key	Used for	Notes
`pretrained_model`	Legacy/default model source	Still accepted as a general fallback. Prefer the phase-specific keys below when the two phases should start from different checkpoints.
`pretrained_model_for_pretraining`	Scratch/pretraining phase initialization	Use a HF ID or local path to start pretraining from existing weights instead of a new randomly initialized model.
`pretrained_model_for_continue`	Continuation phase initialization	Used when `skip_pretraining: true` or when continuation should start from a specific checkpoint.
`skip_pretraining`	Phase control	When `true`, skip the scratch/pretraining phase and run only continuation from `pretrained_model_for_continue` or the fallback model.

Tokenized datasets are saved both as combined outputs and as grouped per-language outputs. For example, a base path of ./tokenized_train_dataset-len512 also produces paths such as ./tokenized_train_dataset-len512-by-language/python and ./tokenized_train_dataset-len512-by-language/javascript when those languages are configured.

If a dataset has a validation split, it is used directly. If not, the train split is split by repository when a repo column is available; otherwise the train split is used without an extra validation/test split.

A fuller reusable template is available at configs/pretrain_template.yaml.

Available collators:

mlm: Hugging Face DataCollatorForLanguageModeling — standard random-token MLM.
line_no_space: line-based MLM, excluding whitespace-only tokens from masking.
line_include_space: line-based MLM, including leading indent tokens in masked positions.
mlm_replace: experimental collator combining line-level masking, random-line replacement, and keep-as-is in an 80/10/10 ratio.

For an Owl ph2-style continuation run, continue_collator: "line_no_space" is the closest match.

How line-level masking works (`line_no_space` / `line_include_space`)

Standard MLM samples a fixed fraction of individual tokens at random. Line-level MLM shifts the masking decision to the line level:

Segment by newline tokens. The tokenized sequence is split into segments at every token whose string representation contains the newline character Ċ (the GPT-2/RoBERTa BPE encoding of \n). Each segment is one source-code line.
Per-line Bernoulli draw. Each segment is independently masked with probability mlm_probability (default 0.30). If the draw fires, every token in that segment is replaced with [MASK] and recorded as a prediction target.
Whitespace exclusion (line_no_space). Before masking, tokens that consist solely of space or tab characters (BPE strings matching Ġ+ or ĉ+, or tokens whose decoded form contains only ' ' and '\t') are skipped. Leading indentation is left unmasked while the semantic content of the line is hidden.
Special-token protection. The first and last token positions ([CLS] / [SEP] or equivalent) are never masked regardless of the line decision.
Labels. Unmasked positions receive a label of -100 (ignored by cross-entropy). Masked positions receive the original token id as the label.

Concretely, given a small Python snippet:

[CLS] def foo ( x ) : Ċ Ġ Ġ Ġ Ġ return x * 2 Ċ [SEP]

The collator identifies two lines — def foo(x): and return x * 2 — and draws independently for each. If the second line is selected:

input:  [CLS]  def  foo  (  x  )  :  Ċ  Ġ  Ġ  Ġ  Ġ [MASK] [MASK] [MASK] [MASK]  Ċ  [SEP]
labels: [-100] [-100 ...              -100 -100 -100 -100]  return  x   *   2  [-100] [-100]

The indentation tokens Ġ Ġ Ġ Ġ remain unmasked (they carry structural but not semantic information), while the four content tokens are hidden and must be predicted from the surrounding lines.

`mlm_replace` collator

CustomMLMReplaceRatioCollator applies the same line-level segmentation but uses an 80/10/10 split on each selected line:

Draw	Action
80 %	Replace all maskable tokens in the line with `[MASK]` (standard masking)
10 %	Substitute the entire line with a randomly chosen line from the same batch (random-line replacement)
10 %	Keep the original tokens but still record them as prediction targets

The random-line replacement variant is borrowed from the original BERT token-level strategy, applied at line granularity, to prevent the model from learning to predict masked lines purely by position.

Fine-Tuning YAML

configs/finetune_csn_simple.yaml configures the optional simple SentenceTransformer run:

dataset:
  name: "code-search-net/code_search_net"
  format: "pairs"
  split: "train"
  languages: ["python", "javascript", "ruby", "go", "java", "php"]
  query_field: "func_documentation_string"
  code_field: "func_code_string"
  remove_python_docstrings: true
  train_sample_limit: 100000

model:
  pretrained_model: "./created_models/code-continue-modernbert-len512_Last"
  load_strategy: "auto"
  output_dir: "./created_models/code-st"
  pooling: "cls"
  max_seq_length: 512

training:
  epochs: 1
  batch_size: 32
  learning_rate: 0.00002
  scheduler: "WarmupLinear"
  warmup_ratio: 0.1
  loss: "mnrl"
  mnrl_scale: 20.0
  cached_mnrl_mini_batch_size: 32
  weight_decay: 0.01
  fp16: false
  bf16: true

remove_python_docstrings: true removes Python module/class/function docstrings from code texts before fine-tuning. This avoids training on code that repeats the query verbatim.

Fine-tuning supports multiple dataset shapes:

format: "pairs": rows contain one query field and one positive/code field, used with mnrl.
format: "hard_negatives": rows contain query, positive(s), and negative(s), e.g. query, pos/positive/positives, and neg/negative/negatives/hard_negatives.
format: "beir": expects a BEIR/ranking dataset that has already been materialized into query/positive/negative text columns. Raw BEIR corpus/queries/qrels data should be converted first.
format: "kd_scores": split-config KD/ranking datasets with configs like queries_python, documents_python, and scores_python. scores_* rows should contain query_id plus ranked document_ids; rank 0 is treated as the positive by default.
format: "auto": detects hard-negative columns when present; otherwise falls back to pairs.

model.pretrained_model selects the base encoder or existing SentenceTransformer checkpoint. With load_strategy: "auto", a local SentenceTransformer directory with modules.json is loaded directly; otherwise the model is wrapped with a Transformer + Pooling module. pooling can be cls, mean, max, mean_sqrt_len, weightedmean, lasttoken, or a combination such as mean+max. You can also append a Dense projection with projection_dim and append unit-length normalization with normalize_embeddings: true. query_prompt and code_prompt are saved as SentenceTransformer prompts; code_prompt is also saved as the document prompt so encode_document() works for code embeddings.

To create a reusable SentenceTransformer directory from a local/HF encoder before fine-tuning:

uv run owl-wrap-st \
  --base-model ./created_models/code-continue-modernbert-len512_Last \
  --output-dir ./created_models/code-continue-modernbert-len512_ST \
  --pooling mean+max \
  --max-seq-length 512 \
  --projection-dim 768 \
  --projection-activation tanh \
  --normalize-embeddings \
  --query-prompt "query: " \
  --code-prompt "code: "

Fine-tuning learning-rate settings:

learning_rate: optimizer LR.
scheduler: SentenceTransformers scheduler, such as WarmupLinear, WarmupCosine, WarmupConstant, or constantlr.
warmup_ratio or warmup_steps: warmup can be ratio-based or explicit steps.
loss: mnrl, cached_mnrl, triplet, or cosine. The longer aliases multiple_negatives_ranking and cached_multiple_negatives_ranking are also accepted.
mnrl_scale or scale: similarity scale for mnrl and cached_mnrl; SentenceTransformers defaults to 20.0.
cached_mnrl_mini_batch_size, gather_across_devices: extra options for cached_mnrl. cached_mnrl_mini_batch_size controls the cached loss mini-batch size separately from the dataloader batch_size.
fp16, bf16: precision flags passed to SentenceTransformerTrainingArguments. Fine-tuning defaults to bf16 when no precision setting is provided. You can also use precision: "fp16", precision: "bf16", or precision: "fp32". Do not enable both fp16 and bf16.
batch_group_by_language: when true, each batch is sampled from one language only. Batch order is still shuffled across languages.
eval_csn: when true, builds a CodeSearchNet validation evaluator by language and reports csn-avg_cosine_ndcg@10 as the language-averaged primary metric. Use eval_languages and eval_sample_limit to keep it small.

Staged fine-tuning is supported with a top-level stages list. Each stage inherits the top-level dataset, model, and training sections, then overrides only the keys you provide. The output of each stage becomes the input model for the next stage.

stages:
  - name: "pairs-warmup"
    training:
      loss: "mnrl"
      batch_group_by_language: true
  - name: "kd-hard-negative"
    dataset:
      name: "Shuu12121/owl_code_search_hard_negative_datasets_V2_kd"
      format: "kd_scores"
      languages: ["python", "javascript"]
    training:
      loss: "triplet"

Hard-negative templates are available at configs/finetune_hard_negatives_template.yaml, configs/finetune_beir_template.yaml, configs/finetune_kd_scores_template.yaml, and configs/finetune_staged_template.yaml. Datasets produced by hard-negative ranking dataset makers are supported when they expose query/positive/negative text columns, equivalent field names configured through query_field, positive_field, and negative_field, or the split-config queries_*/documents_*/scores_* KD layout used by Shuu12121/owl_code_search_hard_negative_datasets_V2_kd.

MTEB Evaluation

configs/eval_mteb_template.yaml provides a small MTEB evaluation scaffold for SentenceTransformer checkpoints:

uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

To evaluate every model produced by configs/batch_experiment.yaml and report the best learning rate per model:

uv run python scripts/batch_eval_mteb.py --config configs/batch_experiment.yaml

The batch evaluator writes per-run MTEB summaries under mteb_results/{model}{lr-suffix}/, then saves the best-across-LR comparison to mteb_results/batch_summary.tsv and mteb_results/batch_best_scores.json.

Minimal evaluation YAML:

evaluation:
  model: "./created_models/code-st-kd"
  model_loader: "sentence_transformer"
  tasks: ["CodeSearchNetRetrieval", "COIRCodeSearchNetRetrieval", "CosQA"]
  languages: []
  eval_splits: []
  output_dir: "./mteb_results/code-st-kd"
  summary_path: "./mteb_results/code-st-kd/summary.json"
  batch_size: 32
  normalize_embeddings: false
  prompt_name: "query"
  overwrite_strategy: "only-missing"

For code-search models, start with retrieval tasks such as CodeSearchNetRetrieval, COIRCodeSearchNetRetrieval, CosQA, StackOverflowQA, AppsRetrieval, and CodeEditSearchRetrieval, then expand the list for full COIR/MTEB coverage. model_loader can be sentence_transformer for local/HF SentenceTransformer checkpoints or mteb for models loaded through MTEB's registry. overwrite_strategy is passed to MTEB's result cache; only-missing is the safe default for iterative experiments. prompt_name is forwarded to SentenceTransformer encode() through MTEB encode_kwargs; use a saved prompt such as query when you want a global prompt during evaluation, or leave it null to use the model's default behavior. For a local raw encoder checkpoint, first wrap it as a SentenceTransformer directory with owl-wrap-st, then point evaluation.model at that directory. Results are cached under evaluation.output_dir, and a compact JSON summary is written to evaluation.summary_path.

Optional CLI

The package also exposes convenience commands:

uv run owl-train-tokenizer --config configs/config.yaml
uv run owl-pretrain --config configs/config.yaml
uv run owl-wrap-st --base-model ./created_models/code-continue-modernbert-len512_Last --output-dir ./created_models/code-st-base
uv run owl-finetune-st --config configs/finetune_csn_simple.yaml
uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

Publishing

Publishing is mechanically simple, but treat it as a release:

uv build
uv run twine check dist/*
uv run twine upload dist/*

Before uploading, create a clean GitHub repository, verify that no generated artifacts or private configs are included, and tag the version.

For PyPI, prefer a project-scoped API token or trusted publishing rather than a personal password.

Tests

uv run --extra dev pytest -q tests

Citation

If you use this toolkit, the pre-trained models, or our methodology in your research, please cite this repository:

@misc{owl_code_pretraining,
  author       = {Shun0212},
  title        = {Owl Code Pretraining: A minimal toolkit for building code-specialized pretrained encoders},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/Shun0212/codeowl-training-core}}
}

日本語版

Owl Code Pretraining は、コード特化の事前学習モデルを作るためのツールキットです。BPE トークナイザーの作成、データセットの前処理、ModernBERT / RoBERTa による MLM 事前学習、継続事前学習、コード検索向けのファインチューニングまでを、ひととおりこのリポジトリで行えます。

Shuu12121/Owl-ph2-base-len2048 をはじめとする Owl 系列のチェックポイントは、このプロジェクトの前身となるコードで学習したものです。本リポジトリは、その学習フローを再現・拡張しやすいよう、中心となる関数とスクリプトを抜き出して整理したものです。

基本的な使い方は、YAMLに設定を書き、Pythonから関数として呼び出す 形式です。CLIも用意していますが、研究や実験コードの中に組み込みやすいことを重視しています。

オーバービュー

事前学習は2段階（ph1: ランダムtoken MLM → ph2: 行単位MLM）に分けています。その後の SentenceTransformer へのファインチューニング、MTEB評価まで、同じリポジトリ内で実行できます。

コードデータセット (8言語, 関数レベルのdocstring + codeペア)
         │
         ▼
   カスタム BPE トークナイザー  (語彙数 ≈ 50k + reserved)
         │
         ▼
   Tokenized dataset の作成  (長いサンプルのsplit/chunk)
         │
         ▼
 ┌───────────────────────────────────────────────────────┐
 │  Phase 1 — ランダムtoken MLM  (pre_collator: "mlm")     │
 │  BERT標準のtokenレベルマスキング                          │
 │  mlm_probability の割合のtokenを [MASK] に置換           │
 └──────────────────────┬────────────────────────────────┘
                        │  ph1 チェックポイント
                        ▼
 ┌───────────────────────────────────────────────────────┐
 │  Phase 2 — 行単位 MLM  (continue_collator:             │
 │            "line_no_space")                           │
 │  マスキングの決定単位をtokenから行(line)に変更              │
 │  ソースコードの1行全体をまとめてマスク                      │
 └──────────────────────┬────────────────────────────────┘
                        │  ph2 チェックポイント  (Owl-ph2-*)
                        ▼
   SentenceTransformer 化  (owl-wrap-st)
                        │
                        ▼
   CodeSearchNet のペアデータでファインチューニング  (owl-finetune-st)
                        │
                        ▼
   MTEB / CodeSearchNetRetrieval 評価  (owl-eval-mteb)

なぜ2段階にするのか？ Phase 1 では、ランダムtoken MLMでコード語彙の基本的な表現を学習します。Phase 2 では行単位のマスキングに切り替え、周囲の文脈から1行全体を復元する課題を与えます。ソースコードの「行」は、ステートメント、式、宣言といった意味のまとまりになりやすいため、行単位で隠すことで、コード検索に必要な「意味の単位を文脈から理解する力」を直接鍛えやすくなります。

アーキテクチャ. ベースエンコーダーには ModernBERT スタイルの構成（local/global attentionの交互配置、RoPE埋め込み、hidden_size=768、num_hidden_layers=22、num_attention_heads=12、intermediate_size=1152）を使い、語彙数50,368のカスタムBPEトークナイザーを組み合わせています。architectures: ["roberta"] を指定すれば、RoBERTaスタイルのモデルも学習できます。

学習データセット

このリポジトリのチェックポイント系列は、それぞれ異なるデータの組み合わせで事前学習しています。データセットはすべて公開されており、Hugging Face Hub上にあります。どのデータセットを使うかは configs/ 以下のYAMLで完全に制御されます。

Owl 系列

tree-sitter で関数・メソッド単位に切り出したコードを、8言語ぶん Shuu12121/*-treesitter-* データセット（docstring + code ペア）から学習しています。設定ファイル: configs/owl_config.yaml。

言語	データセット
Python	`Shuu12121/python-treesitter-filtered-datasetsV2`
JavaScript	`Shuu12121/javascript-treesitter-filtered-datasetsV2`
TypeScript	`Shuu12121/typescript-treesitter-filtered-datasetsV2`
Java	`Shuu12121/java-treesitter-dedupe_doc-filtered-dataset`
Go	`Shuu12121/go-treesitter-dedupe_doc-filtered-dataset`
Ruby	`Shuu12121/ruby-treesitter-filtered-datasetsV2`
PHP	`Shuu12121/php-treesitter-filtered-datasetsV2`
Rust	`Shuu12121/rust-treesitter-filtered-datasetsV2`

Crow 系列

同じ8言語のファイル単位のソースコードで学習しています。言語ごとに1つの Hugging Face リポジトリ（Shuu12121/github-file-programs-dataset-*、テキストフィールドは content）です。

python, javascript, typescript, java, go, rust, ruby, php

NightOwl 系列

多様なマルチソースのデータでスクラッチから2段階で事前学習しています。Phase 1 は以下のすべてのソースを使い、Phase 2 はコード関連のサブセットのみで継続学習します。

1. bigcode/starcoder2data-extras — 12サブセット. max_samples はサブセットごとの抽出件数の上限、max_chars は非常に長い文書の切り詰め文字数です。

サブセット	`max_samples`	優先度	内容	Phase 2
`kaggle`	2,000,000	high	ノートブック形式のコード	✅
`stackoverflow`	2,000,000	high	Q&A のコードスレッド	✅
`issues`	1,000,000	medium	GitHub issue のテキスト	✅
`owm`	1,000,000	medium	Open web math	—
`lhq`	3,000,000	high	高品質テキスト	—
`wikipedia`	1,000,000	medium	百科事典的な自然言語	—
`arxiv`	600,000	low	長い LaTeX 文書（`max_chars=10,000`）	—
`documentation`	2,000,000	high	技術ドキュメント	✅
`ir_cpp`	100,000	low	C++ の IR（`max_chars=5,000`）	—
`ir_low_resource`	100,000	low	低リソース言語の IR（`max_chars=5,000`）	—
`ir_python`	100,000	low	Python の IR（`max_chars=5,000`）	—
`ir_rust`	100,000	low	Rust の IR（`max_chars=5,000`）	—

2. Shuu12121/github-file-programs-dataset — 8言語. ファイル単位のソースコードで、言語ごとに1リポジトリです。両フェーズで使用します（Phase 1: 1言語あたり最大1,000,000ファイル、Phase 2: 最大2,000,000ファイル）。

python, javascript, typescript, java, go, rust, ruby, php

クイックスタート

以下のコマンドで、tokenizer作成から事前学習、SentenceTransformer化、ファインチューニング、MTEB評価までを一通り実行できます。

uv sync
uv run owl-train-tokenizer --config configs/config.yaml
uv run owl-pretrain --config configs/config.yaml
uv run owl-wrap-st --base-model ./created_models/code-continue-modernbert-len512_Last --output-dir ./created_models/code-st-base
uv run owl-finetune-st --config configs/finetune_csn_simple.yaml
uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

まず動作確認だけしたい場合は、YAML内の max_samples_per_dataset、train_sample_limit、MTEBの tasks を小さくしてから実行してください。

動作確認済み環境

付属 Dev Container の Python 3.10
Linux x86_64
torch==2.8.0
platform markerが合う環境では CUDA 12.8 / FlashAttention wheel

現時点では、主に手元で動作確認した環境を対象にしています。今後、対応バージョンは順次広げていく予定です。

インストール

ローカルにcheckoutしたリポジトリで使う場合:

uv sync

標準環境には、事前学習とSentenceTransformerファインチューニングに必要なパッケージが含まれています。FlashAttention wheelは、Linux x86_64 + Python 3.10 のCUDA 12.8環境でのみインストールされます。

開発用の依存関係も入れる場合:

uv sync --extra dev

FlashAttentionはplatform markerで保護しているため、対応していないCPU環境、Windows、Python 3.11+ などでは自動的にスキップされます。

Python API

YAMLの設定を使ってトークナイザーを学習し、Hugging Face形式に変換します。

from owl_code_pretraining import train_tokenizer_from_config

artifacts = train_tokenizer_from_config("configs/config.yaml")
print(artifacts.tokenizer_dir)

同じYAMLから、事前学習用のtokenized datasetも作成できます。

from owl_code_pretraining import prepare_datasets_from_config

artifacts = prepare_datasets_from_config("configs/config.yaml")
print(artifacts.train_dataset_path)

簡易的なSentenceTransformerファインチューニングも、関数として実行できます。

from owl_code_pretraining import finetune_sentence_transformer_from_config

output_dir = finetune_sentence_transformer_from_config(
    "configs/finetune_csn_simple.yaml"
)

より細かく制御したい場合は、低レベルの部品を直接importして使えます。

from train_codemodernbert.trainer import func_create_modernbert_model
from train_codemodernbert.collators import CustomMLMReplaceRatioCollator
from train_codemodernbert.special_tokens import all_special_tokens_from_config

事前学習用YAML

最小構成の例を示します。

datasets:
  - name: "code-search-net/code_search_net"
    languages: ["python"]
    doc_field: "func_documentation_string"
    code_field: "func_code_string"

tokenizer:
  output_json: "bpe_tokenizer_code.json"
  output_dir: "bpe_tokenizer_code"
  vocab_size: 50000
  max_per_lang: 100000
  additional_special_tokens: []
  unused_token_count: 363
  hf_name: null

training:
  pre_output_dir: "./created_models/code-pretrain"
  continue_output_dir: "./created_models/code-continue"
  train_dataset_path: "./tokenized_train_dataset"
  valid_dataset_path: "./tokenized_validation_dataset"
  model_hidden_size: 768
  model_num_layers: 22
  model_num_heads: 12
  model_intermediate_size: 1152
  architectures: ["modernbert"]
  max_samples_per_dataset: 100000
  split_long_examples: true
  pre_collator: "mlm"
  continue_collator: "line_no_space"
  mlm_probability: 0.3
  num_train_epochs: 3
  per_device_train_batch_size: 16
  gradient_accumulation_steps: 16
  learning_rate: 5e-5
  lr_scheduler_type: null
  fp16: true
  push_to_hub: false
  hub_model_id: null
  hub_strategy: "end"
  hub_private_repo: false
  logging_steps: 100
  eval_steps: 1000
  save_steps: 10000
  save_total_limit: 5
  dataloader_num_workers: 16
  dataloader_pin_memory: true
  eval_sample_size: 1600
  skip_pretraining: false
  pretrained_model: null
  pre_max_lengths: [512]
  continue_max_lengths: [512]
  pre_max_length_tokenized: 512

datasets では、Hugging Face datasetのどのカラムを自然言語・コードとして扱うかを指定します。

name: datasets.load_dataset に渡すdataset ID
languages: CodeSearchNet系ではsubset/config名、language カラムを持つdatasetでは言語フィルタとして使う値。["javascript"] のように1つだけ指定しても構いません
doc_field: docstringなどの自然言語フィールド。コードのみで事前学習する場合は "" を指定します
code_field: コード本文のフィールド

tokenizer には、トークナイザーの学習と変換に関する設定を書きます。

output_json: 学習直後のBPE tokenizer JSONの保存先
output_dir: Hugging Face形式に変換したtokenizerディレクトリ
vocab_size: BPEの語彙数
max_per_lang: tokenizer学習に使う言語ごとの最大サンプル数
additional_special_tokens: 追加で予約する特殊token
unused_token_count: 予約する [UNUSED*] tokenの個数
hf_name: 既存のtokenizer/modelを再利用する場合のHugging Face ID。null なら新規に作成します

training では、tokenized datasetの作成とMLM学習の設定を行います。

pre_output_dir: scratch/pretraining phaseの出力先prefix
continue_output_dir: continuation phaseの出力先prefix
train_dataset_path, valid_dataset_path: tokenized datasetの保存先
model_hidden_size, model_num_layers, model_num_heads, model_intermediate_size: モデルサイズを決める設定
architectures: modernbert、roberta、またはその両方を指定できます
max_samples_per_dataset: dataset/languageごとの学習サンプル上限
split_long_examples: 長いサンプルを切り捨てず複数chunkに分けるかどうか。false の場合は、raw sample 1件につき tokenized sample を1件作ります
pre_collator: 最初の事前学習phaseで使うcollator。通常は mlm を指定します
continue_collator: 継続学習phaseで使うcollator。Owl ph2系では line_no_space を使います
mlm_probability: MLM collatorに渡すmask確率
pretrained_model_for_pretraining: pretraining phaseをランダム初期化ではなく既存モデルから始める場合のpathまたはHF ID
pretrained_model_for_continue: skip_pretraining: true のときや、continue phaseを特定モデルから始める場合のpathまたはHF ID
per_device_train_batch_size, gradient_accumulation_steps: 実効batch sizeは、おおよそ per_device_train_batch_size * gradient_accumulation_steps * GPU数
learning_rate: optimizerのlearning rate
lr_scheduler_type: 任意のTransformers schedulerを指定できます。null の場合は、インストール済みTransformersのデフォルトに従います。現在の依存関係では linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, inverse_sqrt, reduce_lr_on_plateau, cosine_with_min_lr, cosine_warmup_with_min_lr, warmup_stable_decay, greedy が使えます
push_to_hub, hub_model_id, hub_strategy, hub_private_repo: Hugging Face Trainer標準のHub upload設定
logging_steps, eval_steps, save_steps, save_total_limit: ログ出力・評価・チェックポイント保存の間隔
eval_sample_size: eval時にvalidationからサンプルする件数
pre_max_lengths, continue_max_lengths: 実行する系列長
pre_max_length_tokenized: tokenized dataset作成時の最大系列長

TrainingArgumentsは3段階で指定できます。

training:
  training_args:
    per_device_train_batch_size: 16
    gradient_accumulation_steps: 16
    learning_rate: 5e-5
  pre_training_args:
    num_train_epochs: 3
    logging_dir: "./logs/pre"
  continue_training_args:
    num_train_epochs: 1
    logging_dir: "./logs/continue"

training_args は両phaseに共通で適用されます。pre_training_args と continue_training_args は、それぞれのphaseだけを上書きします。よく使うkeyは num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, learning_rate, lr_scheduler_type, warmup_steps, warmup_ratio, weight_decay, fp16, bf16, tf32, gradient_checkpointing, logging_steps, eval_steps, save_steps, save_strategy, save_total_limit, load_best_model_at_end, metric_for_best_model, push_to_hub, hub_model_id などです。

モデル初期化まわりのkeyは、次のように使い分けます。

Key	用途	補足
`pretrained_model`	旧来/共通のモデル指定	フォールバックとして今も使えます。phaseごとに開始チェックポイントを変えたい場合は、下の専用keyを優先してください。
`pretrained_model_for_pretraining`	scratch/pretraining phaseの初期化	ランダム初期化ではなく、既存weightからpretrainingを始めたい場合に使います。
`pretrained_model_for_continue`	continuation phaseの初期化	`skip_pretraining: true` の場合や、continuationを特定チェックポイントから始めたい場合に使います。
`skip_pretraining`	phase制御	`true` にするとscratch/pretraining phaseを飛ばし、continuationだけを実行します。

tokenized datasetは、全言語を結合したものに加えて、言語別にも保存されます。たとえば ./tokenized_train_dataset-len512 をbase pathにすると、設定した言語に応じて ./tokenized_train_dataset-len512-by-language/python や ./tokenized_train_dataset-len512-by-language/javascript も作成されます。

datasetに validation splitがある場合は、それをそのまま使います。ない場合は、repo カラムがあればrepository単位でtrain/validation/testに分割し、repo カラムもなければtrain splitのみを使います。

再利用しやすいテンプレートは configs/pretrain_template.yaml にあります。

利用できるcollatorは以下の4種類です。

mlm: Hugging Face DataCollatorForLanguageModeling — 標準のランダムtoken MLM
line_no_space: 行単位MLMです。空白のみのtokenはmask対象から外します。
line_include_space: 行単位MLMです。先頭インデントtokenもmask対象に含めます。
mlm_replace: 行単位マスク、ランダム行置換、keep-as-isを80/10/10の比率で適用する実験的collatorです。

Owl ph2に近い継続事前学習を行う場合は、continue_collator: "line_no_space" を指定します。

行単位マスキングの仕組み（`line_no_space` / `line_include_space`）

標準的なMLMではtokenをランダムに個別選択しますが、行単位MLMではマスキングの単位を 行 (line) にします。

改行tokenで分割. トークン列を、文字列表現に改行文字 Ċ（GPT-2/RoBERTa BPEにおける \n の表現）を含む位置で区切ります。各セグメントがソースコードの1行に対応します。
行ごとに抽選. 各セグメントを独立に、確率 mlm_probability（デフォルト0.30）でマスク対象にします。選ばれた行では、その行の全tokenを [MASK] に置換します。
空白tokenの除外（line_no_space）. スペースやタブだけで構成されるtoken（BPE文字列が Ġ+ や ĉ+ にマッチするもの、またはデコード結果が空白・タブのみのもの）はマスク対象から外します。インデントは残し、行の意味的な内容だけを隠します。
特殊tokenの保護. 先頭・末尾のtoken（[CLS] / [SEP] など）は、行の抽選結果に関係なくマスクしません。
ラベル. マスクしなかった位置のラベルは -100（cross-entropyで無視）にし、マスクした位置には元のtoken idを入れます。

具体例として、次のような短いPythonコードを考えます。

[CLS] def foo ( x ) : Ċ Ġ Ġ Ġ Ġ return x * 2 Ċ [SEP]

collatorは def foo(x): と return x * 2 の2行を認識し、それぞれ独立に抽選します。2行目が選ばれた場合は、次のようになります。

input:  [CLS]  def  foo  (  x  )  :  Ċ  Ġ  Ġ  Ġ  Ġ  [MASK] [MASK] [MASK] [MASK]  Ċ  [SEP]
labels: [-100] [-100  ...           -100 -100 -100 -100]  return  x   *   2  [-100] [-100]

インデントtoken Ġ Ġ Ġ Ġ は構造上の手がかりではありますが、行の意味的な内容そのものではないためマスクしません。この例では、4つの内容tokenだけが隠されます。モデルは前後の行を手がかりに、return x * 2 を復元する必要があります。

`mlm_replace` collator

CustomMLMReplaceRatioCollator は同じ行単位の分割を使いつつ、選ばれた行に対して80/10/10の分岐を適用します。

確率	処理
80 %	行内のマスク可能なtokenをすべて `[MASK]` に置換（標準マスキング）
10 %	行全体を、同じバッチからランダムに選んだ別の行に置換（ランダム行置換）
10 %	元のtokenをそのまま残し、予測ターゲットとしてだけ記録（keep-as-is）

ランダム行置換は、BERTのtoken置換戦略を行単位に拡張したものです。モデルが位置情報だけを手がかりにマスクを埋めるような近道を取りにくくします。

ファインチューニング用YAML

configs/finetune_csn_simple.yaml は、SentenceTransformerファインチューニングを簡単に試すための設定です。

dataset:
  name: "code-search-net/code_search_net"
  format: "pairs"
  split: "train"
  languages: ["python", "javascript", "ruby", "go", "java", "php"]
  query_field: "func_documentation_string"
  code_field: "func_code_string"
  remove_python_docstrings: true
  train_sample_limit: 100000

model:
  pretrained_model: "./created_models/code-continue-modernbert-len512_Last"
  load_strategy: "auto"
  output_dir: "./created_models/code-st"
  pooling: "cls"
  max_seq_length: 512

training:
  epochs: 1
  batch_size: 32
  learning_rate: 0.00002
  scheduler: "WarmupLinear"
  warmup_ratio: 0.1
  loss: "mnrl"
  mnrl_scale: 20.0
  cached_mnrl_mini_batch_size: 32
  weight_decay: 0.01
  fp16: false
  bf16: true

remove_python_docstrings: true にすると、Pythonコードから module / class / function のdocstringをASTベースで取り除いてからファインチューニングします。queryと同じ文字列がcode側にそのまま残ることを避けるためです。

ファインチューニングでは、複数のdataset形式を選べます。

format: "pairs": query fieldとpositive/code fieldを持つ通常のペア形式です。基本的には mnrl で使います。
format: "hard_negatives": query、positive(s)、negative(s) を持つ形式です。query, pos/positive/positives, neg/negative/negatives/hard_negatives のような列名を読み取れます。
format: "beir": BEIR/ranking datasetをquery/positive/negative text列に変換済みの形式として読み込みます。生のBEIR corpus/queries/qrels は、先に変換してください。
format: "kd_scores": queries_python, documents_python, scores_python のようにconfigが分かれているKD/ranking datasetを読み込みます。scores_* は query_id とranked document_ids を持つ想定です。デフォルトではrank 0をpositiveとして扱います。
format: "auto": hard negative系の列があれば自動でhard negativesとして扱い、なければpairsとして扱います。

model.pretrained_model では、事前学習済みエンコーダーまたはSentenceTransformerチェックポイントを指定できます。load_strategy: "auto" の場合、ローカルディレクトリに modules.json があればSentenceTransformerとして直接読み込みます。なければ Transformer + Pooling としてラップします。pooling には cls, mean, max, mean_sqrt_len, weightedmean, lasttoken、または mean+max のような組み合わせを指定できます。 projection_dim を指定するとDenseの線形写像を追加できます。normalize_embeddings: true にすると、最終embeddingをL2正規化します。 query_prompt と code_prompt はSentenceTransformerのpromptsとして保存されます。code_prompt は document promptとしても保存されるため、code embeddingには encode_document() も使えます。

ファインチューニング前に、ローカルまたはHF上のエンコーダーをSentenceTransformer形式へ変換したい場合は、次のコマンドを使います。

uv run owl-wrap-st \
  --base-model ./created_models/code-continue-modernbert-len512_Last \
  --output-dir ./created_models/code-continue-modernbert-len512_ST \
  --pooling mean+max \
  --max-seq-length 512 \
  --projection-dim 768 \
  --projection-activation tanh \
  --normalize-embeddings \
  --query-prompt "query: " \
  --code-prompt "code: "

learning rateまわりの設定項目は以下のとおりです。

learning_rate: optimizerのlearning rate
scheduler: SentenceTransformersのscheduler（WarmupLinear, WarmupCosine, WarmupConstant, constantlr など）
warmup_ratio / warmup_steps: warmupをratioで指定するか、step数で直接指定するか
loss: mnrl, cached_mnrl, triplet, cosine のいずれか。長いaliasの multiple_negatives_ranking と cached_multiple_negatives_ranking も使えます
mnrl_scale または scale: mnrl / cached_mnrl の類似度scale。SentenceTransformersのデフォルトは 20.0
cached_mnrl_mini_batch_size, gather_across_devices: cached_mnrl 用の追加設定。cached_mnrl_mini_batch_size では、dataloaderの batch_size とは別に、cached loss内部のmini-batch sizeを指定します
fp16, bf16: SentenceTransformerTrainingArguments に渡すprecision設定。precision指定がない場合、ファインチューニングではbf16をデフォルトにします。precision: "fp16"、precision: "bf16"、precision: "fp32" でも指定できます。fp16 と bf16 は同時に有効化しないでください
batch_group_by_language: true にすると、1つのbatch内が同じ言語だけになります。batchの順番自体は言語をまたいでshuffleされます
eval_csn: true にすると、CodeSearchNet validation evaluatorを言語別に作り、言語平均の csn-avg_cosine_ndcg@10 を出力します。軽く試す場合は eval_languages と eval_sample_limit を小さくしてください

top-level の stages を使うと、段階的なファインチューニングもできます。各stageはtop-levelの dataset, model, training を継承し、stage内に書いた項目だけを上書きします。前stageの出力modelが、次stageの入力modelになります。

stages:
  - name: "pairs-warmup"
    training:
      loss: "mnrl"
      batch_group_by_language: true
  - name: "kd-hard-negative"
    dataset:
      name: "Shuu12121/owl_code_search_hard_negative_datasets_V2_kd"
      format: "kd_scores"
      languages: ["python", "javascript"]
    training:
      loss: "triplet"

hard negative用のテンプレートは configs/finetune_hard_negatives_template.yaml、configs/finetune_beir_template.yaml、configs/finetune_kd_scores_template.yaml、configs/finetune_staged_template.yaml にあります。hard negatives ranking dataset maker系で作ったdatasetにも対応できます。query/positive/negative text列、query_field, positive_field, negative_field で指定できる同等の列、または Shuu12121/owl_code_search_hard_negative_datasets_V2_kd のような queries_*/documents_*/scores_* のKD layoutを利用できます。

MTEB評価

configs/eval_mteb_template.yaml には、SentenceTransformerチェックポイントをMTEBで評価するための最小構成を置いています。

uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

configs/batch_experiment.yaml で作成した全モデルをまとめて評価し、モデルごとに最良の学習率を集計する場合は、次のコマンドを使います。

uv run python scripts/batch_eval_mteb.py --config configs/batch_experiment.yaml

バッチ評価では、各runのMTEBサマリを mteb_results/{model}{lr-suffix}/ に保存します。学習率をまたいだベスト表は、mteb_results/batch_summary.tsv と mteb_results/batch_best_scores.json に出力します。

最小YAML例:

evaluation:
  model: "./created_models/code-st-kd"
  model_loader: "sentence_transformer"
  tasks: ["CodeSearchNetRetrieval", "COIRCodeSearchNetRetrieval", "CosQA"]
  languages: []
  eval_splits: []
  output_dir: "./mteb_results/code-st-kd"
  summary_path: "./mteb_results/code-st-kd/summary.json"
  batch_size: 32
  normalize_embeddings: false
  prompt_name: "query"
  overwrite_strategy: "only-missing"

コード検索モデルでは、まず CodeSearchNetRetrieval、COIRCodeSearchNetRetrieval、CosQA、StackOverflowQA、AppsRetrieval、CodeEditSearchRetrieval などのretrieval taskから試すのがおすすめです。必要に応じて、COIR/MTEB全体に広げてください。

model_loader は、ローカル/HFのSentenceTransformerチェックポイントなら sentence_transformer、MTEB registry経由のモデルなら mteb を指定します。overwrite_strategy はMTEBの結果キャッシュへの書き込み方針です。繰り返し実験する場合は only-missing が安全です。

prompt_name は、MTEBの encode_kwargs 経由でSentenceTransformerの encode() に渡すprompt名です。保存済みpromptを使いたい場合に指定してください。モデルのデフォルト挙動に任せる場合は null のままで構いません。ローカルのrawエンコーダーを評価する場合は、先に owl-wrap-st でSentenceTransformer形式に変換してから evaluation.model に指定します。結果は evaluation.output_dir にキャッシュされ、概要JSONは evaluation.summary_path に出力されます。

補助CLI

Pythonから呼び出さず、コマンドとして実行したい場合は次のCLIが使えます。

uv run owl-train-tokenizer --config configs/config.yaml
uv run owl-pretrain --config configs/config.yaml
uv run owl-wrap-st --base-model ./created_models/code-continue-modernbert-len512_Last --output-dir ./created_models/code-st-base
uv run owl-finetune-st --config configs/finetune_csn_simple.yaml
uv run owl-eval-mteb --config configs/eval_mteb_template.yaml

公開

公開手順は単純ですが、リリース作業として慎重に扱ってください。

uv build
uv run twine check dist/*
uv run twine upload dist/*

アップロード前に、不要な生成物やprivate configが含まれていないことを確認してください。そのうえでtagを切り、公開タイミングを決めてください。PyPIではpersonal passwordではなく、project-scoped tokenまたはtrusted publishingを推奨します。

テスト

uv run --extra dev pytest -q tests

引用 (Citation)

ご自身の研究やプロジェクトで、本ツールキット、事前学習済みモデル、または本手法をご活用いただく際は、以下の形式で本リポジトリの引用をお願いいたします。

@misc{owl_code_pretraining,
  author       = {Shun0212},
  title        = {Owl Code Pretraining: A minimal toolkit for building code-specialized pretrained encoders},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/Shun0212/codeowl-training-core}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Owl Code Pretraining

Overview

Datasets

Owl family

Crow family

NightOwl family

Quick Start

Install

Secrets

Python API

Config

Pretraining YAML

How line-level masking works (line_no_space / line_include_space)

mlm_replace collator

Fine-Tuning YAML

MTEB Evaluation

Optional CLI

Publishing

Tests

Citation

日本語版

オーバービュー

学習データセット

Owl 系列

Crow 系列

NightOwl 系列

クイックスタート

インストール

Python API

事前学習用YAML

行単位マスキングの仕組み（line_no_space / line_include_space）

mlm_replace collator

ファインチューニング用YAML

MTEB評価

補助CLI

公開

テスト

引用 (Citation)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How line-level masking works (`line_no_space` / `line_include_space`)

`mlm_replace` collator

行単位マスキングの仕組み（`line_no_space` / `line_include_space`）

`mlm_replace` collator

Packages