Minimal change for latest transformer usage. by CenjhihLi · Pull Request #31 · Shark-NLP/OpenICL

CenjhihLi · 2026-05-06T03:28:55Z

Modify the tokenizer encode usage.
Modify the pad_token assign: only assign pad_token = eos_token when eos_token is not None
Modify the ce_loss computation in ppl_inferemcer to support bfloat16.
Add cone_retriever from https://github.com/Romainpkq/revisit_demon_selection_in_ICL

1. Modify the tokenizer encode usage. 2. Modify the pad_token assign: only assign pad_token = eos_token when eos_token is not None 3. Modify the ce_loss computation in ppl_inferemcer to support bfloat16. 4. Add cone_retriever from https://github.com/Romainpkq/revisit_demon_selection_in_ICL

Copilot

Pull request overview

This PR updates OpenICL’s Hugging Face Transformers integration to be more compatible with newer tokenizer/model behaviors (padding/tokenization APIs and bfloat16 outputs), and adds a new ConE-based retriever implementation.

Changes:

Adjust device transfer in the padding+CUDA collator to handle non-tensor values in the batch.
Make pad_token / pad_token_id assignment conditional on eos_token(_id) being present to avoid overwriting valid pad settings with None.
Update PPL CE-loss computation to better handle bfloat16 logits and add a new ConERetriever.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
openicl/utils/collators.py	Changes how collated batches are moved onto device (now via per-item `.to(...)`).
openicl/icl_retriever/icl_topk_retriever.py	Makes pad token assignment conditional on the presence of EOS token/id.
openicl/icl_retriever/icl_cone_retriever.py	Adds a new ConE retriever that reranks TopK candidates using CE loss.
openicl/icl_retriever/init.py	Exposes `ConERetriever` from the retriever package.
openicl/icl_inferencer/icl_ppl_inferencer.py	Updates CE-loss normalization and dtype handling to support bfloat16 outputs.
openicl/icl_inferencer/icl_base_inferencer.py	Makes pad token assignment conditional on the presence of EOS token/id.
openicl/icl_dataset_reader.py	Updates dataset tokenization to use `tokenizer(...)` with a fallback to `encode_plus`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            batch = {
+                k: v.to(self.device) if hasattr(v, "to") else v
+                for k, v in batch.items()
+            }


-        ce_loss = loss.sum(-1).cpu().detach().numpy() / lens
+            lens -= torch.tensor(mask_length, device=lens.device, dtype=lens.dtype)
+        # Some new hf models are bfloat16
+        ce_loss = (loss.sum(-1) / lens.to(loss.dtype)).detach().to(torch.float32).cpu()


-            tokenized_data = self.tokenizer.encode_plus(data, truncation=True, return_tensors='pt', verbose=False)
+            try:
+                tokenized_data = self.tokenizer(data, truncation=True, return_tensors='pt', verbose=False)
+            except:


+            embed = np.expand_dims(entry['embed'], axis=0)
+            near_ids = self.index.search(embed, min(self.candidate_num, len(self.index_ds)))[1][0].tolist()
+            candidates = []
+            mdl_scores = []
+
+            prompts = []
+            mask_lengths = []
+            test_lengths = []
+
+            for j in range(self.candidate_num):
+                rand_idx_list = [near_ids[j]]
+                candidates.append(rand_idx_list)


+            for batch_id in range(self.candidate_num // self.ppl_batch_size):
+                with torch.no_grad():
+                    loss_list = self.cal_ce(prompts[batch_id * self.ppl_batch_size: (batch_id + 1) * self.ppl_batch_size], mask_lengths=mask_lengths[batch_id * self.ppl_batch_size: (batch_id + 1) * self.ppl_batch_size], test_lengths=test_lengths[batch_id * self.ppl_batch_size: (batch_id + 1) * self.ppl_batch_size])
+                    mdl_scores.extend(loss_list)
+
+            if self.candidate_num % self.ppl_batch_size != 0:
+                with torch.no_grad():
+                    end_pos = self.candidate_num // self.ppl_batch_size * self.ppl_batch_size
+                    loss_list = self.cal_ce(prompts[end_pos:], mask_lengths=mask_lengths[end_pos:], test_lengths=test_lengths[end_pos:])
+                    mdl_scores.extend(loss_list)
+
+            ppl_scores = list(sorted(list(enumerate(mdl_scores)), key=lambda x: x[1]))
+            # get the most lower ppl demonstrations for each test input
+            rtr_idx_list[idx] = [int(candidates[ppl_scores[i][0]][0]) for i in range(self.ice_num)]


+            logger.info(f'Load model {self.metric_model} for calculating MDL...')
+            self.metric_model = AutoModelForCausalLM.from_pretrained(self.ce_model_name)
+            self.metric_model.to(self.device)


+
+            ppl_scores = list(sorted(list(enumerate(mdl_scores)), key=lambda x: x[1]))
+            # get the most lower ppl demonstrations for each test input
+            rtr_idx_list[idx] = [int(candidates[ppl_scores[i][0]][0]) for i in range(self.ice_num)]


+"""MDL Retriever"""
+
+from openicl import DatasetReader, PromptTemplate
+from openicl.icl_retriever.icl_topk_retriever import TopkRetriever
+from openicl.utils.calculate import entropy
+from openicl.utils.logging import get_logger
+from typing import List, Union, Optional, Tuple
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import tqdm
+import torch
+import numpy as np
+from accelerate import Accelerator
+
+
+logger = get_logger(__name__)
+
+
+class ConERetriever(TopkRetriever):
+    """PPL In-context Learning Retriever Class
+        Class of ConE retriever.
+


Copilot AI review requested due to automatic review settings May 6, 2026 03:28

Copilot started reviewing on behalf of CenjhihLi May 6, 2026 03:29 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal change for latest transformer usage.#31

Minimal change for latest transformer usage.#31
CenjhihLi wants to merge 1 commit intoShark-NLP:mainfrom
CenjhihLi:main

CenjhihLi commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CenjhihLi commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants