-
Notifications
You must be signed in to change notification settings - Fork 357
Description
Describe the issue
Hi @iofu728 ,Thank you for the impressive work on LongLLMLingua.
I am currently trying to reproduce LongLLMLingua on the LongBench dataset using LongChat-7b-v1.5-32k as the target model.
I noticed a significant performance gap when comparing my results (compressed prompt) against the official LongBench baseline scores (original full prompt).
My Setup:
Dataset: download LongBench to local (e.g., NarrativeQA, Qasper, etc.)
Target Model: lmsys/longchat-7b-v1.5-32k downloaded from huggingface
Method:
compressed_prompt = compressor.compress_prompt(
origin, # load from longbench dataset's sample["context"]
question=question, # load from sample["input"],if null question = instruction from longbench's eval.py
target_token=3000,
condition_in_question="after_condition",
reorder_context="sort",
dynamic_context_compression_ratio=0.3,
condition_compare=True,
context_budget="+100",
rank_method="longllmlingua",
)
The Issue: After compressing the LongBench data and evaluating it with LongChat-7b-v1.5-32k, the scores are much lower than the reported performance of the model on the original (uncompressed) data in the official LongBench leaderboard.
For example :
Official LongChat-7b-v1.5-32k (Full Context): 41.4
My Result with LongLLMLingua (Compressed): 27.39
My Question: Is a significant performance drop expected when compressing for a long-context model like LongChat-32k, or is this gap likely due to my parameter configuration?
Could you advise on the recommended perparameters specifically for LongBench evaluation to minimize this performance loss?
Thanks for your help!