Reproducing Table 1 on TOFU, and metrics clarification

Dear SimNPO authors,

Thank you for your insightful work and open-sourcing the code base. Would you mind sharing the hyper-parameter settings for reproducing NPO and SimNPO results on the TOFU dataset (results in Table 1)? I am using default finetune and forget configs (.yaml) to reproduce the results but I notice a significant discrepancy between the evaluation scores (in aggregate_stat.txt) and the reported results. Using SimNPO + grad_diff, I get a matched model utility score (0.578) but the forget quality (KS Test p value) is quite low (4e-6) comparing the reported 0.99. Using SimNPO alone I get ~0.4 forget quality but ~0 model utility. Haven't changed code or default configs.

Referring to the evaluation section in TOFU paper, should the lower the better for the probability p(a|q) and ROUGE on the forget set, and the higher the better for the truth ratio (prob. wrong answer / prob. correct answers) on the forget set? The direction of arrows in Tab 1 for ROUGE and prob. on forget set seem reversed, and the truth ratio arrow on the retain set is reversed too?

Looking forward to hearing from you, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Table 1 on TOFU, and metrics clarification #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing Table 1 on TOFU, and metrics clarification #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions