There is no code provided for the DPO execution process described in the paper after RM scoring. Can you explain how you built the data for DPO?
There is no code provided for the DPO execution process described in the paper after RM scoring.
Can you explain how you built the data for DPO?