use the already existing MTP module of the model to implement the loss during training via next token prediction
use the already existing MTP module of the model to implement the loss during training via next token prediction