Hi,
First of all thanks for sharing this incredible project!
Regarding the implementation of the generative pretraining, I would like to ask for some clarifications. Your paper is very detailed with emphasis on all the hyperparameters, but I think these two things are missing.
- How is the prefix length sampled? Is it constant for each batch, or is it specific for each sample in the batch? Are there any minimum, maximum values for it?
- In the paper, you mention that the MLP head (that predicts the normalized pixel values), consists of 12 MLP blocks. Does this mean that each MLP block is similar to the MLP block from the attention layer? Does the head include residual connections as well?
Thanks again for sharing! If you could clarify these things, it would be very helpful for other researchers who seek to build upon this nice work.
Hi,
First of all thanks for sharing this incredible project!
Regarding the implementation of the generative pretraining, I would like to ask for some clarifications. Your paper is very detailed with emphasis on all the hyperparameters, but I think these two things are missing.
Thanks again for sharing! If you could clarify these things, it would be very helpful for other researchers who seek to build upon this nice work.