Implementation Details

Hi, 
First of all thanks for sharing this incredible project! 
Regarding the implementation of the generative pretraining, I would like to ask for some clarifications. Your paper is very detailed with emphasis on all the hyperparameters, but I think these two things are missing.

1. How is the prefix length sampled? Is it constant for each batch, or is it specific for each sample in the batch? Are there any minimum, maximum values for it?
2. In the paper, you mention that the MLP head (that predicts the normalized pixel values), consists of 12 MLP blocks. Does this mean that each MLP block is similar to the MLP block from the attention layer? Does the head include residual connections as well? 


Thanks again for sharing! If you could clarify these things, it would be very helpful for other researchers who seek to build upon this nice work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Details #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation Details #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions