About training data

Thanks for your great job and codes!

Regarding the training data, did you randomly crop out 512*512 pieces, and then extract textual descriptions from the 512-sized images?

If that's the case, do all the training images need to be preprocessed and saved in advance? 
How many training data of 512 dimensions are there approximately?