Hi, I have some training questions to ask.
When I train the AO model at 2 A100 80G, Batch_size=8, and train the lrs3 and lrs2 data sets simultaneously, errors will be reported out of memory and the training time will reach an epoch of 5 hours.
When only the ls2 data set is used, Batch_size=16, num_workers=4, an error will also be reported out of memory. In this case, the epoch is 20 minutes.
When only ls2 data set is used, Batch_size=8, num_workers=4, the epoch is 40 minutes, but loss=nan occurs when multiple epochs are used.
All the above problems have caused me to be unable to train normally. Could you please tell me the time and details of your training of this model in detail? Thank you very much!!
Hi, I have some training questions to ask.
When I train the AO model at 2 A100 80G, Batch_size=8, and train the lrs3 and lrs2 data sets simultaneously, errors will be reported out of memory and the training time will reach an epoch of 5 hours.
When only the ls2 data set is used, Batch_size=16, num_workers=4, an error will also be reported out of memory. In this case, the epoch is 20 minutes.
When only ls2 data set is used, Batch_size=8, num_workers=4, the epoch is 40 minutes, but loss=nan occurs when multiple epochs are used.
All the above problems have caused me to be unable to train normally. Could you please tell me the time and details of your training of this model in detail? Thank you very much!!