First, thanks or this great work and implementation - I want to use it in my own work.
I have a basic question about the implementation:
assume I have fixed embedding (size 512) with many samples (about 2 million)
I saw in the examples that the values of the MI is changing through the optimization, and moreover the values have high variance but extremally good MSE.
As I understand I will use all the 2 million samples in order to train the CLUB estimator - when is the best time to take the evaluation of the MI? is it best to monitor the loss in order to see it not changing or other measure? what is your suggestion ? and then what portion of the 2 million examples will you use for the evaluation of the true MI? all of them? and then taking the MSE of all the examples?
second question, regarding the architecture of the hidden layer and the network, any suggestion about that for the case I have two variables with 512 dim each?
the last question regarding the robustness of the optimizer, lets assume I will change the two vectors in time, optimizing them for a different task, and I will want to measure the MI again after changing them, will you initialize the optimizer for measuring the MI for the modified vectors or use the last optimizer that was trained?
thanks!
First, thanks or this great work and implementation - I want to use it in my own work.
I have a basic question about the implementation:
assume I have fixed embedding (size 512) with many samples (about 2 million)
I saw in the examples that the values of the MI is changing through the optimization, and moreover the values have high variance but extremally good MSE.
As I understand I will use all the 2 million samples in order to train the CLUB estimator - when is the best time to take the evaluation of the MI? is it best to monitor the loss in order to see it not changing or other measure? what is your suggestion ? and then what portion of the 2 million examples will you use for the evaluation of the true MI? all of them? and then taking the MSE of all the examples?
second question, regarding the architecture of the hidden layer and the network, any suggestion about that for the case I have two variables with 512 dim each?
the last question regarding the robustness of the optimizer, lets assume I will change the two vectors in time, optimizing them for a different task, and I will want to measure the MI again after changing them, will you initialize the optimizer for measuring the MI for the modified vectors or use the last optimizer that was trained?
thanks!