Sparse MoE code reading by long8v · Pull Request #28 · long8v/PTIR

long8v · 2022-05-10T07:42:01Z

Paper

https://arxiv.org/abs/1701.06538
논문정리 : notion

구현체

https://github.com/davidmrau/mixture-of-experts

long8v · 2022-05-10T07:59:01Z

MoE/moe.py

+from torch.distributions.normal import Normal
+from mlp import MLP
+import numpy as np
+class SparseDispatcher(object):


input 미니 배치가 있을 때, 이를 각각의 expert에 넘겨주는 dispatch, 각 expert의 결과물을 모아서 하나의 tensor로 만드는 combine을 하기 위한 헬퍼 함수.

long8v · 2022-05-11T02:33:17Z

MoE/moe.py

+    combine - take output Tensors from each expert and form a combined output
+      Tensor.  Outputs from different experts for the same batch element are
+      summed together, weighted by the provided "gates".


combine은 각각의 expert의 gate에 대한 weighted sum을 해줌

long8v · 2022-05-11T02:34:48Z

MoE/moe.py

+      summed together, weighted by the provided "gates".
+    The class is initialized with a "gates" Tensor, which specifies which
+    batch elements go to which experts, and the weights to use when combining
+    the outputs.  Batch element b is sent to expert e iff gates[b, e] != 0.


gates는 one hot vector. batch b가 expert e로 가면 1, 아니면 0

long8v · 2022-05-11T02:43:50Z

MoE/moe.py

+        self._gates = gates
+        self._num_experts = num_experts
+        # sort experts
+        sorted_experts, index_sorted_experts = torch.nonzero(gates).sort(0)


nonzero -> 0이 아닌 tensor의 index뽑기 -> value 작은 순서대로 sort하기

long8v · 2022-05-11T02:47:15Z

MoE/moe.py

+        # sort experts
+        sorted_experts, index_sorted_experts = torch.nonzero(gates).sort(0)
+        # drop indices
+        _, self._expert_index = sorted_experts.split(1, dim=1)


dim=1에서 1씩 다른 텐서로 자르고 맨처음만 빼고 다시 self._exepert_index 로 저장함.
https://pytorch.org/docs/stable/generated/torch.split.html

long8v · 2022-05-11T03:05:18Z

MoE/moe.py

+
+        threshold_positions_if_in = torch.arange(batch).to(self.device) * m + self.k
+        threshold_if_in = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_in), 1)
+        is_in = torch.gt(noisy_values, threshold_if_in)


greater than. https://pytorch.org/docs/stable/generated/torch.gt.html
element-wise로 noisy_values > theshold_if_in이면 true

long8v · 2022-05-11T03:08:18Z

MoE/moe.py

+        noisy_top_values: a `Tensor` of shape [batch, m].
+           "values" Output of tf.top_k(noisy_top_values, m).  m >= k+1


top m개의 expert 텐서

long8v · 2022-05-11T03:10:33Z

MoE/moe.py

+              and shapes `[expert_batch_size_i]`
+        """
+        # split nonzero gates for each expert
+        return torch.split(self._nonzero_gates, self._part_sizes, dim=0)


_nonzero_gate들을 각각 expert에 넣어줄 배치개수 만큼 잘라줌.

long8v · 2022-05-11T03:13:50Z

MoE/moe.py

+        """The squared coefficient of variation of a sample.
+        Useful as a loss to encourage a positive distribution to be more uniform.


텐서의 coefficient of variation을 구함.

long8v · 2022-05-11T03:16:45Z

MoE/moe.py

+    def _gates_to_load(self, gates):
+        """Compute the true load per expert, given the gates.
+        The load is the number of examples for which the corresponding gate is >0.


gate들이 있을 때 각 expert 별 true load(noisy 하지 않다는 뜻인듯)를 계산.
gate > 0인 example의 개수를 load로 정의.

long8v · 2022-05-13T01:35:05Z

MoE/moe.py

+        loss = self.cv_squared(importance) + self.cv_squared(load)
+        loss *= loss_coef


importance에 대한 loss와 load에 대한 loss 합침

long8v · 2022-05-13T01:35:26Z

MoE/moe.py

+    def noisy_top_k_gating(self, x, train, noise_epsilon=1e-2):
+        """Noisy top-k gating.
+          See paper: https://arxiv.org/abs/1701.06538.
+          Args:
+            x: input Tensor with shape [batch_size, input_size]
+            train: a boolean - we only add noise at training time.
+            noise_epsilon: a float
+          Returns:
+            gates: a Tensor with shape [batch_size, num_experts]
+            load: a Tensor with shape [num_experts]
+        """


top-k gate noisy 하게 만드는 부분

long8v · 2022-05-13T02:03:27Z

MoE/moe.py

+        top_k_gates = self.softmax(top_k_logits)
+
+        zeros = torch.zeros_like(logits, requires_grad=True).to(self.device)
+        gates = zeros.scatter(1, top_k_indices, top_k_gates).to(self.device)


https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_.html#torch.Tensor.scatter_

jy.nam added 2 commits May 10, 2022 07:40

add moe code review

51375e2

remove submodule

64942df

long8v commented May 11, 2022

View reviewed changes

long8v changed the title ~~add MoE code review~~ Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer May 13, 2022

long8v added 2017 MoE labels May 13, 2022

long8v commented May 13, 2022

View reviewed changes

long8v changed the title ~~Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer~~ Sparse MoE code reading Jul 21, 2022

		noisy_top_values: a `Tensor` of shape [batch, m].
		"values" Output of tf.top_k(noisy_top_values, m). m >= k+1

		"""The squared coefficient of variation of a sample.
		Useful as a loss to encourage a positive distribution to be more uniform.

		loss = self.cv_squared(importance) + self.cv_squared(load)
		loss *= loss_coef

Conversation

long8v commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Paper

구현체

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

long8v May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

long8v May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

long8v commented May 10, 2022 •

edited

Loading

long8v May 11, 2022 •

edited

Loading

long8v May 11, 2022 •

edited

Loading