SurveyWithCode
diff --git a/‎app/(navbar)/attention/page.md‎
Lines changed: 14 additions & 14 deletions b/‎app/(navbar)/attention/page.md‎
Lines changed: 14 additions & 14 deletions
@@ -107,7 +107,7 @@ $$
 
 where $$ [.;.] $$ is a concatenation operation. $$ \mathbf{W}^q_i, \mathbf{W}^k_i \in \mathbb{R}^{d \times d_k/h}, \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h} $$ are weight matrices to map input embeddings of size $$ L \times d $$ into query, key and value matrices. And $$ \mathbf{W}^o \in \mathbb{R}^{d_v \times d} $$ is the output linear transformation. All the weights should be learned during training.
 
-![Multi-head scaled dot-product attention](/img/transformer/multi-head-attention.png)
+![Multi-head scaled dot-product attention](/posts/transformer-family-2/multi-head-attention.png)
 _Fig. 1. Illustration of the multi-head scaled dot-product attention mechanism. (Image source: Figure 2 in [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762))_
 
 ## Transformer
@@ -120,7 +120,7 @@ The **encoder** generates an attention-based representation with capability to l
 
 The function of Transformer **decoder** is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is _masked_ to prevent positions from attending to the future.
 
-![Transformer](/img/transformer/transformer.png)
+![Transformer](/posts/transformer-family-2/transformer.png)
 _Fig. 2. The architecture of the vanilla Transformer model. (Image source: [Figure 17])_
 
 **Positional Encoding**
@@ -139,7 +139,7 @@ $$
 
 In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from $$ 2\pi $$ to $$ 10000 \cdot 2\pi $$ .
 
-![Transformer](/img/transformer/sinoidual-positional-encoding.png)
+![Transformer](/posts/transformer-family-2/sinoidual-positional-encoding.png)
 _Fig. 3. Sinusoidal positional encoding with $$ L=32 $$ and $$ d=128 $$ . The value is between -1 (black) and 1 (white) and the value 0 is in gray._
 
 (2) _Learned positional encoding_, as its name suggested, assigns each element with a learned column vector which encodes its _absolute_ position ([Gehring, et al. 2017](https://arxiv.org/abs/1705.03122)).
@@ -152,7 +152,7 @@ Following the vanilla Transformer, [Al-Rfou et al. (2018)](https://arxiv.org/abs
 - Each intermediate Transformer layer is used for making predictions as well. Lower layers are weighted to contribute less and less to the total loss as training progresses.
 - Each position in the sequence can predict multiple targets, i.e. two or more predictions of the future tokens.
 
-![Transformer](/img/transformer/transformer-aux-losses.png)
+![Transformer](/posts/transformer-family-2/transformer-aux-losses.png)
 _Fig. 4. Auxiliary prediction tasks used in deep Transformer for character-level language modeling. (Image source: [Al-Rfou et al. (2018)](https://arxiv.org/abs/1808.04444))_
 
 ## Adaptive Computation Time (ACT)
@@ -191,7 +191,7 @@ where $$ M $$ is an upper limit for the number of immediate steps allowed.
 
 The final state and output are mean-field updates:
 $$ s*t = \sum*{n=1}^{N(t)} p*t^n s_t^n,\quad y_t = \sum*{n=1}^{N(t)} p*t^n y_t^n $$
-![ACT computation graph](/img/transformer/ACT-computation-graph.png)
+![ACT computation graph](/posts/transformer-family-2/ACT-computation-graph.png)
 _Fig. 5. The computation graph of a RNN with ACT mechanism. (Image source: [Graves, 2016](https://arxiv.org/abs/1603.08983))*
 
 To avoid unnecessary pondering over each input, ACT adds a _ponder cost_ $$ \mathcal{P}(x) = \sum_{t=1}^L N(t) + R(t) $$ in the loss function to encourage a smaller number of intermediate computational steps.
@@ -219,7 +219,7 @@ This _context segmentation_ causes several issues:
 
 The recurrent connection between segments is introduced into the model by continuously using the hidden states from the previous segments.
 
-![Training phrase of Transformer-XL](/img/transformer/transformer-XL-training.png)
+![Training phrase of Transformer-XL](/posts/transformer-family-2/transformer-XL-training.png)
 _Fig. 6. A comparison between the training phrase of vanilla Transformer & Transformer-XL with a segment length 4. (Image source: left part of Figure 2 in [Dai et al., 2019](https://arxiv.org/abs/1901.02860))._
 
 Let's label the hidden state of the $$ n $$ -th layer for the $$ (\tau + 1) $$ -th segment in the model as $$ \mathbf{h}_{\tau+1}^{(n)} \in \mathbb{R}^{L \times d} $$ . In addition to the hidden state of the last layer for the same segment $$ \mathbf{h}_{\tau+1}^{(n-1)} $$ , it also depends on the hidden state of the same layer for the previous segment $$ \mathbf{h}_{\tau}^{(n)} $$ . By incorporating information from the previous hidden states, the model extends the attention span much longer in the past, over multiple segments.
@@ -272,7 +272,7 @@ One key advantage of Transformer is the capability of capturing long-term depend
 
 This is the motivation for **Adaptive Attention Span**. [Sukhbaatar, et al., (2019)](https://arxiv.org/abs/1905.07799) proposed a self-attention mechanism that seeks an optimal attention span. They hypothesized that different attention heads might assign scores differently within the same context window (See Fig. 7) and thus the optimal span would be trained separately per head.
 
-![Attention per head](/img/transformer/attention-per-head.png)
+![Attention per head](/posts/transformer-family-2/attention-per-head.png)
 _Fig. 7. Two attention heads in the same model, A & B, assign attention differently within the same context window. Head A attends more to the recent tokens, while head B look further back into the past uniformly. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799))_
 
 Given the $$ i $$ -th token, we need to compute the attention weights between this token and other keys at positions $$ j \in S_i $$ , where $$ S_i $$ defineds the $$ i $$ -th token's context window.
@@ -289,7 +289,7 @@ A _soft mask function_ $$ m_z $$ is added to control for an effective adjustable
 $$ m_z(x) = \text{clamp}(\frac{1}{R}(R+z-x), 0, 1) $$
 where $$ R $$ is a hyper-parameter which defines the softness of $$ m_z $$ .
 
-![Soft masking function](/img/transformer/soft-masking-function.png)
+![Soft masking function](/posts/transformer-family-2/soft-masking-function.png)
 _Fig. 8. The soft masking function used in the adaptive attention span. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799).)_
 
 The soft mask function is applied to the softmax elements in the attention weights:
@@ -315,7 +315,7 @@ Let's label the representation of the current pixel to be generated as the query
 
 Image Transformer introduced two types of localized $$ \mathbf{M} $$ , as illustrated below.
 
-![Attention patterns in Image Transformer](/img/transformer/image-transformer-attention.png)
+![Attention patterns in Image Transformer](/posts/transformer-family-2/image-transformer-attention.png)
 _Fig. 9. Illustration of 1D and 2D attention span for visual inputs in Image Transformer. The black line marks a query block and the cyan outlines the actual attention span for pixel q. (Image source: Figure 2 in [Parmer et al, 2018](https://arxiv.org/abs/1802.05751))_
 
 (1) _1D Local Attention_: The input image is flattened in the [raster scanning](https://en.wikipedia.org/wiki/Raster_scan#Scanning_pattern) order, that is, from left to right and top to bottom. The linearized image is then partitioned into non-overlapping query blocks. The context window consists of pixels in the same query block as $$ \mathbf{q} $$ and a fixed number of additional pixels generated before this query block.
@@ -353,7 +353,7 @@ Precisely, the set $$ S*i $$ is divided into $$ p $$ _non-overlapping* subsets,
 
 Sparse Transformer proposed two types of fractorized attention. It is easier to understand the concepts as illustrated in Fig. 10 with 2D image inputs as examples.
 
-![Sparse attention](/img/transformer/sparse-attention.png)
+![Sparse attention](/posts/transformer-family-2/sparse-attention.png)
 _Fig. 10. The top row illustrates the attention connectivity patterns in (a) Transformer, (b) Sparse Transformer with strided attention, and (c) Sparse Transformer with fixed attention. The bottom row contains corresponding self-attention connectivity matrices. Note that the top and bottom rows are not in the same scale. (Image source: [Child et al., 2019](https://arxiv.org/abs/1904.10509) + a few of extra annotations.)_
 
 (1) _Strided_ attention with stride $$ \ell \sim \sqrt{n} $$ . This works well with image data as the structure is aligned with strides. In the image case, each pixel would attend to all the previous $$ \ell $$ pixels in the raster scanning order (naturally cover the entire width of the image) and then those pixels attend to others in the same column (defined by another attention connectivity subset).
@@ -408,7 +408,7 @@ In $$ \mathbf{Q} \mathbf{K}^\top $$ part of the [attention formula](#attention-a
 
 A hashing scheme $$ x \mapsto h(x) $$ is _locality-sensitive_ if it preserves the distancing information between data points, such that close vectors obtain similar hashes while distant vectors have very different ones. The Reformer adopts a hashing scheme as such, given a fixed random matrix $$ \mathbf{R} \in \mathbb{R}^{d \times b/2} $$ (where $$ b $$ is a hyperparam), the hash function is $$ h(x) = \arg\max([xR; -xR]) $$ .
 
-![LSH attention matrix](/img/transformer/LSH-attention-matrix.png)
+![LSH attention matrix](/posts/transformer-family-2/LSH-attention-matrix.png)
 _Fig. 11. Illustration of Locality-Sensitive Hashing (LSH) attention. (Image source: right part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
 
 In LSH attention, a query can only attend to positions in the same hashing bucket, $$ S_i = \{j: h(\mathbf{q}_i) = h(\mathbf{k}_j)\} $$ . It is carried out in the following process, as illustrated in Fig. 11:
@@ -418,7 +418,7 @@ In LSH attention, a query can only attend to positions in the same hashing bucke
 - (c) Set $$ \mathbf{Q} = \mathbf{K} $$ (precisely $$ \mathbf{k}_j = \mathbf{q}_j / \|\mathbf{q}_j\| $$ ), so that there are equal numbers of keys and queries in one bucket, easier for batching. Interestingly, this "shared-QK" config does not affect the performance of the Transformer.
 - (d) Apply batching where chunks of $$ m $$ consecutive queries are grouped together.
 
-![LSH attention](/img/transformer/LSH-attention.png)
+![LSH attention](/posts/transformer-family-2/LSH-attention.png)
 _Fig. 12. The LSH attention consists of 4 steps: bucketing, sorting, chunking, and attention computation. (Image source: left part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
 
 **Reversible Residual Network**
@@ -459,7 +459,7 @@ Rather than going through a fixed number of layers, Universal Transformer dynami
 
 On a high level, the universal transformer can be viewed as a recurrent function for learning the hidden state representation per token. The recurrent function evolves in parallel across token positions and the information between positions is shared through self-attention.
 
-![Universal Transformer Recurrent Step](/img/transformer/universal-transformer-loop.png)
+![Universal Transformer Recurrent Step](/posts/transformer-family-2/universal-transformer-loop.png)
 _Fig. 13. How the Universal Transformer refines a set of hidden state representations repeatedly for every position in parallel. (Image source: Figure 1 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))._
 
 Given an input sequence of length $$ L $$ , Universal Transformer iteratively updates the representation $$ \mathbf{H}^t \in \mathbb{R}^{L \times d} $$ at step $$ t $$ for an adjustable number of steps. At step 0, $$ \mathbf{H}^0 $$ is initialized to be same as the input embedding matrix. All the positions are processed in parallel in the multi-head self-attention mechanism and then go through a recurrent transition function.
@@ -483,7 +483,7 @@ $$
 \end{cases}
 $$
 
-![Universal Transformer](/img/transformer/universal-transformer.png)
+![Universal Transformer](/posts/transformer-family-2/universal-transformer.png)
 _Fig. 14. A simplified illustration of Universal Transformer. The encoder and decoder share the same basic recurrent structure. But the decoder also attends to final encoder representation $$ \mathbf{H}^T $$ . (Image source: Figure 2 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))_
 
 In the adaptive version of Universal Transformer, the number of recurrent steps $$ T $$ is dynamically determined by [ACT](#adaptive-computation-time-act). Each position is equipped with a dynamic ACT halting mechanism. Once a per-token recurrent block halts, it stops taking more recurrent updates but simply copies the current value to the next step until all the blocks halt or until the model reaches a maximum step limit.