Skip to content

Commit 79d38d7

Browse files
committed
language
1 parent bbc118e commit 79d38d7

File tree

3 files changed

+42
-42
lines changed

3 files changed

+42
-42
lines changed

app/(navbar)/attention/page.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ $$
107107

108108
where $$ [.;.] $$ is a concatenation operation. $$ \mathbf{W}^q_i, \mathbf{W}^k_i \in \mathbb{R}^{d \times d_k/h}, \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h} $$ are weight matrices to map input embeddings of size $$ L \times d $$ into query, key and value matrices. And $$ \mathbf{W}^o \in \mathbb{R}^{d_v \times d} $$ is the output linear transformation. All the weights should be learned during training.
109109

110-
![Multi-head scaled dot-product attention](/img/transformer/multi-head-attention.png)
110+
![Multi-head scaled dot-product attention](/posts/transformer-family-2/multi-head-attention.png)
111111
_Fig. 1. Illustration of the multi-head scaled dot-product attention mechanism. (Image source: Figure 2 in [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762))_
112112

113113
## Transformer
@@ -120,7 +120,7 @@ The **encoder** generates an attention-based representation with capability to l
120120

121121
The function of Transformer **decoder** is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is _masked_ to prevent positions from attending to the future.
122122

123-
![Transformer](/img/transformer/transformer.png)
123+
![Transformer](/posts/transformer-family-2/transformer.png)
124124
_Fig. 2. The architecture of the vanilla Transformer model. (Image source: [Figure 17])_
125125

126126
**Positional Encoding**
@@ -139,7 +139,7 @@ $$
139139

140140
In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from $$ 2\pi $$ to $$ 10000 \cdot 2\pi $$ .
141141

142-
![Transformer](/img/transformer/sinoidual-positional-encoding.png)
142+
![Transformer](/posts/transformer-family-2/sinoidual-positional-encoding.png)
143143
_Fig. 3. Sinusoidal positional encoding with $$ L=32 $$ and $$ d=128 $$ . The value is between -1 (black) and 1 (white) and the value 0 is in gray._
144144

145145
(2) _Learned positional encoding_, as its name suggested, assigns each element with a learned column vector which encodes its _absolute_ position ([Gehring, et al. 2017](https://arxiv.org/abs/1705.03122)).
@@ -152,7 +152,7 @@ Following the vanilla Transformer, [Al-Rfou et al. (2018)](https://arxiv.org/abs
152152
- Each intermediate Transformer layer is used for making predictions as well. Lower layers are weighted to contribute less and less to the total loss as training progresses.
153153
- Each position in the sequence can predict multiple targets, i.e. two or more predictions of the future tokens.
154154

155-
![Transformer](/img/transformer/transformer-aux-losses.png)
155+
![Transformer](/posts/transformer-family-2/transformer-aux-losses.png)
156156
_Fig. 4. Auxiliary prediction tasks used in deep Transformer for character-level language modeling. (Image source: [Al-Rfou et al. (2018)](https://arxiv.org/abs/1808.04444))_
157157

158158
## Adaptive Computation Time (ACT)
@@ -191,7 +191,7 @@ where $$ M $$ is an upper limit for the number of immediate steps allowed.
191191

192192
The final state and output are mean-field updates:
193193
$$ s*t = \sum*{n=1}^{N(t)} p*t^n s_t^n,\quad y_t = \sum*{n=1}^{N(t)} p*t^n y_t^n $$
194-
![ACT computation graph](/img/transformer/ACT-computation-graph.png)
194+
![ACT computation graph](/posts/transformer-family-2/ACT-computation-graph.png)
195195
_Fig. 5. The computation graph of a RNN with ACT mechanism. (Image source: [Graves, 2016](https://arxiv.org/abs/1603.08983))*
196196

197197
To avoid unnecessary pondering over each input, ACT adds a _ponder cost_ $$ \mathcal{P}(x) = \sum_{t=1}^L N(t) + R(t) $$ in the loss function to encourage a smaller number of intermediate computational steps.
@@ -219,7 +219,7 @@ This _context segmentation_ causes several issues:
219219

220220
The recurrent connection between segments is introduced into the model by continuously using the hidden states from the previous segments.
221221

222-
![Training phrase of Transformer-XL](/img/transformer/transformer-XL-training.png)
222+
![Training phrase of Transformer-XL](/posts/transformer-family-2/transformer-XL-training.png)
223223
_Fig. 6. A comparison between the training phrase of vanilla Transformer & Transformer-XL with a segment length 4. (Image source: left part of Figure 2 in [Dai et al., 2019](https://arxiv.org/abs/1901.02860))._
224224

225225
Let's label the hidden state of the $$ n $$ -th layer for the $$ (\tau + 1) $$ -th segment in the model as $$ \mathbf{h}_{\tau+1}^{(n)} \in \mathbb{R}^{L \times d} $$ . In addition to the hidden state of the last layer for the same segment $$ \mathbf{h}_{\tau+1}^{(n-1)} $$ , it also depends on the hidden state of the same layer for the previous segment $$ \mathbf{h}_{\tau}^{(n)} $$ . By incorporating information from the previous hidden states, the model extends the attention span much longer in the past, over multiple segments.
@@ -272,7 +272,7 @@ One key advantage of Transformer is the capability of capturing long-term depend
272272

273273
This is the motivation for **Adaptive Attention Span**. [Sukhbaatar, et al., (2019)](https://arxiv.org/abs/1905.07799) proposed a self-attention mechanism that seeks an optimal attention span. They hypothesized that different attention heads might assign scores differently within the same context window (See Fig. 7) and thus the optimal span would be trained separately per head.
274274

275-
![Attention per head](/img/transformer/attention-per-head.png)
275+
![Attention per head](/posts/transformer-family-2/attention-per-head.png)
276276
_Fig. 7. Two attention heads in the same model, A & B, assign attention differently within the same context window. Head A attends more to the recent tokens, while head B look further back into the past uniformly. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799))_
277277

278278
Given the $$ i $$ -th token, we need to compute the attention weights between this token and other keys at positions $$ j \in S_i $$ , where $$ S_i $$ defineds the $$ i $$ -th token's context window.
@@ -289,7 +289,7 @@ A _soft mask function_ $$ m_z $$ is added to control for an effective adjustable
289289
$$ m_z(x) = \text{clamp}(\frac{1}{R}(R+z-x), 0, 1) $$
290290
where $$ R $$ is a hyper-parameter which defines the softness of $$ m_z $$ .
291291

292-
![Soft masking function](/img/transformer/soft-masking-function.png)
292+
![Soft masking function](/posts/transformer-family-2/soft-masking-function.png)
293293
_Fig. 8. The soft masking function used in the adaptive attention span. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799).)_
294294

295295
The soft mask function is applied to the softmax elements in the attention weights:
@@ -315,7 +315,7 @@ Let's label the representation of the current pixel to be generated as the query
315315

316316
Image Transformer introduced two types of localized $$ \mathbf{M} $$ , as illustrated below.
317317

318-
![Attention patterns in Image Transformer](/img/transformer/image-transformer-attention.png)
318+
![Attention patterns in Image Transformer](/posts/transformer-family-2/image-transformer-attention.png)
319319
_Fig. 9. Illustration of 1D and 2D attention span for visual inputs in Image Transformer. The black line marks a query block and the cyan outlines the actual attention span for pixel q. (Image source: Figure 2 in [Parmer et al, 2018](https://arxiv.org/abs/1802.05751))_
320320

321321
(1) _1D Local Attention_: The input image is flattened in the [raster scanning](https://en.wikipedia.org/wiki/Raster_scan#Scanning_pattern) order, that is, from left to right and top to bottom. The linearized image is then partitioned into non-overlapping query blocks. The context window consists of pixels in the same query block as $$ \mathbf{q} $$ and a fixed number of additional pixels generated before this query block.
@@ -353,7 +353,7 @@ Precisely, the set $$ S*i $$ is divided into $$ p $$ _non-overlapping* subsets,
353353

354354
Sparse Transformer proposed two types of fractorized attention. It is easier to understand the concepts as illustrated in Fig. 10 with 2D image inputs as examples.
355355

356-
![Sparse attention](/img/transformer/sparse-attention.png)
356+
![Sparse attention](/posts/transformer-family-2/sparse-attention.png)
357357
_Fig. 10. The top row illustrates the attention connectivity patterns in (a) Transformer, (b) Sparse Transformer with strided attention, and (c) Sparse Transformer with fixed attention. The bottom row contains corresponding self-attention connectivity matrices. Note that the top and bottom rows are not in the same scale. (Image source: [Child et al., 2019](https://arxiv.org/abs/1904.10509) + a few of extra annotations.)_
358358

359359
(1) _Strided_ attention with stride $$ \ell \sim \sqrt{n} $$ . This works well with image data as the structure is aligned with strides. In the image case, each pixel would attend to all the previous $$ \ell $$ pixels in the raster scanning order (naturally cover the entire width of the image) and then those pixels attend to others in the same column (defined by another attention connectivity subset).
@@ -408,7 +408,7 @@ In $$ \mathbf{Q} \mathbf{K}^\top $$ part of the [attention formula](#attention-a
408408

409409
A hashing scheme $$ x \mapsto h(x) $$ is _locality-sensitive_ if it preserves the distancing information between data points, such that close vectors obtain similar hashes while distant vectors have very different ones. The Reformer adopts a hashing scheme as such, given a fixed random matrix $$ \mathbf{R} \in \mathbb{R}^{d \times b/2} $$ (where $$ b $$ is a hyperparam), the hash function is $$ h(x) = \arg\max([xR; -xR]) $$ .
410410

411-
![LSH attention matrix](/img/transformer/LSH-attention-matrix.png)
411+
![LSH attention matrix](/posts/transformer-family-2/LSH-attention-matrix.png)
412412
_Fig. 11. Illustration of Locality-Sensitive Hashing (LSH) attention. (Image source: right part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
413413

414414
In LSH attention, a query can only attend to positions in the same hashing bucket, $$ S_i = \{j: h(\mathbf{q}_i) = h(\mathbf{k}_j)\} $$ . It is carried out in the following process, as illustrated in Fig. 11:
@@ -418,7 +418,7 @@ In LSH attention, a query can only attend to positions in the same hashing bucke
418418
- (c) Set $$ \mathbf{Q} = \mathbf{K} $$ (precisely $$ \mathbf{k}_j = \mathbf{q}_j / \|\mathbf{q}_j\| $$ ), so that there are equal numbers of keys and queries in one bucket, easier for batching. Interestingly, this "shared-QK" config does not affect the performance of the Transformer.
419419
- (d) Apply batching where chunks of $$ m $$ consecutive queries are grouped together.
420420

421-
![LSH attention](/img/transformer/LSH-attention.png)
421+
![LSH attention](/posts/transformer-family-2/LSH-attention.png)
422422
_Fig. 12. The LSH attention consists of 4 steps: bucketing, sorting, chunking, and attention computation. (Image source: left part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
423423

424424
**Reversible Residual Network**
@@ -459,7 +459,7 @@ Rather than going through a fixed number of layers, Universal Transformer dynami
459459

460460
On a high level, the universal transformer can be viewed as a recurrent function for learning the hidden state representation per token. The recurrent function evolves in parallel across token positions and the information between positions is shared through self-attention.
461461

462-
![Universal Transformer Recurrent Step](/img/transformer/universal-transformer-loop.png)
462+
![Universal Transformer Recurrent Step](/posts/transformer-family-2/universal-transformer-loop.png)
463463
_Fig. 13. How the Universal Transformer refines a set of hidden state representations repeatedly for every position in parallel. (Image source: Figure 1 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))._
464464

465465
Given an input sequence of length $$ L $$ , Universal Transformer iteratively updates the representation $$ \mathbf{H}^t \in \mathbb{R}^{L \times d} $$ at step $$ t $$ for an adjustable number of steps. At step 0, $$ \mathbf{H}^0 $$ is initialized to be same as the input embedding matrix. All the positions are processed in parallel in the multi-head self-attention mechanism and then go through a recurrent transition function.
@@ -483,7 +483,7 @@ $$
483483
\end{cases}
484484
$$
485485

486-
![Universal Transformer](/img/transformer/universal-transformer.png)
486+
![Universal Transformer](/posts/transformer-family-2/universal-transformer.png)
487487
_Fig. 14. A simplified illustration of Universal Transformer. The encoder and decoder share the same basic recurrent structure. But the decoder also attends to final encoder representation $$ \mathbf{H}^T $$ . (Image source: Figure 2 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))_
488488

489489
In the adaptive version of Universal Transformer, the number of recurrent steps $$ T $$ is dynamically determined by [ACT](#adaptive-computation-time-act). Each position is equipped with a dynamic ACT halting mechanism. Once a per-token recurrent block halts, it stops taking more recurrent updates but simply copies the current value to the next step until all the blocks halt or until the model reaches a maximum step limit.

0 commit comments

Comments
 (0)