You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: app/(navbar)/attention/page.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@ $$
107
107
108
108
where $$ [.;.] $$ is a concatenation operation. $$ \mathbf{W}^q_i, \mathbf{W}^k_i \in \mathbb{R}^{d \times d_k/h}, \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h} $$ are weight matrices to map input embeddings of size $$ L \times d $$ into query, key and value matrices. And $$ \mathbf{W}^o \in \mathbb{R}^{d_v \times d} $$ is the output linear transformation. All the weights should be learned during training.
_Fig. 1. Illustration of the multi-head scaled dot-product attention mechanism. (Image source: Figure 2 in [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762))_
112
112
113
113
## Transformer
@@ -120,7 +120,7 @@ The **encoder** generates an attention-based representation with capability to l
120
120
121
121
The function of Transformer **decoder** is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is _masked_ to prevent positions from attending to the future.
_Fig. 2. The architecture of the vanilla Transformer model. (Image source: [Figure 17])_
125
125
126
126
**Positional Encoding**
@@ -139,7 +139,7 @@ $$
139
139
140
140
In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from $$ 2\pi $$ to $$ 10000 \cdot 2\pi $$ .
_Fig. 3. Sinusoidal positional encoding with $$ L=32 $$ and $$ d=128 $$ . The value is between -1 (black) and 1 (white) and the value 0 is in gray._
144
144
145
145
(2) _Learned positional encoding_, as its name suggested, assigns each element with a learned column vector which encodes its _absolute_ position ([Gehring, et al. 2017](https://arxiv.org/abs/1705.03122)).
@@ -152,7 +152,7 @@ Following the vanilla Transformer, [Al-Rfou et al. (2018)](https://arxiv.org/abs
152
152
- Each intermediate Transformer layer is used for making predictions as well. Lower layers are weighted to contribute less and less to the total loss as training progresses.
153
153
- Each position in the sequence can predict multiple targets, i.e. two or more predictions of the future tokens.
_Fig. 4. Auxiliary prediction tasks used in deep Transformer for character-level language modeling. (Image source: [Al-Rfou et al. (2018)](https://arxiv.org/abs/1808.04444))_
157
157
158
158
## Adaptive Computation Time (ACT)
@@ -191,7 +191,7 @@ where $$ M $$ is an upper limit for the number of immediate steps allowed.
191
191
192
192
The final state and output are mean-field updates:
_Fig. 5. The computation graph of a RNN with ACT mechanism. (Image source: [Graves, 2016](https://arxiv.org/abs/1603.08983))*
196
196
197
197
To avoid unnecessary pondering over each input, ACT adds a _ponder cost_$$ \mathcal{P}(x) = \sum_{t=1}^L N(t) + R(t) $$ in the loss function to encourage a smaller number of intermediate computational steps.
@@ -219,7 +219,7 @@ This _context segmentation_ causes several issues:
219
219
220
220
The recurrent connection between segments is introduced into the model by continuously using the hidden states from the previous segments.
221
221
222
-

222
+

223
223
_Fig. 6. A comparison between the training phrase of vanilla Transformer & Transformer-XL with a segment length 4. (Image source: left part of Figure 2 in [Dai et al., 2019](https://arxiv.org/abs/1901.02860))._
224
224
225
225
Let's label the hidden state of the $$ n $$ -th layer for the $$ (\tau + 1) $$ -th segment in the model as $$ \mathbf{h}_{\tau+1}^{(n)} \in \mathbb{R}^{L \times d} $$ . In addition to the hidden state of the last layer for the same segment $$ \mathbf{h}_{\tau+1}^{(n-1)} $$ , it also depends on the hidden state of the same layer for the previous segment $$ \mathbf{h}_{\tau}^{(n)} $$ . By incorporating information from the previous hidden states, the model extends the attention span much longer in the past, over multiple segments.
@@ -272,7 +272,7 @@ One key advantage of Transformer is the capability of capturing long-term depend
272
272
273
273
This is the motivation for **Adaptive Attention Span**. [Sukhbaatar, et al., (2019)](https://arxiv.org/abs/1905.07799) proposed a self-attention mechanism that seeks an optimal attention span. They hypothesized that different attention heads might assign scores differently within the same context window (See Fig. 7) and thus the optimal span would be trained separately per head.
274
274
275
-

275
+

276
276
_Fig. 7. Two attention heads in the same model, A & B, assign attention differently within the same context window. Head A attends more to the recent tokens, while head B look further back into the past uniformly. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799))_
277
277
278
278
Given the $$ i $$ -th token, we need to compute the attention weights between this token and other keys at positions $$ j \in S_i $$ , where $$ S_i $$ defineds the $$ i $$ -th token's context window.
@@ -289,7 +289,7 @@ A _soft mask function_ $$ m_z $$ is added to control for an effective adjustable
_Fig. 8. The soft masking function used in the adaptive attention span. (Image source: [Sukhbaatar, et al. 2019](https://arxiv.org/abs/1905.07799).)_
294
294
295
295
The soft mask function is applied to the softmax elements in the attention weights:
@@ -315,7 +315,7 @@ Let's label the representation of the current pixel to be generated as the query
315
315
316
316
Image Transformer introduced two types of localized $$ \mathbf{M} $$ , as illustrated below.
317
317
318
-

318
+

319
319
_Fig. 9. Illustration of 1D and 2D attention span for visual inputs in Image Transformer. The black line marks a query block and the cyan outlines the actual attention span for pixel q. (Image source: Figure 2 in [Parmer et al, 2018](https://arxiv.org/abs/1802.05751))_
320
320
321
321
(1) _1D Local Attention_: The input image is flattened in the [raster scanning](https://en.wikipedia.org/wiki/Raster_scan#Scanning_pattern) order, that is, from left to right and top to bottom. The linearized image is then partitioned into non-overlapping query blocks. The context window consists of pixels in the same query block as $$ \mathbf{q} $$ and a fixed number of additional pixels generated before this query block.
@@ -353,7 +353,7 @@ Precisely, the set $$ S*i $$ is divided into $$ p $$ _non-overlapping* subsets,
353
353
354
354
Sparse Transformer proposed two types of fractorized attention. It is easier to understand the concepts as illustrated in Fig. 10 with 2D image inputs as examples.
_Fig. 10. The top row illustrates the attention connectivity patterns in (a) Transformer, (b) Sparse Transformer with strided attention, and (c) Sparse Transformer with fixed attention. The bottom row contains corresponding self-attention connectivity matrices. Note that the top and bottom rows are not in the same scale. (Image source: [Child et al., 2019](https://arxiv.org/abs/1904.10509) + a few of extra annotations.)_
358
358
359
359
(1) _Strided_ attention with stride $$ \ell \sim \sqrt{n} $$ . This works well with image data as the structure is aligned with strides. In the image case, each pixel would attend to all the previous $$ \ell $$ pixels in the raster scanning order (naturally cover the entire width of the image) and then those pixels attend to others in the same column (defined by another attention connectivity subset).
@@ -408,7 +408,7 @@ In $$ \mathbf{Q} \mathbf{K}^\top $$ part of the [attention formula](#attention-a
408
408
409
409
A hashing scheme $$ x \mapsto h(x) $$ is _locality-sensitive_ if it preserves the distancing information between data points, such that close vectors obtain similar hashes while distant vectors have very different ones. The Reformer adopts a hashing scheme as such, given a fixed random matrix $$ \mathbf{R} \in \mathbb{R}^{d \times b/2} $$ (where $$ b $$ is a hyperparam), the hash function is $$ h(x) = \arg\max([xR; -xR]) $$ .
_Fig. 11. Illustration of Locality-Sensitive Hashing (LSH) attention. (Image source: right part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
413
413
414
414
In LSH attention, a query can only attend to positions in the same hashing bucket, $$ S_i = \{j: h(\mathbf{q}_i) = h(\mathbf{k}_j)\} $$ . It is carried out in the following process, as illustrated in Fig. 11:
@@ -418,7 +418,7 @@ In LSH attention, a query can only attend to positions in the same hashing bucke
418
418
- (c) Set $$ \mathbf{Q} = \mathbf{K} $$ (precisely $$ \mathbf{k}_j = \mathbf{q}_j / \|\mathbf{q}_j\| $$ ), so that there are equal numbers of keys and queries in one bucket, easier for batching. Interestingly, this "shared-QK" config does not affect the performance of the Transformer.
419
419
- (d) Apply batching where chunks of $$ m $$ consecutive queries are grouped together.
_Fig. 12. The LSH attention consists of 4 steps: bucketing, sorting, chunking, and attention computation. (Image source: left part of Figure 1 in [Kitaev, et al. 2020](https://arxiv.org/abs/2001.04451))._
423
423
424
424
**Reversible Residual Network**
@@ -459,7 +459,7 @@ Rather than going through a fixed number of layers, Universal Transformer dynami
459
459
460
460
On a high level, the universal transformer can be viewed as a recurrent function for learning the hidden state representation per token. The recurrent function evolves in parallel across token positions and the information between positions is shared through self-attention.
_Fig. 13. How the Universal Transformer refines a set of hidden state representations repeatedly for every position in parallel. (Image source: Figure 1 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))._
464
464
465
465
Given an input sequence of length $$ L $$ , Universal Transformer iteratively updates the representation $$ \mathbf{H}^t \in \mathbb{R}^{L \times d} $$ at step $$ t $$ for an adjustable number of steps. At step 0, $$ \mathbf{H}^0 $$ is initialized to be same as the input embedding matrix. All the positions are processed in parallel in the multi-head self-attention mechanism and then go through a recurrent transition function.
_Fig. 14. A simplified illustration of Universal Transformer. The encoder and decoder share the same basic recurrent structure. But the decoder also attends to final encoder representation $$ \mathbf{H}^T $$ . (Image source: Figure 2 in [Dehghani, et al. 2019](https://arxiv.org/abs/1807.03819))_
488
488
489
489
In the adaptive version of Universal Transformer, the number of recurrent steps $$ T $$ is dynamically determined by [ACT](#adaptive-computation-time-act). Each position is equipped with a dynamic ACT halting mechanism. Once a per-token recurrent block halts, it stops taking more recurrent updates but simply copies the current value to the next step until all the blocks halt or until the model reaches a maximum step limit.
0 commit comments