Skip to content

Commit f9537e5

Browse files
committed
add custom langauge
1 parent e57b560 commit f9537e5

23 files changed

Lines changed: 1592 additions & 732 deletions

File tree

app/(private)/attention/page.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ import CiteGraph from 'components/citegraph/citegraph'
2020

2121
<CiteGraph className="w-full"/>
2222

23-
The name CiteGraph is derived from `stale-while-revalidate`, a HTTP cache invalidation strategy popularized by [HTTP RFC 5861](https://tools.ietf.org/html/rfc5861).
23+
The name "CiteGraph" is derived from `stale-while-revalidate`, a HTTP cache invalidation strategy popularized by [HTTP RFC 5861](https://tools.ietf.org/html/rfc5861).
2424
CiteGraph is a strategy to first return the data from cache (stale), then send the fetch request (revalidate), and finally come with the up-to-date data.
2525

2626
<Callout emoji="">

app/(private)/campuchia/page.md

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929

3030
# មូលដ្ឋានគ្រឹះនៃ Transformer
3131

32-
គំរូ **Transformer** (ដែលនឹងត្រូវបានហៅថា vanilla Transformer ដើម្បីសម្គាល់វាពីកំណែដែលបានកែលម្អផ្សេងទៀត; [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762)) មានស្ថាបត្យកម្ម encoder-decoder ដែលត្រូវបានប្រើប្រាស់ជាទូទៅនៅក្នុងគំរូ [NMT](https://lilianweng.github.io/posts/2018-06-24-attention/#born-for-translation) ជាច្រើន។ ក្រោយមក Transformer ដែលត្រូវបានធ្វើឱ្យសាមញ្ញ ត្រូវបានបង្ហាញថាសម្រេចបានលទ្ធផលដ៏អស្ចារ្យនៅក្នុងកិច្ចការគំរូភាសា ដូចជានៅក្នុង [BERT](https://lilianweng.github.io/posts/2019-01-31-lm/#bert) ដែលប្រើតែ encoder ឬ [GPT](https://lilianweng.github.io/posts/2019-01-31-lm/#openai-gpt) ដែលប្រើតែ decoder។
32+
គំរូ **Transformer** (ដែលនឹងត្រូវបានហៅថា "vanilla Transformer" ដើម្បីសម្គាល់វាពីកំណែដែលបានកែលម្អផ្សេងទៀត; [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762)) មានស្ថាបត្យកម្ម encoder-decoder ដែលត្រូវបានប្រើប្រាស់ជាទូទៅនៅក្នុងគំរូ [NMT](https://lilianweng.github.io/posts/2018-06-24-attention/#born-for-translation) ជាច្រើន។ ក្រោយមក Transformer ដែលត្រូវបានធ្វើឱ្យសាមញ្ញ ត្រូវបានបង្ហាញថាសម្រេចបានលទ្ធផលដ៏អស្ចារ្យនៅក្នុងកិច្ចការគំរូភាសា ដូចជានៅក្នុង [BERT](https://lilianweng.github.io/posts/2019-01-31-lm/#bert) ដែលប្រើតែ encoder ឬ [GPT](https://lilianweng.github.io/posts/2019-01-31-lm/#openai-gpt) ដែលប្រើតែ decoder។
3333

3434
## Attention និង Self-Attention
3535

@@ -197,7 +197,7 @@ Vanilla Transformer មាន Attention Span ថេរនិងមានកម
197197
- វាពិបាកក្នុងការទស្សន៍ទាយ tokens ពីរបីដំបូងនៅក្នុង segment នីមួយៗ ដោយសារគ្មាន ឬមានបរិបទតិចតួច។
198198
- ការវាយតម្លៃគឺថ្លៃ។ នៅពេលណាដែល segment ត្រូវបានផ្លាស់ប្តូរទៅខាងស្តាំដោយមួយ, segment ថ្មីត្រូវបានដំណើរការឡើងវិញពីដំបូង, ទោះបីជាមាន tokens ដែលត្រួតលើគ្នាជាច្រើនក៏ដោយ។
199199

200-
<a id="transformer-xl"></a>**Transformer-XL** ([Dai et al., 2019](https://arxiv.org/abs/1901.02860); “XL” មានន័យថា វែងបន្ថែម) កែប្រែស្ថាបត្យកម្មដើម្បីប្រើប្រាស់ឡើងវិញនូវ hidden states រវាង segments ជាមួយអង្គចងចាំបន្ថែម។ ការតភ្ជាប់ recurrent រវាង segments ត្រូវបានបញ្ចូលទៅក្នុងគំរូដោយការប្រើប្រាស់ hidden states ពី segments មុនៗជាបន្តបន្ទាប់។
200+
<a id="transformer-xl"></a>**Transformer-XL** ([Dai et al., 2019](https://arxiv.org/abs/1901.02860); "XL" មានន័យថា "វែងបន្ថែម") កែប្រែស្ថាបត្យកម្មដើម្បីប្រើប្រាស់ឡើងវិញនូវ hidden states រវាង segments ជាមួយអង្គចងចាំបន្ថែម។ ការតភ្ជាប់ recurrent រវាង segments ត្រូវបានបញ្ចូលទៅក្នុងគំរូដោយការប្រើប្រាស់ hidden states ពី segments មុនៗជាបន្តបន្ទាប់។
201201

202202
![ការប្រៀបធៀបរវាងដំណាក់កាលហ្វឹកហាត់របស់ vanilla Transformer & Transformer-XL ជាមួយប្រវែង segment 4។](/posts/transformer-family-2/transformer-XL-training.png)
203203
_ការប្រៀបធៀបរវាងដំណាក់កាលហ្វឹកហាត់របស់ vanilla Transformer & Transformer-XL ជាមួយប្រវែង segment 4។ (ប្រភពរូបភាព៖ ផ្នែកខាងឆ្វេងនៃរូបភាពទី 2 ក្នុង [Dai et al., 2019](https://arxiv.org/abs/1901.02860))។_
@@ -638,7 +638,7 @@ _ការបង្ហាញអំពី Locality-Sensitive Hashing (LSH) attent
638638

639639
- (a) ម៉ាទ្រីស Attention สำหรับ full attention มักจะเบาบาง។
640640
- (b) ដោយใช้ LSH, เราสามารถเรียงลำดับ keys และ queries ให้อยู่ในแนวเดียวกันตามถังแฮชของพวกมัน។
641-
- (c) กำหนด $\mathbf{Q} = \mathbf{K}$ (อย่างแม่นยำ $\mathbf{k}_j = \mathbf{q}_j / |\mathbf{q}_j|$), เพื่อให้มีจำนวน keys และ queries เท่ากันในถังหนึ่ง, ง่ายต่อการ batching ។ ที่น่าสนใจ, การกำหนดค่า shared-QK นี้ไม่ส่งผลต่อประสิทธิภาพของ Transformer ។
641+
- (c) กำหนด $\mathbf{Q} = \mathbf{K}$ (อย่างแม่นยำ $\mathbf{k}_j = \mathbf{q}_j / |\mathbf{q}_j|$), เพื่อให้มีจำนวน keys และ queries เท่ากันในถังหนึ่ง, ง่ายต่อการ batching ។ ที่น่าสนใจ, การกำหนดค่า "shared-QK" นี้ไม่ส่งผลต่อประสิทธิภาพของ Transformer ។
642642
- (d) ใช้ batching ដែល chunks ของ $m$ queries ติดต่อกันត្រូវបានจัดกลุ่มเข้าด้วยกัน។
643643

644644
![LSH attention ประกอบด้วย 4 ขั้นตอน: bucketing, sorting, chunking, และการគណនា Attention ។](/posts/transformer-family-2/LSH-attention.png)
@@ -825,36 +825,36 @@ $$
825825

826826
# ឯកសារយោង
827827

828-
1. Ashish Vaswani, et al. [Attention is all you need.](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) NIPS 2017.
829-
2. Rami Al-Rfou, et al. [Character-level language modeling with deeper self-attention.](https://arxiv.org/abs/1808.04444) AAAI 2019.
830-
3. Olah & Carter, [Attention and Augmented Recurrent Neural Networks](http://doi.org/10.23915/disti), Distill, 2016.
831-
4. Sainbayar Sukhbaatar, et al. [Adaptive Attention Span in Transformers](https://arxiv.org/abs/1905.07799). ACL 2019.
832-
5. Rewon Child, et al. [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) arXiv:1904.10509 (2019).
833-
6. Nikita Kitaev, et al. [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) ICLR 2020.
834-
7. Alex Graves. [Adaptive Computation Time for Recurrent Neural Networks](https://arxiv.org/abs/1603.08983)
835-
8. Niki Parmar, et al. [Image Transformer](https://arxiv.org/abs/1802.05751) ICML 2018.
836-
9. Zihang Dai, et al. [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.](https://arxiv.org/abs/1901.02860) ACL 2019.
837-
10. Aidan N. Gomez, et al. [The Reversible Residual Network: Backpropagation Without Storing Activations](https://arxiv.org/abs/1707.04585) NIPS 2017.
838-
11. Mostafa Dehghani, et al. [Universal Transformers](https://arxiv.org/abs/1807.03819) ICLR 2019.
839-
12. Emilio Parisotto, et al. [Stabilizing Transformers for Reinforcement Learning](https://arxiv.org/abs/1910.06764) arXiv:1910.06764 (2019).
840-
13. Rae et al. [Compressive Transformers for Long-Range Sequence Modelling.](https://arxiv.org/abs/1911.05507) 2019.
841-
14. Press et al. [Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation.](https://arxiv.org/abs/2108.12409) ICLR 2022.
842-
15. Wu, et al. [DA-Transformer: Distance Aware Transformer](https://aclanthology.org/2021.naacl-main.166) 2021.
843-
16. Elabyad et al. [Depth-Adaptive Transformer.](https://arxiv.org/abs/1910.10073) ICLR 2020.
844-
17. Schuster et al. [Confident Adaptive Language Modeling](https://arxiv.org/abs/2207.07061) 2022.
845-
18. Qiu et al. [Blockwise self-attention for long document understanding](https://arxiv.org/abs/1911.02972) 2019
846-
19. Roy et al. [Efficient Content-Based Sparse Attention with Routing Transformers.](https://arxiv.org/abs/2003.05997) 2021.
847-
20. Ainslie et al. [ETC: Encoding Long and Structured Inputs in Transformers.](https://aclanthology.org/2020.emnlp-main.19/) EMNLP 2019.
848-
21. Beltagy et al. [Longformer: The long-document transformer.](https://arxiv.org/abs/2004.05150) 2020.
849-
22. Zaheer et al. [Big Bird: Transformers for Longer Sequences.](https://arxiv.org/abs/2007.14062) 2020.
850-
23. Wang et al. [Linformer: Self-Attention with Linear Complexity.](https://arxiv.org/abs/2006.04768) arXiv preprint arXiv:2006.04768 (2020).
851-
24. Tay et al. 2020 [Sparse Sinkhorn Attention.](https://arxiv.org/abs/2002.11296) ICML 2020.
852-
25. Peng et al. [Random Feature Attention.](https://arxiv.org/abs/2103.02143) ICLR 2021.
853-
26. Choromanski et al. [Rethinking Attention with Performers.](https://arxiv.org/abs/2009.14794) ICLR 2021.
854-
27. Khandelwal et al. [Generalization through memorization: Nearest neighbor language models.](https://arxiv.org/abs/1911.00172) ICLR 2020.
855-
28. Yogatama et al. [Adaptive semiparametric language models.](https://arxiv.org/abs/2102.02557) ACL 2021.
856-
29. Wu et al. [Memorizing Transformers.](https://arxiv.org/abs/2203.08913) ICLR 2022.
857-
30. Su et al. [Roformer: Enhanced transformer with rotary position embedding.](https://arxiv.org/abs/2104.09864) arXiv preprint arXiv:2104.09864 (2021).
858-
31. Shaw et al. [Self-attention with relative position representations.](https://arxiv.org/abs/1803.02155) arXiv preprint arXiv:1803.02155 (2018).
859-
32. Tay et al. [Efficient Transformers: A Survey.](https://arxiv.org/abs/2009.06732) ACM Computing Surveys 55.6 (2022): 1-28.
860-
33. Chen et al., [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) arXiv preprint arXiv:2106.01345 (2021).
828+
1. Ashish Vaswani, et al. ["Attention is all you need."](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) NIPS 2017.
829+
2. Rami Al-Rfou, et al. ["Character-level language modeling with deeper self-attention."](https://arxiv.org/abs/1808.04444) AAAI 2019.
830+
3. Olah & Carter, ["Attention and Augmented Recurrent Neural Networks"](http://doi.org/10.23915/disti), Distill, 2016.
831+
4. Sainbayar Sukhbaatar, et al. ["Adaptive Attention Span in Transformers"](https://arxiv.org/abs/1905.07799). ACL 2019.
832+
5. Rewon Child, et al. ["Generating Long Sequences with Sparse Transformers"](https://arxiv.org/abs/1904.10509) arXiv:1904.10509 (2019).
833+
6. Nikita Kitaev, et al. ["Reformer: The Efficient Transformer"](https://arxiv.org/abs/2001.04451) ICLR 2020.
834+
7. Alex Graves. ["Adaptive Computation Time for Recurrent Neural Networks"](https://arxiv.org/abs/1603.08983)
835+
8. Niki Parmar, et al. ["Image Transformer"](https://arxiv.org/abs/1802.05751) ICML 2018.
836+
9. Zihang Dai, et al. ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context."](https://arxiv.org/abs/1901.02860) ACL 2019.
837+
10. Aidan N. Gomez, et al. ["The Reversible Residual Network: Backpropagation Without Storing Activations"](https://arxiv.org/abs/1707.04585) NIPS 2017.
838+
11. Mostafa Dehghani, et al. ["Universal Transformers"](https://arxiv.org/abs/1807.03819) ICLR 2019.
839+
12. Emilio Parisotto, et al. ["Stabilizing Transformers for Reinforcement Learning"](https://arxiv.org/abs/1910.06764) arXiv:1910.06764 (2019).
840+
13. Rae et al. ["Compressive Transformers for Long-Range Sequence Modelling."](https://arxiv.org/abs/1911.05507) 2019.
841+
14. Press et al. ["Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation."](https://arxiv.org/abs/2108.12409) ICLR 2022.
842+
15. Wu, et al. ["DA-Transformer: Distance Aware Transformer"](https://aclanthology.org/2021.naacl-main.166) 2021.
843+
16. Elabyad et al. ["Depth-Adaptive Transformer."](https://arxiv.org/abs/1910.10073) ICLR 2020.
844+
17. Schuster et al. ["Confident Adaptive Language Modeling"](https://arxiv.org/abs/2207.07061) 2022.
845+
18. Qiu et al. ["Blockwise self-attention for long document understanding"](https://arxiv.org/abs/1911.02972) 2019
846+
19. Roy et al. ["Efficient Content-Based Sparse Attention with Routing Transformers."](https://arxiv.org/abs/2003.05997) 2021.
847+
20. Ainslie et al. ["ETC: Encoding Long and Structured Inputs in Transformers."](https://aclanthology.org/2020.emnlp-main.19/) EMNLP 2019.
848+
21. Beltagy et al. ["Longformer: The long-document transformer."](https://arxiv.org/abs/2004.05150) 2020.
849+
22. Zaheer et al. ["Big Bird: Transformers for Longer Sequences."](https://arxiv.org/abs/2007.14062) 2020.
850+
23. Wang et al. ["Linformer: Self-Attention with Linear Complexity."](https://arxiv.org/abs/2006.04768) arXiv preprint arXiv:2006.04768 (2020).
851+
24. Tay et al. 2020 ["Sparse Sinkhorn Attention."](https://arxiv.org/abs/2002.11296) ICML 2020.
852+
25. Peng et al. ["Random Feature Attention."](https://arxiv.org/abs/2103.02143) ICLR 2021.
853+
26. Choromanski et al. ["Rethinking Attention with Performers."](https://arxiv.org/abs/2009.14794) ICLR 2021.
854+
27. Khandelwal et al. ["Generalization through memorization: Nearest neighbor language models."](https://arxiv.org/abs/1911.00172) ICLR 2020.
855+
28. Yogatama et al. ["Adaptive semiparametric language models."](https://arxiv.org/abs/2102.02557) ACL 2021.
856+
29. Wu et al. ["Memorizing Transformers."](https://arxiv.org/abs/2203.08913) ICLR 2022.
857+
30. Su et al. ["Roformer: Enhanced transformer with rotary position embedding."](https://arxiv.org/abs/2104.09864) arXiv preprint arXiv:2104.09864 (2021).
858+
31. Shaw et al. ["Self-attention with relative position representations."](https://arxiv.org/abs/1803.02155) arXiv preprint arXiv:1803.02155 (2018).
859+
32. Tay et al. ["Efficient Transformers: A Survey."](https://arxiv.org/abs/2009.06732) ACM Computing Surveys 55.6 (2022): 1-28.
860+
33. Chen et al., ["Decision Transformer: Reinforcement Learning via Sequence Modeling"](https://arxiv.org/abs/2106.01345) arXiv preprint arXiv:2106.01345 (2021).

app/(private)/china/page.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ _图 1. 多头缩放点积注意力机制的示意图。(图片来源:[Vaswa
5858

5959
## Transformer
6060

61-
**Transformer**(为区别于其他增强版本,将被称为原始Transformer[Vaswani 等人,2017](https://arxiv.org/abs/1706.03762))模型采用编码器-解码器架构,常用于许多[NMT]模型。后来仅解码器的Transformer被证明在语言建模任务(如[GPT 和 BERT])中表现出色。
61+
**Transformer**(为区别于其他增强版本,将被称为"原始Transformer"[Vaswani 等人,2017](https://arxiv.org/abs/1706.03762))模型采用编码器-解码器架构,常用于许多[NMT]模型。后来仅解码器的Transformer被证明在语言建模任务(如[GPT 和 BERT])中表现出色。
6262

6363
**编码器-解码器架构**
6464

@@ -156,7 +156,7 @@ $$ s*t = \sum*{n=1}^{N(t)} p*t^n s_t^n, \quad y_t = \sum*{n=1}^{N(t)} p*t^n y_t^
156156
- 给定无或薄上下文,很难预测每个段的前几个标记。
157157
- 评估成本高昂。每当段向右移动一个位置时,新段需要从头开始重新处理,尽管有很多重叠的标记。
158158

159-
**Transformer-XL**[Dai 等人,2019](https://arxiv.org/abs/1901.02860)“XL”意为“超长”)通过两种主要修改解决了上下文分割问题:
159+
**Transformer-XL**[Dai 等人,2019](https://arxiv.org/abs/1901.02860)"XL"意为"超长")通过两种主要修改解决了上下文分割问题:
160160

161161
1. 在段之间重用隐藏状态。
162162
2. 采用适合重用状态的新位置编码。
@@ -361,7 +361,7 @@ _图 11. 局部敏感哈希 (LSH) 注意力的示意图。(图片来源:[Kit
361361

362362
- (a) 完整注意力的注意力矩阵通常是稀疏的。
363363
- (b) 使用LSH,我们可以根据哈希桶对键和查询进行排序对齐。
364-
- (c) 设置 $$ \mathbf{Q} = \mathbf{K} $$(精确地说 $$ \mathbf{k}\_j = \mathbf{q}\_j / \|\mathbf{q}\_j\| $$),以便一个桶中有相等数量的键和查询,便于批处理。有趣的是,这种共享QK配置不影响Transformer的性能。
364+
- (c) 设置 $$ \mathbf{Q} = \mathbf{K} $$(精确地说 $$ \mathbf{k}\_j = \mathbf{q}\_j / \|\mathbf{q}\_j\| $$),以便一个桶中有相等数量的键和查询,便于批处理。有趣的是,这种"共享QK"配置不影响Transformer的性能。
365365
- (d) 应用批处理,将 $$ m $$ 个连续查询分组在一起。
366366

367367
![LSH注意力](/img/transformer/LSH-attention.png)

0 commit comments

Comments
 (0)