|
29 | 29 |
|
30 | 30 | # មូលដ្ឋានគ្រឹះនៃ Transformer |
31 | 31 |
|
32 | | -គំរូ **Transformer** (ដែលនឹងត្រូវបានហៅថា “vanilla Transformer” ដើម្បីសម្គាល់វាពីកំណែដែលបានកែលម្អផ្សេងទៀត; [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762)) មានស្ថាបត្យកម្ម encoder-decoder ដែលត្រូវបានប្រើប្រាស់ជាទូទៅនៅក្នុងគំរូ [NMT](https://lilianweng.github.io/posts/2018-06-24-attention/#born-for-translation) ជាច្រើន។ ក្រោយមក Transformer ដែលត្រូវបានធ្វើឱ្យសាមញ្ញ ត្រូវបានបង្ហាញថាសម្រេចបានលទ្ធផលដ៏អស្ចារ្យនៅក្នុងកិច្ចការគំរូភាសា ដូចជានៅក្នុង [BERT](https://lilianweng.github.io/posts/2019-01-31-lm/#bert) ដែលប្រើតែ encoder ឬ [GPT](https://lilianweng.github.io/posts/2019-01-31-lm/#openai-gpt) ដែលប្រើតែ decoder។ |
| 32 | +គំរូ **Transformer** (ដែលនឹងត្រូវបានហៅថា "vanilla Transformer" ដើម្បីសម្គាល់វាពីកំណែដែលបានកែលម្អផ្សេងទៀត; [Vaswani, et al., 2017](https://arxiv.org/abs/1706.03762)) មានស្ថាបត្យកម្ម encoder-decoder ដែលត្រូវបានប្រើប្រាស់ជាទូទៅនៅក្នុងគំរូ [NMT](https://lilianweng.github.io/posts/2018-06-24-attention/#born-for-translation) ជាច្រើន។ ក្រោយមក Transformer ដែលត្រូវបានធ្វើឱ្យសាមញ្ញ ត្រូវបានបង្ហាញថាសម្រេចបានលទ្ធផលដ៏អស្ចារ្យនៅក្នុងកិច្ចការគំរូភាសា ដូចជានៅក្នុង [BERT](https://lilianweng.github.io/posts/2019-01-31-lm/#bert) ដែលប្រើតែ encoder ឬ [GPT](https://lilianweng.github.io/posts/2019-01-31-lm/#openai-gpt) ដែលប្រើតែ decoder។ |
33 | 33 |
|
34 | 34 | ## Attention និង Self-Attention |
35 | 35 |
|
@@ -197,7 +197,7 @@ Vanilla Transformer មាន Attention Span ថេរនិងមានកម |
197 | 197 | - វាពិបាកក្នុងការទស្សន៍ទាយ tokens ពីរបីដំបូងនៅក្នុង segment នីមួយៗ ដោយសារគ្មាន ឬមានបរិបទតិចតួច។ |
198 | 198 | - ការវាយតម្លៃគឺថ្លៃ។ នៅពេលណាដែល segment ត្រូវបានផ្លាស់ប្តូរទៅខាងស្តាំដោយមួយ, segment ថ្មីត្រូវបានដំណើរការឡើងវិញពីដំបូង, ទោះបីជាមាន tokens ដែលត្រួតលើគ្នាជាច្រើនក៏ដោយ។ |
199 | 199 |
|
200 | | -<a id="transformer-xl"></a>**Transformer-XL** ([Dai et al., 2019](https://arxiv.org/abs/1901.02860); “XL” មានន័យថា “វែងបន្ថែម”) កែប្រែស្ថាបត្យកម្មដើម្បីប្រើប្រាស់ឡើងវិញនូវ hidden states រវាង segments ជាមួយអង្គចងចាំបន្ថែម។ ការតភ្ជាប់ recurrent រវាង segments ត្រូវបានបញ្ចូលទៅក្នុងគំរូដោយការប្រើប្រាស់ hidden states ពី segments មុនៗជាបន្តបន្ទាប់។ |
| 200 | +<a id="transformer-xl"></a>**Transformer-XL** ([Dai et al., 2019](https://arxiv.org/abs/1901.02860); "XL" មានន័យថា "វែងបន្ថែម") កែប្រែស្ថាបត្យកម្មដើម្បីប្រើប្រាស់ឡើងវិញនូវ hidden states រវាង segments ជាមួយអង្គចងចាំបន្ថែម។ ការតភ្ជាប់ recurrent រវាង segments ត្រូវបានបញ្ចូលទៅក្នុងគំរូដោយការប្រើប្រាស់ hidden states ពី segments មុនៗជាបន្តបន្ទាប់។ |
201 | 201 |
|
202 | 202 |  |
203 | 203 | _ការប្រៀបធៀបរវាងដំណាក់កាលហ្វឹកហាត់របស់ vanilla Transformer & Transformer-XL ជាមួយប្រវែង segment 4។ (ប្រភពរូបភាព៖ ផ្នែកខាងឆ្វេងនៃរូបភាពទី 2 ក្នុង [Dai et al., 2019](https://arxiv.org/abs/1901.02860))។_ |
@@ -638,7 +638,7 @@ _ការបង្ហាញអំពី Locality-Sensitive Hashing (LSH) attent |
638 | 638 |
|
639 | 639 | - (a) ម៉ាទ្រីស Attention สำหรับ full attention มักจะเบาบาง។ |
640 | 640 | - (b) ដោយใช้ LSH, เราสามารถเรียงลำดับ keys และ queries ให้อยู่ในแนวเดียวกันตามถังแฮชของพวกมัน។ |
641 | | -- (c) กำหนด $\mathbf{Q} = \mathbf{K}$ (อย่างแม่นยำ $\mathbf{k}_j = \mathbf{q}_j / |\mathbf{q}_j|$), เพื่อให้มีจำนวน keys และ queries เท่ากันในถังหนึ่ง, ง่ายต่อการ batching ។ ที่น่าสนใจ, การกำหนดค่า “shared-QK” นี้ไม่ส่งผลต่อประสิทธิภาพของ Transformer ។ |
| 641 | +- (c) กำหนด $\mathbf{Q} = \mathbf{K}$ (อย่างแม่นยำ $\mathbf{k}_j = \mathbf{q}_j / |\mathbf{q}_j|$), เพื่อให้มีจำนวน keys และ queries เท่ากันในถังหนึ่ง, ง่ายต่อการ batching ។ ที่น่าสนใจ, การกำหนดค่า "shared-QK" นี้ไม่ส่งผลต่อประสิทธิภาพของ Transformer ។ |
642 | 642 | - (d) ใช้ batching ដែល chunks ของ $m$ queries ติดต่อกันត្រូវបានจัดกลุ่มเข้าด้วยกัน។ |
643 | 643 |
|
644 | 644 |  |
|
825 | 825 |
|
826 | 826 | # ឯកសារយោង |
827 | 827 |
|
828 | | -1. Ashish Vaswani, et al. [“Attention is all you need.”](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) NIPS 2017. |
829 | | -2. Rami Al-Rfou, et al. [“Character-level language modeling with deeper self-attention.”](https://arxiv.org/abs/1808.04444) AAAI 2019. |
830 | | -3. Olah & Carter, [“Attention and Augmented Recurrent Neural Networks”](http://doi.org/10.23915/disti), Distill, 2016. |
831 | | -4. Sainbayar Sukhbaatar, et al. [“Adaptive Attention Span in Transformers”](https://arxiv.org/abs/1905.07799). ACL 2019. |
832 | | -5. Rewon Child, et al. [“Generating Long Sequences with Sparse Transformers”](https://arxiv.org/abs/1904.10509) arXiv:1904.10509 (2019). |
833 | | -6. Nikita Kitaev, et al. [“Reformer: The Efficient Transformer”](https://arxiv.org/abs/2001.04451) ICLR 2020. |
834 | | -7. Alex Graves. [“Adaptive Computation Time for Recurrent Neural Networks”](https://arxiv.org/abs/1603.08983) |
835 | | -8. Niki Parmar, et al. [“Image Transformer”](https://arxiv.org/abs/1802.05751) ICML 2018. |
836 | | -9. Zihang Dai, et al. [“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.”](https://arxiv.org/abs/1901.02860) ACL 2019. |
837 | | -10. Aidan N. Gomez, et al. [“The Reversible Residual Network: Backpropagation Without Storing Activations”](https://arxiv.org/abs/1707.04585) NIPS 2017. |
838 | | -11. Mostafa Dehghani, et al. [“Universal Transformers”](https://arxiv.org/abs/1807.03819) ICLR 2019. |
839 | | -12. Emilio Parisotto, et al. [“Stabilizing Transformers for Reinforcement Learning”](https://arxiv.org/abs/1910.06764) arXiv:1910.06764 (2019). |
840 | | -13. Rae et al. [“Compressive Transformers for Long-Range Sequence Modelling.”](https://arxiv.org/abs/1911.05507) 2019. |
841 | | -14. Press et al. [“Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation.”](https://arxiv.org/abs/2108.12409) ICLR 2022. |
842 | | -15. Wu, et al. [“DA-Transformer: Distance Aware Transformer”](https://aclanthology.org/2021.naacl-main.166) 2021. |
843 | | -16. Elabyad et al. [“Depth-Adaptive Transformer.”](https://arxiv.org/abs/1910.10073) ICLR 2020. |
844 | | -17. Schuster et al. [“Confident Adaptive Language Modeling”](https://arxiv.org/abs/2207.07061) 2022. |
845 | | -18. Qiu et al. [“Blockwise self-attention for long document understanding”](https://arxiv.org/abs/1911.02972) 2019 |
846 | | -19. Roy et al. [“Efficient Content-Based Sparse Attention with Routing Transformers.”](https://arxiv.org/abs/2003.05997) 2021. |
847 | | -20. Ainslie et al. [“ETC: Encoding Long and Structured Inputs in Transformers.”](https://aclanthology.org/2020.emnlp-main.19/) EMNLP 2019. |
848 | | -21. Beltagy et al. [“Longformer: The long-document transformer.”](https://arxiv.org/abs/2004.05150) 2020. |
849 | | -22. Zaheer et al. [“Big Bird: Transformers for Longer Sequences.”](https://arxiv.org/abs/2007.14062) 2020. |
850 | | -23. Wang et al. [“Linformer: Self-Attention with Linear Complexity.”](https://arxiv.org/abs/2006.04768) arXiv preprint arXiv:2006.04768 (2020). |
851 | | -24. Tay et al. 2020 [“Sparse Sinkhorn Attention.”](https://arxiv.org/abs/2002.11296) ICML 2020. |
852 | | -25. Peng et al. [“Random Feature Attention.”](https://arxiv.org/abs/2103.02143) ICLR 2021. |
853 | | -26. Choromanski et al. [“Rethinking Attention with Performers.”](https://arxiv.org/abs/2009.14794) ICLR 2021. |
854 | | -27. Khandelwal et al. [“Generalization through memorization: Nearest neighbor language models.”](https://arxiv.org/abs/1911.00172) ICLR 2020. |
855 | | -28. Yogatama et al. [“Adaptive semiparametric language models.”](https://arxiv.org/abs/2102.02557) ACL 2021. |
856 | | -29. Wu et al. [“Memorizing Transformers.”](https://arxiv.org/abs/2203.08913) ICLR 2022. |
857 | | -30. Su et al. [“Roformer: Enhanced transformer with rotary position embedding.”](https://arxiv.org/abs/2104.09864) arXiv preprint arXiv:2104.09864 (2021). |
858 | | -31. Shaw et al. [“Self-attention with relative position representations.”](https://arxiv.org/abs/1803.02155) arXiv preprint arXiv:1803.02155 (2018). |
859 | | -32. Tay et al. [“Efficient Transformers: A Survey.”](https://arxiv.org/abs/2009.06732) ACM Computing Surveys 55.6 (2022): 1-28. |
860 | | -33. Chen et al., [“Decision Transformer: Reinforcement Learning via Sequence Modeling”](https://arxiv.org/abs/2106.01345) arXiv preprint arXiv:2106.01345 (2021). |
| 828 | +1. Ashish Vaswani, et al. ["Attention is all you need."](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) NIPS 2017. |
| 829 | +2. Rami Al-Rfou, et al. ["Character-level language modeling with deeper self-attention."](https://arxiv.org/abs/1808.04444) AAAI 2019. |
| 830 | +3. Olah & Carter, ["Attention and Augmented Recurrent Neural Networks"](http://doi.org/10.23915/disti), Distill, 2016. |
| 831 | +4. Sainbayar Sukhbaatar, et al. ["Adaptive Attention Span in Transformers"](https://arxiv.org/abs/1905.07799). ACL 2019. |
| 832 | +5. Rewon Child, et al. ["Generating Long Sequences with Sparse Transformers"](https://arxiv.org/abs/1904.10509) arXiv:1904.10509 (2019). |
| 833 | +6. Nikita Kitaev, et al. ["Reformer: The Efficient Transformer"](https://arxiv.org/abs/2001.04451) ICLR 2020. |
| 834 | +7. Alex Graves. ["Adaptive Computation Time for Recurrent Neural Networks"](https://arxiv.org/abs/1603.08983) |
| 835 | +8. Niki Parmar, et al. ["Image Transformer"](https://arxiv.org/abs/1802.05751) ICML 2018. |
| 836 | +9. Zihang Dai, et al. ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context."](https://arxiv.org/abs/1901.02860) ACL 2019. |
| 837 | +10. Aidan N. Gomez, et al. ["The Reversible Residual Network: Backpropagation Without Storing Activations"](https://arxiv.org/abs/1707.04585) NIPS 2017. |
| 838 | +11. Mostafa Dehghani, et al. ["Universal Transformers"](https://arxiv.org/abs/1807.03819) ICLR 2019. |
| 839 | +12. Emilio Parisotto, et al. ["Stabilizing Transformers for Reinforcement Learning"](https://arxiv.org/abs/1910.06764) arXiv:1910.06764 (2019). |
| 840 | +13. Rae et al. ["Compressive Transformers for Long-Range Sequence Modelling."](https://arxiv.org/abs/1911.05507) 2019. |
| 841 | +14. Press et al. ["Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation."](https://arxiv.org/abs/2108.12409) ICLR 2022. |
| 842 | +15. Wu, et al. ["DA-Transformer: Distance Aware Transformer"](https://aclanthology.org/2021.naacl-main.166) 2021. |
| 843 | +16. Elabyad et al. ["Depth-Adaptive Transformer."](https://arxiv.org/abs/1910.10073) ICLR 2020. |
| 844 | +17. Schuster et al. ["Confident Adaptive Language Modeling"](https://arxiv.org/abs/2207.07061) 2022. |
| 845 | +18. Qiu et al. ["Blockwise self-attention for long document understanding"](https://arxiv.org/abs/1911.02972) 2019 |
| 846 | +19. Roy et al. ["Efficient Content-Based Sparse Attention with Routing Transformers."](https://arxiv.org/abs/2003.05997) 2021. |
| 847 | +20. Ainslie et al. ["ETC: Encoding Long and Structured Inputs in Transformers."](https://aclanthology.org/2020.emnlp-main.19/) EMNLP 2019. |
| 848 | +21. Beltagy et al. ["Longformer: The long-document transformer."](https://arxiv.org/abs/2004.05150) 2020. |
| 849 | +22. Zaheer et al. ["Big Bird: Transformers for Longer Sequences."](https://arxiv.org/abs/2007.14062) 2020. |
| 850 | +23. Wang et al. ["Linformer: Self-Attention with Linear Complexity."](https://arxiv.org/abs/2006.04768) arXiv preprint arXiv:2006.04768 (2020). |
| 851 | +24. Tay et al. 2020 ["Sparse Sinkhorn Attention."](https://arxiv.org/abs/2002.11296) ICML 2020. |
| 852 | +25. Peng et al. ["Random Feature Attention."](https://arxiv.org/abs/2103.02143) ICLR 2021. |
| 853 | +26. Choromanski et al. ["Rethinking Attention with Performers."](https://arxiv.org/abs/2009.14794) ICLR 2021. |
| 854 | +27. Khandelwal et al. ["Generalization through memorization: Nearest neighbor language models."](https://arxiv.org/abs/1911.00172) ICLR 2020. |
| 855 | +28. Yogatama et al. ["Adaptive semiparametric language models."](https://arxiv.org/abs/2102.02557) ACL 2021. |
| 856 | +29. Wu et al. ["Memorizing Transformers."](https://arxiv.org/abs/2203.08913) ICLR 2022. |
| 857 | +30. Su et al. ["Roformer: Enhanced transformer with rotary position embedding."](https://arxiv.org/abs/2104.09864) arXiv preprint arXiv:2104.09864 (2021). |
| 858 | +31. Shaw et al. ["Self-attention with relative position representations."](https://arxiv.org/abs/1803.02155) arXiv preprint arXiv:1803.02155 (2018). |
| 859 | +32. Tay et al. ["Efficient Transformers: A Survey."](https://arxiv.org/abs/2009.06732) ACM Computing Surveys 55.6 (2022): 1-28. |
| 860 | +33. Chen et al., ["Decision Transformer: Reinforcement Learning via Sequence Modeling"](https://arxiv.org/abs/2106.01345) arXiv preprint arXiv:2106.01345 (2021). |
0 commit comments