Skip to content

azain47/minimal-gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal GPT

UPDATE-1, Aug 6

  • Added Support for Rotary Positional Embeddings. Improved performance for poorly tokenized languages.
  • Replaced LayerNorm with RMSNorm, however the difference seems minimal, on compute and performance, both.
  • Added a Dataset preprocessor/generator. Tokenizes a dataset, shuffles and writes val and train datasets to pickle binaries.
  • A better Dataloader, shuffles datasets, after each epoch.*
  • Also added an Inferencing script, currently supports Top-P and Top-K generation strategies.

epoch* : Epoch here means 1 full run through the dataset. If your dataset has 131072 tokens, and each training step you're going through 32768 tokens, then 32768 * 4 = 131072, 4 training steps will be equal to 1 epoch.

Features

  • Maintains most of the original params from the paper.

  • Didn't use weight sharing scheme of LM Head and token embeddings. (hence the increased model size at 163M params)

  • Uses Shakespeare Toy dataset.

  • Gradient Accumulation is done to an approximate of 32k tokens instead of 0.5M as stated in the paper.

  • Optimizations done:

    • Flash Attention instead of usual attn calculation (10X speedup)
    • autocasting and reduced matmul precision; my gpu (RTX3060) supports bfloat16
    • changing params to a multiple of 2, like vocab size, batch size

Also has code for Bigram Language Model, and attention variant of it as well.(gpt)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors