GPT From Scratch

This repository contains a from-scratch implementation of a GPT-style language model built using Python and PyTorch. The goal of this project is to deeply understand how GPT models work internally by implementing every core component manually — from tokenization to training and text generation — without relying on high-level transformer libraries.

Project Overview

GPT (Generative Pretrained Transformer) is an autoregressive, decoder-only transformer model trained to predict the next token given a sequence of previous tokens. This project implements the complete GPT pipeline:

Raw text → tokens
Tokens → embeddings
Embeddings → transformer blocks
Transformer outputs → logits
Logits → next-token prediction

High-Level Architecture

Input Text
    ↓
Tokenization (Text → Token IDs)
    ↓
Token Embedding + Positional Embedding
    ↓
N × Transformer Decoder Blocks
    │   ├─ LayerNorm
    │   ├─ Masked Multi-Head Self Attention
    │   ├─ Residual Connection
    │   ├─ Feed Forward Network
    │   └─ Residual Connection
    ↓
Final LayerNorm
    ↓
Linear Projection (Vocabulary Size)
    ↓
Softmax → Next Token Probabilities

Step-by-Step Components

1. Tokenization

Tokenization is the process of converting raw text into a sequence of integer token IDs that the model can process.

Step 1: Vocabulary Construction

The entire training corpus is scanned
Unique characters or subwords are collected
Each unique unit is assigned a unique integer ID

Example:

Vocabulary = {
  "a": 0,
  "b": 1,
  "c": 2,
  " ": 3,
  "<EOS>": 4
}

Step 2: Encoding (Text → Tokens)

Input text:

"hello"

Encoding process:

h → 7
e → 4
l → 11
l → 11
o → 14

Result:

[7, 4, 11, 11, 14]

Step 3: Decoding (Tokens → Text)

The inverse mapping converts model outputs back to readable text:

[7, 4, 11, 11, 14] → "hello"

Step 4: Preparing Training Sequences

A sliding window is used over token sequences
Input tokens predict the next token

Example:

Input : [h, e, l, l]
Target: [e, l, l, o]

This creates the autoregressive learning objective.

2. Embeddings

Neural networks operate on vectors, not integers. Embeddings convert token IDs into dense vectors.

Token Embeddings

A learnable lookup table of shape:

[vocab_size, embedding_dim]

Each token ID maps to a dense vector

Example:

Token ID: 14 → Vector: [0.12, -0.34, 0.88, ...]

Positional Embeddings

Transformers do not inherently understand order. Positional embeddings inject sequence information.

Learnable position vectors
Added element-wise to token embeddings

Final Embedding = Token Embedding + Positional Embedding

3. Attention Mechanism

Attention allows the model to dynamically focus on relevant parts of the input sequence.

Self Attention

Each token computes three vectors:

Query (Q)
Key (K)
Value (V)

They are obtained via linear projections:

Q = X · Wq
K = X · Wk
V = X · Wv

Attention Score Calculation

Attention(Q, K, V) = softmax((Q · Kᵀ) / √d_k) · V

This computes how much each token should attend to every other token.

Causal Masking

GPT uses masked self-attention:

Tokens can only attend to previous tokens
Future tokens are masked using an upper triangular mask

This enforces autoregressive behavior.

Multi-Head Attention

Instead of a single attention operation, GPT uses multiple heads.

Input is split into multiple subspaces
Each head performs self-attention independently
Outputs are concatenated and projected

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₙ) · Wo

This allows the model to capture different types of relationships simultaneously.

4. Transformer Block Components

Each transformer decoder block contains:

1. Layer Normalization

Normalizes activations
Improves training stability

2. Masked Multi-Head Self Attention

Computes contextual representations
Uses causal masking

3. Residual Connections

X = X + Attention(X)
X = X + FeedForward(X)

Residuals help with gradient flow in deep networks.

4. Feed Forward Network (FFN)

A two-layer MLP applied independently to each token:

FFN(x) = GELU(x · W1) · W2

5. GPT-Only Architecture

GPT uses a decoder-only transformer architecture:

No encoder
No cross-attention
Only masked self-attention

Key characteristics:

Autoregressive generation
Left-to-right token prediction
Same transformer block repeated N times

This design is optimized for language modeling and text generation.

6. Training Loop

Step 1: Forward Pass

Input token batch → embeddings
Pass through transformer blocks
Output logits of shape:

[batch_size, sequence_length, vocab_size]

Step 2: Loss Computation

Cross-entropy loss between predicted logits and target tokens
Targets are shifted by one token

Step 3: Backpropagation

Compute gradients using autograd
Update parameters using Adam optimizer

Step 4: Iterative Training

Loop over dataset for multiple epochs
Periodically log loss

7. Text Generation

During inference:

Provide an initial prompt
Predict next token probabilities
Sample or take argmax
Append token to input
Repeat until max length or end token

This produces coherent, autoregressive text output.

Summary

This repository demonstrates a complete GPT-style language model built from first principles, covering:

Tokenization and data preparation
Embedding and positional encoding
Masked multi-head self-attention
Transformer decoder blocks
GPT-only architecture
Training and text generation

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
sms_spam_collection		sms_spam_collection
Comparing-GELU-RELU-Act.py		Comparing-GELU-RELU-Act.py
GELU vs RELU.png		GELU vs RELU.png
GELU-Feedforward.png		GELU-Feedforward.png
README.md		README.md
TransformerBlock.py		TransformerBlock.py
calc_loss_batch.py		calc_loss_batch.py
create_data_loader.py		create_data_loader.py
create_dataloader.py		create_dataloader.py
creating_test_valid_train_data.py		creating_test_valid_train_data.py
embedding-example.py		embedding-example.py
encoding_and_decoding_tokens.py		encoding_and_decoding_tokens.py
feedforward.py		feedforward.py
gelu_activation.py		gelu_activation.py
generate_text.py		generate_text.py
gpt_config.py		gpt_config.py
gpt_config_2.py		gpt_config_2.py
gpt_download.py		gpt_download.py
gpt_model.py		gpt_model.py
layerNormalization.py		layerNormalization.py
linearLayer.py		linearLayer.py
load_data.py		load_data.py
load_weights_into_gpt.py		load_weights_into_gpt.py
loading_pretrained_model.py		loading_pretrained_model.py
multi_head_attention.py		multi_head_attention.py
positional-embedding-example.py		positional-embedding-example.py
self_attention_v1.py		self_attention_v1.py
sms_spam_collection.zip		sms_spam_collection.zip
spam-dataset.py		spam-dataset.py
test.csv		test.csv
the-verdict.txt		the-verdict.txt
train.csv		train.csv
train_model.py		train_model.py
training-gpt.py		training-gpt.py
training_flow.png		training_flow.png
validation.csv		validation.csv

Folders and files

Latest commit

History

Repository files navigation

GPT From Scratch

Project Overview

High-Level Architecture

Step-by-Step Components

1. Tokenization

Step 1: Vocabulary Construction

Step 2: Encoding (Text → Tokens)

Step 3: Decoding (Tokens → Text)

Step 4: Preparing Training Sequences

2. Embeddings

Token Embeddings

Positional Embeddings

3. Attention Mechanism

Self Attention

Attention Score Calculation

Causal Masking

Multi-Head Attention

4. Transformer Block Components

1. Layer Normalization

2. Masked Multi-Head Self Attention

3. Residual Connections

4. Feed Forward Network (FFN)

5. GPT-Only Architecture

6. Training Loop

Step 1: Forward Pass

Step 2: Loss Computation

Step 3: Backpropagation

Step 4: Iterative Training

7. Text Generation

Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages