Reinforcement Learning with Human Feebback (RLHF): End-to-End RLHF Pipeline — from Pretraining to PPO
- Model stack: minGPT, GPT-2, custom transformer architectures
- Datasets: TinyStories, OpenAI Summarize TL;DR, CarperAI/openai_summarize_comparisons
- Training pipeline: pretraining, supervised fine-tuning (SFT), RL fine-tuning
- RL methods: vanilla policy gradient, PPO, KL-divergence penalty, GAE (gamma=1, lambda=0.95)
- Reward modeling: learned reward model, scalar reward head, Bradley-Terry pairwise ranking loss
- Deployment: Flask web interface, React visualization front-end
- Concepts: reward modeling, preference learning, KL-regularized RL, text summarization, sentiment steering
This repository contains an RLHF project by meghanaNanuvala. It includes two main parts:
RLHF-Part1: core RLHF model training and inference workflows.RLHF-Visualizer: a React-based visualization app to explain and inspect the RLHF process.
This repository brings together a complete RLHF research project and a visualization interface:
RLHF-Part1implements a full three-stage RLHF pipeline from scratch: pretraining, supervised fine-tuning, and RL fine-tuning.RLHF-Visualizerprovides a user-friendly view of how the RLHF components connect and how the training pipeline behaves.
The goal is to maintain this as a personal RLHF project by meghanaNanuvala, with clean branding and clear documentation.
RLHF-Part1 is the core model project. It includes:
mingpt/: minimal GPT model implementation used for training and inference.happy_gpt/: a Flask web app that loads both a pre-trained model and an RL fine-tuned model.chargpt/: additional RL or model-related scripts.summarize_rlhf/: tools for summarization and reward model evaluation.
Key behavior:
- Implements a complete 3-stage RLHF pipeline on TinyStories: pretraining → supervised fine-tuning → RL fine-tuning.
- Uses vanilla policy gradient with a KL-divergence penalty against a frozen reference model to steer generation toward positive sentiment based on VADER compound score.
- Builds a PPO optimization loop for text summarization on the OpenAI Summarize TL;DR dataset, with a GPT-2 architecture (12-layer, 768-dim), clipped surrogate objective, GAE advantage estimation (gamma=1, lambda=0.95), and a separate value function.
- Trains a learned reward model from human preference comparisons (
CarperAI/openai_summarize_comparisons) by fine-tuning a GPT-2 transformer with a scalar reward head and Bradley-Terry ranking loss. - Uses the learned reward model to drive PPO optimization with KL-regularized rewards.
- Serves a Flask web app from
happy_gptwhere users can compare output from pre-trained and RL fine-tuned story models.
RLHF-Visualizer is a React front-end app that visualizes the RLHF pipeline.
It is built with:
Reactandreact-router-domTailwind CSSreact-scripts
This app is intended to show the RLHF process in a clearer way, making the training and reward pipeline easier to understand.
- Navigate to
RLHF-Part1. - Install the Python dependencies.
- Run the Flask app in
happy_gpt.
Example:
cd /Users/mnanuva/Documents/RLHF/RLHF-Part1
pip install -r requirements.txt
cd happy_gpt
python app.pyIf model files are missing, the app warns about missing files and will not load the corresponding model.
- Navigate to
RLHF/RLHF-Visualizer. - Install the Node dependencies.
- Start the React app.
cd /Users/mnanuva/Documents/RLHF/RLHF-Visualizer
npm install
npm startRLHF/
README.md # This file
RLHF-Part1/ # Core RLHF model and runtime
README.txt # Existing quick instructions
happy_gpt/ # Flask web app for story generation
mingpt/ # Minimal GPT implementation
summarize_rlhf/ # RLHF summarization utilities
RLHF-Visualizer/ # React app for pipeline visualization
public/
src/
package.json
This repo is now a personal project for meghanaNanuvala. All visible documentation and branding should reflect that intent.