Reinforcement Learning with Human Feebback (RLHF): End-to-End RLHF Pipeline — from Pretraining to PPO

Technologies & Key Concepts

Model stack: minGPT, GPT-2, custom transformer architectures
Datasets: TinyStories, OpenAI Summarize TL;DR, CarperAI/openai_summarize_comparisons
Training pipeline: pretraining, supervised fine-tuning (SFT), RL fine-tuning
RL methods: vanilla policy gradient, PPO, KL-divergence penalty, GAE (gamma=1, lambda=0.95)
Reward modeling: learned reward model, scalar reward head, Bradley-Terry pairwise ranking loss
Deployment: Flask web interface, React visualization front-end
Concepts: reward modeling, preference learning, KL-regularized RL, text summarization, sentiment steering

This repository contains an RLHF project by meghanaNanuvala. It includes two main parts:

RLHF-Part1: core RLHF model training and inference workflows.
RLHF-Visualizer: a React-based visualization app to explain and inspect the RLHF process.

What this repository does

This repository brings together a complete RLHF research project and a visualization interface:

RLHF-Part1 implements a full three-stage RLHF pipeline from scratch: pretraining, supervised fine-tuning, and RL fine-tuning.
RLHF-Visualizer provides a user-friendly view of how the RLHF components connect and how the training pipeline behaves.

The goal is to maintain this as a personal RLHF project by meghanaNanuvala, with clean branding and clear documentation.

RLHF-Part1

RLHF-Part1 is the core model project. It includes:

mingpt/: minimal GPT model implementation used for training and inference.
happy_gpt/: a Flask web app that loads both a pre-trained model and an RL fine-tuned model.
chargpt/: additional RL or model-related scripts.
summarize_rlhf/: tools for summarization and reward model evaluation.

Key behavior:

Implements a complete 3-stage RLHF pipeline on TinyStories: pretraining → supervised fine-tuning → RL fine-tuning.
Uses vanilla policy gradient with a KL-divergence penalty against a frozen reference model to steer generation toward positive sentiment based on VADER compound score.
Builds a PPO optimization loop for text summarization on the OpenAI Summarize TL;DR dataset, with a GPT-2 architecture (12-layer, 768-dim), clipped surrogate objective, GAE advantage estimation (gamma=1, lambda=0.95), and a separate value function.
Trains a learned reward model from human preference comparisons (CarperAI/openai_summarize_comparisons) by fine-tuning a GPT-2 transformer with a scalar reward head and Bradley-Terry ranking loss.
Uses the learned reward model to drive PPO optimization with KL-regularized rewards.
Serves a Flask web app from happy_gpt where users can compare output from pre-trained and RL fine-tuned story models.

RLHF-Visualizer

RLHF-Visualizer is a React front-end app that visualizes the RLHF pipeline.

It is built with:

React and react-router-dom
Tailwind CSS
react-scripts

This app is intended to show the RLHF process in a clearer way, making the training and reward pipeline easier to understand.

How to use this repo

RLHF-Part1

Navigate to RLHF-Part1.
Install the Python dependencies.
Run the Flask app in happy_gpt.

Example:

cd /Users/mnanuva/Documents/RLHF/RLHF-Part1
pip install -r requirements.txt
cd happy_gpt
python app.py

If model files are missing, the app warns about missing files and will not load the corresponding model.

RLHF-Visualizer

Navigate to RLHF/RLHF-Visualizer.
Install the Node dependencies.
Start the React app.

cd /Users/mnanuva/Documents/RLHF/RLHF-Visualizer
npm install
npm start

Repository structure

RLHF/
  README.md                # This file
  RLHF-Part1/              # Core RLHF model and runtime
    README.txt            # Existing quick instructions
    happy_gpt/            # Flask web app for story generation
    mingpt/               # Minimal GPT implementation
    summarize_rlhf/       # RLHF summarization utilities
  RLHF-Visualizer/         # React app for pipeline visualization
    public/
    src/
    package.json

Personal branding note

This repo is now a personal project for meghanaNanuvala. All visible documentation and branding should reflect that intent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning with Human Feebback (RLHF): End-to-End RLHF Pipeline — from Pretraining to PPO

Technologies & Key Concepts

What this repository does

RLHF-Part1

RLHF-Visualizer

How to use this repo

RLHF-Part1

RLHF-Visualizer

Repository structure

Personal branding note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
RLHF-Part1		RLHF-Part1
RLHF-Visualizer		RLHF-Visualizer
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning with Human Feebback (RLHF): End-to-End RLHF Pipeline — from Pretraining to PPO

Technologies & Key Concepts

What this repository does

RLHF-Part1

RLHF-Visualizer

How to use this repo

RLHF-Part1

RLHF-Visualizer

Repository structure

Personal branding note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages