A Retrieval-Augmented Generation (RAG) system for question-answering on PDF documents using LangChain, OpenAI embeddings, and Qdrant vector database.
- Load and process PDF documents
- Split documents into manageable chunks with overlap
- Generate embeddings using OpenAI's text-embedding-3-small model
- Store and retrieve vectors in Qdrant database
- Ready for Q&A pipeline integration
- Python 3.8+
- OpenAI API key
- Qdrant server running locally (default: http://localhost:6333)
-
Clone the repository:
git clone <repository-url> cd rag-doc-qa
-
Create a virtual environment:
python -m venv venv source venv/Scripts/activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Copy
.env.exampleto.env(if available) or create.env - Add your OpenAI API key:
OPENAI_API_KEY=your-api-key-here
- Copy
-
Place your PDF file (e.g.,
sample_data.pdf) in the project root. -
Run the indexing script:
python index.py
This will:
- Load the PDF
- Split it into chunks
- Generate embeddings
- Store vectors in Qdrant
-
The system is now ready for Q&A queries (extend with retrieval and generation components).
rag-doc-qa/
├── index.py # Main indexing script
├── requirements.txt # Python dependencies
├── .env # Environment variables (ignored by git)
├── .gitignore # Git ignore rules
├── sample_data.pdf # Sample PDF document
└── README.md # This file
- langchain: Framework for LLM applications
- langchain-openai: OpenAI integrations
- langchain-qdrant: Qdrant vector store
- langchain-community: Community loaders and tools
- qdrant-client: Qdrant database client
- python-dotenv: Environment variable management
- pypdf: PDF processing
- Chunk size: 1000 characters with 200 character overlap
- Embedding model: text-embedding-3-small
- Qdrant URL: http://localhost:6333
- Collection name: learning-rag