A Streamlit-based Retrieval-Augmented Generation (RAG) application that enables users to upload PDF or image files, extract text using Tesseract OCR for images and embedded images in PDFs, and query the content using the Kimi-K2-Instruct model from Moonshot AI via HuggingFace's InferenceClient. The app leverages FAISS for vector-based similarity search to provide context-aware responses.

- Upload and process multiple PDFs or images (PNG, JPG, JPEG).
- Extract text from PDFs using PyMuPDF (fitz) and perform OCR on images using Tesseract.
- Index extracted text into a FAISS vector store for efficient retrieval.
- Query the indexed content with natural language questions, answered by the Kimi-K2-Instruct model.
- User-friendly Streamlit interface with chat-based interaction.
- Python 3.8+
- Tesseract-OCR installed on your system (update
pytesseract.pytesseract.tesseract_cmdin the code to match your Tesseract installation path). - A HuggingFace API token with access to the Kimi-K2-Instruct model (set as
HF_API_KEYin the code).
-
Clone the repository:
git clone https://github.com/your-username/your-repo.git cd your-repo -
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required Python packages:
pip install -r requirements.txt
-
Install Tesseract-OCR:
- Windows: Download and install from Tesseract-OCR. Update the
tesseract_cmdpath in the code if needed. - Linux/Mac: Install via package manager (e.g.,
sudo apt install tesseract-ocrfor Ubuntu orbrew install tesseractfor macOS).
- Windows: Download and install from Tesseract-OCR. Update the
-
Set your HuggingFace API token:
- Replace
HF_API_KEYin the code with your HuggingFace API token or set it as an environment variable:export HF_TOKEN="your-huggingface-api-token"
- Replace
-
Run the Streamlit app:
streamlit run app.py
-
Open the provided local URL (e.g.,
http://localhost:8501) in your browser. -
Upload PDF or image files via the interface.
-
Click "Process files" to extract text, perform OCR, and index the content.
-
Ask questions in the chat input to query the processed documents. The app retrieves relevant content and generates responses using the Kimi-K2-Instruct model.
Listed in requirements.txt:
- streamlit
- PyMuPDF (fitz)
- pytesseract
- Pillow
- langchain-community
- faiss-cpu
- huggingface_hub
Install them using:
pip install streamlit PyMuPDF pytesseract Pillow langchain-community faiss-cpu huggingface_hub- Ensure Tesseract-OCR is correctly installed and its path is set in the script.
- The app uses the
all-MiniLM-L6-v2model for embeddings, which is lightweight and effective for text similarity tasks. - The Kimi-K2-Instruct model requires a valid HuggingFace API token with access to the Moonshot AI provider.
- For large PDFs or images, processing time may vary depending on system resources and file complexity.
- OCR accuracy depends on image quality and Tesseract’s performance.
- The app assumes English text for OCR (
lang="eng"). Modify theocr_bytesfunction for other languages. - The Kimi-K2-Instruct model may occasionally be unavailable due to API constraints.
Contributions are welcome! Please open an issue or submit a pull request with improvements or bug fixes.
- Powered by Streamlit, Tesseract-OCR, and HuggingFace.
- Uses the Kimi-K2-Instruct model by Moonshot AI for natural language processing.