Kimi-K2 + Tesseract RAG

A Streamlit-based Retrieval-Augmented Generation (RAG) application that enables users to upload PDF or image files, extract text using Tesseract OCR for images and embedded images in PDFs, and query the content using the Kimi-K2-Instruct model from Moonshot AI via HuggingFace's InferenceClient. The app leverages FAISS for vector-based similarity search to provide context-aware responses.

Features

Upload and process multiple PDFs or images (PNG, JPG, JPEG).
Extract text from PDFs using PyMuPDF (fitz) and perform OCR on images using Tesseract.
Index extracted text into a FAISS vector store for efficient retrieval.
Query the indexed content with natural language questions, answered by the Kimi-K2-Instruct model.
User-friendly Streamlit interface with chat-based interaction.

Prerequisites

Python 3.8+
Tesseract-OCR installed on your system (update pytesseract.pytesseract.tesseract_cmd in the code to match your Tesseract installation path).
A HuggingFace API token with access to the Kimi-K2-Instruct model (set as HF_API_KEY in the code).

Installation

Clone the repository:

git clone https://github.com/your-username/your-repo.git
cd your-repo

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required Python packages:
```
pip install -r requirements.txt
```
Install Tesseract-OCR:
- Windows: Download and install from Tesseract-OCR. Update the tesseract_cmd path in the code if needed.
- Linux/Mac: Install via package manager (e.g., sudo apt install tesseract-ocr for Ubuntu or brew install tesseract for macOS).
Set your HuggingFace API token:
- Replace HF_API_KEY in the code with your HuggingFace API token or set it as an environment variable:
```
export HF_TOKEN="your-huggingface-api-token"
```

Usage

Run the Streamlit app:
```
streamlit run app.py
```
Open the provided local URL (e.g., http://localhost:8501) in your browser.
Upload PDF or image files via the interface.
Click "Process files" to extract text, perform OCR, and index the content.
Ask questions in the chat input to query the processed documents. The app retrieves relevant content and generates responses using the Kimi-K2-Instruct model.

Dependencies

Listed in requirements.txt:

streamlit
PyMuPDF (fitz)
pytesseract
Pillow
langchain-community
faiss-cpu
huggingface_hub

Install them using:

pip install streamlit PyMuPDF pytesseract Pillow langchain-community faiss-cpu huggingface_hub

Notes

Ensure Tesseract-OCR is correctly installed and its path is set in the script.
The app uses the all-MiniLM-L6-v2 model for embeddings, which is lightweight and effective for text similarity tasks.
The Kimi-K2-Instruct model requires a valid HuggingFace API token with access to the Moonshot AI provider.
For large PDFs or images, processing time may vary depending on system resources and file complexity.

Limitations

OCR accuracy depends on image quality and Tesseract’s performance.
The app assumes English text for OCR (lang="eng"). Modify the ocr_bytes function for other languages.
The Kimi-K2-Instruct model may occasionally be unavailable due to API constraints.

Contributing

Contributions are welcome! Please open an issue or submit a pull request with improvements or bug fixes.

Acknowledgments

Powered by Streamlit, Tesseract-OCR, and HuggingFace.
Uses the Kimi-K2-Instruct model by Moonshot AI for natural language processing.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.markdown		README.markdown
README.md		README.md
app3.py		app3.py
app6.ipynb		app6.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kimi-K2 + Tesseract RAG

Features

Prerequisites

Installation

Usage

Dependencies

Notes

Limitations

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kimi-K2 + Tesseract RAG

Features

Prerequisites

Installation

Usage

Dependencies

Notes

Limitations

Contributing

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages