This project is an offline, CPU-only semantic extraction system built for the Adobe India Hackathon. It identifies the most relevant sections from a collection of PDF documents based on a given persona and job_to_be_done.
- Zero-Internet Dependency: Fully offline inference using
sentence-transformers. - Lightweight & Powerful: Uses the
multi-qa-mpnet-base-dot-v1model (<500MB) for state-of-the-art semantic understanding. - Context-Aware Ranking: Creates a rich contextual prompt from the
personaandjobto find the most relevant document sections. - Robust PDF Parsing: A model-free heading extractor identifies section titles using structural and stylistic heuristics.
- Advanced Scoring: Combines document-level and section-level relevance to ensure highly accurate results.
- Deduplication: Avoids duplicate or near-duplicate results in the final output.
Before running the main script, you must install the dependencies and download the required AI model.
- Install Dependencies:
pip install -r requirements.txt
- Download the Model:
Run the following script from the root directory. This will download and save the model files into the
./modelsfolder.python download_model.py
- Place all required PDF documents inside the
input/folder. - Place your input JSON file (e.g.,
challenge1b_input.json) inside thedata/folder.
Execute the main script from your terminal, pointing it to your specific input JSON file.
python main.py data/your_input_file.jsonThe final output will be saved to output/challenge1b_output.json.