Skip to content

Shreerang4/persona-driven-document-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Document Intelligence Engine

This project is an offline, CPU-only semantic extraction system built for the Adobe India Hackathon. It identifies the most relevant sections from a collection of PDF documents based on a given persona and job_to_be_done.


✅ Features

  • Zero-Internet Dependency: Fully offline inference using sentence-transformers.
  • Lightweight & Powerful: Uses the multi-qa-mpnet-base-dot-v1 model (<500MB) for state-of-the-art semantic understanding.
  • Context-Aware Ranking: Creates a rich contextual prompt from the persona and job to find the most relevant document sections.
  • Robust PDF Parsing: A model-free heading extractor identifies section titles using structural and stylistic heuristics.
  • Advanced Scoring: Combines document-level and section-level relevance to ensure highly accurate results.
  • Deduplication: Avoids duplicate or near-duplicate results in the final output.

🚀 How to Run

1. Setup (IMPORTANT)

Before running the main script, you must install the dependencies and download the required AI model.

  • Install Dependencies:
    pip install -r requirements.txt
  • Download the Model: Run the following script from the root directory. This will download and save the model files into the ./models folder.
    python download_model.py

2. Place Input Files

  • Place all required PDF documents inside the input/ folder.
  • Place your input JSON file (e.g., challenge1b_input.json) inside the data/ folder.

3. Run the Pipeline

Execute the main script from your terminal, pointing it to your specific input JSON file.

python main.py data/your_input_file.json

The final output will be saved to output/challenge1b_output.json.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors