🧠 Document Intelligence Engine

This project is an offline, CPU-only semantic extraction system built for the Adobe India Hackathon. It identifies the most relevant sections from a collection of PDF documents based on a given persona and job_to_be_done.

✅ Features

Zero-Internet Dependency: Fully offline inference using sentence-transformers.
Lightweight & Powerful: Uses the multi-qa-mpnet-base-dot-v1 model (<500MB) for state-of-the-art semantic understanding.
Context-Aware Ranking: Creates a rich contextual prompt from the persona and job to find the most relevant document sections.
Robust PDF Parsing: A model-free heading extractor identifies section titles using structural and stylistic heuristics.
Advanced Scoring: Combines document-level and section-level relevance to ensure highly accurate results.
Deduplication: Avoids duplicate or near-duplicate results in the final output.

🚀 How to Run

1. Setup (IMPORTANT)

Before running the main script, you must install the dependencies and download the required AI model.

Install Dependencies:
```
pip install -r requirements.txt
```
Download the Model: Run the following script from the root directory. This will download and save the model files into the ./models folder.
```
python download_model.py
```

2. Place Input Files

Place all required PDF documents inside the input/ folder.
Place your input JSON file (e.g., challenge1b_input.json) inside the data/ folder.

3. Run the Pipeline

Execute the main script from your terminal, pointing it to your specific input JSON file.

python main.py data/your_input_file.json

The final output will be saved to output/challenge1b_output.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Document Intelligence Engine

✅ Features

🚀 How to Run

1. Setup (IMPORTANT)

2. Place Input Files

3. Run the Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
input		input
models		models
output		output
utils		utils
Dockerfile		Dockerfile
README.md		README.md
download_model.py		download_model.py
main.py		main.py
requirements.txt		requirements.txt
test_embed.py		test_embed.py

Folders and files

Latest commit

History

Repository files navigation

🧠 Document Intelligence Engine

✅ Features

🚀 How to Run

1. Setup (IMPORTANT)

2. Place Input Files

3. Run the Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages