Skip to content

TheWerbinator/ai-industry-docs-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Industrial AI Application

Screenshot 2026-02-02 151549

🛠️ The Mission

In industrial automation, data is often trapped in legacy PDF catalogs or technical datasheets. This project is a Deterministic Extraction Pipeline designed to ingest unstructured industrial PDFs and transform them into validated, machine-readable JSON schemas.

Key Technical Highlight: Unlike a standard chatbot, this system uses Pydantic-based Output Parsing to ensure that the AI follows a strict "Technical Contract." If the AI cannot find a value or tries to hallucinate a format, the system flags it during validation.


🚀 Getting Started

1. System Dependencies (CRITICAL)

This engine uses unstructured.io to partition complex PDF schemas. You must install the following system-level tools for the PDF partitioning to work:

  • macOS (via Homebrew):
    brew install poppler tesseract
  • Windows:
    1. Download and install Poppler for Windows.
    2. Download and install Tesseract OCR.
    3. Add the /bin folders of both to your System Environment PATH.

2. Python Environment Setup

# Create a virtual environment
python -m venv venv

# Activate the environment
# (Windows)
.\venv\Scripts\activate
# (Mac/Linux)
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Running the Program

After all the above is complete, run the project with:

streamlit run app.py

You will be prompted to input your OpenAI product key and the associated catalog PDF.

Once those are in and processed, give it a prompt!

The output is rigid - the code is telling it only to output JSON for a motor's Model Number, Voltage, Horsepower, RPM, Size, and Enclosure.

As a proof of concept, this whole project is essentially showing that we can extract unstructured data and transform it into highly structured, readable JSON.

About

Extract structured data from unstructured PDFs (Datasheets) and uses Pydantic to force deterministic output

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages