PDF Deconstructor

Decompose a PDF file based on its headers for RAG ingestion.

What it does

It takes a PDF file and converts it into a hierarchy of sections and subsections based on the headers inside the file, as well as extract all the links.

Main use

The intended use is to be part of a larger ingestion pipeline of unstructured data for RAG purposes.

How to use

from pdf_deconstructor import Deconstructor as PDFDeconstructor

output = PDFDeconstructor.parse("file.pdf", start_page=1)

# to see extracted tree
for header in output.content:
    print(header.tree())

Each header contains the raw text, markdown syntax text, and extracted links, as well as sub headers

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pdf_deconstructor		pdf_deconstructor
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Deconstructor

What it does

Main use

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Deconstructor

What it does

Main use

How to use

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages