PdfTableExtract

Input PDF:

Output HTML:

This extracts tables from PDFs. It supports cells spanning multiple rows or columns. For results, take a look at the PDF and the HTML in this repository. The HTML table was extracted from the PDF.

I wrote this because I needed to extract the tables of a lot of PDFs, but good tools where expensive or not working well.

This is not a very user friendly tool, but if you want me to make if easier, tell me!

You need the following things installed: ghostscript, pdftotext, opencv

Compile main.cpp, link against opencv. The programm will overwrite tmp.txt and tmp.jpg in your working directory, so make sure you don't have anything important there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfTableExtract

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PdfTableExtract