-
Notifications
You must be signed in to change notification settings - Fork 0
Add example for "How to Create a Searchable PDF document via Python" #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| aspose-pdf | ||
| lxml | ||
| pydicom | ||
| pandas | ||
| pandas | ||
| pytesseract | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,7 +1,10 @@ | ||||||||||||||||||||||||||||||||||||||||||||||
| import aspose.pdf as ap | ||||||||||||||||||||||||||||||||||||||||||||||
| import io | ||||||||||||||||||||||||||||||||||||||||||||||
| import pytesseract | ||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+1
to
+3
|
||||||||||||||||||||||||||||||||||||||||||||||
| import sys | ||||||||||||||||||||||||||||||||||||||||||||||
| from os import path | ||||||||||||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| import aspose.pdf as ap | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| sys.path.append(path.join(path.dirname(__file__), "..")) | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -16,20 +19,53 @@ def create_new_document(input_pdf, output_pdf): | |||||||||||||||||||||||||||||||||||||||||||||
| document.save(output_pdf) | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| def create_searchable_document(infile, outfile, image_file_path, page_number=1): | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| An example of using optical character recognition (OCR) technology to create a searchable PDF document. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||||||||||||||||||||||||
| infile (str): The name of the input PDF file | ||||||||||||||||||||||||||||||||||||||||||||||
| outfile (str): The base name for output files (index will be appended) | ||||||||||||||||||||||||||||||||||||||||||||||
| image_file_path (str): The name of the image file | ||||||||||||||||||||||||||||||||||||||||||||||
| page_number (int): The page number | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Returns: | ||||||||||||||||||||||||||||||||||||||||||||||
| None | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+24
to
+34
|
||||||||||||||||||||||||||||||||||||||||||||||
| An example of using optical character recognition (OCR) technology to create a searchable PDF document. | |
| Args: | |
| infile (str): The name of the input PDF file | |
| outfile (str): The base name for output files (index will be appended) | |
| image_file_path (str): The name of the image file | |
| page_number (int): The page number | |
| Returns: | |
| None | |
| """ | |
| Use optical character recognition (OCR) to create a searchable PDF document. | |
| Args: | |
| infile (str): The path to the input PDF file. | |
| outfile (str): The path to the output searchable PDF file. | |
| image_file_path (str): The path to the intermediate image file. | |
| page_number (int): The page number to process. | |
| Returns: | |
| None | |
| """ |
Copilot
AI
Apr 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_stream is opened with mode 'x' before the try block. If the file already exists (e.g., a previous run crashed before cleanup), this raises before cleanup runs. Also, the stream remains open when pytesseract reads image_file_path, which can fail on Windows due to file locking and/or unflushed writes. Open the file inside the try (or use a context manager), write/flush/close it before calling pytesseract, and consider using a tempfile-managed path to avoid collisions.
Copilot
AI
Apr 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.unlink(missing_ok=True) requires Python 3.8+, but the repo README states Python 3.7+ support. Replace this with a try/except FileNotFoundError (or check existence) to keep compatibility.
| image_file.unlink(missing_ok=True) | |
| try: | |
| image_file.unlink() | |
| except FileNotFoundError: | |
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding
pytesseractintroduces a runtime dependency on the native Tesseract binary (not installed via pip). Without documenting installation steps (or handlingTesseractNotFoundErrorwith a clear message), users will hit confusing failures at runtime. Consider adding a short note in the example (and/or README) describing how to install Tesseract and how to configurepytesseract.pytesseract.tesseract_cmdon Windows.