LDEV-2968 v3 openpdftohtml by zspitzer · Pull Request #62 · lucee/extension-pdf

zspitzer · 2026-02-05T08:22:40Z

https://luceeserver.atlassian.net/browse/LDEV-2968

- optimize: remove bookmarks, metadata, JS, attachments, thumbnails, comments, forms, links - sanitize: security-focused removal of dangerous elements (JS, attachments, metadata, link actions) - addStamp: delegates to watermark for image-based stamps

- srcfile/src without tag body now triggers rendering via doEndTag - getBaseUrl() handles Lucee Resources, not just java.io.File - Empty body no longer overrides srcfile content - Encryption constants are now distinct; AES-128 uses setPreferAES - Page ranges like "3-" resolve actual page count instead of -1 - "printing" permission maps to ALLOW_PRINTING, remove dead duplicate

- Cache getInfo() result to avoid re-parsing PDF on every struct access - Close source PDDocuments in concat(), use try-with-resources in toImage() - Fix InputStream leak in PDFForm.loadPDDocument() on error - Remove pd4ml.jar, ss_css2.jar, .flattened-pom.xml - Consolidate handlePageNumbers() into processPageVariables() - Remove dead getMultipleHF() and multi-render loop - Remove empty writeImages() stub - Require action attribute on cfpdf (was silent no-op) - Implement setFilter() for directory merge glob patterns - Fix FONT_EMBED_SELECCTIVE typo - Fix setMimetype() discarding normalised value - Update fontdirectory TLD description

Routes render logging through Lucee's pdf log when defined. No-op if the log isn't configured in admin.

cfdocument src with proxyserver/proxyport now works.

Verifies invalid proxy errors on remote fetch, and is ignored for local content.

- Prevent XXE in PDFForm XML parsing (disable DTDs and external entities) - Sanitise saveAsName in Content-Disposition header (strip CR/LF/quotes) - Fix LuceeLogHandler accumulation (only attach once per JVM) - Escape extracted text in XML output (extractText type=xml) - Fix setFilter glob-to-regex using Pattern.quote for literal chars - Map thumbnail scale (1-100) to DPI (3-300) instead of using raw value - Remove dead PDF2Image.java - Enable checkFileLocation on cfdocument filename attribute

- DocumentSection.setMimetype() now passes normalised value (was discarded) - PDFPageMark.getHtmlTemplate() delegates to getHtml() for bounds safety - ApplicationSettings.init is now volatile - Remove FontsJarExtractor.main() debug method - Replace e.printStackTrace() calls with comments

…onts.jar - Strip path components from PDF attachment filenames to prevent directory traversal - Set 15s timeout on JSoup URL fetching - Remove dead fonts.jar from res/ (2.4MB, loaded from classloader not res/)

Use OpenHTMLToPDF's native <bookmarks> element for PDF outline generation, replacing the post-render hack that pointed all bookmarks to section start pages. Bookmarks now resolve to exact rendered page positions for explicit bookmarks, HTML headings, and section names. cfpdf merge now preserves and remaps bookmarks from all source PDFs with correct page offsets, and filters out bookmarks for excluded pages.

Renders content onto a larger page proportional to the scale factor, then uses PDFBox to scale pages back to target dimensions.

IsPDFArchive now validates pdfaid:part in XMP metadata instead of just checking if the file is a valid PDF. getInfo() includes PDFAVersion key. Register IsPDFArchive as a standalone function in function.fld.

Allow self-closing unknown HTML tags for compatibility with real-world HTML

Accepts a Component (with onResourceFetch method) or UDF to intercept image/CSS/resource fetching. Returns content or null to fall through to default. Wired into both src fetching and OpenHTMLToPDF rendering.

Bumps test.bat to jdk-11.0.30 / jdk-21.0.10 and 7.1/snapshot/light.

Adds -fs-table-paginate: paginate to the default OpenHTMLToPDF stylesheet so tables break across pages instead of dropping rows.

The handler now receives (url, parsedUrl) where parsedUrl is a struct with protocol, host, port, path, query, fragment. CFC handlers can still declare onResourceFetch(url) and ignore the second arg.

setScale now throws for values <= 0 (was < 0), matching the documented 1-100 range. Adds TODO note where mergeDocuments() can NPE on form fields with null font names — PDFBOX-5963, fixed in 3.0.8.

DocumentRendering.cfc — 24 specs covering basic rendering, HTML entities, unicode, CSS, page breaks, page-size dimensions, images, HTML-to-AcroField conversion, and error handling. One skip for the checkbox case waiting on PDFBOX-5963.

Both files were previously fully skipped because they needed a PDF with AcroFields. The fixture is now generated from a cfdocument with <input type=text>, so all 9 populate specs and 6 read specs run.

Tests: - PDFWatermark — verifies watermark image is actually embedded via extractImage. removeWatermark stub is documented + skipped. - PDFThumbnail — verifies scale produces different-sized images. - PDFExtractImages — checks imagePrefix is honoured + extracted file is a valid image, multi-image case. - PDFRemoveAttachments — verifies attachments are gone via extractAttachments round-trip. Java: - Extract path-traversal sanitizer to PDFUtil.sanitizeFilename, add null-byte rejection. Used by extractAttachments. - Fix InputStream leak in PDFForm.loadFromResource — switch to try-with-resources, the buffer copies bytes but never owns the stream.

…ted/

… resourceHandler docs

…rm read

…artifacts/fonts

…back

zspitzer added 30 commits February 5, 2026 09:22

LDEV-2968 v3 openpdftohtml

6580828

https://luceeserver.atlassian.net/browse/LDEV-2968

Create CHANGELOG.md

1a6a15c

Add .gitignore to test artifact directories, untrack generated PDFs

89f9111

refactor

52cf21c

Update AGENTS.md

c3a786e

Move test artifacts to generated/ subdirs, leave in place for inspection

13e7468

Rewrite CI workflow to match crypto extension pattern

38d6813

Fix cfdocumentitem type=bookmark, add bookmark test coverage

d6be29e

Enable LuceeLogHandler for OpenHTMLToPDF logging

0c332c3

Routes render logging through Lucee's pdf log when defined. No-op if the log isn't configured in admin.

Wire up proxy settings to JSoup URL fetching

186c140

cfdocument src with proxyserver/proxyport now works.

Add proxy test coverage

7adcdf0

Verifies invalid proxy errors on remote fetch, and is ignored for local content.

Fix path traversal in extractAttachments, add fetch timeout, remove f…

8aca6ab

…onts.jar - Strip path components from PDF attachment filenames to prevent directory traversal - Set 15s timeout on JSoup URL fetching - Remove dead fonts.jar from res/ (2.4MB, loaded from classloader not res/)

Implement cfdocument scale attribute

22eae44

Renders content onto a larger page proportional to the scale factor, then uses PDFBox to scale pages back to target dimensions.

IsPDFArchive checks XMP metadata for PDF/A conformance

d5c0170

IsPDFArchive now validates pdfaid:part in XMP metadata instead of just checking if the file is a valid PDF. getInfo() includes PDFAVersion key. Register IsPDFArchive as a standalone function in function.fld.

Update changelog, agents docs

c34ea67

Bump openhtmltopdf 1.1.37, pdfbox 3.0.7, jsoup 1.22.1

85bba83

Allow self-closing unknown HTML tags for compatibility with real-world HTML

Add resourceHandler attribute to cfdocument for custom resource fetching

0c85c8c

Accepts a Component (with onResourceFetch method) or UDF to intercept image/CSS/resource fetching. Returns content or null to fall through to default. Wired into both src fetching and OpenHTMLToPDF rendering.

Reorganize tests into tags/document, tags/pdf, tickets subdirs

b5fabc9

Bumps test.bat to jdk-11.0.30 / jdk-21.0.10 and 7.1/snapshot/light.

Enable table pagination by default for cfdocument

4fc16f2

Adds -fs-table-paginate: paginate to the default OpenHTMLToPDF stylesheet so tables break across pages instead of dropping rows.

Pass parsedUrl struct to cfdocument resourceHandler

ba735dc

The handler now receives (url, parsedUrl) where parsedUrl is a struct with protocol, host, port, path, query, fragment. CFC handlers can still declare onResourceFetch(url) and ignore the second arg.

Reject cfdocument scale=0 and document PDFBOX-5963 mergeDocuments NPE

6e14647

setScale now throws for values <= 0 (was < 0), matching the documented 1-100 range. Adds TODO note where mergeDocuments() can NPE on form fields with null font names — PDFBOX-5963, fixed in 3.0.8.

Add cfdocument core rendering test coverage

88af90c

DocumentRendering.cfc — 24 specs covering basic rendering, HTML entities, unicode, CSS, page breaks, page-size dimensions, images, HTML-to-AcroField conversion, and error handling. One skip for the checkbox case waiting on PDFBOX-5963.

Enable cfpdfform populate/read tests via HTML-generated fixture

3bf0266

Both files were previously fully skipped because they needed a PDF with AcroFields. The fixture is now generated from a cfdocument with <input type=text>, so all 9 populate specs and 6 read specs run.

zspitzer added 14 commits May 16, 2026 12:13

DocumentTableBreaks: add rowspan+thead test, move artefacts to genera…

9845a94

…ted/

TLD: add attribute groups for v3 cfpdf/cfpdfform actions, fix scale +…

f7659e6

… resourceHandler docs

cfpdfform: default overwriteData to false to match Adobe CF

5005421

Add htmlbookmark edge case test coverage

d857908

Implement real cfpdf removeWatermark via marked content

61197af

cfpdfform/cfpdf: wire flatten attribute, workaround PDFBOX-5962 in fo…

6ee98fc

…rm read

cfdocument: route OHTPDF logging via native XRLogger impl

9716082

cfdocument: remove dead fonts.jar bootstrap (PD4ML legacy)

dfd9ce4

tests: add DocumentFonts.cfc + consolidate font fixtures under tests/…

6cef511

…artifacts/fonts

cfdocument: WARN on font metadata read failure instead of silent fall…

1d0d71a

…back

cfdocument: remove dead PD4ML type-selection machinery

dc0ca3d

cfdocument: fix footer running-element only rendering on last page

6c29165

cfdocument: add debughtml attribute to dump pre-OHTPDF HTML

2d386e8

tests: add DocumentLinks + DocumentErrorConditions

e5dfec1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDEV-2968 v3 openpdftohtml#62

LDEV-2968 v3 openpdftohtml#62
zspitzer wants to merge 44 commits into
masterfrom
v3-openhtmltopdf

zspitzer commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zspitzer commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant