Skip to content

Improve squeeze efficiency for large files#440

Open
luantak wants to merge 2 commits intojohnwhitington:masterfrom
luantak:squeeze_optimization
Open

Improve squeeze efficiency for large files#440
luantak wants to merge 2 commits intojohnwhitington:masterfrom
luantak:squeeze_optimization

Conversation

@luantak
Copy link
Copy Markdown

@luantak luantak commented Mar 17, 2026

This makes the short path of squeeze -squeeze -squeeze-no-pagedata much faster for huge pdf files. There is a moderate speedup for smaller files.

May also improve file size a little on some files.

I can provide concrete numbers for my improvement claims if necessary.

I'm not sure if all of this code belongs in cpdf or if some of it should go in camlpdf.

luantak added 2 commits March 17, 2026 18:38
…rgeted cleanup

Use new hash tables to bucket normalized objects and cache stream-content hashes, reducing repeated comparisons and avoiding unnecessary stream materialization during duplicate detection.

Track rewritten page streams and only run unreferenced-object cleanup and follow-up deduplication when page rewrites or recompression actually changed the PDF.
@johnwhitington
Copy link
Copy Markdown
Owner

Thanks. I'll take a look soon.

But can you give me a couple of paragraphs of description to help me navigate the patch, please?

@luantak
Copy link
Copy Markdown
Author

luantak commented Mar 18, 2026

Previously, cpdf squeeze found duplicate objects in a fairly expensive way. It would hash objects, sort and group them, and then do direct comparisons on the candidates. For stream objects, those comparisons often meant pulling in the full stream data just to decide whether two objects were actually the same.

The new version makes that duplicate-checking process much more selective. Instead of going quickly from “maybe similar” to “compare the whole object,” it filters things in stages.

First, it puts objects into hash-table buckets using a cheap normalized representation, so only objects that already look alike are grouped together. Anything that ends up alone is discarded immediately.

For streams, it then refines those groups again using stronger hashes based on both the stream metadata and the stream contents, with those content hashes cached.

After that, actual equality checks only happen inside these much smaller groups. And even there, stream objects are checked cheaply first by comparing normalized dictionaries and lengths before the code touches the full byte data.

In summary duplicate detection is no longer doing expensive work on broad sets of objects. It now narrows the field aggressively, rejects most non-matches early, and only pays the full comparison cost for the few objects that have already passed several cheap tests.

@johnwhitington
Copy link
Copy Markdown
Owner

johnwhitington commented Mar 27, 2026

Manual testing shows big speed improvements on big files. Great! And all test outputs appear to open in PDF viewers. Manual inspection will be needed to see that they are fully valid.

But, with this patch, on Cpdf's (normal sized) test files:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m52.999s
user	0m51.539s
sys	0m0.898s

$ du -h PDFResults/
165M	PDFResults/squeeze
165M	PDFResults/

And with vanilla v2.9:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m25.396s
user	0m23.997s
sys	0m0.858s

$ du -h PDFResults/
184M	PDFResults/squeeze
184M	PDFResults/

So the time is doubled, but 10% size improvements. These are two separate issues - I need to understand why the output file sizes are smaller, because that probably exposes a bug in the existing code. Then we can look at which of your new methods can be applied to improve large file speeds without degrading ordinary size file speeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants