Improve squeeze efficiency for large files by luantak · Pull Request #440 · johnwhitington/cpdf-source

luantak · 2026-03-17T18:01:54Z

This makes the short path of squeeze -squeeze -squeeze-no-pagedata much faster for huge pdf files. There is a moderate speedup for smaller files.

May also improve file size a little on some files.

I can provide concrete numbers for my improvement claims if necessary.

I'm not sure if all of this code belongs in cpdf or if some of it should go in camlpdf.

…rgeted cleanup Use new hash tables to bucket normalized objects and cache stream-content hashes, reducing repeated comparisons and avoiding unnecessary stream materialization during duplicate detection. Track rewritten page streams and only run unreferenced-object cleanup and follow-up deduplication when page rewrites or recompression actually changed the PDF.

johnwhitington · 2026-03-18T11:06:46Z

Thanks. I'll take a look soon.

But can you give me a couple of paragraphs of description to help me navigate the patch, please?

luantak · 2026-03-18T12:42:00Z

Previously, cpdf squeeze found duplicate objects in a fairly expensive way. It would hash objects, sort and group them, and then do direct comparisons on the candidates. For stream objects, those comparisons often meant pulling in the full stream data just to decide whether two objects were actually the same.

The new version makes that duplicate-checking process much more selective. Instead of going quickly from “maybe similar” to “compare the whole object,” it filters things in stages.

First, it puts objects into hash-table buckets using a cheap normalized representation, so only objects that already look alike are grouped together. Anything that ends up alone is discarded immediately.

For streams, it then refines those groups again using stronger hashes based on both the stream metadata and the stream contents, with those content hashes cached.

After that, actual equality checks only happen inside these much smaller groups. And even there, stream objects are checked cheaply first by comparing normalized dictionaries and lengths before the code touches the full byte data.

In summary duplicate detection is no longer doing expensive work on broad sets of objects. It now narrows the field aggressively, rejects most non-matches early, and only pays the full comparison cost for the few objects that have already passed several cheap tests.

johnwhitington · 2026-03-27T14:04:56Z

Manual testing shows big speed improvements on big files. Great! And all test outputs appear to open in PDF viewers. Manual inspection will be needed to see that they are fully valid.

But, with this patch, on Cpdf's (normal sized) test files:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m52.999s
user	0m51.539s
sys	0m0.898s

$ du -h PDFResults/
165M	PDFResults/squeeze
165M	PDFResults/

And with vanilla v2.9:

$ time ./cpdftest -squeeze all >foo 2>&1

real	0m25.396s
user	0m23.997s
sys	0m0.858s

$ du -h PDFResults/
184M	PDFResults/squeeze
184M	PDFResults/

So the time is doubled, but 10% size improvements. These are two separate issues - I need to understand why the output file sizes are smaller, because that probably exposes a bug in the existing code. Then we can look at which of your new methods can be applied to improve large file speeds without degrading ordinary size file speeds.

luantak added 2 commits March 17, 2026 18:38

add build output to gitignore

4a8089b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve squeeze efficiency for large files#440

Improve squeeze efficiency for large files#440
luantak wants to merge 2 commits intojohnwhitington:masterfrom
luantak:squeeze_optimization

luantak commented Mar 17, 2026

Uh oh!

johnwhitington commented Mar 18, 2026

Uh oh!

luantak commented Mar 18, 2026 •

edited

Loading

Uh oh!

johnwhitington commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luantak commented Mar 17, 2026

Uh oh!

johnwhitington commented Mar 18, 2026

Uh oh!

luantak commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnwhitington commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luantak commented Mar 18, 2026 •

edited

Loading

johnwhitington commented Mar 27, 2026 •

edited

Loading