Improve squeeze efficiency for large files#440
Improve squeeze efficiency for large files#440luantak wants to merge 2 commits intojohnwhitington:masterfrom
Conversation
…rgeted cleanup Use new hash tables to bucket normalized objects and cache stream-content hashes, reducing repeated comparisons and avoiding unnecessary stream materialization during duplicate detection. Track rewritten page streams and only run unreferenced-object cleanup and follow-up deduplication when page rewrites or recompression actually changed the PDF.
|
Thanks. I'll take a look soon. But can you give me a couple of paragraphs of description to help me navigate the patch, please? |
|
Previously, cpdf squeeze found duplicate objects in a fairly expensive way. It would hash objects, sort and group them, and then do direct comparisons on the candidates. For stream objects, those comparisons often meant pulling in the full stream data just to decide whether two objects were actually the same. The new version makes that duplicate-checking process much more selective. Instead of going quickly from “maybe similar” to “compare the whole object,” it filters things in stages. First, it puts objects into hash-table buckets using a cheap normalized representation, so only objects that already look alike are grouped together. Anything that ends up alone is discarded immediately. For streams, it then refines those groups again using stronger hashes based on both the stream metadata and the stream contents, with those content hashes cached. After that, actual equality checks only happen inside these much smaller groups. And even there, stream objects are checked cheaply first by comparing normalized dictionaries and lengths before the code touches the full byte data. In summary duplicate detection is no longer doing expensive work on broad sets of objects. It now narrows the field aggressively, rejects most non-matches early, and only pays the full comparison cost for the few objects that have already passed several cheap tests. |
|
Manual testing shows big speed improvements on big files. Great! And all test outputs appear to open in PDF viewers. Manual inspection will be needed to see that they are fully valid. But, with this patch, on Cpdf's (normal sized) test files: And with vanilla v2.9: So the time is doubled, but 10% size improvements. These are two separate issues - I need to understand why the output file sizes are smaller, because that probably exposes a bug in the existing code. Then we can look at which of your new methods can be applied to improve large file speeds without degrading ordinary size file speeds. |
This makes the short path of squeeze
-squeeze -squeeze-no-pagedatamuch faster for huge pdf files. There is a moderate speedup for smaller files.May also improve file size a little on some files.
I can provide concrete numbers for my improvement claims if necessary.
I'm not sure if all of this code belongs in cpdf or if some of it should go in camlpdf.