feat: add gRPC protobuf definitions and conversion utility#546
feat: add gRPC protobuf definitions and conversion utility#546krickert wants to merge 7 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @krickert, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for:
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e233312 Signed-off-by: Kristian Rickert <krickert@gmail.com>
|
Docling Team, I have updated this implementation and moved the core defninitions and mapping into here. I debated about keeping the mapping in docling-serve, but feel this might be the best place since it's so model specific. Specifically, I’ve moved the Protobuf definitions and the Pydantic-to-Proto conversion logic into
I am currently running this implementation through a stress test of 80k+ PDFs over the coming week to verify stability at scale. I would love to start a discussion on how we can expand this into native streaming functionality, and in the meantime I’d be happy to contribute language-specific tutorials. Looking forward to your feedback... you built an incredible product and I'm happy to contribute. |
|
This merge is tied together with docling-project/docling-serve#504 - this is the model definition and mapping while the other project is the gRPC server option. |
|
Bump.. is this enough to start a review? Just an initial pass.. I can convert to draft if we need a few rounds of discussion. On my side, I'm getting ready to test this against a large corpus in common-crawl @dolfim-ibm |
|
@dolfim-ibm updated latest commit with main to keep it up to date. The latest protobufs were sync'ed (the grpc server was working without it though, but the new model changes have been added and properly mapped) |
|
Pushed the changes to docling-project/docling-serve/pull/504 as these two are tightly coupled. |
Apply design review of DoclingDocument proto against the Pydantic source of truth and lock the parity contract in PARITY.md. Proto changes: - PictureItem: make self_ref required and parent optional, matching the Pydantic shape and other DocItem subclasses. - CodeItem: inline TextItemBase fields directly instead of nesting a base wrapper, since CodeItem overrides the meta field with FloatingMeta and the wrapper would force two coexisting meta fields. - Formatting: add script_raw fallback so unrecognized Script enum values round-trip as strings, matching the policy used for label, picture_class, code_language, and modality. - TrackSource: drop the redundant kind field; the Pydantic Literal["track"] discriminator is already represented by the SourceType oneof tag in proto. Conversion utility updated to populate the inlined CodeItem fields, emit script_raw on unrecognized scripts, and stop writing TrackSource.kind. PARITY.md documents intentional Pydantic vs proto differences, including computed fields surfaced for JSON parity (TableData.grid) and Pydantic-only discriminators absorbed by oneof tags. Pre-release wire stability is explicitly out of scope until the gRPC PR ships, so renames are still permitted; this is documented in the sync procedure on the docling-serve side.
Update: proto parity tightening and sync with
|
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: f2c8145 Signed-off-by: Kristian Rickert <krickert@gmail.com>
|
I've made a repository with examples demonstrating how to run Here: ai-pipestream/docling-grpc-examples
|
feat: add gRPC protobuf definitions and Pydantic conversion utility
Description
This PR introduces official Protocol Buffer definitions for the
DoclingDocumentmodel and a high-performance conversion utility to map between Docling's Pydantic models and Protobuf representations.By moving the Protobuf source of truth into
docling-core, we enable:Key Changes
1. Protobuf Definitions (
/proto)ai/docling/core/v1/docling_document.proto.DoclingDocumentPydantic model, including:field_regions,field_items,field_heading, andfield_value.2. Conversion Utility (
docling_core/utils/conversion.py)docling_document_to_proto: A surgical, field-by-field mapper.google.protobuf.Structfor custom metadata.DocItemLabel,GroupLabel, andCoordOrigin.3. Tooling & Dependencies
protobufas a core dependency.grpcio-toolsto thedevdependency group for local development.scripts/gen_proto.pyto automate code generation usinguv.buflinting and formatting standards.Validation Performed
Unit Tests
test/test_proto_conversion.pyto verify:_root_).Integration Testing (via
docling-serve)docling-servegRPC suite.docling-servestartup schema validator, ensuring 100% parity between Pydantic and Proto schemas.Related Issues/PRs
docling-servePR fix(Doclang): fix image URI serialization #504 (feat: Grpc native converter).Protocol buffer integration:
docling_core/proto/__init__.pyto centralize imports for DoclingDocument protocol buffer definitions and conversion utilities.