Skip to content

Fix duplicate Table/Chart sections in to_markdown_by_page#1741

Open
randerzander wants to merge 3 commits intoNVIDIA:mainfrom
randerzander:markdown_fix
Open

Fix duplicate Table/Chart sections in to_markdown_by_page#1741
randerzander wants to merge 3 commits intoNVIDIA:mainfrom
randerzander:markdown_fix

Conversation

@randerzander
Copy link
Copy Markdown
Collaborator

@randerzander randerzander commented Mar 27, 2026

when running the core pipeline on multimodal_test.pdf and using to_markdown_by_page, table and charts get duplicated in the markdown text.

Claude's description:
When multiple chunks per page carry table/chart column data, _collect_page_record appended a section per chunk, producing 3× Table/Chart headers with identical content. _dedupe_blocks couldn't catch this because auto-incremented headers made identical blocks appear distinct.

Fix: deduplicate sections by content-only key (stripping the numeric header) before combining with text blocks, and filter out text blocks whose content is already covered by a labeled section.

When multiple chunks per page all carry table/chart column data,
_collect_page_record was appending a section for each chunk, producing
3x Table/Chart headers with identical content. _dedupe_blocks could not
catch this because auto-incremented headers (### Table 1, ### Table 2)
made otherwise-identical blocks appear distinct.

Fix: deduplicate sections by content-only key (stripping the numeric
header) before combining with text blocks, and filter out text blocks
whose content is already represented by a labeled section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@randerzander randerzander requested review from a team as code owners March 27, 2026 17:54
# markdown formatted table from the first page
>>> chunks[1]["text"]
'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
'| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice here that the text surrounding the table (captured in crop because it increases relevance for recall) is also included in the markdown format.

ToDo: exclude surrounding text from markdown formatting

@randerzander randerzander requested a review from jioffe502 March 27, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant