fix(table): handle HTML table captions correctly#640
fix(table): handle HTML table captions correctly#640omda777 wants to merge 1 commit intoaccordproject:mainfrom
Conversation
Signed-off-by: Abdelrahman Emad <abdoemad0120@gmail.com>
|
hi @mttrbrts can review PR |
|
Thanks @omda777 for working on this caption handling fix! The approach looks good, but I noticed some issues to address: Tests Required
Minor: Code Style Once tests are added, this should be good to merge. Thanks for contributing! 🙏 This comment was generated by AI on behalf of @mttrbrts. |
There was a problem hiding this comment.
Pull request overview
Fixes HTML → CommonMark deserialization for <table> elements that contain a <caption>, aiming to emit the caption as Markdown content preceding the table and to normalize table structure/whitespace to avoid invalid Markdown tables.
Changes:
- Added
<caption>handling and emitted it as a bold paragraph before the table. - Normalized table child handling (filtering to head/body, filtering rows/cells) and added heuristics to derive a header from the first body row when needed.
- Added table-cell whitespace normalization via
cleanTableNodes(softbreak removal + text merging).
| nodes: [ | ||
| { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Strong`, | ||
| nodes: cleanTableNodes(captionNode.nodes) | ||
| }, |
There was a problem hiding this comment.
Wrapping the entire caption content in a Strong node assumes the caption deserializes to inline content only, but HTML <caption> allows flow content (e.g., <p>, lists). If block nodes are produced here, they’ll become children of Strong, which is an invalid CommonMark structure and can break rendering/serialization. Consider normalizing caption content to inline-only text for the bolded title, or emitting the caption as one/more block nodes before the table without forcing Strong around arbitrary children.
| const TABLE_RULE = { | ||
| deserialize(el, next, ignoreSpace) { | ||
| if (el.tagName && el.tagName.toLowerCase() === 'table') { | ||
| return { | ||
| const children = next(el.childNodes, ignoreSpace); | ||
| const captionNode = children.find(c => c.type === 'caption'); | ||
| let tableNodes = children.filter(node => | ||
| node.$class === `${CommonMarkModel.NAMESPACE}.TableHead` || | ||
| node.$class === `${CommonMarkModel.NAMESPACE}.TableBody` | ||
| ); | ||
|
|
||
| let head = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableHead`); | ||
| const body = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableBody`); | ||
|
|
||
| if (!head && body && body.nodes && body.nodes.length > 0) { | ||
| const firstRow = body.nodes[0]; | ||
| const hasHeaderCells = firstRow.nodes && firstRow.nodes.some(n => n.$class === `${CommonMarkModel.NAMESPACE}.HeaderCell`); | ||
|
|
||
| if (hasHeaderCells) { | ||
| head = { | ||
| $class: `${CommonMarkModel.NAMESPACE}.TableHead`, | ||
| nodes: [firstRow] | ||
| }; | ||
| const newBody = { | ||
| $class: `${CommonMarkModel.NAMESPACE}.TableBody`, | ||
| nodes: body.nodes.slice(1) | ||
| }; | ||
| tableNodes = [head, newBody]; | ||
| } | ||
| } | ||
|
|
||
| const table = { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Table`, | ||
| nodes: next(el.childNodes), | ||
| nodes: tableNodes, | ||
| }; | ||
|
|
||
| if (captionNode) { | ||
| const captionParagraph = { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Paragraph`, | ||
| nodes: [ | ||
| { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Strong`, | ||
| nodes: cleanTableNodes(captionNode.nodes) | ||
| }, | ||
| { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Text`, | ||
| text: '\n\n' | ||
| } | ||
| ] | ||
| }; | ||
| return [captionParagraph, table]; | ||
| } | ||
|
|
||
| return table; |
There was a problem hiding this comment.
There are snapshot roundtrip tests for tables, but no coverage for HTML input containing <caption> (or the new header/body normalization paths). Please add a test that calls HtmlTransformer.toCiceroMark() on an HTML table with a caption and asserts the resulting CiceroMark/Markdown output (and ideally a roundtrip) so regressions in caption parsing/whitespace cleanup are caught.
| const CAPTION_RULE = { | ||
| deserialize(el, next, ignoreSpace) { | ||
| if (el.tagName && el.tagName.toLowerCase() === 'caption') { | ||
| return { | ||
| type: 'caption', | ||
| nodes: next(el.childNodes, ignoreSpace) | ||
| }; | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
CAPTION_RULE returns a plain object with type: 'caption' and no $class. If a <caption> appears outside of a <table> (invalid HTML, but possible input), this node will leak into the output DOM and likely break downstream visitors/serializers that expect only CommonMark/CiceroMark $class nodes. Consider handling <caption> directly inside the table branch (query the DOM children for <caption> and deserialize its contents there) or returning a valid CommonMark node and filtering it out safely.
| function cleanTableNodes(nodes) { | ||
| const NS = CommonMarkModel.NAMESPACE; | ||
| const TEXT = `${NS}.Text`; | ||
| const SOFT = `${NS}.Softbreak`; | ||
|
|
||
| if (!nodes) return []; | ||
| nodes = Array.isArray(nodes) ? nodes : [nodes]; | ||
|
|
||
| const merged = nodes.reduce((acc, node) => { | ||
| if (!node) return acc; | ||
|
|
||
| let newNode = { ...node }; | ||
| if (newNode.nodes) newNode = { ...newNode, nodes: cleanTableNodes(newNode.nodes) }; | ||
|
|
||
| if (newNode.$class === SOFT) { | ||
| newNode = { $class: TEXT, text: ' ' }; | ||
| } | ||
|
|
||
| const last = acc[acc.length - 1]; | ||
| if (last && last.$class === TEXT && newNode.$class === TEXT) | ||
| { | ||
| last.text += newNode.text; | ||
| } | ||
| else | ||
| { | ||
| acc.push(newNode); | ||
| } | ||
|
|
||
| return acc; | ||
| }, []); | ||
|
|
||
| // Normalize whitespace inside Text nodes | ||
| merged.forEach(n => { | ||
| if (n.$class === TEXT) n.text = n.text.replace(/\s+/g, ' '); | ||
| }); | ||
|
|
||
|
|
||
| if (merged.length > 0 && merged[0].$class === TEXT) { | ||
| merged[0].text = merged[0].text.replace(/^\s+/, ''); | ||
| } | ||
| if (merged.length > 0 && merged[merged.length - 1].$class === TEXT) { | ||
| merged[merged.length - 1].text = merged[merged.length - 1].text.replace(/\s+$/, ''); | ||
| } | ||
|
|
||
| return merged.filter(n => n.$class !== TEXT || n.text.length > 0); | ||
| } |
There was a problem hiding this comment.
cleanTableNodes uses 2-space indentation and nonstandard brace placement in a file/repo configured for 4-space indentation (packages/markdown-html/.eslintrc.yml sets indent: [error, 4]). This will cause ESLint failures. Please reformat this function to match the established 4-space indentation and brace style used elsewhere in rules.js.
| }, | ||
| { | ||
| $class: `${CommonMarkModel.NAMESPACE}.Text`, | ||
| text: '\n\n' |
There was a problem hiding this comment.
The caption emission builds a Paragraph that contains a Text node with literal \n\n. Newlines inside a paragraph are not the right way to separate block nodes and can lead to unexpected Markdown/HTML output (e.g., softbreaks inside the paragraph). Prefer returning the caption paragraph as its own block and let the block serializer handle spacing between the paragraph and the following table (or insert separation at the document/block level, not as inline text).
| }, | |
| { | |
| $class: `${CommonMarkModel.NAMESPACE}.Text`, | |
| text: '\n\n' |
Closes #635
This PR fixes incorrect handling of HTML elements in tables during HTML → CommonMark deserialization.
It ensures captions are properly parsed, normalized, and emitted as Markdown content preceding the table, while also fixing edge cases related to table headers and whitespace normalization.
Screenshots or Video
Input (HTML with element):
the output:

Related Issues
Author Checklist
--signoffoption of git commit.mainfromfork:branchname