Skip to content

fix(table): handle HTML table captions correctly#640

Open
omda777 wants to merge 1 commit intoaccordproject:mainfrom
omda777:fix/html-table-capiton
Open

fix(table): handle HTML table captions correctly#640
omda777 wants to merge 1 commit intoaccordproject:mainfrom
omda777:fix/html-table-capiton

Conversation

@omda777
Copy link
Copy Markdown

@omda777 omda777 commented Jan 13, 2026

Closes #635

This PR fixes incorrect handling of HTML elements in tables during HTML → CommonMark deserialization.
It ensures captions are properly parsed, normalized, and emitted as Markdown content preceding the table, while also fixing edge cases related to table headers and whitespace normalization.

Screenshots or Video

Input (HTML with element):

<table>
  <caption>Employee Details</caption>
  <thead>
    <tr>
      <th>Name</th>
      <th>Role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alice</td>
      <td>Engineer</td>
    </tr>
    <tr>
      <td>Bob</td>
      <td>Designer</td>
    </tr>
  </tbody>
</table>

the output:
image

**Employee Details**

| Name | Role |
|---------|---------|
| Alice | Engineer |
| Bob | Designer |

Related Issues

Author Checklist

  • Ensure you provide a DCO sign-off for your commits using the --signoff option of git commit.
  • Vital features and changes captured in unit and/or integration tests
  • Commits messages follow AP format
  • Extend the documentation, if necessary
  • Merging to main from fork:branchname

Signed-off-by: Abdelrahman Emad <abdoemad0120@gmail.com>
@omda777
Copy link
Copy Markdown
Author

omda777 commented Feb 23, 2026

hi @mttrbrts can review PR

@mttrbrts
Copy link
Copy Markdown
Member

Thanks @omda777 for working on this caption handling fix! The approach looks good, but I noticed some issues to address:

Tests Required
The PR modifies significant deserialization logic but doesn't include any test coverage. Per repo guidelines, bug fixes must include regression tests. Please add tests for:

  • <caption> element deserialization
  • cleanTableNodes() whitespace normalization
  • Edge case: <tbody> with <th> cells and no <thead>
  • Tables without captions (ensure no regression)

Minor: Code Style
The cleanTableNodes function has inconsistent brace placement. Please use the repo's standard style (braces on same line as if/else).

Once tests are added, this should be good to merge. Thanks for contributing! 🙏


This comment was generated by AI on behalf of @mttrbrts.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes HTML → CommonMark deserialization for <table> elements that contain a <caption>, aiming to emit the caption as Markdown content preceding the table and to normalize table structure/whitespace to avoid invalid Markdown tables.

Changes:

  • Added <caption> handling and emitted it as a bold paragraph before the table.
  • Normalized table child handling (filtering to head/body, filtering rows/cells) and added heuristics to derive a header from the first body row when needed.
  • Added table-cell whitespace normalization via cleanTableNodes (softbreak removal + text merging).

Comment on lines +603 to +607
nodes: [
{
$class: `${CommonMarkModel.NAMESPACE}.Strong`,
nodes: cleanTableNodes(captionNode.nodes)
},
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the entire caption content in a Strong node assumes the caption deserializes to inline content only, but HTML <caption> allows flow content (e.g., <p>, lists). If block nodes are produced here, they’ll become children of Strong, which is an invalid CommonMark structure and can break rendering/serialization. Consider normalizing caption content to inline-only text for the bolded title, or emitting the caption as one/more block nodes before the table without forcing Strong around arbitrary children.

Copilot uses AI. Check for mistakes.
Comment on lines 565 to +617
const TABLE_RULE = {
deserialize(el, next, ignoreSpace) {
if (el.tagName && el.tagName.toLowerCase() === 'table') {
return {
const children = next(el.childNodes, ignoreSpace);
const captionNode = children.find(c => c.type === 'caption');
let tableNodes = children.filter(node =>
node.$class === `${CommonMarkModel.NAMESPACE}.TableHead` ||
node.$class === `${CommonMarkModel.NAMESPACE}.TableBody`
);

let head = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableHead`);
const body = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableBody`);

if (!head && body && body.nodes && body.nodes.length > 0) {
const firstRow = body.nodes[0];
const hasHeaderCells = firstRow.nodes && firstRow.nodes.some(n => n.$class === `${CommonMarkModel.NAMESPACE}.HeaderCell`);

if (hasHeaderCells) {
head = {
$class: `${CommonMarkModel.NAMESPACE}.TableHead`,
nodes: [firstRow]
};
const newBody = {
$class: `${CommonMarkModel.NAMESPACE}.TableBody`,
nodes: body.nodes.slice(1)
};
tableNodes = [head, newBody];
}
}

const table = {
$class: `${CommonMarkModel.NAMESPACE}.Table`,
nodes: next(el.childNodes),
nodes: tableNodes,
};

if (captionNode) {
const captionParagraph = {
$class: `${CommonMarkModel.NAMESPACE}.Paragraph`,
nodes: [
{
$class: `${CommonMarkModel.NAMESPACE}.Strong`,
nodes: cleanTableNodes(captionNode.nodes)
},
{
$class: `${CommonMarkModel.NAMESPACE}.Text`,
text: '\n\n'
}
]
};
return [captionParagraph, table];
}

return table;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are snapshot roundtrip tests for tables, but no coverage for HTML input containing <caption> (or the new header/body normalization paths). Please add a test that calls HtmlTransformer.toCiceroMark() on an HTML table with a caption and asserts the resulting CiceroMark/Markdown output (and ideally a roundtrip) so regressions in caption parsing/whitespace cleanup are caught.

Copilot uses AI. Check for mistakes.
Comment on lines +501 to +510
const CAPTION_RULE = {
deserialize(el, next, ignoreSpace) {
if (el.tagName && el.tagName.toLowerCase() === 'caption') {
return {
type: 'caption',
nodes: next(el.childNodes, ignoreSpace)
};
}
}
};
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAPTION_RULE returns a plain object with type: 'caption' and no $class. If a <caption> appears outside of a <table> (invalid HTML, but possible input), this node will leak into the output DOM and likely break downstream visitors/serializers that expect only CommonMark/CiceroMark $class nodes. Consider handling <caption> directly inside the table branch (query the DOM children for <caption> and deserialize its contents there) or returning a valid CommonMark node and filtering it out safely.

Copilot uses AI. Check for mistakes.
Comment on lines +517 to +562
function cleanTableNodes(nodes) {
const NS = CommonMarkModel.NAMESPACE;
const TEXT = `${NS}.Text`;
const SOFT = `${NS}.Softbreak`;

if (!nodes) return [];
nodes = Array.isArray(nodes) ? nodes : [nodes];

const merged = nodes.reduce((acc, node) => {
if (!node) return acc;

let newNode = { ...node };
if (newNode.nodes) newNode = { ...newNode, nodes: cleanTableNodes(newNode.nodes) };

if (newNode.$class === SOFT) {
newNode = { $class: TEXT, text: ' ' };
}

const last = acc[acc.length - 1];
if (last && last.$class === TEXT && newNode.$class === TEXT)
{
last.text += newNode.text;
}
else
{
acc.push(newNode);
}

return acc;
}, []);

// Normalize whitespace inside Text nodes
merged.forEach(n => {
if (n.$class === TEXT) n.text = n.text.replace(/\s+/g, ' ');
});


if (merged.length > 0 && merged[0].$class === TEXT) {
merged[0].text = merged[0].text.replace(/^\s+/, '');
}
if (merged.length > 0 && merged[merged.length - 1].$class === TEXT) {
merged[merged.length - 1].text = merged[merged.length - 1].text.replace(/\s+$/, '');
}

return merged.filter(n => n.$class !== TEXT || n.text.length > 0);
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanTableNodes uses 2-space indentation and nonstandard brace placement in a file/repo configured for 4-space indentation (packages/markdown-html/.eslintrc.yml sets indent: [error, 4]). This will cause ESLint failures. Please reformat this function to match the established 4-space indentation and brace style used elsewhere in rules.js.

Copilot uses AI. Check for mistakes.
Comment on lines +607 to +610
},
{
$class: `${CommonMarkModel.NAMESPACE}.Text`,
text: '\n\n'
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caption emission builds a Paragraph that contains a Text node with literal \n\n. Newlines inside a paragraph are not the right way to separate block nodes and can lead to unexpected Markdown/HTML output (e.g., softbreaks inside the paragraph). Prefer returning the caption paragraph as its own block and let the block serializer handle spacing between the paragraph and the following table (or insert separation at the document/block level, not as inline text).

Suggested change
},
{
$class: `${CommonMarkModel.NAMESPACE}.Text`,
text: '\n\n'

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML Transform: Erronous HTML Table Parsing

3 participants