fix(table): handle HTML table captions correctly by omda777 · Pull Request #640 · accordproject/markdown-transform

omda777 · 2026-01-13T11:01:56Z

Closes #635

This PR fixes incorrect handling of HTML elements in tables during HTML → CommonMark deserialization.
It ensures captions are properly parsed, normalized, and emitted as Markdown content preceding the table, while also fixing edge cases related to table headers and whitespace normalization.

Screenshots or Video

Input (HTML with element):

<table>
  <caption>Employee Details</caption>
  <thead>
    <tr>
      <th>Name</th>
      <th>Role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alice</td>
      <td>Engineer</td>
    </tr>
    <tr>
      <td>Bob</td>
      <td>Designer</td>
    </tr>
  </tbody>
</table>

the output:

**Employee Details**

| Name | Role |
|---------|---------|
| Alice | Engineer |
| Bob | Designer |

Related Issues

Issue HTML Transform: Erronous HTML Table Parsing #635

Author Checklist

Ensure you provide a DCO sign-off for your commits using the --signoff option of git commit.
Vital features and changes captured in unit and/or integration tests
Commits messages follow AP format
Extend the documentation, if necessary
Merging to main from fork:branchname

Signed-off-by: Abdelrahman Emad <abdoemad0120@gmail.com>

omda777 · 2026-02-23T02:00:07Z

hi @mttrbrts can review PR

mttrbrts · 2026-04-26T19:05:48Z

Thanks @omda777 for working on this caption handling fix! The approach looks good, but I noticed some issues to address:

Tests Required
The PR modifies significant deserialization logic but doesn't include any test coverage. Per repo guidelines, bug fixes must include regression tests. Please add tests for:

<caption> element deserialization
cleanTableNodes() whitespace normalization
Edge case: <tbody> with <th> cells and no <thead>
Tables without captions (ensure no regression)

Minor: Code Style
The cleanTableNodes function has inconsistent brace placement. Please use the repo's standard style (braces on same line as if/else).

Once tests are added, this should be good to merge. Thanks for contributing! 🙏

This comment was generated by AI on behalf of @mttrbrts.

Copilot

Pull request overview

Fixes HTML → CommonMark deserialization for <table> elements that contain a <caption>, aiming to emit the caption as Markdown content preceding the table and to normalize table structure/whitespace to avoid invalid Markdown tables.

Changes:

Added <caption> handling and emitted it as a bold paragraph before the table.
Normalized table child handling (filtering to head/body, filtering rows/cells) and added heuristics to derive a header from the first body row when needed.
Added table-cell whitespace normalization via cleanTableNodes (softbreak removal + text merging).

Copilot · 2026-04-26T21:44:15Z

+                    nodes: [
+                        {
+                            $class: `${CommonMarkModel.NAMESPACE}.Strong`,
+                            nodes: cleanTableNodes(captionNode.nodes)
+                        },


Wrapping the entire caption content in a Strong node assumes the caption deserializes to inline content only, but HTML <caption> allows flow content (e.g., <p>, lists). If block nodes are produced here, they’ll become children of Strong, which is an invalid CommonMark structure and can break rendering/serialization. Consider normalizing caption content to inline-only text for the bolded title, or emitting the caption as one/more block nodes before the table without forcing Strong around arbitrary children.

Copilot · 2026-04-26T21:44:15Z

 const TABLE_RULE = {
    deserialize(el, next, ignoreSpace) {
        if (el.tagName && el.tagName.toLowerCase() === 'table') {
-            return {
+            const children = next(el.childNodes, ignoreSpace);
+            const captionNode = children.find(c => c.type === 'caption');
+            let tableNodes = children.filter(node =>
+                node.$class === `${CommonMarkModel.NAMESPACE}.TableHead` ||
+                node.$class === `${CommonMarkModel.NAMESPACE}.TableBody`
+            );
+
+            let head = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableHead`);
+            const body = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableBody`);
+
+            if (!head && body && body.nodes && body.nodes.length > 0) {
+                const firstRow = body.nodes[0];
+                const hasHeaderCells = firstRow.nodes && firstRow.nodes.some(n => n.$class === `${CommonMarkModel.NAMESPACE}.HeaderCell`);
+
+                if (hasHeaderCells) {
+                    head = {
+                        $class: `${CommonMarkModel.NAMESPACE}.TableHead`,
+                        nodes: [firstRow]
+                    };
+                    const newBody = {
+                        $class: `${CommonMarkModel.NAMESPACE}.TableBody`,
+                        nodes: body.nodes.slice(1)
+                    };
+                    tableNodes = [head, newBody];
+                }
+            }
+
+            const table = {
                $class: `${CommonMarkModel.NAMESPACE}.Table`,
-                nodes: next(el.childNodes),
+                nodes: tableNodes,
            };
+
+            if (captionNode) {
+                const captionParagraph = {
+                    $class: `${CommonMarkModel.NAMESPACE}.Paragraph`,
+                    nodes: [
+                        {
+                            $class: `${CommonMarkModel.NAMESPACE}.Strong`,
+                            nodes: cleanTableNodes(captionNode.nodes)
+                        },
+                        {
+                            $class: `${CommonMarkModel.NAMESPACE}.Text`,
+                            text: '\n\n'
+                        }
+                    ]
+                };
+                return [captionParagraph, table];
+            }
+
+            return table;


There are snapshot roundtrip tests for tables, but no coverage for HTML input containing <caption> (or the new header/body normalization paths). Please add a test that calls HtmlTransformer.toCiceroMark() on an HTML table with a caption and asserts the resulting CiceroMark/Markdown output (and ideally a roundtrip) so regressions in caption parsing/whitespace cleanup are caught.

Copilot · 2026-04-26T21:44:15Z

+const CAPTION_RULE = {
+    deserialize(el, next, ignoreSpace) {
+        if (el.tagName && el.tagName.toLowerCase() === 'caption') {
+            return {
+                type: 'caption',
+                nodes: next(el.childNodes, ignoreSpace)
+            };
+        }
+    }
+};


CAPTION_RULE returns a plain object with type: 'caption' and no $class. If a <caption> appears outside of a <table> (invalid HTML, but possible input), this node will leak into the output DOM and likely break downstream visitors/serializers that expect only CommonMark/CiceroMark $class nodes. Consider handling <caption> directly inside the table branch (query the DOM children for <caption> and deserialize its contents there) or returning a valid CommonMark node and filtering it out safely.

Copilot · 2026-04-26T21:44:15Z

+function cleanTableNodes(nodes) {
+  const NS = CommonMarkModel.NAMESPACE;
+  const TEXT = `${NS}.Text`;
+  const SOFT = `${NS}.Softbreak`;
+
+  if (!nodes) return [];
+  nodes = Array.isArray(nodes) ? nodes : [nodes];
+
+  const merged = nodes.reduce((acc, node) => {
+    if (!node) return acc;
+
+    let newNode = { ...node };
+    if (newNode.nodes) newNode = { ...newNode, nodes: cleanTableNodes(newNode.nodes) };
+
+    if (newNode.$class === SOFT) {
+      newNode = { $class: TEXT, text: ' ' };
+    }
+
+    const last = acc[acc.length - 1];
+    if (last && last.$class === TEXT && newNode.$class === TEXT) 
+    {
+      last.text += newNode.text;
+    } 
+    else
+    {
+      acc.push(newNode);
+    }
+
+    return acc;
+  }, []);
+
+  // Normalize whitespace inside Text nodes
+  merged.forEach(n => {
+    if (n.$class === TEXT) n.text = n.text.replace(/\s+/g, ' ');
+  });
+
+
+  if (merged.length > 0 && merged[0].$class === TEXT) {
+    merged[0].text = merged[0].text.replace(/^\s+/, '');
+  }
+  if (merged.length > 0 && merged[merged.length - 1].$class === TEXT) {
+    merged[merged.length - 1].text = merged[merged.length - 1].text.replace(/\s+$/, '');
+  }
+
+  return merged.filter(n => n.$class !== TEXT || n.text.length > 0);
+}


cleanTableNodes uses 2-space indentation and nonstandard brace placement in a file/repo configured for 4-space indentation (packages/markdown-html/.eslintrc.yml sets indent: [error, 4]). This will cause ESLint failures. Please reformat this function to match the established 4-space indentation and brace style used elsewhere in rules.js.

Copilot · 2026-04-26T21:44:15Z

+                        },
+                        {
+                            $class: `${CommonMarkModel.NAMESPACE}.Text`,
+                            text: '\n\n'


The caption emission builds a Paragraph that contains a Text node with literal \n\n. Newlines inside a paragraph are not the right way to separate block nodes and can lead to unexpected Markdown/HTML output (e.g., softbreaks inside the paragraph). Prefer returning the caption paragraph as its own block and let the block serializer handle spacing between the paragraph and the following table (or insert separation at the document/block level, not as inline text).

Suggested change

},

{

$class: `${CommonMarkModel.NAMESPACE}.Text`,

text: '\n\n'

fix(table): handle HTML table captions correctly

1a73ef9

Signed-off-by: Abdelrahman Emad <abdoemad0120@gmail.com>

mttrbrts requested a review from Copilot April 26, 2026 21:39

Copilot started reviewing on behalf of mttrbrts April 26, 2026 21:39 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(table): handle HTML table captions correctly#640

fix(table): handle HTML table captions correctly#640
omda777 wants to merge 1 commit intoaccordproject:mainfrom
omda777:fix/html-table-capiton

omda777 commented Jan 13, 2026

Uh oh!

omda777 commented Feb 23, 2026

Uh oh!

mttrbrts commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omda777 commented Jan 13, 2026

Closes #635

Screenshots or Video

Related Issues

Author Checklist

Uh oh!

omda777 commented Feb 23, 2026

Uh oh!

mttrbrts commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants