doc processor: Markdown skip header-only chunks by Akrog · Pull Request #48 · lightspeed-core/rag-content

Akrog · 2025-11-21T09:48:03Z

Description

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers.

For example, the following text:

# Configuration

## cinder.conf

Blah, blah

Will result in 2 chunks:

# Configuration

and

## cinder.conf

Blah, blah

With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding):

header_path: /Configuration

In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.

Type of change

New feature: LCORE-1491

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Summary by CodeRabbit

New Features
- Improved content detection to recognize meaningful text (excluding markdown that contains only headers or structural elements).
Refactor
- Validation and filtering logic moved to instance-level checks for clearer behavior; log message clarified when skipping invalid content.
Tests
- Expanded unit tests covering plain text, markdown, header-only cases, and node-filtering behavior for more reliable validation.

coderabbitai · 2025-11-21T09:48:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Replaces classmethods with instance methods in the document processor, adds a static _got_content helper and an instance _valid_text_node for markdown-aware content detection, updates filtering and log messages, and expands unit tests to cover the new validation logic.

Changes

Cohort / File(s)	Summary
Document processor `src/lightspeed_rag_content/document_processor.py`	Converted `_filter_out_invalid_nodes` and `_split_and_filter` from `@classmethod` to instance methods; added `@staticmethod _got_content(text: str)` and instance method `_valid_text_node(self, text: str)` implementing markdown header/content checks; updated filtering to use `self._valid_text_node` and changed debug log to "Skipping invalid node".
Unit tests `tests/test_document_processor_llama_index.py`	Updated imports to include `Node` and `TextNode`; added tests for `_valid_text_node` (plain vs. markdown, header-only cases), parameterized `_got_content` cases, and reworked `_filter_out_invalid_nodes` tests to use mocks and patched `_valid_text_node` behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'doc processor: Markdown skip header-only chunks' accurately summarizes the main change: filtering out header-only chunks during markdown document processing.
Docstring Coverage	✅ Passed	Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_rag_content/document_processor.py (1)

102-111: LGTM - Validation correctly updated to use new logic.

The conversion to an instance method and delegation to _valid_text_node is correct. The generic log message is appropriate given that validation now includes multiple conditions (whitespace and header checks).

Optional: Consider logging more details about why a node is invalid (e.g., "no whitespace", "only headers") to aid debugging, but this is a minor enhancement.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b854493 and dfd3085.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py (1 hunks)
tests/test_document_processor_llama_index.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_document_processor_llama_index.py (1)

src/lightspeed_rag_content/document_processor.py (3)

_valid_text_node (96-100)

_got_non_headers (88-94)

_filter_out_invalid_nodes (102-111)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pylinter
GitHub Check: build-and-push-dev
GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request

🔇 Additional comments (7)

src/lightspeed_rag_content/document_processor.py (3)

87-94: LGTM - Clean header detection logic.

The implementation correctly identifies non-header content in markdown by checking if any non-empty line doesn't start with "#". The early return optimization is good.

Note: Code blocks containing lines starting with "#" will be treated as headers, but this is acceptable given the goal of filtering header-only chunks.

96-100: LGTM - Correct validation logic for markdown chunks.

The validation correctly enforces that markdown chunks must contain non-header content in addition to whitespace. The early return when markdown has only headers is efficient.

This implementation aligns with the PR objective to filter header-only and whitespace-only chunks specifically for markdown documents.

113-116: LGTM - Method correctly converted to instance method.

The conversion from classmethod to instance method is necessary and correctly implemented. All call sites (lines 155 and 357) are properly updated to invoke this as an instance method.

tests/test_document_processor_llama_index.py (4)

22-22: LGTM - Import correctly updated for test requirements.

The addition of Node import is necessary to properly spec the non-TextNode mock in test__filter_out_invalid_nodes, allowing the test to verify that only TextNode instances are validated.

75-120: LGTM - Comprehensive test coverage for validation logic.

These tests thoroughly cover the _valid_text_node method behavior across different document types and content scenarios:

Plain doc type correctly checks only whitespace

Markdown doc type correctly checks both non-headers and whitespace

Early return for header-only markdown is properly tested

The mock assertions ensure the correct methods are called in the right sequence.

122-152: LGTM - Excellent parametrized test coverage with edge cases.

These parametrized tests provide comprehensive coverage of the _got_non_headers method:

Content detection tests verify various patterns of headers mixed with content

Header-only tests cover important edge cases: leading/trailing whitespace, whitespace-only text, multiple headers, and headers without spaces

The edge case coverage (lines 143-147) is particularly thorough, including whitespace-only text and headers with various spacing patterns.

154-175: LGTM - Test correctly verifies updated filtering behavior.

The test properly validates that:

Only TextNode instances are checked for validity via _valid_text_node

Non-TextNode instances are filtered out without validation

The validation results correctly determine which nodes are kept

The use of side_effect to return different validation results for different nodes is an effective testing approach.

syedriko · 2026-03-17T14:39:18Z

src/lightspeed_rag_content/document_processor.py

+
+    def _valid_text_node(self, text: str) -> bool:
+        """Check if text node is valid: has whitespace and has content."""
+        if self.config.doc_type == "markdown" and not self._got_non_headers(text):


@Akrog I'm wondering if we could come up with a non-negative name for _got_non_headers, may be _got_content(), and avoid the double negative in this line?

That's a lot better, will change it.

syedriko · 2026-03-17T17:11:01Z

LGTM

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_document_processor_llama_index.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Fix Black formatting issues.

The CI pipeline reports that this file requires reformatting by Black. Run uv tool run black --write . to fix the formatting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_document_processor_llama_index.py` at line 1, The file is failing
Black formatting; run the project's formatter to reformat
tests/test_document_processor_llama_index.py (and the repo) by executing the
formatter command (uv tool run black --write .), review the resulting diff, and
commit the reformatted file so the CI Black check passes.

🧹 Nitpick comments (1)

tests/test_document_processor_llama_index.py (1)
131-152: Consider renaming test methods to match the method under test.

The test methods are named test__got_non_headers_* but they test the _got_content method. Consider renaming for clarity:

test__got_non_headers_with_content → test__got_content_with_content

test__got_non_headers_only_headers → test__got_content_only_headers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_document_processor_llama_index.py` around lines 131 - 152, Rename
the two test functions so their names reflect the method under test
(_got_content): change test__got_non_headers_with_content to
test__got_content_with_content and change test__got_non_headers_only_headers to
test__got_content_only_headers; update any references (e.g., in test discovery
or decorators) to the old names and keep the test bodies and parametrization
intact so they continue asserting db._got_content(text) behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/test_document_processor_llama_index.py`:
- Line 1: The file is failing Black formatting; run the project's formatter to
reformat tests/test_document_processor_llama_index.py (and the repo) by
executing the formatter command (uv tool run black --write .), review the
resulting diff, and commit the reformatted file so the CI Black check passes.

---

Nitpick comments:
In `@tests/test_document_processor_llama_index.py`:
- Around line 131-152: Rename the two test functions so their names reflect the
method under test (_got_content): change test__got_non_headers_with_content to
test__got_content_with_content and change test__got_non_headers_only_headers to
test__got_content_only_headers; update any references (e.g., in test discovery
or decorators) to the old names and keep the test bodies and parametrization
intact so they continue asserting db._got_content(text) behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b3fb08e-5103-4a18-8e62-7365700dc898

📥 Commits

Reviewing files that changed from the base of the PR and between dfd3085 and f13c5d4.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py
tests/test_document_processor_llama_index.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 107-111: The _valid_text_node method currently skips header-only
checks for only "markdown" docs; update it to treat HTML the same way by
including "html" in the conditional (i.e., if self.config.doc_type in
("markdown", "html") and not self._got_content(text): return False) so that
header-only chunks are filtered for both Markdown and HTML; keep the final
return as return self._got_whitespace(text). This change targets the
_valid_text_node method and uses existing helpers _got_content and
_got_whitespace to maintain parity with MarkdownNodeParser behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b0d8455-e6fa-4254-91f6-89bacb3a68fb

📥 Commits

Reviewing files that changed from the base of the PR and between f13c5d4 and af3a158.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py
tests/test_document_processor_llama_index.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_document_processor_llama_index.py

src/lightspeed_rag_content/document_processor.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 99-105: _got_content currently only skips ATX headers (lines
starting with "#") so Setext-style headers (a title line followed by a line of
=== or ---) are treated as content; update _got_content to iterate with an index
(or pairwise) so when encountering a non-empty line you also inspect the
following line: if the following line matches a Setext underline pattern (e.g.,
r'^[=-]{2,}\s*$') then treat the current line+underline as a header and continue
scanning, otherwise treat the current line as content and return True; keep
referencing the existing _got_content function and ensure both the title and
underline are ignored when determining header-only chunks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 58f2d148-f09b-4540-8c39-cabe455f3507

📥 Commits

Reviewing files that changed from the base of the PR and between af3a158 and d179ec6.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py
tests/test_document_processor_llama_index.py

src/lightspeed_rag_content/document_processor.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 148-152: The Setext-header detection incorrectly treats a blank
next line as a header because all(...) over an empty string returns True; update
the condition in the loop that inspects split_text using the index i to first
compute next_line = split_text[i+1].strip() and require it to be non-empty
(e.g., if next_line and all(char in ("=", "-") for char in next_line):) before
skipping (i += 2); this ensures only lines consisting solely of '=' or '-' are
treated as Setext underline markers and prevents dropping valid chunks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a8a43c5-1861-4684-8a7e-b00742d32edd

📥 Commits

Reviewing files that changed from the base of the PR and between d179ec6 and d8376ab.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py
tests/test_document_processor_llama_index.py

src/lightspeed_rag_content/document_processor.py

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers. For example, the following text: # Configuration ## cinder.conf Blah, blah Will result in 2 chunks: # Configuration and ## cinder.conf Blah, blah With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding): header_path: /Configuration In this patch we change how we process markdown and html doc types and remove chunks that have no real content. We consider chunks as empty when they only contain: - Empty lines - Empty code blocks - Horizontal rules - ATX header lines - Setext header lines Closes: LCORE-1491

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

syedriko reviewed Mar 17, 2026

View reviewed changes

Akrog changed the title ~~doc processor: Markdown skip only header chunks~~ doc processor: Markdown skip header-only chunks Mar 17, 2026

Akrog force-pushed the markdown-drop-only-headers-chunks branch from dfd3085 to f13c5d4 Compare March 17, 2026 17:08

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Akrog force-pushed the markdown-drop-only-headers-chunks branch from f13c5d4 to af3a158 Compare March 17, 2026 17:13

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

src/lightspeed_rag_content/document_processor.py Show resolved Hide resolved

Akrog force-pushed the markdown-drop-only-headers-chunks branch from af3a158 to d179ec6 Compare March 18, 2026 09:44

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

src/lightspeed_rag_content/document_processor.py Show resolved Hide resolved

Akrog force-pushed the markdown-drop-only-headers-chunks branch from d179ec6 to d8376ab Compare March 18, 2026 13:50

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

src/lightspeed_rag_content/document_processor.py Outdated Show resolved Hide resolved

Akrog force-pushed the markdown-drop-only-headers-chunks branch from d8376ab to 94f6797 Compare March 18, 2026 13:57

Akrog force-pushed the markdown-drop-only-headers-chunks branch from 94f6797 to 3a261ef Compare March 18, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc processor: Markdown skip header-only chunks#48

doc processor: Markdown skip header-only chunks#48
Akrog wants to merge 1 commit intolightspeed-core:mainfrom
Akrog:markdown-drop-only-headers-chunks

Akrog commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai bot left a comment

Uh oh!

syedriko Mar 17, 2026

Uh oh!

Akrog Mar 17, 2026

Uh oh!

syedriko commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Akrog commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist before requesting a review

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

syedriko Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Akrog Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

syedriko commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Akrog commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading