Skip to content

doc processor: Markdown skip header-only chunks#48

Open
Akrog wants to merge 1 commit intolightspeed-core:mainfrom
Akrog:markdown-drop-only-headers-chunks
Open

doc processor: Markdown skip header-only chunks#48
Akrog wants to merge 1 commit intolightspeed-core:mainfrom
Akrog:markdown-drop-only-headers-chunks

Conversation

@Akrog
Copy link
Contributor

@Akrog Akrog commented Nov 21, 2025

Description

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers.

For example, the following text:

# Configuration

## cinder.conf

Blah, blah

Will result in 2 chunks:

# Configuration

and

## cinder.conf

Blah, blah

With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding):

header_path: /Configuration

In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.

Type of change

  • New feature: LCORE-1491

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Summary by CodeRabbit

  • New Features

    • Improved content detection to recognize meaningful text (excluding markdown that contains only headers or structural elements).
  • Refactor

    • Validation and filtering logic moved to instance-level checks for clearer behavior; log message clarified when skipping invalid content.
  • Tests

    • Expanded unit tests covering plain text, markdown, header-only cases, and node-filtering behavior for more reliable validation.

@coderabbitai
Copy link

coderabbitai bot commented Nov 21, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces classmethods with instance methods in the document processor, adds a static _got_content helper and an instance _valid_text_node for markdown-aware content detection, updates filtering and log messages, and expands unit tests to cover the new validation logic.

Changes

Cohort / File(s) Summary
Document processor
src/lightspeed_rag_content/document_processor.py
Converted _filter_out_invalid_nodes and _split_and_filter from @classmethod to instance methods; added @staticmethod _got_content(text: str) and instance method _valid_text_node(self, text: str) implementing markdown header/content checks; updated filtering to use self._valid_text_node and changed debug log to "Skipping invalid node".
Unit tests
tests/test_document_processor_llama_index.py
Updated imports to include Node and TextNode; added tests for _valid_text_node (plain vs. markdown, header-only cases), parameterized _got_content cases, and reworked _filter_out_invalid_nodes tests to use mocks and patched _valid_text_node behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'doc processor: Markdown skip header-only chunks' accurately summarizes the main change: filtering out header-only chunks during markdown document processing.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_rag_content/document_processor.py (1)

102-111: LGTM - Validation correctly updated to use new logic.

The conversion to an instance method and delegation to _valid_text_node is correct. The generic log message is appropriate given that validation now includes multiple conditions (whitespace and header checks).

Optional: Consider logging more details about why a node is invalid (e.g., "no whitespace", "only headers") to aid debugging, but this is a minor enhancement.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b854493 and dfd3085.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py (1 hunks)
  • tests/test_document_processor_llama_index.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_document_processor_llama_index.py (1)
src/lightspeed_rag_content/document_processor.py (3)
  • _valid_text_node (96-100)
  • _got_non_headers (88-94)
  • _filter_out_invalid_nodes (102-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pylinter
  • GitHub Check: build-and-push-dev
  • GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (7)
src/lightspeed_rag_content/document_processor.py (3)

87-94: LGTM - Clean header detection logic.

The implementation correctly identifies non-header content in markdown by checking if any non-empty line doesn't start with "#". The early return optimization is good.

Note: Code blocks containing lines starting with "#" will be treated as headers, but this is acceptable given the goal of filtering header-only chunks.


96-100: LGTM - Correct validation logic for markdown chunks.

The validation correctly enforces that markdown chunks must contain non-header content in addition to whitespace. The early return when markdown has only headers is efficient.

This implementation aligns with the PR objective to filter header-only and whitespace-only chunks specifically for markdown documents.


113-116: LGTM - Method correctly converted to instance method.

The conversion from classmethod to instance method is necessary and correctly implemented. All call sites (lines 155 and 357) are properly updated to invoke this as an instance method.

tests/test_document_processor_llama_index.py (4)

22-22: LGTM - Import correctly updated for test requirements.

The addition of Node import is necessary to properly spec the non-TextNode mock in test__filter_out_invalid_nodes, allowing the test to verify that only TextNode instances are validated.


75-120: LGTM - Comprehensive test coverage for validation logic.

These tests thoroughly cover the _valid_text_node method behavior across different document types and content scenarios:

  • Plain doc type correctly checks only whitespace
  • Markdown doc type correctly checks both non-headers and whitespace
  • Early return for header-only markdown is properly tested

The mock assertions ensure the correct methods are called in the right sequence.


122-152: LGTM - Excellent parametrized test coverage with edge cases.

These parametrized tests provide comprehensive coverage of the _got_non_headers method:

  • Content detection tests verify various patterns of headers mixed with content
  • Header-only tests cover important edge cases: leading/trailing whitespace, whitespace-only text, multiple headers, and headers without spaces

The edge case coverage (lines 143-147) is particularly thorough, including whitespace-only text and headers with various spacing patterns.


154-175: LGTM - Test correctly verifies updated filtering behavior.

The test properly validates that:

  • Only TextNode instances are checked for validity via _valid_text_node
  • Non-TextNode instances are filtered out without validation
  • The validation results correctly determine which nodes are kept

The use of side_effect to return different validation results for different nodes is an effective testing approach.


def _valid_text_node(self, text: str) -> bool:
"""Check if text node is valid: has whitespace and has content."""
if self.config.doc_type == "markdown" and not self._got_non_headers(text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akrog I'm wondering if we could come up with a non-negative name for _got_non_headers, may be _got_content(), and avoid the double negative in this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot better, will change it.

@Akrog Akrog changed the title doc processor: Markdown skip only header chunks doc processor: Markdown skip header-only chunks Mar 17, 2026
@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from dfd3085 to f13c5d4 Compare March 17, 2026 17:08
@syedriko
Copy link
Contributor

LGTM

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_document_processor_llama_index.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Fix Black formatting issues.

The CI pipeline reports that this file requires reformatting by Black. Run uv tool run black --write . to fix the formatting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_document_processor_llama_index.py` at line 1, The file is failing
Black formatting; run the project's formatter to reformat
tests/test_document_processor_llama_index.py (and the repo) by executing the
formatter command (uv tool run black --write .), review the resulting diff, and
commit the reformatted file so the CI Black check passes.
🧹 Nitpick comments (1)
tests/test_document_processor_llama_index.py (1)

131-152: Consider renaming test methods to match the method under test.

The test methods are named test__got_non_headers_* but they test the _got_content method. Consider renaming for clarity:

  • test__got_non_headers_with_contenttest__got_content_with_content
  • test__got_non_headers_only_headerstest__got_content_only_headers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_document_processor_llama_index.py` around lines 131 - 152, Rename
the two test functions so their names reflect the method under test
(_got_content): change test__got_non_headers_with_content to
test__got_content_with_content and change test__got_non_headers_only_headers to
test__got_content_only_headers; update any references (e.g., in test discovery
or decorators) to the old names and keep the test bodies and parametrization
intact so they continue asserting db._got_content(text) behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/test_document_processor_llama_index.py`:
- Line 1: The file is failing Black formatting; run the project's formatter to
reformat tests/test_document_processor_llama_index.py (and the repo) by
executing the formatter command (uv tool run black --write .), review the
resulting diff, and commit the reformatted file so the CI Black check passes.

---

Nitpick comments:
In `@tests/test_document_processor_llama_index.py`:
- Around line 131-152: Rename the two test functions so their names reflect the
method under test (_got_content): change test__got_non_headers_with_content to
test__got_content_with_content and change test__got_non_headers_only_headers to
test__got_content_only_headers; update any references (e.g., in test discovery
or decorators) to the old names and keep the test bodies and parametrization
intact so they continue asserting db._got_content(text) behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b3fb08e-5103-4a18-8e62-7365700dc898

📥 Commits

Reviewing files that changed from the base of the PR and between dfd3085 and f13c5d4.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_index.py

@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from f13c5d4 to af3a158 Compare March 17, 2026 17:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 107-111: The _valid_text_node method currently skips header-only
checks for only "markdown" docs; update it to treat HTML the same way by
including "html" in the conditional (i.e., if self.config.doc_type in
("markdown", "html") and not self._got_content(text): return False) so that
header-only chunks are filtered for both Markdown and HTML; keep the final
return as return self._got_whitespace(text). This change targets the
_valid_text_node method and uses existing helpers _got_content and
_got_whitespace to maintain parity with MarkdownNodeParser behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b0d8455-e6fa-4254-91f6-89bacb3a68fb

📥 Commits

Reviewing files that changed from the base of the PR and between f13c5d4 and af3a158.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_index.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_document_processor_llama_index.py

@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from af3a158 to d179ec6 Compare March 18, 2026 09:44
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 99-105: _got_content currently only skips ATX headers (lines
starting with "#") so Setext-style headers (a title line followed by a line of
=== or ---) are treated as content; update _got_content to iterate with an index
(or pairwise) so when encountering a non-empty line you also inspect the
following line: if the following line matches a Setext underline pattern (e.g.,
r'^[=-]{2,}\s*$') then treat the current line+underline as a header and continue
scanning, otherwise treat the current line as content and return True; keep
referencing the existing _got_content function and ensure both the title and
underline are ignored when determining header-only chunks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 58f2d148-f09b-4540-8c39-cabe455f3507

📥 Commits

Reviewing files that changed from the base of the PR and between af3a158 and d179ec6.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_index.py

@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from d179ec6 to d8376ab Compare March 18, 2026 13:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 148-152: The Setext-header detection incorrectly treats a blank
next line as a header because all(...) over an empty string returns True; update
the condition in the loop that inspects split_text using the index i to first
compute next_line = split_text[i+1].strip() and require it to be non-empty
(e.g., if next_line and all(char in ("=", "-") for char in next_line):) before
skipping (i += 2); this ensures only lines consisting solely of '=' or '-' are
treated as Setext underline markers and prevents dropping valid chunks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a8a43c5-1861-4684-8a7e-b00742d32edd

📥 Commits

Reviewing files that changed from the base of the PR and between d179ec6 and d8376ab.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_index.py

@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from d8376ab to 94f6797 Compare March 18, 2026 13:57
When processing documents using the markdown chunker we may end up with
chunks that don't really have any useful information, as in having only
headers.

For example, the following text:

    # Configuration

    ## cinder.conf

    Blah, blah

Will result in 2 chunks:

    # Configuration

and

    ## cinder.conf

    Blah, blah

With the first chunk being useless, moreover considering the second
chunk has this in its metadata (that is used for the embedding):

    header_path: /Configuration

In this patch we change how we process markdown and html doc types and
remove chunks that have no real content. We consider chunks as empty
when they only contain:

- Empty lines
- Empty code blocks
- Horizontal rules
- ATX header lines
- Setext header lines

Closes: LCORE-1491
@Akrog Akrog force-pushed the markdown-drop-only-headers-chunks branch from 94f6797 to 3a261ef Compare March 18, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants