doc processor: Markdown skip header-only chunks#48
doc processor: Markdown skip header-only chunks#48Akrog wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces classmethods with instance methods in the document processor, adds a static Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_rag_content/document_processor.py (1)
102-111: LGTM - Validation correctly updated to use new logic.The conversion to an instance method and delegation to
_valid_text_nodeis correct. The generic log message is appropriate given that validation now includes multiple conditions (whitespace and header checks).Optional: Consider logging more details about why a node is invalid (e.g., "no whitespace", "only headers") to aid debugging, but this is a minor enhancement.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.py(1 hunks)tests/test_document_processor_llama_index.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_document_processor_llama_index.py (1)
src/lightspeed_rag_content/document_processor.py (3)
_valid_text_node(96-100)_got_non_headers(88-94)_filter_out_invalid_nodes(102-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pylinter
- GitHub Check: build-and-push-dev
- GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (7)
src/lightspeed_rag_content/document_processor.py (3)
87-94: LGTM - Clean header detection logic.The implementation correctly identifies non-header content in markdown by checking if any non-empty line doesn't start with "#". The early return optimization is good.
Note: Code blocks containing lines starting with "#" will be treated as headers, but this is acceptable given the goal of filtering header-only chunks.
96-100: LGTM - Correct validation logic for markdown chunks.The validation correctly enforces that markdown chunks must contain non-header content in addition to whitespace. The early return when markdown has only headers is efficient.
This implementation aligns with the PR objective to filter header-only and whitespace-only chunks specifically for markdown documents.
113-116: LGTM - Method correctly converted to instance method.The conversion from classmethod to instance method is necessary and correctly implemented. All call sites (lines 155 and 357) are properly updated to invoke this as an instance method.
tests/test_document_processor_llama_index.py (4)
22-22: LGTM - Import correctly updated for test requirements.The addition of
Nodeimport is necessary to properly spec the non-TextNode mock intest__filter_out_invalid_nodes, allowing the test to verify that only TextNode instances are validated.
75-120: LGTM - Comprehensive test coverage for validation logic.These tests thoroughly cover the
_valid_text_nodemethod behavior across different document types and content scenarios:
- Plain doc type correctly checks only whitespace
- Markdown doc type correctly checks both non-headers and whitespace
- Early return for header-only markdown is properly tested
The mock assertions ensure the correct methods are called in the right sequence.
122-152: LGTM - Excellent parametrized test coverage with edge cases.These parametrized tests provide comprehensive coverage of the
_got_non_headersmethod:
- Content detection tests verify various patterns of headers mixed with content
- Header-only tests cover important edge cases: leading/trailing whitespace, whitespace-only text, multiple headers, and headers without spaces
The edge case coverage (lines 143-147) is particularly thorough, including whitespace-only text and headers with various spacing patterns.
154-175: LGTM - Test correctly verifies updated filtering behavior.The test properly validates that:
- Only
TextNodeinstances are checked for validity via_valid_text_node- Non-TextNode instances are filtered out without validation
- The validation results correctly determine which nodes are kept
The use of
side_effectto return different validation results for different nodes is an effective testing approach.
|
|
||
| def _valid_text_node(self, text: str) -> bool: | ||
| """Check if text node is valid: has whitespace and has content.""" | ||
| if self.config.doc_type == "markdown" and not self._got_non_headers(text): |
There was a problem hiding this comment.
@Akrog I'm wondering if we could come up with a non-negative name for _got_non_headers, may be _got_content(), and avoid the double negative in this line?
There was a problem hiding this comment.
That's a lot better, will change it.
dfd3085 to
f13c5d4
Compare
|
LGTM |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_document_processor_llama_index.py (1)
1-1:⚠️ Potential issue | 🟡 MinorFix Black formatting issues.
The CI pipeline reports that this file requires reformatting by Black. Run
uv tool run black --write .to fix the formatting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_document_processor_llama_index.py` at line 1, The file is failing Black formatting; run the project's formatter to reformat tests/test_document_processor_llama_index.py (and the repo) by executing the formatter command (uv tool run black --write .), review the resulting diff, and commit the reformatted file so the CI Black check passes.
🧹 Nitpick comments (1)
tests/test_document_processor_llama_index.py (1)
131-152: Consider renaming test methods to match the method under test.The test methods are named
test__got_non_headers_*but they test the_got_contentmethod. Consider renaming for clarity:
test__got_non_headers_with_content→test__got_content_with_contenttest__got_non_headers_only_headers→test__got_content_only_headers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_document_processor_llama_index.py` around lines 131 - 152, Rename the two test functions so their names reflect the method under test (_got_content): change test__got_non_headers_with_content to test__got_content_with_content and change test__got_non_headers_only_headers to test__got_content_only_headers; update any references (e.g., in test discovery or decorators) to the old names and keep the test bodies and parametrization intact so they continue asserting db._got_content(text) behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/test_document_processor_llama_index.py`:
- Line 1: The file is failing Black formatting; run the project's formatter to
reformat tests/test_document_processor_llama_index.py (and the repo) by
executing the formatter command (uv tool run black --write .), review the
resulting diff, and commit the reformatted file so the CI Black check passes.
---
Nitpick comments:
In `@tests/test_document_processor_llama_index.py`:
- Around line 131-152: Rename the two test functions so their names reflect the
method under test (_got_content): change test__got_non_headers_with_content to
test__got_content_with_content and change test__got_non_headers_only_headers to
test__got_content_only_headers; update any references (e.g., in test discovery
or decorators) to the old names and keep the test bodies and parametrization
intact so they continue asserting db._got_content(text) behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2b3fb08e-5103-4a18-8e62-7365700dc898
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.pytests/test_document_processor_llama_index.py
f13c5d4 to
af3a158
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 107-111: The _valid_text_node method currently skips header-only
checks for only "markdown" docs; update it to treat HTML the same way by
including "html" in the conditional (i.e., if self.config.doc_type in
("markdown", "html") and not self._got_content(text): return False) so that
header-only chunks are filtered for both Markdown and HTML; keep the final
return as return self._got_whitespace(text). This change targets the
_valid_text_node method and uses existing helpers _got_content and
_got_whitespace to maintain parity with MarkdownNodeParser behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b0d8455-e6fa-4254-91f6-89bacb3a68fb
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.pytests/test_document_processor_llama_index.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_document_processor_llama_index.py
af3a158 to
d179ec6
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 99-105: _got_content currently only skips ATX headers (lines
starting with "#") so Setext-style headers (a title line followed by a line of
=== or ---) are treated as content; update _got_content to iterate with an index
(or pairwise) so when encountering a non-empty line you also inspect the
following line: if the following line matches a Setext underline pattern (e.g.,
r'^[=-]{2,}\s*$') then treat the current line+underline as a header and continue
scanning, otherwise treat the current line as content and return True; keep
referencing the existing _got_content function and ensure both the title and
underline are ignored when determining header-only chunks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 58f2d148-f09b-4540-8c39-cabe455f3507
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.pytests/test_document_processor_llama_index.py
d179ec6 to
d8376ab
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_rag_content/document_processor.py`:
- Around line 148-152: The Setext-header detection incorrectly treats a blank
next line as a header because all(...) over an empty string returns True; update
the condition in the loop that inspects split_text using the index i to first
compute next_line = split_text[i+1].strip() and require it to be non-empty
(e.g., if next_line and all(char in ("=", "-") for char in next_line):) before
skipping (i += 2); this ensures only lines consisting solely of '=' or '-' are
treated as Setext underline markers and prevents dropping valid chunks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a8a43c5-1861-4684-8a7e-b00742d32edd
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.pytests/test_document_processor_llama_index.py
d8376ab to
94f6797
Compare
When processing documents using the markdown chunker we may end up with
chunks that don't really have any useful information, as in having only
headers.
For example, the following text:
# Configuration
## cinder.conf
Blah, blah
Will result in 2 chunks:
# Configuration
and
## cinder.conf
Blah, blah
With the first chunk being useless, moreover considering the second
chunk has this in its metadata (that is used for the embedding):
header_path: /Configuration
In this patch we change how we process markdown and html doc types and
remove chunks that have no real content. We consider chunks as empty
when they only contain:
- Empty lines
- Empty code blocks
- Horizontal rules
- ATX header lines
- Setext header lines
Closes: LCORE-1491
94f6797 to
3a261ef
Compare
Description
When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers.
For example, the following text:
Will result in 2 chunks:
and
With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding):
In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.
Type of change
Checklist before requesting a review
Summary by CodeRabbit
New Features
Refactor
Tests