Fix/20706 chonkie init doc#20713
Conversation
…nments Some chunkers depend on torch/sentence-transformers and fail to import in CI/minimal installs. Skip them when dependencies are unavailable so the integration test suite remains deterministic.
|
I reproduced the issue locally using the linked repro notebook. Root cause: Pydantic replaces This attaches the intended docstring to the generated constructor and adds a regression test to prevent future regressions. Happy to adjust if preferred approach differs! |
chonk-lain
left a comment
There was a problem hiding this comment.
this doesn't fix the current issue at hand, try applying the relevant change then try again.
| CHUNKER_INIT_DOC = f"""Initialize with a Chonkie chunker instance or create one if not provided. | ||
|
|
||
| Args: | ||
| chunker (Union[str, BaseChunker]): The chunker to use. Must be one of {CHUNKERS} or a chonkie chunker instance. | ||
| callback_manager (Optional[CallbackManager]): Callback manager for handling callbacks. | ||
| include_metadata (bool): Whether to include metadata in the nodes. | ||
| include_prev_next_rel (bool): Whether to include previous/next relationships. | ||
| id_func (Optional[Callable]): Function to generate node IDs. | ||
| **kwargs: Additional keyword arguments for the underlying Chonkie chunker. | ||
| """.strip() | ||
|
|
||
| # Pydantic replaces __init__ at runtime; attach docs to the generated constructor so help() is meaningful. | ||
| try: | ||
| Chunker.__init__.__doc__ = CHUNKER_INIT_DOC | ||
| except AttributeError: | ||
| pass No newline at end of file |
There was a problem hiding this comment.
move the docstring definition to line 52~53
as what you're testing against is my patch which is a working implementation i made on the fly.
AstraBert
left a comment
There was a problem hiding this comment.
This doesn't appear to be fixing the underlying cause of the docstring overwriting problem, so I would encourage to fix that rather than have a chonkie-specific fix
…reTextSplitter Pydantic replaces __init__ at runtime, causing subclass constructor docs to be overwritten by BaseModel docs. Restore the subclass docstring via __init_subclass__ so help() and introspection behave correctly. Fixes run-llama#20706
|
Thanks for the feedback. The minimal repro from the issue now works without any integration-specific patch. |
AstraBert
left a comment
There was a problem hiding this comment.
Looks like it's working! I just pushed a linting fix and once the CI is done I will merge 🙌
Restore subclass __init__ docstring using __pydantic_init__ without touching pydantic wrappers.
Thanks! I’ve pushed the updates addressing the core MetadataAwareTextSplitter behavior and added regression coverage. |
chonk-lain
left a comment
There was a problem hiding this comment.
@ObinnaIheanachor
the current implementation needs some revision, the logic is off the target.
i haven't dug too deep into the current issue at hand, but my gut feeling is saying that the issue lies with DispatcherSpanMixin and not with the MetadataAwareTextSplitter, this is a hunch based on some of my earlier findings which you can find in #20622 (comment) (i un-cloppased the section for easier read) also this is not confirmed hunch so take it as just a guiding lead that might or might not be spot on
this is a too low level issue with how python functions between what help and inspect are retrieving the docs and finding out where things started to go wrong is essential as this touches really close to the fundamentals of the python programming language and the current code base.
...ndex-integrations/node_parser/llama-index-node-parser-chonkie/tests/test_chunker_init_doc.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
the second approach does indeed work ✅
kudos to @ObinnaIheanachor for the fix 🙌
also now for the tests you are using pydoc and not inspect which is a valid choice for the tests here (under the hood help calls in pydoc)
also cc @AstraBert for extra viz
|
Mmhh tests are failing @ObinnaIheanachor: Looks like a minor capitalization error |
|
@AstraBert I've normalized expected strings to lowercase to match pydoc.render_doc() output across Python versions and avoid formatting/casing inconsistencies. |
Description
MetadataAwareTextSplitter subclasses lose their constructor docstring because Pydantic replaces init at runtime.
As a result, help(Subclass.init) shows the BaseModel docs:
"Create a new model by parsing and validating input data from keyword arguments."
This affects all integrations, not just Chonkie.
Fixes #20706
New Package?
Did I fill in the
tool.llamahubsection in thepyproject.tomland provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.tomlfile of the package I am updating? (Except for thellama-index-corepackage)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Added core regression test: test_metadataaware_init_doc.py
Converted Chonkie test into a smoke test
Verified repro from [Bug]: documentation override for classes inheriting from MetadataAwareTextSplitter #20706 now passes
I added new unit tests to cover this change
I believe this change is already covered by existing unit tests
Suggested Checklist: