Skip to content

Comments

Fix/20706 chonkie init doc#20713

Open
ObinnaIheanachor wants to merge 16 commits intorun-llama:mainfrom
ObinnaIheanachor:fix/20706-chonkie-init-doc
Open

Fix/20706 chonkie init doc#20713
ObinnaIheanachor wants to merge 16 commits intorun-llama:mainfrom
ObinnaIheanachor:fix/20706-chonkie-init-doc

Conversation

@ObinnaIheanachor
Copy link

@ObinnaIheanachor ObinnaIheanachor commented Feb 17, 2026

Description

MetadataAwareTextSplitter subclasses lose their constructor docstring because Pydantic replaces init at runtime.
As a result, help(Subclass.init) shows the BaseModel docs:

"Create a new model by parsing and validating input data from keyword arguments."

This affects all integrations, not just Chonkie.

Fixes #20706

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…nments

Some chunkers depend on torch/sentence-transformers and fail to import
in CI/minimal installs. Skip them when dependencies are unavailable so
the integration test suite remains deterministic.
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Feb 17, 2026
@ObinnaIheanachor
Copy link
Author

I reproduced the issue locally using the linked repro notebook.

Root cause: Pydantic replaces __init__ at runtime, which overwrites the integration constructor docstring.

This attaches the intended docstring to the generated constructor and adds a regression test to prevent future regressions.

Happy to adjust if preferred approach differs!

Copy link
Contributor

@chonk-lain chonk-lain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't fix the current issue at hand, try applying the relevant change then try again.

Comment on lines 145 to 160
CHUNKER_INIT_DOC = f"""Initialize with a Chonkie chunker instance or create one if not provided.

Args:
chunker (Union[str, BaseChunker]): The chunker to use. Must be one of {CHUNKERS} or a chonkie chunker instance.
callback_manager (Optional[CallbackManager]): Callback manager for handling callbacks.
include_metadata (bool): Whether to include metadata in the nodes.
include_prev_next_rel (bool): Whether to include previous/next relationships.
id_func (Optional[Callable]): Function to generate node IDs.
**kwargs: Additional keyword arguments for the underlying Chonkie chunker.
""".strip()

# Pydantic replaces __init__ at runtime; attach docs to the generated constructor so help() is meaningful.
try:
Chunker.__init__.__doc__ = CHUNKER_INIT_DOC
except AttributeError:
pass No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the docstring definition to line 52~53
as what you're testing against is my patch which is a working implementation i made on the fly.

Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be fixing the underlying cause of the docstring overwriting problem, so I would encourage to fix that rather than have a chonkie-specific fix

…reTextSplitter

Pydantic replaces __init__ at runtime, causing subclass constructor docs
to be overwritten by BaseModel docs. Restore the subclass docstring via
__init_subclass__ so help() and introspection behave correctly.

Fixes run-llama#20706
@ObinnaIheanachor
Copy link
Author

Thanks for the feedback.
I’ve moved the fix into MetadataAwareTextSplitter so it restores subclass constructor docstrings generically for all integrations.
Added a core regression test and simplified the Chonkie test into a smoke test.

The minimal repro from the issue now works without any integration-specific patch.

Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's working! I just pushed a linting fix and once the CI is done I will merge 🙌

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 18, 2026
@ObinnaIheanachor
Copy link
Author

Looks like it's working! I just pushed a linting fix and once the CI is done I will merge 🙌

Thanks! I’ve pushed the updates addressing the core MetadataAwareTextSplitter behavior and added regression coverage.
Could you approve workflows so CI can run?

Copy link
Contributor

@chonk-lain chonk-lain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ObinnaIheanachor
the current implementation needs some revision, the logic is off the target.
i haven't dug too deep into the current issue at hand, but my gut feeling is saying that the issue lies with DispatcherSpanMixin and not with the MetadataAwareTextSplitter, this is a hunch based on some of my earlier findings which you can find in #20622 (comment) (i un-cloppased the section for easier read) also this is not confirmed hunch so take it as just a guiding lead that might or might not be spot on

this is a too low level issue with how python functions between what help and inspect are retrieving the docs and finding out where things started to go wrong is essential as this touches really close to the fundamentals of the python programming language and the current code base.

Copy link
Contributor

@chonk-lain chonk-lain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second approach does indeed work ✅

Image

kudos to @ObinnaIheanachor for the fix 🙌
also now for the tests you are using pydoc and not inspect which is a valid choice for the tests here (under the hood help calls in pydoc)

also cc @AstraBert for extra viz

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 20, 2026
@AstraBert
Copy link
Member

AstraBert commented Feb 20, 2026

Mmhh tests are failing @ObinnaIheanachor:

assert "my custom doc" in rendered
E       AssertionError: assert 'my custom doc' in 'Python Library Documentation: function __init__ in module test_metadataaware_init_doc\n\n_\x08__\x08_i\x08in\x08ni\x08it\x08t_\x08__\x08_(self)\n    My custom doc\n'

Looks like a minor capitalization error

@ObinnaIheanachor
Copy link
Author

@AstraBert I've normalized expected strings to lowercase to match pydoc.render_doc() output across Python versions and avoid formatting/casing inconsistencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: documentation override for classes inheriting from MetadataAwareTextSplitter

3 participants