Skip to content

Create tests for internal entities & revise existing noResolve tests#334

Draft
boswelja wants to merge 6 commits intopdvrieze:masterfrom
boswelja:expandentity-testing
Draft

Create tests for internal entities & revise existing noResolve tests#334
boswelja wants to merge 6 commits intopdvrieze:masterfrom
boswelja:expandentity-testing

Conversation

@boswelja
Copy link
Copy Markdown
Contributor

@boswelja boswelja commented Feb 5, 2026

The purpose of this PR is to propose changes (or rather, the addition of) internal entity handling. Following TDD< I've created a couple of tests & updated some existing ones to get your thoughts on my intended solution first.

Essentially, I'd like to

  • Change expandEntities behavior, such that false will never touch any entities
  • Implement internal entity management & lookup

@pdvrieze
Copy link
Copy Markdown
Owner

pdvrieze commented Feb 5, 2026

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

If we want to change that behaviour we need to name the property differently (as the behaviour would be different). However, if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.
It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped), but it would be valid to allow for a function in the policy to specify the behaviour on unresolved entities.

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

@boswelja
Copy link
Copy Markdown
Contributor Author

boswelja commented Feb 5, 2026

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

Gotcha, this is good info to have! I will update the documentation for expandEntities, since it specifies "If true ... throwing errors if not found", which suggests that fale does not behave that way. I'll also revert the tests I touched there & update the expectations of the new tests.

if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.

That makes sense, I can try to keep entity resolution at the serializer level. I haven't dug around the codebase too much yet, but I'm imagining the parser will provide some sort of tokenized entities for the serializer to reference

It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped)

I see the problem, but that also means users would never be able to serialize an object that should reference entities right? Do we have a way of configuring HTML escaping policies today?

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Yeah, it seems like a big undertaking 😅 I'm hoping I can split the work into parts & at least handle my original problem in #311 first.

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

I'll get the tests set up for serrializer & parser separately this weekend. My intention here is just to have an easy to read way of sharing my thoughts to get us on the same page. The actual work will be done in separate PRs, and will include the respective tests.

@pdvrieze
Copy link
Copy Markdown
Owner

pdvrieze commented Feb 5, 2026

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

Gotcha, this is good info to have! I will update the documentation for expandEntities, since it specifies "If true ... throwing errors if not found", which suggests that fale does not behave that way. I'll also revert the tests I touched there & update the expectations of the new tests.

Note that KtXmlReader is currently broken for attributes with this.

if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.

That makes sense, I can try to keep entity resolution at the serializer level. I haven't dug around the codebase too much yet, but I'm imagining the parser will provide some sort of tokenized entities for the serializer to reference

Ideally the deserialization should work the same in either mode for the the reader. But the parser in expandEntities=false mode exposes it as an entity event (instead of text) - except for attribute values where this is not possible.

It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped)

I see the problem, but that also means users would never be able to serialize an object that should reference entities right? Do we have a way of configuring HTML escaping policies today?

The default XML entities (per standard) are supported (they don't require a DTD), and character references (&#0030;) are supported. I'm not sure that attempting to match predefined entities to parts of text is worthwhile (quite some processing is needed).

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Yeah, it seems like a big undertaking 😅 I'm hoping I can split the work into parts & at least handle my original problem in #311 first.

Actually entity resolution is almost there for the KtXmlReader. There is an interface for a resolver, and it is called. There is just no way to specify one - there is only the one that resolves default entities. This parser also doesn't support parsing DTD's (even internal) and as such does not record entity declarations to resolve.

The StaxReader has underlying support for entities (it is just the JDK default), and could use the non-resolving mode to support the same resolution API at wrapper level. As this API is effectively internal you are free to adjust it (and if possible use the same interface at the serialization policy level)

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

I'll get the tests set up for serrializer & parser separately this weekend. My intention here is just to have an easy to read way of sharing my thoughts to get us on the same page. The actual work will be done in separate PRs, and will include the respective tests.

That is fine. Btw. I'll fix the handling of entities inside attributes in KtXmlReader when I have a bit of time.

Copy link
Copy Markdown
Owner

@pdvrieze pdvrieze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good updates to documentation/test. I've just suggested remarking that (as updated now) entities in attribute values are always expanded (as there would otherwise not be a way to detect this).

* @param expandEntities Whether entity references should be emitted as distinct events, or
* treated as simple text. In both cases, entities are resolved and their value available via
* [XmlReader.text]. Accessing `text` for an unknown entity will throw an exception. If
* [expandEntities] is false, the name of the entity is available via [XmlReader.localName].
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that "entities in attribute values are always expanded. If the reference in an attribute is unrecognized this will always result in an exception" (new behaviour, old one was an empty string).

For XML wellformedness it is required that entity references reference declared entities. So throwing an exception would be valid (even though the parser is not quite meeting the requirements of even a non-validating parser.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed an update, what do you think?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Maybe the test could have a parameter to createReader to determine whether entities are expanded. Alternatively some of the per format subclasses could be duplicated with different factory functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - just to clarify, the parameter on createReader should behave the same as the one here right?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes.

]>
<Test>&helloWorld;</Test>
""".trimIndent()
val reader = createReader(text)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worthwhile to change this to have a parameter on whether entities are (auto)expanded.

@boswelja
Copy link
Copy Markdown
Contributor Author

Sorry for the delay, my primary dev machine is a Windows on ARM laptop, so I can't sync with native targets enabled. Something changed a while ago that prevented me from disabling them in this project too :(

Copy link
Copy Markdown
Owner

@pdvrieze pdvrieze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. But can you rebase it on the dev branch. I did redo parts of the generic parser due to a significant performance/memory usage issue on JS (stringbuilders on the JS target just store an underlying string and then do string concatenation in really high-allocation ways). As part of that I have actually made the parser support parsing entity declarations in the doctype. It even supports parameter entities and the use of entities that provide tags in the entity replacement that needs to then be parsed.

As such you may find that once you have rebased the tests pass.

* @param expandEntities Whether entity references should be emitted as distinct events, or
* treated as simple text. In both cases, entities are resolved and their value available via
* [XmlReader.text]. Accessing `text` for an unknown entity will throw an exception. If
* [expandEntities] is false, the name of the entity is available via [XmlReader.localName].
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants