Create tests for internal entities & revise existing noResolve tests by boswelja · Pull Request #334 · pdvrieze/xmlutil

boswelja · 2026-02-05T08:57:40Z

The purpose of this PR is to propose changes (or rather, the addition of) internal entity handling. Following TDD< I've created a couple of tests & updated some existing ones to get your thoughts on my intended solution first.

Essentially, I'd like to

Change expandEntities behavior, such that false will never touch any entities
Implement internal entity management & lookup

pdvrieze · 2026-02-05T09:43:24Z

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

If we want to change that behaviour we need to name the property differently (as the behaviour would be different). However, if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.
It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped), but it would be valid to allow for a function in the policy to specify the behaviour on unresolved entities.

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

boswelja · 2026-02-05T11:41:15Z

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

Gotcha, this is good info to have! I will update the documentation for expandEntities, since it specifies "If true ... throwing errors if not found", which suggests that fale does not behave that way. I'll also revert the tests I touched there & update the expectations of the new tests.

if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.

That makes sense, I can try to keep entity resolution at the serializer level. I haven't dug around the codebase too much yet, but I'm imagining the parser will provide some sort of tokenized entities for the serializer to reference

It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped)

I see the problem, but that also means users would never be able to serialize an object that should reference entities right? Do we have a way of configuring HTML escaping policies today?

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Yeah, it seems like a big undertaking 😅 I'm hoping I can split the work into parts & at least handle my original problem in #311 first.

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

I'll get the tests set up for serrializer & parser separately this weekend. My intention here is just to have an easy to read way of sharing my thoughts to get us on the same page. The actual work will be done in separate PRs, and will include the respective tests.

pdvrieze · 2026-02-05T14:17:24Z

For the parser (not the deserializer) what happens is that if expandEntities is set to false, the parser will not automatically substitute the entities with their text content (when in element content), but instead represent them as an entity element for which the localname is the name of the entity and text will do resolution (or throw an exception if unknown), there is also a method that allows checking whether the entity is known. It appears that KtXmlReader does not quite implement attributes correctly in case it is set to false (it should resolve them for attributes).

Gotcha, this is good info to have! I will update the documentation for expandEntities, since it specifies "If true ... throwing errors if not found", which suggests that fale does not behave that way. I'll also revert the tests I touched there & update the expectations of the new tests.

Note that KtXmlReader is currently broken for attributes with this.

if your focus is on the serializer, it is possible to handle this there, possibly even using a special custom serializer.

That makes sense, I can try to keep entity resolution at the serializer level. I haven't dug around the codebase too much yet, but I'm imagining the parser will provide some sort of tokenized entities for the serializer to reference

Ideally the deserialization should work the same in either mode for the the reader. But the parser in expandEntities=false mode exposes it as an entity event (instead of text) - except for attribute values where this is not possible.

It is not generally correct to represent entities by their original text in the result (it would not serialize the same as read, but be escaped)

I see the problem, but that also means users would never be able to serialize an object that should reference entities right? Do we have a way of configuring HTML escaping policies today?

The default XML entities (per standard) are supported (they don't require a DTD), and character references () are supported. I'm not sure that attempting to match predefined entities to parts of text is worthwhile (quite some processing is needed).

To make this work correctly it would be needed to actually properly implement entity resolution for the parser (at least), that could then also hook in this "unresolved" behaviour for attributes.

Yeah, it seems like a big undertaking 😅 I'm hoping I can split the work into parts & at least handle my original problem in #311 first.

Actually entity resolution is almost there for the KtXmlReader. There is an interface for a resolver, and it is called. There is just no way to specify one - there is only the one that resolves default entities. This parser also doesn't support parsing DTD's (even internal) and as such does not record entity declarations to resolve.

The StaxReader has underlying support for entities (it is just the JDK default), and could use the non-resolving mode to support the same resolution API at wrapper level. As this API is effectively internal you are free to adjust it (and if possible use the same interface at the serialization policy level)

Separately, the tests should be disabled until implemented (so the code will continue to pass the test suite). It would probably also make a lot of sense to split the tests out into tests against the parser(s) and for the deserializer (so both behaviours can be tested - for all available parsers/XmlReader implementations).

I'll get the tests set up for serrializer & parser separately this weekend. My intention here is just to have an easy to read way of sharing my thoughts to get us on the same page. The actual work will be done in separate PRs, and will include the respective tests.

That is fine. Btw. I'll fix the handling of entities inside attributes in KtXmlReader when I have a bit of time.

pdvrieze

Good updates to documentation/test. I've just suggested remarking that (as updated now) entities in attribute values are always expanded (as there would otherwise not be a way to detect this).

pdvrieze · 2026-02-07T15:45:00Z

+     * @param expandEntities Whether entity references should be emitted as distinct events, or
+     * treated as simple text. In both cases, entities are resolved and their value available via
+     * [XmlReader.text]. Accessing `text` for an unknown entity will throw an exception. If
+     * [expandEntities] is false, the name of the entity is available via [XmlReader.localName].


Maybe mention that "entities in attribute values are always expanded. If the reference in an attribute is unrecognized this will always result in an exception" (new behaviour, old one was an empty string).

For XML wellformedness it is required that entity references reference declared entities. So throwing an exception would be valid (even though the parser is not quite meeting the requirements of even a non-validating parser.

I pushed an update, what do you think?

It looks good. Maybe the test could have a parameter to createReader to determine whether entities are expanded. Alternatively some of the per format subclasses could be duplicated with different factory functions.

Sounds good - just to clarify, the parameter on createReader should behave the same as the one here right?

Basically yes.

pdvrieze · 2026-03-01T22:20:41Z

+            ]>
+            <Test>&helloWorld;</Test>
+        """.trimIndent()
+        val reader = createReader(text)


It may be worthwhile to change this to have a parameter on whether entities are (auto)expanded.

boswelja · 2026-03-20T07:15:53Z

Sorry for the delay, my primary dev machine is a Windows on ARM laptop, so I can't sync with native targets enabled. Something changed a while ago that prevented me from disabling them in this project too :(

pdvrieze

Looks good. But can you rebase it on the dev branch. I did redo parts of the generic parser due to a significant performance/memory usage issue on JS (stringbuilders on the JS target just store an underlying string and then do string concatenation in really high-allocation ways). As part of that I have actually made the parser support parsing entity declarations in the doctype. It even supports parameter entities and the use of entities that provide tags in the entity replacement that needs to then be parsed.

As such you may find that once you have rebased the tests pass.

pdvrieze · 2026-03-20T14:52:57Z

+     * @param expandEntities Whether entity references should be emitted as distinct events, or
+     * treated as simple text. In both cases, entities are resolved and their value available via
+     * [XmlReader.text]. Accessing `text` for an unknown entity will throw an exception. If
+     * [expandEntities] is false, the name of the entity is available via [XmlReader.localName].


Basically yes.

Create tests for internal entities & revise existing noResolve tests

37fb450

boswelja mentioned this pull request Feb 5, 2026

Decoding custom entities that don't exist results in a blank string #311

Open

Update test cases

608c334

boswelja added 3 commits February 7, 2026 23:28

Create parser tests

b49a829

Create parser tests

c82c228

Update docs

b127858

pdvrieze reviewed Feb 7, 2026

View reviewed changes

Update IXmlStreaming.kt

5633172

pdvrieze reviewed Mar 1, 2026

View reviewed changes

pdvrieze requested changes Mar 20, 2026

View reviewed changes

Conversation

boswelja commented Feb 5, 2026

Uh oh!

pdvrieze commented Feb 5, 2026

Uh oh!

boswelja commented Feb 5, 2026

Uh oh!

pdvrieze commented Feb 5, 2026

Uh oh!

pdvrieze left a comment

Choose a reason for hiding this comment

Uh oh!

pdvrieze Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

boswelja Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pdvrieze Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

boswelja Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

pdvrieze Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pdvrieze Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

boswelja commented Mar 20, 2026

Uh oh!

pdvrieze left a comment

Choose a reason for hiding this comment

Uh oh!

pdvrieze Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants