This document describes the complete processing pipeline for rustixml, an iXML (Invisible XML) parser implementation in Rust.
- High-Level Overview
- Grammar Parsing Pipeline
- Runtime Parsing Pipeline
- XML Generation Pipeline
- Semantic Actions System
- Mark Processing
- Test Infrastructure
flowchart TB
subgraph Input
G[iXML Grammar Text]
I[Input Text]
end
subgraph GrammarParsing[Grammar Parsing Phase]
L[Lexer]
P[Recursive Descent Parser]
AST[IxmlGrammar AST]
end
subgraph RuntimeConversion[Runtime Conversion Phase]
C[AST to Earley Converter]
GB[GrammarBuilder]
EG[Earley Grammar]
end
subgraph InputParsing[Input Parsing Phase]
EP[EarleyParser]
PT[Parse Trees/State]
end
subgraph XMLGeneration[XML Generation Phase]
EF[EarleyForest with Actions]
XN[XmlNode Tree]
XML[XML String Output]
end
G --> L
L --> P
P --> AST
AST --> C
C --> GB
GB --> EG
EG --> EP
I --> EP
EP --> PT
AST --> EF
PT --> EF
EF --> XN
XN --> XML
style Input fill:#e1f5fe,color:#000
style GrammarParsing fill:#f3e5f5,color:#000
style RuntimeConversion fill:#fff3e0,color:#000
style InputParsing fill:#e8f5e9,color:#000
style XMLGeneration fill:#fce4ec,color:#000
Test References:
- All 49 tests in
/home/bigale/repos/ixml/tests/correct/exercise this complete pipeline arith,hash,test- Simple grammars that test basic pipeline flow
The lexer (src/lexer.rs) tokenizes iXML grammar text into a stream of tokens.
flowchart LR
subgraph Tokens
ID[Identifier]
LIT[Literal]
MARK[Mark: @ - ^]
REP[Repetition: + * ?]
SEP[Separator: ++ **]
HEX[Hex: #XX]
CC[CharClass]
PUNCT[Punctuation]
end
GT[Grammar Text] --> LEX[Lexer]
LEX --> ID
LEX --> LIT
LEX --> MARK
LEX --> REP
LEX --> SEP
LEX --> HEX
LEX --> CC
LEX --> PUNCT
style Tokens fill:#fff9c4,color:#000
Key Features:
- Handles brace comments
{...}including nested comments - Supports hex characters
#a,#20, etc. - Recognizes double operators
++and**for separated repetitions
Test References:
nested-comment- Tests nested brace comment handlinghex,hex1,hex3- Tests hex character lexinglf,tab- Tests special character handling
The handwritten recursive descent parser (src/grammar_parser.rs) builds the AST.
flowchart TB
subgraph ParserMethods[Parser Methods]
PG[parse_grammar]
PR[parse_rule]
PA[parse_alternatives]
PS[parse_sequence]
PF[parse_factor]
PB[parse_base_factor]
end
subgraph ASTNodes[AST Node Types]
IG[IxmlGrammar]
R[Rule]
A[Alternatives]
S[Sequence]
F[Factor]
BF[BaseFactor]
end
PG --> PR
PR --> PA
PA --> PS
PS --> PF
PF --> PB
PG --> IG
PR --> R
PA --> A
PS --> S
PF --> F
PB --> BF
style ParserMethods fill:#c8e6c9,color:#000
style ASTNodes fill:#bbdefb,color:#000
BaseFactor Variants:
flowchart TB
BF[BaseFactor]
BF --> NT[Nonterminal]
BF --> LIT[Literal]
BF --> CC[CharClass]
BF --> GRP[Group]
NT --> NTF[name: String<br/>mark: Mark]
LIT --> LITF[value: String<br/>insertion: bool<br/>mark: Mark]
CC --> CCF[content: String<br/>negated: bool<br/>mark: Mark]
GRP --> GRPF[alternatives: Alternatives]
style BF fill:#ffecb3,color:#000
Test References:
marked- Tests mark parsing (@,-,^)string- Tests literal parsingrange,ranges- Tests character class parsingempty-group- Tests group parsing
The converter (src/runtime_parser.rs:ast_to_earlgrey) transforms the iXML AST into an Earley grammar.
flowchart TB
subgraph FirstPass[First Pass: Terminal Collection]
CC1[Collect Character Classes]
CT[Collect Character Terminals]
ML[Collect Marked Literals]
end
subgraph SecondPass[Second Pass: Grammar Building]
NTD[Declare Nonterminals]
PRD[Add Productions]
REP[Create Repetition Rules]
GRP[Create Group Rules]
end
AST[IxmlGrammar AST] --> CC1
CC1 --> CT
CT --> ML
ML --> NTD
NTD --> PRD
PRD --> REP
REP --> GRP
GRP --> GB[GrammarBuilder]
style FirstPass fill:#e3f2fd,color:#000
style SecondPass fill:#f1f8e9,color:#000
flowchart TB
subgraph RepTypes[Repetition Types]
NONE[None]
OPT[Optional: ?]
STAR[ZeroOrMore: *]
PLUS[OneOrMore: +]
SEPSTAR[SeparatedZeroOrMore: **]
SEPPLUS[SeparatedOneOrMore: ++]
end
subgraph GeneratedRules[Generated Helper Rules]
OPT --> OPTR[name_opt -> name<br/>name_opt -> epsilon]
STAR --> STARR[name_star -> name name_star<br/>name_star -> epsilon]
PLUS --> PLUSR[name_plus -> name name_plus<br/>name_plus -> name]
SEPSTAR --> SEPSTARR[name_sepstar -> name sep name_sepstar<br/>name_sepstar -> epsilon]
SEPPLUS --> SEPPLUSR[name_sepplus -> name sep name_sepplus<br/>name_sepplus -> name]
end
style RepTypes fill:#fff3e0,color:#000
style GeneratedRules fill:#fce4ec,color:#000
Test References:
hash- Tests separated repetitions (++)expr1throughexpr6- Tests complex repetition patternspara-test- Tests+repetition
The forest (src/runtime_parser.rs:build_xml_forest) registers semantic actions for each production.
flowchart TB
subgraph ActionRegistration[Action Registration]
RRA[Register Rule Actions]
RLA[Register Literal Sequence Actions]
RGA[Register Group Actions]
RMA[Register Marked Literal Actions]
end
subgraph ActionTypes[Action Types]
ERA[Element Creation]
TRA[Text Collection]
ARA[Attribute Extraction]
URA[Unwrapping Hidden/Promoted]
end
AST[IxmlGrammar] --> RRA
RRA --> RLA
RLA --> RGA
RGA --> RMA
RRA --> ERA
RRA --> TRA
RRA --> ARA
RRA --> URA
style ActionRegistration fill:#e8eaf6,color:#000
style ActionTypes fill:#fbe9e7,color:#000
flowchart TB
XN[XmlNode]
XN --> EL[Element]
XN --> TX[Text]
XN --> AT[Attribute]
EL --> ELF[name: String<br/>attributes: Vec<br/>children: Vec]
TX --> TXF[String]
AT --> ATF[name: String<br/>value: String]
style XN fill:#f3e5f5,color:#000
The serializer (src/runtime_parser.rs:to_xml) produces canonical iXML format.
flowchart TB
subgraph SerializationModes[Serialization Modes]
COMP[Compact Mode]
CAN[Canonical Mode]
end
subgraph CanonicalFormat[Canonical Format Features]
OT[Opening tag without final >]
GT[> on next line with indent]
CT[Closing tag without final >]
RTG[Root element gets final >]
end
XN[XmlNode Tree] --> COMP
XN --> CAN
CAN --> OT
CAN --> GT
CAN --> CT
CAN --> RTG
COMP --> XML1[Compact XML]
CAN --> XML2[Canonical XML]
style SerializationModes fill:#e0f2f1,color:#000
style CanonicalFormat fill:#fafafa,color:#000
Test References:
arith- Tests canonical format with nested elementsattribute-value- Tests attribute escapingelement-content- Tests text content escaping
flowchart TB
subgraph NodeProcessing[Node Processing in Actions]
RN[Receive Nodes]
UP[Unwrap Containers]
PM[Process Marks]
EA[Extract Attributes]
BC[Build Children]
CE[Create Element]
end
RN --> UP
UP --> PM
PM --> EA
PM --> BC
EA --> CE
BC --> CE
style NodeProcessing fill:#fff8e1,color:#000
flowchart LR
subgraph Containers[Container Types to Unwrap]
RC[_repeat_container]
HD[_hidden]
GR[group]
end
subgraph Result[Unwrapping Result]
CH[Extract children]
AT[Promote attributes]
end
RC --> CH
HD --> CH
HD --> AT
GR --> CH
style Containers fill:#ffccbc,color:#000
style Result fill:#c8e6c9,color:#000
flowchart TB
subgraph MarkTypes[Mark Types]
MN[None - default visible]
MH[Hidden: -]
MA[Attribute: @]
MP[Promoted: ^]
end
subgraph RuleMark[Rule Definition Mark]
RMN[Creates visible element]
RMH[Creates _hidden wrapper]
RMA[Creates Attribute node]
RMP[Creates _promoted wrapper]
end
subgraph FactorMark[Factor Reference Mark]
FMN[Keep element as child]
FMH[Extract children inline]
FMA[Convert to attribute]
FMP[Wrap in _promoted inline]
end
MN --> RMN
MH --> RMH
MA --> RMA
MP --> RMP
MN --> FMN
MH --> FMH
MA --> FMA
MP --> FMP
style MarkTypes fill:#e1bee7,color:#000
style RuleMark fill:#b2dfdb,color:#000
style FactorMark fill:#ffecb3,color:#000
Test References:
aaa- Tests hidden marked literals (-"text")marked- Tests attribute marks (@name)lf- Tests hidden hex chars (-#a)
When a factor mark differs from the rule definition mark:
flowchart LR
subgraph Override[Mark Override Examples]
E1[Rule: -expr<br/>Factor: ^expr]
E2[Rule: name<br/>Factor: @name]
end
subgraph Result[Override Result]
R1[Factor mark wins<br/>Creates visible expr]
R2[Factor mark wins<br/>Creates attribute]
end
E1 --> R1
E2 --> R2
style Override fill:#fff3e0,color:#000
style Result fill:#e8f5e9,color:#000
Test References:
expr2- Tests^exprpromoting hidden-exprexpr1- Tests@plusopcreating attributes from factor
flowchart TB
subgraph TestInput[Test Input Files]
GF[name.ixml - Grammar]
IF[name.inp - Input]
OF[name.output.xml - Expected]
end
subgraph TestRunner[Test Runner]
RT[read_simple_test]
RN[run_test]
CMP[Compare Output]
end
subgraph Outcomes[Test Outcomes]
PASS[Pass]
FAIL[Fail with diff]
GPE[GrammarParseError]
IPE[InputParseError]
SKIP[Skip]
end
GF --> RT
IF --> RT
OF --> RT
RT --> RN
RN --> CMP
CMP --> PASS
CMP --> FAIL
CMP --> GPE
CMP --> IPE
CMP --> SKIP
style TestInput fill:#e3f2fd,color:#000
style TestRunner fill:#f3e5f5,color:#000
style Outcomes fill:#fff8e1,color:#000
flowchart TB
subgraph DockerEnv[Docker Environment]
DF[Dockerfile.test]
IMG[rustixml-test image]
CNT[Container]
end
subgraph SafeRunner[Safe Conformance Runner]
PC[Panic Catching]
TO[2s Timeout]
IW[Incremental Write]
SK[Skip Known Crashers]
end
subgraph Output[Test Output]
SO[stdout progress]
RF[/tmp/safe_results.txt]
SUM[Summary stats]
end
DF --> IMG
IMG --> CNT
CNT --> PC
CNT --> TO
CNT --> IW
CNT --> SK
PC --> SO
IW --> RF
RF --> SUM
style DockerEnv fill:#ffecb3,color:#000
style SafeRunner fill:#c8e6c9,color:#000
style Output fill:#e1bee7,color:#000
| Category | Count | Status | Examples |
|---|---|---|---|
| Passing | 30 | Complete | arith, hash, test, hex, marked |
| Failing | 8 | Output mismatch | json, vcard, xml, expr1, expr2, expr4 |
| Error | 6 | Parse issues | email, unicode-classes, xpath |
| Timeout | 0 | Fixed | Previously expr1-6, diary1-3 |
flowchart TB
subgraph Comparison[Comparison Strategy]
ES[Exact String Match]
SC[Semantic Comparison]
end
subgraph SemanticCheck[Semantic Checks]
TN[Tag Names]
AT[Attributes unordered]
TX[Text trimmed]
CH[Children ordered]
end
EXP[Expected XML] --> ES
ACT[Actual XML] --> ES
ES -->|Match| PASS[Pass]
ES -->|No Match| SC
SC --> TN
SC --> AT
SC --> TX
SC --> CH
TN --> PASS2[Pass]
AT --> PASS2
TX --> PASS2
CH --> PASS2
style Comparison fill:#e8eaf6,color:#000
style SemanticCheck fill:#fbe9e7,color:#000
| File | Purpose | Key Functions |
|---|---|---|
src/lexer.rs |
Tokenization | Lexer::next_token() |
src/ast.rs |
AST definitions | IxmlGrammar, Rule, Factor |
src/grammar_parser.rs |
Recursive descent parser | parse_ixml_grammar() |
src/grammar_ast.rs |
Parser entry point | Re-exports parser |
src/runtime_parser.rs |
Earley conversion + XML | ast_to_earlgrey(), build_xml_forest() |
src/testsuite_utils.rs |
Test infrastructure | run_test(), xml_deep_equal() |
src/bin/safe_conformance_runner.rs |
Docker test runner | Panic-safe test execution |
-
GROUP_COUNTER Synchronization - Fixed Missing Action errors
- Problem: AST traversed twice (conversion + action registration), counter incremented differently
- Solution: Thread-local
GROUP_ID_MAPstoresgroup_pointer → group_idmapping - Result: Eliminated all "Missing Action" errors, 1 test moved from INPUT_ERROR to FAIL
-
Character Class Parsing - Fixed unquoted character sequences
- Problem:
[xyz]treated "xyz" as Unicode category, not individual chars - Solution: Added else clause in
parse_char_class()to split unquoted sequences - Result: Character classes now work in simple cases
- File:
src/runtime_parser.rs:1322
- Problem:
-
Test Classification - Identified intentionally invalid tests
- Problem:
elem1has empty character class[]causing parse failure - Solution: Added skip logic for invalid grammar tests
- Result: Reduced noise in test results
- Problem:
-
Large Grammar Parsing - Some tests with 40+ alternatives fail
- Tests:
unicode-classes,ixml-spaces,ixml3 - Status: Parse succeeds for simple versions, fails for full grammar
- Hypothesis: Earley parser limitations with large choice sets
- Tests:
-
Promoted mark handling -
^on hidden elements needs proper element name restoration- Affects:
expr2,expr4tests - Status: Output structure incorrect
- Affects:
-
Separated repetition attributes -
@plusopinterm++plusopnot collected to parent- Affects:
expr1test - Status: Attributes not propagated correctly
- Affects:
Translation Layer Complexity: The current implementation translates iXML to Earley grammar, which creates several challenges:
- Impedance Mismatch - Earley expects token-level parsing, we do character-level
- Helper Nonterminals - Repetitions require generated rules (
name_plus,name_star) - Synchronization Issues - Multiple AST traversals need coordinated state (GROUP_COUNTER)
- Predicate Generation - Character classes converted to Rust predicates
See ABSTRACTION_ANALYSIS.md for detailed discussion of specification-driven vs translation-based approaches.
- Handwritten parser: O(n) linear time for grammar parsing
- Earley parser: O(n³) worst case, O(n) for unambiguous grammars
- Character-level tokenization: Every character is a separate token
- Left-recursive grammars may cause performance issues
- Large grammars (40+ rules) may hit Earley limitations
Current Results:
- 53 PASS (39.8%) - Core functionality working
- 12 FAIL (9.0%) - Output mismatch, need XML structure fixes
- 3 INPUT_ERROR (2.3%) - Parse failures on large grammars
- 65 SKIP (48.9%) - No expected output or test configuration
Priority Issues:
- Fix large grammar parsing (unicode-classes, ixml-spaces, ixml3)
- Fix promoted mark handling (expr2, expr4)
- Fix attribute collection in separated repetitions (expr1)
Target: 90%+ pass rate
- Secondary: Resolve input parse errors