Skip to content

Latest commit

 

History

History
677 lines (536 loc) · 16.8 KB

File metadata and controls

677 lines (536 loc) · 16.8 KB

rustixml Architecture Documentation

This document describes the complete processing pipeline for rustixml, an iXML (Invisible XML) parser implementation in Rust.

Table of Contents

  1. High-Level Overview
  2. Grammar Parsing Pipeline
  3. Runtime Parsing Pipeline
  4. XML Generation Pipeline
  5. Semantic Actions System
  6. Mark Processing
  7. Test Infrastructure

High-Level Overview

flowchart TB
    subgraph Input
        G[iXML Grammar Text]
        I[Input Text]
    end

    subgraph GrammarParsing[Grammar Parsing Phase]
        L[Lexer]
        P[Recursive Descent Parser]
        AST[IxmlGrammar AST]
    end

    subgraph RuntimeConversion[Runtime Conversion Phase]
        C[AST to Earley Converter]
        GB[GrammarBuilder]
        EG[Earley Grammar]
    end

    subgraph InputParsing[Input Parsing Phase]
        EP[EarleyParser]
        PT[Parse Trees/State]
    end

    subgraph XMLGeneration[XML Generation Phase]
        EF[EarleyForest with Actions]
        XN[XmlNode Tree]
        XML[XML String Output]
    end

    G --> L
    L --> P
    P --> AST
    AST --> C
    C --> GB
    GB --> EG
    EG --> EP
    I --> EP
    EP --> PT
    AST --> EF
    PT --> EF
    EF --> XN
    XN --> XML

    style Input fill:#e1f5fe,color:#000
    style GrammarParsing fill:#f3e5f5,color:#000
    style RuntimeConversion fill:#fff3e0,color:#000
    style InputParsing fill:#e8f5e9,color:#000
    style XMLGeneration fill:#fce4ec,color:#000
Loading

Test References:

  • All 49 tests in /home/bigale/repos/ixml/tests/correct/ exercise this complete pipeline
  • arith, hash, test - Simple grammars that test basic pipeline flow

Grammar Parsing Pipeline

Lexer Stage

The lexer (src/lexer.rs) tokenizes iXML grammar text into a stream of tokens.

flowchart LR
    subgraph Tokens
        ID[Identifier]
        LIT[Literal]
        MARK[Mark: @ - ^]
        REP[Repetition: + * ?]
        SEP[Separator: ++ **]
        HEX[Hex: #XX]
        CC[CharClass]
        PUNCT[Punctuation]
    end

    GT[Grammar Text] --> LEX[Lexer]
    LEX --> ID
    LEX --> LIT
    LEX --> MARK
    LEX --> REP
    LEX --> SEP
    LEX --> HEX
    LEX --> CC
    LEX --> PUNCT

    style Tokens fill:#fff9c4,color:#000
Loading

Key Features:

  • Handles brace comments {...} including nested comments
  • Supports hex characters #a, #20, etc.
  • Recognizes double operators ++ and ** for separated repetitions

Test References:

  • nested-comment - Tests nested brace comment handling
  • hex, hex1, hex3 - Tests hex character lexing
  • lf, tab - Tests special character handling

Parser Stage

The handwritten recursive descent parser (src/grammar_parser.rs) builds the AST.

flowchart TB
    subgraph ParserMethods[Parser Methods]
        PG[parse_grammar]
        PR[parse_rule]
        PA[parse_alternatives]
        PS[parse_sequence]
        PF[parse_factor]
        PB[parse_base_factor]
    end

    subgraph ASTNodes[AST Node Types]
        IG[IxmlGrammar]
        R[Rule]
        A[Alternatives]
        S[Sequence]
        F[Factor]
        BF[BaseFactor]
    end

    PG --> PR
    PR --> PA
    PA --> PS
    PS --> PF
    PF --> PB

    PG --> IG
    PR --> R
    PA --> A
    PS --> S
    PF --> F
    PB --> BF

    style ParserMethods fill:#c8e6c9,color:#000
    style ASTNodes fill:#bbdefb,color:#000
Loading

BaseFactor Variants:

flowchart TB
    BF[BaseFactor]

    BF --> NT[Nonterminal]
    BF --> LIT[Literal]
    BF --> CC[CharClass]
    BF --> GRP[Group]

    NT --> NTF[name: String<br/>mark: Mark]
    LIT --> LITF[value: String<br/>insertion: bool<br/>mark: Mark]
    CC --> CCF[content: String<br/>negated: bool<br/>mark: Mark]
    GRP --> GRPF[alternatives: Alternatives]

    style BF fill:#ffecb3,color:#000
Loading

Test References:

  • marked - Tests mark parsing (@, -, ^)
  • string - Tests literal parsing
  • range, ranges - Tests character class parsing
  • empty-group - Tests group parsing

Runtime Parsing Pipeline

AST to Earley Conversion

The converter (src/runtime_parser.rs:ast_to_earlgrey) transforms the iXML AST into an Earley grammar.

flowchart TB
    subgraph FirstPass[First Pass: Terminal Collection]
        CC1[Collect Character Classes]
        CT[Collect Character Terminals]
        ML[Collect Marked Literals]
    end

    subgraph SecondPass[Second Pass: Grammar Building]
        NTD[Declare Nonterminals]
        PRD[Add Productions]
        REP[Create Repetition Rules]
        GRP[Create Group Rules]
    end

    AST[IxmlGrammar AST] --> CC1
    CC1 --> CT
    CT --> ML
    ML --> NTD
    NTD --> PRD
    PRD --> REP
    REP --> GRP
    GRP --> GB[GrammarBuilder]

    style FirstPass fill:#e3f2fd,color:#000
    style SecondPass fill:#f1f8e9,color:#000
Loading

Repetition Handling

flowchart TB
    subgraph RepTypes[Repetition Types]
        NONE[None]
        OPT[Optional: ?]
        STAR[ZeroOrMore: *]
        PLUS[OneOrMore: +]
        SEPSTAR[SeparatedZeroOrMore: **]
        SEPPLUS[SeparatedOneOrMore: ++]
    end

    subgraph GeneratedRules[Generated Helper Rules]
        OPT --> OPTR[name_opt -> name<br/>name_opt -> epsilon]
        STAR --> STARR[name_star -> name name_star<br/>name_star -> epsilon]
        PLUS --> PLUSR[name_plus -> name name_plus<br/>name_plus -> name]
        SEPSTAR --> SEPSTARR[name_sepstar -> name sep name_sepstar<br/>name_sepstar -> epsilon]
        SEPPLUS --> SEPPLUSR[name_sepplus -> name sep name_sepplus<br/>name_sepplus -> name]
    end

    style RepTypes fill:#fff3e0,color:#000
    style GeneratedRules fill:#fce4ec,color:#000
Loading

Test References:

  • hash - Tests separated repetitions (++)
  • expr1 through expr6 - Tests complex repetition patterns
  • para-test - Tests + repetition

XML Generation Pipeline

EarleyForest Actions

The forest (src/runtime_parser.rs:build_xml_forest) registers semantic actions for each production.

flowchart TB
    subgraph ActionRegistration[Action Registration]
        RRA[Register Rule Actions]
        RLA[Register Literal Sequence Actions]
        RGA[Register Group Actions]
        RMA[Register Marked Literal Actions]
    end

    subgraph ActionTypes[Action Types]
        ERA[Element Creation]
        TRA[Text Collection]
        ARA[Attribute Extraction]
        URA[Unwrapping Hidden/Promoted]
    end

    AST[IxmlGrammar] --> RRA
    RRA --> RLA
    RLA --> RGA
    RGA --> RMA

    RRA --> ERA
    RRA --> TRA
    RRA --> ARA
    RRA --> URA

    style ActionRegistration fill:#e8eaf6,color:#000
    style ActionTypes fill:#fbe9e7,color:#000
Loading

XmlNode Structure

flowchart TB
    XN[XmlNode]

    XN --> EL[Element]
    XN --> TX[Text]
    XN --> AT[Attribute]

    EL --> ELF[name: String<br/>attributes: Vec<br/>children: Vec]
    TX --> TXF[String]
    AT --> ATF[name: String<br/>value: String]

    style XN fill:#f3e5f5,color:#000
Loading

XML Serialization

The serializer (src/runtime_parser.rs:to_xml) produces canonical iXML format.

flowchart TB
    subgraph SerializationModes[Serialization Modes]
        COMP[Compact Mode]
        CAN[Canonical Mode]
    end

    subgraph CanonicalFormat[Canonical Format Features]
        OT[Opening tag without final >]
        GT[> on next line with indent]
        CT[Closing tag without final >]
        RTG[Root element gets final >]
    end

    XN[XmlNode Tree] --> COMP
    XN --> CAN
    CAN --> OT
    CAN --> GT
    CAN --> CT
    CAN --> RTG
    COMP --> XML1[Compact XML]
    CAN --> XML2[Canonical XML]

    style SerializationModes fill:#e0f2f1,color:#000
    style CanonicalFormat fill:#fafafa,color:#000
Loading

Test References:

  • arith - Tests canonical format with nested elements
  • attribute-value - Tests attribute escaping
  • element-content - Tests text content escaping

Semantic Actions System

Node Processing Flow

flowchart TB
    subgraph NodeProcessing[Node Processing in Actions]
        RN[Receive Nodes]
        UP[Unwrap Containers]
        PM[Process Marks]
        EA[Extract Attributes]
        BC[Build Children]
        CE[Create Element]
    end

    RN --> UP
    UP --> PM
    PM --> EA
    PM --> BC
    EA --> CE
    BC --> CE

    style NodeProcessing fill:#fff8e1,color:#000
Loading

Container Unwrapping

flowchart LR
    subgraph Containers[Container Types to Unwrap]
        RC[_repeat_container]
        HD[_hidden]
        GR[group]
    end

    subgraph Result[Unwrapping Result]
        CH[Extract children]
        AT[Promote attributes]
    end

    RC --> CH
    HD --> CH
    HD --> AT
    GR --> CH

    style Containers fill:#ffccbc,color:#000
    style Result fill:#c8e6c9,color:#000
Loading

Mark Processing

Mark Types and Effects

flowchart TB
    subgraph MarkTypes[Mark Types]
        MN[None - default visible]
        MH[Hidden: -]
        MA[Attribute: @]
        MP[Promoted: ^]
    end

    subgraph RuleMark[Rule Definition Mark]
        RMN[Creates visible element]
        RMH[Creates _hidden wrapper]
        RMA[Creates Attribute node]
        RMP[Creates _promoted wrapper]
    end

    subgraph FactorMark[Factor Reference Mark]
        FMN[Keep element as child]
        FMH[Extract children inline]
        FMA[Convert to attribute]
        FMP[Wrap in _promoted inline]
    end

    MN --> RMN
    MH --> RMH
    MA --> RMA
    MP --> RMP

    MN --> FMN
    MH --> FMH
    MA --> FMA
    MP --> FMP

    style MarkTypes fill:#e1bee7,color:#000
    style RuleMark fill:#b2dfdb,color:#000
    style FactorMark fill:#ffecb3,color:#000
Loading

Test References:

  • aaa - Tests hidden marked literals (-"text")
  • marked - Tests attribute marks (@name)
  • lf - Tests hidden hex chars (-#a)

Mark Override Behavior

When a factor mark differs from the rule definition mark:

flowchart LR
    subgraph Override[Mark Override Examples]
        E1[Rule: -expr<br/>Factor: ^expr]
        E2[Rule: name<br/>Factor: @name]
    end

    subgraph Result[Override Result]
        R1[Factor mark wins<br/>Creates visible expr]
        R2[Factor mark wins<br/>Creates attribute]
    end

    E1 --> R1
    E2 --> R2

    style Override fill:#fff3e0,color:#000
    style Result fill:#e8f5e9,color:#000
Loading

Test References:

  • expr2 - Tests ^expr promoting hidden -expr
  • expr1 - Tests @plusop creating attributes from factor

Test Infrastructure

Test Runner Architecture

flowchart TB
    subgraph TestInput[Test Input Files]
        GF[name.ixml - Grammar]
        IF[name.inp - Input]
        OF[name.output.xml - Expected]
    end

    subgraph TestRunner[Test Runner]
        RT[read_simple_test]
        RN[run_test]
        CMP[Compare Output]
    end

    subgraph Outcomes[Test Outcomes]
        PASS[Pass]
        FAIL[Fail with diff]
        GPE[GrammarParseError]
        IPE[InputParseError]
        SKIP[Skip]
    end

    GF --> RT
    IF --> RT
    OF --> RT
    RT --> RN
    RN --> CMP
    CMP --> PASS
    CMP --> FAIL
    CMP --> GPE
    CMP --> IPE
    CMP --> SKIP

    style TestInput fill:#e3f2fd,color:#000
    style TestRunner fill:#f3e5f5,color:#000
    style Outcomes fill:#fff8e1,color:#000
Loading

Docker Test Environment

flowchart TB
    subgraph DockerEnv[Docker Environment]
        DF[Dockerfile.test]
        IMG[rustixml-test image]
        CNT[Container]
    end

    subgraph SafeRunner[Safe Conformance Runner]
        PC[Panic Catching]
        TO[2s Timeout]
        IW[Incremental Write]
        SK[Skip Known Crashers]
    end

    subgraph Output[Test Output]
        SO[stdout progress]
        RF[/tmp/safe_results.txt]
        SUM[Summary stats]
    end

    DF --> IMG
    IMG --> CNT
    CNT --> PC
    CNT --> TO
    CNT --> IW
    CNT --> SK
    PC --> SO
    IW --> RF
    RF --> SUM

    style DockerEnv fill:#ffecb3,color:#000
    style SafeRunner fill:#c8e6c9,color:#000
    style Output fill:#e1bee7,color:#000
Loading

Test Categories and Status

Category Count Status Examples
Passing 30 Complete arith, hash, test, hex, marked
Failing 8 Output mismatch json, vcard, xml, expr1, expr2, expr4
Error 6 Parse issues email, unicode-classes, xpath
Timeout 0 Fixed Previously expr1-6, diary1-3

Semantic XML Comparison

flowchart TB
    subgraph Comparison[Comparison Strategy]
        ES[Exact String Match]
        SC[Semantic Comparison]
    end

    subgraph SemanticCheck[Semantic Checks]
        TN[Tag Names]
        AT[Attributes unordered]
        TX[Text trimmed]
        CH[Children ordered]
    end

    EXP[Expected XML] --> ES
    ACT[Actual XML] --> ES
    ES -->|Match| PASS[Pass]
    ES -->|No Match| SC
    SC --> TN
    SC --> AT
    SC --> TX
    SC --> CH
    TN --> PASS2[Pass]
    AT --> PASS2
    TX --> PASS2
    CH --> PASS2

    style Comparison fill:#e8eaf6,color:#000
    style SemanticCheck fill:#fbe9e7,color:#000
Loading

File Reference

File Purpose Key Functions
src/lexer.rs Tokenization Lexer::next_token()
src/ast.rs AST definitions IxmlGrammar, Rule, Factor
src/grammar_parser.rs Recursive descent parser parse_ixml_grammar()
src/grammar_ast.rs Parser entry point Re-exports parser
src/runtime_parser.rs Earley conversion + XML ast_to_earlgrey(), build_xml_forest()
src/testsuite_utils.rs Test infrastructure run_test(), xml_deep_equal()
src/bin/safe_conformance_runner.rs Docker test runner Panic-safe test execution

Known Issues and Recent Fixes

Recent Fixes (November 2025)

  1. GROUP_COUNTER Synchronization - Fixed Missing Action errors

    • Problem: AST traversed twice (conversion + action registration), counter incremented differently
    • Solution: Thread-local GROUP_ID_MAP stores group_pointer → group_id mapping
    • Result: Eliminated all "Missing Action" errors, 1 test moved from INPUT_ERROR to FAIL
  2. Character Class Parsing - Fixed unquoted character sequences

    • Problem: [xyz] treated "xyz" as Unicode category, not individual chars
    • Solution: Added else clause in parse_char_class() to split unquoted sequences
    • Result: Character classes now work in simple cases
    • File: src/runtime_parser.rs:1322
  3. Test Classification - Identified intentionally invalid tests

    • Problem: elem1 has empty character class [] causing parse failure
    • Solution: Added skip logic for invalid grammar tests
    • Result: Reduced noise in test results

Current Limitations

  1. Large Grammar Parsing - Some tests with 40+ alternatives fail

    • Tests: unicode-classes, ixml-spaces, ixml3
    • Status: Parse succeeds for simple versions, fails for full grammar
    • Hypothesis: Earley parser limitations with large choice sets
  2. Promoted mark handling - ^ on hidden elements needs proper element name restoration

    • Affects: expr2, expr4 tests
    • Status: Output structure incorrect
  3. Separated repetition attributes - @plusop in term++plusop not collected to parent

    • Affects: expr1 test
    • Status: Attributes not propagated correctly

Architectural Considerations

Translation Layer Complexity: The current implementation translates iXML to Earley grammar, which creates several challenges:

  1. Impedance Mismatch - Earley expects token-level parsing, we do character-level
  2. Helper Nonterminals - Repetitions require generated rules (name_plus, name_star)
  3. Synchronization Issues - Multiple AST traversals need coordinated state (GROUP_COUNTER)
  4. Predicate Generation - Character classes converted to Rust predicates

See ABSTRACTION_ANALYSIS.md for detailed discussion of specification-driven vs translation-based approaches.

Performance Considerations

  • Handwritten parser: O(n) linear time for grammar parsing
  • Earley parser: O(n³) worst case, O(n) for unambiguous grammars
  • Character-level tokenization: Every character is a separate token
  • Left-recursive grammars may cause performance issues
  • Large grammars (40+ rules) may hit Earley limitations

Test Coverage Status (November 2025)

Current Results:

  • 53 PASS (39.8%) - Core functionality working
  • 12 FAIL (9.0%) - Output mismatch, need XML structure fixes
  • 3 INPUT_ERROR (2.3%) - Parse failures on large grammars
  • 65 SKIP (48.9%) - No expected output or test configuration

Priority Issues:

  1. Fix large grammar parsing (unicode-classes, ixml-spaces, ixml3)
  2. Fix promoted mark handling (expr2, expr4)
  3. Fix attribute collection in separated repetitions (expr1)

Target: 90%+ pass rate

  • Secondary: Resolve input parse errors