A recursive descent parser generator that produces high-performance callback-based
parsers from .desc specification files.
The Good:
- Clean AST class hierarchy (GM, GMFunction, GMState, GMCase, GMCommand)
- Uses Liquid templates
- Some separation of concerns
The Spaghetti:
to_str()methods on every AST class - Each class knows how to render C code- GMCommand does parsing AND C codegen -
outv(),specials_map(),parse_side()all know C syntax - Template is vestigial - Just
{% for f in functions %}{{f}}{% endfor %}- real work in Ruby - Type mapping hardcoded -
type_map()returns C types directly
Same pattern, worse coupling:
to_rust()methods everywhere - Every class renders itself to Rustemit_rust()is 100 lines of Rust-specific event code (lines 999-1101)- Template is 838 lines but mostly boilerplate -
{{ functions }}placeholder for pre-rendered code - State machine architecture baked in - Generator assumes output is always a state machine with enum
Both generators conflate three distinct concerns:
┌─────────────────────────────────────────────────────────────────┐
│ 1. DSL Parsing → What did the user write? │
│ 2. Semantic Analysis → What does it mean? (types, scopes) │
│ 3. Code Generation → How to express it in target language? │
└─────────────────────────────────────────────────────────────────┘
The existing generators do all three in a single pass through AST classes. Each class knows how to parse itself AND how to render itself to the target language.
┌──────────────┐
udon.desc ───▶ │ Lexer │ ───▶ Tokens
└──────────────┘
│
▼
┌──────────────┐
│ Parser │ ───▶ AST (pure Data structs)
└──────────────┘
│
▼
┌──────────────┐
│ IR Builder │ ───▶ IR (semantic model)
│ │ - Types resolved
│ │ - SCAN chars inferred
│ │ - EOF handling inferred
└──────────────┘
│
▼
┌──────────────┐ ┌─────────────────────┐
│ Generator │◀──────│ rust/parser.liquid │
│ (Liquid) │ │ c/parser.liquid │
└──────────────┘ │ (target-specific) │
│ └─────────────────────┘
▼
parser.rs
| Principle | What It Means |
|---|---|
| AST is pure data | No to_rust() methods. Just Data structs with fields. |
| IR is the semantic model | All analysis done here: type inference, SCAN inference, validation |
| Templates own ALL target knowledge | Type mapping, syntax, idioms - all in Liquid |
| Generator core ~400 lines | Lexer + Parser + IR Builder + Generator |
- Tokenize pipe-delimited
.descfiles - Track line numbers for error reporting
- Output: Array of
TokenData structs
- Build AST from token stream
- Pure recursive descent
- Output:
AST::Machinecontaining functions, states, cases, commands
- Pure Data structs, no behavior
- Represents direct parse result before semantic analysis
- Types:
Machine,Function,State,Case,Command,Conditional
- Transform AST → IR with semantic analysis
- Infer SCAN optimization characters from state structure
- Infer EOF handling requirements
- Collect local variable declarations with types
- Validate consistency
- Semantic model Data structs
- Enriched with inferred information
- Types:
Parser,Function,State,Case,Command - States know their
scan_chars,is_self_looping - Functions know their
locals,emits_events
- Render IR to target code using Liquid templates
- NO target-specific logic - just template loading and context building
- All type mapping, syntax, idioms in templates
Templates live in lib/descent/templates/{target}/:
templates/
├── rust/
│ ├── parser.liquid # Main template
│ └── _helpers.liquid # Shared macros (optional)
└── c/
├── parser.liquid
└── header.liquid
Type mapping:
{%- case type.kind -%}
{%- when "string" -%}ChunkSlice
{%- when "bracket" -%}{%- comment -%}void - emits events{%- endcomment -%}
{%- endcase -%}State machine generation:
{% for state in func.states %}
State::{{ state.name | pascalcase }} => {
{% if state.scannable %}
match self.scan_to{{ state.scan_chars | size }}({{ state.scan_chars | map: "rust_char" | join: ", " }}) {
{% else %}
match self.peek() {
{% endif %}
...
}
{% endfor %}Emit patterns:
{% case cmd.args.value %}
{% when "ElementStart" %}
{ let name = Some(self.term()); self.emit(StreamingEvent::ElementStart { name, span: self.span_from_mark() }); }
{% when "Text" %}
{ let content = self.term(); self.emit(StreamingEvent::Text { content, span: self.span_from_mark() }); }
{% endcase %}The IR builder automatically infers SCAN optimization from state structure:
Rule: If a state has a default case that self-loops (|default | -> |>>),
the explicit character cases become SCAN targets.
Example DSL:
|state[:prose]
|c[\n] |.newline | emit(Text) |>> :line
|c[<P>] |.pipe | emit(Text) |>> :check_element
|default |.collect | -> |>>
Inferred: scan_chars = ["\n", "|"]
Generated code uses memchr2 instead of per-character loop.
Three sources of EOF behavior:
- Explicit
|eofhandler - DSL specifies exactly what to do - EXPECTS(x) annotation - Implies error if EOF before x
- Return type - CONTENT types emit on EOF, BRACKET types may error
The IR builder validates that every reachable state has EOF handling (explicit or inferred) to prevent infinite loops.
The .desc format is valid UDON. This enables:
- Now: descent parses
.descusing hand-coded Ruby lexer/parser - Later: When libudon matures, descent can use UDON parser
- Full circle: descent generates the parser that parses its own input
To enable this:
- Keep
.descsyntax strictly UDON-compatible - No extensions that would break UDON parsing
- Document any UDON features we rely on
# Basic usage
descent udon.desc # Output to stdout, Rust target
# Specify target
descent --target rust udon.desc
descent --target c udon.desc
# Output to file
descent -o parser.rs udon.desc
# Enable trace output in generated parser
descent --trace udon.desc
# Help
descent --help
descent --versiontest/lexer_test.rb- Tokenization edge casestest/parser_test.rb- AST constructiontest/ir_builder_test.rb- Inference and validation
test/fixtures/minimal.desc- Smallest valid parsertest/fixtures/udon.desc- Full UDON parser (symlink to libudon)- Round-trip: parse → IR → generate → compile → run
- Any valid
.descshould produce compilable output - Generated parser should accept what spec says it accepts
# Install dependencies
bundle install
# Run tests
dx test
# Lint
dx lint
dx lint --fix
# Build and install gem locally
dx gem install
# Generate parser from .desc
descent --target rust udon.desc > parser.rsdescent/
├── descent.gemspec
├── Gemfile
├── mise.toml
├── .rubocop.yml
├── implementation-spec.md # This file
├── exe/
│ └── descent # CLI entry point
├── lib/
│ ├── descent.rb # Main module
│ └── descent/
│ ├── version.rb
│ ├── ast.rb # Pure Data AST nodes
│ ├── ir.rb # Semantic IR nodes
│ ├── lexer.rb # Tokenizer
│ ├── parser.rb # AST builder
│ ├── ir_builder.rb # AST → IR transformation
│ ├── generator.rb # Template rendering
│ └── templates/
│ ├── rust/
│ │ └── parser.liquid
│ └── c/
│ ├── parser.liquid
│ └── header.liquid
└── test/
├── test_helper.rb
├── descent_test.rb
├── lexer_test.rb
├── parser_test.rb
└── fixtures/
└── minimal.desc
Status vs the full spec in libudon/generator/parser-gen.md:
| Feature | Notes |
|---|---|
| Type system | BRACKET/CONTENT/INTERNAL with correct semantics |
| Basic functions | |function[name:Type] :params |
| States | |state[:name] |
| Character matching | |c[chars], |default |
| Character escapes | \n, \t, <P>, <L>, <R>, <LB>, <RB> |
| Transitions | |>>, |>> :state, |return |
| Function calls | | /function, | /function(args) |
| Variables | Assignment, +=, -= |
| Single-line conditionals | |if[condition] |
| SCAN inference | Auto-detected from self-looping default |
| Line numbers | Errors/warnings include line numbers |
| Validator | Duplicate types, undefined refs, empty states |
| Feature | Status |
|---|---|
| Character classes | LETTER, LABEL_CONT parsed as special_class, template needs handling |
| Advance-to | ->[\n] parsed, template needs to generate memchr |
| MARK/TERM | Parsed, but auto-MARK for CONTENT not implemented |
| Feature | Notes |
|---|---|
| EXPECTS annotation | | EXPECTS(]) for unclosed delimiter errors |
| Inline literals | TypeName(literal), TypeName(USE_MARK) |
| Combined char classes | |c[LETTER'[.?!*+] |
| TERM adjustments | TERM(-1) |
| PREPEND | PREPEND(|) |
| Return with value | |return value for INTERNAL types |
| Built-in /error | Should be recognized without function definition |
| Rust template | Placeholder only - needs full implementation |
| C template | Not started |
| Feature | Notes |
|---|---|
| Bootstrap mode | Use UDON parser to parse .desc |
| Async callbacks | Non-blocking backpressure |
The EXPECTS(x) annotation can likely be inferred from state structure rather than
requiring explicit declaration:
State-level inference:
- If a state has a self-looping default (or no default)
- And exactly ONE character case that exits (return or transition out)
- And that case matches a single delimiter character
- Then → infer
EXPECTS(that_char)
Example:
|state[:main]
|c["] | TERM | -> |return ; ← only exit
|c[\\] | -> | -> |>> ; loops
|default | -> |>> ; loops
Inferred: EXPECTS(")
Function-level inference:
- If all terminal paths require the same delimiter to exit → function EXPECTS that delimiter
- Bonus: If function is called after seeing opener (e.g.,
[), can generate meaningful error like "unclosed bracket" rather than just "expected ]"
Nested delimiters:
|function[brace_comment] | depth = 1
|state[:main]
|c[{] | depth += 1 |>>
|c[}] | depth -= 1
|if[depth == 0] |return ; exit requires } when balanced
|default | -> |>>
Inferred: EXPECTS(}) - only way out is via } when depth reaches 0.
The IR provides enough structure for useful static analysis:
Infinite loop detection:
- States where all cases (including default) self-loop with no exit path
- States that advance but never consume (missing
->in loops) - Cycles with no return path reachable
Unreachable state detection:
- States with no incoming transitions
- Cases that can never match (e.g., checking for char already consumed)
- Dead code after unconditional returns
Type consistency:
- CONTENT functions that never MARK
- BRACKET functions that return without proper nesting
- Emitting wrong type (e.g., emitting Text from Name function)
Character class conflicts:
- Overlapping cases (e.g.,
|c[a]and|LETTERboth match 'a') - Redundant cases shadowed by earlier matches
These analyses would run in the IR builder or validator, producing warnings/errors before code generation.
- Separation of concerns: No target-specific code outside templates
- Template-driven: Adding a new target = new template directory
- Correctness: Generated parser passes existing libudon test suite
- Performance: Generated Rust parser achieves ≥2 GiB/s
- Maintainability: Core Ruby code under 500 lines total
- Bootstrappable:
.descformat remains valid UDON