Skip to content

feat(lsp): validate conditional requirements from RFC 2119 descriptions#257

Draft
bennypowers wants to merge 1 commit intomainfrom
lsp/validations-from-docs-rfc2119
Draft

feat(lsp): validate conditional requirements from RFC 2119 descriptions#257
bennypowers wants to merge 1 commit intomainfrom
lsp/validations-from-docs-rfc2119

Conversation

@bennypowers
Copy link
Copy Markdown
Owner

Experimental — this branch explores extracting and enforcing conditional requirements from natural language descriptions in custom elements manifests.

Summary

  • Extracts RFC 2119 conditional requirements from attribute/slot descriptions (e.g. "If you set variant to 'icon', you MUST also set accessible-label")
  • Produces LSP error diagnostics when HTML violates those requirements
  • Only MUST/REQUIRED/SHALL keywords trigger diagnostics (SHOULD/MAY ignored to reduce noise)

Approach: Signal-based extraction

Rather than matching whole-sentence regex templates (fragile, limited to exact phrasings), this uses a signal-based pipeline:

  1. Extract signals from each sentence: backtick-quoted attribute names, RFC 2119 keywords, conditional markers (if/when), quoted values, negation words
  2. Infer relationships from relative positions of those signals — the attr nearest the conditional marker is the condition; the attr nearest the MUST keyword is the requirement
  3. Evaluate rules against the actual HTML element's attributes

This handles arbitrary verbs, passive voice, negated conditions, OR values, and any word order — because it never looks at verbs or sentence structure.

Alternatives considered

Approach Description Trade-off
Regex templates Match exact sentence patterns like if/when ATTR is VALUE, MUST VERB ATTR Brittle — fails on passive voice, unusual verbs, different word order
Clause parser Split sentences into clauses, classify each as conditional/declarative More structured but higher complexity for similar coverage
POS tagging (prose) Use part-of-speech tags to match grammatical patterns like MODAL+VERB+NOUN Adds dependency, POS errors on domain-specific terms
Structured manifest field Add a constraints field to the CEM schema Reliable but requires schema changes and author buy-in
Build-time LLM extraction Run LLM at cem generate to extract rules, store as structured data Powerful but adds LLM dependency to build step

Test plan

  • 41 unit tests for signal extraction and rule evaluation
  • 5 LSP integration tests for end-to-end diagnostic generation
  • Full test suite passes (2214 tests)
  • Manual testing with real-world manifests (RHDS, PFE)

🤖 Generated with Claude Code

Extract and enforce RFC 2119 conditional requirements from element
attribute descriptions. When a description says e.g. "If you set
`variant` to 'icon', you MUST also set `accessible-label`", the LSP
now produces an error diagnostic if the HTML violates that rule.

Uses a signal-based NLP approach: rather than matching whole-sentence
regex templates, decomposes sentences into signals (backtick-quoted
attr names, RFC 2119 keywords, conditional markers, quoted values,
negation words) and infers relationships from relative positions.

Assisted-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3fcd8e8e-7323-4e9f-a320-5fde72ae37d8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch lsp/validations-from-docs-rfc2119

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

LSP Benchmark Results

Benchmark PR Mean (ms) Base Mean (ms) Delta Success Rate Status
Startup 2.789231 2.848442 -0.06 (-2.1%) ✅ 100%
Hover 0.47224 0.419426 0.05 (12.6%) 🐢 100%
Completion 1.755401 1.832363 -0.08 (-4.2%) ✅ 100%
Diagnostics 2028.22574 2030.647938 -2.42 (-0.1%) ✅ 100%
Attribute Hover 0 0 0.00 (0.0%) ➖ 100%
References 33.7515 31.6046 2.15 (6.8%) ⚠️ 100%

View this benchmark run in GitHub Actions

💡 Tip: Raw JSON results are available in workflow artifacts if needed.


Generate Benchmarks

Branch Total Time (s) # Runs Avg Time/run (s) Output Size (kb) Perf/kb (s/kb)
Base main 4.37466 6 0.72911 156 0.0280427
PR lsp/validations-from-docs-rfc2119 4.37375 6 0.728958 156 0.0280369
Δ -0.0009 0 -0.0002 0 -0.0000 👍

Perf/kb delta ratio: 1.00x 👍

View this benchmark run in GitHub Actions

💡 Tip: Raw JSON outputs are available in workflow artifacts if needed.

@bennypowers
Copy link
Copy Markdown
Owner Author

@paceaux I need a linguist's eye, WDYT about this? The goal is to extract meaning from user-written documentation, to validate element usage.

e.g. user documents a button element with "When variant attribute is set to icon, you MUST include an accessible-label attribute". This PR attempts to extract rules from such texts which the LSP can use to flag invalid usage in HTML documents or templates. I'd like to avoid shipping an LLM engine for this, for performance reasons.

@paceaux
Copy link
Copy Markdown

paceaux commented Mar 6, 2026

@bennypowers This is a complex task and it's best accomplished with a variety of approaches. (and, FTW, none of the alternative options you listed are mutually exclusive)

The absolute best approach would be to use a library, but for reasons you've established (and I agree with) that'd be overkill for the task at hand.

I'm not familiar with Go at all, so that's a bit of a limiting factor here.

Based on what I think I understand about your stated goals (extract natural language statements from an element manifest, use RFC2119 as a kind of "mapping" from those statements, evaluate an element's rules against those reported, and report on that)...I think your described approach is naive, but still may work well because of the narrow range you're working in.

You need to test this with non-happy path statements: misspellings, capitalization issues, mismatched quotes, rearranged words, etc.

What we're talking about are first conditional sentences; sentences where the conditional signal has a single known and expected implication. "if" and "when" are the most common, for sure. But you've also got

  • unless
  • until
  • in case
  • as long as
  • provided

I'd like to see some of those included in your conditionalRegex

You may also want to account for at least a few modal verbs because that could influence how you map to RFC2119:

  • may
  • might
  • could
  • should
  • would

Also don't forget the French quotes and the annoying apostrophe-as-quote for quotedValueRegex

What I don't see (and maybe I missed it?) is text normalization: where you set all your text to lowercase and remove special characters. that may be useful.

In cases like these, usually it goes:

  1. normalize
  2. tokenize (split on spaces or inter-sentence punctuation)
  3. do all the other things

But all the same, I think this is a good start and I'd like to see a healthy number of examples somewhere of what the text looks like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants