Skip to content

Establish structured OOXML/ECMA-376 spec traceability and compliance coverage #223

@stevenobiajulu

Description

@stevenobiajulu

Problem

SafeDocX increasingly makes implementation and test claims that depend on OOXML / ECMA-376 requirements, but those claims are not currently tied to a structured, queryable, authoritative source of truth.

Today the repo has comments, issue bodies, docs, fixtures, runtime checks, and tests that say things like "ECMA-376 requires X" or "this mirrors canonical WordprocessingML behavior." Those claims may be correct, but they are mostly free-text. A reviewer, contributor, or AI agent cannot reliably answer:

  • Which exact ECMA-376 edition, part, section, and topic is being relied on?
  • Where is the authoritative text or canonical example for that claim?
  • Which production code paths and tests cover that fragment of the standard?
  • Which fragments are intentionally out of scope for SafeDocX?
  • Are we claiming full OOXML compliance, scoped tracked-change compliance, or only conformance for a specific supported editing surface?

This makes internal GitHub issues and PR review threads become de facto authority. That is useful engineering context, but it should not be the primary public-facing authority for normative OOXML behavior.

Concrete grounding example

Issue #217 is a good example of the problem, not because it is bad, but because it shows the current limitation clearly:

  • inplace atomizer: emit fragmented fields per ECMA-376 Part 4 (split <w:ins>/<w:del> at field-character boundaries) #217 asserts requirements from ECMA-376 Part 4 around w:fldChar, w:delInstrText, and fragmented field markup.
  • packages/docx-core/src/baselines/atomizer/inPlaceModifier.ts contains field-handling logic such as getAtomRuns(...) treating collapsed field atoms as a single logical unit, plus pre-split logic that skips collapsed field atoms and field-character elements.
  • The intended implementation behavior is highly specific: field-character runs may need to stay at sibling level while only payload runs are wrapped in w:ins / w:del.

That is exactly the sort of claim that should be traceable to a stable spec reference, a canonical fixture, and coverage status, rather than only to an issue narrative.

Why this matters

  1. Trust: Users and contributors should be able to see what SafeDocX means when it says it emits valid or conformant OOXML.
  2. Review quality: A reviewer should not have to reverse-engineer a spec claim from an issue thread or web search.
  3. Agent usability: AI agents working in the repo should be able to resolve spec citations locally and structurally, without relying on ad-hoc internet lookup.
  4. Scope control: We should be explicit about what we do not attempt to implement. This issue is not about implementing all of ECMA-376.
  5. Regression safety: Tests should be able to declare the normative spec fragment they exercise, making coverage and drift visible over time.
  6. Professionalism: Public-facing code comments should prefer authoritative standards references as primary support. Internal GitHub issues can remain useful secondary context.

Important scope distinction

This should not become a vague claim of "SafeDocX supports all OOXML."

The goal is to create a structured way to say, for example:

  • This source/test/fixture is intended to satisfy ECMA-376 5th edition, Part 4, section/topic X.
  • This spec fragment is covered by tests A and B and runtime validator C.
  • This adjacent spec fragment is intentionally out of scope because SafeDocX does not support that editing surface.
  • This behavior is implementation-informed rather than directly normative, and therefore should be marked as such.

Non-goals for this issue

  • Do not implement the whole ECMA-376 standard.
  • Do not decide the final design in this issue.
  • Do not replace Word / LibreOffice / docx4j / pandoc interoperability testing.
  • Do not treat GitHub issues as normative sources; they should remain context and project history.
  • Do not silently copy or modify standards text without preserving required notices and verifying the applicable Ecma terms.

Categories of solution to evaluate later

This issue should first capture the problem. Follow-up design can choose between these categories or combine them:

  1. Pinned standards source / corpus

    • Vendored ECMA-376 artifacts, or a script that fetches official artifacts by pinned URL and checksum.
    • If vendored, preserve copyright notices and keep the material unchanged/up-to-date as required by Ecma's text copyright policy.
    • Consider whether to store original ZIP/PDF files, extracted section text, structured indexes, or only a manifest plus fetch script.
  2. Normative reference IDs

    • Create stable internal IDs such as ooxml.ecma376.5ed.part4.17.16.5.fldChar.
    • Each ID should record edition, part, section/topic, title, normative/informative status, source artifact, checksum/page/anchor, and any known errata or related implementation notes.
  3. Structured annotations in code and tests

    • Add JSDoc or test metadata such as @ooxmlSpec <id> / @ooxmlCoverage <id>.
    • Allow source comments, XML constants, fixtures, runtime validators, and tests to reference the same spec IDs.
  4. Coverage matrix

    • Track per-spec-fragment status: covered, partial, out-of-scope, not-yet-covered, implementation-note, etc.
    • Include rationale for out-of-scope decisions.
    • Generate a report so maintainers can see coverage by part/section and avoid accidental overclaims.
  5. Spec-backed fixtures

    • Keep canonical OOXML examples or minimized fixtures associated with spec IDs.
    • Where examples are copied from the standard, preserve precise source attribution and required notices.
    • Where examples are derived/minimized, mark them as derived and explain the transformation.
  6. Lint / CI enforcement

    • Fail CI when a code/test annotation references a missing spec ID.
    • Optionally warn when a public-facing comment cites only an internal issue for an OOXML rule that has a known spec ID.
    • Optionally verify checksums of vendored or fetched standards artifacts.
  7. Public conformance/support documentation

    • Extend existing support/conformance docs to state the supported OOXML surface precisely.
    • Separate normative ECMA-376 conformance claims from product-interoperability findings and project-specific design choices.

Acceptance criteria for a first pass

  • A short ADR/proposal exists that records the problem, candidate approaches, and licensing/copyright constraints for storing or extracting ECMA text.
  • The repo has an initial structured spec-reference manifest with at least the ECMA-376 Part 4 field-fragmentation references needed by inplace atomizer: emit fragmented fields per ECMA-376 Part 4 (split <w:ins>/<w:del> at field-character boundaries) #217.
  • At least one production source comment or test references the manifest ID instead of only free-text section prose.
  • At least one out-of-scope or partial entry exists, to make clear this is scoped coverage rather than a full-standard compliance claim.
  • A simple report or script can summarize referenced spec IDs and coverage status.
  • Contributor guidance states that internal issues may be cited as project history, but normative OOXML claims should cite the spec reference ID when one exists.

External references to consider

Related repo context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions