Skip to content

Vendor ECMA-376 source artifacts and generate OOXML vocabulary/spec-reference manifest #224

@stevenobiajulu

Description

@stevenobiajulu

Context

Child issue of #223.

We want SafeDocX's OOXML constants, semantic tag groups, implementation comments, and tests to be traceable to the official ECMA-376 standard artifacts rather than to hand-written strings or internal issue history alone.

Decision from project discussion: vendor the official ECMA-376 ZIP files unchanged rather than relying only on a fetch script. The reason is durability: the ECMA website URL structure may change, while the repository should continue to preserve the exact standard edition used for review, agent lookup, and reproducible code generation.

Source artifacts

Use the ECMA-376 official publication page:

https://ecma-international.org/publications-and-standards/standards/ecma-376/

Vendor the official ZIP downloads for the four parts listed there:

  • Part 1: Fundamentals And Markup Language Reference, 5th edition, December 2016
  • Part 2: Open Packaging Conventions, 5th edition, December 2021
  • Part 3: Markup Compatibility and Extensibility, 5th edition, December 2015
  • Part 4: Transitional Migration Features, 5th edition, December 2016

Required repository treatment

  • Store the downloaded ZIPs unchanged.
  • Add a SHA256SUMS file for the exact vendored ZIPs.
  • Add a README.md next to the artifacts explaining source URL, download date, edition/part metadata, and why the artifacts are vendored.
  • Add the required Ecma copyright notice/license/disclaimer text or a pointer file sufficient to satisfy the Ecma text copyright policy.
  • Do not edit the standard artifacts in place.
  • Any extracted/generated artifacts must clearly identify themselves as derived from the unchanged vendored source and must record the input ZIP checksum.

Implementation direction

Create a small spec-ingestion layer that can read the vendored ECMA artifacts and produce structured metadata usable by TypeScript and docs.

At minimum, produce:

  1. Artifact manifest

    • Edition
    • Part
    • Title
    • Publication date
    • Vendored path
    • SHA-256
    • Source URL
    • Notes/copyright status
  2. Spec-reference manifest

    • Stable internal ID, e.g. ooxml.ecma376.5ed.part4.<topic-or-section>
    • Edition / part / section or topic
    • Normative vs informative where determinable
    • Source artifact and locator
    • Coverage status: covered, partial, out-of-scope, not-yet-covered, implementation-note
    • Related tests/source files
  3. Generated OOXML vocabulary registry

    • Namespace URI
    • Preferred prefix
    • Local name
    • QName form, e.g. w:fldChar
    • Clark-notation form
    • Element vs attribute where determinable
    • Source schema/artifact locator
  4. Generated TypeScript constants

    • Generated constants should represent raw vocabulary entries.
    • Existing handwritten constants such as W_FLDCHAR, W_INSTRTEXT, W_DEL, W_INS, etc. should gradually migrate to generated constants or be validated against the generated registry.
  5. Hand-authored semantic groups over generated vocabulary

    • Groups such as FIELD_CHAR_TAG_NAMES / FIELD_CODE_BOUNDARY_TAGS should not be treated as purely schema-generated.
    • They should be hand-authored semantic subsets that import generated vocabulary constants and cite spec-reference IDs.
    • Example target shape:
/**
 * Field-code marker/payload elements that require field-context-aware splitting.
 *
 * @ooxmlSpec ooxml.ecma376.5ed.part4.fields.fragmented-track-changes
 */
export const FIELD_CODE_BOUNDARY_TAGS = new Set([
  W.FLD_CHAR.qname,
  W.INSTR_TEXT.qname,
  W.DEL_INSTR_TEXT.qname,
]);

Initial migration target

Start with the field-fragmentation area from #217:

  • w:fldChar
  • w:instrText
  • w:delInstrText
  • w:ins
  • w:del
  • any attributes needed for w:fldChar/@w:fldCharType, w:id, w:author, and w:date

Then add a report showing whether packages/docx-core/src/baselines/atomizer/inPlaceModifier.ts references generated/validated OOXML names for those elements.

Acceptance criteria

  • Official ECMA-376 ZIP files for Parts 1-4 are vendored unchanged.
  • SHA-256 checksums are committed and verified by a script.
  • Copyright/license/disclaimer handling is documented beside the artifacts.
  • A machine-readable artifact manifest exists.
  • A first spec-reference manifest exists, covering at least the inplace atomizer: emit fragmented fields per ECMA-376 Part 4 (split <w:ins>/<w:del> at field-character boundaries) #217 field-fragmentation references.
  • A generated OOXML vocabulary registry exists for the initial WordprocessingML names listed above.
  • At least one TypeScript source file or test references a generated/validated OOXML vocabulary entry and an @ooxmlSpec ID.
  • A simple report summarizes referenced spec IDs, coverage statuses, and vocabulary constants used by source/tests.

Non-goals

  • No claim of full OOXML implementation coverage.
  • No modification of the ECMA artifacts themselves.
  • No requirement to migrate every existing OOXML string constant in the first PR.
  • No replacement for interoperability tests against Microsoft Word, LibreOffice, docx4j, or other consumers/producers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions