Skip to content

feat: CLI validation, CDN drawer integration, and citation data quality#387

Merged
bensonwong merged 13 commits intomainfrom
feat/cli-cdn-validation-improvements
Mar 30, 2026
Merged

feat: CLI validation, CDN drawer integration, and citation data quality#387
bensonwong merged 13 commits intomainfrom
feat/cli-cdn-validation-improvements

Conversation

@bensonwong
Copy link
Copy Markdown
Collaborator

@bensonwong bensonwong commented Mar 30, 2026

Summary

  • Citation data validation (validateCitationData.ts): detect extraction artifacts (garbled text, encoding errors, truncation markers) and flag unreliable citation data before it reaches the UI
  • Garbled snippet normalization (normalizeSnippetText in react/utils.ts): clean up mojibake, collapsed whitespace, and encoding artifacts in snippet text; show "imprecise location" note when snippet quality is degraded
  • CLI improvements: strip existing injection before re-inject to prevent duplication, add --indicator flag for indicator variant selection, improved report utilities
  • CDN drawer integration: wire CitationDrawer into the vanilla/CDN runtime with [data-dc-drawer-trigger] binding, programmatic showDrawer()/hideDrawer() API, and proper lifecycle cleanup
  • CDN popover enrichment: pass downloadUrl and urlAccessStatus through to CDN popover content
  • Regression tests: citation key stability tests, column detection regression tests, snippet normalization tests, citation data validation tests (290+ new test cases)
  • Housekeeping: gitignore .deepcitation/ CLI output directory, address prior review findings

Test plan

  • bun run test — all 1654 tests pass
  • bunx biome ci ./src — lint clean (pre-existing errors in columnDetection.test.ts only)
  • Verify CDN drawer: add <div data-dc-drawer-trigger></div> to static HTML example, confirm drawer renders and opens
  • Verify showDrawer()/hideDrawer() programmatic API: call twice to confirm idempotency fix
  • Verify CLI inject idempotency: run deepcitation inject twice on same file, confirm no duplication
  • Verify garbled snippet handling: citation with mojibake text shows cleaned snippet + imprecise note

Benson and others added 10 commits March 30, 2026 07:57
Adds downloadUrl, urlAccessStatus, and urlVerificationError to the CDN
verification data types and mapper so the popover can show download
links and URL access diagnostics for injected HTML reports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- stripExistingInjection removes dc-data, dc-key-map, init script, and
  CDN bundle before injecting fresh copies, preventing stale duplicates
- --indicator <icon|dot|none> flag for both inject and verify-html
- Validation warnings from validateCitationData logged in verify-html

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds validateCitationData and detectExtractionArtifacts utilities that
catch common PDF/HTML text extraction issues before they reach the API:
collapsed spaces, broken hyphens, fi-ligature loss, table fragments,
missing punctuation spaces, and broken words. Based on Round 3 QA
analysis of 527 citations / 71 partials.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- normalizeSnippetText fixes collapsed spaces, missing punctuation
  spacing, and quote boundaries in verification API snippets. Uses
  reference-guided spacing from fullPhrase when available, falls back
  to regex heuristics.
- EvidenceTray shows "Exact location not specified" note when a
  verified citation lacks page number or line IDs.
- i18n: evidence.impreciseLocation added for en, es, fr, vi.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- citationKeyStability: verifies getCitationKey determinism, sensitivity
  to field changes, URL citation hashing, and regression fixtures
- columnDetection: validates detectColumns for single/multi-column PDFs,
  narrow gaps, and real-world CDC immunization schedule patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review fixes:
- Extract stripExistingInjection to reportUtils.ts (single canonical
  location, no test duplication, no CLI side-effect import)
- Tighten CDN bundle regex to require assignment (window.DeepCitation
  Popover=) instead of mere mention, avoiding false strips
- Use initParts pattern in verifyHtml for consistency with inject
- Add validateRegexInput guard in detectExtractionArtifacts (ReDoS)
- Add "v." exclusion to missing-space-after-punctuation detector
- Document legitimateHyphens multi-segment matching behavior
- Populate citationKeyStability fixtures with frozen hash values
- Add JSDoc clarifying applyReferenceSpacing preserves garbled casing
- Document @filelasso/shared workspace-only import (circular dep)
- Auto-format via biome

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add drawer support to the CDN popover API: bind [data-dc-drawer-trigger]
elements, expose showDrawer/hideDrawer on window.DeepCitationPopover, and
refresh drawer content on update(). Include React-vs-CDN comparison
showcase for visual testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix showDrawer() idempotency: unmount+remove container on hideDrawer()
  so next showDrawer() re-mounts with initialOpen=true (useState ignores
  initialProp on reconciliation)
- Fix update() not refreshing the programmatic drawer portal
- Replace render(null as unknown as ...) with unmountComponentAtNode
- Remove nested ternary / no-op reactVariant mapping
- Move cdn-comparison-showcase.html to src/vanilla/testing/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Map<HTMLElement, true> → Set<HTMLElement> (value was never read)
- Extract renderDrawer() helper to deduplicate 4 identical render sites
- Compute buildDrawerGroups() once in update() instead of 3 times
- Accept pre-built groups in bind/refresh to avoid redundant computation
- Remove redundant JSDoc comments restating function names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Actions Updated (UTC)
agui-chat-deepcitation Ignored Ignored Preview Mar 30, 2026 2:23am
deepcitation-langchain-rag-chat Ignored Ignored Preview Mar 30, 2026 2:23am
mastra-rag-deepcitation Ignored Ignored Preview Mar 30, 2026 2:23am
nextjs-ai-sdk-deepcitation Ignored Ignored Preview Mar 30, 2026 2:23am

// Insert space after sentence-ending punctuation + uppercase: "overruled.We" → "overruled. We"
result = result.replace(/([.;:!?])([A-Z])/g, "$1 $2");
// Insert space between letter+quote+letter: 'equal"has' → 'equal" has'
result = result.replace(/([a-zA-Z])(["'"'"])([a-zA-Z])/g, "$1$2 $3");
// Insert space after sentence-ending punctuation + uppercase: "overruled.We" → "overruled. We"
result = result.replace(/([.;:!?])([A-Z])/g, "$1 $2");
// Insert space between letter+quote+letter: 'equal"has' → 'equal" has'
result = result.replace(/([a-zA-Z])(["'"'"])([a-zA-Z])/g, "$1$2 $3");
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

✅ Playwright Test Report

Status: Tests passed

📊 Download Report & Snapshots (see Artifacts section)

What's in the Visual Snapshots

The gallery includes visual snapshots for:

  • 🖥️ Desktop showcase (all variants × all states)
  • 📱 Mobile showcase (iPhone SE viewport)
  • 📟 Tablet showcase (iPad viewport)
  • 🔍 Popover states (verified, partial, not found)
  • 🔗 URL citation variants

Run ID: 23725157556

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

PR Review — feat: CLI validation, CDN drawer integration, and citation data quality

Solid PR overall. The test coverage (290+ new cases, regression fixtures for hash stability) is excellent and the feature scope is well-motivated by QA data. A few issues worth addressing before merge:


Bugs

1. normalizeSnippetText can't fix cross-boundary artifacts (medium)

In DefaultPopoverContent.tsx and SearchAnalysisSummary.tsx, normalization is applied to each segment individually after splitting on matchStart/matchEnd:

const before = normalizeSnippetText(snippet.contextText.slice(0, snippet.matchStart));
const match  = normalizeSnippetText(snippet.contextText.slice(snippet.matchStart, snippet.matchEnd));
const after  = normalizeSnippetText(snippet.contextText.slice(snippet.matchEnd));

Artifacts that straddle a boundary won't be fixed. For example, if before ends with "overruled." and match starts with "We", the regex ([.;:!?])([A-Z]) never fires because . and W are in separate strings. The normalization should ideally be applied to the full contextText first, then re-sliced:

const normalized = normalizeSnippetText(snippet.contextText);
const before = normalized.slice(0, snippet.matchStart);
// ...

That said, this requires match offsets to remain valid after normalization — which is only guaranteed if normalization is purely additive (it is: it only inserts spaces, never removes characters). Worth verifying this assumption holds and adding a test for cross-boundary cases.


2. destroy() doesn't clean up drawer state (medium)

cdn.ts destroy() unmounts the popover but leaves drawerContainerEl and drawerTriggerEls populated. On a page that calls destroy() + init() again (e.g., a SPA route change), the drawer portal element would be orphaned in the DOM and drawerTriggerEls would hold stale references, leading to duplicate renders on the next update() call.

function destroy(): void {
  // ...existing cleanup...
  hideDrawer();            // add this
  drawerTriggerEls.clear(); // add this
}

Code quality / consistency

3. verifyHtml initParts wasn't refactored to match inject (minor)

inject was cleaned up to use a readable initParts array, but verifyHtml still has the old inline form:

// verifyHtml — hard to read:
`{${[`theme:${JSON.stringify(theme)}`, ...(indicator !== "icon" ? [...] : [])].join(",")}}`

// inject — clean new form:
const initParts = [`theme:${JSON.stringify(theme)}`];
if (indicator !== "icon") initParts.push(`indicatorVariant:${JSON.stringify(indicator)}`);

Finish the refactor so both paths use the same pattern.


4. Validation errors don't affect exit code (minor)

In verifyHtml, validateCitationData errors (not just warnings) are logged to stderr but execution continues:

const validation = validateCitationData(parsed.citations);
for (const err of validation.errors) {
  console.error(`Error: citation [${err.citationId}] ${err.field}${err.message}`);
}
// no exit / no non-zero code

The page_id error message even says "API rejects the entire attachment batch when pageNumber is absent". If the API is going to fail anyway, a non-zero exit here would save users a round-trip. At minimum, consider process.exitCode = 1 (non-throwing) so CI pipelines can catch data quality issues.


5. detectSnippetDisplayArtifacts is an unexported alias with no callers (minor)

// validateCitationData.ts
export function detectSnippetDisplayArtifacts(snippet: string): ExtractionArtifact[] {
  return detectExtractionArtifacts(snippet);
}

This function doesn't appear to be used anywhere in the diff. If it's intended for future use, it's premature; if it's for documentation purposes the JSDoc on detectExtractionArtifacts covers that. Remove it or wire it up.


Maintainability note

6. Static legitimateHyphens allowlist will need constant upkeep

The hardcoded Set in detectExtractionArtifacts will drift as citation domains grow (medical, scientific, financial terms all have valid short-prefix hyphenates: co-lead, bi-modal, ex-post, pre-tax). The current filter condition left.length <= 2 || right.length <= 2 is already pretty specific, so the allowlist may not be pulling much weight. Consider whether the condition alone is sufficient, or document the list as a known maintenance surface.


What's well done

  • Hash regression fixtures (citationKeyStability.test.ts) are the right approach — they'll catch any accidental algorithm change that would break existing injected HTML.
  • stripExistingInjection correctly resets lastIndex after .test() before calling .replace() — that's an easy bug to miss with stateful global regexes.
  • CDN drawer showDrawer()/hideDrawer() idempotency is handled correctly: repeated showDrawer() calls re-render into the same container; hideDrawer() fully unmounts and nulls the ref.
  • applyReferenceSpacing preserving garbled casing while transferring reference spacing is the right tradeoff — fixing spacing without guessing at capitalization is safer.
  • i18n coverage for evidence.impreciseLocation across es/fr/vi is thorough.

Summary: Issues 1 and 2 are worth fixing before merge. Issues 3–5 are quick fixes. Issue 6 is informational. Happy to approve once 1 and 2 are addressed.

🤖 Generated with Claude Code

Benson and others added 3 commits March 30, 2026 09:10
- Remove columnDetection test that depends on @filelasso/shared (moved to shared package)
- Replace unsafe `render(null as unknown as ...)` with `unmountComponentAtNode`
- Remove unused `detectSnippetDisplayArtifacts` alias
- Fix biome formatting in test files and EvidenceTray.tsx
- Fix pre-existing lint error in pdfWordSpacing.test.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test imports from @filelasso/shared which isn't available in CI
where deepcitation runs tests in isolation. This test belongs in the
shared package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bensonwong bensonwong merged commit e50f1eb into main Mar 30, 2026
6 checks passed
@bensonwong bensonwong deleted the feat/cli-cdn-validation-improvements branch March 30, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant