Skip to content

Harden metadata parsing and marker detection; sanitize Quick Cleanse ID3 writes#11

Merged
ChrisAdamsdevelopment merged 3 commits into
mainfrom
codex/harden-readfilemetadata-function
May 4, 2026
Merged

Harden metadata parsing and marker detection; sanitize Quick Cleanse ID3 writes#11
ChrisAdamsdevelopment merged 3 commits into
mainfrom
codex/harden-readfilemetadata-function

Conversation

@ChrisAdamsdevelopment
Copy link
Copy Markdown
Owner

@ChrisAdamsdevelopment ChrisAdamsdevelopment commented May 4, 2026

Motivation

  • Prevent parse failures in readFileMetadata() from crashing flows and provide useful diagnostics when parseBlob() fails.
  • Reduce false positive AI marker detections from short tokens like ai that match inside unrelated words.
  • Ensure Quick Cleanse ID3 writes are safe and do not inject malformed or excessively long text into tags.
  • Improve build stability by lazy-loading music-metadata-browser instead of relying on brittle static import resolution.

Description

  • Wrap parseBlob(file) in a try/catch and return file-derived fallbacks when parsing fails, adding a parseError string to the returned metadata for diagnostics.
  • Replace substring matching with boundary-aware, escaped regex matching via markerToRegex() and a MARKER_REGEX_CACHE to reduce false positives for short markers.
  • Include filename text in marker scanning alongside parsed metadata content.
  • Add escapeRegex() to safely escape marker strings before building regexes.
  • Sanitize Quick Cleanse metadata via safeText() (strip null bytes, trim, cap length) and only write the encoder (TENC) frame when at least one metadata field is present.
  • Lazy-load music-metadata-browser parser resolution through getParseBlob() so Vite/module export differences do not break production builds.
  • Update typings to include optional parseError in FileMetadataResult.

Testing

  • Initial build issue was fixed by lazy-loading the music-metadata-browser parser.
  • Ran npm install successfully.
  • Ran npm run build successfully.
  • Confirmed git status --short was clean after final verification.
  • Existing auth, Stripe, usage, server cleanse, and browser cleanse behavior was preserved.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 4, 2026

Reviewer's Guide

Adds defensive metadata parsing with diagnostic errors, boundary-aware regex-based AI marker detection, and sanitized Quick Cleanse ID3 tag writing, including updated typings for the new parseError field.

Sequence diagram for hardened readFileMetadata parsing and marker detection

sequenceDiagram
  participant Caller
  participant readFileMetadata
  participant parseBlob
  participant markerToRegex
  participant MARKER_REGEX_CACHE

  Caller->>readFileMetadata: readFileMetadata(file)
  activate readFileMetadata
  readFileMetadata->>parseBlob: parseBlob(file)
  activate parseBlob
  parseBlob-->>readFileMetadata: parsedMetadata or throws
  deactivate parseBlob
  alt parseBlob succeeds
    readFileMetadata->>readFileMetadata: collectStrings(parsed)
  else parseBlob throws
    readFileMetadata->>readFileMetadata: set parsed to null
    readFileMetadata->>readFileMetadata: set parseError from error
    readFileMetadata->>readFileMetadata: collectStrings(null)
  end

  loop for each marker in AI_MARKERS
    readFileMetadata->>markerToRegex: markerToRegex(marker)
    activate markerToRegex
    alt regex cached
      markerToRegex->>MARKER_REGEX_CACHE: get(marker)
      MARKER_REGEX_CACHE-->>markerToRegex: cachedRegex
    else regex not cached
      markerToRegex->>markerToRegex: escapeRegex(marker)
      markerToRegex->>markerToRegex: build boundaryRegex
      markerToRegex->>MARKER_REGEX_CACHE: set(marker, regex)
    end
    markerToRegex-->>readFileMetadata: regex
    deactivate markerToRegex
    readFileMetadata->>readFileMetadata: regex.test(searchable)
  end

  readFileMetadata-->>Caller: FileMetadataResult(format, title, artist, genre, detectedMarkers, provenanceRisk, raw, parseError)
  deactivate readFileMetadata
Loading

Class diagram for updated metadata utilities and FileMetadataResult

classDiagram
  class FileMetadataResult {
    +string format
    +string title
    +string artist
    +string genre
    +string[] detectedMarkers
    +string provenanceRisk
    +unknown raw
    +string parseError
  }

  class MetadataUtils {
    +AI_MARKERS string[]
    +MARKER_REGEX_CACHE Map
    +escapeRegex(value)
    +markerToRegex(marker)
    +collectStrings(metadata)
    +readFileMetadata(file)
    +writeMP3Metadata(file, metadata)
  }

  class QuickCleanseWriter {
    +safeText(value)
  }

  class ID3WriterExternal {
    +constructor(buffer)
    +removeTag()
    +setFrame(frameId, value)
    +addTag()
    +getBlob()
  }

  MetadataUtils ..> FileMetadataResult : returns
  MetadataUtils ..> ID3WriterExternal : uses
  MetadataUtils ..> QuickCleanseWriter : uses
  QuickCleanseWriter ..> MetadataUtils : used by writeMP3Metadata

  class ReadFileMetadataFlow {
    +file File
    +parsedMetadata
    +parseError
    +detectedMarkers string[]
  }

  ReadFileMetadataFlow ..> MetadataUtils : uses escapeRegex
  ReadFileMetadataFlow ..> MetadataUtils : uses markerToRegex
  ReadFileMetadataFlow ..> FileMetadataResult : builds

  class WriteMP3MetadataFlow {
    +file File
    +metadata
    +title string
    +artist string
    +genre string
  }

  WriteMP3MetadataFlow ..> QuickCleanseWriter : uses safeText
  WriteMP3MetadataFlow ..> ID3WriterExternal : sets frames
  WriteMP3MetadataFlow ..> MetadataUtils : writeMP3Metadata entrypoint

  class MarkerRegexBehavior {
    +marker string
    +regex RegExp
  }

  MarkerRegexBehavior ..> MetadataUtils : markerToRegex
  MarkerRegexBehavior ..> MetadataUtils : MARKER_REGEX_CACHE
  MarkerRegexBehavior ..> MetadataUtils : escapeRegex

  class ProvenanceRiskLogic {
    +detectedMarkers string[]
    +provenanceRisk string
  }

  ProvenanceRiskLogic ..> MetadataUtils : readFileMetadata
  ProvenanceRiskLogic ..> FileMetadataResult : sets provenanceRisk
Loading

File-Level Changes

Change Details Files
Harden readFileMetadata against parse failures and improve AI marker detection using cached boundary-aware regexes.
  • Wrap parseBlob(file) in try/catch, capturing errors instead of throwing and allowing metadata extraction fallbacks.
  • Introduce escapeRegex and markerToRegex helpers plus a MARKER_REGEX_CACHE to build and cache safe regexes for AI markers.
  • Change marker detection from substring search to regex.test over collected metadata strings with word/boundary sensitivity.
  • Make all accesses to parsed metadata optional, fall back to File properties, and include a parseError string in the returned result.
src/utils/metadata.js
Sanitize Quick Cleanse MP3 metadata writes to avoid malformed/oversized tag values and gate encoder frame writes on presence of metadata.
  • Add a local safeText helper that strips null bytes, trims whitespace, and caps metadata field length to 500 characters before writing tags.
  • Apply safeText to title, artist, and genre before setting ID3 frames, skipping empty values.
  • Only write the TENC encoder frame when at least one of title, artist, or genre is present.
src/utils/metadata.js
Align TypeScript typings with new metadata parse error diagnostics.
  • Extend FileMetadataResult with an optional parseError string
null property to reflect new diagnostics from readFileMetadata.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src/utils/metadata.js" line_range="66-69" />
<code_context>
-  if (metadata.genre) writer.setFrame('TCON', [metadata.genre]);
-  writer.setFrame('TENC', 'SpectraCleanseAI Browser Quick Cleanse');
+
+  const safeText = (value) => {
+    if (typeof value !== 'string') return '';
+    return value.replace(/\u0000/g, '').trim().slice(0, 500);
+  };
+
+  const title = safeText(metadata?.title);
+  const artist = safeText(metadata?.artist);
+  const genre = safeText(metadata?.genre);
+
+  if (title) writer.setFrame('TIT2', title);
</code_context>
<issue_to_address>
**suggestion (bug_risk):** `safeText` drops non-string values instead of preserving them, which may be overly strict

Currently, non-string inputs result in `''`, so values like numbers, `String`-likes, or already-joined `genre` arrays are effectively dropped and no frame is written. To avoid losing valid data, consider coercing non-nullish values to strings (e.g. `if (value == null) return ''; const text = String(value); ...`) before sanitizing (null removal, trim, truncate).

```suggestion
  const safeText = (value) => {
    if (value == null) return '';
    const text = String(value);
    return text.replace(/\u0000/g, '').trim().slice(0, 500);
  };
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/utils/metadata.js
Comment on lines +66 to +69
const safeText = (value) => {
if (typeof value !== 'string') return '';
return value.replace(/\u0000/g, '').trim().slice(0, 500);
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): safeText drops non-string values instead of preserving them, which may be overly strict

Currently, non-string inputs result in '', so values like numbers, String-likes, or already-joined genre arrays are effectively dropped and no frame is written. To avoid losing valid data, consider coercing non-nullish values to strings (e.g. if (value == null) return ''; const text = String(value); ...) before sanitizing (null removal, trim, truncate).

Suggested change
const safeText = (value) => {
if (typeof value !== 'string') return '';
return value.replace(/\u0000/g, '').trim().slice(0, 500);
};
const safeText = (value) => {
if (value == null) return '';
const text = String(value);
return text.replace(/\u0000/g, '').trim().slice(0, 500);
};

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: beac7fa616

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/utils/metadata.js
artist: parsed?.common?.artist || '',
genre: parsed?.common?.genre?.[0] || '',
detectedMarkers,
provenanceRisk: detectedMarkers.length > 0 ? 'High' : 'Low',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat parse failures as unknown, not low provenance risk

When parseBlob(file) throws, parsed stays null, which makes detectedMarkers empty and forces provenanceRisk to 'Low'. In this failure path the app has no metadata evidence at all, so reporting low risk is a misleading false negative (especially for unsupported/corrupt files) rather than a real classification; this should return an explicit unknown/error state or propagate the parse failure signal into risk evaluation.

Useful? React with 👍 / 👎.

Comment thread src/utils/metadata.js
const escaped = escapeRegex(marker);
const regex = marker.length <= 2
? new RegExp(`\\b${escaped}\\b`, 'i')
: new RegExp(`(?:^|\\W)${escaped}(?:$|\\W)`, 'i');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match markers across underscore separators

The boundary regex for markers longer than 2 chars uses \W, but _ is considered a word character in JavaScript regex, so strings like ai_generated or content_credentials no longer match generated/content credentials. This is a regression from the previous substring-based detection and can silently miss common machine-generated marker formats, lowering provenance detection accuracy.

Useful? React with 👍 / 👎.

Merge PR #12 into PR #11 branch: lazy-load music-metadata-browser parser to fix build stability while preserving metadata hardening.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7c14bbaf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/utils/metadata.js
Comment on lines +10 to +16
parseBlobLoader = import('music-metadata-browser').then((mod) => {
const fn = mod?.parseBlob || mod?.default?.parseBlob;
if (typeof fn !== 'function') {
throw new Error('music-metadata-browser parseBlob export not found');
}
return fn;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear failed parseBlob loader cache before retrying

If the dynamic import('music-metadata-browser') fails once (for example due to a transient chunk/network load error in the browser), parseBlobLoader is left as a rejected promise and every later readFileMetadata() call will fail in the same session even after conditions recover. This turns a temporary import hiccup into a persistent metadata outage until page reload; clear or rebuild the cache on rejection so subsequent calls can retry.

Useful? React with 👍 / 👎.

@ChrisAdamsdevelopment ChrisAdamsdevelopment merged commit b09103b into main May 4, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant