Add modular forensic processor and batch downloads#22
Conversation
Reviewer's GuideRefactors the forensic metadata cleanse/inject pipeline into modular server-side components, adds a richer three-class forensic report model, and introduces paid-only batch processing with one-time secure downloads and durable cleanup of generated files, while keeping existing single-file behavior and headers compatible. Sequence diagram for batch processing and one-time download flowsequenceDiagram
actor User
participant Server
participant Processor
participant DownloadTokens
participant Cleanup
User->>Server: POST /api/process-batch
Server->>Server: validatePlanAndSize
alt plan is free or size too large
Server-->>User: 4xx error
else paid and within limit
loop each file
Server->>Processor: processMediaFile(outputPath, platform, metadata)
Processor-->>Server: report
Server->>DownloadTokens: createToken(userId, filePath, downloadName)
DownloadTokens-->>Server: token
Server->>Cleanup: registerForCleanup([filePath])
end
Server-->>User: JSON { results[ { report, downloadToken } ] }
end
User->>Server: GET /api/download/:token
Server->>DownloadTokens: consumeToken(token, userId)
alt token valid
DownloadTokens-->>Server: { filePath, downloadName }
Server-->>User: res.download(filePath, downloadName)
Server->>Cleanup: deleteImmediately(filePath)
else invalid/expired
DownloadTokens-->>Server: { error, code }
Server-->>User: code error
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new
removedCountinprocessMediaFileis now based on marker hits (beforeMarkers.length) rather than the number of tags removed (removedTags.length), which changes the meaning of theX-Forensic-Removedheader compared to the old implementation; if consumers expect the old semantics, consider either keeping the previous behavior or exposing both counts explicitly. - In
/api/process-batchthe JSON response always returnsusage.limit: nulleven though theX-Usage-Limitheader still carries the per-plan limit; if clients read the JSON usage object, consider populatinglimitconsistently with the header value. - Several new code paths (e.g., the
/api/process,/api/process-batchhandlers andprocessMediaFile) compress complex logic into single-linetry/catchblocks and long inline expressions, which makes the control flow difficult to follow; consider expanding these into multi-line statements or extracting helpers to improve readability and future maintainability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `removedCount` in `processMediaFile` is now based on marker hits (`beforeMarkers.length`) rather than the number of tags removed (`removedTags.length`), which changes the meaning of the `X-Forensic-Removed` header compared to the old implementation; if consumers expect the old semantics, consider either keeping the previous behavior or exposing both counts explicitly.
- In `/api/process-batch` the JSON response always returns `usage.limit: null` even though the `X-Usage-Limit` header still carries the per-plan limit; if clients read the JSON usage object, consider populating `limit` consistently with the header value.
- Several new code paths (e.g., the `/api/process`, `/api/process-batch` handlers and `processMediaFile`) compress complex logic into single-line `try/catch` blocks and long inline expressions, which makes the control flow difficult to follow; consider expanding these into multi-line statements or extracting helpers to improve readability and future maintainability.
## Individual Comments
### Comment 1
<location path="server.js" line_range="523" />
<code_context>
+ const usedNow = getMonthlyJobCount(userId);
+ res.setHeader('X-Usage-This-Month', usedNow);
+ res.setHeader('X-Usage-Limit', userPlan === 'free' ? FREE_MONTHLY_LIMIT : 'unlimited');
+ return res.json({ results, usage: { thisMonth: usedNow, limit: null } });
+});
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Usage `limit` is always `null`, even for limited plans.
Here `usage.limit` is always `null` even though the headers differentiate between free and unlimited plans. This makes the batch response inconsistent with `/api/process` and can break clients that rely on the JSON body to display or enforce limits. Please set `usage.limit` in line with the headers, e.g. `userPlan === 'free' ? FREE_MONTHLY_LIMIT : null`.
</issue_to_address>
### Comment 2
<location path="server/processor.js" line_range="53-54" />
<code_context>
+ const finalTags = await exiftool.read(outputPath);
+ const finalMarkers = detectMarkers(finalTags);
+ const verification = verifyFinalState(finalTags);
+ const removedTags = beforeKeys.filter((k) => !(k in finalTags));
+ const removedCount = beforeMarkers.length;
+ const status = (!wipeVerificationPassed || finalMarkers.length > 0) ? 'review_required' : (verification.unexpectedDescriptive.length > 0 ? 'clean_with_notes' : 'clean');
+ const seo = '';
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Mismatch between `removedCount` and `removedTags` may confuse downstream consumers.
`removedTags` tracks tags that disappeared between pre- and post-state, but `removedCount` is based on `beforeMarkers.length` (forensic markers), not removed tags. Since callers may interpret `removedCount` as “tags removed”, this mismatch can be misleading, especially when multiple markers share a tag or when non-marker tags are removed. Please either rename the field (e.g. `removedMarkersCount`) or make it derive from `removedTags.length` to keep the semantics consistent.
```suggestion
const removedTags = beforeKeys.filter((k) => !(k in finalTags));
const removedCount = removedTags.length;
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| const usedNow = getMonthlyJobCount(userId); | ||
| res.setHeader('X-Usage-This-Month', usedNow); | ||
| res.setHeader('X-Usage-Limit', userPlan === 'free' ? FREE_MONTHLY_LIMIT : 'unlimited'); | ||
| return res.json({ results, usage: { thisMonth: usedNow, limit: null } }); |
There was a problem hiding this comment.
issue (bug_risk): Usage limit is always null, even for limited plans.
Here usage.limit is always null even though the headers differentiate between free and unlimited plans. This makes the batch response inconsistent with /api/process and can break clients that rely on the JSON body to display or enforce limits. Please set usage.limit in line with the headers, e.g. userPlan === 'free' ? FREE_MONTHLY_LIMIT : null.
| const removedTags = beforeKeys.filter((k) => !(k in finalTags)); | ||
| const removedCount = beforeMarkers.length; |
There was a problem hiding this comment.
suggestion (bug_risk): Mismatch between removedCount and removedTags may confuse downstream consumers.
removedTags tracks tags that disappeared between pre- and post-state, but removedCount is based on beforeMarkers.length (forensic markers), not removed tags. Since callers may interpret removedCount as “tags removed”, this mismatch can be misleading, especially when multiple markers share a tag or when non-marker tags are removed. Please either rename the field (e.g. removedMarkersCount) or make it derive from removedTags.length to keep the semantics consistent.
| const removedTags = beforeKeys.filter((k) => !(k in finalTags)); | |
| const removedCount = beforeMarkers.length; | |
| const removedTags = beforeKeys.filter((k) => !(k in finalTags)); | |
| const removedCount = removedTags.length; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7cc92395f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const verification = verifyFinalState(finalTags); | ||
| const removedTags = beforeKeys.filter((k) => !(k in finalTags)); | ||
| const removedCount = beforeMarkers.length; | ||
| const status = (!wipeVerificationPassed || finalMarkers.length > 0) ? 'review_required' : (verification.unexpectedDescriptive.length > 0 ? 'clean_with_notes' : 'clean'); |
There was a problem hiding this comment.
Base review status on filtered residual markers
verifyFinalState explicitly filters benign and allowed injected tags, but the status computation ignores that and uses finalMarkers.length > 0 from the unfiltered tag set. This means normal user-provided metadata (for example, an allowed Title containing a brand term like "OpenAI") can force review_required even when finalVerificationPassed is true, creating false-positive forensic failures.
Useful? React with 👍 / 👎.
| const finalMarkers = detectMarkers(finalTags); | ||
| const verification = verifyFinalState(finalTags); | ||
| const removedTags = beforeKeys.filter((k) => !(k in finalTags)); | ||
| const removedCount = beforeMarkers.length; |
There was a problem hiding this comment.
Keep removedCount aligned with removedTags/header semantics
removedCount is now set to beforeMarkers.length, which counts marker-rule hits, not removed metadata keys. Because one key/value can trigger multiple rules (or none), this can diverge from removedTags and from prior X-Forensic-Removed behavior, causing inconsistent audit output and misleading usage/UI metrics that assume the count corresponds to removed tags.
Useful? React with 👍 / 👎.
Motivation
uploads/and support one-time batch downloads.Description
server/metadataRules.js, exportingMARKER_RULES,ALLOWED_INJECTED_TAGS,isBenign, andisAllowedInjectedfor suspicious/benign/allowed tag classification.server/processor.js, encapsulatingprocessMediaFile(...)plus helper logic for marker detection, final-state verification, platform-aware metadata writing, and unsupported-cleanse 422 errors.server/cleanup.js, persisting a SQLite-backedcleanup_queuewith upload-dir safety guards, immediate deletion, periodic sweeping, and row pruning.server/downloadTokens.js, persisting one-timedownload_tokenswith TTL, account validation, consumption semantics, and expiry cleanup.server.jswhile preserving auth (req.user.sub), free-tier usage checks, Stripe/Gemini behavior, MP3 server-cleanse rejection, existing/api/processbehavior, and SPA/API route ordering.POST /api/process-batchandGET /api/download/:tokenbefore the API catch-all.app.tsxwith optional richer audit fields while preserving legacy fields (removedCount,removedTags,timestamp).README.mdnotes for batch processing, one-time download tokens, MP3 server-side exclusion, and deployment/body-size guard expectations.Explicit exclusions preserved
AlbumConsentModal.albumOverrideConfirmed.Album = safeTitle,Description,Year, and optionalLyrics-eng.@google/generative-ai; Gemini remains REST/fetch-based.Testing
Validated under supported Node runtime
20.20.2:npm installpassed.npm run buildpassed.node --check server.jspassed.node --check server/processor.jspassed.node --check server/metadataRules.jspassed.node --check server/cleanup.jspassed.node --check server/downloadTokens.jspassed.Runtime smoke checks performed under Node
20.20.2:GET /api/healthreturned JSON200.GET /api/nonexistentreturned JSON404.GET /nonexistent-frontend-routeservedindex.htmlwhendist/existed.GET /api/download/badtokenwithout auth returned JSON401.GET /api/download/badtokenreturned JSON404(Download link not found.).Code/contract checks confirmed:
req.user.subis still used for authenticated user identity.users.monthly_usageis not referenced.jobs.statusis not referenced.@google/generative-aiis not referenced.AlbumConsentModalis not referenced.albumOverrideConfirmedis not referenced./api/process-batchand/api/download/:tokenare before the/apicatch-all./api/generate-seostill uses REST fetch./api/processroute remains present and MP3 server-side processing remains rejected.