Skip to content

Default-pfp URLs (.../sticky/default_profile_images/...) misclassified as asset, fetched fresh per bookmark #149

@RealADemin

Description

@RealADemin

The substring check entity.sourceUrl.includes('/profile_images/') at three sites in src/bookmark-media.ts is used to classify URLs as either "profile image" (deduplicated by URL alone) or "asset" (keyed by tweetId + sourceUrl). Twitter's default-pfp URL — served for any account with no profile photo set — uses a path with default_profile_images (underscore before profile, not slash), so the substring fails. Default-pfp URLs get misclassified as asset and slip through the URL-deduplication that #79 fixed for real pfps.

Effect: every new bookmark whose author uses the default avatar triggers a fresh fetch of the same 2KB PNG, attributed to that bookmark's tweetId in the manifest.

Concrete evidence

Sample manifest from a ~5200-bookmark archive — 7 manifest entries for the identical URL, different bookmarkId each time:

2026-04-08T15:14:15  bookmarkId=2037126739286253911  url=https://abs.twimg.com/sticky/default_profile_images/default_profile_400x400.png
2026-04-08T16:10:19  bookmarkId=2029867961624793147  (same url)
2026-04-08T18:05:51  bookmarkId=2009833980183376247  (same url)
2026-04-08T18:22:08  bookmarkId=2008208287939076351  (same url)
2026-04-08T18:31:54  bookmarkId=2005656385732964618  (same url)
2026-04-24T07:37:39  bookmarkId=2040426976050086218  (same url)
2026-05-10T19:41:43  bookmarkId=2047271408246395145  (same url)

Seven downloads of the same 2KB PNG across a month, one per distinct bookmark whose author has the default avatar.

Why

Two URL shapes for profile images:

Real user pfp:   https://pbs.twimg.com/profile_images/<id>/<filename>_400x400.jpg
Default pfp:     https://abs.twimg.com/sticky/default_profile_images/default_profile_400x400.png

The substring '/profile_images/' (with surrounding slashes) matches the first but not the second — default_profile_images has _profile_images/ with an underscore before profile, not a slash.

Cascading effect

Three call sites in src/bookmark-media.ts use the same .includes('/profile_images/') check:

// L80 — used to build the cache key, with the third arg being `isProfileImage`
function mediaEntryKeyFromEntry(entry: MediaFetchEntry): string {
  return mediaEntryKey(entry.tweetId, entry.sourceUrl, entry.sourceUrl.includes('/profile_images/'));
}

// L214 — building coveredAssetKeys (non-pfp entries, keyed by tweetId::sourceUrl)
.filter((entry) => !entry.sourceUrl.includes('/profile_images/'))
.filter((entry) => isCoveredEntry(entry, maxBytes))
.map((entry) => `${entry.tweetId}::${entry.sourceUrl}`),

// L223 — building coveredProfileImageUrls (pfp entries, keyed by URL alone)
.filter((entry) => entry.sourceUrl.includes('/profile_images/'))
.filter((entry) => isCoveredEntry(entry, maxBytes))
.map((entry) => entry.sourceUrl),

For a default-pfp URL:

  1. Line 80 returns isProfileImage: false, so the cache key includes tweetId
  2. Line 214's filter passes it through (since the check is negated), adding it to coveredAssetKeys
  3. Line 223's filter excludes it (since includes is false), so the URL is never added to coveredProfileImageUrls

When a new bookmark with default-pfp author arrives, the resolver checks coveredAssetKeys with a key like <newTweetId>::<defaultUrl> — that key isn't there (different tweetId) → URL classified as pending → fetched again. Manifest gets a new entry attributed to the new bookmark.

--skip-profile-images can't help because the URL isn't classified as a profile image.

Relationship to #79

#79 ("profile images are re-downloaded and duplicated on disk for every bookmark from the same author", closed) was the broader pfp-dedup fix that introduced the URL-keyed coveredProfileImageUrls set. That fix works correctly for real-user pfps (the URL pbs.twimg.com/profile_images/... matches the substring). Default-pfp URLs were missed because the substring is overly anchored — they regress to the pre-#79 per-bookmark fetch pattern.

Fix

Three identical substring checks need to broaden from '/profile_images/' to 'profile_images' (drop the surrounding slashes):

// L80
entry.sourceUrl.includes('profile_images')

// L214
.filter((entry) => !entry.sourceUrl.includes('profile_images'))

// L223
.filter((entry) => entry.sourceUrl.includes('profile_images'))

Both URL shapes now match:

  • …/profile_images/<id>/… contains profile_images
  • …/default_profile_images/default_profile_… contains profile_images

Alternative if matching boundaries matters: a regex like /profile_images\//.test(url) || /\/default_profile_images\//.test(url). The simpler substring drop is sufficient — no other twimg.com URL path contains profile_images as a substring outside of these two contexts.

After the fix, the existing coveredProfileImageUrls URL-dedup logic catches default-pfp URLs too. Future fetches short-circuit. No schema change, no migration. Existing duplicated default-pfp files on disk persist harmlessly (or can be cleaned up by an fetch-media --prune-style command if/when that lands).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions