Skip to content

custom domains: auth#2908

Open
Soxasora wants to merge 227 commits intostackernews:masterfrom
Soxasora:feat/custom-domains-auth
Open

custom domains: auth#2908
Soxasora wants to merge 227 commits intostackernews:masterfrom
Soxasora:feat/custom-domains-auth

Conversation

@Soxasora
Copy link
Copy Markdown
Member

@Soxasora Soxasora commented Apr 8, 2026

Description

Part of #1942, revives and adapts #2180
Focuses on synchronizing authentication between SN and a custom domain.

Login overview

Authentication is centralized on stacker.news.
When a user visits /login on a custom domain, they are redirected to stacker.news/login with a domain param. This param customizes the login experience for that specific domain/territory.

domain-login-overview

Flow

From the login page, users can authenticate with the current nym or choose a different account to sync.
A callbackUrl=/api/auth/sync/... is always included to support full login/signup flows

After authentication:

  • user is logged into stacker.news
  • they're redirected to the auth sync endpoint
  • the session is synchronized with the custom domain

Next-auth redirects

To properly support logout flows on custom domains (without router.push hacks), next-auth now supports a custom redirect behavior.

If callbackUrl is:

  • absolute URL -> allowed only for verified custom domains -> redirects back to that domain
  • relative path -> defaults to stacker.news

DNS monitoring

A recurring checkActiveDomainsDNS pgboss job runs every 5 minutes to verify that custom domains still point correctly to us.

When a DNS drift is detected, the domain is moved to HOLD, triggering:

  • ACM certificate revocation
  • tokenVersion increment
  • domain re-verification required

This check is essential to custom domain safety and JWT revocability.

JWT revocability

JWTs issued for custom domains now include:

  • domainId
  • tokenVersion

Validation:

  • tokenVersion mismatch -> token is invalid
  • domainId mismatch -> token is invalid

This ensures that the existing tokens are invalidated when DNS changes and/or when domain is re-verified or re-created.


Attack scenarios and infos are available in /docs/dev/custom-domains.md

Media

custom domain login -> pick -> sync -> logout

auth-sync-flow.mov

custom domain signup -> main domain -> sync

cd-signup.mp4

Additional Context

Next.js middleware proxy has a bug that prevents absolute redirects to localhost, collapsing them into relative redirects. To circumvent this bug, a new /api/auth/redirect endpoint is used to redirect to main-domain/login with the correct domain and callbackUrl params.

We could make a generic api/redirect endpoint that only allows main domain and verified custom domain, and use it for both /login redirect and external territory badges that, at the moment, are based on app code rather than being controlled by the middleware.

Checklist

Are your changes backwards compatible? Please answer below:
Yes, existing behavior is preserved and every change only applies to custom domains.

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
6

  • login with nym: OK
  • login/signup from scratch: OK
  • logout: OK
  • redirect to /login: OK
  • JWT revoked after DNS drift: OK
  • JWT revoked after domain re-insert: OK

For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes

Did you introduce any new environment variables? If so, call them out explicitly here:
n/a

Did you use AI for this? If so, how much did it assist you?

  • reviews
  • auth hardening
  • nits

Note

High Risk
High risk because it adds new authentication flows/endpoints for custom domains, modifies NextAuth JWT/redirect behavior, and introduces DB-trigger-driven token revocation plus a recurring worker job that can force domains into HOLD. Misconfiguration could cause login/redirect breakage or unintended session invalidation.

Overview
Adds custom-domain auth sync so logins/signups initiated on a custom domain are redirected through the main domain and then establish a session back on the custom domain via new GET/POST /api/auth/sync and /api/auth/redirect, with CSRF protection using a proxy-minted sync_proof and HMAC proof headers.

Introduces revocable custom-domain JWTs by adding Domain.tokenVersion (plus domainId pinning) and enforcing these claims in NextAuth’s jwt callback; also tightens redirect handling with a NextAuth redirect callback and new safe-url helpers.

Adds DNS drift monitoring: verifyDNSRecord now distinguishes transient vs conclusive resolver failures, a new checkActiveDomainsDNS worker runs every 5 minutes (pgboss schedule) to flip drifting ACTIVE domains to HOLD, and a DB trigger bumps tokenVersion on any transition to/from ACTIVE to invalidate existing tokens. UI/UX is updated to show territory-aware login headers/links and a “log in with @nym” selector, and dropdown styles are factored into a shared CSS module.

Reviewed by Cursor Bugbot for commit 23403fe. Bugbot is set up for automated code reviews on this repo. Configure here.

Soxasora and others added 30 commits May 1, 2025 18:38
- ACM support
- custom domains crud, resolvers, fragments
- custom domains form, guidelines
- custom domains context
- domain verification every 5 minutes via pgboss
- domain validation schema
- basic custom domains middleware, to be completed
- TODOs tracings
- CustomDomain -> Domain
- DomainVerification table
- CNAME, TXT, SSL verification types
- WIP DomainVerification upsert
…ange status of a Record from its Attempt, multi-purpose dns verification
- use DomainVerificationStatus enum for domains and records
- adapt Territory Form UI to new schema
- return 'records' as an object with its types
- wip: prepare for attempts and certificate usage for prisma
fix:
- fix setDomain mutation transaction
- fix schema typedefs

enhance:
- DNS records guidelines with flex-wrap for longer records

cleanup:
- add comments to worker
- remove console.log on validation values
… HOLD

handle territory changes via triggers
- on territory stop, HOLD the domain
- on territory takeover from another user, delete the domain and its associated records

handle ACM certificates via trigger
- on domain/domainCertificate deletion, ask ACM to delete the certificate via a pgboss job; removes the need to ask ACM in multiple places

clear domains that have been on HOLD for more than 30 days, check every midnight via pgboss schedule

use 'domains' profile for worker jobs
Comment thread lib/safe-url.js
Comment thread components/nav/common.js
@huumn
Copy link
Copy Markdown
Member

huumn commented May 3, 2026

This is all bot review (what I find myself doing in preparation of my review) but I wanted to share before in case any of it is useful. Screenshots of the canvas I had it generate:

Screenshot 2026-05-03 at 1 58 24 AM Screenshot 2026-05-03 at 1 58 39 AM

The fix plan this led to (again, just bot stuff, I have not reviewed much). Phase 1 seems reasonable at a glance. I'm unsure about phase 2 being a meaningful improvement (but again, I have not absorbed all the context yet).

THE FIX PLAN

Phase 1: Harden Existing Flow

  1. Add a server-only redemption proof to the sync POST path.

    • Change proxy.js syncAccount so its server-side fetch to pages/api/auth/sync.js includes a proof header derived from verificationToken, domainName, and NEXTAUTH_SECRET.
    • Change pages/api/auth/sync.js POST handling to validate that proof before consuming a token or minting a JWT.
    • Use a small shared helper only if needed to avoid duplicating crypto incorrectly; otherwise keep the implementation inline and minimal.
    • Result: a sync_token leaked to a DNS hijacker is no longer enough to mint a JWT, because only SN middleware can produce the server-only proof.
  2. Make security-sensitive domain checks fresh, not cached.

    • Add a fresh DB-backed domain lookup in lib/domains.js, e.g. getFreshDomainMapping(domain), selecting id, domainName, subName, and tokenVersion for status: ACTIVE.
    • Use the fresh lookup in pages/api/auth/[...nextauth].js for custom-domain JWT validation.
    • Use it in auth redirect allowlisting if practical, so recently revoked domains are not accepted by NextAuth redirect policy.
    • Keep domainsMappingsCache for proxy routing only, where bounded staleness is less security-critical.
  3. Keep the existing mint-side DB checks, but tighten the consume order.

    • In pages/api/auth/sync.js, validate method, domain, and server proof before consumeVerificationToken deletes anything.
    • Continue re-reading the domain in createSessionToken before minting the JWT.
    • Preserve one-time token semantics and the existing domainId binding.
  4. Fix the failed sync route.

    • Change proxy.js error redirects from /error to an existing route such as /auth/error?error=Callback.
    • This addresses expired, invalid, already-consumed, and failed server-proof sync attempts.
  5. Reduce accidental leakage from the URL handoff while Phase 2 is pending.

    • After successful syncAccount, keep redirecting to the sanitized redirectUri, so the final URL no longer contains sync_token.
    • Consider setting Referrer-Policy: no-referrer on the intermediate sync redirect response in pages/api/auth/sync.js if it can be done without weakening the broader site policy.
    • Do not rely on this as the security boundary; the proof header is the boundary.

Phase 2: Remove URL-Borne Bearer Material

  1. Design a replacement handoff that does not put redeemable auth material in the custom-domain URL.
    • Candidate: main domain creates a non-bearer flow id, and only SN middleware can redeem it using server-side proof.
    • Candidate: main domain renders an auto-submitted POST to the custom domain with non-bearer state, still requiring server proof on redemption.
    • Candidate: split the current sync token into browser-visible state plus server-only verifier, similar to PKCE, so browser-visible data cannot mint a JWT alone.

@huumn
Copy link
Copy Markdown
Member

huumn commented May 3, 2026

Rereading my findings from last night, it's unclear how high risk the high risk finding is. The finding is valid and the sync token is a special case, but imho the 5 minute domain switch check + jwt invalidation is sufficient for our private beta, whether after sync_token or normal requests afaict.

On the other hand, it is probably worthwhile to have the ability to do synchronous DNS checks and add extra protections depending on the request scenario, ie places where an attack is easier to pull off. Is this such a scenario, and is it much more severe than average? Marginally, maybe.

Anyway, I'm still a bit naive to these changes. Regardless of whether special case hardening is worthwhile, it can wait until we leave the private beta if that's what's best. There always be inches to gain in security given we are limited to reactive defenses.

@sir-opti
Copy link
Copy Markdown
Contributor

sir-opti commented May 3, 2026

unclear how high risk the high risk finding is.

I had something similar reported and it combines with other issues. I'm still in repro phase; this is high compexity.

I do think that "Design a replacement handoff that does not put redeemable auth material in the custom-domain URL" is valid though - having tokens in query string puts it in browser history. Can probably be protected with a challenge secret like done in PKCE, to prevent capture by XSS / browser extensions. The "protocol flow" for PKCE paints a clear solution:

                                                 +-------------------+
                                                 |   Authz Server    |
       +--------+                                | +---------------+ |
       |        |--(A)- Authorization Request ---->|               | |
       |        |       + t(code_verifier), t_m  | | Authorization | |
       |        |                                | |    Endpoint   | |
       |        |<-(B)---- Authorization Code -----|               | |
       |        |                                | +---------------+ |
       | Client |                                |                   |
       |        |                                | +---------------+ |
       |        |--(C)-- Access Token Request ---->|               | |
       |        |          + code_verifier       | |    Token      | |
       |        |                                | |   Endpoint    | |
       |        |<-(D)------ Access Token ---------|               | |
       +--------+                                | +---------------+ |
                                                 +-------------------+

where, if t_m == 'sha512', t(code_verifier) == sha512(code_verifier) which are both passed in the auth request, and then code_verifier in cleartext is added to the redemption request. Server checks that sha512(token_verifier) matches the initial digest given. Could use a short-lived cookie as a side-channel to keep it out of query strings and post data?

@Soxasora
Copy link
Copy Markdown
Member Author

Soxasora commented May 4, 2026

Thanks for all the research guys, I think we just want to add the proof header just to ship the private beta of custom domains. It's a small change that won't require re-designing the pipeline.

I'm also okay with a DNS query on POST, since it's not a path that needs to be in any way ultra-fast

  1. Make security-sensitive domain checks fresh, not cached.
    Add a fresh DB-backed domain lookup in lib/domains.js, e.g. getFreshDomainMapping(domain), selecting id, domainName, subName, and tokenVersion for status: ACTIVE.
    Use the fresh lookup in pages/api/auth/[...nextauth].js for custom-domain JWT validation.

Instead, this is something I don't think we'll ever want to do, it would mean a db lookup on every SSR and graphql request.

This being said, an attacker can anyway exchange the verification token for a JWT via our middleware when they re-establish a correct DNS configuration towards us, and if they're lucky enough to be in the correct time window. This is, sadly, valid for any safety measures we can come up with because the attacker can control the domain and use a reverse proxy to access cookies, sniff and do all kinds of things (still they gotta be lucky with timing).

But! We must remember that even if the territory owner/hijacker gets the verification token and/or the full JWT token, it will get revoked the moment our cache catches up. And said JWT cannot be shared with stacker.news or other custom domains.


edit: I'm all for implementing PKCE after private beta. Actually last year I implemented a (bad) prototype of a PKCE-ish process: #2180 (comment), but I remember removing it because auth sync was starting to be too complex for an MVP + the dread of knowing that whatever we do can be exploited.

@sir-opti
Copy link
Copy Markdown
Contributor

sir-opti commented May 4, 2026

I'm all for implementing PKCE after private beta. Actually last year I implemented a (bad) prototype of a PKCE-ish process: #2180 (comment), but I remember removing it because auth sync was starting to be too complex for an MVP + the dread of knowing that whatever we do can be exploited.

My only counter to that is that the moment with the least friction to do it is now, unless private beta stage is throwaway code and all these massive PRs will be reverted and rewritten from scratch - which I don't expect, but maybe I'm naive. Doesn't have to be in this PR though!

@huumn
Copy link
Copy Markdown
Member

huumn commented May 4, 2026

having tokens in query string puts it in browser history

It's a short-lived token, so while not great, it's about as bad as a magic link in this respect (I think).

I think the risk with the sync_token is like the risk with the jwt: the custom domain owner can gain access to either if they switch their DNS records at the right time then impersonate clients until we notice the DNS switch.

Can probably be protected with a challenge secret like done in PKCE

Unless I'm missing something, with a DNS record switch, even if some secrets are only known to the client, those secrets can be MiTM'd at C/D in the diagram. It does make an attack harder though.

We've been unable to find a flow that can't be MiTM'd in this way. It's not possible afaik, but it's like we need client context/storage to bind to DNS records/certs at a point in time, such that when records/certs change, the client context invalidates.

@huumn
Copy link
Copy Markdown
Member

huumn commented May 4, 2026

unless private beta stage is throwaway code and all these massive PRs will be reverted and rewritten from scratch

If we can materially improve upon the security of these PRs by reverting and rewriting them from scratch before making the beta public, we will.

@huumn
Copy link
Copy Markdown
Member

huumn commented May 4, 2026

It'd perhaps help to enumerate attacker personas/scenarios and be clear about what we are protecting against:

  1. territory founders (capable of performing the DNS switch)
  2. other kinds of attackers

PKCE may address (2) substantially without being able to address (1). Specifically, we've mentioned (2) could help when:

  1. someone has access to browser history (and token fails to be consumed and hasn't expired yet)
  2. XSS
  3. browser extensions to the extent that they can't operate on client storage

@Soxasora
Copy link
Copy Markdown
Member Author

Soxasora commented May 4, 2026

The problem with the (2) scenarios is that we've already lost at that point, the user is compromised and the attacker can just get the JWT. The same can be done for stacker.news and any other website.

I agree with @huumn that PKCE can't help with (1) and it can only make the attack a little harder to execute.


Maybe I'm about to say something naive or in the spirit of whataboutism, but I actually think that auth is okay-ish with the header, and that the real problem is wallets. Impersonating someone for a brief period of time is a 1000000x better outcome than stealing wallet creds.
I wouldn't want to enter the same trap we entered last year and halt custom domains, I propose to return to this subject + wallets when we'll have fully-working custom domains.

  • on the wallets subject, the attacker can get the wallet creds only if the user give their passphrase ... which they can do if it's a convincing replica of stacker.news
    • oh but there's more we could do here, like iframes or payments scoped to stacker.news. we'll need to talk in-depth about this

@sir-opti
Copy link
Copy Markdown
Contributor

sir-opti commented May 4, 2026

I propose to return to this subject + wallets when we'll have fully-working custom domains.

I'll do some work in parallel to you, without disrupting your flow (after I finalize aws sdk stuff)

Comment thread pages/login.js Outdated
Comment thread pages/api/auth/sync.js Outdated
Comment thread proxy.js Outdated
Copy link
Copy Markdown
Member

@huumn huumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an afternoon of bot review I have two medium findings.

  • valid afaict: GET /api/auth/sync csrf
    • Attacker who controls any ACTIVE custom domain can lure logged-in Alice (single click, image embed, etc.) to https://stacker.news/api/auth/sync?domain=attacker-territory.com&redirectUri=/. Her main-site session authenticates her, the GET path mints a verification token bound to {aliceId, attackerDomainId}, and 302s her to the attacker's domain where she now has a custom-domain session cookie linking her stacker.news identity.
  • unrealistic afaict: race in consumeVerificationToken and createSessionToken due to multi-statement SQL not wrapped in a transaction
    • the bug would require ACTIVE -> HOLD -> ACTIVE happening very fast (like milliseconds), which is not possible with the way the domain mapping works ... but it might be nice to not have to think about this relationship in the future

This was flagged as "policy":

  • redirect callback can never enforce "redirect only to a domain you started from"

    • a bit related to the CSRF finding above

I'm going to done some more targeted bot review and QA another time (which I expect to go fine).

After that I'll do a human review, and if the bots are right, I shouldn't find anything new.

Comment thread components/login-button.js
@Soxasora
Copy link
Copy Markdown
Member Author

Soxasora commented May 5, 2026

I've implemented CSRF protection for the login flow using a JWE. /api/auth/sync will reject any request that doesn't come with a valid proof.

  1. User goes to pizza.com/login
  2. Middleware creates a login flow proof that expires in 10 minutes, and redirects to /api/auth/redirect with the proof
  3. /api/auth/redirect redirects to /api/auth/sync with the proof
  • must remember that this extra redirect endpoint is a workaround to nextjs' ability to redirect to absolute localhost urls
  1. auth sync will verify the proof before creating a verification sync token

This means that only /login or /signup on a custom domain will be able to establish an auth sync login flow, everything else is rejected.
The proof is replayable, but the proof alone doesn't grant a session.


unrealistic afaict: race in consumeVerificationToken and createSessionToken due to multi-statement SQL not wrapped in a transaction

even if it's an unrealistic scenario, it's a chance to improve code quality! consumeVerificationToken now starts a transaction, locks the Domain row and finally deletes the verification token; it will then return what's necessary to create the final session token without requesting domain info again.


every other nit has been addressed, thank you so much k00b!

edit:

redirect callback can never enforce "redirect only to a domain you started from"

I missed this, but do we really need to enforce this? Now that the flow is completely controlled by us, I don't see why this should be enforced.

Comment thread lib/domains/auth-sync.js Outdated
Comment on lines +30 to +34
export function createLoginFlowProof ({ domainName, expiration, secret }) {
if (!secret) throw new Error('login flow proof: missing secret')
if (!domainName || !expiration) throw new Error('login flow proof: missing inputs')
return sign(secret, `${domainName.toLowerCase()}:${expiration}`)
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait... this can be a JWE, then the login url will look cleaner with just a JWE instead of proof + expiration params.
I'll do this right now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the login URL is not cleaner ... it's worse ... but this is better as we don't have to deal with manual expiration

@huumn
Copy link
Copy Markdown
Member

huumn commented May 5, 2026

I'll do another round of bot review on the changes.

I was getting hydration and other errors in my browser console during QA but I might have something weird going on with my env. Just thought I should flag that while I'm here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 23403fe. Configure here.

Comment thread lib/safe-url.js
try {
// arbitrarily resolve against the main domain. if the origin changes, it's unsafe
const base = process.env.NEXT_PUBLIC_URL
return new URL(uri, base).origin === base
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origin comparison fragile if env var has trailing slash

Low Severity

isSafeRedirectPath compares new URL(uri, base).origin directly against the raw process.env.NEXT_PUBLIC_URL string. The .origin property never includes a trailing slash, so if NEXT_PUBLIC_URL is ever set with a trailing slash (e.g. https://stacker.news/), the comparison would always fail, breaking all relative-path redirects including the custom-domain auth sync callbackUrl. Currently the env files don't have a trailing slash, but the function provides no normalization guard.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 23403fe. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auth domains feature new product features that weren't there before

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants