custom domains: auth#2908
Conversation
- ACM support - custom domains crud, resolvers, fragments - custom domains form, guidelines - custom domains context - domain verification every 5 minutes via pgboss - domain validation schema - basic custom domains middleware, to be completed - TODOs tracings
- CustomDomain -> Domain - DomainVerification table - CNAME, TXT, SSL verification types - WIP DomainVerification upsert
…ange status of a Record from its Attempt, multi-purpose dns verification
…tch as they can fail
- use DomainVerificationStatus enum for domains and records - adapt Territory Form UI to new schema - return 'records' as an object with its types - wip: prepare for attempts and certificate usage for prisma
fix: - fix setDomain mutation transaction - fix schema typedefs enhance: - DNS records guidelines with flex-wrap for longer records cleanup: - add comments to worker - remove console.log on validation values
… HOLD handle territory changes via triggers - on territory stop, HOLD the domain - on territory takeover from another user, delete the domain and its associated records handle ACM certificates via trigger - on domain/domainCertificate deletion, ask ACM to delete the certificate via a pgboss job; removes the need to ask ACM in multiple places clear domains that have been on HOLD for more than 30 days, check every midnight via pgboss schedule use 'domains' profile for worker jobs
…not valid or we're in production
|
This is all bot review (what I find myself doing in preparation of my review) but I wanted to share before in case any of it is useful. Screenshots of the canvas I had it generate:
The fix plan this led to (again, just bot stuff, I have not reviewed much). Phase 1 seems reasonable at a glance. I'm unsure about phase 2 being a meaningful improvement (but again, I have not absorbed all the context yet). THE FIX PLANPhase 1: Harden Existing Flow
Phase 2: Remove URL-Borne Bearer Material
|
|
Rereading my findings from last night, it's unclear how high risk the high risk finding is. The finding is valid and the sync token is a special case, but imho the 5 minute domain switch check + jwt invalidation is sufficient for our private beta, whether after On the other hand, it is probably worthwhile to have the ability to do synchronous DNS checks and add extra protections depending on the request scenario, ie places where an attack is easier to pull off. Is this such a scenario, and is it much more severe than average? Marginally, maybe. Anyway, I'm still a bit naive to these changes. Regardless of whether special case hardening is worthwhile, it can wait until we leave the private beta if that's what's best. There always be inches to gain in security given we are limited to reactive defenses. |
I had something similar reported and it combines with other issues. I'm still in repro phase; this is high compexity. I do think that "Design a replacement handoff that does not put redeemable auth material in the custom-domain URL" is valid though - having tokens in query string puts it in browser history. Can probably be protected with a challenge secret like done in PKCE, to prevent capture by XSS / browser extensions. The "protocol flow" for PKCE paints a clear solution: where, if |
|
Thanks for all the research guys, I think we just want to add the proof header just to ship the private beta of custom domains. It's a small change that won't require re-designing the pipeline. I'm also okay with a DNS query on
Instead, this is something I don't think we'll ever want to do, it would mean a db lookup on every SSR and graphql request. This being said, an attacker can anyway exchange the verification token for a JWT via our middleware when they re-establish a correct DNS configuration towards us, and if they're lucky enough to be in the correct time window. This is, sadly, valid for any safety measures we can come up with because the attacker can control the domain and use a reverse proxy to access cookies, sniff and do all kinds of things (still they gotta be lucky with timing). But! We must remember that even if the territory owner/hijacker gets the verification token and/or the full JWT token, it will get revoked the moment our cache catches up. And said JWT cannot be shared with edit: I'm all for implementing PKCE after private beta. Actually last year I implemented a (bad) prototype of a PKCE-ish process: #2180 (comment), but I remember removing it because auth sync was starting to be too complex for an MVP + the dread of knowing that whatever we do can be exploited. |
My only counter to that is that the moment with the least friction to do it is now, unless private beta stage is throwaway code and all these massive PRs will be reverted and rewritten from scratch - which I don't expect, but maybe I'm naive. Doesn't have to be in this PR though! |
It's a short-lived token, so while not great, it's about as bad as a magic link in this respect (I think). I think the risk with the
Unless I'm missing something, with a DNS record switch, even if some secrets are only known to the client, those secrets can be MiTM'd at C/D in the diagram. It does make an attack harder though. We've been unable to find a flow that can't be MiTM'd in this way. It's not possible afaik, but it's like we need client context/storage to bind to DNS records/certs at a point in time, such that when records/certs change, the client context invalidates. |
If we can materially improve upon the security of these PRs by reverting and rewriting them from scratch before making the beta public, we will. |
|
It'd perhaps help to enumerate attacker personas/scenarios and be clear about what we are protecting against:
PKCE may address (2) substantially without being able to address (1). Specifically, we've mentioned (2) could help when:
|
|
The problem with the (2) scenarios is that we've already lost at that point, the user is compromised and the attacker can just get the JWT. The same can be done for I agree with @huumn that PKCE can't help with (1) and it can only make the attack a little harder to execute. Maybe I'm about to say something naive or in the spirit of whataboutism, but I actually think that auth is okay-ish with the header, and that the real problem is wallets. Impersonating someone for a brief period of time is a 1000000x better outcome than stealing wallet creds.
|
I'll do some work in parallel to you, without disrupting your flow (after I finalize aws sdk stuff) |
…via proof header based on NEXTAUTH_SECRET
There was a problem hiding this comment.
After an afternoon of bot review I have two medium findings.
- valid afaict:
GET /api/auth/synccsrf- Attacker who controls any
ACTIVEcustom domain can lure logged-in Alice (single click, image embed, etc.) tohttps://stacker.news/api/auth/sync?domain=attacker-territory.com&redirectUri=/. Her main-site session authenticates her, theGETpath mints a verification token bound to{aliceId, attackerDomainId}, and302s her to the attacker's domain where she now has a custom-domain session cookie linking her stacker.news identity.
- Attacker who controls any
- unrealistic afaict: race in
consumeVerificationTokenandcreateSessionTokendue to multi-statement SQL not wrapped in a transaction- the bug would require
ACTIVE->HOLD->ACTIVEhappening very fast (like milliseconds), which is not possible with the way the domain mapping works ... but it might be nice to not have to think about this relationship in the future
- the bug would require
This was flagged as "policy":
-
redirectcallback can never enforce "redirect only to a domain you started from"- a bit related to the CSRF finding above
I'm going to done some more targeted bot review and QA another time (which I expect to go fine).
After that I'll do a human review, and if the bots are right, I shouldn't find anything new.
…tion; use retrieved data to build the final session token
…of gen and verification to prevent unauthorized sync endpoint access
…(e.g. missing NEXTAUTH_SECRET)
|
I've implemented CSRF protection for the login flow using a JWE.
This means that only
even if it's an unrealistic scenario, it's a chance to improve code quality! every other nit has been addressed, thank you so much k00b! edit:
I missed this, but do we really need to enforce this? Now that the flow is completely controlled by us, I don't see why this should be enforced. |
| export function createLoginFlowProof ({ domainName, expiration, secret }) { | ||
| if (!secret) throw new Error('login flow proof: missing secret') | ||
| if (!domainName || !expiration) throw new Error('login flow proof: missing inputs') | ||
| return sign(secret, `${domainName.toLowerCase()}:${expiration}`) | ||
| } |
There was a problem hiding this comment.
wait... this can be a JWE, then the login url will look cleaner with just a JWE instead of proof + expiration params.
I'll do this right now.
There was a problem hiding this comment.
Well the login URL is not cleaner ... it's worse ... but this is better as we don't have to deal with manual expiration
|
I'll do another round of bot review on the changes. I was getting hydration and other errors in my browser console during QA but I might have something weird going on with my env. Just thought I should flag that while I'm here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 23403fe. Configure here.
| try { | ||
| // arbitrarily resolve against the main domain. if the origin changes, it's unsafe | ||
| const base = process.env.NEXT_PUBLIC_URL | ||
| return new URL(uri, base).origin === base |
There was a problem hiding this comment.
Origin comparison fragile if env var has trailing slash
Low Severity
isSafeRedirectPath compares new URL(uri, base).origin directly against the raw process.env.NEXT_PUBLIC_URL string. The .origin property never includes a trailing slash, so if NEXT_PUBLIC_URL is ever set with a trailing slash (e.g. https://stacker.news/), the comparison would always fail, breaking all relative-path redirects including the custom-domain auth sync callbackUrl. Currently the env files don't have a trailing slash, but the function provides no normalization guard.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 23403fe. Configure here.




Description
Part of #1942, revives and adapts #2180
Focuses on synchronizing authentication between SN and a custom domain.
Login overview
Authentication is centralized on
stacker.news.When a user visits
/loginon a custom domain, they are redirected tostacker.news/loginwith adomainparam. This param customizes the login experience for that specific domain/territory.Flow
From the login page, users can authenticate with the current nym or choose a different account to sync.
A
callbackUrl=/api/auth/sync/...is always included to support full login/signup flowsAfter authentication:
stacker.newsNext-auth redirects
To properly support logout flows on custom domains (without
router.pushhacks), next-auth now supports a custom redirect behavior.If
callbackUrlis:stacker.newsDNS monitoring
A recurring
checkActiveDomainsDNSpgboss job runs every 5 minutes to verify that custom domains still point correctly to us.When a DNS drift is detected, the domain is moved to
HOLD, triggering:tokenVersionincrementThis check is essential to custom domain safety and JWT revocability.
JWT revocability
JWTs issued for custom domains now include:
domainIdtokenVersionValidation:
tokenVersionmismatch -> token is invaliddomainIdmismatch -> token is invalidThis ensures that the existing tokens are invalidated when DNS changes and/or when domain is re-verified or re-created.
Attack scenarios and infos are available in
/docs/dev/custom-domains.mdMedia
custom domain login -> pick -> sync -> logout
auth-sync-flow.mov
custom domain signup -> main domain -> sync
cd-signup.mp4
Additional Context
Next.js middleware proxy has a bug that prevents absolute redirects to
localhost, collapsing them into relative redirects. To circumvent this bug, a new/api/auth/redirectendpoint is used to redirect tomain-domain/loginwith the correctdomainandcallbackUrlparams.We could make a generic
api/redirectendpoint that only allows main domain and verified custom domain, and use it for both/loginredirect and external territory badges that, at the moment, are based on app code rather than being controlled by the middleware.Checklist
Are your changes backwards compatible? Please answer below:
Yes, existing behavior is preserved and every change only applies to custom domains.
On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
6
/login: OKFor frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes
Did you introduce any new environment variables? If so, call them out explicitly here:
n/a
Did you use AI for this? If so, how much did it assist you?
Note
High Risk
High risk because it adds new authentication flows/endpoints for custom domains, modifies NextAuth JWT/redirect behavior, and introduces DB-trigger-driven token revocation plus a recurring worker job that can force domains into
HOLD. Misconfiguration could cause login/redirect breakage or unintended session invalidation.Overview
Adds custom-domain auth sync so logins/signups initiated on a custom domain are redirected through the main domain and then establish a session back on the custom domain via new
GET/POST /api/auth/syncand/api/auth/redirect, with CSRF protection using a proxy-mintedsync_proofand HMAC proof headers.Introduces revocable custom-domain JWTs by adding
Domain.tokenVersion(plusdomainIdpinning) and enforcing these claims in NextAuth’sjwtcallback; also tightens redirect handling with a NextAuthredirectcallback and newsafe-urlhelpers.Adds DNS drift monitoring:
verifyDNSRecordnow distinguishes transient vs conclusive resolver failures, a newcheckActiveDomainsDNSworker runs every 5 minutes (pgboss schedule) to flip driftingACTIVEdomains toHOLD, and a DB trigger bumpstokenVersionon any transition to/fromACTIVEto invalidate existing tokens. UI/UX is updated to show territory-aware login headers/links and a “log in with @nym” selector, and dropdown styles are factored into a shared CSS module.Reviewed by Cursor Bugbot for commit 23403fe. Bugbot is set up for automated code reviews on this repo. Configure here.