Skip to content

Commit b9fc87b

Browse files
committed
fix(webapp): close back-fill-vs-promotion race on env-keyspace cache writes
Stamp the env hash with an `__owner_worker_id` field at promotion time; the back-fill Lua script reads it and CAS-skips the env write if it doesn't match the worker the back-filler resolved to. Scenario the guard closes: 1. Resolver reads env cache, misses 2. Resolver reads PG: findCurrentWorker -> A, BWT for A -> A's data 3. [promotion to B fires, atomically replaces env hash + by-worker:B] 4. Resolver fires back-fill with workerId=A -> env hash now stamped owner=B; A != B -> back-fill skips env write -> by-worker:A still written (harmless; key contains workerId, dead worker) The by-worker keyspace doesn't need a CAS check — the key itself contains the workerId, so stale writes land in a dead worker's keyspace and are never read. Devin Review flagged this as a sub-millisecond race bounded by the 24h env TTL. Cost to close it is one extra HGET inside the Lua call (atomic with the HSET, no extra round trip) and one extra HSET on the promotion write.
1 parent 2c0fae2 commit b9fc87b

1 file changed

Lines changed: 50 additions & 14 deletions

File tree

apps/webapp/app/services/taskMetadataCache.server.ts

Lines changed: 50 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -116,28 +116,43 @@ end
116116
return 1
117117
`;
118118

119+
/**
120+
* Reserved field name on env hashes that records the worker currently
121+
* "owning" the env keyspace. The back-fill Lua script reads this and skips
122+
* its env-side write if the owner has flipped — closing the race where a
123+
* concurrent promotion atomically replaces the env hash between a resolver's
124+
* PG read and its back-fill write. Customer task slugs are kebab/camelCase
125+
* and never start with `__`, so collisions are not a concern; an accidental
126+
* `getCurrent(envId, "__owner_worker_id")` would JSON.parse-fail and fall
127+
* back to PG, not corrupt anything.
128+
*/
129+
const OWNER_FIELD = "__owner_worker_id";
130+
119131
/**
120132
* Atomically replace BOTH keyspaces in one Redis transaction. Used at deploy
121133
* promotion — the worker just became current for the env, so the env keyspace
122-
* and the worker keyspace get the same field set.
134+
* and the worker keyspace get the same field set, and the env hash is
135+
* stamped with the new owner workerId.
123136
*
124137
* KEYS[1] = env hash key
125138
* KEYS[2] = by-worker hash key
126139
* ARGV[1] = env ttl seconds (0 = no TTL)
127140
* ARGV[2] = by-worker ttl seconds (0 = no TTL)
128-
* ARGV[3..N] = alternating field, value pairs (same for both hashes)
141+
* ARGV[3] = workerId (env-hash owner marker)
142+
* ARGV[4..N] = alternating field, value pairs (same for both hashes)
129143
*/
130144
const REPLACE_TWO_HASHES_LUA = `
131145
redis.call("DEL", KEYS[1])
132146
redis.call("DEL", KEYS[2])
133-
if #ARGV > 2 then
147+
if #ARGV > 3 then
134148
local fv = {}
135-
for i = 3, #ARGV do
149+
for i = 4, #ARGV do
136150
fv[#fv + 1] = ARGV[i]
137151
end
138152
redis.call("HSET", KEYS[1], unpack(fv))
139153
redis.call("HSET", KEYS[2], unpack(fv))
140154
end
155+
redis.call("HSET", KEYS[1], "${OWNER_FIELD}", ARGV[3])
141156
local envTtl = tonumber(ARGV[1])
142157
if envTtl and envTtl > 0 then
143158
redis.call("EXPIRE", KEYS[1], envTtl)
@@ -169,27 +184,44 @@ return 1
169184

170185
/**
171186
* Atomically upsert one field in BOTH keyspaces. Used by the non-locked
172-
* back-fill path. The env-keyspace TTL is only set if no TTL is present
173-
* (preserves the promotion boundary); the by-worker TTL is refreshed.
187+
* back-fill path.
188+
*
189+
* The by-worker hash always gets written (the key contains the workerId, so
190+
* stale data lands in a dead worker's keyspace and is never read by anyone
191+
* not pinned to that version).
192+
*
193+
* The env hash is CAS-guarded by `${OWNER_FIELD}`: if a concurrent promotion
194+
* has replaced the hash between this resolver's PG read and this write, the
195+
* stored owner won't match the workerId the back-filler resolved to, so the
196+
* env write is skipped — preventing the back-fill from overwriting a freshly
197+
* promoted slug with stale data from the previous worker.
174198
*
175199
* KEYS[1] = env hash key
176200
* KEYS[2] = by-worker hash key
177201
* ARGV[1] = env ttl seconds (0 = no TTL)
178202
* ARGV[2] = by-worker ttl seconds (0 = no TTL)
179-
* ARGV[3] = field
180-
* ARGV[4] = value
203+
* ARGV[3] = writer's expected env-hash owner workerId
204+
* ARGV[4] = field
205+
* ARGV[5] = value
181206
*/
182207
const SET_TWO_FIELDS_LUA = `
183-
redis.call("HSET", KEYS[1], ARGV[3], ARGV[4])
184-
local envTtl = tonumber(ARGV[1])
185-
if envTtl and envTtl > 0 and redis.call("TTL", KEYS[1]) == -1 then
186-
redis.call("EXPIRE", KEYS[1], envTtl)
187-
end
188-
redis.call("HSET", KEYS[2], ARGV[3], ARGV[4])
208+
redis.call("HSET", KEYS[2], ARGV[4], ARGV[5])
189209
local workerTtl = tonumber(ARGV[2])
190210
if workerTtl and workerTtl > 0 then
191211
redis.call("EXPIRE", KEYS[2], workerTtl)
192212
end
213+
214+
local owner = redis.call("HGET", KEYS[1], "${OWNER_FIELD}")
215+
if owner == false or owner == ARGV[3] then
216+
redis.call("HSET", KEYS[1], ARGV[4], ARGV[5])
217+
if owner == false then
218+
redis.call("HSET", KEYS[1], "${OWNER_FIELD}", ARGV[3])
219+
end
220+
local envTtl = tonumber(ARGV[1])
221+
if envTtl and envTtl > 0 and redis.call("TTL", KEYS[1]) == -1 then
222+
redis.call("EXPIRE", KEYS[1], envTtl)
223+
end
224+
end
193225
return 1
194226
`;
195227

@@ -205,6 +237,7 @@ declare module "ioredis" {
205237
workerKey: string,
206238
envTtlSeconds: string,
207239
workerTtlSeconds: string,
240+
workerId: string,
208241
...fieldValues: string[]
209242
): Result<number, Context>;
210243
taskMetaSetFieldRefreshTtl(
@@ -219,6 +252,7 @@ declare module "ioredis" {
219252
workerKey: string,
220253
envTtlSeconds: string,
221254
workerTtlSeconds: string,
255+
workerId: string,
222256
field: string,
223257
value: string,
224258
callback?: Callback<number>
@@ -280,6 +314,7 @@ export class RedisTaskMetadataCache implements TaskMetadataCache {
280314
byWorkerKey(workerId),
281315
String(this.currentEnvTtlSeconds),
282316
String(this.byWorkerTtlSeconds),
317+
workerId,
283318
...fieldValues
284319
);
285320
} catch (error) {
@@ -322,6 +357,7 @@ export class RedisTaskMetadataCache implements TaskMetadataCache {
322357
byWorkerKey(workerId),
323358
String(this.currentEnvTtlSeconds),
324359
String(this.byWorkerTtlSeconds),
360+
workerId,
325361
entry.slug,
326362
encode(entry)
327363
);

0 commit comments

Comments
 (0)