rocm: enable wmma indexer support by alantsev · Pull Request #180 · antirez/ds4

alantsev · 2026-05-17T07:06:25Z

most of the changes are from the upstream main branch - the only files directly changed by this commit are -

M Makefile
M ds4_cuda.cu
M ds4_rocm.h
M ds4_server.c

.

the rocm related changes are about enabling the wmma indexer for hipcc build

the current tests and eval results:

$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
server: OK
ds4 tests: ok

The evaluation run

$ ./ds4-eval -m ds4flash.gguf --plain --questions 12 --tokens 2048 --temp 0 --seed 1
...

PASSED got 16 expected 16 (159.8s, 1437 tokens)
ds4-eval: 10/12 passed, 2 failed, runtime 00h:27m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201     1661     1862 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149      370      519 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      623      704 70       70       AIME2025/aime2025-01
  4 FAILED        313     2048     2361 A        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272     2048     2320 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1325     1471 468      468      AIME2025/aime2025-16
  7 PASSED        156     1303     1459 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127      280      407 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 FAILED        633     2048     2681 26       588      AIME2025/aime2025-02
 10 PASSED        182     1080     1262 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137      232      369 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165     1437     1602 16       16       AIME2025/aime2025-03

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

the answer was outside of the claimed energy precision. the evaluation after the fix (with smooth distribution over the tokens) ``` $ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196) ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m # state prompt gen total given correct test 1 PASSED 201 733 934 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 87 236 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 574 655 70 70 AIME2025/aime2025-01 4 PASSED 313 239 552 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 177 449 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1140 1286 468 468 AIME2025/aime2025-16 7 PASSED 156 646 802 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 52 179 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 4780 5413 588 588 AIME2025/aime2025-02 10 PASSED 182 322 504 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 68 205 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 747 912 16 16 AIME2025/aime2025-03 13 PASSED 149 672 821 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 68 235 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 4837 5142 86 82 AIME2025/aime2025-18 16 PASSED 131 671 802 D D GPQA Diamond/recNFJjE5PPTqVJGv 17 PASSED 175 67 242 I I SuperGPQA/7ca71b86327744b78e93185a45bc5cef 18 PASSED 102 1199 1301 117 117 AIME2025/aime2025-04 19 STOPPED 187 80 267 - B GPQA Diamond/rec2UlKqC6RFHdcro 20 PENDING 0 0 0 - E SuperGPQA/d44b94f7749345a39a65f6312bda8764 21 PENDING 0 0 0 - 106 AIME2025/aime2025-19 22 PENDING 0 0 0 - B GPQA Diamond/recv7GsQg3f0fvB1f 23 PENDING 0 0 0 - B SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1 ```

alantsev · 2026-05-23T16:40:50Z

merged latest changes from the upstream main branch.
enabled the agent.
simplified the Makefile.

$ make test
./ds4-eval --self-test-extractors
ds4-eval: answer extractor self-tests passed
./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-short-prefill:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
metal-short-prefill: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
metal-tensor-equivalence:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics

ds4-test: Tensor equivalence candidate route=auto
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-8.78234 cand=-8.78234 abs=0 id=1 ref=18.7873 cand=18.7873 abs=0 id=2 ref=-53.75 cand=-53.75 abs=0 id=3 ref=15.7433 cand=15.7433 abs=0 id=4
ref=24.2153 cand=24.2153 abs=0
ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=6.73034 cand=6.73034 abs=0 id=1 ref=25.0882 cand=25.0882 abs=0 id=2 ref=-44.3569 cand=-44.3569 abs=0 id=3 ref=11.6954 cand=11.6954 abs=0
id=4 ref=23.5527 cand=23.5527 abs=0
ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-0.454318 cand=-0.454318 abs=0 id=1 ref=23.2188 cand=23.2188 abs=0 id=2 ref=-43.7816 cand=-43.7816 abs=0 id=3 ref=15.0173 cand=15.0173 ab
s=0 id=4 ref=14.2727 cand=14.2727 abs=0
ds4-test: Tensor equivalence long_memory_archive top1 ref=10220 cand=10220 top5_overlap=5/5 overlap=17/20 max_rank_delta=3 rms=0.910726 max_abs=5.01606 top20_max_abs=3.59195
ds4-test: Tensor equivalence long_memory_archive largest deltas: id=67130 ref=5.90324 cand=10.9193 abs=5.01606 id=95394 ref=-0.0971611 cand=4.77867 abs=4.87583 id=109465 ref=-0.533072 cand=4.3211 abs=4.85417 id=40
799 ref=1.97854 cand=6.77793 abs=4.79939 id=36320 ref=7.94481 cand=12.6538 abs=4.70901
ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=17/20 max_rank_delta=9 rms=0.950555 max_abs=5.22848 top20_max_abs=2.40018
ds4-test: Tensor equivalence long_code_audit largest deltas: id=59928 ref=-18.5866 cand=-13.3582 abs=5.22848 id=30781 ref=8.43611 cand=13.4286 abs=4.99249 id=37176 ref=8.16547 cand=12.863 abs=4.69756 id=67153 ref=
6.01649 cand=10.6338 abs=4.61734 id=85360 ref=7.75199 cand=12.2393 abs=4.4873
ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=5/5 min_overlap=17/20 worst_rank_delta=9 worst_rms=0.950555 worst_max_abs=5.22848 worst_top20
_max_abs=3.59195
metal-tensor-equivalence: OK
server:
server: OK
ds4 tests: ok

Full routed-MoE TensorOps enabled the gate, up, and down projections. The regression was isolated to the gate projection: enabling TensorOps for gate is sufficient to send a sensitive AIME continuation into a repeated wrong answer, while TensorOps for up+down remains stable. The kernel-side cause is small but real arithmetic drift in mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A same-input routed-MoE probe showed no address/layout corruption: TensorOps gate was close to legacy, but not bit-identical. An isolated same-tile primitive probe confirmed the source outside DS4 routing and quantization: legacy simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on the tested tile, while TensorOps produced close nonzero FP32 differences. MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the drift. That normally tiny drift matters here because MoE routing has discontinuous top-k expert selection. In the failing path the first observed safe-vs-full routing change was layer 3, token row 11: the selected sixth expert changed from 96 to 50 across a margin of only about 8e-4. Once an expert changes, the transformer state is no longer a smooth local perturbation, and autoregressive decoding can fall into a bad repetition basin. Attempts that preserved the full gate TensorOps speed did not produce a zero-drift or stable fix: forcing the routed intermediate to F32, using the older generic TensorOps routed matmul instead of the expert-major fast layout, changing the TensorOps descriptor mode, and compiling with strict Metal math all left the gate drift or the bad continuation in place. Retaining TensorOps for up and down keeps most of the MoE speedup, but gate stays on the legacy path because it feeds the nonlinear silu(gate) * up branch and is the projection that can flip later router decisions.

Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story. The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.

ds4-eval (fix): q13 provides wrong answer

agent: add --chdir working-directory option

Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column. Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages. Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model.

Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save. Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs.

``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK server: server: OK ds4 tests: ok ``` ``` $ ./ds4-eval -m ds4flash.gguf --plain --questions 12 --tokens 2048 --temp 0 --seed 1 ... PASSED got 16 expected 16 (159.8s, 1437 tokens) ds4-eval: 10/12 passed, 2 failed, runtime 00h:27m # state prompt gen total given correct test 1 PASSED 201 1661 1862 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 370 519 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 623 704 70 70 AIME2025/aime2025-01 4 FAILED 313 2048 2361 A C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 2048 2320 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1325 1471 468 468 AIME2025/aime2025-16 7 PASSED 156 1303 1459 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 280 407 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 FAILED 633 2048 2681 26 588 AIME2025/aime2025-02 10 PASSED 182 1080 1262 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 232 369 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 1437 1602 16 16 AIME2025/aime2025-03 ```

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

antirez and others added 4 commits May 23, 2026 15:12

Fix agent prefill progress bar sizing

9ae1eeb

agent: add working-directory option

23e264d

merge from main@upstream

89c60e1

alantsev force-pushed the rocm branch from a1a1b4c to c2b64dd Compare May 23, 2026 15:44

Highlight agent code output

e792c55

antirez and others added 22 commits May 23, 2026 19:09

Fix F16 routed MoE graph dumps

a05e69e

Add GPU power throttling

444afce

Simplify agent edit tools

df22245

Add runtime power commands

f398aa3

Guard attention output TensorOps full tiles

57ae485

Apply agent power changes while busy

4e867d6

Improve agent edit tooling

d447bdb

Fix anchored edit tail matching

a1f30a6

Merge pull request antirez#233 from alantsev/eval-fix

e3efafe

ds4-eval (fix): q13 provides wrong answer

Merge pull request antirez#232 from gmontana/fix/agent-chdir

b230e8e

agent: add --chdir working-directory option

Refine agent tool prompt reminders

925eb4c

Improve agent session management

4a855d8

merge from main@upstream

16e1bae

enable agent for rocm

454f030

simplify Makefile

993267b

ds4-agent experiment - create rocm flavour of a nix build configuration

327e80e

ds4-agent experiment - add multi-platform nix flake

fb1c642

alantsev force-pushed the rocm branch from 3f56004 to 993267b Compare May 24, 2026 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: enable wmma indexer support#180

rocm: enable wmma indexer support#180
alantsev wants to merge 140 commits into
antirez:rocmfrom
alantsev:rocm

alantsev commented May 17, 2026 •

edited

Loading

Uh oh!

alantsev commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

alantsev commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alantsev commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

alantsev commented May 17, 2026 •

edited

Loading