Merged
Conversation
Replace the old workspace (compute, executor, p2p, utils) with a single-crate binary featuring embedded GGUF inference via llama-cpp-2, QUIC networking via quinn, HuggingFace model management, and a backpressure-aware worker pool.
…allenge-response - Add dkn-protocol path dependency; re-export wire types via thin modules - Remove local copies of protocol types, proof, and chat template (now in dkn-protocol) - Worker holds per-model engines/templates; TPS is HashMap<String, f64> - Implement challenge signing in the main event loop - Add try_reconnect() with exponential backoff on stream close/error - Introduce NodeContext for shared state across event handlers
- Replace old models (gemma, llama, mistral) with lfm2.5, qwen3.5, nanbeige, locooperator - Add ModelType field to ModelSpec and propagate through worker/main - Worker rejects tasks with image/audio content when model lacks that modality - Re-export ModelType and MessageContent from dkn-protocol
Allows users to specify a quantization level (e.g. Q8_0, Q5_K_M) instead of always downloading the registry default (Q4_K_M). This avoids redundant downloads when a different quantization is already cached locally via HuggingFace hub. Also fix dkn-protocol type mismatches: add ModelType enum, MessageContent enum, and model_type field to ModelRegistryEntry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rguments Router requires ALPN "dkn" on QUIC connections. Without it, the handshake fails with "peer doesn't support any known protocol". Set ALPN on the client config and on all mock server configs in tests.
Refactor RouterConnection to use mpsc write channel for concurrent sends. Extend Worker with streaming path: run_inference_streaming sends StreamToken messages per token via sync_channel bridge, then StreamEnd/StreamError on completion. Update main.rs to pass stream flag and connection sender to worker.
Add InferenceEngine::apply_template() which extracts the Jinja2 chat template from GGUF metadata and applies it via llama.cpp's built-in engine. Remove all hardcoded template plumbing: chat_template field from ModelSpec, template parameter from Worker/add_engine/run_inference, and the models::template re-export module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable llama-cpp-2 mtmd feature, add MtmdContext to InferenceEngine with generate_multimodal() path, mmproj download/cache, hf_mmproj_file on ModelSpec (populated for lfm2.5-vl and lfm2.5-audio), and multimodal branching in worker sync/streaming paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ignored test that downloads lfm2.5-vl:1.6b + mmproj, runs generate_multimodal() with a synthetic BMP or user-provided image via TEST_IMAGE_PATH env var.
- Add `dria-node setup` interactive command: detects RAM, shows models that fit, downloads selection, runs test inference - Update release workflow: auto-update homebrew-dkn formula with SHA256s - Update CI workflow for v2 branch structure - Add install scripts for macOS/Linux (curl) and Windows (PowerShell) - Update README with quick start, model table, and CLI reference - Bump version to 1.0.0-alpha.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Bump version from 1.0.0-alpha.2 to 0.7.1 - Replace number input with arrow-key selection (dialoguer) - Add 4-bit/8-bit quantization picker, RAM-aware - Retry loop on download/load failure instead of crashing - Fix qwen3.5:35b-a3b GGUF filename (was 404)
Checks GitHub releases on startup: patch bumps warn, minor/major bumps download and replace the binary. Includes --skip-update flag and makeLatest: true in the release workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin to rev ce026e3 so CI and other machines can build without a local checkout of dkn-protocol. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… sampler bug Plumb OpenAI-compatible response_format parameter through the stack: - ResponseFormat enum (JsonObject, JsonSchema) from dkn-protocol - Worker converts response_format to GBNF grammar via json_schema_to_grammar - Grammar sampler inserted first in chain (masks invalid tokens before sampling) - Both generate() and generate_multimodal() support grammar constraints Fix critical double-accept bug: sample() internally calls accept(), so the explicit sampler.accept() after sample() was advancing grammar stacks twice, causing GGML_ASSERT(!stacks.empty()) crash on any grammar-constrained generation. Switch dkn-protocol dependency from rev to branch = "main".
…ampling generate_multimodal() was setting logits_idx to the sequence position (current_pos) after single-token decode, but get_logits_ith expects a batch output index. Single-token decode has only one output slot (index 0), so passing e.g. 55 caused a panic. Always use -1 (C API sentinel for "last logits"), matching the pattern in generate(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace hardcoded n_ctx=2048/4096 with model-native context window (auto-detected from GGUF metadata via n_ctx_train) - Add --context-size / DRIA_CONTEXT_SIZE to optionally cap context for limited VRAM (uses min of model native and cap) - Add pre-flight check in worker: reject tasks where prompt_tokens + max_tokens exceeds context before consuming a capacity slot - Propagate inference errors back to router via StreamError so it can retry on another node instead of waiting for timeout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allocate prompt_tokens + max_tokens per request instead of the full ctx_limit (e.g. 32k). This avoids OOM on machines where the model fits in RAM but a full-size KV cache does not. The ctx_limit ceiling still gates pre-flight rejection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KV cache now defaults to Q8_0 quantization instead of F16, roughly halving KV memory with negligible quality loss. Operators can override with --kv-quant (f16, q8_0, q4_0, etc.) or DRIA_KV_QUANT env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LlamaBackend::init() is a global singleton that errors on second call. Use OnceLock to ensure it's initialized exactly once and shared across all InferenceEngine instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use PowerShell Get-CimInstance instead of deprecated wmic for RAM detection - Verify SHA-256 of cached model files before loading - Auto-delete and re-download corrupt files instead of failing permanently
with_quant() used rfind('-') which broke for repos using dot separators
(e.g. LocoOperator-4B.Q4_K_M.gguf → LocoOperator-Q8_0.gguf instead of
LocoOperator-4B.Q8_0.gguf). Now finds the quant portion by matching
known quant prefixes, preserving the original separator.
- Prefix cached mmproj filenames with model name to avoid collisions between models sharing the same mmproj filename (e.g. multiple Qwen models all using mmproj-BF16.gguf from different repos) - Add LLVM/Clang to Windows prerequisites (needed by bindgen) - Add libclang troubleshooting entry
Chunk prompt evaluation into batches of 2048 tokens instead of decoding all at once, which triggers GGML_ASSERT(n_tokens_all <= n_batch). Applied to both generate() and validate_prefill(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.