Skip to content
View xiaguan's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report xiaguan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
xiaguan/README.md

JinYan Su

LLM Serving Infrastructure / Rust + CUDA / KV Cache Systems

I build the systems path between model weights and production tokens.

Inference runtime, decode fast paths, KV cache movement, GPU offload, SSD tiering, RDMA transport, NUMA-aware memory, and vLLM/SGLang/Mooncake integration.

GitHub followers pegainfer stars pegaflow stars LinkedIn Blog Zhihu

What I am building

I work on the serving substrate for large language models: the layer where CUDA kernels, Rust runtimes, KV cache systems, RDMA transport, and production schedulers meet.

My public work is concentrated in one direction: make LLM serving faster, more observable, and more predictable when the bottleneck is no longer just the model, but memory movement, cache layout, GPU/CPU coordination, and distributed serving behavior.

Public signal

System What I push on
pegainfer Pure Rust + CUDA inference runtime, Kimi/DeepSeek/Qwen decode paths, PPLX EP, CuTeDSL/cuBLAS prefill kernels, benchmark gates, nsys profiling
PegaFlow KV cache storage for vLLM/SGLang, GPU offloading, SSD caching, RDMA QPs, pinned memory, NUMA placement, cache metrics, vLLM E2E gates
Mooncake Store/transfer engine work, client metrics, RDMA device setup, NUMA binding, SGLang HiCache documentation and integration paths
LMCache Mooncake connector performance, zero-copy get/put, NUMA-aware operations, vLLM scheduler/cache behavior
SGLang HiCache/Mooncake integration, NUMA detection, cache prefetch fixes, serving-path reliability
vLLM ecosystem Scheduler/cache issues, router fixes, connector behavior, large-scale serving ergonomics

Where I go deep

  • Rust inference runtimes and CUDA-backed model execution
  • Decode hot paths for Kimi, DeepSeek, and Qwen-style serving workloads
  • KV cache transport across GPU memory, CPU pinned memory, SSD, and RDMA
  • NUMA-aware allocation, pinned pool startup, CUDA IPC, and long-tail latency control
  • vLLM/SGLang connector behavior under real cache pressure
  • Benchmarking, profiling, CI gates, and release paths for serving infrastructure

Current stack

Rust / CUDA / C++ / Python / RDMA / vLLM / SGLang / Mooncake / LMCache / PegaFlow

Contact

Pinned Loading

  1. kvcache-ai/Mooncake kvcache-ai/Mooncake Public

    Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

    C++ 5.4k 791

  2. sgl-project/sglang sgl-project/sglang Public

    SGLang is a high-performance serving framework for large language models and multimodal models.

    Python 28.4k 6.2k

  3. LMCache/LMCache LMCache/LMCache Public

    LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

    Python 8.4k 1.2k

  4. foyer-rs/foyer foyer-rs/foyer Public

    Hybrid in-memory and disk cache in Rust

    Rust 1.7k 85

  5. novitalabs/pegaflow novitalabs/pegaflow Public

    High-performance KV cache storage for LLM inference — GPU offloading, SSD caching, and cross-node sharing via RDMA. Works with vLLM and SGLang.

    Rust 116 19

  6. pegainfer pegainfer Public

    Pure Rust + CUDA LLM inference engine

    Rust 341 37