diff --git a/docs/planning/gpu-compositing-attack-plan.md b/docs/planning/gpu-compositing-attack-plan.md
new file mode 100644
index 00000000..f8a5c537
--- /dev/null
+++ b/docs/planning/gpu-compositing-attack-plan.md
@@ -0,0 +1,719 @@
+# GPU Compositing Attack Plan
+
+## Post-Mortem and Implementation Guide
+
+This document catalogs every attempt made to implement per-window GPU compositing,
+identifies the specific root causes of failure, and provides a step-by-step plan
+to implement it correctly.
+
+---
+
+## 1. What Was Tried: Complete Catalog
+
+### Attempt 1: Pre-Allocated Texture Pool (8 display-sized textures at init)
+
+**What was done:**
+- Created 8 TEXTURE_2D resources (IDs 10-17) at display size (1280x960) during
+  `virgl_init()`, before any SUBMIT_3D
+- Each texture: `RESOURCE_CREATE_3D(TEXTURE_2D, B8G8R8X8_UNORM, BIND_SAMPLER_VIEW|BIND_SCANOUT)`
+- Heap-allocated backing (4.9MB each, 39MB total), paged scatter-gather ATTACH_BACKING
+- Primed each with TRANSFER_TO_HOST_3D
+- `virgl_composite_gpu_batch()` built single SUBMIT_3D:
+  - Pipeline setup: create_sub_ctx(1), set_sub_ctx(1), tweaks, surface(10), blend(11),
+    DSA(12), rasterizer(13), VS(14), FS(15), VE(16), sampler_state(18)
+  - Background quad: sampler_view(17) on COMPOSITE_TEX(res 5), draw_vbo
+  - Per-window quads: sampler_view(40+i) on window tex(res 10+slot), draw_vbo each
+
+**Result:** BLACK screen on first boot. `prlctl capture` showed black even after
+90 seconds. Later proved this was capture timing -- after 17+ minutes the display
+was actually working. Incorrectly diagnosed as "ATTACH_BACKING poisons pipeline."
+
+**What was actually proven:**
+- The display DID work after sufficient time (message 149, 154)
+- 4 window textures with ATTACH_BACKING did NOT poison the pipeline (message 286 agent)
+- Content was visible: bounce spheres, bcheck 23/23, btop, terminal (message 142)
+
+**What went wrong:**
+- Premature conclusion that "ATTACH_BACKING on secondary textures poisons VirGL"
+- `prlctl capture` timing issue misinterpreted as rendering failure
+- 39MB heap allocation caused OOM on some boots (heap exhaustion, message 143)
+- Z-order issue: window content quads drawn at same depth covered window frames
+
+### Attempt 2: Lazy Per-Window Texture Creation
+
+**What was done:**
+- Textures created lazily via `init_window_texture()` when windows register
+- Various bind flag combinations tested:
+  - `BIND_SAMPLER_VIEW` only
+  - `BIND_SAMPLER_VIEW | BIND_SCANOUT`
+  - `BIND_RENDER_TARGET | BIND_SAMPLER_VIEW | BIND_SCANOUT | BIND_SHARED` (0x14000A)
+
+**Result:** Window content BLACK. Background from COMPOSITE_TEX rendered correctly.
+Per-window textured quads rendered as invisible/black.
+
+**What was proven:**
+- When per-window quads sampled from COMPOSITE_TEX (instead of their own texture),
+  they rendered correctly (message 129) -- proving multi-quad, NDC coords, shaders,
+  and UV mapping all work
+- `copy_window_pages_to_backing()` confirmed working -- first_pixel values showed
+  real application content (0x00648CDC, 0x000A0A19, etc.) not zeros (message 489-492)
+- TRANSFER_TO_HOST_3D returned success for per-window textures
+- Test colors (red/green/blue/yellow) injected at `init_window_texture()` time were
+  overwritten by actual app content before Phase A2 ran -- proving data flow works
+
+**Hypothesis that emerged:** "TRANSFER_TO_HOST_3D only works for resources created
+before the first SUBMIT_3D" (message 135). This was tested: a 64x64 test texture
+created during init showed RED. Pool textures created during init all worked. But
+this hypothesis was DISPROVEN by the Linux probe VM tests (message 2707).
+
+### Attempt 3: Interleaved Z-Order Rendering
+
+**What was done:**
+- For each window back-to-front: (a) frame quad from COMPOSITE_TEX at full window
+  bounds, (b) content quad from per-window texture at content area
+- Cursor overlay quad at end (sampling cursor area from COMPOSITE_TEX)
+
+**Result:** Z-order fixed for the pre-allocated pool path where textures worked,
+but the lazily-created textures still showed BLACK content.
+
+### Attempt 4: Background-Only in gpu_batch (Isolation Test)
+
+**What was done:**
+- Disabled all per-window quads in `virgl_composite_gpu_batch()` -- drew ONLY the
+  background quad from COMPOSITE_TEX (message 307)
+- This should produce identical output to `virgl_composite_single_quad()`
+
+**Result:** BLACK even with background only! (message 929)
+
+**Critical finding:** `virgl_composite_single_quad()` worked perfectly, but
+`virgl_composite_gpu_batch()` with IDENTICAL background-only code produced BLACK.
+
+**What was tested to explain this:**
+- Delegation: `gpu_batch()` calling `single_quad()` internally -- result unknown
+  due to build caching (message 2947-2959)
+- Stack overflow: ruled out, 2MB stack vs ~12KB usage (message 2955)
+- Code inlining: attempted to inline single_quad code into gpu_batch body (message 2959)
+- Build caching: cargo builds completing in 0.06s when real recompilation takes
+  5-6s, suggesting stale binaries were deployed (message 3011)
+
+**Root cause: ALMOST CERTAINLY BUILD CACHING**
+
+The 0.06-0.07s build times prove that cargo was not recompiling gpu_pci.rs.
+Multiple test iterations deployed the SAME stale binary with the original
+gpu_batch code, regardless of source changes. This explains why:
+- "Identical" code in gpu_batch still produced BLACK (old binary still running)
+- Delegation to single_quad appeared not to work (old binary still running)
+- Changes to bind flags, handle IDs, etc. had no effect (old binary still running)
+
+### Attempt 5: Multi-Draw Test (Positive Control)
+
+**What was done:**
+- Modified `virgl_composite_single_quad()` to add ONE extra draw_vbo after the
+  fullscreen background quad (message 290)
+- Second quad: top-right corner with flipped UV mapping
+
+**Result:** WORKED (message 293). Both quads visible -- normal background and
+upside-down test rectangle. This proved multiple draw_vbo calls in one SUBMIT_3D
+batch work on Parallels.
+
+### Attempt 6: Revert
+
+All per-window texture code was removed. Reverted to CPU blit via
+`virgl_composite_single_quad()`.
+
+---
+
+## 2. Linux Probe VM Evidence
+
+Eight C test programs were run on the Linux probe VM (Parallels ARM64, Ubuntu 24.04.4,
+kernel 6.8.0, virtio_gpu DRM card1, 3D accel highest).
+
+### Definitive Test: `poison_fixed.c`
+
+1. Create display TEXTURE_2D (B8G8R8X8_UNORM, RT|SV|SCANOUT, 1024x768)
+2. Map + fill with BLUE via CPU + TRANSFER_TO_HOST
+3. Establish DRM scanout (AddFB + SetCrtc)
+4. Create N extra TEXTURE_2D (SV|SCANOUT, 128x128), map, fill, TRANSFER_TO_HOST
+5. VirGL SUBMIT_3D: create_sub_ctx + CLEAR to RED
+6. Re-display (AddFB + SetCrtc)
+
+**Results:**
+- 0 extra textures: RED (pass)
+- 1 extra texture: RED (pass)
+- 8 extra textures: RED (pass)
+- 32 extra textures: RED (pass)
+
+### EGL Multi-Texture Test: `gl_multi_texture_test.c`
+
+Used Mesa's EGL surfaceless + GBM + GLES2 pipeline on `/dev/dri/renderD128`:
+- Created 2 FBO textures, rendered different colors into each
+- Composited both as textured quads onto a third surface
+
+**Result:** Multi-texture VirGL rendering CONFIRMED WORKING on Parallels.
+
+### Key Insight from Linux
+
+Linux's DRM driver uses the exact same protocol: `DRM_IOCTL_VIRTGPU_RESOURCE_CREATE` +
+`DRM_IOCTL_VIRTGPU_MAP` + `DRM_IOCTL_VIRTGPU_TRANSFER_TO_HOST`. For each resource:
+1. RESOURCE_CREATE_3D: target=TEXTURE_2D(2), format=B8G8R8X8_UNORM(2),
+   bind=SV|SCANOUT(0x40008)
+2. ATTACH_BACKING: automatic via GEM BO, per-page scatter-gather
+3. TRANSFER_TO_HOST_3D: box={0,0,0,w,h,1}, level=0
+
+Resources can be created at ANY time -- there is no requirement that they be
+created before the first SUBMIT_3D. The "must create before first SUBMIT_3D"
+hypothesis was a misinterpretation of a build caching artifact.
+
+---
+
+## 3. Identified Root Causes
+
+### Root Cause 1: Build Caching (CONFIRMED -- High Confidence)
+
+**Evidence:** Cargo build times of 0.06-0.07s vs 5-6s for real recompilation.
+Multiple iterations of "change code, rebuild, deploy, test" were deploying the
+exact same stale binary. This made it appear that:
+- Changes to gpu_batch had no effect (stale binary)
+- gpu_batch with "identical" code to single_quad still failed (stale binary)
+- Bind flag changes didn't help (stale binary)
+- Handle changes didn't help (stale binary)
+
+**Why:** `gpu_pci.rs` is 4262 lines. Touch detection may not propagate through
+the dependency chain correctly, or Parallels VM deployment may reuse a cached
+disk image.
+
+**Fix:** Always `touch kernel/src/drivers/virtio/gpu_pci.rs` before building.
+Always verify build time is >3 seconds. Always check the `.elf` timestamp.
+Always use `run.sh --parallels` which handles the full pipeline including
+userspace rebuild and fresh VM creation.
+
+### Root Cause 2: Per-Window Texture Sampling (UNRESOLVED -- Needs Investigation)
+
+**What we know:**
+- Per-window quads sampling from COMPOSITE_TEX: WORKS (same batch, same shader)
+- Per-window quads sampling from their own texture: BLACK
+- TRANSFER_TO_HOST_3D returns success for per-window textures
+- Backing data is confirmed non-zero (real app content in first_pixel)
+
+**What we DON'T know (because of build caching):**
+- Whether any of the "fixes" tried would have actually worked if properly deployed
+- Whether the 64x64 test texture created at init worked because of timing or
+  because of size
+- Whether the handle allocation scheme actually collided
+
+**Possible sub-causes (all plausible, none definitively confirmed or eliminated):**
+
+**2a. Missing CTX_ATTACH_RESOURCE for per-window textures**
+
+If `virgl_ctx_attach_resource_cmd()` was not called for per-window texture
+resources, the VirGL context would not have access to them. The sampler_view
+would reference a resource the context cannot see, producing BLACK.
+
+The `init_composite_texture()` function (WORKS) calls:
+```
+virgl_ctx_attach_resource_cmd(state, VIRGL_CTX_ID, RESOURCE_COMPOSITE_TEX_ID)
+```
+
+The per-window `init_window_texture()` code DOES call this. So this is likely
+not the issue, but MUST be verified in the new implementation.
+
+**2b. create_sampler_view format encoding error for per-window textures**
+
+Memory note: "create_sampler_view format MUST include texture target -- bits
+[24:31] must contain PIPE_TEXTURE_2D << 24. Without it, host creates
+BUFFER-targeted sampler view -> black."
+
+The `create_sampler_view` function in virgl.rs correctly encodes:
+```rust
+let fmt_target = (format & 0x00FF_FFFF) | ((target & 0xFF) << 24);
+```
+
+And the call in the batch used `pipe::TEXTURE_2D` for target. This encoding
+is correct. However, if the wrong format constant was passed (e.g., a raw
+number instead of the pipe constant), it would fail silently.
+
+The background sampler_view(17) uses `vfmt::B8G8R8X8_UNORM, pipe::TEXTURE_2D`.
+The per-window sampler_view(40+i) should use the same. If they used a different
+format (e.g., B8G8R8A8_UNORM for the texture resource but B8G8R8X8_UNORM for the
+sampler_view, or vice versa), the host could reject it silently.
+
+**2c. Texture not primed (TRANSFER_TO_HOST_3D before first sample)**
+
+COMPOSITE_TEX is primed during init. Per-window textures ARE primed during
+`init_window_texture()`. But if the priming TRANSFER_TO_HOST_3D was called
+AFTER the first SUBMIT_3D that tried to sample from the texture (due to
+timing), the host might have cached the texture as empty.
+
+On the other hand, per-frame TRANSFER_TO_HOST_3D uploads should override
+this. So this is unlikely but should be verified.
+
+**2d. TRANSFER_TO_HOST_3D stride mismatch**
+
+If the backing buffer has a different stride (row pitch) than what's passed
+to TRANSFER_TO_HOST_3D, the host reads garbage. For COMPOSITE_TEX, stride =
+`tex_w * 4` (correct). For per-window textures, stride should be `win_w * 4`
+if the backing is exactly win_w*win_h*4 bytes. If pool textures are
+display-sized but the transfer uses window dimensions with window stride,
+the host would read the right data. But if the stride is wrong, the upload
+silently corrupts.
+
+**2e. VirGL batch rejection by a bad command**
+
+If ANY command in a SUBMIT_3D batch is malformed, virglrenderer may reject
+the ENTIRE batch silently. A bad `create_sampler_view` for a per-window
+texture could poison the whole batch, causing even the background quad to
+go BLACK.
+
+This would explain the "even background-only gpu_batch is BLACK" finding --
+IF the gpu_batch code had additional commands (even unreachable ones) that
+were malformed. However, the "background-only" test was supposed to remove
+all per-window commands.
+
+**Since build caching was occurring, we cannot know whether the
+background-only test actually ran the modified code.**
+
+### Root Cause 3: prlctl capture Timing (CONFIRMED)
+
+`prlctl capture` returns BLACK for 60-90 seconds after boot with VirGL GPU
+compositing. This caused multiple false "BLACK screen" diagnoses. The display
+WAS rendering correctly -- it just wasn't visible to the capture API.
+
+**Fix:** Always wait at least 90 seconds before capturing. Take multiple captures
+5 seconds apart. Use VNC or direct visual inspection when possible.
+
+---
+
+## 4. Handle Allocation Analysis
+
+### VirGL Object Handles (within SUBMIT_3D, single hash table per sub-context)
+
+These are VirGL object handles -- NOT resource IDs. They share one namespace:
+
+| Handle | Object Type | Used By |
+|--------|------------|---------|
+| 10 | surface | Render target surface on RESOURCE_3D_ID(2) |
+| 11 | blend | Simple blend (dither, RGBA colormask) |
+| 12 | DSA | Default depth-stencil-alpha |
+| 13 | rasterizer | Default rasterizer |
+| 14 | shader (VS) | Texture vertex shader |
+| 15 | shader (FS) | Texture fragment shader |
+| 16 | vertex_elements | 2-attribute VE (pos + texcoord) |
+| 17 | sampler_view | Background sampler view on COMPOSITE_TEX(res 5) |
+| 18 | sampler_state | Nearest filter, clamp-to-edge |
+| 40+i | sampler_view | Per-window sampler view on window tex(res 10+i) |
+
+### GPU Resource IDs (global, outside SUBMIT_3D)
+
+| Resource ID | Type | Purpose |
+|-------------|------|---------|
+| 2 (RESOURCE_3D_ID) | TEXTURE_2D | VirGL render target (scanout) |
+| 3 (RESOURCE_VB_ID) | BUFFER | Vertex buffer (INLINE_WRITE) |
+| 5 (RESOURCE_COMPOSITE_TEX_ID) | TEXTURE_2D | Compositor background texture |
+| 10-17 | TEXTURE_2D | Per-window texture slots |
+
+### Collision Analysis
+
+VirGL object handles (surface, blend, DSA, etc.) live in a SEPARATE namespace
+from GPU resource IDs. Handle 10 (surface) does NOT collide with Resource ID 10
+(window texture). These go through different code paths:
+- Object handles: `create_surface(handle=10, ...)` inside SUBMIT_3D
+- Resource IDs: `virgl_resource_create_3d_cmd(res_id=10, ...)` outside SUBMIT_3D
+
+Within the VirGL object namespace, there are NO collisions in the scheme above:
+- Handles 10-18 for pipeline objects
+- Handles 40+ for per-window sampler_views
+
+**HOWEVER:** If per-window sampler_view handles collide with ANY pipeline object
+handle, virglrenderer replaces the existing object. Handle 17 (bg sampler_view)
+is recreated per-frame, and handle 17 is also used for frame-quad sampler_views
+in the z-order loop. Recreating handle 17 multiple times per batch (once for bg,
+once per frame quad) is fine -- virglrenderer replaces the previous object.
+
+### Recommended Handle Scheme for New Implementation
+
+Use explicit, well-separated ranges:
+
+```
+Pipeline objects (created once per batch):
+  100: surface (render target)
+  101: blend
+  102: DSA
+  103: rasterizer
+  104: VS shader
+  105: FS shader
+  106: vertex_elements
+  107: sampler_state
+
+Sampler views (re-created per draw):
+  200: background sampler_view (COMPOSITE_TEX)
+  201+i: per-window sampler_view (window texture i)
+```
+
+Resource IDs (unchanged):
+```
+  2: render target
+  3: vertex buffer
+  5: compositor texture
+  10+i: per-window texture i
+```
+
+---
+
+## 5. Exact VirGL Command Sequence for Multi-Texture Compositing
+
+### Per-Window Texture Resource Setup (outside SUBMIT_3D, during init or window register)
+
+For each window texture slot i (resource ID = 10+i):
+
+```
+1. RESOURCE_CREATE_3D:
+     resource_id = 10 + i
+     target = TEXTURE_2D (2)
+     format = B8G8R8X8_UNORM (2)
+     bind = BIND_SAMPLER_VIEW (0x8) | BIND_SCANOUT (0x40000)
+     width = window_width (or display_width for pool)
+     height = window_height (or display_height for pool)
+     depth = 1, array_size = 1, last_level = 0, nr_samples = 0
+
+2. ATTACH_BACKING (paged scatter-gather):
+     resource_id = 10 + i
+     nr_entries = num_pages
+     entries = [{page_phys_addr, 4096}, ...] for each page
+
+3. CTX_ATTACH_RESOURCE:
+     ctx_id = VIRGL_CTX_ID (1)
+     resource_id = 10 + i
+
+4. TRANSFER_TO_HOST_3D (priming):
+     resource_id = 10 + i
+     box = {0, 0, 0, width, height, 1}
+     level = 0
+     stride = width * 4
+```
+
+### Per-Frame Upload (outside SUBMIT_3D, for each dirty window)
+
+```
+1. Cache clean backing memory (ARM64: DC CIVAC on dirty range)
+2. TRANSFER_TO_HOST_3D:
+     resource_id = 10 + slot
+     box = {0, 0, 0, win_width, win_height, 1}
+     stride = win_width * 4  (or pool_width * 4 if pool-sized backing)
+```
+
+### Per-Frame SUBMIT_3D Batch
+
+```
+// Pipeline setup (same as working virgl_composite_single_quad)
+create_sub_ctx(1)
+set_sub_ctx(1)
+set_tweaks(1, 1)
+set_tweaks(2, display_width)
+create_surface(100, RESOURCE_3D_ID, B8G8R8X8_UNORM, 0, 0)
+set_framebuffer_state(zsurf=0, cbufs=[100])
+create_blend_simple(101)
+bind_object(101, BLEND)
+create_dsa_default(102)
+bind_object(102, DSA)
+create_rasterizer_default(103)
+bind_object(103, RASTERIZER)
+
+// Shaders (num_tokens=300 required by Parallels)
+create_shader(104, VERTEX, 300, TEX_VS_TGSI)
+bind_shader(104, VERTEX)
+create_shader(105, FRAGMENT, 300, TEX_FS_TGSI)
+bind_shader(105, FRAGMENT)
+
+// Vertex elements: 2 attributes (position vec4 + texcoord vec4)
+create_vertex_elements(106, [(0,0,0,R32G32B32A32_FLOAT), (16,0,0,R32G32B32A32_FLOAT)])
+bind_object(106, VERTEX_ELEMENTS)
+
+// Sampler state (shared by all draws)
+create_sampler_state(107, CLAMP_TO_EDGE, CLAMP_TO_EDGE, CLAMP_TO_EDGE,
+                     NEAREST, MIPFILTER_NONE, NEAREST)
+bind_sampler_states(FRAGMENT, 0, [107])
+set_min_samples(1)
+set_viewport(display_width, display_height)
+
+// CLEAR (optional -- background quad will cover entire screen)
+clear_color(0.0, 0.0, 0.0, 1.0)
+
+// === Draw 0: Background quad (fullscreen, from COMPOSITE_TEX) ===
+create_sampler_view(200, RESOURCE_COMPOSITE_TEX_ID, B8G8R8X8_UNORM,
+                    TEXTURE_2D, 0, 0, 0, 0, IDENTITY_SWIZZLE)
+set_sampler_views(FRAGMENT, 0, [200])
+resource_inline_write(VB_RES_ID, 0, 128, bg_quad_verts)  // fullscreen NDC
+set_vertex_buffers([(32, 0, VB_RES_ID)])
+draw_vbo(0, 4, TRIANGLE_FAN, 3)
+
+// === Draw 1..N: Per-window quads (back-to-front z-order) ===
+for each window (back to front):
+    // Frame quad: from COMPOSITE_TEX at window bounds (includes frame/decorations)
+    create_sampler_view(200, RESOURCE_COMPOSITE_TEX_ID, B8G8R8X8_UNORM,
+                        TEXTURE_2D, 0, 0, 0, 0, IDENTITY_SWIZZLE)
+    set_sampler_views(FRAGMENT, 0, [200])
+    resource_inline_write(VB_RES_ID, offset, 128, frame_quad_verts)
+    set_vertex_buffers([(32, 0, VB_RES_ID)])
+    draw_vbo(start, 4, TRIANGLE_FAN, 3)
+
+    // Content quad: from per-window texture at content area
+    create_sampler_view(201 + i, window_res_id, B8G8R8X8_UNORM,
+                        TEXTURE_2D, 0, 0, 0, 0, IDENTITY_SWIZZLE)
+    set_sampler_views(FRAGMENT, 0, [201 + i])
+    resource_inline_write(VB_RES_ID, offset, 128, content_quad_verts)
+    set_vertex_buffers([(32, 0, VB_RES_ID)])
+    draw_vbo(start, 4, TRIANGLE_FAN, 3)
+
+// === Final: Cursor overlay (from COMPOSITE_TEX cursor area) ===
+create_sampler_view(200, RESOURCE_COMPOSITE_TEX_ID, B8G8R8X8_UNORM,
+                    TEXTURE_2D, 0, 0, 0, 0, IDENTITY_SWIZZLE)
+set_sampler_views(FRAGMENT, 0, [200])
+resource_inline_write(VB_RES_ID, offset, 128, cursor_quad_verts)
+set_vertex_buffers([(32, 0, VB_RES_ID)])
+draw_vbo(start, 4, TRIANGLE_FAN, 3)
+```
+
+After SUBMIT_3D:
+```
+SET_SCANOUT(scanout=0, resource=RESOURCE_3D_ID)
+RESOURCE_FLUSH(resource=RESOURCE_3D_ID, rect=0,0,display_w,display_h)
+```
+
+### TGSI Shaders (Proven Working)
+
+Vertex shader:
+```
+VERT
+DCL IN[0]
+DCL IN[1]
+DCL OUT[0], POSITION
+DCL OUT[1], GENERIC[0]
+  0: MOV OUT[0], IN[0]
+  1: MOV OUT[1], IN[1]
+  2: END
+```
+
+Fragment shader:
+```
+FRAG
+PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
+DCL IN[0], GENERIC[0], LINEAR
+DCL OUT[0], COLOR
+DCL SAMP[0]
+DCL SVIEW[0], 2D, FLOAT
+  0: TEX OUT[0], IN[0], SAMP[0], 2D
+  1: END
+```
+
+### NDC Coordinate Conversion
+
+Screen pixel (px, py) to NDC:
+```
+ndc_x = (px / display_w) * 2.0 - 1.0
+ndc_y = 1.0 - (py / display_h) * 2.0   // Y flipped
+```
+
+Texture UV for per-window content:
+```
+u_max = win_width / tex_width    // <1.0 if tex is pool-sized
+v_max = win_height / tex_height
+```
+
+Quad vertices (4 verts, TRIANGLE_FAN, each 8 floats = 32 bytes):
+```
+top-left:     (ndc_x0, ndc_y0, 0, 1, u0, v0, 0, 0)
+bottom-left:  (ndc_x0, ndc_y1, 0, 1, u0, v1, 0, 0)
+bottom-right: (ndc_x1, ndc_y1, 0, 1, u1, v1, 0, 0)
+top-right:    (ndc_x1, ndc_y0, 0, 1, u1, v0, 0, 0)
+```
+
+---
+
+## 6. Step-by-Step Implementation Plan
+
+### Phase 0: Eliminate Build Caching (MANDATORY FIRST STEP)
+
+Before any code changes, establish a reliable build+deploy+verify loop:
+
+1. **Add a build canary:** At the top of `virgl_composite_single_quad()`, add:
+   ```rust
+   static BUILD_ID: AtomicU32 = AtomicU32::new(0);
+   let id = BUILD_ID.fetch_add(1, Ordering::Relaxed);
+   if id == 0 {
+       crate::serial_println!("[BUILD] gpu_pci.rs build={}", env!("BUILD_TIMESTAMP_PLACEHOLDER"));
+       // Or simpler: manually increment a constant each rebuild
+       crate::serial_println!("[BUILD] gpu_pci.rs version=42");
+   }
+   ```
+   Change the version number with EVERY rebuild. If the serial log shows the
+   wrong version number, the build cache is stale.
+
+2. **Always touch before building:**
+   ```bash
+   touch kernel/src/drivers/virtio/gpu_pci.rs
+   ```
+
+3. **Verify build time:** Real recompilation of gpu_pci.rs takes >3 seconds.
+   If cargo finishes in <1 second, the build is cached/stale.
+
+4. **Always use `run.sh --parallels`:** This handles the full pipeline: userspace
+   rebuild, ext2 disk creation, fresh VM, serial log truncation.
+
+5. **Wait 90+ seconds** before `prlctl capture`. Take 3 captures 5s apart.
+
+### Phase 1: Prove Multi-Texture Sampling Works (Minimal Test)
+
+**Goal:** Two textures, two quads, one SUBMIT_3D batch. All in virgl_init().
+
+**Steps:**
+
+1. Create a second test texture (resource ID 20, small: 64x64) during virgl_init,
+   AFTER COMPOSITE_TEX but BEFORE the Step 9 SUBMIT_3D:
+   ```
+   RESOURCE_CREATE_3D(res=20, TEXTURE_2D, B8G8R8X8_UNORM, SV|SCANOUT, 64, 64)
+   ATTACH_BACKING(res=20, paged scatter-gather)
+   CTX_ATTACH_RESOURCE(ctx=1, res=20)
+   Fill backing with solid RED (0x00FF0000 in BGRX)
+   cache clean
+   TRANSFER_TO_HOST_3D(res=20, box=0,0,0,64,64,1, stride=256)
+   ```
+
+2. In Step 9 SUBMIT_3D batch, add a second draw_vbo AFTER the existing red quad:
+   ```
+   // Draw 1: existing fullscreen colored quad (constant buffer FS)
+   // Draw 2: small textured quad at top-right from test texture res 20
+   create_sampler_view(200, res=20, B8G8R8X8_UNORM, TEXTURE_2D, ...)
+   set_sampler_views(FRAGMENT, 0, [200])
+   // Switch to texture FS (create_shader + bind_shader)
+   resource_inline_write(VB_RES_ID, 128, 128, small_quad_verts)
+   set_vertex_buffers([(32, 0, VB_RES_ID)])
+   draw_vbo(4, 4, TRIANGLE_FAN, 7)
+   ```
+
+3. **Verification:** Display should show dark blue background (CLEAR) with
+   fullscreen red quad AND a small red rectangle at top-right (from the test
+   texture). If the test texture quad is BLACK, the texture sampling is broken.
+   If it's RED, it works and we can proceed.
+
+4. **If BLACK:** Compare the create_sampler_view encoding byte-for-byte with
+   the working COMPOSITE_TEX sampler_view. Check:
+   - Is the format DWORD identical? (bits [0:23] = format, [24:31] = target)
+   - Is CTX_ATTACH_RESOURCE called for resource 20?
+   - Is the TRANSFER_TO_HOST_3D box correct?
+   - Print the raw hex of the SUBMIT_3D buffer and diff the two sampler_view
+     commands
+
+### Phase 2: Per-Window Texture in Production Pipeline
+
+Only proceed after Phase 1 passes.
+
+1. **Add a `virgl_create_window_texture()` function** (not a pool -- one per window):
+   ```rust
+   fn virgl_create_window_texture(
+       slot: usize, width: u32, height: u32
+   ) -> Result<u32, &'static str> {
+       let res_id = 10 + slot as u32;
+       // Same pattern as init_composite_texture:
+       // 1. Heap allocate backing (page-aligned)
+       // 2. RESOURCE_CREATE_3D(res_id, TEXTURE_2D, B8G8R8X8_UNORM, SV|SCANOUT, w, h)
+       // 3. virgl_attach_backing_paged(res_id, ptr, size)
+       // 4. virgl_ctx_attach_resource_cmd(VIRGL_CTX_ID, res_id)
+       // 5. cache clean + TRANSFER_TO_HOST_3D (prime)
+       Ok(res_id)
+   }
+   ```
+
+2. **Call from graphics.rs** when a window registers (lazy init).
+
+3. **Per-frame upload:** In `virgl_composite_windows()`, for each dirty window:
+   - Copy MAP_SHARED pages to contiguous backing (`copy_window_pages_to_backing`)
+   - Cache clean the backing
+   - TRANSFER_TO_HOST_3D
+
+4. **Modify `virgl_composite_single_quad()` to accept window quads:**
+
+   Rather than creating a new gpu_batch function, EXTEND the existing working
+   function. This eliminates the "two functions, one works, one doesn't" problem.
+
+   Add a parameter for optional per-window quads:
+   ```rust
+   fn virgl_composite_single_quad_with_windows(
+       windows: &[WindowQuadInfo],
+   ) -> Result<(), &'static str>
+   ```
+
+   Start with ZERO windows (identical to current single_quad). Add windows
+   one at a time, testing after each addition.
+
+5. **Verification at each step:**
+   - 0 windows: identical to current display (regression test)
+   - 1 window: background + one window content quad (should show window content)
+   - N windows: background + all windows in z-order
+
+### Phase 3: Z-Order Interleaving
+
+Once per-window texture sampling is confirmed working:
+
+1. For each window (back to front):
+   - Draw frame quad from COMPOSITE_TEX (window bounds including title bar + border)
+   - Draw content quad from per-window texture (content area only)
+2. Final cursor overlay quad from COMPOSITE_TEX
+
+### Phase 4: Remove CPU Blit from BWM
+
+1. BWM stops calling `blit_client_pixels()` for windows with GPU textures
+2. BWM only composites background, decorations, and cursor into COMPOSITE_TEX
+3. Per-window content goes directly from MAP_SHARED pages to GPU texture backing
+
+---
+
+## 7. Diagnostic Checklist
+
+When per-window texture sampling produces BLACK, check these in order:
+
+1. **Build canary:** Does the serial log show the expected build version number?
+   If not, STOP -- you are running stale code.
+
+2. **CTX_ATTACH_RESOURCE:** Was `virgl_ctx_attach_resource_cmd(1, res_id)` called
+   for this texture resource? Check serial log for the attach message.
+
+3. **create_sampler_view format DWORD:** Print the raw u32 value of the fmt_target
+   parameter. It MUST be `(B8G8R8X8_UNORM & 0x00FFFFFF) | (TEXTURE_2D << 24)` =
+   `0x02000002`. If it's `0x00000002`, the texture target is missing -> BLACK.
+
+4. **TRANSFER_TO_HOST_3D box:** Print the box parameters. Width and height must
+   match the resource dimensions, not zero. Stride must be `width * 4`.
+
+5. **Backing data:** Print first_pixel of the backing buffer after copy. If it's
+   zero, the copy failed. If it's non-zero, the data is present.
+
+6. **Batch rejection:** If EVEN the background quad is BLACK, a command earlier
+   in the batch is poisoning it. Binary search: comment out the last half of
+   commands, test, narrow down.
+
+7. **prlctl capture timing:** Wait 90 seconds. Take 3 captures. If all 3 are
+   black, it's a real rendering issue. If the 3rd shows content, it's timing.
+
+---
+
+## 8. What NOT to Do
+
+1. **Do NOT create a separate gpu_batch function.** Extend the working
+   `virgl_composite_single_quad()` incrementally. The original failure was
+   partly caused by having two "identical" functions where one worked and
+   one didn't -- a situation made impossible to debug by build caching.
+
+2. **Do NOT pre-allocate a pool of 8 display-sized textures.** Each texture
+   is 4.9MB of heap. 8 textures = 39MB. This caused OOM on some boots.
+   Create textures at actual window dimensions, lazily.
+
+3. **Do NOT conclude "ATTACH_BACKING poisons the pipeline"** without first
+   verifying the build canary. This was a false conclusion caused by stale
+   binaries.
+
+4. **Do NOT change multiple variables at once.** Change one thing, verify the
+   build canary, wait 90 seconds, capture. One variable per test cycle.
+
+5. **Do NOT use `prlctl capture` as the sole verification method.** Also check
+   serial output for SUBMIT_3D success/failure, check for error responses,
+   and use visual inspection via the VM window when possible.
diff --git a/kernel/src/drivers/virtio/gpu_pci.rs b/kernel/src/drivers/virtio/gpu_pci.rs
index b0695b25..4e764435 100644
--- a/kernel/src/drivers/virtio/gpu_pci.rs
+++ b/kernel/src/drivers/virtio/gpu_pci.rs
@@ -444,6 +444,11 @@ const TEST_TEX_BYTES: usize = (TEST_TEX_DIM * TEST_TEX_DIM * 4) as usize;
 
 /// Resource ID for the compositor texture (BWM uploads pixel buffers here)
 const RESOURCE_COMPOSITE_TEX_ID: u32 = 5;
+/// Resource ID for the GPU cursor texture (12x18 arrow, uploaded once at init)
+const RESOURCE_CURSOR_TEX_ID: u32 = 6;
+/// Cursor bitmap dimensions
+const CURSOR_TEX_W: u32 = 12;
+const CURSOR_TEX_H: u32 = 18;
 
 // VirtIO standard feature bits
 const VIRTIO_F_VERSION_1: u64 = 1 << 32;
@@ -557,6 +562,133 @@ static COMPOSITE_TEX_H: AtomicU32 = AtomicU32::new(0);
 /// Whether the compositor texture resource has been initialized.
 static COMPOSITE_TEX_READY: AtomicBool = AtomicBool::new(false);
 
+/// Whether the cursor GPU texture has been initialized.
+static CURSOR_TEX_READY: AtomicBool = AtomicBool::new(false);
+
+// =============================================================================
+// Per-Window GPU Textures
+// =============================================================================
+
+/// Base resource ID for per-window textures. Window slot N gets resource (10 + N).
+const RESOURCE_WIN_TEX_BASE: u32 = 10;
+/// Maximum number of per-window texture slots.
+const MAX_WIN_TEX_SLOTS: usize = 8;
+
+/// Per-slot backing buffer pointer and length.
+static mut WIN_TEX_BACKING: [(*mut u8, usize); MAX_WIN_TEX_SLOTS] = [(core::ptr::null_mut(), 0); MAX_WIN_TEX_SLOTS];
+/// Width/height of each slot's texture.
+static mut WIN_TEX_DIMS: [(u32, u32); MAX_WIN_TEX_SLOTS] = [(0, 0); MAX_WIN_TEX_SLOTS];
+/// Whether each slot has been initialized.
+static mut WIN_TEX_INITIALIZED: [bool; MAX_WIN_TEX_SLOTS] = [false; MAX_WIN_TEX_SLOTS];
+
+/// Create a per-window VirGL texture for GPU compositing.
+///
+/// Same resource creation pattern as COMPOSITE_TEX (proven working):
+/// RESOURCE_CREATE_3D -> ATTACH_BACKING -> CTX_ATTACH -> TRANSFER_TO_HOST_3D
+pub fn create_window_texture(
+    slot: usize,
+    width: u32,
+    height: u32,
+) -> Result<u32, &'static str> {
+    use super::virgl::{format as vfmt, pipe};
+
+    if slot >= MAX_WIN_TEX_SLOTS {
+        return Err("window texture slot out of range");
+    }
+
+    let res_id = RESOURCE_WIN_TEX_BASE + slot as u32;
+
+    // Already initialized — return existing
+    if unsafe { WIN_TEX_INITIALIZED[slot] } {
+        return Ok(res_id);
+    }
+
+    let tex_size = (width as usize) * (height as usize) * 4;
+    let layout = alloc::alloc::Layout::from_size_align(tex_size, 4096)
+        .map_err(|_| "invalid window texture layout")?;
+    let ptr = unsafe { alloc::alloc::alloc_zeroed(layout) };
+    if ptr.is_null() {
+        return Err("failed to allocate window texture backing");
+    }
+
+    // RESOURCE_CREATE_3D — same bind flags as COMPOSITE_TEX
+    with_device_state(|state| {
+        virgl_resource_create_3d_cmd(
+            state, res_id, pipe::TEXTURE_2D, vfmt::B8G8R8X8_UNORM,
+            pipe::BIND_SAMPLER_VIEW | pipe::BIND_SCANOUT,
+            width, height, 1, 1,
+        )
+    })?;
+
+    // ATTACH_BACKING (paged scatter-gather)
+    with_device_state(|state| {
+        virgl_attach_backing_paged(state, res_id, ptr, tex_size)
+    })?;
+
+    // CTX_ATTACH_RESOURCE
+    with_device_state(|state| {
+        virgl_ctx_attach_resource_cmd(state, VIRGL_CTX_ID, res_id)
+    })?;
+
+    // Prime with TRANSFER_TO_HOST_3D
+    dma_cache_clean(ptr, tex_size);
+    with_device_state(|state| {
+        transfer_to_host_3d(state, res_id, 0, 0, width, height, width * 4)
+    })?;
+
+    unsafe {
+        WIN_TEX_BACKING[slot] = (ptr, tex_size);
+        WIN_TEX_DIMS[slot] = (width, height);
+        WIN_TEX_INITIALIZED[slot] = true;
+    }
+
+    crate::serial_println!(
+        "[virgl-win-tex] Created: slot={} res_id={} {}x{} ({}B)",
+        slot, res_id, width, height, tex_size
+    );
+    Ok(res_id)
+}
+
+/// Upload dirty window pixels to GPU texture via TRANSFER_TO_HOST_3D.
+/// Copies scattered MAP_SHARED pages to contiguous backing, then uploads.
+fn upload_window_texture(
+    slot: usize,
+    width: u32,
+    height: u32,
+    page_phys_addrs: &[u64],
+    total_size: usize,
+) -> Result<(), &'static str> {
+    if slot >= MAX_WIN_TEX_SLOTS { return Err("slot out of range"); }
+    let (backing_ptr, backing_len) = unsafe { WIN_TEX_BACKING[slot] };
+    if backing_ptr.is_null() { return Err("backing not allocated"); }
+
+    let win_bytes = (width as usize) * (height as usize) * 4;
+    let copy_len = win_bytes.min(total_size).min(backing_len);
+
+    // Copy scattered pages to contiguous backing.
+    // page_phys_addrs contains PHYSICAL addresses — convert to kernel virtual.
+    let mut copied = 0usize;
+    for &page_phys in page_phys_addrs {
+        if copied >= copy_len { break; }
+        let chunk = 4096usize.min(copy_len - copied);
+        let virt = phys_to_kern_virt(page_phys);
+        unsafe {
+            core::ptr::copy_nonoverlapping(
+                virt as *const u8,
+                backing_ptr.add(copied),
+                chunk,
+            );
+        }
+        copied += chunk;
+    }
+
+    let res_id = RESOURCE_WIN_TEX_BASE + slot as u32;
+    dma_cache_clean(backing_ptr, copy_len);
+    with_device_state(|state| {
+        transfer_to_host_3d(state, res_id, 0, 0, width, height, width * 4)
+    })
+}
+
 /// Allocate and initialize the compositor texture resource for GPU compositing.
 /// Creates a TEXTURE_2D resource with SAMPLER_VIEW bind, attaches heap-allocated
 /// backing, and primes it with TRANSFER_TO_HOST_3D.
@@ -615,48 +747,149 @@ fn init_composite_texture(width: u32, height: u32) -> Result<(), &'static str> {
     COMPOSITE_TEX_READY.store(true, Ordering::Release);
     crate::serial_println!("[virgl-composite] Texture resource initialized (id={})", RESOURCE_COMPOSITE_TEX_ID);
 
-    // ── Pre-allocate per-window texture pool ──
-    // Parallels requires resources to be created BEFORE the first SUBMIT_3D.
-    // Resources created after SUBMIT_3D has been called don't get their
-    // TRANSFER_TO_HOST_3D data. Pre-allocate all slots now with display-sized
-    // backing so they're ready when windows appear.
-    let pool_w = width;
-    let pool_h = height;
-    let pool_size = (pool_w as usize) * (pool_h as usize) * 4;
-    let mut pool_count = 0usize;
+    // Pre-allocate per-window texture pool at init time.
+    // TRANSFER_TO_HOST_3D only works for resources created before first SUBMIT_3D.
     for slot in 0..MAX_WIN_TEX_SLOTS {
+        let max_w: u32 = 1024;
+        let max_h: u32 = 768;
+        let tex_size = (max_w as usize) * (max_h as usize) * 4;
         let res_id = RESOURCE_WIN_TEX_BASE + slot as u32;
-        let layout = alloc::alloc::Layout::from_size_align(pool_size, 4096)
-            .map_err(|_| "win texture pool: layout error")?;
+
+        let layout = alloc::alloc::Layout::from_size_align(tex_size, 4096)
+            .map_err(|_| "invalid pre-alloc texture layout")?;
         let ptr = unsafe { alloc::alloc::alloc_zeroed(layout) };
         if ptr.is_null() {
-            crate::serial_println!("[virgl-pool] slot {} alloc failed, pool stopped at {}", slot, slot);
-            break;
+            return Err("failed to allocate pre-alloc texture backing");
+        }
+
+        // Fill slot 0 with red for testing
+        if slot == 0 {
+            unsafe {
+                let px = ptr as *mut u32;
+                let count = (max_w as usize) * (max_h as usize);
+                for i in 0..count { *px.add(i) = 0x000000FF; } // B8G8R8X8: red
+            }
         }
 
         with_device_state(|state| {
-            virgl_resource_create_3d_cmd(
-                state, res_id, pipe::TEXTURE_2D, vfmt::B8G8R8X8_UNORM,
-                pipe::BIND_SAMPLER_VIEW | pipe::BIND_SCANOUT,
-                pool_w, pool_h, 1, 1,
-            )
+            virgl_resource_create_3d_cmd(state, res_id, pipe::TEXTURE_2D, vfmt::B8G8R8X8_UNORM,
+                pipe::BIND_SAMPLER_VIEW | pipe::BIND_SCANOUT, max_w, max_h, 1, 1)
         })?;
         with_device_state(|state| {
-            virgl_attach_backing_paged(state, res_id, ptr, pool_size)
+            virgl_attach_backing_paged(state, res_id, ptr, tex_size)
         })?;
         with_device_state(|state| {
             virgl_ctx_attach_resource_cmd(state, VIRGL_CTX_ID, res_id)
         })?;
-        dma_cache_clean(ptr, pool_size);
+        dma_cache_clean(ptr, tex_size);
         with_device_state(|state| {
-            transfer_to_host_3d(state, res_id, 0, 0, pool_w, pool_h, pool_w * 4)
+            transfer_to_host_3d(state, res_id, 0, 0, max_w, max_h, max_w * 4)
         })?;
 
-        unsafe { WIN_TEX_BACKING[slot] = (ptr, pool_size); }
-        pool_count += 1;
+        unsafe {
+            WIN_TEX_BACKING[slot] = (ptr, tex_size);
+            WIN_TEX_DIMS[slot] = (max_w, max_h);
+            WIN_TEX_INITIALIZED[slot] = true;
+        }
+        crate::serial_println!("[virgl-pool] Pre-allocated slot={} res_id={} {}x{}", slot, res_id, max_w, max_h);
     }
-    crate::serial_println!("[virgl-pool] Pre-allocated {}/{} window texture slots ({}x{}, {}KB each)",
-        pool_count, MAX_WIN_TEX_SLOTS, pool_w, pool_h, pool_size / 1024);
+
+    // Initialize cursor GPU texture (12x18 arrow bitmap, uploaded once)
+    init_cursor_texture()?;
+
+    Ok(())
+}
+
+/// Initialize a small GPU texture containing the cursor arrow bitmap.
+///
+/// The cursor is rendered as a GPU quad in `virgl_composite_single_quad()`,
+/// sampling from this texture. This avoids stamping the cursor into COMPOSITE_TEX
+/// (which caused ghost trails when the saved background was stale).
+fn init_cursor_texture() -> Result<(), &'static str> {
+    use super::virgl::{format as vfmt, pipe};
+
+    // Arrow cursor bitmap: 1=white, 2=black outline, 0=transparent (12x18)
+    const CURSOR_BITMAP: [[u8; 12]; 18] = [
+        [2,0,0,0,0,0,0,0,0,0,0,0],
+        [2,2,0,0,0,0,0,0,0,0,0,0],
+        [2,1,2,0,0,0,0,0,0,0,0,0],
+        [2,1,1,2,0,0,0,0,0,0,0,0],
+        [2,1,1,1,2,0,0,0,0,0,0,0],
+        [2,1,1,1,1,2,0,0,0,0,0,0],
+        [2,1,1,1,1,1,2,0,0,0,0,0],
+        [2,1,1,1,1,1,1,2,0,0,0,0],
+        [2,1,1,1,1,1,1,1,2,0,0,0],
+        [2,1,1,1,1,1,1,1,1,2,0,0],
+        [2,1,1,1,1,1,1,1,1,1,2,0],
+        [2,1,1,1,1,1,2,2,2,2,2,0],
+        [2,1,1,1,1,2,0,0,0,0,0,0],
+        [2,1,1,2,1,1,2,0,0,0,0,0],
+        [2,1,2,0,2,1,1,2,0,0,0,0],
+        [2,2,0,0,2,1,1,2,0,0,0,0],
+        [2,0,0,0,0,2,1,2,0,0,0,0],
+        [0,0,0,0,0,2,2,0,0,0,0,0],
+    ];
+
+    let w = CURSOR_TEX_W;
+    let h = CURSOR_TEX_H;
+    let size = (w as usize) * (h as usize) * 4;
+
+    // Allocate page-aligned backing (single 4KB page is sufficient for 12*18*4=864 bytes)
+    let layout = alloc::alloc::Layout::from_size_align(4096, 4096)
+        .map_err(|_| "invalid cursor texture layout")?;
+    let ptr = unsafe { alloc::alloc::alloc_zeroed(layout) };
+    if ptr.is_null() {
+        return Err("failed to allocate cursor texture backing");
+    }
+
+    // Rasterize cursor bitmap into BGRA pixels.
+    // Transparent pixels (0) are fully transparent black (alpha=0).
+    // White fill (1) and black outline (2) are fully opaque (alpha=0xFF).
+    unsafe {
+        let pixels = ptr as *mut u32;
+        for row in 0..h as usize {
+            for col in 0..w as usize {
+                let idx = row * w as usize + col;
+                *pixels.add(idx) = match CURSOR_BITMAP[row][col] {
+                    1 => 0xFF_FF_FF_FF, // white (B8G8R8A8: BGRA = FF,FF,FF,FF)
+                    2 => 0xFF_00_00_00, // black with alpha=FF
+                    _ => 0x00_00_00_00, // transparent (alpha=0)
+                };
+            }
+        }
+    }
+
+    // Create texture resource (SAMPLER_VIEW only, never used as render target)
+    with_device_state(|state| {
+        virgl_resource_create_3d_cmd(
+            state,
+            RESOURCE_CURSOR_TEX_ID,
+            pipe::TEXTURE_2D,
+            vfmt::B8G8R8A8_UNORM,
+            pipe::BIND_SAMPLER_VIEW,
+            w, h, 1, 1,
+        )
+    })?;
+
+    // Attach backing memory
+    with_device_state(|state| {
+        virgl_attach_backing_paged(state, RESOURCE_CURSOR_TEX_ID, ptr, 4096)
+    })?;
+
+    // Attach to VirGL context
+    with_device_state(|state| {
+        virgl_ctx_attach_resource_cmd(state, VIRGL_CTX_ID, RESOURCE_CURSOR_TEX_ID)
+    })?;
+
+    // Prime with TRANSFER_TO_HOST_3D
+    dma_cache_clean(ptr, size);
+    with_device_state(|state| {
+        transfer_to_host_3d(state, RESOURCE_CURSOR_TEX_ID, 0, 0, w, h, w * 4)
+    })?;
+
+    CURSOR_TEX_READY.store(true, Ordering::Release);
+    crate::serial_println!("[virgl-cursor] Cursor texture initialized (id={}, {}x{})",
+        RESOURCE_CURSOR_TEX_ID, w, h);
 
     Ok(())
 }
@@ -711,6 +944,14 @@ fn virt_to_phys(addr: u64) -> u64 {
     }
 }
 
+/// Inverse of virt_to_phys: convert a physical address to a kernel-accessible
+/// virtual address. Uses physical_memory_offset (HHDM base) when available.
+#[inline(always)]
+fn phys_to_kern_virt(phys: u64) -> u64 {
+    let offset = crate::memory::physical_memory_offset().as_u64();
+    offset + phys
+}
+
 /// Clean (flush) a range of memory from CPU caches to physical RAM.
 ///
 /// On ARM64, CPU writes to WB-cacheable BSS memory stay in L1/L2 cache.
@@ -2269,96 +2510,6 @@ fn virgl_attach_backing_from_pages(
     Ok(())
 }
 
-/// Base resource ID for per-window VirGL textures. Window slot N → resource (10 + N).
-const RESOURCE_WIN_TEX_BASE: u32 = 10;
-const MAX_WIN_TEX_SLOTS: usize = 8;
-
-/// Per-window contiguous backing buffers for VirGL textures.
-/// Parallels requires contiguous physical backing for TRANSFER_TO_HOST_3D to work.
-/// We allocate a contiguous heap buffer per window and copy MAP_SHARED pixels into it
-/// before uploading.
-static mut WIN_TEX_BACKING: [(*mut u8, usize); MAX_WIN_TEX_SLOTS] =
-    [(core::ptr::null_mut(), 0); MAX_WIN_TEX_SLOTS];
-
-/// Initialize a VirGL TEXTURE_2D resource for a window buffer.
-///
-/// Creates the 3D resource, attaches CONTIGUOUS heap-allocated backing,
-/// attaches to the VirGL context, and primes with TRANSFER_TO_HOST_3D.
-/// Returns the resource ID on success.
-pub fn init_window_texture(
-    slot_index: usize,
-    width: u32,
-    height: u32,
-    _page_phys_addrs: &[u64],
-    _total_len: usize,
-) -> Result<u32, &'static str> {
-
-    if slot_index >= MAX_WIN_TEX_SLOTS {
-        return Err("init_window_texture: slot_index out of range");
-    }
-
-    let resource_id = RESOURCE_WIN_TEX_BASE + slot_index as u32;
-
-    // Pool was pre-allocated at init time (before first SUBMIT_3D).
-    // Just verify the slot exists and return the resource ID.
-    let (existing_ptr, existing_len) = unsafe { WIN_TEX_BACKING[slot_index] };
-    if existing_ptr.is_null() || existing_len == 0 {
-        return Err("init_window_texture: slot not pre-allocated");
-    }
-
-    crate::serial_println!(
-        "[virgl-win] init_window_texture: slot={} using pre-allocated res={} ({}x{}, backing={:#x})",
-        slot_index, resource_id, width, height, existing_ptr as u64
-    );
-    Ok(resource_id)
-}
-
-/// Blit window content from MAP_SHARED pages directly into COMPOSITE_TEX at (x, y).
-/// This composites window pixels into the single compositor texture, giving correct
-/// z-order when called bottom-to-top. The cursor is drawn AFTER this, so it appears on top.
-fn blit_window_to_compositor(
-    win_x: u32, win_y: u32,
-    win_w: u32, win_h: u32,
-    page_phys_addrs: &[u64],
-    tex_w: u32, tex_h: u32,
-) {
-    let phys_offset = crate::memory::physical_memory_offset().as_u64();
-    let row_bytes = (win_w as usize) * 4;
-    let tex_stride = (tex_w as usize) * 4;
-    let tex_ptr = unsafe { COMPOSITE_TEX_PTR };
-
-    for row in 0..win_h as usize {
-        let dst_y = (win_y as usize) + row;
-        if dst_y >= tex_h as usize { break; }
-        let dst_x = win_x as usize;
-        let copy_w = (win_w as usize).min((tex_w as usize).saturating_sub(dst_x));
-        if copy_w == 0 { continue; }
-        let copy_bytes = copy_w * 4;
-
-        let src_offset = row * row_bytes;
-        let dst_offset = dst_y * tex_stride + dst_x * 4;
-
-        // Copy from scattered pages, handling page boundaries
-        let mut copied = 0usize;
-        while copied < copy_bytes {
-            let linear_pos = src_offset + copied;
-            let page_idx = linear_pos / 4096;
-            let page_off = linear_pos % 4096;
-            if page_idx >= page_phys_addrs.len() { break; }
-            let chunk = (4096 - page_off).min(copy_bytes - copied);
-            let src_ptr = (phys_offset + page_phys_addrs[page_idx] + page_off as u64) as *const u8;
-            unsafe {
-                core::ptr::copy_nonoverlapping(
-                    src_ptr,
-                    tex_ptr.add(dst_offset + copied),
-                    chunk,
-                );
-            }
-            copied += chunk;
-        }
-    }
-}
-
 /// Flush a specific resource to the display (SET_SCANOUT must point at it first).
 fn resource_flush_3d(state: &mut GpuPciDeviceState, resource_id: u32) -> Result<(), &'static str> {
     unsafe {
@@ -3411,16 +3562,35 @@ pub fn virgl_composite_frame_textured(
     Ok(())
 }
 
-/// Build and submit a single fullscreen textured quad from COMPOSITE_TEX.
+/// Render the composited frame: fullscreen background from COMPOSITE_TEX,
+/// then per-window content quads from individual GPU textures.
 ///
-/// COMPOSITE_TEX already contains the fully-composited frame: background, window
-/// frames/decorations, window content (blitted in z-order), and cursor.
-fn virgl_composite_single_quad() -> Result<(), &'static str> {
+/// COMPOSITE_TEX contains background, window frames/decorations.
+/// Per-window textures contain the actual window content (pixels from clients).
+/// The cursor is rendered as a GPU quad from a dedicated cursor texture (last draw).
+/// When a window has no GPU texture, its content was already blitted into
+/// COMPOSITE_TEX by BWM, so the background quad covers it.
+fn virgl_composite_single_quad(
+    windows: &[crate::syscall::graphics::WindowCompositeInfo],
+) -> Result<(), &'static str> {
     use super::virgl::{CommandBuffer, format as vfmt, pipe, swizzle};
 
+    // BWM frame decoration constants (must match bwm.rs)
+    const TITLE_BAR_HEIGHT: i32 = 32;
+    const BORDER_WIDTH: i32 = 2;
+
+    // ── Build canary — detect stale binary deployment ──
+    static FRAME_COUNT: AtomicU32 = AtomicU32::new(0);
+    let frame = FRAME_COUNT.fetch_add(1, Ordering::Relaxed);
+    if frame == 0 {
+        crate::serial_println!("[BUILD-CANARY] gpu_pci.rs version=8 gpu-cursor-quad");
+    }
+
     let tex_w = COMPOSITE_TEX_W.load(Ordering::Relaxed);
     let tex_h = COMPOSITE_TEX_H.load(Ordering::Relaxed);
     let (display_w, display_h) = dimensions().ok_or("GPU not initialized")?;
+    let dw = display_w as f32;
+    let dh = display_h as f32;
 
     let mut cmdbuf = CommandBuffer::new();
     cmdbuf.create_sub_ctx(1);
@@ -3457,25 +3627,182 @@ fn virgl_composite_single_quad() -> Result<(), &'static str> {
     cmdbuf.set_min_samples(1);
     cmdbuf.set_viewport(display_w as f32, display_h as f32);
 
+    // ── Create sampler views for all textures upfront ──
+    // Handle 17: COMPOSITE_TEX (background + frames + decorations)
     cmdbuf.create_sampler_view(17, RESOURCE_COMPOSITE_TEX_ID, vfmt::B8G8R8X8_UNORM,
         pipe::TEXTURE_2D, 0, 0, 0, 0, swizzle::IDENTITY);
-    cmdbuf.set_sampler_views(pipe::SHADER_FRAGMENT, 0, &[17]);
+    // Handles 40+i: per-window textures
+    for (i, win) in windows.iter().enumerate() {
+        if !win.virgl_initialized || win.virgl_resource_id == 0 { continue; }
+        let sv_handle = 40 + i as u32;
+        cmdbuf.create_sampler_view(sv_handle, win.virgl_resource_id, vfmt::B8G8R8X8_UNORM,
+            pipe::TEXTURE_2D, 0, 0, 0, 0, swizzle::IDENTITY);
+    }
 
     let u_max = (tex_w.min(display_w) as f32) / (tex_w as f32);
     let v_max = (tex_h.min(display_h) as f32) / (tex_h as f32);
-    let bg_verts: [u32; 32] = [
-        (-1.0f32).to_bits(), (1.0f32).to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
-        0f32.to_bits(), 0f32.to_bits(), 0f32.to_bits(), 0f32.to_bits(),
-        (-1.0f32).to_bits(), (-1.0f32).to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
-        0f32.to_bits(), v_max.to_bits(), 0f32.to_bits(), 0f32.to_bits(),
-        1.0f32.to_bits(), (-1.0f32).to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
-        u_max.to_bits(), v_max.to_bits(), 0f32.to_bits(), 0f32.to_bits(),
-        1.0f32.to_bits(), (1.0f32).to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
-        u_max.to_bits(), 0f32.to_bits(), 0f32.to_bits(), 0f32.to_bits(),
-    ];
-    cmdbuf.resource_inline_write(RESOURCE_VB_ID, 0, 128, &bg_verts);
-    cmdbuf.set_vertex_buffers(&[(32, 0, RESOURCE_VB_ID)]);
-    cmdbuf.draw_vbo(0, 4, pipe::PRIM_TRIANGLE_FAN, 3);
+
+    // Helper: build a textured quad's 4 vertices (TRIANGLE_FAN) from pixel coords + UV
+    let make_quad = |px0: f32, py0: f32, px1: f32, py1: f32,
+                     u0: f32, v0: f32, u1: f32, v1: f32| -> [u32; 32] {
+        let nx0 = px0 / dw * 2.0 - 1.0;
+        let ny0 = 1.0 - py0 / dh * 2.0;
+        let nx1 = px1 / dw * 2.0 - 1.0;
+        let ny1 = 1.0 - py1 / dh * 2.0;
+        [
+            nx0.to_bits(), ny0.to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
+            u0.to_bits(),  v0.to_bits(),  0f32.to_bits(), 0f32.to_bits(),
+            nx0.to_bits(), ny1.to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
+            u0.to_bits(),  v1.to_bits(),  0f32.to_bits(), 0f32.to_bits(),
+            nx1.to_bits(), ny1.to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
+            u1.to_bits(),  v1.to_bits(),  0f32.to_bits(), 0f32.to_bits(),
+            nx1.to_bits(), ny0.to_bits(), 0f32.to_bits(), 1.0f32.to_bits(),
+            u1.to_bits(),  v0.to_bits(),  0f32.to_bits(), 0f32.to_bits(),
+        ]
+    };
+
+    // Helper: emit a textured quad draw (inline write + draw_vbo)
+    let mut vb_offset: u32 = 0;
+    let mut draw_idx: u32 = 0;
+    let emit_quad = |cmdbuf: &mut CommandBuffer, verts: &[u32; 32],
+                          vb_off: &mut u32, di: &mut u32| {
+        cmdbuf.resource_inline_write(RESOURCE_VB_ID, *vb_off, 128, verts);
+        cmdbuf.set_vertex_buffers(&[(32, 0, RESOURCE_VB_ID)]);
+        cmdbuf.draw_vbo(*di, 4, pipe::PRIM_TRIANGLE_FAN, 3);
+        *vb_off += 128;
+        *di += 4;
+    };
+
+    // ── Draw 0: Fullscreen background quad from COMPOSITE_TEX ──
+    // Contains background, window frames/decorations, taskbar, appbar.
+    cmdbuf.set_sampler_views(pipe::SHADER_FRAGMENT, 0, &[17]);
+    let bg_verts = make_quad(0.0, 0.0, dw, dh, 0.0, 0.0, u_max, v_max);
+    emit_quad(&mut cmdbuf, &bg_verts, &mut vb_offset, &mut draw_idx);
+
+    // ── Per-window interleaved draws (back to front for correct z-order) ──
+    // For each window:
+    //   1. Content quad from per-window texture (covers content area)
+    //   2. Title bar strip from COMPOSITE_TEX (covers title bar area on top of content)
+    //   3. Left/right/bottom border strips from COMPOSITE_TEX
+    // Front windows draw last, naturally covering back windows.
+    let tw = tex_w as f32;
+    let th = tex_h as f32;
+
+    for (i, win) in windows.iter().enumerate() {
+        if !win.virgl_initialized || win.virgl_resource_id == 0 { continue; }
+
+        // Window content position (set by BWM via set_window_position)
+        let cx = win.x as f32;
+        let cy = win.y as f32;
+        let cw = win.width as f32;
+        let ch = win.height as f32;
+
+        // Frame bounds (content is inset by BORDER_WIDTH left/right/bottom and TITLE_BAR_HEIGHT top)
+        let fx0 = cx - BORDER_WIDTH as f32;
+        let fy0 = cy - TITLE_BAR_HEIGHT as f32;
+        let fx1 = cx + cw + BORDER_WIDTH as f32;
+        let fy1 = cy + ch + BORDER_WIDTH as f32;
+
+        // 1. Content quad from per-window texture
+        let slot = (win.virgl_resource_id as usize).saturating_sub(RESOURCE_WIN_TEX_BASE as usize);
+        let (tex_alloc_w, tex_alloc_h) = if slot < MAX_WIN_TEX_SLOTS {
+            unsafe { WIN_TEX_DIMS[slot] }
+        } else {
+            (win.width, win.height)
+        };
+        let wu = win.width as f32 / tex_alloc_w as f32;
+        let wv = win.height as f32 / tex_alloc_h as f32;
+
+        let sv_handle = 40 + i as u32;
+        cmdbuf.set_sampler_views(pipe::SHADER_FRAGMENT, 0, &[sv_handle]);
+        let content_verts = make_quad(cx, cy, cx + cw, cy + ch, 0.0, 0.0, wu, wv);
+        emit_quad(&mut cmdbuf, &content_verts, &mut vb_offset, &mut draw_idx);
+
+        // 2. Frame strips from COMPOSITE_TEX (drawn ON TOP of content for z-order)
+        cmdbuf.set_sampler_views(pipe::SHADER_FRAGMENT, 0, &[17]);
+
+        // Title bar: full width of frame, from frame top to content top
+        if fy0 < cy {
+            let tu0 = fx0.max(0.0) / tw;
+            let tv0 = fy0.max(0.0) / th;
+            let tu1 = fx1.min(dw) / tw;
+            let tv1 = cy.min(dh) / th;
+            let title_verts = make_quad(fx0.max(0.0), fy0.max(0.0), fx1.min(dw), cy.min(dh),
+                                        tu0, tv0, tu1, tv1);
+            emit_quad(&mut cmdbuf, &title_verts, &mut vb_offset, &mut draw_idx);
+        }
+        // Left border: from content top to frame bottom
+        if fx0 < cx {
+            let lu0 = fx0.max(0.0) / tw;
+            let lv0 = cy.max(0.0) / th;
+            let lu1 = cx / tw;
+            let lv1 = fy1.min(dh) / th;
+            let left_verts = make_quad(fx0.max(0.0), cy.max(0.0), cx, fy1.min(dh),
+                                       lu0, lv0, lu1, lv1);
+            emit_quad(&mut cmdbuf, &left_verts, &mut vb_offset, &mut draw_idx);
+        }
+        // Right border: from content top to frame bottom
+        if fx1 > cx + cw {
+            let ru0 = (cx + cw) / tw;
+            let rv0 = cy.max(0.0) / th;
+            let ru1 = fx1.min(dw) / tw;
+            let rv1 = fy1.min(dh) / th;
+            let right_verts = make_quad(cx + cw, cy.max(0.0), fx1.min(dw), fy1.min(dh),
+                                        ru0, rv0, ru1, rv1);
+            emit_quad(&mut cmdbuf, &right_verts, &mut vb_offset, &mut draw_idx);
+        }
+        // Bottom border: between left and right borders
+        if fy1 > cy + ch {
+            let bu0 = cx / tw;
+            let bv0 = (cy + ch) / th;
+            let bu1 = (cx + cw) / tw;
+            let bv1 = fy1.min(dh) / th;
+            let bot_verts = make_quad(cx, cy + ch, cx + cw, fy1.min(dh),
+                                      bu0, bv0, bu1, bv1);
+            emit_quad(&mut cmdbuf, &bot_verts, &mut vb_offset, &mut draw_idx);
+        }
+
+        if frame < 3 {
+            crate::serial_println!(
+                "[GPU-WIN] frame={} win[{}] res={} content=({},{})-({}x{}) frame=({:.0},{:.0})-({:.0},{:.0})",
+                frame, i, win.virgl_resource_id, win.x, win.y, win.width, win.height,
+                fx0, fy0, fx1, fy1
+            );
+        }
+    }
+
+    // ── Draw cursor as GPU quad (rendered LAST, on top of everything) ──
+    // The cursor lives in a dedicated GPU texture (RESOURCE_CURSOR_TEX_ID, 12x18,
+    // B8G8R8A8_UNORM with alpha for transparency). Alpha blending ensures transparent
+    // pixels don't overwrite the content underneath.
+    if CURSOR_TEX_READY.load(Ordering::Acquire) {
+        let (mouse_x, mouse_y) = if crate::drivers::virtio::input_mmio::is_tablet_initialized() {
+            crate::drivers::virtio::input_mmio::mouse_position()
+        } else {
+            crate::drivers::usb::hid::mouse_position()
+        };
+        let mx = mouse_x as f32;
+        let my = mouse_y as f32;
+        let cw = CURSOR_TEX_W as f32;
+        let ch = CURSOR_TEX_H as f32;
+
+        // Only draw if cursor is within display bounds
+        if mx < dw && my < dh {
+            // Switch to alpha-blending blend state for cursor transparency
+            cmdbuf.create_blend_alpha(19);
+            cmdbuf.bind_object(19, super::virgl::OBJ_BLEND);
+
+            // Create sampler view for cursor texture (B8G8R8A8_UNORM with alpha)
+            cmdbuf.create_sampler_view(20, RESOURCE_CURSOR_TEX_ID, vfmt::B8G8R8A8_UNORM,
+                pipe::TEXTURE_2D, 0, 0, 0, 0, swizzle::IDENTITY);
+            cmdbuf.set_sampler_views(pipe::SHADER_FRAGMENT, 0, &[20]);
+
+            // Cursor quad: position at (mx, my), size = cursor bitmap dimensions
+            let cursor_verts = make_quad(mx, my, (mx + cw).min(dw), (my + ch).min(dh),
+                                         0.0, 0.0, 1.0, 1.0);
+            emit_quad(&mut cmdbuf, &cursor_verts, &mut vb_offset, &mut draw_idx);
+        }
+    }
 
     virgl_submit_sync(cmdbuf.as_slice())?;
     with_device_state(|state| set_scanout_resource(state, RESOURCE_3D_ID))?;
@@ -3584,58 +3911,17 @@ pub fn virgl_composite_windows(
         }
     }
 
-    // Step 2: Blit window content from MAP_SHARED pages into COMPOSITE_TEX.
-    // Windows are composited in z-order (bottom first in the array, top last)
-    // so higher-z windows correctly overwrite lower-z windows where they overlap.
-    // This must happen BEFORE cursor drawing so the cursor appears on top.
-    if bg_dirty || any_window_dirty {
-        for win in windows.iter() {
-            if win.page_phys_addrs.is_empty() || win.width == 0 || win.height == 0 {
-                continue;
-            }
-            blit_window_to_compositor(
-                win.x as u32, win.y as u32,
-                win.width, win.height,
-                &win.page_phys_addrs,
-                tex_w, tex_h,
-            );
-        }
-    }
+    // Step 2: Window content is blitted by BWM in z-order (with occluded optimization).
+    // BWM writes directly into COMPOSITE_TEX via MAP_SHARED. No kernel-level blit needed.
 
-    // ── Step 3: Cursor rendering ────────────────────────────────────────────
-    // Draw the mouse cursor directly into COMPOSITE_TEX so it appears in the
-    // composited output without requiring a full 4.9MB upload from userspace.
-    // Track cursor state to erase the old position and detect cursor-only moves.
+    // ── Step 3: Cursor position tracking ────────────────────────────────────
+    // Read mouse position and detect movement for early-out optimization.
+    // The cursor is rendered as a GPU quad in virgl_composite_single_quad(),
+    // NOT stamped into COMPOSITE_TEX (which caused ghost trails with per-window textures).
     use core::sync::atomic::AtomicI32;
     static CURSOR_PREV_X: AtomicI32 = AtomicI32::new(-1);
     static CURSOR_PREV_Y: AtomicI32 = AtomicI32::new(-1);
 
-    // Arrow cursor bitmap: 1=white, 2=black outline, 0=transparent (12x18)
-    const CURSOR_W: u32 = 12;
-    const CURSOR_H: u32 = 18;
-    const CURSOR_BITMAP: [[u8; 12]; 18] = [
-        [2,0,0,0,0,0,0,0,0,0,0,0],
-        [2,2,0,0,0,0,0,0,0,0,0,0],
-        [2,1,2,0,0,0,0,0,0,0,0,0],
-        [2,1,1,2,0,0,0,0,0,0,0,0],
-        [2,1,1,1,2,0,0,0,0,0,0,0],
-        [2,1,1,1,1,2,0,0,0,0,0,0],
-        [2,1,1,1,1,1,2,0,0,0,0,0],
-        [2,1,1,1,1,1,1,2,0,0,0,0],
-        [2,1,1,1,1,1,1,1,2,0,0,0],
-        [2,1,1,1,1,1,1,1,1,2,0,0],
-        [2,1,1,1,1,1,1,1,1,1,2,0],
-        [2,1,1,1,1,1,2,2,2,2,2,0],
-        [2,1,1,1,1,2,0,0,0,0,0,0],
-        [2,1,1,2,1,1,2,0,0,0,0,0],
-        [2,1,2,0,2,1,1,2,0,0,0,0],
-        [2,2,0,0,2,1,1,2,0,0,0,0],
-        [2,0,0,0,0,2,1,2,0,0,0,0],
-        [0,0,0,0,0,2,2,0,0,0,0,0],
-    ];
-    // Saved background pixels under the cursor (BGRA packed u32)
-    static mut CURSOR_SAVED_BG: [u32; 12 * 18] = [0; 12 * 18];
-
     let (mouse_x, mouse_y) = if crate::drivers::virtio::input_mmio::is_tablet_initialized() {
         crate::drivers::virtio::input_mmio::mouse_position()
     } else {
@@ -3646,101 +3932,7 @@ pub fn virgl_composite_windows(
     let prev_cx = CURSOR_PREV_X.load(Ordering::Relaxed);
     let prev_cy = CURSOR_PREV_Y.load(Ordering::Relaxed);
     let cursor_moved = cur_x != prev_cx || cur_y != prev_cy;
-
-    // Erase old cursor by restoring background pixels.
-    //
-    // With MAP_SHARED (bg_pixels=None), BWM writes directly to COMPOSITE_TEX.
-    // On full_redraw (dirty_rect=None), BWM fills entire background + blits all windows,
-    // overwriting the old cursor area — skip erase.
-    // Otherwise, use saved_bg to restore the old cursor area.
-    // With non-MAP_SHARED (bg_pixels=Some), use bg_pixels for partial mode.
-    let full_bg_copy = bg_dirty && dirty_rect.is_none() && bg_pixels.is_some();
-    let map_shared_full_redraw = bg_dirty && dirty_rect.is_none() && bg_pixels.is_none();
-    if prev_cx >= 0 && prev_cy >= 0 && !full_bg_copy && !map_shared_full_redraw
-        && (cursor_moved || any_window_dirty)
-    {
-        let tex_ptr = unsafe { COMPOSITE_TEX_PTR as *mut u32 };
-        let tw = tex_w as usize;
-
-        if bg_dirty && bg_pixels.is_some() {
-            // Partial mode with bg_pixels: read correct background from BWM's buffer
-            if let Some(pixels) = bg_pixels {
-                let src_w = bg_width.min(tex_w) as usize;
-                for row in 0..CURSOR_H as usize {
-                    let py = prev_cy as usize + row;
-                    if py >= tex_h as usize || py >= (bg_height as usize) { break; }
-                    for col in 0..CURSOR_W as usize {
-                        let px = prev_cx as usize + col;
-                        if px >= tw || px >= src_w { break; }
-                        if CURSOR_BITMAP[row][col] != 0 {
-                            unsafe {
-                                *tex_ptr.add(py * tw + px) = *pixels.as_ptr().add(py * src_w + px);
-                            }
-                        }
-                    }
-                }
-            }
-        } else {
-            // MAP_SHARED partial or cursor-only: restore from saved_bg.
-            // BWM already wrote correct content to COMPOSITE_TEX for dirty regions;
-            // saved_bg captures the pre-cursor state for the cursor area.
-            for row in 0..CURSOR_H as usize {
-                let py = prev_cy as usize + row;
-                if py >= tex_h as usize { break; }
-                for col in 0..CURSOR_W as usize {
-                    let px = prev_cx as usize + col;
-                    if px >= tw { break; }
-                    if CURSOR_BITMAP[row][col] != 0 {
-                        unsafe {
-                            let saved = CURSOR_SAVED_BG[row * CURSOR_W as usize + col];
-                            *tex_ptr.add(py * tw + px) = saved;
-                        }
-                    }
-                }
-            }
-        }
-    }
-
-    // After bg/window blits may have changed pixels under the old cursor,
-    // re-read if content was re-blitted over the old cursor area
-    // (bg_dirty or window blit already wrote fresh pixels, so saved_bg is stale — that's fine,
-    //  we just restored stale pixels that got immediately overwritten by the blit above)
-
-    // Save background under new cursor position, then draw cursor
-    let draw_cursor = cur_x >= 0 && cur_y >= 0
-        && (cur_x as u32) < tex_w && (cur_y as u32) < tex_h;
-    if draw_cursor {
-        let tex_ptr = unsafe { COMPOSITE_TEX_PTR as *mut u32 };
-        let tw = tex_w as usize;
-        // Save
-        for row in 0..CURSOR_H as usize {
-            let py = cur_y as usize + row;
-            if py >= tex_h as usize { break; }
-            for col in 0..CURSOR_W as usize {
-                let px = cur_x as usize + col;
-                if px >= tw { break; }
-                if CURSOR_BITMAP[row][col] != 0 {
-                    unsafe {
-                        CURSOR_SAVED_BG[row * CURSOR_W as usize + col] =
-                            *tex_ptr.add(py * tw + px);
-                    }
-                }
-            }
-        }
-        // Draw
-        for row in 0..CURSOR_H as usize {
-            let py = cur_y as usize + row;
-            if py >= tex_h as usize { break; }
-            for col in 0..CURSOR_W as usize {
-                let px = cur_x as usize + col;
-                if px >= tw { break; }
-                match CURSOR_BITMAP[row][col] {
-                    1 => unsafe { *tex_ptr.add(py * tw + px) = 0x00FFFFFF; }, // white (BGRX)
-                    2 => unsafe { *tex_ptr.add(py * tw + px) = 0x00000000; }, // black
-                    _ => {}
-                }
-            }
-        }
+    if cursor_moved {
         CURSOR_PREV_X.store(cur_x, Ordering::Relaxed);
         CURSOR_PREV_Y.store(cur_y, Ordering::Relaxed);
     }
@@ -3782,15 +3974,6 @@ pub fn virgl_composite_windows(
             crate::tracing::providers::counters::GPU_PARTIAL_UPLOADS.increment();
             crate::tracing::providers::counters::GPU_BYTES_UPLOADED.add((uw as u64) * (uh as u64) * 4);
         }
-        // Upload cursor areas: old position (erased) and new position (drawn)
-        if cursor_moved || any_window_dirty {
-            if prev_cx >= 0 && prev_cy >= 0 {
-                upload_rect(prev_cx as u32, prev_cy as u32, CURSOR_W, CURSOR_H)?;
-            }
-            if draw_cursor {
-                upload_rect(cur_x as u32, cur_y as u32, CURSOR_W, CURSOR_H)?;
-            }
-        }
     } else if bg_dirty {
         // Full background upload
         dma_cache_clean(unsafe { COMPOSITE_TEX_PTR }, tex_bytes_total);
@@ -3799,25 +3982,28 @@ pub fn virgl_composite_windows(
         })?;
         crate::tracing::providers::counters::GPU_FULL_UPLOADS.increment();
         crate::tracing::providers::counters::GPU_BYTES_UPLOADED.add(tex_bytes_total as u64);
-    } else {
-        // Cursor moved and/or windows dirty — upload cursor bounding boxes + dirty windows
-        if prev_cx >= 0 && prev_cy >= 0 {
-            upload_rect(prev_cx as u32, prev_cy as u32, CURSOR_W, CURSOR_H)?;
-        }
-        // Upload new cursor area
-        if draw_cursor {
-            upload_rect(cur_x as u32, cur_y as u32, CURSOR_W, CURSOR_H)?;
-        }
-        // Note: kernel does NOT blit client windows from MAP_SHARED pages.
-        // BWM composites all windows at correct z-order and sends bg_dirty=2
-        // with a dirty rect when client content changes.
+    }
+    // Note: cursor-only moves (no bg_dirty, no window_dirty) still trigger SUBMIT_3D
+    // below to redraw the GPU cursor quad at the new position. No COMPOSITE_TEX upload needed.
+
+    // =========================================================================
+    // Phase A2: Upload per-window GPU textures
+    // Per-window textures pre-allocated at init, TRANSFER_TO_HOST_3D proven working.
+    // Uploads dirty window content from MAP_SHARED pages to GPU textures.
+    // =========================================================================
+    for win in windows.iter() {
+        if !win.virgl_initialized || !win.dirty { continue; }
+        if win.page_phys_addrs.is_empty() { continue; }
+        let slot = (win.virgl_resource_id as usize).saturating_sub(RESOURCE_WIN_TEX_BASE as usize);
+        if slot >= MAX_WIN_TEX_SLOTS { continue; }
+        let _ = upload_window_texture(slot, win.width, win.height, &win.page_phys_addrs, win.size);
     }
 
     // =========================================================================
-    // Phase B+C: Single fullscreen SUBMIT_3D quad + display
+    // Phase B+C: GPU compositing + display
     // =========================================================================
-    // Window content was already blitted into COMPOSITE_TEX in z-order (step 2),
-    // so a single textured quad correctly displays everything including cursor.
+    // Background + decorations from COMPOSITE_TEX, per-window content from
+    // individual GPU textures, all in one SUBMIT_3D batch.
 
     // Perf: timestamp before display phase
     #[cfg(target_arch = "aarch64")]
@@ -3827,7 +4013,7 @@ pub fn virgl_composite_windows(
         v
     };
 
-    virgl_composite_single_quad()?;
+    virgl_composite_single_quad(windows)?;
 
     // Perf: end of frame
     #[cfg(target_arch = "aarch64")]
diff --git a/kernel/src/drivers/virtio/virgl.rs b/kernel/src/drivers/virtio/virgl.rs
index 73435e2f..c7293b19 100644
--- a/kernel/src/drivers/virtio/virgl.rs
+++ b/kernel/src/drivers/virtio/virgl.rs
@@ -266,6 +266,30 @@ impl CommandBuffer {
         }
     }
 
+    /// Create a blend state with SRC_ALPHA / INV_SRC_ALPHA alpha blending.
+    ///
+    /// Used for rendering quads with per-pixel transparency (e.g., cursor texture
+    /// where alpha=0 means transparent and alpha=0xFF means opaque).
+    pub fn create_blend_alpha(&mut self, handle: u32) {
+        // S2[0] encoding (per virgl_hw.h):
+        //   bit 0: blend_enable = 1
+        //   bits 1-3: rgb_func = PIPE_BLEND_ADD (0)
+        //   bits 4-8: rgb_src_factor = PIPE_BLENDFACTOR_SRC_ALPHA (0x03)
+        //   bits 9-13: rgb_dst_factor = PIPE_BLENDFACTOR_INV_SRC_ALPHA (0x13)
+        //   bits 14-16: alpha_func = PIPE_BLEND_ADD (0)
+        //   bits 17-21: alpha_src_factor = PIPE_BLENDFACTOR_ONE (0x01)
+        //   bits 22-26: alpha_dst_factor = PIPE_BLENDFACTOR_ZERO (0x11)
+        //   bits 27-30: colormask = 0xF (write RGBA)
+        self.push(Self::cmd0(ccmd::CREATE_OBJECT, obj::BLEND, 11));
+        self.push(handle);
+        self.push(0x00000004); // S0: dither enabled
+        self.push(0);          // S1: logicop_func = 0
+        self.push(0x7C42_2631); // S2[0]: alpha blend enabled
+        for _ in 0..7 {
+            self.push(0);
+        }
+    }
+
     /// Create a depth-stencil-alpha state matching Mesa exactly.
     /// Mesa sends DSA with S0=0x00000000, length=5.
     pub fn create_dsa_default(&mut self, handle: u32) {
diff --git a/kernel/src/syscall/graphics.rs b/kernel/src/syscall/graphics.rs
index af99cf6b..54bda828 100644
--- a/kernel/src/syscall/graphics.rs
+++ b/kernel/src/syscall/graphics.rs
@@ -663,54 +663,33 @@ fn handle_virgl_op(cmd: &FbDrawCmd) -> SyscallResult {
             } else {
                 &[]
             };
-            // Extract window info under lock, then drop lock before VirGL init
-            let win_info = {
+            let registered = {
                 let mut reg = WINDOW_REGISTRY.lock();
-                // Find slot index first (immutable borrow)
-                let slot_idx = reg.buffers.iter().position(|s| {
-                    s.as_ref().map_or(false, |b| b.id == buffer_id)
-                });
-                match (slot_idx, reg.find_mut(buffer_id)) {
-                    (Some(idx), Some(buf)) => {
+                match reg.find_mut(buffer_id) {
+                    Some(buf) => {
                         buf.registered = true;
                         buf.title_len = title.len().min(MAX_TITLE_LEN);
                         buf.title[..buf.title_len].copy_from_slice(&title[..buf.title_len]);
-                        Some((idx, buf.width, buf.height, buf.page_phys_addrs.clone(), buf.size))
+                        true
                     }
-                    _ => None,
+                    None => false,
                 }
             };
-            match win_info {
-                Some((slot_idx, w, h, pages, size)) => {
-                    // Initialize VirGL texture for this window (outside registry lock)
-                    if crate::drivers::virtio::gpu_pci::is_virgl_enabled() {
-                        match crate::drivers::virtio::gpu_pci::init_window_texture(slot_idx, w, h, &pages, size) {
-                            Ok(res_id) => {
-                                let mut reg = WINDOW_REGISTRY.lock();
-                                if let Some(buf) = reg.find_mut(buffer_id) {
-                                    buf.virgl_resource_id = res_id;
-                                    buf.virgl_initialized = true;
-                                }
-                            }
-                            Err(e) => {
-                                crate::serial_println!("[window] VirGL texture init failed for buffer {}: {}", buffer_id, e);
-                            }
-                        }
-                    }
-                    // Bump registry generation + wake compositor so it discovers the new window
-                    #[cfg(target_arch = "aarch64")]
-                    {
-                        REGISTRY_GENERATION.fetch_add(1, core::sync::atomic::Ordering::Relaxed);
-                        let compositor_tid = COMPOSITOR_WAITING_THREAD.load(core::sync::atomic::Ordering::Acquire);
-                        if compositor_tid != 0 {
-                            crate::task::scheduler::with_scheduler(|sched| {
-                                sched.unblock(compositor_tid);
-                            });
-                        }
+            if registered {
+                // Bump registry generation + wake compositor so it discovers the new window
+                #[cfg(target_arch = "aarch64")]
+                {
+                    REGISTRY_GENERATION.fetch_add(1, core::sync::atomic::Ordering::Relaxed);
+                    let compositor_tid = COMPOSITOR_WAITING_THREAD.load(core::sync::atomic::Ordering::Acquire);
+                    if compositor_tid != 0 {
+                        crate::task::scheduler::with_scheduler(|sched| {
+                            sched.unblock(compositor_tid);
+                        });
                     }
-                    SyscallResult::Ok(0)
                 }
-                None => SyscallResult::Err(super::ErrorCode::InvalidArgument as u64),
+                SyscallResult::Ok(0)
+            } else {
+                SyscallResult::Err(super::ErrorCode::InvalidArgument as u64)
             }
         }
         13 => {
@@ -1330,30 +1309,30 @@ fn handle_composite_windows(desc_ptr: u64) -> SyscallResult {
                 if !buf.registered { continue; }
                 if buf.width == 0 || buf.height == 0 { continue; }
 
-                // Lazy VirGL texture init: create per-window GPU texture on first composite
+                let dirty = buf.generation > buf.last_uploaded_gen;
+
+                // Lazy-init per-window GPU texture on first composite
                 if !buf.virgl_initialized && !buf.page_phys_addrs.is_empty()
                     && matches!(crate::graphics::compositor_backend(),
                                 crate::graphics::CompositorBackend::VirGL)
                 {
-                    let slot_idx = (buf.id as usize).saturating_sub(1) % 16;
-                    match crate::drivers::virtio::gpu_pci::init_window_texture(
-                        slot_idx, buf.width, buf.height, &buf.page_phys_addrs, buf.size
+                    let slot_idx = (buf.id as usize).saturating_sub(1) % 8;
+                    match crate::drivers::virtio::gpu_pci::create_window_texture(
+                        slot_idx, buf.width, buf.height,
                     ) {
                         Ok(res_id) => {
                             buf.virgl_resource_id = res_id;
                             buf.virgl_initialized = true;
-                            crate::serial_println!("[composite] Window {} got VirGL texture (res={})",
-                                buf.id, res_id);
                         }
                         Err(e) => {
-                            crate::serial_println!("[composite] Window {} texture init failed: {}",
-                                buf.id, e);
+                            crate::serial_println!(
+                                "[composite] GPU texture init failed for buf {}: {}",
+                                buf.id, e
+                            );
                         }
                     }
                 }
 
-                let dirty = buf.generation > buf.last_uploaded_gen;
-
                 result.push(WindowCompositeInfo {
                     virgl_resource_id: buf.virgl_resource_id,
                     virgl_initialized: buf.virgl_initialized,
diff --git a/userspace/programs/src/bwm.rs b/userspace/programs/src/bwm.rs
index f3b6628c..37d793a9 100644
--- a/userspace/programs/src/bwm.rs
+++ b/userspace/programs/src/bwm.rs
@@ -187,16 +187,6 @@ struct Window {
     minimized: bool,
     /// Stable ordering for appbar (assigned at discovery time, never changes)
     creation_order: u32,
-    /// Direct-mapped pointer to client window's pixel buffer (read-only, MAP_SHARED)
-    /// Stored for future per-window direct blit (currently compositor uses bulk composite).
-    #[allow(dead_code)]
-    mapped_ptr: *const u32,
-    /// Client window buffer width (from map_window_buffer)
-    #[allow(dead_code)]
-    mapped_w: u32,
-    /// Client window buffer height (from map_window_buffer)
-    #[allow(dead_code)]
-    mapped_h: u32,
 }
 
 impl Window {
@@ -614,15 +604,6 @@ fn discover_windows(windows: &mut Vec<Window>, screen_w: usize, screen_h: usize,
             core::str::from_utf8(&title[..title_len]).unwrap_or("?"),
             info.buffer_id, info.width, info.height, cascade_x, cascade_y);
 
-        // Map client window buffer into our address space for zero-copy reads
-        let (map_ptr, map_w, map_h) = match graphics::map_window_buffer(info.buffer_id) {
-            Ok(result) => result,
-            Err(_) => {
-                print!("[bwm] WARNING: failed to map window {} buffer\n", info.buffer_id);
-                (core::ptr::null(), 0, 0)
-            }
-        };
-
         // Tell kernel where the client content goes on screen (for GPU compositing)
         let content_x = cascade_x + BORDER_WIDTH as i32;
         let content_y = cascade_y + TITLE_BAR_HEIGHT as i32 + BORDER_WIDTH as i32;
@@ -636,7 +617,6 @@ fn discover_windows(windows: &mut Vec<Window>, screen_w: usize, screen_h: usize,
             owner_pid: info.owner_pid,
             minimized: false,
             creation_order: order,
-            mapped_ptr: map_ptr, mapped_w: map_w, mapped_h: map_h,
         });
         added = true;
     }
@@ -652,7 +632,7 @@ fn redraw_all_windows(fb: &mut FrameBuf, windows: &[Window], focused_win: usize,
     for i in 0..windows.len() {
         if windows[i].minimized { continue; }
         draw_window_frame(fb, &windows[i], i == focused_win);
-        // GPU compositing handles client content — don't blit here
+        // Window content rendered by GPU from per-window textures — no CPU blit needed
     }
     draw_appbar(fb, windows, focused_win);
 }
@@ -709,8 +689,7 @@ fn compose_partial_redraw(
             let end = row * screen_w + dx1;
             sbuf[start..end].copy_from_slice(&bg[start..end]);
         }
-        // 2. Redraw UI elements that intersect dirty region
-        // GPU compositing handles client content — only draw frames/decorations
+        // 2. Redraw UI elements (frames only — content rendered by GPU)
         if dy0 < TASKBAR_HEIGHT {
             draw_taskbar(sfb, clock);
         }
@@ -733,8 +712,7 @@ fn compose_partial_redraw(
             vram[start..end].copy_from_slice(&sbuf[start..end]);
         }
     } else {
-        // Non-shadow path: restore bg region, redraw affected windows
-        // GPU compositing handles client content — only draw frames/decorations
+        // Non-shadow path: restore bg region, redraw affected windows (frames only)
         for row in dy0..dy1 {
             let start = row * screen_w + dx0;
             let end = row * screen_w + dx1;
@@ -861,6 +839,7 @@ fn main() {
     let mut dragging: Option<(usize, i32, i32)> = None;
     let mut full_redraw = true;
     let mut content_dirty = false;
+    let mut windows_dirty = false;
 
     // Clock state
     let mut last_clock_sec: i64 = -1;
@@ -1134,22 +1113,15 @@ fn main() {
             }
         }
 
-        // ── 5. GPU compositing handles window content — just check which are dirty ──
-        // Skip entirely if compositor_wait didn't report dirty content
+        // ── 5. Window content handled by GPU ──
+        // Per-window textures are uploaded by the kernel directly from MAP_SHARED
+        // pages and composited via VirGL SUBMIT_3D with z-order interleaved
+        // frame strips. No CPU blit needed. Mark windows_dirty so the composite
+        // syscall triggers per-window GPU upload + render WITHOUT re-uploading
+        // the full COMPOSITE_TEX (which contains only frames/decorations).
         if ready & graphics::COMPOSITOR_READY_DIRTY != 0 {
-            for i in 0..windows.len().min(16) {
-                if windows[i].window_id != 0 && !windows[i].minimized {
-                    if graphics::check_window_dirty(windows[i].window_id).unwrap_or(false) {
-                        content_dirty = true;
-                        let (bx0, by0, bx1, by1) = windows[i].bounds();
-                        dirty_x0 = dirty_x0.min(bx0);
-                        dirty_y0 = dirty_y0.min(by0);
-                        dirty_x1 = dirty_x1.max(bx1);
-                        dirty_y1 = dirty_y1.max(by1);
-                    }
-                }
-            }
-        } // end if DIRTY
+            windows_dirty = true;
+        }
 
         // ── 5b. Update clock (once per second) ──
         if let Ok(ts) = libbreenix::time::now_realtime() {
@@ -1179,6 +1151,7 @@ fn main() {
             );
             full_redraw = false;
             content_dirty = false;
+            windows_dirty = false;
         } else if content_dirty {
             let sw = screen_w as i32;
             let sh = screen_h as i32;
@@ -1191,12 +1164,16 @@ fn main() {
                 2, dx, dy, dw, dh,
             );
             content_dirty = false;
-        } else if mouse_moved_this_frame {
-            // Mouse-only update: no content changed, but kernel draws cursor
+            windows_dirty = false;
+        } else if windows_dirty || mouse_moved_this_frame {
+            // Window content and/or mouse-only update: no COMPOSITE_TEX change,
+            // but kernel uploads per-window textures and draws cursor via SUBMIT_3D.
+            // dirty_mode=0 tells kernel bg_dirty=false → skip COMPOSITE_TEX upload.
             let _ = graphics::virgl_composite_windows_rect(
                 cbuf, cw, ch,
                 0, 0, 0, 0, 0,
             );
+            windows_dirty = false;
         }
         // No sleep — compositor_wait handles blocking