Skip to content

Add prealloc in benchmark and do it with Float32#55

Merged
blegat merged 2 commits intomainfrom
bl/prealloc
May 6, 2026
Merged

Add prealloc in benchmark and do it with Float32#55
blegat merged 2 commits intomainfrom
bl/prealloc

Conversation

@blegat
Copy link
Copy Markdown
Owner

@blegat blegat commented May 6, 2026

Now we get

Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 3 seconds. 47 already precompiled.
CUDA.jl device : NVIDIA RTX PRO 1000 Blackwell Generation Laptop GPU  (math_mode=FAST_MATH)

========================================================================
h = 16, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=16) ... 0.99 s
ArrayDiff GPU build (h=16) ... 0.32 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 8.527e-14 (rel 2.45e-16)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median       202.09 µs
ArrayDiff CPU     : median         0.39 µs
ArrayDiff GPU     : median        57.64 µs

========================================================================
h = 256, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=256) ... 0.00 s
ArrayDiff GPU build (h=256) ... 0.00 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 1.819e-12 (rel 5.95e-16)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median       179.48 µs
ArrayDiff CPU     : median         3.38 µs
ArrayDiff GPU     : median        61.22 µs

========================================================================
h = 4096, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=4096) ... 0.06 s
ArrayDiff GPU build (h=4096) ... 0.03 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 1.637e-11 (rel 1.17e-15)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median      2119.60 µs
ArrayDiff CPU     : median        78.86 µs
ArrayDiff GPU     : median        48.80 µs

CUDA.jl device : NVIDIA RTX PRO 1000 Blackwell Generation Laptop GPU  (math_mode=FAST_MATH)

========================================================================
h = 16, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=16) ... 0.00 s
ArrayDiff GPU build (h=16) ... 0.00 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 8.527e-14 (rel 2.45e-16)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median       146.55 µs
ArrayDiff CPU     : median         0.19 µs
ArrayDiff GPU     : median        55.98 µs

========================================================================
h = 256, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=256) ... 0.00 s
ArrayDiff GPU build (h=256) ... 0.00 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 1.819e-12 (rel 5.95e-16)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median       181.35 µs
ArrayDiff CPU     : median         3.40 µs
ArrayDiff GPU     : median        62.93 µs

========================================================================
h = 4096, d = 13, n = 178
========================================================================
ArrayDiff CPU build (h=4096) ... 0.03 s
ArrayDiff GPU build (h=4096) ... 0.02 s
ArrayDiff CPU   vs hand-CUDA: max|Δ| = 1.637e-11 (rel 1.17e-15)  match=true
ArrayDiff GPU   vs hand-CUDA: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA Float64 : median      2095.01 µs
ArrayDiff CPU     : median        80.81 µs
ArrayDiff GPU     : median        62.88 µs







forward_pass (generic function with 1 method)

reverse_diff (generic function with 1 method)



reverse_diff_prealloc! (generic function with 1 method)

build_arraydiff (generic function with 2 methods)

arraydiff_grad_cpu! (generic function with 1 method)

arraydiff_grad_gpu! (generic function with 1 method)

run_one (generic function with 1 method)

main (generic function with 1 method)

CUDA.jl device : NVIDIA RTX PRO 1000 Blackwell Generation Laptop GPU  (math_mode=FAST_MATH)

========================================================================
h = 16, d = 13, n = 178  (Float32)
========================================================================
ArrayDiff CPU build (h=16) ... 0.17 s
ArrayDiff GPU build (h=16) ... 0.04 s
Hand-CUDA prealloc   vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true
ArrayDiff CPU        vs hand-CUDA alloc: max|Δ| = 2.606e-01 (rel 2.57e-04)  match=true
ArrayDiff GPU        vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA alloc    : median       170.84 µs
Hand-CUDA prealloc : median       163.93 µs
ArrayDiff CPU      : median         0.35 µs
ArrayDiff GPU      : median        48.76 µs

========================================================================
h = 256, d = 13, n = 178  (Float32)
========================================================================
ArrayDiff CPU build (h=256) ... 0.00 s
ArrayDiff GPU build (h=256) ... 0.00 s
Hand-CUDA prealloc   vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true
ArrayDiff CPU        vs hand-CUDA alloc: max|Δ| = 1.291e+00 (rel 3.61e-04)  match=true
ArrayDiff GPU        vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA alloc    : median       111.47 µs
Hand-CUDA prealloc : median       105.87 µs
ArrayDiff CPU      : median         1.44 µs
ArrayDiff GPU      : median        44.57 µs

========================================================================
h = 4096, d = 13, n = 178  (Float32)
========================================================================
ArrayDiff CPU build (h=4096) ... 0.03 s
ArrayDiff GPU build (h=4096) ... 0.01 s
Hand-CUDA prealloc   vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true
ArrayDiff CPU        vs hand-CUDA alloc: max|Δ| = 1.009e+01 (rel 7.42e-04)  match=true
ArrayDiff GPU        vs hand-CUDA alloc: max|Δ| = 0.000e+00 (rel 0.00e+00)  match=true

--- benchmark (median of N samples, post-sync) ---
Hand-CUDA alloc    : median       153.24 µs
Hand-CUDA prealloc : median       170.41 µs
ArrayDiff CPU      : median        47.21 µs
ArrayDiff GPU      : median        57.36 µs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.20%. Comparing base (6c05cb6) to head (8e8eca4).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #55   +/-   ##
=======================================
  Coverage   90.20%   90.20%           
=======================================
  Files          23       23           
  Lines        2848     2848           
=======================================
  Hits         2569     2569           
  Misses        279      279           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@blegat blegat merged commit 5b4d9ab into main May 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant