I have a full CUDA backend implementation on my fork:
Includes CUDA backend support, demos, benchmark/test tooling, and the ggml CUDA fix needed for PCS/text-prompt mode.
Perf notes:
https://github.com/peters/sam3.cpp/blob/bf831442965c4918b566086b9e3aaa8ad1ab40ff/docs/cuda_performance.md#cuda-performance-optimization-log
RTX 4080:
- SAM3 f16 ~1100 ms/frame
- SAM3 q4_0 1030 ms/frame
- SAM2.1 tiny f16 118 ms/frame
- SAM2.1 tiny q8_0 111 ms/frame
- SAM2.1 tiny q4_0 125 ms/frame
I have a full CUDA backend implementation on my fork:
bf83144)04623eff)Includes CUDA backend support, demos, benchmark/test tooling, and the ggml CUDA fix needed for PCS/text-prompt mode.
Perf notes:
https://github.com/peters/sam3.cpp/blob/bf831442965c4918b566086b9e3aaa8ad1ab40ff/docs/cuda_performance.md#cuda-performance-optimization-log
RTX 4080: