[CPU/CUDA EP] Add DeformConv op support#27393
[CPU/CUDA EP] Add DeformConv op support#27393ShirasawaSama wants to merge 58 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
|
@fs-eire @tianleiwu Hello, could you please help me trigger a GitHub Copilot code review and unit test/build test pipeline? Thank you very much. |
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive support for the DeformConv2D (Deformable Convolution 2D) operator to ONNX Runtime for opsets 19-22. The implementation includes CPU and CUDA kernel implementations with extensive test coverage.
Changes:
- Implements DeformConv operator for CPU (float) and CUDA (float, double, MLFloat16, BFloat16)
- Adds deformable im2col kernels with bilinear interpolation for spatially-adaptive sampling
- Includes comprehensive unit tests covering edge cases, data types, and parameter combinations
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/nn/deform_conv_op_test.cc | Comprehensive test suite with 24 test cases covering various configurations, edge cases, and data types |
| onnxruntime/test/providers/cpu/nn/deform_conv_expected_gen.py | Python script to generate expected outputs using PyTorch's torchvision.ops.deform_conv2d as reference |
| onnxruntime/core/providers/cuda/nn/deform_conv_impl.h | CUDA kernel interface declarations for im2col, bias addition, and GEMM output copy |
| onnxruntime/core/providers/cuda/nn/deform_conv_impl.cu | CUDA kernel implementations with optimizations for FP16/BF16 and memory access patterns |
| onnxruntime/core/providers/cuda/nn/deform_conv.h | CUDA operator class declaration and attributes structure |
| onnxruntime/core/providers/cuda/nn/deform_conv.cc | CUDA operator compute logic with batched processing and memory management |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | CUDA kernel registrations for opsets 19-22 across all supported types |
| onnxruntime/core/providers/cpu/nn/deform_conv.h | CPU operator class declaration |
| onnxruntime/core/providers/cpu/nn/deform_conv.cc | CPU operator implementation with float support only |
| onnxruntime/core/providers/cpu/cpu_execution_provider.cc | CPU kernel registrations for opsets 19-22 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hello @hariharans29, Sorry to ping here — I just wanted to kindly ask whether the current PR structure looks reasonable. I realize this PR is relatively large (~2000 lines including tests) since it includes both CPU and CUDA implementations of DeformConv. Would it make the review process easier if I split it into two separate PRs (e.g., CPU first, then CUDA)? I’m happy to restructure it if that aligns better with ORT’s contribution workflow. I’m also planning to add a DirectML implementation in a follow-up PR. Please let me know if there are any specific design considerations I should keep in mind to ensure consistency across execution providers. Additionally, would you recommend including performance benchmarks or accuracy comparisons against PyTorch’s native implementation? I can provide benchmark results and numerical validation reports if that would be helpful for review. As I’m still relatively new to contributing here, I may not be reaching out to the most appropriate person. If this isn’t within your area or availability, I completely understand — I would really appreciate any suggestions on the appropriate next steps. I really appreciate your time and guidance. Thank you! |
2d85c09 to
bb17da5
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
1. CPU kernel registration uses InputMemoryType(OrtMemTypeCPUInput, ...) incorrectly
File: onnxruntime/core/providers/cpu/nn/deform_conv.cc, lines ~370–384
.InputMemoryType(OrtMemTypeCPUInput, 2) /* offset */
.InputMemoryType(OrtMemTypeCPUInput, 4), /* optional mask */InputMemoryType annotations are meaningful for non-CPU execution providers (e.g., CUDA EP) to specify that certain inputs should stay on CPU. For a CPU kernel, all inputs are already on CPU — these annotations are unnecessary and potentially confusing. They should be removed from the CPU registration macros.
4. GemmEx<double> addition in math_cpu.cc may affect other operators
File: onnxruntime/core/util/math_cpu.cc, +65 lines
A new GemmEx<double, ThreadPool> specialization is added with Eigen fallback when MLAS_SUPPORTS_GEMM_DOUBLE is not defined. This is a shared utility — if this template was previously uninstantiated/missing, it could break or affect other operators that might attempt to use it. This change should be:
- Called out explicitly in the PR description
- Tested independently or confirmed that no existing code path is affected
- Considered for a separate PR since it's infrastructure-level
5. CUDA: No cudaGetLastError() / error checking after kernel launches
File: onnxruntime/core/providers/cuda/nn/deform_conv_impl.cu
After DeformableIm2ColKernel, DeformConvAddBiasKernel, and CopyGemmOutputRowMajorToNCHWKernel launches, there is no error checking (e.g., CUDA_RETURN_IF_ERROR(cudaGetLastError())). ORT's convention for CUDA kernels typically includes post-launch error checking to catch launch configuration errors and async kernel failures. Add error checking after each kernel launch.
Moderate Issues
6. CUDA: Potential integer overflow in DeformableIm2ColKernel address computation
Inside the kernel, address calculations like:
out_b * (channels * height * width) + in_c * (height * width)use int64_t arithmetic, but out_b is derived from IndexT (which may be int32_t). When IndexT = int32_t, the intermediate products could overflow if channels * height * width is large. The address calculations should be explicitly cast to int64_t.
7. CPU: DeformableIm2col iterates over output spatially then channels — poor cache locality
File: onnxruntime/core/providers/cpu/nn/deform_conv.cc
The loop order is:
for c_col in [0, out_h*out_w): // spatial
for c_im in [0, C): // channel
for i, j in kernel: // kernel
This means for each spatial position, all channels are processed. A better loop order for CPU cache efficiency would be to iterate channels in the outer loop (stride-1 access in data_im). This would significantly improve performance especially for large input tensors. Consider:
for c_im:
for c_col:
for i, j:
8. CPU: No multi-threading in DeformableIm2col
The CPU im2col implementation is entirely single-threaded. The thread_pool obtained from context is only passed to GemmEx. For large inputs, the im2col phase could be a bottleneck. Consider parallelizing the outer spatial or channel loop using concurrency::ThreadPool::TryParallelFor.
9. CUDA: CopyGemmOutputRowMajorToNCHW — unnecessary extra copy kernel
The CopyGemmOutputRowMajorToNCHW kernel copies GEMM results to NCHW layout after each group's GEMM. This is an extra memory copy pass. Could the GEMM output be written directly to the correct NCHW offsets by adjusting the cuBLAS call's output pointer and leading dimension? This would eliminate one kernel launch per group per batch block.
Alternatively, if the current layout mismatch is unavoidable due to the batched im2col approach, this should be documented clearly.
10. CUDA: Memory heuristic kMaxTempMemSize = 256MB is arbitrary
File: onnxruntime/core/providers/cuda/nn/deform_conv.cc, line ~210
constexpr size_t kMaxTempMemSize = 256 * 1024 * 1024;This hardcoded limit doesn't account for actual GPU memory availability. On smaller GPUs (e.g., 4GB), 256MB is a significant fraction. Consider querying available memory or making this configurable, or at least documenting the rationale.
11. Test: Asymmetric padding computation uses pad[1] as left, but ONNX spec says pad[1] is left
File: onnxruntime/test/providers/cpu/nn/deform_conv_op_test.cc
In RunDeformConvTest, the output size is computed as:
const int64_t out_h = (params.in_h + params.pad[0] + params.pad[2] - ...) / params.stride[0] + 1;
const int64_t out_w = (params.in_w + params.pad[1] + params.pad[3] - ...) / params.stride[1] + 1;The ONNX spec defines pads as [x1_begin, x2_begin, x1_end, x2_end] (for 2D: [pad_top, pad_left, pad_bottom, pad_right]). The computation here is consistent but should be verified against the kernel code which uses pad_h = pads[0], pad_w = pads[1] (begin pads only) and pad_h_end = pads[2], pad_w_end = pads[3].
Minor Issues / Suggestions
12. Duplicated DeformConvAttributes struct between CPU and CUDA
The DeformConvAttributes struct is defined identically in both cpu/nn/deform_conv.h and cuda/nn/deform_conv.h. Consider extracting it to a shared header (e.g., core/providers/common/deform_conv_attributes.h) to avoid duplication and keep them in sync.
13. Duplicated validation logic between CPU and CUDA Compute functions
Both DeformConv<T>::Compute (CPU) and DeformConv<T>::ComputeInternal (CUDA) contain ~30 lines of identical validation code. This could be extracted to a shared validation helper.
14. CUDA kernel uses #pragma unroll on dynamic loop bounds
#pragma unroll
for (int64_t i = 0; i < weight_h; ++i) {weight_h and weight_w are runtime values (kernel parameters), so #pragma unroll has no effect here with most compilers. It's harmless but misleading — either remove it or add a comment noting it's a hint for common small kernel sizes.
15. Test MinimalBilinear: offset semantics need clearer documentation
The test comment says:
// offset (1, 2, 2, 2): ch0=offset_h, ch1=offset_w per output position. (0,0):(0.5,0)->2.5, (0,1):(0.5,-1)->1
But the offset tensor has 8 values with shape [1, 2, 2, 2]. The mapping from offset values to output positions is not immediately clear from the comment. Consider adding a more explicit breakdown.
16. No ONNX model test nodes (.onnx test data)
The test suite only uses OpTester C++ tests. Consider also adding standard ONNX model test data files that can be used in model zoo validation or cross-platform testing.
18. GetGreatestDivisorBelowBound could be slow for large N with small bound
For N values that are prime (or have no small factors), this linearly scans from bound to 1. For bound = 32 this is negligible, but worth noting.
19. Test Python script (deform_conv_expected_gen.py) hard-codes PyTorch padding convention
The script uses padding=(pad_h, pad_w) (symmetric) for PyTorch but generates ONNX pads = [pad_h, pad_w, pad_h, pad_w]. This works for symmetric padding but wouldn't generate correct asymmetric test cases. This is fine for the current tests but limits the script's utility.
20. Consider adding a DeformConv shape inference function
The PR adds compute kernels but doesn't seem to add shape inference support. If ONNX Runtime's graph optimizer needs shape inference for DeformConv nodes (e.g., for memory planning), this may need to be added.
bb17da5 to
7d2b779
Compare
1e5baba to
1222ad4
Compare
|
@tianleiwu Thanks for the code review. Here is the updated CR response: CR Response1. CPU InputMemoryType for offset/mask 4. GemmEx<double> in math_cpu.cc 5. CUDA: No cudaGetLastError() after kernel launches 6. CUDA: Potential integer overflow with IndexT 7. CPU: DeformableIm2col loop order for cache locality 8. CPU: No multi-threading in DeformableIm2col 9. CUDA: CopyGemmOutputRowMajorToNCHW extra copy 10. CUDA: kMaxTempMemSize = 256MB arbitrary 11. Test: Asymmetric padding pad[1] semantics 12. Duplicated DeformConvAttributes 13. Duplicated validation logic 14. #pragma unroll on dynamic loop bounds 15. MinimalBilinear test offset documentation 16. No ONNX model test nodes 18. GetGreatestDivisorBelowBound performance 19. deform_conv_expected_gen.py padding convention 20. DeformConv shape inference |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@ShirasawaSama, Here are some issues found by AI, please take a look. They might not be correct. 1. Unused
|
|
@tianleiwu Thank you for your reply! I will carefully review the code again. Additionally, would you recommend adding benchmarks to this PR or to subsequent ones? Personally, I'd prefer to include them in the next PR. In my local tests comparing different convolution kernels, I observed performance gains ranging from 50% to 140% compared to torchvision's CUDA implementation (depending on kernel size and input dimensions). I can share my benchmark Jupyter notebook in my personal repository if needed. |
|
All code review suggestions mentioned above have been addressed. perf(DeformConv CPU): improve im2col, bilinear interpolation, and bias handling
|
|
For cuda:
|
9d7d29d to
cbe1eca
Compare
Random offset and masks:
1. Absolute time (ms)
2. Relative time (ORT / TV), ratio < 1 ⇒ ORT faster
3. Numerical accuracy (same inputs)
|
7adfb05 to
c5d86e7
Compare
|
Rebased to latest main branch |
|
I’ve reviewed my code again and feel that the following two areas could be optimized, though they don’t affect the performance comparison report above:
If you have any suggestions, I’m happy to make changes at any time. Of course, I could also submit a pull request to optimize these two details, given that running the pipeline takes too long. |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
Change list:
|
|
After careful consideration and reviewing the implementation of conv op, I believe that allowing The last commit was made because the most recent pipeline failed; the new assert statement added last time (b96de81) caused an error when I apologize for running the pipeline again, but this should be the last time. |
|
I think the code in the current commit is good enough. However, there is still room for improvement. In the next PR, I will remove Of course, I don’t expect the improvement to be significant. The current implementation already outperforms TorchVision and MMCV (I’ll add the performance comparison results for MMCV when I have time). |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@ShirasawaSama, please update two files under docs directory with artifacts here: https://aiinfra.visualstudio.com/PublicPackages/_build/results?buildId=1126907&view=artifacts&pathAsName=false&type=publishedArtifacts It is required to pass "Windows GPU Doc Gen CI Pipeline" |
|
ok |
Head branch was pushed to by a user without write access
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |



Description
This change adds support for the Deformable Convolution 2D operator (DeformConv2D) to ONNX Runtime. The branch implements the operator schema and registration, provides kernel implementations (CPU and GPU/CUDA where available), implements shape inference, and adds unit and integration tests to validate correctness and numerical parity with reference implementations. The changes include performance-oriented optimizations and necessary changes to build/test scripts.
Motivation and Context
Deformable convolutions are widely used in vision models that require spatial sampling flexibility (e.g., Deformable ConvNets, some detection/segmentation models). Native support in ONNX Runtime enables these models to run efficiently without custom operators or external runtimes, broadening the set of compatible models and improving performance and portability.
See also
https://onnx.ai/onnx/operators/onnx__DeformConv.html
https://docs.pytorch.org/vision/main/generated/torchvision.ops.deform_conv2d.html
https://arxiv.org/abs/1811.11168
https://arxiv.org/abs/1703.06211
https://github.com/pytorch/vision/blob/0f6d91d9fe514e6de2f5519114cbeb389d498b2d/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/ops/dcn/src/deform_conv_cuda.cpp
https://github.com/pytorch/vision/blob/0f6d91d9fe514e6de2f5519114cbeb389d498b2d/torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/ops/dcn/src/deform_conv_cuda.cpp
DeformConv in > ONNX Runtime v1.19.0 #22060
[Feature Request] Support DeformConv opset 19 #15572
[ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for DeformConv(19) node with name 'p2o.DeformConv.0' #20810
onnxruntime DeformConv #16903
onnxruntime does not implement DeformConv onnx/onnx#5451
Export dynamic batch size ONNX using ONNX's DeformConv ZhengPeng7/BiRefNet#167
Add support for deform_conv2d export to ONNX pytorch/pytorch#68910
does deformable conv support export to onnx? pytorch/vision#2066