Add NVLink P2P support for mixed NVLink/PCIe GPU topologies#18
Add NVLink P2P support for mixed NVLink/PCIe GPU topologies#18valdemardi wants to merge 1 commit intoaikitoria:595.45.04-p2pfrom
Conversation
|
Cool! Sadly I don't have any 3090s anymore to test this change. Including my cudaHostRegister change in your repo is pretty brave, that solves a particular edge case in my other project where I wanted to register an enormous amount of memory for async copies that lives in 1G reserved pages, and is otherwise not tested much, although I haven't heard of it causing crashes for anyone else. |
|
i have 2 5090 and 2 3090 nvlinked on saphir rappids, let me try |
You said you did what?? Are you saying the P2P and nvlink are working together? Can you please publish your I've got a lot of RTX 3090 and the two slot NvLink bridges. I have to do the water-cooling loop first, so I had been delaying it because there is no benefit in NvLink without P2P. So you're saying you have solved the issue and everything just works? [EDIT]: Aha, the data is in your repo. Well, I have to test it then. :) |
|
Yup, as far as I can see, everything is working perfectly in my system. I’ve also run several stress tests with multiple instances of the nccl-tests and p2pBandwidthLatencyTests running simultaneously, and I haven’t seen any problems. The changes are mainly reverting changes made in the tinygrad and aikitoria versions to bring back the NVLink features, rather than adding much new. I also have a minimized version in the mini-p2p branch, where the diff to the NVIDIA version is very small. This stripped-down version runs also perfectly on my system, and utlizies both NVLink and PCIe P2P and I use this one as my daily driver currently. The only small drawback with this version is that it requires setting some more kernel options (nvidia NVReg Dwords) for it to work, which the tingygrad/aikitoria versions set behind the scenes. On the other hand, it will be easier to keep it up-to-date with the NVidia version. The required dword options for the mini-p2p version are: One thing I would also be interested to get feedback about is whether this version still works properly as PCIe-only P2P with for example RTX 4090 or RTX 5090 cards. I would assume it will, and if it does, I think the small diff to the upstream and the added NVLink capability makes it quite attractive fork to maintain overall. Please let me know how things go with your 3090 system. |
👍 Very interested to hear how this works out. |
Hi @aikitoria
I created an NVLink-enabled version based on your 595.45.04 updated tinygrad driver. In my repository, I forked the Nvidia upstream repository from the 595.45.04 tag, applied most of the changes from your repository (excluding the README and install.sh), and then made the NVLink enabling changes and updated the README with some test results, which confirm that the driver works as expected.
Today I also created a commit against your repository with the changes, in case you or others might find this useful, given your repository's visibility. The version in this PR should work as a drop-in replacement for your version. If the system running this version has NVLink(s), the driver will prefer them where possible, and otherwise it will fall back to the BAR1 PCIe P2P approach.
I have tested the this PR version only on a quad RTX 3090 system with two NVLinks (two NVLinked GPU pairs) and with that system it works as expected. I'd expect it to work the same as your version on systems with no NVLinks, but I have not done any testing.
Cheers