Add support for vSphere device groups#440
Conversation
|
as a bit of early feedback, I'd expect to see new unit tests for this new feature. AFAICT, the changeset updates the existing tests so they don't fail, but doesn't add any new ones. Also, please make a corresponding PR to https://bosh.io/docs/vsphere-cpi/#resource-pools to document the new cloud config. |
|
For clarification, please submit a companion PR to edit https://github.com/cloudfoundry/docs-bosh/blob/master/content/vsphere-cpi.md with documentation for your new property. |
Device groups require vSphere 8.0+ and allow multiple physical devices (e.g., NVLink-connected NVIDIA GPUs) to be presented as a single logical unit to a VM. - Add device_groups cloud property support in VmType and VmConfig - Implement create_device_group_vgpus method to query and create vGPU devices from device group information - Add device group attachment logic in VmCreator after hardware upgrade - Fix SDK bug: properly instantiate SoapError in stub_adapter.rb - Fix version mismatch: use vim.version.v8_0_0_1 instead of v8_0_0_0 for v8.0 - Add fallback logic to search all hosts in cluster if device group not found on VM's initial host
* Add unit tests for create_device_group_vgpus method in pci_passthrough_spec.rb * Add unit tests for device groups in vm_creator_spec.rb * Update default SDK version to 8.0 in spec_helper.rb to support device groups testing
* Refactor host selection to use prioritized list pattern * Remove nested exception handling in favor of unified search * Simplify logging to consistently show which host was tried * Update tests to match new log message format
97a6c1f to
b31e217
Compare
|
Hey @Alphasite and @julian-hj, thank you for the review and comments. I have done a first pass at addressing all the feedback and there are just a couple of discussions to resolve. I am planning to test these new changes on my development env to make sure it is still working as expected. I will write back here to confirm once that's been done. I have also opened a draft PR into docs-bosh repo here: cloudfoundry/docs-bosh#894. I plan to undraft once this PR has been merged a new release cut so that I can include the version in the docs. |
|
Just to confirm that I have tested the new commits on my dev env and it is all working as expected. |
|
Thanks all - looks like all the feedback has now been resolved, are we ok to merge? (cc @Alphasite) |
Alphasite
left a comment
There was a problem hiding this comment.
I think it LGTM. I'd like a second approval as well since im just not very familiar with vgpus or passthrough devices.
|
Thanks @Alphasite! Who would be best to provide that second approval? Maybe @julian-hj or @ystros ? I would also be happy to approve it (if authors are allowed to do that) as I understand that my team will essentially be responsible for looking after this specific bit of the codebase. |
Summary
Adds support for vSphere device groups in the vSphere CPI. Device groups allow multiple physical GPUs connected via NVLink to be presented as a single logical unit to a VM, enabling workloads that need multiple GPUs working together.
Please note that I am not an expert in this area and I have relied on Cursor to help me to make the changes. I have tested the changes out in a development environment and can confirm that the new functionality works as expected.
I'd like to ask that someone who is more familiar with the vSphere CPI take a look at these changes and to advise on whether or not they seem reasonable. I am happy to make any changes that would be needed to get this merged, I'd just need some pointers and feedback to set me in the right direction.
Further Context
note: Content generated by cursor and reviewed for correctness by @teddyking.
The vSphere CPI already supports:
vgpuscloud property)pci_passthroughscloud property)Device groups extend this by grouping multiple physical devices so they appear as a coordinated unit to the VM. This requires vSphere 8.0+.
Key Changes
VmTypeandVmConfigpci_passthrough.rb)create_device_group_vgpusmethod:AssignableHardwareManagerto retrieve device group informationnvidiaVgpudevices)deviceGroupInfometadata that links them together via:group_instance_key: Identifies which device group instancesequence_id: Identifies the device's position within the groupvm_creator.rb)VirtualDeviceGroupsdeclaration that defines the device group at the VM leveldeviceGroupInfolinking them togetherstub_adapter.rb): Fixed incorrect SoapError instantiation (was callingSoapError(fault)instead ofSoapError.new('Method not found', fault))soap_stub.rb): Changed vSphere 8.0 version constant fromvim.version.v8_0_0_0tovim.version.v8_0_0_1to match the actual SDK version$vc_version = '8.0'in./src/vsphere_cpi/spec/spec_helper.rbRelated PR and Issues
N/A.
Impacted Areas in Application
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
I added some unit tests and ran
bin/test-unitand confirmed that they all tests passed. I also tested the changes out for real in a development environment. To do this, I created a git patch of the changes and applied them to a running BOSH director. I then called the CPI manually, as follows:I'm not sure how to handle automated integration testing going forward as this functionality depends on fairly expensive and complex hardware ... I am happy to arrange a call with reviewers to walk through the testing I did on the dev env I have access to if that would be useful.
Checklist: