Skip to content

Add support for vSphere device groups#440

Merged
aramprice merged 7 commits intocloudfoundry:masterfrom
teddyking:device-groups-attempt-2
Jan 29, 2026
Merged

Add support for vSphere device groups#440
aramprice merged 7 commits intocloudfoundry:masterfrom
teddyking:device-groups-attempt-2

Conversation

@teddyking
Copy link
Copy Markdown
Contributor

@teddyking teddyking commented Jan 14, 2026

Summary

Adds support for vSphere device groups in the vSphere CPI. Device groups allow multiple physical GPUs connected via NVLink to be presented as a single logical unit to a VM, enabling workloads that need multiple GPUs working together.

Please note that I am not an expert in this area and I have relied on Cursor to help me to make the changes. I have tested the changes out in a development environment and can confirm that the new functionality works as expected.

I'd like to ask that someone who is more familiar with the vSphere CPI take a look at these changes and to advise on whether or not they seem reasonable. I am happy to make any changes that would be needed to get this merged, I'd just need some pointers and feedback to set me in the right direction.

Further Context

note: Content generated by cursor and reviewed for correctness by @teddyking.

The vSphere CPI already supports:

  • vGPUs: Individual virtual GPUs (via vgpus cloud property)
  • PCI Passthrough: Direct access to physical PCI devices (via pci_passthroughs cloud property)

Device groups extend this by grouping multiple physical devices so they appear as a coordinated unit to the VM. This requires vSphere 8.0+.

Key Changes

  1. New Cloud Property Support
  • Added device_groups property to VmType and VmConfig
  • Users can specify device group names in their cloud config (e.g., "Nvidia:2@nvidia_h100l-94c%NVLink")
  • Device groups trigger hardware version upgrades (similar to vGPUs and PCI passthrough)
  1. Device Group Query and Creation Logic (pci_passthrough.rb)
  • New create_device_group_vgpus method:
    • Queries the vSphere host's AssignableHardwareManager to retrieve device group information
    • Finds the matching device group by name
    • Extracts component devices (specifically nvidiaVgpu devices)
    • Creates vGPU devices with deviceGroupInfo metadata that links them together via:
    • group_instance_key: Identifies which device group instance
    • sequence_id: Identifies the device's position within the group
  • Includes error handling for vSphere 8.0+ requirements and missing device groups
  1. VM Creation Integration (vm_creator.rb)
  • Device group attachment happens after hardware version upgrade (device groups require upgraded hardware)
  • Creates both:
    • VM-level configuration: VirtualDeviceGroups declaration that defines the device group at the VM level
    • Device-level configuration: Individual vGPU devices with deviceGroupInfo linking them together
  • Host discovery logic: If the device group isn't found on the VM's initial host, searches all healthy hosts in the cluster (useful when the VM hasn't been placed yet or the device group exists on a different host)
  1. Bug Fixes
  • SDK bug fix (stub_adapter.rb): Fixed incorrect SoapError instantiation (was calling SoapError(fault) instead of SoapError.new('Method not found', fault))
  • Version mismatch fix (soap_stub.rb): Changed vSphere 8.0 version constant from vim.version.v8_0_0_0 to vim.version.v8_0_0_1 to match the actual SDK version
  1. Unit Tests
  • New unit tests written to cover new functionality
    • NB Required me to bump $vc_version = '8.0' in ./src/vsphere_cpi/spec/spec_helper.rb

Related PR and Issues

N/A.

Impacted Areas in Application

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

I added some unit tests and ran bin/test-unit and confirmed that they all tests passed. I also tested the changes out for real in a development environment. To do this, I created a git patch of the changes and applied them to a running BOSH director. I then called the CPI manually, as follows:

echo '{
  "method": "create_vm",
  "arguments": [
    "agent-test-manual-device-groups",
    "sc-7321eead-7bbb-4a64-854f-e496d23b9842",
    {
      "cpu": 16,
      "cpu_reserve_full_mhz": false,
      "disk": 65536,
      "memory_reservation_locked_to_max": false,
      "ram": 131072,
      "root_disk_size_gb": 25,
      "vmx_options": {
        "pciPassthru.64bitMMIOSizeGB": 512,
        "pciPassthru.use64bitMMIO": "TRUE"
      },
      "device_groups": [
        "Nvidia:2@nvidia_h100l-94c%NVLink"
      ]
    },
    {
      "default": {
        "ip": "<ip>",
        "netmask": "<netmask>",
        "gateway": "<gateway>",
        "dns": ["<dns1>", "<dns2>"],
        "default": ["dns", "gateway"],
        "cloud_properties": {
          "name": "<network-name>"
        }
      }
    },
    [],
    {}
  ],
  "context": {
    "director_uuid": "<director-uuid>"
  }
}' | /var/vcap/jobs/vsphere_cpi/bin/cpi 2>&1

I'm not sure how to handle automated integration testing going forward as this functionality depends on fairly expensive and complex hardware ... I am happy to arrange a call with reviewers to walk through the testing I did on the dev env I have access to if that would be useful.

Checklist:

  • My code follows the standard ruby style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@julian-hj
Copy link
Copy Markdown
Member

as a bit of early feedback, I'd expect to see new unit tests for this new feature. AFAICT, the changeset updates the existing tests so they don't fail, but doesn't add any new ones.

Also, please make a corresponding PR to https://bosh.io/docs/vsphere-cpi/#resource-pools to document the new cloud config.

Comment thread src/vsphere_cpi/lib/cloud/vsphere/vm_creator.rb
@rkoster rkoster moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Jan 22, 2026
@rkoster rkoster moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jan 22, 2026
Comment thread src/vsphere_cpi/lib/ruby_vim_sdk/soap/stub_adapter.rb Outdated
Comment thread src/vsphere_cpi/lib/cloud/vsphere/resources/pci_passthrough.rb
Comment thread src/vsphere_cpi/lib/cloud/vsphere/resources/pci_passthrough.rb Outdated
Comment thread src/vsphere_cpi/lib/cloud/vsphere/resources/pci_passthrough.rb
Comment thread src/vsphere_cpi/lib/cloud/vsphere/soap_stub.rb
Comment thread src/vsphere_cpi/lib/cloud/vsphere/vm_config.rb
Comment thread src/vsphere_cpi/lib/cloud/vsphere/vm_creator.rb Outdated
Comment thread src/vsphere_cpi/lib/cloud/vsphere/vm_creator.rb Outdated
@julian-hj
Copy link
Copy Markdown
Member

For clarification, please submit a companion PR to edit https://github.com/cloudfoundry/docs-bosh/blob/master/content/vsphere-cpi.md with documentation for your new property.

Device groups require vSphere 8.0+ and allow multiple physical devices
(e.g., NVLink-connected NVIDIA GPUs) to be presented as a single logical
unit to a VM.

- Add device_groups cloud property support in VmType and VmConfig
- Implement create_device_group_vgpus method to query and create vGPU devices
  from device group information
- Add device group attachment logic in VmCreator after hardware upgrade
- Fix SDK bug: properly instantiate SoapError in stub_adapter.rb
- Fix version mismatch: use vim.version.v8_0_0_1 instead of v8_0_0_0 for v8.0
- Add fallback logic to search all hosts in cluster if device group not found
  on VM's initial host
* Add unit tests for create_device_group_vgpus method in pci_passthrough_spec.rb
* Add unit tests for device groups in vm_creator_spec.rb
* Update default SDK version to 8.0 in spec_helper.rb to support device groups testing
* Refactor host selection to use prioritized list pattern
* Remove nested exception handling in favor of unified search
* Simplify logging to consistently show which host was tried
* Update tests to match new log message format
@teddyking
Copy link
Copy Markdown
Contributor Author

Hey @Alphasite and @julian-hj, thank you for the review and comments. I have done a first pass at addressing all the feedback and there are just a couple of discussions to resolve.

I am planning to test these new changes on my development env to make sure it is still working as expected. I will write back here to confirm once that's been done.

I have also opened a draft PR into docs-bosh repo here: cloudfoundry/docs-bosh#894. I plan to undraft once this PR has been merged a new release cut so that I can include the version in the docs.

@teddyking
Copy link
Copy Markdown
Contributor Author

Just to confirm that I have tested the new commits on my dev env and it is all working as expected.

@teddyking teddyking requested a review from Alphasite January 27, 2026 15:25
@teddyking
Copy link
Copy Markdown
Contributor Author

Thanks all - looks like all the feedback has now been resolved, are we ok to merge? (cc @Alphasite)

Copy link
Copy Markdown
Contributor

@Alphasite Alphasite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it LGTM. I'd like a second approval as well since im just not very familiar with vgpus or passthrough devices.

@github-project-automation github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Jan 28, 2026
@teddyking
Copy link
Copy Markdown
Contributor Author

Thanks @Alphasite! Who would be best to provide that second approval? Maybe @julian-hj or @ystros ? I would also be happy to approve it (if authors are allowed to do that) as I understand that my team will essentially be responsible for looking after this specific bit of the codebase.

@aramprice aramprice merged commit c968913 into cloudfoundry:master Jan 29, 2026
1 check passed
@github-project-automation github-project-automation Bot moved this from Pending Merge | Prioritized to Done in Foundational Infrastructure Working Group Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

5 participants