Usage Modes

RPC-only: Several VMs running on a physical machine with no GPU. CUDA processes send RPC commands to a cluster of GPUs. A GPU cluster has a single public endpoint (cluster manager IP) which receives messages from VMs. [WORKING]

Shared-mem: Several VMs running on a physical Machine with a GPU. CUDA processes running on VM sends commands via an RPC channel to an end-point on the host, MPS-enabled machine. Data-intensive CUDA commands (cudaMemcpy) transfer data via a shared-memory channel. [IN-PROGRESS]

Build-And-Test Prerequisites:

Flyt root directory: flyt/. All filepaths are relative to the root directory.

Three kinds of nodes: Client node (VM running cuda app), Cluster Node (A machine with mps GPU), cluster-manager node (public endpoint, communicates with every cluster node).
The "cluster manager" node may be one of the cluster nodes or a regular machine.
Each cluster node must support nvidia MPS with environment variable CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING=1

Two possible setups:

A. Independently build Flyt from source code on each client node, each cluster-node and the cluster-manager node.

B. Build Flyt from source via make on the cluster-manager node and share the /bin, /configs to the other nodes. Modify the LD_LIBRARY_PATH variable on client and cluster nodes to search the /bin folder for custom dynamically loaded libraries.

Build-and-Test Setup A:

Independently build Flyt from source code on each client node, each cluster-node and the cluster-manager node.

(I) Do Cluster-Manager Node

Single network endpoint for the GPU cluster node(s). On local machine

Build Flyt control managers: make install-cmgrs

bin/flytctl and bin/flyt-cluster-manager eventually run on the cluster-manager node.

Initialise MongoDB database with network endpoint details of the client VM.

Install mongodb
npm link mongodb
sudo apt install mongodb
sudo systemctl start mongod Create the database

mongosh use flyt db.createUser({ user: "adminUser", pwd: "flyt", roles: [{ role: "readWrite", db: "flyt" }] })

exit
sudo rm -rf /tmp/mongodb-27017.sock (if ECONNREFUSED)
node setup/node_mongo_insert.js.

Ensure /configs/cluster-mgr-config.toml is set to the appropriate values.

[vm-resource-db]: MongoDB login credentials and network endpoints, used by cluster manager to connect to the databade and access VM info.
[ports]: One listen port for each client and cluster node.
[virt-server-auto-deallocate]: Remove a cluster-node from available pool after a timeout
[ipc]: Message queue for mongoDB <-> cluster manager comms and a socket for flyctl->cluster manager comms.

Start the flyt-cluster-manager: ./bin/flyt-cluster-manager

Logs into mongoDB database server with configured credentials.
Obtains VM data from the mongodb server via a message queue /tmp/flyt-rmgr-queue
Listens to Client Node(s) and Cluster Node(s) at configured ports.

(II) Do Cluster Node

Runs the flyt backend framework. SHM Mode: ub-12-3 host, RPC Mode: ub-11 host

Build the multithreaded RPC server: make install-cpu-server

Subsequently decodes incoming RPCs from client(s), calls the regular cuda runtime-api functions, and transmits results back to the client.

Build the flyt control managers: make install-cmgrs

flyt-node-manager daemon eventually runs on each cluster (GPU) node.

Ensure configs/servnode-config.toml is set to appropriate values

[resource-manager]: address of cluster manager + its cluster node listen port.
[virt-server]: cricket-rpc-server path, thread mode.
[ipc]: message queue backend for flyt-node-manager <-> cricket-rpc-server comms

Start MPS daemon bash mps/start_mps.sh

Ensure environment variable CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING=1 is set.

Start the flyt-node-manager ./bin/flyt-node-manager

Starts the rpc-server, listens for new clients.

(III) Do Client Node

Runs CUDA applications linked to virtualised cuda runtime library.

ub-12-3 VM.

Build virtualised CUDA runtime library and replace the standard cuda runtime API library in kernel: make install-client-lib.

Copies virtualised library to default cuda library location in VM kernel and sets up symbolic links to ensure CUDA apps dynamically load this virtualised library when compiled with nvcc .. cudart=shared
default cuda library is restoreable via make restore-client-lib.

Build the flyt control managers: make install-cmgrs

flyt-client-manager(single instance per VM) runs on the client node.

Ensure configs/client-mgr.toml is set to appropriate values.

[resource-manager]: address of cluster-manager node and its client listen port.
[vcuda-client]:
[ipc]:

Start the flyt-client-manager: ./bin/flyt-client-manager
Build the benchmark test and run it: cd synthetic_benchmark; make; bash run_benchE.sh

This runs a CUDA app on the VM whose runtime API calls are routed to the remote GPU cluster via the virtualised CUDA runtime library cricket-client.so.

Build-And-Test Setup B

Build Flyt from source via make on a machine with all flyt dependencies installed. and share the /bin, /configs to the other nodes. Modify the LD_LIBRARY_PATH variable on client and cluster nodes to search the /bin folder for custom dynamically loaded libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage Modes

Build-And-Test Prerequisites:

Build-and-Test Setup A:

(I) Do Cluster-Manager Node

(II) Do Cluster Node

(III) Do Client Node

Build-And-Test Setup B

FilesExpand file tree

TEST.md

Latest commit

History

TEST.md

File metadata and controls

Usage Modes

Build-And-Test Prerequisites:

Build-and-Test Setup A:

(I) Do Cluster-Manager Node

(II) Do Cluster Node

(III) Do Client Node

Build-And-Test Setup B