Distributed inference is easy with the help RPC backend. Basically, start one or more RPC server(s), then they can be used just like a backend device. Each RPC server corresponds to a single backend device. For example, if there are two GPUs in a box, then two RPC servers needs to be started, one for each GPU.
Use --serve_rpc SPEC to start a RPC server. Examples of SPEC:
-
8080: start a server on0.0.0.0:8080with device #0. -
127.0.0.1:9000: start a server on127.0.0.1:9000with device #0. -
8080@1: start a server on0.0.0.0:8080with device #1. -
127.0.0.1:9000@1: start a server on127.0.0.1:9000with device #1.
Don't forget to use --show_devices to check device IDs, and --log_level 2 to view more logs. Example:
main --serve_rpc 80 --log_level 2
Itrying to start RPC server at 0.0.0.0:80, using
IVulkan - Vulkan0 (NVIDIA GeForce ...)
I type: GPU
I memory total: .. B
I memory free : .. BAfter RPC servers are started, they can be used. Use --rcp_endpoints EPS to register RPC servers (each is called an endpoint.).
Each endpoint is specified by HOST:PORT, and when HOST is omitted, 127.0.0.1 is assumed. Multiple endpoints are joined by ;.
Use --show_devices to check if everything is Okay:
main --rpc_endpoints 80 --show_devices
0: .....
1: RPC - RPC[127.0.0.1:80] (NVIDIA GeForce ...)
type: GPU
memory total: .. B
memory free : .. B
1: CPU - CPU
...Now, let go with distributed inference:
main --rpc_endpoints 80 -ngl 1:all -m ...