Replies: 3 comments
-
|
This is a textbook distributed connection/healthcheck failure—exactly the sort of cross-process bug that keeps popping up in vLLM + Ray setups. (In our issue map it’s classified under “distributed infra: stale connection pool / cluster node health desync”.) Most of the time, your Ray cluster may pass the built-in healthcheck but still fail when vLLM tries to schedule or allocate resources—due to socket state, firewall, or subtle config drift between nodes. Quick things to check:
If you want the step-by-step diagnosis checklist or a full breakdown of these connection issues, let me know and I’ll share the reference. |
Beta Was this translation helpful? Give feedback.
-
|
Ray cluster connectivity issues with vLLM are tricky — here's a debugging checklist: 1. Check Ray cluster is actually running ray status
# Should show your nodes2. Verify head node address export RAY_ADDRESS="ray://<head-node-ip>:10001"
# or
ray.init(address="ray://<head-node-ip>:10001")3. Port accessibility # From worker, test connection to head
nc -zv <head-ip> 6379 # Redis
nc -zv <head-ip> 10001 # Client port
nc -zv <head-ip> 8265 # Dashboard4. vLLM-specific Ray init from vllm import LLM
# Let vLLM handle Ray
llm = LLM(
model="...",
tensor_parallel_size=4,
# Don't init Ray yourself — vLLM does it
)5. Common gotchas
Fix: Explicit connection import ray
ray.init(address="auto", ignore_reinit_error=True)
# Then start vLLMWe've deployed vLLM on Ray clusters at RevolutionAI. What's your cluster setup — same machine or distributed? |
Beta Was this translation helpful? Give feedback.
-
|
This is a common issue with vLLM + external Ray clusters on Kubernetes. Root cause: Fixes: 1. Run Ray worker in vLLM pod containers:
- name: vllm
command:
- /bin/bash
- -c
- |
ray start --address=$RAY_ADDRESS --block &
sleep 10 # Wait for node registration
python -m vllm.entrypoints.openai.api_server ...2. Use Ray job submission instead # Submit vLLM as Ray job
ray job submit --address $RAY_ADDRESS -- python -m vllm ...3. Shared /tmp/ray volume volumes:
- name: ray-tmp
emptyDir: {}
volumeMount:
- name: ray-tmp
mountPath: /tmp/ray4. Set node IP explicitly env:
- name: RAY_ADDRESS
value: "ray://ray-cluster-head:10001"
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP5. Use KubeRay RayCluster with vLLM worker workerGroupSpecs:
- groupName: vllm-workers
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latestWe deploy vLLM on Kubernetes at Revolution AI — the Ray worker startup in the vLLM pod is the key fix. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been attempting to connect a vLLM engine (as part of KubeAI) to a Ray Cluster (deployed by Kuberay) and have not had much success. For some reason it is unable to generate the file node_ip_address.json. I can confirm that if I run
ray statusin the vLLM engine pod I see exactly the same output as I can see in the Ray cluster head pod, so vLLM is able to communicate with ray. These are the logs from vLLM.Executing a health check from the vLLM engine pod returns an exit code of 0, which means the ray cluster health is allegedly ok.
Has anyone seen the same behaviour before but successfully connected vLLM to an external ray cluster?
Engine Config:
Versions:
Platform:
Stack Trace:
Beta Was this translation helpful? Give feedback.
All reactions