Skip to content

单机任务并行导致任务被杀 #348

@luyouqi233

Description

@luyouqi233

单机运行两个程序,分别占用不同的两卡,sh文件中参数分别设置为

export MULTI_TENANT=1
export MASTER_PORT=6379
export DASHBOARD_PORT=8265

export MULTI_TENANT=1
export MASTER_PORT=6380
export DASHBOARD_PORT=8266

在任务1完成后,任务2报错显示:

Traceback (most recent call last):                                                        
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 34, in 
<module>                                                                                  
    main()                                                                                
  File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 30, in 
main                                                                                      
    pipeline.run()                                                                        
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/
_contextlib.py", line 116, in decorate_context                                            
    return func(*args, **kwargs)                                                          
  File "/fs/fast/ROLL/roll/pipeline/rlvr/rlvr_custom_vlm_pipeline.py", line 52
4, in run                                                                                 
    self.do_checkpoint(global_step=global_step)                                           
  File "/fs/fast/ROLL/roll/pipeline/base_pipeline.py", line 84, in do_checkpoi
nt                                                                                        
    ckpt_metrics = DataProto.materialize_concat(data_refs=ckpt_metrics_refs) 
  File "/fs/fast/ROLL/roll/distributed/scheduler/protocol.py", line 854, in m$
terialize_concat                                                                          
    data: List["DataProto"] = ray.get(data_refs, timeout=timeout)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/worker.py", line 932, in get_objects
    raise value
ray.exceptions.ActorUnavailableError: The actor 1483e6d031abd82a654eb09902000000 is unavai
lable: The actor is temporarily unavailable: RpcError: RPC Error message: Socket closed; R
PC Error details:  rpc_code: 14. The task may or maynot have been executed on the actor.
[2026-02-08 19:39:17,054 E 16388 17032] gcs_rpc_client.h:196: Failed to connect to GCS wit
hin 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or i
s killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. http
s://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-dir
ectory-structure. The program will terminate.

由于之前也跑过3任务,任务1完成后任务2和3报同样的错,因此怀疑是单机任务并行导致任务被杀,不知道有没有什么解决方法

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions