[pull] master from ray-project:master by pull[bot] · Pull Request #3975 · miqdigital/ray

pull · 2026-03-16T19:35:28Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

updating lock file for ci py3.10 deps Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…61708) The compile_pip_requirements rule used the autodetecting Python toolchain, which resolved to system python. Fix by inlining the compile_pip_requirements logic as a py_binary + py_test pair with exec_compatible_with = ["//bazel:py310"], which forces Bazel to select the hermetic Python 3.10 toolchain already registered in WORKSPACE. Topic: fix-requirements-update Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>

## Summary - Add client IP:port to Ray Serve HTTP access logs - Thread client address from the proxy through the request context and metadata to the replica - Handle both proxy-routed and direct ingress HTTP paths For services behind a load balancer, uvicorn's `ProxyHeadersMiddleware` (enabled by default) resolves `X-Forwarded-For` into `scope["client"]` automatically, so the logged IP reflects the original client when `FORWARDED_ALLOW_IPS` is configured. ## How It Works The client IP is available at the entry point (proxy or direct ingress replica) but needs to reach the replica's access log, which runs in a separate process. The data flows through existing infrastructure: ``` External Client (10.0.91.46:54321) | [ Proxy ] | 1. Reads scope["client"] via proxy_request.client | 2. format_client_address() formats the raw tuple into "host:port" | 3. Logs it in the proxy access log | 4. Passes it into _RequestContext._client | [ DeploymentHandle ] | 5. default_impl.py copies _RequestContext._client → RequestMetadata._client | [ Replica ] | 6. Reads request_metadata._client and logs it in the replica access log ``` For **direct ingress HTTP** (replica serves HTTP directly, no proxy), the replica reads `scope["client"]` itself and formats it with the same `format_client_address()`. --- ## Update: Feature flag gating Per review feedback, the client IP logging is now gated behind a feature flag that is **off by default**: ``` RAY_SERVE_LOG_CLIENT_ADDRESS=1 ``` The gate is centralized in `access_log_msg()` in `logging_utils.py` — when the flag is off, the `client` parameter is ignored and the log format is unchanged from before this PR. The client address data still flows through the request context, but is simply not rendered in logs unless the flag is enabled. **Tests:** Added a parametrized integration test (`test_http_access_log_client_address`) that verifies both flag states — client IP present when on, absent when off. --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Bazel 7 removed the exec_tools attribute from genrule; patch protobuf's BUILD files to use tools instead. Signed-off-by: andrew <andrew@anyscale.com>

…gs for Bazel 7 (#61695) gRPC's grpc_deps() pulls in rules_apple 1.1.3 which uses apple_common.multi_arch_split, removed in Bazel 7. Override to 3.2.1 (compatible with Bazel 6/7/8) before grpc_deps() runs so the maybe() call is a no-op. rules_apple 3.2.1 requires apple_support >= 1.11.1. Patch is_xcode_at_least_version to return False instead of fail() on CLT-only CI machines where xcode_config.xcode_version() is None. Set BAZEL_NO_APPLE_CPP_TOOLCHAIN=1 to skip apple_cc_toolchain on CLT-only machines where it would fail with "Xcode version must be specified". Override -mmacosx-version-min to 12.0 to satisfy std::filesystem and std::variant requirements (generic toolchain defaults to 10.11). Signed-off-by: andrew <andrew@anyscale.com>

…Killer (#60330) ## Description In recent investigations on memory usage issues, we found that: * After a worker becomes IDLE after done with a task, the worker can still take comparatively large amount of space of memory space (~1GB) * The current Ray OOM killer will only kill worker processes with a task scheduled on it To solve the above findings, ideally we should investigate the root cause of why the IDLE workers still takes up large memory space, then fix the memory usage issue and/or update the OOM killer's logic based on the findings. While the above are happening, a short term mitigation that can help with the situation is for the OOM killer to prioritizing killing those IDLE workers that occupies large memory space. This PR implements the short term mitigation by: 1. Add a ray config `idle_worker_killing_memory_threshold_bytes` to indicate the threshold of whether the OOM killer should consider killing the IDLE worker. The default is set to 1GB. We need a threshold because we want to avoid the case where the freshly created IDLE workers from worker pre-start are being killed. This is because killing those IDLE workers won't help much with the memory usage. 2. Update the current OOM killer logic to check and pick a IDLE worker to kill if it possible, before applying the current memory killing logic. 3. Update the `ray_memory_manager_worker_eviction_total` metric to include `MemoryManager.IdleWorkerEviction.Total` type to track the number of idle worker termination 4. Add the corresponding test cases 5. Did some code cleanup along the way ## Related issues N/A ## Additional information Log line changes. For killing the idle worker, we will output the following log line: ``` [2026-02-27 23:33:10,289 I 3779325 3779325] (raylet) node_manager.cc:3078: Killing 1 worker(s), kill details: Memory on the node (IP: 172.31.14.189, ID: 60e4ea8d3b8fc0f4e99ed19c87bf5f9282797707af6e5babca343c7d) was 31.01GB / 62.01GB (0.500018), which exceeds the memory usage threshold of 0.500000; Object store memory usage: [- objects spillable: 0; - bytes spillable: 0; - objects unsealed: 0; - bytes unsealed: 0; - objects in use: 0; - bytes in use: 0; - objects evictable: 0; - bytes evictable: 0; ; - objects created by worker: 0; - bytes created by worker: 0; - objects restored: 0; - bytes restored: 0; - objects received: 0; - bytes received: 0; - objects errored: 0; - bytes errored: 0; ; Eviction Stats:; (global lru) capacity: 104857600; (global lru) used: 0%; (global lru) num objects: 0; (global lru) num evictions: 0; (global lru) bytes evicted: 0]; Ray killed 1 worker(s) based on the killing policy:[Worker with no lease granted: job ID=01000000, pid=3779512, required resources={CPU: 1}, actual memory used=1.18GB, worker ID=23139757f947661f6be1db7c25ee7b7ce449c21e927bdc2134d9b08e)]; To see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.14.189`; Top 10 memory users: PID MEM(GB) COMMAND, 3779511 18.92 ray::allocate_memory, 3779512 1.18 ray::IDLE, 3779514 1.18 ray::IDLE, 3779513 1.17 ray::IDLE, 3779519 1.17 ray::IDLE, 3752505 0.95 bazel(core-1792) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.confi..., 3753337 0.86 /home/ubuntu/.cursor-server/cli/servers/Stable-7b98dcb824ea96c9c62362a5e80dbf0d1aae4770/server/node ..., 3754326 0.82 /home/ubuntu/.cursor-server/cli/servers/Stable-7b98dcb824ea96c9c62362a5e80dbf0d1aae4770/server/node ..., 3760346 0.76 /home/ubuntu/.cursor-server/extensions/ms-vscode.cpptools-1.23.6-linux-x64/bin/cpptools-srv 3753753 ..., 3753753 0.65 /home/ubuntu/.cursor-server/extensions/ms-vscode.cpptools-1.23.6-linux-x64/bin/cpptools, suggestions: Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero. ``` Followup action items: * Investigate and fix the reason why the workers still take up large amount of memory after being IDLE * Based on the above investigation, improve the memory killer with better heuristic in killing the worker processes --------- Signed-off-by: myan <myan@anyscale.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>

elliot-barn and others added 6 commits March 16, 2026 09:46

[deps] updating 3.10 reqs (#61705)

6ce712d

updating lock file for ci py3.10 deps Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[Build] Patch protobuf for Bazel 7 exec_tools removal (#61694)

a075095

Bazel 7 removed the exec_tools attribute from genrule; patch protobuf's BUILD files to use tools instead. Signed-off-by: andrew <andrew@anyscale.com>

pull bot locked and limited conversation to collaborators Mar 16, 2026

pull bot added the ⤵️ pull label Mar 16, 2026

pull bot merged commit 4902739 into miqdigital:master Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#3975

[pull] master from ray-project:master#3975
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master

pull bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pull bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull bot commented Mar 16, 2026 •

edited

Loading