[pull] master from ray-project:master by pull[bot] · Pull Request #3976 · miqdigital/ray

pull · 2026-03-17T01:35:26Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…verage for actor-failure corner cases (#61758) In chained `DeploymentResponse` flows, a downstream replica can surface an upstream actor death while remaining healthy. Previously, the router treated these failures as local replica deaths and incorrectly removed healthy downstream replicas from routing. This change prevents that misattribution and preserves correct replica health behavior. ran the repro provided in #61594 and it passes added new unit and integration tests --------- Signed-off-by: abrar <abrar@anyscale.com>

## Description Stabilizing flaky tests: <img width="1597" height="82" alt="Screenshot 2026-03-13 at 12 21 04 PM" src="https://github.com/user-attachments/assets/2ed304c3-6aea-46f5-b17c-774da27ce008" /> ## Approach - The dashboard may be unavailable because a previous test's dashboard process is still holding port 8265 when the next test starts a new Ray cluster. The new cluster's dashboard fails to bind to that port, so `list_actors()` (which requires the dashboard) fails. Using `use_controller=True` avoids this by querying replica states through `serve.status()`, which goes through the Serve controller via GCS. - Remove file-based synchronizations and prefer signal actors. - Relax timeouts. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…de (#61386) ## Summary - Add HAProxy load balancing section to the Serve performance tuning guide - Add interdeployment gRPC transport section --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>

…otFound Error in Chained Left Joins (#61507) ## Description Per #60013, chained left joins fail with `ColumnNotFoundError` when the first join produces empty intermediate blocks. #60520 attempted the fix, but a refined reproduction script shows it does not resolve the underlying issue. This PR proposes a targeted fix and a deterministic regression test. ### Root cause Using the example in #60013, when the streaming executor feeds the second join's input, the first block delivered can have zero rows. The bug is then triggered through the following sequence: 1. `_do_add_input_inner` sees that this is the first block for input sequence 0 (or 1), so it submits a `_shuffle_block` task with `send_empty_blocks=True` and immediately sets `self._has_schemas_broadcasted[input_index] = True`; 2. Remote `_shuffle_block` worker tasks **triggers an early return of `(empty_metadata, {})`**. No `aggregator.submit()` calls are ever made and the schema never reaches any aggregator; 3. All subsequent blocks are submitted with `send_empty_blocks=False`. Qggregators with no non-empty data are never contacted at all, leaving their bucket queue empty; 4. At finalization, `drain_queue()` returns `[]` for those partitions, so `_combine([])` builds an `ArrowblockBuilder` with no `add_block()` calls and produces an empty table with no columns; 5. When `JoiningAggregation.finalize()` calls `pa.Table.join()` on this columnless table, it raises `ColumnNotFoundError` as observed. ### Why #60520 does not fix this issue #60520 modifies `ArrowBlockBuilder.build()` to use a stored `self._schema` when `len(tables) == 0`. However, `self._schema` is only populated inside`add_block()` calls. When `partition_shards` is `[]` in `_combine(...)`, `self._schema` remains `None`. ### This fix In `_shuffle_block`, when `block.num_rows == 0` and `send_empty_blocks=True`, explicitly broadcast schema-carrying empty tables to every aggregator before returning. This mirrors the broadcast logic for non-empty blocks, which ensures every aggregator holds at least one schema-carrying block and thus finalizes correctly. ### Alternative fix Deleting the entire early-return if branch in `_shuffle_block` would also eliminate the issue. However, since the bug only affects the edge case where the first incoming block is empty, removing the full early-return branch risks performance degradation. ## Related issues Fixes #60013 and follows up on #60520. ## Additional information The original reproduction script in #60013 occasionally misses the error due to the uncertain order of blocks fed to the second join. To force the bug, add the following lines to the reproduction script: ``` ... shapes = [b.shape for b in blocks] print(f"Columns flattened via map_batches: {flatten_columns}") print("Block shapes after first join:", shapes) # ----- Add the following lines ----- # Force the bug # The streaming executor delivers blocks in completion order, so non-empty # partitions finish faster and arrive first, letting schema broadcast succeed # silently. Reconstructing the dataset with empty blocks at the front # guarantees that _shuffle_block() sees a zero-row block as the very first # block for the left input sequence, triggering the premature # _has_schemas_broadcasted flag and the resulting (0,0) empty-table bug. import pyarrow as pa empty = [b for b in blocks if b.num_rows == 0] nonempty = [b for b in blocks if b.num_rows > 0] assert empty, "No empty blocks found — cannot reproduce the bug with this dataset." print(f"Reordering: {len(empty)} empty blocks first, then {len(nonempty)} non-empty.") ds_joined = ray.data.from_arrow(empty + nonempty) print("Block shapes after reordering:", [b.shape for b in (ray.get(ref) for ref in ds_joined.get_internal_block_refs())]) # ---------------------------------- # Create mapping table # Use some of the location_ids for the mapping shared_location_ids = location_ids[: max(1, len(location_ids) // 3)] ... ``` The augmented script forces the order of blocks so that the first block going into the second join is always empty. The new test case in `test_join.py` places the empty block in a list fed to `from_arrow`, preserving the block order and ensuring that the second join will always see the empty block first. The bug fires reliably on every run before the fix. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

`get_metric_dictionaries` internally calls `wait_for_condition` to block until a metric appears. When tests wrap this call inside their own `wait_for_condition` (the common pattern for checking metric values or counts), the two waits nest: the inner one times out and raises, preventing the outer loop from ever retrying. This caused intermittent failures in `test_metrics`, `test_metrics_2`, `test_metrics_3`, and `test_metrics_haproxy` depending on how quickly Prometheus scraped. **What** Added a `wait: bool = True` parameter to `get_metric_dictionaries`: - `wait=True` (default): preserves existing behavior — blocks until the metric appears. - `wait=False`: performs a single fetch and returns immediately (possibly empty), letting the caller's `wait_for_condition` drive the retry loop. All call sites inside test `check_*` / `metrics_available` lambdas that are already wrapped in `wait_for_condition` are switched to `wait=False`. Signed-off-by: abrar <abrar@anyscale.com>

…les used by RLlib (#60877) ## Description Avoids triggering v2 module loading when RLlib imports `BackendExecutor` when `RAY_TRAIN_V2_ENABLED=1`. ## Additional information As a follow-up we can extend this safe import logic across the other files as well. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

…61709) Hand-written REST client using requests. This commit adds the core layer: GitHubException, a bare GitHubRepo handle, and GitHubClient with all HTTP methods (_get, _get_paginated, _post, _patch). Tests use the responses library to intercept HTTP at the transport layer. Signed-off-by: andrew <andrew@anyscale.com>

…ist (#61774) ## Description as titled, this #61059 caused ref bundles to be the same using the same object_ref, so used a unique one. Also, made the input_file to tuples over lists ## Related issues ## Additional information Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…only provider (#61732) ## Description On a static Ray cluster, the autoscaler works with a read-only cloud provider, so that it can still keep reconciling to print out warnings for infeasible requests, but not do the actual scaling. However, it still emits other logs that are misleading for a static cluster. For example, we will see these annoying logs on a static Ray cluster: <img width="876" height="366" alt="image" src="https://github.com/user-attachments/assets/42dd8027-1a1b-4cd0-834d-ba928b157ad8" /> This PR makes the event logger only emit warnings for infeasible requests, but suppresses autoscaler action logs if the cloud provider is read-only. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

…gcs client (#61666) ## Description This PR adds the `resize_raylet_resource_instances` function to the GCS Cython client. It will be used by the autoscaler in a follow-up PR for IPPR. This PR also addresses comments in the previous comments #61654 (comment) and #61654 (comment). Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

jeffreywang-anyscale and others added 12 commits March 16, 2026 12:40

[serve][llm] Introduce DP group fault tolerance (#61480)

d982a8a

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[docs][serve][llm] Update data parallel attention documentation (#61706)

0670feb

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

pull bot locked and limited conversation to collaborators Mar 17, 2026

pull bot added the ⤵️ pull label Mar 17, 2026

pull bot merged commit 0670feb into miqdigital:master Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#3976

[pull] master from ray-project:master#3976
pull[bot] merged 12 commits intomiqdigital:masterfrom
ray-project:master

pull bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

pull bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pull bot commented Mar 17, 2026 •

edited

Loading