Skip to content

[pull] master from ray-project:master#3976

Merged
pull[bot] merged 12 commits intomiqdigital:masterfrom
ray-project:master
Mar 17, 2026
Merged

[pull] master from ray-project:master#3976
pull[bot] merged 12 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link

@pull pull bot commented Mar 17, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

jeffreywang-anyscale and others added 12 commits March 16, 2026 12:40
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…verage for actor-failure corner cases (#61758)

In chained `DeploymentResponse` flows, a downstream replica can surface
an upstream actor death while remaining healthy. Previously, the router
treated these failures as local replica deaths and incorrectly removed
healthy downstream replicas from routing. This change prevents that
misattribution and preserves correct replica health behavior.

ran the repro provided in
#61594 and it passes
added new unit and integration tests

---------

Signed-off-by: abrar <abrar@anyscale.com>
## Description
Stabilizing flaky tests:
<img width="1597" height="82" alt="Screenshot 2026-03-13 at 12 21 04 PM"
src="https://github.com/user-attachments/assets/2ed304c3-6aea-46f5-b17c-774da27ce008"
/>


## Approach
- The dashboard may be unavailable because a previous test's dashboard
process is still holding port 8265 when the next test starts a new Ray
cluster. The new cluster's dashboard fails to bind to that port, so
`list_actors()` (which requires the dashboard) fails. Using
`use_controller=True` avoids this by querying replica states through
`serve.status()`, which goes through the Serve controller via GCS.
- Remove file-based synchronizations and prefer signal actors.
- Relax timeouts.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…de (#61386)

## Summary
- Add HAProxy load balancing section to the Serve performance tuning
guide
- Add interdeployment gRPC transport section

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
…otFound Error in Chained Left Joins (#61507)

## Description
Per #60013, chained left joins fail with `ColumnNotFoundError` when the
first join produces empty intermediate blocks. #60520 attempted the fix,
but a refined reproduction script shows it does not resolve the
underlying issue. This PR proposes a targeted fix and a deterministic
regression test.

### Root cause
Using the example in #60013, when the streaming executor feeds the
second join's input, the first block delivered can have zero rows. The
bug is then triggered through the following sequence:
1. `_do_add_input_inner` sees that this is the first block for input
sequence 0 (or 1), so it submits a `_shuffle_block` task with
`send_empty_blocks=True` and immediately sets
`self._has_schemas_broadcasted[input_index] = True`;
2. Remote `_shuffle_block` worker tasks **triggers an early return of
`(empty_metadata, {})`**. No `aggregator.submit()` calls are ever made
and the schema never reaches any aggregator;
3. All subsequent blocks are submitted with `send_empty_blocks=False`.
Qggregators with no non-empty data are never contacted at all, leaving
their bucket queue empty;
4. At finalization, `drain_queue()` returns `[]` for those partitions,
so `_combine([])` builds an `ArrowblockBuilder` with no `add_block()`
calls and produces an empty table with no columns;
5. When `JoiningAggregation.finalize()` calls `pa.Table.join()` on this
columnless table, it raises `ColumnNotFoundError` as observed.

### Why #60520 does not fix this issue

#60520 modifies `ArrowBlockBuilder.build()` to use a stored
`self._schema` when `len(tables) == 0`. However, `self._schema` is only
populated inside`add_block()` calls. When `partition_shards` is `[]` in
`_combine(...)`, `self._schema` remains `None`.

### This fix

In `_shuffle_block`, when `block.num_rows == 0` and
`send_empty_blocks=True`, explicitly broadcast schema-carrying empty
tables to every aggregator before returning. This mirrors the broadcast
logic for non-empty blocks, which ensures every aggregator holds at
least one schema-carrying block and thus finalizes correctly.

### Alternative fix

Deleting the entire early-return if branch in `_shuffle_block` would
also eliminate the issue. However, since the bug only affects the edge
case where the first incoming block is empty, removing the full
early-return branch risks performance degradation.

## Related issues
Fixes #60013 and follows up on #60520.

## Additional information
The original reproduction script in #60013 occasionally misses the error
due to the uncertain order of blocks fed to the second join. To force
the bug, add the following lines to the reproduction script:



```
    ...
    shapes = [b.shape for b in blocks]
    print(f"Columns flattened via map_batches: {flatten_columns}")
    print("Block shapes after first join:", shapes)

    # ----- Add the following lines -----
    # Force the bug
    # The streaming executor delivers blocks in completion order, so non-empty
    # partitions finish faster and arrive first, letting schema broadcast succeed
    # silently.  Reconstructing the dataset with empty blocks at the front
    # guarantees that _shuffle_block() sees a zero-row block as the very first
    # block for the left input sequence, triggering the premature
    # _has_schemas_broadcasted flag and the resulting (0,0) empty-table bug.
    import pyarrow as pa
    empty   = [b for b in blocks if b.num_rows == 0]
    nonempty = [b for b in blocks if b.num_rows > 0]
    assert empty, "No empty blocks found — cannot reproduce the bug with this dataset."
    print(f"Reordering: {len(empty)} empty blocks first, then {len(nonempty)} non-empty.")
    ds_joined = ray.data.from_arrow(empty + nonempty)
    print("Block shapes after reordering:", [b.shape for b in (ray.get(ref) for ref in ds_joined.get_internal_block_refs())])
    # ----------------------------------

    # Create mapping table
    # Use some of the location_ids for the mapping
    shared_location_ids = location_ids[: max(1, len(location_ids) // 3)]
    ...
```
The augmented script forces the order of blocks so that the first block
going into the second join is always empty.

The new test case in `test_join.py` places the empty block in a list fed
to `from_arrow`, preserving the block order and ensuring that the second
join will always see the empty block first. The bug fires reliably on
every run before the fix.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
`get_metric_dictionaries` internally calls `wait_for_condition` to block
until a metric appears. When tests wrap this call inside their own
`wait_for_condition` (the common pattern for checking metric values or
counts), the two waits nest: the inner one times out and raises,
preventing the outer loop from ever retrying. This caused intermittent
failures in `test_metrics`, `test_metrics_2`, `test_metrics_3`, and
`test_metrics_haproxy` depending on how quickly Prometheus scraped.

**What**

Added a `wait: bool = True` parameter to `get_metric_dictionaries`:

- `wait=True` (default): preserves existing behavior — blocks until the
metric appears.
- `wait=False`: performs a single fetch and returns immediately
(possibly empty), letting the caller's `wait_for_condition` drive the
retry loop.

All call sites inside test `check_*` / `metrics_available` lambdas that
are already wrapped in `wait_for_condition` are switched to
`wait=False`.

Signed-off-by: abrar <abrar@anyscale.com>
…les used by RLlib (#60877)

## Description
Avoids triggering v2 module loading when RLlib imports `BackendExecutor`
when `RAY_TRAIN_V2_ENABLED=1`.


## Additional information  

As a follow-up we can extend this safe import logic across the other
files as well.

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
…61709)

Hand-written REST client using requests. This commit adds the core
layer:
GitHubException, a bare GitHubRepo handle, and GitHubClient with all
HTTP methods (_get, _get_paginated, _post, _patch). Tests use the
responses library to intercept HTTP at the transport layer.

Signed-off-by: andrew <andrew@anyscale.com>
…ist (#61774)

## Description
as titled, this #61059 caused
ref bundles to be the same using the same object_ref, so used a unique
one. Also, made the input_file to tuples over lists

## Related issues

## Additional information

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…only provider (#61732)

## Description

On a static Ray cluster, the autoscaler works with a read-only cloud
provider, so that it can still keep reconciling to print out warnings
for infeasible requests, but not do the actual scaling. However, it
still emits other logs that are misleading for a static cluster.

For example, we will see these annoying logs on a static Ray cluster:
<img width="876" height="366" alt="image"
src="https://github.com/user-attachments/assets/42dd8027-1a1b-4cd0-834d-ba928b157ad8"
/>

This PR makes the event logger only emit warnings for infeasible
requests, but suppresses autoscaler action logs if the cloud provider is
read-only.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
…gcs client (#61666)

## Description

This PR adds the `resize_raylet_resource_instances` function to the GCS
Cython client. It will be used by the autoscaler in a follow-up PR for
IPPR.

This PR also addresses comments in the previous comments
#61654 (comment) and
#61654 (comment).

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@pull pull bot locked and limited conversation to collaborators Mar 17, 2026
@pull pull bot added the ⤵️ pull label Mar 17, 2026
@pull pull bot merged commit 0670feb into miqdigital:master Mar 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants