CUCo: An Agentic Framework for Compute and Communication Co-design

CUCo is a training-free, agent-driven framework that automatically generates high-performance CUDA kernels jointly orchestrating computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks optimization opportunities unavailable to existing approaches, reducing end-to-end latency by up to 1.57x over host-driven baselines.

Overview

CUCo consists of three inter-twined components:

Design Space Specification — A structured, declarative set of communication primitives (backend, placement, sync scope, issuer granularity, chunk size) that grounds agent reasoning in valid collective semantics.
Fast-Path Agent — A correctness-first pipeline that converts host-driven NCCL code into device-initiated (GIN/LSA) equivalents through a three-step process: CUDA code analysis, host-to-device transformation via an LLM-judge loop, and evolve-block annotation.
Slow-Path Agent — An LLM-driven evolutionary search that optimizes the fast-path baseline through island-based populations, phase-dependent explore/exploit mutation, cascaded evaluation, and a shared candidate database with meta-summarization.

How It Works

Given a host-driven CUDA+NCCL kernel, CUCo's fast-path agent first analyzes the communication pattern, converts host-side collectives to device-initiated GIN/LSA primitives, and annotates mutable regions with EVOLVE-BLOCK markers. The slow-path agent then treats the annotated kernel as generation 0 and runs an evolutionary search: each generation, an LLM mutates the code within the evolve blocks, the candidate is compiled, run, and scored, and the result feeds back into the next iteration. Over 10-20 generations, this loop discovers optimizations like compute-communication overlap, kernel fusion, and pipelined transfers that are difficult to find manually.

Key Results

CUCo was evaluated on four representative workloads spanning different compute-communication patterns. In each case, CUCo's evolved kernels significantly outperform the host-driven NCCL baselines.

DeepSeek-V3 MoE _{Dispatch-Compute-Combine}	KV Cache Transfer _{Prefill-Decode Pipeline}

Flash Attention _{Attention with AllGather}	GEMM + AllGather _{Matmul with Collective}

Documentation

Guide	Description
Getting Started	Installation, first run, end-to-end walkthrough
Architecture	System design, module map, data flow
Adding a New Workload	Step-by-step guide to onboard your own kernel
Fast-Path Agent	Host-to-device transformation pipeline
Slow-Path Agent	Evolutionary search deep dive
Configuration Reference	All config parameters (EvolutionConfig, TransformConfig, etc.)
LLM Backends	Provider setup (Anthropic, Bedrock, OpenAI, Gemini, DeepSeek)
Writing Evaluations	Custom evaluate.py for your workload
Visualization	Web UI, plotting tools, database queries

Extensibility

While the included example uses CUDA and NCCL device APIs, CUCo's core framework is workload-agnostic. The evaluation script (evaluate.py), prompt customization (run_evo.py), and API documentation file are all user-defined — you can adapt CUCo for any kernel, library, or optimization target where an LLM can generate code and a script can score it. See Adding a New Workload for details.

Repository Layout

cuco/                   Core framework
├── core/               Evolution runner, sampler, novelty judge, summarizer
├── database/           Candidate database, complexity analysis, island management
├── edit/               Diff/full-rewrite application, async editing
├── llm/                LLM client, model backends (Anthropic, OpenAI, Gemini, DeepSeek)
├── prompts/            Mutation prompt templates (base, diff, full, cross, novelty, meta)
├── transform/          Fast-path agent: CUDA analyzer, host-to-device transformer
├── plots/              Visualization utilities (lineage trees, pareto fronts, improvement plots)
├── webui/              Interactive evolution visualization UI
├── launch/             Local and Slurm launch backends
├── cuco_launch         Entry point for launching evolution runs
└── cuco_visualize      Entry point for the visualization UI
examples/
└── ds_v3_moe/          DeepSeek-V3 MoE dispatch-compute-combine workload
    ├── ds_v3_moe.cu    Seed CUDA kernel (host-driven baseline)
    ├── evaluate.py     Build, run, and fitness evaluation logic
    ├── run_evo.py      Launch slow-path evolutionary search
    ├── run_transform.py Launch fast-path host-to-device transformation
    ├── nccl_api_docs.py NCCL device API documentation for agent context
    └── results_ds_v3_moe/ Evolution results (generations, scores, logs)
pyproject.toml          Package configuration and dependencies
uv.lock                 Locked dependency versions

Setup

Prerequisites

Python >= 3.10
CUDA 13.1+ with NCCL 2.28.9+ (for device-initiated communication)
NVIDIA GPUs with NVLink (intra-node) or RoCE (inter-node)
LLM API credentials (Anthropic Bedrock, OpenAI, etc.)

Installation

# Clone the repository
git clone https://github.com/UT-InfraAI/cuco.git
cd cuco

# Create virtual environment and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Or with uv (recommended):

uv venv
source .venv/bin/activate
uv sync

Configuration

Create a .env file in the repository root with your LLM API credentials:

AWS_DEFAULT_REGION=us-east-1
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

Usage

Fast-Path Agent (Host-to-Device Transformation)

The fast-path agent converts a host-driven NCCL program into a device-initiated equivalent:

cd examples/ds_v3_moe
python run_transform.py

This runs the three-step pipeline (CUDA analysis, host-to-device transformation, evolve-block annotation) and outputs the transformed kernel to _transform_host_output/.

Slow-Path Agent (Evolutionary Search)

The slow-path agent optimizes the transformed kernel through LLM-driven evolution:

cd examples/ds_v3_moe
python run_evo.py --num_generations=18

Evolution results (candidate programs, scores, logs) are saved to results_ds_v3_moe/.

Visualization

Launch the interactive web UI to explore the evolution tree:

cuco_visualize --db examples/ds_v3_moe/results_ds_v3_moe/evolution_db.sqlite

Citation

@article{hu2026cuco,
  title={CUCo: An Agentic Framework for Compute and Communication Co-design},
  author={Hu, Bodun and Varadharajan, Yoga Sri Varshan and Agarwal, Saurabh and Akella, Aditya},
  note={Equal contribution: Bodun Hu and Yoga Sri Varshan V},
  journal={arXiv preprint arXiv:2603.02376},
  year={2026}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
cuco		cuco
docs		docs
examples/ds_v3_moe		examples/ds_v3_moe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUCo: An Agentic Framework for Compute and Communication Co-design

Overview

How It Works

Key Results

Documentation

Extensibility

Repository Layout

Setup

Prerequisites

Installation

Configuration

Usage

Fast-Path Agent (Host-to-Device Transformation)

Slow-Path Agent (Evolutionary Search)

Visualization

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

Folders and files

Latest commit

History

Repository files navigation

CUCo: An Agentic Framework for Compute and Communication Co-design

Overview

How It Works

Key Results

Documentation

Extensibility

Repository Layout

Setup

Prerequisites

Installation

Configuration

Usage

Fast-Path Agent (Host-to-Device Transformation)

Slow-Path Agent (Evolutionary Search)

Visualization

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages