-
Notifications
You must be signed in to change notification settings - Fork 1
Wave First-Shot Quality: Lessons from the Run Detail Redesign #708
Description
The Aspiration
Wave's north star is first-shot results — pipelines that produce production-quality output on the first run, without requiring the human operator to spot problems, suggest fixes, and re-run. Every iteration loop that a human has to close is a failure of the system. Not a catastrophic failure, not a moral failure, but a signal that Wave's pipeline design left a gap the agent couldn't fill on its own.
This aspiration is not about perfection. It's about completeness of process. A carpenter doesn't produce a perfect table on the first try because they're a genius — they produce it because their process includes measuring before cutting, dry-fitting before gluing, and sanding before finishing. Each of those steps exists because someone, at some point, skipped it and got a bad result. First-shot quality is the accumulation of lessons about what to check and when.
This issue documents one such lesson.
The Failure
We redesigned Wave's run detail page — the primary view users see when monitoring a pipeline execution. The process took 6 iterations to converge on a good design. Here's what happened:
Iteration 1: Fantasy mockups. We generated 6 ASCII mockup variations for the new layout. They explored different arrangements — vertical timelines, horizontal Gantt bars, split panels. They looked plausible on paper. They were also completely ungrounded. No one had looked at the actual current UI. No one had measured how wide step names actually are, how many steps a real pipeline has, what the actual data density looks like. The mockups were architectural drawings for a house where nobody visited the lot.
Iteration 2: Reality check. We captured actual screenshots of the live UI. Only then did the real problems become visible: the existing layout wasted enormous horizontal space, step status was buried in a table that required scanning, timing information was disconnected from visual position. The screenshots invalidated most of the assumptions baked into the mockups.
Iteration 3: Evaluation by committee. We ran 4 parallel evaluation agents to score the mockup variations. They converged on a V4+V6 hybrid — a reasonable synthesis of the options presented. But they optimized within the menu. Not one of the four agents said: "Wait — all six of these share the same fundamental problem. They treat Gantt bars as thin lines with labels beside them. What if the bars were fat and contained the data directly?" They scored. They ranked. They did not question the frame.
Iteration 4: Human reframing. The operator suggested "fat Gantt shapes" — bars wide enough to contain step name, duration, status icon, and artifact chips directly inside the shape. This was a fundamentally better idea than anything the agents produced. It wasn't a refinement of V4 or V6. It was a different category of solution that emerged from visual intuition about information density.
Iteration 5: Prototype with placeholder disease. We built the fat-Gantt prototype. First render revealed massive problems: every step's input section showed "in: from pipeline input" — meaningless placeholder text that no one caught during the design phase because it was never rendered with real data. Artifacts were displayed as plain text strings instead of interactive chips. The OUT card (final pipeline output) was visually indistinguishable from regular steps. Standard CSS padding created vast whitespace deserts in what should have been an information-dense dashboard. There was no timeline ruler to give temporal context to the Gantt bars.
Iteration 6: Rework with eyes open. After the operator pointed out the problems with an annotated screenshot, we reworked everything — compact spacing, artifact chips with icons, Gantt-bars-as-background with proper z-indexing, visual connectors between steps, and a distinct treatment for the terminal OUT card.
The result was good. But it took 6 passes through the loop to get there. Wave's job is to close these loops automatically.
Root Causes
1. Research Without Grounding
The most expensive mistake was starting with ASCII mockups instead of starting with screenshots. We designed in a vacuum, then had to throw away most of that work when reality intruded.
This is a general failure mode: agents default to generating before observing. Given a task like "redesign the run detail page," the natural agent behavior is to immediately start producing designs. The discipline of saying "first, let me look at what exists and understand its specific problems" has to be explicitly built into the process.
The analog in software engineering is writing code before reading the existing implementation. Wave already handles this reasonably well for code tasks (research steps, fetch-assess patterns). But for design tasks, the "observe first" discipline was missing.
2. Evaluation Agents Lacked Taste (Frame Blindness)
Four evaluation agents scored six variations. They produced matrices. They identified strengths and weaknesses. They synthesized. And they completely missed the most important insight: that all six options shared a structural limitation.
This is not a failure of intelligence — it's a failure of mandate. The agents were asked to evaluate the options presented. They were not asked to question whether the option space was well-constructed. They were optimizers when they needed to be critics.
The deeper issue: evaluation without meta-evaluation is just scoring. A good design critic doesn't just rank the options — they ask "what option is missing?" and "what assumption do all of these share that might be wrong?" This meta-cognitive step has to be explicitly prompted. Agents will not spontaneously question the frame they're given.
3. Placeholder Syndrome
"from pipeline input" appeared as the input description for every step in the prototype. This is the design equivalent of lorem ipsum making it to production — a placeholder that was never replaced with real data because nobody rendered the design against actual pipeline content.
The root cause is designing with abstract data instead of concrete data. When you design a card layout using "Step Name" and "Input: description here," you can't see that your layout breaks when the step name is "validate-terraform-plan-against-staging-environment" or that "from pipeline input" is meaninglessly generic for 90% of steps.
Real-data rendering is a contract, not a courtesy. If a design step produces UI, and that UI contains text, something needs to verify that the text is meaningful — not a placeholder, not a default, not a truncated ellipsis hiding critical information.
4. Whitespace Blindness
The prototype used standard CSS spacing — gap: 12px, padding: 16px, the kind of values you find in every component library's defaults. For a marketing page, these are fine. For an information-dense operational dashboard, they're wasteful.
Nobody questioned this because the values were "reasonable defaults." This is the spacing equivalent of cargo culting: applying patterns from one context (general web design) to a different context (dense operational UI) without asking whether the pattern fits.
The fix isn't "always use tight spacing." It's "explicitly decide spacing based on information density requirements, and have someone verify the decision." A dashboard showing 15 pipeline steps in a scrollable view has different spatial requirements than a form with 5 fields.
5. Missing Competitive Context
We designed without first looking at how GitHub Actions, Airflow, Tekton, Argo Workflows, or any other pipeline visualization tool handles run detail views. We reinvented from scratch when there were dozens of battle-tested references available.
This isn't about copying competitors. It's about starting with the accumulated design knowledge of the field. Every one of those tools has been through their own multi-iteration refinement process. Their current designs encode lessons we could have absorbed in 10 minutes of research instead of rediscovering through 6 iterations.
The general principle: for any design problem that other products have solved, research their solutions before designing your own. This is so obvious it's embarrassing to state, and yet we skipped it because the agent's default behavior is to generate, not to research.
Action Plan
A. Mandatory Ground-Truth Capture Before Design
Problem: Agents design from imagination instead of observation.
Change: For any pipeline that includes a design or redesign step, introduce a ground-truth capture step that runs before any design work begins. This step:
- Captures screenshots of the current state (for redesign tasks)
- Captures screenshots of 3-5 competitor/reference implementations (for any UI task)
- Captures real data samples (actual step names, actual durations, actual artifact counts)
- Produces a constraints document: viewport dimensions, maximum item counts observed in real usage, longest strings, most common patterns
The design step's input is not "redesign the run detail page." It's "redesign the run detail page, given these screenshots of the current state, these reference implementations, and these data constraints."
This maps to a reusable pipeline pattern: capture-context → design → build → validate. The capture-context step should be extractable as a composition primitive that any design pipeline can include.
B. Frame-Questioning in Evaluation Steps
Problem: Evaluation agents optimize within the given options instead of questioning the option space.
Change: Every evaluation/triage persona should include an explicit frame-check prompt in its system instructions:
Before scoring the options presented, answer these questions:
- What assumption do ALL of these options share? Is that assumption valid?
- What category of solution is NOT represented? Why might it be better?
- If you could only give one piece of feedback that isn't a score, what would it be?
This is cheap — three extra questions in the evaluation prompt. But it structurally prevents the "score and rank without thinking" failure mode. The frame-check output should be a required section in the evaluation artifact, and the downstream synthesis step should be contractually required to address it.
For Wave's composition primitives, this suggests a pattern: branch(evaluate × N) → aggregate(synthesize + frame-check) where the aggregation step explicitly looks for frame-level concerns across all evaluations before recommending a direction.
C. Real-Data Rendering Contracts
Problem: Placeholder text and dummy data survive into prototypes because nothing checks for them.
Change: Add a contract type (or contract check pattern) for UI work that validates against placeholder syndrome:
- No generic placeholders: Contract scans rendered output for common placeholder patterns (
lorem ipsum,placeholder,description here,from pipeline input,Step N,example.com) - Data variety check: If the UI renders a list of N items, the contract verifies that the items are not all identical or trivially sequential
- Truncation check: Verify that the longest realistic data string renders without truncation or layout breakage
- Empty state check: Verify the UI handles zero items, one item, and many items
This can be implemented as a contract helper that design/build steps include. The persona's contract block would include something like:
contracts:
- type: ui_rendering
checks: [no_placeholders, data_variety, truncation, empty_states]Until that's built as a first-class contract type, the immediate fix is adding these checks to the natural-language contract description in design pipeline steps.
D. Information Density Audit
Problem: Default spacing values create wasteful layouts for dense operational UIs.
Change: For UI tasks that produce operational dashboards, monitoring views, or other information-dense interfaces, include a density audit step between build and final validation:
- Compare pixels-per-data-point against reference implementations
- Flag any element where padding exceeds content height
- Verify that the primary use case (e.g., "see status of all 12 steps without scrolling") is achievable at standard viewport sizes
- Check that interactive elements meet minimum touch/click target sizes (density shouldn't sacrifice usability)
This is a specialized validation step — not every UI needs it, but every dashboard UI does. The pipeline selector (or the meta-pipeline) should route dashboard tasks through this step automatically.
E. Mandatory Research Before Invention
Problem: Agents generate solutions from scratch when prior art exists.
Change: For any pipeline step that produces a design, architecture, or significant technical approach, add a research dependency:
- The research step surveys existing solutions (competitor products, open-source implementations, design pattern libraries)
- It produces a prior art summary: what solutions exist, what tradeoffs they make, what patterns are most common
- The design step receives this summary as input and must explicitly state which prior art it's building on and where it's deliberately diverging
This is the design equivalent of Wave's existing research-implement pipeline pattern. The insight is that it should be the default for design tasks, not an opt-in upgrade. Any pipeline that includes a "design" or "architect" step without a preceding research step should trigger a linter warning.
F. Composition Pattern: Iterate-Until-Grounded
The 6-iteration loop happened because each iteration discovered new problems that should have been caught earlier. Wave's composition primitives (iterate, branch, aggregate) can encode this:
capture-ground-truth
→ research-prior-art
→ branch(design × 3)
→ aggregate(evaluate + frame-check)
→ build-prototype
→ validate(real-data + density + placeholder-check)
→ iterate(if validation fails, loop with failure context)
The key insight is that the validation step at the end must be genuinely capable of catching the problems we caught manually. If the validation is weak, the iterate loop just produces more of the same. Strong validation contracts are the engine that makes iterate-until-grounded actually work.
This is the product investment: not just adding more steps, but making each step's contracts strong enough that the pipeline genuinely self-corrects. Every one of the 6 iterations we went through represents a contract that was missing. Add those contracts, and the pipeline closes the loop itself.
Summary
The run detail redesign took 6 iterations because Wave's design pipeline lacked five things: grounding in reality, frame-questioning in evaluation, real-data validation, density-aware auditing, and competitive research. None of these are novel ideas. They're the design equivalents of "measure twice, cut once" — process steps that exist precisely because skipping them produces bad first-shot results.
The path to first-shot quality is not better agents. It's better process — more complete pipelines with stronger contracts at each stage. Every iteration loop a human has to close is a missing step or a weak contract. Find it, encode it, and the next run gets it right the first time.