feat: first-class Windows support across CLI, runtime, and judges

## Problem / motivation

`skill-up` is currently a second-class citizen on Windows — the build passes, but core evaluation paths either fail or are explicitly skipped:

- **The script judge is effectively unusable on Windows.** Every script-related test in `internal/judge/script_test.go` and `internal/judge/e2e_test.go` is gated by `if runtime.GOOS == "windows" { t.Skip(...) }` (see `script_test.go:29,113,151,186` and `e2e_test.go:619`). That means the `script` judge has neither test coverage nor a working execution path on Windows: scripts are hard-coded with `#!/bin/sh` / `#!/bin/bash` shebangs, and scenarios like missing POSIX interpreters or CRLF line endings have never been exercised.
- **Examples and setup tooling assume POSIX shell + GNU coreutils.** `examples/judge-debug-eval.sh`, `install.sh`, and the `make hooks` / `make lint-tools` / `make verify` targets in the `Makefile` all assume `/bin/sh`. A Windows user who installs Go 1.25 cannot follow the Setup commands documented in `AGENTS.md` without WSL.
- **Agent Engine adapters are written against a POSIX process model.** `internal/agent/{qodercli,claude_code,codex}.go` shell out to external CLIs, and `internal/shellquote` only implements POSIX quoting. Windows-specific concerns — `cmd.exe` / PowerShell quoting, the `.exe` suffix, and `PATH` / `PATHEXT` resolution — are not handled.
- **CI does not cover Windows.** `.github/workflows/ci.yml` only runs on Linux, so regressions are invisible and the gap keeps widening.

**Who is affected**: contributors and users developing Agent Skills on Windows (native, not just WSL). As the Skill ecosystem expands, lack of Windows support is a hard blocker for a meaningful fraction of potential adopters.

## Proposed solution

Promote Windows to a first-class supported platform in phased steps:

1. **Add a Windows CI job first.** Extend `.github/workflows/ci.yml` with a `windows-latest` runner executing `go build` and `go test -race ./...` (initially as `continue-on-error` to surface the current gap) so subsequent work has a regression baseline.
2. **Make the script judge cross-platform** (`internal/judge/script.go`):
   - Dispatch to an interpreter based on shebang and file extension (`.ps1` / `.cmd` / `.sh`).
   - On Windows, fall back to a user-configured `bash` (Git Bash / WSL) for `.sh` scripts, returning a clear error when none is available instead of failing silently.
   - Remove every `t.Skip("skipping on windows")` in `script_test.go` / `e2e_test.go` and replace them with platform-aware table-driven cases.
3. **Audit Agent Engine adapters** (`internal/agent/`):
   - Centralize executable discovery through an `exec.LookPath` wrapper that handles the `.exe` suffix and `PATHEXT`.
   - Route all shell composition through `internal/shellquote`, and add a Windows quoting implementation (see `golang.org/x/sys/windows` / `CommandLineToArgvW` semantics).
4. **Provide Windows-equivalent tooling scripts.** Add PowerShell counterparts under `scripts/windows/` for `make hooks` / `lint-tools` / `verify`, and document them in the Setup section of `AGENTS.md` and `CONTRIBUTING.md`.
5. **Path and newline hygiene.** Sweep `internal/runner`, `internal/report`, and `internal/skill` to ensure all path construction uses `filepath.Join` (mostly already the case) and that generated scripts / transcripts are written with explicit LF endings to avoid Git `autocrlf` surprises.
6. **Documentation.** Add a "Windows support" page under `docs/` covering supported features, known limitations, and recommended workflows (native vs. WSL2).

## Alternatives considered

- **Recommend WSL2 only and skip native Windows.** Cheapest to implement, but it contradicts the project's positioning as a CLI evaluation framework for Agent Skill developers. The supported engines (Qoder CLI / Claude Code / Codex) already ship native Windows builds, so forcing WSL splits the user's engine and the evaluator across two environments and creates path / credential synchronization friction.
- **Restrict the script judge to explicitly typed scripts (`.ps1` on Windows, `.sh` on POSIX).** Sidesteps shebang parsing but breaks compatibility with existing case configs and forces Skill authors to maintain parallel scripts per platform — a poor user experience.
- **Embed a Go-native shell interpreter (e.g. `mvdan/sh`) to run `.sh` scripts.** Removes the dependency on external bash, but subtle behavioral differences vs. real `bash` + coreutils would surprise Skill authors. Better positioned as an optional fallback than the default.

## Additional context

- Concrete Windows-skip locations that can serve as a remediation checklist:
  - `internal/judge/script_test.go:29,113,151,186`
  - `internal/judge/e2e_test.go:619`
- Hard-coded POSIX shebangs in fixtures and tests:
  - `internal/judge/script_test.go`, `internal/evaluator/evaluator_test.go:1484`, `e2e/contract_test.go:703`
- Related files that need to stay in sync with any change: [`AGENTS.md`](AGENTS.md) (Setup commands / Testing), [`.github/workflows/ci.yml`](.github/workflows/ci.yml), [`Makefile`](Makefile).
- Toolchain note: Cobra, `golangci-lint`, and `goreleaser` all ship official Windows binaries, so there is no upstream blocker.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: first-class Windows support across CLI, runtime, and judges #31

Problem / motivation

Proposed solution

Alternatives considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: first-class Windows support across CLI, runtime, and judges #31

Description

Problem / motivation

Proposed solution

Alternatives considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions