chore(deps): bump actions/checkout from 4 to 6#2
Closed
dependabot[bot] wants to merge 38 commits into
Closed
Conversation
added 30 commits
June 9, 2026 16:05
MIT-licensed Python 3.10+ project laid out with hatchling, ruff, mypy, pytest+coverage, and a versioned single-source __version__ module. First README is a stub; it will be rewritten as the toolkit grows.
These five modules are the wire format between every later subsystem (speech, grounding, policy, env, metrics). They're deliberately small and dependency-free so importing parley.core stays cheap. - types: Audio/Transcript/Grounding/Frame/Action/Trace dataclasses. - errors: ParleyError tree (ConfigError / RegistryError / ValidationError). - rng: RngManager with BLAKE2b-derived named sub-streams — process-stable. - registry: typed per-kind registries (speech / perturb / policy / metric / ...). - config: pydantic v2 strict models + YAML loader with source-path errors.
Including determinism checks for derive_seed, stream-identity invariants, extra-field rejection, and YAML/config error paths.
The runtime-checkable Protocol nails down the audio->Transcript contract. The mock frontend (perfect-ASR) lets users isolate downstream errors; the Whisper adapter is import-lazy and gated behind the 'whisper' extra so the default install stays light.
Encodes text as a sequence of windowed sine tones on a log-spaced frequency grid; decodes by FFT peak-picking over vocab bins. Clean audio round-trips exactly, while noise / mu-law / clipping perturbations collapse bin SNR and yield realistic substitutions. This is the trick that lets Parley measure WER end-to-end in CI without shipping a real acoustic model — and it's deliberately separate from the frontend wrapper so other tools can call encode/decode directly.
Audio: Gain, Clip, AdditiveNoise(SNR-calibrated), MuLawCodec, Reverb,
TimeStretch, PitchShift. All implemented in pure numpy so the toolkit
keeps zero heavy deps; quality is benchmark-grade, not production audio.
Linguistic: Disfluency (word stutter), FillerInsertion ("uhm"/"uh"),
AccentSubstitution (configurable lexical remap). Each respects the rate
parameter and uses the supplied numpy Generator only — no hidden state.
Compose threads a single RNG through a list and is identity on empty,
so the "clean" baseline row is just Compose([]).
Covers VERB <COLOR> <SHAPE> [to the DIRECTION] with verb canonicalization and a graceful <unknown> sentinel for failed parses. Six tests covering full / minimal / canonical / unknown / partial inputs. Acts as the reference baseline an LLM grounder can be benched against.
The bare 'env/' rule was eating the parley.env package. Anchor it.
A unit-square workspace populated with colored shape objects. Continuous xy_delta moves; explicit pick/place actions interact with whichever object the effector is on. Success predicates compare the final state against a GoalSpec (pick: holding the right object, place/push: target inside the named direction zone). This is *not* a robotics simulator — it's a controllable testbed that keeps the whole speech->ASR->grounding->policy->action->success chain runnable in milliseconds. Real sims (LIBERO/ManiSkill/RLBench) plug in via the same Environment Protocol.
Three reference policies implementing the VLAPolicy Protocol: - ScriptedPolicy: small approach->pick->transit->place state machine parameterised by Grounding+Scene. Acts as the success-rate ceiling on synthetic data given a particular grounder. - RandomPolicy: uniform xy + 5% pick/place — the floor baseline. - NoisyPolicy: Gaussian-perturbs another policy's actions. Useful for probing metric sensitivity and as an 'imperfect VLA' surrogate. Real models (OpenVLA/Octo/pi-0) plug in here via the same protocol.
…ormat generate_dataset() emits Episodes whose audio is the codec-encoded form of their instruction, closing the loop with the codec ASR: clean audio decodes to the original text, so any non-zero WER attributable purely to perturbations. Templates: pick/place/push the <COLOR> <SHAPE> [to the <DIRECTION>]. The closed vocab returned by vocab_for() also covers filler/disfluency/accent tokens so perturbed text round-trips cleanly. On-disk format: a jsonl index + a sibling .audio.npz blob. The split lets index-only operations (count, list, validate) skip the audio entirely and keeps random-access cheap.
…ness, bootstrap Per stage: - ASR: WER (with sub/ins/del breakdown), CER, KeywordRecall. - Grounding: GroundingExactMatch and GroundingSlotF1 over verb/target/dest. - Action: SuccessRate, ActionMSE/MAE (gated on a reference action sequence in trace.metadata), DTW with path-length normalization. - Efficiency: LatencyPercentiles (p50/p95/p99 + total + RTF). - Robustness: RobustnessDelta — clean-vs-perturbed deltas + mean/max degradation (post-hoc aggregator over per-perturbation means). - Aggregate: summarize() returns mean/SEM/percentile-bootstrap CI; paired_bootstrap_pvalue() for marking significant pipeline pairs. WER/CER use a textbook DP edit-distance with backpointer reconstruction so we can report sub/ins/del counts. Pure numpy; zero new deps. 27 tests covering each error type, partial-credit, and bootstrap edge cases.
The orchestrator that turns a config into per-episode results: - Pipeline composes a SpeechFrontend + Grounder + VLAPolicy by name and records per-stage wall-clock timings on the Trace. - run_episode walks an Episode through Perturbation -> Speech -> Grounding -> Env-rollout, populating Trace fields and timings_ms used by the latency metric. Per-stage RNG streams are derived from the global seed with distinct names so e.g. additive noise can't leak into env spawn. - ContentCache: file-backed (atomic-write) JSON cache keyed by a (pipeline, perturbation, episode_id, seed) hash; cache_dir=None means disabled. - expand_suite: cartesian product of pipelines x (clean + perturb groups) x episodes. Clean is always row 0 so robustness deltas have a baseline. - BenchmarkEngine builds pipelines lazily (codec frontend gets the dataset vocab injected automatically), persists per-run trace JSON to output_dir/traces/, and supports an optional ThreadPoolExecutor for workers > 1. Smoke run on synth(n=4) with 2 pipelines x 3 perturb groups = 24 results; scripted policy is the success-rate ceiling and random is the floor as expected. 13 tests cover cache, suite expansion, end-to-end runs, the threadpool path, and the unknown-pipeline error path.
…d JSON - aggregate_results() groups EpisodeResults by (pipeline, perturbation) and runs each metric through summarize() for mean/SEM/bootstrap CI. Skipped metrics (e.g. action_mse without a reference) are tracked per episode so the n in each cell is honest. - render_markdown() emits a GitHub-friendly table with mean [low, high] cells; render_csv() emits one row per cell with explicit ci_low/high columns for spreadsheet use. - build_leaderboard() ranks pipelines by clean success-rate, ties broken by lower mean degradation across perturbation groups. - dump_report/load_report use a versioned JSON schema (schema_version=1) carrying parley_version + suite_name + rows + leaderboard. Refusing unknown schema_versions on load means future format changes don't silently misread old reports.
Thin glue over the library. Highlights: - run: loads YAML config, optionally overrides dataset path / seed, runs the engine, dumps report.json + a config.resolved.yaml snapshot next to it, and prints a Rich-rendered table inline. - report: re-renders previously-written report.json as markdown / csv / json. The CSV path reconstitutes Summary objects from the JSON dump so it's available without re-running the suite. - list: enumerates registered plugins by kind for discoverability. - validate: parses a YAML and prints a one-line summary. `parley` is exposed as a console script via [project.scripts].
… smoke Includes a synth -> run -> report --format json end-to-end round-trip that exercises the full toolkit through the CLI.
Adds whitespace and line-break normalization. No semantic changes — all 130 tests stay green and mypy strict is still clean. Future commits can assume `ruff format` is the canonical layout, which is what CI enforces.
Dependabot is grouped so dev-tooling bumps (ruff, mypy, pytest) arrive as one PR per week instead of a stream of singletons. CODEOWNERS keeps review routing trivial for now (single maintainer).
- ci.yml: ruff check, ruff format --check, mypy on a Python 3.12 lint job; pytest+coverage matrix across Python 3.10..3.13 on Ubuntu plus a macOS 3.12 sanity job; a separate smoke-cli job runs the full `parley synth + run + validate` round trip against the quickstart example so CI breaks loudly if any end-to-end glue regresses. - codeql.yml: weekly security scan + on every push/PR. - release.yml: tag-triggered build + PyPI publish via trusted publishing (id-token: write, no API tokens needed).
Mirrors what CI enforces so committers catch lint/type breakage before push instead of waiting on the runners.
…cimate) + sweep helpers Three channel-flavored perturbations cribbed from the VoIP / telephony literature: - PacketLoss: drop contiguous packets at a configurable rate. Zeros rather than concealment — we're not doing PLC. - BandLimit: brick-wall FFT band-pass; default 300-3400 Hz matches the ITU-T G.712 narrowband telephony passband. - SpectralDecimate: zero out the top fraction of FFT bins, a poor-man's perceptual-codec proxy that's deterministic and dep-free. suites.py adds three programmatic sweep builders: - snr_sweep over CHiME/MUSAN-style SNR ladders. - codec_sweep: mu_law / telephone / spectral_decimate / packet_loss. - linguistic_sweep: disfluency / filler / accent_subst. These are for ad-hoc / notebook use; YAML configs can still list each PerturbationGroup explicitly when that's clearer. 9 new tests.
Two robustness-science staples cribbed from the speech/fairness eval literature (see docs/design-notes.md): - sensitivity_index() computes ΔTask / ΔInput per (pipeline, perturbation) using each pipeline's own "clean" row as baseline. The slope is interpretable: a 1-point increase in the upstream metric (WER, ...) costs N points of downstream task success. A degenerate ΔInput=0 with non-zero ΔTask becomes math.inf — surfaces 'perturbation didn't move the input metric but the task collapsed anyway' (a pipeline brittleness signal that survives WER staying flat). - worst_group_report() — per pipeline, returns the minimum value of a target metric across grouped rows. Currently grouped by perturbation but the surface accommodates future axes (accent stratum, speaker id) without redesign. 7 tests cover fragility comparison, zero-delta handling, missing-baseline skip, and the secondary-metric path.
The synth generator was rolling a 10% chance of omitting the direction for place/push verbs, but the env's success predicate requires a destination zone for those verbs, making those episodes unsatisfiable and pushing the scripted policy's clean success-rate to ~80% instead of 100%. Found it via the programmatic example's worst-group report — a nice example of the toolkit catching its own bugs. Now: pick verbs have no destination; place/push always carry one. include_directions=False degrades place/push to picks rather than emitting an unsatisfiable goal.
- quickstart.yaml: smallest interesting run (1 pipeline, 16 episodes, one mild perturbation). - robustness_panel.yaml: scripted vs random across 11 perturbations covering acoustic / channel / linguistic axes. - snr_sweep.yaml: a five-rung SNR ladder — the canonical degradation curve. - programmatic/custom_suite.py: builds the same kind of config in code using snr_sweep() / codec_sweep(), runs, prints headline table + sensitivity index + worst-group report. All three configs pass `parley validate`. examples/README.md indexes them.
ASCII diagram of the call graph from CLI through runner into the four pluggable subsystems, with a wire-format table mapping each parley.core.types dataclass to its origin -> sink, and sections on determinism, caching, parallelism, and the eval-toolkit lineage the design borrows from (HELM, lm-eval, Inspect).
- usage.md: install, the five subcommands, what `parley run` writes, programmatic usage, plugging in a real frontend, reproducibility. - metrics.md: every metric — what it measures, what scale, when to use, grouped by stage (ASR / grounding / action / efficiency / robustness / bookkeeping). - api-reference.md: curated public surface across core / data / runner / report / metrics / perturb. Anything not listed there is internal.
Grounds Parley's design in current (2023-2026) systems with verifiable URLs: VLA policies (RT-1/2, OpenVLA, Octo, pi_0, GR00T N1, RDT-1B...), speech LLMs (Whisper, Qwen2-Audio, SALMONN, Moshi, SeamlessM4T...), robot benchmarks (LIBERO, CALVIN, RLBench, ManiSkill, SimplerEnv, VLABench...), and eval-harness architecture patterns (HELM, lm-evaluation-harness, MTEB, Inspect). Also documents what the design is deliberately *not* good at — synth env isn't a physics sim, the codec ASR is robust to surprisingly low SNRs, action chunking and real RIRs are out of scope for v0. Better to be honest than oversell.
- README: badges, why-this-toolkit-exists, 60-second tour (CLI + Python), feature summary, doc index. - CHANGELOG: Keep-a-Changelog format with the 0.1.0 inventory under Added. - CONTRIBUTING: the four CI gates, plugin-registration recipe, commit style guide. - CODE_OF_CONDUCT: Contributor Covenant 2.1. - SECURITY: GitHub private-vuln-reporting flow, plus an honest note on the actual attack surface (npz + YAML).
…key, per-thread pipelines Three correctness issues found in review: 1. Linguistic perturbations were silent. They rewrite instruction.text, but run_episode kept feeding the *original* codec audio to the ASR, so disfluency/filler/accent_subst all measured WER=0. Now: when the frontend is the codec, re-encode the perturbed text before transcribe. (Real-audio adapters like Whisper are unaffected — they don't touch instruction.text in their perturbation path.) 2. The cache key was (pipeline, perturbation, episode, seed) by NAME only. Two suites reusing a group name 'noise' with different snr_db collided. cache_key now takes a config_fingerprint folding in the resolved pipeline + perturbation params + env + metrics + max_steps. 3. The engine shared one lazily-built Pipeline instance across the ThreadPoolExecutor, but policies hold per-episode state (reset/act). Under workers>1 we now build a private pipeline per run unit.
n_in=0 produced a negative linspace and out-of-bounds indexing in TimeStretch/PitchShift. Guard and return the empty array unchanged.
- linguistic perturbation moves WER - cache key param fingerprint + no cross-param collision - empty-input resample - threaded run matches serial (policy-state race guard) Plus CHANGELOG entries under [Unreleased].
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Author
|
OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting If you change your mind, just re-open this PR and I'll resolve any conflicts on it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps actions/checkout from 4 to 6.
Release notes
Sourced from actions/checkout's releases.
... (truncated)
Changelog
Sourced from actions/checkout's changelog.
... (truncated)
Commits
df4cb1cUpdate changelog for v6.0.3 (#2446)1cce339Fix checkout init for SHA-256 repositories (#2439)900f221fix: expand merge commit SHA regex and add SHA-256 test cases (#2414)0c366fdUpdate changelog (#2357)de0fac2Fix tag handling: preserve annotations and explicit fetch-tags (#2356)064fe7fAdd orchestration_id to git user-agent when ACTIONS_ORCHESTRATION_ID is set (...8e8c483Clarify v6 README (#2328)033fa0dAdd worktree support for persist-credentials includeIf (#2327)c2d88d3Update all references from v5 and v4 to v6 (#2314)1af3b93update readme/changelog for v6 (#2311)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)