Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 29 additions & 35 deletions docs/V9_PUBLIC_BULK_ALPHA_CARD_DATAPACKAGE_BOUNDARY_UPDATE.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,41 @@
# V9-BULK-ALPHA-003 Dataset Card And Data Package Boundary Update
# SpaceBio-Bench v9 Dataset Card And Data Package Note

Status: `metadata_alpha_boundary_applied`
Status: `public_ready`

Claim boundary: `metadata_only_public_bulk_alpha_no_payload_release`
## Purpose

## Decision Applied
This note records how the v9 public bulk metadata catalog is represented in the
reader-facing dataset card and the Data Package descriptor.

The V9-BULK-ALPHA-002 metadata-only alpha decision is now reflected in both
public-facing metadata surfaces:
## Reader-Facing Card

- `docs/v9_hf_dataset_card.md`
- `v9/datapackage.draft.json`

Both artifacts say the public bulk lane is a metadata-only alpha snapshot, not
a frozen payload release or locally hash-verified payload bundle.

## Data Package Update
The v9 card at `docs/v9_hf_dataset_card.md` presents the public bulk surface as
a metadata catalog. It summarizes:

The draft Data Package now records:
- 8 public bulk leave-one-mission-out task manifests;
- 6 tissue contexts;
- 22 public NASA OSDR source rows;
- 33 fold definitions;
- 24 reference baseline runs across 3 baseline families;
- 21 catalog, audit, and output resources in `v9/datapackage.draft.json`.

- `spacebio_bench:release_status = metadata_alpha_not_frozen`
- `spacebio_bench:alpha_snapshot_status = metadata_only_alpha_snapshot`
- `spacebio_bench:claim_boundary = metadata_only_public_bulk_alpha_no_payload_release`
- `spacebio_bench:payload_release_allowed = false`
- `spacebio_bench:payload_verification_status = checksum_manifests_parsed_payloads_not_hashed`
The card is written for reviewers and method developers who want to inspect the
task catalog, source coverage, fold index, and reference outputs.

It contains 21 resources. Ten resources are explicit
`alpha_boundary_metadata` tables from the public bulk gap matrix and snapshot
decision packages.
## Data Package Descriptor

## Dataset Card Update
The descriptor at `v9/datapackage.draft.json` lists the metadata and output
resources that make up the public bulk catalog. It includes task indexes,
source inventory files, checksum-audit summaries, baseline summaries, and
supporting report tables.

The dataset card now uses `SpaceBio-Bench v9 Public Bulk Metadata Alpha` as the
public label and states:
The descriptor is useful for machine reading and resource inventory checks. The
human-facing explanation remains `docs/v9_hf_dataset_card.md`.

- Release status: metadata-only alpha snapshot, not frozen.
- The card is not frozen release language.
- The package does not include a local payload mirror.
- Payload-level hash verification remains pending.
- Organoid and multispecies draft tracks are not public bulk alpha core tasks.
## Related Files

## Next

The public bulk alpha claim/payload boundary is now explicit enough to return
to `V9-SC-001: RRRM asset inventory`, unless the user explicitly chooses to
start the deferred payload-mirror freeze lane first.
- `docs/v9_hf_dataset_card.md`
- `v9/datapackage.draft.json`
- `v9/README.md`
- `v9/reports/README.md`
- `release/release_manifest.json`
93 changes: 39 additions & 54 deletions docs/V9_PUBLIC_BULK_ALPHA_METADATA_SNAPSHOT_DECISION.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,51 @@
# V9-BULK-ALPHA-002 Metadata-Only Alpha Snapshot Decision
# SpaceBio-Bench v9 Public Bulk Metadata Catalog Note

Status: `metadata_only_alpha_snapshot_allowed_with_payload_blockers`
Status: `public_ready`

Selected path: `metadata_only_alpha_snapshot`
## Purpose

Deferred path: `payload_mirror_first`
This note records the public v9 bulk metadata catalog scope. The catalog helps
readers inspect task manifests, source rows, fold indexes, checksum-audit
summaries, reference baselines, and package metadata from the GitHub
repository.

Claim boundary: `metadata_only_public_bulk_alpha_no_payload_release`
## Catalog Scope

## Decision
| Area | Current public record |
|---|---|
| Task manifests | 8 public bulk LOMO task manifests |
| Fold definitions | 33 fold rows |
| Source records | 22 public NASA OSDR source rows |
| Checksum audit | 22 source rows with parsed OSDR API/checksum-manifest records |
| Baseline outputs | 24 reference baseline rows across 3 baseline families |
| Data package descriptor | `v9/datapackage.draft.json` |
| Reader-facing card | `docs/v9_hf_dataset_card.md` |

Proceed with a metadata-only alpha snapshot for the public bulk lane, with
explicit payload-hash blockers. This is not a frozen payload release. It is a
bounded review surface for task manifests, source inventory, OSDR API and
checksum-manifest evidence, baseline summaries, and provenance reports.
## Public Description

The payload-release path remains blocked because `0/22`
public bulk sources are locally payload-hash verified, while
`22/22` sources have parsed checksum-manifest
evidence.
Use this description for the v9 public bulk surface:

## Option Comparison
> SpaceBio-Bench v9 public bulk is a metadata catalog for public mouse bulk
> RNA-seq mission-held-out tasks. It records task definitions, OSDR source
> coverage, fold indexes, checksum-audit summaries, and reference baseline
> outputs.

| Path | Decision | Status |
| --- | --- | --- |
| metadata-only alpha snapshot | selected | allowed with explicit blockers |
| payload mirror first | deferred | valid for future payload release, not required before metadata alpha |
| no alpha until payload frozen | rejected | too conservative for metadata scaffold |
## Source And Dataset Notes

## Allowed Language
- NASA OSDR remains the upstream source for biological data.
- The GitHub catalog records metadata, audit summaries, and baseline outputs.
- The Hugging Face dataset card remains the entry point for processed public
fold downloads in the v7.1 public package.
- Larger v9 payload bundles can be handled as separate release work when the
package metadata and verification records are ready.

- `SpaceBio-Bench v9 public bulk metadata alpha`
- The snapshot documents public mouse bulk LOMO task/source/provenance metadata.
- OSDR file-list and checksum-manifest evidence has been parsed for all 22
public bulk source rows.
- Payload mirroring and local payload-hash verification remain pending.
## Related Files

## Prohibited Language

- Frozen public benchmark release.
- Frozen payload mirror.
- Locally hash-verified data bundle.
- DOI/archive release, complete release Data Package, or leaderboard claim.
- Organoid or multispecies draft tracks as public bulk alpha core tasks.

## External Guidance Anchors

- Hugging Face dataset cards are README/metadata surfaces meant to help users
understand dataset contents, context, and responsible use:
https://huggingface.co/docs/hub/datasets-cards
- Frictionless Data Package descriptors separate package metadata from resource
entries and can describe metadata resources without implying a local payload
mirror:
https://specs.frictionlessdata.io/data-package/
- OSDR API file-list and metadata endpoints support source/file traceability,
while local benchmark payload hashing remains a separate project claim:
https://visualization.osdr.nasa.gov/biodata/api/
- NASA OSDR should remain the credited upstream source for space biology data:
https://science.nasa.gov/reference/osdr-faq/

## Next Block

Run `V9-BULK-ALPHA-003: dataset card and Data Package alpha boundary update`. That block should update
`docs/v9_hf_dataset_card.md` and `v9/datapackage.draft.json` using the claim
boundary in this decision package.
- `docs/v9_hf_dataset_card.md`
- `v9/README.md`
- `v9/task_manifest_index.csv`
- `v9/task_data_index.csv`
- `v9/source_inventory.csv`
- `v9/source_checksum_audit.csv`
- `v9/reports/bulk_lomo_baseline_summary.csv`
- `v9/datapackage.draft.json`
36 changes: 36 additions & 0 deletions scripts/validate_public_docs_consistency.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
ARCHIVE_STATUS = REPO_ROOT / "docs" / "RELEASE_ARCHIVE_CARD.md"
ARCHIVE_MANIFEST = REPO_ROOT / "docs" / "RELEASE_ARCHIVE_MANIFEST.md"
ARCHIVE_CHECKLIST = REPO_ROOT / "docs" / "RELEASE_ARCHIVE_CHECKLIST.md"
V9_CATALOG_NOTE = REPO_ROOT / "docs" / "V9_PUBLIC_BULK_ALPHA_METADATA_SNAPSHOT_DECISION.md"
V9_CARD_PACKAGE_NOTE = REPO_ROOT / "docs" / "V9_PUBLIC_BULK_ALPHA_CARD_DATAPACKAGE_BOUNDARY_UPDATE.md"
V9_README = REPO_ROOT / "v9" / "README.md"
V9_REPORTS_README = REPO_ROOT / "v9" / "reports" / "README.md"
CITATION = REPO_ROOT / "CITATION.cff"
Expand Down Expand Up @@ -83,6 +85,8 @@ def validate_public_docs() -> list[str]:
archive_status = ARCHIVE_STATUS.read_text()
archive_manifest = ARCHIVE_MANIFEST.read_text()
archive_checklist = ARCHIVE_CHECKLIST.read_text()
v9_catalog_note = V9_CATALOG_NOTE.read_text()
v9_card_package_note = V9_CARD_PACKAGE_NOTE.read_text()
v9_readme = V9_README.read_text()
v9_reports_readme = V9_REPORTS_README.read_text()
citation = CITATION.read_text()
Expand Down Expand Up @@ -281,6 +285,38 @@ def validate_public_docs() -> list[str]:
):
require_absent(errors, "release archive docs", archive_text, forbidden)

v9_note_text = "\n".join([v9_catalog_note, v9_card_package_note])
for label, text, expected in (
(
"docs/V9_PUBLIC_BULK_ALPHA_METADATA_SNAPSHOT_DECISION.md",
v9_catalog_note,
"# SpaceBio-Bench v9 Public Bulk Metadata Catalog Note",
),
(
"docs/V9_PUBLIC_BULK_ALPHA_CARD_DATAPACKAGE_BOUNDARY_UPDATE.md",
v9_card_package_note,
"# SpaceBio-Bench v9 Dataset Card And Data Package Note",
),
):
require_contains(errors, label, text, expected)
for forbidden in (
"Metadata-Only Alpha Snapshot Decision",
"metadata-only alpha",
"metadata alpha",
"Public Bulk Metadata Alpha",
"Claim boundary",
"claim boundary",
"Allowed Language",
"Prohibited Language",
"payload blockers",
"blocked",
"deferred",
"not frozen",
"not a frozen",
"decision package",
):
require_absent(errors, "v9 implementation notes", v9_note_text, forbidden)

zenodo_text = json.dumps(zenodo, sort_keys=True)
require_contains(errors, ".zenodo.json", zenodo.get("title", ""), "SpaceBio-Bench")
require_contains(errors, ".zenodo.json", zenodo.get("version", ""), "7.1.2")
Expand Down