Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions docs/SPACEBIOBENCH_EVALUATION_CARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,15 @@ excluded from the current public-review path.

## Evaluation Flow

```mermaid
flowchart LR
A["Source inventory"] --> B["Task manifest"]
B --> C["Held-out mission fold"]
C --> D["Baseline or submitted predictions"]
D --> E["Metrics with task/fold ids"]
E --> F["Per-task interpretation"]
F --> G["Pooled summary with caveats"]
G --> H["Claim register language"]
```
| Stage | Evidence to inspect | Interpretation control |
|---|---|---|
| 1. Source inventory | OSDR accessions, tissue labels, mission labels, access status, and checksum-manifest evidence | Confirms the public data source before interpreting any score |
| 2. Task manifest | Task id, tissue, feature namespace, source ids, label map, and metric ids | Defines what the evaluation is actually testing |
| 3. Held-out mission fold | Train/test mission split, row counts, and selected-gene counts | Keeps mission-held-out validation separate from random-split performance |
| 4. Prediction and metric files | Baseline or submitted predictions, task/fold ids, AUROC, macro-F1, balanced accuracy, calibration | Ties every metric to a concrete task and fold surface |
| 5. Per-task interpretation | Tissue-specific and fold-specific behavior | Prevents pooled means from hiding failures or confounding |
| 6. Pooled summary | Aggregate result only after task/fold checks | Allows navigation-level summaries with mission, tissue, baseline, and payload caveats |
| 7. Claim register language | Allowed, blocked, and future-only wording | Converts evaluation evidence into release-safe public claims |

The evaluation flow is intentionally claim-aware. A score is first interpreted
at the task and fold level, then summarized only with caveats about mission,
Expand Down
17 changes: 7 additions & 10 deletions docs/SPACEBIOBENCH_SYSTEM_CARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,13 @@ The project currently has multiple surfaces with different maturity levels:

## System Boundary Map

```mermaid
flowchart LR
A["Public NASA OSDR sources"] --> B["Task manifests and fold definitions"]
B --> C["Baseline runs and result summaries"]
C --> D["System, evaluation, release, and claim cards"]
D --> E["Allowed benchmark claims"]
D --> F["Blocked clinical, crew-health, countermeasure, and Mars-regime claims"]
B --> G["v9 metadata-alpha scaffold"]
G --> H["Payload hashing pending"]
```
| Boundary layer | Evidence entering the layer | What the current card allows | What remains blocked |
|---|---|---|---|
| Source layer | Public NASA OSDR sources, source inventory rows, OSDR API evidence, checksum-manifest evidence | Public source/provenance claims with accession-level traceability | Claims about private, controlled, or non-public human sequence data |
| Task layer | Task manifests, fold definitions, held-out mission labels, feature namespaces | Mission-held-out benchmark task claims when task and fold ids are named | Treating mission labels as pure biology or operational readiness evidence |
| Result layer | Baseline runs, metric files, prediction rows, v7.1 canonical result summaries | Benchmark and workflow-evidence claims tied to the correct release surface | Mixed-surface leaderboard, model-superiority, or biological mechanism claims |
| Transparency layer | System card, evaluation card, release readiness card, and claim register | Allowed benchmark claims with explicit scope and caveats | Blocked clinical, crew-health, countermeasure, and Mars-regime claims |
| v9 metadata-alpha layer | Public bulk task/source/provenance scaffold and baseline anchors | Metadata-alpha and scaffold-baseline language | Frozen payload release claims until payload hashing and release gates pass |

This map shows the boundary the cards enforce: benchmark evidence can support
task, fold, metric, provenance, and release-readiness claims, but it cannot
Expand Down
18 changes: 8 additions & 10 deletions docs/SPACEBIOBENCH_TRANSPARENCY_CARD_PACK.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,14 @@ extension lanes remain outside this public-review path.

## Three-Minute Review Map

```mermaid
flowchart LR
A["Portfolio brief"] --> B["System card"]
B --> C["Evaluation card"]
C --> D["Release readiness card"]
D --> E["Claim register"]
B --> F["Data and provenance boundary"]
C --> G["Task, fold, metric, and baseline interpretation"]
E --> H["Allowed, blocked, and future-only language"]
```
| Review step | Open this | What to verify |
|---|---|---|
| 1 | [Portfolio brief](SPACEBIOBENCH_PORTFOLIO_BRIEF.md) | Project contribution, role-relevant signal, and concise application summary |
| 2 | [System card](SPACEBIOBENCH_SYSTEM_CARD.md) | Benchmark scope, data surfaces, provenance boundary, and out-of-scope claims |
| 3 | [Evaluation card](SPACEBIOBENCH_EVALUATION_CARD.md) | Task, fold, metric, baseline, and pooled-summary interpretation |
| 4 | [Release readiness card](SPACEBIOBENCH_RELEASE_READINESS_CARD.md) | Release tier, evidence gates, and blockers for stronger public wording |
| 5 | [Claim register](SPACEBIOBENCH_CLAIM_REGISTER.md) | Allowed wording, blocked wording, support level, and future-only claims |
| Cross-check | [Canonical v7.1 results](CANONICAL_RESULTS_V7_1.md) and [v9 dataset card draft](v9_hf_dataset_card.md) | Whether a statement belongs to the canonical result surface, metadata-alpha scaffold, or a future release lane |

The intended reading order is portfolio brief first, then the system card,
evaluation card, release readiness card, and claim register. This keeps the
Expand Down
2 changes: 1 addition & 1 deletion docs/hf_dataset_card.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ tags:
- single-cell
- spatial-transcriptomics
size_categories:
- 1GB<n<10GB
- 100M<n<1GB
language:
- en
pretty_name: "GeneLab Spaceflight Transcriptomics Benchmark"
Expand Down
7 changes: 6 additions & 1 deletion tests/test_review_fixes.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ def test_public_release_metadata_uses_v7_consistently(self):
self.assertIn("note = {v7.1.2 documentation, public-card, metadata, and evidence-visibility patch over canonical v7.1 results; data freeze 2026-03-01}", readme)
self.assertIn("Version: v7.1.2 public-card/metadata/evidence-visibility patch | Canonical results: v7.1 | Dataset freeze: 2026-03-01", hf_card)
self.assertIn("note = {v7.1.2 documentation, public-card, metadata, and evidence-visibility patch over canonical v7.1 results; data freeze 2026-03-01}", hf_card)
self.assertIn(" - 100M<n<1GB", hf_card)
self.assertIn('version: "7.1.2"', citation)
self.assertIn('date-released: "2026-06-05"', citation)
self.assertIn('notes: "Manuscript in preparation; v7.1.2 documentation, public-card, metadata, and evidence-visibility patch."', citation)
Expand All @@ -328,6 +329,7 @@ def test_public_release_metadata_uses_v7_consistently(self):
self.assertNotIn("Kang", citation)
self.assertNotIn("Jaeyoung", citation)
self.assertNotIn("Jihoon", readme + hf_card + citation)
self.assertNotIn("1GB<n<10GB", hf_card)
self.assertNotIn("blob/v3/docs/SPACEBIOBENCH", hf_card)
self.assertNotIn('version: "5.0.0"', citation)
self.assertNotIn("Target journal:", citation)
Expand All @@ -338,12 +340,15 @@ def test_public_card_pack_includes_visual_review_path(self):
evaluation_card = self.read_repo_text("docs/SPACEBIOBENCH_EVALUATION_CARD.md")

self.assertIn("## Three-Minute Review Map", card_pack)
self.assertIn("```mermaid", card_pack)
self.assertIn("| Review step | Open this | What to verify |", card_pack)
self.assertIn("[docs/SPACEBIOBENCH_SYSTEM_CARD.md](SPACEBIOBENCH_SYSTEM_CARD.md)", card_pack)
self.assertIn("## System Boundary Map", system_card)
self.assertIn("| Boundary layer | Evidence entering the layer | What the current card allows | What remains blocked |", system_card)
self.assertIn("Blocked clinical, crew-health, countermeasure, and Mars-regime claims", system_card)
self.assertIn("## Evaluation Flow", evaluation_card)
self.assertIn("| Stage | Evidence to inspect | Interpretation control |", evaluation_card)
self.assertIn("Claim register language", evaluation_card)
self.assertNotIn("```mermaid", card_pack + system_card + evaluation_card)

def test_public_v9_metadata_alpha_subset_is_inspectable(self):
readme = self.read_repo_text("README.md")
Expand Down