diff --git a/docs/source/_static/capsule_template_screenshot.png b/docs/source/_static/capsule_template_screenshot.png new file mode 100644 index 0000000..3f87453 Binary files /dev/null and b/docs/source/_static/capsule_template_screenshot.png differ diff --git a/docs/source/_static/clone_via_git_screenshot.png b/docs/source/_static/clone_via_git_screenshot.png new file mode 100644 index 0000000..2f16c7b Binary files /dev/null and b/docs/source/_static/clone_via_git_screenshot.png differ diff --git a/docs/source/_static/import_to_github_screenshot.png b/docs/source/_static/import_to_github_screenshot.png new file mode 100644 index 0000000..f07adcb Binary files /dev/null and b/docs/source/_static/import_to_github_screenshot.png differ diff --git a/docs/source/_static/new_repo_screenshot.png b/docs/source/_static/new_repo_screenshot.png new file mode 100644 index 0000000..c9cc850 Binary files /dev/null and b/docs/source/_static/new_repo_screenshot.png differ diff --git a/docs/source/acquire_upload/acquire_data.md b/docs/source/acquire_upload/acquire_data.md index 4fc18ee..9e63d0d 100644 --- a/docs/source/acquire_upload/acquire_data.md +++ b/docs/source/acquire_upload/acquire_data.md @@ -2,7 +2,8 @@ During data acquisition you are responsible for running version-controlled acquisition software and ensuring your data files for each modality are organized according to standardized conventions. -Metadata generated during acquisition captures **what data** should appear in the final NWB files after processing, as well as **what manipulations** were performed (both behavioral stimuli and any procedures). +Metadata generated during acquisition captures how data was acquired. This includes what data streams are being recorded, what stimuli or behaviors (if any) are used, and any manipulations (procedures) that occur during the session. + ## Data @@ -47,4 +48,3 @@ Some tasks are being run on a standardized platform using Bonsai and Harp for da |------|-----------| | VrForaging | https://github.com/AllenNeuralDynamics/Aind.Behavior.VrForaging | | IsoForce | https://github.com/AllenNeuralDynamics/Aind.Behavior.IsoForce | - diff --git a/docs/source/acquire_upload/calibration.md b/docs/source/acquire_upload/calibration.md index c18e359..62b8f2f 100644 --- a/docs/source/acquire_upload/calibration.md +++ b/docs/source/acquire_upload/calibration.md @@ -8,6 +8,66 @@ Common calibrations include measuring power output (e.g. for lasers) with [Power Calibrations have an option to include fit parameters, if your calibration fit is not available in the [FitType](https://aind-data-schema.readthedocs.io/en/latest/components/measurements.html#fittype) options please request that we add it by opening an [issue](https://github.com/AllenNeuralDynamics/aind-data-schema/issues). -## Instrument testing +## Testing -When collecting a *test data asset* on an instrument using a "calibration object" instead of a subject or specimen, you should set `subject_id = "calibration"` in all metadata files. Please also use the [CalibrationObject](https://aind-data-schema.readthedocs.io/en/latest/components/subjects.html#calibrationobject) in the `Subject.subject_details` to track information about the physical object used during calibration. +Use `subject_id="calibration"` to mark data assets as test assets. When a "phantom" or other calibration object is used during testing please provide details about that object in a `CalibrationObject`. Before uploading calibration assets you need to ensure that your asset and metadata are compatible with the processing pipelines that will be run downstream. + +Note that test assets that are not intended to be kept long-term should be immediately (or as soon as feasible) marked as archived on Code Ocean. Archived assets are deleted when unused for 30 days. + +### Manual calibration metadata + +If the processing pipeline that will run on your test data asset **requires certain fields in the subject or procedures metadata to be set** you need to create an actual `subject.json` and `procedures.json` and upload these alongside your data asset. + +```{code-block} python +from aind_data_schema.core.subject import Subject +from aind_data_schema.core.procedures import Procedures +from aind_data_schema.components.subjects import CalibrationObject +from aind_data_schema.components.devices import Device +from aind_data_schema_models.organizations import Organization + +subject_id = "calibration" + +subject = Subject( + subject_id=subject_id, + subject_details=CalibrationObject( + empty=True, + description="FIP calibration", + ), +) + +procedures = Procedures(subject_id=subject_id) + +subject.write_standard_file() +procedures.write_standard_file() +``` + +### Automated calibration metadata + +If the processing pipeline that will run on your data asset **does not read the subject and/or procedures metadata** you can follow the instructions below to create empty subject and procedures files. + +Note that this automation is only available if your job_type runs `aind-metadata-mapper>=1.3.0`. Please include as much detail as possible about your [CalibrationObject](https://aind-data-schema.readthedocs.io/en/latest/components/subjects.html#calibrationobject) in the upload settings. + +```{code-block} python +from aind_data_schema.components.subjects import CalibrationObject +from aind_data_schema_models.modalities import Modality +from aind_metadata_mapper.gather_metadata import GatherMetadataJob +from aind_metadata_mapper.models import JobSettings, DataDescriptionSettings, SubjectSettings + +job_settings = JobSettings( + output_dir="/path/to/output", + subject_id="calibration", + data_description_settings=DataDescriptionSettings( + project_name="", + modalities=[Modality.ECEPHYS], + ), + subject_settings=SubjectSettings( + calibration_object=CalibrationObject( + description="Neuropixels dummy probe", + empty=False, + ) + ), +) + +job = GatherMetadataJob(job_settings=job_settings) +job.run_job() +``` diff --git a/docs/source/acquire_upload/prepare_before_acquisition.md b/docs/source/acquire_upload/prepare_before_acquisition.md index 73fd2d6..e825a3e 100644 --- a/docs/source/acquire_upload/prepare_before_acquisition.md +++ b/docs/source/acquire_upload/prepare_before_acquisition.md @@ -10,7 +10,8 @@ You are ready to generate data when: ## Project name -Your *project* and *subproject* (if applicable) needs to be accurate. The full project name ` - ` is tied directly with the funding and investigator metadata. The list of project names can be viewed at the [metadata-service project_names/ endpoint](https://aind-metadata-service/api/v2/project_names). Projects that do not have metadata in the metadata-service must upload their own `data_description.json` -- reach out to Scientific Computing for help. +Your *project* and *subproject* (if applicable) needs to be accurate. The full project name ` - ` is tied directly with the funding and investigator metadata. The list of project names can be viewed at the [metadata-service project_names/ endpoint](https://aind-metadata-service/api/v2/project_names). +Projects that are not listed in the metadata-service must provide their own `data_description.json` at upload, including funding and investigator fields. Reach out to Scientific Computing for help. If you need a new project name, please request that it be added with the [project name and funding intake form](https://app.smartsheet.com/b/form/9f366857582b4db98d1fe41ef724a613). @@ -20,7 +21,7 @@ The funding endpoint will be used during data upload to populate your data descr ```{raw} html
- + @@ -83,7 +84,7 @@ The funding endpoint will be used during data upload to populate your data descr fetch('https://aind-metadata-service/api/v2/funding/' + encodeURIComponent(projectName)) .then(response => { if (!response.ok) { - throw new Error('HTTP error! status: ' + response.status); + return response.text().then(text => { throw new Error(text || 'HTTP error! status: ' + response.status); }); } return response.json(); }) @@ -174,7 +175,7 @@ The investigators endpoint will be used during data upload to populate your data fetch('https://aind-metadata-service/api/v2/investigators/' + encodeURIComponent(projectName)) .then(response => { if (!response.ok) { - throw new Error('HTTP error! status: ' + response.status); + return response.text().then(text => { throw new Error(text || 'HTTP error! status: ' + response.status); }); } return response.json(); }) @@ -265,17 +266,19 @@ Subject metadata is populated by lab animal services (LAS) without your involvem fetch('https://aind-metadata-service/api/v2/subject/' + encodeURIComponent(subjectId)) .then(response => { - if (!response.ok) { - throw new Error('HTTP error! status: ' + response.status); + if (!response.ok && response.status !== 400) { + return response.text().then(text => { throw new Error(text || 'HTTP error! status: ' + response.status); }); } - return response.json(); + return response.json().then(data => ({ data, status: response.status })); }) - .then(response => { - const data = response.data || response; - resultDiv.style.backgroundColor = '#d4edda'; - resultDiv.style.border = '1px solid #28a745'; - resultDiv.innerHTML = 'Subject Information:
' + 
-                              JSON.stringify(data, null, 2) + '
'; + .then(({ data, status }) => { + const subject = data.data || data; + const isInvalid = status === 400; + resultDiv.style.backgroundColor = isInvalid ? '#fff3cd' : '#d4edda'; + resultDiv.style.border = isInvalid ? '1px solid #ffc107' : '1px solid #28a745'; + resultDiv.innerHTML = (isInvalid ? 'Warning: subject data failed schema validation:' : 'Subject Information:') + + '
' +
+                              JSON.stringify(subject, null, 2) + '
'; }) .catch(error => { resultDiv.style.backgroundColor = '#f8d7da'; @@ -394,7 +397,7 @@ Currently, only NSB procedures are automatically attached to data assets during ### Custom procedures -Custom [Procedures](https://aind-data-schema.readthedocs.io/en/latest/procedures.html) require you to generate a `procedures.json` file manually. Note that the `data-transfer-service` will **NOT** merge your procedures with any stored in NSB, you must pull the NSB procedures and manually merge them ahead of time, please reach out to Scientific Computing for help with this process. +Custom [Procedures](https://aind-data-schema.readthedocs.io/en/latest/procedures.html) require you to generate a `procedures.json` file manually. Please only provide metadata for procedures that are not stored by NSB. ### NSB procedures @@ -456,17 +459,19 @@ Standardized procedures that are performed by NSB (link?) are uploaded and acces fetch('https://aind-metadata-service/api/v2/procedures/' + encodeURIComponent(subjectId)) .then(response => { - if (!response.ok) { - throw new Error('HTTP error! status: ' + response.status); + if (!response.ok && response.status !== 400) { + return response.text().then(text => { throw new Error(text || 'HTTP error! status: ' + response.status); }); } - return response.json(); + return response.json().then(data => ({ data, status: response.status })); }) - .then(response => { - const data = response.data || response; - resultDiv.style.backgroundColor = '#d4edda'; - resultDiv.style.border = '1px solid #28a745'; - resultDiv.innerHTML = 'Procedures Information:
' + 
-                              JSON.stringify(data, null, 2) + '
'; + .then(({ data, status }) => { + const procedures = data.data || data; + const isInvalid = status === 400; + resultDiv.style.backgroundColor = isInvalid ? '#fff3cd' : '#d4edda'; + resultDiv.style.border = isInvalid ? '1px solid #ffc107' : '1px solid #28a745'; + resultDiv.innerHTML = (isInvalid ? 'Warning: procedures data failed schema validation:' : 'Procedures Information:') + + '
' +
+                              JSON.stringify(procedures, null, 2) + '
'; }) .catch(error => { resultDiv.style.backgroundColor = '#f8d7da'; diff --git a/docs/source/acquire_upload/process_data.md b/docs/source/acquire_upload/process_data.md index d4fb2a3..d90375c 100644 --- a/docs/source/acquire_upload/process_data.md +++ b/docs/source/acquire_upload/process_data.md @@ -2,7 +2,8 @@ Scientific computing is currently re-organizing pipelines to be per-modality, rather than per-project. -Pipeline development requirements are documented in [Pipeline development](../policies_practices/pipeline_development.md). +Pipeline development requirements are documented in [Pipeline development](../policies_practices/platform_support.md#pipeline-development) +and the pipeline versioning policy is documented in [Versioning pipelines](../policies_practices/version_pipelines.md). ## Per-modality physiology pipelines diff --git a/docs/source/aind/core_services.md b/docs/source/aind/core_services.md index 45a1d00..a5e222f 100644 --- a/docs/source/aind/core_services.md +++ b/docs/source/aind/core_services.md @@ -1,5 +1,7 @@ # Core Services +The interactions between many of these services are illustrated in the [AIND software diagrams](./diagrams.md). + **aind-data-transfer-service** FastAPI service to run data compression and transfer jobs on the HPC diff --git a/docs/source/aind/diagrams.md b/docs/source/aind/diagrams.md new file mode 100644 index 0000000..a3f1b84 --- /dev/null +++ b/docs/source/aind/diagrams.md @@ -0,0 +1,11 @@ +# AIND Software and Systems Diagrams + +This page contains diagrams illustrating the interactions between AIND software and systems, including core services, data storage, and compute resources. These diagrams are intended to provide a high-level overview of how different components fit together, and will be updated periodically as our software and systems evolve. + +**New diagrams coming soon for future plans and low-level architecture.** + +## High-level architecture + +![High-level data flow](../diagrams/high_level/general_data_flow.drawio.svg) + +![AIND Software Overview](../_static/aind-software-overview.png) \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index c004ad9..391a2e1 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -33,6 +33,7 @@ "sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinx_tippy", + "sphinx_copybutton", "myst_parser", ] templates_path = ["_templates"] diff --git a/diagrams/dynamic_foraging/dynamic_foraging_architecture.drawio b/docs/source/diagrams/dynamic_foraging/dynamic_foraging_architecture.drawio similarity index 100% rename from diagrams/dynamic_foraging/dynamic_foraging_architecture.drawio rename to docs/source/diagrams/dynamic_foraging/dynamic_foraging_architecture.drawio diff --git a/diagrams/dynamic_foraging/dynamic_foraging_architecture.svg b/docs/source/diagrams/dynamic_foraging/dynamic_foraging_architecture.svg similarity index 100% rename from diagrams/dynamic_foraging/dynamic_foraging_architecture.svg rename to docs/source/diagrams/dynamic_foraging/dynamic_foraging_architecture.svg diff --git a/diagrams/dynamic-foraging/low_level/dynamic-foraging-low-level-processing.drawio b/docs/source/diagrams/dynamic_foraging/low_level/dynamic-foraging-low-level-processing.drawio similarity index 100% rename from diagrams/dynamic-foraging/low_level/dynamic-foraging-low-level-processing.drawio rename to docs/source/diagrams/dynamic_foraging/low_level/dynamic-foraging-low-level-processing.drawio diff --git a/diagrams/dynamic-foraging/low_level/dynamic-foraging-low-level-processing.svg b/docs/source/diagrams/dynamic_foraging/low_level/dynamic-foraging-low-level-processing.svg similarity index 100% rename from diagrams/dynamic-foraging/low_level/dynamic-foraging-low-level-processing.svg rename to docs/source/diagrams/dynamic_foraging/low_level/dynamic-foraging-low-level-processing.svg diff --git a/diagrams/high_level/general_data_flow.drawio b/docs/source/diagrams/high_level/general_data_flow.drawio similarity index 100% rename from diagrams/high_level/general_data_flow.drawio rename to docs/source/diagrams/high_level/general_data_flow.drawio diff --git a/diagrams/high_level/general_data_flow.drawio.svg b/docs/source/diagrams/high_level/general_data_flow.drawio.svg similarity index 100% rename from diagrams/high_level/general_data_flow.drawio.svg rename to docs/source/diagrams/high_level/general_data_flow.drawio.svg diff --git a/diagrams/low_level/aind-data-transfer-service/aind-data-transfer-service-k8s.drawio b/docs/source/diagrams/low_level/aind-data-transfer-service/aind-data-transfer-service-k8s.drawio similarity index 100% rename from diagrams/low_level/aind-data-transfer-service/aind-data-transfer-service-k8s.drawio rename to docs/source/diagrams/low_level/aind-data-transfer-service/aind-data-transfer-service-k8s.drawio diff --git a/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio b/docs/source/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio similarity index 100% rename from diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio rename to docs/source/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio diff --git a/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio.svg b/docs/source/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio.svg similarity index 100% rename from diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio.svg rename to docs/source/diagrams/low_level/aind-metadata-service/aind-metadata-service-k8s.drawio.svg diff --git a/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio b/docs/source/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio similarity index 100% rename from diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio rename to docs/source/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio diff --git a/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio.svg b/docs/source/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio.svg similarity index 100% rename from diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio.svg rename to docs/source/diagrams/low_level/aind-metadata-service/metadata_service_overview_diagram.drawio.svg diff --git a/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio b/docs/source/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio similarity index 100% rename from diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio rename to docs/source/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio diff --git a/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio.svg b/docs/source/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio.svg similarity index 100% rename from diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio.svg rename to docs/source/diagrams/low_level/aind-metadata-service/metadata_service_procedures_extractor_diagram.drawio.svg diff --git a/diagrams/low_level/asset_registration_api.drawio b/docs/source/diagrams/low_level/asset_registration_api.drawio similarity index 100% rename from diagrams/low_level/asset_registration_api.drawio rename to docs/source/diagrams/low_level/asset_registration_api.drawio diff --git a/diagrams/low_level/asset_registration_api.svg b/docs/source/diagrams/low_level/asset_registration_api.svg similarity index 100% rename from diagrams/low_level/asset_registration_api.svg rename to docs/source/diagrams/low_level/asset_registration_api.svg diff --git a/diagrams/low_level/data_asset_indexer.drawio b/docs/source/diagrams/low_level/data_asset_indexer.drawio similarity index 100% rename from diagrams/low_level/data_asset_indexer.drawio rename to docs/source/diagrams/low_level/data_asset_indexer.drawio diff --git a/diagrams/low_level/data_asset_indexer.svg b/docs/source/diagrams/low_level/data_asset_indexer.svg similarity index 100% rename from diagrams/low_level/data_asset_indexer.svg rename to docs/source/diagrams/low_level/data_asset_indexer.svg diff --git a/diagrams/low_level/data_schema_to_docdb.drawio b/docs/source/diagrams/low_level/data_schema_to_docdb.drawio similarity index 100% rename from diagrams/low_level/data_schema_to_docdb.drawio rename to docs/source/diagrams/low_level/data_schema_to_docdb.drawio diff --git a/diagrams/low_level/data_schema_to_docdb.svg b/docs/source/diagrams/low_level/data_schema_to_docdb.svg similarity index 100% rename from diagrams/low_level/data_schema_to_docdb.svg rename to docs/source/diagrams/low_level/data_schema_to_docdb.svg diff --git a/diagrams/low_level/docdb_api.drawio b/docs/source/diagrams/low_level/docdb_api.drawio similarity index 100% rename from diagrams/low_level/docdb_api.drawio rename to docs/source/diagrams/low_level/docdb_api.drawio diff --git a/diagrams/low_level/docdb_api.svg b/docs/source/diagrams/low_level/docdb_api.svg similarity index 100% rename from diagrams/low_level/docdb_api.svg rename to docs/source/diagrams/low_level/docdb_api.svg diff --git a/diagrams/low_level/redshift_client.drawio b/docs/source/diagrams/low_level/redshift_client.drawio similarity index 100% rename from diagrams/low_level/redshift_client.drawio rename to docs/source/diagrams/low_level/redshift_client.drawio diff --git a/diagrams/low_level/redshift_client.svg b/docs/source/diagrams/low_level/redshift_client.svg similarity index 100% rename from diagrams/low_level/redshift_client.svg rename to docs/source/diagrams/low_level/redshift_client.svg diff --git a/docs/source/diagrams/mid_level/QC.drawio b/docs/source/diagrams/mid_level/QC.drawio new file mode 100644 index 0000000..23c37df --- /dev/null +++ b/docs/source/diagrams/mid_level/QC.drawio @@ -0,0 +1,115 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/source/diagrams/mid_level/QC.drawio.svg b/docs/source/diagrams/mid_level/QC.drawio.svg new file mode 100644 index 0000000..0330332 --- /dev/null +++ b/docs/source/diagrams/mid_level/QC.drawio.svg @@ -0,0 +1,4 @@ + + + +
Metadata
(DocDB)
Data Assets
(S3)
Static Data Stores
Static QC Portal
(data.allenneuraldynamics.org/qc)
Microsoft Entra
Editable QC Portal
(qc.allenneuraldynamics.org/view)
v
QC metadata
v
Reference media
OAuth
Data Consumers
User QCMetric
updates
Processes that transform data
User review
submission
\ No newline at end of file diff --git a/diagrams/mid_level/codeocean_pipeline_diagram.drawio b/docs/source/diagrams/mid_level/codeocean_pipeline_diagram.drawio similarity index 100% rename from diagrams/mid_level/codeocean_pipeline_diagram.drawio rename to docs/source/diagrams/mid_level/codeocean_pipeline_diagram.drawio diff --git a/diagrams/mid_level/codeocean_pipeline_diagram.svg b/docs/source/diagrams/mid_level/codeocean_pipeline_diagram.svg similarity index 100% rename from diagrams/mid_level/codeocean_pipeline_diagram.svg rename to docs/source/diagrams/mid_level/codeocean_pipeline_diagram.svg diff --git a/diagrams/mid_level/local_data_center_flow.drawio b/docs/source/diagrams/mid_level/local_data_center_flow.drawio similarity index 100% rename from diagrams/mid_level/local_data_center_flow.drawio rename to docs/source/diagrams/mid_level/local_data_center_flow.drawio diff --git a/diagrams/mid_level/local_data_center_flow.drawio.svg b/docs/source/diagrams/mid_level/local_data_center_flow.drawio.svg similarity index 100% rename from diagrams/mid_level/local_data_center_flow.drawio.svg rename to docs/source/diagrams/mid_level/local_data_center_flow.drawio.svg diff --git a/docs/source/diagrams/mid_level/modular-experiments.drawio b/docs/source/diagrams/mid_level/modular-experiments.drawio new file mode 100644 index 0000000..73ff97e --- /dev/null +++ b/docs/source/diagrams/mid_level/modular-experiments.drawio @@ -0,0 +1,445 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/source/diagrams/mid_level/modular-experiments.md b/docs/source/diagrams/mid_level/modular-experiments.md new file mode 100644 index 0000000..0bfa2e9 --- /dev/null +++ b/docs/source/diagrams/mid_level/modular-experiments.md @@ -0,0 +1,29 @@ +# Modular Platforms and Composable Experiments + +![Modular experiments architecture](modular-experiments.svg) + +The diagram describes how independent acquisition **platforms** are composed into a single **experiment** and carried through the AIND data pipeline. It follows the architecture proposed for the `Aind.Behavior.Services` framework (see *"Towards modular platforms and composable experiments"*). The example shown is a typical session combining **one behavior platform** (Behavior + Behavior Videos) with **two physiology platforms** (Fiber Photometry and Electrophysiology) acquired simultaneously. + +The core principle is a strict separation between **immutable** raw data, generated once at acquisition and never changed, and **derived** assets, which can always be regenerated by re-running a pipeline. Each platform is self-contained: it encapsulates and exposes, through well-documented interfaces, everything needed to run, describe (metadata), quality-control, and package its own data. Pipelines are designed to run on AIND cloud resources (VAST, AWS, CodeOcean) but must also be runnable locally and in isolation. + +## At the rig and on VAST — Session Data (Immutable) + +Each platform writes its raw data to the session folder on VAST, **platform-agnostically**, alongside its own `aind-data-schema`-compliant metadata (`aind-data-schema per platform`). Before the data is uploaded to AWS, an **aggregator/merger** runs on VAST to combine the per-platform metadata into a single `merged metadata per experiment` record. All of this acquired session data is treated as **immutable**. `aind-data-transfer` then stages the data from VAST to the cloud. + +## In the cloud (AWS) — Platform-scoped pipelines (Derived) + +Once data lands in AWS, processing runs in a **platform-scoped** fashion — one pipeline per platform (Behavior Pipeline, Fiber Photometry Pipeline, Electrophysiology Pipeline), each typically implemented as CodeOcean capsules. Every pipeline takes the raw data of a *single* platform and runs two families of routines, **quality control** and **data packaging**, producing `NWB per platform` outputs that are all **derived**. + +These CodeOcean capsules should be **thin wrappers** around well-maintained, versioned packages rather than the place where processing logic lives. CodeOcean (and AWS more broadly) is treated as one execution environment, not a hard dependency: the same packages must be runnable **in isolation outside the CodeOcean platform** — for example on a local machine during development, debugging, or dissemination. Keeping the logic in standalone packages and the capsule as a thin invocation layer is what makes the pipelines portable, testable, and maintainable across platforms. + +### Inside a single platform pipeline + +The inset zooms into one platform's pipeline and shows how raw and processed data flow into a per-platform NWB container: + +- **Persist Raw Data → NWB Container:** the immutable raw data is packaged into an NWB container with minimal conversion ("package raw data"). Keeping conversion minimal limits maintenance cost. +- **Process → Append processed data:** processing steps (e.g. filtering, trial parsing, parsing hardware events into behavior-relevant events) operate *on top of* the raw NWB container, and the resulting processed data is appended back into the same container. These intermediate products are treated as **ephemeral** — only raw and final processed data are persisted; the code that generates intermediates is documented and versioned instead. +- **Quality control appends to `qc.json`:** QC routines run at the appropriate scope and append their results to a shared `qc.json` (via the `aind-data-schema` quality-control model), optionally emitting other **artifacts**. + +## Experiment aggregation (Derived) + +After every platform has produced its own self-contained, derived NWB, a final **Experiment Aggregation Pipeline** merges the per-platform NWBs into a single `merged NWB per experiment`. This enables cross-platform, cross-modality analysis (for example, aligning physiology to behavior events). Like every other cloud output, this merged experiment NWB is **derived** and can be rehydrated at any time by re-running the upstream pipelines. diff --git a/docs/source/diagrams/mid_level/modular-experiments.svg b/docs/source/diagrams/mid_level/modular-experiments.svg new file mode 100644 index 0000000..51160c6 --- /dev/null +++ b/docs/source/diagrams/mid_level/modular-experiments.svg @@ -0,0 +1,3 @@ + + +
Session Data (VAST)
Behavior
Behavior Videos
Fiber Photometry
Electrophysiology
aind-data-schema
per platform
Immutable
merged metadata
per experiment
Cloud (AWS)
Behavior
Pipeline
Fiber Photometry
Pipeline
Electrophysiology
Pipeline
Behavior
Behavior Videos
Fiber Photometry
Electrophysiology
NWB
per platform
merged NWB
per experiment
Derived
aind-data-transfer
Data packaging
Primary data quality control
Raw
Immutable
Derived
Package
Raw data
Processed data
(e.g. filtering, trial parsing, etc...)
Append
processed
data
Persist Raw Data
Process
NWB
Container
Raw
Device-level QC
Data contract QC
Appends to
Appends to
Experiment
Aggregation Pipeline
Input to
NWB
Container
NWB
Container
Processed  data quality control
Appends to
Task metrics QC
qc.json
Artifacts
Derived
Platform
Data
Immutable
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/source/explore_analyze/analyze_data.md b/docs/source/explore_analyze/analyze_data.md index 3548d20..4f90a53 100644 --- a/docs/source/explore_analyze/analyze_data.md +++ b/docs/source/explore_analyze/analyze_data.md @@ -1,9 +1,10 @@ (analyze-data)= # Analyze data -If you are new to Code Ocean you may find the 101 series in the [training resources](https://alleninstitute.sharepoint.com/sites/AWS/Shared%20Documents/Forms/AllItems.aspx?FolderCTID=0x012000A09B1ADCA192D64C99E1504DAB6FBD2F&id=%2Fsites%2FAWS%2FShared%20Documents%2FGeneral%2FTraining) helpful as an initial starting point. +## Code Ocean capsules -## Capsules +Most analysis happens in individual capsules on the [Code Ocean](https://codeocean.allenneuraldynamics.org/) platform. +We have documented [best practices and tips](co_best_practices.md) for working in that context. (analysis-framework)= ## Analysis framework diff --git a/docs/source/explore_analyze/co_best_practices.md b/docs/source/explore_analyze/co_best_practices.md new file mode 100644 index 0000000..29e053e --- /dev/null +++ b/docs/source/explore_analyze/co_best_practices.md @@ -0,0 +1,380 @@ + +# Best practices for working in Code Ocean + +If you are new to Code Ocean you may find the 101 series in the [training resources](https://alleninstitute.sharepoint.com/sites/AWS/Shared%20Documents/Forms/AllItems.aspx?FolderCTID=0x012000A09B1ADCA192D64C99E1504DAB6FBD2F&id=%2Fsites%2FAWS%2FShared%20Documents%2FGeneral%2FTraining) helpful as an initial starting point. + +## General guidance + +### Capsules vs Libraries + +- Capsules should have a narrow, well-defined scope and leverage + installable libraries as much as possible. Put another way, capsules + should configure libraries to answer a question with specific data. + +- Capsules can and should be re-usable, but ultimately the code in them + should be relatively simple and specific to the question at hand. + Generally, useful functions should find their way into libraries that + can be installed into other capsules' environments. + +- The code within a capsule should call library functions in a specific + order or be dataset specific. + +- By placing code in GitHub libraries, it's easier for code to be + adopted by internal and external users for new purposes. + + +### Collaboration and GitHub + +- Capsules are git repositories. They can be easily synchronized with + GitHub if created with the "Clone from Git" option. Recommended + workflow: + + - On GitHub, use the + [aind-capsule-template](https://github.com/AllenNeuralDynamics/aind-capsule-template) + repository to create your own repo. + + - In Code Ocean, create a new capsule via the "Clone from Git" option. + Note that this will require github credentials to be added in your + CO account settings. + + - After running "commit changes" in the capsule, you will see a "sync" + button to sync your changes to (and others' changes from) github. + +- When developing code, collaboration is best done on GitHub, not within + Code Ocean. + + - Individual contributors can work from their own capsules cloned from + a shared github repository, and collaborate by syncing their own branches to github. + - Alternatively, contributors can use completely different capsules that both + depend on a single shared library stored on github, actively sharing changes there. + + +### Library Development + +You can use Code Ocean to work with a library that is actively being +developed. + +- To install a package from github in the environment builder, specify + it in the format + `git+https://github.com//.git#egg=` + (note that the package name appears twice). This will install and pin + the latest version (by its commit hash), which you can force it to + update on a future build by deleting the pinned version to move to the + latest again (as with other environment builder packages). + +- To edit the library from within Code Ocean: the above method produces an "editable" + installation with source code in `/src`. You can access (and edit) this + by "Add folder to workspace" in code-server/VS Code (not possible in JupyterLab). + Be **very careful** to sync your changes back to github when you edit -- + *they will be erased when the capsule is rebuilt*! + +- Alternative approaches to developing a library: + - If you primarily develop the code in a single capsule but want to make it + available to other users and capsules: + + As long as the capsule is synced to github, you can make it installable as a library by simply + creating a pyproject.toml in the root of the capsule. Follow the typical capsule layout + with library modules in a subfolder of `code`, and point to that location from the pyproject.toml + (along with setting appropriate dependencies etc). + + - If library code doesn't need to be actively tested on cloud data: + + Develop the library locally, sync changes to github, then update any capsules + that rely on it: either reset the version and rebuild, or reinstall within a running + workstation (`pip install -e git+https://github.com//.git#egg=`) + and don't forget to also update the environment on the next rebuild. + + - If you just want to briefly test and edit an existing library with cloud data: + + You can create a new capsule cloning the library's github repo directly. You will need + to configure an appropriate environment after creating the capsule (if + you own the repo you may want to save this by committing the + dockerfile). After launching a workstation, install the library in + place via `pip install -e .` + +## Tips for Cloud Workstations + +### JupyterLab + +- In a JupyterLab workstation, you can pop out notebook figures by right + clicking and selecting "Create New View for Output." They won't be + in floating windows, but you can keep them in view without scrolling + and organize them within different tabs. + +- Your matplotlib code runs, but the figures aren't displayed under your + notebook cells? You likely need to set the backend of matplotlib to + "inline" at the top of your notebook, using either: + + - `%matplotlib inline` + + - `get_ipython().run_line_magic('matplotlib', 'inline')` + +### VS Code (code-server) + +- Processes running in code-server will be terminated when the connection is closed, + unlike in JupyterLab. Keep your tab open and avoid resetting your network connection + (this can happen from connecting to a dock with wired ethernet, or switching routers). + +- For best results, use a more recent version of code-server than the CO default: + either by using a "code-server extensions pack" base image or editing your postInstall + following [this example](https://github.com/tmchartrand/code-server-base-image/blob/master/environment/install_vscode) + +- The code-server extensions pack environment has recommended settings preconfigured + (as [machine settings](https://github.com/tmchartrand/code-server-base-image/blob/master/environment/files/vscode_machine_settings.json)). If not using this environment you may want to copy these + manually, or at least *be sure to change the following essential settings*: + + - "Git: Use Integrated Ask Pass": False + + If your GitHub credentials are attached to your Code Ocean account, + this will let code-server use those credentials rather than + prompting you to log in to github for every git operation! + + - "Python: Language server": "Jedi" + + On certain versions of code-server, language server features like + autocompletion and hints will not work in Jupyter notebooks with the + default "Auto" setting. Alternatively, install the "basedpyright" extension, and set the + language server to "None" + +### Customization and Personalization + +User-configured settings (e.g. themes, font color, etc) will generally +be saved when the workstation is on hold, but not when it is shut down and rebuilt. +If you want to customize your workstation more permanently, you can use the postInstall script to pull +configuration files from a github repo. + +```bash +git clone +# move files to relevant locations, typically within /root +``` + +- *This won't work for VSCode/code-server settings*, as those are stored + inside the capsule filesystem (not available during the Docker build) + -- on the plus side, setting changes here *are* persistent across + rebuilds. + +- Customizations of this nature are specific to code development, and + shouldn't be included when sharing capsules with others. Before + sharing the capsule, remove the customizations from the postInstall + file. + +## Tips for building capsule environments + +- If your environment build fails, find the issue by opening the build log + (from the error message or the capsule timeline) and searching for errors, + typically towards the end of the log. + +- When debugging a tricky build, you may consider making a few duplicates of the capsule + so that you can test different variations simultaneously. + +- To improve very slow builds, consider: + - conda packages: make sure to use an environment with the mamba package manager instead of + the older conda (both the "conda" and "mamba" entries will install via mamba). + - pip packages: check the build log for extensive "backtracking" where pip tries many versions + of a package sequentially in an attempt to resolve dependency conflicts. Pin the versions of + these dependencies to eliminate this. + +## Tips for Pipelines + +### Resource labels + +Any pipeline with active usage beyond initial testing needs a unique label +so we can monitor cost and execution. Create a nextflow.config file in the `pipeline` folder +(same folder as the main.nf script) and put the following in it: + +`process.resourceLabels = ['allen-batch-pipeline': 'YOUR-TAG-GOES-HERE']` + +With the tag replaced by a short, unique, descriptive name for your pipeline. + +### Template repository + +Production-level pipelines should be based on the +[aind-pipeline-template](https://github.com/allenNeuralDynamics/aind-pipeline-template) +template repository. This includes a license, recommended nextflow config, +and automated versioning and release. + + +## How do I...? + +### Install the GitHub Copilot extension in VSCode? + +The VScode in Code Ocean is actually code-server, which does not support +every extension in the VSCode extension marketplace. Instead, you can +use the following script to download and install extensions from the +official marketplace: + + +(Note that Pylance is one extension that cannot be installed at all, +even using this workaround) + +### Avoid reinstalling VSCode extensions every time I rebuild? + +By default, extensions are not saved across rebuilds. You can, however, [configure +the postInstall script](https://gist.github.com/tmchartrand/5dfa687698cae6b349f86628de36f559) +to either install a list of extensions *or* move the extensions +directory inside the capsule filesystem so manually installed extensions will persist +(both options are not possible together). + + +### Make my Streamlit app running in Code Ocean externally accessible? + +Currently this is not possible. Consider trying the free hosting services provided by Streamlit or Huggingface, +or requesting SciComp support for deploying your app on AWS. + +### Keep variables in memory while shutting down cloud workstations? + +Instance RAM state is not preserved when instances are paused. By +default, instances should remain live for 180 min before automatically +pausing. As a workaround, use a disk cache (on `/scratch`) to save results +for any slower-running functions -- in Python this can be as simple as adding a +decorator from the built-in joblib. +() + +### Download a data asset to a local machine? + +Generally, we should minimize how much data we are downloading from Code +Ocean. This is particularly true of larger data (GBs) that require long, +easily interruptible download times. That said: + +- You can hover over most data assets in the My Data viewer and there is + an option to download. + +- If the data-asset is "external" then CO does not support direct + downloads. + +- You can run a cloud workstation, save the data-asset as a file in the + "results" folder, and then download the dataset. + +- s3fs is a python package for making AWS S3 bucket data look like files + and folders. + +### Generate interactive figures outside of a jupyter notebook? + +Interactive figures that open in new windows (e.g. the "agg" or +"qt" backends for matplotlib) are not useable within the browser. +For this reason, in-browser widgets like JupyterWidgets are the +preferred way to open interactive figures. + +Python users: Code Ocean also is able to open streamlit applications +defined within a capsule. R users: the same is true for Shiny apps. + +Other web apps (in addition to streamlit/shiny) can be +viewed by running in a vscode/code-server workstation and using the +built-in port-forwarding, which generates a link to access the web app +process in a new tab. + +### Transition an existing capsule without a github link to a new github-backed capsule? + +[Detailed walkthrough here](../how_to/github_backed_capsules.md) + +In brief: follow the steps for cloning a capsule to generate a Code Ocean git URL +([https://docs.codeocean.com/user-guide/compute-capsule-basics/version-control/clone-via-git](https://docs.codeocean.com/user-guide/compute-capsule-basics/version-control/clone-via-git...) +), then use this URL to import a new repository into github +( using your CO account email and API +token as credentials). Finally, create a new capsule on Code Ocean as a +clone of this new github repo. + +### Add my github credentials to code ocean? + + + +As shown in the docs, make sure to use a github **classic token** +(), not a fine-grained token. +If you are accessing internal repositories, you will also need +to select "Configure SSO" for the token and authorize it to access the relevant +organization. + +### Start a workstation through Code Ocean and then SSH to it? + +This is technically possible but not supported now. Compute instances +are in a private network that is not open to public SSH access. This is +a security best practice. That said, if there is sufficient demand for +this workflow we can look into supporting this ([comment +here](https://github.com/AllenNeuralDynamics/aind-code-ocean-info/issues/60)). + +### Request a new base image? + +Base images can come from any public docker image registry, but must be created by an admin. +Make a request by posting on the Code Ocean Teams channel or opening a github issue +on the [SciComp requests board](https://github.com/AllenNeuralDynamics/aind-scientific-computing/issues). + +### Reduce the screen real-estate used by Code Ocean (full-screen mode)? + +In Mac, press ^ + cmd + F, or see below. In Windows, press F11. + +![Full screen command screenshot](image.png) + +### Create a nextflow.config file to configure process execution? + +Add a nextflow.config file to the pipeline directory. The process scope +controls process execution in the Nextflow workflow. Add the parameters +below to your nextflow.config file: + +```nextflow +process { + executor = 'awsbatch' + queueSize = 100 + errorStrategy = 'retry' + maxRetries = 20 + maxErrors = 100 +} +``` + +`executor`: What is executing the pipeline. + +`queueSize`: Number of parallel instances that can be executed at any +given time. + +`errorStrategy`: How the pipeline should handle failures; in this case, +it will retry. + +`maxRetries`: How many retries are performed. + +`maxErrors`: Threshold for error accumulation in a given process. + +Configuration white paper is found +[here](https://www.nextflow.io/docs/latest/config.html#configuration-file) +to see what other configurations are available. + +## Bugs and other gotchas + +This list will be updated periodically, with recent issues added at the top of the list and resolved issues removed. + +### Jupyter notebook workstation fails to launch +- If you're trying to run Jupyter **notebook** and it fails to launch, you may + be running an environment that has a recent version of JupyterLab (>4.0) without the + notebook executable. You can fix this by: + - If JupyterLab has been added to your conda package list, remove it + - Add the "notebook" package to your conda package list. + +### Pipeline API "permission denied" errors despite being able to run capsules + +This problem can arise if two or more users collaboratively build a +pipeline together, and at least one capsule does not have sufficient AWS +credential secrets attached. This will show as a bypassable warning when +running the pipeline manually with "Reproducible Run", but will not run +via API. Attach AWS secrets to all capsules and the issue is resolved. + +### Environment build fails with `"/git-askpass": not found` + +You may see an error like this when running a reproducible run or rebuilding your capsule environment: + +``` +ERROR: failed to calculate checksum of ref ...: "/git-askpass": not found +``` + +**Why this happens:** Code Ocean switched its build system to Docker BuildKit, which builds images differently from classic Docker. As part of that migration, the credential helper file that Code Ocean injects into the Docker build context was renamed from `git-askpass` to `git-ask-pass`. The rename was intentional: BuildKit's layer caching would have continued reusing broken cached layers containing the old filename, so the rename forced those layers to be invalidated and rebuilt. Capsules whose Dockerfiles still reference the old name will fail until updated. + +**Fix:** Open your capsule's `environment/Dockerfile` and change: + +```dockerfile +COPY git-askpass / +``` + +to: + +```dockerfile +COPY git-ask-pass / +``` + +If your capsule has no dependencies on internal GitHub repositories, you can instead simply remove the `COPY git-askpass /` line (and the `ARG GIT_ASKPASS` / `ARG GIT_ACCESS_TOKEN` lines above it) entirely. diff --git a/docs/source/explore_analyze/create_processing_metadata.md b/docs/source/explore_analyze/create_processing_metadata.md new file mode 100644 index 0000000..05f6542 --- /dev/null +++ b/docs/source/explore_analyze/create_processing_metadata.md @@ -0,0 +1,236 @@ +# Adding metadata for scientist-derived data + +In the process of running analysis on Code Ocean, users often end up saving derived data to track additional analysis inputs beyond the main experimental data assets - we call this {term}`scientist-derived data`. +This data is typically one of two types: outputs of a prior analysis step or {term}`non-AIND data` that have been imported for comparison or integration. + + +## Storage locations +For scientist-derived data that is relatively stable (won't be replace often), +store the data as an {term}`internal data asset`. +This lets it be shared easily across capsules and users, +and the immutability helps ensure reproducible results. + +If the data comes from early-stage analysis and requires significant iteration, +it makes sense to start by using the capsule filesystem instead[^1]. +Be sure to organize your data in subfolders that will eventually become data assets. + +[^1]: The `scratch` folder is generally preferred, but will not be available to a Reproducible Run; +`data` or other directories use limited capsule storage space and must be explicitly added to the .gitignore file, but will be available to a Reproducible Run. + +If you need to iterate often and also share across capsules, +external data assets using `aind-scratch-data` may be a solution - discuss with a SciComp team member.[^2] +[^2]: Mutable data assets will be supported by Code Ocean as a better solution in the future. + +## Adding metadata + +If scientist-derived data contributes to published results, +it must be transferred to the open data bucket with complete metadata prior to publication +([publication standards](../policies_practices/publication_standards.md)). + +This consists of the following steps: + +- 1. copy or create the data into a capsule folder +- 2. add metadata files (more detail below) +- 3. create a data asset from the Code Ocean UI or API +- 4. file an issue to request transfer of the asset to aind-open-data. + +Steps 1-3 can be scripted end-to-end within a Code Ocean capsule copied from the [metadata template capsule](https://codeocean.com/capsule/1234567/tree), +or scripts to add metadata can be added to an existing analysis capsule based on the snippets below. + + +## Metadata for intermediate results + +For {term}`derived data` originating from AIND processing or analysis, +both data description and processing metadata are required. +This metadata is saved automatically by established processing pipelines, +but must be added explicitly to scientist-derived data. + +The best practice is to add the metadata whenever a data asset is saved, +and we are working on tools to automate this for Code Ocean {term}`result data asset`s. +Until these tools are available, it is generally simplest to add this metadata after most iteration on the asset is complete: when sharing the asset with others, or prior to publication at the latest. + +### Processing + +Create Code objects for each component process: +```python +import aind_data_schema.core.processing as ps +code_details = ps.Code( + name="Capsule or Pipeline name" + url="https://github.com/abcd", + version="1.0", + # commit_hash="89abcdef0123456789abcdef0123456789abcdef01", + parameters={"size": 7}, + input_data=[ + ps.DataAsset(name="data-asset-name"), + ps.DataAsset(url="data-asset-url"), + ] +) +``` +- name is optional except pipelines (pipeline components refer to it) +- a github url is preferred, but a release capsule url will also work +- specify either commit_hash or version (github or CO release) +- include parameters if they are passed to the code at runtime +(no need to if they are hard-coded) +- list all input data assets by name (preferred) or url + +Then create a Processing object containing one or more DataProcess records: +```python +my_processing = ps.Processing( + # pipelines=[pipeline_code_details] + data_processes=[ + ps.DataProcess( + stage=ps.ProcessStage.ANALYSIS, + process_type=ps.ProcessName.ANALYSIS, + name="my_custom_analysis", + experimenters=["Analysis Owner"], + start_date_time="2022-11-22T08:43", + end_date_time="2022-11-22T08:53", + output_path="path/to/outputs", + code=code_details, + # pipeline_name="Pipeline Name" + notes="Explain any manual steps here, and any additional notes" + ), + ], +) +``` +- `stage` is used to indicate processing or analysis +- `process_type` points to a defined list of well-known operations; +use one of these if appropriate, otherwise ANALYSIS or OTHER +- `name` is duplicated from `process_type` if left blank, +must be explicitly specified if process type is ANALYSIS or OTHER +- `experimenters` are those responsible for running the processing/analysis +- record exact start/end times if possible, otherwise a single approximate run date is fine +(ISO format string or datetime objects work; +if timezone isn't specified, the timezone of the computer running the script is used) +- specify an output path (relative to /results)^[*] if multiple DataProcesses are writing to a single asset +(no need to specify "/results" or "./") + +### Data Description + +The data description for derived data records the origin and organizational context +*of the data processing or analysis*, not the original experiment. + +```python +from datetime import datetime +import aind_data_schema.core.data_description as ds + +creation_time = datetime.now() +base_name = "primary-data-asset-name" +# base_name = "multi-input-analysis-name" +name = ds.build_data_name(base_name, creation_time) +my_dd = ds.DataDescription( + name=name, + source_data=["data-asset-name-1","data-asset-name-2"], + creation_time=creation_time, + institution=ds.Organization.AIND, + data_level=ds.DataLevel.DERIVED, + investigators=[ds.Person(name="Analysis Owner")], + project_name="Analysis Project Name", + modalities=[ds.Modality.MRI, ds.Modality.SPIM], + license=ds.License.CC_BY_40, + funding_source=[ + ds.Funding(funder=ds.Organization.NIMH, grant_number="RF1..."), + ], + data_summary="Analysis of data from... for ..." +) +``` +- When there is a single primary data asset input, the name should be derived from that asset's name. +- For analysis that aggregates multiple inputs, the name should be descriptive of the combined result. + +### Putting it all together + +#### Single-input results +If the intermediate result is derived from a single AIND primary data asset, +the rest of the experimental metadata should be "inherited" from that asset. +This base metadata can be loaded from the metadata database (preferred) or from json files within the asset. + +```python +import aind_data_schema.core.data_description as ds +from aind_data_schema.core.metadata import Metadata +from aind_data_access_api.document_db import MetadataDbClient +docdb_api_client = MetadataDbClient( + host="api.allenneuraldynamics.org", + version="v2", +) +base_json = docdb_api_client.retrieve_docdb_records( + filter_query=dict(name="full_asset_name"), +) +base_md = Metadata.model_validate(base_json) +new_md = base_md.model_copy(update=dict( + data_description=my_dd, + processing=base_md.processing + my_processing +)) +output_path = "/results" +new_md.data_description.write_standard_file(output_path) +new_md.processing.write_standard_file(output_path) +new_md.procedures.write_standard_file(output_path) +new_md.instrument.write_standard_file(output_path) +new_md.acquisition.write_standard_file(output_path) +``` + +If the processing is part of the same project as the input data, +the base metadata can also be inherited for the data description using the `DataDescription.from_data_description()` function, +which updates the data level, name, and creation time. + +```python +my_dd = ds.DataDescription.from_data_description( + data_description=base_md.data_description, + process_name=my_processing.name, +) +``` + +#### Aggregated results +For results that aggregate many inputs from the same subject, the subject and procedures metadata only should be inherited. + +For aggregation across subjects, no metadata should be inherited. +The new metadata will include the new data description and processing only. + +```python +new_md = Metadata( + data_description=new_dd, + processing=my_processing, + # subject=base_md.subject, + # procedures=base_md.procedures, +) +output_path = "/results" +new_md.data_description.write_standard_file(output_path) +new_md.processing.write_standard_file(output_path) +# new_md.subject.write_standard_file(output_path) +# new_md.procedures.write_standard_file(output_path) +``` + +## Metadata for non-AIND data + +When loading data from external sources for analysis, we typically want to make a stable copy as a data asset.[^3] +[^3]: Exceptions may be stable cloud-native repositories like DANDI where data can be queried and processed directly in the cloud. +To make this a publication-ready data asset, +we need to add data description metadata documenting its source and other key details. + +```python +from datetime import datetime +import aind_data_schema.core.data_description as ds + +creation_time = datetime(2024,4,21) +name = ds.build_data_name("KimLab-DevCCF-v1", creation_time) +dd = ds.DataDescription( + name=name, + creation_time=creation_time, + institution=ds.Organization.UPENN, + data_level=ds.DataLevel.DERIVED, + investigators=[ds.Person(name="Yongsoo Kim")], + project_name="external data", + modalities=[ds.Modality.MRI, ds.Modality.SPIM], + license=ds.License.CC_BY_40, + funding_source=[ + ds.Funding(funder=ds.Organization.NIMH, grant_number="RF1MH12460501"), + ds.Funding(funder=ds.Organization.NINDS, grant_number="R01NS108407"), + ds.Funding(funder=ds.Organization.NIMH, grant_number="R01MH116176"), + ], + data_summary="Downloaded from https://pennstateoffice365-my.sharepoint.com/:f:/g/personal/yuk17_psu_edu/EmCllFDonwtLvDD0xgWd7QYBuzVVvnSv4oKpUy7F9bx75Q?e=RxAmJa on 2025-03-01" + ) +``` +- Typically all published external data will be *derived* not *raw* data. +- For `project_name`, use "external data" unless data collection is linked to a specific AIND project (for instance a shared grant) +- For `creation_time`, use the date the data was posted, or a related publication date if that is not available. +- Funding information should be included for sources documented in the manuscript or data repository. +- Document the specific data source in the `data_summary` (URL or API call) and the date accessed. \ No newline at end of file diff --git a/docs/source/explore_analyze/find_data.md b/docs/source/explore_analyze/find_data.md index 73da26a..323bb8d 100644 --- a/docs/source/explore_analyze/find_data.md +++ b/docs/source/explore_analyze/find_data.md @@ -1,5 +1,106 @@ # Find data -The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report. +Analysis scripts should find assets using queries on our metadata database using your project name and other fields unique to your experiment. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**. -We'll be expanding functionality here in the near future! +Some fields that are commonly used to filter assets: + +- `data_description.project_name` +- `data_description.data_level` +- `data_description.modalities.abbreviation` +- `subject.subject_id` +- `acquisition.acquisition_start_time` +- `acquisition.acquisition_type`: this is the primary string that should differentiate acquisitions within the same project +- `quality_control.status` + +You may also find it useful to tag your data with custom strings at the time of upload. These tags will make it easy to cluster your data into different subsets. + +- `data_description.tags`: this is a list of strings you can use to cluster assets by things that aren't well represented in the metadata. + +## Querying the metadata database + +Our metadata is stored in a MongoDB database with one record for each data asset. MongoDB queries are dictionaries (key-value pairs) that return a set of records. These can be complicated to construct, which is why we've developed (1) AI tools and (2) cache tables to make it easier to quickly find data assets. We explain below in more detail in (3) how to fully leverage the MongoDB database, if you need it. + +Analysis workflows are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. + +### Option 1: MCP Server (AI) + +The [`aind-metadata-mcp`](https://github.com/allenNeuralDynamics/aind-metadata-mcp) (for V1 metadata) and [`aind-data-mcp`](https://github.com/allenNeuralDynamics/aind-data-mcp) (for V2 metadata) make it easy to generate queries for your data assets without knowing the exact structure of the metadata. + +Install the MCP server by following the instructions for that package or by using the pre-built environment in Code Ocean (`code-server python extensions pack`). You can then ask an AI agent that has access to the MCP server tools to run test queries and write a Python script for your query. The resulting script will use a mixture of the cache query system or full access, depending on what fields you need access to. + +### Option 2: Metadata cache + +Some queries to the metadata database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache (updated nightly) of some fields in the V2 metadata making them instantly available. Please see the [zombie-squirrel README](https://github.com/AllenNeuralDynamics/zombie-squirrel/#scurry-fetch-data) for a complete list of tables that are available. We add new tables regularly, and we can support custom tables for individual projects. Please [reach out to scientific computing](https://github.com/AllenNeuralDynamics/aind-scientific-computing/issues) with requests. + +For example, here is a query that evaluates in a few hundred milliseconds to find the latest derived assets with behavior NWB files from the VR foraging project: + +```python +from zombie_squirrel import asset_basics, raw_to_derived, qc + +asset_metadata = asset_basics() + +# Query #1: VR foraging latest derived behavior assets +raw_assets = asset_metadata[ + (asset_metadata["project_name"] == "Cognitive flexibility in patch foraging") & + (asset_metadata["data_level"] == "raw") +] +# Modalities is a comma-separated list, filter for "behavior" +raw_assets = raw_assets[raw_assets["modalities"].str.contains("behavior")] +# Get latest derived versions of these behavior raw assets +derived_asset_names = raw_to_derived(raw_assets["name"].tolist(), modality="behavior", latest=True) +``` + +And then a second level to filter by passing QC for a few metrics: + +```python +# Query #2: Same restrictions but passing QC +qc_metric_names = ["Running Velocity", "General Performance"] +derived_assets = asset_metadata[asset_metadata["name"].isin(derived_asset_names)] +passing_qc_asset_names = [] +for subject_id, subject_assets in derived_assets.groupby("subject_id"): + qc_df = qc(subject_id=subject_id) + if qc_df.empty or "status" not in qc_df.columns: + continue + for _, row in subject_assets.iterrows(): + passing = qc_df[ + (qc_df["asset_name"] == row["name"]) & + (qc_df["name"].isin(qc_metric_names)) & + (qc_df["modality"] == "behavior") & + (qc_df["status"] == "Pass") + ] + if passing["name"].nunique() == len(qc_metric_names): + passing_qc_asset_names.append(row["name"]) +``` + +### Full access to all metadata fields through the database + +The `aind-data-access-api` package is used to read metadata records from DocDB. There are two kinds of DocDB queries: filter queries are a flat dictionary which look for records that match certain field:value pairs, while aggregation pipelines can perform multiple steps. Use the `version="v1"` or `version="v2"` parameter to control whether you are accessing the V1 or V2 metadata; reach out to scientific computing if you aren't sure which metadata you should be using. + +A simple example to get all derived assets with behavior NWB files from the VR foraging project: + +```python +from aind_data_access_api.document_db import MetadataDbClient + +client = MetadataDbClient( + host="api.allenneuraldynamics.org", + version="v1", +) + +query = { + "data_description.project_name": "Cognitive flexibility in patch foraging", + "data_description.data_level": "derived", + "data_description.modalitities.abbreviation": "behavior", +} +records = client.retrieve_docdb_records( + filter_query=query +) +``` + +More details about DocDB queries can be found in the [aind-data-access-api#querying-metadata documentation](https://aind-data-access-api.readthedocs.io/en/latest/ExamplesDocDBRestApi.html#querying-metadata) + +## Dashboards + +We are expanding the number of platform and project dashboards based on V2 metadata over time. We currently host: + +- [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report. +- [smartspim dashboard](https://data.allenneuraldynamics-test.org/smartspim) diff --git a/docs/source/explore_analyze/image.png b/docs/source/explore_analyze/image.png new file mode 100644 index 0000000..99bc0aa Binary files /dev/null and b/docs/source/explore_analyze/image.png differ diff --git a/docs/source/explore_analyze/index.md b/docs/source/explore_analyze/index.md index 95125df..ab68aea 100644 --- a/docs/source/explore_analyze/index.md +++ b/docs/source/explore_analyze/index.md @@ -4,17 +4,34 @@ orphan: true # Explore, QC & analyze +Raw assets uploaded from platforms at AIND are run through automated pipelines that produce derived assets. You can explore these assets through the [Data Portal](https://data.allenneuraldynamics.org/assets). + +The Data Portal exposes a range of views tailored to different slices of the data: + +- The [**Assets**](https://data.allenneuraldynamics.org/assets) view is your entry point into all data assets acquired in Neural Dynamics. +- [**Subject**](https://data.allenneuraldynamics.org/subject) views let you explore the history of an experimental subject from birth through surgical procedures, data acquisitions, perfusion, etc. Clicking into individual events pulls up the detailed metadata about that event as well as interactive viewers. +- [**Project**](https://data.allenneuraldynamics.org/project) views show data acquisitions for each subject within a project, and can be used to identify the modality or behavior curriculum stage for each acquisition. + +There are also Platform dashboards for each of the major platforms in Neural Dynamics: + +| Platform | Dashboard | Internal Site | +| -- | -- | -- | +| SmartSPIM | [Dashboard](https://data.allenneuraldynamics.org/smartspim) | [Internal Site](https://alleninstitute.sharepoint.com/sites/NeuralDynamics/SitePages/SmartSPIM-Platform.aspx) | +| Fiber Photometry | [Dashboard](https://data.allenneuraldynamics.org/fiber_photometry) | [Internal site](https://alleninstitute.sharepoint.com/sites/NeuralDynamics/SitePages/Fiber-Photometry-Platform.aspx) | +| Dynamic Foraging | [Dashboard](https://data.allenneuraldynamics.org/dynamic_foraging)| | +| VR Foraging | [Dashboard](https://data.allenneuraldynamics.org/vr_foraging) | | + ## I want to... [Quality control my processed data assets](quality_control.md) before starting analysis. -[Find data](find_data.md). - -[Analyze my data](analyze_data.md#analyze-data) in the cloud. +[Find and query data](find_data.md) based on its stored metadata. -[Automate my analysis](analyze_data.md#analysis-framework) using the Analysis Framework. +[Learn about different approaches to analyze data](analyze_data.md) in the cloud. -[Explore custom tools](analyze_data.md#custom-tools) for annotation and data exploration. +- [Learn about Code Ocean](co_best_practices.md) best practices. +- [Automate my analysis](analyze_data.md#analysis-framework) using the Analysis Framework. +- [Explore custom tools and apps](analyze_data.md#custom-tools) for annotation and data exploration. [Find outreach events](outreach.md). @@ -24,5 +41,6 @@ orphan: true quality_control find_data analyze_data +co_best_practices outreach ``` diff --git a/docs/source/explore_analyze/outreach.md b/docs/source/explore_analyze/outreach.md index 04bf52a..243e7fe 100644 --- a/docs/source/explore_analyze/outreach.md +++ b/docs/source/explore_analyze/outreach.md @@ -31,11 +31,11 @@ This section highlights materials that may be useful beyond the original event t [View capsule](https://codeocean.allenneuraldynamics.org/capsule/0692322/tree/v3) --- -# Workshops +## Workshops Workshops provide structured training experiences focused on datasets, tools, and computational approaches used across the Allen Institute for Neural Dynamics. -## Summer Workshop on the Dynamic Brain 2025 +### Summer Workshop on the Dynamic Brain 2025 **Date:** August 24 – September 7, 2025 **Audience:** graduate students, researchers @@ -44,13 +44,13 @@ Workshops provide structured training experiences focused on datasets, tools, an A two-week summer course focused on computational analysis of large-scale neuroscience datasets. -### Resources +#### Resources - [Code Ocean collection](https://codeocean.allenneuraldynamics.org/collections/815cebfe-1829-4287-8e99-f1346b5d6ccb) - [SWDB Data Book](https://allenswdb.github.io/intro.html) --- -## Workshop at Western Washington University +### Workshop at Western Washington University **Date:** October 30, 2025 @@ -60,13 +60,13 @@ A two-week summer course focused on computational analysis of large-scale neuros This workshop included short talks on team science, an overview of Allen Institute for Neural Dynamics, and hands-on code tutorials using neuron reconstructions from the Brain-wide Anatomy platform and neurophysiology and behavior data from the BCI project. -### Resources +#### Resources - [Exa-SPIM tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/0685965/tree/v1) - [BCI / Credit Assignment tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/0692322/tree/v1) --- -## Workshop at Okinawa Institute of Science and Technology +### Workshop at Okinawa Institute of Science and Technology **Date:** January 30, 2025 **Audience:** researchers @@ -78,11 +78,11 @@ This workshop focused on introducing and sharing resources from the Brain-wide A --- -# Hackathons +## Hackathons Hackathons are collaborative coding events where participants work directly with Allen Institute datasets and develop exploratory analyses, tools, or prototype research workflows. -## University of Washington Neurohackathon (2026) +### University of Washington Neurohackathon (2026) **Date:** March 6–8, 2026 **Audience:** undergraduate students, graduate students, researchers @@ -91,12 +91,12 @@ Hackathons are collaborative coding events where participants work directly with Our second year participating in the UW Neurohackathon, sharing updated data and tutorials from the *Credit Assignment During Learning* project. -### Resources +#### Resources - [BCI / Credit Assignment tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/0692322/tree/v3) --- -## University of Washington Neurohackathon (2025) +### University of Washington Neurohackathon (2025) **Date:** May 16–18, 2025 **Audience:** undergraduate students, graduate students, researchers @@ -105,12 +105,12 @@ Our second year participating in the UW Neurohackathon, sharing updated data and A hackathon organized by the Conect and Synaptech student clubs at UW, where participants used neurotechnology devices and/or Allen Institute neural datasets to develop a project or prototype. -### Resources +#### Resources - [BCI / Credit Assignment tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/6784496/tree/v1) --- -## UW CNC–AIND Hackacollabathon: Credit Assignment During Learning +### UW CNC–AIND Hackacollabathon: Credit Assignment During Learning **Date:** May 14, 2025 **Audience:** graduate students, researchers @@ -119,12 +119,12 @@ A hackathon organized by the Conect and Synaptech student clubs at UW, where par A collaborative hackathon to share new data from the *Credit Assignment During Learning* project with researchers at UW and the Allen Institute. -### Resources +#### Resources - [BCI / Credit Assignment tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/6784496/tree/v1) --- -## UW CNC–AIND Hackacollabathon: Mesoscale Connectivity +### UW CNC–AIND Hackacollabathon: Mesoscale Connectivity **Date:** December 4, 2024 **Audience:** undergraduate students, graduate students, researchers @@ -133,16 +133,16 @@ A collaborative hackathon to share new data from the *Credit Assignment During L A collaborative hackathon to share new data from the *Thalamus in the Middle* project with researchers at UW and the Allen Institute. -### Resources +#### Resources - [Mesoscale Connectivity tutorial capsule](https://codeocean.allenneuraldynamics.org/capsule/6784496/tree/v1) --- -# Conference Sessions & Talks +## Conference Sessions & Talks Conference sessions and invited talks help introduce Allen Institute datasets, tools, and scientific resources to broader technical and research communities. -## Talk at Cosyne Tutorial Session +### Talk at Cosyne Tutorial Session **Date:** March 12, 2026 **Audience:** researchers @@ -153,7 +153,7 @@ A talk delivered during the Cosyne tutorial session highlighting datasets, tools --- -## Talk at the NeurIPS Data on the Brain & Mind Workshop +### Talk at the NeurIPS Data on the Brain & Mind Workshop **Date:** December 7, 2025 **Audience:** researchers (machine learning, AI, neuroscience) @@ -162,17 +162,17 @@ A talk delivered during the Cosyne tutorial session highlighting datasets, tools A short talk during a workshop on AI applications for neuroscience and cognitive science data, highlighting a preprint and code tutorial from the Visual Behavior Neuropixels project. -### Resources +#### Resources - [Workshop website](https://data-brain-mind.github.io/) - [Blogpost on the tutorial](https://data-brain-mind.github.io/tutorials/an-overview-of-the-neuropixels-visual-behavior-dataset-from-the-allen-institute/) --- -# Lectures & Classroom Outreach +### Lectures & Classroom Outreach Lectures and classroom activities support student learning by introducing Allen Institute datasets and computational neuroscience workflows in educational settings. -## Lecture at Undergraduate Course at University of Puget Sound +### Lecture at Undergraduate Course at University of Puget Sound **Date:** September 24, 2025 **Audience:** undergraduate students (neuroscience, computer science) @@ -181,12 +181,12 @@ Lectures and classroom activities support student learning by introducing Allen A lecture for an undergraduate class in which students developed quarter-long research projects using Allen Institute datasets. -### Resources +#### Resources - [Code tutorial repository](https://github.com/AllenNeuralDynamics/ups_nrsc490_tutorial/tree/main) --- -## Code Tutorial for High School Field Trip +### Code Tutorial for High School Field Trip **Date:** February 13, 2026 **Audience:** high school students @@ -195,11 +195,11 @@ A lecture for an undergraduate class in which students developed quarter-long re A code tutorial for high school group visiting the Allen Institute, focused on introducing neural anatomy data and mapping connections between brain areas. -### Resources +#### Resources - [Code tutorial repository](https://github.com/leesuyee/mesoscale-connectivity-tutorial) --- -## Have questions? Interested in using our materials? +### Have questions? Interested in using our materials? Contact the Data & Outreach team. \ No newline at end of file diff --git a/docs/source/explore_analyze/quality_control.md b/docs/source/explore_analyze/quality_control.md index 06d6e94..87489c3 100644 --- a/docs/source/explore_analyze/quality_control.md +++ b/docs/source/explore_analyze/quality_control.md @@ -8,4 +8,8 @@ Please see the documentation on [QualityControl](https://aind-data-schema.readth ## QC Portal -Please see the [QC Portal](https://github.com/AllenNeuralDynamics/aind-qc-portal?tab=readme-ov-file) documentation for more information. \ No newline at end of file +![QC diagram](../diagrams/mid_level/QC.drawio.svg) + +The QC Portal is a web app that allows users to explore the quality control metadata for data assets and, in edit mode, modify the value and state of metrics to annotate assets as passing or failing QC. + +Please see the [QC Portal](https://github.com/AllenNeuralDynamics/aind-qc-portal?tab=readme-ov-file) documentation for more information. diff --git a/docs/source/glossary.md b/docs/source/glossary.md index 7a252b0..5fefb3c 100644 --- a/docs/source/glossary.md +++ b/docs/source/glossary.md @@ -1,5 +1,56 @@ # Glossary {.glossary} -intermediate result -: derived data that consists of an analysis result saved as input for downstream analysis steps. \ No newline at end of file +scientist-derived data +: derived data created by a scientist directly, +rather than as output of an established processing pipeline. +A typical example is an intermediate analysis result saved as input for downstream analysis. + +{.glossary} +primary data +: the least processed permanent data asset from a given data acquisition - +in most cases this is the raw data asset, but in some cases the raw data is deleted +and a minimally-processed (compression or format conversion only) derived data is preserved as primary data. +(In many cases "raw data" is used as a synonym for "primary data", +even when the data is not strictly raw.) + +{.glossary} +derived data +: data that is a result of processing or analysis applied to one or many data inputs + +{.glossary} +non-registry data +: data that has been published by a source outside of AIND, +which we are hosting a copy of in our registry/bucket for ease of access and reproducibility. + +{.glossary} +released data +: data in a public s3 bucket (with complete metadata). +This is generally represented in Code Ocean as an external data asset in a public collection, +and is also accessible through the AWS open data registry. + +{.glossary} +internal data asset +: a Code Ocean data asset stored internally in the Code Ocean cloud storage +(in a private s3 bucket optimized for fast access from computations, +not intended for direct access). +[CO docs](https://docs.codeocean.com/user-guide/data-assets-guide/types-of-data-assets#internal-data) + +{.glossary} +external data asset +: a Code Ocean data asset linking to data on AWS s3 (public or private buckets) or other cloud storage. +[CO docs](https://docs.codeocean.com/user-guide/data-assets-guide/types-of-data-assets#external-data-a-remote-link) + +{.glossary} +result data asset +: a Code Ocean data asset saved as the result of a Code Ocean computation. +Provenance information is automatically added, +which can be used to generate more complete processing metadata. +[CO docs](https://docs.codeocean.com/user-guide/data-assets-guide/types-of-data-assets#results) + +{.glossary} +combined data asset +: a Code Ocean data asset that links together multiple data assets. +Currently these are restricted to external data assets only for technical reasons. +[CO docs](https://docs.codeocean.com/user-guide/data-assets-guide/types-of-data-assets#combined-data) + diff --git a/docs/source/how_to/github_backed_capsules.md b/docs/source/how_to/github_backed_capsules.md new file mode 100644 index 0000000..ac99086 --- /dev/null +++ b/docs/source/how_to/github_backed_capsules.md @@ -0,0 +1,92 @@ +# GitHub-backed Code Ocean capsules + +## Why GitHub backing matters + +Linking a Code Ocean capsule to a GitHub repository provides an additional venue for sharing your code, environment configuration, and links to your data beyond Code Ocean itself. It also integrates your capsule into standard software development workflows — version history, code review, and discoverability via GitHub search — and makes it easier for others to find, cite, and build on your work. + +## Creating a new capsule backed by a GitHub repo + +The easiest way to ensure your capsule is GitHub-backed is to set this up at creation time. Code Ocean does not allow you to add GitHub backing to an existing capsule after it has been created (see [this issue](https://github.com/AllenNeuralDynamics/aind-code-ocean-info/issues/16)), so getting this right from the start avoids the more involved migration process described below. + +Rather than creating a GitHub repo from scratch, you should start from the [aind-capsule-template](https://github.com/AllenNeuralDynamics/aind-capsule-template). This template provides the `code/` and `environment/` directory structure that Code Ocean requires. Creating a repo from scratch risks producing a layout that is incompatible with Code Ocean. + +**Steps:** + +1. Go to [https://github.com/AllenNeuralDynamics/aind-capsule-template](https://github.com/AllenNeuralDynamics/aind-capsule-template) and click the green **Use this template** button, then select **Create a new repository**. + + + +2. Under **Owner**, select **AllenNeuralDynamics**. Give the repo a name and set visibility. +3. Click **Create repository**. +4. Copy the HTTPS clone URL from the green **Code** button (e.g., `https://github.com/AllenNeuralDynamics/REPO_NAME.git`). +5. Go to [https://codeocean.allenneuraldynamics.org/dashboard](https://codeocean.allenneuraldynamics.org/dashboard), click **New Capsule**, and select **Clone From Git**. +6. Paste the URL from step 4 and click **Clone**. +7. Your capsule is now linked to your GitHub repo. Any changes committed and synced in Code Ocean will be reflected in the repo. + +## Migrating an existing capsule to GitHub backing + +If your capsule was created without GitHub backing, you cannot add it retroactively. Instead, you must import your capsule's code into a new GitHub repository, create a new Code Ocean capsule cloned from that repo, and deprecate the original capsule. + +Broadly, the steps are: + +**A)** Create a new GitHub repo from your existing capsule. +**B)** Create a new capsule backed by this new repo. +**C)** Create a new reproducible run and release from the new capsule. +**D)** Deprecate the old capsule. + +From that point forward, all changes in the new capsule will automatically sync with the GitHub repo. + +### A) Create a GitHub repo from your existing capsule + + +1. Click **Capsule > Clone via Git...** from the menu near the top of the Code Ocean interface. + + + +2. Copy the URL under **Clone using this URL:** (it will look something like `https://codeocean.allenneuraldynamics.org/capsule-XXXXXXX.git`). +3. In a new tab, go to [https://github.com/AllenNeuralDynamics](https://github.com/AllenNeuralDynamics). +4. Click the green **New** button. +5. Click **Import a repository** near the top. + + + + + +6. Paste the URL from step 2 into the **The URL for your source repository** field. +7. Enter your full Allen email address in the **Your username for your source repository** field. +8. Switch back to your Code Ocean tab. If there is a **Generate a user token** option, create a token and copy it. +9. Paste the token into the **Your access token or password for your source repository** field on GitHub. If Code Ocean did not offer a token, leave this field blank. +10. Under **Choose an owner**, select **AllenNeuralDynamics**. +11. Under **Repository name**, enter a name matching your capsule name. +12. Set visibility to either **Internal** or **Public**. You can change this later. But note that this must ultimately be set to **Public** to make the capsule repo visible to the outside world. +13. Click **Begin Import**. This can take several minutes to complete. (If this step fails, see [Troubleshooting: import credentials](#import-fails-due-to-incorrect-credentials).) +14. Once complete, click the link to go to your new repo. + +### B) Create a new GitHub-backed capsule + +15. Click the green **Code** button on your new GitHub repo and copy the HTTPS URL (e.g., `https://github.com/AllenNeuralDynamics/REPO_NAME.git`). +16. Go to [https://codeocean.allenneuraldynamics.org/dashboard](https://codeocean.allenneuraldynamics.org/dashboard), click **New Capsule**, and select **Clone From Git**. +17. Paste the URL from step 15 and click **Clone**. This creates your new GitHub-backed capsule. +18. Edit the README of the new capsule to include a link to the GitHub repo. +19. Commit your changes and click **Sync with GitHub**. +20. Go to your GitHub repo and verify that the README changes are visible there. + +### C) Create a reproducible run and release + +21. Create a new reproducible run and release from the new capsule. + +### D) Deprecate the old capsule + +22. Add a note to the original capsule's README stating that it is deprecated, and include a link to the new capsule. This is important so that you don't inadvertently continue working on the original capsule. +23. Consider renaming the original capsule to make its status clear (e.g., "My Capsule Title (deprecated)"). + +## Troubleshooting + +### Import fails due to incorrect credentials + +If GitHub reports an error after clicking **Begin Import** in step 13, the most likely cause is incorrect credentials. Check the following: + +- **Username:** Use your full Allen Institute email address (e.g., `firstname.lastname@alleninstitute.org`), not your GitHub username. +- **Token:** Even if you did not see a **Generate a user token** option in step 8, you may need to copy and paste your CO token into the password field (not your GitHub or Allen Institute password). You may need to generate a new token or copy an existing one manually, in the CO user settings. Note that some capsules may import successfully without a token (typically public ones), while private capsules will fail without it. + + diff --git a/docs/source/index.md b/docs/source/index.md index 362776a..feb3ae3 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -32,10 +32,10 @@ Follow these links to request access to: ## I want to learn about... -[Data organization](policies_practices/data_organization.md), [data governance](policies_practices/data_governance.md), and [software practices](policies_practices/software_practices.md) at AIND. +[Data organization](policies_practices/data_organization.md), [data governance](policies_practices/data_governance.md), [software practices](policies_practices/software_practices.md), or [visualize how our software and systems interact](aind/diagrams.md). ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :hidden: :caption: Acquire, upload & process @@ -48,34 +48,35 @@ acquire_upload/process_data ``` ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :hidden: :caption: Explore, QC & analyze explore_analyze/quality_control explore_analyze/find_data explore_analyze/analyze_data -explore_analyze/workshops +explore_analyze/outreach + ``` ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :hidden: :caption: Policies & practices policies_practices/data_organization policies_practices/data_governance policies_practices/publication_standards -policies_practices/software_practices policies_practices/platform_support +policies_practices/version_pipelines +policies_practices/software_practices policies_practices/docs -policies_practices/developer_templates ``` ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :hidden: :caption: AIND Resources diff --git a/docs/source/policies_practices/data_organization.md b/docs/source/policies_practices/data_organization.md index e155fe4..ad9d559 100644 --- a/docs/source/policies_practices/data_organization.md +++ b/docs/source/policies_practices/data_organization.md @@ -58,7 +58,7 @@ All primary data assets have the following naming convention: A few points: -- Format `: yyyymmdd_HH-MM-SS` +- Format `: YYYY-MM-DD_hh-mm-ss` - This should be the start of acquisition, in the local time zone. - The local time-zone is documented in metadata files - All tokens (e.g. ``) must not contain underscores or illegal filename characters. Subject ID is not strictly necessary – only the timestamp is essential. However, it is part of the current naming convention because it helps people visually browse for data. diff --git a/docs/source/policies_practices/platform_support.md b/docs/source/policies_practices/platform_support.md index 81816d3..7a3eccd 100644 --- a/docs/source/policies_practices/platform_support.md +++ b/docs/source/policies_practices/platform_support.md @@ -15,6 +15,9 @@ All platforms and pipelines must follow the [Data organization conventions](data ## Platform requirements +- Platform data assets should include valid aind-data-schema core files (minimum: data_description, subject, procedures, instrument, and acquisition). +- The `data_description.tags` field must include a unique string that *will not ever change* for each platform. We recommend using `platform:` to make it easy to find this string. + ### Logging Platforms should log all events to the [Loki server](https://github.com/AllenNeuralDynamics/aind-log-utils) maintained by SIPE. Events should be discrete information, warnings, and errors that need to be made visible to users in a dashboard. Logging of continuous metrics should be done in a log service that is specific to each tool and made visible in a dashboard attached to the tool. diff --git a/docs/source/policies_practices/standards_checklist.md b/docs/source/policies_practices/standards_checklist.md index 4969157..f197c4e 100644 --- a/docs/source/policies_practices/standards_checklist.md +++ b/docs/source/policies_practices/standards_checklist.md @@ -1,7 +1,7 @@ ### Capsules and repositories - [ ] Capsules (or pipelines) for all processing steps, from raw data to figures [^3] [^3]: Tools that already have, or are progressing towards, a separate public release may be left out. -- [ ] Working copy of capsule shared internally and linked to a public github repository within AIND or AIBS github organization +- [ ] Working copy of capsule shared internally and [linked to a public github repository](../how_to/github_backed_capsules.md) within AIND or AIBS github organization - [ ] Released version of capsule added to manuscript collection (requires author and description in capsule metadata, sync to github, and reproducible run). - [ ] Reproducible run script generates all outputs[^4] (if manual steps are unavoidable, include step-by-step instructions and automate as much as possible). [^4]: This can trigger execution of notebooks (e.g. using nbconvert), as long as they run top to bottom with no interaction required. @@ -12,7 +12,7 @@ ### Data - [ ] All AIND data stored as external data assets (aind-open-data), with complete metadata -- [ ] All {term}`intermediate result`s stored as external data assets (aind-open-data), with processing metadata added. +- [ ] All {term}`intermediate result`s stored as external data assets (aind-open-data), with processing metadata added ([Tutorial](../explore_analyze/create_processing_metadata.md)). - [ ] All data from external sources documented and downloadable with clear instructions from a stable data repository, or mirrored in aind-open-data. - [ ] If many individual assets are used, create combined data assets to organize them by data modality or type - [ ] All data assets (combined if needed) added to public collection -- *intermediate results* should be included on a case-by-case basis. diff --git a/docs/source/policies_practices/version_pipelines.md b/docs/source/policies_practices/version_pipelines.md new file mode 100644 index 0000000..3447001 --- /dev/null +++ b/docs/source/policies_practices/version_pipelines.md @@ -0,0 +1,92 @@ +# Versioning pipelines + +Users need to understand how to interact with computed results produced by data processing pipelines, and +if there are changes in pipeline results, it must be easy for users to detect, understand, and adapt to these changes. +This policy is intended to facilitate this by ensuring that pipelines are versioned in a consistent and informative way, +and that version information is easily accessible to users and developers downstream. + +## Policies + +Core data processing pipelines MUST adopt semantic versioning, with version numbers `MAJOR.MINOR.PATCH` updated according to the following guidelines: +- Major version changes indicate significant breaking changes to outputs, where the structure or interpretation of the data has changed. +Code relying on the outputs will require significant refactoring to accommodate the changes. Processed outputs may need to be preserved across multiple major versions for compatibility. +- Minor version changes indicate new features or minor breaking changes (including bugfixes) in the content of output data. +Code relying on the outputs may require minor refactoring, and previously processed data may need to be reprocessed. +- Patch version changes indicate non-breaking bug fixes or other code changes. +Data should not require reprocessing and downstream code should not need to be updated. + +The pipeline's name, semantic version, and url MUST be stored in aind-data-schema [Processing](https://github.com/AllenNeuralDynamics/aind-data-schema/blob/dev/src/aind_data_schema/core/processing.py#L970) metadata at the top level of the results - specifically the fields `Processing.pipelines.name`, `Processing.pipelines.code.version`, and `Processing.pipelines.code.url`. + +The pipeline's name and semantic version MUST also be stored in the pipeline repository and easily accessible to pipeline code. +We recommend environment variables `PIPELINE_VERSION`, `PIPELINE_NAME`, and `PIPELINE_URL`, +populated from the pipeline's `nextflow.config` - see below. +These environment variables can then be used to populate the appropriate fields in the `Processing` object. + +To deploy a new release of a pipeline: + +- Pipelines and component capsules MUST update their semantic version appropriately. +- Pipelines and component capsules MUST be synchronized with a linked public GitHub repository. +- Pipelines and component capsules MUST have a Code Ocean "internal release." +- Pipelines MUST update their `CHANGELOG` indicating what has changed in the release. + +This process ensures production pipelines are not subject to accidental changes and versioning is always communicated consistently to users downstream. + +## Implementation + +Developers can create a pipeline from this template: [`aind-pipeline-template`](https://github.com/AllenNeuralDynamics/aind-pipeline-template). +Once created, the pipeline uses a [workflow](https://github.com/AllenNeuralDynamics/.github/blob/main/.github/docs/Release%20Tag%20and%20Publish%20Pipeline.md) that will, on every pull request into main, bump the version using [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) with the modifications below and generate a `CHANGELOG` based on the commit history. +Environment variables for `PIPELINE_VERSION`, `PIPELINE_NAME` and `PIPELINE_URL` are added to the `nextflow.config` file and available to each component capsule. + +Methods from the [`aind-metadata-manager`](https://github.com/AllenNeuralDynamics/aind-metadata-manager) package can be used to enforce that all three pipeline fields are provided and create the appropriate entries in the `Processing` object. +If a value is missing, the pipeline will fail with a clear error message rather than falling back to a placeholder default. + +The developer is still responsible for ensuring that the `PIPELINE_VERSION`, `PIPELINE_NAME`, and `PIPELINE_URL` values, as well as the `CHANGELOG` are correct and up-to-date in the repository. + + +## Commit types and version increments + +Standard semantic versioning alone does not fully capture the needs of data processing pipelines. +Conventional commit types such as `fix` and `feat` can each be either breaking or non-breaking from the perspective of output data, +and breaking changes can differ in kind: some change output content (a downstream process may produce wrong results), +while others change output structure or processing fundamentals (a downstream process fails entirely). + +The table below maps conventional commit types to the appropriate version increment. +Other commit types (docs/chore/ci) will not increment the version. +This mapping is implemented in the release workflow of the `aind-pipeline-template` and can be customized if needed for specific pipelines. +The generated changelog contains commit messages organized by category, +and both minor breaking changes (!) and major breaking changes (BREAKING) called out in separate sections. + +| Commit type | Description | Version increment | +|---|---|---| +| `refactor` | A code change that neither fixes a bug nor adds a feature | patch | +| `perf` | A code change that improves performance with output unchanged | patch | +| `fix` | A bug fix that resolves failures only | patch | +| `build` | Non-breaking changes to external dependencies | patch | +| `feat` | A new feature added to output without changing existing output | minor | +| `fix!` | A bug fix that changes outputs to correct previous errors | minor | +| `build!` | Breaking changes to external dependencies (e.g. an algorithm slightly changes its output) | minor | +| `feat!` | A new feature that also changes outputs (e.g. renaming an existing output) | minor| +| `BREAKING` (in footer) | Fundamental change to the processing approach or output structure such that results before and after are not directly comparable | major | + + +## Code Ocean release versioning + +When a capsule or pipeline is internally released in Code Ocean, Code Ocean creates an immutable copy of the pipeline and issues it a release version, +recorded as `MAJOR.MINOR` with the minor version fixed at 0 and the major version incremented with each release. +This version is unrelated to the semantic version of the pipeline, but it is a necessary parameter for those triggering pipelines via the API (e.g. the AIND data transfer service). +We intend to develop helper functions to map between these two versioning systems. + +The example below illustrates a typical progression of the two versioning systems + +| Code Ocean Version | GitHub Version | Git Commit | +|--------------------|----------------|------------| +| 18.0 | - | - | +| 19.0 | 0.1.0 | feat: add release.yml file for semantic versioning | +| 20.0 | 0.1.1 | fix: correct mislabeled metadata in processing | +| 21.0 | 0.2.0 | feat: add two new QC plots | + +Assets processed before semantic versioning was adopted will only have a Code Ocean version in their metadata (e.g., `18.0`). Assets processed after adoption will have a semantic version (e.g., `0.1.0`). +Users and developers may need to account for both version formats in code or queries that deal with processed data from both before and after semantic versioning was adopted. +For example, to find all assets processed with this pipeline before version `0.2.0`, the query would need to match: +- Semantic versions `< 0.2.0` (i.e., `0.1.0`, `0.1.1`) +- Code Ocean versions `<= 18.0` (distinguished by two vs three version elements) \ No newline at end of file diff --git a/docs/source/process_data/version_pipelines.md b/docs/source/process_data/version_pipelines.md deleted file mode 100644 index 5b207d9..0000000 --- a/docs/source/process_data/version_pipelines.md +++ /dev/null @@ -1,55 +0,0 @@ -# Versioning pipelines - -Users need to understand how to interact with computed results produced by data processing pipelines. If there are changes in the structure or interpretation of results because of a change to a processing pipeline, it must be easy for users to understand the nature of these changes and detect these changes reliably in code. - -## Policies - -Core data processing pipelines MUST adopt [semantic versioning](https://semver.org/). -- Major version changes indicate that the structure or interpretation of the data has changed. -- Minor version changes indicate new, backwards compatible features were added to the pipeline. -- Patch version changes indicate bug fixes. - -The pipeline's name and semantic version MUST be stored in aind-data-schema [Processing](https://github.com/AllenNeuralDynamics/aind-data-schema/blob/dev/src/aind_data_schema/core/processing.py#L970) metadata at the top level of the results. - -The pipeline's name and semantic version MUST be stored in the pipeline repository and easily accessible to pipeline code. We recommend a `.env` file containing `PIPELINE_VERSION`, `PIPELINE_NAME`, and `PIPELINE_URL` variables. These environment variables can be pulled using standard tools such as `os` and added to the `aind-data-schema` `Processing` core object for proper documentation. Specifically, the following fields of the `Processing` object should be populated with these enironment variables: - -`Processing.pipeline_version=os.getenv("PIPELINE_VERSION", "No version reported.")` -`Processing.pipeline_url=os.getenv("PIPELINE_URL", "No pipeline URL reported.")` - -The pipeline repository and the repositories of all individual capsules MUST be public on GitHub. - -To deploy a new release of a pipeline: - -- Pipelines and component capsules MUST update their semantic version appropriately. -- Pipelines and component capsules MUST be synchronized with GitHub. -- Pipelines and component capsules used in production MUST have a Code Ocean "internal release." -- Pipelines MUST update their `CHANGELOG` indicating what has changed in the release. - -This process ensures production pipelines are not subject to accidental changes and versioning is always communicated consistently to users downstream. - -## Code Ocean versioning - -When a capsule or pipeline is internally released in Code Ocean, Code Ocean creates an immutable copy of the pipeline and issues it a release version. This version, which is published as an `int` value, is unrelated to the semantic version of the pipeline, but it is a necessary parameter for those triggering pipelines via the API (e.g. the AIND data transfer service). - -## Implementation - -Developers can create a pipeline from this template: [`aind-pipeline-template`](https://github.com/AllenNeuralDynamics/aind-pipeline-template). Once created, the pipeline uses a [workflow](https://github.com/AllenNeuralDynamics/.github/blob/main/.github/docs/Release%20Tag%20and%20Publish%20Pipeline.md) that will, on every pull request into main, bump the version using [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/). The version and GitHub repository of the pipeline created with this template are added to the pipeline's environment variables as `PIPELINE_VERSION`, `PIPELINE_NAME` and `PIPELINE_URL` in the repostory's `nextflow.config` file. - -The developer is still responsible for ensuring that the `PIPELINE_VERSION`, `PIPELINE_NAME`, and `PIPELINE_URL` values, as well as the `CHANGELOG` are correct and up-to-date in the repository. - -To address Git versions being out-of-sync with the Code Ocean version, a table is provided below that explains the relationship. Version numbers are only illustrative and meant to demonstrate that Code Ocean pipeline version always increases as an integer while semantic versions increase according to update level. - -| Code Ocean Version | GitHub Version | Git Commit | -|--------------------|----------------|------------| -| 18.0 | - | - | -| 19.0 | 0.1.0 | feat: add release.yml file for semantic versioning | -| 20.0 | 0.1.1 | fix: correct mislabeled metadata in processing | -| 21.0 | 0.2.0 | feat: add two new QC plots | - -Because some pipelines already have mature Code Ocean releases, there will be a mismatch between Code Ocean versions and the semantic versions reported in the `Processing` object. Assets processed before semantic versioning was adopted will only have a Code Ocean version in their metadata (e.g., `18.0`). Assets processed after adoption will have a semantic version (e.g., `0.1.0`). - -When querying the metadata database for `Processing.pipeline_version`, users and developers must account for both version formats. For example, to find all assets processed with this pipeline before version `0.2.0`, the query would need to match: -- Semantic versions `< 0.2.0` (i.e., `0.1.0`, `0.1.1`) -- Code Ocean versions from before semantic versioning was adopted (i.e., `18.0`) - -For pipelines that have adopted semantic versioning, users and developers will always be able to find a pipelines semantic version in the `nextflow.config`. \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index b769239..7a07501 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -25,6 +25,7 @@ dev = [ 'furo', 'myst-parser', 'sphinx-tippy', + 'sphinx-copybutton', ] [tool.setuptools.packages.find]