Introduce ContentType as Blob vs Zarr and rename blobDateModified to contentDateModified#220
Introduce ContentType as Blob vs Zarr and rename blobDateModified to contentDateModified#220yarikoptic wants to merge 1 commit intomasterfrom
Conversation
…contentDateModified This is just an initial attempt open for discussion. I ran into "blobDateModified" in a zarr metadata and it raised my eyebrow since that is not really appropriate and confusing. Hence I decided to look into generalization. I also thought that it would be valuable to make "type" of the content Asset points to explicit, although that could lead to inconsistencies since information is somewhat redundant with encodingFormat and potentially could also be deduced from contenUrl since we have different end points on S3, etc. Nevertheless I think it might be better to make it explicit. Or at least we have to rename blobDateModified. - ContenType name is quite suboptimal since there is a standard HTTP header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type and thus we could potential confusion. But we should keep it a "Type" (not e.g. a Class) to be consistent with other type definitions among models. So the other part we could try to vary is "Content". Possible alternatives are "Object", "Data", "Resource" - ATM we call all Zarrs just Zarr but it is a "ZarrFolder" really. I wonder if it would be time to start to introduce differentiation here by making it "ZarrFolder", as later we might get "ZarrHDF5" or alike
6d3b5ff to
ddd77da
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #220 +/- ##
==========================================
- Coverage 97.66% 97.61% -0.05%
==========================================
Files 18 18
Lines 1798 1806 +8
==========================================
+ Hits 1756 1763 +7
- Misses 42 43 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
i'm fine with something like this being in the database, but not sure about changing the metadata model. perhaps we can brainstorm when we meet. |
satra
left a comment
There was a problem hiding this comment.
We have both encodingFormat and dataType - do we need yet another field or a clearer encodingFormat.
(I see no hits among jsonld manifests ATM)
|
| if "schemaKey" not in obj: | ||
| obj["schemaKey"] = "Dandiset" | ||
| obj["schemaVersion"] = to_version | ||
| if version2tuple(schema_version) < version2tuple("0.7.0"): |
There was a problem hiding this comment.
here will need to become some future version now that 0.7.0 is out... we might want to introduce next branch alike to next in linux development, or more specifically next-0.8.0 and position this PR against it so we could absorb meanwhile non-breaking changes
| if version2tuple(schema_version) < version2tuple("0.7.0"): | |
| if version2tuple(schema_version) < version2tuple("0.8.0"): |
| or obj.get("path", "").endswith(".zarr") | ||
| ) | ||
| else models.ContentType.Blob | ||
| ) |
There was a problem hiding this comment.
I am not yet 100% certain since we have only 2 cases -- it is either zarr or not (blob), but this field and code would be the "centralization" of the logic for DANDI-specific behavior of zarr-vs-blob. Otherwise, if we add some other "contentType" later on (e.g. remoteLink 🤷) such comparisons would need to be duplicated in every piece of code... we would need likely to add validation for that logic also at pydantic or linkml level.
This is just an initial attempt open for discussion. @satra please chime in
I ran into "blobDateModified" in a zarr metadata and it raised my eyebrow since that is not really appropriate and confusing. Hence I decided to look into generalization. I also thought that it would be valuable to make "type" of the content Asset points to explicit, although that could lead to inconsistencies since information is somewhat redundant with encodingFormat and potentially could also be deduced from contenUrl since we have different end points on S3, etc.
Nevertheless I think it might be better to make it explicit. Or at least we have to rename blobDateModified.
ContenType name is quite suboptimal since there is a standard HTTP header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type and thus we could potential confusion.
But we should keep it a "Type" (not e.g. a Class) to be consistent with other type definitions among models.
So the other part we could try to vary is "Content". Possible alternatives are "Object", "Data", "Resource"
ATM we call all Zarrs just Zarr but it is a "ZarrFolder" really. I wonder if it would be time to start to introduce differentiation here by making it "ZarrFolder", as later we might get "ZarrHDF5" or alike