Skip to content

Handle path-style object sotrage HTTPS URLs correctly in ObjectStoreRegistry.resolve #79

Description

@hrodmn

ObjectStoreRegistry.resolve() returns the wrong trailing path for path-style object storage HTTPS URLs when the registered store is already mounted to a container or bucket.

In these setups, the resolved path still includes the container or bucket name, but the underlying store already treats that container or bucket as the storage root. Passing the resolved path into store.head() or store.get_range() therefore targets a non-existent object.

Azure Blob Storage HTTPS URLs are a clear real-world example of this bug. The same class of bug can also affect path-style S3 HTTPS URLs.

Current behavior

Given an Azure Blob asset URL like:

https://ukmoeuwest.blob.core.windows.net/deterministic/uk/near-surface/...nc

and a store created from the same container:

store = AzureStore(
    credential_provider=PlanetaryComputerCredentialProvider(
        "https://ukmoeuwest.blob.core.windows.net/deterministic/"
    )
)
registry = ObjectStoreRegistry(
    {"https://ukmoeuwest.blob.core.windows.net/deterministic/": store}
)

registry.resolve(asset_url) returns:

"deterministic/uk/near-surface/...nc"

But AzureStore expects a path within the deterministic container:

"uk/near-surface/...nc"

The extra deterministic/ causes store.head() to 404.

Expected behavior

For path-style object storage HTTPS URLs, ObjectStoreRegistry.resolve() should return a path that is compatible with the mounted store, i.e. a path relative to the configured container or bucket.

For the Azure example above, the resolved path should be:

"uk/near-surface/...nc"

Steps to reproduce

Minimal repro with current behavior:

from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider
from obstore.store import AzureStore
from obspec_utils.registry import ObjectStoreRegistry

prefix = "https://ukmoeuwest.blob.core.windows.net/deterministic/"
asset_url = (
    "https://ukmoeuwest.blob.core.windows.net/"
    "deterministic/uk/near-surface/20260501T0000Z/"
    "20260501T0000Z-PT0000H00M-temperature_at_surface.nc"
)

store = AzureStore(
    credential_provider=PlanetaryComputerCredentialProvider(prefix)
)
registry = ObjectStoreRegistry({prefix: store})

_, path = registry.resolve(asset_url)
print(path)
# current output:
# deterministic/uk/near-surface/20260501T0000Z/
# 20260501T0000Z-PT0000H00M-temperature_at_surface.nc

print(store.head("uk/near-surface/20260501T0000Z/20260501T0000Z-PT0000H00M-temperature_at_surface.nc")["size"])
# works

print(store.head(path)["size"])
# FileNotFoundError / 404

A smaller non-network repro for the path resolution itself:

from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider
from obstore.store import AzureStore
from obspec_utils.registry import ObjectStoreRegistry

prefix = "https://ukmoeuwest.blob.core.windows.net/deterministic/"
asset_url = (
    "https://ukmoeuwest.blob.core.windows.net/"
    "deterministic/uk/near-surface/20260501T0000Z/"
    "20260501T0000Z-PT0000H00M-temperature_at_surface.nc"
)

store = AzureStore(
    credential_provider=PlanetaryComputerCredentialProvider(prefix)
)
registry = ObjectStoreRegistry({prefix: store})

_, path = registry.resolve(asset_url)
assert path == "uk/near-surface/20260501T0000Z/20260501T0000Z-PT0000H00M-temperature_at_surface.nc"
# currently fails because path includes the container name

Impact

This breaks downstream code that relies on ObjectStoreRegistry.resolve() to feed resolved paths into mounted object-store methods.

One concrete case is VirtualiZarr, where the resolved path is passed into a buffered reader backed by AzureStore.head() and AzureStore.get_range(). The dataset open fails even though the original HTTPS asset href is valid and the store is correctly authenticated.

This does not appear to affect every HTTPS object URL shape. For example, virtual-hosted S3 URLs like https://my-bucket.s3.us-west-2.amazonaws.com/... appear to resolve correctly because the bucket name lives in the hostname rather than in the path. The problem shows up when the storage identity is encoded in the path, as with Azure Blob HTTPS URLs and path-style S3 URLs like https://s3.us-west-2.amazonaws.com/my-bucket/....

Notes / possible cause

The issue appears to come from this branch in ObjectStoreRegistry.resolve():

elif hasattr(store, "url"):
    prefix = urlparse(store.url).path.lstrip("/")
    path_after_prefix = path.lstrip("/").removeprefix(prefix).lstrip("/")
else:
    path_after_prefix = path.lstrip("/")

AzureStore does not expose .url, and the registry falls back to returning the full object path from the HTTPS URL.

For Azure Blob HTTPS URLs, that full path includes the container name. But AzureStore already knows the container and expects paths relative to it.

The same mismatch can arise for other path-style HTTPS object URLs where the bucket or container lives in the URL path but the mounted store already knows that storage root.

This suggests the bug is not Azure-specific. It is a mismatch between path-style HTTPS URL resolution and stores that are already mounted to a bucket or container.

Acceptance criteria

  • ObjectStoreRegistry.resolve() returns container-relative or bucket-relative paths for path-style object storage HTTPS URLs when resolving against mounted stores
  • A regression test covers an AzureStore registered under a https://<account>.blob.core.windows.net/<container>/ prefix
  • A regression test covers the equivalent path-style S3 case, such as https://s3.<region>.amazonaws.com/<bucket>/...
  • Existing non-object-storage HTTP/HTTPS resolution behavior remains unchanged
  • Existing virtual-hosted object-storage HTTPS behavior remains unchanged

Open questions

  • Should this be handled by teaching resolve() about mounted object stores with path-style HTTPS URLs, or by exposing enough metadata on store objects to make the existing prefix-stripping logic work generically?
  • Should https://<account>.blob.core.windows.net/<container>/... and abfs://<container>/... be normalized to equivalent resolved paths in the registry?
  • Should the same normalization apply to path-style S3 HTTPS URLs so they behave like s3://bucket/... and virtual-hosted S3 HTTPS URLs?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions