From ecd25a3578ca4104901b50b19da2bd5bb0e46663 Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 21 May 2025 18:27:23 -0400 Subject: [PATCH 1/8] Set up intro obspec blog post --- docs/blog/.authors.yml | 5 +++++ docs/blog/index.md | 2 ++ docs/blog/posts/introducing-obspec.md | 22 ++++++++++++++++++++++ mkdocs.yml | 4 ++-- 4 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 docs/blog/.authors.yml create mode 100644 docs/blog/index.md create mode 100644 docs/blog/posts/introducing-obspec.md diff --git a/docs/blog/.authors.yml b/docs/blog/.authors.yml new file mode 100644 index 0000000..c23a2bb --- /dev/null +++ b/docs/blog/.authors.yml @@ -0,0 +1,5 @@ +authors: + kylebarron: + name: Kyle Barron + description: Creator + avatar: https://github.com/kylebarron.png diff --git a/docs/blog/index.md b/docs/blog/index.md new file mode 100644 index 0000000..c58f16c --- /dev/null +++ b/docs/blog/index.md @@ -0,0 +1,2 @@ +# Blog + diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md new file mode 100644 index 0000000..cf74f0b --- /dev/null +++ b/docs/blog/posts/introducing-obspec.md @@ -0,0 +1,22 @@ +--- +draft: false +date: 2025-05-25 +categories: + - Release +authors: + - kylebarron +--- + +# Introducing Obspec + + + +Obstore is the simplest, highest-throughput Python interface to Amazon S3, Google Cloud Storage, and Azure Storage, powered by Rust. + +This post gives an overview of what's new in obstore version 0.4. + + + + +## Compare and contrast to fsspec + diff --git a/mkdocs.yml b/mkdocs.yml index 076265a..5a7fee4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -29,8 +29,8 @@ nav: # - integrations.md # - performance.md # - fsspec.md - # - Blog: - # - blog/index.md + - Blog: + - blog/index.md - API Reference: - api/copy.md - api/delete.md From 811f47a18f2d4d24da984ebf3d3fb4b6cf90432a Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 28 May 2025 18:36:05 -0400 Subject: [PATCH 2/8] brain dump --- docs/blog/posts/introducing-obspec.md | 242 +++++++++++++++++++++++++- 1 file changed, 240 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index cf74f0b..ca955b0 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -1,13 +1,13 @@ --- draft: false -date: 2025-05-25 +date: 2025-05-29 categories: - Release authors: - kylebarron --- -# Introducing Obspec +# Introducing Obspec: A Python protocol for interfacing with object storage @@ -17,6 +17,244 @@ This post gives an overview of what's new in obstore version 0.4. +## Why? + +Consistent interface to object storage. + + + +- Obspec grew out of obstore. + +Comparison to obstore: Obstore is a concrete implementation; obspec is an abstract interface using Python protocols. + +Builds on a series of known protocols. Uses the buffer protocol for representing binary data. ## Compare and contrast to fsspec +1. api surface area of obspec vs fsspec. moving away from trying to make a file system layer which is a poor semantic mismatch and causes confusion and overhead. + +2. We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. + +### Abstraction target + +Fsspec: +Access remote data via stateful file objects + + + +```py +from fsspec import AbstractFileSystem + + +def download_file(fs: AbstractFileSystem) -> str: + with fs.open("my-file.txt", "rb") as f: + return f.read().decode() +``` + + +Obstore: HTTP requests + +Access remote data via HTTP-like requests +All operations are atomic (readers cannot observe partial/failed writes) +Allows for functionality not native to filesystems +Operation preconditions (fetch if unmodified) +Atomic multipart uploads + + + +```py +from obspec import Get + + +def download_file(client: Get) -> str: + response = client.get("my-file.txt") + # buffer is only known to implement the Buffer Protocol + buffer = response.bytes() + return bytes(buffer).decode() +``` + +!!! note + + Core point: mismatched abstraction, object stores are not filesystems. + +### Stateful vs Stateless + +Core point: stateful APIs add user uncertainty. +Is the list request cached? +How many requests are made? +What happens if the remote data changes? +Will the second list automatically reflect new data? + + +We want a clear contract between provider (backend) and consumer (user/library) +Is the list request cached? (yes) +How many requests does this make? (1) +What happens if the remote data changes? +Will the second list automatically reflect new data? (no, not by default, but could be implementation dependent) + + +```py +from time import sleep + +from fsspec import AbstractFileSystem + + +def list_files_twice(fs: AbstractFileSystem): + fs.ls("s3://mybucket") + sleep(5) + fs.ls("s3://mybucket") +``` + +```py +from time import sleep + +from obspec import List + + +def list_files_twice(client: List): + list_iter = client.list("prefix") + list_items = list(list_iter) + sleep(5) + list_iter = client.list("prefix") + list_items = list(list_iter) +``` + +### API Surface + +Core point: obstore has a smaller API surface, easier to understand, compose. + +Fsspec: + +AbstractFileSystem: 10 public attributes, 56 public methods, more public async methods +AbstractBufferedFile: 20 public methods +Common to hit NotImplementedError since not all backends support all filesystem concepts (e.g. async) + +Obstore: + +Just 11 methods total: Core operations that object stores support natively +Full clarity of underlying HTTP calls +E.g. opening an fsspec file and then iterating over the responses… unclear how many raw HTTP requests that translates into. +Predictable performance. +No automatic caching (to be provided on top) +Very rare NotImplementedError: Azure suffix requests + + +copy/copy_async: Copy an object +delete/delete_async: Delete an object +get: Download a file +get_range/get_range_async: Get a byte range +get_ranges/get_ranges_async: Get multiple byte ranges +head/head_async: Access file metadata +list: List objects +list_with_delimiter/list_with_delimiter_async: List objects within a specific “directory”, avoiding recursing into further directories. +put/put_async: Upload to file +rename/rename_async: Move an object from one path to another +sign/sign_async: Create a signed URL + +### Streaming + +Core point: obstore has full streaming support. + +Fsspec: + +Streaming download: No support. +Can be emulated with file object, but no way to make one request and have it return as a stream. +Streaming upload: supported synchronously by passing file-like object. +Streaming list: No support: ls will always return all objects within prefix. + + +Both sync and async streaming support. +Streaming download: start working with byte response before entire file has downloaded. +Streaming upload: upload data from any byte source without materializing everything in memory. +Streaming list: automatic pagination behind the scenes + +Streaming download: + +```py +from obspec import Get + + +def streaming_download(client: Get): + response = client.get("file.txt") + for buffer_chunk in response: + # The iteration object is again a Buffer Protocol object + print(len(memoryview(buffer_chunk))) +``` + +Async streaming download. In just a few lines of code we can switch to supporting async. + + +```py +from obspec import GetAsync + + +async def streaming_download(client: GetAsync): + response = await client.get_async("file.txt") + async for buffer_chunk in response: + # The iteration object is again a Buffer Protocol object + print(len(memoryview(buffer_chunk))) +``` + + + + +### Intersecting features + +Not all backends will support all features. + +This is why obspec is defined as a set of independent protocols. Users can intersect the ones they need. + +### Full async API + +### Type hinting + +Fully type hinted + +### Manner of subtyping + +As [described in the Mypy documentation](https://mypy.readthedocs.io/en/stable/protocols.html), the Python type system supports two different manners of subtyping. + +> _Nominal_ subtyping is strictly based on the class hierarchy. If class `Dog` +> inherits class `Animal`, it's a subtype of `Animal`. Instances of `Dog` +> can be used when `Animal` instances are expected. This form of subtyping +> is what Python's type system predominantly uses: it's easy to +> understand and produces clear and concise error messages, and matches how the +> native :py:func:`isinstance ` check works -- based on class +> hierarchy. +> +> _Structural_ subtyping is based on the operations that can be performed with +> an object. Class `Dog` is a structural subtype of class `Animal` if the former +> has all attributes and methods of the latter, and with compatible types. +> +> Structural subtyping can be seen as a static equivalent of duck typing, which +> is well known to Python programmers. + +Fsspec uses nominal subtyping. + +Obspec uses structural subtyping. + + +### Predictability + +We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. + +#### Caching + +Fsspec has caching built in. This can cause unpredictable results. + +With obspec, the idea is to provide the low-level primitives and caching can be implemented _as wrappers_ on top of + +### Dependencies? + +Talk about protocols? + +## Implemented by obstore + +Protocols are worthless if a concrete implementation doesn't exist. + +Obstore is a zero-dependency implementation. + + +## Future work + +### Common exceptions From cc84e51707d965f9e348bc2959438c511e15d9bd Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Fri, 20 Jun 2025 13:31:32 -0400 Subject: [PATCH 3/8] update why section --- docs/blog/posts/introducing-obspec.md | 80 +++++++++++++++++++++++++-- 1 file changed, 76 insertions(+), 4 deletions(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index ca955b0..5311209 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -9,15 +9,87 @@ authors: # Introducing Obspec: A Python protocol for interfacing with object storage +Obspec defines a minimal, transparent Python interface for object storage. +It's designed to abstract away the complexities of different object storage APIs while acknowledging that object storage is _not a filesystem_ and presents more similarities to HTTP requests than Python file objects. -Obstore is the simplest, highest-throughput Python interface to Amazon S3, Google Cloud Storage, and Azure Storage, powered by Rust. + -This post gives an overview of what's new in obstore version 0.4. +## Why a new interface? - +The primary existing Python specification used for object storage is [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), which defines a filesystem-like interface based around Python file-like objects. + +However this presents an impedance mismatch: object storage is not a filesystem and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity + +### Fsspec's stateful APIs add user uncertainty. + +Fsspec has significant layers of caching to try to make object storage behave _like_ a filesystem, but this also causes unpredictable results. + +#### Opaque list requests + +Take the following example. Is the list request cached? How many requests are made, one or two? What happens if the remote data changes? Will the second list automatically reflect new data? + +```py +from time import sleep +from fsspec import AbstractFileSystem + +def list_files_twice(fs: AbstractFileSystem): + fs.ls("s3://mybucket") + sleep(5) + fs.ls("s3://mybucket") +``` + +The API documentation for `ls` [doesn't say](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.ls) what the default is (only that you _may_ explicitly pass `refresh=True|False` to force a behavior). You have to read implementation-specific source code to find out that, in the case of `s3fs`, the [default is `refresh=False`](https://github.com/fsspec/s3fs/blob/ec57f88c057dfd29fa1db80db423832fbfa4832a/s3fs/core.py#L1021). So the list call is cached, only one HTTP request is made, and the second call to `ls` will not reflect new data without an explicit call to `refresh=True`. + +In contrast, since obspec is stateless and abstracts HTTPs requests, not files, the comparable obspec code is easier to understand and reason about. + +```py +from time import sleep +from obspec import List + +def list_files_twice(client: List): + list_items = list(client.list("prefix")) + sleep(5) + list_items = list(client.list("prefix")) +``` + +There's no internal caching, two requests are made, and every `list` method call will reflect the latest state of the bucket. + +#### Opaque file downloads + +Consider the options fsspec provides for downloading data. Fsspec doesn't have a method to stream a file download into memory, so your options are: + +1. Materialize the entire file in memory, which is not practical for large files. +2. Make targeted range requests, which requires you to know the byte ranges you want to download and requires multiple HTTP calls. +3. Use a file-like object, which is not clear how many HTTP requests it will make, and how caching works. +4. Download to a local file, which incurs overhead of writing to disk and then reading back into memory. + +Suppose we choose option 3, using a file-like object. It's fully opaque how many requests are being made: + +```py +from fsspec import AbstractFileSystem + +def iterate_over_file_object(fs: AbstractFileSystem, path: str): + with fs.open(path) as f: + for line in f: + print(line.strip()) +``` + +In contrast, obspec makes it fully transparent what HTTP requests are happening under the hood. Obspec also allows for streaming a file via a Python iterator: + +```py +from obspec import Get + +def download_file(client: Get): + response = client.get("my-file.txt") + for buffer in response: + # Process each buffer chunk as needed + print(f"Received buffer of size: {len(memoryview(buffer))} bytes") +``` + +Only one HTTP request is made, and you can start processing the data as it arrives without needing to materialize the entire file in memory. -## Why? +---------- Consistent interface to object storage. From e4a7f58e15c26fcc3ca090fdfb88615d031389fc Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Fri, 20 Jun 2025 13:47:39 -0400 Subject: [PATCH 4/8] progress --- docs/blog/posts/introducing-obspec.md | 156 +++++++------------------- 1 file changed, 41 insertions(+), 115 deletions(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index 5311209..ae37206 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -19,9 +19,9 @@ It's designed to abstract away the complexities of different object storage APIs The primary existing Python specification used for object storage is [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), which defines a filesystem-like interface based around Python file-like objects. -However this presents an impedance mismatch: object storage is not a filesystem and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity +However this presents an impedance mismatch: **object storage is not a filesystem** and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity -### Fsspec's stateful APIs add user uncertainty. +### File-like, stateful APIs add user ambiguity Fsspec has significant layers of caching to try to make object storage behave _like_ a filesystem, but this also causes unpredictable results. @@ -101,127 +101,34 @@ Comparison to obstore: Obstore is a concrete implementation; obspec is an abstra Builds on a series of known protocols. Uses the buffer protocol for representing binary data. -## Compare and contrast to fsspec - -1. api surface area of obspec vs fsspec. moving away from trying to make a file system layer which is a poor semantic mismatch and causes confusion and overhead. - -2. We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. +## Features of Obspec ### Abstraction target -Fsspec: -Access remote data via stateful file objects - - - -```py -from fsspec import AbstractFileSystem - - -def download_file(fs: AbstractFileSystem) -> str: - with fs.open("my-file.txt", "rb") as f: - return f.read().decode() -``` - - -Obstore: HTTP requests - -Access remote data via HTTP-like requests -All operations are atomic (readers cannot observe partial/failed writes) -Allows for functionality not native to filesystems -Operation preconditions (fetch if unmodified) -Atomic multipart uploads - - - -```py -from obspec import Get - - -def download_file(client: Get) -> str: - response = client.get("my-file.txt") - # buffer is only known to implement the Buffer Protocol - buffer = response.bytes() - return bytes(buffer).decode() -``` - -!!! note - - Core point: mismatched abstraction, object stores are not filesystems. - -### Stateful vs Stateless - -Core point: stateful APIs add user uncertainty. -Is the list request cached? -How many requests are made? -What happens if the remote data changes? -Will the second list automatically reflect new data? - - -We want a clear contract between provider (backend) and consumer (user/library) -Is the list request cached? (yes) -How many requests does this make? (1) -What happens if the remote data changes? -Will the second list automatically reflect new data? (no, not by default, but could be implementation dependent) - - -```py -from time import sleep - -from fsspec import AbstractFileSystem - +As mentioned above, obspec intends to abstract stateless, HTTP-like requests, not a file system. While this improves predictability and performance, it also means: -def list_files_twice(fs: AbstractFileSystem): - fs.ls("s3://mybucket") - sleep(5) - fs.ls("s3://mybucket") -``` - -```py -from time import sleep - -from obspec import List +- All operations are atomic (readers cannot observe partial/failed writes) +- Allows for functionality not native to filesystems, such as preconditions (fetch if unmodified) and atomic multipart uploads -def list_files_twice(client: List): - list_iter = client.list("prefix") - list_items = list(list_iter) - sleep(5) - list_iter = client.list("prefix") - list_items = list(list_iter) -``` - ### API Surface -Core point: obstore has a smaller API surface, easier to understand, compose. - -Fsspec: - -AbstractFileSystem: 10 public attributes, 56 public methods, more public async methods -AbstractBufferedFile: 20 public methods -Common to hit NotImplementedError since not all backends support all filesystem concepts (e.g. async) - -Obstore: +Obspec has a much smaller API surface than fsspec, which makes it easier to understand, implement, and compose. It also means that it's much rarer for a backend to not implement the full API. -Just 11 methods total: Core operations that object stores support natively -Full clarity of underlying HTTP calls -E.g. opening an fsspec file and then iterating over the responses… unclear how many raw HTTP requests that translates into. -Predictable performance. -No automatic caching (to be provided on top) -Very rare NotImplementedError: Azure suffix requests +Obspec has just 10 core methods: +- [`copy`][obspec.Copy]/[`copy_async`][obspec.CopyAsync]: Copy an object within the same store. +- [`delete`][obspec.Delete]/[`delete_async`][obspec.DeleteAsync]: Delete an object. +- [`get`][obspec.Get]/[`get_async`][obspec.GetAsync]: Download a file, returning an iterator or async iterator of buffers. +- [`get_range`][obspec.GetRange]/[`get_range_async`][obspec.GetRangeAsync]: Get a single byte range. +- [`get_ranges`][obspec.GetRanges]/[`get_ranges_async`][obspec.GetRangesAsync]: Get multiple byte ranges. +- [`head`][obspec.Head]/[`head_async`][obspec.HeadAsync]: Access file metadata. +- [`list`][obspec.List]/[`list_async`][obspec.ListAsync]: List objects, returning an iterator or async iterator of metadata. +- [`list_with_delimiter`][obspec.ListWithDelimiter]/[`list_with_delimiter_async`][obspec.ListWithDelimiterAsync]: List objects within a specific directory. +- [`put`][obspec.Put]/[`put_async`][obspec.PutAsync]: Upload a file, buffer, or iterable of buffers. +- [`rename`][obspec.Rename]/[`rename_async`][obspec.RenameAsync]: Move an object from one path to another within the same store. -copy/copy_async: Copy an object -delete/delete_async: Delete an object -get: Download a file -get_range/get_range_async: Get a byte range -get_ranges/get_ranges_async: Get multiple byte ranges -head/head_async: Access file metadata -list: List objects -list_with_delimiter/list_with_delimiter_async: List objects within a specific “directory”, avoiding recursing into further directories. -put/put_async: Upload to file -rename/rename_async: Move an object from one path to another -sign/sign_async: Create a signed URL +All methods have both synchronous and asynchronous variants, allowing for flexibility in how you use them. ### Streaming @@ -270,6 +177,28 @@ async def streaming_download(client: GetAsync): + + +```py +from obspec import Get + + +def download_file(client: Get) -> str: + response = client.get("my-file.txt") + # buffer is only known to implement the Buffer Protocol + buffer = response.bytes() + return bytes(buffer).decode() +``` + + + +1. api surface area of obspec vs fsspec. moving away from trying to make a file system layer which is a poor semantic mismatch and causes confusion and overhead. + +2. We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. + + +## Usage + ### Intersecting features Not all backends will support all features. @@ -278,9 +207,6 @@ This is why obspec is defined as a set of independent protocols. Users can inter ### Full async API -### Type hinting - -Fully type hinted ### Manner of subtyping From 66eac756da49af27a60117bf1ca9f9f20024ae35 Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 25 Jun 2025 00:58:03 -0400 Subject: [PATCH 5/8] Flesh out intro blog post --- docs/blog/posts/introducing-obspec.md | 258 +++++++++++++++----------- 1 file changed, 153 insertions(+), 105 deletions(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index ae37206..808e537 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -9,23 +9,23 @@ authors: # Introducing Obspec: A Python protocol for interfacing with object storage -Obspec defines a minimal, transparent Python interface for object storage. +Obspec defines a minimal, transparent Python interface to read, write, and modify data on object storage. -It's designed to abstract away the complexities of different object storage APIs while acknowledging that object storage is _not a filesystem_ and presents more similarities to HTTP requests than Python file objects. +It's designed to abstract away the complexities of different object storage providers while acknowledging that object storage is _not a filesystem_. The Python protocols present more similarities to HTTP requests than Python file objects. -## Why a new interface? + The primary existing Python specification used for object storage is [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), which defines a filesystem-like interface based around Python file-like objects. -However this presents an impedance mismatch: **object storage is not a filesystem** and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity +However this presents an impedance mismatch: **object storage is not a filesystem** and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity. -### File-like, stateful APIs add user ambiguity +## File-like, stateful APIs add ambiguity Fsspec has significant layers of caching to try to make object storage behave _like_ a filesystem, but this also causes unpredictable results. -#### Opaque list requests +### Fsspec: Opaque list caching Take the following example. Is the list request cached? How many requests are made, one or two? What happens if the remote data changes? Will the second list automatically reflect new data? @@ -39,9 +39,15 @@ def list_files_twice(fs: AbstractFileSystem): fs.ls("s3://mybucket") ``` -The API documentation for `ls` [doesn't say](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.ls) what the default is (only that you _may_ explicitly pass `refresh=True|False` to force a behavior). You have to read implementation-specific source code to find out that, in the case of `s3fs`, the [default is `refresh=False`](https://github.com/fsspec/s3fs/blob/ec57f88c057dfd29fa1db80db423832fbfa4832a/s3fs/core.py#L1021). So the list call is cached, only one HTTP request is made, and the second call to `ls` will not reflect new data without an explicit call to `refresh=True`. +Because [`AbstractFileSystem.ls`][fsspec.spec.AbstractFileSystem.ls] returns a _fully-materialized_ `list` and there can be thousands of items in a bucket, fsspec implementations tend to use some sort of internal caching. Furthermore, the specification explicitly allows for caching by defining a keyword argument named `refresh`. But the API documentation for `ls` [doesn't say][fsspec.spec.AbstractFileSystem.ls] what the default for `refresh` is (only that you _may_ explicitly pass `refresh=True|False` to force a behavior). -In contrast, since obspec is stateless and abstracts HTTPs requests, not files, the comparable obspec code is easier to understand and reason about. +You have to read implementation-specific source code to find out that, in the case of [`s3fs`](https://github.com/fsspec/s3fs), the fsspec implementation for S3, the [default is `refresh=False`](https://github.com/fsspec/s3fs/blob/ec57f88c057dfd29fa1db80db423832fbfa4832a/s3fs/core.py#L1021). So in the case of `s3fs`, the list call _is cached_, only one HTTP request is made, and the second call to `ls` will not reflect new data without an explicit call to `refresh=True`. + +But the design of the abstraction means that it's very difficult for generic code operating on the abstract base class to infer from the function signature how many HTTP requests will be made by most implementations. + +### Obstore: Streaming list + +In contrast, obspec relies on iterators wherever possible. The [`obspec.List`][] protocol returns an iterator of metadata about files, which enables stateless implementations that map much more closely to the underlying HTTP requests. ```py from time import sleep @@ -53,9 +59,9 @@ def list_files_twice(client: List): list_items = list(client.list("prefix")) ``` -There's no internal caching, two requests are made, and every `list` method call will reflect the latest state of the bucket. +There's no internal caching, a set of possibly-multiple requests are made for each call to `list`, and each call to `list` will reflect the latest state of the bucket. -#### Opaque file downloads +### Fsspec: Opaque file downloads Consider the options fsspec provides for downloading data. Fsspec doesn't have a method to stream a file download into memory, so your options are: @@ -75,7 +81,9 @@ def iterate_over_file_object(fs: AbstractFileSystem, path: str): print(line.strip()) ``` -In contrast, obspec makes it fully transparent what HTTP requests are happening under the hood. Obspec also allows for streaming a file via a Python iterator: +### Obspec: Streaming download + +By mapping more closely to the underlying HTTP requests, obspec makes it clearer what HTTP requests are happening under the hood. [obspec.Get] allows for streaming a file download via a Python iterator: ```py from obspec import Get @@ -87,35 +95,27 @@ def download_file(client: Get): print(f"Received buffer of size: {len(memoryview(buffer))} bytes") ``` -Only one HTTP request is made, and you can start processing the data as it arrives without needing to materialize the entire file in memory. - ----------- - -Consistent interface to object storage. - - - -- Obspec grew out of obstore. +In this case, only one HTTP request is made, and you can start processing the data as it arrives without needing to materialize the entire file in memory. -Comparison to obstore: Obstore is a concrete implementation; obspec is an abstract interface using Python protocols. +### Support for functionality not native to filesystems -Builds on a series of known protocols. Uses the buffer protocol for representing binary data. +Obspec allows for functionality not native to filesystems, such as preconditions (fetch if unmodified) and atomic multipart uploads. -## Features of Obspec +## Native Async support -### Abstraction target +Fsspec was originally designed for synchronous I/O. Async support was bolted on via async versions of methods, but the core architecture is still sync-first and the async support is relatively sparsely documented. -As mentioned above, obspec intends to abstract stateless, HTTP-like requests, not a file system. While this improves predictability and performance, it also means: +The async support in fsspec is intentionally hidden away: all async operations are named with a leading underscore and in effect "private" and not designed to be visible by most users. Additionally some "async" calls in fsspec just use `loop.run_in_executor(...)` to perform the work in a thread in the background. -- All operations are atomic (readers cannot observe partial/failed writes) -- Allows for functionality not native to filesystems, such as preconditions (fetch if unmodified) and atomic multipart uploads +In 2025, the Python async ecosystem has progressed to the point where an interface should provide **first-class support for async code**. All obspec functionality is defined in matching sync and async protocols with clear separation between the two. +## API Surface -### API Surface +The fsspec API surface is _quite large_. [`AbstractFileSystem`][fsspec.spec.AbstractFileSystem] defines around 10 public attributes and 56 public methods. [`AbstractBufferedFile`][fsspec.spec.AbstractBufferedFile] defines around 20 public methods. And that's not including the async implementation in [`AsyncFileSystem`][fsspec.asyn.AsyncFileSystem]. -Obspec has a much smaller API surface than fsspec, which makes it easier to understand, implement, and compose. It also means that it's much rarer for a backend to not implement the full API. +Aside from being difficult for backends to implement the full surface area, it's also common to hit `NotImplementedError` at runtime when a backend doesn't support the method you're using. -Obspec has just 10 core methods: +Obspec has a **much smaller API surface** than fsspec, which makes it easier to understand, implement, and compose. Obspec has just 10 core methods with synchronous and asynchronous variants: - [`copy`][obspec.Copy]/[`copy_async`][obspec.CopyAsync]: Copy an object within the same store. - [`delete`][obspec.Delete]/[`delete_async`][obspec.DeleteAsync]: Delete an object. @@ -128,131 +128,179 @@ Obspec has just 10 core methods: - [`put`][obspec.Put]/[`put_async`][obspec.PutAsync]: Upload a file, buffer, or iterable of buffers. - [`rename`][obspec.Rename]/[`rename_async`][obspec.RenameAsync]: Move an object from one path to another within the same store. -All methods have both synchronous and asynchronous variants, allowing for flexibility in how you use them. +This smaller API surface also means that it's much rarer to get a runtime `NotImplementedError`. -### Streaming +## Static typing support -Core point: obstore has full streaming support. +Fsspec hardly has any support for static typing, which makes it hard for a user to know they're using the interface correctly. -Fsspec: +Obspec is **fully statically typed**. This provides excellent in-editor documentation and autocompletion, as well as static warnings when the interface is used incorrectly. -Streaming download: No support. -Can be emulated with file object, but no way to make one request and have it return as a stream. -Streaming upload: supported synchronously by passing file-like object. -Streaming list: No support: ls will always return all objects within prefix. + -Streaming download: +## Protocols & duck typing, not subclassing -```py -from obspec import Get +Python defines two types of subtyping: [nominal and structural subtyping](https://docs.python.org/3/library/typing.html#nominal-vs-structural-subtyping). +In essence, _nominal_ subtyping means _subclassing_. Class `A` is a nominal subtype of class `B` if `A` subclasses from `B`. _Structural_ subtyping means _duck typing_. Class `A` is a structural subtype of class `B` if `A` "looks like" `B`, that is, it _conforms to the same shape_ as `B`. -def streaming_download(client: Get): - response = client.get("file.txt") - for buffer_chunk in response: - # The iteration object is again a Buffer Protocol object - print(len(memoryview(buffer_chunk))) -``` +Using structural subtyping means that an ecosystem of libraries don't need to have any knowledge or dependency on each other, as long as they strictly and accurately implement the same duck-typed interface. -Async streaming download. In just a few lines of code we can switch to supporting async. +For example, an `Iterable` is a protocol. You don't need to subclass from a base `Iterable` class in order to make your type iterable. Instead, if you define an `__iter__` dunder method on your class, it _automatically becomes iterable_ because Python has a convention that if you see an `__iter__` method, you can call it to iterate over a sequence. +As another example, the [Buffer Protocol](https://docs.python.org/3/c-api/buffer.html) is a protocol to enable zero-copy exchange of binary data between Python libraries. Unlike `Iterable`, this is a protocol that is inaccessible in user Python code and only accessible at the C level, but it's still a protocol. Numpy can create arrays that view a buffer via the buffer protocol, even when Numpy has no prior knowledge of the library that produces the buffer. -```py -from obspec import GetAsync +Obspec relies on structural subtyping to provide flexibility to implementors while not requiring them to take an explicit dependency on obspec, which would be required to subclass from obspec using nominal subtyping. -async def streaming_download(client: GetAsync): - response = await client.get_async("file.txt") - async for buffer_chunk in response: - # The iteration object is again a Buffer Protocol object - print(len(memoryview(buffer_chunk))) -``` +## Existing implementations +[Obstore](https://developmentseed.org/obstore/latest/) is the primary existing implementation of obspec. Indeed, obspec's API is essentially a simplified formalization of obstore's existing API. +We'd like to see additional future first-party and third-party implementations of the obspec protocol. +## Example: Caching wrapper +Obspec does not have any built-in caching logic. This is a deliberate design choice to keep the interface simple and predictable. Caching can be implemented as a wrapper around obspec, allowing users to choose their caching strategy without complicating the core interface. +Here we have a very simple example of this approach. `SimpleCache` is a wrapper class around something implementing the `GetRange` protocol. The `SimpleCache` manages caching logic itself _outside the underlying `GetRange` backend_. But since `SimpleCache` also implements `GetRange`, it can be used wherever `GetRange` is expected. ```py -from obspec import Get - - -def download_file(client: Get) -> str: - response = client.get("my-file.txt") - # buffer is only known to implement the Buffer Protocol - buffer = response.bytes() - return bytes(buffer).decode() +from __future__ import annotations +from typing_extensions import Buffer +from obspec import GetRange + +class SimpleCache(GetRange): + """A simple cache for synchronous range requests that never evicts data.""" + + def __init__(self, client: GetRange): + self.client = client + self.cache: dict[tuple[str, int, int | None, int | None], Buffer] = {} + + def get_range( + self, + path: str, + *, + start: int, + end: int | None = None, + length: int | None = None, + ) -> Buffer: + cache_key = (path, start, end, length) + if cache_key in self.cache: + return self.cache[cache_key] + + response = self.client.get_range( + path, + start=start, + end=end, + length=length, + ) + self.cache[cache_key] = response + return response ``` +Of course, a real implementation would be smarter than just caching the exact byte range, and might use something like block caching. +Now if `GetRange` is expected to be used like so: + +```py +def my_function(client: GetRange, path: str, *, start: int, end: int): + buffer = client.get_range(path, start=start, end=end) + # Do something with the buffer + print(len(memoryview(buffer))) +``` -1. api surface area of obspec vs fsspec. moving away from trying to make a file system layer which is a poor semantic mismatch and causes confusion and overhead. +Then a user can seamlessly insert the `SimpleCache` in the middle. The second request will be cached and not reach the S3Store -2. We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. +```py +from obstore.store import S3Store +store = S3Store("bucket") +caching_wrapper = SimpleCache(store) +my_function(caching_wrapper, "path.txt", start=0, end=10) +my_function(caching_wrapper, "path.txt", start=0, end=10) +``` -## Usage +## Usage for downstream libraries -### Intersecting features +Not all backends will necessarily support all features. Obspec is defined as a set of _independent_ protocols to allow libraries depending on obspec to verify that obspec implementations provide all required functionality. -Not all backends will support all features. +In particular, Python allows you to [intersect protocols](https://typing.python.org/en/latest/spec/protocol.html#unions-and-intersections-of-protocols). Thus, you should use the most minimal methods required for your use case, **creating your own subclassed protocol** with just what you need. -This is why obspec is defined as a set of independent protocols. Users can intersect the ones they need. +```py +from typing import Protocol +from obspec import Delete, Get, List, Put -### Full async API +class MyCustomObspecProtocol(Delete, Get, List, Put, Protocol): + """ + My custom protocol with functionality required in a downstream library. + """ +``` -### Manner of subtyping +Then use that protocol generically: -As [described in the Mypy documentation](https://mypy.readthedocs.io/en/stable/protocols.html), the Python type system supports two different manners of subtyping. +```py +def do_something(backend: MyCustomObspecProtocol): + backend.put("path.txt", b"hello world!") -> _Nominal_ subtyping is strictly based on the class hierarchy. If class `Dog` -> inherits class `Animal`, it's a subtype of `Animal`. Instances of `Dog` -> can be used when `Animal` instances are expected. This form of subtyping -> is what Python's type system predominantly uses: it's easy to -> understand and produces clear and concise error messages, and matches how the -> native :py:func:`isinstance ` check works -- based on class -> hierarchy. -> -> _Structural_ subtyping is based on the operations that can be performed with -> an object. Class `Dog` is a structural subtype of class `Animal` if the former -> has all attributes and methods of the latter, and with compatible types. -> -> Structural subtyping can be seen as a static equivalent of duck typing, which -> is well known to Python programmers. + files = list(backend.list()) + assert any(file["path"] == "path.txt" for file in files) -Fsspec uses nominal subtyping. + assert memoryview(backend.get("path.txt").buffer()) == b"hello world!" -Obspec uses structural subtyping. + backend.delete("path.txt") + files = list(backend.list()) + assert not any(file["path"] == "path.txt" for file in files) +``` -### Predictability +By defining the most minimal interface you require, it widens the set of possible backends that can implement your interface. For example, making a range request is possible by any HTTP client, but a list call may have semantics not defined in the HTTP specification. So by only requiring, say, `Get` and `GetRange` you allow more implementations to be used with your program. -We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. +Alternatively, if you only require a single method, there's no need to create your own custom protocol, and you can use the obspec protocol directly. -#### Caching +### Example: Cloud-Optimized GeoTIFF reader -Fsspec has caching built in. This can cause unpredictable results. +A [Cloud-Optimized GeoTIFF (COG)](https://cogeo.org/) reader might only require range requests -With obspec, the idea is to provide the low-level primitives and caching can be implemented _as wrappers_ on top of +```py +from typing import Protocol +from obspec import GetRange, GetRanges -### Dependencies? +class CloudOptimizedGeoTiffReader(GetRange, GetRanges, Protocol): + """Protocol with necessary methods to read a Cloud-Optimized GeoTIFF file.""" -Talk about protocols? +def read_cog_header(backend: CloudOptimizedGeoTiffReader, path: str): + # Make request for first 32KB of file + header_bytes = backend.get_range(path, start=0, end=32 * 1024) + # TODO: parse information from header + raise NotImplementedError -## Implemented by obstore +def read_cog_image(backend: CloudOptimizedGeoTiffReader, path: str): + header = read_cog_header(backend, path) + # TODO: read image data from file. +``` -Protocols are worthless if a concrete implementation doesn't exist. +An _async_ Cloud-Optimized GeoTIFF reader might instead subclass from obspec's async methods: -Obstore is a zero-dependency implementation. +```py +from typing import Protocol +from obspec import GetRangeAsync, GetRangesAsync +class AsyncCloudOptimizedGeoTiffReader(GetRangeAsync, GetRangesAsync, Protocol): + """Necessary methods to asynchronously read a Cloud-Optimized GeoTIFF file.""" -## Future work +async def read_cog_header(backend: AsyncCloudOptimizedGeoTiffReader, path: str): + # Make request for first 32KB of file + header_bytes = await backend.get_range_async(path, start=0, end=32 * 1024) + # TODO: parse information from header + raise NotImplementedError -### Common exceptions +async def read_cog_image(backend: AsyncCloudOptimizedGeoTiffReader, path: str): + header = await read_cog_header(backend, path) + # TODO: read image data from file. +``` From f234fb3e742e3ce82fbcee123e665a2a0a3aecaf Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 25 Jun 2025 00:58:54 -0400 Subject: [PATCH 6/8] update date --- docs/blog/posts/introducing-obspec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index 808e537..ad8e20a 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -1,6 +1,6 @@ --- draft: false -date: 2025-05-29 +date: 2025-06-26 categories: - Release authors: From 49320541816b024db7a123dd42464f7c8a686345 Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 25 Jun 2025 01:00:05 -0400 Subject: [PATCH 7/8] flesh out --- README.md | 104 +------------------------- docs/blog/posts/introducing-obspec.md | 4 - 2 files changed, 3 insertions(+), 105 deletions(-) diff --git a/README.md b/README.md index 9f0fe1d..0411c91 100644 --- a/README.md +++ b/README.md @@ -1,108 +1,10 @@ # obspec -Object storage protocol definitions for Python. +A Python protocol for interfacing with object storage. -## Background +[Read the release post.](https://developmentseed.org/obspec/latest/blog/2025/06/26/introducing-obspec-a-python-protocol-for-interfacing-with-object-storage/) -Python defines two types of subtyping: [nominal and structural subtyping](https://docs.python.org/3/library/typing.html#nominal-vs-structural-subtyping). In essence, _nominal_ subtyping is subclassing. Class `A` is a nominal subtype of class `B` if `A` subclasses from `B`. _Structural_ subtyping is duck typing. Class `A` is a structural subtype of class `B` if `A` "looks like" `B`, that is, it _conforms to the same shape_ as `B`. - -Using structural subtyping means that an ecosystem of libraries don't need to have any knowledge or dependency on each other, as long as they strictly and accurately implement the same duck-typed interface. - -For example, an `Iterable` is a protocol. You don't need to subclass from a base `Iterable` class in order to make your type iterable. Instead, if you define an `__iter__` dunder method on your class, it _automatically becomes iterable_ because Python has a convention that if you see an `__iter__` method, you can call it to iterate over a sequence. - -As another example, the [Buffer Protocol](https://docs.python.org/3/c-api/buffer.html) is a protocol to enable zero-copy exchange of binary data between Python libraries. Unlike `Iterable`, this is a protocol that is inaccessible in user Python code and only accessible at the C level, but it's still a protocol. Numpy can create arrays that view a buffer via the buffer protocol, even when Numpy has no prior knowledge of the library that produces the buffer. - -Obspec defines core protocols to interface with data stored on file systems, remote object stores, etc. - -## Usage - -You should use the minimal methods required for your use case, **creating your own protocol** with just what you need. - -In particular, Python allows you to [intersect protocols](https://typing.python.org/en/latest/spec/protocol.html#unions-and-intersections-of-protocols): - -```py -from typing import Protocol - -from obspec import Delete, Get, List, Put - - -class MyCustomObspecProtocol(Delete, Get, List, Put, Protocol): - """My custom protocol.""" -``` - -Then use that protocol generically: - -```py -def do_something(backend: MyCustomObspecProtocol): - backend.put("path.txt", b"hello world!") - - files = backend.list().collect() - assert any(file["path"] == "path.txt" for file in files) - - assert backend.get("path.txt").bytes() == b"hello world!" - - backend.delete("path.txt") - - files = backend.list().collect() - assert not any(file["path"] == "path.txt" for file in files) -``` - -In particular, by defining the most minimal interface you require, it widens the set of possible backends that can implement your interface. For example, making a range request is possible by any HTTP client, but a list call may have semantics not defined in the HTTP specification. So by only requiring, say, `Get` and `GetRange` you allow more implementations to be used with your program. - -### Example: Cloud-Optimized GeoTIFF reader - -A [Cloud-Optimized GeoTIFF (COG)](https://cogeo.org/) reader might only require range requests - -```py -from typing import Protocol - -from obspec import GetRange, GetRanges - - -class CloudOptimizedGeoTiffReader(GetRange, GetRanges, Protocol): - """Protocol with necessary methods to read a Cloud-Optimized GeoTIFF file.""" - - -def read_cog_header(backend: CloudOptimizedGeoTiffReader, path: str): - # Make request for first 32KB of file - header_bytes = backend.get_range(path, start=0, end=32 * 1024) - - # TODO: parse information from header - raise NotImplementedError - - -def read_cog_image(backend: CloudOptimizedGeoTiffReader, path: str): - header = read_cog_header(backend, path) - - # TODO: read image data from file. -``` - -An _async_ Cloud-Optimized GeoTIFF reader might instead subclass from obspec's async methods: - -```py -from typing import Protocol - -from obspec import GetRangeAsync, GetRangesAsync - - -class AsyncCloudOptimizedGeoTiffReader(GetRangeAsync, GetRangesAsync, Protocol): - """Necessary methods to asynchronously read a Cloud-Optimized GeoTIFF file.""" - - -async def read_cog_header(backend: AsyncCloudOptimizedGeoTiffReader, path: str): - # Make request for first 32KB of file - header_bytes = await backend.get_range_async(path, start=0, end=32 * 1024) - - # TODO: parse information from header - - raise NotImplementedError - - -async def read_cog_image(backend: AsyncCloudOptimizedGeoTiffReader, path: str): - header = await read_cog_header(backend, path) - - # TODO: read image data from file. -``` +It's designed to abstract away the complexities of different object storage providers while acknowledging that object storage is _not a filesystem_. The Python protocols present more similarities to HTTP requests than Python file objects. ## Implementations diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index ad8e20a..e1f2219 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -15,8 +15,6 @@ It's designed to abstract away the complexities of different object storage prov - - The primary existing Python specification used for object storage is [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), which defines a filesystem-like interface based around Python file-like objects. However this presents an impedance mismatch: **object storage is not a filesystem** and does not have the same semantics as filesystems. This leads to surprising behavior, poor performance, and integration complexity. @@ -136,7 +134,6 @@ Fsspec hardly has any support for static typing, which makes it hard for a user Obspec is **fully statically typed**. This provides excellent in-editor documentation and autocompletion, as well as static warnings when the interface is used incorrectly. -