Skip to content

[ADH-8124] Add S3 Gateway ozone objects cache#10

Open
tigrulya-exe wants to merge 1 commit into2.2.0-developfrom
feature/ADH-8124-2.2.0
Open

[ADH-8124] Add S3 Gateway ozone objects cache#10
tigrulya-exe wants to merge 1 commit into2.2.0-developfrom
feature/ADH-8124-2.2.0

Conversation

@tigrulya-exe
Copy link
Copy Markdown
Member

Added caching OM client for S3 gateway component. Currently, it caches metadata for S3 volume, buckets, keys (both regular and extended versions) in local per-user Caffeine caches. Cache entries are evicted on corresponding write operations (delete, rename) or by TTL.

Related options:

Key Description Default Value
ozone.s3g.object.cache.enabled Enable or disable the S3Gateway object cache. false
ozone.s3g.object.cache.metrics.enabled Enable or disable metrics for the S3Gateway object cache. true
ozone.s3g.object.cache.volume.size Maximum number of entries in the S3Gateway volume cache. 10
ozone.s3g.object.cache.volume.ttl.ms TTL in milliseconds for entries in the S3Gateway volume cache. 3000
ozone.s3g.object.cache.bucket.size Maximum number of entries in the S3Gateway bucket cache. 100
ozone.s3g.object.cache.bucket.ttl.ms TTL in milliseconds for entries in the S3Gateway bucket cache. 3000
ozone.s3g.object.cache.key.size Maximum number of entries in the S3Gateway key cache. 1000
ozone.s3g.object.cache.key.ttl.ms TTL in milliseconds for entries in the S3Gateway key cache. 3000

Also a Hadoop Metrics2 source is added to collect and publish the following statistics for all S3 object caches (s3VolumeCache, s3BucketCache, s3HeadKeyCache, s3KeyInfoCache):

Name Description
hitCount Cache hit count
hitRate Cache hit rate
missCount Cache miss count
loadSuccessCount Successful cache loads
loadFailureCount Failed cache loads
loadFailureRate Failed cache loads rate
evictionCount Cache eviction count
averageLoadPenalty Average load penalty (ns)

Example of metrics if Prometheus MetricsSink is used:

# TYPE s3_object_cache_metrics_miss_count gauge
s3_object_cache_metrics_miss_count{cachename="s3BucketCache",hostname="0dfbab4b336f"} 27
s3_object_cache_metrics_miss_count{cachename="s3HeadKeyCache",hostname="0dfbab4b336f"} 10
s3_object_cache_metrics_miss_count{cachename="s3KeyInfoCache",hostname="0dfbab4b336f"} 5
s3_object_cache_metrics_miss_count{cachename="s3VolumeCache",hostname="0dfbab4b336f"} 30

Copy link
Copy Markdown

@iamlapa iamlapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, BucketEndpoint::handleGetRequest(S3RequestContext context, String bucketName)

ozoneKeyIterator = bucket.listKeys(prefix, prevKey, shallow);

This path still ignores the S3 max-keys value when choosing how much to fetch from OM; the iterator falls back to ozone.client.list.cache, which is 1000 by default, so max-keys=1 can still over-fetch badly. We should add an S3-specific list path that passes a backend fetch size like min(maxKeys + 1, configuredCap).

}
S3ObjectCacheKey that = (S3ObjectCacheKey) o;
// deliberately use only the immutable part of the S3Auth
return Objects.equals(s3Auth.getAccessID(), that.s3Auth.getAccessID());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we let enable this, we need a real auth gate in front of metadata cache hits. Right now the cache key only uses the access ID, so a warmed entry can be reused by a later request with the same access ID but a bad stringToSign/signature, and OM never gets a chance to validate SigV4. Either validate SigV4 locally in S3G with a secure short-lived S3 secret cache, or do an auth-only OM check before serving cached metadata, but cached metadata cannot be returned just because the access ID matches.

Copy link
Copy Markdown
Member Author

@tigrulya-exe tigrulya-exe May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I think the second suggested option will conflict with the main goal of this cache — reducing the number of calls to OM. Regarding the first one - cache itself is a kind of trade-off between latency and consistency, because it won't be strictly consistent in any case (changes can be made via the ozone CLI or other clients, even if we add a subscription mechanism, it will receive events only eventually, not immediately). In some cases this cache can return even non-existent buckets/keys, so in my opinion a small chance of returning an entry for an invalid stringToSign/signature during short key TTL isn't worth adding extra layers of complexity, like a local S3 secret cache or an additional OM call.

}

@Override
public void deleteKey(String volumeName, String bucketName, String keyName, boolean recursive) throws IOException {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only covers part of the write surface. Before the cache can be enabled, key metadata needs to be evicted for createKey*, rewriteKey*, completeMultipartUpload, putObjectTagging, deleteKey(s), and renameKey(s) for both source and destination. For stream writes, I’d also evict before opening the write and again after successful close/commit, so we never keep serving an old ETag or metadata while a replacement object is being written.

Copy link
Copy Markdown
Member Author

@tigrulya-exe tigrulya-exe May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but currently this methods aren't used in S3G and therefore I decided not to wrap them. However, future releases can use them/internal logic can change, so I added implementation for these methods

}

@Override
public void deleteBucket(String volumeName, String bucketName) throws IOException {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucket details are cached, but this only evicts on deleteBucket. We also need invalidation after createBucket, setBucketOwner, setBucketVersioning, setBucketStorageType, setBucketQuota, setReplicationConfig, setEncryptionKey, and setAcl; otherwise owner checks, acl decisions, default replication, encryption, and quota-related behavior can be stale until TTL.

* limitations under the License.
*/

package org.apache.hadoop.ozone.client;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this puts S3 Gateway implementation classes into the org.apache.hadoop.ozone.client package so they can call package-private OzoneClientFactory hooks from the client module. That makes an internal client-module helper an implicit cross-module API. I’d rather expose a small intentional extension point from the client module and keep the S3-specific builder/cache under org.apache.hadoop.ozone.s3.*; otherwise future client refactors or upstream merges can break S3G in a pretty non-obvious way.

Copy link
Copy Markdown
Member Author

@tigrulya-exe tigrulya-exe May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposed the method in OzoneClientFactory and made it a bit more readable

@tigrulya-exe tigrulya-exe force-pushed the feature/ADH-8124-2.2.0 branch from f844ae5 to e96d57d Compare May 4, 2026 20:19
@tigrulya-exe
Copy link
Copy Markdown
Member Author

tigrulya-exe commented May 4, 2026

Also, BucketEndpoint::handleGetRequest(S3RequestContext context, String bucketName)

ozoneKeyIterator = bucket.listKeys(prefix, prevKey, shallow);

This path still ignores the S3 max-keys value when choosing how much to fetch from OM; the iterator falls back to ozone.client.list.cache, which is 1000 by default, so max-keys=1 can still over-fetch badly. We should add an S3-specific list path that passes a backend fetch size like min(maxKeys + 1, configuredCap).

Yeah, it was already discussed with Alexander and we decided to skip this requirement for now and check if further work is needed in a separate task. Unfortunately, we can't simply pass the value of max-keys down to OM due to some internal logic, see HDDS-12882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants