Skip to content

Add unified array cache#2

Draft
NikitaEvs wants to merge 10 commits intomasterfrom
add-unified-array-cache
Draft

Add unified array cache#2
NikitaEvs wants to merge 10 commits intomasterfrom
add-unified-array-cache

Conversation

@NikitaEvs
Copy link
Owner

@NikitaEvs NikitaEvs commented May 10, 2023

Summary

Added unified cache memory storage for data caches with an integration of the general-purpose memory allocator.

  • Designed and implemented the universal cache storage
  • Integrated MarksCache and UncompressedCache in this storage
  • Integrated general-purpose Allocator that can directly use the cache storage

Features

  • Instead of the configuration of the data caches separately, you need to configure the global cache storage
  • Each data cache should be assigned to some BlockCache -- the pool of the similar data caches. The pool has separate memory storage and eviction mechanism.
  • The storage of cache pools is managed by RebalanceStrategy, which is a specific algorithm to extend/reduce the sizes of the cache pools. The RebalanceStrategy is the point of configuration of the cache memory management.
  • Some baseline rebalancing strategies are implemented:
    • Dummy rebalance strategy. Cache pools can grow indefinitely separately, but the total size of caches in the unified cache storage is bounded. Memory for BlockCache is allocated through mmap.
    • Buddy static strategy. Memory for BlockCache is allocated through special BuddyAllocator which is a custom allocator with buddy schema. The size of the memory arena for the allocator is fixed. This rebalance strategy can take memory blocks from one cache pool and add them to another.
    • Buddy dynamic strategy. The same as the buddy static strategy, but the size of the memory arena for the allocator is dynamic and can grow up to the absolute limit.
  • The memory tracker can purge the cache storage and free some additional memory for further allocations.
  • Buddy dynamic strategy can work with an integration of a general-purpose allocator to the cache storage. It can directly take memory blocks from the evicted cache items.

Performance

The ClickBench results are shown in the figure below.

Cold run without populated caches
Hot run with populated caches

Our unified cache storage has a similar performance to the baseline both in cold and hot runs. The same performance in the cold run without populated caches means that there is no significant overhead to the writing to the cache storage, and the same performance in the hot run with populated caches means that there is no significant overhead to the reading from the cache storage. The performance difference over all queries is not greater than 3%.

Design

Overview

  • The key component is the BlockCache which is a cache of static memory blocks. This class is based on the existing ArrayCache.
  • The unified cache storage consists of cache pools which are separate BlockCaches.
  • The cache pools are rebalanced using RebalanceStrategy

BlockCache design

BlockCache-basic

  • The BlockCache stores underlying memory in the form of chunks.
  • Chunk is an abstraction of the contiguous memory block that is given to the BlockCache from an external source. The BlockCache slices a chunk’s memory into parts which are assigned to regions.
  • Region is part of a chunk that is used as a memory storage for cache items. The metadata of regions is stored in the intrusive list that implements the LRU cache eviction policy. This policy is chosen as a baseline, because it is easy to implement, and it works well enough in most scenarios.
  • Allocations of regions from chunks are served by the allocator that uses a red-black tree to keep free blocks of chunks.

BlockCache-handler

  • The BlockCache provides the memory of regions for cache items, but it must not evict regions which are currently in use to serve as an underlying memory for cache items.
  • To track the usage of the memory of regions, the reference counting mechanism is implemented in Holder which is an access container for regions.
  • Holder provides access to raw memory, but the current data caches do not work directly with raw memory. UnifiedCacheAdapter is implemented to connect current data caches with raw memory storage provided by Holder.

BlockCache-dynamic

  • The BlockCache must have dynamically changed underlying memory storage. The Chunk represents the underlying memory storage, so the BlockCache must have support for adding chunks to it and taking chunks from it.
  • Two lists of chunks are used to implement this support. The first one is the Free list, it stores chunks which do not have any regions placed on them. The second one is the Acquired list, it stores chunks which have at least one region on them.
  • If we want to add a new chunk to the BlockCache, we need to add it to the Free list. So we can use chunks from the Free list if we do not have enough free space in chunks in the Acquired list.
  • If we want to take a chunk from the BlockCache, we can directly do it from the Free list. This guarantees that we do not take a chunk from the BlockCache that is currently used for some cache items.

Dynamic BuddyAllocator design

We want to allow the general-purpose allocator to take memory from the cache storage, so this requires modification to the general-purpose allocator. As we do not want to modify the external library jemalloc, we implement our allocator schema that can serve as a general-purpose allocator.

We chose the buddy schema among other allocation algorithms because:

  • Buddy system has good performance in terms of allocation speed.
  • The original Buddy system algorithm is easy to implement as a baseline.
  • Internal memory fragmentation is not a big concern for a prototypical implementation of the cache storage and the allocator. This problem will be mitigated using the BlockCache design.
  • We want to use BuddyAllocator only for the cache items, so we need to add memory shrinkage and growth support in BuddyAllocator, so we can reuse memory from the cache storage for general-purpose allocations and dynamically change the size of the storage.
  • The proposed solution is the following:
    1. At the start of the program, define an absolute maximum of memory that can be used by BuddyAllocator.
    2. BuddyAllocator allocates the absolute maximum of virtual memory without a population of the mapping to the physical memory.
    3. BuddyAllocator initializes metadata for the whole memory arena.
    4. Shrinking and growth are done with a help of an additional list of freed blocks.
  • So the algorithm for memory shrinking is as follows:
    1. For the allocation of the target amount of blocks, we need to remove them from the free lists and mark them as allocated.
    2. Advise the operating system that we do not need a mapping to the physical memory for these blocks.
    3. Add the blocks to the additional lists of freed blocks and change their status.
  • The algorithm is similar for memory growth.

Cache sharding

  • The cache pollution problem occurs because of the non-uniform cache items. This problem relates to the unnecessary evictions of many small cache items because of a new big cache item.
  • To deal with this problem, we can add the cache storage sharding using cache pools.
  • The Cache pool is a separate instance of BlockCache that serves as cache storage for a group of related data caches.
  • As the cache pool has separate memory storage and eviction policy, items from different cache pools do not interfere with each other.
  • With a cache sharding approach, there should be a coordinator that manages memory chunks of cache pools.
  • The coordinator in the current design is the rebalance strategy, an abstraction that exchanges chunks between cache pools and some global cache memory storage.
  • The method initialize of the rebalance strategy is called by the cache pool during its initialization. Each cache pool has a unique name, so the rebalance strategy can distinguish cache pools between different calls to its methods. This method allows the rebalance strategy to assign starter memory chunks to the cache pool.
  • The method finalize has similar semantics, but it is needed so that the cache pool can return memory Chunks on its destruction.
  • The main purpose of shouldRebalance is that the cache pool can notify the rebalance strategy about the shortage of memory, so the rebalance strategy can add more chunks or allow evicting some cache items.
  • So shouldRebalance method is integrated into the pipeline of a new
    cache entry allocation.
    Integration of the rebalance strategy into the allocation of a new cache entry
  • The rebalance strategy can utilize different statistics from cache pools such as the size of their local storage, hit/miss ratio, and mean lifetime of cache items. The development of a good rebalance strategy
    is a space for further research.
  • One of the developed rebalance strategies, the dynamic buddy rebalance strategy, uses BuddyAllocator as the source of chunks for cache pools.
  • The rebalancing strategy can shrink cache pools to return more memory in the BuddyAllocator, and then BuddyAllocator can also shrink itself to return memory to the operating system, or BuddyAllocator can serve allocations directly as a general-purpose allocator.
  • Dynamic buddy rebalancing strategy implements the following shouldRebalance algorithm:
    1. Take all free memory chunks from other cache pools.
    2. Return them to the BuddyAllocator, so contiguous blocks will be merged.
    3. Take the target amount of memory from BuddyAllocator.
    4. Make a chunk using this memory.
    5. Add this chunk to the caller cache pool.

Final design

Future work

  • Development of new rebalance strategies. An example of a new rebalance strategy is the strategy that tries to optimize the overall hit ratio rather than ensuring fairness between cache pools. Using cache pool statistics, we can also optimize the cache usage under different work conditions.
  • Integration of modern cache eviction policies, such as ARC, in the BlockCache.
  • Optimization of the allocator used for the rebalance strategy. While the buddy schema is a good baseline, further research can develop a more scalable system.
  • Advanced cache and memory allocator benchmarking.

Additional information

This work was part of my bachelor's thesis.
You can also check the slides from the project presentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant