Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Data on Kubernetes - Data Analytics and AI/ML Workloads

**Authors:** Alexa Griffith, Chunxu Tang, Joe Huang, Lu Qiu, Raghu Shankar, Robert Hodges, Shawn Sun, Victor Lu, Wannes Rosiers, Xing Yang

## Overview

The Cloud Native AI whitepaper¹ delves into the interaction of Cloud Native (CN) and Artificial Intelligence (AI) technologies, discusses the current state, the challenges, the opportunities, and the potential solutions in this area.

In this paper, we aim at highlighting the characteristics of data analytics and AI/ML workloads and the patterns and trends in data storage to meet the new challenges.

There are different stages of AI/ML with different utilization patterns. The batch, training/fine tuning workloads have sustained utilization of resources and are long running jobs. On the other hand, the inference workloads have spiky utilization of resources and need immediate access².

Different stages of AI/ML workloads have different requirements for data storage.

Data analytics and AI/ML workloads usually contain a large amount of structured and unstructured data. These workloads need to be highly scalable, highly performant with low latency. They typically need read-only access to data. These characteristics have implications for data storage requirements.

Data warehouses and data lakes are typically used to store large data for analytics workloads. Data warehouses are optimized for Online Analytical Processing (OLAP) as compared to traditional relational databases which are used for Online Transaction Processing (OLTP). Column-oriented databases can be used in data warehouses for efficient query search³. Data warehouses are normally utilized to store structured data. Data lakes, on the other hand, can handle both structured and unstructured data.

Data caching plays an important role in data analytics and AI/ML workloads. It can help improve performance dramatically by placing data close to where it needs to be accessed and processed. It can also help to avoid redundant computation.

In recent years, vector databases have gained popularity due to their capability to do similarity search rather than the exact match used in traditional database searches. This is very important for AI/ML workloads, especially generative AI workloads.

Block, file, and object storage can be used as the underlying storage systems to store data for data analytics and AI/ML workloads. Object storage's ability to allow data to be shared with multiple workloads simultaneously, optimized throughput for parallelised workloads, and highly scalable capacities make it a popular choice for AI/ML workloads as well as data analytics.

Data warehouses and data lakes are centralized repositories. There are trends to decentralize them using the data mesh architecture for building AI/ML models. Another data paradigm is data fabric, an architectural approach and technology framework that addresses data lake challenges by providing a unified and integrated view of data across various sources⁴.

Modern Data Architecture Principles emphasize on data quality, data governance, consistency, data as a shareable asset, and data security and privacy.

## Patterns and Trends in Data Storage

- [Data Warehouses, Data Lakes, and Data Lake Houses](topics/data-lake-houses.md)
- [Data Cache and Data Locality](topics/data-cache-locality.md)
- [Databases](topics/databases.md)
- [Block, File, and Object Storage](topics/block-file-object.md)
- [Network Management](topics/network-management.md)
- [Data Pipelines](topics/data-pipelines.md)
- [Data Mesh and Data Fabric](topics/data-mesh-and-fabric.md)
- [Hardware/Software Co-design](topics/hardware-software-co-design.md)

## Storage in the AI Lifecycle

Different phases of AI workloads have distinct storage requirements and usage patterns. Training workloads are throughput-oriented with sustained data movement, inference workloads are latency-sensitive with spiky traffic patterns, and AI agents require complex state management across iterative reasoning loops.

- [Training and its Storage Usage Patterns](topics/training.md)
- [Inference and its Storage Usage Patterns](topics/inference.md)
- [AI Agent and its Storage Usage Patterns](topics/ai-agent.md)

---

## References

1. https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf
2. https://sched.co/1YhIO
3. Designing Data Intensive Applications, Martin Kleppmann, O'Reilly Media, 2 May 2017
4. From data warehouse to data fabric: the evolution of data architecture, CNCF Blog, 21 July 2023
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## AI Agent and its Storage Usage Patterns

The typical operational process of an AI Agent is a state-driven, closed-loop iterative architecture, centered on utilizing the model as the brain to dynamically integrate multi-dimensional contexts within every execution cycle, and produce various storage usage patterns. The typical processing workflow is as follows:

1. **Session Initialization**: upon receiving the input(text, multi-modal data or streaming media), the AI Agent first performs intent recognition and then goal decomposition by synthesizing the input, the initial session state and long-term memory (such as historical preferences,relevant information from past interactions and external knowledge sources). Multi-modal data in the input may include one or several media types like text, image, document, audio clip, video clip etc. Streaming media is continuous live voice or live video and usually in full duplex transmission, not like traditional request-response mode, providing real-time interaction is critical for this kind of AI agent. Input should be persisted with the connection to the session.

2. **Iterative Reasoning Loop**: During each iteration, the model makes decisions through a deeply integrated context window that aggregates the following elements in real-time:
- **Instructions and Constraints**: The predefined or dynamic synthesized AI Agent prompt and specific requirements for the current step. Instructions and constraints may be divided into one or several text parts, ranging from few bytes to kilo bytes.
- **Real-time State in Short-term Memory**: State is used for holding temporary data relevant only to the current active session, including current variable values, to-do list (or DAG) and execution status etc. State is mutable, and the contents of the state are expected to be altered as the session evolves in every step. To ensure the AI Agent can complete long-running steps despite various potential failures including node failure, state can be persisted into files or database fields, depending on AI agent developer's decision.
- **Events History in Short-term Memory**: The complete "Thought-Action-Observation" trajectory occurring within the current session, in chronological sequence including the input, interaction messages, generated responses, tool calls and results, AI agent calls and results. Events history is immutable. Persisting and reloading the entire events history at each step is highly time-consuming. When feeding the events history into the model, it should optimally leverage inference acceleration techniques such as KV caching. Therefore, the events history storage must support append-only writes. Since the model's context window is highly limited, as iteration steps increase, repetitive accumulated content not only affects inference accuracy, latency, and cost but also increases computational overhead. Thus, events history requires good deduplication implementation. Event history must also account for node-level failures, necessitating cross-node accessibility.
- **Artifacts as intermediate results**: physical objects generated or processed in previous steps, such as code files, images, video clips, audio segments, or documents. Artifacts could be part of the next step inputs as intermediate results for cross steps information sharing, or final deliverables. For example, if the session is to generate source code for a web site, some files may be updated in each step, some files will be reused as inputs for next steps, while others may be generated and kept unchanged in later steps. The session execution generates a large volume of artifacts. These artifacts require deduplication, cache to improve persistence and loading speeds, while also supporting cross-AI agent and cross-node access. For live streaming media, the stream must be segmented into media blobs and persisted. This serves to mitigate node failures, enables context sharing across AI agents to reduce storage overhead and avoid redundant user input requests. Additionally, occurring events, transcribed text from streaming media and media blobs must be synchronized with their corresponding media blobs on a timeline.
- **Long-term Memory**: search to recall cross-session relevant historical data and external domain knowledge base. The completed session will be ingested and consolidated into persisted long-term memory.

3. **Tool/AI Agent Orchestration and Execution**: Based on the integrated context, the model determines the next action. This may involve invoking tools (e.g., executing Python code in sandbox, calling MCP servers, function calling provided by the AI agent etc.), reading/writing artifacts (e.g., loading content of browsed page,analyzing an image,generating a video, uploading file to remote S3 storage), or dispatching requests to other AI Agents. For multi-AI agents' collaboration scenarios, the data used in the Iterative Reasoning Loop may be shared among AI agents.

4. **Observation and State Update**: The results returned from each action (such as code execution failure, API payloads, newly generated images, or feedback messages from collaborating AI agents) are captured as new observations or artifacts. These are appended to the short-term memory events history or saved to the artifact repository, updating variable values in short-term memory state. To-do list (or DAG) execution status will be updated by the Model in the next cycle according to the action result. Some AI agent implementation may prefer to save the state to file, this turns to another action.

5. **Delivery and Memory Consolidation**: The AI agent continuously evaluates whether the feedback satisfies the termination criteria. Upon completion, the critical decision paths, conclusion and produced artifacts are not only used to finalize the goal as deliverables, but may also be indexed into long-term memory, allowing them to be reactivated and referenced in future independent sessions.

The aforementioned persisted contents vary drastically in several aspects in AI Agent workflow, must meet both AI agents' providers and end-users' purposes, for example:

- An AI Agent processing streaming media might temporarily store interactive video/audio segments for replay purposes.
- Intermediate generated files may reside in the sandboxed execution environment with limited retention for context reuse.
- Final conclusions and delivered artifacts to be published on a web site.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
### Block, File, and Object Storage

As described in the CNCF Storage Landscape Whitepaper, block storage is best suited for workloads that require availability, low latency performance, and good throughput performance for individual workloads, but it is less suited for workloads that require capacity scaling and sharing data with multiple workloads simultaneously. Block storage can be used accordingly for AI/ML workloads depending on the requirements.

File system based storage is best suited for use cases that need to share data with multiple workloads simultaneously, and need optimized throughput for aggregated workloads. So file system based storage is also a good choice for AI/ML workloads.

The Container Storage Interface (CSI) is a standard for exposing arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes¹⁶.

Object storage is best suited for workloads that require availability, large capacities (PB scale), durability, sharing data with multiple workloads simultaneously, and optimized throughput for parallelised workloads¹⁷. This makes object storage an excellent choice for AI/ML workloads. It is also increasingly favored for analytic databases, where it enables separation of compute and storage necessary to support diverse workloads on very large datasets.

Object storage is least suited for workloads that require low latency performance. A cache layer can be used together with object storage to alleviate the latency constraint as well as reduce costs due to excessive API calls. Object storage I/O occurs through API calls, which means that object storage reads do not benefit from the OS page cache. For this reason, workloads like data warehouses that perform large numbers of repeated reads from object storage almost always include local and/or distributed caches.

Container Object Storage Interface (COSI) is a Kubernetes storage project that introduces standard APIs for the provisioning and consuming of object storage in Kubernetes¹⁸.

### FUSE CSI Driver

FUSE (Filesystem in Userspace) is an interface that allows non-privileged users to expose their filesystem to Unix and Unix-like operating systems. In other words, by implementing the interface, a storage system can be translated into a virtual file system. Users are able to access the storage system as if accessing the local filesystem. The FUSE process can be a long-running process inside its own pod and the filesystem can be mounted onto the host machine or VM. Whenever an application pod gets scheduled on the host, it uses the hostPath volume to access the storage system.

However, the FUSE process may not always be favored because different workloads may cause underutilization of resources such as CPU and memory. This is where CSI drivers play their role. Only when the application pods get scheduled, the CSI drivers will get triggered to start the FUSE process, either in a separate pod or in a sidecar mode, that is, injecting another container into the application pod using webhook. Whether the CSI drivers support pod or sidecar depends on the individual implementations.

Major cloud providers have been supporting FUSE CSI Drivers for their object storage services. For example, Azure Blob CSI Driver was released at the end of 2022, GCS Fuse CSI Driver was released in 2023, and AWS mountpoint S3 CSI Driver was released in 2023.

---

## References

16. https://github.com/container-storage-interface/spec/blob/master/spec.md
17. CNCF Storage Landscape Whitepaper
18. https://kubernetes.io/blog/2022/09/02/cosi-kubernetes-object-storage-management/
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### Data Cache

With the increasing popularity of data lakes and compute-storage disaggregation, the compute tier is now decoupled from the storage tier in both data analytics and AI/ML applications. Data caching plays a crucial role in accelerating data loading, minimizing data transfer between compute and storage layers, and reducing API calls, an often overlooked cost in object storage access. Data access patterns differ slightly between data analytics and AI/ML workloads.

Modern query optimization strategies, such as dynamic filtering on columnar data files, typically result in small, disparate reads rather than sequentially reading large chunks of data. Conversely, AI/ML workloads lean towards hybrid data access patterns, where random and sequential reads coexist based on the specific AI/ML scenario.

To fit the needs of varying workloads, the data cache needs to be scalable and elastic. A hierarchical caching design, with a local cache on each compute node and a distributed cache between compute and storage layers, can be effective for handling high volumes of data in data and AI applications. Integrating such a unified data cache into different components of the machine learning life cycle, including data preprocessing, feature engineering, model training, and model serving, can potentially improve the overall system performance⁷.

### Data Locality

In the realm of data processing and machine learning, data locality is a crucial factor that significantly impacts performance and cost. The cost considerations include not only data transfer expenses but also the hidden costs associated with low CPU/GPU utilization rates.

Several strategies can be used to achieve data locality, with each offering advantages and challenges:

#### Reading Data Directly from Remote Storage
- **Benefits**: This approach requires minimal setup effort.
- **Challenges**: Every training epoch requires re-reading all data from remote storage. Since multiple epochs are often necessary for better accuracy, this method can lead to significant time spent on data loading rather than training.

#### Copying Data to Local Storage Before Training
- **Benefits**: This method ensures that all data is local, thus gaining all the benefits of data locality.
- **Challenges**: Management can be difficult. Users must manually delete training data after use, and cache space is limited. For very large datasets, the benefits can be limited by local storage capacity.

#### Local Cache Layer for Data Reuse
- **Examples**: Tools like S3FS with built-in local cache and Alluxio Fuse SDK.
- **Benefits**: Reused data remains local, and the cache layer handles data management, eliminating the need for manual supervision.
- **Challenges**: Cache space remains limited, thus for large datasets, the benefits might still be constrained.

#### Distributed Cache Layer
- **Benefits**: A distributed cache system ensures that data is either local or adjacent, offers centralized data management, and provides scalable cache space.
- **Challenges**: Building and maintaining a distributed caching system can be complex and resource-intensive.

Data locality offers significant performance gains and cost savings, particularly for data-intensive applications and machine learning workloads. However, as discussed above, each strategy comes with its own set of trade-offs. Users must carefully consider their specific needs and constraints when selecting an approach to maximize the benefits of data locality.

A notable solution for addressing data locality issues is Fluid⁸, a CNCF sandbox project. Fluid offers significant performance gains and cost savings, particularly for machine learning workloads. It enables connecting to remote storage and supports local and/or distributed caching using Kubernetes-native approaches, significantly simplifying data locality management and accelerating AI workloads. Many of these benefits are enabled by Fluid runtimes, such as Alluxio, an open-source data orchestration and distributed caching system.

---

## References

7. https://www.alluxio.io/blog/data-caching-strategies-for-data-analytics-and-ai-dataai-summit-2023-session-recap/
8. https://fluid-cloudnative.github.io/
Loading
Loading