diff --git a/README.md b/README.md index 508c6f8d..b0c0ff31 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,9 @@ -# BharatMLStack -
- BharatMLStack Logo + + + + BharatMLStack +
@@ -20,121 +22,95 @@ ## What is BharatMLStack? -BharatMLStack is a comprehensive, production-ready machine learning infrastructure platform designed to democratize ML capabilities across India and beyond. Our mission is to provide a robust, scalable, and accessible ML stack that empowers organizations to build, deploy, and manage machine learning solutions at massive scale. +BharatMLStack is a production-ready, cloud-agnostic ML infrastructure platform that powers real-time feature serving, model inference, and embedding search at massive scale. Built and battle-tested at [Meesho](https://meesho.com), it is designed to help organizations ship ML to production faster, cheaper, and more reliably. ## Our Vision -- 🎯 **Democratize Machine Learning**: Make advanced ML infrastructure accessible to organizations of all sizes -- 🚀 **Scale Without Limits**: Built to handle millions of requests per second with enterprise-grade reliability -- 🇮🇳 **Bharat-First Approach**: Optimized for Indian market needs while maintaining global standards -- ⚡ **Real-Time Intelligence**: Enable instant decision-making with sub-millisecond feature serving -- 🔧 **Developer-Friendly**: Intuitive APIs and interfaces that accelerate ML development cycles +BharatMLStack is built around **four core tenets**: -## Star History +### Workflow Integration & Productivity +> Ship ML to production faster than ever. -[![Star History Chart](https://api.star-history.com/svg?repos=Meesho/BharatMLStack&type=Date)](https://www.star-history.com/#Meesho/BharatMLStack&Date) +- **3x faster** experiment-to-deployment cycles +- **95% reduction** in model onboarding time -## Running at Million Scale +### Cloud-Agnostic & Lock-In Free +> Run anywhere. Own your stack. -BharatMLStack is battle-tested in production environments, powering: -- **1M+ feature vector retrievals per second** across distributed deployments -- **Sub-10ms latency** for real-time feature retrieval -- **99.99% uptime** with auto-scaling and fault tolerance -- **Petabyte-scale** feature storage and processing -- **Multi-region deployments** with global load balancing +- Runs across **public cloud, on-prem, and edge** +- Kubernetes-native with zero vendor lock-in -## Document -- [Doc](https://meesho.github.io/BharatMLStack/) -- [Blogs](https://meesho.github.io/BharatMLStack/blog) -## Core Components +### Economic Efficiency +> Do more with less. -### 📋 Current Releases - -| Component | Version | Description | -|-----------|---------|-------------| -| 🚀 **Horizon** | `v1.0.0` | Control Plane & Backend | -| 🎨 **Trufflebox UI** | `v1.0.0` | ML Management Console | -| 🗄️ **Online Feature Store** | `v1.0.0` | Real-Time Features | -| 🐹 **Go SDK** | `v1.0.0` | Go Client Library | -| 🐍 **Python SDK** | `v1.0.1` | Python Client Library | -| 🚀 **Numerix** | `v1.0.0` | Mathematical Compute Engine | - -### 🚀 Horizon - Control Plane & Backend -The central control plane for BharatMLStack components, serving as the backend for Trufflebox UI. -- **Component orchestration**: Manages and coordinates all BharatMLStack services -- **API gateway**: Unified interface for all MLOps and workflows - -### 🎨 Trufflebox UI - ML Management Console -Modern web interface for managing ML models, features, and experiments. Currently it supports: -- **Feature Registry**: Centralized repository for feature definitions and metadata -- **Feature Cataloging**: Discovery and search capabilities for available features -- **Online Feature Store Control System**: Management interface for feature store operations -- **Approval Flows**: Workflow management for feature deployment and changes - -### 🗄️ Online Feature Store - Real-Time Features -High-performance feature store for real-time ML inference and training. -- **Real-time serving**: Sub-10ms feature retrieval at scale -- **Streaming ingestion**: Process millions of feature updates per second -- **Feature Backward Compatible Versioning**: Track and manage feature evolution -- **Multi-source integration**: Push from stream, batch and real-time sources - -### 🗄️ Numerix - Mathematical Compute Engine -High-performance feature store for real-time ML inference and training. -- **Matrix Operations**: High-performance matrix computations and transformations -- **gRPC API**: Fast binary protocol for efficient data transfer -- **Multi-format Support**: String and byte-based matrix formats -- **Optimized Performance**: Built with Rust for maximum efficiency -- **Scalable Architecture**: Designed for distributed processing - -## Key Differentiators - -- ✨ **Production-Ready**: Battle-tested components used in high-traffic production systems -- 🌐 **Cloud Agnostic**: Kubernetes-native, so deploy on the cloud you love -- 📊 **Observability**: Built-in monitoring, logging +- **60–70% lower** infrastructure costs vs hyperscaler managed services +- Optimized resource utilization across CPU and GPU workloads -## Quick Start +### Availability & Scalability +> Enterprise-grade reliability at internet scale. -🚀 **Get started with BharatMLStack in minutes!** +- **99.99% uptime** across clusters +- **1M+ QPS** with low latency -For comprehensive setup instructions, examples, and deployment guides, see our detailed Quick Start documentation: +## Designed Truly for Bharat Scale -📖 **[Quick Start Guide →](./quick-start/README.md)** +Built for the demands of one of the world's largest e-commerce platforms: -### What You'll Find: +| Metric | Performance | +|--------|-------------| +| **Feature Store** | 2.4M QPS (batch of 100 id lookups) | +| **Model Inference** | 1M+ QPS | +| **Embedding Search** | 500K QPS | +| **Feature Retrieval Latency** | Sub-10ms | -- **🐳 Docker Setup**: Complete stack deployment with Docker Compose -- **📊 Sample Data**: Pre-configured examples to get you started -- **🔍 Health Checks**: Verify your deployment is working -- **📝 Step-by-Step Tutorials**: From installation to first feature operations +## Core Components -### TL;DR - One Command Setup: +| Component | Description | Version | Docs | +|-----------|-------------|---------|------| +| **[TruffleBox UI](./trufflebox-ui/)** | Web console for feature registry, cataloging, and approval workflows | `v1.3.0` | [Docs](https://meesho.github.io/BharatMLStack/trufflebox-ui/v1.0.0/userguide) | +| **[Online Feature Store](./online-feature-store/)** | Sub-10ms feature retrieval at millions of QPS with streaming ingestion | `v1.2.0` | [Docs](https://meesho.github.io/BharatMLStack/category/online-feature-store) | +| **[Inferflow](./inferflow/)** | DAG-based real-time inference orchestration for composable ML pipelines | `v1.0.0` | [Docs](https://meesho.github.io/BharatMLStack/category/inferflow) | +| **[Numerix](./numerix/)** | Rust-powered math compute engine for high-performance matrix ops | `v1.0.0` | [Docs](https://meesho.github.io/BharatMLStack/category/numerix) | +| **[Skye](./skye/)** | Vector similarity search with pluggable backends | `v1.0.0` | [Docs](https://meesho.github.io/BharatMLStack/category/skye) | +| **[Go SDK](./go-sdk/)** | Go client for Feature Store, Interaction Store, and logging | `v1.3.0` | [Docs](https://meesho.github.io/BharatMLStack/category/go-sdk) | +| **[Python SDK](./py-sdk/)** | Python client libraries for Feature Store and inference logging | `v1.0.1` | [Docs](https://meesho.github.io/BharatMLStack/category/python-sdk) | +| **[Interaction Store](./interaction-store/)** | ScyllaDB-backed store for user interaction signals at sub-10ms | — | — | +| **[Horizon](./horizon/)** | Control plane that orchestrates all services and powers TruffleBox UI | `v1.3.0` | — | + +> Full documentation at [meesho.github.io/BharatMLStack](https://meesho.github.io/BharatMLStack/) | [Blogs](https://meesho.github.io/BharatMLStack/blog) +- [All Blog Posts](https://meesho.github.io/BharatMLStack/blog) + +## Quick Start ```bash -# Clone and start the complete stack git clone https://github.com/Meesho/BharatMLStack.git cd BharatMLStack/quick-start -ONFS_VERSION= HORIZON_VERSION= TRUFFLEBOX_VERSION= NUMERIX_VERSION= ./start.sh +#Set versions +ONFS_VERSION=v1.2.0 HORIZON_VERSION=v1.3.0 TRUFFLEBOX_VERSION=v1.3.0 NUMERIX_VERSION=v1.0.0 + +./start.sh ``` -Then follow the [Quick Start Guide](./quick-start/README.md) for detailed setup and usage instructions. +For step-by-step setup, Docker Compose details, sample data, and health checks, see the full **[Quick Start Guide →](./quick-start/README.md)**. ## Architecture -BharatMLStack follows a microservices architecture designed for scalability and maintainability. Several components are to be open-sourced -
- BharatMLStack Logo + BharatMLStack Architecture
-### 🚀 Quick Navigation +## Use-Cases + +BharatMLStack powers a wide range of ML-driven applications: -| Component | Documentation | Quick Start | -|-----------|--------------|-------------| -| **Online Feature Store** | [Docs](https://meesho.github.io/BharatMLStack/category/online-feature-store) | [Setup](./quick-start/README.md) | -| **Go SDK** | [Docs](./go-sdk/README.md) | [Examples](./go-sdk/README.md) | -| **Python SDK** | [Docs](./py-sdk/README.md) | [Quickstart](./py-sdk/README.md) | -| **User Guide** | [Docs](https://meesho.github.io/BharatMLStack/trufflebox-ui/v1.0.0/userguide) | [Setup](./quick-start/README.md) | -| **Numerix** | [Docs](https://meesho.github.io/BharatMLStack/category/numerix) | [Setup](./quick-start/README.md) | +| Use-Case | What BharatMLStack Enables | +|----------|---------------------------| +| **Personalized Candidate Generation** | Retrieve and rank millions of candidates in real time using feature vectors and embedding similarity | +| **Personalized Ranking** | Serve user, item, and context features at ultra-low latency to power real-time ranking models | +| **Fraud & Risk Detection** | Stream interaction signals and features to detect anomalies and fraudulent patterns in milliseconds | +| **Image Search** | Run embedding search at 500K QPS to match visual queries against massive product catalogs | +| **LLM Recommender Systems** | Orchestrate LLM inference pipelines with feature enrichment for next-gen recommendation engines | +| **DL & LLM Deployments at Scale** | Deploy and scale deep learning and large language models across GPU clusters with Inferflow orchestration | ## Contributing @@ -142,9 +118,9 @@ We welcome contributions from the community! Please see our [Contributing Guide] ## Community & Support -- 💬 **Discord**: Join our [community chat](https://discord.gg/XkT7XsV2AU) -- 🐛 **Issues**: Report bugs and request features on [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) -- 📧 **Email**: Contact us at [ml-oss@meesho.com](mailto:ml-oss@meesho.com ) +- **Discord**: Join our [community chat](https://discord.gg/XkT7XsV2AU) +- **Issues**: Report bugs and request features on [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) +- **Email**: Contact us at [ml-oss@meesho.com](mailto:ml-oss@meesho.com) ## License diff --git a/assets/bharatmlstack-architecture.png b/assets/bharatmlstack-architecture.png new file mode 100644 index 00000000..afa5b787 Binary files /dev/null and b/assets/bharatmlstack-architecture.png differ diff --git a/assets/bharatmlstack-logo.png b/assets/bharatmlstack-logo.png new file mode 100644 index 00000000..9756a1ec Binary files /dev/null and b/assets/bharatmlstack-logo.png differ diff --git a/docs-src/docs/inferflow/v1.0.0/_category_.json b/docs-src/docs/inferflow/v1.0.0/_category_.json index 0641455e..3c72a212 100644 --- a/docs-src/docs/inferflow/v1.0.0/_category_.json +++ b/docs-src/docs/inferflow/v1.0.0/_category_.json @@ -1,9 +1,4 @@ { - "label": "v1.0.0", - "position": 1, - "link": { - "type": "generated-index", - "description": "Inferflow v1.0.0", - "slug": "/inferflow/v1.0.0" - } + "label": "v1.0.0", + "position": 1 } diff --git a/docs-src/docs/inferflow/v1.0.0/index.md b/docs-src/docs/inferflow/v1.0.0/index.md new file mode 100644 index 00000000..abb59d61 --- /dev/null +++ b/docs-src/docs/inferflow/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Inferflow v1.0.0 +sidebar_position: 0 +slug: /inferflow/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Inferflow v1.0.0 + +Inferflow is a graph-driven feature retrieval and model inference orchestration engine. It dynamically resolves entity relationships via configurable DAGs, retrieves features from the Online Feature Store, and orchestrates model scoring. + + diff --git a/docs-src/docs/intro.md b/docs-src/docs/intro.md new file mode 100644 index 00000000..4a56ea71 --- /dev/null +++ b/docs-src/docs/intro.md @@ -0,0 +1,57 @@ +--- +sidebar_position: 0 +title: BharatMLStack Documentation +slug: intro +--- + +# BharatMLStack Documentation + +Welcome to the BharatMLStack documentation. BharatMLStack is an open-source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Explore the components below to get started. + +--- + +## Quick Start + +Get up and running with BharatMLStack in minutes. Step-by-step instructions, sample data, and Docker Compose setup for local development and testing. + +**[Go to Quick Start →](/category/quick-start)** + +--- + +## Online Feature Store + +Sub-10ms, high-throughput access to machine learning features for real-time inference. Supports batch and streaming ingestion, schema validation, and compact versioned feature groups. + +**[Go to Online Feature Store →](/category/online-feature-store)** + +--- + +## Inferflow + +Graph-driven feature retrieval and model inference orchestration engine. Dynamically resolves entity relationships, retrieves features, and orchestrates model scoring — all without custom code. + +**[Go to Inferflow →](/category/inferflow)** + +--- + +## Trufflebox UI + +Modern, feature-rich UI framework for MLOps management. Supports feature catalog, user management, and admin operations with approval flows. + +**[Go to Trufflebox UI →](/category/trufflebox-ui)** + +--- + +## SDKs + +Client libraries for Go and Python to interact with the Online Feature Store and other platform components. Includes gRPC clients, REST APIs, and Apache Spark integration. + +**[Go to SDKs →](/category/sdks)** + +--- + +## Numerix + +High-performance compute engine for ultra-fast element-wise matrix operations. Built in Rust with SIMD acceleration for sub-5ms p99 latency. + +**[Go to Numerix →](/category/numerix)** diff --git a/docs-src/docs/numerix/_category_.json b/docs-src/docs/numerix/_category_.json index 7c2d4af0..2340ae40 100644 --- a/docs-src/docs/numerix/_category_.json +++ b/docs-src/docs/numerix/_category_.json @@ -1,6 +1,6 @@ { "label": "Numerix", - "position": 6, + "position": 7, "link": { "type": "generated-index", "description": "Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors." diff --git a/docs-src/docs/numerix/v1.0.0/_category_.json b/docs-src/docs/numerix/v1.0.0/_category_.json index 4748f653..66455a9e 100644 --- a/docs-src/docs/numerix/v1.0.0/_category_.json +++ b/docs-src/docs/numerix/v1.0.0/_category_.json @@ -1,10 +1,5 @@ { "label": "v1.0.0", - "position": 1, - "link": { - "type": "generated-index", - "description": "Numerix v1.0.0", - "slug": "/numerix/v1.0.0" - } + "position": 1 } diff --git a/docs-src/docs/numerix/v1.0.0/index.md b/docs-src/docs/numerix/v1.0.0/index.md new file mode 100644 index 00000000..1307fef7 --- /dev/null +++ b/docs-src/docs/numerix/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Numerix v1.0.0 +sidebar_position: 0 +slug: /numerix/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Numerix v1.0.0 + +Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors. + + diff --git a/docs-src/docs/online-feature-store/v1.0.0/_category_.json b/docs-src/docs/online-feature-store/v1.0.0/_category_.json index 4fec8dcc..4e1f685c 100644 --- a/docs-src/docs/online-feature-store/v1.0.0/_category_.json +++ b/docs-src/docs/online-feature-store/v1.0.0/_category_.json @@ -1,9 +1,4 @@ { - "label": "v1.0.0", - "position": 1, - "link": { - "type": "generated-index", - "description": "Online Feature Store v1.0.0", - "slug": "/online-feature-store/v1.0.0" - } + "label": "v1.0.0", + "position": 1 } \ No newline at end of file diff --git a/docs-src/docs/online-feature-store/v1.0.0/index.md b/docs-src/docs/online-feature-store/v1.0.0/index.md new file mode 100644 index 00000000..b790c081 --- /dev/null +++ b/docs-src/docs/online-feature-store/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Online Feature Store v1.0.0 +sidebar_position: 0 +slug: /online-feature-store/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Online Feature Store v1.0.0 + +A high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with low-latency feature retrieval. + + diff --git a/docs-src/docs/predator/_category_.json b/docs-src/docs/predator/_category_.json new file mode 100644 index 00000000..576eb122 --- /dev/null +++ b/docs-src/docs/predator/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Predator", + "position": 7, + "link": { + "type": "generated-index", + "description": "Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server, designed to serve ML models with low latency in Kubernetes, with OnFS and Interflow integration." + } +} diff --git a/docs-src/docs/predator/v1.0.0/_category_.json b/docs-src/docs/predator/v1.0.0/_category_.json new file mode 100644 index 00000000..3c72a212 --- /dev/null +++ b/docs-src/docs/predator/v1.0.0/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "v1.0.0", + "position": 1 +} diff --git a/docs-src/docs/predator/v1.0.0/architecture.md b/docs-src/docs/predator/v1.0.0/architecture.md new file mode 100644 index 00000000..e337ae4f --- /dev/null +++ b/docs-src/docs/predator/v1.0.0/architecture.md @@ -0,0 +1,201 @@ +--- +title: Architecture +sidebar_position: 1 +--- + +# BharatMLStack - Predator + +Predator is a scalable, high-performance model inference service built as a wrapper around the **NVIDIA Triton Inference Server**. It is designed to serve a variety of machine learning models (Deep Learning, Tree-based, etc.) with low latency in a **Kubernetes (K8s)** environment. + +The system integrates seamlessly with the **Online Feature Store (OnFS)** for real-time feature retrieval and uses **Horizon** as the deployment orchestration layer. Deployments follow a **GitOps** pipeline — Horizon generates Helm configurations, commits them to GitHub, and **Argo Sync** reconciles the desired state onto Kubernetes. + +--- + +## High-Level Design + +![Predator HLD - End-to-end deployment and inference architecture](../../../static/img/v1.0.0-predator-hld.png) + +### End-to-End Flow + +1. **Model Deployment Trigger**: An actor initiates deployment through **Trufflebox UI**, specifying the GCS path (`gcs://`) of the trained model. Separately, post-training pipelines write model artifacts to **GCS Artifactory**. + +2. **Orchestration via Horizon**: Trufflebox UI communicates with **Horizon**, the deployment orchestration layer. Horizon generates the appropriate **Helm** chart configuration for the inference service. + +3. **GitOps Pipeline**: Horizon commits the Helm values to a **GitHub** repository. **Argo Sync** watches the repo and reconciles the desired state onto the Kubernetes cluster, creating or updating deployable units. + +4. **Deployable Units (Deployable 1 … N)**: Each deployable is an independent Kubernetes deployment that: + - Downloads model artifacts from **GCS** at startup via an `init.sh` script. + - Launches a **Triton Inference Server** instance loaded with the model. + - Runs one or more pods, each containing the inference runtime and configured backends. + +5. **Triton Backends**: Each Triton instance supports pluggable backends based on the model type: + - **FIL** — GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest). + - **PyTorch** — Native PyTorch models via LibTorch. + - **Python** — Custom preprocessing/postprocessing or unsupported model formats. + - **TRT (TensorRT)** — GPU-optimized serialized TensorRT engines. + - **ONNX** — Framework-agnostic execution via ONNX Runtime. + - **DALI** — GPU-accelerated data preprocessing (image, audio, video). + +6. **Autoscaling with KEDA**: The cluster uses **KEDA** (Kubernetes Event-Driven Autoscaling) to scale deployable pods based on custom metrics (CPU utilization, GPU utilization via DCGM, queue depth, etc.). The underlying **Kubernetes** scheduler places pods across GPU/CPU node pools. + +### Key Design Principles + +- **GitOps-driven**: All deployment state is version-controlled in Git; Argo Sync ensures cluster state matches the declared configuration. +- **Isolation per deployable**: Each model or model group gets its own deployable unit, preventing noisy-neighbor interference. +- **Init-based model loading**: Models are materialized to local disk before Triton starts, ensuring deterministic startup and no runtime dependency on remote storage. +- **Pluggable backends**: The same infrastructure serves deep learning, tree-based, and custom models through Triton's backend abstraction. + +--- + +## Inference Engine: Triton Inference Server + +NVIDIA Triton Inference Server is a high-performance model serving system designed to deploy ML and deep learning models at scale across CPUs and GPUs. It provides a unified inference runtime that supports multiple frameworks, optimized execution, and production-grade scheduling. + +Triton operates as a standalone server that loads models from a model repository and exposes standardized HTTP/gRPC APIs. Predator uses **gRPC** for efficient request and response handling via the **helix client**. + +### Core Components + +- **Model Repository**: Central directory where models are stored. Predator typically materializes the model repository onto local disk via an init container, enabling fast model loading and eliminating runtime dependency on remote storage during inference. + +### Backends + +A backend is the runtime responsible for executing a model. Each model specifies which backend runs it via configuration. + +| Backend | Description | +|---------|-------------| +| **TensorRT** | GPU-optimized; executes serialized TensorRT engines (kernel fusion, FP16/INT8). | +| **PyTorch** | Serves native PyTorch models via LibTorch. | +| **ONNX Runtime** | Framework-agnostic ONNX execution with TensorRT and other accelerators. | +| **TensorFlow** | Runs TensorFlow SavedModel format. | +| **Python backend** | Custom Python code for preprocessing, postprocessing, or unsupported models. | +| **Custom backends** | C++/Python backends for specialized or proprietary runtimes. | +| **DALI** | GPU-accelerated data preprocessing (image, audio, video). | +| **FIL (Forest Inference Library)** | GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest). | + +### Key Features + +- **Dynamic batching**: Combines multiple requests into a single batch at runtime — higher GPU utilization, improved throughput, reduced latency variance. +- **Concurrent model execution**: Run multiple models or multiple instances of the same model; distribute load across GPUs. +- **Model versioning**: Support multiple versions per model. +- **Ensemble models**: Pipeline of models as an ensemble; eliminates intermediate network hops, reduces latency. +- **Model instance scaling**: Multiple copies of a model for parallel inference and load isolation. +- **Observability**: Prometheus metrics, granular latency, throughput, GPU utilization. +- **Warmup requests**: Preload kernels and avoid cold-start latency. + +--- + +## Model Repository Structure + +``` +model_repository/ +├── model_A/ +│ ├── config.pbtxt +│ ├── 1/ +│ │ └── model.plan +│ ├── 2/ +│ │ └── model.plan +├── model_B/ +│ ├── config.pbtxt +│ ├── 1/ +│ └── model.py +``` + +The `config.pbtxt` file defines how Triton loads and executes a model: input/output tensors, batch settings, hardware execution, backend runtime, and optimization parameters. At minimum it defines: `backend/platform`, `max_batch_size`, `inputs`, `outputs`. + +### Sample config.pbtxt + +```text +name: "product_ranking_model" +platform: "tensorrt_plan" +max_batch_size: 64 +input [ { name: "input_embeddings" data_type: TYPE_FP16 dims: [ 128 ] }, { name: "context_features" data_type: TYPE_FP32 dims: [ 32 ] } ] +output [ { name: "scores" data_type: TYPE_FP32 dims: [ 1 ] } ] +instance_group [ { kind: KIND_GPU count: 2 gpus: [0] } ] +dynamic_batching { preferred_batch_size: [8,16,32,64] max_queue_delay_microseconds: 2000 } +``` + +--- + +## Kubernetes Deployment Architecture + +Predator inference services are deployed on Kubernetes using **Helm-based** deployments for standardized, scalable, GPU-optimized model serving. Each deployment consists of Triton Inference Server wrapped within a Predator runtime, with autoscaling driven by CPU and GPU utilization. + +### Pod Architecture + +``` +Predator Pod +├── Init Container (Model Sync) +├── Triton Inference Server Container +``` + +Model artifacts and runtime are initialized before inference traffic is accepted. + +#### Init Container + +- Download model artifacts from cloud storage (GCS). +- Populate the Triton model repository directory. +- Example: `gcloud storage cp -r gs://.../model-path/* /models` + +Benefits: deterministic startup (Triton starts only after models are available), separation of concerns (image = runtime, repository = data). + +#### Triton Inference Server Container + +- Load model artifacts from local repository. +- Manage inference scheduling, request/response handling, and expose inference endpoints. + +### Triton Server Image Strategy + +The Helm chart uses the Triton container image from the internal **artifact registry**. Production uses **custom-built** images (only required backends, e.g. TensorRT, Python) to reduce size and startup time. Unnecessary components are excluded; images are built internally and pushed to the registry. + +**Response Caching**: Custom cache plugins can be added at image build time for optional inference response caching — reducing redundant execution and GPU use for repeated inputs. + +### Image Distribution Optimization + +- **Secondary boot disk image caching**: Images are pre-cached on GPU node pool secondary boot disks to avoid repeated pulls during scale-up and reduce pod startup time and cold-start latency. +- **Image streaming**: Can be used to progressively pull layers for faster time-to-readiness during scaling. + +### Health Probes + +Readiness and liveness use `/v2/health/ready`. Triton receives traffic only after model loading; failed instances are restarted automatically. + +### Resource Configuration + +Sample GPU resource config: + +```yaml +limits: + cpu: 7000m + memory: 28Gi + gpu: 1 +``` + +### Autoscaling Architecture + +Predator uses **KEDA** (Kubernetes Event-Driven Autoscaling) for scaling deployable pods. KEDA supports custom metric sources including: + +- **CPU / Memory utilization** for CPU-based deployments. +- **GPU utilization** via **DCGM** (Data Center GPU Manager) for GPU pods — covering utilization, memory, power, etc. +- **Custom Prometheus queries** for application-level scaling signals (e.g., inference queue depth, request latency). + +KEDA ScaledObjects are configured per deployable, enabling fine-grained, independent scaling for each model or model group. + +--- + +## Contributing + +We welcome contributions! See the [Contributing Guide](https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md). + +## Community & Support + +- **Discord**: [community chat](https://discord.gg/XkT7XsV2AU) +- **Issues**: [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) +- **Email**: [ml-oss@meesho.com](mailto:ml-oss@meesho.com) + +## License + +BharatMLStack is open-source under the [BharatMLStack Business Source License 1.1](https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md). + +--- + +
Built with ❤️ for the ML community from Meesho
+
If you find this useful, ⭐️ the repo — your support means the world to us!
diff --git a/docs-src/docs/predator/v1.0.0/functionalities.md b/docs-src/docs/predator/v1.0.0/functionalities.md new file mode 100644 index 00000000..e5f90b7c --- /dev/null +++ b/docs-src/docs/predator/v1.0.0/functionalities.md @@ -0,0 +1,119 @@ +--- +title: Key Functionalities +sidebar_position: 2 +--- + +# Predator - Key Functionalities + +## Overview + +Predator is a scalable, high-performance model inference service built as a wrapper around **NVIDIA Triton Inference Server**. It serves Deep Learning and tree-based models with low latency in **Kubernetes**, integrates with the **Online Feature Store (OnFS)** and uses **Interflow** for orchestration between clients, feature store, and inference engine. Clients send inference requests via the **Helix client** over gRPC. + +--- + +## Core Capabilities + +### Multi-Backend Inference + +Predator leverages Triton's pluggable backends so you can serve a variety of model types from a single deployment: + +| Backend | Use Case | +|---------|----------| +| **TensorRT** | GPU-optimized DL; serialized engines (FP16/INT8) | +| **PyTorch** | Native PyTorch via LibTorch | +| **ONNX Runtime** | Framework-agnostic ONNX with TensorRT/GPU | +| **TensorFlow** | SavedModel format | +| **Python** | Custom preprocessing, postprocessing, or unsupported models | +| **FIL** | Tree-based models (XGBoost, LightGBM, Random Forest) on GPU | +| **DALI** | GPU-accelerated data preprocessing (image, audio, video) | +| **Custom** | C++/Python backends for proprietary or specialized runtimes | + +### Dynamic Batching + +Triton combines multiple incoming requests into a single batch at runtime. + +- Higher GPU utilization and improved throughput +- Reduced latency variance +- Configurable `preferred_batch_size` and `max_queue_delay_microseconds` in `config.pbtxt` + +### Concurrent Model Execution + +- Run multiple models simultaneously +- Run multiple instances of the same model +- Distribute load across GPUs via `instance_group` in model config + +### Model Versioning & Ensembles + +- **Versioning**: Multiple versions per model (e.g. `1/`, `2/` in the model repository) +- **Ensembles**: Define a pipeline of models as an ensemble; eliminates intermediate network hops and reduces latency + +### Model Instance Scaling + +- Deploy multiple copies of a model for parallel inference and load isolation +- Configured via `instance_group` + +--- + +## Inference & API + +### gRPC via Helix Client + +Predator uses **gRPC** for efficient request/response handling. Client applications (e.g. Realestate, IOP) send inference requests through the **Helix client**, which talks to the Triton Inference Server inside the Predator pod. + +### Model Repository + +Models are stored in a local model repository. Predator materializes this via an **Init Container** that downloads artifacts from cloud storage (e.g. GCS) so Triton has no runtime dependency on remote storage during inference. + +--- + +## Deployment & Operational Features + +### Custom Triton Images + +- Production uses **custom-built** Triton images (only required backends) for smaller size and faster startup +- Images built on GCP VM, pushed to **Artifact Registry**, and referenced in Helm deployments +- Optional **response caching** via custom cache plugins added at image build time + +### Image Distribution + +- **Secondary boot disk caching**: Triton image pre-cached on GPU node pool to reduce pod startup and scale-up latency +- **Image streaming**: Optionally used for faster time-to-readiness during scaling + +### Health Probes + +- Readiness and liveness use `/v2/health/ready` +- Triton receives traffic only after models are loaded; failed instances are restarted automatically + +### Autoscaling + +- CPU-based scaling for generic load +- GPU-based scaling using **DCGM** metrics (utilization, memory, power); custom queries drive scale-up/scale-down + +--- + +## Observability + +- **Prometheus metrics**: Latency, throughput, GPU utilization, and more +- Metrics emitted from the Triton Inference Container and visualized in **Grafana** +- **Warmup requests**: Configurable to preload kernels and avoid cold-start latency + +--- + +## Contributing + +We welcome contributions! See the [Contributing Guide](https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md). + +## Community & Support + +- **Discord**: [community chat](https://discord.gg/XkT7XsV2AU) +- **Issues**: [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) +- **Email**: [ml-oss@meesho.com](mailto:ml-oss@meesho.com) + +## License + +BharatMLStack is open-source under the [BharatMLStack Business Source License 1.1](https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md). + +--- + +
Built with ❤️ for the ML community from Meesho
+
If you find this useful, ⭐️ the repo — your support means the world to us!
diff --git a/docs-src/docs/predator/v1.0.0/index.md b/docs-src/docs/predator/v1.0.0/index.md new file mode 100644 index 00000000..9de78cd9 --- /dev/null +++ b/docs-src/docs/predator/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Predator v1.0.0 +sidebar_position: 0 +slug: /predator/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Predator v1.0.0 + +Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server, designed to serve ML models with low latency in Kubernetes. + + diff --git a/docs-src/docs/predator/v1.0.0/release-notes.md b/docs-src/docs/predator/v1.0.0/release-notes.md new file mode 100644 index 00000000..a53f4ff3 --- /dev/null +++ b/docs-src/docs/predator/v1.0.0/release-notes.md @@ -0,0 +1,21 @@ +--- +title: Release Notes +sidebar_position: 3 +--- + +# Predator - Release Notes + +## Version 1.0.0 + +**Release Date**: June 2025 +**Status**: General Availability (GA) + +First stable release of **Predator** — scalable model inference service built around **NVIDIA Triton Inference Server**, part of BharatMLStack. Serves Deep Learning and tree-based models with low latency in **Kubernetes**; integrates with **OnFS** and **Interflow**; clients use the **Helix client** over gRPC. + +### What's New + +- **Triton inference engine**: Unified runtime for DL and tree-based models on CPU/GPU; model repository via Init Container from GCS; gRPC API via Helix client. +- **Multi-backend support**: TensorRT, PyTorch, ONNX Runtime, TensorFlow, Python, FIL, DALI, Custom. +- **Dynamic batching & concurrency**: Configurable via `config.pbtxt`; model versioning and ensembles. +- **Kubernetes deployment**: Helm-based; Init Container + Triton container; custom Triton images from Artifact Registry; health probes; CPU/GPU autoscaling. +- **Observability**: Prometheus metrics, Grafana; warmup requests for cold-start avoidance. diff --git a/docs-src/docs/quick-start/_category_.json b/docs-src/docs/quick-start/_category_.json index 2e50c7ae..ad53c8fa 100644 --- a/docs-src/docs/quick-start/_category_.json +++ b/docs-src/docs/quick-start/_category_.json @@ -1,6 +1,6 @@ { "label": "Quick Start", - "position": 2, + "position": 3, "link": { "type": "generated-index", "description": "Quick Start guide for BharatML Stack. Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing." diff --git a/docs-src/docs/quick-start/v1.0.0/index.md b/docs-src/docs/quick-start/v1.0.0/index.md new file mode 100644 index 00000000..adc1bb16 --- /dev/null +++ b/docs-src/docs/quick-start/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Quick Start v1.0.0 +sidebar_position: 0 +slug: /quick-start/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Quick Start v1.0.0 + +Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing. + + diff --git a/docs-src/docs/sdks/_category_.json b/docs-src/docs/sdks/_category_.json index 674a3f7e..ee44b06e 100644 --- a/docs-src/docs/sdks/_category_.json +++ b/docs-src/docs/sdks/_category_.json @@ -1,6 +1,6 @@ { "label": "SDKs", - "position": 3, + "position": 5, "link": { "type": "generated-index", "description": "Software Development Kits (SDKs) for BharatML Stack. Includes client libraries for Go and Python to interact with the online feature store and other platform components." diff --git a/docs-src/docs/sdks/go/v1.0.0/index.md b/docs-src/docs/sdks/go/v1.0.0/index.md new file mode 100644 index 00000000..72dbc2da --- /dev/null +++ b/docs-src/docs/sdks/go/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Go SDK v1.0.0 +sidebar_position: 0 +slug: /sdks/go/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Go SDK v1.0.0 + +Go client libraries and packages for interacting with the BharatML Stack online feature store, including gRPC clients and protocol buffer definitions. + + diff --git a/docs-src/docs/sdks/python/v1.0.0/_category_.json b/docs-src/docs/sdks/python/v1.0.0/_category_.json index 58516700..4e1f685c 100644 --- a/docs-src/docs/sdks/python/v1.0.0/_category_.json +++ b/docs-src/docs/sdks/python/v1.0.0/_category_.json @@ -1,8 +1,4 @@ { - "label": "v1.0.0", - "position": 1, - "link": { - "type": "generated-index", - "description": "Python SDK v1.0.0 documentation for BharatML Stack. Contains API reference, usage guides, and examples for the Python client libraries including gRPC feature client, Spark feature push client, and common utilities." - } + "label": "v1.0.0", + "position": 1 } \ No newline at end of file diff --git a/docs-src/docs/sdks/python/v1.0.0/index.md b/docs-src/docs/sdks/python/v1.0.0/index.md new file mode 100644 index 00000000..3d6f0e23 --- /dev/null +++ b/docs-src/docs/sdks/python/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Python SDK v1.0.0 +sidebar_position: 0 +slug: /sdks/python/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Python SDK v1.0.0 + +Python client libraries and utilities for interacting with the BharatML Stack online feature store, including gRPC clients, Spark integration, and common utilities. + + diff --git a/docs-src/docs/skye/_category_.json b/docs-src/docs/skye/_category_.json new file mode 100644 index 00000000..431fda78 --- /dev/null +++ b/docs-src/docs/skye/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Skye", + "position": 6, + "link": { + "type": "generated-index", + "description": "Skye is a high-performance vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, tenant-level index isolation, intelligent caching, and centralized cluster management." + } +} diff --git a/docs-src/docs/skye/v1.0.0/_category_.json b/docs-src/docs/skye/v1.0.0/_category_.json new file mode 100644 index 00000000..3c72a212 --- /dev/null +++ b/docs-src/docs/skye/v1.0.0/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "v1.0.0", + "position": 1 +} diff --git a/docs-src/docs/skye/v1.0.0/architecture.md b/docs-src/docs/skye/v1.0.0/architecture.md new file mode 100644 index 00000000..f08926a7 --- /dev/null +++ b/docs-src/docs/skye/v1.0.0/architecture.md @@ -0,0 +1,373 @@ +--- +title: Architecture +sidebar_position: 1 +--- + +# Skye - Vector Similarity Search Platform + +Skye is BharatMLStack's vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: **skye-admin**, **skye-consumers**, and **skye-serving**. + +--- + +## System Overview + +![Skye System Architecture](../../../static/img/skye-system-overview.png) + +Skye provides a critical platform for managing data aggregation, model onboarding, and embedding support at production scale. The architecture is designed around three core pillars: + +- **Pluggable Vector Databases**: Support for multiple vector database backends (Qdrant and extensible to others) via a generic abstraction layer. +- **Tenant-Level Index Isolation with Shared Embeddings**: Models are stored once but can serve multiple tenants (variants), reducing data redundancy. +- **Event-Driven Administration**: Model lifecycle management is handled through Kafka-based event flows for resilience and fault tolerance. + +### Component Architecture + +| Component | Role | +|---|---| +| **skye-serving** | Handles real-time similarity search queries with in-memory caching and vector DB lookups | +| **skye-consumers** | Processes embedding ingestion (reset/delta jobs) and real-time aggregation events from Kafka | +| **skye-admin** | Manages model lifecycle, onboarding, variant registration, and coordinates Databricks jobs | + +--- + +## Data Model + +### Model and Variant Hierarchy + +Skye uses a **model-first** hierarchy rather than a tenant-first approach. Models sit at the base level with variants (formerly tenants) nested within each model. This eliminates embedding duplication across tenants. + +``` +model (e.g., intent_model) + ├── model_config (distance_function, vector_dimension, etc.) + ├── embedding_store (shared embeddings for all variants) + ├── variant_1 (e.g., organic) + │ ├── vss_filter (criteria for index inclusion) + │ ├── vectordb_type (QDRANT, etc.) + │ ├── vectordb_config (host, port, replication, sharding) + │ ├── read_version / write_version + │ └── job_frequency (FREQ_1D, FREQ_3H, etc.) + └── variant_2 (e.g., ad) + ├── vss_filter + ├── vectordb_type + └── ... +``` + +**Key benefit**: If a model consumes 30M embeddings and is used by two variants, the embeddings are stored once (30M) instead of duplicated (60M). + +### Entity-Based Data Split + +Data is split at the entity level (catalog, product, user) into separate tables for both embeddings and aggregator data: + +**Embedding Tables** (per entity): + +```sql +CREATE TABLE catalog_embeddings ( + model_name text, + version int, + id text, + embedding frozen>, + search_embedding frozen>, + to_be_indexed_variant_1 boolean, + to_be_indexed_variant_2 boolean, + PRIMARY KEY ((model_name, version), id) +); +``` + +**Aggregator Tables** (per entity): + +```sql +CREATE TABLE catalog_aggregator ( + id text, + is_live_ad text, + out_of_stock text, + PRIMARY KEY (id) +); +``` + +Each entity is mapped via a store configuration: + +```json +{ + "db_conf_id": "1", + "embeddings_table": "catalog_embeddings", + "aggregator_table": "catalog_aggregator" +} +``` + +--- + +## Serving Flow + +The serving path is optimized for low latency with multiple caching layers: + +1. **Request arrives** at skye-serving via gRPC +2. **ConfigRepo** resolves the model configuration, variant filters, and vector DB connection +3. **In-memory cache** is checked first to reduce load on distributed cache +4. **Distributed cache (Redis)** is checked next for cached similarity results +5. **Vector DB query** executes if cache misses, using `search_indexed_only` flag for optimal searches within indexed space +6. **Aggregator data** is fetched from ScyllaDB to apply variant-level filters +7. **Response** returns ranked similar candidates with scores + +### Configuration Bootstrap + +On startup, ConfigRepo creates: +- A map of each model with its configurations (embedding table, vector DB channel) +- A map of each entity to its aggregator table + +```json +{ + "intent_model": { + "db_conf_id": "1", + "index_embedding_table": "catalog_embeddings", + "vector_db_grpc_channel": "" + } +} +``` + +--- + +## Admin Flows + +Skye uses an **event-driven approach** for model lifecycle management: + +- All admin operations are processed through Kafka consumers asynchronously +- A SQL database behind the admin stores all model states +- Pod termination does not affect in-progress operations (events are re-consumed on failure) +- Databricks jobs are triggered and monitored via the admin API + +### API Contracts + +#### Register Model + +``` +POST /register-model +``` + +```json +{ + "entity": "catalog", + "ingestion_column_mapping": "{\"id_column\":\"id\",\"embedding_column\":\"features\",\"to_be_indexed_column\":\"to_be_indexed\"}", + "embedding_store_enabled": true, + "embedding_store_ttl": 604800, + "mq_id": 804, + "model_config": "{\"distance_function\":\"DOT\",\"vector_dimension\":32}", + "store_id": 1, + "training_data_path": "gcs_path" +} +``` + +#### Register Variant + +``` +POST /register-variant +``` + +```json +{ + "entity": "catalog", + "model_name": "intent_model", + "vss_filter": "{...filter criteria...}", + "vectordb_type": "QDRANT", + "vectordb_config": "{...connection config...}", + "job_frequency": "FREQ_1D" +} +``` + +#### Reset Model + +``` +POST /reset-model +``` + +```json +{ + "entity": "catalog", + "model_name": "intent_model", + "frequency": "FREQ_1D" +} +``` + +Response includes variant version mappings, MQ ID, and training data path for the Databricks job. + +#### Trigger Model Machine + +``` +POST /trigger-model-machine +``` + +```json +{ + "entity": "catalog", + "model_name": "intent_model", + "variant": "organic" +} +``` + +#### Promote Model / Variant to Scale-Up Cluster + +``` +POST /promote-model +POST /promote-variant +``` + +Used to transition successful experiments from experiment clusters to production clusters. + +--- + +## Consumer Flows + +![Skye Real-Time Consumer Flow](../../../static/img/skye-rt-consumer-flow.png) + +### Reset/Delta Ingestion + +Embedding ingestion occurs once per model and executes in parallel for each variant. The Kafka event contract supports: + +- **Multiple variants per event**: A single embedding event specifies which variants should index the data +- **Separate search and index embeddings**: Models can have different embeddings for search space vs index space +- **EOF handling**: EOF is sent to all partitions to ensure all data is consumed before completion + +```json +{ + "entity": "catalog", + "model_name": "intent_model", + "candidate_id": "48869419", + "version": "1", + "index_space": { + "variants_version_map": "{'organic':1,'ad':2}", + "embedding": [0.036, -0.048, ...], + "variants_index_map": "{'organic':true,'ad':false}", + "operation": "A", + "payload": "{'sscat_id':700}" + }, + "search_space": { + "embedding": [0.036, -0.048, ...] + } +} +``` + +### Real-Time Consumers + +A generic Kafka schema is used for all real-time consumers, simplifying new integrations: + +```json +{ + "timestamp": 1719308350, + "entity_label": "catalog", + "data": [ + { + "id": "125138466", + "label": "is_live_ad", + "value": "true" + } + ] +} +``` + +### Retry Topic + +Failed ingestion events are published to a retry topic for reprocessing, ensuring no data loss: + +```json +{ + "timestamp": 1719308350, + "entity_label": "catalog", + "model_name": "intent_model", + "variant": "organic", + "data": [ + { + "id": "125138466", + "label": "is_live_ad", + "value": "true" + } + ] +} +``` + +--- + +## Key Design Decisions + +### Pluggable Vector Database Support + +Skye introduces a generic `vector_db_type` configuration and converts vendor-specific configs to a generic `vector_config`, enabling support for multiple vector database backends beyond Qdrant. + +### Variant-Based Model Sharing + +By eliminating the tenant-based construct and introducing variants, Skye allows: +- Models to be shared across tenants without duplication +- Each variant to have its own filter criteria, vector DB config, and job frequency +- Independent read/write version tracking per variant + +### ScyllaDB for Real-Time Aggregation + +Replaced Delta Lake with self-hosted ScyllaDB for cost efficiency. The aggregator is entity-generic (not model/version-specific) since all real-time data is consistent across models. + +### Event-Driven State Management + +Model state transitions are handled via Kafka events with a SQL database backing store. This eliminates: +- Single points of failure in admin/ingestion flows +- Models getting stuck during pod restarts +- Manual intervention for consumer pause/resume + +--- + +## Resiliency + +| Mechanism | Description | +|---|---| +| **Retry Topics** | Failed ingestion messages are captured in a failure topic for reprocessing | +| **Circuit Breakers** | Applied to similarity search API calls to throttle RPS during failures | +| **Snapshot Backups** | Periodic collection snapshots enable quick restore during downtime | +| **Automated Cluster Setup** | Scripted provisioning eliminates configuration inconsistencies | +| **Databricks Job Retries** | Lambda functions with retry mechanisms for failed ingestion jobs | + +--- + +## Scalability + +- **Vector DB Scaling**: Generic scripts for adding nodes to existing clusters, enabling horizontal scaling based on load and RPS +- **Service Scaling**: Hosted on EKS with CPU-based autoscaling +- **Experiment Isolation**: Experiments run on separate EKS and vector DB clusters, reducing production cluster complexity +- **Indexed-Only Search**: The `search_indexed_only` flag ensures queries only search indexed space, avoiding latency from brute-force searches on partially built indexes + +--- + +## Observability + +### Metrics (per model + variant) + +| Metric | Description | +|---|---| +| `avg_similar_candidates` | Average number of similarity candidates returned | +| `avg_recall` | Score of the first similar catalog returned | +| Service Latency | P99.9 / P99 / P95 / P50 | +| Service 5xx Count | Error rate monitoring | +| Vector DB Latency | P99.9 / P99 / P95 / P50 | +| Vector DB QPS | Throughput monitoring | +| ScyllaDB Latency | P99.9 / P99 / P95 / P90 | +| Redis Latency | P99.9 / P99 / P95 / P90 | +| Redis Hit % | Cache effectiveness | + +### Alerts + +| Alert | Threshold | +|---|---| +| Indexed Vector Count | < 95% | +| Events to Failure Topic | Rate > 0 | +| Service 5xx | < 10 | +| Service Latency | Model-dependent SLA | + +--- + +## Technology Stack + +| Component | Technology | +|---|---| +| Language | Go | +| Vector Database | Qdrant (pluggable) | +| Embedding Storage | ScyllaDB | +| Real-Time Aggregation | ScyllaDB | +| Caching | Redis + In-Memory | +| Message Queue | Kafka | +| Configuration | ZooKeeper / etcd | +| Container Orchestration | Kubernetes (EKS) | +| Job Orchestration | Databricks | diff --git a/docs-src/docs/skye/v1.0.0/functionalities.md b/docs-src/docs/skye/v1.0.0/functionalities.md new file mode 100644 index 00000000..1ba92d30 --- /dev/null +++ b/docs-src/docs/skye/v1.0.0/functionalities.md @@ -0,0 +1,106 @@ +--- +title: Functionalities +sidebar_position: 2 +--- + +# Skye - Functionalities + +## Core Capabilities + +### 1. Vector Similarity Search + +Skye provides real-time nearest-neighbor search across high-dimensional vector spaces. It supports: + +- **Configurable distance functions**: DOT product, Cosine similarity, Euclidean distance +- **Configurable vector dimensions**: Per-model vector dimension settings +- **Indexed-only search**: Queries only search within fully indexed space, avoiding brute-force fallback on partially built indexes +- **Pagination support**: Service-level pagination for clients, even when the underlying vector DB does not natively support it + +### 2. Pluggable Vector Database Support + +The platform is designed to be vector DB agnostic: + +- **Generic vector config**: A `vector_db_type` field and generic `vectordb_config` replace vendor-specific configurations +- **Current support**: Qdrant with official Go client +- **Extensibility**: New vector databases can be integrated by implementing the vector DB interface + +### 3. Model and Variant Management + +#### Model Registration +- Models are registered via API with entity type, embedding configuration, distance function, vector dimension, and training data path +- Each model is associated with a store ID mapping to specific embedding and aggregator tables + +#### Variant Registration +- Variants represent different views/filters of the same model (e.g., organic, ad, commerce) +- Each variant has its own filter criteria, vector DB cluster, job frequency, and version tracking +- Variants share the same embeddings, eliminating data redundancy + +#### Model Promotion +- Successful experiments can be promoted from experiment clusters to production clusters via API + +### 4. Embedding Ingestion + +#### Batch Ingestion (Reset/Delta Jobs) +- Triggered via Databricks jobs that read from GCS paths +- Supports separate index-space and search-space embeddings +- Per-variant `to_be_indexed` flags control which embeddings are indexed for each variant +- EOF markers sent to all Kafka partitions ensure complete data consumption + +#### Real-Time Ingestion +- Generic Kafka schema for all real-time consumers +- Entity-based aggregation data (e.g., is_live_ad, out_of_stock) updates in real time +- During model resets, real-time consumers continue pushing data to the latest collection (no pausing) + +### 5. Real-Time Data Aggregation + +- Entity-wise (catalog, product, user) real-time aggregation via ScyllaDB +- Generic approach: aggregator tables are entity-level, not model/version-specific +- All real-time data is consistent across models sharing the same entity + +### 6. Intelligent Caching + +- **In-memory cache**: First layer, reduces load on distributed cache +- **Distributed cache (Redis)**: Second layer for cached similarity results +- Hit rate monitoring and cache effectiveness metrics per model + +### 7. Embedded Storage + +- Optional embedding storage with configurable TTL +- Enables embedding lookup APIs for downstream consumers +- Stored in ScyllaDB with efficient binary serialization + +### 8. Retry and Fault Tolerance + +- **Retry topic**: Failed ingestion events are published to a dedicated retry topic +- **Event-driven state management**: Model states persist in SQL DB, surviving pod restarts +- **Kafka-based admin**: Asynchronous processing with automatic re-consumption on failure + +### 9. Experiment Isolation + +- Dedicated EKS cluster (`skye-service-experiments`) for experiments +- Dedicated vector DB cluster for experiment workloads +- Clean separation from production: experiments do not impact production performance +- Promotion path from experiment to production after load analysis + +### 10. Centralized Cluster Management + +- Automated cluster provisioning via scripts (collaboration with DevOps) +- Consistent configurations across all clusters (eliminates consensus issues) +- Horizontal scaling support: generic scripts for adding nodes to existing clusters + +--- + +## Onboarding Flow + +### Step-by-step Process + +1. **Data Scientist** provides a base GCS path where model embeddings will be pushed +2. **Register Model** via `POST /register-model` with entity type, column mappings, model config +3. **Register Variant(s)** via `POST /register-variant` with filter criteria, vector DB config, job frequency +4. **Schedule Databricks Job** to read data from GCS path and ingest into Skye platform +5. **Reset Model** via `POST /reset-model` to trigger the first full ingestion +6. **Trigger Model Machine** via `POST /trigger-model-machine` to start the indexing pipeline + +### Extending to New Tenants + +With the variant system, extending a model to a new tenant only requires registering a new variant with appropriate filters -- no re-ingestion of embeddings is needed. diff --git a/docs-src/docs/skye/v1.0.0/index.md b/docs-src/docs/skye/v1.0.0/index.md new file mode 100644 index 00000000..0a5ee391 --- /dev/null +++ b/docs-src/docs/skye/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Skye v1.0.0 +sidebar_position: 0 +slug: /skye/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Skye v1.0.0 + +Skye is a high-performance vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. + + diff --git a/docs-src/docs/skye/v1.0.0/release-notes.md b/docs-src/docs/skye/v1.0.0/release-notes.md new file mode 100644 index 00000000..ac0b1f77 --- /dev/null +++ b/docs-src/docs/skye/v1.0.0/release-notes.md @@ -0,0 +1,67 @@ +--- +title: Release Notes +sidebar_position: 3 +--- + +# Skye - Release Notes + +## v1.0.0 + +### Overview + +Initial open-source release of Skye, BharatMLStack's vector similarity search platform. This release represents a complete re-architecture of the internal VSS (Vector Similarity Search) service, addressing scalability, resilience, and operational efficiency challenges from the previous generation. + +### What's New + +#### Architecture +- **Model-first hierarchy**: Models at the base level with variants nested within, eliminating embedding duplication across tenants +- **Entity-based data split**: Separate embedding and aggregator tables per entity type (catalog, product, user) +- **Event-driven admin flows**: Kafka-based model lifecycle management with SQL-backed state persistence +- **Pluggable vector DB support**: Generic vector database abstraction replacing vendor-specific tight coupling + +#### Serving +- **Multi-layer caching**: In-memory cache + Redis distributed cache for low-latency similarity search +- **Indexed-only search**: `search_indexed_only` flag prevents brute-force fallback on partially indexed collections +- **Pagination support**: Service-level pagination for clients +- **Separate search/index embeddings**: Models can use different embedding spaces for search and indexing + +#### Ingestion +- **Shared embeddings across variants**: Single ingestion per model with parallel variant processing +- **Generic RT consumer schema**: Simplified onboarding for new real-time data sources +- **Retry topic**: Automatic capture and reprocessing of failed ingestion events +- **EOF to all partitions**: Ensures complete data consumption before processing completion + +#### Operations +- **API-based model onboarding**: Register models and variants via REST API (replaces manual Databricks-only flow) +- **Automated cluster provisioning**: Scripted setup for consistent vector DB cluster configurations +- **Experiment isolation**: Dedicated EKS and vector DB clusters for experiments +- **Comprehensive observability**: Per-model + per-variant metrics for latency, throughput, error rates, and cache effectiveness + +### Improvements Over Previous Architecture + +| Area | Before | After | +|---|---|---| +| Embedding storage | Duplicated per tenant | Shared per model | +| Vector DB coupling | Tightly coupled to Qdrant | Pluggable via generic interface | +| State management | In-pod synchronous thread | Event-driven with SQL backing | +| Consumer handling | Paused during ingestion | No pausing; concurrent writes | +| Cluster setup | Manual, error-prone | Automated, consistent | +| Experiment infra | Shared with production | Isolated clusters | +| Failure recovery | Manual intervention | Retry topics + snapshots | +| Observability | Generic alerts | Model + variant level metrics | + +### Known Limitations + +- Snapshot restore is currently supported for smaller indexes only +- Pagination is handled at the service level (not natively by the vector DB) +- Horizontal scaling of vector DB clusters requires running provisioning scripts + +### Technology Stack + +- **Language**: Go +- **Vector Database**: Qdrant (pluggable) +- **Storage**: ScyllaDB +- **Cache**: Redis + In-Memory +- **Message Queue**: Kafka +- **Configuration**: ZooKeeper / etcd +- **Orchestration**: Kubernetes (EKS) diff --git a/docs-src/docs/trufflebox-ui/_category_.json b/docs-src/docs/trufflebox-ui/_category_.json index b06298f4..d44ae254 100644 --- a/docs-src/docs/trufflebox-ui/_category_.json +++ b/docs-src/docs/trufflebox-ui/_category_.json @@ -1,6 +1,6 @@ { "label": "Trufflebox UI", - "position": 2, + "position": 4, "link": { "type": "generated-index", "description": "Trufflebox UI is a modern, feature rich UI framework for supporting MLOps. It supports Feature catalog, management, user managemnet and other adminops" diff --git a/docs-src/docs/trufflebox-ui/v1.0.0/index.md b/docs-src/docs/trufflebox-ui/v1.0.0/index.md new file mode 100644 index 00000000..ee6a7212 --- /dev/null +++ b/docs-src/docs/trufflebox-ui/v1.0.0/index.md @@ -0,0 +1,14 @@ +--- +title: v1.0.0 +description: Trufflebox UI v1.0.0 +sidebar_position: 0 +slug: /trufflebox-ui/v1.0.0 +--- + +import DocCardList from '@theme/DocCardList'; + +# Trufflebox UI v1.0.0 + +Trufflebox UI is a modern, feature-rich UI framework for supporting MLOps. It supports feature catalog, management, user management, and other admin operations. + + diff --git a/docs-src/docusaurus.config.js b/docs-src/docusaurus.config.js index 229e0cb2..59f6ea48 100644 --- a/docs-src/docusaurus.config.js +++ b/docs-src/docusaurus.config.js @@ -78,6 +78,10 @@ const config = { ({ // Replace with your project's social card image: 'img/docusaurus-social-card.jpg', + colorMode: { + defaultMode: 'dark', + respectPrefersColorScheme: true, + }, navbar: { title: 'BharatMLStack', items: [ diff --git a/docs-src/package.json b/docs-src/package.json index 3b2c4d32..b470544c 100644 --- a/docs-src/package.json +++ b/docs-src/package.json @@ -24,7 +24,8 @@ }, "devDependencies": { "@docusaurus/module-type-aliases": "3.8.1", - "@docusaurus/types": "3.8.1" + "@docusaurus/types": "3.8.1", + "yarn": "1.22.22" }, "browserslist": { "production": [ diff --git a/docs-src/src/css/custom.css b/docs-src/src/css/custom.css index ff94defd..b66bc7db 100644 --- a/docs-src/src/css/custom.css +++ b/docs-src/src/css/custom.css @@ -1,143 +1,636 @@ /** - * Any CSS included here will be global. The classic template - * bundles Infima by default. Infima is a CSS framework designed to - * work well for content-centric websites. + * Global theme for BharatMLStack docs site. + * Overrides Infima variables to match the homepage's indigo/purple dark theme. + * Supports both dark (primary) and light modes. */ -/* You can override the default Infima variables here. */ +/* ======================================== + 1. Infima Variable Overrides + ======================================== */ + :root { - /* BharatMLStack brand colors - purple/burgundy theme */ - --ifm-color-primary: #450839; - --ifm-color-primary-dark: #3d0732; - --ifm-color-primary-darker: #39062f; - --ifm-color-primary-darkest: #2f0527; - --ifm-color-primary-light: #4d0940; - --ifm-color-primary-lighter: #510a43; - --ifm-color-primary-lightest: #5d0c4d; + /* Primary palette – gold/amber */ + --ifm-color-primary: #f59e0b; + --ifm-color-primary-dark: #d97706; + --ifm-color-primary-darker: #b45309; + --ifm-color-primary-darkest: #92400e; + --ifm-color-primary-light: #fbbf24; + --ifm-color-primary-lighter: #fcd34d; + --ifm-color-primary-lightest: #fde68a; + + /* Light mode backgrounds and text */ + --ifm-background-color: #f8fafc; + --ifm-background-surface-color: #ffffff; + --ifm-font-color-base: #1e293b; + --ifm-font-color-secondary: #64748b; + --ifm-heading-color: #0f172a; + --ifm-link-color: #f59e0b; + --ifm-link-hover-color: #d97706; + + /* Code */ --ifm-code-font-size: 95%; - --docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1); - - /* Custom BharatMLStack variables with better contrast */ - --bharatml-primary: #450839; - --bharatml-primary-hover: #6a0c59; - --bharatml-secondary: #f9f9f9; - --bharatml-text: #1c1e21; /* Much darker for better contrast */ - --bharatml-text-light: #606770; /* Darker gray for better readability */ + --ifm-code-background: #f1f5f9; + --ifm-code-border-radius: 6px; + --ifm-code-padding-horizontal: 0.4rem; + --ifm-code-padding-vertical: 0.15rem; + --docusaurus-highlighted-code-line-bg: rgba(245, 158, 11, 0.08); + + /* Cards, borders, shadows */ + --ifm-card-background-color: #ffffff; + --ifm-global-shadow-lw: 0 2px 8px rgba(0, 0, 0, 0.06); + --ifm-global-shadow-md: 0 4px 16px rgba(0, 0, 0, 0.08); + --ifm-global-shadow-tl: 0 8px 32px rgba(0, 0, 0, 0.1); + --ifm-global-radius: 8px; + + /* Table of contents */ + --ifm-toc-border-color: rgba(0, 0, 0, 0.08); + + /* Navbar height for padding */ + --ifm-navbar-height: 3.75rem; } -/* For readability concerns, you should choose a lighter palette in dark mode. */ +/* Dark mode */ [data-theme='dark'] { - --ifm-color-primary: #8b4582; - --ifm-color-primary-dark: #7d3f75; - --ifm-color-primary-darker: #763c6e; - --ifm-color-primary-darkest: #62315a; - --ifm-color-primary-light: #994b8f; - --ifm-color-primary-lighter: #a04e96; - --ifm-color-primary-lightest: #b657a9; - --docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.3); - - /* Dark mode BharatMLStack colors */ - --bharatml-primary: #8b4582; - --bharatml-primary-hover: #a04e96; - --bharatml-secondary: #1e1e1e; - --bharatml-text: #e3e3e3; /* Light text for dark mode */ - --bharatml-text-light: #b4b4b4; /* Lighter gray for dark mode */ + --ifm-color-primary: #fbbf24; + --ifm-color-primary-dark: #f59e0b; + --ifm-color-primary-darker: #d97706; + --ifm-color-primary-darkest: #b45309; + --ifm-color-primary-light: #fcd34d; + --ifm-color-primary-lighter: #fde68a; + --ifm-color-primary-lightest: #fef3c7; + + --ifm-background-color: #27001D; + --ifm-background-surface-color: #3d0029; + --ifm-font-color-base: #e2e8f0; + --ifm-font-color-secondary: #94a3b8; + --ifm-heading-color: #f1f5f9; + --ifm-link-color: #fbbf24; + --ifm-link-hover-color: #fcd34d; + + --ifm-code-background: rgba(255, 255, 255, 0.06); + --docusaurus-highlighted-code-line-bg: rgba(251, 191, 36, 0.15); + + --ifm-card-background-color: rgba(255, 255, 255, 0.03); + --ifm-global-shadow-lw: 0 2px 8px rgba(0, 0, 0, 0.3); + --ifm-global-shadow-md: 0 4px 16px rgba(0, 0, 0, 0.4); + --ifm-global-shadow-tl: 0 8px 32px rgba(0, 0, 0, 0.5); + + --ifm-toc-border-color: rgba(255, 255, 255, 0.06); } -/* Custom BharatMLStack styles */ -.bharatml-hero { - background: linear-gradient(135deg, var(--bharatml-primary) 0%, var(--bharatml-primary-hover) 100%); - color: white; + +/* ======================================== + 2. Global Gradient Orb Background + ======================================== */ + +.gradient-bg-global { + position: fixed; + top: 0; + left: 0; + width: 100%; + height: 100%; + z-index: 0; + pointer-events: none; } -/* Hero button styling - both buttons should have white borders and proper text colors */ -.bharatml-hero .bharatml-button { - background-color: var(--bharatml-primary); - border: 2px solid white !important; - color: white !important; - transition: all 0.3s ease; +.gradient-orb-global { + position: absolute; + border-radius: 50%; + filter: blur(100px); + opacity: 0.25; + animation: globalOrbFloat 25s ease-in-out infinite; } -.bharatml-hero .bharatml-button:hover { - background-color: white !important; - border-color: white !important; - color: var(--bharatml-primary) !important; +[data-theme='light'] .gradient-orb-global { + opacity: 0.10; } -.bharatml-hero .button--outline { - background-color: transparent !important; - border: 2px solid white !important; - color: white !important; - transition: all 0.3s ease; +.orb-global-1 { + width: 600px; + height: 600px; + background: radial-gradient(circle, #fbbf24, transparent); + top: -10%; + left: -10%; } -.bharatml-hero .button--outline:hover { - background-color: white !important; - border-color: white !important; - color: var(--bharatml-primary) !important; +.orb-global-2 { + width: 500px; + height: 500px; + background: radial-gradient(circle, #f59e0b, transparent); + top: 50%; + right: -10%; + animation-delay: 8s; } -/* Dark mode hero buttons */ -[data-theme='dark'] .bharatml-hero .bharatml-button { - background-color: var(--bharatml-primary); - border: 2px solid white !important; - color: white !important; +.orb-global-3 { + width: 700px; + height: 700px; + background: radial-gradient(circle, #06b6d4, transparent); + bottom: -20%; + left: 30%; + animation-delay: 15s; } -[data-theme='dark'] .bharatml-hero .bharatml-button:hover { - background-color: white !important; - border-color: white !important; - color: var(--bharatml-primary) !important; +@keyframes globalOrbFloat { + 0%, 100% { + transform: translate(0, 0) scale(1); + } + 33% { + transform: translate(60px, -60px) scale(1.1); + } + 66% { + transform: translate(-40px, 40px) scale(0.9); + } } -[data-theme='dark'] .bharatml-hero .button--outline { - background-color: transparent !important; - border: 2px solid white !important; - color: white !important; + +/* ======================================== + 3. Navbar – Glass Morphism + ======================================== */ + +.navbar { + background: rgba(39, 0, 29, 0.8) !important; + backdrop-filter: blur(20px); + -webkit-backdrop-filter: blur(20px); + border-bottom: 1px solid rgba(255, 255, 255, 0.05); + box-shadow: none; + position: sticky; + z-index: 100; +} + +[data-theme='light'] .navbar { + background: rgba(255, 255, 255, 0.85) !important; + border-bottom: 1px solid rgba(0, 0, 0, 0.08); +} + +.navbar__title { + font-weight: 800; + background: linear-gradient(135deg, #fbbf24, #f59e0b, #06b6d4); + background-size: 200% 200%; + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; + animation: navGradientShift 3s ease infinite; +} + +@keyframes navGradientShift { + 0%, 100% { background-position: 0% 50%; } + 50% { background-position: 100% 50%; } +} + +.navbar__link { + font-weight: 500; } -[data-theme='dark'] .bharatml-hero .button--outline:hover { - background-color: white !important; - border-color: white !important; - color: var(--bharatml-primary) !important; +[data-theme='dark'] .navbar__link { + color: #e2e8f0; } -/* General button styling for other parts of the site */ -.bharatml-button { - background-color: var(--bharatml-primary); - border-color: var(--bharatml-primary); - transition: all 0.3s ease; +[data-theme='dark'] .navbar__link:hover, +[data-theme='dark'] .navbar__link--active { + color: #fbbf24; +} + +.navbar__toggle { + color: var(--ifm-font-color-base); +} + +/* Navbar sidebar (mobile) */ +.navbar-sidebar { + background: var(--ifm-background-color); +} + + +/* ======================================== + 4. Footer – Dark Theme + ======================================== */ + +.footer { + background: #3d0029 !important; + border-top: 1px solid rgba(255, 255, 255, 0.05); +} + +[data-theme='light'] .footer { + background: #f1f5f9 !important; + border-top: 1px solid rgba(0, 0, 0, 0.08); +} + +.footer__title { + color: #e2e8f0; + font-weight: 700; +} + +[data-theme='light'] .footer__title { + color: #1e293b; +} + +.footer__link-item { + color: #94a3b8; + transition: color 0.3s; +} + +.footer__link-item:hover { + color: #fbbf24; + text-decoration: none; +} + +[data-theme='light'] .footer__link-item { + color: #64748b; +} + +[data-theme='light'] .footer__link-item:hover { + color: #f59e0b; +} + +.footer__copyright { + color: #64748b; +} + + +/* ======================================== + 5. Sidebar – Glass Effect + ======================================== */ + +[data-theme='dark'] .theme-doc-sidebar-container { + border-right: 1px solid rgba(255, 255, 255, 0.05) !important; } -.bharatml-button:hover { - background-color: var(--bharatml-primary-hover); - border-color: var(--bharatml-primary-hover); - color: white; +[data-theme='dark'] .menu { + background: transparent; } -.bharatml-card { - border: 1px solid rgba(69, 8, 57, 0.1); +[data-theme='dark'] .menu__link { + color: #cbd5e1; border-radius: 8px; - padding: 2rem; - transition: all 0.3s ease; - background: white; + transition: all 0.2s; +} + +[data-theme='dark'] .menu__link:hover { + background: rgba(251, 191, 36, 0.1); + color: #e2e8f0; +} + +[data-theme='dark'] .menu__link--active:not(.menu__link--sublist) { + background: rgba(251, 191, 36, 0.15); + color: #fbbf24; + font-weight: 600; +} + +[data-theme='dark'] .menu__list-item-collapsible:hover { + background: rgba(251, 191, 36, 0.08); +} + +[data-theme='dark'] .theme-doc-sidebar-item-category > .menu__list-item-collapsible > .menu__link { + color: #e2e8f0; + font-weight: 600; +} + + +/* ======================================== + 6. Doc / Blog Content + ======================================== */ + +/* Ensure proper z-index for content above gradient orbs */ +[class*='docMainContainer'], +[class*='mainWrapper'], +.main-wrapper { + position: relative; + z-index: 1; +} + +/* Markdown content */ +.markdown h1, +.markdown h2, +.markdown h3, +.markdown h4, +.markdown h5, +.markdown h6 { + color: var(--ifm-heading-color); +} + +/* Tables */ +[data-theme='dark'] table { + border-color: rgba(255, 255, 255, 0.08); +} + +[data-theme='dark'] table thead tr { + background: rgba(255, 255, 255, 0.04); + border-bottom: 1px solid rgba(255, 255, 255, 0.08); +} + +[data-theme='dark'] table tbody tr { + border-bottom: 1px solid rgba(255, 255, 255, 0.04); +} + +[data-theme='dark'] table tbody tr:nth-child(2n) { + background: rgba(255, 255, 255, 0.02); +} + +[data-theme='dark'] th, +[data-theme='dark'] td { + border-color: rgba(255, 255, 255, 0.06); +} + +/* Blockquotes */ +[data-theme='dark'] blockquote { + border-left-color: #fbbf24; + background: rgba(251, 191, 36, 0.05); + color: #cbd5e1; +} + +/* Horizontal rules */ +[data-theme='dark'] hr { + border-color: rgba(255, 255, 255, 0.06); +} + + +/* ======================================== + 7. Code Blocks + ======================================== */ + +[data-theme='dark'] .prism-code { + background: rgba(255, 255, 255, 0.04) !important; + border: 1px solid rgba(255, 255, 255, 0.06); +} + +[data-theme='dark'] code { + background: rgba(255, 255, 255, 0.06); + border: 1px solid rgba(255, 255, 255, 0.08); + color: #e2e8f0; +} + +[data-theme='dark'] a code { + color: var(--ifm-link-color); +} + +/* Code block title bar */ +[data-theme='dark'] .codeBlockTitle_node_modules-\@docusaurus-theme-classic-lib-theme-CodeBlock-Content-styles-module { + background: rgba(255, 255, 255, 0.06) !important; + border-bottom: 1px solid rgba(255, 255, 255, 0.06); +} + + +/* ======================================== + 8. Admonitions + ======================================== */ + +[data-theme='dark'] .alert { + background: rgba(255, 255, 255, 0.03); + border: 1px solid rgba(255, 255, 255, 0.06); + color: #e2e8f0; +} + +[data-theme='dark'] .alert--info { + border-left: 4px solid #06b6d4; + background: rgba(6, 182, 212, 0.06); +} + +[data-theme='dark'] .alert--warning { + border-left: 4px solid #f59e0b; + background: rgba(245, 158, 11, 0.06); +} + +[data-theme='dark'] .alert--danger { + border-left: 4px solid #ef4444; + background: rgba(239, 68, 68, 0.06); } -.bharatml-card:hover { - border-color: var(--bharatml-primary); - box-shadow: 0 4px 20px rgba(69, 8, 57, 0.1); - transform: translateY(-2px); +[data-theme='dark'] .alert--success { + border-left: 4px solid #10b981; + background: rgba(16, 185, 129, 0.06); } -.bharatml-icon { - width: 64px; - height: 64px; - background: linear-gradient(135deg, var(--bharatml-primary), var(--bharatml-primary-hover)); +[data-theme='dark'] .alert--secondary { + border-left: 4px solid #fbbf24; + background: rgba(251, 191, 36, 0.06); +} + +[data-theme='dark'] .admonitionHeading_node_modules-\@docusaurus-theme-classic-lib-theme-Admonition-Layout-styles-module { + color: inherit; +} + + +/* ======================================== + 9. Table of Contents (right sidebar) + ======================================== */ + +[data-theme='dark'] .table-of-contents__link { + color: #94a3b8; +} + +[data-theme='dark'] .table-of-contents__link:hover, +[data-theme='dark'] .table-of-contents__link--active { + color: #fbbf24; +} + +[data-theme='dark'] .table-of-contents { + border-left: 1px solid rgba(255, 255, 255, 0.06); +} + + +/* ======================================== + 10. Pagination / Doc navigation + ======================================== */ + +[data-theme='dark'] .pagination-nav__link { + background: rgba(255, 255, 255, 0.03); + border: 1px solid rgba(255, 255, 255, 0.08); border-radius: 12px; - display: flex; - align-items: center; - justify-content: center; - margin: 0 auto 1rem; - font-size: 1.5rem; - color: white; + transition: all 0.3s; +} + +[data-theme='dark'] .pagination-nav__link:hover { + border-color: rgba(251, 191, 36, 0.3); + background: rgba(251, 191, 36, 0.06); +} + +[data-theme='dark'] .pagination-nav__sublabel { + color: #94a3b8; +} + +[data-theme='dark'] .pagination-nav__label { + color: #e2e8f0; +} + + +/* ======================================== + 11. Blog-specific + ======================================== */ + +[data-theme='dark'] .blog-post-page article header h1 { + color: #f1f5f9; +} + +[data-theme='dark'] article .avatar__name a { + color: #fbbf24; +} + +[data-theme='dark'] .blog-tags a { + background: rgba(251, 191, 36, 0.1); + border: 1px solid rgba(251, 191, 36, 0.2); + color: #fbbf24; +} + +[data-theme='dark'] .blog-tags a:hover { + background: rgba(251, 191, 36, 0.2); + border-color: rgba(251, 191, 36, 0.4); + text-decoration: none; +} + + +/* ======================================== + 12. Search and misc inputs + ======================================== */ + +[data-theme='dark'] .navbar__search-input { + background: rgba(255, 255, 255, 0.05); + border: 1px solid rgba(255, 255, 255, 0.1); + color: #e2e8f0; +} + +[data-theme='dark'] .navbar__search-input::placeholder { + color: #64748b; +} + + +/* ======================================== + 13. Breadcrumbs + ======================================== */ + +[data-theme='dark'] .breadcrumbs__link { + background: rgba(255, 255, 255, 0.04); + color: #94a3b8; + border-radius: 6px; +} + +[data-theme='dark'] .breadcrumbs__link:hover { + background: rgba(251, 191, 36, 0.1); + color: #e2e8f0; +} + +[data-theme='dark'] .breadcrumbs__item--active .breadcrumbs__link { + background: rgba(251, 191, 36, 0.12); + color: #fbbf24; +} + + +/* ======================================== + 14. Tabs + ======================================== */ + +[data-theme='dark'] .tabs__item { + color: #94a3b8; + border-bottom-color: transparent; +} + +[data-theme='dark'] .tabs__item:hover { + color: #e2e8f0; +} + +[data-theme='dark'] .tabs__item--active { + color: #fbbf24; + border-bottom-color: #fbbf24; +} + + +/* ======================================== + 15. Scrollbar (dark mode) + ======================================== */ + +[data-theme='dark'] ::-webkit-scrollbar { + width: 8px; + height: 8px; +} + +[data-theme='dark'] ::-webkit-scrollbar-track { + background: transparent; +} + +[data-theme='dark'] ::-webkit-scrollbar-thumb { + background: rgba(255, 255, 255, 0.12); + border-radius: 4px; +} + +[data-theme='dark'] ::-webkit-scrollbar-thumb:hover { + background: rgba(255, 255, 255, 0.2); +} + + +/* ======================================== + 16. Version / Dropdown badges + ======================================== */ + +[data-theme='dark'] .dropdown__menu { + background: #3d0029; + border: 1px solid rgba(255, 255, 255, 0.08); +} + +[data-theme='dark'] .dropdown__link { + color: #cbd5e1; +} + +[data-theme='dark'] .dropdown__link:hover { + background: rgba(251, 191, 36, 0.1); + color: #e2e8f0; +} + +[data-theme='dark'] .dropdown__link--active { + color: #fbbf24; + background: rgba(251, 191, 36, 0.12); +} + + +/* ======================================== + 17. Homepage Isolation + (hide Docusaurus navbar/footer on homepage) + ======================================== */ + +html.homepage-active .navbar { + display: none !important; +} + +html.homepage-active .footer { + display: none !important; +} + +html.homepage-active main { + margin-top: 0; +} + +html.homepage-active [class*='docMainContainer'], +html.homepage-active [class*='mainWrapper'] { + padding-top: 0; +} + + +/* ======================================== + 18. Light mode refinements + ======================================== */ + +[data-theme='light'] .theme-doc-sidebar-container { + border-right: 1px solid rgba(0, 0, 0, 0.06); +} + +[data-theme='light'] .menu__link--active:not(.menu__link--sublist) { + background: rgba(245, 158, 11, 0.08); + color: #f59e0b; + font-weight: 600; +} + +[data-theme='light'] .menu__link:hover { + background: rgba(245, 158, 11, 0.05); +} + +[data-theme='light'] .pagination-nav__link { + border-radius: 12px; + transition: all 0.3s; +} + +[data-theme='light'] .pagination-nav__link:hover { + border-color: rgba(245, 158, 11, 0.3); + box-shadow: 0 4px 16px rgba(245, 158, 11, 0.08); +} + +[data-theme='light'] blockquote { + border-left-color: #f59e0b; } diff --git a/docs-src/src/pages/index.js b/docs-src/src/pages/index.js index c58e5f55..325cf414 100644 --- a/docs-src/src/pages/index.js +++ b/docs-src/src/pages/index.js @@ -1,202 +1,584 @@ -import clsx from 'clsx'; -import Link from '@docusaurus/Link'; +import React, { useEffect, useLayoutEffect, useRef, useState, useCallback } from 'react'; +import Layout from '@theme/Layout'; import useDocusaurusContext from '@docusaurus/useDocusaurusContext'; import useBaseUrl from '@docusaurus/useBaseUrl'; -import Layout from '@theme/Layout'; -import { OnlineFeatureStoreFeatures, TruffleboxUIFeatures, SDKsFeatures } from '@site/src/components/HomepageFeatures'; - -import Heading from '@theme/Heading'; import styles from './index.module.css'; -function HomepageHeader() { - const {siteConfig} = useDocusaurusContext(); +// ─── Data ────────────────────────────────────────────── + +const BARRIERS = [ + { + icon: '\u{1F9E0}', + title: 'Focus on building intelligence, not infrastructure', + questions: [ + 'Does every model deployment require a full-stack integration effort?', + 'Do engineers have to rebuild feature retrieval, endpoint integrations, and logging for each new model?', + 'Does changing a simple expression like 0.2\u00D7s\u2081 + 0.8\u00D7s\u2082 to 0.3\u00D7s\u2081 + 0.7\u00D7s\u2082 really need code reviews and redeployments?', + 'Why does deploying intelligence require the devops team to provision infra?', + ], + answer: + 'Machine learning teams should be iterating on models, not systems. Yet today, infrastructure complexity turns simple improvements into weeks of engineering effort, slowing experimentation and innovation.', + }, + { + icon: '\u{1F4B0}', + title: 'Built for scale without exponential cost growth', + questions: [ + 'Do your infrastructure costs scale faster than your ML impact?', + 'Are you recomputing the same features, reloading the same data, and moving the same bytes across systems repeatedly?', + 'Are expensive GPUs and compute sitting underutilized while workloads wait on data or inefficient pipelines?', + 'Why does scaling ML often mean scaling cost linearly\u2014or worse?', + ], + answer: + 'A modern ML platform should eliminate redundant computation, reuse features intelligently, and optimize data access across memory, NVMe, and object storage. Compute should be pooled, scheduled efficiently, and fully utilized\u2014ensuring that scale drives impact, not runaway infrastructure costs.', + }, + { + icon: '\u{1F30D}', + title: 'Freedom to deploy anywhere, without lock-in', + questions: [ + 'Are your models tied to a single cloud, making migration costly and complex?', + 'Does adopting managed services today limit your ability to optimize cost or move infrastructure tomorrow?', + 'Can you deploy the same ML stack across public cloud, private cloud, or sovereign environments without redesigning everything?', + 'Why should infrastructure choices dictate the future of your ML systems?', + ], + answer: + 'A modern ML platform should be built on open standards and cloud-neutral abstractions, allowing you to deploy anywhere\u2014public cloud, private infrastructure, or sovereign environments. This ensures complete control over your data, freedom from vendor lock-in, and the ability to optimize for cost, performance, and compliance without architectural constraints.', + }, +]; + +const COMPONENTS = [ + { + icon: '\u{26A1}', + title: 'Online Feature Store', + description: + 'BharatMLStack Online Feature Store delivers sub-10ms, high-throughput access to machine learning features for real-time inference. It seamlessly ingests batch and streaming data, validates schemas, and persists compact, versioned feature groups optimized for low latency and efficiency. With scalable storage backends, gRPC APIs, and binary-optimized formats, it ensures consistent, reliable feature serving across ML pipelines.', + cta: '/online-feature-store/v1.0.0', + }, + { + icon: '\u{1F500}', + title: 'Inferflow', + description: + "Inferflow is BharatMLStack's intelligent inference gateway that dynamically retrieves and assembles features required by ML models using a graph-based configuration called Inferpipes. It automatically resolves entity relationships, fetches features from the Online Feature Store, and constructs feature vectors without custom code.", + cta: '/inferflow/v1.0.0', + }, + { + icon: '\u{1F50D}', + title: 'Skye', + description: + 'Skye enables fast similarity retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, ensuring flexibility across infrastructure. The system provides tenant-level index isolation while allowing single embedding ingestion even when shared across tenants, reducing redundancy.', + cta: '/skye/v1.0.0', + }, + { + icon: '\u{1F9EE}', + title: 'Numerix', + description: + 'Numerix is a high-performance compute engine designed for ultra-fast element-wise matrix operations. Built in Rust and accelerated using SIMD, it delivers exceptional efficiency and predictable performance. Optimized for real-time inference workloads, it achieves strict sub-5ms p99 latency on matrices up to 1000\u00D710.', + cta: '/numerix/v1.0.0', + }, + { + icon: '\u{1F680}', + title: 'Predator', + description: + 'Predator streamlines infrastructure and model lifecycle management. It enables the creation of deployables with specific Triton Server versions and supports seamless model rollouts. Leveraging Helm charts and Argo CD, Predator automates Kubernetes-based deployments while integrating with KEDA for auto-scaling and performance tuning.', + cta: '/predator/v1.0.0', + }, +]; + +const STATS = [ + { target: 4.5, suffix: 'M+', decimals: 1, label: 'Daily Orders', description: 'Daily orders processed via ML pipelines' }, + { target: 2.4, suffix: 'M', decimals: 1, label: 'QPS on FS', description: 'QPS on Feature Store with batch size of 100 id lookups' }, + { target: 1, suffix: 'M+', decimals: 0, label: 'QPS Inference', description: 'QPS on Model Inference' }, + { target: 500, suffix: 'K', decimals: 0, label: 'QPS Embedding', description: 'QPS Embedding Search' }, +]; + +const DEMO_VIDEOS = [ + { + title: 'Feature Store', + description: 'Learn how to onboard and manage features using the self-serve UI for the Online Feature Store.', + url: 'https://videos.meesho.com/reels/feature_store.mp4', + }, + { + title: 'Embedding Platform', + description: 'Walkthrough of onboarding and managing embedding models via the Skye self-serve UI.', + url: 'https://videos.meesho.com/reels/embedding_platform.mp4', + }, + { + title: 'Numerix', + description: 'Step-by-step guide to configuring and running matrix operations through the Numerix self-serve UI.', + url: 'https://videos.meesho.com/reels/numerix.mp4', + }, + { + title: 'Predator', + description: 'How to deploy and manage ML models on Kubernetes using the Predator self-serve UI.', + url: 'https://videos.meesho.com/reels/predator.mp4', + }, + { + title: 'Inferflow', + description: 'Setting up inferpipes and feature retrieval graphs through the Inferflow self-serve UI.', + url: 'https://videos.meesho.com/reels/inferflow.mp4', + }, +]; + +const BLOG_POSTS = [ + { + title: "Building Meesho's ML Platform: From Chaos to Cutting-Edge (Part 1)", + category: 'ML Platform', + icon: '\u{1F680}', + link: '/blog/post-one', + }, + { + title: "Building Meesho's ML Platform: Lessons from the First-Gen System (Part 2)", + category: 'ML Platform', + icon: '\u{1F9E9}', + link: '/blog/post-two', + }, + { + title: 'Cracking the Code: Scaling Model Inference & Real-Time Embedding Search', + category: 'Inference', + icon: '\u{26A1}', + link: '/blog/post-three', + }, + { + title: 'Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving', + category: 'LLM', + icon: '\u{1F9E0}', + link: '/blog/post-four', + }, + { + title: 'LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale', + category: 'Optimization', + icon: '\u{1F52C}', + link: '/blog/post-five', + }, +]; + +// ─── Components ──────────────────────────────────────── + +function CustomNav() { + const docsUrl = useBaseUrl('/'); + const blogUrl = useBaseUrl('/blog'); return ( -
-
-
- BharatMLStack Logo -
- - Welcome to {siteConfig.title} - -

+

+ ); +} + +function NetworkBackground() { + const canvasRef = useRef(null); + + useEffect(() => { + const canvas = canvasRef.current; + if (!canvas) return; + const ctx = canvas.getContext('2d'); + let animationId; + let nodes = []; + const NODE_COUNT = 50; + const CONNECTION_DIST = 150; + + function resize() { + const parent = canvas.parentElement; + canvas.width = parent.offsetWidth; + canvas.height = parent.offsetHeight; + } + + function initNodes() { + nodes = []; + for (let i = 0; i < NODE_COUNT; i++) { + nodes.push({ + x: Math.random() * canvas.width, + y: Math.random() * canvas.height, + vx: (Math.random() - 0.5) * 0.4, + vy: (Math.random() - 0.5) * 0.4, + radius: Math.random() * 2 + 1, + }); + } + } + + function draw() { + ctx.clearRect(0, 0, canvas.width, canvas.height); + + // Draw connections + for (let i = 0; i < nodes.length; i++) { + for (let j = i + 1; j < nodes.length; j++) { + const dx = nodes[i].x - nodes[j].x; + const dy = nodes[i].y - nodes[j].y; + const dist = Math.sqrt(dx * dx + dy * dy); + if (dist < CONNECTION_DIST) { + const opacity = (1 - dist / CONNECTION_DIST) * 0.25; + ctx.beginPath(); + ctx.moveTo(nodes[i].x, nodes[i].y); + ctx.lineTo(nodes[j].x, nodes[j].y); + ctx.strokeStyle = `rgba(99, 102, 241, ${opacity})`; + ctx.lineWidth = 0.8; + ctx.stroke(); + } + } + } + + // Draw nodes + for (const node of nodes) { + ctx.beginPath(); + ctx.arc(node.x, node.y, node.radius, 0, Math.PI * 2); + ctx.fillStyle = 'rgba(139, 92, 246, 0.5)'; + ctx.fill(); + } + } + + function update() { + for (const node of nodes) { + node.x += node.vx; + node.y += node.vy; + // Bounce off edges + if (node.x < 0 || node.x > canvas.width) node.vx *= -1; + if (node.y < 0 || node.y > canvas.height) node.vy *= -1; + // Keep in bounds + node.x = Math.max(0, Math.min(canvas.width, node.x)); + node.y = Math.max(0, Math.min(canvas.height, node.y)); + } + } + + function animate() { + update(); + draw(); + animationId = requestAnimationFrame(animate); + } + + resize(); + initNodes(); + animate(); + + const resizeObserver = new ResizeObserver(() => { + resize(); + }); + resizeObserver.observe(canvas.parentElement); + + return () => { + cancelAnimationFrame(animationId); + resizeObserver.disconnect(); + }; + }, []); + + return ( +
+ ); } -function OnlineFeatureStoreAbout() { +function BarriersSection() { return ( -
-
-
-
- Built for India's Scale -

- BharatMLStack is a comprehensive, production-ready machine learning infrastructure - platform designed to democratize ML capabilities across India and beyond. Our mission - is to provide a robust, scalable, and accessible ML stack that empowers organizations - to build, deploy, and manage machine learning solutions at massive scale. -

- - Explore Online Feature Store → - -
-
-
-

🏆 Key Achievements

-
    -
  • ✅ Sub-10ms P99 latency for real-time inference
  • -
  • ✅ 1M+ RPS tested with 100 IDs per request
  • -
  • ✅ PSDB format outperforms Proto3 & Arrow
  • -
  • ✅ Multi-database: Scylla, Dragonfly, Redis
  • -
  • ✅ Production-ready with comprehensive monitoring
  • +
    +
    +
    +

    Why BharatMLStack

    +

    The Real Barriers to Scaling Machine Learning

    +

    + ML teams spend more time fighting infrastructure than building intelligence. + BharatMLStack removes those barriers. +

    +
    +
    + {BARRIERS.map((barrier, idx) => ( +
    +
    {barrier.icon}
    +

    {barrier.title}

    +
      + {barrier.questions.map((q, i) => ( +
    • {q}
    • + ))}
    +

    {barrier.answer}

    -
    + ))}
); } -function TruffleboxAbout() { +function ComponentsSection() { + const cardsRef = useRef([]); + const baseUrl = useBaseUrl('/'); + + useEffect(() => { + const observer = new IntersectionObserver( + (entries) => { + entries.forEach((entry) => { + if (entry.isIntersecting) { + entry.target.classList.add(styles.componentCardVisible); + } + }); + }, + { threshold: 0.1, rootMargin: '0px 0px -80px 0px' } + ); + + cardsRef.current.forEach((card) => { + if (card) observer.observe(card); + }); + + return () => observer.disconnect(); + }, []); + return ( -
-
-
-
- Modern MLOps Management -

- Trufflebox UI provides a comprehensive, modern web interface for managing your entire - ML infrastructure. Built with cutting-edge web technologies, it delivers an intuitive - experience for feature management, user administration, and operational oversight. - Streamline your MLOps workflows with enterprise-grade UI components. -

- - Explore Trufflebox UI → - -
-
-
-

🎨 UI Features

-
    -
  • ✅ Comprehensive feature catalog & discovery
  • -
  • ✅ Role-based access control & user management
  • -
  • ✅ Job, Store, Admin Ops management
  • -
  • ✅ Approval flow for everything
  • -
  • ✅ Responsive design for desktop & mobile
  • -
+
+
+
+

Platform Components

+

BharatMLStack Components

+

+ Purpose-built components for every stage of the ML lifecycle, from feature + serving to model deployment. +

+
+
+ {COMPONENTS.map((comp, idx) => ( +
(cardsRef.current[idx] = el)} + > +
{comp.icon}
+
+

{comp.title}

+

{comp.description}

+ + Learn more → + +
-
+ ))}
); } -function SDKsAbout() { +function AnimatedCounter({ target, suffix, decimals, duration = 1500 }) { + const [count, setCount] = useState(0); + const [hasStarted, setHasStarted] = useState(false); + const ref = useRef(null); + + const startAnimation = useCallback(() => { + if (hasStarted) return; + setHasStarted(true); + + const startTime = performance.now(); + const step = (now) => { + const elapsed = now - startTime; + const progress = Math.min(elapsed / duration, 1); + // Ease-out cubic for a fast start that decelerates + const eased = 1 - Math.pow(1 - progress, 3); + setCount(eased * target); + if (progress < 1) { + requestAnimationFrame(step); + } else { + setCount(target); + } + }; + requestAnimationFrame(step); + }, [target, duration, hasStarted]); + + useEffect(() => { + const el = ref.current; + if (!el) return; + const observer = new IntersectionObserver( + ([entry]) => { + if (entry.isIntersecting) { + startAnimation(); + } + }, + { threshold: 0.3 } + ); + observer.observe(el); + return () => observer.disconnect(); + }, [startAnimation]); + + const display = decimals > 0 + ? count.toFixed(decimals) + : Math.round(count).toLocaleString(); + return ( -
-
-
-
- Developer-First Integration -

- Our SDKs are designed with developers in mind, providing idiomatic APIs for Go and Python - that feel natural in your existing codebase. Whether you're building microservices, - data pipelines, or ML applications, our SDKs provide the tools you need for seamless - integration with BharatMLStack's powerful infrastructure. -

- - Explore SDKs → - -
-
-
-

🛠️ Developer Tools

-
    -
  • ✅ Native Go & Python SDKs with type safety
  • -
  • ✅ High-performance gRPC
  • -
  • ✅ Apache Spark integration for publishing features
  • -
+
+ {display}{suffix} +
+ ); +} + +function StatsSection() { + return ( +
+
+
+

Proven at scale

+

Scaling Numbers

+
+
+ {STATS.map((stat, idx) => ( +
+

{stat.label}

+ +

{stat.description}

-
+ ))}
); } -function NumerixAbout() { +function DemoVideosSection() { return ( -
-
-
-
- Numerix -

- Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors. -

- - Explore Numerix → - -
-
-
-

🛠️ Numerix Features

-
    -
  • ✅ Postfix expression evaluation
  • -
  • ✅ Vectorized math operations
  • -
  • ✅ Typed evaluation
  • -
  • ✅ Compiler-assisted SIMD
  • -
  • ✅ ARM & AMD support
  • -
  • ✅ Multi-arch builds
  • -
  • ✅ Deterministic runtime
  • - -
+
+
+
+

See it in action

+

Demo Videos

+

+ Watch short demos of each BharatMLStack component in action. +

+
+
+ {DEMO_VIDEOS.map((video, idx) => ( +
+
+ +
+
+

{video.title}

+

{video.description}

+
+ ))} +
+
+
+ ); +} + +function BlogSection() { + const baseUrl = useBaseUrl('/'); + return ( +
+
+
+

From our blog

+

View Our Blogs

+

+ Technical articles, architecture deep-dives, and the story behind BharatMLStack. +

+
+ +
+
+ ); +} + +function CTASection() { + const getStartedUrl = useBaseUrl('/intro'); + return ( +
+
+
+

Deploy ML models with confidence

+

+ Comprehensive stack for business-ready ML. Integrates seamlessly with enterprise + systems. Robust security and regulatory compliance. +

+
@@ -204,22 +586,96 @@ function NumerixAbout() { ); } +function CustomFooter() { + const docsUrl = useBaseUrl('/'); + const blogUrl = useBaseUrl('/blog'); + return ( + + ); +} + +// ─── Page ────────────────────────────────────────────── + export default function Home() { - const {siteConfig} = useDocusaurusContext(); + const { siteConfig } = useDocusaurusContext(); + + // Hide Docusaurus navbar/footer on homepage (client-side, before paint) + useLayoutEffect(() => { + document.documentElement.classList.add('homepage-active'); + return () => { + document.documentElement.classList.remove('homepage-active'); + }; + }, []); + return ( - -
- - - - - - - -
+ description="Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity." + > + {/* Inline style ensures Docusaurus navbar/footer are hidden during SSR and before JS hydration */} + +
+ + + + + + + + + +
); } diff --git a/docs-src/src/pages/index.module.css b/docs-src/src/pages/index.module.css index 30770d52..3dc200fc 100644 --- a/docs-src/src/pages/index.module.css +++ b/docs-src/src/pages/index.module.css @@ -1,144 +1,1005 @@ /** - * CSS files with the .module.css suffix will be treated as CSS modules - * and scoped locally. + * Homepage CSS Module + * Dark-themed design (primary), with light mode variant. + * Based on reference HTML design for BharatMLStack. */ -.heroBanner { - padding: 4rem 0; - text-align: center; - position: relative; - overflow: hidden; +/* ======================================== + CSS Variables (scoped via data-theme) + ======================================== */ + +:root { + --hp-primary: #fbbf24; + --hp-primary-dark: #f59e0b; + --hp-secondary: #8b5cf6; + --hp-accent: #06b6d4; + --hp-success: #10b981; + --hp-dark: #27001D; + --hp-dark-light: #3d0029; + --hp-text: #e2e8f0; + --hp-text-muted: #94a3b8; + --hp-bg-card: rgba(255, 255, 255, 0.03); + --hp-bg-page: #27001D; } -@media screen and (max-width: 996px) { - .heroBanner { - padding: 2rem; - } +[data-theme='light'] { + --hp-dark: #f8fafc; + --hp-dark-light: #f1f5f9; + --hp-text: #1e293b; + --hp-text-muted: #64748b; + --hp-bg-card: rgba(0, 0, 0, 0.02); + --hp-bg-page: #f8fafc; } -.logoContainer { - margin-bottom: 2rem; +/* ======================================== + Page wrapper + ======================================== */ + +.homepageWrapper { + font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif; + background: var(--hp-bg-page); + color: var(--hp-text); + line-height: 1.6; + overflow-x: hidden; +} + +/* ======================================== + Custom Navigation + ======================================== */ + +.customNav { + position: fixed; + top: 0; + width: 100%; + background: rgba(39, 0, 29, 0.85); + backdrop-filter: blur(20px); + border-bottom: 1px solid rgba(255, 255, 255, 0.15); + z-index: 1000; + padding: 1.2rem 0; + transition: transform 0.3s ease; +} + +[data-theme='light'] .customNav { + background: rgba(255, 255, 255, 0.85); + border-bottom: 1px solid rgba(0, 0, 0, 0.08); +} + +.navContainer { + max-width: 1400px; + margin: 0 auto; + padding: 0 2rem; display: flex; - justify-content: center; + justify-content: space-between; align-items: center; } -.heroLogo { - width: 180px; - height: 180px; - filter: drop-shadow(0 4px 8px rgba(0, 0, 0, 0.1)); - transition: transform 0.3s ease; +.logo { + font-size: 1.6rem; + font-weight: 800; + background: linear-gradient(135deg, #fbbf24, #f59e0b, #06b6d4); + background-size: 200% 200%; + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; + animation: hpGradientShift 3s ease infinite; + text-decoration: none; +} + +@keyframes hpGradientShift { + 0%, 100% { background-position: 0% 50%; } + 50% { background-position: 100% 50%; } +} + +.navLinks { + display: flex; + gap: 2.5rem; + align-items: center; +} + +.navLink { + color: var(--hp-text); + text-decoration: none; + transition: color 0.3s; + font-weight: 500; +} + +.navLink:hover { + color: var(--hp-primary); + text-decoration: none; +} + +/* ======================================== + Buttons + ======================================== */ + +.btn { + padding: 0.75rem 2rem; + border-radius: 50px; + text-decoration: none; + font-weight: 600; + transition: all 0.4s cubic-bezier(0.175, 0.885, 0.32, 1.275); + display: inline-block; + cursor: pointer; + border: none; + font-size: 1rem; +} + +.btn:hover { + text-decoration: none; +} + +.btnPrimary { + background: linear-gradient(135deg, #fbbf24, #f59e0b); + color: white; + box-shadow: 0 10px 30px rgba(251, 191, 36, 0.3); +} + +.btnPrimary:hover { + transform: translateY(-3px); + box-shadow: 0 15px 40px rgba(251, 191, 36, 0.5); + color: white; +} + +.btnSecondary { + background: rgba(255, 255, 255, 0.05); + color: var(--hp-text); + border: 2px solid rgba(251, 191, 36, 0.5); +} + +[data-theme='light'] .btnSecondary { + background: rgba(251, 191, 36, 0.05); + border-color: rgba(251, 191, 36, 0.4); +} + +.btnSecondary:hover { + background: rgba(251, 191, 36, 0.2); + border-color: var(--hp-primary); + transform: translateY(-3px); + color: var(--hp-text); +} + +.btnWhite { + background: white; + color: var(--hp-primary); +} + +.btnWhite:hover { + background: #f8fafc; + transform: translateY(-3px) scale(1.05); + color: var(--hp-primary); +} + +.btnOutlineWhite { + background: transparent; + border: 2px solid white; + color: white; +} + +.btnOutlineWhite:hover { + background: rgba(255, 255, 255, 0.15); + color: white; + transform: translateY(-3px); } -.heroLogo:hover { - transform: scale(1.05); +/* ======================================== + Hero Section + ======================================== */ + +.hero { + min-height: 100vh; + display: grid; + grid-template-columns: 1fr 1fr; + gap: 4rem; + align-items: center; + padding: 10rem 2rem 5rem; + max-width: 1400px; + margin: 0 auto; + position: relative; + z-index: 1; + overflow: hidden; } -@media screen and (max-width: 768px) { - .heroLogo { - width: 120px; - height: 120px; +.networkCanvas { + position: absolute; + top: 0; + left: -2rem; + width: calc(100% + 4rem); + height: 100%; + z-index: 0; + pointer-events: none; +} + +.heroContent { + animation: hpFadeInUp 1s ease-out; + position: relative; + z-index: 1; +} + +@keyframes hpFadeInUp { + from { + opacity: 0; + transform: translateY(40px); } - - .logoContainer { - margin-bottom: 1.5rem; + to { + opacity: 1; + transform: translateY(0); } } -.buttons { - display: flex; - align-items: center; - justify-content: center; - gap: 1rem; +.heroBadge { + display: inline-block; + padding: 0.5rem 1.5rem; + background: rgba(251, 191, 36, 0.1); + border: 1px solid rgba(251, 191, 36, 0.3); + border-radius: 50px; + color: var(--hp-primary); + font-size: 0.9rem; + font-weight: 600; margin-bottom: 2rem; + backdrop-filter: blur(10px); } -@media screen and (max-width: 768px) { - .buttons { - flex-direction: column; - gap: 0.5rem; - } +.heroTitle { + font-size: 4.5rem; + font-weight: 900; + margin-bottom: 1.5rem; + line-height: 1.1; + background: linear-gradient(135deg, #fff 0%, #a5b4fc 100%); + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; } -.statsContainer { +[data-theme='light'] .heroTitle { + background: linear-gradient(135deg, #1e293b 0%, #fbbf24 100%); + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; +} + +.heroSubtitle { + font-size: 1.25rem; + color: var(--hp-text-muted); + margin-bottom: 2.5rem; + line-height: 1.8; +} + +.heroButtons { display: flex; - justify-content: center; + gap: 1.5rem; + flex-wrap: wrap; +} + +.heroImage { + position: relative; + z-index: 1; + animation: hpFadeInUp 1s ease-out 0.3s both; +} + +.heroImage img { + width: 100%; + border-radius: 20px; + box-shadow: 0 40px 80px rgba(0, 0, 0, 0.5); +} + +[data-theme='light'] .heroImage img { + box-shadow: 0 40px 80px rgba(0, 0, 0, 0.15); +} + +.adoptionBadge { + text-align: center; + margin-top: 3rem; + animation: hpFadeInUp 1s ease-out 0.6s both; +} + +.adoptionBadge p { + color: var(--hp-text-muted); + font-size: 0.95rem; +} + +/* ======================================== + Section (generic) + ======================================== */ + +.section { + padding: 8rem 2rem; + position: relative; + z-index: 1; +} + +.container { + max-width: 1400px; + margin: 0 auto; +} + +.sectionHeader { + text-align: center; + margin-bottom: 5rem; +} + +.sectionSubtitle { + font-size: 0.95rem; + color: var(--hp-primary); + font-weight: 700; + text-transform: uppercase; + letter-spacing: 2px; + margin-bottom: 1rem; +} + +.sectionTitle { + font-size: 3.5rem; + font-weight: 900; + margin-bottom: 1.5rem; + background: linear-gradient(135deg, #fff, #a5b4fc); + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; +} + +[data-theme='light'] .sectionTitle { + background: linear-gradient(135deg, #1e293b, #fbbf24); + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; +} + +.sectionDescription { + font-size: 1.2rem; + color: var(--hp-text-muted); + max-width: 800px; + margin: 0 auto; +} + +/* ======================================== + Barriers Section (3-panel) + ======================================== */ + +.barriersGrid { + display: grid; + grid-template-columns: repeat(3, 1fr); + gap: 2.5rem; + margin-top: 4rem; +} + +.barrierCard { + background: var(--hp-bg-card); + backdrop-filter: blur(20px); + border: 1px solid rgba(255, 255, 255, 0.08); + border-radius: 24px; + padding: 2.5rem; + transition: all 0.4s; +} + +[data-theme='light'] .barrierCard { + background: white; + border-color: rgba(0, 0, 0, 0.08); + box-shadow: 0 4px 20px rgba(0, 0, 0, 0.05); +} + +.barrierCard:hover { + transform: translateY(-8px); + border-color: rgba(251, 191, 36, 0.3); + box-shadow: 0 20px 50px rgba(0, 0, 0, 0.4); +} + +[data-theme='light'] .barrierCard:hover { + box-shadow: 0 20px 50px rgba(251, 191, 36, 0.12); + border-color: rgba(251, 191, 36, 0.3); +} + +.barrierIcon { + font-size: 2.5rem; + margin-bottom: 1.5rem; +} + +.barrierCard h3 { + font-size: 1.4rem; + font-weight: 700; + margin-bottom: 1rem; + color: var(--hp-text); +} + +.barrierCard p { + color: var(--hp-text-muted); + line-height: 1.8; + font-size: 0.95rem; +} + +.barrierQuestions { + list-style: none; + padding: 0; + margin: 1rem 0; +} + +.barrierQuestions li { + color: var(--hp-text-muted); + padding: 0.4rem 0; + font-size: 0.92rem; + line-height: 1.6; + position: relative; + padding-left: 1.2rem; +} + +.barrierQuestions li::before { + content: '?'; + position: absolute; + left: 0; + color: var(--hp-primary); + font-weight: 700; +} + +.barrierAnswer { + margin-top: 1rem; + color: var(--hp-text-muted); + font-size: 0.92rem; + line-height: 1.8; + border-top: 1px solid rgba(255, 255, 255, 0.06); + padding-top: 1rem; +} + +[data-theme='light'] .barrierAnswer { + border-top-color: rgba(0, 0, 0, 0.06); +} + +/* ======================================== + Component Cards (5 cards) + ======================================== */ + +.componentsGrid { + display: grid; + grid-template-columns: repeat(3, 1fr); gap: 3rem; - margin-top: 2rem; - opacity: 0.9; + margin-top: 4rem; } -@media screen and (max-width: 768px) { - .statsContainer { - flex-direction: column; - gap: 1rem; - align-items: center; - } +.componentCard { + background: var(--hp-bg-card); + backdrop-filter: blur(20px); + border: 1px solid rgba(255, 255, 255, 0.08); + border-radius: 24px; + overflow: hidden; + transition: all 0.5s cubic-bezier(0.175, 0.885, 0.32, 1.275); + opacity: 0; + transform: translateY(50px); +} + +.componentCardVisible { + opacity: 1; + transform: translateY(0); +} + +[data-theme='light'] .componentCard { + background: white; + border-color: rgba(0, 0, 0, 0.08); + box-shadow: 0 4px 20px rgba(0, 0, 0, 0.05); +} + +.componentCard:hover { + transform: translateY(-10px); + border-color: rgba(251, 191, 36, 0.3); + box-shadow: 0 30px 60px rgba(0, 0, 0, 0.5); +} + +[data-theme='light'] .componentCard:hover { + box-shadow: 0 30px 60px rgba(251, 191, 36, 0.1); +} + +.componentCardVisible:hover { + transform: translateY(-10px); +} + +.componentContent { + padding: 2.5rem; +} + +.componentContent h3 { + font-size: 1.6rem; + margin-bottom: 1rem; + font-weight: 700; + color: var(--hp-text); +} + +.componentContent p { + color: var(--hp-text-muted); + margin-bottom: 1.5rem; + line-height: 1.7; +} + +.componentLink { + color: var(--hp-primary); + text-decoration: none; + font-weight: 600; + display: inline-flex; + align-items: center; + gap: 0.5rem; + transition: gap 0.3s; +} + +.componentLink:hover { + gap: 1rem; + text-decoration: none; + color: var(--hp-primary); } -.statItem { +.componentIcon { + width: 100%; + height: 180px; display: flex; - flex-direction: column; align-items: center; + justify-content: center; + font-size: 4rem; + background: linear-gradient(135deg, rgba(251, 191, 36, 0.1), rgba(245, 158, 11, 0.1)); +} + +[data-theme='light'] .componentIcon { + background: linear-gradient(135deg, rgba(251, 191, 36, 0.06), rgba(245, 158, 11, 0.06)); +} + +/* ======================================== + Stats Grid + ======================================== */ + +.statsSection { + background: rgba(0, 0, 0, 0.2); +} + +[data-theme='light'] .statsSection { + background: rgba(251, 191, 36, 0.03); +} + +.statsGrid { + display: grid; + grid-template-columns: repeat(4, 1fr); + gap: 2.5rem; + margin-top: 4rem; +} + +.statCard { + background: var(--hp-bg-card); + backdrop-filter: blur(20px); + border: 1px solid rgba(255, 255, 255, 0.08); + border-radius: 20px; + padding: 2.5rem; text-align: center; - color: white; + transition: all 0.4s; +} + +[data-theme='light'] .statCard { + background: white; + border-color: rgba(0, 0, 0, 0.08); + box-shadow: 0 4px 20px rgba(0, 0, 0, 0.05); +} + +.statCard:hover { + transform: translateY(-5px); + border-color: rgba(251, 191, 36, 0.3); +} + +.statLabel { + font-size: 0.9rem; + color: var(--hp-text-muted); + text-transform: uppercase; + letter-spacing: 1.5px; + margin-bottom: 0.5rem; +} + +.statValue { + font-size: 2.5rem; + font-weight: 900; + background: linear-gradient(135deg, #fbbf24, #f59e0b); + -webkit-background-clip: text; + -webkit-text-fill-color: transparent; + background-clip: text; +} + +.statDescription { + color: var(--hp-text-muted); + font-size: 0.95rem; + margin-top: 0.5rem; +} + +/* ======================================== + Demo Videos Grid + ======================================== */ + +.videosGrid { + display: grid; + grid-template-columns: repeat(3, 1fr); + gap: 2.5rem; + margin-top: 4rem; +} + +.videoCard { + background: var(--hp-bg-card); + backdrop-filter: blur(20px); + border: 1px solid rgba(255, 255, 255, 0.08); + border-radius: 24px; + overflow: hidden; + transition: all 0.4s; +} + +[data-theme='light'] .videoCard { + background: white; + border-color: rgba(0, 0, 0, 0.08); + box-shadow: 0 4px 20px rgba(0, 0, 0, 0.05); +} + +.videoCard:hover { + transform: translateY(-8px); + border-color: rgba(251, 191, 36, 0.3); + box-shadow: 0 20px 50px rgba(0, 0, 0, 0.4); +} + +[data-theme='light'] .videoCard:hover { + box-shadow: 0 20px 50px rgba(251, 191, 36, 0.12); +} + +.videoWrapper { + position: relative; + width: 100%; + aspect-ratio: 16 / 9; + background: #000; + overflow: hidden; +} + +.videoPlayer { + width: 100%; + height: 100%; + object-fit: cover; + display: block; +} + +.videoContent { + padding: 1.5rem 2rem 2rem; } -.statItem strong { - font-size: 1.5rem; +.videoContent h3 { + font-size: 1.3rem; font-weight: 700; - margin-bottom: 0.25rem; + margin-bottom: 0.5rem; + color: var(--hp-text); +} + +.videoContent p { + color: var(--hp-text-muted); + font-size: 0.92rem; + line-height: 1.6; + margin: 0; +} + +/* ======================================== + Blog Grid + ======================================== */ + +.blogGrid { + display: grid; + grid-template-columns: repeat(auto-fill, minmax(350px, 1fr)); + gap: 2.5rem; + margin-top: 4rem; +} + +.blogCard { + background: var(--hp-bg-card); + backdrop-filter: blur(20px); + border: 1px solid rgba(255, 255, 255, 0.08); + border-radius: 20px; + overflow: hidden; + transition: all 0.4s; + text-decoration: none; + color: inherit; display: block; } -.statItem span { - font-size: 0.875rem; - opacity: 0.8; - text-transform: uppercase; - letter-spacing: 0.5px; +[data-theme='light'] .blogCard { + background: white; + border-color: rgba(0, 0, 0, 0.08); + box-shadow: 0 4px 20px rgba(0, 0, 0, 0.05); } -.aboutSection { - padding: 4rem 0; - background-color: var(--ifm-background-surface-color); +.blogCard:hover { + transform: translateY(-5px); + border-color: rgba(251, 191, 36, 0.3); + text-decoration: none; + color: inherit; } -.highlightBox { - background: linear-gradient(135deg, #f8f9ff 0%, #e8f0ff 100%); - border: 1px solid rgba(69, 8, 57, 0.1); - border-radius: 12px; +.blogCardIcon { + width: 100%; + height: 160px; + display: flex; + align-items: center; + justify-content: center; + font-size: 3rem; + background: linear-gradient(135deg, rgba(251, 191, 36, 0.15), rgba(6, 182, 212, 0.15)); +} + +[data-theme='light'] .blogCardIcon { + background: linear-gradient(135deg, rgba(251, 191, 36, 0.08), rgba(6, 182, 212, 0.08)); +} + +.blogContent { padding: 2rem; - height: 100%; } -.highlightBox h3 { - color: var(--bharatml-primary); +.blogCategory { + display: inline-block; + padding: 0.25rem 0.75rem; + background: rgba(251, 191, 36, 0.2); + border-radius: 12px; + font-size: 0.75rem; + color: var(--hp-primary); + font-weight: 700; + text-transform: uppercase; margin-bottom: 1rem; - font-size: 1.25rem; } -.highlightBox ul { +.blogCard h3 { + font-size: 1.3rem; + margin-bottom: 0.75rem; + font-weight: 700; + color: var(--hp-text); +} + +.blogMeta { + display: flex; + align-items: center; + gap: 0.5rem; + color: var(--hp-text-muted); + font-size: 0.85rem; +} + +/* ======================================== + CTA Section + ======================================== */ + +.ctaSection { + background: linear-gradient(135deg, rgba(139, 0, 77, 0.3), rgba(99, 0, 54, 0.4)); + border: 2px solid rgba(139, 0, 77, 0.5); + border-radius: 40px; + padding: 6rem 4rem; + text-align: center; + margin: 2rem 0; + position: relative; + overflow: hidden; +} + +.ctaSection::before { + content: ''; + position: absolute; + top: -50%; + left: -50%; + width: 200%; + height: 200%; + background: radial-gradient(circle, rgba(255, 255, 255, 0.1) 0%, transparent 70%); + animation: hpRotate 20s linear infinite; +} + +@keyframes hpRotate { + from { transform: rotate(0deg); } + to { transform: rotate(360deg); } +} + +.ctaTitle { + font-size: 3.5rem; + font-weight: 900; + margin-bottom: 1.5rem; + position: relative; + z-index: 1; + color: white; + background: none; + -webkit-text-fill-color: white; +} + +.ctaDescription { + font-size: 1.3rem; + margin-bottom: 3rem; + position: relative; + z-index: 1; + color: rgba(255, 255, 255, 0.9); +} + +.ctaButtons { + display: flex; + gap: 1.5rem; + justify-content: center; + flex-wrap: wrap; + position: relative; + z-index: 1; +} + +/* ======================================== + Custom Footer + ======================================== */ + +.customFooter { + background: var(--hp-dark-light); + border-top: 1px solid rgba(255, 255, 255, 0.05); + padding: 5rem 2rem 2rem; + position: relative; + z-index: 1; +} + +[data-theme='light'] .customFooter { + background: #f1f5f9; + border-top-color: rgba(0, 0, 0, 0.08); +} + +.footerContent { + max-width: 1400px; + margin: 0 auto; + display: grid; + grid-template-columns: 2fr 1fr 1fr 1fr; + gap: 4rem; + margin-bottom: 3rem; +} + +.footerSection h4 { + font-size: 1.2rem; + margin-bottom: 1.5rem; + font-weight: 700; + color: var(--hp-text); +} + +.footerSection p { + color: var(--hp-text-muted); + line-height: 1.8; +} + +.footerList { list-style: none; padding: 0; margin: 0; } -.highlightBox li { - padding: 0.5rem 0; - font-size: 0.95rem; - color: var(--bharatml-text); +.footerList li { + margin-bottom: 0.75rem; +} + +.footerList a { + color: var(--hp-text-muted); + text-decoration: none; + transition: all 0.3s; +} + +.footerList a:hover { + color: var(--hp-primary); + text-decoration: none; +} + +.footerBottom { + max-width: 1400px; + margin: 0 auto; + padding-top: 2rem; + border-top: 1px solid rgba(255, 255, 255, 0.05); + display: flex; + justify-content: space-between; + align-items: center; + color: var(--hp-text-muted); + flex-wrap: wrap; + gap: 1rem; +} + +[data-theme='light'] .footerBottom { + border-top-color: rgba(0, 0, 0, 0.08); +} + +.footerLinks { + display: flex; + gap: 2rem; +} + +.footerLinks a { + color: var(--hp-text-muted); + text-decoration: none; + transition: color 0.3s; +} + +.footerLinks a:hover { + color: var(--hp-primary); + text-decoration: none; } -.highlightBox li:not(:last-child) { - border-bottom: 1px solid rgba(69, 8, 57, 0.05); +/* ======================================== + Responsive + ======================================== */ + +@media (max-width: 1024px) { + .hero { + grid-template-columns: 1fr; + text-align: center; + padding-top: 8rem; + } + + .heroImage { + order: -1; + margin: 2rem 0 0 0; + } + + .heroContent { + order: 1; + } + + .heroButtons { + justify-content: center; + } + + .componentsGrid { + grid-template-columns: 1fr 1fr; + } + + .barriersGrid { + grid-template-columns: 1fr; + } + + .videosGrid { + grid-template-columns: 1fr 1fr; + } + + .statsGrid { + grid-template-columns: 1fr 1fr; + } + + .footerContent { + grid-template-columns: 1fr 1fr; + } } -/* Dark mode adjustments */ -[data-theme='dark'] .highlightBox { - background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%); - border-color: rgba(139, 69, 130, 0.2); +@media (max-width: 768px) { + .heroTitle { + font-size: 3rem; + } + + .sectionTitle { + font-size: 2.5rem; + } + + .navLinks a:not(.btn):not(.btnPrimary) { + display: none; + } + + .componentsGrid, + .blogGrid, + .videosGrid { + grid-template-columns: 1fr; + } + + .statsGrid { + grid-template-columns: 1fr; + } + + .footerContent { + grid-template-columns: 1fr; + } + + .ctaTitle { + font-size: 2.5rem; + } + + .ctaSection { + padding: 4rem 2rem; + border-radius: 20px; + } + + .section { + padding: 4rem 1.5rem; + } + + .hero { + padding: 7rem 1.5rem 3rem; + } } -[data-theme='dark'] .highlightBox li { - color: var(--bharatml-text); +@media (max-width: 480px) { + .heroTitle { + font-size: 2.2rem; + } + + .sectionTitle { + font-size: 2rem; + } + + .heroButtons { + flex-direction: column; + align-items: center; + } } diff --git a/docs-src/src/theme/Root.js b/docs-src/src/theme/Root.js new file mode 100644 index 00000000..3ceb9c99 --- /dev/null +++ b/docs-src/src/theme/Root.js @@ -0,0 +1,14 @@ +import React from 'react'; + +export default function Root({ children }) { + return ( + <> +
+
+
+
+
+ {children} + + ); +} diff --git a/docs-src/static/img/bharatml-stack-logo.jpg b/docs-src/static/img/bharatml-stack-logo.jpg new file mode 100644 index 00000000..46ecdfa5 Binary files /dev/null and b/docs-src/static/img/bharatml-stack-logo.jpg differ diff --git a/docs-src/static/img/skye-rt-consumer-flow.png b/docs-src/static/img/skye-rt-consumer-flow.png new file mode 100644 index 00000000..11e40769 Binary files /dev/null and b/docs-src/static/img/skye-rt-consumer-flow.png differ diff --git a/docs-src/static/img/skye-system-overview.png b/docs-src/static/img/skye-system-overview.png new file mode 100644 index 00000000..2f992dbf Binary files /dev/null and b/docs-src/static/img/skye-system-overview.png differ diff --git a/docs-src/static/img/v1.0.0-predator-hld.png b/docs-src/static/img/v1.0.0-predator-hld.png new file mode 100644 index 00000000..3e8a21ad Binary files /dev/null and b/docs-src/static/img/v1.0.0-predator-hld.png differ diff --git a/docs/.DS_Store b/docs/.DS_Store new file mode 100644 index 00000000..5008ddfc Binary files /dev/null and b/docs/.DS_Store differ diff --git a/docs/.nojekyll b/docs/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/docs/404.html b/docs/404.html index a131da18..eb02ff6f 100644 --- a/docs/404.html +++ b/docs/404.html @@ -4,14 +4,14 @@ BharatMLStack - - - + + + -

Page Not Found

We could not find what you were looking for.

Please contact the owner of the site that linked you to the original URL and let them know their link is broken.

+

Page Not Found

We could not find what you were looking for.

Please contact the owner of the site that linked you to the original URL and let them know their link is broken.

\ No newline at end of file diff --git a/docs/assets/css/styles.14b2d0af.css b/docs/assets/css/styles.aaf16941.css similarity index 55% rename from docs/assets/css/styles.14b2d0af.css rename to docs/assets/css/styles.aaf16941.css index 8bc1333c..b852d07c 100644 --- a/docs/assets/css/styles.14b2d0af.css +++ b/docs/assets/css/styles.aaf16941.css @@ -1 +1 @@ -@layer docusaurus.infima,docusaurus.theme-common,docusaurus.theme-classic,docusaurus.core,docusaurus.plugin-debug,docusaurus.theme-mermaid,docusaurus.theme-live-codeblock,docusaurus.theme-search-algolia.docsearch,docusaurus.theme-search-algolia;@layer docusaurus.infima{.col,.container{padding:0 var(--ifm-spacing-horizontal);width:100%}.markdown>h2,.markdown>h3,.markdown>h4,.markdown>h5,.markdown>h6{margin-bottom:calc(var(--ifm-heading-vertical-rhythm-bottom)*var(--ifm-leading))}.markdown li,body{word-wrap:break-word}body,ol ol,ol ul,ul ol,ul ul{margin:0}pre,table{overflow:auto}blockquote,pre{margin:0 0 var(--ifm-spacing-vertical)}.breadcrumbs__link,.button{transition-timing-function:var(--ifm-transition-timing-default)}.button,code{vertical-align:middle}.button--outline.button--active,.button--outline:active,.button--outline:hover,:root{--ifm-button-color:var(--ifm-font-color-base-inverse)}.menu__link:hover,a{transition:color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.navbar--dark,:root{--ifm-navbar-link-hover-color:var(--ifm-color-primary)}.menu,.navbar-sidebar{overflow-x:hidden}:root,html[data-theme=dark]{--ifm-color-emphasis-500:var(--ifm-color-gray-500)}:root{--ifm-color-scheme:light;--ifm-dark-value:10%;--ifm-darker-value:15%;--ifm-darkest-value:30%;--ifm-light-value:15%;--ifm-lighter-value:30%;--ifm-lightest-value:50%;--ifm-contrast-background-value:90%;--ifm-contrast-foreground-value:70%;--ifm-contrast-background-dark-value:70%;--ifm-contrast-foreground-dark-value:90%;--ifm-color-primary:#3578e5;--ifm-color-secondary:#ebedf0;--ifm-color-success:#00a400;--ifm-color-info:#54c7ec;--ifm-color-warning:#ffba00;--ifm-color-danger:#fa383e;--ifm-color-primary-dark:#306cce;--ifm-color-primary-darker:#2d66c3;--ifm-color-primary-darkest:#2554a0;--ifm-color-primary-light:#538ce9;--ifm-color-primary-lighter:#72a1ed;--ifm-color-primary-lightest:#9abcf2;--ifm-color-primary-contrast-background:#ebf2fc;--ifm-color-primary-contrast-foreground:#102445;--ifm-color-secondary-dark:#d4d5d8;--ifm-color-secondary-darker:#c8c9cc;--ifm-color-secondary-darkest:#a4a6a8;--ifm-color-secondary-light:#eef0f2;--ifm-color-secondary-lighter:#f1f2f5;--ifm-color-secondary-lightest:#f5f6f8;--ifm-color-secondary-contrast-background:#fdfdfe;--ifm-color-secondary-contrast-foreground:#474748;--ifm-color-success-dark:#009400;--ifm-color-success-darker:#008b00;--ifm-color-success-darkest:#007300;--ifm-color-success-light:#26b226;--ifm-color-success-lighter:#4dbf4d;--ifm-color-success-lightest:#80d280;--ifm-color-success-contrast-background:#e6f6e6;--ifm-color-success-contrast-foreground:#003100;--ifm-color-info-dark:#4cb3d4;--ifm-color-info-darker:#47a9c9;--ifm-color-info-darkest:#3b8ba5;--ifm-color-info-light:#6ecfef;--ifm-color-info-lighter:#87d8f2;--ifm-color-info-lightest:#aae3f6;--ifm-color-info-contrast-background:#eef9fd;--ifm-color-info-contrast-foreground:#193c47;--ifm-color-warning-dark:#e6a700;--ifm-color-warning-darker:#d99e00;--ifm-color-warning-darkest:#b38200;--ifm-color-warning-light:#ffc426;--ifm-color-warning-lighter:#ffcf4d;--ifm-color-warning-lightest:#ffdd80;--ifm-color-warning-contrast-background:#fff8e6;--ifm-color-warning-contrast-foreground:#4d3800;--ifm-color-danger-dark:#e13238;--ifm-color-danger-darker:#d53035;--ifm-color-danger-darkest:#af272b;--ifm-color-danger-light:#fb565b;--ifm-color-danger-lighter:#fb7478;--ifm-color-danger-lightest:#fd9c9f;--ifm-color-danger-contrast-background:#ffebec;--ifm-color-danger-contrast-foreground:#4b1113;--ifm-color-white:#fff;--ifm-color-black:#000;--ifm-color-gray-0:var(--ifm-color-white);--ifm-color-gray-100:#f5f6f7;--ifm-color-gray-200:#ebedf0;--ifm-color-gray-300:#dadde1;--ifm-color-gray-400:#ccd0d5;--ifm-color-gray-500:#bec3c9;--ifm-color-gray-600:#8d949e;--ifm-color-gray-700:#606770;--ifm-color-gray-800:#444950;--ifm-color-gray-900:#1c1e21;--ifm-color-gray-1000:var(--ifm-color-black);--ifm-color-emphasis-0:var(--ifm-color-gray-0);--ifm-color-emphasis-100:var(--ifm-color-gray-100);--ifm-color-emphasis-200:var(--ifm-color-gray-200);--ifm-color-emphasis-300:var(--ifm-color-gray-300);--ifm-color-emphasis-400:var(--ifm-color-gray-400);--ifm-color-emphasis-600:var(--ifm-color-gray-600);--ifm-color-emphasis-700:var(--ifm-color-gray-700);--ifm-color-emphasis-800:var(--ifm-color-gray-800);--ifm-color-emphasis-900:var(--ifm-color-gray-900);--ifm-color-emphasis-1000:var(--ifm-color-gray-1000);--ifm-color-content:var(--ifm-color-emphasis-900);--ifm-color-content-inverse:var(--ifm-color-emphasis-0);--ifm-color-content-secondary:#525860;--ifm-background-color:#0000;--ifm-background-surface-color:var(--ifm-color-content-inverse);--ifm-global-border-width:1px;--ifm-global-radius:0.4rem;--ifm-hover-overlay:#0000000d;--ifm-font-color-base:var(--ifm-color-content);--ifm-font-color-base-inverse:var(--ifm-color-content-inverse);--ifm-font-color-secondary:var(--ifm-color-content-secondary);--ifm-font-family-base:system-ui,-apple-system,Segoe UI,Roboto,Ubuntu,Cantarell,Noto Sans,sans-serif,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";--ifm-font-family-monospace:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;--ifm-font-size-base:100%;--ifm-font-weight-light:300;--ifm-font-weight-normal:400;--ifm-font-weight-semibold:500;--ifm-font-weight-bold:700;--ifm-font-weight-base:var(--ifm-font-weight-normal);--ifm-line-height-base:1.65;--ifm-global-spacing:1rem;--ifm-spacing-vertical:var(--ifm-global-spacing);--ifm-spacing-horizontal:var(--ifm-global-spacing);--ifm-transition-fast:200ms;--ifm-transition-slow:400ms;--ifm-transition-timing-default:cubic-bezier(0.08,0.52,0.52,1);--ifm-global-shadow-lw:0 1px 2px 0 #0000001a;--ifm-global-shadow-md:0 5px 40px #0003;--ifm-global-shadow-tl:0 12px 28px 0 #0003,0 2px 4px 0 #0000001a;--ifm-z-index-dropdown:100;--ifm-z-index-fixed:200;--ifm-z-index-overlay:400;--ifm-container-width:1140px;--ifm-container-width-xl:1320px;--ifm-code-background:#f6f7f8;--ifm-code-border-radius:var(--ifm-global-radius);--ifm-code-font-size:90%;--ifm-code-padding-horizontal:0.1rem;--ifm-code-padding-vertical:0.1rem;--ifm-pre-background:var(--ifm-code-background);--ifm-pre-border-radius:var(--ifm-code-border-radius);--ifm-pre-color:inherit;--ifm-pre-line-height:1.45;--ifm-pre-padding:1rem;--ifm-heading-color:inherit;--ifm-heading-margin-top:0;--ifm-heading-margin-bottom:var(--ifm-spacing-vertical);--ifm-heading-font-family:var(--ifm-font-family-base);--ifm-heading-font-weight:var(--ifm-font-weight-bold);--ifm-heading-line-height:1.25;--ifm-h1-font-size:2rem;--ifm-h2-font-size:1.5rem;--ifm-h3-font-size:1.25rem;--ifm-h4-font-size:1rem;--ifm-h5-font-size:0.875rem;--ifm-h6-font-size:0.85rem;--ifm-image-alignment-padding:1.25rem;--ifm-leading-desktop:1.25;--ifm-leading:calc(var(--ifm-leading-desktop)*1rem);--ifm-list-left-padding:2rem;--ifm-list-margin:1rem;--ifm-list-item-margin:0.25rem;--ifm-list-paragraph-margin:1rem;--ifm-table-cell-padding:0.75rem;--ifm-table-background:#0000;--ifm-table-stripe-background:#00000008;--ifm-table-border-width:1px;--ifm-table-border-color:var(--ifm-color-emphasis-300);--ifm-table-head-background:inherit;--ifm-table-head-color:inherit;--ifm-table-head-font-weight:var(--ifm-font-weight-bold);--ifm-table-cell-color:inherit;--ifm-link-color:var(--ifm-color-primary);--ifm-link-decoration:none;--ifm-link-hover-color:var(--ifm-link-color);--ifm-link-hover-decoration:underline;--ifm-paragraph-margin-bottom:var(--ifm-leading);--ifm-blockquote-font-size:var(--ifm-font-size-base);--ifm-blockquote-border-left-width:2px;--ifm-blockquote-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-blockquote-padding-vertical:0;--ifm-blockquote-shadow:none;--ifm-blockquote-color:var(--ifm-color-emphasis-800);--ifm-blockquote-border-color:var(--ifm-color-emphasis-300);--ifm-hr-background-color:var(--ifm-color-emphasis-500);--ifm-hr-height:1px;--ifm-hr-margin-vertical:1.5rem;--ifm-scrollbar-size:7px;--ifm-scrollbar-track-background-color:#f1f1f1;--ifm-scrollbar-thumb-background-color:silver;--ifm-scrollbar-thumb-hover-background-color:#a7a7a7;--ifm-alert-background-color:inherit;--ifm-alert-border-color:inherit;--ifm-alert-border-radius:var(--ifm-global-radius);--ifm-alert-border-width:0px;--ifm-alert-border-left-width:5px;--ifm-alert-color:var(--ifm-font-color-base);--ifm-alert-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-alert-padding-vertical:var(--ifm-spacing-vertical);--ifm-alert-shadow:var(--ifm-global-shadow-lw);--ifm-avatar-intro-margin:1rem;--ifm-avatar-intro-alignment:inherit;--ifm-avatar-photo-size:3rem;--ifm-badge-background-color:inherit;--ifm-badge-border-color:inherit;--ifm-badge-border-radius:var(--ifm-global-radius);--ifm-badge-border-width:var(--ifm-global-border-width);--ifm-badge-color:var(--ifm-color-white);--ifm-badge-padding-horizontal:calc(var(--ifm-spacing-horizontal)*0.5);--ifm-badge-padding-vertical:calc(var(--ifm-spacing-vertical)*0.25);--ifm-breadcrumb-border-radius:1.5rem;--ifm-breadcrumb-spacing:0.5rem;--ifm-breadcrumb-color-active:var(--ifm-color-primary);--ifm-breadcrumb-item-background-active:var(--ifm-hover-overlay);--ifm-breadcrumb-padding-horizontal:0.8rem;--ifm-breadcrumb-padding-vertical:0.4rem;--ifm-breadcrumb-size-multiplier:1;--ifm-breadcrumb-separator:url('data:image/svg+xml;utf8,');--ifm-breadcrumb-separator-filter:none;--ifm-breadcrumb-separator-size:0.5rem;--ifm-breadcrumb-separator-size-multiplier:1.25;--ifm-button-background-color:inherit;--ifm-button-border-color:var(--ifm-button-background-color);--ifm-button-border-width:var(--ifm-global-border-width);--ifm-button-font-weight:var(--ifm-font-weight-bold);--ifm-button-padding-horizontal:1.5rem;--ifm-button-padding-vertical:0.375rem;--ifm-button-size-multiplier:1;--ifm-button-transition-duration:var(--ifm-transition-fast);--ifm-button-border-radius:calc(var(--ifm-global-radius)*var(--ifm-button-size-multiplier));--ifm-button-group-spacing:2px;--ifm-card-background-color:var(--ifm-background-surface-color);--ifm-card-border-radius:calc(var(--ifm-global-radius)*2);--ifm-card-horizontal-spacing:var(--ifm-global-spacing);--ifm-card-vertical-spacing:var(--ifm-global-spacing);--ifm-toc-border-color:var(--ifm-color-emphasis-300);--ifm-toc-link-color:var(--ifm-color-content-secondary);--ifm-toc-padding-vertical:0.5rem;--ifm-toc-padding-horizontal:0.5rem;--ifm-dropdown-background-color:var(--ifm-background-surface-color);--ifm-dropdown-font-weight:var(--ifm-font-weight-semibold);--ifm-dropdown-link-color:var(--ifm-font-color-base);--ifm-dropdown-hover-background-color:var(--ifm-hover-overlay);--ifm-footer-background-color:var(--ifm-color-emphasis-100);--ifm-footer-color:inherit;--ifm-footer-link-color:var(--ifm-color-emphasis-700);--ifm-footer-link-hover-color:var(--ifm-color-primary);--ifm-footer-link-horizontal-spacing:0.5rem;--ifm-footer-padding-horizontal:calc(var(--ifm-spacing-horizontal)*2);--ifm-footer-padding-vertical:calc(var(--ifm-spacing-vertical)*2);--ifm-footer-title-color:inherit;--ifm-footer-logo-max-width:min(30rem,90vw);--ifm-hero-background-color:var(--ifm-background-surface-color);--ifm-hero-text-color:var(--ifm-color-emphasis-800);--ifm-menu-color:var(--ifm-color-emphasis-700);--ifm-menu-color-active:var(--ifm-color-primary);--ifm-menu-color-background-active:var(--ifm-hover-overlay);--ifm-menu-color-background-hover:var(--ifm-hover-overlay);--ifm-menu-link-padding-horizontal:0.75rem;--ifm-menu-link-padding-vertical:0.375rem;--ifm-menu-link-sublist-icon:url('data:image/svg+xml;utf8,');--ifm-menu-link-sublist-icon-filter:none;--ifm-navbar-background-color:var(--ifm-background-surface-color);--ifm-navbar-height:3.75rem;--ifm-navbar-item-padding-horizontal:0.75rem;--ifm-navbar-item-padding-vertical:0.25rem;--ifm-navbar-link-color:var(--ifm-font-color-base);--ifm-navbar-link-active-color:var(--ifm-link-color);--ifm-navbar-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-navbar-padding-vertical:calc(var(--ifm-spacing-vertical)*0.5);--ifm-navbar-shadow:var(--ifm-global-shadow-lw);--ifm-navbar-search-input-background-color:var(--ifm-color-emphasis-200);--ifm-navbar-search-input-color:var(--ifm-color-emphasis-800);--ifm-navbar-search-input-placeholder-color:var(--ifm-color-emphasis-500);--ifm-navbar-search-input-icon:url('data:image/svg+xml;utf8,');--ifm-navbar-sidebar-width:83vw;--ifm-pagination-border-radius:var(--ifm-global-radius);--ifm-pagination-color-active:var(--ifm-color-primary);--ifm-pagination-font-size:1rem;--ifm-pagination-item-active-background:var(--ifm-hover-overlay);--ifm-pagination-page-spacing:0.2em;--ifm-pagination-padding-horizontal:calc(var(--ifm-spacing-horizontal)*1);--ifm-pagination-padding-vertical:calc(var(--ifm-spacing-vertical)*0.25);--ifm-pagination-nav-border-radius:var(--ifm-global-radius);--ifm-pagination-nav-color-hover:var(--ifm-color-primary);--ifm-pills-color-active:var(--ifm-color-primary);--ifm-pills-color-background-active:var(--ifm-hover-overlay);--ifm-pills-spacing:0.125rem;--ifm-tabs-color:var(--ifm-font-color-secondary);--ifm-tabs-color-active:var(--ifm-color-primary);--ifm-tabs-color-active-border:var(--ifm-tabs-color-active);--ifm-tabs-padding-horizontal:1rem;--ifm-tabs-padding-vertical:1rem}.badge--danger,.badge--info,.badge--primary,.badge--secondary,.badge--success,.badge--warning{--ifm-badge-border-color:var(--ifm-badge-background-color)}.button--link,.button--outline{--ifm-button-background-color:#0000}*{box-sizing:border-box}html{background-color:var(--ifm-background-color);color:var(--ifm-font-color-base);color-scheme:var(--ifm-color-scheme);font:var(--ifm-font-size-base)/var(--ifm-line-height-base) var(--ifm-font-family-base);-webkit-font-smoothing:antialiased;-webkit-tap-highlight-color:transparent;text-rendering:optimizelegibility;-webkit-text-size-adjust:100%;text-size-adjust:100%}iframe{border:0;color-scheme:auto}.container{margin:0 auto;max-width:var(--ifm-container-width)}.container--fluid{max-width:inherit}.row{display:flex;flex-wrap:wrap;margin:0 calc(var(--ifm-spacing-horizontal)*-1)}.margin-bottom--none,.margin-vert--none,.markdown>:last-child{margin-bottom:0!important}.margin-top--none,.margin-vert--none{margin-top:0!important}.row--no-gutters{margin-left:0;margin-right:0}.margin-horiz--none,.margin-right--none{margin-right:0!important}.row--no-gutters>.col{padding-left:0;padding-right:0}.row--align-top{align-items:flex-start}.row--align-bottom{align-items:flex-end}.row--align-center{align-items:center}.row--align-stretch{align-items:stretch}.row--align-baseline{align-items:baseline}.col{--ifm-col-width:100%;flex:1 0;margin-left:0;max-width:var(--ifm-col-width)}.padding-bottom--none,.padding-vert--none{padding-bottom:0!important}.padding-top--none,.padding-vert--none{padding-top:0!important}.padding-horiz--none,.padding-left--none{padding-left:0!important}.padding-horiz--none,.padding-right--none{padding-right:0!important}.col[class*=col--]{flex:0 0 var(--ifm-col-width)}.col--1{--ifm-col-width:8.33333%}.col--offset-1{margin-left:8.33333%}.col--2{--ifm-col-width:16.66667%}.col--offset-2{margin-left:16.66667%}.col--3{--ifm-col-width:25%}.col--offset-3{margin-left:25%}.col--4{--ifm-col-width:33.33333%}.col--offset-4{margin-left:33.33333%}.col--5{--ifm-col-width:41.66667%}.col--offset-5{margin-left:41.66667%}.col--6{--ifm-col-width:50%}.col--offset-6{margin-left:50%}.col--7{--ifm-col-width:58.33333%}.col--offset-7{margin-left:58.33333%}.col--8{--ifm-col-width:66.66667%}.col--offset-8{margin-left:66.66667%}.col--9{--ifm-col-width:75%}.col--offset-9{margin-left:75%}.col--10{--ifm-col-width:83.33333%}.col--offset-10{margin-left:83.33333%}.col--11{--ifm-col-width:91.66667%}.col--offset-11{margin-left:91.66667%}.col--12{--ifm-col-width:100%}.col--offset-12{margin-left:100%}.margin-horiz--none,.margin-left--none{margin-left:0!important}.margin--none{margin:0!important}.margin-bottom--xs,.margin-vert--xs{margin-bottom:.25rem!important}.margin-top--xs,.margin-vert--xs{margin-top:.25rem!important}.margin-horiz--xs,.margin-left--xs{margin-left:.25rem!important}.margin-horiz--xs,.margin-right--xs{margin-right:.25rem!important}.margin--xs{margin:.25rem!important}.margin-bottom--sm,.margin-vert--sm{margin-bottom:.5rem!important}.margin-top--sm,.margin-vert--sm{margin-top:.5rem!important}.margin-horiz--sm,.margin-left--sm{margin-left:.5rem!important}.margin-horiz--sm,.margin-right--sm{margin-right:.5rem!important}.margin--sm{margin:.5rem!important}.margin-bottom--md,.margin-vert--md{margin-bottom:1rem!important}.margin-top--md,.margin-vert--md{margin-top:1rem!important}.margin-horiz--md,.margin-left--md{margin-left:1rem!important}.margin-horiz--md,.margin-right--md{margin-right:1rem!important}.margin--md{margin:1rem!important}.margin-bottom--lg,.margin-vert--lg{margin-bottom:2rem!important}.margin-top--lg,.margin-vert--lg{margin-top:2rem!important}.margin-horiz--lg,.margin-left--lg{margin-left:2rem!important}.margin-horiz--lg,.margin-right--lg{margin-right:2rem!important}.margin--lg{margin:2rem!important}.margin-bottom--xl,.margin-vert--xl{margin-bottom:5rem!important}.margin-top--xl,.margin-vert--xl{margin-top:5rem!important}.margin-horiz--xl,.margin-left--xl{margin-left:5rem!important}.margin-horiz--xl,.margin-right--xl{margin-right:5rem!important}.margin--xl{margin:5rem!important}.padding--none{padding:0!important}.padding-bottom--xs,.padding-vert--xs{padding-bottom:.25rem!important}.padding-top--xs,.padding-vert--xs{padding-top:.25rem!important}.padding-horiz--xs,.padding-left--xs{padding-left:.25rem!important}.padding-horiz--xs,.padding-right--xs{padding-right:.25rem!important}.padding--xs{padding:.25rem!important}.padding-bottom--sm,.padding-vert--sm{padding-bottom:.5rem!important}.padding-top--sm,.padding-vert--sm{padding-top:.5rem!important}.padding-horiz--sm,.padding-left--sm{padding-left:.5rem!important}.padding-horiz--sm,.padding-right--sm{padding-right:.5rem!important}.padding--sm{padding:.5rem!important}.padding-bottom--md,.padding-vert--md{padding-bottom:1rem!important}.padding-top--md,.padding-vert--md{padding-top:1rem!important}.padding-horiz--md,.padding-left--md{padding-left:1rem!important}.padding-horiz--md,.padding-right--md{padding-right:1rem!important}.padding--md{padding:1rem!important}.padding-bottom--lg,.padding-vert--lg{padding-bottom:2rem!important}.padding-top--lg,.padding-vert--lg{padding-top:2rem!important}.padding-horiz--lg,.padding-left--lg{padding-left:2rem!important}.padding-horiz--lg,.padding-right--lg{padding-right:2rem!important}.padding--lg{padding:2rem!important}.padding-bottom--xl,.padding-vert--xl{padding-bottom:5rem!important}.padding-top--xl,.padding-vert--xl{padding-top:5rem!important}.padding-horiz--xl,.padding-left--xl{padding-left:5rem!important}.padding-horiz--xl,.padding-right--xl{padding-right:5rem!important}.padding--xl{padding:5rem!important}code{background-color:var(--ifm-code-background);border:.1rem solid #0000001a;border-radius:var(--ifm-code-border-radius);font-family:var(--ifm-font-family-monospace);font-size:var(--ifm-code-font-size);padding:var(--ifm-code-padding-vertical) var(--ifm-code-padding-horizontal)}a code{color:inherit}pre{background-color:var(--ifm-pre-background);border-radius:var(--ifm-pre-border-radius);color:var(--ifm-pre-color);font:var(--ifm-code-font-size)/var(--ifm-pre-line-height) var(--ifm-font-family-monospace);padding:var(--ifm-pre-padding)}pre code{background-color:initial;border:none;font-size:100%;line-height:inherit;padding:0}kbd{background-color:var(--ifm-color-emphasis-0);border:1px solid var(--ifm-color-emphasis-400);border-radius:.2rem;box-shadow:inset 0 -1px 0 var(--ifm-color-emphasis-400);color:var(--ifm-color-emphasis-800);font:80% var(--ifm-font-family-monospace);padding:.15rem .3rem}h1,h2,h3,h4,h5,h6{color:var(--ifm-heading-color);font-family:var(--ifm-heading-font-family);font-weight:var(--ifm-heading-font-weight);line-height:var(--ifm-heading-line-height);margin:var(--ifm-heading-margin-top) 0 var(--ifm-heading-margin-bottom) 0}h1{font-size:var(--ifm-h1-font-size)}h2{font-size:var(--ifm-h2-font-size)}h3{font-size:var(--ifm-h3-font-size)}h4{font-size:var(--ifm-h4-font-size)}h5{font-size:var(--ifm-h5-font-size)}h6{font-size:var(--ifm-h6-font-size)}img{max-width:100%}img[align=right]{padding-left:var(--image-alignment-padding)}img[align=left]{padding-right:var(--image-alignment-padding)}.markdown{--ifm-h1-vertical-rhythm-top:3;--ifm-h2-vertical-rhythm-top:2;--ifm-h3-vertical-rhythm-top:1.5;--ifm-heading-vertical-rhythm-top:1.25;--ifm-h1-vertical-rhythm-bottom:1.25;--ifm-heading-vertical-rhythm-bottom:1}.markdown:after,.markdown:before{content:"";display:table}.markdown:after{clear:both}.markdown h1:first-child{--ifm-h1-font-size:3rem;margin-bottom:calc(var(--ifm-h1-vertical-rhythm-bottom)*var(--ifm-leading))}.markdown>h2{--ifm-h2-font-size:2rem;margin-top:calc(var(--ifm-h2-vertical-rhythm-top)*var(--ifm-leading))}.markdown>h3{--ifm-h3-font-size:1.5rem;margin-top:calc(var(--ifm-h3-vertical-rhythm-top)*var(--ifm-leading))}.markdown>h4,.markdown>h5,.markdown>h6{margin-top:calc(var(--ifm-heading-vertical-rhythm-top)*var(--ifm-leading))}.markdown>p,.markdown>pre,.markdown>ul{margin-bottom:var(--ifm-leading)}.markdown li>p{margin-top:var(--ifm-list-paragraph-margin)}.markdown li+li{margin-top:var(--ifm-list-item-margin)}ol,ul{margin:0 0 var(--ifm-list-margin);padding-left:var(--ifm-list-left-padding)}ol ol,ul ol{list-style-type:lower-roman}ol ol ol,ol ul ol,ul ol ol,ul ul ol{list-style-type:lower-alpha}table{border-collapse:collapse;display:block;margin-bottom:var(--ifm-spacing-vertical)}table thead tr{border-bottom:2px solid var(--ifm-table-border-color)}table thead,table tr:nth-child(2n){background-color:var(--ifm-table-stripe-background)}table tr{background-color:var(--ifm-table-background);border-top:var(--ifm-table-border-width) solid var(--ifm-table-border-color)}table td,table th{border:var(--ifm-table-border-width) solid var(--ifm-table-border-color);padding:var(--ifm-table-cell-padding)}table th{background-color:var(--ifm-table-head-background);color:var(--ifm-table-head-color);font-weight:var(--ifm-table-head-font-weight)}table td{color:var(--ifm-table-cell-color)}strong{font-weight:var(--ifm-font-weight-bold)}a{color:var(--ifm-link-color);text-decoration:var(--ifm-link-decoration)}a:hover{color:var(--ifm-link-hover-color);text-decoration:var(--ifm-link-hover-decoration)}.button:hover,.text--no-decoration,.text--no-decoration:hover,a:not([href]){-webkit-text-decoration:none;text-decoration:none}p{margin:0 0 var(--ifm-paragraph-margin-bottom)}blockquote{border-left:var(--ifm-blockquote-border-left-width) solid var(--ifm-blockquote-border-color);box-shadow:var(--ifm-blockquote-shadow);color:var(--ifm-blockquote-color);font-size:var(--ifm-blockquote-font-size);padding:var(--ifm-blockquote-padding-vertical) var(--ifm-blockquote-padding-horizontal)}blockquote>:first-child{margin-top:0}blockquote>:last-child{margin-bottom:0}hr{background-color:var(--ifm-hr-background-color);border:0;height:var(--ifm-hr-height);margin:var(--ifm-hr-margin-vertical) 0}.shadow--lw{box-shadow:var(--ifm-global-shadow-lw)!important}.shadow--md{box-shadow:var(--ifm-global-shadow-md)!important}.shadow--tl{box-shadow:var(--ifm-global-shadow-tl)!important}.text--primary{color:var(--ifm-color-primary)}.text--secondary{color:var(--ifm-color-secondary)}.text--success{color:var(--ifm-color-success)}.text--info{color:var(--ifm-color-info)}.text--warning{color:var(--ifm-color-warning)}.text--danger{color:var(--ifm-color-danger)}.text--center{text-align:center}.text--left{text-align:left}.text--justify{text-align:justify}.text--right{text-align:right}.text--capitalize{text-transform:capitalize}.text--lowercase{text-transform:lowercase}.alert__heading,.text--uppercase{text-transform:uppercase}.text--light{font-weight:var(--ifm-font-weight-light)}.text--normal{font-weight:var(--ifm-font-weight-normal)}.text--semibold{font-weight:var(--ifm-font-weight-semibold)}.text--bold{font-weight:var(--ifm-font-weight-bold)}.text--italic{font-style:italic}.text--truncate{overflow:hidden;text-overflow:ellipsis;white-space:nowrap}.text--break{word-wrap:break-word!important;word-break:break-word!important}.clean-btn{background:none;border:none;color:inherit;cursor:pointer;font-family:inherit;padding:0}.alert,.alert .close{color:var(--ifm-alert-foreground-color)}.clean-list{list-style:none;padding-left:0}.alert--primary{--ifm-alert-background-color:var(--ifm-color-primary-contrast-background);--ifm-alert-background-color-highlight:#3578e526;--ifm-alert-foreground-color:var(--ifm-color-primary-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-primary-dark)}.alert--secondary{--ifm-alert-background-color:var(--ifm-color-secondary-contrast-background);--ifm-alert-background-color-highlight:#ebedf026;--ifm-alert-foreground-color:var(--ifm-color-secondary-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-secondary-dark)}.alert--success{--ifm-alert-background-color:var(--ifm-color-success-contrast-background);--ifm-alert-background-color-highlight:#00a40026;--ifm-alert-foreground-color:var(--ifm-color-success-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-success-dark)}.alert--info{--ifm-alert-background-color:var(--ifm-color-info-contrast-background);--ifm-alert-background-color-highlight:#54c7ec26;--ifm-alert-foreground-color:var(--ifm-color-info-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-info-dark)}.alert--warning{--ifm-alert-background-color:var(--ifm-color-warning-contrast-background);--ifm-alert-background-color-highlight:#ffba0026;--ifm-alert-foreground-color:var(--ifm-color-warning-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-warning-dark)}.alert--danger{--ifm-alert-background-color:var(--ifm-color-danger-contrast-background);--ifm-alert-background-color-highlight:#fa383e26;--ifm-alert-foreground-color:var(--ifm-color-danger-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-danger-dark)}.alert{--ifm-code-background:var(--ifm-alert-background-color-highlight);--ifm-link-color:var(--ifm-alert-foreground-color);--ifm-link-hover-color:var(--ifm-alert-foreground-color);--ifm-link-decoration:underline;--ifm-tabs-color:var(--ifm-alert-foreground-color);--ifm-tabs-color-active:var(--ifm-alert-foreground-color);--ifm-tabs-color-active-border:var(--ifm-alert-border-color);background-color:var(--ifm-alert-background-color);border:var(--ifm-alert-border-width) solid var(--ifm-alert-border-color);border-left-width:var(--ifm-alert-border-left-width);border-radius:var(--ifm-alert-border-radius);box-shadow:var(--ifm-alert-shadow);padding:var(--ifm-alert-padding-vertical) var(--ifm-alert-padding-horizontal)}.alert__heading{align-items:center;display:flex;font:700 var(--ifm-h5-font-size)/var(--ifm-heading-line-height) var(--ifm-heading-font-family);margin-bottom:.5rem}.alert__icon{display:inline-flex;margin-right:.4em}.alert__icon svg{fill:var(--ifm-alert-foreground-color);stroke:var(--ifm-alert-foreground-color);stroke-width:0}.alert .close{margin:calc(var(--ifm-alert-padding-vertical)*-1) calc(var(--ifm-alert-padding-horizontal)*-1) 0 0;opacity:.75}.alert .close:focus,.alert .close:hover{opacity:1}.alert a{text-decoration-color:var(--ifm-alert-border-color)}.alert a:hover{text-decoration-thickness:2px}.avatar{column-gap:var(--ifm-avatar-intro-margin);display:flex}.avatar__photo{border-radius:50%;display:block;height:var(--ifm-avatar-photo-size);overflow:hidden;width:var(--ifm-avatar-photo-size)}.card--full-height,.navbar__logo img{height:100%}.avatar__photo--sm{--ifm-avatar-photo-size:2rem}.avatar__photo--lg{--ifm-avatar-photo-size:4rem}.avatar__photo--xl{--ifm-avatar-photo-size:6rem}.avatar__intro{display:flex;flex:1 1;flex-direction:column;justify-content:center;text-align:var(--ifm-avatar-intro-alignment)}.badge,.breadcrumbs__item,.breadcrumbs__link,.button,.dropdown>.navbar__link:after{display:inline-block}.avatar__name{font:700 var(--ifm-h4-font-size)/var(--ifm-heading-line-height) var(--ifm-font-family-base)}.avatar__subtitle{margin-top:.25rem}.avatar--vertical{--ifm-avatar-intro-alignment:center;--ifm-avatar-intro-margin:0.5rem;align-items:center;flex-direction:column}.badge{background-color:var(--ifm-badge-background-color);border:var(--ifm-badge-border-width) solid var(--ifm-badge-border-color);border-radius:var(--ifm-badge-border-radius);color:var(--ifm-badge-color);font-size:75%;font-weight:var(--ifm-font-weight-bold);line-height:1;padding:var(--ifm-badge-padding-vertical) var(--ifm-badge-padding-horizontal)}.badge--primary{--ifm-badge-background-color:var(--ifm-color-primary)}.badge--secondary{--ifm-badge-background-color:var(--ifm-color-secondary);color:var(--ifm-color-black)}.breadcrumbs__link,.button.button--secondary.button--outline:not(.button--active):not(:hover){color:var(--ifm-font-color-base)}.badge--success{--ifm-badge-background-color:var(--ifm-color-success)}.badge--info{--ifm-badge-background-color:var(--ifm-color-info)}.badge--warning{--ifm-badge-background-color:var(--ifm-color-warning)}.badge--danger{--ifm-badge-background-color:var(--ifm-color-danger)}.breadcrumbs{margin-bottom:0;padding-left:0}.breadcrumbs__item:not(:last-child):after{background:var(--ifm-breadcrumb-separator) center;content:" ";display:inline-block;filter:var(--ifm-breadcrumb-separator-filter);height:calc(var(--ifm-breadcrumb-separator-size)*var(--ifm-breadcrumb-size-multiplier)*var(--ifm-breadcrumb-separator-size-multiplier));margin:0 var(--ifm-breadcrumb-spacing);opacity:.5;width:calc(var(--ifm-breadcrumb-separator-size)*var(--ifm-breadcrumb-size-multiplier)*var(--ifm-breadcrumb-separator-size-multiplier))}.breadcrumbs__item--active .breadcrumbs__link{background:var(--ifm-breadcrumb-item-background-active);color:var(--ifm-breadcrumb-color-active)}.breadcrumbs__link{border-radius:var(--ifm-breadcrumb-border-radius);font-size:calc(1rem*var(--ifm-breadcrumb-size-multiplier));padding:calc(var(--ifm-breadcrumb-padding-vertical)*var(--ifm-breadcrumb-size-multiplier)) calc(var(--ifm-breadcrumb-padding-horizontal)*var(--ifm-breadcrumb-size-multiplier));transition-duration:var(--ifm-transition-fast);transition-property:background,color}.breadcrumbs__link:any-link:hover,.breadcrumbs__link:link:hover,.breadcrumbs__link:visited:hover,area[href].breadcrumbs__link:hover{background:var(--ifm-breadcrumb-item-background-active);-webkit-text-decoration:none;text-decoration:none}.breadcrumbs--sm{--ifm-breadcrumb-size-multiplier:0.8}.breadcrumbs--lg{--ifm-breadcrumb-size-multiplier:1.2}.button{background-color:var(--ifm-button-background-color);border:var(--ifm-button-border-width) solid var(--ifm-button-border-color);border-radius:var(--ifm-button-border-radius);cursor:pointer;font-size:calc(.875rem*var(--ifm-button-size-multiplier));font-weight:var(--ifm-button-font-weight);line-height:1.5;padding:calc(var(--ifm-button-padding-vertical)*var(--ifm-button-size-multiplier)) calc(var(--ifm-button-padding-horizontal)*var(--ifm-button-size-multiplier));text-align:center;transition-duration:var(--ifm-button-transition-duration);transition-property:color,background,border-color;-webkit-user-select:none;user-select:none;white-space:nowrap}.button,.button:hover{color:var(--ifm-button-color)}.button--outline{--ifm-button-color:var(--ifm-button-border-color)}.button--outline:hover{--ifm-button-background-color:var(--ifm-button-border-color)}.button--link{--ifm-button-border-color:#0000;color:var(--ifm-link-color);text-decoration:var(--ifm-link-decoration)}.button--link.button--active,.button--link:active,.button--link:hover{color:var(--ifm-link-hover-color);text-decoration:var(--ifm-link-hover-decoration)}.dropdown__link--active,.dropdown__link:hover,.menu__link:hover,.navbar__brand:hover,.navbar__link--active,.navbar__link:hover,.pagination-nav__link:hover,.pagination__link:hover{-webkit-text-decoration:none;text-decoration:none}.button.disabled,.button:disabled,.button[disabled]{opacity:.65;pointer-events:none}.button--sm{--ifm-button-size-multiplier:0.8}.button--lg{--ifm-button-size-multiplier:1.35}.button--block{display:block;width:100%}.button.button--secondary{color:var(--ifm-color-gray-900)}:where(.button--primary){--ifm-button-background-color:var(--ifm-color-primary);--ifm-button-border-color:var(--ifm-color-primary)}:where(.button--primary):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-primary-dark);--ifm-button-border-color:var(--ifm-color-primary-dark)}.button--primary.button--active,.button--primary:active{--ifm-button-background-color:var(--ifm-color-primary-darker);--ifm-button-border-color:var(--ifm-color-primary-darker)}:where(.button--secondary){--ifm-button-background-color:var(--ifm-color-secondary);--ifm-button-border-color:var(--ifm-color-secondary)}:where(.button--secondary):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-secondary-dark);--ifm-button-border-color:var(--ifm-color-secondary-dark)}.button--secondary.button--active,.button--secondary:active{--ifm-button-background-color:var(--ifm-color-secondary-darker);--ifm-button-border-color:var(--ifm-color-secondary-darker)}:where(.button--success){--ifm-button-background-color:var(--ifm-color-success);--ifm-button-border-color:var(--ifm-color-success)}:where(.button--success):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-success-dark);--ifm-button-border-color:var(--ifm-color-success-dark)}.button--success.button--active,.button--success:active{--ifm-button-background-color:var(--ifm-color-success-darker);--ifm-button-border-color:var(--ifm-color-success-darker)}:where(.button--info){--ifm-button-background-color:var(--ifm-color-info);--ifm-button-border-color:var(--ifm-color-info)}:where(.button--info):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-info-dark);--ifm-button-border-color:var(--ifm-color-info-dark)}.button--info.button--active,.button--info:active{--ifm-button-background-color:var(--ifm-color-info-darker);--ifm-button-border-color:var(--ifm-color-info-darker)}:where(.button--warning){--ifm-button-background-color:var(--ifm-color-warning);--ifm-button-border-color:var(--ifm-color-warning)}:where(.button--warning):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-warning-dark);--ifm-button-border-color:var(--ifm-color-warning-dark)}.button--warning.button--active,.button--warning:active{--ifm-button-background-color:var(--ifm-color-warning-darker);--ifm-button-border-color:var(--ifm-color-warning-darker)}:where(.button--danger){--ifm-button-background-color:var(--ifm-color-danger);--ifm-button-border-color:var(--ifm-color-danger)}:where(.button--danger):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-danger-dark);--ifm-button-border-color:var(--ifm-color-danger-dark)}.button--danger.button--active,.button--danger:active{--ifm-button-background-color:var(--ifm-color-danger-darker);--ifm-button-border-color:var(--ifm-color-danger-darker)}.button-group{display:inline-flex;gap:var(--ifm-button-group-spacing)}.button-group>.button:not(:first-child){border-bottom-left-radius:0;border-top-left-radius:0}.button-group>.button:not(:last-child){border-bottom-right-radius:0;border-top-right-radius:0}.button-group--block{display:flex;justify-content:stretch}.button-group--block>.button{flex-grow:1}.card{background-color:var(--ifm-card-background-color);border-radius:var(--ifm-card-border-radius);box-shadow:var(--ifm-global-shadow-lw);display:flex;flex-direction:column;overflow:hidden}.card__image{padding-top:var(--ifm-card-vertical-spacing)}.card__image:first-child{padding-top:0}.card__body,.card__footer,.card__header{padding:var(--ifm-card-vertical-spacing) var(--ifm-card-horizontal-spacing)}.card__body:not(:last-child),.card__footer:not(:last-child),.card__header:not(:last-child){padding-bottom:0}.card__body>:last-child,.card__footer>:last-child,.card__header>:last-child{margin-bottom:0}.card__footer{margin-top:auto}.table-of-contents{font-size:.8rem;margin-bottom:0;padding:var(--ifm-toc-padding-vertical) 0}.table-of-contents,.table-of-contents ul{list-style:none;padding-left:var(--ifm-toc-padding-horizontal)}.table-of-contents li{margin:var(--ifm-toc-padding-vertical) var(--ifm-toc-padding-horizontal)}.table-of-contents__left-border{border-left:1px solid var(--ifm-toc-border-color)}.table-of-contents__link{color:var(--ifm-toc-link-color);display:block}.table-of-contents__link--active,.table-of-contents__link--active code,.table-of-contents__link:hover,.table-of-contents__link:hover code{color:var(--ifm-color-primary);-webkit-text-decoration:none;text-decoration:none}.close{color:var(--ifm-color-black);float:right;font-size:1.5rem;font-weight:var(--ifm-font-weight-bold);line-height:1;opacity:.5;padding:1rem;transition:opacity var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.close:hover{opacity:.7}.close:focus{opacity:.8}.dropdown{display:inline-flex;font-weight:var(--ifm-dropdown-font-weight);position:relative;vertical-align:top}.dropdown--hoverable:hover .dropdown__menu,.dropdown--show .dropdown__menu{opacity:1;pointer-events:all;transform:translateY(-1px);visibility:visible}.dropdown__menu,.navbar__item.dropdown .navbar__link:not([href]){pointer-events:none}.dropdown--right .dropdown__menu{left:inherit;right:0}.dropdown--nocaret .navbar__link:after{content:none!important}.dropdown__menu{background-color:var(--ifm-dropdown-background-color);border-radius:var(--ifm-global-radius);box-shadow:var(--ifm-global-shadow-md);left:0;list-style:none;max-height:80vh;min-width:10rem;opacity:0;overflow-y:auto;padding:.5rem;position:absolute;top:calc(100% - var(--ifm-navbar-item-padding-vertical) + .3rem);transform:translateY(-.625rem);transition-duration:var(--ifm-transition-fast);transition-property:opacity,transform,visibility;transition-timing-function:var(--ifm-transition-timing-default);visibility:hidden;z-index:var(--ifm-z-index-dropdown)}.menu__caret,.menu__link,.menu__list-item-collapsible{border-radius:.25rem;transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.dropdown__link{border-radius:.25rem;color:var(--ifm-dropdown-link-color);display:block;font-size:.875rem;margin-top:.2rem;padding:.25rem .5rem;white-space:nowrap}.dropdown__link--active,.dropdown__link:hover{background-color:var(--ifm-dropdown-hover-background-color);color:var(--ifm-dropdown-link-color)}.dropdown__link--active,.dropdown__link--active:hover{--ifm-dropdown-link-color:var(--ifm-link-color)}.dropdown>.navbar__link:after{border-color:currentcolor #0000;border-style:solid;border-width:.4em .4em 0;content:"";margin-left:.3em;position:relative;top:2px;transform:translateY(-50%)}.footer{background-color:var(--ifm-footer-background-color);color:var(--ifm-footer-color);padding:var(--ifm-footer-padding-vertical) var(--ifm-footer-padding-horizontal)}.footer--dark{--ifm-footer-background-color:#303846;--ifm-footer-color:var(--ifm-footer-link-color);--ifm-footer-link-color:var(--ifm-color-secondary);--ifm-footer-title-color:var(--ifm-color-white)}.footer__links{margin-bottom:1rem}.footer__link-item{color:var(--ifm-footer-link-color);line-height:2}.footer__link-item:hover{color:var(--ifm-footer-link-hover-color)}.footer__link-separator{margin:0 var(--ifm-footer-link-horizontal-spacing)}.footer__logo{margin-top:1rem;max-width:var(--ifm-footer-logo-max-width)}.footer__title{color:var(--ifm-footer-title-color);font:700 var(--ifm-h4-font-size)/var(--ifm-heading-line-height) var(--ifm-font-family-base);margin-bottom:var(--ifm-heading-margin-bottom)}.menu,.navbar__link{font-weight:var(--ifm-font-weight-semibold)}.footer__item{margin-top:0}.footer__items{margin-bottom:0}[type=checkbox]{padding:0}.hero{align-items:center;background-color:var(--ifm-hero-background-color);color:var(--ifm-hero-text-color);display:flex;padding:4rem 2rem}.hero--primary{--ifm-hero-background-color:var(--ifm-color-primary);--ifm-hero-text-color:var(--ifm-font-color-base-inverse)}.hero--dark{--ifm-hero-background-color:#303846;--ifm-hero-text-color:var(--ifm-color-white)}.hero__title{font-size:3rem}.hero__subtitle{font-size:1.5rem}.menu__list{list-style:none;margin:0;padding-left:0}.menu__caret,.menu__link{padding:var(--ifm-menu-link-padding-vertical) var(--ifm-menu-link-padding-horizontal)}.menu__list .menu__list{flex:0 0 100%;margin-top:.25rem;padding-left:var(--ifm-menu-link-padding-horizontal)}.menu__list-item:not(:first-child){margin-top:.25rem}.menu__list-item--collapsed .menu__list{height:0;overflow:hidden}.menu__list-item--collapsed .menu__caret:before,.menu__list-item--collapsed .menu__link--sublist:after{transform:rotate(90deg)}.menu__list-item-collapsible{display:flex;flex-wrap:wrap;position:relative}.menu__caret:hover,.menu__link:hover,.menu__list-item-collapsible--active,.menu__list-item-collapsible:hover{background:var(--ifm-menu-color-background-hover)}.menu__list-item-collapsible .menu__link--active,.menu__list-item-collapsible .menu__link:hover{background:none!important}.menu__caret,.menu__link{align-items:center;display:flex}.menu__link{color:var(--ifm-menu-color);flex:1;line-height:1.25}.menu__link:hover{color:var(--ifm-menu-color)}.menu__caret:before,.menu__link--sublist-caret:after{content:"";filter:var(--ifm-menu-link-sublist-icon-filter);height:1.25rem;transform:rotate(180deg);transition:transform var(--ifm-transition-fast) linear;width:1.25rem}.menu__link--sublist-caret:after{background:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem;margin-left:auto;min-width:1.25rem}.menu__link--active,.menu__link--active:hover{color:var(--ifm-menu-color-active)}.navbar__brand,.navbar__link{color:var(--ifm-navbar-link-color)}.menu__link--active:not(.menu__link--sublist){background-color:var(--ifm-menu-color-background-active)}.menu__caret:before{background:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem}.navbar--dark,html[data-theme=dark]{--ifm-menu-link-sublist-icon-filter:invert(100%) sepia(94%) saturate(17%) hue-rotate(223deg) brightness(104%) contrast(98%)}.navbar{background-color:var(--ifm-navbar-background-color);box-shadow:var(--ifm-navbar-shadow);height:var(--ifm-navbar-height);padding:var(--ifm-navbar-padding-vertical) var(--ifm-navbar-padding-horizontal)}.navbar,.navbar>.container,.navbar>.container-fluid{display:flex}.navbar--fixed-top{position:sticky;top:0;z-index:var(--ifm-z-index-fixed)}.navbar-sidebar,.navbar-sidebar__backdrop{bottom:0;left:0;opacity:0;position:fixed;top:0;transition-duration:var(--ifm-transition-fast);transition-timing-function:ease-in-out;visibility:hidden}.navbar__inner{display:flex;flex-wrap:wrap;justify-content:space-between;width:100%}.navbar__brand{align-items:center;display:flex;margin-right:1rem;min-width:0}.navbar__brand:hover{color:var(--ifm-navbar-link-hover-color)}.navbar__title{flex:1 1 auto}.navbar__toggle{display:none;margin-right:.5rem}.navbar__logo{flex:0 0 auto;height:2rem;margin-right:.5rem}.navbar__items{align-items:center;display:flex;flex:1;min-width:0}.navbar__items--center{flex:0 0 auto}.navbar__items--center .navbar__brand{margin:0}.navbar__items--center+.navbar__items--right{flex:1}.navbar__items--right{flex:0 0 auto;justify-content:flex-end}.navbar__items--right>:last-child{padding-right:0}.navbar__item{display:inline-block;padding:var(--ifm-navbar-item-padding-vertical) var(--ifm-navbar-item-padding-horizontal)}.navbar__link--active,.navbar__link:hover{color:var(--ifm-navbar-link-hover-color)}.navbar--dark,.navbar--primary{--ifm-menu-color:var(--ifm-color-gray-300);--ifm-navbar-link-color:var(--ifm-color-gray-100);--ifm-navbar-search-input-background-color:#ffffff1a;--ifm-navbar-search-input-placeholder-color:#ffffff80;color:var(--ifm-color-white)}.navbar--dark{--ifm-navbar-background-color:#242526;--ifm-menu-color-background-active:#ffffff0d;--ifm-navbar-search-input-color:var(--ifm-color-white)}.navbar--primary{--ifm-navbar-background-color:var(--ifm-color-primary);--ifm-navbar-link-hover-color:var(--ifm-color-white);--ifm-menu-color-active:var(--ifm-color-white);--ifm-navbar-search-input-color:var(--ifm-color-emphasis-500)}.navbar__search-input{appearance:none;background:var(--ifm-navbar-search-input-background-color) var(--ifm-navbar-search-input-icon) no-repeat .75rem center/1rem 1rem;border:none;border-radius:2rem;color:var(--ifm-navbar-search-input-color);cursor:text;display:inline-block;font-size:1rem;height:2rem;padding:0 .5rem 0 2.25rem;width:12.5rem}.navbar__search-input::placeholder{color:var(--ifm-navbar-search-input-placeholder-color)}.navbar-sidebar{background-color:var(--ifm-navbar-background-color);box-shadow:var(--ifm-global-shadow-md);transform:translate3d(-100%,0,0);transition-property:opacity,visibility,transform;width:var(--ifm-navbar-sidebar-width)}.navbar-sidebar--show .navbar-sidebar,.navbar-sidebar__items{transform:translateZ(0)}.navbar-sidebar--show .navbar-sidebar,.navbar-sidebar--show .navbar-sidebar__backdrop{opacity:1;visibility:visible}.navbar-sidebar__backdrop{background-color:#0009;right:0;transition-property:opacity,visibility}.navbar-sidebar__brand{align-items:center;box-shadow:var(--ifm-navbar-shadow);display:flex;flex:1;height:var(--ifm-navbar-height);padding:var(--ifm-navbar-padding-vertical) var(--ifm-navbar-padding-horizontal)}.navbar-sidebar__items{display:flex;height:calc(100% - var(--ifm-navbar-height));transition:transform var(--ifm-transition-fast) ease-in-out}.navbar-sidebar__items--show-secondary{transform:translate3d(calc((var(--ifm-navbar-sidebar-width))*-1),0,0)}.navbar-sidebar__item{flex-shrink:0;padding:.5rem;width:calc(var(--ifm-navbar-sidebar-width))}.navbar-sidebar__back{background:var(--ifm-menu-color-background-active);font-size:15px;font-weight:var(--ifm-button-font-weight);margin:0 0 .2rem -.5rem;padding:.6rem 1.5rem;position:relative;text-align:left;top:-.5rem;width:calc(100% + 1rem)}.navbar-sidebar__close{display:flex;margin-left:auto}.pagination{column-gap:var(--ifm-pagination-page-spacing);display:flex;font-size:var(--ifm-pagination-font-size);padding-left:0}.pagination--sm{--ifm-pagination-font-size:0.8rem;--ifm-pagination-padding-horizontal:0.8rem;--ifm-pagination-padding-vertical:0.2rem}.pagination--lg{--ifm-pagination-font-size:1.2rem;--ifm-pagination-padding-horizontal:1.2rem;--ifm-pagination-padding-vertical:0.3rem}.pagination__item{display:inline-flex}.pagination__item>span{padding:var(--ifm-pagination-padding-vertical)}.pagination__item--active .pagination__link{color:var(--ifm-pagination-color-active)}.pagination__item--active .pagination__link,.pagination__item:not(.pagination__item--active):hover .pagination__link{background:var(--ifm-pagination-item-active-background)}.pagination__item--disabled,.pagination__item[disabled]{opacity:.25;pointer-events:none}.pagination__link{border-radius:var(--ifm-pagination-border-radius);color:var(--ifm-font-color-base);display:inline-block;padding:var(--ifm-pagination-padding-vertical) var(--ifm-pagination-padding-horizontal);transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pagination-nav{display:grid;grid-gap:var(--ifm-spacing-horizontal);gap:var(--ifm-spacing-horizontal);grid-template-columns:repeat(2,1fr)}.pagination-nav__link{border:1px solid var(--ifm-color-emphasis-300);border-radius:var(--ifm-pagination-nav-border-radius);display:block;height:100%;line-height:var(--ifm-heading-line-height);padding:var(--ifm-global-spacing);transition:border-color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pagination-nav__link:hover{border-color:var(--ifm-pagination-nav-color-hover)}.pagination-nav__link--next{grid-column:2/3;text-align:right}.pagination-nav__label{font-size:var(--ifm-h4-font-size);font-weight:var(--ifm-heading-font-weight);word-break:break-word}.pagination-nav__link--prev .pagination-nav__label:before{content:"« "}.pagination-nav__link--next .pagination-nav__label:after{content:" »"}.pagination-nav__sublabel{color:var(--ifm-color-content-secondary);font-size:var(--ifm-h5-font-size);font-weight:var(--ifm-font-weight-semibold);margin-bottom:.25rem}.pills__item,.tabs{font-weight:var(--ifm-font-weight-bold)}.pills{display:flex;gap:var(--ifm-pills-spacing);padding-left:0}.pills__item{border-radius:.5rem;cursor:pointer;display:inline-block;padding:.25rem 1rem;transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pills__item--active{color:var(--ifm-pills-color-active)}.pills__item--active,.pills__item:not(.pills__item--active):hover{background:var(--ifm-pills-color-background-active)}.pills--block{justify-content:stretch}.pills--block .pills__item{flex-grow:1;text-align:center}.tabs{color:var(--ifm-tabs-color);display:flex;margin-bottom:0;overflow-x:auto;padding-left:0}.tabs__item{border-bottom:3px solid #0000;border-radius:var(--ifm-global-radius);cursor:pointer;display:inline-flex;padding:var(--ifm-tabs-padding-vertical) var(--ifm-tabs-padding-horizontal);transition:background-color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.tabs__item--active{border-bottom-color:var(--ifm-tabs-color-active-border);border-bottom-left-radius:0;border-bottom-right-radius:0;color:var(--ifm-tabs-color-active)}.tabs__item:hover{background-color:var(--ifm-hover-overlay)}.tabs--block{justify-content:stretch}.tabs--block .tabs__item{flex-grow:1;justify-content:center}html[data-theme=dark]{--ifm-color-scheme:dark;--ifm-color-emphasis-0:var(--ifm-color-gray-1000);--ifm-color-emphasis-100:var(--ifm-color-gray-900);--ifm-color-emphasis-200:var(--ifm-color-gray-800);--ifm-color-emphasis-300:var(--ifm-color-gray-700);--ifm-color-emphasis-400:var(--ifm-color-gray-600);--ifm-color-emphasis-600:var(--ifm-color-gray-400);--ifm-color-emphasis-700:var(--ifm-color-gray-300);--ifm-color-emphasis-800:var(--ifm-color-gray-200);--ifm-color-emphasis-900:var(--ifm-color-gray-100);--ifm-color-emphasis-1000:var(--ifm-color-gray-0);--ifm-background-color:#1b1b1d;--ifm-background-surface-color:#242526;--ifm-hover-overlay:#ffffff0d;--ifm-color-content:#e3e3e3;--ifm-color-content-secondary:#fff;--ifm-breadcrumb-separator-filter:invert(64%) sepia(11%) saturate(0%) hue-rotate(149deg) brightness(99%) contrast(95%);--ifm-code-background:#ffffff1a;--ifm-scrollbar-track-background-color:#444;--ifm-scrollbar-thumb-background-color:#686868;--ifm-scrollbar-thumb-hover-background-color:#7a7a7a;--ifm-table-stripe-background:#ffffff12;--ifm-toc-border-color:var(--ifm-color-emphasis-200);--ifm-color-primary-contrast-background:#102445;--ifm-color-primary-contrast-foreground:#ebf2fc;--ifm-color-secondary-contrast-background:#474748;--ifm-color-secondary-contrast-foreground:#fdfdfe;--ifm-color-success-contrast-background:#003100;--ifm-color-success-contrast-foreground:#e6f6e6;--ifm-color-info-contrast-background:#193c47;--ifm-color-info-contrast-foreground:#eef9fd;--ifm-color-warning-contrast-background:#4d3800;--ifm-color-warning-contrast-foreground:#fff8e6;--ifm-color-danger-contrast-background:#4b1113;--ifm-color-danger-contrast-foreground:#ffebec}}.bharatml-hero .bharatml-button:hover,.bharatml-hero .button--outline:hover,[data-theme=dark] .bharatml-hero .bharatml-button:hover,[data-theme=dark] .bharatml-hero .button--outline:hover{background-color:#fff!important;border-color:#fff!important;color:var(--bharatml-primary)!important}.bharatml-hero .bharatml-button,.bharatml-hero .button--outline{border:2px solid #fff!important;color:#fff!important;transition:.3s}:root{--ifm-color-primary:#450839;--ifm-color-primary-dark:#3d0732;--ifm-color-primary-darker:#39062f;--ifm-color-primary-darkest:#2f0527;--ifm-color-primary-light:#4d0940;--ifm-color-primary-lighter:#510a43;--ifm-color-primary-lightest:#5d0c4d;--ifm-code-font-size:95%;--docusaurus-highlighted-code-line-bg:#0000001a;--bharatml-primary:#450839;--bharatml-primary-hover:#6a0c59;--bharatml-secondary:#f9f9f9;--bharatml-text:#1c1e21;--bharatml-text-light:#606770}[data-theme=dark]{--ifm-color-primary:#8b4582;--ifm-color-primary-dark:#7d3f75;--ifm-color-primary-darker:#763c6e;--ifm-color-primary-darkest:#62315a;--ifm-color-primary-light:#994b8f;--ifm-color-primary-lighter:#a04e96;--ifm-color-primary-lightest:#b657a9;--docusaurus-highlighted-code-line-bg:#0000004d;--bharatml-primary:#8b4582;--bharatml-primary-hover:#a04e96;--bharatml-secondary:#1e1e1e;--bharatml-text:#e3e3e3;--bharatml-text-light:#b4b4b4}.bharatml-hero{background:linear-gradient(135deg,var(--bharatml-primary) 0,var(--bharatml-primary-hover) 100%);color:#fff}.bharatml-hero .bharatml-button{background-color:var(--bharatml-primary)}.bharatml-hero .button--outline{background-color:initial!important}[data-theme=dark] .bharatml-hero .bharatml-button{background-color:var(--bharatml-primary);border:2px solid #fff!important;color:#fff!important}[data-theme=dark] .bharatml-hero .button--outline{background-color:initial!important;border:2px solid #fff!important;color:#fff!important}.bharatml-button{background-color:var(--bharatml-primary);border-color:var(--bharatml-primary);transition:.3s}.bharatml-button:hover{background-color:var(--bharatml-primary-hover);border-color:var(--bharatml-primary-hover);color:#fff}.bharatml-card{background:#fff;border:1px solid #4508391a;border-radius:8px;padding:2rem;transition:.3s}.bharatml-card:hover{border-color:var(--bharatml-primary);box-shadow:0 4px 20px #4508391a;transform:translateY(-2px)}.bharatml-icon{align-items:center;background:linear-gradient(135deg,var(--bharatml-primary),var(--bharatml-primary-hover));border-radius:12px;color:#fff;display:flex;font-size:1.5rem;height:64px;justify-content:center;margin:0 auto 1rem;width:64px}@layer docusaurus.core{#__docusaurus-base-url-issue-banner-container{display:none}}.aboutSection_udvw,.features_t9lD{background-color:var(--ifm-background-surface-color)}.featuresHeader_qR2i,.features_t9lD h3{color:var(--bharatml-primary);margin-bottom:1rem}.features_t9lD{display:block;padding:4rem 0;text-align:center;width:100%}.featureSvg_GfXr{height:200px;width:200px}.featuresHeader_qR2i{font-size:2.5rem;font-weight:700;text-align:center}.featuresSubtitle_VdGe{color:var(--ifm-font-color-base);font-size:1.2rem;opacity:1;text-align:center}.features_t9lD .bharatml-card_xZ6l{height:100%;margin-top:1rem}.features_t9lD .bharatml-icon_XBoJ{margin-bottom:1.5rem}.features_t9lD h3{font-size:1.25rem;font-weight:600}.features_t9lD p{color:var(--ifm-font-color-base)!important;font-size:.95rem;font-weight:400;line-height:1.6;margin:0}.featureDescription_sP1D{color:#1c1e21!important;font-size:.95rem!important;font-weight:400!important;line-height:1.6!important;margin:0!important}[data-theme=dark] .bharatml-card_xZ6l{background:#2a2a2a!important;border-color:#8b45824d;color:#fff}[data-theme=dark] .bharatml-card_xZ6l:hover{background:#333!important;border-color:var(--bharatml-primary);box-shadow:0 4px 20px #8b45824d}[data-theme=dark] .featureDescription_sP1D,[data-theme=dark] .featuresHeader_qR2i,[data-theme=dark] .features_t9lD h3,[data-theme=dark] .features_t9lD p{color:#a04e96!important}[data-theme=dark] .featuresSubtitle_VdGe{color:#e0e0e0!important}.heroBanner_qdFl{overflow:hidden;padding:4rem 0;position:relative;text-align:center}.logoContainer_xdaK{align-items:center;display:flex;justify-content:center;margin-bottom:2rem}.heroLogo_U6bI{filter:drop-shadow(0 4px 8px rgba(0,0,0,.1));height:180px;transition:transform .3s;width:180px}.heroLogo_U6bI:hover{transform:scale(1.05)}.buttons_AeoN{align-items:center;gap:1rem;margin-bottom:2rem}.buttons_AeoN,.statsContainer_KpvY{display:flex;justify-content:center}.statsContainer_KpvY{gap:3rem;margin-top:2rem;opacity:.9}.statItem_bwiZ{align-items:center;color:#fff;display:flex;flex-direction:column;text-align:center}.statItem_bwiZ strong{display:block;font-size:1.5rem;font-weight:700;margin-bottom:.25rem}.statItem_bwiZ span{font-size:.875rem;letter-spacing:.5px;opacity:.8;text-transform:uppercase}.aboutSection_udvw{padding:4rem 0}.highlightBox_Uhe8{background:linear-gradient(135deg,#f8f9ff,#e8f0ff);border:1px solid #4508391a;border-radius:12px;height:100%;padding:2rem}.highlightBox_Uhe8 h3{color:var(--bharatml-primary);font-size:1.25rem;margin-bottom:1rem}.highlightBox_Uhe8 li,[data-theme=dark] .highlightBox_Uhe8 li{color:var(--bharatml-text)}.highlightBox_Uhe8 ul{list-style:none;margin:0;padding:0}.highlightBox_Uhe8 li{font-size:.95rem;padding:.5rem 0}.highlightBox_Uhe8 li:not(:last-child){border-bottom:1px solid #4508390d}[data-theme=dark] .highlightBox_Uhe8{background:linear-gradient(135deg,#1a1a2e,#16213e);border-color:#8b458233}@layer docusaurus.theme-common{body:not(.navigation-with-keyboard) :not(input):focus{outline:0}.themedComponent_mlkZ{display:none}[data-theme=dark] .themedComponent--dark_xIcU,[data-theme=light] .themedComponent--light_NVdE,html:not([data-theme]) .themedComponent--light_NVdE{display:initial}.errorBoundaryError_a6uf{color:red;white-space:pre-wrap}.errorBoundaryFallback_VBag{color:red;padding:.55rem}.details_lb9f{--docusaurus-details-summary-arrow-size:0.38rem;--docusaurus-details-transition:transform 200ms ease;--docusaurus-details-decoration-color:grey}.details_lb9f>summary{cursor:pointer;list-style:none;padding-left:1rem;position:relative}.details_lb9f>summary::-webkit-details-marker{display:none}.details_lb9f>summary:before{border-color:#0000 #0000 #0000 var(--docusaurus-details-decoration-color);border-style:solid;border-width:var(--docusaurus-details-summary-arrow-size);content:"";left:0;position:absolute;top:.45rem;transform:rotate(0);transform-origin:calc(var(--docusaurus-details-summary-arrow-size)/2) 50%;transition:var(--docusaurus-details-transition)}.details_lb9f[data-collapsed=false].isBrowser_bmU9>summary:before,.details_lb9f[open]:not(.isBrowser_bmU9)>summary:before{transform:rotate(90deg)}.collapsibleContent_i85q{border-top:1px solid var(--docusaurus-details-decoration-color);margin-top:1rem;padding-top:1rem}.collapsibleContent_i85q p:last-child,.details_lb9f>summary>p:last-child{margin-bottom:0}}@layer docusaurus.theme-classic{:root{--docusaurus-progress-bar-color:var(--ifm-color-primary);--docusaurus-announcement-bar-height:auto;--docusaurus-collapse-button-bg:#0000;--docusaurus-collapse-button-bg-hover:#0000001a;--doc-sidebar-width:300px;--doc-sidebar-hidden-width:30px;--docusaurus-blog-social-icon-size:1rem;--docusaurus-tag-list-border:var(--ifm-color-emphasis-300)}#nprogress{pointer-events:none}#nprogress .bar{background:var(--docusaurus-progress-bar-color);height:2px;left:0;position:fixed;top:0;width:100%;z-index:1031}#nprogress .peg{box-shadow:0 0 10px var(--docusaurus-progress-bar-color),0 0 5px var(--docusaurus-progress-bar-color);height:100%;opacity:1;position:absolute;right:0;transform:rotate(3deg) translateY(-4px);width:100px}.skipToContent_fXgn{background-color:var(--ifm-background-surface-color);color:var(--ifm-color-emphasis-900);left:100%;padding:calc(var(--ifm-global-spacing)/2) var(--ifm-global-spacing);position:fixed;top:1rem;z-index:calc(var(--ifm-z-index-fixed) + 1)}.skipToContent_fXgn:focus{box-shadow:var(--ifm-global-shadow-md);left:1rem}.closeButton_CVFx{line-height:0;padding:0}.content_knG7{font-size:85%;padding:5px 0;text-align:center}.content_knG7 a{color:inherit;-webkit-text-decoration:underline;text-decoration:underline}.announcementBar_mb4j{align-items:center;background-color:var(--ifm-color-white);border-bottom:1px solid var(--ifm-color-emphasis-100);color:var(--ifm-color-black);display:flex;height:var(--docusaurus-announcement-bar-height)}.docSidebarContainer_YfHR,.navbarSearchContainer_Bca1:empty,.sidebarLogo_isFc,.toggleIcon_g3eP,html[data-announcement-bar-initially-dismissed=true] .announcementBar_mb4j{display:none}.announcementBarPlaceholder_vyr4{flex:0 0 10px}.announcementBarClose_gvF7{align-self:stretch;flex:0 0 30px}.announcementBarContent_xLdY{flex:1 1 auto}.toggle_vylO{height:2rem;width:2rem}.toggleButton_gllP{-webkit-tap-highlight-color:transparent;align-items:center;border-radius:50%;display:flex;height:100%;justify-content:center;transition:background var(--ifm-transition-fast);width:100%}.toggleButton_gllP:hover{background:var(--ifm-color-emphasis-200)}[data-theme-choice=dark] .darkToggleIcon_wfgR,[data-theme-choice=light] .lightToggleIcon_pyhR,[data-theme-choice=system] .systemToggleIcon_QzmC{display:initial}.toggleButtonDisabled_aARS{cursor:not-allowed}.darkNavbarColorModeToggle_X3D1:hover{background:var(--ifm-color-gray-800)}.backToTopButton_sjWU{background-color:var(--ifm-color-emphasis-200);border-radius:50%;bottom:1.3rem;box-shadow:var(--ifm-global-shadow-lw);height:3rem;opacity:0;position:fixed;right:1.3rem;transform:scale(0);transition:all var(--ifm-transition-fast) var(--ifm-transition-timing-default);visibility:hidden;width:3rem;z-index:calc(var(--ifm-z-index-fixed) - 1)}.backToTopButton_sjWU:after{background-color:var(--ifm-color-emphasis-1000);content:" ";display:inline-block;height:100%;-webkit-mask:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem no-repeat;mask:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem no-repeat;width:100%}.backToTopButtonShow_xfvO{opacity:1;transform:scale(1);visibility:visible}[data-theme=dark]:root{--docusaurus-collapse-button-bg:#ffffff0d;--docusaurus-collapse-button-bg-hover:#ffffff1a}.collapseSidebarButton_PEFL{display:none;margin:0}.iconExternalLink_nPIU{margin-left:.3rem}.dropdownNavbarItemMobile_J0Sd{cursor:pointer}.iconLanguage_nlXk{margin-right:5px;vertical-align:text-bottom}.navbarHideable_m1mJ{transition:transform var(--ifm-transition-fast) ease}.navbarHidden_jGov{transform:translate3d(0,calc(-100% - 2px),0)}.navbar__items--right>:last-child{padding-right:0}.footerLogoLink_BH7S{opacity:.5;transition:opacity var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.footerLogoLink_BH7S:hover,.hash-link:focus,:hover>.hash-link{opacity:1}.menuExternalLink_NmtK{align-items:center}.docMainContainer_TBSr,.docRoot_UBD9{display:flex;width:100%}.authorSocialIcon_XYv3,.authorSocialLink_owbf{width:var(--docusaurus-blog-social-icon-size)}.docsWrapper_hBAB{display:flex;flex:1 0 auto}.anchorWithStickyNavbar_LWe7{scroll-margin-top:calc(var(--ifm-navbar-height) + .5rem)}.anchorWithHideOnScrollNavbar_WYt5{scroll-margin-top:.5rem}.hash-link{opacity:0;padding-left:.5rem;transition:opacity var(--ifm-transition-fast);-webkit-user-select:none;user-select:none}.hash-link:before{content:"#"}.docCardListItem_W1sv>*,body,html{height:100%}.mainWrapper_z2l0{display:flex;flex:1 0 auto;flex-direction:column}.docusaurus-mt-lg{margin-top:3rem}#__docusaurus{display:flex;flex-direction:column;min-height:100%}.sidebar_re4s{max-height:calc(100vh - var(--ifm-navbar-height) - 2rem);overflow-y:auto;position:sticky;top:calc(var(--ifm-navbar-height) + 2rem)}.authorSocials_rSDt,.authorTitle_nd0D{overflow:hidden;-webkit-box-orient:vertical}.sidebarItemTitle_pO2u{font-size:var(--ifm-h3-font-size);font-weight:var(--ifm-font-weight-bold)}.container_mt6G,.sidebarItemList_Yudw{font-size:.9rem}.sidebarItem__DBe{margin-top:.7rem}.sidebarItemLink_mo7H{color:var(--ifm-font-color-base);display:block}.sidebarItemLink_mo7H:hover{-webkit-text-decoration:none;text-decoration:none}.sidebarItemLinkActive_I1ZP{color:var(--ifm-color-primary)!important}.yearGroupHeading_rMGB{margin-bottom:.4rem;margin-top:1.6rem}.yearGroupHeading_QT03{margin:1rem .75rem .5rem}.cardContainer_fWXF{--ifm-link-color:var(--ifm-color-emphasis-800);--ifm-link-hover-color:var(--ifm-color-emphasis-700);--ifm-link-hover-decoration:none;border:1px solid var(--ifm-color-emphasis-200);box-shadow:0 1.5px 3px 0 #00000026;transition:all var(--ifm-transition-fast) ease;transition-property:border,box-shadow}.cardContainer_fWXF:hover{border-color:var(--ifm-color-primary);box-shadow:0 3px 6px 0 #0003}.admonitionContent_BuS1>:last-child,.cardContainer_fWXF :last-child{margin-bottom:0}.cardTitle_rnsV{font-size:1.2rem}.cardDescription_PWke{font-size:.8rem}.docCardListItem_W1sv{margin-bottom:2rem}.title_f1Hy{font-size:3rem}[data-theme=dark] .githubSvg_Uu4N,[data-theme=dark] .instagramSvg_YC40,[data-theme=dark] .threadsSvg_PTXY,[data-theme=dark] .xSvg_y3PF{fill:var(--light)}[data-theme=light] .githubSvg_Uu4N,[data-theme=light] .instagramSvg_YC40,[data-theme=light] .threadsSvg_PTXY,[data-theme=light] .xSvg_y3PF{fill:var(--dark)}.authorSocials_rSDt{align-items:center;display:flex;flex-wrap:wrap;line-clamp:1;-webkit-line-clamp:1}.authorSocialLink_owbf,.authorSocials_rSDt{height:var(--docusaurus-blog-social-icon-size);line-height:0}.authorSocialLink_owbf{margin-right:.4rem}.authorSocialIcon_XYv3{height:var(--docusaurus-blog-social-icon-size)}.authorImage_XqGP{--ifm-avatar-photo-size:3.6rem}.author-as-h1_n9oJ .authorImage_XqGP{--ifm-avatar-photo-size:7rem}.author-as-h2_gXvM .authorImage_XqGP{--ifm-avatar-photo-size:5.4rem}.authorDetails_lV9A{align-items:flex-start;display:flex;flex-direction:column;justify-content:space-around}.authorName_yefp{display:flex;flex-direction:row;font-size:1.1rem;line-height:1.1rem}.author-as-h1_n9oJ .authorName_yefp{display:inline;font-size:2.4rem;line-height:2.4rem}.author-as-h2_gXvM .authorName_yefp{display:inline;font-size:1.4rem;line-height:1.4rem}.authorTitle_nd0D{display:-webkit-box;font-size:.8rem;line-height:1rem;line-clamp:1;-webkit-line-clamp:1}.author-as-h1_n9oJ .authorTitle_nd0D{font-size:1.2rem;line-height:1.6rem}.author-as-h2_gXvM .authorTitle_nd0D{font-size:1rem;line-height:1.3rem}.authorBlogPostCount_iiJ5{background:var(--ifm-color-secondary);border-radius:var(--ifm-global-radius);color:var(--ifm-color-black);font-size:.8rem;line-height:1.2;margin-left:.3rem;padding:.1rem .4rem}.authorListItem_n3yI{list-style-type:none;margin-bottom:2rem}.authorCol_Hf19{max-width:inherit!important}.imageOnlyAuthorRow_pa_O{display:flex;flex-flow:row wrap}.imageOnlyAuthorCol_G86a{margin-left:.3rem;margin-right:.3rem}.codeBlockContainer_Ckt0{background:var(--prism-background-color);border-radius:var(--ifm-code-border-radius);box-shadow:var(--ifm-global-shadow-lw);color:var(--prism-color);margin-bottom:var(--ifm-leading)}.codeBlock_bY9V{--ifm-pre-background:var(--prism-background-color);margin:0;padding:0}.codeBlockStandalone_MEMb{padding:0}.codeBlockLines_e6Vv{float:left;font:inherit;min-width:100%;padding:var(--ifm-pre-padding)}.codeBlockLinesWithNumbering_o6Pm{display:table;padding:var(--ifm-pre-padding) 0}:where(:root){--docusaurus-highlighted-code-line-bg:#484d5b}:where([data-theme=dark]){--docusaurus-highlighted-code-line-bg:#646464}.theme-code-block-highlighted-line{background-color:var(--docusaurus-highlighted-code-line-bg);display:block;margin:0 calc(var(--ifm-pre-padding)*-1);padding:0 var(--ifm-pre-padding)}.codeLine_lJS_{counter-increment:a;display:table-row}.codeLineNumber_Tfdd{background:var(--ifm-pre-background);display:table-cell;left:0;overflow-wrap:normal;padding:0 var(--ifm-pre-padding);position:sticky;text-align:right;width:1%}.codeLineNumber_Tfdd:before{content:counter(a);opacity:.4}.theme-code-block-highlighted-line .codeLineNumber_Tfdd:before{opacity:.8}.codeLineContent_feaV{padding-right:var(--ifm-pre-padding)}.theme-code-block:hover .copyButtonCopied_Vdqa{opacity:1!important}.copyButtonIcons_IEyt{height:1.125rem;position:relative;width:1.125rem}.copyButtonIcon_TrPX,.copyButtonSuccessIcon_cVMy{left:0;position:absolute;top:0;fill:currentColor;height:inherit;opacity:inherit;transition:all var(--ifm-transition-fast) ease;width:inherit}.copyButtonSuccessIcon_cVMy{color:#00d600;left:50%;opacity:0;top:50%;transform:translate(-50%,-50%) scale(.33)}.copyButtonCopied_Vdqa .copyButtonIcon_TrPX{opacity:0;transform:scale(.33)}.copyButtonCopied_Vdqa .copyButtonSuccessIcon_cVMy{opacity:1;transform:translate(-50%,-50%) scale(1);transition-delay:75ms}.wordWrapButtonIcon_b1P5{height:1.2rem;width:1.2rem}.wordWrapButtonEnabled_uzNF .wordWrapButtonIcon_b1P5{color:var(--ifm-color-primary)}.buttonGroup_M5ko{column-gap:.2rem;display:flex;position:absolute;right:calc(var(--ifm-pre-padding)/2);top:calc(var(--ifm-pre-padding)/2)}.buttonGroup_M5ko button{align-items:center;background:var(--prism-background-color);border:1px solid var(--ifm-color-emphasis-300);border-radius:var(--ifm-global-radius);color:var(--prism-color);display:flex;line-height:0;opacity:0;padding:.4rem;transition:opacity var(--ifm-transition-fast) ease-in-out}.buttonGroup_M5ko button:focus-visible,.buttonGroup_M5ko button:hover{opacity:1!important}.theme-code-block:hover .buttonGroup_M5ko button{opacity:.4}.tag_zVej{border:1px solid var(--docusaurus-tag-list-border);transition:border var(--ifm-transition-fast)}.tag_zVej:hover{--docusaurus-tag-list-border:var(--ifm-link-color);-webkit-text-decoration:none;text-decoration:none}.tagRegular_sFm0{border-radius:var(--ifm-global-radius);font-size:90%;padding:.2rem .5rem .3rem}.tagWithCount_h2kH{align-items:center;border-left:0;display:flex;padding:0 .5rem 0 1rem;position:relative}.tagWithCount_h2kH:after,.tagWithCount_h2kH:before{border:1px solid var(--docusaurus-tag-list-border);content:"";position:absolute;top:50%;transition:inherit}.tagWithCount_h2kH:before{border-bottom:0;border-right:0;height:1.18rem;right:100%;transform:translate(50%,-50%) rotate(-45deg);width:1.18rem}.tagWithCount_h2kH:after{border-radius:50%;height:.5rem;left:0;transform:translateY(-50%);width:.5rem}.tagWithCount_h2kH span{background:var(--ifm-color-secondary);border-radius:var(--ifm-global-radius);color:var(--ifm-color-black);font-size:.7rem;line-height:1.2;margin-left:.3rem;padding:.1rem .4rem}.tag_Nnez{display:inline-block;margin:.5rem .5rem 0 1rem}.codeBlockContent_QJqH{border-radius:inherit;direction:ltr;position:relative}.codeBlockTitle_OeMC{border-bottom:1px solid var(--ifm-color-emphasis-300);border-top-left-radius:inherit;border-top-right-radius:inherit;font-size:var(--ifm-code-font-size);font-weight:500;padding:.75rem var(--ifm-pre-padding)}.codeBlockTitle_OeMC+.codeBlockContent_QJqH .codeBlock_a8dz{border-top-left-radius:0;border-top-right-radius:0}.tags_jXut{display:inline}.tag_QGVx{display:inline-block;margin:0 .4rem .5rem 0}.iconEdit_Z9Sw{margin-right:.3em;vertical-align:sub}.lastUpdated_JAkA{font-size:smaller;font-style:italic;margin-top:.2rem}.tocCollapsibleButton_TO0P{align-items:center;display:flex;font-size:inherit;justify-content:space-between;padding:.4rem .8rem;width:100%}.tocCollapsibleButton_TO0P:after{background:var(--ifm-menu-link-sublist-icon) 50% 50%/2rem 2rem no-repeat;content:"";filter:var(--ifm-menu-link-sublist-icon-filter);height:1.25rem;transform:rotate(180deg);transition:transform var(--ifm-transition-fast);width:1.25rem}.tocCollapsibleButtonExpanded_MG3E:after,.tocCollapsibleExpanded_sAul{transform:none}.tocCollapsible_ETCw{background-color:var(--ifm-menu-color-background-active);border-radius:var(--ifm-global-radius);margin:1rem 0}.tocCollapsibleContent_vkbj>ul{border-left:none;border-top:1px solid var(--ifm-color-emphasis-300);font-size:15px;padding:.2rem 0}.tocCollapsibleContent_vkbj ul li{margin:.4rem .8rem}.tocCollapsibleContent_vkbj a{display:block}.details_b_Ee{--docusaurus-details-decoration-color:var(--ifm-alert-border-color);--docusaurus-details-transition:transform var(--ifm-transition-fast) ease;border:1px solid var(--ifm-alert-border-color);margin:0 0 var(--ifm-spacing-vertical)}.containsTaskList_mC6p{list-style:none}:not(.containsTaskList_mC6p>li)>.containsTaskList_mC6p{padding-left:0}.img_ev3q{height:auto}.tableOfContents_bqdL{max-height:calc(100vh - var(--ifm-navbar-height) - 2rem);overflow-y:auto;position:sticky;top:calc(var(--ifm-navbar-height) + 1rem)}.admonition_xJq3{margin-bottom:1em}.admonitionHeading_Gvgb{font:var(--ifm-heading-font-weight) var(--ifm-h5-font-size)/var(--ifm-heading-line-height) var(--ifm-heading-font-family);text-transform:uppercase}.admonitionHeading_Gvgb:not(:last-child){margin-bottom:.3rem}.admonitionHeading_Gvgb code{text-transform:none}.admonitionIcon_Rf37{display:inline-block;margin-right:.4em;vertical-align:middle}.admonitionIcon_Rf37 svg{display:inline-block;height:1.6em;width:1.6em;fill:var(--ifm-alert-foreground-color)}.breadcrumbHomeIcon_YNFT{height:1.1rem;position:relative;top:1px;vertical-align:top;width:1.1rem}.breadcrumbsContainer_Z_bl{--ifm-breadcrumb-size-multiplier:0.8;margin-bottom:.8rem}.title_kItE{--ifm-h1-font-size:3rem;margin-bottom:calc(var(--ifm-leading)*1.25)}.docItemContainer_Djhp article>:first-child,.docItemContainer_Djhp header+*{margin-top:0}.mdxPageWrapper_j9I6{justify-content:center}}@media (min-width:997px){.collapseSidebarButton_PEFL,.expandButton_TmdG{background-color:var(--docusaurus-collapse-button-bg)}:root{--docusaurus-announcement-bar-height:30px}.announcementBarClose_gvF7,.announcementBarPlaceholder_vyr4{flex-basis:50px}.collapseSidebarButton_PEFL{border:1px solid var(--ifm-toc-border-color);border-radius:0;bottom:0;display:block!important;height:40px;position:sticky}.collapseSidebarButtonIcon_kv0_{margin-top:4px;transform:rotate(180deg)}.expandButtonIcon_i1dp,[dir=rtl] .collapseSidebarButtonIcon_kv0_{transform:rotate(0)}.collapseSidebarButton_PEFL:focus,.collapseSidebarButton_PEFL:hover,.expandButton_TmdG:focus,.expandButton_TmdG:hover{background-color:var(--docusaurus-collapse-button-bg-hover)}.navbarSearchContainer_Bca1{padding:var(--ifm-navbar-item-padding-vertical) var(--ifm-navbar-item-padding-horizontal)}.menuHtmlItem_M9Kj{padding:var(--ifm-menu-link-padding-vertical) var(--ifm-menu-link-padding-horizontal)}.menu_SIkG{flex-grow:1;padding:.5rem}@supports (scrollbar-gutter:stable){.menu_SIkG{padding:.5rem 0 .5rem .5rem;scrollbar-gutter:stable}}.menuWithAnnouncementBar_GW3s{margin-bottom:var(--docusaurus-announcement-bar-height)}.sidebar_njMd{display:flex;flex-direction:column;height:100%;padding-top:var(--ifm-navbar-height);width:var(--doc-sidebar-width)}.sidebarWithHideableNavbar_wUlq{padding-top:0}.sidebarHidden_VK0M{opacity:0;visibility:hidden}.sidebarLogo_isFc{align-items:center;color:inherit!important;display:flex!important;margin:0 var(--ifm-navbar-padding-horizontal);max-height:var(--ifm-navbar-height);min-height:var(--ifm-navbar-height);-webkit-text-decoration:none!important;text-decoration:none!important}.sidebarLogo_isFc img{height:2rem;margin-right:.5rem}.expandButton_TmdG{align-items:center;display:flex;height:100%;justify-content:center;position:absolute;right:0;top:0;transition:background-color var(--ifm-transition-fast) ease;width:100%}[dir=rtl] .expandButtonIcon_i1dp{transform:rotate(180deg)}.docSidebarContainer_YfHR{border-right:1px solid var(--ifm-toc-border-color);clip-path:inset(0);display:block;margin-top:calc(var(--ifm-navbar-height)*-1);transition:width var(--ifm-transition-fast) ease;width:var(--doc-sidebar-width);will-change:width}.docSidebarContainerHidden_DPk8{cursor:pointer;width:var(--doc-sidebar-hidden-width)}.sidebarViewport_aRkj{height:100%;max-height:100vh;position:sticky;top:0}.docMainContainer_TBSr{flex-grow:1;max-width:calc(100% - var(--doc-sidebar-width))}.docMainContainerEnhanced_lQrH{max-width:calc(100% - var(--doc-sidebar-hidden-width))}.docItemWrapperEnhanced_JWYK{max-width:calc(var(--ifm-container-width) + var(--doc-sidebar-width))!important}.lastUpdated_JAkA{text-align:right}.tocMobile_ITEo{display:none}.docItemCol_VOVn,.generatedIndexPage_vN6x{max-width:75%!important}}@media (min-width:1440px){.container{max-width:var(--ifm-container-width-xl)}}@media (max-width:996px){.col{--ifm-col-width:100%;flex-basis:var(--ifm-col-width);margin-left:0}.footer{--ifm-footer-padding-horizontal:0}.colorModeToggle_DEke,.footer__link-separator,.navbar__item,.sidebar_re4s,.tableOfContents_bqdL{display:none}.footer__col{margin-bottom:calc(var(--ifm-spacing-vertical)*3)}.footer__link-item{display:block;width:max-content}.hero{padding-left:0;padding-right:0}.navbar>.container,.navbar>.container-fluid{padding:0}.navbar__toggle{display:inherit}.navbar__search-input{width:9rem}.pills--block,.tabs--block{flex-direction:column}.navbarSearchContainer_Bca1{position:absolute;right:var(--ifm-navbar-padding-horizontal)}.docItemContainer_F8PC{padding:0 .3rem}}@media screen and (max-width:996px){.features_t9lD .bharatml-card_xZ6l{margin-bottom:2rem}.featuresHeader_qR2i{font-size:2rem}.featuresSubtitle_VdGe{font-size:1rem}.heroBanner_qdFl{padding:2rem}}@media screen and (max-width:768px){.heroLogo_U6bI{height:120px;width:120px}.logoContainer_xdaK{margin-bottom:1.5rem}.buttons_AeoN{flex-direction:column;gap:.5rem}.statsContainer_KpvY{align-items:center;flex-direction:column;gap:1rem}}@media (max-width:576px){.markdown h1:first-child{--ifm-h1-font-size:2rem}.markdown>h2{--ifm-h2-font-size:1.5rem}.markdown>h3{--ifm-h3-font-size:1.25rem}.title_f1Hy{font-size:2rem}}@media (hover:hover){.backToTopButton_sjWU:hover{background-color:var(--ifm-color-emphasis-300)}}@media (pointer:fine){.thin-scrollbar{scrollbar-width:thin}.thin-scrollbar::-webkit-scrollbar{height:var(--ifm-scrollbar-size);width:var(--ifm-scrollbar-size)}.thin-scrollbar::-webkit-scrollbar-track{background:var(--ifm-scrollbar-track-background-color);border-radius:10px}.thin-scrollbar::-webkit-scrollbar-thumb{background:var(--ifm-scrollbar-thumb-background-color);border-radius:10px}.thin-scrollbar::-webkit-scrollbar-thumb:hover{background:var(--ifm-scrollbar-thumb-hover-background-color)}}@media (prefers-reduced-motion:reduce){:root{--ifm-transition-fast:0ms;--ifm-transition-slow:0ms}}@media print{.announcementBar_mb4j,.footer,.menu,.navbar,.pagination-nav,.table-of-contents,.tocMobile_ITEo{display:none}.tabs{page-break-inside:avoid}.codeBlockLines_e6Vv{white-space:pre-wrap}} \ No newline at end of file +@layer docusaurus.infima,docusaurus.theme-common,docusaurus.theme-classic,docusaurus.core,docusaurus.plugin-debug,docusaurus.theme-mermaid,docusaurus.theme-live-codeblock,docusaurus.theme-search-algolia.docsearch,docusaurus.theme-search-algolia;@layer docusaurus.infima{.col,.container{padding:0 var(--ifm-spacing-horizontal);width:100%}.markdown>h2,.markdown>h3,.markdown>h4,.markdown>h5,.markdown>h6{margin-bottom:calc(var(--ifm-heading-vertical-rhythm-bottom)*var(--ifm-leading))}.markdown li,body{word-wrap:break-word}body,ol ol,ol ul,ul ol,ul ul{margin:0}pre,table{overflow:auto}blockquote,pre{margin:0 0 var(--ifm-spacing-vertical)}.breadcrumbs__link,.button{transition-timing-function:var(--ifm-transition-timing-default)}.button,code{vertical-align:middle}.button--outline.button--active,.button--outline:active,.button--outline:hover,:root{--ifm-button-color:var(--ifm-font-color-base-inverse)}.menu__link:hover,a{transition:color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.navbar--dark,:root{--ifm-navbar-link-hover-color:var(--ifm-color-primary)}.menu,.navbar-sidebar{overflow-x:hidden}:root,html[data-theme=dark]{--ifm-color-emphasis-500:var(--ifm-color-gray-500)}:root{--ifm-color-scheme:light;--ifm-dark-value:10%;--ifm-darker-value:15%;--ifm-darkest-value:30%;--ifm-light-value:15%;--ifm-lighter-value:30%;--ifm-lightest-value:50%;--ifm-contrast-background-value:90%;--ifm-contrast-foreground-value:70%;--ifm-contrast-background-dark-value:70%;--ifm-contrast-foreground-dark-value:90%;--ifm-color-primary:#3578e5;--ifm-color-secondary:#ebedf0;--ifm-color-success:#00a400;--ifm-color-info:#54c7ec;--ifm-color-warning:#ffba00;--ifm-color-danger:#fa383e;--ifm-color-primary-dark:#306cce;--ifm-color-primary-darker:#2d66c3;--ifm-color-primary-darkest:#2554a0;--ifm-color-primary-light:#538ce9;--ifm-color-primary-lighter:#72a1ed;--ifm-color-primary-lightest:#9abcf2;--ifm-color-primary-contrast-background:#ebf2fc;--ifm-color-primary-contrast-foreground:#102445;--ifm-color-secondary-dark:#d4d5d8;--ifm-color-secondary-darker:#c8c9cc;--ifm-color-secondary-darkest:#a4a6a8;--ifm-color-secondary-light:#eef0f2;--ifm-color-secondary-lighter:#f1f2f5;--ifm-color-secondary-lightest:#f5f6f8;--ifm-color-secondary-contrast-background:#fdfdfe;--ifm-color-secondary-contrast-foreground:#474748;--ifm-color-success-dark:#009400;--ifm-color-success-darker:#008b00;--ifm-color-success-darkest:#007300;--ifm-color-success-light:#26b226;--ifm-color-success-lighter:#4dbf4d;--ifm-color-success-lightest:#80d280;--ifm-color-success-contrast-background:#e6f6e6;--ifm-color-success-contrast-foreground:#003100;--ifm-color-info-dark:#4cb3d4;--ifm-color-info-darker:#47a9c9;--ifm-color-info-darkest:#3b8ba5;--ifm-color-info-light:#6ecfef;--ifm-color-info-lighter:#87d8f2;--ifm-color-info-lightest:#aae3f6;--ifm-color-info-contrast-background:#eef9fd;--ifm-color-info-contrast-foreground:#193c47;--ifm-color-warning-dark:#e6a700;--ifm-color-warning-darker:#d99e00;--ifm-color-warning-darkest:#b38200;--ifm-color-warning-light:#ffc426;--ifm-color-warning-lighter:#ffcf4d;--ifm-color-warning-lightest:#ffdd80;--ifm-color-warning-contrast-background:#fff8e6;--ifm-color-warning-contrast-foreground:#4d3800;--ifm-color-danger-dark:#e13238;--ifm-color-danger-darker:#d53035;--ifm-color-danger-darkest:#af272b;--ifm-color-danger-light:#fb565b;--ifm-color-danger-lighter:#fb7478;--ifm-color-danger-lightest:#fd9c9f;--ifm-color-danger-contrast-background:#ffebec;--ifm-color-danger-contrast-foreground:#4b1113;--ifm-color-white:#fff;--ifm-color-black:#000;--ifm-color-gray-0:var(--ifm-color-white);--ifm-color-gray-100:#f5f6f7;--ifm-color-gray-200:#ebedf0;--ifm-color-gray-300:#dadde1;--ifm-color-gray-400:#ccd0d5;--ifm-color-gray-500:#bec3c9;--ifm-color-gray-600:#8d949e;--ifm-color-gray-700:#606770;--ifm-color-gray-800:#444950;--ifm-color-gray-900:#1c1e21;--ifm-color-gray-1000:var(--ifm-color-black);--ifm-color-emphasis-0:var(--ifm-color-gray-0);--ifm-color-emphasis-100:var(--ifm-color-gray-100);--ifm-color-emphasis-200:var(--ifm-color-gray-200);--ifm-color-emphasis-300:var(--ifm-color-gray-300);--ifm-color-emphasis-400:var(--ifm-color-gray-400);--ifm-color-emphasis-600:var(--ifm-color-gray-600);--ifm-color-emphasis-700:var(--ifm-color-gray-700);--ifm-color-emphasis-800:var(--ifm-color-gray-800);--ifm-color-emphasis-900:var(--ifm-color-gray-900);--ifm-color-emphasis-1000:var(--ifm-color-gray-1000);--ifm-color-content:var(--ifm-color-emphasis-900);--ifm-color-content-inverse:var(--ifm-color-emphasis-0);--ifm-color-content-secondary:#525860;--ifm-background-color:#0000;--ifm-background-surface-color:var(--ifm-color-content-inverse);--ifm-global-border-width:1px;--ifm-global-radius:0.4rem;--ifm-hover-overlay:#0000000d;--ifm-font-color-base:var(--ifm-color-content);--ifm-font-color-base-inverse:var(--ifm-color-content-inverse);--ifm-font-color-secondary:var(--ifm-color-content-secondary);--ifm-font-family-base:system-ui,-apple-system,Segoe UI,Roboto,Ubuntu,Cantarell,Noto Sans,sans-serif,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";--ifm-font-family-monospace:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;--ifm-font-size-base:100%;--ifm-font-weight-light:300;--ifm-font-weight-normal:400;--ifm-font-weight-semibold:500;--ifm-font-weight-bold:700;--ifm-font-weight-base:var(--ifm-font-weight-normal);--ifm-line-height-base:1.65;--ifm-global-spacing:1rem;--ifm-spacing-vertical:var(--ifm-global-spacing);--ifm-spacing-horizontal:var(--ifm-global-spacing);--ifm-transition-fast:200ms;--ifm-transition-slow:400ms;--ifm-transition-timing-default:cubic-bezier(0.08,0.52,0.52,1);--ifm-global-shadow-lw:0 1px 2px 0 #0000001a;--ifm-global-shadow-md:0 5px 40px #0003;--ifm-global-shadow-tl:0 12px 28px 0 #0003,0 2px 4px 0 #0000001a;--ifm-z-index-dropdown:100;--ifm-z-index-fixed:200;--ifm-z-index-overlay:400;--ifm-container-width:1140px;--ifm-container-width-xl:1320px;--ifm-code-background:#f6f7f8;--ifm-code-border-radius:var(--ifm-global-radius);--ifm-code-font-size:90%;--ifm-code-padding-horizontal:0.1rem;--ifm-code-padding-vertical:0.1rem;--ifm-pre-background:var(--ifm-code-background);--ifm-pre-border-radius:var(--ifm-code-border-radius);--ifm-pre-color:inherit;--ifm-pre-line-height:1.45;--ifm-pre-padding:1rem;--ifm-heading-color:inherit;--ifm-heading-margin-top:0;--ifm-heading-margin-bottom:var(--ifm-spacing-vertical);--ifm-heading-font-family:var(--ifm-font-family-base);--ifm-heading-font-weight:var(--ifm-font-weight-bold);--ifm-heading-line-height:1.25;--ifm-h1-font-size:2rem;--ifm-h2-font-size:1.5rem;--ifm-h3-font-size:1.25rem;--ifm-h4-font-size:1rem;--ifm-h5-font-size:0.875rem;--ifm-h6-font-size:0.85rem;--ifm-image-alignment-padding:1.25rem;--ifm-leading-desktop:1.25;--ifm-leading:calc(var(--ifm-leading-desktop)*1rem);--ifm-list-left-padding:2rem;--ifm-list-margin:1rem;--ifm-list-item-margin:0.25rem;--ifm-list-paragraph-margin:1rem;--ifm-table-cell-padding:0.75rem;--ifm-table-background:#0000;--ifm-table-stripe-background:#00000008;--ifm-table-border-width:1px;--ifm-table-border-color:var(--ifm-color-emphasis-300);--ifm-table-head-background:inherit;--ifm-table-head-color:inherit;--ifm-table-head-font-weight:var(--ifm-font-weight-bold);--ifm-table-cell-color:inherit;--ifm-link-color:var(--ifm-color-primary);--ifm-link-decoration:none;--ifm-link-hover-color:var(--ifm-link-color);--ifm-link-hover-decoration:underline;--ifm-paragraph-margin-bottom:var(--ifm-leading);--ifm-blockquote-font-size:var(--ifm-font-size-base);--ifm-blockquote-border-left-width:2px;--ifm-blockquote-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-blockquote-padding-vertical:0;--ifm-blockquote-shadow:none;--ifm-blockquote-color:var(--ifm-color-emphasis-800);--ifm-blockquote-border-color:var(--ifm-color-emphasis-300);--ifm-hr-background-color:var(--ifm-color-emphasis-500);--ifm-hr-height:1px;--ifm-hr-margin-vertical:1.5rem;--ifm-scrollbar-size:7px;--ifm-scrollbar-track-background-color:#f1f1f1;--ifm-scrollbar-thumb-background-color:silver;--ifm-scrollbar-thumb-hover-background-color:#a7a7a7;--ifm-alert-background-color:inherit;--ifm-alert-border-color:inherit;--ifm-alert-border-radius:var(--ifm-global-radius);--ifm-alert-border-width:0px;--ifm-alert-border-left-width:5px;--ifm-alert-color:var(--ifm-font-color-base);--ifm-alert-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-alert-padding-vertical:var(--ifm-spacing-vertical);--ifm-alert-shadow:var(--ifm-global-shadow-lw);--ifm-avatar-intro-margin:1rem;--ifm-avatar-intro-alignment:inherit;--ifm-avatar-photo-size:3rem;--ifm-badge-background-color:inherit;--ifm-badge-border-color:inherit;--ifm-badge-border-radius:var(--ifm-global-radius);--ifm-badge-border-width:var(--ifm-global-border-width);--ifm-badge-color:var(--ifm-color-white);--ifm-badge-padding-horizontal:calc(var(--ifm-spacing-horizontal)*0.5);--ifm-badge-padding-vertical:calc(var(--ifm-spacing-vertical)*0.25);--ifm-breadcrumb-border-radius:1.5rem;--ifm-breadcrumb-spacing:0.5rem;--ifm-breadcrumb-color-active:var(--ifm-color-primary);--ifm-breadcrumb-item-background-active:var(--ifm-hover-overlay);--ifm-breadcrumb-padding-horizontal:0.8rem;--ifm-breadcrumb-padding-vertical:0.4rem;--ifm-breadcrumb-size-multiplier:1;--ifm-breadcrumb-separator:url('data:image/svg+xml;utf8,');--ifm-breadcrumb-separator-filter:none;--ifm-breadcrumb-separator-size:0.5rem;--ifm-breadcrumb-separator-size-multiplier:1.25;--ifm-button-background-color:inherit;--ifm-button-border-color:var(--ifm-button-background-color);--ifm-button-border-width:var(--ifm-global-border-width);--ifm-button-font-weight:var(--ifm-font-weight-bold);--ifm-button-padding-horizontal:1.5rem;--ifm-button-padding-vertical:0.375rem;--ifm-button-size-multiplier:1;--ifm-button-transition-duration:var(--ifm-transition-fast);--ifm-button-border-radius:calc(var(--ifm-global-radius)*var(--ifm-button-size-multiplier));--ifm-button-group-spacing:2px;--ifm-card-background-color:var(--ifm-background-surface-color);--ifm-card-border-radius:calc(var(--ifm-global-radius)*2);--ifm-card-horizontal-spacing:var(--ifm-global-spacing);--ifm-card-vertical-spacing:var(--ifm-global-spacing);--ifm-toc-border-color:var(--ifm-color-emphasis-300);--ifm-toc-link-color:var(--ifm-color-content-secondary);--ifm-toc-padding-vertical:0.5rem;--ifm-toc-padding-horizontal:0.5rem;--ifm-dropdown-background-color:var(--ifm-background-surface-color);--ifm-dropdown-font-weight:var(--ifm-font-weight-semibold);--ifm-dropdown-link-color:var(--ifm-font-color-base);--ifm-dropdown-hover-background-color:var(--ifm-hover-overlay);--ifm-footer-background-color:var(--ifm-color-emphasis-100);--ifm-footer-color:inherit;--ifm-footer-link-color:var(--ifm-color-emphasis-700);--ifm-footer-link-hover-color:var(--ifm-color-primary);--ifm-footer-link-horizontal-spacing:0.5rem;--ifm-footer-padding-horizontal:calc(var(--ifm-spacing-horizontal)*2);--ifm-footer-padding-vertical:calc(var(--ifm-spacing-vertical)*2);--ifm-footer-title-color:inherit;--ifm-footer-logo-max-width:min(30rem,90vw);--ifm-hero-background-color:var(--ifm-background-surface-color);--ifm-hero-text-color:var(--ifm-color-emphasis-800);--ifm-menu-color:var(--ifm-color-emphasis-700);--ifm-menu-color-active:var(--ifm-color-primary);--ifm-menu-color-background-active:var(--ifm-hover-overlay);--ifm-menu-color-background-hover:var(--ifm-hover-overlay);--ifm-menu-link-padding-horizontal:0.75rem;--ifm-menu-link-padding-vertical:0.375rem;--ifm-menu-link-sublist-icon:url('data:image/svg+xml;utf8,');--ifm-menu-link-sublist-icon-filter:none;--ifm-navbar-background-color:var(--ifm-background-surface-color);--ifm-navbar-height:3.75rem;--ifm-navbar-item-padding-horizontal:0.75rem;--ifm-navbar-item-padding-vertical:0.25rem;--ifm-navbar-link-color:var(--ifm-font-color-base);--ifm-navbar-link-active-color:var(--ifm-link-color);--ifm-navbar-padding-horizontal:var(--ifm-spacing-horizontal);--ifm-navbar-padding-vertical:calc(var(--ifm-spacing-vertical)*0.5);--ifm-navbar-shadow:var(--ifm-global-shadow-lw);--ifm-navbar-search-input-background-color:var(--ifm-color-emphasis-200);--ifm-navbar-search-input-color:var(--ifm-color-emphasis-800);--ifm-navbar-search-input-placeholder-color:var(--ifm-color-emphasis-500);--ifm-navbar-search-input-icon:url('data:image/svg+xml;utf8,');--ifm-navbar-sidebar-width:83vw;--ifm-pagination-border-radius:var(--ifm-global-radius);--ifm-pagination-color-active:var(--ifm-color-primary);--ifm-pagination-font-size:1rem;--ifm-pagination-item-active-background:var(--ifm-hover-overlay);--ifm-pagination-page-spacing:0.2em;--ifm-pagination-padding-horizontal:calc(var(--ifm-spacing-horizontal)*1);--ifm-pagination-padding-vertical:calc(var(--ifm-spacing-vertical)*0.25);--ifm-pagination-nav-border-radius:var(--ifm-global-radius);--ifm-pagination-nav-color-hover:var(--ifm-color-primary);--ifm-pills-color-active:var(--ifm-color-primary);--ifm-pills-color-background-active:var(--ifm-hover-overlay);--ifm-pills-spacing:0.125rem;--ifm-tabs-color:var(--ifm-font-color-secondary);--ifm-tabs-color-active:var(--ifm-color-primary);--ifm-tabs-color-active-border:var(--ifm-tabs-color-active);--ifm-tabs-padding-horizontal:1rem;--ifm-tabs-padding-vertical:1rem}.badge--danger,.badge--info,.badge--primary,.badge--secondary,.badge--success,.badge--warning{--ifm-badge-border-color:var(--ifm-badge-background-color)}.button--link,.button--outline{--ifm-button-background-color:#0000}*{box-sizing:border-box}html{background-color:var(--ifm-background-color);color:var(--ifm-font-color-base);color-scheme:var(--ifm-color-scheme);font:var(--ifm-font-size-base)/var(--ifm-line-height-base) var(--ifm-font-family-base);-webkit-font-smoothing:antialiased;-webkit-tap-highlight-color:transparent;text-rendering:optimizelegibility;-webkit-text-size-adjust:100%;text-size-adjust:100%}iframe{border:0;color-scheme:auto}.container{margin:0 auto;max-width:var(--ifm-container-width)}.container--fluid{max-width:inherit}.row{display:flex;flex-wrap:wrap;margin:0 calc(var(--ifm-spacing-horizontal)*-1)}.margin-bottom--none,.margin-vert--none,.markdown>:last-child{margin-bottom:0!important}.margin-top--none,.margin-vert--none{margin-top:0!important}.row--no-gutters{margin-left:0;margin-right:0}.margin-horiz--none,.margin-right--none{margin-right:0!important}.row--no-gutters>.col{padding-left:0;padding-right:0}.row--align-top{align-items:flex-start}.row--align-bottom{align-items:flex-end}.row--align-center{align-items:center}.row--align-stretch{align-items:stretch}.row--align-baseline{align-items:baseline}.col{--ifm-col-width:100%;flex:1 0;margin-left:0;max-width:var(--ifm-col-width)}.padding-bottom--none,.padding-vert--none{padding-bottom:0!important}.padding-top--none,.padding-vert--none{padding-top:0!important}.padding-horiz--none,.padding-left--none{padding-left:0!important}.padding-horiz--none,.padding-right--none{padding-right:0!important}.col[class*=col--]{flex:0 0 var(--ifm-col-width)}.col--1{--ifm-col-width:8.33333%}.col--offset-1{margin-left:8.33333%}.col--2{--ifm-col-width:16.66667%}.col--offset-2{margin-left:16.66667%}.col--3{--ifm-col-width:25%}.col--offset-3{margin-left:25%}.col--4{--ifm-col-width:33.33333%}.col--offset-4{margin-left:33.33333%}.col--5{--ifm-col-width:41.66667%}.col--offset-5{margin-left:41.66667%}.col--6{--ifm-col-width:50%}.col--offset-6{margin-left:50%}.col--7{--ifm-col-width:58.33333%}.col--offset-7{margin-left:58.33333%}.col--8{--ifm-col-width:66.66667%}.col--offset-8{margin-left:66.66667%}.col--9{--ifm-col-width:75%}.col--offset-9{margin-left:75%}.col--10{--ifm-col-width:83.33333%}.col--offset-10{margin-left:83.33333%}.col--11{--ifm-col-width:91.66667%}.col--offset-11{margin-left:91.66667%}.col--12{--ifm-col-width:100%}.col--offset-12{margin-left:100%}.margin-horiz--none,.margin-left--none{margin-left:0!important}.margin--none{margin:0!important}.margin-bottom--xs,.margin-vert--xs{margin-bottom:.25rem!important}.margin-top--xs,.margin-vert--xs{margin-top:.25rem!important}.margin-horiz--xs,.margin-left--xs{margin-left:.25rem!important}.margin-horiz--xs,.margin-right--xs{margin-right:.25rem!important}.margin--xs{margin:.25rem!important}.margin-bottom--sm,.margin-vert--sm{margin-bottom:.5rem!important}.margin-top--sm,.margin-vert--sm{margin-top:.5rem!important}.margin-horiz--sm,.margin-left--sm{margin-left:.5rem!important}.margin-horiz--sm,.margin-right--sm{margin-right:.5rem!important}.margin--sm{margin:.5rem!important}.margin-bottom--md,.margin-vert--md{margin-bottom:1rem!important}.margin-top--md,.margin-vert--md{margin-top:1rem!important}.margin-horiz--md,.margin-left--md{margin-left:1rem!important}.margin-horiz--md,.margin-right--md{margin-right:1rem!important}.margin--md{margin:1rem!important}.margin-bottom--lg,.margin-vert--lg{margin-bottom:2rem!important}.margin-top--lg,.margin-vert--lg{margin-top:2rem!important}.margin-horiz--lg,.margin-left--lg{margin-left:2rem!important}.margin-horiz--lg,.margin-right--lg{margin-right:2rem!important}.margin--lg{margin:2rem!important}.margin-bottom--xl,.margin-vert--xl{margin-bottom:5rem!important}.margin-top--xl,.margin-vert--xl{margin-top:5rem!important}.margin-horiz--xl,.margin-left--xl{margin-left:5rem!important}.margin-horiz--xl,.margin-right--xl{margin-right:5rem!important}.margin--xl{margin:5rem!important}.padding--none{padding:0!important}.padding-bottom--xs,.padding-vert--xs{padding-bottom:.25rem!important}.padding-top--xs,.padding-vert--xs{padding-top:.25rem!important}.padding-horiz--xs,.padding-left--xs{padding-left:.25rem!important}.padding-horiz--xs,.padding-right--xs{padding-right:.25rem!important}.padding--xs{padding:.25rem!important}.padding-bottom--sm,.padding-vert--sm{padding-bottom:.5rem!important}.padding-top--sm,.padding-vert--sm{padding-top:.5rem!important}.padding-horiz--sm,.padding-left--sm{padding-left:.5rem!important}.padding-horiz--sm,.padding-right--sm{padding-right:.5rem!important}.padding--sm{padding:.5rem!important}.padding-bottom--md,.padding-vert--md{padding-bottom:1rem!important}.padding-top--md,.padding-vert--md{padding-top:1rem!important}.padding-horiz--md,.padding-left--md{padding-left:1rem!important}.padding-horiz--md,.padding-right--md{padding-right:1rem!important}.padding--md{padding:1rem!important}.padding-bottom--lg,.padding-vert--lg{padding-bottom:2rem!important}.padding-top--lg,.padding-vert--lg{padding-top:2rem!important}.padding-horiz--lg,.padding-left--lg{padding-left:2rem!important}.padding-horiz--lg,.padding-right--lg{padding-right:2rem!important}.padding--lg{padding:2rem!important}.padding-bottom--xl,.padding-vert--xl{padding-bottom:5rem!important}.padding-top--xl,.padding-vert--xl{padding-top:5rem!important}.padding-horiz--xl,.padding-left--xl{padding-left:5rem!important}.padding-horiz--xl,.padding-right--xl{padding-right:5rem!important}.padding--xl{padding:5rem!important}code{background-color:var(--ifm-code-background);border:.1rem solid #0000001a;border-radius:var(--ifm-code-border-radius);font-family:var(--ifm-font-family-monospace);font-size:var(--ifm-code-font-size);padding:var(--ifm-code-padding-vertical) var(--ifm-code-padding-horizontal)}a code{color:inherit}pre{background-color:var(--ifm-pre-background);border-radius:var(--ifm-pre-border-radius);color:var(--ifm-pre-color);font:var(--ifm-code-font-size)/var(--ifm-pre-line-height) var(--ifm-font-family-monospace);padding:var(--ifm-pre-padding)}pre code{background-color:initial;border:none;font-size:100%;line-height:inherit;padding:0}kbd{background-color:var(--ifm-color-emphasis-0);border:1px solid var(--ifm-color-emphasis-400);border-radius:.2rem;box-shadow:inset 0 -1px 0 var(--ifm-color-emphasis-400);color:var(--ifm-color-emphasis-800);font:80% var(--ifm-font-family-monospace);padding:.15rem .3rem}h1,h2,h3,h4,h5,h6{color:var(--ifm-heading-color);font-family:var(--ifm-heading-font-family);font-weight:var(--ifm-heading-font-weight);line-height:var(--ifm-heading-line-height);margin:var(--ifm-heading-margin-top) 0 var(--ifm-heading-margin-bottom) 0}h1{font-size:var(--ifm-h1-font-size)}h2{font-size:var(--ifm-h2-font-size)}h3{font-size:var(--ifm-h3-font-size)}h4{font-size:var(--ifm-h4-font-size)}h5{font-size:var(--ifm-h5-font-size)}h6{font-size:var(--ifm-h6-font-size)}img{max-width:100%}img[align=right]{padding-left:var(--image-alignment-padding)}img[align=left]{padding-right:var(--image-alignment-padding)}.markdown{--ifm-h1-vertical-rhythm-top:3;--ifm-h2-vertical-rhythm-top:2;--ifm-h3-vertical-rhythm-top:1.5;--ifm-heading-vertical-rhythm-top:1.25;--ifm-h1-vertical-rhythm-bottom:1.25;--ifm-heading-vertical-rhythm-bottom:1}.markdown:after,.markdown:before{content:"";display:table}.markdown:after{clear:both}.markdown h1:first-child{--ifm-h1-font-size:3rem;margin-bottom:calc(var(--ifm-h1-vertical-rhythm-bottom)*var(--ifm-leading))}.markdown>h2{--ifm-h2-font-size:2rem;margin-top:calc(var(--ifm-h2-vertical-rhythm-top)*var(--ifm-leading))}.markdown>h3{--ifm-h3-font-size:1.5rem;margin-top:calc(var(--ifm-h3-vertical-rhythm-top)*var(--ifm-leading))}.markdown>h4,.markdown>h5,.markdown>h6{margin-top:calc(var(--ifm-heading-vertical-rhythm-top)*var(--ifm-leading))}.markdown>p,.markdown>pre,.markdown>ul{margin-bottom:var(--ifm-leading)}.markdown li>p{margin-top:var(--ifm-list-paragraph-margin)}.markdown li+li{margin-top:var(--ifm-list-item-margin)}ol,ul{margin:0 0 var(--ifm-list-margin);padding-left:var(--ifm-list-left-padding)}ol ol,ul ol{list-style-type:lower-roman}ol ol ol,ol ul ol,ul ol ol,ul ul ol{list-style-type:lower-alpha}table{border-collapse:collapse;display:block;margin-bottom:var(--ifm-spacing-vertical)}table thead tr{border-bottom:2px solid var(--ifm-table-border-color)}table thead,table tr:nth-child(2n){background-color:var(--ifm-table-stripe-background)}table tr{background-color:var(--ifm-table-background);border-top:var(--ifm-table-border-width) solid var(--ifm-table-border-color)}table td,table th{border:var(--ifm-table-border-width) solid var(--ifm-table-border-color);padding:var(--ifm-table-cell-padding)}table th{background-color:var(--ifm-table-head-background);color:var(--ifm-table-head-color);font-weight:var(--ifm-table-head-font-weight)}table td{color:var(--ifm-table-cell-color)}strong{font-weight:var(--ifm-font-weight-bold)}a{color:var(--ifm-link-color);text-decoration:var(--ifm-link-decoration)}a:hover{color:var(--ifm-link-hover-color);text-decoration:var(--ifm-link-hover-decoration)}.button:hover,.text--no-decoration,.text--no-decoration:hover,a:not([href]){-webkit-text-decoration:none;text-decoration:none}p{margin:0 0 var(--ifm-paragraph-margin-bottom)}blockquote{border-left:var(--ifm-blockquote-border-left-width) solid var(--ifm-blockquote-border-color);box-shadow:var(--ifm-blockquote-shadow);color:var(--ifm-blockquote-color);font-size:var(--ifm-blockquote-font-size);padding:var(--ifm-blockquote-padding-vertical) var(--ifm-blockquote-padding-horizontal)}blockquote>:first-child{margin-top:0}blockquote>:last-child{margin-bottom:0}hr{background-color:var(--ifm-hr-background-color);border:0;height:var(--ifm-hr-height);margin:var(--ifm-hr-margin-vertical) 0}.shadow--lw{box-shadow:var(--ifm-global-shadow-lw)!important}.shadow--md{box-shadow:var(--ifm-global-shadow-md)!important}.shadow--tl{box-shadow:var(--ifm-global-shadow-tl)!important}.text--primary{color:var(--ifm-color-primary)}.text--secondary{color:var(--ifm-color-secondary)}.text--success{color:var(--ifm-color-success)}.text--info{color:var(--ifm-color-info)}.text--warning{color:var(--ifm-color-warning)}.text--danger{color:var(--ifm-color-danger)}.text--center{text-align:center}.text--left{text-align:left}.text--justify{text-align:justify}.text--right{text-align:right}.text--capitalize{text-transform:capitalize}.text--lowercase{text-transform:lowercase}.alert__heading,.text--uppercase{text-transform:uppercase}.text--light{font-weight:var(--ifm-font-weight-light)}.text--normal{font-weight:var(--ifm-font-weight-normal)}.text--semibold{font-weight:var(--ifm-font-weight-semibold)}.text--bold{font-weight:var(--ifm-font-weight-bold)}.text--italic{font-style:italic}.text--truncate{overflow:hidden;text-overflow:ellipsis;white-space:nowrap}.text--break{word-wrap:break-word!important;word-break:break-word!important}.clean-btn{background:none;border:none;color:inherit;cursor:pointer;font-family:inherit;padding:0}.alert,.alert .close{color:var(--ifm-alert-foreground-color)}.clean-list{list-style:none;padding-left:0}.alert--primary{--ifm-alert-background-color:var(--ifm-color-primary-contrast-background);--ifm-alert-background-color-highlight:#3578e526;--ifm-alert-foreground-color:var(--ifm-color-primary-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-primary-dark)}.alert--secondary{--ifm-alert-background-color:var(--ifm-color-secondary-contrast-background);--ifm-alert-background-color-highlight:#ebedf026;--ifm-alert-foreground-color:var(--ifm-color-secondary-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-secondary-dark)}.alert--success{--ifm-alert-background-color:var(--ifm-color-success-contrast-background);--ifm-alert-background-color-highlight:#00a40026;--ifm-alert-foreground-color:var(--ifm-color-success-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-success-dark)}.alert--info{--ifm-alert-background-color:var(--ifm-color-info-contrast-background);--ifm-alert-background-color-highlight:#54c7ec26;--ifm-alert-foreground-color:var(--ifm-color-info-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-info-dark)}.alert--warning{--ifm-alert-background-color:var(--ifm-color-warning-contrast-background);--ifm-alert-background-color-highlight:#ffba0026;--ifm-alert-foreground-color:var(--ifm-color-warning-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-warning-dark)}.alert--danger{--ifm-alert-background-color:var(--ifm-color-danger-contrast-background);--ifm-alert-background-color-highlight:#fa383e26;--ifm-alert-foreground-color:var(--ifm-color-danger-contrast-foreground);--ifm-alert-border-color:var(--ifm-color-danger-dark)}.alert{--ifm-code-background:var(--ifm-alert-background-color-highlight);--ifm-link-color:var(--ifm-alert-foreground-color);--ifm-link-hover-color:var(--ifm-alert-foreground-color);--ifm-link-decoration:underline;--ifm-tabs-color:var(--ifm-alert-foreground-color);--ifm-tabs-color-active:var(--ifm-alert-foreground-color);--ifm-tabs-color-active-border:var(--ifm-alert-border-color);background-color:var(--ifm-alert-background-color);border:var(--ifm-alert-border-width) solid var(--ifm-alert-border-color);border-left-width:var(--ifm-alert-border-left-width);border-radius:var(--ifm-alert-border-radius);box-shadow:var(--ifm-alert-shadow);padding:var(--ifm-alert-padding-vertical) var(--ifm-alert-padding-horizontal)}.alert__heading{align-items:center;display:flex;font:700 var(--ifm-h5-font-size)/var(--ifm-heading-line-height) var(--ifm-heading-font-family);margin-bottom:.5rem}.alert__icon{display:inline-flex;margin-right:.4em}.alert__icon svg{fill:var(--ifm-alert-foreground-color);stroke:var(--ifm-alert-foreground-color);stroke-width:0}.alert .close{margin:calc(var(--ifm-alert-padding-vertical)*-1) calc(var(--ifm-alert-padding-horizontal)*-1) 0 0;opacity:.75}.alert .close:focus,.alert .close:hover{opacity:1}.alert a{text-decoration-color:var(--ifm-alert-border-color)}.alert a:hover{text-decoration-thickness:2px}.avatar{column-gap:var(--ifm-avatar-intro-margin);display:flex}.avatar__photo{border-radius:50%;display:block;height:var(--ifm-avatar-photo-size);overflow:hidden;width:var(--ifm-avatar-photo-size)}.card--full-height,.navbar__logo img{height:100%}.avatar__photo--sm{--ifm-avatar-photo-size:2rem}.avatar__photo--lg{--ifm-avatar-photo-size:4rem}.avatar__photo--xl{--ifm-avatar-photo-size:6rem}.avatar__intro{display:flex;flex:1 1;flex-direction:column;justify-content:center;text-align:var(--ifm-avatar-intro-alignment)}.badge,.breadcrumbs__item,.breadcrumbs__link,.button,.dropdown>.navbar__link:after{display:inline-block}.avatar__name{font:700 var(--ifm-h4-font-size)/var(--ifm-heading-line-height) var(--ifm-font-family-base)}.avatar__subtitle{margin-top:.25rem}.avatar--vertical{--ifm-avatar-intro-alignment:center;--ifm-avatar-intro-margin:0.5rem;align-items:center;flex-direction:column}.badge{background-color:var(--ifm-badge-background-color);border:var(--ifm-badge-border-width) solid var(--ifm-badge-border-color);border-radius:var(--ifm-badge-border-radius);color:var(--ifm-badge-color);font-size:75%;font-weight:var(--ifm-font-weight-bold);line-height:1;padding:var(--ifm-badge-padding-vertical) var(--ifm-badge-padding-horizontal)}.badge--primary{--ifm-badge-background-color:var(--ifm-color-primary)}.badge--secondary{--ifm-badge-background-color:var(--ifm-color-secondary);color:var(--ifm-color-black)}.breadcrumbs__link,.button.button--secondary.button--outline:not(.button--active):not(:hover){color:var(--ifm-font-color-base)}.badge--success{--ifm-badge-background-color:var(--ifm-color-success)}.badge--info{--ifm-badge-background-color:var(--ifm-color-info)}.badge--warning{--ifm-badge-background-color:var(--ifm-color-warning)}.badge--danger{--ifm-badge-background-color:var(--ifm-color-danger)}.breadcrumbs{margin-bottom:0;padding-left:0}.breadcrumbs__item:not(:last-child):after{background:var(--ifm-breadcrumb-separator) center;content:" ";display:inline-block;filter:var(--ifm-breadcrumb-separator-filter);height:calc(var(--ifm-breadcrumb-separator-size)*var(--ifm-breadcrumb-size-multiplier)*var(--ifm-breadcrumb-separator-size-multiplier));margin:0 var(--ifm-breadcrumb-spacing);opacity:.5;width:calc(var(--ifm-breadcrumb-separator-size)*var(--ifm-breadcrumb-size-multiplier)*var(--ifm-breadcrumb-separator-size-multiplier))}.breadcrumbs__item--active .breadcrumbs__link{background:var(--ifm-breadcrumb-item-background-active);color:var(--ifm-breadcrumb-color-active)}.breadcrumbs__link{border-radius:var(--ifm-breadcrumb-border-radius);font-size:calc(1rem*var(--ifm-breadcrumb-size-multiplier));padding:calc(var(--ifm-breadcrumb-padding-vertical)*var(--ifm-breadcrumb-size-multiplier)) calc(var(--ifm-breadcrumb-padding-horizontal)*var(--ifm-breadcrumb-size-multiplier));transition-duration:var(--ifm-transition-fast);transition-property:background,color}.breadcrumbs__link:any-link:hover,.breadcrumbs__link:link:hover,.breadcrumbs__link:visited:hover,area[href].breadcrumbs__link:hover{background:var(--ifm-breadcrumb-item-background-active);-webkit-text-decoration:none;text-decoration:none}.breadcrumbs--sm{--ifm-breadcrumb-size-multiplier:0.8}.breadcrumbs--lg{--ifm-breadcrumb-size-multiplier:1.2}.button{background-color:var(--ifm-button-background-color);border:var(--ifm-button-border-width) solid var(--ifm-button-border-color);border-radius:var(--ifm-button-border-radius);cursor:pointer;font-size:calc(.875rem*var(--ifm-button-size-multiplier));font-weight:var(--ifm-button-font-weight);line-height:1.5;padding:calc(var(--ifm-button-padding-vertical)*var(--ifm-button-size-multiplier)) calc(var(--ifm-button-padding-horizontal)*var(--ifm-button-size-multiplier));text-align:center;transition-duration:var(--ifm-button-transition-duration);transition-property:color,background,border-color;-webkit-user-select:none;user-select:none;white-space:nowrap}.button,.button:hover{color:var(--ifm-button-color)}.button--outline{--ifm-button-color:var(--ifm-button-border-color)}.button--outline:hover{--ifm-button-background-color:var(--ifm-button-border-color)}.button--link{--ifm-button-border-color:#0000;color:var(--ifm-link-color);text-decoration:var(--ifm-link-decoration)}.button--link.button--active,.button--link:active,.button--link:hover{color:var(--ifm-link-hover-color);text-decoration:var(--ifm-link-hover-decoration)}.dropdown__link--active,.dropdown__link:hover,.menu__link:hover,.navbar__brand:hover,.navbar__link--active,.navbar__link:hover,.pagination-nav__link:hover,.pagination__link:hover{-webkit-text-decoration:none;text-decoration:none}.button.disabled,.button:disabled,.button[disabled]{opacity:.65;pointer-events:none}.button--sm{--ifm-button-size-multiplier:0.8}.button--lg{--ifm-button-size-multiplier:1.35}.button--block{display:block;width:100%}.button.button--secondary{color:var(--ifm-color-gray-900)}:where(.button--primary){--ifm-button-background-color:var(--ifm-color-primary);--ifm-button-border-color:var(--ifm-color-primary)}:where(.button--primary):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-primary-dark);--ifm-button-border-color:var(--ifm-color-primary-dark)}.button--primary.button--active,.button--primary:active{--ifm-button-background-color:var(--ifm-color-primary-darker);--ifm-button-border-color:var(--ifm-color-primary-darker)}:where(.button--secondary){--ifm-button-background-color:var(--ifm-color-secondary);--ifm-button-border-color:var(--ifm-color-secondary)}:where(.button--secondary):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-secondary-dark);--ifm-button-border-color:var(--ifm-color-secondary-dark)}.button--secondary.button--active,.button--secondary:active{--ifm-button-background-color:var(--ifm-color-secondary-darker);--ifm-button-border-color:var(--ifm-color-secondary-darker)}:where(.button--success){--ifm-button-background-color:var(--ifm-color-success);--ifm-button-border-color:var(--ifm-color-success)}:where(.button--success):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-success-dark);--ifm-button-border-color:var(--ifm-color-success-dark)}.button--success.button--active,.button--success:active{--ifm-button-background-color:var(--ifm-color-success-darker);--ifm-button-border-color:var(--ifm-color-success-darker)}:where(.button--info){--ifm-button-background-color:var(--ifm-color-info);--ifm-button-border-color:var(--ifm-color-info)}:where(.button--info):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-info-dark);--ifm-button-border-color:var(--ifm-color-info-dark)}.button--info.button--active,.button--info:active{--ifm-button-background-color:var(--ifm-color-info-darker);--ifm-button-border-color:var(--ifm-color-info-darker)}:where(.button--warning){--ifm-button-background-color:var(--ifm-color-warning);--ifm-button-border-color:var(--ifm-color-warning)}:where(.button--warning):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-warning-dark);--ifm-button-border-color:var(--ifm-color-warning-dark)}.button--warning.button--active,.button--warning:active{--ifm-button-background-color:var(--ifm-color-warning-darker);--ifm-button-border-color:var(--ifm-color-warning-darker)}:where(.button--danger){--ifm-button-background-color:var(--ifm-color-danger);--ifm-button-border-color:var(--ifm-color-danger)}:where(.button--danger):not(.button--outline):hover{--ifm-button-background-color:var(--ifm-color-danger-dark);--ifm-button-border-color:var(--ifm-color-danger-dark)}.button--danger.button--active,.button--danger:active{--ifm-button-background-color:var(--ifm-color-danger-darker);--ifm-button-border-color:var(--ifm-color-danger-darker)}.button-group{display:inline-flex;gap:var(--ifm-button-group-spacing)}.button-group>.button:not(:first-child){border-bottom-left-radius:0;border-top-left-radius:0}.button-group>.button:not(:last-child){border-bottom-right-radius:0;border-top-right-radius:0}.button-group--block{display:flex;justify-content:stretch}.button-group--block>.button{flex-grow:1}.card{background-color:var(--ifm-card-background-color);border-radius:var(--ifm-card-border-radius);box-shadow:var(--ifm-global-shadow-lw);display:flex;flex-direction:column;overflow:hidden}.card__image{padding-top:var(--ifm-card-vertical-spacing)}.card__image:first-child{padding-top:0}.card__body,.card__footer,.card__header{padding:var(--ifm-card-vertical-spacing) var(--ifm-card-horizontal-spacing)}.card__body:not(:last-child),.card__footer:not(:last-child),.card__header:not(:last-child){padding-bottom:0}.card__body>:last-child,.card__footer>:last-child,.card__header>:last-child{margin-bottom:0}.card__footer{margin-top:auto}.table-of-contents{font-size:.8rem;margin-bottom:0;padding:var(--ifm-toc-padding-vertical) 0}.table-of-contents,.table-of-contents ul{list-style:none;padding-left:var(--ifm-toc-padding-horizontal)}.table-of-contents li{margin:var(--ifm-toc-padding-vertical) var(--ifm-toc-padding-horizontal)}.table-of-contents__left-border{border-left:1px solid var(--ifm-toc-border-color)}.table-of-contents__link{color:var(--ifm-toc-link-color);display:block}.table-of-contents__link--active,.table-of-contents__link--active code,.table-of-contents__link:hover,.table-of-contents__link:hover code{color:var(--ifm-color-primary);-webkit-text-decoration:none;text-decoration:none}.close{color:var(--ifm-color-black);float:right;font-size:1.5rem;font-weight:var(--ifm-font-weight-bold);line-height:1;opacity:.5;padding:1rem;transition:opacity var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.close:hover{opacity:.7}.close:focus{opacity:.8}.dropdown{display:inline-flex;font-weight:var(--ifm-dropdown-font-weight);position:relative;vertical-align:top}.dropdown--hoverable:hover .dropdown__menu,.dropdown--show .dropdown__menu{opacity:1;pointer-events:all;transform:translateY(-1px);visibility:visible}.dropdown__menu,.navbar__item.dropdown .navbar__link:not([href]){pointer-events:none}.dropdown--right .dropdown__menu{left:inherit;right:0}.dropdown--nocaret .navbar__link:after{content:none!important}.dropdown__menu{background-color:var(--ifm-dropdown-background-color);border-radius:var(--ifm-global-radius);box-shadow:var(--ifm-global-shadow-md);left:0;list-style:none;max-height:80vh;min-width:10rem;opacity:0;overflow-y:auto;padding:.5rem;position:absolute;top:calc(100% - var(--ifm-navbar-item-padding-vertical) + .3rem);transform:translateY(-.625rem);transition-duration:var(--ifm-transition-fast);transition-property:opacity,transform,visibility;transition-timing-function:var(--ifm-transition-timing-default);visibility:hidden;z-index:var(--ifm-z-index-dropdown)}.menu__caret,.menu__link,.menu__list-item-collapsible{border-radius:.25rem;transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.dropdown__link{border-radius:.25rem;color:var(--ifm-dropdown-link-color);display:block;font-size:.875rem;margin-top:.2rem;padding:.25rem .5rem;white-space:nowrap}.dropdown__link--active,.dropdown__link:hover{background-color:var(--ifm-dropdown-hover-background-color);color:var(--ifm-dropdown-link-color)}.dropdown__link--active,.dropdown__link--active:hover{--ifm-dropdown-link-color:var(--ifm-link-color)}.dropdown>.navbar__link:after{border-color:currentcolor #0000;border-style:solid;border-width:.4em .4em 0;content:"";margin-left:.3em;position:relative;top:2px;transform:translateY(-50%)}.footer{background-color:var(--ifm-footer-background-color);color:var(--ifm-footer-color);padding:var(--ifm-footer-padding-vertical) var(--ifm-footer-padding-horizontal)}.footer--dark{--ifm-footer-background-color:#303846;--ifm-footer-color:var(--ifm-footer-link-color);--ifm-footer-link-color:var(--ifm-color-secondary);--ifm-footer-title-color:var(--ifm-color-white)}.footer__links{margin-bottom:1rem}.footer__link-item{color:var(--ifm-footer-link-color);line-height:2}.footer__link-item:hover{color:var(--ifm-footer-link-hover-color)}.footer__link-separator{margin:0 var(--ifm-footer-link-horizontal-spacing)}.footer__logo{margin-top:1rem;max-width:var(--ifm-footer-logo-max-width)}.footer__title{color:var(--ifm-footer-title-color);font:700 var(--ifm-h4-font-size)/var(--ifm-heading-line-height) var(--ifm-font-family-base);margin-bottom:var(--ifm-heading-margin-bottom)}.menu,.navbar__link{font-weight:var(--ifm-font-weight-semibold)}.footer__item{margin-top:0}.footer__items{margin-bottom:0}[type=checkbox]{padding:0}.hero{align-items:center;background-color:var(--ifm-hero-background-color);color:var(--ifm-hero-text-color);display:flex;padding:4rem 2rem}.hero--primary{--ifm-hero-background-color:var(--ifm-color-primary);--ifm-hero-text-color:var(--ifm-font-color-base-inverse)}.hero--dark{--ifm-hero-background-color:#303846;--ifm-hero-text-color:var(--ifm-color-white)}.hero__title{font-size:3rem}.hero__subtitle{font-size:1.5rem}.menu__list{list-style:none;margin:0;padding-left:0}.menu__caret,.menu__link{padding:var(--ifm-menu-link-padding-vertical) var(--ifm-menu-link-padding-horizontal)}.menu__list .menu__list{flex:0 0 100%;margin-top:.25rem;padding-left:var(--ifm-menu-link-padding-horizontal)}.menu__list-item:not(:first-child){margin-top:.25rem}.menu__list-item--collapsed .menu__list{height:0;overflow:hidden}.menu__list-item--collapsed .menu__caret:before,.menu__list-item--collapsed .menu__link--sublist:after{transform:rotate(90deg)}.menu__list-item-collapsible{display:flex;flex-wrap:wrap;position:relative}.menu__caret:hover,.menu__link:hover,.menu__list-item-collapsible--active,.menu__list-item-collapsible:hover{background:var(--ifm-menu-color-background-hover)}.menu__list-item-collapsible .menu__link--active,.menu__list-item-collapsible .menu__link:hover{background:none!important}.menu__caret,.menu__link{align-items:center;display:flex}.menu__link{color:var(--ifm-menu-color);flex:1;line-height:1.25}.menu__link:hover{color:var(--ifm-menu-color)}.menu__caret:before,.menu__link--sublist-caret:after{content:"";filter:var(--ifm-menu-link-sublist-icon-filter);height:1.25rem;transform:rotate(180deg);transition:transform var(--ifm-transition-fast) linear;width:1.25rem}.menu__link--sublist-caret:after{background:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem;margin-left:auto;min-width:1.25rem}.menu__link--active,.menu__link--active:hover{color:var(--ifm-menu-color-active)}.navbar__brand,.navbar__link{color:var(--ifm-navbar-link-color)}.menu__link--active:not(.menu__link--sublist){background-color:var(--ifm-menu-color-background-active)}.menu__caret:before{background:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem}.navbar--dark,html[data-theme=dark]{--ifm-menu-link-sublist-icon-filter:invert(100%) sepia(94%) saturate(17%) hue-rotate(223deg) brightness(104%) contrast(98%)}.navbar{background-color:var(--ifm-navbar-background-color);box-shadow:var(--ifm-navbar-shadow);height:var(--ifm-navbar-height);padding:var(--ifm-navbar-padding-vertical) var(--ifm-navbar-padding-horizontal)}.navbar,.navbar>.container,.navbar>.container-fluid{display:flex}.navbar--fixed-top{position:sticky;top:0;z-index:var(--ifm-z-index-fixed)}.navbar-sidebar,.navbar-sidebar__backdrop{bottom:0;left:0;opacity:0;position:fixed;top:0;transition-duration:var(--ifm-transition-fast);transition-timing-function:ease-in-out;visibility:hidden}.navbar__inner{display:flex;flex-wrap:wrap;justify-content:space-between;width:100%}.navbar__brand{align-items:center;display:flex;margin-right:1rem;min-width:0}.navbar__brand:hover{color:var(--ifm-navbar-link-hover-color)}.navbar__title{flex:1 1 auto}.navbar__toggle{display:none;margin-right:.5rem}.navbar__logo{flex:0 0 auto;height:2rem;margin-right:.5rem}.navbar__items{align-items:center;display:flex;flex:1;min-width:0}.navbar__items--center{flex:0 0 auto}.navbar__items--center .navbar__brand{margin:0}.navbar__items--center+.navbar__items--right{flex:1}.navbar__items--right{flex:0 0 auto;justify-content:flex-end}.navbar__items--right>:last-child{padding-right:0}.navbar__item{display:inline-block;padding:var(--ifm-navbar-item-padding-vertical) var(--ifm-navbar-item-padding-horizontal)}.navbar__link--active,.navbar__link:hover{color:var(--ifm-navbar-link-hover-color)}.navbar--dark,.navbar--primary{--ifm-menu-color:var(--ifm-color-gray-300);--ifm-navbar-link-color:var(--ifm-color-gray-100);--ifm-navbar-search-input-background-color:#ffffff1a;--ifm-navbar-search-input-placeholder-color:#ffffff80;color:var(--ifm-color-white)}.navbar--dark{--ifm-navbar-background-color:#242526;--ifm-menu-color-background-active:#ffffff0d;--ifm-navbar-search-input-color:var(--ifm-color-white)}.navbar--primary{--ifm-navbar-background-color:var(--ifm-color-primary);--ifm-navbar-link-hover-color:var(--ifm-color-white);--ifm-menu-color-active:var(--ifm-color-white);--ifm-navbar-search-input-color:var(--ifm-color-emphasis-500)}.navbar__search-input{appearance:none;background:var(--ifm-navbar-search-input-background-color) var(--ifm-navbar-search-input-icon) no-repeat .75rem center/1rem 1rem;border:none;border-radius:2rem;color:var(--ifm-navbar-search-input-color);cursor:text;display:inline-block;font-size:1rem;height:2rem;padding:0 .5rem 0 2.25rem;width:12.5rem}.navbar__search-input::placeholder{color:var(--ifm-navbar-search-input-placeholder-color)}.navbar-sidebar{background-color:var(--ifm-navbar-background-color);box-shadow:var(--ifm-global-shadow-md);transform:translate3d(-100%,0,0);transition-property:opacity,visibility,transform;width:var(--ifm-navbar-sidebar-width)}.navbar-sidebar--show .navbar-sidebar,.navbar-sidebar__items{transform:translateZ(0)}.navbar-sidebar--show .navbar-sidebar,.navbar-sidebar--show .navbar-sidebar__backdrop{opacity:1;visibility:visible}.navbar-sidebar__backdrop{background-color:#0009;right:0;transition-property:opacity,visibility}.navbar-sidebar__brand{align-items:center;box-shadow:var(--ifm-navbar-shadow);display:flex;flex:1;height:var(--ifm-navbar-height);padding:var(--ifm-navbar-padding-vertical) var(--ifm-navbar-padding-horizontal)}.navbar-sidebar__items{display:flex;height:calc(100% - var(--ifm-navbar-height));transition:transform var(--ifm-transition-fast) ease-in-out}.navbar-sidebar__items--show-secondary{transform:translate3d(calc((var(--ifm-navbar-sidebar-width))*-1),0,0)}.navbar-sidebar__item{flex-shrink:0;padding:.5rem;width:calc(var(--ifm-navbar-sidebar-width))}.navbar-sidebar__back{background:var(--ifm-menu-color-background-active);font-size:15px;font-weight:var(--ifm-button-font-weight);margin:0 0 .2rem -.5rem;padding:.6rem 1.5rem;position:relative;text-align:left;top:-.5rem;width:calc(100% + 1rem)}.navbar-sidebar__close{display:flex;margin-left:auto}.pagination{column-gap:var(--ifm-pagination-page-spacing);display:flex;font-size:var(--ifm-pagination-font-size);padding-left:0}.pagination--sm{--ifm-pagination-font-size:0.8rem;--ifm-pagination-padding-horizontal:0.8rem;--ifm-pagination-padding-vertical:0.2rem}.pagination--lg{--ifm-pagination-font-size:1.2rem;--ifm-pagination-padding-horizontal:1.2rem;--ifm-pagination-padding-vertical:0.3rem}.pagination__item{display:inline-flex}.pagination__item>span{padding:var(--ifm-pagination-padding-vertical)}.pagination__item--active .pagination__link{color:var(--ifm-pagination-color-active)}.pagination__item--active .pagination__link,.pagination__item:not(.pagination__item--active):hover .pagination__link{background:var(--ifm-pagination-item-active-background)}.pagination__item--disabled,.pagination__item[disabled]{opacity:.25;pointer-events:none}.pagination__link{border-radius:var(--ifm-pagination-border-radius);color:var(--ifm-font-color-base);display:inline-block;padding:var(--ifm-pagination-padding-vertical) var(--ifm-pagination-padding-horizontal);transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pagination-nav{display:grid;grid-gap:var(--ifm-spacing-horizontal);gap:var(--ifm-spacing-horizontal);grid-template-columns:repeat(2,1fr)}.pagination-nav__link{border:1px solid var(--ifm-color-emphasis-300);border-radius:var(--ifm-pagination-nav-border-radius);display:block;height:100%;line-height:var(--ifm-heading-line-height);padding:var(--ifm-global-spacing);transition:border-color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pagination-nav__link:hover{border-color:var(--ifm-pagination-nav-color-hover)}.pagination-nav__link--next{grid-column:2/3;text-align:right}.pagination-nav__label{font-size:var(--ifm-h4-font-size);font-weight:var(--ifm-heading-font-weight);word-break:break-word}.pagination-nav__link--prev .pagination-nav__label:before{content:"« "}.pagination-nav__link--next .pagination-nav__label:after{content:" »"}.pagination-nav__sublabel{color:var(--ifm-color-content-secondary);font-size:var(--ifm-h5-font-size);font-weight:var(--ifm-font-weight-semibold);margin-bottom:.25rem}.pills__item,.tabs{font-weight:var(--ifm-font-weight-bold)}.pills{display:flex;gap:var(--ifm-pills-spacing);padding-left:0}.pills__item{border-radius:.5rem;cursor:pointer;display:inline-block;padding:.25rem 1rem;transition:background var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.pills__item--active{color:var(--ifm-pills-color-active)}.pills__item--active,.pills__item:not(.pills__item--active):hover{background:var(--ifm-pills-color-background-active)}.pills--block{justify-content:stretch}.pills--block .pills__item{flex-grow:1;text-align:center}.tabs{color:var(--ifm-tabs-color);display:flex;margin-bottom:0;overflow-x:auto;padding-left:0}.tabs__item{border-bottom:3px solid #0000;border-radius:var(--ifm-global-radius);cursor:pointer;display:inline-flex;padding:var(--ifm-tabs-padding-vertical) var(--ifm-tabs-padding-horizontal);transition:background-color var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.tabs__item--active{border-bottom-color:var(--ifm-tabs-color-active-border);border-bottom-left-radius:0;border-bottom-right-radius:0;color:var(--ifm-tabs-color-active)}.tabs__item:hover{background-color:var(--ifm-hover-overlay)}.tabs--block{justify-content:stretch}.tabs--block .tabs__item{flex-grow:1;justify-content:center}html[data-theme=dark]{--ifm-color-scheme:dark;--ifm-color-emphasis-0:var(--ifm-color-gray-1000);--ifm-color-emphasis-100:var(--ifm-color-gray-900);--ifm-color-emphasis-200:var(--ifm-color-gray-800);--ifm-color-emphasis-300:var(--ifm-color-gray-700);--ifm-color-emphasis-400:var(--ifm-color-gray-600);--ifm-color-emphasis-600:var(--ifm-color-gray-400);--ifm-color-emphasis-700:var(--ifm-color-gray-300);--ifm-color-emphasis-800:var(--ifm-color-gray-200);--ifm-color-emphasis-900:var(--ifm-color-gray-100);--ifm-color-emphasis-1000:var(--ifm-color-gray-0);--ifm-background-color:#1b1b1d;--ifm-background-surface-color:#242526;--ifm-hover-overlay:#ffffff0d;--ifm-color-content:#e3e3e3;--ifm-color-content-secondary:#fff;--ifm-breadcrumb-separator-filter:invert(64%) sepia(11%) saturate(0%) hue-rotate(149deg) brightness(99%) contrast(95%);--ifm-code-background:#ffffff1a;--ifm-scrollbar-track-background-color:#444;--ifm-scrollbar-thumb-background-color:#686868;--ifm-scrollbar-thumb-hover-background-color:#7a7a7a;--ifm-table-stripe-background:#ffffff12;--ifm-toc-border-color:var(--ifm-color-emphasis-200);--ifm-color-primary-contrast-background:#102445;--ifm-color-primary-contrast-foreground:#ebf2fc;--ifm-color-secondary-contrast-background:#474748;--ifm-color-secondary-contrast-foreground:#fdfdfe;--ifm-color-success-contrast-background:#003100;--ifm-color-success-contrast-foreground:#e6f6e6;--ifm-color-info-contrast-background:#193c47;--ifm-color-info-contrast-foreground:#eef9fd;--ifm-color-warning-contrast-background:#4d3800;--ifm-color-warning-contrast-foreground:#fff8e6;--ifm-color-danger-contrast-background:#4b1113;--ifm-color-danger-contrast-foreground:#ffebec}}:root{--ifm-color-primary:#f59e0b;--ifm-color-primary-dark:#d97706;--ifm-color-primary-darker:#b45309;--ifm-color-primary-darkest:#92400e;--ifm-color-primary-light:#fbbf24;--ifm-color-primary-lighter:#fcd34d;--ifm-color-primary-lightest:#fde68a;--ifm-background-color:#f8fafc;--ifm-background-surface-color:#fff;--ifm-font-color-base:#1e293b;--ifm-font-color-secondary:#64748b;--ifm-heading-color:#0f172a;--ifm-link-color:#f59e0b;--ifm-link-hover-color:#d97706;--ifm-code-font-size:95%;--ifm-code-background:#f1f5f9;--ifm-code-border-radius:6px;--ifm-code-padding-horizontal:0.4rem;--ifm-code-padding-vertical:0.15rem;--docusaurus-highlighted-code-line-bg:#f59e0b14;--ifm-card-background-color:#fff;--ifm-global-shadow-lw:0 2px 8px #0000000f;--ifm-global-shadow-md:0 4px 16px #00000014;--ifm-global-shadow-tl:0 8px 32px #0000001a;--ifm-global-radius:8px;--ifm-toc-border-color:#00000014;--ifm-navbar-height:3.75rem;--hp-primary:#fbbf24;--hp-primary-dark:#f59e0b;--hp-secondary:#8b5cf6;--hp-accent:#06b6d4;--hp-success:#10b981;--hp-dark:#27001d;--hp-dark-light:#3d0029;--hp-text:#e2e8f0;--hp-text-muted:#94a3b8;--hp-bg-card:#ffffff08;--hp-bg-page:#27001d}[data-theme=dark]{--ifm-color-primary:#fbbf24;--ifm-color-primary-dark:#f59e0b;--ifm-color-primary-darker:#d97706;--ifm-color-primary-darkest:#b45309;--ifm-color-primary-light:#fcd34d;--ifm-color-primary-lighter:#fde68a;--ifm-color-primary-lightest:#fef3c7;--ifm-background-color:#27001d;--ifm-background-surface-color:#3d0029;--ifm-font-color-base:#e2e8f0;--ifm-font-color-secondary:#94a3b8;--ifm-heading-color:#f1f5f9;--ifm-link-color:#fbbf24;--ifm-link-hover-color:#fcd34d;--ifm-code-background:#ffffff0f;--docusaurus-highlighted-code-line-bg:#fbbf2426;--ifm-card-background-color:#ffffff08;--ifm-global-shadow-lw:0 2px 8px #0000004d;--ifm-global-shadow-md:0 4px 16px #0006;--ifm-global-shadow-tl:0 8px 32px #00000080;--ifm-toc-border-color:#ffffff0f}.gradient-bg-global{height:100%;left:0;pointer-events:none;position:fixed;top:0;width:100%;z-index:0}.gradient-orb-global{animation:25s ease-in-out infinite a;border-radius:50%;filter:blur(100px);opacity:.25;position:absolute}[data-theme=light] .gradient-orb-global{opacity:.1}.orb-global-1{background:radial-gradient(circle,#fbbf24,#0000);height:600px;left:-10%;top:-10%;width:600px}.orb-global-2{animation-delay:8s;background:radial-gradient(circle,#f59e0b,#0000);height:500px;right:-10%;top:50%;width:500px}.orb-global-3{animation-delay:15s;background:radial-gradient(circle,#06b6d4,#0000);bottom:-20%;height:700px;left:30%;width:700px}.logo_Ukns,.navbar__title{animation:3s infinite b;-webkit-text-fill-color:#0000}@keyframes a{0%,to{transform:translate(0) scale(1)}33%{transform:translate(60px,-60px) scale(1.1)}66%{transform:translate(-40px,40px) scale(.9)}}.navbar{backdrop-filter:blur(20px);-webkit-backdrop-filter:blur(20px);background:#27001dcc!important;border-bottom:1px solid #ffffff0d;box-shadow:none;position:sticky;z-index:100}[data-theme=light] .navbar{background:#ffffffd9!important;border-bottom:1px solid #00000014}.navbar__title{background:linear-gradient(135deg,#fbbf24,#f59e0b,#06b6d4);-webkit-background-clip:text;background-size:200% 200%;font-weight:800;background-clip:text}.navbar__link{font-weight:500}[data-theme=dark] .navbar__link,[data-theme=dark] .pagination-nav__label{color:#e2e8f0}[data-theme=dark] .navbar__link--active,[data-theme=dark] .navbar__link:hover{color:#fbbf24}.navbar__toggle{color:var(--ifm-font-color-base)}.navbar-sidebar{background:var(--ifm-background-color)}.footer{background:#3d0029!important;border-top:1px solid #ffffff0d}[data-theme=light] .footer{background:#f1f5f9!important;border-top:1px solid #00000014}.footer__title{color:#e2e8f0;font-weight:700}[data-theme=light] .footer__title{color:#1e293b}.footer__link-item{color:#94a3b8;transition:color .3s}.footer__link-item:hover{color:#fbbf24;-webkit-text-decoration:none;text-decoration:none}.footer__copyright,[data-theme=light] .footer__link-item{color:#64748b}[data-theme=light] .footer__link-item:hover{color:#f59e0b}[data-theme=dark] .theme-doc-sidebar-container{border-right:1px solid #ffffff0d!important}[data-theme=dark] .menu{background:#0000}[data-theme=dark] .menu__link{border-radius:8px;color:#cbd5e1;transition:.2s}[data-theme=dark] .menu__link:hover{background:#fbbf241a;color:#e2e8f0}[data-theme=dark] .menu__link--active:not(.menu__link--sublist){background:#fbbf2426;color:#fbbf24;font-weight:600}[data-theme=dark] .menu__list-item-collapsible:hover{background:#fbbf2414}[data-theme=dark] .theme-doc-sidebar-item-category>.menu__list-item-collapsible>.menu__link{color:#e2e8f0;font-weight:600}.main-wrapper,[class*=docMainContainer],[class*=mainWrapper]{position:relative;z-index:1}.markdown h1,.markdown h2,.markdown h3,.markdown h4,.markdown h5,.markdown h6{color:var(--ifm-heading-color)}[data-theme=dark] table{border-color:#ffffff14}[data-theme=dark] table thead tr{background:#ffffff0a;border-bottom:1px solid #ffffff14}[data-theme=dark] table tbody tr{border-bottom:1px solid #ffffff0a}[data-theme=dark] table tbody tr:nth-child(2n){background:#ffffff05}[data-theme=dark] hr,[data-theme=dark] td,[data-theme=dark] th{border-color:#ffffff0f}[data-theme=dark] blockquote{background:#fbbf240d;border-left-color:#fbbf24;color:#cbd5e1}[data-theme=dark] .prism-code{background:#ffffff0a!important;border:1px solid #ffffff0f}[data-theme=dark] code{background:#ffffff0f;border:1px solid #ffffff14;color:#e2e8f0}[data-theme=dark] a code{color:var(--ifm-link-color)}[data-theme=dark] .codeBlockTitle_node_modules-\@docusaurus-theme-classic-lib-theme-CodeBlock-Content-styles-module{background:#ffffff0f!important;border-bottom:1px solid #ffffff0f}[data-theme=dark] .alert{background:#ffffff08;border:1px solid #ffffff0f;color:#e2e8f0}[data-theme=dark] .alert--info{background:#06b6d40f;border-left:4px solid #06b6d4}[data-theme=dark] .alert--warning{background:#f59e0b0f;border-left:4px solid #f59e0b}[data-theme=dark] .alert--danger{background:#ef44440f;border-left:4px solid #ef4444}[data-theme=dark] .alert--success{background:#10b9810f;border-left:4px solid #10b981}[data-theme=dark] .alert--secondary{background:#fbbf240f;border-left:4px solid #fbbf24}[data-theme=dark] .admonitionHeading_node_modules-\@docusaurus-theme-classic-lib-theme-Admonition-Layout-styles-module{color:inherit}[data-theme=dark] .pagination-nav__sublabel,[data-theme=dark] .table-of-contents__link{color:#94a3b8}[data-theme=dark] .table-of-contents__link--active,[data-theme=dark] .table-of-contents__link:hover,[data-theme=dark] article .avatar__name a{color:#fbbf24}[data-theme=dark] .table-of-contents{border-left:1px solid #ffffff0f}[data-theme=dark] .pagination-nav__link{background:#ffffff08;border:1px solid #ffffff14;border-radius:12px;transition:.3s}[data-theme=dark] .pagination-nav__link:hover{background:#fbbf240f;border-color:#fbbf244d}[data-theme=dark] .blog-post-page article header h1{color:#f1f5f9}[data-theme=dark] .blog-tags a{background:#fbbf241a;border:1px solid #fbbf2433;color:#fbbf24}[data-theme=dark] .blog-tags a:hover{background:#fbbf2433;border-color:#fbbf2466;-webkit-text-decoration:none;text-decoration:none}[data-theme=dark] .navbar__search-input{background:#ffffff0d;border:1px solid #ffffff1a;color:#e2e8f0}[data-theme=dark] .navbar__search-input::placeholder{color:#64748b}[data-theme=dark] .breadcrumbs__link{background:#ffffff0a;border-radius:6px;color:#94a3b8}[data-theme=dark] .breadcrumbs__link:hover{background:#fbbf241a;color:#e2e8f0}[data-theme=dark] .breadcrumbs__item--active .breadcrumbs__link{background:#fbbf241f;color:#fbbf24}[data-theme=dark] .tabs__item{border-bottom-color:#0000;color:#94a3b8}[data-theme=dark] .tabs__item:hover{color:#e2e8f0}[data-theme=dark] .tabs__item--active{border-bottom-color:#fbbf24;color:#fbbf24}[data-theme=dark] ::-webkit-scrollbar{height:8px;width:8px}[data-theme=dark] ::-webkit-scrollbar-track{background:#0000}[data-theme=dark] ::-webkit-scrollbar-thumb{background:#ffffff1f;border-radius:4px}[data-theme=dark] ::-webkit-scrollbar-thumb:hover{background:#fff3}[data-theme=dark] .dropdown__menu{background:#3d0029;border:1px solid #ffffff14}[data-theme=dark] .dropdown__link{color:#cbd5e1}[data-theme=dark] .dropdown__link:hover{background:#fbbf241a;color:#e2e8f0}[data-theme=dark] .dropdown__link--active{background:#fbbf241f;color:#fbbf24}html.homepage-active .footer,html.homepage-active .navbar{display:none!important}html.homepage-active main{margin-top:0}html.homepage-active [class*=docMainContainer],html.homepage-active [class*=mainWrapper]{padding-top:0}[data-theme=light] .theme-doc-sidebar-container{border-right:1px solid #0000000f}[data-theme=light] .menu__link--active:not(.menu__link--sublist){background:#f59e0b14;color:#f59e0b;font-weight:600}[data-theme=light] .menu__link:hover{background:#f59e0b0d}[data-theme=light] .pagination-nav__link{border-radius:12px;transition:.3s}[data-theme=light] .pagination-nav__link:hover{border-color:#f59e0b4d;box-shadow:0 4px 16px #f59e0b14}[data-theme=light] blockquote{border-left-color:#f59e0b}@layer docusaurus.core{#__docusaurus-base-url-issue-banner-container{display:none}}.btn_bvfa,.btn_bvfa:hover,.componentLink_RzJT,.componentLink_RzJT:hover,.footerLinks_lH9U a,.footerLinks_lH9U a:hover,.footerList_2l2h a,.footerList_2l2h a:hover,.logo_Ukns,.navLink_aQaq,.navLink_aQaq:hover{-webkit-text-decoration:none;text-decoration:none}.hero_aEcG,.navContainer_E5Tz{margin:0 auto;max-width:1400px}[data-theme=light]{--hp-dark:#f8fafc;--hp-dark-light:#f1f5f9;--hp-text:#1e293b;--hp-text-muted:#64748b;--hp-bg-card:#00000005;--hp-bg-page:#f8fafc}.homepageWrapper_H_rv{background:var(--hp-bg-page);color:var(--hp-text);font-family:-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,sans-serif;line-height:1.6;overflow-x:hidden}.customNav_xRNg{backdrop-filter:blur(20px);background:#27001dd9;border-bottom:1px solid #ffffff26;padding:1.2rem 0;position:fixed;top:0;transition:transform .3s;width:100%;z-index:1000}[data-theme=light] .customNav_xRNg{background:#ffffffd9;border-bottom:1px solid #00000014}.navContainer_E5Tz{align-items:center;display:flex;justify-content:space-between;padding:0 2rem}.logo_Ukns{background:linear-gradient(135deg,#fbbf24,#f59e0b,#06b6d4);-webkit-background-clip:text;background-size:200% 200%;font-size:1.6rem;font-weight:800;background-clip:text}@keyframes b{0%,to{background-position:0 50%}50%{background-position:100% 50%}}.navLinks_FO3Z{align-items:center;display:flex;gap:2.5rem}.navLink_aQaq{color:var(--hp-text);font-weight:500;transition:color .3s}.footerLinks_lH9U a:hover,.footerList_2l2h a:hover,.navLink_aQaq:hover{color:var(--hp-primary)}.btn_bvfa{border:none;border-radius:50px;cursor:pointer;display:inline-block;font-size:1rem;font-weight:600;padding:.75rem 2rem;transition:.4s cubic-bezier(.175,.885,.32,1.275)}.btnPrimary_hBjO{background:linear-gradient(135deg,#fbbf24,#f59e0b);box-shadow:0 10px 30px #fbbf244d;color:#fff}.btnPrimary_hBjO:hover{box-shadow:0 15px 40px #fbbf2480;color:#fff;transform:translateY(-3px)}.btnSecondary_mRVh{background:#ffffff0d;border:2px solid #fbbf2480;color:var(--hp-text)}[data-theme=light] .btnSecondary_mRVh{background:#fbbf240d;border-color:#fbbf2466}.btnSecondary_mRVh:hover{background:#fbbf2433;border-color:var(--hp-primary);color:var(--hp-text);transform:translateY(-3px)}.btnWhite_DoE5{background:#fff;color:var(--hp-primary)}.btnWhite_DoE5:hover{background:#f8fafc;color:var(--hp-primary);transform:translateY(-3px) scale(1.05)}.btnOutlineWhite_Kzbe{background:#0000;border:2px solid #fff;color:#fff}.btnOutlineWhite_Kzbe:hover{background:#ffffff26;color:#fff;transform:translateY(-3px)}.hero_aEcG{align-items:center;display:grid;gap:4rem;grid-template-columns:1fr 1fr;min-height:100vh;overflow:hidden;padding:10rem 2rem 5rem;position:relative;z-index:1}.networkCanvas_S8Th{height:100%;left:-2rem;pointer-events:none;position:absolute;top:0;width:calc(100% + 4rem);z-index:0}.ctaButtons_vsp7,.ctaDescription_HswS,.ctaTitle_arch,.customFooter_Ymmc,.heroContent_mKPX,.heroImage_xZN7,.section_Q9Zo{position:relative;z-index:1}.heroContent_mKPX{animation:1s ease-out c}@keyframes c{0%{opacity:0;transform:translateY(40px)}to{opacity:1;transform:translateY(0)}}.heroBadge_Z6oq{backdrop-filter:blur(10px);background:#fbbf241a;border:1px solid #fbbf244d;border-radius:50px;color:var(--hp-primary);display:inline-block;font-size:.9rem;font-weight:600;margin-bottom:2rem;padding:.5rem 1.5rem}.heroTitle_qg2I{background:linear-gradient(135deg,#fff,#a5b4fc);-webkit-background-clip:text;font-size:4.5rem;font-weight:900;line-height:1.1;margin-bottom:1.5rem;-webkit-text-fill-color:#0000;background-clip:text}.barrierAnswer_ZtxW,.barrierCard_tMSq p{line-height:1.8;color:var(--hp-text-muted)}[data-theme=light] .heroTitle_qg2I,[data-theme=light] .sectionTitle_Ut5p{background:linear-gradient(135deg,#1e293b,#fbbf24);-webkit-background-clip:text;-webkit-text-fill-color:#0000;background-clip:text}.heroSubtitle_jFu1{color:var(--hp-text-muted);font-size:1.25rem;line-height:1.8;margin-bottom:2.5rem}.heroButtons_r52D{display:flex;flex-wrap:wrap;gap:1.5rem}.heroImage_xZN7{animation:1s ease-out .3s both c}.heroImage_xZN7 img{border-radius:20px;box-shadow:0 40px 80px #00000080;width:100%}[data-theme=light] .heroImage_xZN7 img{box-shadow:0 40px 80px #00000026}.adoptionBadge_hbYR{animation:1s ease-out .6s both c;margin-top:3rem;text-align:center}.adoptionBadge_hbYR p{color:var(--hp-text-muted);font-size:.95rem}.section_Q9Zo{padding:8rem 2rem}.container_bfhl{margin:0 auto;max-width:1400px}.sectionHeader_Gahl{margin-bottom:5rem;text-align:center}.barrierCard_tMSq h3,.componentContent_xz2v h3,.sectionSubtitle_AZuW{font-weight:700;margin-bottom:1rem}.sectionSubtitle_AZuW{color:var(--hp-primary);font-size:.95rem;letter-spacing:2px;text-transform:uppercase}.sectionTitle_Ut5p{background:linear-gradient(135deg,#fff,#a5b4fc);-webkit-background-clip:text;font-size:3.5rem;font-weight:900;margin-bottom:1.5rem;-webkit-text-fill-color:#0000;background-clip:text}.barrierCard_tMSq,.componentCard_LlUg{backdrop-filter:blur(20px);background:var(--hp-bg-card);border:1px solid #ffffff14}.sectionDescription_cpL1{color:var(--hp-text-muted);font-size:1.2rem;margin:0 auto;max-width:800px}.barriersGrid_u0Jf,.videosGrid_FXHY{display:grid;gap:2.5rem;grid-template-columns:repeat(3,1fr);margin-top:4rem}.barrierCard_tMSq{border-radius:24px;padding:2.5rem;transition:.4s}[data-theme=light] .barrierCard_tMSq,[data-theme=light] .blogCard_hyds,[data-theme=light] .componentCard_LlUg,[data-theme=light] .statCard_w2S8,[data-theme=light] .videoCard_jGks{background:#fff;border-color:#00000014;box-shadow:0 4px 20px #0000000d}.barrierCard_tMSq:hover,.videoCard_jGks:hover{border-color:#fbbf244d;box-shadow:0 20px 50px #0006;transform:translateY(-8px)}.componentCardVisible_hAJc:hover,.componentCard_LlUg:hover{transform:translateY(-10px)}[data-theme=light] .barrierCard_tMSq:hover{border-color:#fbbf244d;box-shadow:0 20px 50px #fbbf241f}.barrierIcon_HTIA{font-size:2.5rem;margin-bottom:1.5rem}.barrierCard_tMSq h3{color:var(--hp-text);font-size:1.4rem}.barrierCard_tMSq p{font-size:.95rem}.barrierQuestions_jlWA{list-style:none;margin:1rem 0;padding:0}.barrierQuestions_jlWA li{color:var(--hp-text-muted);font-size:.92rem;line-height:1.6;padding:.4rem 0 .4rem 1.2rem;position:relative}.barrierQuestions_jlWA li:before{color:var(--hp-primary);content:"?";font-weight:700;left:0;position:absolute}.barrierAnswer_ZtxW{border-top:1px solid #ffffff0f;font-size:.92rem;margin-top:1rem;padding-top:1rem}.componentContent_xz2v,.statCard_w2S8{padding:2.5rem}[data-theme=light] .barrierAnswer_ZtxW{border-top-color:#0000000f}.componentsGrid_KtT5{display:grid;gap:3rem;grid-template-columns:repeat(3,1fr);margin-top:4rem}.componentCard_LlUg{border-radius:24px;opacity:0;overflow:hidden;transform:translateY(50px);transition:.5s cubic-bezier(.175,.885,.32,1.275)}.componentCardVisible_hAJc{opacity:1;transform:translateY(0)}.componentCard_LlUg:hover{border-color:#fbbf244d;box-shadow:0 30px 60px #00000080}[data-theme=light] .componentCard_LlUg:hover{box-shadow:0 30px 60px #fbbf241a}.blogCard_hyds:hover,.statCard_w2S8:hover{border-color:#fbbf244d;transform:translateY(-5px)}.componentContent_xz2v h3{color:var(--hp-text);font-size:1.6rem}.componentContent_xz2v p{color:var(--hp-text-muted);line-height:1.7;margin-bottom:1.5rem}.componentLink_RzJT{align-items:center;display:inline-flex;font-weight:600;gap:.5rem;transition:gap .3s}.blogCard_hyds,.statCard_w2S8,.videoCard_jGks{backdrop-filter:blur(20px);transition:.4s}.componentLink_RzJT,.componentLink_RzJT:hover{color:var(--hp-primary)}.componentLink_RzJT:hover{gap:1rem}.componentIcon_JDYs{align-items:center;background:linear-gradient(135deg,#fbbf241a,#f59e0b1a);display:flex;font-size:4rem;height:180px;justify-content:center;width:100%}[data-theme=light] .componentIcon_JDYs{background:linear-gradient(135deg,#fbbf240f,#f59e0b0f)}.statsSection_GUBq{background:#0003}[data-theme=light] .statsSection_GUBq{background:#fbbf2408}.statsGrid_wBRk{display:grid;gap:2.5rem;grid-template-columns:repeat(4,1fr);margin-top:4rem}.statCard_w2S8{background:var(--hp-bg-card);border:1px solid #ffffff14;border-radius:20px;text-align:center}.statLabel_I99V{color:var(--hp-text-muted);font-size:.9rem;letter-spacing:1.5px;margin-bottom:.5rem;text-transform:uppercase}.statValue_tB6D{background:linear-gradient(135deg,#fbbf24,#f59e0b);-webkit-background-clip:text;font-size:2.5rem;font-weight:900;-webkit-text-fill-color:#0000;background-clip:text}.statDescription_WIU_{color:var(--hp-text-muted);font-size:.95rem;margin-top:.5rem}.blogCard_hyds,.blogCard_hyds:hover{color:inherit;-webkit-text-decoration:none;text-decoration:none}.videoCard_jGks{background:var(--hp-bg-card);border:1px solid #ffffff14;border-radius:24px;overflow:hidden}[data-theme=light] .videoCard_jGks:hover{box-shadow:0 20px 50px #fbbf241f}.videoWrapper_XWWU{aspect-ratio:16/9;background:#000;overflow:hidden;position:relative;width:100%}.videoPlayer_Nt7m{display:block;height:100%;object-fit:cover;width:100%}.videoContent_pd0B{padding:1.5rem 2rem 2rem}.videoContent_pd0B h3{color:var(--hp-text);font-size:1.3rem;font-weight:700;margin-bottom:.5rem}.videoContent_pd0B p{color:var(--hp-text-muted);font-size:.92rem;line-height:1.6;margin:0}.blogGrid_Qec3{display:grid;gap:2.5rem;grid-template-columns:repeat(auto-fill,minmax(350px,1fr));margin-top:4rem}.blogCard_hyds{background:var(--hp-bg-card);border:1px solid #ffffff14;border-radius:20px;display:block;overflow:hidden}.blogCardIcon_JPeR{align-items:center;background:linear-gradient(135deg,#fbbf2426,#06b6d426);display:flex;font-size:3rem;height:160px;justify-content:center;width:100%}[data-theme=light] .blogCardIcon_JPeR{background:linear-gradient(135deg,#fbbf2414,#06b6d414)}.blogContent_dJxs{padding:2rem}.blogCategory_UY54{background:#fbbf2433;border-radius:12px;color:var(--hp-primary);display:inline-block;font-size:.75rem;font-weight:700;margin-bottom:1rem;padding:.25rem .75rem;text-transform:uppercase}.blogCard_hyds h3{color:var(--hp-text);font-size:1.3rem;font-weight:700;margin-bottom:.75rem}.blogMeta_skDH{align-items:center;color:var(--hp-text-muted);display:flex;font-size:.85rem;gap:.5rem}.ctaSection_bmsv{background:linear-gradient(135deg,#8b004d4d,#63003666);border:2px solid #8b004d80;border-radius:40px;margin:2rem 0;overflow:hidden;padding:6rem 4rem;position:relative;text-align:center}.ctaSection_bmsv:before{animation:20s linear infinite d;background:radial-gradient(circle,#ffffff1a 0,#0000 70%);content:"";height:200%;left:-50%;position:absolute;top:-50%;width:200%}@keyframes d{0%{transform:rotate(0)}to{transform:rotate(1turn)}}.ctaTitle_arch{background:none;color:#fff;font-size:3.5rem;font-weight:900;margin-bottom:1.5rem;-webkit-text-fill-color:#fff}.ctaDescription_HswS{color:#ffffffe6;font-size:1.3rem;margin-bottom:3rem}.ctaButtons_vsp7{display:flex;flex-wrap:wrap;gap:1.5rem;justify-content:center}.customFooter_Ymmc{background:var(--hp-dark-light);border-top:1px solid #ffffff0d;padding:5rem 2rem 2rem}[data-theme=light] .customFooter_Ymmc{background:#f1f5f9;border-top-color:#00000014}.footerContent_obNo{display:grid;gap:4rem;grid-template-columns:2fr 1fr 1fr 1fr;margin:0 auto 3rem;max-width:1400px}.footerSection__c07 h4{color:var(--hp-text);font-size:1.2rem;font-weight:700;margin-bottom:1.5rem}.footerBottom_nS2f,.footerLinks_lH9U a,.footerList_2l2h a,.footerSection__c07 p{color:var(--hp-text-muted)}.footerSection__c07 p{line-height:1.8}.footerList_2l2h{list-style:none;margin:0;padding:0}.footerList_2l2h li{margin-bottom:.75rem}.footerList_2l2h a{transition:.3s}.footerBottom_nS2f{align-items:center;border-top:1px solid #ffffff0d;display:flex;flex-wrap:wrap;gap:1rem;justify-content:space-between;margin:0 auto;max-width:1400px;padding-top:2rem}[data-theme=light] .footerBottom_nS2f{border-top-color:#00000014}.footerLinks_lH9U{display:flex;gap:2rem}.footerLinks_lH9U a{transition:color .3s}@layer docusaurus.theme-common{body:not(.navigation-with-keyboard) :not(input):focus{outline:0}.themedComponent_mlkZ{display:none}[data-theme=dark] .themedComponent--dark_xIcU,[data-theme=light] .themedComponent--light_NVdE,html:not([data-theme]) .themedComponent--light_NVdE{display:initial}.errorBoundaryError_a6uf{color:red;white-space:pre-wrap}.errorBoundaryFallback_VBag{color:red;padding:.55rem}.details_lb9f{--docusaurus-details-summary-arrow-size:0.38rem;--docusaurus-details-transition:transform 200ms ease;--docusaurus-details-decoration-color:grey}.details_lb9f>summary{cursor:pointer;list-style:none;padding-left:1rem;position:relative}.details_lb9f>summary::-webkit-details-marker{display:none}.details_lb9f>summary:before{border-color:#0000 #0000 #0000 var(--docusaurus-details-decoration-color);border-style:solid;border-width:var(--docusaurus-details-summary-arrow-size);content:"";left:0;position:absolute;top:.45rem;transform:rotate(0);transform-origin:calc(var(--docusaurus-details-summary-arrow-size)/2) 50%;transition:var(--docusaurus-details-transition)}.details_lb9f[data-collapsed=false].isBrowser_bmU9>summary:before,.details_lb9f[open]:not(.isBrowser_bmU9)>summary:before{transform:rotate(90deg)}.collapsibleContent_i85q{border-top:1px solid var(--docusaurus-details-decoration-color);margin-top:1rem;padding-top:1rem}.collapsibleContent_i85q p:last-child,.details_lb9f>summary>p:last-child{margin-bottom:0}}@layer docusaurus.theme-classic{:root{--docusaurus-progress-bar-color:var(--ifm-color-primary);--docusaurus-announcement-bar-height:auto;--docusaurus-collapse-button-bg:#0000;--docusaurus-collapse-button-bg-hover:#0000001a;--doc-sidebar-width:300px;--doc-sidebar-hidden-width:30px;--docusaurus-blog-social-icon-size:1rem;--docusaurus-tag-list-border:var(--ifm-color-emphasis-300)}#nprogress{pointer-events:none}#nprogress .bar{background:var(--docusaurus-progress-bar-color);height:2px;left:0;position:fixed;top:0;width:100%;z-index:1031}#nprogress .peg{box-shadow:0 0 10px var(--docusaurus-progress-bar-color),0 0 5px var(--docusaurus-progress-bar-color);height:100%;opacity:1;position:absolute;right:0;transform:rotate(3deg) translateY(-4px);width:100px}.skipToContent_fXgn{background-color:var(--ifm-background-surface-color);color:var(--ifm-color-emphasis-900);left:100%;padding:calc(var(--ifm-global-spacing)/2) var(--ifm-global-spacing);position:fixed;top:1rem;z-index:calc(var(--ifm-z-index-fixed) + 1)}.skipToContent_fXgn:focus{box-shadow:var(--ifm-global-shadow-md);left:1rem}.closeButton_CVFx{line-height:0;padding:0}.content_knG7{font-size:85%;padding:5px 0;text-align:center}.content_knG7 a{color:inherit;-webkit-text-decoration:underline;text-decoration:underline}.announcementBar_mb4j{align-items:center;background-color:var(--ifm-color-white);border-bottom:1px solid var(--ifm-color-emphasis-100);color:var(--ifm-color-black);display:flex;height:var(--docusaurus-announcement-bar-height)}.docSidebarContainer_YfHR,.navbarSearchContainer_Bca1:empty,.sidebarLogo_isFc,.toggleIcon_g3eP,html[data-announcement-bar-initially-dismissed=true] .announcementBar_mb4j{display:none}.announcementBarPlaceholder_vyr4{flex:0 0 10px}.announcementBarClose_gvF7{align-self:stretch;flex:0 0 30px}.announcementBarContent_xLdY{flex:1 1 auto}.toggle_vylO{height:2rem;width:2rem}.toggleButton_gllP{-webkit-tap-highlight-color:transparent;align-items:center;border-radius:50%;display:flex;height:100%;justify-content:center;transition:background var(--ifm-transition-fast);width:100%}.toggleButton_gllP:hover{background:var(--ifm-color-emphasis-200)}[data-theme-choice=dark] .darkToggleIcon_wfgR,[data-theme-choice=light] .lightToggleIcon_pyhR,[data-theme-choice=system] .systemToggleIcon_QzmC{display:initial}.toggleButtonDisabled_aARS{cursor:not-allowed}.darkNavbarColorModeToggle_X3D1:hover{background:var(--ifm-color-gray-800)}.backToTopButton_sjWU{background-color:var(--ifm-color-emphasis-200);border-radius:50%;bottom:1.3rem;box-shadow:var(--ifm-global-shadow-lw);height:3rem;opacity:0;position:fixed;right:1.3rem;transform:scale(0);transition:all var(--ifm-transition-fast) var(--ifm-transition-timing-default);visibility:hidden;width:3rem;z-index:calc(var(--ifm-z-index-fixed) - 1)}.backToTopButton_sjWU:after{background-color:var(--ifm-color-emphasis-1000);content:" ";display:inline-block;height:100%;-webkit-mask:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem no-repeat;mask:var(--ifm-menu-link-sublist-icon) 50%/2rem 2rem no-repeat;width:100%}.backToTopButtonShow_xfvO{opacity:1;transform:scale(1);visibility:visible}[data-theme=dark]:root{--docusaurus-collapse-button-bg:#ffffff0d;--docusaurus-collapse-button-bg-hover:#ffffff1a}.collapseSidebarButton_PEFL{display:none;margin:0}.iconExternalLink_nPIU{margin-left:.3rem}.dropdownNavbarItemMobile_J0Sd{cursor:pointer}.iconLanguage_nlXk{margin-right:5px;vertical-align:text-bottom}.navbarHideable_m1mJ{transition:transform var(--ifm-transition-fast) ease}.navbarHidden_jGov{transform:translate3d(0,calc(-100% - 2px),0)}.navbar__items--right>:last-child{padding-right:0}.footerLogoLink_BH7S{opacity:.5;transition:opacity var(--ifm-transition-fast) var(--ifm-transition-timing-default)}.footerLogoLink_BH7S:hover,.hash-link:focus,:hover>.hash-link{opacity:1}.menuExternalLink_NmtK{align-items:center}.docMainContainer_TBSr,.docRoot_UBD9{display:flex;width:100%}.authorSocialIcon_XYv3,.authorSocialLink_owbf{width:var(--docusaurus-blog-social-icon-size)}.docsWrapper_hBAB{display:flex;flex:1 0 auto}.anchorWithStickyNavbar_LWe7{scroll-margin-top:calc(var(--ifm-navbar-height) + .5rem)}.anchorWithHideOnScrollNavbar_WYt5{scroll-margin-top:.5rem}.hash-link{opacity:0;padding-left:.5rem;transition:opacity var(--ifm-transition-fast);-webkit-user-select:none;user-select:none}.hash-link:before{content:"#"}.docCardListItem_W1sv>*,body,html{height:100%}.mainWrapper_z2l0{display:flex;flex:1 0 auto;flex-direction:column}.docusaurus-mt-lg{margin-top:3rem}#__docusaurus{display:flex;flex-direction:column;min-height:100%}.sidebar_re4s{max-height:calc(100vh - var(--ifm-navbar-height) - 2rem);overflow-y:auto;position:sticky;top:calc(var(--ifm-navbar-height) + 2rem)}.authorSocials_rSDt,.authorTitle_nd0D{overflow:hidden;-webkit-box-orient:vertical}.sidebarItemTitle_pO2u{font-size:var(--ifm-h3-font-size);font-weight:var(--ifm-font-weight-bold)}.container_mt6G,.sidebarItemList_Yudw{font-size:.9rem}.sidebarItem__DBe{margin-top:.7rem}.sidebarItemLink_mo7H{color:var(--ifm-font-color-base);display:block}.sidebarItemLink_mo7H:hover{-webkit-text-decoration:none;text-decoration:none}.sidebarItemLinkActive_I1ZP{color:var(--ifm-color-primary)!important}.yearGroupHeading_rMGB{margin-bottom:.4rem;margin-top:1.6rem}.yearGroupHeading_QT03{margin:1rem .75rem .5rem}.cardContainer_fWXF{--ifm-link-color:var(--ifm-color-emphasis-800);--ifm-link-hover-color:var(--ifm-color-emphasis-700);--ifm-link-hover-decoration:none;border:1px solid var(--ifm-color-emphasis-200);box-shadow:0 1.5px 3px 0 #00000026;transition:all var(--ifm-transition-fast) ease;transition-property:border,box-shadow}.cardContainer_fWXF:hover{border-color:var(--ifm-color-primary);box-shadow:0 3px 6px 0 #0003}.admonitionContent_BuS1>:last-child,.cardContainer_fWXF :last-child{margin-bottom:0}.cardTitle_rnsV{font-size:1.2rem}.cardDescription_PWke{font-size:.8rem}.docCardListItem_W1sv{margin-bottom:2rem}.title_f1Hy{font-size:3rem}[data-theme=dark] .githubSvg_Uu4N,[data-theme=dark] .instagramSvg_YC40,[data-theme=dark] .threadsSvg_PTXY,[data-theme=dark] .xSvg_y3PF{fill:var(--light)}[data-theme=light] .githubSvg_Uu4N,[data-theme=light] .instagramSvg_YC40,[data-theme=light] .threadsSvg_PTXY,[data-theme=light] .xSvg_y3PF{fill:var(--dark)}.authorSocials_rSDt{align-items:center;display:flex;flex-wrap:wrap;line-clamp:1;-webkit-line-clamp:1}.authorSocialLink_owbf,.authorSocials_rSDt{height:var(--docusaurus-blog-social-icon-size);line-height:0}.authorSocialLink_owbf{margin-right:.4rem}.authorSocialIcon_XYv3{height:var(--docusaurus-blog-social-icon-size)}.authorImage_XqGP{--ifm-avatar-photo-size:3.6rem}.author-as-h1_n9oJ .authorImage_XqGP{--ifm-avatar-photo-size:7rem}.author-as-h2_gXvM .authorImage_XqGP{--ifm-avatar-photo-size:5.4rem}.authorDetails_lV9A{align-items:flex-start;display:flex;flex-direction:column;justify-content:space-around}.authorName_yefp{display:flex;flex-direction:row;font-size:1.1rem;line-height:1.1rem}.author-as-h1_n9oJ .authorName_yefp{display:inline;font-size:2.4rem;line-height:2.4rem}.author-as-h2_gXvM .authorName_yefp{display:inline;font-size:1.4rem;line-height:1.4rem}.authorTitle_nd0D{display:-webkit-box;font-size:.8rem;line-height:1rem;line-clamp:1;-webkit-line-clamp:1}.author-as-h1_n9oJ .authorTitle_nd0D{font-size:1.2rem;line-height:1.6rem}.author-as-h2_gXvM .authorTitle_nd0D{font-size:1rem;line-height:1.3rem}.authorBlogPostCount_iiJ5{background:var(--ifm-color-secondary);border-radius:var(--ifm-global-radius);color:var(--ifm-color-black);font-size:.8rem;line-height:1.2;margin-left:.3rem;padding:.1rem .4rem}.authorListItem_n3yI{list-style-type:none;margin-bottom:2rem}.authorCol_Hf19{max-width:inherit!important}.imageOnlyAuthorRow_pa_O{display:flex;flex-flow:row wrap}.imageOnlyAuthorCol_G86a{margin-left:.3rem;margin-right:.3rem}.codeBlockContainer_Ckt0{background:var(--prism-background-color);border-radius:var(--ifm-code-border-radius);box-shadow:var(--ifm-global-shadow-lw);color:var(--prism-color);margin-bottom:var(--ifm-leading)}.codeBlock_bY9V{--ifm-pre-background:var(--prism-background-color);margin:0;padding:0}.codeBlockStandalone_MEMb{padding:0}.codeBlockLines_e6Vv{float:left;font:inherit;min-width:100%;padding:var(--ifm-pre-padding)}.codeBlockLinesWithNumbering_o6Pm{display:table;padding:var(--ifm-pre-padding) 0}:where(:root){--docusaurus-highlighted-code-line-bg:#484d5b}:where([data-theme=dark]){--docusaurus-highlighted-code-line-bg:#646464}.theme-code-block-highlighted-line{background-color:var(--docusaurus-highlighted-code-line-bg);display:block;margin:0 calc(var(--ifm-pre-padding)*-1);padding:0 var(--ifm-pre-padding)}.codeLine_lJS_{counter-increment:a;display:table-row}.codeLineNumber_Tfdd{background:var(--ifm-pre-background);display:table-cell;left:0;overflow-wrap:normal;padding:0 var(--ifm-pre-padding);position:sticky;text-align:right;width:1%}.codeLineNumber_Tfdd:before{content:counter(a);opacity:.4}.theme-code-block-highlighted-line .codeLineNumber_Tfdd:before{opacity:.8}.codeLineContent_feaV{padding-right:var(--ifm-pre-padding)}.theme-code-block:hover .copyButtonCopied_Vdqa{opacity:1!important}.copyButtonIcons_IEyt{height:1.125rem;position:relative;width:1.125rem}.copyButtonIcon_TrPX,.copyButtonSuccessIcon_cVMy{left:0;position:absolute;top:0;fill:currentColor;height:inherit;opacity:inherit;transition:all var(--ifm-transition-fast) ease;width:inherit}.copyButtonSuccessIcon_cVMy{color:#00d600;left:50%;opacity:0;top:50%;transform:translate(-50%,-50%) scale(.33)}.copyButtonCopied_Vdqa .copyButtonIcon_TrPX{opacity:0;transform:scale(.33)}.copyButtonCopied_Vdqa .copyButtonSuccessIcon_cVMy{opacity:1;transform:translate(-50%,-50%) scale(1);transition-delay:75ms}.wordWrapButtonIcon_b1P5{height:1.2rem;width:1.2rem}.wordWrapButtonEnabled_uzNF .wordWrapButtonIcon_b1P5{color:var(--ifm-color-primary)}.buttonGroup_M5ko{column-gap:.2rem;display:flex;position:absolute;right:calc(var(--ifm-pre-padding)/2);top:calc(var(--ifm-pre-padding)/2)}.buttonGroup_M5ko button{align-items:center;background:var(--prism-background-color);border:1px solid var(--ifm-color-emphasis-300);border-radius:var(--ifm-global-radius);color:var(--prism-color);display:flex;line-height:0;opacity:0;padding:.4rem;transition:opacity var(--ifm-transition-fast) ease-in-out}.buttonGroup_M5ko button:focus-visible,.buttonGroup_M5ko button:hover{opacity:1!important}.theme-code-block:hover .buttonGroup_M5ko button{opacity:.4}.tag_zVej{border:1px solid var(--docusaurus-tag-list-border);transition:border var(--ifm-transition-fast)}.tag_zVej:hover{--docusaurus-tag-list-border:var(--ifm-link-color);-webkit-text-decoration:none;text-decoration:none}.tagRegular_sFm0{border-radius:var(--ifm-global-radius);font-size:90%;padding:.2rem .5rem .3rem}.tagWithCount_h2kH{align-items:center;border-left:0;display:flex;padding:0 .5rem 0 1rem;position:relative}.tagWithCount_h2kH:after,.tagWithCount_h2kH:before{border:1px solid var(--docusaurus-tag-list-border);content:"";position:absolute;top:50%;transition:inherit}.tagWithCount_h2kH:before{border-bottom:0;border-right:0;height:1.18rem;right:100%;transform:translate(50%,-50%) rotate(-45deg);width:1.18rem}.tagWithCount_h2kH:after{border-radius:50%;height:.5rem;left:0;transform:translateY(-50%);width:.5rem}.tagWithCount_h2kH span{background:var(--ifm-color-secondary);border-radius:var(--ifm-global-radius);color:var(--ifm-color-black);font-size:.7rem;line-height:1.2;margin-left:.3rem;padding:.1rem .4rem}.tag_Nnez{display:inline-block;margin:.5rem .5rem 0 1rem}.codeBlockContent_QJqH{border-radius:inherit;direction:ltr;position:relative}.codeBlockTitle_OeMC{border-bottom:1px solid var(--ifm-color-emphasis-300);border-top-left-radius:inherit;border-top-right-radius:inherit;font-size:var(--ifm-code-font-size);font-weight:500;padding:.75rem var(--ifm-pre-padding)}.codeBlockTitle_OeMC+.codeBlockContent_QJqH .codeBlock_a8dz{border-top-left-radius:0;border-top-right-radius:0}.tags_jXut{display:inline}.tag_QGVx{display:inline-block;margin:0 .4rem .5rem 0}.iconEdit_Z9Sw{margin-right:.3em;vertical-align:sub}.lastUpdated_JAkA{font-size:smaller;font-style:italic;margin-top:.2rem}.tocCollapsibleButton_TO0P{align-items:center;display:flex;font-size:inherit;justify-content:space-between;padding:.4rem .8rem;width:100%}.tocCollapsibleButton_TO0P:after{background:var(--ifm-menu-link-sublist-icon) 50% 50%/2rem 2rem no-repeat;content:"";filter:var(--ifm-menu-link-sublist-icon-filter);height:1.25rem;transform:rotate(180deg);transition:transform var(--ifm-transition-fast);width:1.25rem}.tocCollapsibleButtonExpanded_MG3E:after,.tocCollapsibleExpanded_sAul{transform:none}.tocCollapsible_ETCw{background-color:var(--ifm-menu-color-background-active);border-radius:var(--ifm-global-radius);margin:1rem 0}.tocCollapsibleContent_vkbj>ul{border-left:none;border-top:1px solid var(--ifm-color-emphasis-300);font-size:15px;padding:.2rem 0}.tocCollapsibleContent_vkbj ul li{margin:.4rem .8rem}.tocCollapsibleContent_vkbj a{display:block}.details_b_Ee{--docusaurus-details-decoration-color:var(--ifm-alert-border-color);--docusaurus-details-transition:transform var(--ifm-transition-fast) ease;border:1px solid var(--ifm-alert-border-color);margin:0 0 var(--ifm-spacing-vertical)}.containsTaskList_mC6p{list-style:none}:not(.containsTaskList_mC6p>li)>.containsTaskList_mC6p{padding-left:0}.img_ev3q{height:auto}.tableOfContents_bqdL{max-height:calc(100vh - var(--ifm-navbar-height) - 2rem);overflow-y:auto;position:sticky;top:calc(var(--ifm-navbar-height) + 1rem)}.admonition_xJq3{margin-bottom:1em}.admonitionHeading_Gvgb{font:var(--ifm-heading-font-weight) var(--ifm-h5-font-size)/var(--ifm-heading-line-height) var(--ifm-heading-font-family);text-transform:uppercase}.admonitionHeading_Gvgb:not(:last-child){margin-bottom:.3rem}.admonitionHeading_Gvgb code{text-transform:none}.admonitionIcon_Rf37{display:inline-block;margin-right:.4em;vertical-align:middle}.admonitionIcon_Rf37 svg{display:inline-block;height:1.6em;width:1.6em;fill:var(--ifm-alert-foreground-color)}.breadcrumbHomeIcon_YNFT{height:1.1rem;position:relative;top:1px;vertical-align:top;width:1.1rem}.breadcrumbsContainer_Z_bl{--ifm-breadcrumb-size-multiplier:0.8;margin-bottom:.8rem}.title_kItE{--ifm-h1-font-size:3rem;margin-bottom:calc(var(--ifm-leading)*1.25)}.docItemContainer_Djhp article>:first-child,.docItemContainer_Djhp header+*{margin-top:0}.mdxPageWrapper_j9I6{justify-content:center}}@media (min-width:997px){.collapseSidebarButton_PEFL,.expandButton_TmdG{background-color:var(--docusaurus-collapse-button-bg)}:root{--docusaurus-announcement-bar-height:30px}.announcementBarClose_gvF7,.announcementBarPlaceholder_vyr4{flex-basis:50px}.collapseSidebarButton_PEFL{border:1px solid var(--ifm-toc-border-color);border-radius:0;bottom:0;display:block!important;height:40px;position:sticky}.collapseSidebarButtonIcon_kv0_{margin-top:4px;transform:rotate(180deg)}.expandButtonIcon_i1dp,[dir=rtl] .collapseSidebarButtonIcon_kv0_{transform:rotate(0)}.collapseSidebarButton_PEFL:focus,.collapseSidebarButton_PEFL:hover,.expandButton_TmdG:focus,.expandButton_TmdG:hover{background-color:var(--docusaurus-collapse-button-bg-hover)}.navbarSearchContainer_Bca1{padding:var(--ifm-navbar-item-padding-vertical) var(--ifm-navbar-item-padding-horizontal)}.menuHtmlItem_M9Kj{padding:var(--ifm-menu-link-padding-vertical) var(--ifm-menu-link-padding-horizontal)}.menu_SIkG{flex-grow:1;padding:.5rem}@supports (scrollbar-gutter:stable){.menu_SIkG{padding:.5rem 0 .5rem .5rem;scrollbar-gutter:stable}}.menuWithAnnouncementBar_GW3s{margin-bottom:var(--docusaurus-announcement-bar-height)}.sidebar_njMd{display:flex;flex-direction:column;height:100%;padding-top:var(--ifm-navbar-height);width:var(--doc-sidebar-width)}.sidebarWithHideableNavbar_wUlq{padding-top:0}.sidebarHidden_VK0M{opacity:0;visibility:hidden}.sidebarLogo_isFc{align-items:center;color:inherit!important;display:flex!important;margin:0 var(--ifm-navbar-padding-horizontal);max-height:var(--ifm-navbar-height);min-height:var(--ifm-navbar-height);-webkit-text-decoration:none!important;text-decoration:none!important}.sidebarLogo_isFc img{height:2rem;margin-right:.5rem}.expandButton_TmdG{align-items:center;display:flex;height:100%;justify-content:center;position:absolute;right:0;top:0;transition:background-color var(--ifm-transition-fast) ease;width:100%}[dir=rtl] .expandButtonIcon_i1dp{transform:rotate(180deg)}.docSidebarContainer_YfHR{border-right:1px solid var(--ifm-toc-border-color);clip-path:inset(0);display:block;margin-top:calc(var(--ifm-navbar-height)*-1);transition:width var(--ifm-transition-fast) ease;width:var(--doc-sidebar-width);will-change:width}.docSidebarContainerHidden_DPk8{cursor:pointer;width:var(--doc-sidebar-hidden-width)}.sidebarViewport_aRkj{height:100%;max-height:100vh;position:sticky;top:0}.docMainContainer_TBSr{flex-grow:1;max-width:calc(100% - var(--doc-sidebar-width))}.docMainContainerEnhanced_lQrH{max-width:calc(100% - var(--doc-sidebar-hidden-width))}.docItemWrapperEnhanced_JWYK{max-width:calc(var(--ifm-container-width) + var(--doc-sidebar-width))!important}.lastUpdated_JAkA{text-align:right}.tocMobile_ITEo{display:none}.docItemCol_VOVn,.generatedIndexPage_vN6x{max-width:75%!important}}@media (min-width:1440px){.container{max-width:var(--ifm-container-width-xl)}}@media (max-width:1024px){.hero_aEcG{grid-template-columns:1fr;padding-top:8rem;text-align:center}.heroImage_xZN7{margin:2rem 0 0;order:-1}.heroContent_mKPX{order:1}.heroButtons_r52D{justify-content:center}.componentsGrid_KtT5,.footerContent_obNo,.statsGrid_wBRk,.videosGrid_FXHY{grid-template-columns:1fr 1fr}.barriersGrid_u0Jf{grid-template-columns:1fr}}@media (max-width:996px){.col{--ifm-col-width:100%;flex-basis:var(--ifm-col-width);margin-left:0}.footer{--ifm-footer-padding-horizontal:0}.colorModeToggle_DEke,.footer__link-separator,.navbar__item,.sidebar_re4s,.tableOfContents_bqdL{display:none}.footer__col{margin-bottom:calc(var(--ifm-spacing-vertical)*3)}.footer__link-item{display:block;width:max-content}.hero{padding-left:0;padding-right:0}.navbar>.container,.navbar>.container-fluid{padding:0}.navbar__toggle{display:inherit}.navbar__search-input{width:9rem}.pills--block,.tabs--block{flex-direction:column}.navbarSearchContainer_Bca1{position:absolute;right:var(--ifm-navbar-padding-horizontal)}.docItemContainer_F8PC{padding:0 .3rem}}@media (max-width:768px){.heroTitle_qg2I{font-size:3rem}.ctaTitle_arch,.sectionTitle_Ut5p{font-size:2.5rem}.navLinks_FO3Z a:not(.btn_bvfa):not(.btnPrimary_hBjO){display:none}.blogGrid_Qec3,.componentsGrid_KtT5,.footerContent_obNo,.statsGrid_wBRk,.videosGrid_FXHY{grid-template-columns:1fr}.ctaSection_bmsv{border-radius:20px;padding:4rem 2rem}.section_Q9Zo{padding:4rem 1.5rem}.hero_aEcG{padding:7rem 1.5rem 3rem}}@media (max-width:576px){.markdown h1:first-child{--ifm-h1-font-size:2rem}.markdown>h2{--ifm-h2-font-size:1.5rem}.markdown>h3{--ifm-h3-font-size:1.25rem}.title_f1Hy{font-size:2rem}}@media (max-width:480px){.heroTitle_qg2I{font-size:2.2rem}.sectionTitle_Ut5p{font-size:2rem}.heroButtons_r52D{align-items:center;flex-direction:column}}@media (hover:hover){.backToTopButton_sjWU:hover{background-color:var(--ifm-color-emphasis-300)}}@media (pointer:fine){.thin-scrollbar{scrollbar-width:thin}.thin-scrollbar::-webkit-scrollbar{height:var(--ifm-scrollbar-size);width:var(--ifm-scrollbar-size)}.thin-scrollbar::-webkit-scrollbar-track{background:var(--ifm-scrollbar-track-background-color);border-radius:10px}.thin-scrollbar::-webkit-scrollbar-thumb{background:var(--ifm-scrollbar-thumb-background-color);border-radius:10px}.thin-scrollbar::-webkit-scrollbar-thumb:hover{background:var(--ifm-scrollbar-thumb-hover-background-color)}}@media (prefers-reduced-motion:reduce){:root{--ifm-transition-fast:0ms;--ifm-transition-slow:0ms}}@media print{.announcementBar_mb4j,.footer,.menu,.navbar,.pagination-nav,.table-of-contents,.tocMobile_ITEo{display:none}.tabs{page-break-inside:avoid}.codeBlockLines_e6Vv{white-space:pre-wrap}} \ No newline at end of file diff --git a/docs/assets/images/skye-rt-consumer-flow-7f064a31c41151ff4516900b3170dbc8.png b/docs/assets/images/skye-rt-consumer-flow-7f064a31c41151ff4516900b3170dbc8.png new file mode 100644 index 00000000..11e40769 Binary files /dev/null and b/docs/assets/images/skye-rt-consumer-flow-7f064a31c41151ff4516900b3170dbc8.png differ diff --git a/docs/assets/images/skye-system-overview-24940f4c319f41fb3b7583a525b0a534.png b/docs/assets/images/skye-system-overview-24940f4c319f41fb3b7583a525b0a534.png new file mode 100644 index 00000000..2f992dbf Binary files /dev/null and b/docs/assets/images/skye-system-overview-24940f4c319f41fb3b7583a525b0a534.png differ diff --git a/docs/assets/images/v1.0.0-predator-hld-949215d6604ae103e724c3978e803443.png b/docs/assets/images/v1.0.0-predator-hld-949215d6604ae103e724c3978e803443.png new file mode 100644 index 00000000..3e8a21ad Binary files /dev/null and b/docs/assets/images/v1.0.0-predator-hld-949215d6604ae103e724c3978e803443.png differ diff --git a/docs/assets/js/00b12b9c.31789eb1.js b/docs/assets/js/00b12b9c.31789eb1.js new file mode 100644 index 00000000..a8d46566 --- /dev/null +++ b/docs/assets/js/00b12b9c.31789eb1.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7048],{411:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-dag-976ff51caf25f09d977ccc10e70918f3.png"},721:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1106:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}}')},7704:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-matrix-43994f433f78905ccbd10cfe284f3c9f.png"},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>o});var i=t(6540);const r={},s=i.createContext(r);function a(e){const n=i.useContext(s);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),i.createElement(s.Provider,{value:n},e.children)}},8517:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>i,toc:()=>c});var i=t(1106),r=t(4848),s=t(8453);const a={slug:"post-two",title:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",authors:["bhawani","jigar","adarsha"],date:"2023-4-10",tags:["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0]},c=[{value:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",level:2},{value:"The Cost of Success",id:"the-cost-of-success",level:3},{value:"Scaling Pains (and Cassandra\u2019s Limits)",id:"scaling-pains-and-cassandras-limits",level:3},{value:"Interaction Store Woes",id:"interaction-store-woes",level:3},{value:"Silver Linings",id:"silver-linings",level:3},{value:"Round Two: Solving the Top 2 Bottlenecks",id:"round-two-solving-the-top-2-bottlenecks",level:3},{value:"Problem 1: No-Code Feature Retrieval for Model Inference",id:"problem-1-no-code-feature-retrieval-for-model-inference",level:4},{value:"Problem 2: Scaling Without Breaking the Bank",id:"problem-2-scaling-without-breaking-the-bank",level:4},{value:"Optimizing the Online Feature Store",id:"optimizing-the-online-feature-store",level:4},{value:"Optimizing the Interaction Store",id:"optimizing-the-interaction-store",level:4},{value:"Results",id:"results",level:4},{value:"The Catch: Our ML Hosting Hit a Hard Limit",id:"the-catch-our-ml-hosting-hit-a-hard-limit",level:4},{value:"Conclusion: From Firefighting to Future-Proofing",id:"conclusion-from-firefighting-to-future-proofing",level:3}];function d(e){const n={h2:"h2",h3:"h3",h4:"h4",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:t(721).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",children:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)"}),"\n",(0,r.jsx)(n.p,{children:"By late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\nAnd it worked. Mostly.\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS."}),"\n",(0,r.jsx)(n.h3,{id:"the-cost-of-success",children:"The Cost of Success"}),"\n",(0,r.jsx)(n.p,{children:"Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Adding new DAG nodes in IOP"}),"\n",(0,r.jsx)(n.li,{children:"Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)"}),"\n",(0,r.jsx)(n.li,{children:"Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)"}),"\n",(0,r.jsx)(n.li,{children:"Optimizing I/O and dealing with the inevitable bugs"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations."}),"\n",(0,r.jsx)(n.h3,{id:"scaling-pains-and-cassandras-limits",children:"Scaling Pains (and Cassandra\u2019s Limits)"}),"\n",(0,r.jsx)(n.p,{children:"At some point, we were hitting:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"250\u2013300K reads/sec"}),"\n",(0,r.jsx)(n.li,{children:"1M writes/sec (during lean hours)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership."}),"\n",(0,r.jsx)(n.h3,{id:"interaction-store-woes",children:"Interaction Store Woes"}),"\n",(0,r.jsx)(n.p,{children:"Our interaction store was another ticking time bomb:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Clusters kept growing in size and cost"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Latency spikes became increasingly frequent"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale."}),"\n",(0,r.jsx)(n.h3,{id:"silver-linings",children:"Silver Linings"}),"\n",(0,r.jsx)(n.p,{children:"Despite the chaos, the system was live and delivering value:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Real-time infrastructure was in production"}),"\n",(0,r.jsx)(n.li,{children:"Costs dropped by 60\u201370% compared to offline personalization"}),"\n",(0,r.jsx)(n.li,{children:"New experiments rolled out faster and more successfully"}),"\n",(0,r.jsx)(n.li,{children:"User engagement metrics improved"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"It wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot."}),"\n",(0,r.jsx)(n.h3,{id:"round-two-solving-the-top-2-bottlenecks",children:"Round Two: Solving the Top 2 Bottlenecks"}),"\n",(0,r.jsx)(n.p,{children:"With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Coding feature retrieval logic for every new model was becoming unsustainable"}),"\n",(0,r.jsx)(n.li,{children:"ML scale was exploding\u2014bringing rising infra costs with it"}),"\n",(0,r.jsx)(n.li,{children:"Real-time embedding search was the next big unlock"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"We tackled them one by one\u2014starting with the biggest pain point."}),"\n",(0,r.jsx)(n.h4,{id:"problem-1-no-code-feature-retrieval-for-model-inference",children:"Problem 1: No-Code Feature Retrieval for Model Inference"}),"\n",(0,r.jsx)(n.p,{children:"We noticed a pattern: for personalized ranking, models needed features from:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u2705 Product"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User \xd7 Category"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 Region, cohort, sub-category, etc."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"A key insight emerged: Entities that contribute features for a model always map back to the context entities."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"MP Dag",src:t(411).A+"",width:"1272",height:"512"})}),"\n",(0,r.jsx)(n.p,{children:"With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)"}),"\n",(0,r.jsx)(n.li,{children:"2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper"}),"\n",(0,r.jsx)(n.li,{children:"3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically"}),"\n",(0,r.jsx)(n.li,{children:"4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"\ud83d\udca1 The impact?"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Feature consistency across experiments"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond"}),"\n"]}),"\n",(0,r.jsxs)(n.p,{children:["Here\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\n",(0,r.jsx)(n.img,{alt:"MP matrix",src:t(7704).A+"",width:"1262",height:"768"}),"\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency."]}),"\n",(0,r.jsx)(n.h4,{id:"problem-2-scaling-without-breaking-the-bank",children:"Problem 2: Scaling Without Breaking the Bank"}),"\n",(0,r.jsx)(n.p,{children:"With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Online Feature Store"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Interaction Store"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-online-feature-store",children:"Optimizing the Online Feature Store"}),"\n",(0,r.jsx)(n.p,{children:"Our costs were concentrated in:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Database (Cassandra)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cache (Redis)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Running Pods (Java services)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency."}),"\n",(0,r.jsx)(n.li,{children:"Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization."}),"\n",(0,r.jsx)(n.li,{children:"Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"2\ufe0f\u20e3 Finding the Right Cache\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation."}),"\n",(0,r.jsx)(n.li,{children:"Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access."}),"\n",(0,r.jsx)(n.li,{children:"Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?"}),"\n",(0,r.jsx)(n.p,{children:"\u2705 Memory usage dropped by ~80%\n\u2705 CPU utilization was significantly lower\n\u2705 Faster, more efficient deployments"}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-interaction-store",children:"Optimizing the Interaction Store"}),"\n",(0,r.jsx)(n.p,{children:"We realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"InteractionStore",src:t(9497).A+"",width:"1242",height:"572"})}),"\n",(0,r.jsx)(n.h4,{id:"results",children:"Results"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat"}),"\n",(0,r.jsx)(n.li,{children:"Infra costs for Online Feature Store and Interaction Store dropped by ~60%"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"the-catch-our-ml-hosting-hit-a-hard-limit",children:"The Catch: Our ML Hosting Hit a Hard Limit"}),"\n",(0,r.jsx)(n.p,{children:"While planning for 2023 MBS, we ran into a critical scalability bottleneck:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u274c Insufficient compute availability in our region for ML instances"}),"\n",(0,r.jsx)(n.li,{children:"\u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads."}),"\n",(0,r.jsx)(n.h3,{id:"conclusion-from-firefighting-to-future-proofing",children:"Conclusion: From Firefighting to Future-Proofing"}),"\n",(0,r.jsx)(n.p,{children:"What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},9497:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/interaction-str-d9e7aefea121aefb4e94c6c9f060d016.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/00b12b9c.ea8fba0b.js b/docs/assets/js/00b12b9c.ea8fba0b.js deleted file mode 100644 index 179d11b7..00000000 --- a/docs/assets/js/00b12b9c.ea8fba0b.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7048],{1106:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}}')},3086:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},4114:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-dag-976ff51caf25f09d977ccc10e70918f3.png"},8111:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-matrix-43994f433f78905ccbd10cfe284f3c9f.png"},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>o});var i=t(6540);const r={},s=i.createContext(r);function a(e){const n=i.useContext(s);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),i.createElement(s.Provider,{value:n},e.children)}},8517:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>i,toc:()=>c});var i=t(1106),r=t(4848),s=t(8453);const a={slug:"post-two",title:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",authors:["bhawani","jigar","adarsha"],date:"2023-4-10",tags:["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0]},c=[{value:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",level:2},{value:"The Cost of Success",id:"the-cost-of-success",level:3},{value:"Scaling Pains (and Cassandra\u2019s Limits)",id:"scaling-pains-and-cassandras-limits",level:3},{value:"Interaction Store Woes",id:"interaction-store-woes",level:3},{value:"Silver Linings",id:"silver-linings",level:3},{value:"Round Two: Solving the Top 2 Bottlenecks",id:"round-two-solving-the-top-2-bottlenecks",level:3},{value:"Problem 1: No-Code Feature Retrieval for Model Inference",id:"problem-1-no-code-feature-retrieval-for-model-inference",level:4},{value:"Problem 2: Scaling Without Breaking the Bank",id:"problem-2-scaling-without-breaking-the-bank",level:4},{value:"Optimizing the Online Feature Store",id:"optimizing-the-online-feature-store",level:4},{value:"Optimizing the Interaction Store",id:"optimizing-the-interaction-store",level:4},{value:"Results",id:"results",level:4},{value:"The Catch: Our ML Hosting Hit a Hard Limit",id:"the-catch-our-ml-hosting-hit-a-hard-limit",level:4},{value:"Conclusion: From Firefighting to Future-Proofing",id:"conclusion-from-firefighting-to-future-proofing",level:3}];function d(e){const n={h2:"h2",h3:"h3",h4:"h4",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:t(3086).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",children:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)"}),"\n",(0,r.jsx)(n.p,{children:"By late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\nAnd it worked. Mostly.\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS."}),"\n",(0,r.jsx)(n.h3,{id:"the-cost-of-success",children:"The Cost of Success"}),"\n",(0,r.jsx)(n.p,{children:"Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Adding new DAG nodes in IOP"}),"\n",(0,r.jsx)(n.li,{children:"Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)"}),"\n",(0,r.jsx)(n.li,{children:"Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)"}),"\n",(0,r.jsx)(n.li,{children:"Optimizing I/O and dealing with the inevitable bugs"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations."}),"\n",(0,r.jsx)(n.h3,{id:"scaling-pains-and-cassandras-limits",children:"Scaling Pains (and Cassandra\u2019s Limits)"}),"\n",(0,r.jsx)(n.p,{children:"At some point, we were hitting:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"250\u2013300K reads/sec"}),"\n",(0,r.jsx)(n.li,{children:"1M writes/sec (during lean hours)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership."}),"\n",(0,r.jsx)(n.h3,{id:"interaction-store-woes",children:"Interaction Store Woes"}),"\n",(0,r.jsx)(n.p,{children:"Our interaction store was another ticking time bomb:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Clusters kept growing in size and cost"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Latency spikes became increasingly frequent"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale."}),"\n",(0,r.jsx)(n.h3,{id:"silver-linings",children:"Silver Linings"}),"\n",(0,r.jsx)(n.p,{children:"Despite the chaos, the system was live and delivering value:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Real-time infrastructure was in production"}),"\n",(0,r.jsx)(n.li,{children:"Costs dropped by 60\u201370% compared to offline personalization"}),"\n",(0,r.jsx)(n.li,{children:"New experiments rolled out faster and more successfully"}),"\n",(0,r.jsx)(n.li,{children:"User engagement metrics improved"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"It wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot."}),"\n",(0,r.jsx)(n.h3,{id:"round-two-solving-the-top-2-bottlenecks",children:"Round Two: Solving the Top 2 Bottlenecks"}),"\n",(0,r.jsx)(n.p,{children:"With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Coding feature retrieval logic for every new model was becoming unsustainable"}),"\n",(0,r.jsx)(n.li,{children:"ML scale was exploding\u2014bringing rising infra costs with it"}),"\n",(0,r.jsx)(n.li,{children:"Real-time embedding search was the next big unlock"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"We tackled them one by one\u2014starting with the biggest pain point."}),"\n",(0,r.jsx)(n.h4,{id:"problem-1-no-code-feature-retrieval-for-model-inference",children:"Problem 1: No-Code Feature Retrieval for Model Inference"}),"\n",(0,r.jsx)(n.p,{children:"We noticed a pattern: for personalized ranking, models needed features from:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u2705 Product"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User \xd7 Category"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 Region, cohort, sub-category, etc."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"A key insight emerged: Entities that contribute features for a model always map back to the context entities."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"MP Dag",src:t(4114).A+"",width:"1272",height:"512"})}),"\n",(0,r.jsx)(n.p,{children:"With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)"}),"\n",(0,r.jsx)(n.li,{children:"2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper"}),"\n",(0,r.jsx)(n.li,{children:"3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically"}),"\n",(0,r.jsx)(n.li,{children:"4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"\ud83d\udca1 The impact?"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Feature consistency across experiments"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond"}),"\n"]}),"\n",(0,r.jsxs)(n.p,{children:["Here\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\n",(0,r.jsx)(n.img,{alt:"MP matrix",src:t(8111).A+"",width:"1262",height:"768"}),"\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency."]}),"\n",(0,r.jsx)(n.h4,{id:"problem-2-scaling-without-breaking-the-bank",children:"Problem 2: Scaling Without Breaking the Bank"}),"\n",(0,r.jsx)(n.p,{children:"With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Online Feature Store"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Interaction Store"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-online-feature-store",children:"Optimizing the Online Feature Store"}),"\n",(0,r.jsx)(n.p,{children:"Our costs were concentrated in:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Database (Cassandra)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cache (Redis)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Running Pods (Java services)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency."}),"\n",(0,r.jsx)(n.li,{children:"Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization."}),"\n",(0,r.jsx)(n.li,{children:"Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"2\ufe0f\u20e3 Finding the Right Cache\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation."}),"\n",(0,r.jsx)(n.li,{children:"Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access."}),"\n",(0,r.jsx)(n.li,{children:"Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?"}),"\n",(0,r.jsx)(n.p,{children:"\u2705 Memory usage dropped by ~80%\n\u2705 CPU utilization was significantly lower\n\u2705 Faster, more efficient deployments"}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-interaction-store",children:"Optimizing the Interaction Store"}),"\n",(0,r.jsx)(n.p,{children:"We realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"InteractionStore",src:t(9758).A+"",width:"1242",height:"572"})}),"\n",(0,r.jsx)(n.h4,{id:"results",children:"Results"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat"}),"\n",(0,r.jsx)(n.li,{children:"Infra costs for Online Feature Store and Interaction Store dropped by ~60%"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"the-catch-our-ml-hosting-hit-a-hard-limit",children:"The Catch: Our ML Hosting Hit a Hard Limit"}),"\n",(0,r.jsx)(n.p,{children:"While planning for 2023 MBS, we ran into a critical scalability bottleneck:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u274c Insufficient compute availability in our region for ML instances"}),"\n",(0,r.jsx)(n.li,{children:"\u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads."}),"\n",(0,r.jsx)(n.h3,{id:"conclusion-from-firefighting-to-future-proofing",children:"Conclusion: From Firefighting to Future-Proofing"}),"\n",(0,r.jsx)(n.p,{children:"What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},9758:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/interaction-str-d9e7aefea121aefb4e94c6c9f060d016.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/0413d9af.aecac3d5.js b/docs/assets/js/0413d9af.aecac3d5.js deleted file mode 100644 index 88711431..00000000 --- a/docs/assets/js/0413d9af.aecac3d5.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9919],{7114:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>a,metadata:()=>s,toc:()=>o});const s=JSON.parse('{"id":"sdks/python/v1.0.0/grpc_feature_client","title":"GRPC Feature client","description":"PyPI version","source":"@site/docs/sdks/python/v1.0.0/grpc_feature_client.md","sourceDirName":"sdks/python/v1.0.0","slug":"/sdks/python/v1.0.0/grpc_feature_client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/python/v1.0.0/grpc_feature_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"GRPC Feature client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/category/v100"},"next":{"title":"Spark client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client"}}');var i=t(4848),r=t(8453);const a={title:"GRPC Feature client",sidebar_position:1},l="GRPC Feature Client",c={},o=[{value:"Installation",id:"installation",level:2},{value:"Dependencies",id:"dependencies",level:2},{value:"Features",id:"features",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"API Reference",id:"api-reference",level:2},{value:"GRPCFeatureClient",id:"grpcfeatureclient",level:3},{value:"GRPCClientConfig",id:"grpcclientconfig",level:3},{value:"Usage Examples",id:"usage-examples",level:2},{value:"Persisting Features",id:"persisting-features",level:3},{value:"Retrieving Features",id:"retrieving-features",level:3},{value:"With Context Management",id:"with-context-management",level:3},{value:"When to Use",id:"when-to-use",level:2},{value:"Related Packages",id:"related-packages",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"grpc-feature-client",children:"GRPC Feature Client"})}),"\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.a,{href:"https://badge.fury.io/py/grpc_feature_client",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/pypi/v/grpc_feature_client?label=pypi-package&color=light%20green",alt:"PyPI version"})}),"\n",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml",children:(0,i.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml/badge.svg",alt:"Build Status"})}),"\n",(0,i.jsx)(n.a,{href:"https://www.python.org/downloads/",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/python-3.7+-blue.svg",alt:"Python 3.7+"})}),"\n",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})}),"\n",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/License-BharatMLStack%20BSL%201.1-blue.svg",alt:"License"})})]}),"\n",(0,i.jsx)(n.p,{children:"High-performance gRPC client for BharatML Stack real-time feature operations with direct API access."}),"\n",(0,i.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"pip install grpc_feature_client\n"})}),"\n",(0,i.jsx)(n.h2,{id:"dependencies",children:"Dependencies"}),"\n",(0,i.jsx)(n.p,{children:"This package depends on:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"grpcio>=1.50.0"}),": gRPC framework"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"grpcio-tools>=1.50.0"}),": gRPC tools for protobuf"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Direct gRPC API"}),": persist, retrieve, retrieveDecoded operations"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Go SDK Compatible"}),": Same authentication and API semantics"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Batch Processing"}),": Automatic batching with parallel execution"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Real-time Focus"}),": Low-latency feature persistence and retrieval"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Context Management"}),": Timeout and metadata handling"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Connection Pooling"}),": Efficient connection management"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig\n\n# Configure for real-time operations\nconfig = GRPCClientConfig(\n server_address="localhost:50051",\n job_id="realtime-service",\n job_token="api-token"\n)\n\nclient = GRPCFeatureClient(config)\n\n# Direct API operations\nresult = client.persist_features(entity_label, keys_schema, feature_groups, data)\nfeatures = client.retrieve_decoded_features(entity_label, feature_groups, keys, entity_keys)\n'})}),"\n",(0,i.jsx)(n.h2,{id:"api-reference",children:"API Reference"}),"\n",(0,i.jsx)(n.h3,{id:"grpcfeatureclient",children:"GRPCFeatureClient"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"class GRPCFeatureClient:\n def __init__(self, config: GRPCClientConfig)\n \n def persist_features(\n self,\n entity_label: str,\n keys_schema: List[str],\n feature_group_schemas: List[Dict[str, Any]],\n data_rows: List[Dict[str, Any]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n \n def retrieve_features(\n self,\n entity_label: str,\n feature_groups: List[Dict[str, Any]],\n keys_schema: List[str],\n entity_keys: List[List[str]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n \n def retrieve_decoded_features(\n self,\n entity_label: str,\n feature_groups: List[Dict[str, Any]],\n keys_schema: List[str],\n entity_keys: List[List[str]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n"})}),"\n",(0,i.jsx)(n.h3,{id:"grpcclientconfig",children:"GRPCClientConfig"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"class GRPCClientConfig:\n def __init__(\n self,\n server_address: str,\n job_id: str,\n job_token: str,\n use_tls: bool = False,\n timeout_seconds: float = 30.0,\n metadata: Dict[str, str] = None,\n max_receive_message_length: int = 4 * 1024 * 1024,\n max_send_message_length: int = 4 * 1024 * 1024\n )\n"})}),"\n",(0,i.jsx)(n.h2,{id:"usage-examples",children:"Usage Examples"}),"\n",(0,i.jsx)(n.h3,{id:"persisting-features",children:"Persisting Features"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig\n\nconfig = GRPCClientConfig(\n server_address="feature-store.example.com:50051",\n job_id="predator-service",\n job_token="api-token"\n)\n\nclient = GRPCFeatureClient(config)\n\n# Persist real-time features\nresult = client.persist_features(\n entity_label="user_interaction",\n keys_schema=["user_id", "session_id"],\n feature_group_schemas=[{\n "label": "realtime_features",\n "feature_labels": ["click_count", "page_views"]\n }],\n data_rows=[{\n "user_id": "u123",\n "session_id": "s456",\n "click_count": 5,\n "page_views": 3\n }]\n)\n\nprint(f"Persist result: {result}")\n'})}),"\n",(0,i.jsx)(n.h3,{id:"retrieving-features",children:"Retrieving Features"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'# Retrieve features for ML model inference\nfeatures = client.retrieve_decoded_features(\n entity_label="user_interaction",\n feature_groups=[{\n "label": "user_features",\n "feature_labels": ["age", "location"]\n }],\n keys_schema=["user_id"],\n entity_keys=[["u123"], ["u456"]]\n)\n\nprint(f"Retrieved features: {features}")\n'})}),"\n",(0,i.jsx)(n.h3,{id:"with-context-management",children:"With Context Management"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"# Use client with automatic cleanup\nwith GRPCFeatureClient(config) as client:\n result = client.persist_features(...)\n features = client.retrieve_decoded_features(...)\n# Connection automatically closed\n"})}),"\n",(0,i.jsx)(n.h2,{id:"when-to-use",children:"When to Use"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Use grpc_feature_client for:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\ude80 ",(0,i.jsx)(n.strong,{children:"Real-time Operations"}),": Direct persist/retrieve operations"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd0d ",(0,i.jsx)(n.strong,{children:"Interactive Queries"}),": Low-latency feature lookups"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83c\udfaf ",(0,i.jsx)(n.strong,{children:"API Integration"}),": Service-to-service communication"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udca8 ",(0,i.jsx)(n.strong,{children:"Single Records"}),": Persisting individual feature records"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,i.jsx)(n.strong,{children:"Model Serving"}),": Feature retrieval for online inference"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Use spark_feature_push_client for:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,i.jsx)(n.strong,{children:"Batch ETL Pipelines"}),": Scheduled feature computation and publishing"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcca ",(0,i.jsx)(n.strong,{children:"Historical Data Backfill"}),": Loading historical features into online store"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83c\udfd7\ufe0f ",(0,i.jsx)(n.strong,{children:"Data Engineering"}),": Spark-based feature transformations"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcc8 ",(0,i.jsx)(n.strong,{children:"Large Scale Processing"}),": Processing millions of records efficiently"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"related-packages",children:"Related Packages"}),"\n",(0,i.jsx)(n.p,{children:"This package is part of the BharatML Stack ecosystem:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions (required dependency)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/spark_feature_push_client/",children:"spark_feature_push_client"})}),": Spark-based data pipeline client"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>l});var s=t(6540);const i={},r=s.createContext(i);function a(e){const n=s.useContext(r);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:a(e.components),s.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/0413d9af.fc3050c7.js b/docs/assets/js/0413d9af.fc3050c7.js new file mode 100644 index 00000000..ca32179a --- /dev/null +++ b/docs/assets/js/0413d9af.fc3050c7.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9919],{7114:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>a,metadata:()=>s,toc:()=>o});const s=JSON.parse('{"id":"sdks/python/v1.0.0/grpc_feature_client","title":"GRPC Feature client","description":"PyPI version","source":"@site/docs/sdks/python/v1.0.0/grpc_feature_client.md","sourceDirName":"sdks/python/v1.0.0","slug":"/sdks/python/v1.0.0/grpc_feature_client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/python/v1.0.0/grpc_feature_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"GRPC Feature client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/sdks/python/v1.0.0"},"next":{"title":"Spark client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client"}}');var i=t(4848),r=t(8453);const a={title:"GRPC Feature client",sidebar_position:1},l="GRPC Feature Client",c={},o=[{value:"Installation",id:"installation",level:2},{value:"Dependencies",id:"dependencies",level:2},{value:"Features",id:"features",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"API Reference",id:"api-reference",level:2},{value:"GRPCFeatureClient",id:"grpcfeatureclient",level:3},{value:"GRPCClientConfig",id:"grpcclientconfig",level:3},{value:"Usage Examples",id:"usage-examples",level:2},{value:"Persisting Features",id:"persisting-features",level:3},{value:"Retrieving Features",id:"retrieving-features",level:3},{value:"With Context Management",id:"with-context-management",level:3},{value:"When to Use",id:"when-to-use",level:2},{value:"Related Packages",id:"related-packages",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"grpc-feature-client",children:"GRPC Feature Client"})}),"\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.a,{href:"https://badge.fury.io/py/grpc_feature_client",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/pypi/v/grpc_feature_client?label=pypi-package&color=light%20green",alt:"PyPI version"})}),"\n",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml",children:(0,i.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml/badge.svg",alt:"Build Status"})}),"\n",(0,i.jsx)(n.a,{href:"https://www.python.org/downloads/",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/python-3.7+-blue.svg",alt:"Python 3.7+"})}),"\n",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})}),"\n",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/License-BharatMLStack%20BSL%201.1-blue.svg",alt:"License"})})]}),"\n",(0,i.jsx)(n.p,{children:"High-performance gRPC client for BharatML Stack real-time feature operations with direct API access."}),"\n",(0,i.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"pip install grpc_feature_client\n"})}),"\n",(0,i.jsx)(n.h2,{id:"dependencies",children:"Dependencies"}),"\n",(0,i.jsx)(n.p,{children:"This package depends on:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"grpcio>=1.50.0"}),": gRPC framework"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"grpcio-tools>=1.50.0"}),": gRPC tools for protobuf"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Direct gRPC API"}),": persist, retrieve, retrieveDecoded operations"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Go SDK Compatible"}),": Same authentication and API semantics"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Batch Processing"}),": Automatic batching with parallel execution"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Real-time Focus"}),": Low-latency feature persistence and retrieval"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Context Management"}),": Timeout and metadata handling"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Connection Pooling"}),": Efficient connection management"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig\n\n# Configure for real-time operations\nconfig = GRPCClientConfig(\n server_address="localhost:50051",\n job_id="realtime-service",\n job_token="api-token"\n)\n\nclient = GRPCFeatureClient(config)\n\n# Direct API operations\nresult = client.persist_features(entity_label, keys_schema, feature_groups, data)\nfeatures = client.retrieve_decoded_features(entity_label, feature_groups, keys, entity_keys)\n'})}),"\n",(0,i.jsx)(n.h2,{id:"api-reference",children:"API Reference"}),"\n",(0,i.jsx)(n.h3,{id:"grpcfeatureclient",children:"GRPCFeatureClient"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"class GRPCFeatureClient:\n def __init__(self, config: GRPCClientConfig)\n \n def persist_features(\n self,\n entity_label: str,\n keys_schema: List[str],\n feature_group_schemas: List[Dict[str, Any]],\n data_rows: List[Dict[str, Any]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n \n def retrieve_features(\n self,\n entity_label: str,\n feature_groups: List[Dict[str, Any]],\n keys_schema: List[str],\n entity_keys: List[List[str]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n \n def retrieve_decoded_features(\n self,\n entity_label: str,\n feature_groups: List[Dict[str, Any]],\n keys_schema: List[str],\n entity_keys: List[List[str]],\n timeout: Optional[float] = None\n ) -> Dict[str, Any]\n"})}),"\n",(0,i.jsx)(n.h3,{id:"grpcclientconfig",children:"GRPCClientConfig"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"class GRPCClientConfig:\n def __init__(\n self,\n server_address: str,\n job_id: str,\n job_token: str,\n use_tls: bool = False,\n timeout_seconds: float = 30.0,\n metadata: Dict[str, str] = None,\n max_receive_message_length: int = 4 * 1024 * 1024,\n max_send_message_length: int = 4 * 1024 * 1024\n )\n"})}),"\n",(0,i.jsx)(n.h2,{id:"usage-examples",children:"Usage Examples"}),"\n",(0,i.jsx)(n.h3,{id:"persisting-features",children:"Persisting Features"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig\n\nconfig = GRPCClientConfig(\n server_address="feature-store.example.com:50051",\n job_id="predator-service",\n job_token="api-token"\n)\n\nclient = GRPCFeatureClient(config)\n\n# Persist real-time features\nresult = client.persist_features(\n entity_label="user_interaction",\n keys_schema=["user_id", "session_id"],\n feature_group_schemas=[{\n "label": "realtime_features",\n "feature_labels": ["click_count", "page_views"]\n }],\n data_rows=[{\n "user_id": "u123",\n "session_id": "s456",\n "click_count": 5,\n "page_views": 3\n }]\n)\n\nprint(f"Persist result: {result}")\n'})}),"\n",(0,i.jsx)(n.h3,{id:"retrieving-features",children:"Retrieving Features"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:'# Retrieve features for ML model inference\nfeatures = client.retrieve_decoded_features(\n entity_label="user_interaction",\n feature_groups=[{\n "label": "user_features",\n "feature_labels": ["age", "location"]\n }],\n keys_schema=["user_id"],\n entity_keys=[["u123"], ["u456"]]\n)\n\nprint(f"Retrieved features: {features}")\n'})}),"\n",(0,i.jsx)(n.h3,{id:"with-context-management",children:"With Context Management"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-python",children:"# Use client with automatic cleanup\nwith GRPCFeatureClient(config) as client:\n result = client.persist_features(...)\n features = client.retrieve_decoded_features(...)\n# Connection automatically closed\n"})}),"\n",(0,i.jsx)(n.h2,{id:"when-to-use",children:"When to Use"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Use grpc_feature_client for:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\ude80 ",(0,i.jsx)(n.strong,{children:"Real-time Operations"}),": Direct persist/retrieve operations"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd0d ",(0,i.jsx)(n.strong,{children:"Interactive Queries"}),": Low-latency feature lookups"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83c\udfaf ",(0,i.jsx)(n.strong,{children:"API Integration"}),": Service-to-service communication"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udca8 ",(0,i.jsx)(n.strong,{children:"Single Records"}),": Persisting individual feature records"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,i.jsx)(n.strong,{children:"Model Serving"}),": Feature retrieval for online inference"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Use spark_feature_push_client for:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,i.jsx)(n.strong,{children:"Batch ETL Pipelines"}),": Scheduled feature computation and publishing"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcca ",(0,i.jsx)(n.strong,{children:"Historical Data Backfill"}),": Loading historical features into online store"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83c\udfd7\ufe0f ",(0,i.jsx)(n.strong,{children:"Data Engineering"}),": Spark-based feature transformations"]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcc8 ",(0,i.jsx)(n.strong,{children:"Large Scale Processing"}),": Processing millions of records efficiently"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"related-packages",children:"Related Packages"}),"\n",(0,i.jsx)(n.p,{children:"This package is part of the BharatML Stack ecosystem:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions (required dependency)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:(0,i.jsx)(n.a,{href:"https://pypi.org/project/spark_feature_push_client/",children:"spark_feature_push_client"})}),": Spark-based data pipeline client"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>l});var s=t(6540);const i={},r=s.createContext(i);function a(e){const n=s.useContext(r);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:a(e.components),s.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/09dd5be9.87777cdb.js b/docs/assets/js/09dd5be9.87777cdb.js new file mode 100644 index 00000000..77c1f9cc --- /dev/null +++ b/docs/assets/js/09dd5be9.87777cdb.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6273],{1544:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/schema-d699efc400ed0f83bba421c1f55ab211.png"},1547:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1585:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/first-gen-arch-7c0b286810aecb7eff42b48f51caee1f.png"},3983:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},4204:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/old-batch-arch-bc2cedbc1fed0fc6f08479ba8fe52996.png"},5714:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/interaction-store-v0-68167b64c6e462ef2f177f0f86d55bda.png"},7490:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/online-feature-store-v0-86ec0010947ae24621f39ebd0d1729ca.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const s={},r=t.createContext(s);function a(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),t.createElement(r.Provider,{value:n},e.children)}},8831:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>c,frontMatter:()=>a,metadata:()=>t,toc:()=>d});var t=i(3983),s=i(4848),r=i(8453);const a={slug:"post-one",title:"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)",authors:["adarsha","aditya","bhawani","jigar"],date:new Date("2022-11-15T00:00:00.000Z"),tags:["online-feature-store","interaction-store","mlplatform","meesho"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0,void 0]},d=[{value:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform",id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",level:2},{value:"The Turning Point: From Batch to Real-Time",id:"the-turning-point-from-batch-to-real-time",level:2},{value:"First Generation Design",id:"first-generation-design",level:2},{value:"1. IOP Framework: A Real-Time DAG Executor",id:"1-iop-framework-a-real-time-dag-executor",level:3},{value:"2. Online Feature Store - 0th Version",id:"2-online-feature-store---0th-version",level:3},{value:"3. Interaction Store - 0th Version",id:"3-interaction-store---0th-version",level:3},{value:"Building the Online Feature Store - 0th Version",id:"building-the-online-feature-store---0th-version",level:2},{value:"Choosing the Right Tech Stack",id:"choosing-the-right-tech-stack",level:3},{value:"Streamlining the Data Flow",id:"streamlining-the-data-flow",level:3},{value:"The Challenges: Data Format and Storage",id:"the-challenges-data-format-and-storage",level:2},{value:"Feature Consistency",id:"feature-consistency",level:3},{value:"TTL Granularity",id:"ttl-granularity",level:3},{value:"Extensibility Across Databases",id:"extensibility-across-databases",level:3},{value:"Overcoming Technical Constraints",id:"overcoming-technical-constraints",level:2},{value:"The Solution: Schema Separation",id:"the-solution-schema-separation",level:2},{value:"Tracking Changes in Feature Groups",id:"tracking-changes-in-feature-groups",level:2},{value:"Common Real-World Scenarios:",id:"common-real-world-scenarios",level:3},{value:"The Solution: Schema Versioning",id:"the-solution-schema-versioning",level:2},{value:"Backward Compatibility",id:"backward-compatibility",level:3},{value:"Partial Availability Handling",id:"partial-availability-handling",level:3},{value:"Safe Writes Without Pipeline Pauses",id:"safe-writes-without-pipeline-pauses",level:3},{value:"Interaction Store - 0th Version",id:"interaction-store---0th-version",level:2},{value:"Event Ingestion",id:"event-ingestion",level:2},{value:"Storage Design",id:"storage-design",level:2},{value:"Why Redis?",id:"why-redis",level:3},{value:"Storage Structure",id:"storage-structure",level:3},{value:"Built-in Guardrails",id:"built-in-guardrails",level:3},{value:"Conclusion: Laying the Foundation for Real-Time ML",id:"conclusion-laying-the-foundation-for-real-time-ml",level:2}];function h(e){const n={a:"a",br:"br",code:"code",em:"em",h1:"h1",h2:"h2",h3:"h3",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"BharatMLStack",src:i(1547).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(n.h2,{id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",children:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:'"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?"'})}),"\n",(0,s.jsx)(n.p,{children:"The laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact."}),"\n",(0,s.jsxs)(n.p,{children:["At the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the ",(0,s.jsx)(n.strong,{children:"Meesho ML Platform"}),", known today as ",(0,s.jsx)(n.strong,{children:"BharatMLStack"}),"."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(4204).A+"",width:"1600",height:"1078"})}),"\n",(0,s.jsx)(n.p,{children:"Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Ingestion"}),": The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 1"}),": Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 2"}),": Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 3"}),": Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Serving"}),': A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This approach held up well\u2014until Meesho started seeing a significant surge in traffic."}),"\n",(0,s.jsx)(n.h2,{id:"the-turning-point-from-batch-to-real-time",children:"The Turning Point: From Batch to Real-Time"}),"\n",(0,s.jsxs)(n.p,{children:["At this time, the team was iterating on new ",(0,s.jsx)(n.strong,{children:"Ranker models"}),", and real-time inference seemed like the next logical step. But Rankers needed ",(0,s.jsx)(n.strong,{children:"real-time feature retrieval"}),", which meant an ",(0,s.jsx)(n.strong,{children:"online feature store"})," had to be built first."]}),"\n",(0,s.jsxs)(n.p,{children:["Exploring open-source options led to ",(0,s.jsx)(n.strong,{children:"cost vs. performance trade-offs"}),", but Meesho\u2019s surging traffic meant that ",(0,s.jsx)(n.strong,{children:"latency and stability were non-negotiable"}),". After multiple debates and stakeholder discussions, a bold decision was made:"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:"We would build our own feature store."})}),"\n",(0,s.jsxs)(n.p,{children:["Meanwhile, efforts began to bring ",(0,s.jsx)(n.strong,{children:"Candidate Generators (CGs)"})," to real-time. The challenge? ",(0,s.jsx)(n.strong,{children:"Storing and retrieving user interactions quickly enough"})," to power real-time recommendations."]}),"\n",(0,s.jsxs)(n.p,{children:["As the team dove deeper, a new roadblock emerged:",(0,s.jsx)(n.br,{}),"\n","Our ML jobs were orchestrated using ",(0,s.jsx)(n.strong,{children:"Airflow DAGs"}),", giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, ",(0,s.jsx)(n.strong,{children:"slowing down iteration cycles"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["That\u2019s when the idea struck:",(0,s.jsx)(n.br,{}),"\n","We needed a ",(0,s.jsx)(n.strong,{children:"framework for real-time DAG execution"}),"\u2014one that preserved the same flexibility as Airflow but worked for ",(0,s.jsx)(n.strong,{children:"streaming data"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This moment shaped the ",(0,s.jsx)(n.strong,{children:"next phase of our journey"}),"."]}),"\n",(0,s.jsx)(n.h2,{id:"first-generation-design",children:"First Generation Design"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1585).A+"",width:"1600",height:"1006"})}),"\n",(0,s.jsx)(n.h1,{id:"laying-the-groundwork-the-first-gen-ml-platform",children:"Laying the Groundwork: The First-Gen ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"To solve these challenges, the team built three foundational components:"}),"\n",(0,s.jsx)(n.h3,{id:"1-iop-framework-a-real-time-dag-executor",children:"1. IOP Framework: A Real-Time DAG Executor"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Reusable Nodes"}),": Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Config-driven Dynamic Graphs"}),": Execution graphs were defined as adjacency lists stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", allowing teams to modify the sequence or structure of operations without touching application code."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Plug-and-play CGs"}),": The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing ",(0,s.jsx)(n.code,{children:"cg_name"})," in the request. This drastically reduced the code surface area and improved maintainability."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Production-Grade DAGs"}),": DAGs were designed to execute in ",(0,s.jsx)(n.strong,{children:"low-latency real-time environments"}),", with support for ",(0,s.jsx)(n.strong,{children:"parallel execution, retries, and branching"}),"."]}),"\n"]}),"\n",(0,s.jsx)("u",{children:(0,s.jsx)(n.a,{href:"https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform",children:"More about IOP DAG"})}),"\n",(0,s.jsx)(n.h3,{id:"2-online-feature-store---0th-version",children:"2. Online Feature Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Used ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for low-latency feature serving."]}),"\n",(0,s.jsxs)(n.li,{children:["Maintained feature consistency using ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," with TTL-based expiry."]}),"\n",(0,s.jsxs)(n.li,{children:["A hybrid schema was used: feature keys stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", data stored in ",(0,s.jsx)(n.strong,{children:"compact arrays"}),"."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"3-interaction-store---0th-version",children:"3. Interaction Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Captured real-time user interactions like clicks, orders, and add-to-cart events."}),"\n",(0,s.jsxs)(n.li,{children:["Stored event data in ",(0,s.jsx)(n.strong,{children:"Redis ZSETs (sorted sets)"})," to enable fast lookups for recommendation engines."]}),"\n",(0,s.jsxs)(n.li,{children:["Provided an API to fetch a user's ",(0,s.jsxs)(n.strong,{children:["last ",(0,s.jsx)(n.em,{children:"k"})," interactions"]})," or ",(0,s.jsx)(n.strong,{children:"interactions within a time window"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["With these components in place, ",(0,s.jsx)(n.strong,{children:"real-time ML at Meesho became a reality"}),"."]}),"\n",(0,s.jsx)(n.p,{children:"This was just the beginning."}),"\n",(0,s.jsx)(n.h2,{id:"building-the-online-feature-store---0th-version",children:"Building the Online Feature Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt text",src:i(7490).A+"",width:"1574",height:"562"})}),"\n",(0,s.jsx)(n.h3,{id:"choosing-the-right-tech-stack",children:"Choosing the Right Tech Stack"}),"\n",(0,s.jsxs)(n.p,{children:["We spent considerable time evaluating various databases, caches, and communication protocols for our ",(0,s.jsx)(n.strong,{children:"online feature store"}),". After carefully weighing ",(0,s.jsx)(n.strong,{children:"cost, latency, throughput"}),", and ",(0,s.jsx)(n.strong,{children:"operational stability"}),", we settled on a combination of:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"gRPC + Proto3"})," as our communication layer"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"streamlining-the-data-flow",children:"Streamlining the Data Flow"}),"\n",(0,s.jsx)(n.p,{children:"To keep things simple in the initial version:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature engineering jobs"})," wrote raw outputs to an ",(0,s.jsx)(n.strong,{children:"S3 bucket"})]}),"\n",(0,s.jsxs)(n.li,{children:["A ",(0,s.jsx)(n.strong,{children:"daily feature push job"}),":","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read from S3"}),"\n",(0,s.jsxs)(n.li,{children:["Grouped related features into ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," (ensuring consistency)"]}),"\n",(0,s.jsxs)(n.li,{children:["Pushed them to ",(0,s.jsx)(n.strong,{children:"Kafka"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"For features requiring frequent updates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ad-hoc jobs"})," computed features in higher frequency"]}),"\n",(0,s.jsxs)(n.li,{children:["These jobs pushed to both ",(0,s.jsx)(n.strong,{children:"Kafka"})," and ",(0,s.jsx)(n.strong,{children:"S3"})," (S3 preserved historical data for future model training)"]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-challenges-data-format-and-storage",children:"The Challenges: Data Format and Storage"}),"\n",(0,s.jsxs)(n.p,{children:["One of the most critical design challenges was how to store feature data ",(0,s.jsx)(n.strong,{children:"efficiently and consistently"}),", especially in databases like ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"}),", which come with unique storage constraints."]}),"\n",(0,s.jsx)(n.p,{children:"We had to solve for three key requirements:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"feature-consistency",children:"Feature Consistency"}),"\n",(0,s.jsxs)(n.p,{children:["When a feature group contains features like ",(0,s.jsx)(n.code,{children:"order_count_1h"})," and ",(0,s.jsx)(n.code,{children:"click_count_1h"}),", both must reflect the ",(0,s.jsx)(n.strong,{children:"same time window"}),". Inconsistent updates would lead to ",(0,s.jsx)(n.strong,{children:"unreliable model predictions"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"ttl-granularity",children:"TTL Granularity"}),"\n",(0,s.jsxs)(n.p,{children:["Each feature group required an ",(0,s.jsx)(n.strong,{children:"expiry timestamp"}),", so that ",(0,s.jsx)(n.strong,{children:"all features within it expired together"}),"\u2014preserving consistency during reads."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"extensibility-across-databases",children:"Extensibility Across Databases"}),"\n",(0,s.jsxs)(n.p,{children:["We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be ",(0,s.jsx)(n.strong,{children:"decoupled from DB-specific layouts"}),", enabling portability to systems like ",(0,s.jsx)(n.strong,{children:"ScyllaDB"}),", ",(0,s.jsx)(n.strong,{children:"DynamoDB"}),", ",(0,s.jsx)(n.strong,{children:"HBase"}),", or ",(0,s.jsx)(n.strong,{children:"BigTable"}),"."]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"overcoming-technical-constraints",children:"Overcoming Technical Constraints"}),"\n",(0,s.jsx)(n.p,{children:'At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.'}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-separation",children:"The Solution: Schema Separation"}),"\n",(0,s.jsx)(n.p,{children:"We introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\nTo represent these groups efficiently, we adopted a layered storage approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Labels (Keys)"})," were stored in ZooKeeper, serving as the schema."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Values"})," were stored as a comma-separated string array in Cassandra or Redis."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Expiry Timestamp and Schema Version"})," were appended using a semi-colon delimiter at the end of the string."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Example:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"feature_1_value,feature_2_value,feature_3_value;expiry_ts\n"})}),"\n",(0,s.jsx)(n.p,{children:"This format allowed:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Consistent writes and reads at the group level"}),"\n",(0,s.jsx)(n.li,{children:"Easy parsing of feature values using the schema lookup from ZooKeeper"}),"\n",(0,s.jsx)(n.li,{children:"Efficient storage with minimal DB column usage"}),"\n",(0,s.jsx)(n.li,{children:"Support for per-group TTLs and schema evolution"}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"tracking-changes-in-feature-groups",children:"Tracking Changes in Feature Groups"}),"\n",(0,s.jsx)(n.p,{children:"Feature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn't feasible."}),"\n",(0,s.jsx)(n.h3,{id:"common-real-world-scenarios",children:"Common Real-World Scenarios:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"A new feature is added to the schema, but ingestion jobs still use the older schema version."}),"\n",(0,s.jsx)(n.li,{children:"Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features."}),"\n",(0,s.jsx)(n.li,{children:"During serving, models request a mix of old and new features, depending on rollout stages."}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-versioning",children:"The Solution: Schema Versioning"}),"\n",(0,s.jsx)(n.p,{children:"We solved this with versioned feature group schemas, which unlocked several capabilities:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"backward-compatibility",children:"Backward Compatibility"}),"\n","Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"partial-availability-handling",children:"Partial Availability Handling"}),"\n","During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"safe-writes-without-pipeline-pauses",children:"Safe Writes Without Pipeline Pauses"}),"\n","With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1544).A+"",width:"1600",height:"599"})}),"\n",(0,s.jsx)(n.h2,{id:"interaction-store---0th-version",children:"Interaction Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(5714).A+"",width:"1600",height:"518"})}),"\n",(0,s.jsxs)(n.p,{children:["To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as ",(0,s.jsx)(n.strong,{children:"Similar Products"}),", ",(0,s.jsx)(n.strong,{children:"People Also Viewed"}),", or ",(0,s.jsx)(n.strong,{children:"Recently Ordered Again"}),".\nFor the ",(0,s.jsx)(n.strong,{children:"0th version"})," of the Interaction Store, we focused on a design that was ",(0,s.jsx)(n.strong,{children:"simple, fast, and reliable"})," \u2014 optimized for high-throughput ingestion and low-latency lookups."]}),"\n",(0,s.jsx)(n.h2,{id:"event-ingestion",children:"Event Ingestion"}),"\n",(0,s.jsx)(n.p,{children:"We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Click"}),"\n",(0,s.jsx)(n.li,{children:"Order"}),"\n",(0,s.jsx)(n.li,{children:"Add to Cart"}),"\n",(0,s.jsx)(n.li,{children:"Wishlist"}),"\n",(0,s.jsx)(n.li,{children:"Share"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Each event carried essential metadata:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"userId \u2014 uniquely identifies the user"}),"\n",(0,s.jsx)(n.li,{children:"productId \u2014 the item being interacted with"}),"\n",(0,s.jsx)(n.li,{children:"timestamp \u2014 the moment the interaction occurred"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently."}),"\n",(0,s.jsx)(n.h2,{id:"storage-design",children:"Storage Design"}),"\n",(0,s.jsx)(n.p,{children:"To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure."}),"\n",(0,s.jsx)(n.h3,{id:"why-redis",children:"Why Redis?"}),"\n",(0,s.jsx)(n.p,{children:"Redis gave us:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Low-latency"})," reads and writes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Time-ordered data"})," using ZSETs (via score = timestamp)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Native TTL support"}),", if needed in later versions"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-memory performance"})," \u2014ideal for real-time CGs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"storage-structure",children:"Storage Structure"}),"\n",(0,s.jsx)(n.p,{children:"Each user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"userId_eventType \u2192 ZSET[...(pid, ts)...]\n"})}),"\n",(0,s.jsx)(n.p,{children:"Within each ZSET:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"timestamp"})," served as the score, maintaining temporal order"]}),"\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"productId"})," (optionally with metadata) was the ",(0,s.jsx)(n.strong,{children:"value"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Fetch the ",(0,s.jsx)(n.strong,{children:"last k interactions"})," of a specific type for a given user with ",(0,s.jsx)(n.code,{children:"ZREVRANGE(userId_eventType, count)"})]}),"\n",(0,s.jsxs)(n.li,{children:["Retrieve ",(0,s.jsx)(n.strong,{children:"all interactions within a time range"})," (e.g., last 24 hours) with ",(0,s.jsx)(n.code,{children:"ZREVRANGEBYSCORE(userId_eventType, timeRange)"})]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"built-in-guardrails",children:"Built-in Guardrails"}),"\n",(0,s.jsx)(n.p,{children:"Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated."}),"\n",(0,s.jsx)(n.h2,{id:"conclusion-laying-the-foundation-for-real-time-ml",children:"Conclusion: Laying the Foundation for Real-Time ML"}),"\n",(0,s.jsxs)(n.p,{children:["In this first phase, we tackled the ",(0,s.jsx)(n.strong,{children:"fundamentals"}),"\u2014shifting from batch-based recommendations to a ",(0,s.jsx)(n.strong,{children:"real-time Recommendation"})," using ML platform that could keep up with Meesho\u2019s growth."]}),"\n",(0,s.jsxs)(n.p,{children:["With the ",(0,s.jsx)(n.strong,{children:"IOP Framework"}),", ",(0,s.jsx)(n.strong,{children:"Online Feature Store"}),", and ",(0,s.jsx)(n.strong,{children:"Interaction Store"}),", we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"\u2705 Faster, more dynamic recommendations for millions of users."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 Better infrastructure efficiency, reducing wasted compute power."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 A flexible, modular system that allows for further experimentation."}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["But this is just the beginning. While we've solved key challenges, ",(0,s.jsx)(n.strong,{children:"certain roadblocks remain"})," \u2014from optimizing ",(0,s.jsx)(n.strong,{children:"cost-performance trade-offs"})," to ",(0,s.jsx)(n.strong,{children:"seamlessly evolving schemas"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This foundational work laid the path for a reliable and scalable ",(0,s.jsx)(n.strong,{children:"real-time feature serving layer"}),"."]})]})}function c(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(h,{...e})}):h(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/09dd5be9.be7fd2aa.js b/docs/assets/js/09dd5be9.be7fd2aa.js deleted file mode 100644 index c4d40aef..00000000 --- a/docs/assets/js/09dd5be9.be7fd2aa.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6273],{395:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/interaction-store-v0-68167b64c6e462ef2f177f0f86d55bda.png"},1164:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1757:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/old-batch-arch-bc2cedbc1fed0fc6f08479ba8fe52996.png"},3983:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},5017:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/online-feature-store-v0-86ec0010947ae24621f39ebd0d1729ca.png"},7848:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/first-gen-arch-7c0b286810aecb7eff42b48f51caee1f.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const s={},r=t.createContext(s);function a(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),t.createElement(r.Provider,{value:n},e.children)}},8733:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/schema-d699efc400ed0f83bba421c1f55ab211.png"},8831:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>c,frontMatter:()=>a,metadata:()=>t,toc:()=>d});var t=i(3983),s=i(4848),r=i(8453);const a={slug:"post-one",title:"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)",authors:["adarsha","aditya","bhawani","jigar"],date:new Date("2022-11-15T00:00:00.000Z"),tags:["online-feature-store","interaction-store","mlplatform","meesho"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0,void 0]},d=[{value:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform",id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",level:2},{value:"The Turning Point: From Batch to Real-Time",id:"the-turning-point-from-batch-to-real-time",level:2},{value:"First Generation Design",id:"first-generation-design",level:2},{value:"1. IOP Framework: A Real-Time DAG Executor",id:"1-iop-framework-a-real-time-dag-executor",level:3},{value:"2. Online Feature Store - 0th Version",id:"2-online-feature-store---0th-version",level:3},{value:"3. Interaction Store - 0th Version",id:"3-interaction-store---0th-version",level:3},{value:"Building the Online Feature Store - 0th Version",id:"building-the-online-feature-store---0th-version",level:2},{value:"Choosing the Right Tech Stack",id:"choosing-the-right-tech-stack",level:3},{value:"Streamlining the Data Flow",id:"streamlining-the-data-flow",level:3},{value:"The Challenges: Data Format and Storage",id:"the-challenges-data-format-and-storage",level:2},{value:"Feature Consistency",id:"feature-consistency",level:3},{value:"TTL Granularity",id:"ttl-granularity",level:3},{value:"Extensibility Across Databases",id:"extensibility-across-databases",level:3},{value:"Overcoming Technical Constraints",id:"overcoming-technical-constraints",level:2},{value:"The Solution: Schema Separation",id:"the-solution-schema-separation",level:2},{value:"Tracking Changes in Feature Groups",id:"tracking-changes-in-feature-groups",level:2},{value:"Common Real-World Scenarios:",id:"common-real-world-scenarios",level:3},{value:"The Solution: Schema Versioning",id:"the-solution-schema-versioning",level:2},{value:"Backward Compatibility",id:"backward-compatibility",level:3},{value:"Partial Availability Handling",id:"partial-availability-handling",level:3},{value:"Safe Writes Without Pipeline Pauses",id:"safe-writes-without-pipeline-pauses",level:3},{value:"Interaction Store - 0th Version",id:"interaction-store---0th-version",level:2},{value:"Event Ingestion",id:"event-ingestion",level:2},{value:"Storage Design",id:"storage-design",level:2},{value:"Why Redis?",id:"why-redis",level:3},{value:"Storage Structure",id:"storage-structure",level:3},{value:"Built-in Guardrails",id:"built-in-guardrails",level:3},{value:"Conclusion: Laying the Foundation for Real-Time ML",id:"conclusion-laying-the-foundation-for-real-time-ml",level:2}];function h(e){const n={a:"a",br:"br",code:"code",em:"em",h1:"h1",h2:"h2",h3:"h3",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"BharatMLStack",src:i(1164).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(n.h2,{id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",children:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:'"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?"'})}),"\n",(0,s.jsx)(n.p,{children:"The laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact."}),"\n",(0,s.jsxs)(n.p,{children:["At the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the ",(0,s.jsx)(n.strong,{children:"Meesho ML Platform"}),", known today as ",(0,s.jsx)(n.strong,{children:"BharatMLStack"}),"."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1757).A+"",width:"1600",height:"1078"})}),"\n",(0,s.jsx)(n.p,{children:"Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Ingestion"}),": The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 1"}),": Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 2"}),": Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 3"}),": Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Serving"}),': A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This approach held up well\u2014until Meesho started seeing a significant surge in traffic."}),"\n",(0,s.jsx)(n.h2,{id:"the-turning-point-from-batch-to-real-time",children:"The Turning Point: From Batch to Real-Time"}),"\n",(0,s.jsxs)(n.p,{children:["At this time, the team was iterating on new ",(0,s.jsx)(n.strong,{children:"Ranker models"}),", and real-time inference seemed like the next logical step. But Rankers needed ",(0,s.jsx)(n.strong,{children:"real-time feature retrieval"}),", which meant an ",(0,s.jsx)(n.strong,{children:"online feature store"})," had to be built first."]}),"\n",(0,s.jsxs)(n.p,{children:["Exploring open-source options led to ",(0,s.jsx)(n.strong,{children:"cost vs. performance trade-offs"}),", but Meesho\u2019s surging traffic meant that ",(0,s.jsx)(n.strong,{children:"latency and stability were non-negotiable"}),". After multiple debates and stakeholder discussions, a bold decision was made:"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:"We would build our own feature store."})}),"\n",(0,s.jsxs)(n.p,{children:["Meanwhile, efforts began to bring ",(0,s.jsx)(n.strong,{children:"Candidate Generators (CGs)"})," to real-time. The challenge? ",(0,s.jsx)(n.strong,{children:"Storing and retrieving user interactions quickly enough"})," to power real-time recommendations."]}),"\n",(0,s.jsxs)(n.p,{children:["As the team dove deeper, a new roadblock emerged:",(0,s.jsx)(n.br,{}),"\n","Our ML jobs were orchestrated using ",(0,s.jsx)(n.strong,{children:"Airflow DAGs"}),", giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, ",(0,s.jsx)(n.strong,{children:"slowing down iteration cycles"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["That\u2019s when the idea struck:",(0,s.jsx)(n.br,{}),"\n","We needed a ",(0,s.jsx)(n.strong,{children:"framework for real-time DAG execution"}),"\u2014one that preserved the same flexibility as Airflow but worked for ",(0,s.jsx)(n.strong,{children:"streaming data"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This moment shaped the ",(0,s.jsx)(n.strong,{children:"next phase of our journey"}),"."]}),"\n",(0,s.jsx)(n.h2,{id:"first-generation-design",children:"First Generation Design"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(7848).A+"",width:"1600",height:"1006"})}),"\n",(0,s.jsx)(n.h1,{id:"laying-the-groundwork-the-first-gen-ml-platform",children:"Laying the Groundwork: The First-Gen ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"To solve these challenges, the team built three foundational components:"}),"\n",(0,s.jsx)(n.h3,{id:"1-iop-framework-a-real-time-dag-executor",children:"1. IOP Framework: A Real-Time DAG Executor"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Reusable Nodes"}),": Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Config-driven Dynamic Graphs"}),": Execution graphs were defined as adjacency lists stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", allowing teams to modify the sequence or structure of operations without touching application code."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Plug-and-play CGs"}),": The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing ",(0,s.jsx)(n.code,{children:"cg_name"})," in the request. This drastically reduced the code surface area and improved maintainability."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Production-Grade DAGs"}),": DAGs were designed to execute in ",(0,s.jsx)(n.strong,{children:"low-latency real-time environments"}),", with support for ",(0,s.jsx)(n.strong,{children:"parallel execution, retries, and branching"}),"."]}),"\n"]}),"\n",(0,s.jsx)("u",{children:(0,s.jsx)(n.a,{href:"https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform",children:"More about IOP DAG"})}),"\n",(0,s.jsx)(n.h3,{id:"2-online-feature-store---0th-version",children:"2. Online Feature Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Used ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for low-latency feature serving."]}),"\n",(0,s.jsxs)(n.li,{children:["Maintained feature consistency using ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," with TTL-based expiry."]}),"\n",(0,s.jsxs)(n.li,{children:["A hybrid schema was used: feature keys stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", data stored in ",(0,s.jsx)(n.strong,{children:"compact arrays"}),"."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"3-interaction-store---0th-version",children:"3. Interaction Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Captured real-time user interactions like clicks, orders, and add-to-cart events."}),"\n",(0,s.jsxs)(n.li,{children:["Stored event data in ",(0,s.jsx)(n.strong,{children:"Redis ZSETs (sorted sets)"})," to enable fast lookups for recommendation engines."]}),"\n",(0,s.jsxs)(n.li,{children:["Provided an API to fetch a user's ",(0,s.jsxs)(n.strong,{children:["last ",(0,s.jsx)(n.em,{children:"k"})," interactions"]})," or ",(0,s.jsx)(n.strong,{children:"interactions within a time window"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["With these components in place, ",(0,s.jsx)(n.strong,{children:"real-time ML at Meesho became a reality"}),"."]}),"\n",(0,s.jsx)(n.p,{children:"This was just the beginning."}),"\n",(0,s.jsx)(n.h2,{id:"building-the-online-feature-store---0th-version",children:"Building the Online Feature Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt text",src:i(5017).A+"",width:"1574",height:"562"})}),"\n",(0,s.jsx)(n.h3,{id:"choosing-the-right-tech-stack",children:"Choosing the Right Tech Stack"}),"\n",(0,s.jsxs)(n.p,{children:["We spent considerable time evaluating various databases, caches, and communication protocols for our ",(0,s.jsx)(n.strong,{children:"online feature store"}),". After carefully weighing ",(0,s.jsx)(n.strong,{children:"cost, latency, throughput"}),", and ",(0,s.jsx)(n.strong,{children:"operational stability"}),", we settled on a combination of:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"gRPC + Proto3"})," as our communication layer"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"streamlining-the-data-flow",children:"Streamlining the Data Flow"}),"\n",(0,s.jsx)(n.p,{children:"To keep things simple in the initial version:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature engineering jobs"})," wrote raw outputs to an ",(0,s.jsx)(n.strong,{children:"S3 bucket"})]}),"\n",(0,s.jsxs)(n.li,{children:["A ",(0,s.jsx)(n.strong,{children:"daily feature push job"}),":","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read from S3"}),"\n",(0,s.jsxs)(n.li,{children:["Grouped related features into ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," (ensuring consistency)"]}),"\n",(0,s.jsxs)(n.li,{children:["Pushed them to ",(0,s.jsx)(n.strong,{children:"Kafka"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"For features requiring frequent updates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ad-hoc jobs"})," computed features in higher frequency"]}),"\n",(0,s.jsxs)(n.li,{children:["These jobs pushed to both ",(0,s.jsx)(n.strong,{children:"Kafka"})," and ",(0,s.jsx)(n.strong,{children:"S3"})," (S3 preserved historical data for future model training)"]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-challenges-data-format-and-storage",children:"The Challenges: Data Format and Storage"}),"\n",(0,s.jsxs)(n.p,{children:["One of the most critical design challenges was how to store feature data ",(0,s.jsx)(n.strong,{children:"efficiently and consistently"}),", especially in databases like ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"}),", which come with unique storage constraints."]}),"\n",(0,s.jsx)(n.p,{children:"We had to solve for three key requirements:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"feature-consistency",children:"Feature Consistency"}),"\n",(0,s.jsxs)(n.p,{children:["When a feature group contains features like ",(0,s.jsx)(n.code,{children:"order_count_1h"})," and ",(0,s.jsx)(n.code,{children:"click_count_1h"}),", both must reflect the ",(0,s.jsx)(n.strong,{children:"same time window"}),". Inconsistent updates would lead to ",(0,s.jsx)(n.strong,{children:"unreliable model predictions"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"ttl-granularity",children:"TTL Granularity"}),"\n",(0,s.jsxs)(n.p,{children:["Each feature group required an ",(0,s.jsx)(n.strong,{children:"expiry timestamp"}),", so that ",(0,s.jsx)(n.strong,{children:"all features within it expired together"}),"\u2014preserving consistency during reads."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"extensibility-across-databases",children:"Extensibility Across Databases"}),"\n",(0,s.jsxs)(n.p,{children:["We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be ",(0,s.jsx)(n.strong,{children:"decoupled from DB-specific layouts"}),", enabling portability to systems like ",(0,s.jsx)(n.strong,{children:"ScyllaDB"}),", ",(0,s.jsx)(n.strong,{children:"DynamoDB"}),", ",(0,s.jsx)(n.strong,{children:"HBase"}),", or ",(0,s.jsx)(n.strong,{children:"BigTable"}),"."]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"overcoming-technical-constraints",children:"Overcoming Technical Constraints"}),"\n",(0,s.jsx)(n.p,{children:'At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.'}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-separation",children:"The Solution: Schema Separation"}),"\n",(0,s.jsx)(n.p,{children:"We introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\nTo represent these groups efficiently, we adopted a layered storage approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Labels (Keys)"})," were stored in ZooKeeper, serving as the schema."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Values"})," were stored as a comma-separated string array in Cassandra or Redis."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Expiry Timestamp and Schema Version"})," were appended using a semi-colon delimiter at the end of the string."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Example:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"feature_1_value,feature_2_value,feature_3_value;expiry_ts\n"})}),"\n",(0,s.jsx)(n.p,{children:"This format allowed:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Consistent writes and reads at the group level"}),"\n",(0,s.jsx)(n.li,{children:"Easy parsing of feature values using the schema lookup from ZooKeeper"}),"\n",(0,s.jsx)(n.li,{children:"Efficient storage with minimal DB column usage"}),"\n",(0,s.jsx)(n.li,{children:"Support for per-group TTLs and schema evolution"}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"tracking-changes-in-feature-groups",children:"Tracking Changes in Feature Groups"}),"\n",(0,s.jsx)(n.p,{children:"Feature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn't feasible."}),"\n",(0,s.jsx)(n.h3,{id:"common-real-world-scenarios",children:"Common Real-World Scenarios:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"A new feature is added to the schema, but ingestion jobs still use the older schema version."}),"\n",(0,s.jsx)(n.li,{children:"Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features."}),"\n",(0,s.jsx)(n.li,{children:"During serving, models request a mix of old and new features, depending on rollout stages."}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-versioning",children:"The Solution: Schema Versioning"}),"\n",(0,s.jsx)(n.p,{children:"We solved this with versioned feature group schemas, which unlocked several capabilities:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"backward-compatibility",children:"Backward Compatibility"}),"\n","Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"partial-availability-handling",children:"Partial Availability Handling"}),"\n","During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"safe-writes-without-pipeline-pauses",children:"Safe Writes Without Pipeline Pauses"}),"\n","With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(8733).A+"",width:"1600",height:"599"})}),"\n",(0,s.jsx)(n.h2,{id:"interaction-store---0th-version",children:"Interaction Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(395).A+"",width:"1600",height:"518"})}),"\n",(0,s.jsxs)(n.p,{children:["To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as ",(0,s.jsx)(n.strong,{children:"Similar Products"}),", ",(0,s.jsx)(n.strong,{children:"People Also Viewed"}),", or ",(0,s.jsx)(n.strong,{children:"Recently Ordered Again"}),".\nFor the ",(0,s.jsx)(n.strong,{children:"0th version"})," of the Interaction Store, we focused on a design that was ",(0,s.jsx)(n.strong,{children:"simple, fast, and reliable"})," \u2014 optimized for high-throughput ingestion and low-latency lookups."]}),"\n",(0,s.jsx)(n.h2,{id:"event-ingestion",children:"Event Ingestion"}),"\n",(0,s.jsx)(n.p,{children:"We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Click"}),"\n",(0,s.jsx)(n.li,{children:"Order"}),"\n",(0,s.jsx)(n.li,{children:"Add to Cart"}),"\n",(0,s.jsx)(n.li,{children:"Wishlist"}),"\n",(0,s.jsx)(n.li,{children:"Share"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Each event carried essential metadata:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"userId \u2014 uniquely identifies the user"}),"\n",(0,s.jsx)(n.li,{children:"productId \u2014 the item being interacted with"}),"\n",(0,s.jsx)(n.li,{children:"timestamp \u2014 the moment the interaction occurred"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently."}),"\n",(0,s.jsx)(n.h2,{id:"storage-design",children:"Storage Design"}),"\n",(0,s.jsx)(n.p,{children:"To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure."}),"\n",(0,s.jsx)(n.h3,{id:"why-redis",children:"Why Redis?"}),"\n",(0,s.jsx)(n.p,{children:"Redis gave us:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Low-latency"})," reads and writes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Time-ordered data"})," using ZSETs (via score = timestamp)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Native TTL support"}),", if needed in later versions"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-memory performance"})," \u2014ideal for real-time CGs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"storage-structure",children:"Storage Structure"}),"\n",(0,s.jsx)(n.p,{children:"Each user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"userId_eventType \u2192 ZSET[...(pid, ts)...]\n"})}),"\n",(0,s.jsx)(n.p,{children:"Within each ZSET:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"timestamp"})," served as the score, maintaining temporal order"]}),"\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"productId"})," (optionally with metadata) was the ",(0,s.jsx)(n.strong,{children:"value"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Fetch the ",(0,s.jsx)(n.strong,{children:"last k interactions"})," of a specific type for a given user with ",(0,s.jsx)(n.code,{children:"ZREVRANGE(userId_eventType, count)"})]}),"\n",(0,s.jsxs)(n.li,{children:["Retrieve ",(0,s.jsx)(n.strong,{children:"all interactions within a time range"})," (e.g., last 24 hours) with ",(0,s.jsx)(n.code,{children:"ZREVRANGEBYSCORE(userId_eventType, timeRange)"})]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"built-in-guardrails",children:"Built-in Guardrails"}),"\n",(0,s.jsx)(n.p,{children:"Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated."}),"\n",(0,s.jsx)(n.h2,{id:"conclusion-laying-the-foundation-for-real-time-ml",children:"Conclusion: Laying the Foundation for Real-Time ML"}),"\n",(0,s.jsxs)(n.p,{children:["In this first phase, we tackled the ",(0,s.jsx)(n.strong,{children:"fundamentals"}),"\u2014shifting from batch-based recommendations to a ",(0,s.jsx)(n.strong,{children:"real-time Recommendation"})," using ML platform that could keep up with Meesho\u2019s growth."]}),"\n",(0,s.jsxs)(n.p,{children:["With the ",(0,s.jsx)(n.strong,{children:"IOP Framework"}),", ",(0,s.jsx)(n.strong,{children:"Online Feature Store"}),", and ",(0,s.jsx)(n.strong,{children:"Interaction Store"}),", we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"\u2705 Faster, more dynamic recommendations for millions of users."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 Better infrastructure efficiency, reducing wasted compute power."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 A flexible, modular system that allows for further experimentation."}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["But this is just the beginning. While we've solved key challenges, ",(0,s.jsx)(n.strong,{children:"certain roadblocks remain"})," \u2014from optimizing ",(0,s.jsx)(n.strong,{children:"cost-performance trade-offs"})," to ",(0,s.jsx)(n.strong,{children:"seamlessly evolving schemas"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This foundational work laid the path for a reliable and scalable ",(0,s.jsx)(n.strong,{children:"real-time feature serving layer"}),"."]})]})}function c(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(h,{...e})}):h(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/0e384e19.cb894d32.js b/docs/assets/js/0e384e19.cb894d32.js new file mode 100644 index 00000000..ae6953f5 --- /dev/null +++ b/docs/assets/js/0e384e19.cb894d32.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[3976],{2053:(e,t,n)=>{n.r(t),n.d(t,{assets:()=>c,contentTitle:()=>s,default:()=>h,frontMatter:()=>i,metadata:()=>r,toc:()=>l});const r=JSON.parse('{"id":"intro","title":"BharatMLStack Documentation","description":"Welcome to the BharatMLStack documentation. BharatMLStack is an open-source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Explore the components below to get started.","source":"@site/docs/intro.md","sourceDirName":".","slug":"/intro","permalink":"/BharatMLStack/intro","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/intro.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"sidebar_position":0,"title":"BharatMLStack Documentation","slug":"intro"},"sidebar":"tutorialSidebar","next":{"title":"Online Feature Store","permalink":"/BharatMLStack/category/online-feature-store"}}');var o=n(4848),a=n(8453);const i={sidebar_position:0,title:"BharatMLStack Documentation",slug:"intro"},s="BharatMLStack Documentation",c={},l=[{value:"Quick Start",id:"quick-start",level:2},{value:"Online Feature Store",id:"online-feature-store",level:2},{value:"Inferflow",id:"inferflow",level:2},{value:"Trufflebox UI",id:"trufflebox-ui",level:2},{value:"SDKs",id:"sdks",level:2},{value:"Numerix",id:"numerix",level:2}];function d(e){const t={a:"a",h1:"h1",h2:"h2",header:"header",hr:"hr",p:"p",strong:"strong",...(0,a.R)(),...e.components};return(0,o.jsxs)(o.Fragment,{children:[(0,o.jsx)(t.header,{children:(0,o.jsx)(t.h1,{id:"bharatmlstack-documentation",children:"BharatMLStack Documentation"})}),"\n",(0,o.jsx)(t.p,{children:"Welcome to the BharatMLStack documentation. BharatMLStack is an open-source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Explore the components below to get started."}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,o.jsx)(t.p,{children:"Get up and running with BharatMLStack in minutes. Step-by-step instructions, sample data, and Docker Compose setup for local development and testing."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/quick-start",children:"Go to Quick Start \u2192"})})}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"online-feature-store",children:"Online Feature Store"}),"\n",(0,o.jsx)(t.p,{children:"Sub-10ms, high-throughput access to machine learning features for real-time inference. Supports batch and streaming ingestion, schema validation, and compact versioned feature groups."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/online-feature-store",children:"Go to Online Feature Store \u2192"})})}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"inferflow",children:"Inferflow"}),"\n",(0,o.jsx)(t.p,{children:"Graph-driven feature retrieval and model inference orchestration engine. Dynamically resolves entity relationships, retrieves features, and orchestrates model scoring \u2014 all without custom code."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/inferflow",children:"Go to Inferflow \u2192"})})}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"trufflebox-ui",children:"Trufflebox UI"}),"\n",(0,o.jsx)(t.p,{children:"Modern, feature-rich UI framework for MLOps management. Supports feature catalog, user management, and admin operations with approval flows."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/trufflebox-ui",children:"Go to Trufflebox UI \u2192"})})}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"sdks",children:"SDKs"}),"\n",(0,o.jsx)(t.p,{children:"Client libraries for Go and Python to interact with the Online Feature Store and other platform components. Includes gRPC clients, REST APIs, and Apache Spark integration."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/sdks",children:"Go to SDKs \u2192"})})}),"\n",(0,o.jsx)(t.hr,{}),"\n",(0,o.jsx)(t.h2,{id:"numerix",children:"Numerix"}),"\n",(0,o.jsx)(t.p,{children:"High-performance compute engine for ultra-fast element-wise matrix operations. Built in Rust with SIMD acceleration for sub-5ms p99 latency."}),"\n",(0,o.jsx)(t.p,{children:(0,o.jsx)(t.strong,{children:(0,o.jsx)(t.a,{href:"/category/numerix",children:"Go to Numerix \u2192"})})})]})}function h(e={}){const{wrapper:t}={...(0,a.R)(),...e.components};return t?(0,o.jsx)(t,{...e,children:(0,o.jsx)(d,{...e})}):d(e)}},8453:(e,t,n)=>{n.d(t,{R:()=>i,x:()=>s});var r=n(6540);const o={},a=r.createContext(o);function i(e){const t=r.useContext(a);return r.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function s(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(o):e.components||o:i(e.components),r.createElement(a.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/0fff8dc8.70193857.js b/docs/assets/js/0fff8dc8.70193857.js new file mode 100644 index 00000000..39ec1e2e --- /dev/null +++ b/docs/assets/js/0fff8dc8.70193857.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9596],{5958:(e,n,s)=>{s.r(n),s.d(n,{assets:()=>a,contentTitle:()=>c,default:()=>h,frontMatter:()=>t,metadata:()=>r,toc:()=>o});const r=JSON.parse('{"id":"quick-start/v1.0.0/quick-start","title":"Quick Start","description":"Discord","source":"@site/docs/quick-start/v1.0.0/quick-start.md","sourceDirName":"quick-start/v1.0.0","slug":"/quick-start/v1.0.0/quick-start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/quick-start/v1.0.0/quick-start.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Quick Start","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/quick-start/v1.0.0"},"next":{"title":"Trufflebox UI","permalink":"/BharatMLStack/category/trufflebox-ui"}}');var i=s(4848),l=s(8453);const t={title:"Quick Start",sidebar_position:1},c="BharatML Stack Quick Start Guide",a={},o=[{value:"Prerequisites",id:"prerequisites",level:2},{value:"System Components",id:"system-components",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"Starting the System",id:"starting-the-system",level:3},{value:"Testing Different Versions",id:"testing-different-versions",level:3},{value:"Stopping the System",id:"stopping-the-system",level:3},{value:"Accessing Services",id:"accessing-services",level:2},{value:"Frontend UI",id:"frontend-ui",level:3},{value:"API Endpoints",id:"api-endpoints",level:3},{value:"Database Access",id:"database-access",level:3},{value:"Feature Store API Examples",id:"feature-store-api-examples",level:2},{value:"gRPC API Commands",id:"grpc-api-commands",level:3},{value:"Sample Request Bodies",id:"sample-request-bodies",level:3},{value:"Key Points",id:"key-points",level:3},{value:"Response Format Differences",id:"response-format-differences",level:3},{value:"Managing Services",id:"managing-services",level:2},{value:"Viewing Logs",id:"viewing-logs",level:3},{value:"Service Management",id:"service-management",level:3},{value:"Troubleshooting",id:"troubleshooting",level:2},{value:"Common Issues",id:"common-issues",level:3},{value:"Service Dependencies",id:"service-dependencies",level:3},{value:"Development",id:"development",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,l.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatml-stack-quick-start-guide",children:"BharatML Stack Quick Start Guide"})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})})}),"\n",(0,i.jsx)(n.p,{children:"A quick way to get the BharatML Stack Online Feature Store platform up and running locally for development and testing."}),"\n",(0,i.jsx)(n.h2,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Docker and Docker Compose"}),"\n",(0,i.jsx)(n.li,{children:"Go 1.22 or later"}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"nc"})," (netcat) command for connectivity checks"]}),"\n",(0,i.jsx)(n.li,{children:"Bash shell"}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"grpcurl"})," for testing gRPC API endpoints (install from ",(0,i.jsx)(n.a,{href:"https://github.com/fullstorydev/grpcurl",children:"https://github.com/fullstorydev/grpcurl"}),")"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"system-components",children:"System Components"}),"\n",(0,i.jsx)(n.p,{children:"BharatMLStack's Online Feature Store consists of several interconnected services:"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Infrastructure Services:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"ScyllaDB"}),": NoSQL database for high-performance feature storage"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"MySQL"}),": Relational database for metadata and configuration"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Redis"}),": In-memory data store for caching"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd"}),": Distributed key-value store for service coordination"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Application Services:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon"}),": Backend API service (runs on port 8082)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Trufflebox UI"}),": Frontend web interface (runs on port 3000)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store gRPC API Server"}),": High-performance gRPC interface (runs on port 8089)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd Workbench"}),": etcd management interface (runs on port 8081)"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"All services are orchestrated using Docker Compose with pre-built images from GitHub Container Registry (GHCR)."}),"\n",(0,i.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,i.jsx)(n.h3,{id:"starting-the-system",children:"Starting the System"}),"\n",(0,i.jsx)(n.p,{children:"Run the start script to set up your workspace and launch all services:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./start.sh\n"})}),"\n",(0,i.jsx)(n.h3,{id:"testing-different-versions",children:"Testing Different Versions"}),"\n",(0,i.jsx)(n.p,{children:"You can easily test different versions of the application services by setting environment variables:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Test specific versions [Replace with actual versions]\nONFS_VERSION=v1.2.3 HORIZON_VERSION=v2.1.0 TRUFFLEBOX_VERSION=v1.0.5 ./start.sh\n\n# Or set them in your workspace and run docker-compose directly\ncd workspace\nONFS_VERSION=main docker-compose up -d onfs-api-server\n"})}),"\n",(0,i.jsx)(n.p,{children:"Available version formats:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"latest"})," (default) - Latest stable release"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"main"})," - Latest development build"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"v1.2.3"})," - Specific version tag"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"sha-abcd1234"})," - Specific commit SHA"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"This will:"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Check for Go installation (1.22+ required)"}),"\n",(0,i.jsx)(n.li,{children:"Create a workspace directory with configuration files"}),"\n",(0,i.jsxs)(n.li,{children:["Pull and start all services using ",(0,i.jsx)(n.code,{children:"docker-compose up -d"})]}),"\n",(0,i.jsx)(n.li,{children:"Wait for services to become healthy"}),"\n",(0,i.jsx)(n.li,{children:"Initialize databases with required schemas"}),"\n",(0,i.jsx)(n.li,{children:"Display access information and helpful commands"}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"Once complete, you can access:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Trufflebox UI"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:3000",children:"http://localhost:3000"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8082",children:"http://localhost:8082"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store gRPC API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8089",children:"http://localhost:8089"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd Workbench"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8081",children:"http://localhost:8081"})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"stopping-the-system",children:"Stopping the System"}),"\n",(0,i.jsx)(n.p,{children:"To stop all services:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./stop.sh\n"})}),"\n",(0,i.jsx)(n.p,{children:"To stop and completely purge all containers, volumes, and workspace:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./stop.sh --purge\n"})}),"\n",(0,i.jsx)(n.h2,{id:"accessing-services",children:"Accessing Services"}),"\n",(0,i.jsx)(n.h3,{id:"frontend-ui",children:"Frontend UI"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"URL"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:3000",children:"http://localhost:3000"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Default admin credentials"}),":","\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Email: ",(0,i.jsx)(n.code,{children:"admin@admin.com"})]}),"\n",(0,i.jsxs)(n.li,{children:["Password: ",(0,i.jsx)(n.code,{children:"admin"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"api-endpoints",children:"API Endpoints"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8082",children:"http://localhost:8082"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Health check: ",(0,i.jsx)(n.a,{href:"http://localhost:8082/health",children:"http://localhost:8082/health"})]}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"ONFS gRPC API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8089",children:"http://localhost:8089"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Health check: ",(0,i.jsx)(n.a,{href:"http://localhost:8089/health/self",children:"http://localhost:8089/health/self"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"database-access",children:"Database Access"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"MySQL"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 3306"}),"\n",(0,i.jsx)(n.li,{children:"Username: root"}),"\n",(0,i.jsx)(n.li,{children:"Password: root"}),"\n",(0,i.jsx)(n.li,{children:"Database: testdb"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"ScyllaDB"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 9042"}),"\n",(0,i.jsx)(n.li,{children:"Keyspace: onfs"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Redis"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 6379"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"etcd"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Endpoint: ",(0,i.jsx)(n.a,{href:"http://localhost:2379",children:"http://localhost:2379"})]}),"\n",(0,i.jsxs)(n.li,{children:["Workbench: ",(0,i.jsx)(n.a,{href:"http://localhost:8081",children:"http://localhost:8081"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"feature-store-api-examples",children:"Feature Store API Examples"}),"\n",(0,i.jsx)(n.h3,{id:"grpc-api-commands",children:"gRPC API Commands"}),"\n",(0,i.jsxs)(n.p,{children:["Use the following ",(0,i.jsx)(n.code,{children:"grpcurl"})," commands to interact with the Online Feature Store gRPC API:"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Persist Features:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 persist.FeatureService/PersistFeatures\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Retrieve Features (Decoded):"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 retrieve.FeatureService/RetrieveDecodedResult\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Retrieve Features (Binary):"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 retrieve.FeatureService/RetrieveFeatures\n'})}),"\n",(0,i.jsx)(n.h3,{id:"sample-request-bodies",children:"Sample Request Bodies"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Single Feature Group Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [{\n "key_values": ["10"],\n "feature_values": [{\n "values": {"fp32_values": [123.45]}\n }]\n }],\n "entity_label": "catalog",\n "feature_group_schema": [{\n "label": "int_fg",\n "feature_labels": ["id"]\n }],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Single Feature Group Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [{\n "label": "int_fg",\n "feature_labels": ["id"]\n }],\n "keys_schema": ["catalog_id"],\n "keys": [{"cols": ["10"]}]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Multiple Feature Groups Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [\n {\n "key_values": ["1"],\n "feature_values": [\n {"values": {"fp32_values": [28.5]}},\n {"values": {"string_values": ["Bharat"]}}\n ]\n },\n {\n "key_values": ["2"],\n "feature_values": [\n {"values": {"fp32_values": [32.0]}},\n {"values": {"string_values": ["India"]}}\n ]\n }\n ],\n "entity_label": "catalog",\n "feature_group_schema": [\n {"label": "int_fg", "feature_labels": ["id"]},\n {"label": "string_fg", "feature_labels": ["name"]}\n ],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Multiple Feature Groups Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [\n {"label": "int_fg", "feature_labels": ["id"]},\n {"label": "string_fg", "feature_labels": ["name"]}\n ],\n "keys_schema": ["catalog_id"],\n "keys": [\n {"cols": ["1"]},\n {"cols": ["2"]}\n ]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Vector Feature Group Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [{\n "key_values": ["123"],\n "feature_values": [{\n "values": {\n "vector": [{\n "values": {"fp32_values": [1.0, 2.0, 3.0, 4.0]}\n }]\n }\n }]\n }],\n "entity_label": "catalog",\n "feature_group_schema": [{\n "label": "vector_fg",\n "feature_labels": ["embedding"]\n }],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Vector Feature Group Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [{\n "label": "vector_fg",\n "feature_labels": ["embedding"]\n }],\n "keys_schema": ["catalog_id"],\n "keys": [{"cols": ["123"]}]\n}\n'})}),"\n",(0,i.jsx)(n.h3,{id:"key-points",children:"Key Points"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Only one type per feature value block:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"feature_values"})," is a list, and each item in the list has only one value type populated"]}),"\n",(0,i.jsxs)(n.li,{children:["For example: one item has only ",(0,i.jsx)(n.code,{children:"fp32_values"}),", another has only ",(0,i.jsx)(n.code,{children:"int64_values"})]}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Field Types:"}),"\nThe following value types are supported:"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"fp32_values"}),": ",(0,i.jsx)(n.code,{children:"float32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"fp64_values"}),": ",(0,i.jsx)(n.code,{children:"float64[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"int32_values"}),": ",(0,i.jsx)(n.code,{children:"int32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"int64_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})," (because JSON doesn't support 64-bit ints directly)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"uint32_values"}),": ",(0,i.jsx)(n.code,{children:"uint32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"uint64_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"string_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"bool_values"}),": ",(0,i.jsx)(n.code,{children:"bool[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"vector"}),": list of objects with nested values (used for embedded features)"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"response-format-differences",children:"Response Format Differences"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Retrieve Features (Binary)"}),": Returns data in binary format for optimal performance and reduced network overhead"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Retrieve Features (Decoded)"}),": Returns data in human-readable string format for easier debugging and development purposes"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"managing-services",children:"Managing Services"}),"\n",(0,i.jsx)(n.h3,{id:"viewing-logs",children:"Viewing Logs"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# View logs for all services\ncd workspace && docker-compose logs -f\n\n# View logs for specific services\ncd workspace && docker-compose logs -f horizon\ncd workspace && docker-compose logs -f trufflebox-ui\ncd workspace && docker-compose logs -f onfs-api-server\n"})}),"\n",(0,i.jsx)(n.h3,{id:"service-management",children:"Service Management"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Restart a specific service\ncd workspace && docker-compose restart horizon\n\n# Stop all services\ncd workspace && docker-compose down\n\n# Start services again\ncd workspace && docker-compose up -d\n\n# Check service status\ncd workspace && docker-compose ps\n"})}),"\n",(0,i.jsx)(n.h2,{id:"troubleshooting",children:"Troubleshooting"}),"\n",(0,i.jsx)(n.h3,{id:"common-issues",children:"Common Issues"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Port conflicts"}),": Ensure ports 3000, 8081, 8082, 8089, 9042, 3306, 6379, and 2379 are not in use by other applications."]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Docker network issues"}),": If containers can't communicate, try recreating:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"docker network rm onfs-network\ndocker network create onfs-network\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Service health checks failing"}),": Check if all infrastructure services (databases) are running:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"cd workspace && docker-compose ps\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Image pull issues"}),": Ensure you have access to GitHub Container Registry:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"docker login ghcr.io\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.a,{href:"https://github.com/tzfun/etcd-workbench/blob/master/README.md",children:"How to use Etcd Workbench ?"})}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"service-dependencies",children:"Service Dependencies"}),"\n",(0,i.jsx)(n.p,{children:"Services start in the following order:"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Infrastructure services (ScyllaDB, MySQL, Redis, etcd)"}),"\n",(0,i.jsx)(n.li,{children:"Online Feature Store gRPC API Server"}),"\n",(0,i.jsx)(n.li,{children:"Horizon (depends on databases + ONFS API)"}),"\n",(0,i.jsx)(n.li,{children:"Trufflebox UI (depends on Horizon)"}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"If a service fails to start, check its dependencies are healthy first."}),"\n",(0,i.jsx)(n.h2,{id:"development",children:"Development"}),"\n",(0,i.jsx)(n.p,{children:"The workspace directory contains all runtime configuration:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"workspace/docker-compose.yml"})," - Complete service orchestration"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"workspace/check_db_and_init.sh"})," - Database initialization script"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"You can modify environment variables in the docker-compose.yml file and restart services."}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,l.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,s)=>{s.d(n,{R:()=>t,x:()=>c});var r=s(6540);const i={},l=r.createContext(i);function t(e){const n=r.useContext(l);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function c(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:t(e.components),r.createElement(l.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/0fff8dc8.fcba975a.js b/docs/assets/js/0fff8dc8.fcba975a.js deleted file mode 100644 index 2fb035eb..00000000 --- a/docs/assets/js/0fff8dc8.fcba975a.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9596],{5958:(e,n,s)=>{s.r(n),s.d(n,{assets:()=>a,contentTitle:()=>c,default:()=>h,frontMatter:()=>t,metadata:()=>r,toc:()=>o});const r=JSON.parse('{"id":"quick-start/v1.0.0/quick-start","title":"Quick Start","description":"Discord","source":"@site/docs/quick-start/v1.0.0/quick-start.md","sourceDirName":"quick-start/v1.0.0","slug":"/quick-start/v1.0.0/quick-start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/quick-start/v1.0.0/quick-start.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Quick Start","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"Quick Start","permalink":"/BharatMLStack/category/quick-start"},"next":{"title":"Trufflebox UI","permalink":"/BharatMLStack/category/trufflebox-ui"}}');var i=s(4848),l=s(8453);const t={title:"Quick Start",sidebar_position:1},c="BharatML Stack Quick Start Guide",a={},o=[{value:"Prerequisites",id:"prerequisites",level:2},{value:"System Components",id:"system-components",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"Starting the System",id:"starting-the-system",level:3},{value:"Testing Different Versions",id:"testing-different-versions",level:3},{value:"Stopping the System",id:"stopping-the-system",level:3},{value:"Accessing Services",id:"accessing-services",level:2},{value:"Frontend UI",id:"frontend-ui",level:3},{value:"API Endpoints",id:"api-endpoints",level:3},{value:"Database Access",id:"database-access",level:3},{value:"Feature Store API Examples",id:"feature-store-api-examples",level:2},{value:"gRPC API Commands",id:"grpc-api-commands",level:3},{value:"Sample Request Bodies",id:"sample-request-bodies",level:3},{value:"Key Points",id:"key-points",level:3},{value:"Response Format Differences",id:"response-format-differences",level:3},{value:"Managing Services",id:"managing-services",level:2},{value:"Viewing Logs",id:"viewing-logs",level:3},{value:"Service Management",id:"service-management",level:3},{value:"Troubleshooting",id:"troubleshooting",level:2},{value:"Common Issues",id:"common-issues",level:3},{value:"Service Dependencies",id:"service-dependencies",level:3},{value:"Development",id:"development",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,l.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatml-stack-quick-start-guide",children:"BharatML Stack Quick Start Guide"})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})})}),"\n",(0,i.jsx)(n.p,{children:"A quick way to get the BharatML Stack Online Feature Store platform up and running locally for development and testing."}),"\n",(0,i.jsx)(n.h2,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Docker and Docker Compose"}),"\n",(0,i.jsx)(n.li,{children:"Go 1.22 or later"}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"nc"})," (netcat) command for connectivity checks"]}),"\n",(0,i.jsx)(n.li,{children:"Bash shell"}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"grpcurl"})," for testing gRPC API endpoints (install from ",(0,i.jsx)(n.a,{href:"https://github.com/fullstorydev/grpcurl",children:"https://github.com/fullstorydev/grpcurl"}),")"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"system-components",children:"System Components"}),"\n",(0,i.jsx)(n.p,{children:"BharatMLStack's Online Feature Store consists of several interconnected services:"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Infrastructure Services:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"ScyllaDB"}),": NoSQL database for high-performance feature storage"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"MySQL"}),": Relational database for metadata and configuration"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Redis"}),": In-memory data store for caching"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd"}),": Distributed key-value store for service coordination"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Application Services:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon"}),": Backend API service (runs on port 8082)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Trufflebox UI"}),": Frontend web interface (runs on port 3000)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store gRPC API Server"}),": High-performance gRPC interface (runs on port 8089)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd Workbench"}),": etcd management interface (runs on port 8081)"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"All services are orchestrated using Docker Compose with pre-built images from GitHub Container Registry (GHCR)."}),"\n",(0,i.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,i.jsx)(n.h3,{id:"starting-the-system",children:"Starting the System"}),"\n",(0,i.jsx)(n.p,{children:"Run the start script to set up your workspace and launch all services:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./start.sh\n"})}),"\n",(0,i.jsx)(n.h3,{id:"testing-different-versions",children:"Testing Different Versions"}),"\n",(0,i.jsx)(n.p,{children:"You can easily test different versions of the application services by setting environment variables:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Test specific versions [Replace with actual versions]\nONFS_VERSION=v1.2.3 HORIZON_VERSION=v2.1.0 TRUFFLEBOX_VERSION=v1.0.5 ./start.sh\n\n# Or set them in your workspace and run docker-compose directly\ncd workspace\nONFS_VERSION=main docker-compose up -d onfs-api-server\n"})}),"\n",(0,i.jsx)(n.p,{children:"Available version formats:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"latest"})," (default) - Latest stable release"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"main"})," - Latest development build"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"v1.2.3"})," - Specific version tag"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"sha-abcd1234"})," - Specific commit SHA"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"This will:"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Check for Go installation (1.22+ required)"}),"\n",(0,i.jsx)(n.li,{children:"Create a workspace directory with configuration files"}),"\n",(0,i.jsxs)(n.li,{children:["Pull and start all services using ",(0,i.jsx)(n.code,{children:"docker-compose up -d"})]}),"\n",(0,i.jsx)(n.li,{children:"Wait for services to become healthy"}),"\n",(0,i.jsx)(n.li,{children:"Initialize databases with required schemas"}),"\n",(0,i.jsx)(n.li,{children:"Display access information and helpful commands"}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"Once complete, you can access:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Trufflebox UI"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:3000",children:"http://localhost:3000"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8082",children:"http://localhost:8082"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store gRPC API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8089",children:"http://localhost:8089"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"etcd Workbench"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8081",children:"http://localhost:8081"})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"stopping-the-system",children:"Stopping the System"}),"\n",(0,i.jsx)(n.p,{children:"To stop all services:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./stop.sh\n"})}),"\n",(0,i.jsx)(n.p,{children:"To stop and completely purge all containers, volumes, and workspace:"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"./stop.sh --purge\n"})}),"\n",(0,i.jsx)(n.h2,{id:"accessing-services",children:"Accessing Services"}),"\n",(0,i.jsx)(n.h3,{id:"frontend-ui",children:"Frontend UI"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"URL"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:3000",children:"http://localhost:3000"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Default admin credentials"}),":","\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Email: ",(0,i.jsx)(n.code,{children:"admin@admin.com"})]}),"\n",(0,i.jsxs)(n.li,{children:["Password: ",(0,i.jsx)(n.code,{children:"admin"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"api-endpoints",children:"API Endpoints"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Horizon API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8082",children:"http://localhost:8082"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Health check: ",(0,i.jsx)(n.a,{href:"http://localhost:8082/health",children:"http://localhost:8082/health"})]}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"ONFS gRPC API"}),": ",(0,i.jsx)(n.a,{href:"http://localhost:8089",children:"http://localhost:8089"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Health check: ",(0,i.jsx)(n.a,{href:"http://localhost:8089/health/self",children:"http://localhost:8089/health/self"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"database-access",children:"Database Access"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"MySQL"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 3306"}),"\n",(0,i.jsx)(n.li,{children:"Username: root"}),"\n",(0,i.jsx)(n.li,{children:"Password: root"}),"\n",(0,i.jsx)(n.li,{children:"Database: testdb"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"ScyllaDB"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 9042"}),"\n",(0,i.jsx)(n.li,{children:"Keyspace: onfs"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Redis"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Host: localhost"}),"\n",(0,i.jsx)(n.li,{children:"Port: 6379"}),"\n"]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"etcd"}),":"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Endpoint: ",(0,i.jsx)(n.a,{href:"http://localhost:2379",children:"http://localhost:2379"})]}),"\n",(0,i.jsxs)(n.li,{children:["Workbench: ",(0,i.jsx)(n.a,{href:"http://localhost:8081",children:"http://localhost:8081"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"feature-store-api-examples",children:"Feature Store API Examples"}),"\n",(0,i.jsx)(n.h3,{id:"grpc-api-commands",children:"gRPC API Commands"}),"\n",(0,i.jsxs)(n.p,{children:["Use the following ",(0,i.jsx)(n.code,{children:"grpcurl"})," commands to interact with the Online Feature Store gRPC API:"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Persist Features:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 persist.FeatureService/PersistFeatures\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Retrieve Features (Decoded):"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 retrieve.FeatureService/RetrieveDecodedResult\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Retrieve Features (Binary):"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext -H "online-feature-store-caller-id: " -H "online-feature-store-auth-token: " -d \'\' localhost:8089 retrieve.FeatureService/RetrieveFeatures\n'})}),"\n",(0,i.jsx)(n.h3,{id:"sample-request-bodies",children:"Sample Request Bodies"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Single Feature Group Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [{\n "key_values": ["10"],\n "feature_values": [{\n "values": {"fp32_values": [123.45]}\n }]\n }],\n "entity_label": "catalog",\n "feature_group_schema": [{\n "label": "int_fg",\n "feature_labels": ["id"]\n }],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Single Feature Group Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [{\n "label": "int_fg",\n "feature_labels": ["id"]\n }],\n "keys_schema": ["catalog_id"],\n "keys": [{"cols": ["10"]}]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Multiple Feature Groups Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [\n {\n "key_values": ["1"],\n "feature_values": [\n {"values": {"fp32_values": [28.5]}},\n {"values": {"string_values": ["Bharat"]}}\n ]\n },\n {\n "key_values": ["2"],\n "feature_values": [\n {"values": {"fp32_values": [32.0]}},\n {"values": {"string_values": ["India"]}}\n ]\n }\n ],\n "entity_label": "catalog",\n "feature_group_schema": [\n {"label": "int_fg", "feature_labels": ["id"]},\n {"label": "string_fg", "feature_labels": ["name"]}\n ],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Multiple Feature Groups Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [\n {"label": "int_fg", "feature_labels": ["id"]},\n {"label": "string_fg", "feature_labels": ["name"]}\n ],\n "keys_schema": ["catalog_id"],\n "keys": [\n {"cols": ["1"]},\n {"cols": ["2"]}\n ]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Vector Feature Group Persist:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "data": [{\n "key_values": ["123"],\n "feature_values": [{\n "values": {\n "vector": [{\n "values": {"fp32_values": [1.0, 2.0, 3.0, 4.0]}\n }]\n }\n }]\n }],\n "entity_label": "catalog",\n "feature_group_schema": [{\n "label": "vector_fg",\n "feature_labels": ["embedding"]\n }],\n "keys_schema": ["catalog_id"]\n}\n'})}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Vector Feature Group Retrieve:"})}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-json",children:'{\n "entity_label": "catalog",\n "feature_groups": [{\n "label": "vector_fg",\n "feature_labels": ["embedding"]\n }],\n "keys_schema": ["catalog_id"],\n "keys": [{"cols": ["123"]}]\n}\n'})}),"\n",(0,i.jsx)(n.h3,{id:"key-points",children:"Key Points"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Only one type per feature value block:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"feature_values"})," is a list, and each item in the list has only one value type populated"]}),"\n",(0,i.jsxs)(n.li,{children:["For example: one item has only ",(0,i.jsx)(n.code,{children:"fp32_values"}),", another has only ",(0,i.jsx)(n.code,{children:"int64_values"})]}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Field Types:"}),"\nThe following value types are supported:"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"fp32_values"}),": ",(0,i.jsx)(n.code,{children:"float32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"fp64_values"}),": ",(0,i.jsx)(n.code,{children:"float64[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"int32_values"}),": ",(0,i.jsx)(n.code,{children:"int32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"int64_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})," (because JSON doesn't support 64-bit ints directly)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"uint32_values"}),": ",(0,i.jsx)(n.code,{children:"uint32[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"uint64_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"string_values"}),": ",(0,i.jsx)(n.code,{children:"string[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"bool_values"}),": ",(0,i.jsx)(n.code,{children:"bool[]"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"vector"}),": list of objects with nested values (used for embedded features)"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"response-format-differences",children:"Response Format Differences"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Retrieve Features (Binary)"}),": Returns data in binary format for optimal performance and reduced network overhead"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Retrieve Features (Decoded)"}),": Returns data in human-readable string format for easier debugging and development purposes"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"managing-services",children:"Managing Services"}),"\n",(0,i.jsx)(n.h3,{id:"viewing-logs",children:"Viewing Logs"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# View logs for all services\ncd workspace && docker-compose logs -f\n\n# View logs for specific services\ncd workspace && docker-compose logs -f horizon\ncd workspace && docker-compose logs -f trufflebox-ui\ncd workspace && docker-compose logs -f onfs-api-server\n"})}),"\n",(0,i.jsx)(n.h3,{id:"service-management",children:"Service Management"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Restart a specific service\ncd workspace && docker-compose restart horizon\n\n# Stop all services\ncd workspace && docker-compose down\n\n# Start services again\ncd workspace && docker-compose up -d\n\n# Check service status\ncd workspace && docker-compose ps\n"})}),"\n",(0,i.jsx)(n.h2,{id:"troubleshooting",children:"Troubleshooting"}),"\n",(0,i.jsx)(n.h3,{id:"common-issues",children:"Common Issues"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Port conflicts"}),": Ensure ports 3000, 8081, 8082, 8089, 9042, 3306, 6379, and 2379 are not in use by other applications."]}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Docker network issues"}),": If containers can't communicate, try recreating:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"docker network rm onfs-network\ndocker network create onfs-network\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Service health checks failing"}),": Check if all infrastructure services (databases) are running:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"cd workspace && docker-compose ps\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.strong,{children:"Image pull issues"}),": Ensure you have access to GitHub Container Registry:"]}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"docker login ghcr.io\n"})}),"\n"]}),"\n",(0,i.jsxs)(n.li,{children:["\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.a,{href:"https://github.com/tzfun/etcd-workbench/blob/master/README.md",children:"How to use Etcd Workbench ?"})}),"\n"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"service-dependencies",children:"Service Dependencies"}),"\n",(0,i.jsx)(n.p,{children:"Services start in the following order:"}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Infrastructure services (ScyllaDB, MySQL, Redis, etcd)"}),"\n",(0,i.jsx)(n.li,{children:"Online Feature Store gRPC API Server"}),"\n",(0,i.jsx)(n.li,{children:"Horizon (depends on databases + ONFS API)"}),"\n",(0,i.jsx)(n.li,{children:"Trufflebox UI (depends on Horizon)"}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"If a service fails to start, check its dependencies are healthy first."}),"\n",(0,i.jsx)(n.h2,{id:"development",children:"Development"}),"\n",(0,i.jsx)(n.p,{children:"The workspace directory contains all runtime configuration:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"workspace/docker-compose.yml"})," - Complete service orchestration"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.code,{children:"workspace/check_db_and_init.sh"})," - Database initialization script"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:"You can modify environment variables in the docker-compose.yml file and restart services."}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,l.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,s)=>{s.d(n,{R:()=>t,x:()=>c});var r=s(6540);const i={},l=r.createContext(i);function t(e){const n=r.useContext(l);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function c(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:t(e.components),r.createElement(l.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/14064408.74c23df4.js b/docs/assets/js/14064408.9ca9709f.js similarity index 86% rename from docs/assets/js/14064408.74c23df4.js rename to docs/assets/js/14064408.9ca9709f.js index d9f055a3..fa99219e 100644 --- a/docs/assets/js/14064408.74c23df4.js +++ b/docs/assets/js/14064408.9ca9709f.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4582],{9416:t=>{t.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Quick Start","description":"Quick Start guide for BharatML Stack. Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing.","slug":"/category/quick-start","permalink":"/BharatMLStack/category/quick-start","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Release Notes","permalink":"/BharatMLStack/inferflow/v1.0.0/release-notes"},"next":{"title":"Quick Start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4582],{9416:t=>{t.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Quick Start","description":"Quick Start guide for BharatML Stack. Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing.","slug":"/category/quick-start","permalink":"/BharatMLStack/category/quick-start","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Release Notes","permalink":"/BharatMLStack/inferflow/v1.0.0/release-notes"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/quick-start/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/14eb3368.398ac934.js b/docs/assets/js/14eb3368.398ac934.js deleted file mode 100644 index 786e75d7..00000000 --- a/docs/assets/js/14eb3368.398ac934.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6969],{477:(e,s,n)=>{n.r(s),n.d(s,{default:()=>w});n(6540);var t=n(5500),r=n(6972),a=n(6025),i=n(4164),c=n(8774),l=n(5846),o=n(6654),d=n(1312),u=n(1107);const m={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var h=n(4848);function b({className:e,href:s,children:n}){return(0,h.jsx)(c.A,{href:s,className:(0,i.A)("card padding--lg",m.cardContainer,e),children:n})}function x({className:e,href:s,icon:n,title:t,description:r}){return(0,h.jsxs)(b,{href:s,className:e,children:[(0,h.jsxs)(u.A,{as:"h2",className:(0,i.A)("text--truncate",m.cardTitle),title:t,children:[n," ",t]}),r&&(0,h.jsx)("p",{className:(0,i.A)("text--truncate",m.cardDescription),title:r,children:r})]})}function p({item:e}){const s=(0,r.Nr)(e),n=function(){const{selectMessage:e}=(0,l.W)();return s=>e(s,(0,d.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:s}))}();return s?(0,h.jsx)(x,{className:e.className,href:s,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??n(e.items.length)}):null}function v({item:e}){const s=(0,o.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",n=(0,r.cC)(e.docId??void 0);return(0,h.jsx)(x,{className:e.className,href:e.href,icon:s,title:e.label,description:e.description??n?.description})}function g({item:e}){switch(e.type){case"link":return(0,h.jsx)(v,{item:e});case"category":return(0,h.jsx)(p,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const f={docCardListItem:"docCardListItem_W1sv"};function j({className:e}){const s=(0,r.a4)();return(0,h.jsx)(N,{items:s,className:e})}function A({item:e}){return(0,h.jsx)("article",{className:(0,i.A)(f.docCardListItem,"col col--6"),children:(0,h.jsx)(g,{item:e})})}function N(e){const{items:s,className:n}=e;if(!s)return(0,h.jsx)(j,{...e});const t=(0,r.d1)(s);return(0,h.jsx)("section",{className:(0,i.A)("row",n),children:t.map((e,s)=>(0,h.jsx)(A,{item:e},s))})}var L=n(7719),_=n(1878),T=n(4267),k=n(594);const y={generatedIndexPage:"generatedIndexPage_vN6x",title:"title_kItE"};function I({categoryGeneratedIndex:e}){return(0,h.jsx)(t.be,{title:e.title,description:e.description,keywords:e.keywords,image:(0,a.Ay)(e.image)})}function C({categoryGeneratedIndex:e}){const s=(0,r.$S)();return(0,h.jsxs)("div",{className:y.generatedIndexPage,children:[(0,h.jsx)(_.A,{}),(0,h.jsx)(k.A,{}),(0,h.jsx)(T.A,{}),(0,h.jsxs)("header",{children:[(0,h.jsx)(u.A,{as:"h1",className:y.title,children:e.title}),e.description&&(0,h.jsx)("p",{children:e.description})]}),(0,h.jsx)("article",{className:"margin-top--lg",children:(0,h.jsx)(N,{items:s.items,className:y.list})}),(0,h.jsx)("footer",{className:"margin-top--md",children:(0,h.jsx)(L.A,{previous:e.navigation.previous,next:e.navigation.next})})]})}function w(e){return(0,h.jsxs)(h.Fragment,{children:[(0,h.jsx)(I,{...e}),(0,h.jsx)(C,{...e})]})}},594:(e,s,n)=>{n.d(s,{A:()=>j});n(6540);var t=n(4164),r=n(7559),a=n(6972),i=n(9169),c=n(8774),l=n(1312),o=n(6025),d=n(4848);function u(e){return(0,d.jsx)("svg",{viewBox:"0 0 24 24",...e,children:(0,d.jsx)("path",{d:"M10 19v-5h4v5c0 .55.45 1 1 1h3c.55 0 1-.45 1-1v-7h1.7c.46 0 .68-.57.33-.87L12.67 3.6c-.38-.34-.96-.34-1.34 0l-8.36 7.53c-.34.3-.13.87.33.87H5v7c0 .55.45 1 1 1h3c.55 0 1-.45 1-1z",fill:"currentColor"})})}const m={breadcrumbHomeIcon:"breadcrumbHomeIcon_YNFT"};function h(){const e=(0,o.Ay)("/");return(0,d.jsx)("li",{className:"breadcrumbs__item",children:(0,d.jsx)(c.A,{"aria-label":(0,l.T)({id:"theme.docs.breadcrumbs.home",message:"Home page",description:"The ARIA label for the home page in the breadcrumbs"}),className:"breadcrumbs__link",href:e,children:(0,d.jsx)(u,{className:m.breadcrumbHomeIcon})})})}var b=n(5260),x=n(4586);function p(e){const s=function({breadcrumbs:e}){const{siteConfig:s}=(0,x.A)();return{"@context":"https://schema.org","@type":"BreadcrumbList",itemListElement:e.filter(e=>e.href).map((e,n)=>({"@type":"ListItem",position:n+1,name:e.label,item:`${s.url}${e.href}`}))}}({breadcrumbs:e.breadcrumbs});return(0,d.jsx)(b.A,{children:(0,d.jsx)("script",{type:"application/ld+json",children:JSON.stringify(s)})})}const v={breadcrumbsContainer:"breadcrumbsContainer_Z_bl"};function g({children:e,href:s,isLast:n}){const t="breadcrumbs__link";return n?(0,d.jsx)("span",{className:t,children:e}):s?(0,d.jsx)(c.A,{className:t,href:s,children:(0,d.jsx)("span",{children:e})}):(0,d.jsx)("span",{className:t,children:e})}function f({children:e,active:s}){return(0,d.jsx)("li",{className:(0,t.A)("breadcrumbs__item",{"breadcrumbs__item--active":s}),children:e})}function j(){const e=(0,a.OF)(),s=(0,i.Dt)();return e?(0,d.jsxs)(d.Fragment,{children:[(0,d.jsx)(p,{breadcrumbs:e}),(0,d.jsx)("nav",{className:(0,t.A)(r.G.docs.docBreadcrumbs,v.breadcrumbsContainer),"aria-label":(0,l.T)({id:"theme.docs.breadcrumbs.navAriaLabel",message:"Breadcrumbs",description:"The ARIA label for the breadcrumbs"}),children:(0,d.jsxs)("ul",{className:"breadcrumbs",children:[s&&(0,d.jsx)(h,{}),e.map((s,n)=>{const t=n===e.length-1,r="category"===s.type&&s.linkUnlisted?void 0:s.href;return(0,d.jsx)(f,{active:t,children:(0,d.jsx)(g,{href:r,isLast:t,children:s.label})},n)})]})})]}):null}},1878:(e,s,n)=>{n.d(s,{A:()=>p});n(6540);var t=n(4164),r=n(4586),a=n(8774),i=n(1312),c=n(4070),l=n(7559),o=n(3886),d=n(3025),u=n(4848);const m={unreleased:function({siteTitle:e,versionMetadata:s}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.unreleasedVersionLabel",description:"The label used to tell the user that he's browsing an unreleased doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:s.label})},children:"This is unreleased documentation for {siteTitle} {versionLabel} version."})},unmaintained:function({siteTitle:e,versionMetadata:s}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.unmaintainedVersionLabel",description:"The label used to tell the user that he's browsing an unmaintained doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:s.label})},children:"This is documentation for {siteTitle} {versionLabel}, which is no longer actively maintained."})}};function h(e){const s=m[e.versionMetadata.banner];return(0,u.jsx)(s,{...e})}function b({versionLabel:e,to:s,onClick:n}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.latestVersionSuggestionLabel",description:"The label used to tell the user to check the latest version",values:{versionLabel:e,latestVersionLink:(0,u.jsx)("b",{children:(0,u.jsx)(a.A,{to:s,onClick:n,children:(0,u.jsx)(i.A,{id:"theme.docs.versions.latestVersionLinkLabel",description:"The label used for the latest version suggestion link label",children:"latest version"})})})},children:"For up-to-date documentation, see the {latestVersionLink} ({versionLabel})."})}function x({className:e,versionMetadata:s}){const{siteConfig:{title:n}}=(0,r.A)(),{pluginId:a}=(0,c.vT)({failfast:!0}),{savePreferredVersionName:i}=(0,o.g1)(a),{latestDocSuggestion:d,latestVersionSuggestion:m}=(0,c.HW)(a),x=d??(p=m).docs.find(e=>e.id===p.mainDocId);var p;return(0,u.jsxs)("div",{className:(0,t.A)(e,l.G.docs.docVersionBanner,"alert alert--warning margin-bottom--md"),role:"alert",children:[(0,u.jsx)("div",{children:(0,u.jsx)(h,{siteTitle:n,versionMetadata:s})}),(0,u.jsx)("div",{className:"margin-top--md",children:(0,u.jsx)(b,{versionLabel:m.label,to:x.path,onClick:()=>i(m.name)})})]})}function p({className:e}){const s=(0,d.r)();return s.banner?(0,u.jsx)(x,{className:e,versionMetadata:s}):null}},4267:(e,s,n)=>{n.d(s,{A:()=>l});n(6540);var t=n(4164),r=n(1312),a=n(7559),i=n(3025),c=n(4848);function l({className:e}){const s=(0,i.r)();return s.badge?(0,c.jsx)("span",{className:(0,t.A)(e,a.G.docs.docVersionBadge,"badge badge--secondary"),children:(0,c.jsx)(r.A,{id:"theme.docs.versionBadge.label",values:{versionLabel:s.label},children:"Version: {versionLabel}"})}):null}},5846:(e,s,n)=>{n.d(s,{W:()=>o});var t=n(6540),r=n(4586);const a=["zero","one","two","few","many","other"];function i(e){return a.filter(s=>e.includes(s))}const c={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function l(){const{i18n:{currentLocale:e}}=(0,r.A)();return(0,t.useMemo)(()=>{try{return function(e){const s=new Intl.PluralRules(e);return{locale:e,pluralForms:i(s.resolvedOptions().pluralCategories),select:e=>s.select(e)}}(e)}catch(s){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${s.message}\n`),c}},[e])}function o(){const e=l();return{selectMessage:(s,n)=>function(e,s,n){const t=e.split("|");if(1===t.length)return t[0];t.length>n.pluralForms.length&&console.error(`For locale=${n.locale}, a maximum of ${n.pluralForms.length} plural forms are expected (${n.pluralForms.join(",")}), but the message contains ${t.length}: ${e}`);const r=n.select(s),a=n.pluralForms.indexOf(r);return t[Math.min(a,t.length-1)]}(n,s,e)}}},7719:(e,s,n)=>{n.d(s,{A:()=>c});n(6540);var t=n(4164),r=n(1312),a=n(9022),i=n(4848);function c(e){const{className:s,previous:n,next:c}=e;return(0,i.jsxs)("nav",{className:(0,t.A)(s,"pagination-nav"),"aria-label":(0,r.T)({id:"theme.docs.paginator.navAriaLabel",message:"Docs pages",description:"The ARIA label for the docs pagination"}),children:[n&&(0,i.jsx)(a.A,{...n,subLabel:(0,i.jsx)(r.A,{id:"theme.docs.paginator.previous",description:"The label used to navigate to the previous doc",children:"Previous"})}),c&&(0,i.jsx)(a.A,{...c,subLabel:(0,i.jsx)(r.A,{id:"theme.docs.paginator.next",description:"The label used to navigate to the next doc",children:"Next"}),isNext:!0})]})}},9022:(e,s,n)=>{n.d(s,{A:()=>i});n(6540);var t=n(4164),r=n(8774),a=n(4848);function i(e){const{permalink:s,title:n,subLabel:i,isNext:c}=e;return(0,a.jsxs)(r.A,{className:(0,t.A)("pagination-nav__link",c?"pagination-nav__link--next":"pagination-nav__link--prev"),to:s,children:[i&&(0,a.jsx)("div",{className:"pagination-nav__sublabel",children:i}),(0,a.jsx)("div",{className:"pagination-nav__label",children:n})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/14eb3368.60af715e.js b/docs/assets/js/14eb3368.60af715e.js new file mode 100644 index 00000000..8d277bdd --- /dev/null +++ b/docs/assets/js/14eb3368.60af715e.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6969],{594:(e,s,n)=>{n.d(s,{A:()=>j});n(6540);var t=n(4164),r=n(7559),a=n(6972),i=n(9169),c=n(8774),l=n(1312),o=n(6025),d=n(4848);function u(e){return(0,d.jsx)("svg",{viewBox:"0 0 24 24",...e,children:(0,d.jsx)("path",{d:"M10 19v-5h4v5c0 .55.45 1 1 1h3c.55 0 1-.45 1-1v-7h1.7c.46 0 .68-.57.33-.87L12.67 3.6c-.38-.34-.96-.34-1.34 0l-8.36 7.53c-.34.3-.13.87.33.87H5v7c0 .55.45 1 1 1h3c.55 0 1-.45 1-1z",fill:"currentColor"})})}const m={breadcrumbHomeIcon:"breadcrumbHomeIcon_YNFT"};function h(){const e=(0,o.Ay)("/");return(0,d.jsx)("li",{className:"breadcrumbs__item",children:(0,d.jsx)(c.A,{"aria-label":(0,l.T)({id:"theme.docs.breadcrumbs.home",message:"Home page",description:"The ARIA label for the home page in the breadcrumbs"}),className:"breadcrumbs__link",href:e,children:(0,d.jsx)(u,{className:m.breadcrumbHomeIcon})})})}var b=n(5260),x=n(4586);function p(e){const s=function({breadcrumbs:e}){const{siteConfig:s}=(0,x.A)();return{"@context":"https://schema.org","@type":"BreadcrumbList",itemListElement:e.filter(e=>e.href).map((e,n)=>({"@type":"ListItem",position:n+1,name:e.label,item:`${s.url}${e.href}`}))}}({breadcrumbs:e.breadcrumbs});return(0,d.jsx)(b.A,{children:(0,d.jsx)("script",{type:"application/ld+json",children:JSON.stringify(s)})})}const v={breadcrumbsContainer:"breadcrumbsContainer_Z_bl"};function g({children:e,href:s,isLast:n}){const t="breadcrumbs__link";return n?(0,d.jsx)("span",{className:t,children:e}):s?(0,d.jsx)(c.A,{className:t,href:s,children:(0,d.jsx)("span",{children:e})}):(0,d.jsx)("span",{className:t,children:e})}function f({children:e,active:s}){return(0,d.jsx)("li",{className:(0,t.A)("breadcrumbs__item",{"breadcrumbs__item--active":s}),children:e})}function j(){const e=(0,a.OF)(),s=(0,i.Dt)();return e?(0,d.jsxs)(d.Fragment,{children:[(0,d.jsx)(p,{breadcrumbs:e}),(0,d.jsx)("nav",{className:(0,t.A)(r.G.docs.docBreadcrumbs,v.breadcrumbsContainer),"aria-label":(0,l.T)({id:"theme.docs.breadcrumbs.navAriaLabel",message:"Breadcrumbs",description:"The ARIA label for the breadcrumbs"}),children:(0,d.jsxs)("ul",{className:"breadcrumbs",children:[s&&(0,d.jsx)(h,{}),e.map((s,n)=>{const t=n===e.length-1,r="category"===s.type&&s.linkUnlisted?void 0:s.href;return(0,d.jsx)(f,{active:t,children:(0,d.jsx)(g,{href:r,isLast:t,children:s.label})},n)})]})})]}):null}},1878:(e,s,n)=>{n.d(s,{A:()=>p});n(6540);var t=n(4164),r=n(4586),a=n(8774),i=n(1312),c=n(4070),l=n(7559),o=n(3886),d=n(3025),u=n(4848);const m={unreleased:function({siteTitle:e,versionMetadata:s}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.unreleasedVersionLabel",description:"The label used to tell the user that he's browsing an unreleased doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:s.label})},children:"This is unreleased documentation for {siteTitle} {versionLabel} version."})},unmaintained:function({siteTitle:e,versionMetadata:s}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.unmaintainedVersionLabel",description:"The label used to tell the user that he's browsing an unmaintained doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:s.label})},children:"This is documentation for {siteTitle} {versionLabel}, which is no longer actively maintained."})}};function h(e){const s=m[e.versionMetadata.banner];return(0,u.jsx)(s,{...e})}function b({versionLabel:e,to:s,onClick:n}){return(0,u.jsx)(i.A,{id:"theme.docs.versions.latestVersionSuggestionLabel",description:"The label used to tell the user to check the latest version",values:{versionLabel:e,latestVersionLink:(0,u.jsx)("b",{children:(0,u.jsx)(a.A,{to:s,onClick:n,children:(0,u.jsx)(i.A,{id:"theme.docs.versions.latestVersionLinkLabel",description:"The label used for the latest version suggestion link label",children:"latest version"})})})},children:"For up-to-date documentation, see the {latestVersionLink} ({versionLabel})."})}function x({className:e,versionMetadata:s}){const{siteConfig:{title:n}}=(0,r.A)(),{pluginId:a}=(0,c.vT)({failfast:!0}),{savePreferredVersionName:i}=(0,o.g1)(a),{latestDocSuggestion:d,latestVersionSuggestion:m}=(0,c.HW)(a),x=d??(p=m).docs.find(e=>e.id===p.mainDocId);var p;return(0,u.jsxs)("div",{className:(0,t.A)(e,l.G.docs.docVersionBanner,"alert alert--warning margin-bottom--md"),role:"alert",children:[(0,u.jsx)("div",{children:(0,u.jsx)(h,{siteTitle:n,versionMetadata:s})}),(0,u.jsx)("div",{className:"margin-top--md",children:(0,u.jsx)(b,{versionLabel:m.label,to:x.path,onClick:()=>i(m.name)})})]})}function p({className:e}){const s=(0,d.r)();return s.banner?(0,u.jsx)(x,{className:e,versionMetadata:s}):null}},4267:(e,s,n)=>{n.d(s,{A:()=>l});n(6540);var t=n(4164),r=n(1312),a=n(7559),i=n(3025),c=n(4848);function l({className:e}){const s=(0,i.r)();return s.badge?(0,c.jsx)("span",{className:(0,t.A)(e,a.G.docs.docVersionBadge,"badge badge--secondary"),children:(0,c.jsx)(r.A,{id:"theme.docs.versionBadge.label",values:{versionLabel:s.label},children:"Version: {versionLabel}"})}):null}},4795:(e,s,n)=>{n.d(s,{A:()=>j});n(6540);var t=n(4164),r=n(6972),a=n(8774),i=n(5846),c=n(6654),l=n(1312),o=n(1107);const d={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var u=n(4848);function m({className:e,href:s,children:n}){return(0,u.jsx)(a.A,{href:s,className:(0,t.A)("card padding--lg",d.cardContainer,e),children:n})}function h({className:e,href:s,icon:n,title:r,description:a}){return(0,u.jsxs)(m,{href:s,className:e,children:[(0,u.jsxs)(o.A,{as:"h2",className:(0,t.A)("text--truncate",d.cardTitle),title:r,children:[n," ",r]}),a&&(0,u.jsx)("p",{className:(0,t.A)("text--truncate",d.cardDescription),title:a,children:a})]})}function b({item:e}){const s=(0,r.Nr)(e),n=function(){const{selectMessage:e}=(0,i.W)();return s=>e(s,(0,l.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:s}))}();return s?(0,u.jsx)(h,{className:e.className,href:s,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??n(e.items.length)}):null}function x({item:e}){const s=(0,c.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",n=(0,r.cC)(e.docId??void 0);return(0,u.jsx)(h,{className:e.className,href:e.href,icon:s,title:e.label,description:e.description??n?.description})}function p({item:e}){switch(e.type){case"link":return(0,u.jsx)(x,{item:e});case"category":return(0,u.jsx)(b,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const v={docCardListItem:"docCardListItem_W1sv"};function g({className:e}){const s=(0,r.a4)();return(0,u.jsx)(j,{items:s,className:e})}function f({item:e}){return(0,u.jsx)("article",{className:(0,t.A)(v.docCardListItem,"col col--6"),children:(0,u.jsx)(p,{item:e})})}function j(e){const{items:s,className:n}=e;if(!s)return(0,u.jsx)(g,{...e});const a=(0,r.d1)(s);return(0,u.jsx)("section",{className:(0,t.A)("row",n),children:a.map((e,s)=>(0,u.jsx)(f,{item:e},s))})}},5846:(e,s,n)=>{n.d(s,{W:()=>o});var t=n(6540),r=n(4586);const a=["zero","one","two","few","many","other"];function i(e){return a.filter(s=>e.includes(s))}const c={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function l(){const{i18n:{currentLocale:e}}=(0,r.A)();return(0,t.useMemo)(()=>{try{return function(e){const s=new Intl.PluralRules(e);return{locale:e,pluralForms:i(s.resolvedOptions().pluralCategories),select:e=>s.select(e)}}(e)}catch(s){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${s.message}\n`),c}},[e])}function o(){const e=l();return{selectMessage:(s,n)=>function(e,s,n){const t=e.split("|");if(1===t.length)return t[0];t.length>n.pluralForms.length&&console.error(`For locale=${n.locale}, a maximum of ${n.pluralForms.length} plural forms are expected (${n.pluralForms.join(",")}), but the message contains ${t.length}: ${e}`);const r=n.select(s),a=n.pluralForms.indexOf(r);return t[Math.min(a,t.length-1)]}(n,s,e)}}},5847:(e,s,n)=>{n.r(s),n.d(s,{default:()=>p});n(6540);var t=n(5500),r=n(6972),a=n(6025),i=n(4795),c=n(7719),l=n(1878),o=n(4267),d=n(594),u=n(1107);const m={generatedIndexPage:"generatedIndexPage_vN6x",title:"title_kItE"};var h=n(4848);function b({categoryGeneratedIndex:e}){return(0,h.jsx)(t.be,{title:e.title,description:e.description,keywords:e.keywords,image:(0,a.Ay)(e.image)})}function x({categoryGeneratedIndex:e}){const s=(0,r.$S)();return(0,h.jsxs)("div",{className:m.generatedIndexPage,children:[(0,h.jsx)(l.A,{}),(0,h.jsx)(d.A,{}),(0,h.jsx)(o.A,{}),(0,h.jsxs)("header",{children:[(0,h.jsx)(u.A,{as:"h1",className:m.title,children:e.title}),e.description&&(0,h.jsx)("p",{children:e.description})]}),(0,h.jsx)("article",{className:"margin-top--lg",children:(0,h.jsx)(i.A,{items:s.items,className:m.list})}),(0,h.jsx)("footer",{className:"margin-top--md",children:(0,h.jsx)(c.A,{previous:e.navigation.previous,next:e.navigation.next})})]})}function p(e){return(0,h.jsxs)(h.Fragment,{children:[(0,h.jsx)(b,{...e}),(0,h.jsx)(x,{...e})]})}},7719:(e,s,n)=>{n.d(s,{A:()=>c});n(6540);var t=n(4164),r=n(1312),a=n(9022),i=n(4848);function c(e){const{className:s,previous:n,next:c}=e;return(0,i.jsxs)("nav",{className:(0,t.A)(s,"pagination-nav"),"aria-label":(0,r.T)({id:"theme.docs.paginator.navAriaLabel",message:"Docs pages",description:"The ARIA label for the docs pagination"}),children:[n&&(0,i.jsx)(a.A,{...n,subLabel:(0,i.jsx)(r.A,{id:"theme.docs.paginator.previous",description:"The label used to navigate to the previous doc",children:"Previous"})}),c&&(0,i.jsx)(a.A,{...c,subLabel:(0,i.jsx)(r.A,{id:"theme.docs.paginator.next",description:"The label used to navigate to the next doc",children:"Next"}),isNext:!0})]})}},9022:(e,s,n)=>{n.d(s,{A:()=>i});n(6540);var t=n(4164),r=n(8774),a=n(4848);function i(e){const{permalink:s,title:n,subLabel:i,isNext:c}=e;return(0,a.jsxs)(r.A,{className:(0,t.A)("pagination-nav__link",c?"pagination-nav__link--next":"pagination-nav__link--prev"),to:s,children:[i&&(0,a.jsx)("div",{className:"pagination-nav__sublabel",children:i}),(0,a.jsx)("div",{className:"pagination-nav__label",children:n})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/176d210f.21c450d1.js b/docs/assets/js/176d210f.21c450d1.js deleted file mode 100644 index 8f88edf3..00000000 --- a/docs/assets/js/176d210f.21c450d1.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6100],{73:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-add-features-details-278a519cdfe25bead880d7a18e0b858e.png"},351:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-entity-details-016ab5c5b2fef9f58bde75e6a07c9823.png"},753:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>r,toc:()=>c});const r=JSON.parse('{"id":"trufflebox-ui/v1.0.0/userguide","title":"User Manual","description":"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management.","source":"@site/docs/trufflebox-ui/v1.0.0/userguide.md","sourceDirName":"trufflebox-ui/v1.0.0","slug":"/trufflebox-ui/v1.0.0/userguide","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/trufflebox-ui/v1.0.0/userguide.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"User Manual","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"Trufflebox UI","permalink":"/BharatMLStack/category/trufflebox-ui"},"next":{"title":"SDKs","permalink":"/BharatMLStack/category/sdks"}}');var s=i(4848),t=i(8453);const a={title:"User Manual",sidebar_position:1},o="Usage Guide",l={},c=[{value:"Table of Contents",id:"table-of-contents",level:2},{value:"System Overview",id:"system-overview",level:2},{value:"Environment Setup",id:"environment-setup",level:2},{value:"Online Feature Store Configuration",id:"online-feature-store-configuration",level:3},{value:"Core Application Settings",id:"core-application-settings",level:4},{value:"Storage Configuration",id:"storage-configuration",level:4},{value:"Caching Configuration",id:"caching-configuration",level:4},{value:"Service Discovery and Configuration",id:"service-discovery-and-configuration",level:4},{value:"Horizon Configuration",id:"horizon-configuration",level:3},{value:"Core Application Settings",id:"core-application-settings-1",level:4},{value:"Database Configuration",id:"database-configuration",level:4},{value:"ScyllaDB Configuration",id:"scylladb-configuration",level:4},{value:"Service Integration",id:"service-integration",level:4},{value:"Key Constructs",id:"key-constructs",level:2},{value:"Store ID",id:"store-id",level:3},{value:"Entity",id:"entity",level:3},{value:"Feature Group",id:"feature-group",level:3},{value:"Feature",id:"feature",level:3},{value:"Job",id:"job",level:3},{value:"Configuration Hierarchy",id:"configuration-hierarchy",level:3},{value:"Table of Contents",id:"table-of-contents-1",level:2},{value:"User Flow",id:"user-flow",level:2},{value:"Getting Started with TruffleBox",id:"getting-started-with-trufflebox",level:3},{value:"Authentication",id:"authentication",level:4},{value:"User Management",id:"user-management",level:4},{value:"Navigation",id:"navigation",level:4},{value:"Feature Discovery",id:"feature-discovery",level:3},{value:"Entity Management",id:"entity-management",level:4},{value:"Feature Group Management",id:"feature-group-management",level:4},{value:"Feature Management",id:"feature-management",level:4},{value:"Store Discovery",id:"store-discovery",level:4},{value:"Job Discovery",id:"job-discovery",level:4},{value:"Feature Registry",id:"feature-registry",level:3},{value:"Request Status Tracking",id:"request-status-tracking",level:4},{value:"Step-by-Step Registration Guide",id:"step-by-step-registration-guide",level:4},{value:"Store Registry",id:"store-registry",level:4},{value:"Job Registry",id:"job-registry",level:4},{value:"Entity Registry",id:"entity-registry",level:4},{value:"Feature Group Registry",id:"feature-group-registry",level:4},{value:"Feature Addition",id:"feature-addition",level:4},{value:"Need Help?",id:"need-help",level:4},{value:"Admin Approval Flow",id:"admin-approval-flow",level:2},{value:"Request Management",id:"request-management",level:3},{value:"Viewing All Requests",id:"viewing-all-requests",level:4},{value:"Request Approval Process",id:"request-approval-process",level:4},{value:"Admin Support",id:"admin-support",level:4},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",br:"br",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,t.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"usage-guide",children:"Usage Guide"})}),"\n",(0,s.jsx)(n.p,{children:"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management."}),"\n",(0,s.jsx)(n.h2,{id:"table-of-contents",children:"Table of Contents"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#system-overview",children:"System Overview"})}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#environment-setup",children:"Environment Setup"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#online-feature-store-configuration",children:"Online Feature Store Configuration"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#horizon-configuration",children:"Horizon Configuration"})}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#key-constructs",children:"Key Constructs"})}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#trufflebox-ui-guide",children:"TruffleBox UI Guide"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#user-flow",children:"User Flow"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#admin-approval-flow",children:"Admin Approval Flow"})}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"system-overview",children:"System Overview"}),"\n",(0,s.jsx)(n.p,{children:"The Online Feature Store is a comprehensive feature management system consisting of two main components:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Online Feature Store"}),": The core feature serving service that provides real-time feature retrieval with multiple storage backends and caching layers"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Horizon"}),": The configuration and metadata management service that handles feature definitions, stores, and job configurations"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"These services work together to provide a scalable, high-performance feature store for machine learning applications."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"environment-setup",children:"Environment Setup"}),"\n",(0,s.jsx)(n.h3,{id:"online-feature-store-configuration",children:"Online Feature Store Configuration"}),"\n",(0,s.jsx)(n.p,{children:"The Online Feature Store requires several environment variables to configure storage backends, caching, and service settings."}),"\n",(0,s.jsx)(n.h4,{id:"core-application-settings",children:"Core Application Settings"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"APP_ENV=prod\nAPP_LOG_LEVEL=DEBUG\nAPP_METRIC_SAMPLING_RATE=1\nAPP_NAME=online-feature-store\nAPP_PORT=8005\nAUTH_TOKEN=ofs-token\n"})}),"\n",(0,s.jsx)(n.h4,{id:"storage-configuration",children:"Storage Configuration"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"ScyllaDB Storage (Primary Storage)"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# Primary ScyllaDB cluster\nSTORAGE_SCYLLA_1_CONTACT_POINTS=localhost\nSTORAGE_SCYLLA_1_KEYSPACE=ofs\nSTORAGE_SCYLLA_1_NUM_CONNS=1\nSTORAGE_SCYLLA_1_PORT=9042\nSTORAGE_SCYLLA_1_TIMEOUT_IN_MS=300000\nSTORAGE_SCYLLA_1_PASSWORD=\nSTORAGE_SCYLLA_1_USERNAME=ofs\n\n# Secondary ScyllaDB cluster\nSTORAGE_SCYLLA_5_CONTACT_POINTS=localhost\nSTORAGE_SCYLLA_5_KEYSPACE=onfs\nSTORAGE_SCYLLA_5_NUM_CONNS=1\nSTORAGE_SCYLLA_5_PASSWORD=\nSTORAGE_SCYLLA_5_PORT=9042\nSTORAGE_SCYLLA_5_TIMEOUT_IN_MS=300000\nSTORAGE_SCYLLA_5_USERNAME=\n\n# Active ScyllaDB configurations\nSTORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5\n"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Redis Storage Configuration"})}),"\n",(0,s.jsx)(n.p,{children:"Redis serves dual purposes in the Online Feature Store:"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Primary Storage Backend"}),": For fast feature retrieval and storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Distributed Cache Layer"}),": For improved performance and reduced latency"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Redis configurations can be referenced by their IDs in Store configurations, similar to ScyllaDB. Each Redis configuration can be independently used as either a storage backend or cache layer."}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# Redis Failover Configuration 1 (ID: 2)\nSTORAGE_REDIS_FAILOVER_2_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_2_DB=0\nSTORAGE_REDIS_FAILOVER_2_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_2_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_2_MAX_IDLE_CONN=32\nSTORAGE_REDIS_FAILOVER_2_MIN_IDLE_CONN=20\nSTORAGE_REDIS_FAILOVER_2_MAX_ACTIVE_CONN=32\nSTORAGE_REDIS_FAILOVER_2_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_2_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_2_READ_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_WRITE_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_POOL_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_POOL_SIZE=32\nSTORAGE_REDIS_FAILOVER_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15\nSTORAGE_REDIS_FAILOVER_2_CONN_MAX_AGE_IN_MINUTES=30\n\n# Redis Failover Configuration 2 (ID: 4)\nSTORAGE_REDIS_FAILOVER_4_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_4_DB=0\nSTORAGE_REDIS_FAILOVER_4_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_4_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_4_MAX_IDLE_CONN=32\nSTORAGE_REDIS_FAILOVER_4_MIN_IDLE_CONN=20\nSTORAGE_REDIS_FAILOVER_4_MAX_ACTIVE_CONN=32\nSTORAGE_REDIS_FAILOVER_4_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_4_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_4_READ_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_WRITE_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_POOL_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_POOL_SIZE=32\nSTORAGE_REDIS_FAILOVER_4_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15\nSTORAGE_REDIS_FAILOVER_4_CONN_MAX_AGE_IN_MINUTES=30\n\n# High-Performance Redis Configuration (ID: 6)\nSTORAGE_REDIS_FAILOVER_6_CONN_MAX_AGE_IN_MINUTES=-1\nSTORAGE_REDIS_FAILOVER_6_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=30\nSTORAGE_REDIS_FAILOVER_6_DB=0\nSTORAGE_REDIS_FAILOVER_6_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_6_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_6_MAX_ACTIVE_CONN=202\nSTORAGE_REDIS_FAILOVER_6_MAX_IDLE_CONN=157\nSTORAGE_REDIS_FAILOVER_6_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_6_MIN_IDLE_CONN=52\nSTORAGE_REDIS_FAILOVER_6_PASSWORD=\nSTORAGE_REDIS_FAILOVER_6_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_6_POOL_SIZE=202\nSTORAGE_REDIS_FAILOVER_6_POOL_TIMEOUT_IN_MS=2\nSTORAGE_REDIS_FAILOVER_6_READ_TIMEOUT_IN_MS=75\nSTORAGE_REDIS_FAILOVER_6_ROUTE_RANDOM=true\nSTORAGE_REDIS_FAILOVER_6_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_6_WRITE_TIMEOUT_IN_MS=300\n\n# Active Redis configurations\nSTORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6\n"})}),"\n",(0,s.jsx)(n.h4,{id:"caching-configuration",children:"Caching Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# In-Memory Cache\nIN_MEM_CACHE_3_ENABLED=true\nIN_MEM_CACHE_3_NAME=onfs\nIN_MEM_CACHE_3_SIZE_IN_BYTES=10000000\nIN_MEM_CACHE_ACTIVE_CONFIG_IDS=3\n\n# Distributed Cache (uses Redis configurations)\n# Redis configurations (IDs: 2,4,6) can be used for distributed caching\nDISTRIBUTED_CACHE_CONF_IDS=2\n"})}),"\n",(0,s.jsx)(n.h4,{id:"service-discovery-and-configuration",children:"Service Discovery and Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ETCD Configuration for service discovery\nETCD_SERVER=0.0.0.0:2379\nETCD_WATCHER_ENABLED=true\n"})}),"\n",(0,s.jsx)(n.h3,{id:"horizon-configuration",children:"Horizon Configuration"}),"\n",(0,s.jsx)(n.p,{children:"Horizon manages the metadata and configuration for the Online Feature Store system."}),"\n",(0,s.jsx)(n.h4,{id:"core-application-settings-1",children:"Core Application Settings"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"APP_NAME=horizon\nAPP_ENVIRONMENT=PROD\nAPP_ENV=production\nAPP_PORT=8082\nAPP_LOG_LEVEL=DEBUG\nAPP_METRIC_SAMPLING_RATE=1\nAPP_GC_PERCENTAGE=1\n"})}),"\n",(0,s.jsx)(n.h4,{id:"database-configuration",children:"Database Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# MySQL Master Configuration\nMYSQL_MASTER_MAX_POOL_SIZE=5\nMYSQL_MASTER_MIN_POOL_SIZE=2\nMYSQL_MASTER_PASSWORD=\nMYSQL_MASTER_HOST=127.0.0.1\nMYSQL_MASTER_PORT=3306\nMYSQL_DB_NAME=ml_config\nMYSQL_MASTER_USERNAME=root\n\n# MySQL Slave Configuration\nMYSQL_SLAVE_MAX_POOL_SIZE=5\nMYSQL_SLAVE_MIN_POOL_SIZE=2\nMYSQL_SLAVE_PASSWORD=\nMYSQL_SLAVE_HOST=127.0.0.1\nMYSQL_SLAVE_USERNAME=root\nMYSQL_SLAVE_PORT=3306\n"})}),"\n",(0,s.jsx)(n.h4,{id:"scylladb-configuration",children:"ScyllaDB Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ScyllaDB for Horizon\nSCYLLA_1_CONTACT_POINTS=localhost\nSCYLLA_1_KEYSPACE=onfs\nSCYLLA_1_NUM_CONNS=1\nSCYLLA_1_PORT=9042\nSCYLLA_1_TIMEOUT_IN_MS=300000\nSCYLLA_1_PASSWORD=\nSCYLLA_1_USERNAME=\nSCYLLA_ACTIVE_CONFIG_IDS=1\n"})}),"\n",(0,s.jsx)(n.h4,{id:"service-integration",children:"Service Integration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ETCD Configuration\nETCD_WATCHER_ENABLED=true\nETCD_SERVER=localhost:2379\n\n# Integration with Online Feature Store\nONLINE_FEATURE_STORE_APP_NAME=online-feature-store\n"})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"key-constructs",children:"Key Constructs"}),"\n",(0,s.jsx)(n.p,{children:"Understanding these key constructs is essential for effectively using the Online Feature Store:"}),"\n",(0,s.jsx)(n.h3,{id:"store-id",children:"Store ID"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Store ID"})," is a unique identifier that represents a data storage configuration within the system. It defines:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Storage Backend"}),": Which underlying storage system (ScyllaDB, Redis, etc.) to use"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Configuration Parameters"}),": Connection settings, timeouts, pool sizes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Access Patterns"}),": How data is read from and written to the store"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Store IDs are referenced throughout the system to:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Route feature requests to the appropriate storage backend"}),"\n",(0,s.jsx)(n.li,{children:"Apply specific caching strategies"}),"\n",(0,s.jsx)(n.li,{children:"Manage data lifecycle and retention policies"}),"\n",(0,s.jsx)(n.li,{children:"Configure stores in TruffleBox UI for feature groups and entities"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Storage Backend Configuration:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"ScyllaDB Store IDs"}),": ",(0,s.jsx)(n.code,{children:"STORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5"})," indicates ScyllaDB configurations with IDs 1 and 5 are active"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Redis Store IDs"}),": ",(0,s.jsx)(n.code,{children:"STORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6"})," indicates Redis configurations with IDs 2, 4, and 6 are active"]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Dual Usage of Redis:"}),"\nRedis configurations can serve dual purposes:"]}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"As Storage Backend"}),": Redis IDs (2,4,6) can be configured as primary storage in Store configurations"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"As Distributed Cache"}),": Same Redis IDs can be used for caching via ",(0,s.jsx)(n.code,{children:"DISTRIBUTED_CACHE_CONF_IDS=2"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"When creating stores in TruffleBox, you can reference these storage configuration IDs to determine which backend (ScyllaDB ID 1/5 or Redis ID 2/4/6) will be used for your feature data."}),"\n",(0,s.jsx)(n.h3,{id:"entity",children:"Entity"}),"\n",(0,s.jsxs)(n.p,{children:["An ",(0,s.jsx)(n.strong,{children:"Entity"})," represents a logical grouping of related features, typically corresponding to a business object (e.g., User, Product, Transaction). Entities provide:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Namespace"}),": Logical separation of feature groups"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Identity"}),": Primary key definition for feature lookup"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Configuration"}),": Cache settings and storage preferences"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"feature-group",children:"Feature Group"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Feature Group"})," is a collection of related features that share:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Common Entity"}),": All features belong to the same entity"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Storage Configuration"}),": Same underlying storage and caching strategy"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Lifecycle"}),": Shared TTL and retention policies"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Access Patterns"}),": Similar read/write characteristics"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"feature",children:"Feature"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Feature"})," is an individual data point that can be retrieved for machine learning models. Each feature has:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Name"}),": Unique identifier within its feature group"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Type"}),": The type of data stored (string, integer, float, etc.)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Default Value"}),": Value returned when feature data is not available"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Source Mapping"}),": How the feature maps to underlying storage columns"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"job",children:"Job"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Job"})," represents a data processing pipeline that:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ingests Data"}),": Processes raw data from various sources"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Transforms Features"}),": Applies business logic and computations"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Updates Storage"}),": Writes processed features to the feature store"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Scheduling"}),": Defines when and how often the job runs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"configuration-hierarchy",children:"Configuration Hierarchy"}),"\n",(0,s.jsx)(n.p,{children:"The system uses a hierarchical configuration approach:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"Store \u2192 Entity \u2192 Feature Group \u2192 Feature\n \u2193 \u2193 \u2193 \u2193\nConfig Identity Collection Individual\nLevel Level Level Level\n"})}),"\n",(0,s.jsx)(n.p,{children:"This hierarchy allows for:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Inheritance"}),": Lower levels inherit settings from higher levels"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Override"}),": Specific configurations can be overridden at each level"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Flexibility"}),": Different storage strategies for different use cases"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h1,{id:"trufflebox-ui-guide",children:"TruffleBox UI Guide"}),"\n",(0,s.jsx)(n.p,{children:"TruffleBox is a comprehensive and intuitive UI to help users onboard new features, models and related entities easily. We will build iteratively and add support overtime for entire feature lifecycle management."}),"\n",(0,s.jsx)(n.h2,{id:"table-of-contents-1",children:"Table of Contents"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#user-flow",children:"User Flow"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#getting-started-with-trufflebox",children:"Getting Started with TruffleBox"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#feature-discovery",children:"Feature Discovery"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#feature-registry",children:"Feature Registry"})}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#admin-approval-flow",children:"Admin Approval Flow"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#request-management",children:"Request Management"})}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"user-flow",children:"User Flow"}),"\n",(0,s.jsx)(n.h3,{id:"getting-started-with-trufflebox",children:"Getting Started with TruffleBox"}),"\n",(0,s.jsx)(n.h4,{id:"authentication",children:"Authentication"}),"\n",(0,s.jsx)(n.p,{children:"Users can access TruffleBox through registration or login:"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Registration"}),":"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"New users should fill in all details and click Register."}),"\n",(0,s.jsx)(n.li,{children:"Once Registered, Please wait for an admin to activate your User"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Registration Screen",src:i(7184).A+"",width:"3438",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"user-management",children:"User Management"}),"\n",(0,s.jsx)(n.p,{children:"Admin users can manage other users through the User Management interface:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"User Management",src:i(8224).A+"",width:"3398",height:"1676"})}),"\n",(0,s.jsx)(n.p,{children:"In the User Management page, admins can:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"View all registered users"}),"\n",(0,s.jsx)(n.li,{children:"Activate/deactivate user accounts"}),"\n",(0,s.jsx)(n.li,{children:"Modify user roles"}),"\n",(0,s.jsx)(n.li,{children:"Manage user permissions"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This is a crucial step in the user onboarding process as new users must be activated by an admin before they can log in to the system."}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Login"}),": Existing users can login with their registered email and password."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Login Screen",src:i(3620).A+"",width:"3438",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"navigation",children:"Navigation"}),"\n",(0,s.jsx)(n.p,{children:"After logging in, you'll be redirected to the feature-discovery page. Access the Control Center by clicking the hamburger icon in the top left corner."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Control Center Navigation",src:i(4525).A+"",width:"3450",height:"1700"})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"feature-discovery",children:"Feature Discovery"}),"\n",(0,s.jsx)(n.p,{children:"The Feature Discovery page displays approved entities, feature groups, and features."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Discovery Landing Page",src:i(2904).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"You can:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"View details by clicking the info icon"}),"\n",(0,s.jsx)(n.li,{children:"Edit entities, feature groups, and features as needed"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"entity-management",children:"Entity Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Entity Details",src:i(2399).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:'View entity details and edit them (limited to In Memory Cache and Distributed Cache details excluding config ID). Submit changes via "Save Changes" to raise an edit request.'}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Entity",src:i(911).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"feature-group-management",children:"Feature Group Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Group Details",src:i(7035).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Edit feature groups (TTL, In-Memory Cache Enabled, Distributed Cache Enabled, Layout Version) and submit changes to raise an edit request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Feature Group",src:i(3963).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"feature-management",children:"Feature Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Details",src:i(6092).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Edit features (Default Value, Source Base Path, Source Data Column, Storage Provider) and submit changes to raise an edit request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Features",src:i(7943).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"store-discovery",children:"Store Discovery"}),"\n",(0,s.jsx)(n.p,{children:"Access Store Discovery from the Control Center to view all stores in the database."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Store Discovery",src:i(8689).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"You can search for specific stores but have view-only access."}),"\n",(0,s.jsx)(n.h4,{id:"job-discovery",children:"Job Discovery"}),"\n",(0,s.jsx)(n.p,{children:"Access Job Discovery from the Control Center to view all jobs in the database."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Job Discovery",src:i(9095).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"You can search for specific jobs but have view-only access."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"feature-registry",children:"Feature Registry"}),"\n",(0,s.jsx)(n.p,{children:"In the Control Center, find the 'Feature Registry' accordion to access various registry options for component registration."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Registry Accordion",src:i(4525).A+"",width:"3450",height:"1700"})}),"\n",(0,s.jsx)(n.h4,{id:"request-status-tracking",children:"Request Status Tracking"}),"\n",(0,s.jsx)(n.p,{children:"After raising a request, track its status in the respective registry page. For rejected requests, view the rejection reason by clicking the info icon in the Actions column."}),"\n",(0,s.jsx)(n.h4,{id:"step-by-step-registration-guide",children:"Step-by-Step Registration Guide"}),"\n",(0,s.jsx)(n.p,{children:"For proper feature lifecycle management, register components in this order:"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsx)(n.li,{children:"Store"}),"\n",(0,s.jsx)(n.li,{children:"Job"}),"\n",(0,s.jsx)(n.li,{children:"Entity"}),"\n",(0,s.jsx)(n.li,{children:"Feature Group"}),"\n",(0,s.jsx)(n.li,{children:"Features (if not added during Feature Group registration)"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"store-registry",children:"Store Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Register Store",src:i(1644).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Store Registry from the Control Center to view raised requests and register new stores. Fill required data and submit to raise a request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Store Details",src:i(1481).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Always add primary keys for proper data identification"}),"\n",(0,s.jsx)(n.li,{children:"Accurate store configuration is crucial as changes later can be complex"}),"\n",(0,s.jsx)(n.li,{children:"Admin approval creates a database table with your configuration"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"job-registry",children:"Job Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Job",src:i(3366).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Job Registry from the Control Center to view raised requests and create new jobs. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Job Details",src:i(2791).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Ensure job details are accurate before proceeding to Entity Registry."}),"\n",(0,s.jsx)(n.h4,{id:"entity-registry",children:"Entity Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Entity",src:i(9214).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Entity Registry from the Control Center to view raised requests and create new entities. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Entity Detail View",src:i(351).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Ensure entity details align with your data model"}),"\n",(0,s.jsx)(n.li,{children:"The entity serves as a logical container for feature groups"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"feature-group-registry",children:"Feature Group Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Feature Group",src:i(1954).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"Access Feature Group Registry from the Control Center to view raised requests and create new feature groups. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Group Detail View",src:i(2955).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Primary keys must match the store primary keys"}),"\n",(0,s.jsx)(n.li,{children:"TTL settings determine how long feature data is stored"}),"\n",(0,s.jsx)(n.li,{children:"Configure cache settings based on access patterns"}),"\n",(0,s.jsx)(n.li,{children:"Approved feature groups automatically add necessary columns to the database table"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"feature-addition",children:"Feature Addition"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Add Features",src:i(3532).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Feature Addition from the Control Center to view raised requests and add new features. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Detail View",src:i(73).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Ensure feature data types are compatible with source data"}),"\n",(0,s.jsx)(n.li,{children:"Set appropriate default values and correct source data column mapping"}),"\n",(0,s.jsx)(n.li,{children:"Approved features automatically add columns to the database table"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"need-help",children:"Need Help?"}),"\n",(0,s.jsx)(n.p,{children:"Please reach out to the BharatMLStack core team for any questions about using TruffleBox."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"admin-approval-flow",children:"Admin Approval Flow"}),"\n",(0,s.jsx)(n.p,{children:"As an admin, you're responsible for reviewing and managing user requests."}),"\n",(0,s.jsx)(n.h3,{id:"request-management",children:"Request Management"}),"\n",(0,s.jsx)(n.h4,{id:"viewing-all-requests",children:"Viewing All Requests"}),"\n",(0,s.jsx)(n.p,{children:"After logging in as an admin, you can see all pending requests across different components (Stores, Jobs, Entities, Feature Groups, Features)."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Admin Dashboard",src:i(6500).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"request-approval-process",children:"Request Approval Process"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Review Details"}),": Click the info icon to view complete request details"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Request Details",src:i(6500).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"2",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Approval Option"}),": After review, use the approve/reject buttons"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Approval Buttons",src:i(6500).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"3",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Approval Process"}),":",(0,s.jsx)(n.br,{}),"\n",'Click "Approve" to process the request. The system will create database tables or add columns as needed. A success message confirms completion.']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Approval Success",src:i(6500).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"4",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Rejection Process"}),":",(0,s.jsx)(n.br,{}),"\n",'Click "Reject" to deny a request. Provide a rejection reason to help users understand why their request wasn\'t approved.']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Rejection Reason",src:i(6863).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"Users can view the rejection reason in their respective registry page."}),"\n",(0,s.jsx)(n.h4,{id:"admin-support",children:"Admin Support"}),"\n",(0,s.jsx)(n.p,{children:"If you need assistance with admin functions, please contact the BharatMLStack core team."}),"\n",(0,s.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,s.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,s.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\ud83d\udcac ",(0,s.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,s.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,s.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,s.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,s.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,s.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,s.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,s.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},911:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-entity-0c3bb1263b53ed678ae2f9310441f3d7.png"},1481:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-store-details-a36537beae9ac91576186b193e858112.png"},1644:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-store-d6f80ceb9a6570b225bba4653ac22dd8.png"},1954:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-fg-9c3b22e62b389f2c1baf968a6e201964.png"},2399:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-entity-details-839bb44b2cd99129eeb0ee785d19152c.png"},2791:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-job-details-075436efba1df107ac7e42164ff6494a.png"},2904:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-c3a8456bb04479842666120a0ec082e6.png"},2955:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-fg-details-1b1100bbb5d23fac31414b15f2a59366.png"},3366:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-job-e45c350f42a09adaeea50ef00d53df55.png"},3532:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-add-features-6cb39960d91af3ee1c896492188cfcb5.png"},3620:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-login-de1cbf15b2daa5c532875a94a4ad1a47.png"},3963:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-fg-edc1a8999700e5c1e9ff023fe9f6413f.png"},4525:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-navigation-0e472fd13ccdae9448011eb9aebb990e.png"},6092:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-feature-details-b780eb1ede246eb257862a46f0fdb53e.png"},6500:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-approve-store-1057c0853f92becfa9b1f87d165a72f9.png"},6863:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-reject-popup-9941183f1128e19034f41970d218d72f.png"},7035:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-fg-details-a2dda4f72568878138e3b2d50fa20e8f.png"},7184:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-registration-aed7738afc652b6418bdc00966850ec0.png"},7943:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-features-41cb78c09d70203c166fce91976d2ba0.png"},8224:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-user-management-2c50fa8488f21ff07b9925c48a10f7cd.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var r=i(6540);const s={},t=r.createContext(s);function a(e){const n=r.useContext(t);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),r.createElement(t.Provider,{value:n},e.children)}},8689:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-store-discovery-8c9042352255fff36b35b4aa193583f7.png"},9095:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-job-discovery-3fac78c4b09b6c76a7bc1dd0738cc93d.png"},9214:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-entity-fe6449f47304e0377107d8e5b3ce1d30.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/176d210f.efd0443c.js b/docs/assets/js/176d210f.efd0443c.js new file mode 100644 index 00000000..9666c425 --- /dev/null +++ b/docs/assets/js/176d210f.efd0443c.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6100],{715:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-fg-9c3b22e62b389f2c1baf968a6e201964.png"},753:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>r,toc:()=>c});const r=JSON.parse('{"id":"trufflebox-ui/v1.0.0/userguide","title":"User Manual","description":"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management.","source":"@site/docs/trufflebox-ui/v1.0.0/userguide.md","sourceDirName":"trufflebox-ui/v1.0.0","slug":"/trufflebox-ui/v1.0.0/userguide","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/trufflebox-ui/v1.0.0/userguide.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"User Manual","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0"},"next":{"title":"SDKs","permalink":"/BharatMLStack/category/sdks"}}');var s=i(4848),t=i(8453);const a={title:"User Manual",sidebar_position:1},o="Usage Guide",l={},c=[{value:"Table of Contents",id:"table-of-contents",level:2},{value:"System Overview",id:"system-overview",level:2},{value:"Environment Setup",id:"environment-setup",level:2},{value:"Online Feature Store Configuration",id:"online-feature-store-configuration",level:3},{value:"Core Application Settings",id:"core-application-settings",level:4},{value:"Storage Configuration",id:"storage-configuration",level:4},{value:"Caching Configuration",id:"caching-configuration",level:4},{value:"Service Discovery and Configuration",id:"service-discovery-and-configuration",level:4},{value:"Horizon Configuration",id:"horizon-configuration",level:3},{value:"Core Application Settings",id:"core-application-settings-1",level:4},{value:"Database Configuration",id:"database-configuration",level:4},{value:"ScyllaDB Configuration",id:"scylladb-configuration",level:4},{value:"Service Integration",id:"service-integration",level:4},{value:"Key Constructs",id:"key-constructs",level:2},{value:"Store ID",id:"store-id",level:3},{value:"Entity",id:"entity",level:3},{value:"Feature Group",id:"feature-group",level:3},{value:"Feature",id:"feature",level:3},{value:"Job",id:"job",level:3},{value:"Configuration Hierarchy",id:"configuration-hierarchy",level:3},{value:"Table of Contents",id:"table-of-contents-1",level:2},{value:"User Flow",id:"user-flow",level:2},{value:"Getting Started with TruffleBox",id:"getting-started-with-trufflebox",level:3},{value:"Authentication",id:"authentication",level:4},{value:"User Management",id:"user-management",level:4},{value:"Navigation",id:"navigation",level:4},{value:"Feature Discovery",id:"feature-discovery",level:3},{value:"Entity Management",id:"entity-management",level:4},{value:"Feature Group Management",id:"feature-group-management",level:4},{value:"Feature Management",id:"feature-management",level:4},{value:"Store Discovery",id:"store-discovery",level:4},{value:"Job Discovery",id:"job-discovery",level:4},{value:"Feature Registry",id:"feature-registry",level:3},{value:"Request Status Tracking",id:"request-status-tracking",level:4},{value:"Step-by-Step Registration Guide",id:"step-by-step-registration-guide",level:4},{value:"Store Registry",id:"store-registry",level:4},{value:"Job Registry",id:"job-registry",level:4},{value:"Entity Registry",id:"entity-registry",level:4},{value:"Feature Group Registry",id:"feature-group-registry",level:4},{value:"Feature Addition",id:"feature-addition",level:4},{value:"Need Help?",id:"need-help",level:4},{value:"Admin Approval Flow",id:"admin-approval-flow",level:2},{value:"Request Management",id:"request-management",level:3},{value:"Viewing All Requests",id:"viewing-all-requests",level:4},{value:"Request Approval Process",id:"request-approval-process",level:4},{value:"Admin Support",id:"admin-support",level:4},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",br:"br",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,t.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"usage-guide",children:"Usage Guide"})}),"\n",(0,s.jsx)(n.p,{children:"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management."}),"\n",(0,s.jsx)(n.h2,{id:"table-of-contents",children:"Table of Contents"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#system-overview",children:"System Overview"})}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#environment-setup",children:"Environment Setup"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#online-feature-store-configuration",children:"Online Feature Store Configuration"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#horizon-configuration",children:"Horizon Configuration"})}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#key-constructs",children:"Key Constructs"})}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#trufflebox-ui-guide",children:"TruffleBox UI Guide"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#user-flow",children:"User Flow"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#admin-approval-flow",children:"Admin Approval Flow"})}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"system-overview",children:"System Overview"}),"\n",(0,s.jsx)(n.p,{children:"The Online Feature Store is a comprehensive feature management system consisting of two main components:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Online Feature Store"}),": The core feature serving service that provides real-time feature retrieval with multiple storage backends and caching layers"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Horizon"}),": The configuration and metadata management service that handles feature definitions, stores, and job configurations"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"These services work together to provide a scalable, high-performance feature store for machine learning applications."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"environment-setup",children:"Environment Setup"}),"\n",(0,s.jsx)(n.h3,{id:"online-feature-store-configuration",children:"Online Feature Store Configuration"}),"\n",(0,s.jsx)(n.p,{children:"The Online Feature Store requires several environment variables to configure storage backends, caching, and service settings."}),"\n",(0,s.jsx)(n.h4,{id:"core-application-settings",children:"Core Application Settings"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"APP_ENV=prod\nAPP_LOG_LEVEL=DEBUG\nAPP_METRIC_SAMPLING_RATE=1\nAPP_NAME=online-feature-store\nAPP_PORT=8005\nAUTH_TOKEN=ofs-token\n"})}),"\n",(0,s.jsx)(n.h4,{id:"storage-configuration",children:"Storage Configuration"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"ScyllaDB Storage (Primary Storage)"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# Primary ScyllaDB cluster\nSTORAGE_SCYLLA_1_CONTACT_POINTS=localhost\nSTORAGE_SCYLLA_1_KEYSPACE=ofs\nSTORAGE_SCYLLA_1_NUM_CONNS=1\nSTORAGE_SCYLLA_1_PORT=9042\nSTORAGE_SCYLLA_1_TIMEOUT_IN_MS=300000\nSTORAGE_SCYLLA_1_PASSWORD=\nSTORAGE_SCYLLA_1_USERNAME=ofs\n\n# Secondary ScyllaDB cluster\nSTORAGE_SCYLLA_5_CONTACT_POINTS=localhost\nSTORAGE_SCYLLA_5_KEYSPACE=onfs\nSTORAGE_SCYLLA_5_NUM_CONNS=1\nSTORAGE_SCYLLA_5_PASSWORD=\nSTORAGE_SCYLLA_5_PORT=9042\nSTORAGE_SCYLLA_5_TIMEOUT_IN_MS=300000\nSTORAGE_SCYLLA_5_USERNAME=\n\n# Active ScyllaDB configurations\nSTORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5\n"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Redis Storage Configuration"})}),"\n",(0,s.jsx)(n.p,{children:"Redis serves dual purposes in the Online Feature Store:"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Primary Storage Backend"}),": For fast feature retrieval and storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Distributed Cache Layer"}),": For improved performance and reduced latency"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Redis configurations can be referenced by their IDs in Store configurations, similar to ScyllaDB. Each Redis configuration can be independently used as either a storage backend or cache layer."}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# Redis Failover Configuration 1 (ID: 2)\nSTORAGE_REDIS_FAILOVER_2_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_2_DB=0\nSTORAGE_REDIS_FAILOVER_2_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_2_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_2_MAX_IDLE_CONN=32\nSTORAGE_REDIS_FAILOVER_2_MIN_IDLE_CONN=20\nSTORAGE_REDIS_FAILOVER_2_MAX_ACTIVE_CONN=32\nSTORAGE_REDIS_FAILOVER_2_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_2_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_2_READ_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_WRITE_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_POOL_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_2_POOL_SIZE=32\nSTORAGE_REDIS_FAILOVER_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15\nSTORAGE_REDIS_FAILOVER_2_CONN_MAX_AGE_IN_MINUTES=30\n\n# Redis Failover Configuration 2 (ID: 4)\nSTORAGE_REDIS_FAILOVER_4_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_4_DB=0\nSTORAGE_REDIS_FAILOVER_4_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_4_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_4_MAX_IDLE_CONN=32\nSTORAGE_REDIS_FAILOVER_4_MIN_IDLE_CONN=20\nSTORAGE_REDIS_FAILOVER_4_MAX_ACTIVE_CONN=32\nSTORAGE_REDIS_FAILOVER_4_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_4_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_4_READ_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_WRITE_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_POOL_TIMEOUT_IN_MS=3000\nSTORAGE_REDIS_FAILOVER_4_POOL_SIZE=32\nSTORAGE_REDIS_FAILOVER_4_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15\nSTORAGE_REDIS_FAILOVER_4_CONN_MAX_AGE_IN_MINUTES=30\n\n# High-Performance Redis Configuration (ID: 6)\nSTORAGE_REDIS_FAILOVER_6_CONN_MAX_AGE_IN_MINUTES=-1\nSTORAGE_REDIS_FAILOVER_6_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=30\nSTORAGE_REDIS_FAILOVER_6_DB=0\nSTORAGE_REDIS_FAILOVER_6_DISABLE_IDENTITY=true\nSTORAGE_REDIS_FAILOVER_6_MASTER_NAME=mymaster\nSTORAGE_REDIS_FAILOVER_6_MAX_ACTIVE_CONN=202\nSTORAGE_REDIS_FAILOVER_6_MAX_IDLE_CONN=157\nSTORAGE_REDIS_FAILOVER_6_MAX_RETRY=-1\nSTORAGE_REDIS_FAILOVER_6_MIN_IDLE_CONN=52\nSTORAGE_REDIS_FAILOVER_6_PASSWORD=\nSTORAGE_REDIS_FAILOVER_6_POOL_FIFO=false\nSTORAGE_REDIS_FAILOVER_6_POOL_SIZE=202\nSTORAGE_REDIS_FAILOVER_6_POOL_TIMEOUT_IN_MS=2\nSTORAGE_REDIS_FAILOVER_6_READ_TIMEOUT_IN_MS=75\nSTORAGE_REDIS_FAILOVER_6_ROUTE_RANDOM=true\nSTORAGE_REDIS_FAILOVER_6_SENTINEL_ADDRESSES=localhost:26379\nSTORAGE_REDIS_FAILOVER_6_WRITE_TIMEOUT_IN_MS=300\n\n# Active Redis configurations\nSTORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6\n"})}),"\n",(0,s.jsx)(n.h4,{id:"caching-configuration",children:"Caching Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# In-Memory Cache\nIN_MEM_CACHE_3_ENABLED=true\nIN_MEM_CACHE_3_NAME=onfs\nIN_MEM_CACHE_3_SIZE_IN_BYTES=10000000\nIN_MEM_CACHE_ACTIVE_CONFIG_IDS=3\n\n# Distributed Cache (uses Redis configurations)\n# Redis configurations (IDs: 2,4,6) can be used for distributed caching\nDISTRIBUTED_CACHE_CONF_IDS=2\n"})}),"\n",(0,s.jsx)(n.h4,{id:"service-discovery-and-configuration",children:"Service Discovery and Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ETCD Configuration for service discovery\nETCD_SERVER=0.0.0.0:2379\nETCD_WATCHER_ENABLED=true\n"})}),"\n",(0,s.jsx)(n.h3,{id:"horizon-configuration",children:"Horizon Configuration"}),"\n",(0,s.jsx)(n.p,{children:"Horizon manages the metadata and configuration for the Online Feature Store system."}),"\n",(0,s.jsx)(n.h4,{id:"core-application-settings-1",children:"Core Application Settings"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"APP_NAME=horizon\nAPP_ENVIRONMENT=PROD\nAPP_ENV=production\nAPP_PORT=8082\nAPP_LOG_LEVEL=DEBUG\nAPP_METRIC_SAMPLING_RATE=1\nAPP_GC_PERCENTAGE=1\n"})}),"\n",(0,s.jsx)(n.h4,{id:"database-configuration",children:"Database Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# MySQL Master Configuration\nMYSQL_MASTER_MAX_POOL_SIZE=5\nMYSQL_MASTER_MIN_POOL_SIZE=2\nMYSQL_MASTER_PASSWORD=\nMYSQL_MASTER_HOST=127.0.0.1\nMYSQL_MASTER_PORT=3306\nMYSQL_DB_NAME=ml_config\nMYSQL_MASTER_USERNAME=root\n\n# MySQL Slave Configuration\nMYSQL_SLAVE_MAX_POOL_SIZE=5\nMYSQL_SLAVE_MIN_POOL_SIZE=2\nMYSQL_SLAVE_PASSWORD=\nMYSQL_SLAVE_HOST=127.0.0.1\nMYSQL_SLAVE_USERNAME=root\nMYSQL_SLAVE_PORT=3306\n"})}),"\n",(0,s.jsx)(n.h4,{id:"scylladb-configuration",children:"ScyllaDB Configuration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ScyllaDB for Horizon\nSCYLLA_1_CONTACT_POINTS=localhost\nSCYLLA_1_KEYSPACE=onfs\nSCYLLA_1_NUM_CONNS=1\nSCYLLA_1_PORT=9042\nSCYLLA_1_TIMEOUT_IN_MS=300000\nSCYLLA_1_PASSWORD=\nSCYLLA_1_USERNAME=\nSCYLLA_ACTIVE_CONFIG_IDS=1\n"})}),"\n",(0,s.jsx)(n.h4,{id:"service-integration",children:"Service Integration"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"# ETCD Configuration\nETCD_WATCHER_ENABLED=true\nETCD_SERVER=localhost:2379\n\n# Integration with Online Feature Store\nONLINE_FEATURE_STORE_APP_NAME=online-feature-store\n"})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"key-constructs",children:"Key Constructs"}),"\n",(0,s.jsx)(n.p,{children:"Understanding these key constructs is essential for effectively using the Online Feature Store:"}),"\n",(0,s.jsx)(n.h3,{id:"store-id",children:"Store ID"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Store ID"})," is a unique identifier that represents a data storage configuration within the system. It defines:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Storage Backend"}),": Which underlying storage system (ScyllaDB, Redis, etc.) to use"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Configuration Parameters"}),": Connection settings, timeouts, pool sizes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Access Patterns"}),": How data is read from and written to the store"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Store IDs are referenced throughout the system to:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Route feature requests to the appropriate storage backend"}),"\n",(0,s.jsx)(n.li,{children:"Apply specific caching strategies"}),"\n",(0,s.jsx)(n.li,{children:"Manage data lifecycle and retention policies"}),"\n",(0,s.jsx)(n.li,{children:"Configure stores in TruffleBox UI for feature groups and entities"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Storage Backend Configuration:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"ScyllaDB Store IDs"}),": ",(0,s.jsx)(n.code,{children:"STORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5"})," indicates ScyllaDB configurations with IDs 1 and 5 are active"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Redis Store IDs"}),": ",(0,s.jsx)(n.code,{children:"STORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6"})," indicates Redis configurations with IDs 2, 4, and 6 are active"]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Dual Usage of Redis:"}),"\nRedis configurations can serve dual purposes:"]}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"As Storage Backend"}),": Redis IDs (2,4,6) can be configured as primary storage in Store configurations"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"As Distributed Cache"}),": Same Redis IDs can be used for caching via ",(0,s.jsx)(n.code,{children:"DISTRIBUTED_CACHE_CONF_IDS=2"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"When creating stores in TruffleBox, you can reference these storage configuration IDs to determine which backend (ScyllaDB ID 1/5 or Redis ID 2/4/6) will be used for your feature data."}),"\n",(0,s.jsx)(n.h3,{id:"entity",children:"Entity"}),"\n",(0,s.jsxs)(n.p,{children:["An ",(0,s.jsx)(n.strong,{children:"Entity"})," represents a logical grouping of related features, typically corresponding to a business object (e.g., User, Product, Transaction). Entities provide:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Namespace"}),": Logical separation of feature groups"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Identity"}),": Primary key definition for feature lookup"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Configuration"}),": Cache settings and storage preferences"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"feature-group",children:"Feature Group"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Feature Group"})," is a collection of related features that share:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Common Entity"}),": All features belong to the same entity"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Storage Configuration"}),": Same underlying storage and caching strategy"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Lifecycle"}),": Shared TTL and retention policies"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Access Patterns"}),": Similar read/write characteristics"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"feature",children:"Feature"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Feature"})," is an individual data point that can be retrieved for machine learning models. Each feature has:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Name"}),": Unique identifier within its feature group"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Type"}),": The type of data stored (string, integer, float, etc.)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Default Value"}),": Value returned when feature data is not available"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Source Mapping"}),": How the feature maps to underlying storage columns"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"job",children:"Job"}),"\n",(0,s.jsxs)(n.p,{children:["A ",(0,s.jsx)(n.strong,{children:"Job"})," represents a data processing pipeline that:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ingests Data"}),": Processes raw data from various sources"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Transforms Features"}),": Applies business logic and computations"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Updates Storage"}),": Writes processed features to the feature store"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Scheduling"}),": Defines when and how often the job runs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"configuration-hierarchy",children:"Configuration Hierarchy"}),"\n",(0,s.jsx)(n.p,{children:"The system uses a hierarchical configuration approach:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"Store \u2192 Entity \u2192 Feature Group \u2192 Feature\n \u2193 \u2193 \u2193 \u2193\nConfig Identity Collection Individual\nLevel Level Level Level\n"})}),"\n",(0,s.jsx)(n.p,{children:"This hierarchy allows for:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Inheritance"}),": Lower levels inherit settings from higher levels"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Override"}),": Specific configurations can be overridden at each level"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Flexibility"}),": Different storage strategies for different use cases"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h1,{id:"trufflebox-ui-guide",children:"TruffleBox UI Guide"}),"\n",(0,s.jsx)(n.p,{children:"TruffleBox is a comprehensive and intuitive UI to help users onboard new features, models and related entities easily. We will build iteratively and add support overtime for entire feature lifecycle management."}),"\n",(0,s.jsx)(n.h2,{id:"table-of-contents-1",children:"Table of Contents"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#user-flow",children:"User Flow"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#getting-started-with-trufflebox",children:"Getting Started with TruffleBox"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#feature-discovery",children:"Feature Discovery"})}),"\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#feature-registry",children:"Feature Registry"})}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.a,{href:"#admin-approval-flow",children:"Admin Approval Flow"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:(0,s.jsx)(n.a,{href:"#request-management",children:"Request Management"})}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"user-flow",children:"User Flow"}),"\n",(0,s.jsx)(n.h3,{id:"getting-started-with-trufflebox",children:"Getting Started with TruffleBox"}),"\n",(0,s.jsx)(n.h4,{id:"authentication",children:"Authentication"}),"\n",(0,s.jsx)(n.p,{children:"Users can access TruffleBox through registration or login:"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Registration"}),":"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"New users should fill in all details and click Register."}),"\n",(0,s.jsx)(n.li,{children:"Once Registered, Please wait for an admin to activate your User"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Registration Screen",src:i(2583).A+"",width:"3438",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"user-management",children:"User Management"}),"\n",(0,s.jsx)(n.p,{children:"Admin users can manage other users through the User Management interface:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"User Management",src:i(1877).A+"",width:"3398",height:"1676"})}),"\n",(0,s.jsx)(n.p,{children:"In the User Management page, admins can:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"View all registered users"}),"\n",(0,s.jsx)(n.li,{children:"Activate/deactivate user accounts"}),"\n",(0,s.jsx)(n.li,{children:"Modify user roles"}),"\n",(0,s.jsx)(n.li,{children:"Manage user permissions"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This is a crucial step in the user onboarding process as new users must be activated by an admin before they can log in to the system."}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Login"}),": Existing users can login with their registered email and password."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Login Screen",src:i(6441).A+"",width:"3438",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"navigation",children:"Navigation"}),"\n",(0,s.jsx)(n.p,{children:"After logging in, you'll be redirected to the feature-discovery page. Access the Control Center by clicking the hamburger icon in the top left corner."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Control Center Navigation",src:i(4726).A+"",width:"3450",height:"1700"})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"feature-discovery",children:"Feature Discovery"}),"\n",(0,s.jsx)(n.p,{children:"The Feature Discovery page displays approved entities, feature groups, and features."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Discovery Landing Page",src:i(5689).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"You can:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"View details by clicking the info icon"}),"\n",(0,s.jsx)(n.li,{children:"Edit entities, feature groups, and features as needed"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"entity-management",children:"Entity Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Entity Details",src:i(1800).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:'View entity details and edit them (limited to In Memory Cache and Distributed Cache details excluding config ID). Submit changes via "Save Changes" to raise an edit request.'}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Entity",src:i(3714).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"feature-group-management",children:"Feature Group Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Group Details",src:i(5592).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Edit feature groups (TTL, In-Memory Cache Enabled, Distributed Cache Enabled, Layout Version) and submit changes to raise an edit request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Feature Group",src:i(9430).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"feature-management",children:"Feature Management"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Details",src:i(2421).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Edit features (Default Value, Source Base Path, Source Data Column, Storage Provider) and submit changes to raise an edit request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Edit Features",src:i(4230).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"store-discovery",children:"Store Discovery"}),"\n",(0,s.jsx)(n.p,{children:"Access Store Discovery from the Control Center to view all stores in the database."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Store Discovery",src:i(8668).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"You can search for specific stores but have view-only access."}),"\n",(0,s.jsx)(n.h4,{id:"job-discovery",children:"Job Discovery"}),"\n",(0,s.jsx)(n.p,{children:"Access Job Discovery from the Control Center to view all jobs in the database."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Job Discovery",src:i(4734).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"You can search for specific jobs but have view-only access."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"feature-registry",children:"Feature Registry"}),"\n",(0,s.jsx)(n.p,{children:"In the Control Center, find the 'Feature Registry' accordion to access various registry options for component registration."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Registry Accordion",src:i(4726).A+"",width:"3450",height:"1700"})}),"\n",(0,s.jsx)(n.h4,{id:"request-status-tracking",children:"Request Status Tracking"}),"\n",(0,s.jsx)(n.p,{children:"After raising a request, track its status in the respective registry page. For rejected requests, view the rejection reason by clicking the info icon in the Actions column."}),"\n",(0,s.jsx)(n.h4,{id:"step-by-step-registration-guide",children:"Step-by-Step Registration Guide"}),"\n",(0,s.jsx)(n.p,{children:"For proper feature lifecycle management, register components in this order:"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsx)(n.li,{children:"Store"}),"\n",(0,s.jsx)(n.li,{children:"Job"}),"\n",(0,s.jsx)(n.li,{children:"Entity"}),"\n",(0,s.jsx)(n.li,{children:"Feature Group"}),"\n",(0,s.jsx)(n.li,{children:"Features (if not added during Feature Group registration)"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"store-registry",children:"Store Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Register Store",src:i(4351).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Store Registry from the Control Center to view raised requests and register new stores. Fill required data and submit to raise a request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Store Details",src:i(6178).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Always add primary keys for proper data identification"}),"\n",(0,s.jsx)(n.li,{children:"Accurate store configuration is crucial as changes later can be complex"}),"\n",(0,s.jsx)(n.li,{children:"Admin approval creates a database table with your configuration"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"job-registry",children:"Job Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Job",src:i(7909).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Job Registry from the Control Center to view raised requests and create new jobs. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Job Details",src:i(3596).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Ensure job details are accurate before proceeding to Entity Registry."}),"\n",(0,s.jsx)(n.h4,{id:"entity-registry",children:"Entity Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Entity",src:i(1023).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Entity Registry from the Control Center to view raised requests and create new entities. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Entity Detail View",src:i(8741).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Ensure entity details align with your data model"}),"\n",(0,s.jsx)(n.li,{children:"The entity serves as a logical container for feature groups"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"feature-group-registry",children:"Feature Group Registry"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Create Feature Group",src:i(715).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"Access Feature Group Registry from the Control Center to view raised requests and create new feature groups. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Group Detail View",src:i(1014).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Primary keys must match the store primary keys"}),"\n",(0,s.jsx)(n.li,{children:"TTL settings determine how long feature data is stored"}),"\n",(0,s.jsx)(n.li,{children:"Configure cache settings based on access patterns"}),"\n",(0,s.jsx)(n.li,{children:"Approved feature groups automatically add necessary columns to the database table"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"feature-addition",children:"Feature Addition"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Add Features",src:i(6163).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:"Access Feature Addition from the Control Center to view raised requests and add new features. Fill required data and submit your request."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Feature Detail View",src:i(6478).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.strong,{children:"Important Considerations:"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Ensure feature data types are compatible with source data"}),"\n",(0,s.jsx)(n.li,{children:"Set appropriate default values and correct source data column mapping"}),"\n",(0,s.jsx)(n.li,{children:"Approved features automatically add columns to the database table"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"need-help",children:"Need Help?"}),"\n",(0,s.jsx)(n.p,{children:"Please reach out to the BharatMLStack core team for any questions about using TruffleBox."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"admin-approval-flow",children:"Admin Approval Flow"}),"\n",(0,s.jsx)(n.p,{children:"As an admin, you're responsible for reviewing and managing user requests."}),"\n",(0,s.jsx)(n.h3,{id:"request-management",children:"Request Management"}),"\n",(0,s.jsx)(n.h4,{id:"viewing-all-requests",children:"Viewing All Requests"}),"\n",(0,s.jsx)(n.p,{children:"After logging in as an admin, you can see all pending requests across different components (Stores, Jobs, Entities, Feature Groups, Features)."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Admin Dashboard",src:i(9093).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsx)(n.h4,{id:"request-approval-process",children:"Request Approval Process"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Review Details"}),": Click the info icon to view complete request details"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Request Details",src:i(9093).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"2",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Approval Option"}),": After review, use the approve/reject buttons"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Approval Buttons",src:i(9093).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"3",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Approval Process"}),":",(0,s.jsx)(n.br,{}),"\n",'Click "Approve" to process the request. The system will create database tables or add columns as needed. A success message confirms completion.']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Approval Success",src:i(9093).A+"",width:"3450",height:"1690"})}),"\n",(0,s.jsxs)(n.ol,{start:"4",children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Rejection Process"}),":",(0,s.jsx)(n.br,{}),"\n",'Click "Reject" to deny a request. Provide a rejection reason to help users understand why their request wasn\'t approved.']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Rejection Reason",src:i(4188).A+"",width:"3456",height:"1680"})}),"\n",(0,s.jsx)(n.p,{children:"Users can view the rejection reason in their respective registry page."}),"\n",(0,s.jsx)(n.h4,{id:"admin-support",children:"Admin Support"}),"\n",(0,s.jsx)(n.p,{children:"If you need assistance with admin functions, please contact the BharatMLStack core team."}),"\n",(0,s.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,s.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,s.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\ud83d\udcac ",(0,s.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,s.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,s.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,s.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,s.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,s.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,s.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,s.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},1014:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-fg-details-1b1100bbb5d23fac31414b15f2a59366.png"},1023:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-entity-fe6449f47304e0377107d8e5b3ce1d30.png"},1800:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-entity-details-839bb44b2cd99129eeb0ee785d19152c.png"},1877:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-user-management-2c50fa8488f21ff07b9925c48a10f7cd.png"},2421:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-feature-details-b780eb1ede246eb257862a46f0fdb53e.png"},2583:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-registration-aed7738afc652b6418bdc00966850ec0.png"},3596:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-job-details-075436efba1df107ac7e42164ff6494a.png"},3714:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-entity-0c3bb1263b53ed678ae2f9310441f3d7.png"},4188:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-reject-popup-9941183f1128e19034f41970d218d72f.png"},4230:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-features-41cb78c09d70203c166fce91976d2ba0.png"},4351:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-store-d6f80ceb9a6570b225bba4653ac22dd8.png"},4726:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-navigation-0e472fd13ccdae9448011eb9aebb990e.png"},4734:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-job-discovery-3fac78c4b09b6c76a7bc1dd0738cc93d.png"},5592:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-fg-details-a2dda4f72568878138e3b2d50fa20e8f.png"},5689:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-feature-discovery-c3a8456bb04479842666120a0ec082e6.png"},6163:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-add-features-6cb39960d91af3ee1c896492188cfcb5.png"},6178:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-store-details-a36537beae9ac91576186b193e858112.png"},6441:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-login-de1cbf15b2daa5c532875a94a4ad1a47.png"},6478:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-add-features-details-278a519cdfe25bead880d7a18e0b858e.png"},7909:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-job-e45c350f42a09adaeea50ef00d53df55.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var r=i(6540);const s={},t=r.createContext(s);function a(e){const n=r.useContext(t);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),r.createElement(t.Provider,{value:n},e.children)}},8668:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-store-discovery-8c9042352255fff36b35b4aa193583f7.png"},8741:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-register-entity-details-016ab5c5b2fef9f58bde75e6a07c9823.png"},9093:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-approve-store-1057c0853f92becfa9b1f87d165a72f9.png"},9430:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/v1.0.0-trufflebox-edit-fg-edc1a8999700e5c1e9ff023fe9f6413f.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/17896441.4ff7d852.js b/docs/assets/js/17896441.72377930.js similarity index 85% rename from docs/assets/js/17896441.4ff7d852.js rename to docs/assets/js/17896441.72377930.js index 915c5a76..df71920b 100644 --- a/docs/assets/js/17896441.4ff7d852.js +++ b/docs/assets/js/17896441.72377930.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8401],{594:(e,n,t)=>{t.d(n,{A:()=>j});t(6540);var s=t(4164),a=t(7559),i=t(6972),l=t(9169),o=t(8774),r=t(1312),c=t(6025),d=t(4848);function u(e){return(0,d.jsx)("svg",{viewBox:"0 0 24 24",...e,children:(0,d.jsx)("path",{d:"M10 19v-5h4v5c0 .55.45 1 1 1h3c.55 0 1-.45 1-1v-7h1.7c.46 0 .68-.57.33-.87L12.67 3.6c-.38-.34-.96-.34-1.34 0l-8.36 7.53c-.34.3-.13.87.33.87H5v7c0 .55.45 1 1 1h3c.55 0 1-.45 1-1z",fill:"currentColor"})})}const m={breadcrumbHomeIcon:"breadcrumbHomeIcon_YNFT"};function h(){const e=(0,c.Ay)("/");return(0,d.jsx)("li",{className:"breadcrumbs__item",children:(0,d.jsx)(o.A,{"aria-label":(0,r.T)({id:"theme.docs.breadcrumbs.home",message:"Home page",description:"The ARIA label for the home page in the breadcrumbs"}),className:"breadcrumbs__link",href:e,children:(0,d.jsx)(u,{className:m.breadcrumbHomeIcon})})})}var b=t(5260),v=t(4586);function x(e){const n=function({breadcrumbs:e}){const{siteConfig:n}=(0,v.A)();return{"@context":"https://schema.org","@type":"BreadcrumbList",itemListElement:e.filter(e=>e.href).map((e,t)=>({"@type":"ListItem",position:t+1,name:e.label,item:`${n.url}${e.href}`}))}}({breadcrumbs:e.breadcrumbs});return(0,d.jsx)(b.A,{children:(0,d.jsx)("script",{type:"application/ld+json",children:JSON.stringify(n)})})}const g={breadcrumbsContainer:"breadcrumbsContainer_Z_bl"};function f({children:e,href:n,isLast:t}){const s="breadcrumbs__link";return t?(0,d.jsx)("span",{className:s,children:e}):n?(0,d.jsx)(o.A,{className:s,href:n,children:(0,d.jsx)("span",{children:e})}):(0,d.jsx)("span",{className:s,children:e})}function p({children:e,active:n}){return(0,d.jsx)("li",{className:(0,s.A)("breadcrumbs__item",{"breadcrumbs__item--active":n}),children:e})}function j(){const e=(0,i.OF)(),n=(0,l.Dt)();return e?(0,d.jsxs)(d.Fragment,{children:[(0,d.jsx)(x,{breadcrumbs:e}),(0,d.jsx)("nav",{className:(0,s.A)(a.G.docs.docBreadcrumbs,g.breadcrumbsContainer),"aria-label":(0,r.T)({id:"theme.docs.breadcrumbs.navAriaLabel",message:"Breadcrumbs",description:"The ARIA label for the breadcrumbs"}),children:(0,d.jsxs)("ul",{className:"breadcrumbs",children:[n&&(0,d.jsx)(h,{}),e.map((n,t)=>{const s=t===e.length-1,a="category"===n.type&&n.linkUnlisted?void 0:n.href;return(0,d.jsx)(p,{active:s,children:(0,d.jsx)(f,{href:a,isLast:s,children:n.label})},t)})]})})]}):null}},833:(e,n,t)=>{t.r(n),t.d(n,{default:()=>F});var s=t(6540),a=t(5500),i=t(9532),l=t(4848);const o=s.createContext(null);function r({children:e,content:n}){const t=function(e){return(0,s.useMemo)(()=>({metadata:e.metadata,frontMatter:e.frontMatter,assets:e.assets,contentTitle:e.contentTitle,toc:e.toc}),[e])}(n);return(0,l.jsx)(o.Provider,{value:t,children:e})}function c(){const e=(0,s.useContext)(o);if(null===e)throw new i.dV("DocProvider");return e}function d(){const{metadata:e,frontMatter:n,assets:t}=c();return(0,l.jsx)(a.be,{title:e.title,description:e.description,keywords:n.keywords,image:t.image??n.image})}var u=t(4164),m=t(4581),h=t(7719);function b(){const{metadata:e}=c();return(0,l.jsx)(h.A,{className:"docusaurus-mt-lg",previous:e.previous,next:e.next})}var v=t(1878),x=t(4267),g=t(7559),f=t(2053),p=t(4336);function j(){const{metadata:e}=c(),{editUrl:n,lastUpdatedAt:t,lastUpdatedBy:s,tags:a}=e,i=a.length>0,o=!!(n||t||s);return i||o?(0,l.jsxs)("footer",{className:(0,u.A)(g.G.docs.docFooter,"docusaurus-mt-lg"),children:[i&&(0,l.jsx)("div",{className:(0,u.A)("row margin-top--sm",g.G.docs.docFooterTagsRow),children:(0,l.jsx)("div",{className:"col",children:(0,l.jsx)(f.A,{tags:a})})}),o&&(0,l.jsx)(p.A,{className:(0,u.A)("margin-top--sm",g.G.docs.docFooterEditMetaRow),editUrl:n,lastUpdatedAt:t,lastUpdatedBy:s})]}):null}var A=t(1422),N=t(5195),C=t(1312);const L={tocCollapsibleButton:"tocCollapsibleButton_TO0P",tocCollapsibleButtonExpanded:"tocCollapsibleButtonExpanded_MG3E"};function _({collapsed:e,...n}){return(0,l.jsx)("button",{type:"button",...n,className:(0,u.A)("clean-btn",L.tocCollapsibleButton,!e&&L.tocCollapsibleButtonExpanded,n.className),children:(0,l.jsx)(C.A,{id:"theme.TOCCollapsible.toggleButtonLabel",description:"The label used by the button on the collapsible TOC component",children:"On this page"})})}const T={tocCollapsible:"tocCollapsible_ETCw",tocCollapsibleContent:"tocCollapsibleContent_vkbj",tocCollapsibleExpanded:"tocCollapsibleExpanded_sAul"};function k({toc:e,className:n,minHeadingLevel:t,maxHeadingLevel:s}){const{collapsed:a,toggleCollapsed:i}=(0,A.u)({initialState:!0});return(0,l.jsxs)("div",{className:(0,u.A)(T.tocCollapsible,!a&&T.tocCollapsibleExpanded,n),children:[(0,l.jsx)(_,{collapsed:a,onClick:i}),(0,l.jsx)(A.N,{lazy:!0,className:T.tocCollapsibleContent,collapsed:a,children:(0,l.jsx)(N.A,{toc:e,minHeadingLevel:t,maxHeadingLevel:s})})]})}const H={tocMobile:"tocMobile_ITEo"};function y(){const{toc:e,frontMatter:n}=c();return(0,l.jsx)(k,{toc:e,minHeadingLevel:n.toc_min_heading_level,maxHeadingLevel:n.toc_max_heading_level,className:(0,u.A)(g.G.docs.docTocMobile,H.tocMobile)})}var M=t(7763);function B(){const{toc:e,frontMatter:n}=c();return(0,l.jsx)(M.A,{toc:e,minHeadingLevel:n.toc_min_heading_level,maxHeadingLevel:n.toc_max_heading_level,className:g.G.docs.docTocDesktop})}var I=t(1107),w=t(3253);function E({children:e}){const n=function(){const{metadata:e,frontMatter:n,contentTitle:t}=c();return n.hide_title||void 0!==t?null:e.title}();return(0,l.jsxs)("div",{className:(0,u.A)(g.G.docs.docMarkdown,"markdown"),children:[n&&(0,l.jsx)("header",{children:(0,l.jsx)(I.A,{as:"h1",children:n})}),(0,l.jsx)(w.A,{children:e})]})}var V=t(594),O=t(1689);const R={docItemContainer:"docItemContainer_Djhp",docItemCol:"docItemCol_VOVn"};function G({children:e}){const n=function(){const{frontMatter:e,toc:n}=c(),t=(0,m.l)(),s=e.hide_table_of_contents,a=!s&&n.length>0;return{hidden:s,mobile:a?(0,l.jsx)(y,{}):void 0,desktop:!a||"desktop"!==t&&"ssr"!==t?void 0:(0,l.jsx)(B,{})}}(),{metadata:t}=c();return(0,l.jsxs)("div",{className:"row",children:[(0,l.jsxs)("div",{className:(0,u.A)("col",!n.hidden&&R.docItemCol),children:[(0,l.jsx)(O.A,{metadata:t}),(0,l.jsx)(v.A,{}),(0,l.jsxs)("div",{className:R.docItemContainer,children:[(0,l.jsxs)("article",{children:[(0,l.jsx)(V.A,{}),(0,l.jsx)(x.A,{}),n.mobile,(0,l.jsx)(E,{children:e}),(0,l.jsx)(j,{})]}),(0,l.jsx)(b,{})]})]}),n.desktop&&(0,l.jsx)("div",{className:"col col--3",children:n.desktop})]})}function F(e){const n=`docs-doc-id-${e.content.metadata.id}`,t=e.content;return(0,l.jsx)(r,{content:e.content,children:(0,l.jsxs)(a.e3,{className:n,children:[(0,l.jsx)(d,{}),(0,l.jsx)(G,{children:(0,l.jsx)(t,{})})]})})}},1689:(e,n,t)=>{t.d(n,{A:()=>d});t(6540);var s=t(4164),a=t(4084),i=t(7559),l=t(7293),o=t(4848);function r({className:e}){return(0,o.jsx)(l.A,{type:"caution",title:(0,o.jsx)(a.Yh,{}),className:(0,s.A)(e,i.G.common.draftBanner),children:(0,o.jsx)(a.TT,{})})}var c=t(2234);function d({metadata:e}){const{unlisted:n,frontMatter:t}=e;return(0,o.jsxs)(o.Fragment,{children:[(n||t.unlisted)&&(0,o.jsx)(c.A,{}),t.draft&&(0,o.jsx)(r,{})]})}},1878:(e,n,t)=>{t.d(n,{A:()=>x});t(6540);var s=t(4164),a=t(4586),i=t(8774),l=t(1312),o=t(4070),r=t(7559),c=t(3886),d=t(3025),u=t(4848);const m={unreleased:function({siteTitle:e,versionMetadata:n}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.unreleasedVersionLabel",description:"The label used to tell the user that he's browsing an unreleased doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:n.label})},children:"This is unreleased documentation for {siteTitle} {versionLabel} version."})},unmaintained:function({siteTitle:e,versionMetadata:n}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.unmaintainedVersionLabel",description:"The label used to tell the user that he's browsing an unmaintained doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:n.label})},children:"This is documentation for {siteTitle} {versionLabel}, which is no longer actively maintained."})}};function h(e){const n=m[e.versionMetadata.banner];return(0,u.jsx)(n,{...e})}function b({versionLabel:e,to:n,onClick:t}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.latestVersionSuggestionLabel",description:"The label used to tell the user to check the latest version",values:{versionLabel:e,latestVersionLink:(0,u.jsx)("b",{children:(0,u.jsx)(i.A,{to:n,onClick:t,children:(0,u.jsx)(l.A,{id:"theme.docs.versions.latestVersionLinkLabel",description:"The label used for the latest version suggestion link label",children:"latest version"})})})},children:"For up-to-date documentation, see the {latestVersionLink} ({versionLabel})."})}function v({className:e,versionMetadata:n}){const{siteConfig:{title:t}}=(0,a.A)(),{pluginId:i}=(0,o.vT)({failfast:!0}),{savePreferredVersionName:l}=(0,c.g1)(i),{latestDocSuggestion:d,latestVersionSuggestion:m}=(0,o.HW)(i),v=d??(x=m).docs.find(e=>e.id===x.mainDocId);var x;return(0,u.jsxs)("div",{className:(0,s.A)(e,r.G.docs.docVersionBanner,"alert alert--warning margin-bottom--md"),role:"alert",children:[(0,u.jsx)("div",{children:(0,u.jsx)(h,{siteTitle:t,versionMetadata:n})}),(0,u.jsx)("div",{className:"margin-top--md",children:(0,u.jsx)(b,{versionLabel:m.label,to:v.path,onClick:()=>l(m.name)})})]})}function x({className:e}){const n=(0,d.r)();return n.banner?(0,u.jsx)(v,{className:e,versionMetadata:n}):null}},2053:(e,n,t)=>{t.d(n,{A:()=>r});t(6540);var s=t(4164),a=t(1312),i=t(6133);const l={tags:"tags_jXut",tag:"tag_QGVx"};var o=t(4848);function r({tags:e}){return(0,o.jsxs)(o.Fragment,{children:[(0,o.jsx)("b",{children:(0,o.jsx)(a.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,o.jsx)("ul",{className:(0,s.A)(l.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,o.jsx)("li",{className:l.tag,children:(0,o.jsx)(i.A,{...e})},e.permalink))})]})}},2234:(e,n,t)=>{t.d(n,{A:()=>c});t(6540);var s=t(4164),a=t(7559),i=t(4084),l=t(7293),o=t(4848);function r({className:e}){return(0,o.jsx)(l.A,{type:"caution",title:(0,o.jsx)(i.Rc,{}),className:(0,s.A)(e,a.G.common.unlistedBanner),children:(0,o.jsx)(i.Uh,{})})}function c(e){return(0,o.jsxs)(o.Fragment,{children:[(0,o.jsx)(i.AE,{}),(0,o.jsx)(r,{...e})]})}},4084:(e,n,t)=>{t.d(n,{AE:()=>r,Rc:()=>l,TT:()=>d,Uh:()=>o,Yh:()=>c});t(6540);var s=t(1312),a=t(5260),i=t(4848);function l(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function o(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function r(){return(0,i.jsx)(a.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},4267:(e,n,t)=>{t.d(n,{A:()=>r});t(6540);var s=t(4164),a=t(1312),i=t(7559),l=t(3025),o=t(4848);function r({className:e}){const n=(0,l.r)();return n.badge?(0,o.jsx)("span",{className:(0,s.A)(e,i.G.docs.docVersionBadge,"badge badge--secondary"),children:(0,o.jsx)(a.A,{id:"theme.docs.versionBadge.label",values:{versionLabel:n.label},children:"Version: {versionLabel}"})}):null}},5195:(e,n,t)=>{t.d(n,{A:()=>v});var s=t(6540),a=t(6342);function i(e){const n=e.map(e=>({...e,parentIndex:-1,children:[]})),t=Array(7).fill(-1);n.forEach((e,n)=>{const s=t.slice(2,e.level);e.parentIndex=Math.max(...s),t[e.level]=n});const s=[];return n.forEach(e=>{const{parentIndex:t,...a}=e;t>=0?n[t].children.push(a):s.push(a)}),s}function l({toc:e,minHeadingLevel:n,maxHeadingLevel:t}){return e.flatMap(e=>{const s=l({toc:e.children,minHeadingLevel:n,maxHeadingLevel:t});return function(e){return e.level>=n&&e.level<=t}(e)?[{...e,children:s}]:s})}function o(e){const n=e.getBoundingClientRect();return n.top===n.bottom?o(e.parentNode):n}function r(e,{anchorTopOffset:n}){const t=e.find(e=>o(e).top>=n);if(t){return function(e){return e.top>0&&e.bottom{e.current=n?0:document.querySelector(".navbar").clientHeight},[n]),e}function d(e){const n=(0,s.useRef)(void 0),t=c();(0,s.useEffect)(()=>{if(!e)return()=>{};const{linkClassName:s,linkActiveClassName:a,minHeadingLevel:i,maxHeadingLevel:l}=e;function o(){const e=function(e){return Array.from(document.getElementsByClassName(e))}(s),o=function({minHeadingLevel:e,maxHeadingLevel:n}){const t=[];for(let s=e;s<=n;s+=1)t.push(`h${s}.anchor`);return Array.from(document.querySelectorAll(t.join()))}({minHeadingLevel:i,maxHeadingLevel:l}),c=r(o,{anchorTopOffset:t.current}),d=e.find(e=>c&&c.id===function(e){return decodeURIComponent(e.href.substring(e.href.indexOf("#")+1))}(e));e.forEach(e=>{!function(e,t){t?(n.current&&n.current!==e&&n.current.classList.remove(a),e.classList.add(a),n.current=e):e.classList.remove(a)}(e,e===d)})}return document.addEventListener("scroll",o),document.addEventListener("resize",o),o(),()=>{document.removeEventListener("scroll",o),document.removeEventListener("resize",o)}},[e,t])}var u=t(8774),m=t(4848);function h({toc:e,className:n,linkClassName:t,isChild:s}){return e.length?(0,m.jsx)("ul",{className:s?void 0:n,children:e.map(e=>(0,m.jsxs)("li",{children:[(0,m.jsx)(u.A,{to:`#${e.id}`,className:t??void 0,dangerouslySetInnerHTML:{__html:e.value}}),(0,m.jsx)(h,{isChild:!0,toc:e.children,className:n,linkClassName:t})]},e.id))}):null}const b=s.memo(h);function v({toc:e,className:n="table-of-contents table-of-contents__left-border",linkClassName:t="table-of-contents__link",linkActiveClassName:o,minHeadingLevel:r,maxHeadingLevel:c,...u}){const h=(0,a.p)(),v=r??h.tableOfContents.minHeadingLevel,x=c??h.tableOfContents.maxHeadingLevel,g=function({toc:e,minHeadingLevel:n,maxHeadingLevel:t}){return(0,s.useMemo)(()=>l({toc:i(e),minHeadingLevel:n,maxHeadingLevel:t}),[e,n,t])}({toc:e,minHeadingLevel:v,maxHeadingLevel:x});return d((0,s.useMemo)(()=>{if(t&&o)return{linkClassName:t,linkActiveClassName:o,minHeadingLevel:v,maxHeadingLevel:x}},[t,o,v,x])),(0,m.jsx)(b,{toc:g,className:n,linkClassName:t,...u})}},6133:(e,n,t)=>{t.d(n,{A:()=>o});t(6540);var s=t(4164),a=t(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var l=t(4848);function o({permalink:e,label:n,count:t,description:o}){return(0,l.jsxs)(a.A,{rel:"tag",href:e,title:o,className:(0,s.A)(i.tag,t?i.tagWithCount:i.tagRegular),children:[n,t&&(0,l.jsx)("span",{children:t})]})}},7719:(e,n,t)=>{t.d(n,{A:()=>o});t(6540);var s=t(4164),a=t(1312),i=t(9022),l=t(4848);function o(e){const{className:n,previous:t,next:o}=e;return(0,l.jsxs)("nav",{className:(0,s.A)(n,"pagination-nav"),"aria-label":(0,a.T)({id:"theme.docs.paginator.navAriaLabel",message:"Docs pages",description:"The ARIA label for the docs pagination"}),children:[t&&(0,l.jsx)(i.A,{...t,subLabel:(0,l.jsx)(a.A,{id:"theme.docs.paginator.previous",description:"The label used to navigate to the previous doc",children:"Previous"})}),o&&(0,l.jsx)(i.A,{...o,subLabel:(0,l.jsx)(a.A,{id:"theme.docs.paginator.next",description:"The label used to navigate to the next doc",children:"Next"}),isNext:!0})]})}},7763:(e,n,t)=>{t.d(n,{A:()=>c});t(6540);var s=t(4164),a=t(5195);const i={tableOfContents:"tableOfContents_bqdL",docItemContainer:"docItemContainer_F8PC"};var l=t(4848);const o="table-of-contents__link toc-highlight",r="table-of-contents__link--active";function c({className:e,...n}){return(0,l.jsx)("div",{className:(0,s.A)(i.tableOfContents,"thin-scrollbar",e),children:(0,l.jsx)(a.A,{...n,linkClassName:o,linkActiveClassName:r})})}},9022:(e,n,t)=>{t.d(n,{A:()=>l});t(6540);var s=t(4164),a=t(8774),i=t(4848);function l(e){const{permalink:n,title:t,subLabel:l,isNext:o}=e;return(0,i.jsxs)(a.A,{className:(0,s.A)("pagination-nav__link",o?"pagination-nav__link--next":"pagination-nav__link--prev"),to:n,children:[l&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:l}),(0,i.jsx)("div",{className:"pagination-nav__label",children:t})]})}}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8401],{594:(e,n,t)=>{t.d(n,{A:()=>j});t(6540);var s=t(4164),a=t(7559),i=t(6972),l=t(9169),o=t(8774),r=t(1312),c=t(6025),d=t(4848);function u(e){return(0,d.jsx)("svg",{viewBox:"0 0 24 24",...e,children:(0,d.jsx)("path",{d:"M10 19v-5h4v5c0 .55.45 1 1 1h3c.55 0 1-.45 1-1v-7h1.7c.46 0 .68-.57.33-.87L12.67 3.6c-.38-.34-.96-.34-1.34 0l-8.36 7.53c-.34.3-.13.87.33.87H5v7c0 .55.45 1 1 1h3c.55 0 1-.45 1-1z",fill:"currentColor"})})}const m={breadcrumbHomeIcon:"breadcrumbHomeIcon_YNFT"};function h(){const e=(0,c.Ay)("/");return(0,d.jsx)("li",{className:"breadcrumbs__item",children:(0,d.jsx)(o.A,{"aria-label":(0,r.T)({id:"theme.docs.breadcrumbs.home",message:"Home page",description:"The ARIA label for the home page in the breadcrumbs"}),className:"breadcrumbs__link",href:e,children:(0,d.jsx)(u,{className:m.breadcrumbHomeIcon})})})}var b=t(5260),v=t(4586);function x(e){const n=function({breadcrumbs:e}){const{siteConfig:n}=(0,v.A)();return{"@context":"https://schema.org","@type":"BreadcrumbList",itemListElement:e.filter(e=>e.href).map((e,t)=>({"@type":"ListItem",position:t+1,name:e.label,item:`${n.url}${e.href}`}))}}({breadcrumbs:e.breadcrumbs});return(0,d.jsx)(b.A,{children:(0,d.jsx)("script",{type:"application/ld+json",children:JSON.stringify(n)})})}const g={breadcrumbsContainer:"breadcrumbsContainer_Z_bl"};function f({children:e,href:n,isLast:t}){const s="breadcrumbs__link";return t?(0,d.jsx)("span",{className:s,children:e}):n?(0,d.jsx)(o.A,{className:s,href:n,children:(0,d.jsx)("span",{children:e})}):(0,d.jsx)("span",{className:s,children:e})}function p({children:e,active:n}){return(0,d.jsx)("li",{className:(0,s.A)("breadcrumbs__item",{"breadcrumbs__item--active":n}),children:e})}function j(){const e=(0,i.OF)(),n=(0,l.Dt)();return e?(0,d.jsxs)(d.Fragment,{children:[(0,d.jsx)(x,{breadcrumbs:e}),(0,d.jsx)("nav",{className:(0,s.A)(a.G.docs.docBreadcrumbs,g.breadcrumbsContainer),"aria-label":(0,r.T)({id:"theme.docs.breadcrumbs.navAriaLabel",message:"Breadcrumbs",description:"The ARIA label for the breadcrumbs"}),children:(0,d.jsxs)("ul",{className:"breadcrumbs",children:[n&&(0,d.jsx)(h,{}),e.map((n,t)=>{const s=t===e.length-1,a="category"===n.type&&n.linkUnlisted?void 0:n.href;return(0,d.jsx)(p,{active:s,children:(0,d.jsx)(f,{href:a,isLast:s,children:n.label})},t)})]})})]}):null}},833:(e,n,t)=>{t.r(n),t.d(n,{default:()=>F});var s=t(6540),a=t(5500),i=t(9532),l=t(4848);const o=s.createContext(null);function r({children:e,content:n}){const t=function(e){return(0,s.useMemo)(()=>({metadata:e.metadata,frontMatter:e.frontMatter,assets:e.assets,contentTitle:e.contentTitle,toc:e.toc}),[e])}(n);return(0,l.jsx)(o.Provider,{value:t,children:e})}function c(){const e=(0,s.useContext)(o);if(null===e)throw new i.dV("DocProvider");return e}function d(){const{metadata:e,frontMatter:n,assets:t}=c();return(0,l.jsx)(a.be,{title:e.title,description:e.description,keywords:n.keywords,image:t.image??n.image})}var u=t(4164),m=t(4581),h=t(7719);function b(){const{metadata:e}=c();return(0,l.jsx)(h.A,{className:"docusaurus-mt-lg",previous:e.previous,next:e.next})}var v=t(1878),x=t(4267),g=t(7559),f=t(4434),p=t(4336);function j(){const{metadata:e}=c(),{editUrl:n,lastUpdatedAt:t,lastUpdatedBy:s,tags:a}=e,i=a.length>0,o=!!(n||t||s);return i||o?(0,l.jsxs)("footer",{className:(0,u.A)(g.G.docs.docFooter,"docusaurus-mt-lg"),children:[i&&(0,l.jsx)("div",{className:(0,u.A)("row margin-top--sm",g.G.docs.docFooterTagsRow),children:(0,l.jsx)("div",{className:"col",children:(0,l.jsx)(f.A,{tags:a})})}),o&&(0,l.jsx)(p.A,{className:(0,u.A)("margin-top--sm",g.G.docs.docFooterEditMetaRow),editUrl:n,lastUpdatedAt:t,lastUpdatedBy:s})]}):null}var A=t(1422),N=t(5195),C=t(1312);const L={tocCollapsibleButton:"tocCollapsibleButton_TO0P",tocCollapsibleButtonExpanded:"tocCollapsibleButtonExpanded_MG3E"};function _({collapsed:e,...n}){return(0,l.jsx)("button",{type:"button",...n,className:(0,u.A)("clean-btn",L.tocCollapsibleButton,!e&&L.tocCollapsibleButtonExpanded,n.className),children:(0,l.jsx)(C.A,{id:"theme.TOCCollapsible.toggleButtonLabel",description:"The label used by the button on the collapsible TOC component",children:"On this page"})})}const T={tocCollapsible:"tocCollapsible_ETCw",tocCollapsibleContent:"tocCollapsibleContent_vkbj",tocCollapsibleExpanded:"tocCollapsibleExpanded_sAul"};function k({toc:e,className:n,minHeadingLevel:t,maxHeadingLevel:s}){const{collapsed:a,toggleCollapsed:i}=(0,A.u)({initialState:!0});return(0,l.jsxs)("div",{className:(0,u.A)(T.tocCollapsible,!a&&T.tocCollapsibleExpanded,n),children:[(0,l.jsx)(_,{collapsed:a,onClick:i}),(0,l.jsx)(A.N,{lazy:!0,className:T.tocCollapsibleContent,collapsed:a,children:(0,l.jsx)(N.A,{toc:e,minHeadingLevel:t,maxHeadingLevel:s})})]})}const H={tocMobile:"tocMobile_ITEo"};function y(){const{toc:e,frontMatter:n}=c();return(0,l.jsx)(k,{toc:e,minHeadingLevel:n.toc_min_heading_level,maxHeadingLevel:n.toc_max_heading_level,className:(0,u.A)(g.G.docs.docTocMobile,H.tocMobile)})}var M=t(7763);function B(){const{toc:e,frontMatter:n}=c();return(0,l.jsx)(M.A,{toc:e,minHeadingLevel:n.toc_min_heading_level,maxHeadingLevel:n.toc_max_heading_level,className:g.G.docs.docTocDesktop})}var I=t(1107),w=t(3253);function E({children:e}){const n=function(){const{metadata:e,frontMatter:n,contentTitle:t}=c();return n.hide_title||void 0!==t?null:e.title}();return(0,l.jsxs)("div",{className:(0,u.A)(g.G.docs.docMarkdown,"markdown"),children:[n&&(0,l.jsx)("header",{children:(0,l.jsx)(I.A,{as:"h1",children:n})}),(0,l.jsx)(w.A,{children:e})]})}var V=t(594),O=t(1689);const R={docItemContainer:"docItemContainer_Djhp",docItemCol:"docItemCol_VOVn"};function G({children:e}){const n=function(){const{frontMatter:e,toc:n}=c(),t=(0,m.l)(),s=e.hide_table_of_contents,a=!s&&n.length>0;return{hidden:s,mobile:a?(0,l.jsx)(y,{}):void 0,desktop:!a||"desktop"!==t&&"ssr"!==t?void 0:(0,l.jsx)(B,{})}}(),{metadata:t}=c();return(0,l.jsxs)("div",{className:"row",children:[(0,l.jsxs)("div",{className:(0,u.A)("col",!n.hidden&&R.docItemCol),children:[(0,l.jsx)(O.A,{metadata:t}),(0,l.jsx)(v.A,{}),(0,l.jsxs)("div",{className:R.docItemContainer,children:[(0,l.jsxs)("article",{children:[(0,l.jsx)(V.A,{}),(0,l.jsx)(x.A,{}),n.mobile,(0,l.jsx)(E,{children:e}),(0,l.jsx)(j,{})]}),(0,l.jsx)(b,{})]})]}),n.desktop&&(0,l.jsx)("div",{className:"col col--3",children:n.desktop})]})}function F(e){const n=`docs-doc-id-${e.content.metadata.id}`,t=e.content;return(0,l.jsx)(r,{content:e.content,children:(0,l.jsxs)(a.e3,{className:n,children:[(0,l.jsx)(d,{}),(0,l.jsx)(G,{children:(0,l.jsx)(t,{})})]})})}},1689:(e,n,t)=>{t.d(n,{A:()=>d});t(6540);var s=t(4164),a=t(4084),i=t(7559),l=t(7293),o=t(4848);function r({className:e}){return(0,o.jsx)(l.A,{type:"caution",title:(0,o.jsx)(a.Yh,{}),className:(0,s.A)(e,i.G.common.draftBanner),children:(0,o.jsx)(a.TT,{})})}var c=t(2234);function d({metadata:e}){const{unlisted:n,frontMatter:t}=e;return(0,o.jsxs)(o.Fragment,{children:[(n||t.unlisted)&&(0,o.jsx)(c.A,{}),t.draft&&(0,o.jsx)(r,{})]})}},1878:(e,n,t)=>{t.d(n,{A:()=>x});t(6540);var s=t(4164),a=t(4586),i=t(8774),l=t(1312),o=t(4070),r=t(7559),c=t(3886),d=t(3025),u=t(4848);const m={unreleased:function({siteTitle:e,versionMetadata:n}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.unreleasedVersionLabel",description:"The label used to tell the user that he's browsing an unreleased doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:n.label})},children:"This is unreleased documentation for {siteTitle} {versionLabel} version."})},unmaintained:function({siteTitle:e,versionMetadata:n}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.unmaintainedVersionLabel",description:"The label used to tell the user that he's browsing an unmaintained doc version",values:{siteTitle:e,versionLabel:(0,u.jsx)("b",{children:n.label})},children:"This is documentation for {siteTitle} {versionLabel}, which is no longer actively maintained."})}};function h(e){const n=m[e.versionMetadata.banner];return(0,u.jsx)(n,{...e})}function b({versionLabel:e,to:n,onClick:t}){return(0,u.jsx)(l.A,{id:"theme.docs.versions.latestVersionSuggestionLabel",description:"The label used to tell the user to check the latest version",values:{versionLabel:e,latestVersionLink:(0,u.jsx)("b",{children:(0,u.jsx)(i.A,{to:n,onClick:t,children:(0,u.jsx)(l.A,{id:"theme.docs.versions.latestVersionLinkLabel",description:"The label used for the latest version suggestion link label",children:"latest version"})})})},children:"For up-to-date documentation, see the {latestVersionLink} ({versionLabel})."})}function v({className:e,versionMetadata:n}){const{siteConfig:{title:t}}=(0,a.A)(),{pluginId:i}=(0,o.vT)({failfast:!0}),{savePreferredVersionName:l}=(0,c.g1)(i),{latestDocSuggestion:d,latestVersionSuggestion:m}=(0,o.HW)(i),v=d??(x=m).docs.find(e=>e.id===x.mainDocId);var x;return(0,u.jsxs)("div",{className:(0,s.A)(e,r.G.docs.docVersionBanner,"alert alert--warning margin-bottom--md"),role:"alert",children:[(0,u.jsx)("div",{children:(0,u.jsx)(h,{siteTitle:t,versionMetadata:n})}),(0,u.jsx)("div",{className:"margin-top--md",children:(0,u.jsx)(b,{versionLabel:m.label,to:v.path,onClick:()=>l(m.name)})})]})}function x({className:e}){const n=(0,d.r)();return n.banner?(0,u.jsx)(v,{className:e,versionMetadata:n}):null}},2234:(e,n,t)=>{t.d(n,{A:()=>c});t(6540);var s=t(4164),a=t(7559),i=t(4084),l=t(7293),o=t(4848);function r({className:e}){return(0,o.jsx)(l.A,{type:"caution",title:(0,o.jsx)(i.Rc,{}),className:(0,s.A)(e,a.G.common.unlistedBanner),children:(0,o.jsx)(i.Uh,{})})}function c(e){return(0,o.jsxs)(o.Fragment,{children:[(0,o.jsx)(i.AE,{}),(0,o.jsx)(r,{...e})]})}},4084:(e,n,t)=>{t.d(n,{AE:()=>r,Rc:()=>l,TT:()=>d,Uh:()=>o,Yh:()=>c});t(6540);var s=t(1312),a=t(5260),i=t(4848);function l(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function o(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function r(){return(0,i.jsx)(a.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(s.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},4267:(e,n,t)=>{t.d(n,{A:()=>r});t(6540);var s=t(4164),a=t(1312),i=t(7559),l=t(3025),o=t(4848);function r({className:e}){const n=(0,l.r)();return n.badge?(0,o.jsx)("span",{className:(0,s.A)(e,i.G.docs.docVersionBadge,"badge badge--secondary"),children:(0,o.jsx)(a.A,{id:"theme.docs.versionBadge.label",values:{versionLabel:n.label},children:"Version: {versionLabel}"})}):null}},4434:(e,n,t)=>{t.d(n,{A:()=>r});t(6540);var s=t(4164),a=t(1312),i=t(6133);const l={tags:"tags_jXut",tag:"tag_QGVx"};var o=t(4848);function r({tags:e}){return(0,o.jsxs)(o.Fragment,{children:[(0,o.jsx)("b",{children:(0,o.jsx)(a.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,o.jsx)("ul",{className:(0,s.A)(l.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,o.jsx)("li",{className:l.tag,children:(0,o.jsx)(i.A,{...e})},e.permalink))})]})}},5195:(e,n,t)=>{t.d(n,{A:()=>v});var s=t(6540),a=t(6342);function i(e){const n=e.map(e=>({...e,parentIndex:-1,children:[]})),t=Array(7).fill(-1);n.forEach((e,n)=>{const s=t.slice(2,e.level);e.parentIndex=Math.max(...s),t[e.level]=n});const s=[];return n.forEach(e=>{const{parentIndex:t,...a}=e;t>=0?n[t].children.push(a):s.push(a)}),s}function l({toc:e,minHeadingLevel:n,maxHeadingLevel:t}){return e.flatMap(e=>{const s=l({toc:e.children,minHeadingLevel:n,maxHeadingLevel:t});return function(e){return e.level>=n&&e.level<=t}(e)?[{...e,children:s}]:s})}function o(e){const n=e.getBoundingClientRect();return n.top===n.bottom?o(e.parentNode):n}function r(e,{anchorTopOffset:n}){const t=e.find(e=>o(e).top>=n);if(t){return function(e){return e.top>0&&e.bottom{e.current=n?0:document.querySelector(".navbar").clientHeight},[n]),e}function d(e){const n=(0,s.useRef)(void 0),t=c();(0,s.useEffect)(()=>{if(!e)return()=>{};const{linkClassName:s,linkActiveClassName:a,minHeadingLevel:i,maxHeadingLevel:l}=e;function o(){const e=function(e){return Array.from(document.getElementsByClassName(e))}(s),o=function({minHeadingLevel:e,maxHeadingLevel:n}){const t=[];for(let s=e;s<=n;s+=1)t.push(`h${s}.anchor`);return Array.from(document.querySelectorAll(t.join()))}({minHeadingLevel:i,maxHeadingLevel:l}),c=r(o,{anchorTopOffset:t.current}),d=e.find(e=>c&&c.id===function(e){return decodeURIComponent(e.href.substring(e.href.indexOf("#")+1))}(e));e.forEach(e=>{!function(e,t){t?(n.current&&n.current!==e&&n.current.classList.remove(a),e.classList.add(a),n.current=e):e.classList.remove(a)}(e,e===d)})}return document.addEventListener("scroll",o),document.addEventListener("resize",o),o(),()=>{document.removeEventListener("scroll",o),document.removeEventListener("resize",o)}},[e,t])}var u=t(8774),m=t(4848);function h({toc:e,className:n,linkClassName:t,isChild:s}){return e.length?(0,m.jsx)("ul",{className:s?void 0:n,children:e.map(e=>(0,m.jsxs)("li",{children:[(0,m.jsx)(u.A,{to:`#${e.id}`,className:t??void 0,dangerouslySetInnerHTML:{__html:e.value}}),(0,m.jsx)(h,{isChild:!0,toc:e.children,className:n,linkClassName:t})]},e.id))}):null}const b=s.memo(h);function v({toc:e,className:n="table-of-contents table-of-contents__left-border",linkClassName:t="table-of-contents__link",linkActiveClassName:o,minHeadingLevel:r,maxHeadingLevel:c,...u}){const h=(0,a.p)(),v=r??h.tableOfContents.minHeadingLevel,x=c??h.tableOfContents.maxHeadingLevel,g=function({toc:e,minHeadingLevel:n,maxHeadingLevel:t}){return(0,s.useMemo)(()=>l({toc:i(e),minHeadingLevel:n,maxHeadingLevel:t}),[e,n,t])}({toc:e,minHeadingLevel:v,maxHeadingLevel:x});return d((0,s.useMemo)(()=>{if(t&&o)return{linkClassName:t,linkActiveClassName:o,minHeadingLevel:v,maxHeadingLevel:x}},[t,o,v,x])),(0,m.jsx)(b,{toc:g,className:n,linkClassName:t,...u})}},6133:(e,n,t)=>{t.d(n,{A:()=>o});t(6540);var s=t(4164),a=t(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var l=t(4848);function o({permalink:e,label:n,count:t,description:o}){return(0,l.jsxs)(a.A,{rel:"tag",href:e,title:o,className:(0,s.A)(i.tag,t?i.tagWithCount:i.tagRegular),children:[n,t&&(0,l.jsx)("span",{children:t})]})}},7719:(e,n,t)=>{t.d(n,{A:()=>o});t(6540);var s=t(4164),a=t(1312),i=t(9022),l=t(4848);function o(e){const{className:n,previous:t,next:o}=e;return(0,l.jsxs)("nav",{className:(0,s.A)(n,"pagination-nav"),"aria-label":(0,a.T)({id:"theme.docs.paginator.navAriaLabel",message:"Docs pages",description:"The ARIA label for the docs pagination"}),children:[t&&(0,l.jsx)(i.A,{...t,subLabel:(0,l.jsx)(a.A,{id:"theme.docs.paginator.previous",description:"The label used to navigate to the previous doc",children:"Previous"})}),o&&(0,l.jsx)(i.A,{...o,subLabel:(0,l.jsx)(a.A,{id:"theme.docs.paginator.next",description:"The label used to navigate to the next doc",children:"Next"}),isNext:!0})]})}},7763:(e,n,t)=>{t.d(n,{A:()=>c});t(6540);var s=t(4164),a=t(5195);const i={tableOfContents:"tableOfContents_bqdL",docItemContainer:"docItemContainer_F8PC"};var l=t(4848);const o="table-of-contents__link toc-highlight",r="table-of-contents__link--active";function c({className:e,...n}){return(0,l.jsx)("div",{className:(0,s.A)(i.tableOfContents,"thin-scrollbar",e),children:(0,l.jsx)(a.A,{...n,linkClassName:o,linkActiveClassName:r})})}},9022:(e,n,t)=>{t.d(n,{A:()=>l});t(6540);var s=t(4164),a=t(8774),i=t(4848);function l(e){const{permalink:n,title:t,subLabel:l,isNext:o}=e;return(0,i.jsxs)(a.A,{className:(0,s.A)("pagination-nav__link",o?"pagination-nav__link--next":"pagination-nav__link--prev"),to:n,children:[l&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:l}),(0,i.jsx)("div",{className:"pagination-nav__label",children:t})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/23a1b8fc.9499aa3f.js b/docs/assets/js/23a1b8fc.9499aa3f.js new file mode 100644 index 00000000..5612c5a9 --- /dev/null +++ b/docs/assets/js/23a1b8fc.9499aa3f.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8439],{211:(e,n,r)=>{r.r(n),r.d(n,{assets:()=>d,contentTitle:()=>l,default:()=>h,frontMatter:()=>o,metadata:()=>i,toc:()=>c});const i=JSON.parse('{"id":"predator/v1.0.0/architecture","title":"Architecture","description":"Predator is a scalable, high-performance model inference service built as a wrapper around the NVIDIA Triton Inference Server. It is designed to serve a variety of machine learning models (Deep Learning, Tree-based, etc.) with low latency in a Kubernetes (K8s) environment.","source":"@site/docs/predator/v1.0.0/architecture.md","sourceDirName":"predator/v1.0.0","slug":"/predator/v1.0.0/architecture","permalink":"/BharatMLStack/predator/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/predator/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/predator/v1.0.0"},"next":{"title":"Key Functionalities","permalink":"/BharatMLStack/predator/v1.0.0/functionalities"}}');var s=r(4848),t=r(8453);const o={title:"Architecture",sidebar_position:1},l="BharatMLStack - Predator",d={},c=[{value:"High-Level Design",id:"high-level-design",level:2},{value:"End-to-End Flow",id:"end-to-end-flow",level:3},{value:"Key Design Principles",id:"key-design-principles",level:3},{value:"Inference Engine: Triton Inference Server",id:"inference-engine-triton-inference-server",level:2},{value:"Core Components",id:"core-components",level:3},{value:"Backends",id:"backends",level:3},{value:"Key Features",id:"key-features",level:3},{value:"Model Repository Structure",id:"model-repository-structure",level:2},{value:"Sample config.pbtxt",id:"sample-configpbtxt",level:3},{value:"Kubernetes Deployment Architecture",id:"kubernetes-deployment-architecture",level:2},{value:"Pod Architecture",id:"pod-architecture",level:3},{value:"Init Container",id:"init-container",level:4},{value:"Triton Inference Server Container",id:"triton-inference-server-container",level:4},{value:"Triton Server Image Strategy",id:"triton-server-image-strategy",level:3},{value:"Image Distribution Optimization",id:"image-distribution-optimization",level:3},{value:"Health Probes",id:"health-probes",level:3},{value:"Resource Configuration",id:"resource-configuration",level:3},{value:"Autoscaling Architecture",id:"autoscaling-architecture",level:3},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function a(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,t.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"bharatmlstack---predator",children:"BharatMLStack - Predator"})}),"\n",(0,s.jsxs)(n.p,{children:["Predator is a scalable, high-performance model inference service built as a wrapper around the ",(0,s.jsx)(n.strong,{children:"NVIDIA Triton Inference Server"}),". It is designed to serve a variety of machine learning models (Deep Learning, Tree-based, etc.) with low latency in a ",(0,s.jsx)(n.strong,{children:"Kubernetes (K8s)"})," environment."]}),"\n",(0,s.jsxs)(n.p,{children:["The system integrates seamlessly with the ",(0,s.jsx)(n.strong,{children:"Online Feature Store (OnFS)"})," for real-time feature retrieval and uses ",(0,s.jsx)(n.strong,{children:"Horizon"})," as the deployment orchestration layer. Deployments follow a ",(0,s.jsx)(n.strong,{children:"GitOps"})," pipeline \u2014 Horizon generates Helm configurations, commits them to GitHub, and ",(0,s.jsx)(n.strong,{children:"Argo Sync"})," reconciles the desired state onto Kubernetes."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"high-level-design",children:"High-Level Design"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Predator HLD - End-to-end deployment and inference architecture",src:r(4097).A+"",width:"1824",height:"1124"})}),"\n",(0,s.jsx)(n.h3,{id:"end-to-end-flow",children:"End-to-End Flow"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Model Deployment Trigger"}),": An actor initiates deployment through ",(0,s.jsx)(n.strong,{children:"Trufflebox UI"}),", specifying the GCS path (",(0,s.jsx)(n.code,{children:"gcs://"}),") of the trained model. Separately, post-training pipelines write model artifacts to ",(0,s.jsx)(n.strong,{children:"GCS Artifactory"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Orchestration via Horizon"}),": Trufflebox UI communicates with ",(0,s.jsx)(n.strong,{children:"Horizon"}),", the deployment orchestration layer. Horizon generates the appropriate ",(0,s.jsx)(n.strong,{children:"Helm"})," chart configuration for the inference service."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"GitOps Pipeline"}),": Horizon commits the Helm values to a ",(0,s.jsx)(n.strong,{children:"GitHub"})," repository. ",(0,s.jsx)(n.strong,{children:"Argo Sync"})," watches the repo and reconciles the desired state onto the Kubernetes cluster, creating or updating deployable units."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Deployable Units (Deployable 1 \u2026 N)"}),": Each deployable is an independent Kubernetes deployment that:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Downloads model artifacts from ",(0,s.jsx)(n.strong,{children:"GCS"})," at startup via an ",(0,s.jsx)(n.code,{children:"init.sh"})," script."]}),"\n",(0,s.jsxs)(n.li,{children:["Launches a ",(0,s.jsx)(n.strong,{children:"Triton Inference Server"})," instance loaded with the model."]}),"\n",(0,s.jsx)(n.li,{children:"Runs one or more pods, each containing the inference runtime and configured backends."}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Triton Backends"}),": Each Triton instance supports pluggable backends based on the model type:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"FIL"})," \u2014 GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest)."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"PyTorch"})," \u2014 Native PyTorch models via LibTorch."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Python"})," \u2014 Custom preprocessing/postprocessing or unsupported model formats."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"TRT (TensorRT)"})," \u2014 GPU-optimized serialized TensorRT engines."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"ONNX"})," \u2014 Framework-agnostic execution via ONNX Runtime."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"DALI"})," \u2014 GPU-accelerated data preprocessing (image, audio, video)."]}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Autoscaling with KEDA"}),": The cluster uses ",(0,s.jsx)(n.strong,{children:"KEDA"})," (Kubernetes Event-Driven Autoscaling) to scale deployable pods based on custom metrics (CPU utilization, GPU utilization via DCGM, queue depth, etc.). The underlying ",(0,s.jsx)(n.strong,{children:"Kubernetes"})," scheduler places pods across GPU/CPU node pools."]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"key-design-principles",children:"Key Design Principles"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"GitOps-driven"}),": All deployment state is version-controlled in Git; Argo Sync ensures cluster state matches the declared configuration."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Isolation per deployable"}),": Each model or model group gets its own deployable unit, preventing noisy-neighbor interference."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Init-based model loading"}),": Models are materialized to local disk before Triton starts, ensuring deterministic startup and no runtime dependency on remote storage."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Pluggable backends"}),": The same infrastructure serves deep learning, tree-based, and custom models through Triton's backend abstraction."]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"inference-engine-triton-inference-server",children:"Inference Engine: Triton Inference Server"}),"\n",(0,s.jsx)(n.p,{children:"NVIDIA Triton Inference Server is a high-performance model serving system designed to deploy ML and deep learning models at scale across CPUs and GPUs. It provides a unified inference runtime that supports multiple frameworks, optimized execution, and production-grade scheduling."}),"\n",(0,s.jsxs)(n.p,{children:["Triton operates as a standalone server that loads models from a model repository and exposes standardized HTTP/gRPC APIs. Predator uses ",(0,s.jsx)(n.strong,{children:"gRPC"})," for efficient request and response handling via the ",(0,s.jsx)(n.strong,{children:"helix client"}),"."]}),"\n",(0,s.jsx)(n.h3,{id:"core-components",children:"Core Components"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Model Repository"}),": Central directory where models are stored. Predator typically materializes the model repository onto local disk via an init container, enabling fast model loading and eliminating runtime dependency on remote storage during inference."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"backends",children:"Backends"}),"\n",(0,s.jsx)(n.p,{children:"A backend is the runtime responsible for executing a model. Each model specifies which backend runs it via configuration."}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Backend"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"TensorRT"})}),(0,s.jsx)(n.td,{children:"GPU-optimized; executes serialized TensorRT engines (kernel fusion, FP16/INT8)."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"PyTorch"})}),(0,s.jsx)(n.td,{children:"Serves native PyTorch models via LibTorch."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"ONNX Runtime"})}),(0,s.jsx)(n.td,{children:"Framework-agnostic ONNX execution with TensorRT and other accelerators."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"TensorFlow"})}),(0,s.jsx)(n.td,{children:"Runs TensorFlow SavedModel format."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Python backend"})}),(0,s.jsx)(n.td,{children:"Custom Python code for preprocessing, postprocessing, or unsupported models."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Custom backends"})}),(0,s.jsx)(n.td,{children:"C++/Python backends for specialized or proprietary runtimes."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"DALI"})}),(0,s.jsx)(n.td,{children:"GPU-accelerated data preprocessing (image, audio, video)."})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"FIL (Forest Inference Library)"})}),(0,s.jsx)(n.td,{children:"GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest)."})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"key-features",children:"Key Features"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Dynamic batching"}),": Combines multiple requests into a single batch at runtime \u2014 higher GPU utilization, improved throughput, reduced latency variance."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Concurrent model execution"}),": Run multiple models or multiple instances of the same model; distribute load across GPUs."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Model versioning"}),": Support multiple versions per model."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ensemble models"}),": Pipeline of models as an ensemble; eliminates intermediate network hops, reduces latency."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Model instance scaling"}),": Multiple copies of a model for parallel inference and load isolation."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Observability"}),": Prometheus metrics, granular latency, throughput, GPU utilization."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Warmup requests"}),": Preload kernels and avoid cold-start latency."]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"model-repository-structure",children:"Model Repository Structure"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"model_repository/\n\u251c\u2500\u2500 model_A/\n\u2502 \u251c\u2500\u2500 config.pbtxt\n\u2502 \u251c\u2500\u2500 1/\n\u2502 \u2502 \u2514\u2500\u2500 model.plan\n\u2502 \u251c\u2500\u2500 2/\n\u2502 \u2502 \u2514\u2500\u2500 model.plan\n\u251c\u2500\u2500 model_B/\n\u2502 \u251c\u2500\u2500 config.pbtxt\n\u2502 \u251c\u2500\u2500 1/\n\u2502 \u2514\u2500\u2500 model.py\n"})}),"\n",(0,s.jsxs)(n.p,{children:["The ",(0,s.jsx)(n.code,{children:"config.pbtxt"})," file defines how Triton loads and executes a model: input/output tensors, batch settings, hardware execution, backend runtime, and optimization parameters. At minimum it defines: ",(0,s.jsx)(n.code,{children:"backend/platform"}),", ",(0,s.jsx)(n.code,{children:"max_batch_size"}),", ",(0,s.jsx)(n.code,{children:"inputs"}),", ",(0,s.jsx)(n.code,{children:"outputs"}),"."]}),"\n",(0,s.jsx)(n.h3,{id:"sample-configpbtxt",children:"Sample config.pbtxt"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-text",children:'name: "product_ranking_model"\nplatform: "tensorrt_plan"\nmax_batch_size: 64\ninput [ { name: "input_embeddings" data_type: TYPE_FP16 dims: [ 128 ] }, { name: "context_features" data_type: TYPE_FP32 dims: [ 32 ] } ]\noutput [ { name: "scores" data_type: TYPE_FP32 dims: [ 1 ] } ]\ninstance_group [ { kind: KIND_GPU count: 2 gpus: [0] } ]\ndynamic_batching { preferred_batch_size: [8,16,32,64] max_queue_delay_microseconds: 2000 }\n'})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"kubernetes-deployment-architecture",children:"Kubernetes Deployment Architecture"}),"\n",(0,s.jsxs)(n.p,{children:["Predator inference services are deployed on Kubernetes using ",(0,s.jsx)(n.strong,{children:"Helm-based"})," deployments for standardized, scalable, GPU-optimized model serving. Each deployment consists of Triton Inference Server wrapped within a Predator runtime, with autoscaling driven by CPU and GPU utilization."]}),"\n",(0,s.jsx)(n.h3,{id:"pod-architecture",children:"Pod Architecture"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"Predator Pod\n\u251c\u2500\u2500 Init Container (Model Sync)\n\u251c\u2500\u2500 Triton Inference Server Container\n"})}),"\n",(0,s.jsx)(n.p,{children:"Model artifacts and runtime are initialized before inference traffic is accepted."}),"\n",(0,s.jsx)(n.h4,{id:"init-container",children:"Init Container"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Download model artifacts from cloud storage (GCS)."}),"\n",(0,s.jsx)(n.li,{children:"Populate the Triton model repository directory."}),"\n",(0,s.jsxs)(n.li,{children:["Example: ",(0,s.jsx)(n.code,{children:"gcloud storage cp -r gs://.../model-path/* /models"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Benefits: deterministic startup (Triton starts only after models are available), separation of concerns (image = runtime, repository = data)."}),"\n",(0,s.jsx)(n.h4,{id:"triton-inference-server-container",children:"Triton Inference Server Container"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Load model artifacts from local repository."}),"\n",(0,s.jsx)(n.li,{children:"Manage inference scheduling, request/response handling, and expose inference endpoints."}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"triton-server-image-strategy",children:"Triton Server Image Strategy"}),"\n",(0,s.jsxs)(n.p,{children:["The Helm chart uses the Triton container image from the internal ",(0,s.jsx)(n.strong,{children:"artifact registry"}),". Production uses ",(0,s.jsx)(n.strong,{children:"custom-built"})," images (only required backends, e.g. TensorRT, Python) to reduce size and startup time. Unnecessary components are excluded; images are built internally and pushed to the registry."]}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Response Caching"}),": Custom cache plugins can be added at image build time for optional inference response caching \u2014 reducing redundant execution and GPU use for repeated inputs."]}),"\n",(0,s.jsx)(n.h3,{id:"image-distribution-optimization",children:"Image Distribution Optimization"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Secondary boot disk image caching"}),": Images are pre-cached on GPU node pool secondary boot disks to avoid repeated pulls during scale-up and reduce pod startup time and cold-start latency."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Image streaming"}),": Can be used to progressively pull layers for faster time-to-readiness during scaling."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"health-probes",children:"Health Probes"}),"\n",(0,s.jsxs)(n.p,{children:["Readiness and liveness use ",(0,s.jsx)(n.code,{children:"/v2/health/ready"}),". Triton receives traffic only after model loading; failed instances are restarted automatically."]}),"\n",(0,s.jsx)(n.h3,{id:"resource-configuration",children:"Resource Configuration"}),"\n",(0,s.jsx)(n.p,{children:"Sample GPU resource config:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-yaml",children:"limits:\n cpu: 7000m\n memory: 28Gi\n gpu: 1\n"})}),"\n",(0,s.jsx)(n.h3,{id:"autoscaling-architecture",children:"Autoscaling Architecture"}),"\n",(0,s.jsxs)(n.p,{children:["Predator uses ",(0,s.jsx)(n.strong,{children:"KEDA"})," (Kubernetes Event-Driven Autoscaling) for scaling deployable pods. KEDA supports custom metric sources including:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"CPU / Memory utilization"})," for CPU-based deployments."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"GPU utilization"})," via ",(0,s.jsx)(n.strong,{children:"DCGM"})," (Data Center GPU Manager) for GPU pods \u2014 covering utilization, memory, power, etc."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Custom Prometheus queries"})," for application-level scaling signals (e.g., inference queue depth, request latency)."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"KEDA ScaledObjects are configured per deployable, enabling fine-grained, independent scaling for each model or model group."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,s.jsxs)(n.p,{children:["We welcome contributions! See the ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"}),"."]}),"\n",(0,s.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Discord"}),": ",(0,s.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Issues"}),": ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Email"}),": ",(0,s.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,s.jsxs)(n.p,{children:["BharatMLStack is open-source under the ",(0,s.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,s.jsx)("div",{align:"center",children:(0,s.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(a,{...e})}):a(e)}},4097:(e,n,r)=>{r.d(n,{A:()=>i});const i=r.p+"assets/images/v1.0.0-predator-hld-949215d6604ae103e724c3978e803443.png"},8453:(e,n,r)=>{r.d(n,{R:()=>o,x:()=>l});var i=r(6540);const s={},t=i.createContext(s);function o(e){const n=i.useContext(t);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:o(e.components),i.createElement(t.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/252a9097.3acfe41e.js b/docs/assets/js/252a9097.486df03c.js similarity index 58% rename from docs/assets/js/252a9097.3acfe41e.js rename to docs/assets/js/252a9097.486df03c.js index 291cf393..927b87c9 100644 --- a/docs/assets/js/252a9097.3acfe41e.js +++ b/docs/assets/js/252a9097.486df03c.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4424],{248:(e,n,r)=>{r.r(n),r.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>o,metadata:()=>t,toc:()=>d});const t=JSON.parse('{"id":"inferflow/v1.0.0/architecture","title":"Architecture","description":"Inferflow is part of BharatMLStack, a graph-driven feature retrieval and model inference orchestration engine built in Go. It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in etcd.","source":"@site/docs/inferflow/v1.0.0/architecture.md","sourceDirName":"inferflow/v1.0.0","slug":"/inferflow/v1.0.0/architecture","permalink":"/BharatMLStack/inferflow/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/inferflow/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/inferflow/v1.0.0"},"next":{"title":"Key Functionalities","permalink":"/BharatMLStack/inferflow/v1.0.0/functionalities"}}');var i=r(4848),s=r(8453);const o={title:"Architecture",sidebar_position:1},l="BharatMLStack - Inferflow",c={},d=[{value:"Overview",id:"overview",level:2},{value:"High-Level Architecture",id:"high-level-architecture",level:2},{value:"Core Components",id:"core-components",level:2},{value:"1. gRPC Server",id:"1-grpc-server",level:3},{value:"2. DAG Topology Executor",id:"2-dag-topology-executor",level:3},{value:"3. Component Types",id:"3-component-types",level:3},{value:"4. ComponentMatrix \u2014 The 2D Result Matrix",id:"4-componentmatrix--the-2d-result-matrix",level:3},{value:"How the matrix evolves through the DAG",id:"how-the-matrix-evolves-through-the-dag",level:4},{value:"Matrix structure",id:"matrix-structure",level:4},{value:"5. Configuration Management (etcd)",id:"5-configuration-management-etcd",level:3},{value:"6. External Integrations",id:"6-external-integrations",level:3},{value:"Online Feature Store (OnFS)",id:"online-feature-store-onfs",level:4},{value:"Predator (Model Serving)",id:"predator-model-serving",level:4},{value:"Numerix (Compute Engine)",id:"numerix-compute-engine",level:4},{value:"Kafka (Inference Logging)",id:"kafka-inference-logging",level:4},{value:"Request Flow",id:"request-flow",level:2},{value:"Observability",id:"observability",level:2},{value:"Metrics (StatsD / Telegraf)",id:"metrics-statsd--telegraf",level:3},{value:"Logging",id:"logging",level:3},{value:"Deployment",id:"deployment",level:2},{value:"Docker",id:"docker",level:3},{value:"Supported Environments",id:"supported-environments",level:3},{value:"Configuration",id:"configuration",level:3},{value:"Target Users",id:"target-users",level:2},{value:"Benefits",id:"benefits",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function a(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,s.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatmlstack---inferflow",children:"BharatMLStack - Inferflow"})}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow is part of ",(0,i.jsx)(n.strong,{children:"BharatMLStack"}),", a graph-driven feature retrieval and model inference orchestration engine built in ",(0,i.jsx)(n.strong,{children:"Go"}),". It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in ",(0,i.jsx)(n.strong,{children:"etcd"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"overview",children:"Overview"}),"\n",(0,i.jsx)(n.p,{children:"In a typical ML serving pipeline, every new model requires bespoke code to:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Fetch features from multiple entities (user, product, user x category, etc.)"}),"\n",(0,i.jsx)(n.li,{children:"Infer intermediate entity relationships (e.g., extract category from product to fetch user x category data)"}),"\n",(0,i.jsx)(n.li,{children:"Orchestrate one or more model inference calls"}),"\n",(0,i.jsx)(n.li,{children:"Handle I/O, batching, and error propagation"}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow abstracts all of this behind a ",(0,i.jsx)(n.strong,{children:"config-driven DAG executor"}),". Given a ",(0,i.jsx)(n.code,{children:"model_config_id"})," and context entities (e.g., ",(0,i.jsx)(n.code,{children:"userId"}),", ",(0,i.jsx)(n.code,{children:"productIds"}),"), it:"]}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Loads a pre-defined feature retrieval and inference graph from etcd"}),"\n",(0,i.jsx)(n.li,{children:"Executes the graph to resolve entity relationships dynamically"}),"\n",(0,i.jsx)(n.li,{children:"Retrieves features from the Online Feature Store (OnFS) in parallel"}),"\n",(0,i.jsx)(n.li,{children:"Calls model serving endpoints (Predator) and compute services (Numerix)"}),"\n",(0,i.jsx)(n.li,{children:"Returns scored results as a structured response"}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"high-level-architecture",children:"High-Level Architecture"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.img,{alt:"Inferflow Architecture - DAG Topology Executor",src:r(7748).A+"",width:"2036",height:"1212"})}),"\n",(0,i.jsxs)(n.p,{children:["The diagram shows the internal DAG structure of Inferflow's topology executor. gRPC APIs (Pair, Point, Slate) feed into the DAG, where ",(0,i.jsx)(n.strong,{children:"Feature Init"})," bootstraps the ComponentMatrix. Feature components (FS User, FS Product, FS Region, FS User Cat, FS Region Scat) fetch features from ",(0,i.jsx)(n.strong,{children:"OnFS"})," in parallel and populate columns in the shared ",(0,i.jsx)(n.strong,{children:"2D Result Matrix"}),". Model components (Model A, Model B) call ",(0,i.jsx)(n.strong,{children:"Predator"})," for inference, and compute components call ",(0,i.jsx)(n.strong,{children:"Numerix"})," for operations like reranking. The entire DAG topology is driven by config loaded from ",(0,i.jsx)(n.strong,{children:"etcd"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"core-components",children:"Core Components"}),"\n",(0,i.jsx)(n.h3,{id:"1-grpc-server",children:"1. gRPC Server"}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow exposes its APIs via a gRPC server, with HTTP health endpoints multiplexed on the same port using ",(0,i.jsx)(n.strong,{children:"cmux"}),". The server provides:"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Inferflow API"})," \u2014 ",(0,i.jsx)(n.code,{children:"RetrieveModelScore"}),": entity-based feature retrieval and scoring"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Predict API"})," \u2014 ",(0,i.jsx)(n.code,{children:"InferPointWise"}),", ",(0,i.jsx)(n.code,{children:"InferPairWise"}),", ",(0,i.jsx)(n.code,{children:"InferSlateWise"}),": structured inference with targets, pairs, and slates"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"2-dag-topology-executor",children:"2. DAG Topology Executor"}),"\n",(0,i.jsxs)(n.p,{children:["The heart of Inferflow. Each model configuration defines a ",(0,i.jsx)(n.code,{children:"component_dependency"})," map that describes a Directed Acyclic Graph (DAG) of components."]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Execution model:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.strong,{children:"Kahn's algorithm"})," for topological ordering"]}),"\n",(0,i.jsxs)(n.li,{children:["Components at the same level run ",(0,i.jsx)(n.strong,{children:"concurrently"})," in goroutines"]}),"\n",(0,i.jsxs)(n.li,{children:["All components share a mutable ",(0,i.jsx)(n.code,{children:"ComponentMatrix"})," (rows = entity IDs, columns = features/scores)"]}),"\n",(0,i.jsxs)(n.li,{children:["DAG topologies are ",(0,i.jsx)(n.strong,{children:"cached"})," using Murmur3 hashing with Ristretto cache"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Validation:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Cycle detection via in-degree analysis"}),"\n",(0,i.jsxs)(n.li,{children:["Component existence verification against the ",(0,i.jsx)(n.code,{children:"ComponentProvider"})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"3-component-types",children:"3. Component Types"}),"\n",(0,i.jsx)(n.p,{children:"Inferflow defines four types of DAG components:"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Component"}),(0,i.jsx)(n.th,{children:"Role"}),(0,i.jsx)(n.th,{children:"External Dependency"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"FeatureInitComponent"})}),(0,i.jsxs)(n.td,{children:["Root node \u2014 initializes the ",(0,i.jsx)(n.code,{children:"ComponentMatrix"})," with entity IDs and schema"]}),(0,i.jsx)(n.td,{children:"None"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"FeatureComponent"})}),(0,i.jsx)(n.td,{children:"Fetches features from the Online Feature Store for a specific entity type"}),(0,i.jsx)(n.td,{children:"OnFS (gRPC)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"PredatorComponent"})}),(0,i.jsx)(n.td,{children:"Calls model serving endpoints for inference scoring"}),(0,i.jsx)(n.td,{children:"Predator / Helix (gRPC)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"NumerixComponent"})}),(0,i.jsx)(n.td,{children:"Calls compute engine for operations like reranking"}),(0,i.jsx)(n.td,{children:"Numerix (gRPC)"})]})]})]}),"\n",(0,i.jsx)(n.h3,{id:"4-componentmatrix--the-2d-result-matrix",children:"4. ComponentMatrix \u2014 The 2D Result Matrix"}),"\n",(0,i.jsx)(n.p,{children:"The ComponentMatrix is a shared, mutable 2D data structure that flows through the entire DAG. Every component reads from and writes to this matrix, progressively building a complete feature + score row for each entity."}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.img,{alt:"DAG Execution & 2D Matrix Flow",src:r(3066).A+"",width:"672",height:"778"})}),"\n",(0,i.jsx)(n.h4,{id:"how-the-matrix-evolves-through-the-dag",children:"How the matrix evolves through the DAG"}),"\n",(0,i.jsx)(n.p,{children:"The diagram above illustrates the three execution phases and how the 2D matrix grows at each stage:"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 1 \u2014 Feature Retrieval"})}),"\n",(0,i.jsxs)(n.p,{children:["The ",(0,i.jsx)(n.strong,{children:"init"})," node creates an empty matrix with one row per target entity ID. Feature components then execute \u2014 first the top-level entities (entity A, entity B) fetch their features from OnFS and populate their columns (shown as colored blocks). Derived entities (entity C, D, E) resolve their keys from the already-populated columns and add more feature columns. At this point the matrix contains all feature data, with each color representing features from a different entity."]}),"\n",(0,i.jsxs)(n.p,{children:["The right side of the diagram shows the matrix being ",(0,i.jsx)(n.strong,{children:"decomposed"})," \u2014 feature columns from different entities are separated into per-model input groups, selecting only the features each model needs."]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 2 \u2014 Model Invocation"})}),"\n",(0,i.jsxs)(n.p,{children:["Model X and Model Y each receive their decomposed feature slices, call ",(0,i.jsx)(n.strong,{children:"Predator"})," for inference, and write score columns back into the matrix (shown as new colored columns appended to the right). Multiple models can run in parallel if they don't depend on each other's outputs."]}),"\n",(0,i.jsx)(n.p,{children:"The scores are then decomposed again to prepare inputs for the compute stage."}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 3 \u2014 Numerix Compute"})}),"\n",(0,i.jsxs)(n.p,{children:["The ",(0,i.jsx)(n.strong,{children:"Score Comb"})," node takes score columns from both models, calls ",(0,i.jsx)(n.strong,{children:"Numerix"})," for a final compute operation (e.g., score combination, reranking), and writes the final score column (shown in dark red) into the matrix. The result is a complete row per entity with all features and all scores."]}),"\n",(0,i.jsx)(n.h4,{id:"matrix-structure",children:"Matrix structure"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Property"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Rows"})}),(0,i.jsx)(n.td,{children:"One per target entity ID (e.g., each product being scored)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"String columns"})}),(0,i.jsx)(n.td,{children:"Human-readable values used in responses"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Byte columns"})}),(0,i.jsx)(n.td,{children:"Binary-encoded feature values used for model inputs"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Column naming"})}),(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"entity_label:feature_group:feature_name"})})]})]})]}),"\n",(0,i.jsx)(n.p,{children:"Each component only reads the columns it needs and writes to its own columns, enabling safe concurrent execution across independent branches of the DAG."}),"\n",(0,i.jsxs)(n.p,{children:["For slate-based APIs, a companion ",(0,i.jsx)(n.code,{children:"SlateData"})," structure holds per-slate matrices and scores, with ",(0,i.jsx)(n.code,{children:"slate_target_indices"})," mapping slates to rows in the main matrix."]}),"\n",(0,i.jsx)(n.h3,{id:"5-configuration-management-etcd",children:"5. Configuration Management (etcd)"}),"\n",(0,i.jsx)(n.p,{children:"Model configurations are stored in etcd and hot-reloaded via watchers:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Config paths"}),": ",(0,i.jsx)(n.code,{children:"/config/inferflow/services/"}),", ",(0,i.jsx)(n.code,{children:"/model-config"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Watch mechanism"}),": etcd watchers trigger ",(0,i.jsx)(n.code,{children:"ReloadModelConfigMapAndRegisterComponents"})," on any change"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"On reload"}),": Updates ",(0,i.jsx)(n.code,{children:"ConfigMap"}),", re-initializes feature schemas, and re-registers DAG components"]}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:["This means new models or configuration changes go live ",(0,i.jsx)(n.strong,{children:"without redeployment"}),"."]}),"\n",(0,i.jsx)(n.h3,{id:"6-external-integrations",children:"6. External Integrations"}),"\n",(0,i.jsx)(n.h4,{id:"online-feature-store-onfs",children:"Online Feature Store (OnFS)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["gRPC client calling ",(0,i.jsx)(n.code,{children:"FeatureService.RetrieveFeatures"})]}),"\n",(0,i.jsx)(n.li,{children:"Batched retrieval with configurable batch size and deadline"}),"\n",(0,i.jsxs)(n.li,{children:["Auth via ",(0,i.jsx)(n.code,{children:"CALLER_ID"})," and ",(0,i.jsx)(n.code,{children:"CALLER_TOKEN"})," metadata"]}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"predator-model-serving",children:"Predator (Model Serving)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.code,{children:"helix-client"})," for model inference"]}),"\n",(0,i.jsxs)(n.li,{children:["Supports ",(0,i.jsx)(n.strong,{children:"percentage-based traffic routing"})," across multiple model endpoints"]}),"\n",(0,i.jsx)(n.li,{children:"Configurable calibration and batch sizing"}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"numerix-compute-engine",children:"Numerix (Compute Engine)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.code,{children:"helix-client"})," Numerix client"]}),"\n",(0,i.jsxs)(n.li,{children:["RPC: ",(0,i.jsx)(n.code,{children:"NumerixService.Compute"})," with entity score data"]}),"\n",(0,i.jsx)(n.li,{children:"Used for compute operations like reranking"}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"kafka-inference-logging",children:"Kafka (Inference Logging)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Async inference log publishing using ",(0,i.jsx)(n.code,{children:"segmentio/kafka-go"})]}),"\n",(0,i.jsxs)(n.li,{children:["Supports ",(0,i.jsx)(n.strong,{children:"Proto"}),", ",(0,i.jsx)(n.strong,{children:"Arrow"}),", and ",(0,i.jsx)(n.strong,{children:"Parquet"})," serialization formats"]}),"\n",(0,i.jsxs)(n.li,{children:["Configurable sampling via ",(0,i.jsx)(n.code,{children:"LoggingPerc"})," and user-based daily sampling"]}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"request-flow",children:"Request Flow"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{children:"1. Client sends gRPC request with model_config_id + entity IDs\n \u2502\n2. Load ModelConfig from etcd-backed ConfigMap\n \u2502\n3. Adapt proto request \u2192 ComponentRequest\n (build ComponentMatrix with entity schema)\n \u2502\n4. Resolve DAG topology from component_dependency config\n \u2502\n5. Execute DAG (Kahn's algorithm, concurrent):\n \u2502\n \u251c\u2500 FeatureInitComponent: populate matrix with entity IDs + schema\n \u2502\n \u251c\u2500 FeatureComponents (parallel): fetch features from OnFS \u2192 fill matrix columns\n \u2502\n \u251c\u2500 PredatorComponent: build feature payloads from matrix \u2192 call model \u2192 write scores\n \u2502\n \u2514\u2500 NumerixComponent: read scores from matrix \u2192 call compute \u2192 write final scores\n \u2502\n6. Build response from matrix columns per ResponseConfig\n \u2502\n7. (Optional) Async Kafka logging of inference features and scores\n \u2502\n8. Return gRPC response to client\n"})}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"observability",children:"Observability"}),"\n",(0,i.jsx)(n.h3,{id:"metrics-statsd--telegraf",children:"Metrics (StatsD / Telegraf)"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Metric"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.request.total"})}),(0,i.jsx)(n.td,{children:"Total RetrieveModelScore requests"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.latency"})}),(0,i.jsx)(n.td,{children:"End-to-end latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.batch.size"})}),(0,i.jsx)(n.td,{children:"Batch size per request"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"predict.infer.request.total"})}),(0,i.jsx)(n.td,{children:"Total Predict API requests"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"predict.infer.latency"})}),(0,i.jsx)(n.td,{children:"Predict API latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.total"})}),(0,i.jsx)(n.td,{children:"Per-component execution count"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.latency"})}),(0,i.jsx)(n.td,{children:"Per-component latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.error"})}),(0,i.jsx)(n.td,{children:"Component-level errors"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.feature.count"})}),(0,i.jsx)(n.td,{children:"Feature count per component"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.external.api.request.total"})}),(0,i.jsx)(n.td,{children:"External API call count"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.external.api.latency"})}),(0,i.jsx)(n.td,{children:"External API latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.inmemorycache.request.total"})}),(0,i.jsx)(n.td,{children:"Cache hit/miss total"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.inmemorycache.miss"})}),(0,i.jsx)(n.td,{children:"Cache misses"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.logging.kafka_sent"})}),(0,i.jsx)(n.td,{children:"Kafka log messages sent"})]})]})]}),"\n",(0,i.jsx)(n.h3,{id:"logging",children:"Logging"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Structured JSON logging via ",(0,i.jsx)(n.strong,{children:"zerolog"})]}),"\n",(0,i.jsx)(n.li,{children:"Configurable log levels"}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"deployment",children:"Deployment"}),"\n",(0,i.jsx)(n.h3,{id:"docker",children:"Docker"}),"\n",(0,i.jsx)(n.p,{children:"Inferflow ships as a multi-stage Docker image:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Builder"}),": Go 1.19 Alpine with optional Kafka support (librdkafka)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Runtime"}),": Debian 10 slim"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Build command"}),": ",(0,i.jsx)(n.code,{children:'go build -tags musl -ldflags "-extldflags -static" -o server cmd/${module}/main.go'})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"supported-environments",children:"Supported Environments"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Kubernetes (K8s)"}),"\n",(0,i.jsx)(n.li,{children:"Google Kubernetes Engine (GKE)"}),"\n",(0,i.jsx)(n.li,{children:"Amazon EKS"}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"configuration",children:"Configuration"}),"\n",(0,i.jsx)(n.p,{children:"All configuration is driven via environment variables (loaded by Viper) and etcd. No config files are required at deployment time."}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"target-users",children:"Target Users"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"User"}),(0,i.jsx)(n.th,{children:"Role"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Data Scientists"}),(0,i.jsx)(n.td,{children:"Define model configs and feature retrieval graphs via config \u2014 no code needed"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"ML Engineers"}),(0,i.jsx)(n.td,{children:"Onboard new models by updating etcd config; manage DAG topologies"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Backend Developers"}),(0,i.jsx)(n.td,{children:"Integrate via gRPC SDKs for real-time scoring in application services"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Platform Engineers"}),(0,i.jsx)(n.td,{children:"Deploy, scale, and monitor Inferflow clusters"})]})]})]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"benefits",children:"Benefits"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"No-code feature retrieval"})," \u2014 new models need only a config change, not custom code"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Feature consistency"})," \u2014 same graph-driven retrieval ensures identical features across experiments"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Faster iteration"})," \u2014 experiment with new models in minutes, not days"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Concurrent execution"})," \u2014 DAG components run in parallel for minimal latency"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Hot reloading"})," \u2014 model config changes via etcd go live without redeployment"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Multi-API support"})," \u2014 PointWise, PairWise, and SlateWise inference patterns out of the box"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Production-grade"})," \u2014 built in Go with gRPC, designed for millions of QPS"]}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(a,{...e})}):a(e)}},3066:(e,n,r)=>{r.d(n,{A:()=>t});const t=r.p+"assets/images/v1.0.0-inferflow-dag-matrix-0f13b51422587e6099cf4ee783844db1.png"},7748:(e,n,r)=>{r.d(n,{A:()=>t});const t=r.p+"assets/images/v1.0.0-inferflow-arch-bce54b3b4f7d3be68fa22dc204529f53.png"},8453:(e,n,r)=>{r.d(n,{R:()=>o,x:()=>l});var t=r(6540);const i={},s=t.createContext(i);function o(e){const n=t.useContext(s);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:o(e.components),t.createElement(s.Provider,{value:n},e.children)}}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4424],{248:(e,n,r)=>{r.r(n),r.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>o,metadata:()=>t,toc:()=>d});const t=JSON.parse('{"id":"inferflow/v1.0.0/architecture","title":"Architecture","description":"Inferflow is part of BharatMLStack, a graph-driven feature retrieval and model inference orchestration engine built in Go. It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in etcd.","source":"@site/docs/inferflow/v1.0.0/architecture.md","sourceDirName":"inferflow/v1.0.0","slug":"/inferflow/v1.0.0/architecture","permalink":"/BharatMLStack/inferflow/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/inferflow/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/inferflow/v1.0.0"},"next":{"title":"Key Functionalities","permalink":"/BharatMLStack/inferflow/v1.0.0/functionalities"}}');var i=r(4848),s=r(8453);const o={title:"Architecture",sidebar_position:1},l="BharatMLStack - Inferflow",c={},d=[{value:"Overview",id:"overview",level:2},{value:"High-Level Architecture",id:"high-level-architecture",level:2},{value:"Core Components",id:"core-components",level:2},{value:"1. gRPC Server",id:"1-grpc-server",level:3},{value:"2. DAG Topology Executor",id:"2-dag-topology-executor",level:3},{value:"3. Component Types",id:"3-component-types",level:3},{value:"4. ComponentMatrix \u2014 The 2D Result Matrix",id:"4-componentmatrix--the-2d-result-matrix",level:3},{value:"How the matrix evolves through the DAG",id:"how-the-matrix-evolves-through-the-dag",level:4},{value:"Matrix structure",id:"matrix-structure",level:4},{value:"5. Configuration Management (etcd)",id:"5-configuration-management-etcd",level:3},{value:"6. External Integrations",id:"6-external-integrations",level:3},{value:"Online Feature Store (OnFS)",id:"online-feature-store-onfs",level:4},{value:"Predator (Model Serving)",id:"predator-model-serving",level:4},{value:"Numerix (Compute Engine)",id:"numerix-compute-engine",level:4},{value:"Kafka (Inference Logging)",id:"kafka-inference-logging",level:4},{value:"Request Flow",id:"request-flow",level:2},{value:"Observability",id:"observability",level:2},{value:"Metrics (StatsD / Telegraf)",id:"metrics-statsd--telegraf",level:3},{value:"Logging",id:"logging",level:3},{value:"Deployment",id:"deployment",level:2},{value:"Docker",id:"docker",level:3},{value:"Supported Environments",id:"supported-environments",level:3},{value:"Configuration",id:"configuration",level:3},{value:"Target Users",id:"target-users",level:2},{value:"Benefits",id:"benefits",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function a(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,s.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatmlstack---inferflow",children:"BharatMLStack - Inferflow"})}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow is part of ",(0,i.jsx)(n.strong,{children:"BharatMLStack"}),", a graph-driven feature retrieval and model inference orchestration engine built in ",(0,i.jsx)(n.strong,{children:"Go"}),". It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in ",(0,i.jsx)(n.strong,{children:"etcd"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"overview",children:"Overview"}),"\n",(0,i.jsx)(n.p,{children:"In a typical ML serving pipeline, every new model requires bespoke code to:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Fetch features from multiple entities (user, product, user x category, etc.)"}),"\n",(0,i.jsx)(n.li,{children:"Infer intermediate entity relationships (e.g., extract category from product to fetch user x category data)"}),"\n",(0,i.jsx)(n.li,{children:"Orchestrate one or more model inference calls"}),"\n",(0,i.jsx)(n.li,{children:"Handle I/O, batching, and error propagation"}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow abstracts all of this behind a ",(0,i.jsx)(n.strong,{children:"config-driven DAG executor"}),". Given a ",(0,i.jsx)(n.code,{children:"model_config_id"})," and context entities (e.g., ",(0,i.jsx)(n.code,{children:"userId"}),", ",(0,i.jsx)(n.code,{children:"productIds"}),"), it:"]}),"\n",(0,i.jsxs)(n.ol,{children:["\n",(0,i.jsx)(n.li,{children:"Loads a pre-defined feature retrieval and inference graph from etcd"}),"\n",(0,i.jsx)(n.li,{children:"Executes the graph to resolve entity relationships dynamically"}),"\n",(0,i.jsx)(n.li,{children:"Retrieves features from the Online Feature Store (OnFS) in parallel"}),"\n",(0,i.jsx)(n.li,{children:"Calls model serving endpoints (Predator) and compute services (Numerix)"}),"\n",(0,i.jsx)(n.li,{children:"Returns scored results as a structured response"}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"high-level-architecture",children:"High-Level Architecture"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.img,{alt:"Inferflow Architecture - DAG Topology Executor",src:r(7773).A+"",width:"2036",height:"1212"})}),"\n",(0,i.jsxs)(n.p,{children:["The diagram shows the internal DAG structure of Inferflow's topology executor. gRPC APIs (Pair, Point, Slate) feed into the DAG, where ",(0,i.jsx)(n.strong,{children:"Feature Init"})," bootstraps the ComponentMatrix. Feature components (FS User, FS Product, FS Region, FS User Cat, FS Region Scat) fetch features from ",(0,i.jsx)(n.strong,{children:"OnFS"})," in parallel and populate columns in the shared ",(0,i.jsx)(n.strong,{children:"2D Result Matrix"}),". Model components (Model A, Model B) call ",(0,i.jsx)(n.strong,{children:"Predator"})," for inference, and compute components call ",(0,i.jsx)(n.strong,{children:"Numerix"})," for operations like reranking. The entire DAG topology is driven by config loaded from ",(0,i.jsx)(n.strong,{children:"etcd"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"core-components",children:"Core Components"}),"\n",(0,i.jsx)(n.h3,{id:"1-grpc-server",children:"1. gRPC Server"}),"\n",(0,i.jsxs)(n.p,{children:["Inferflow exposes its APIs via a gRPC server, with HTTP health endpoints multiplexed on the same port using ",(0,i.jsx)(n.strong,{children:"cmux"}),". The server provides:"]}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Inferflow API"})," \u2014 ",(0,i.jsx)(n.code,{children:"RetrieveModelScore"}),": entity-based feature retrieval and scoring"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Predict API"})," \u2014 ",(0,i.jsx)(n.code,{children:"InferPointWise"}),", ",(0,i.jsx)(n.code,{children:"InferPairWise"}),", ",(0,i.jsx)(n.code,{children:"InferSlateWise"}),": structured inference with targets, pairs, and slates"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"2-dag-topology-executor",children:"2. DAG Topology Executor"}),"\n",(0,i.jsxs)(n.p,{children:["The heart of Inferflow. Each model configuration defines a ",(0,i.jsx)(n.code,{children:"component_dependency"})," map that describes a Directed Acyclic Graph (DAG) of components."]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Execution model:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.strong,{children:"Kahn's algorithm"})," for topological ordering"]}),"\n",(0,i.jsxs)(n.li,{children:["Components at the same level run ",(0,i.jsx)(n.strong,{children:"concurrently"})," in goroutines"]}),"\n",(0,i.jsxs)(n.li,{children:["All components share a mutable ",(0,i.jsx)(n.code,{children:"ComponentMatrix"})," (rows = entity IDs, columns = features/scores)"]}),"\n",(0,i.jsxs)(n.li,{children:["DAG topologies are ",(0,i.jsx)(n.strong,{children:"cached"})," using Murmur3 hashing with Ristretto cache"]}),"\n"]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Validation:"})}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Cycle detection via in-degree analysis"}),"\n",(0,i.jsxs)(n.li,{children:["Component existence verification against the ",(0,i.jsx)(n.code,{children:"ComponentProvider"})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"3-component-types",children:"3. Component Types"}),"\n",(0,i.jsx)(n.p,{children:"Inferflow defines four types of DAG components:"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Component"}),(0,i.jsx)(n.th,{children:"Role"}),(0,i.jsx)(n.th,{children:"External Dependency"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"FeatureInitComponent"})}),(0,i.jsxs)(n.td,{children:["Root node \u2014 initializes the ",(0,i.jsx)(n.code,{children:"ComponentMatrix"})," with entity IDs and schema"]}),(0,i.jsx)(n.td,{children:"None"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"FeatureComponent"})}),(0,i.jsx)(n.td,{children:"Fetches features from the Online Feature Store for a specific entity type"}),(0,i.jsx)(n.td,{children:"OnFS (gRPC)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"PredatorComponent"})}),(0,i.jsx)(n.td,{children:"Calls model serving endpoints for inference scoring"}),(0,i.jsx)(n.td,{children:"Predator / Helix (gRPC)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"NumerixComponent"})}),(0,i.jsx)(n.td,{children:"Calls compute engine for operations like reranking"}),(0,i.jsx)(n.td,{children:"Numerix (gRPC)"})]})]})]}),"\n",(0,i.jsx)(n.h3,{id:"4-componentmatrix--the-2d-result-matrix",children:"4. ComponentMatrix \u2014 The 2D Result Matrix"}),"\n",(0,i.jsx)(n.p,{children:"The ComponentMatrix is a shared, mutable 2D data structure that flows through the entire DAG. Every component reads from and writes to this matrix, progressively building a complete feature + score row for each entity."}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.img,{alt:"DAG Execution & 2D Matrix Flow",src:r(7071).A+"",width:"672",height:"778"})}),"\n",(0,i.jsx)(n.h4,{id:"how-the-matrix-evolves-through-the-dag",children:"How the matrix evolves through the DAG"}),"\n",(0,i.jsx)(n.p,{children:"The diagram above illustrates the three execution phases and how the 2D matrix grows at each stage:"}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 1 \u2014 Feature Retrieval"})}),"\n",(0,i.jsxs)(n.p,{children:["The ",(0,i.jsx)(n.strong,{children:"init"})," node creates an empty matrix with one row per target entity ID. Feature components then execute \u2014 first the top-level entities (entity A, entity B) fetch their features from OnFS and populate their columns (shown as colored blocks). Derived entities (entity C, D, E) resolve their keys from the already-populated columns and add more feature columns. At this point the matrix contains all feature data, with each color representing features from a different entity."]}),"\n",(0,i.jsxs)(n.p,{children:["The right side of the diagram shows the matrix being ",(0,i.jsx)(n.strong,{children:"decomposed"})," \u2014 feature columns from different entities are separated into per-model input groups, selecting only the features each model needs."]}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 2 \u2014 Model Invocation"})}),"\n",(0,i.jsxs)(n.p,{children:["Model X and Model Y each receive their decomposed feature slices, call ",(0,i.jsx)(n.strong,{children:"Predator"})," for inference, and write score columns back into the matrix (shown as new colored columns appended to the right). Multiple models can run in parallel if they don't depend on each other's outputs."]}),"\n",(0,i.jsx)(n.p,{children:"The scores are then decomposed again to prepare inputs for the compute stage."}),"\n",(0,i.jsx)(n.p,{children:(0,i.jsx)(n.strong,{children:"Phase 3 \u2014 Numerix Compute"})}),"\n",(0,i.jsxs)(n.p,{children:["The ",(0,i.jsx)(n.strong,{children:"Score Comb"})," node takes score columns from both models, calls ",(0,i.jsx)(n.strong,{children:"Numerix"})," for a final compute operation (e.g., score combination, reranking), and writes the final score column (shown in dark red) into the matrix. The result is a complete row per entity with all features and all scores."]}),"\n",(0,i.jsx)(n.h4,{id:"matrix-structure",children:"Matrix structure"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Property"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Rows"})}),(0,i.jsx)(n.td,{children:"One per target entity ID (e.g., each product being scored)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"String columns"})}),(0,i.jsx)(n.td,{children:"Human-readable values used in responses"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Byte columns"})}),(0,i.jsx)(n.td,{children:"Binary-encoded feature values used for model inputs"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.strong,{children:"Column naming"})}),(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"entity_label:feature_group:feature_name"})})]})]})]}),"\n",(0,i.jsx)(n.p,{children:"Each component only reads the columns it needs and writes to its own columns, enabling safe concurrent execution across independent branches of the DAG."}),"\n",(0,i.jsxs)(n.p,{children:["For slate-based APIs, a companion ",(0,i.jsx)(n.code,{children:"SlateData"})," structure holds per-slate matrices and scores, with ",(0,i.jsx)(n.code,{children:"slate_target_indices"})," mapping slates to rows in the main matrix."]}),"\n",(0,i.jsx)(n.h3,{id:"5-configuration-management-etcd",children:"5. Configuration Management (etcd)"}),"\n",(0,i.jsx)(n.p,{children:"Model configurations are stored in etcd and hot-reloaded via watchers:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Config paths"}),": ",(0,i.jsx)(n.code,{children:"/config/inferflow/services/"}),", ",(0,i.jsx)(n.code,{children:"/model-config"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Watch mechanism"}),": etcd watchers trigger ",(0,i.jsx)(n.code,{children:"ReloadModelConfigMapAndRegisterComponents"})," on any change"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"On reload"}),": Updates ",(0,i.jsx)(n.code,{children:"ConfigMap"}),", re-initializes feature schemas, and re-registers DAG components"]}),"\n"]}),"\n",(0,i.jsxs)(n.p,{children:["This means new models or configuration changes go live ",(0,i.jsx)(n.strong,{children:"without redeployment"}),"."]}),"\n",(0,i.jsx)(n.h3,{id:"6-external-integrations",children:"6. External Integrations"}),"\n",(0,i.jsx)(n.h4,{id:"online-feature-store-onfs",children:"Online Feature Store (OnFS)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["gRPC client calling ",(0,i.jsx)(n.code,{children:"FeatureService.RetrieveFeatures"})]}),"\n",(0,i.jsx)(n.li,{children:"Batched retrieval with configurable batch size and deadline"}),"\n",(0,i.jsxs)(n.li,{children:["Auth via ",(0,i.jsx)(n.code,{children:"CALLER_ID"})," and ",(0,i.jsx)(n.code,{children:"CALLER_TOKEN"})," metadata"]}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"predator-model-serving",children:"Predator (Model Serving)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.code,{children:"go-sdk"})," for model inference"]}),"\n",(0,i.jsxs)(n.li,{children:["Supports ",(0,i.jsx)(n.strong,{children:"percentage-based traffic routing"})," across multiple model endpoints"]}),"\n",(0,i.jsx)(n.li,{children:"Configurable calibration and batch sizing"}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"numerix-compute-engine",children:"Numerix (Compute Engine)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Uses ",(0,i.jsx)(n.code,{children:"go-sdk"})," Numerix client"]}),"\n",(0,i.jsxs)(n.li,{children:["RPC: ",(0,i.jsx)(n.code,{children:"NumerixService.Compute"})," with entity score data"]}),"\n",(0,i.jsx)(n.li,{children:"Used for compute operations like reranking"}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"kafka-inference-logging",children:"Kafka (Inference Logging)"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Async inference log publishing using ",(0,i.jsx)(n.code,{children:"segmentio/kafka-go"})]}),"\n",(0,i.jsxs)(n.li,{children:["Supports ",(0,i.jsx)(n.strong,{children:"Proto"}),", ",(0,i.jsx)(n.strong,{children:"Arrow"}),", and ",(0,i.jsx)(n.strong,{children:"Parquet"})," serialization formats"]}),"\n",(0,i.jsxs)(n.li,{children:["Configurable sampling via ",(0,i.jsx)(n.code,{children:"LoggingPerc"})," and user-based daily sampling"]}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"request-flow",children:"Request Flow"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{children:"1. Client sends gRPC request with model_config_id + entity IDs\n \u2502\n2. Load ModelConfig from etcd-backed ConfigMap\n \u2502\n3. Adapt proto request \u2192 ComponentRequest\n (build ComponentMatrix with entity schema)\n \u2502\n4. Resolve DAG topology from component_dependency config\n \u2502\n5. Execute DAG (Kahn's algorithm, concurrent):\n \u2502\n \u251c\u2500 FeatureInitComponent: populate matrix with entity IDs + schema\n \u2502\n \u251c\u2500 FeatureComponents (parallel): fetch features from OnFS \u2192 fill matrix columns\n \u2502\n \u251c\u2500 PredatorComponent: build feature payloads from matrix \u2192 call model \u2192 write scores\n \u2502\n \u2514\u2500 NumerixComponent: read scores from matrix \u2192 call compute \u2192 write final scores\n \u2502\n6. Build response from matrix columns per ResponseConfig\n \u2502\n7. (Optional) Async Kafka logging of inference features and scores\n \u2502\n8. Return gRPC response to client\n"})}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"observability",children:"Observability"}),"\n",(0,i.jsx)(n.h3,{id:"metrics-statsd--telegraf",children:"Metrics (StatsD / Telegraf)"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Metric"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.request.total"})}),(0,i.jsx)(n.td,{children:"Total RetrieveModelScore requests"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.latency"})}),(0,i.jsx)(n.td,{children:"End-to-end latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.retrievemodelscore.batch.size"})}),(0,i.jsx)(n.td,{children:"Batch size per request"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"predict.infer.request.total"})}),(0,i.jsx)(n.td,{children:"Total Predict API requests"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"predict.infer.latency"})}),(0,i.jsx)(n.td,{children:"Predict API latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.total"})}),(0,i.jsx)(n.td,{children:"Per-component execution count"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.latency"})}),(0,i.jsx)(n.td,{children:"Per-component latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.execution.error"})}),(0,i.jsx)(n.td,{children:"Component-level errors"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.feature.count"})}),(0,i.jsx)(n.td,{children:"Feature count per component"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.external.api.request.total"})}),(0,i.jsx)(n.td,{children:"External API call count"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.external.api.latency"})}),(0,i.jsx)(n.td,{children:"External API latency"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.inmemorycache.request.total"})}),(0,i.jsx)(n.td,{children:"Cache hit/miss total"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.component.inmemorycache.miss"})}),(0,i.jsx)(n.td,{children:"Cache misses"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"inferflow.logging.kafka_sent"})}),(0,i.jsx)(n.td,{children:"Kafka log messages sent"})]})]})]}),"\n",(0,i.jsx)(n.h3,{id:"logging",children:"Logging"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["Structured JSON logging via ",(0,i.jsx)(n.strong,{children:"zerolog"})]}),"\n",(0,i.jsx)(n.li,{children:"Configurable log levels"}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"deployment",children:"Deployment"}),"\n",(0,i.jsx)(n.h3,{id:"docker",children:"Docker"}),"\n",(0,i.jsx)(n.p,{children:"Inferflow ships as a multi-stage Docker image:"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Builder"}),": Go 1.19 Alpine with optional Kafka support (librdkafka)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Runtime"}),": Debian 10 slim"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Build command"}),": ",(0,i.jsx)(n.code,{children:'go build -tags musl -ldflags "-extldflags -static" -o server cmd/${module}/main.go'})]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"supported-environments",children:"Supported Environments"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Kubernetes (K8s)"}),"\n",(0,i.jsx)(n.li,{children:"Google Kubernetes Engine (GKE)"}),"\n",(0,i.jsx)(n.li,{children:"Amazon EKS"}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"configuration",children:"Configuration"}),"\n",(0,i.jsx)(n.p,{children:"All configuration is driven via environment variables (loaded by Viper) and etcd. No config files are required at deployment time."}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"target-users",children:"Target Users"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"User"}),(0,i.jsx)(n.th,{children:"Role"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Data Scientists"}),(0,i.jsx)(n.td,{children:"Define model configs and feature retrieval graphs via config \u2014 no code needed"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"ML Engineers"}),(0,i.jsx)(n.td,{children:"Onboard new models by updating etcd config; manage DAG topologies"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Backend Developers"}),(0,i.jsx)(n.td,{children:"Integrate via gRPC SDKs for real-time scoring in application services"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Platform Engineers"}),(0,i.jsx)(n.td,{children:"Deploy, scale, and monitor Inferflow clusters"})]})]})]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"benefits",children:"Benefits"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"No-code feature retrieval"})," \u2014 new models need only a config change, not custom code"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Feature consistency"})," \u2014 same graph-driven retrieval ensures identical features across experiments"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Faster iteration"})," \u2014 experiment with new models in minutes, not days"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Concurrent execution"})," \u2014 DAG components run in parallel for minimal latency"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Hot reloading"})," \u2014 model config changes via etcd go live without redeployment"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Multi-API support"})," \u2014 PointWise, PairWise, and SlateWise inference patterns out of the box"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Production-grade"})," \u2014 built in Go with gRPC, designed for millions of QPS"]}),"\n"]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(a,{...e})}):a(e)}},7071:(e,n,r)=>{r.d(n,{A:()=>t});const t=r.p+"assets/images/v1.0.0-inferflow-dag-matrix-0f13b51422587e6099cf4ee783844db1.png"},7773:(e,n,r)=>{r.d(n,{A:()=>t});const t=r.p+"assets/images/v1.0.0-inferflow-arch-bce54b3b4f7d3be68fa22dc204529f53.png"},8453:(e,n,r)=>{r.d(n,{R:()=>o,x:()=>l});var t=r(6540);const i={},s=t.createContext(i);function o(e){const n=t.useContext(s);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:o(e.components),t.createElement(s.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/3216e812.877aa30c.js b/docs/assets/js/3216e812.877aa30c.js deleted file mode 100644 index 0c4e114e..00000000 --- a/docs/assets/js/3216e812.877aa30c.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4771],{1494:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"v1.0.0","description":"Numerix v1.0.0","slug":"/numerix/v1.0.0","permalink":"/BharatMLStack/numerix/v1.0.0","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Numerix","permalink":"/BharatMLStack/category/numerix"},"next":{"title":"Architecture","permalink":"/BharatMLStack/numerix/v1.0.0/architecture"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/340c7c5f.a496fe54.js b/docs/assets/js/340c7c5f.a496fe54.js new file mode 100644 index 00000000..eb9e9e7a --- /dev/null +++ b/docs/assets/js/340c7c5f.a496fe54.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[74],{4795:(e,t,r)=>{r.d(t,{A:()=>N});r(6540);var n=r(4164),s=r(6972),o=r(8774),i=r(5846),a=r(6654),c=r(1312),l=r(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var d=r(4848);function m({className:e,href:t,children:r}){return(0,d.jsx)(o.A,{href:t,className:(0,n.A)("card padding--lg",u.cardContainer,e),children:r})}function f({className:e,href:t,icon:r,title:s,description:o}){return(0,d.jsxs)(m,{href:t,className:e,children:[(0,d.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",u.cardTitle),title:s,children:[r," ",s]}),o&&(0,d.jsx)("p",{className:(0,n.A)("text--truncate",u.cardDescription),title:o,children:o})]})}function h({item:e}){const t=(0,s.Nr)(e),r=function(){const{selectMessage:e}=(0,i.W)();return t=>e(t,(0,c.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,d.jsx)(f,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??r(e.items.length)}):null}function p({item:e}){const t=(0,a.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(e.docId??void 0);return(0,d.jsx)(f,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??r?.description})}function x({item:e}){switch(e.type){case"link":return(0,d.jsx)(p,{item:e});case"category":return(0,d.jsx)(h,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const g={docCardListItem:"docCardListItem_W1sv"};function v({className:e}){const t=(0,s.a4)();return(0,d.jsx)(N,{items:t,className:e})}function j({item:e}){return(0,d.jsx)("article",{className:(0,n.A)(g.docCardListItem,"col col--6"),children:(0,d.jsx)(x,{item:e})})}function N(e){const{items:t,className:r}=e;if(!t)return(0,d.jsx)(v,{...e});const o=(0,s.d1)(t);return(0,d.jsx)("section",{className:(0,n.A)("row",r),children:o.map((e,t)=>(0,d.jsx)(j,{item:e},t))})}},5846:(e,t,r)=>{r.d(t,{W:()=>l});var n=r(6540),s=r(4586);const o=["zero","one","two","few","many","other"];function i(e){return o.filter(t=>e.includes(t))}const a={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function c(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:i(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),a}},[e])}function l(){const e=c();return{selectMessage:(t,r)=>function(e,t,r){const n=e.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${e}`);const s=r.select(t),o=r.pluralForms.indexOf(s);return n[Math.min(o,n.length-1)]}(r,t,e)}}},7642:(e,t,r)=>{r.r(t),r.d(t,{assets:()=>l,contentTitle:()=>c,default:()=>m,frontMatter:()=>a,metadata:()=>n,toc:()=>u});const n=JSON.parse('{"id":"online-feature-store/v1.0.0/index","title":"v1.0.0","description":"Online Feature Store v1.0.0","source":"@site/docs/online-feature-store/v1.0.0/index.md","sourceDirName":"online-feature-store/v1.0.0","slug":"/online-feature-store/v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/online-feature-store/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Online Feature Store v1.0.0","sidebar_position":0,"slug":"/online-feature-store/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Online Feature Store","permalink":"/BharatMLStack/category/online-feature-store"},"next":{"title":"Architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture"}}');var s=r(4848),o=r(8453),i=r(4795);const a={title:"v1.0.0",description:"Online Feature Store v1.0.0",sidebar_position:0,slug:"/online-feature-store/v1.0.0"},c="Online Feature Store v1.0.0",l={},u=[];function d(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"online-feature-store-v100",children:"Online Feature Store v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"A high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with low-latency feature retrieval."}),"\n",(0,s.jsx)(i.A,{})]})}function m(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},8453:(e,t,r)=>{r.d(t,{R:()=>i,x:()=>a});var n=r(6540);const s={},o=n.createContext(s);function i(e){const t=n.useContext(o);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function a(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),n.createElement(o.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/3650a837.fd1a89f8.js b/docs/assets/js/3650a837.fd1a89f8.js new file mode 100644 index 00000000..d3f5b92d --- /dev/null +++ b/docs/assets/js/3650a837.fd1a89f8.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1782],{4795:(e,t,n)=>{n.d(t,{A:()=>v});n(6540);var r=n(4164),s=n(6972),c=n(8774),i=n(5846),a=n(6654),o=n(1312),l=n(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var d=n(4848);function m({className:e,href:t,children:n}){return(0,d.jsx)(c.A,{href:t,className:(0,r.A)("card padding--lg",u.cardContainer,e),children:n})}function h({className:e,href:t,icon:n,title:s,description:c}){return(0,d.jsxs)(m,{href:t,className:e,children:[(0,d.jsxs)(l.A,{as:"h2",className:(0,r.A)("text--truncate",u.cardTitle),title:s,children:[n," ",s]}),c&&(0,d.jsx)("p",{className:(0,r.A)("text--truncate",u.cardDescription),title:c,children:c})]})}function p({item:e}){const t=(0,s.Nr)(e),n=function(){const{selectMessage:e}=(0,i.W)();return t=>e(t,(0,o.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,d.jsx)(h,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??n(e.items.length)}):null}function f({item:e}){const t=(0,a.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",n=(0,s.cC)(e.docId??void 0);return(0,d.jsx)(h,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??n?.description})}function x({item:e}){switch(e.type){case"link":return(0,d.jsx)(f,{item:e});case"category":return(0,d.jsx)(p,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const y={docCardListItem:"docCardListItem_W1sv"};function g({className:e}){const t=(0,s.a4)();return(0,d.jsx)(v,{items:t,className:e})}function k({item:e}){return(0,d.jsx)("article",{className:(0,r.A)(y.docCardListItem,"col col--6"),children:(0,d.jsx)(x,{item:e})})}function v(e){const{items:t,className:n}=e;if(!t)return(0,d.jsx)(g,{...e});const c=(0,s.d1)(t);return(0,d.jsx)("section",{className:(0,r.A)("row",n),children:c.map((e,t)=>(0,d.jsx)(k,{item:e},t))})}},4927:(e,t,n)=>{n.r(t),n.d(t,{assets:()=>l,contentTitle:()=>o,default:()=>m,frontMatter:()=>a,metadata:()=>r,toc:()=>u});const r=JSON.parse('{"id":"skye/v1.0.0/index","title":"v1.0.0","description":"Skye v1.0.0","source":"@site/docs/skye/v1.0.0/index.md","sourceDirName":"skye/v1.0.0","slug":"/skye/v1.0.0","permalink":"/BharatMLStack/skye/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/skye/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Skye v1.0.0","sidebar_position":0,"slug":"/skye/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Skye","permalink":"/BharatMLStack/category/skye"},"next":{"title":"Architecture","permalink":"/BharatMLStack/skye/v1.0.0/architecture"}}');var s=n(4848),c=n(8453),i=n(4795);const a={title:"v1.0.0",description:"Skye v1.0.0",sidebar_position:0,slug:"/skye/v1.0.0"},o="Skye v1.0.0",l={},u=[];function d(e){const t={h1:"h1",header:"header",p:"p",...(0,c.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"skye-v100",children:"Skye v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Skye is a high-performance vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space."}),"\n",(0,s.jsx)(i.A,{})]})}function m(e={}){const{wrapper:t}={...(0,c.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},5846:(e,t,n)=>{n.d(t,{W:()=>l});var r=n(6540),s=n(4586);const c=["zero","one","two","few","many","other"];function i(e){return c.filter(t=>e.includes(t))}const a={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function o(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,r.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:i(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),a}},[e])}function l(){const e=o();return{selectMessage:(t,n)=>function(e,t,n){const r=e.split("|");if(1===r.length)return r[0];r.length>n.pluralForms.length&&console.error(`For locale=${n.locale}, a maximum of ${n.pluralForms.length} plural forms are expected (${n.pluralForms.join(",")}), but the message contains ${r.length}: ${e}`);const s=n.select(t),c=n.pluralForms.indexOf(s);return r[Math.min(c,r.length-1)]}(n,t,e)}}},8453:(e,t,n)=>{n.d(t,{R:()=>i,x:()=>a});var r=n(6540);const s={},c=r.createContext(s);function i(e){const t=r.useContext(c);return r.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function a(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),r.createElement(c.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/3aeb33c7.131ecece.js b/docs/assets/js/3aeb33c7.131ecece.js new file mode 100644 index 00000000..25d91999 --- /dev/null +++ b/docs/assets/js/3aeb33c7.131ecece.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[974],{5969:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"}}')},7309:(e,t,i)=>{i.r(t),i.d(t,{assets:()=>h,contentTitle:()=>d,default:()=>o,frontMatter:()=>r,metadata:()=>n,toc:()=>c});var n=i(5969),s=i(4848),l=i(8453);const r={slug:"post-five",title:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",authors:["jaya"],date:"2025-6-2",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},d=void 0,h={authorsImageUrls:[void 0]},c=[{value:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",level:2},{value:"1. Advanced Memory Management: Paged & Prefix KV Caching",id:"1-advanced-memory-management-paged--prefix-kv-caching",level:2},{value:"Paged KV caching",id:"paged-kv-caching",level:3},{value:"KV cache quantization",id:"kv-cache-quantization",level:3},{value:"Prefix caching (the "voice bot" optimizer)",id:"prefix-caching-the-voice-bot-optimizer",level:3},{value:"2. Aggressive Quantization (INT4 AWQ & FP8)",id:"2-aggressive-quantization-int4-awq--fp8",level:2},{value:"INT4 AWQ (Activation-aware Weight Quantization)",id:"int4-awq-activation-aware-weight-quantization",level:3},{value:"FP8 precision",id:"fp8-precision",level:3},{value:"3. Kernel Fusion & Custom Plugins",id:"3-kernel-fusion--custom-plugins",level:2},{value:"4. Inflight (Continuous) Batching",id:"4-inflight-continuous-batching",level:2},{value:"5. Parallelism Strategies: Scaling Beyond One GPU",id:"5-parallelism-strategies-scaling-beyond-one-gpu",level:2},{value:"6. Speculative Decoding",id:"6-speculative-decoding",level:2},{value:"Few Benchmarks",id:"few-benchmarks",level:2},{value:"Search query rewriting",id:"search-query-rewriting",level:3},{value:"Voice bot query",id:"voice-bot-query",level:3},{value:"Conclusion",id:"conclusion",level:2}];function a(e){const t={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,l.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.p,{children:(0,s.jsx)(t.img,{alt:"BharatMLStack",src:i(8849).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(t.h2,{id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",children:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale"}),"\n",(0,s.jsx)(t.p,{children:"Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution."}),"\n",(0,s.jsx)(t.h2,{id:"1-advanced-memory-management-paged--prefix-kv-caching",children:"1. Advanced Memory Management: Paged & Prefix KV Caching"}),"\n",(0,s.jsx)(t.p,{children:"The most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache."}),"\n",(0,s.jsx)(t.h3,{id:"paged-kv-caching",children:"Paged KV caching"}),"\n",(0,s.jsxs)(t.p,{children:["Standard caching suffers from fragmentation. We use ",(0,s.jsx)(t.strong,{children:"Paged KV caching"}),", which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory."]}),"\n",(0,s.jsx)(t.h3,{id:"kv-cache-quantization",children:"KV cache quantization"}),"\n",(0,s.jsxs)(t.p,{children:["To further maximize available memory, we implement ",(0,s.jsx)(t.strong,{children:"KV cache quantization"})," (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality."]}),"\n",(0,s.jsx)(t.h3,{id:"prefix-caching-the-voice-bot-optimizer",children:'Prefix caching (the "voice bot" optimizer)'}),"\n",(0,s.jsxs)(t.p,{children:['For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable ',(0,s.jsx)(t.strong,{children:"prefix caching"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Impact"}),": By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces ",(0,s.jsx)(t.strong,{children:"Time To First Token (TTFT)"})," by skipping redundant computation of the system prompt."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"2-aggressive-quantization-int4-awq--fp8",children:"2. Aggressive Quantization (INT4 AWQ & FP8)"}),"\n",(0,s.jsx)(t.p,{children:"Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy."}),"\n",(0,s.jsx)(t.h3,{id:"int4-awq-activation-aware-weight-quantization",children:"INT4 AWQ (Activation-aware Weight Quantization)"}),"\n",(0,s.jsxs)(t.p,{children:["For the Llama 3 family, we use ",(0,s.jsx)(t.strong,{children:"AWQ"})," to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed."]}),"\n",(0,s.jsx)(t.h3,{id:"fp8-precision",children:"FP8 precision"}),"\n",(0,s.jsxs)(t.p,{children:["For NVIDIA Hopper (H100) architectures, we are exploring ",(0,s.jsx)(t.strong,{children:"FP8 quantization"}),", leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving ",(0,s.jsx)(t.strong,{children:">99% similarity"}),"."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"3-kernel-fusion--custom-plugins",children:"3. Kernel Fusion & Custom Plugins"}),"\n",(0,s.jsx)(t.p,{children:"To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Flash attention & FMHA"}),": We enable ",(0,s.jsx)(t.strong,{children:"Fused Multi-Head Attention (FMHA)"})," combined with flash attention to reduce memory reads/writes."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"GEMM plugins"}),": We use specialized ",(0,s.jsx)(t.strong,{children:"GEMM"})," plugins to accelerate transformer linear layers."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Removing input padding"}),": Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"4-inflight-continuous-batching",children:"4. Inflight (Continuous) Batching"}),"\n",(0,s.jsx)(t.p,{children:"Traditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else."}),"\n",(0,s.jsxs)(t.p,{children:["We implement ",(0,s.jsx)(t.strong,{children:"inflight batching"}),": as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones."]}),"\n",(0,s.jsx)(t.h2,{id:"5-parallelism-strategies-scaling-beyond-one-gpu",children:"5. Parallelism Strategies: Scaling Beyond One GPU"}),"\n",(0,s.jsx)(t.p,{children:"For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Tensor parallelism (TP)"}),": Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Pipeline parallelism (PP)"}),": Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B)."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"6-speculative-decoding",children:"6. Speculative Decoding"}),"\n",(0,s.jsxs)(t.p,{children:["To reduce inter-token latency (ITL), we explore ",(0,s.jsx)(t.strong,{children:"speculative decoding"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Mechanism"}),': A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).']}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"few-benchmarks",children:"Few Benchmarks"}),"\n",(0,s.jsx)(t.p,{children:"Below are a couple of representative use cases and performance numbers."}),"\n",(0,s.jsx)(t.h3,{id:"search-query-rewriting",children:"Search query rewriting"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Fine-tuned llama-3.2-1B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input & output token length"}),": ~10\u201320"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Non-streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{children:"Hardware"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max requests/sec"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max p99 latency"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"4 \xd7 L4 GPUs (multi-GPU)"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"95 ms"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"1 \xd7 A100 40 GB GPU"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"69 ms"})]})]})]}),"\n",(0,s.jsx)(t.h3,{id:"voice-bot-query",children:"Voice bot query"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Llama-3.1-8B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input token length"}),": ~1900\u20132000"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Output token length"}),": ~200"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Concurrency"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 TTFT (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 ITL (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Token throughput (tokens/sec)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Request throughput (req/sec)"}),(0,s.jsx)(t.th,{children:"Hardware"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.27"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"22.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"45.66"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.23"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"49.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"89.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.45"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"55.33"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"153.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"66.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"39.11"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"279.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.47"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"131.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"30.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"547.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.77"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"277.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"48.02"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"925.7"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"498.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"71.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,164.40"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"6.2"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"677.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"120.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,445.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.69"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,926.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"216.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,600.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8.52"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"130.05"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.68"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"25.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"264.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.35"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"28.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"10.99"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"437.69"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.27"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"12.61"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"760.49"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3.96"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"68.03"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"14.32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,343.80"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.01"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"185.96"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16.82"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2,287.30"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"11.92"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"136.87"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3,625.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"18.89"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"463.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.15"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4,456.51"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.24"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"890.12"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"59.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"5,188.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"27.05"}),(0,s.jsx)(t.td,{children:"A100"})]})]})]}),"\n",(0,s.jsx)(t.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,s.jsx)(t.p,{children:"High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure."}),"\n",(0,s.jsx)(t.p,{children:"These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."})]})}function o(e={}){const{wrapper:t}={...(0,l.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(a,{...e})}):a(e)}},8453:(e,t,i)=>{i.d(t,{R:()=>r,x:()=>d});var n=i(6540);const s={},l=n.createContext(s);function r(e){const t=n.useContext(l);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function d(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:r(e.components),n.createElement(l.Provider,{value:t},e.children)}},8849:(e,t,i)=>{i.d(t,{A:()=>n});const n=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/4137b431.c6fedbd3.js b/docs/assets/js/4137b431.c6fedbd3.js deleted file mode 100644 index 548cc0ed..00000000 --- a/docs/assets/js/4137b431.c6fedbd3.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6054],{4019:e=>{e.exports=JSON.parse('{"version":{"pluginId":"default","version":"current","label":"Next","banner":null,"badge":false,"noIndex":false,"className":"docs-version-current","isLast":true,"docsSidebars":{"tutorialSidebar":[{"type":"category","label":"Online Feature Store","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/online-feature-store/v1.0.0/architecture","docId":"online-feature-store/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Data Formats","href":"/BharatMLStack/online-feature-store/v1.0.0/data-formats","docId":"online-feature-store/v1.0.0/data-formats","unlisted":false},{"type":"link","label":"Benchmarks","href":"/BharatMLStack/online-feature-store/v1.0.0/benchmarks","docId":"online-feature-store/v1.0.0/benchmarks","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/online-feature-store/v1.0.0/functionalities","docId":"online-feature-store/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/online-feature-store/v1.0.0/release-notes","docId":"online-feature-store/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/online-feature-store/v1.0.0"}],"href":"/BharatMLStack/category/online-feature-store"},{"type":"category","label":"Inferflow","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/inferflow/v1.0.0/architecture","docId":"inferflow/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/inferflow/v1.0.0/functionalities","docId":"inferflow/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Configuration Guide","href":"/BharatMLStack/inferflow/v1.0.0/configuration","docId":"inferflow/v1.0.0/configuration","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/inferflow/v1.0.0/release-notes","docId":"inferflow/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/inferflow/v1.0.0"}],"href":"/BharatMLStack/category/inferflow"},{"type":"category","label":"Quick Start","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Quick Start","href":"/BharatMLStack/quick-start/v1.0.0/quick-start","docId":"quick-start/v1.0.0/quick-start","unlisted":false}]}],"href":"/BharatMLStack/category/quick-start"},{"type":"category","label":"Trufflebox UI","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"User Manual","href":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide","docId":"trufflebox-ui/v1.0.0/userguide","unlisted":false}]}],"href":"/BharatMLStack/category/trufflebox-ui"},{"type":"category","label":"SDKs","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"Go SDK","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"GRPC Feature client","href":"/BharatMLStack/sdks/go/v1.0.0/feature_client","docId":"sdks/go/v1.0.0/feature_client","unlisted":false}]}],"href":"/BharatMLStack/category/go-sdk"},{"type":"category","label":"Python SDK","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"GRPC Feature client","href":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client","docId":"sdks/python/v1.0.0/grpc_feature_client","unlisted":false},{"type":"link","label":"Spark client","href":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client","docId":"sdks/python/v1.0.0/spark_feature_push_client","unlisted":false}],"href":"/BharatMLStack/category/v100"}],"href":"/BharatMLStack/category/python-sdk"}],"href":"/BharatMLStack/category/sdks"},{"type":"category","label":"Numerix","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/numerix/v1.0.0/architecture","docId":"numerix/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Benchmarks","href":"/BharatMLStack/numerix/v1.0.0/benchmarks","docId":"numerix/v1.0.0/benchmarks","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/numerix/v1.0.0/functionalities","docId":"numerix/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/numerix/v1.0.0/release-notes","docId":"numerix/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/numerix/v1.0.0"}],"href":"/BharatMLStack/category/numerix"}]},"docs":{"inferflow/v1.0.0/architecture":{"id":"inferflow/v1.0.0/architecture","title":"Architecture","description":"Inferflow is part of BharatMLStack, a graph-driven feature retrieval and model inference orchestration engine built in Go. It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in etcd.","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/configuration":{"id":"inferflow/v1.0.0/configuration","title":"Configuration Guide","description":"Inferflow is fully config-driven. All model onboarding, feature retrieval logic, DAG topology, and inference behavior are controlled through configuration stored in etcd \u2014 with zero code changes required.","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/functionalities":{"id":"inferflow/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/release-notes":{"id":"inferflow/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0","sidebar":"tutorialSidebar"},"numerix/v1.0.0/architecture":{"id":"numerix/v1.0.0/architecture","title":"Architecture","description":"---","sidebar":"tutorialSidebar"},"numerix/v1.0.0/benchmarks":{"id":"numerix/v1.0.0/benchmarks","title":"Benchmarks","description":"This PoC measures the performance of vector addition in Rust with and without compiler SIMD optimizations. Requests consist of repeated fixed-size vector addition operations processed in parallel by the CPU. These results provide perspective on how much faster SIMD makes vectorized computations, and similar improvements are expected for other vectorized operations in Numerix.","sidebar":"tutorialSidebar"},"numerix/v1.0.0/functionalities":{"id":"numerix/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"numerix/v1.0.0/release-notes":{"id":"numerix/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/architecture":{"id":"online-feature-store/v1.0.0/architecture","title":"Architecture","description":"The Online Feature Store (OnFS) is part of BharatMLStack, designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference.","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/benchmarks":{"id":"online-feature-store/v1.0.0/benchmarks","title":"Benchmarks","description":"Summary","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/data-formats":{"id":"online-feature-store/v1.0.0/data-formats","title":"Data Formats","description":"In this section we will go through the data-formats which is at the hear of online-feature-store, it\'s inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/functionalities":{"id":"online-feature-store/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/release-notes":{"id":"online-feature-store/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","sidebar":"tutorialSidebar"},"quick-start/v1.0.0/quick-start":{"id":"quick-start/v1.0.0/quick-start","title":"Quick Start","description":"Discord","sidebar":"tutorialSidebar"},"sdks/go/v1.0.0/feature_client":{"id":"sdks/go/v1.0.0/feature_client","title":"GRPC Feature client","description":"Build Status","sidebar":"tutorialSidebar"},"sdks/python/v1.0.0/grpc_feature_client":{"id":"sdks/python/v1.0.0/grpc_feature_client","title":"GRPC Feature client","description":"PyPI version","sidebar":"tutorialSidebar"},"sdks/python/v1.0.0/spark_feature_push_client":{"id":"sdks/python/v1.0.0/spark_feature_push_client","title":"Spark client","description":"PyPI version","sidebar":"tutorialSidebar"},"trufflebox-ui/v1.0.0/userguide":{"id":"trufflebox-ui/v1.0.0/userguide","title":"User Manual","description":"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management.","sidebar":"tutorialSidebar"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/4137b431.eda97697.js b/docs/assets/js/4137b431.eda97697.js new file mode 100644 index 00000000..f28d4a83 --- /dev/null +++ b/docs/assets/js/4137b431.eda97697.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6054],{4019:e=>{e.exports=JSON.parse('{"version":{"pluginId":"default","version":"current","label":"Next","banner":null,"badge":false,"noIndex":false,"className":"docs-version-current","isLast":true,"docsSidebars":{"tutorialSidebar":[{"type":"link","label":"BharatMLStack Documentation","href":"/BharatMLStack/intro","docId":"intro","unlisted":false},{"type":"category","label":"Online Feature Store","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/online-feature-store/v1.0.0/architecture","docId":"online-feature-store/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Data Formats","href":"/BharatMLStack/online-feature-store/v1.0.0/data-formats","docId":"online-feature-store/v1.0.0/data-formats","unlisted":false},{"type":"link","label":"Benchmarks","href":"/BharatMLStack/online-feature-store/v1.0.0/benchmarks","docId":"online-feature-store/v1.0.0/benchmarks","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/online-feature-store/v1.0.0/functionalities","docId":"online-feature-store/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/online-feature-store/v1.0.0/release-notes","docId":"online-feature-store/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/online-feature-store/v1.0.0"}],"href":"/BharatMLStack/category/online-feature-store"},{"type":"category","label":"Inferflow","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/inferflow/v1.0.0/architecture","docId":"inferflow/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/inferflow/v1.0.0/functionalities","docId":"inferflow/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Configuration Guide","href":"/BharatMLStack/inferflow/v1.0.0/configuration","docId":"inferflow/v1.0.0/configuration","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/inferflow/v1.0.0/release-notes","docId":"inferflow/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/inferflow/v1.0.0"}],"href":"/BharatMLStack/category/inferflow"},{"type":"category","label":"Quick Start","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Quick Start","href":"/BharatMLStack/quick-start/v1.0.0/quick-start","docId":"quick-start/v1.0.0/quick-start","unlisted":false}],"href":"/BharatMLStack/quick-start/v1.0.0"}],"href":"/BharatMLStack/category/quick-start"},{"type":"category","label":"Trufflebox UI","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"User Manual","href":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide","docId":"trufflebox-ui/v1.0.0/userguide","unlisted":false}],"href":"/BharatMLStack/trufflebox-ui/v1.0.0"}],"href":"/BharatMLStack/category/trufflebox-ui"},{"type":"category","label":"SDKs","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"Go SDK","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"GRPC Feature client","href":"/BharatMLStack/sdks/go/v1.0.0/feature_client","docId":"sdks/go/v1.0.0/feature_client","unlisted":false}],"href":"/BharatMLStack/sdks/go/v1.0.0"}],"href":"/BharatMLStack/category/go-sdk"},{"type":"category","label":"Python SDK","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"GRPC Feature client","href":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client","docId":"sdks/python/v1.0.0/grpc_feature_client","unlisted":false},{"type":"link","label":"Spark client","href":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client","docId":"sdks/python/v1.0.0/spark_feature_push_client","unlisted":false}],"href":"/BharatMLStack/sdks/python/v1.0.0"}],"href":"/BharatMLStack/category/python-sdk"}],"href":"/BharatMLStack/category/sdks"},{"type":"category","label":"Skye","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/skye/v1.0.0/architecture","docId":"skye/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Functionalities","href":"/BharatMLStack/skye/v1.0.0/functionalities","docId":"skye/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/skye/v1.0.0/release-notes","docId":"skye/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/skye/v1.0.0"}],"href":"/BharatMLStack/category/skye"},{"type":"category","label":"Numerix","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/numerix/v1.0.0/architecture","docId":"numerix/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Benchmarks","href":"/BharatMLStack/numerix/v1.0.0/benchmarks","docId":"numerix/v1.0.0/benchmarks","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/numerix/v1.0.0/functionalities","docId":"numerix/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/numerix/v1.0.0/release-notes","docId":"numerix/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/numerix/v1.0.0"}],"href":"/BharatMLStack/category/numerix"},{"type":"category","label":"Predator","collapsible":true,"collapsed":true,"items":[{"type":"category","label":"v1.0.0","collapsible":true,"collapsed":true,"items":[{"type":"link","label":"Architecture","href":"/BharatMLStack/predator/v1.0.0/architecture","docId":"predator/v1.0.0/architecture","unlisted":false},{"type":"link","label":"Key Functionalities","href":"/BharatMLStack/predator/v1.0.0/functionalities","docId":"predator/v1.0.0/functionalities","unlisted":false},{"type":"link","label":"Release Notes","href":"/BharatMLStack/predator/v1.0.0/release-notes","docId":"predator/v1.0.0/release-notes","unlisted":false}],"href":"/BharatMLStack/predator/v1.0.0"}],"href":"/BharatMLStack/category/predator"}]},"docs":{"inferflow/v1.0.0/architecture":{"id":"inferflow/v1.0.0/architecture","title":"Architecture","description":"Inferflow is part of BharatMLStack, a graph-driven feature retrieval and model inference orchestration engine built in Go. It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring \u2014 all driven by configuration stored in etcd.","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/configuration":{"id":"inferflow/v1.0.0/configuration","title":"Configuration Guide","description":"Inferflow is fully config-driven. All model onboarding, feature retrieval logic, DAG topology, and inference behavior are controlled through configuration stored in etcd \u2014 with zero code changes required.","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/functionalities":{"id":"inferflow/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/index":{"id":"inferflow/v1.0.0/index","title":"v1.0.0","description":"Inferflow v1.0.0","sidebar":"tutorialSidebar"},"inferflow/v1.0.0/release-notes":{"id":"inferflow/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0","sidebar":"tutorialSidebar"},"intro":{"id":"intro","title":"BharatMLStack Documentation","description":"Welcome to the BharatMLStack documentation. BharatMLStack is an open-source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Explore the components below to get started.","sidebar":"tutorialSidebar"},"numerix/v1.0.0/architecture":{"id":"numerix/v1.0.0/architecture","title":"Architecture","description":"---","sidebar":"tutorialSidebar"},"numerix/v1.0.0/benchmarks":{"id":"numerix/v1.0.0/benchmarks","title":"Benchmarks","description":"This PoC measures the performance of vector addition in Rust with and without compiler SIMD optimizations. Requests consist of repeated fixed-size vector addition operations processed in parallel by the CPU. These results provide perspective on how much faster SIMD makes vectorized computations, and similar improvements are expected for other vectorized operations in Numerix.","sidebar":"tutorialSidebar"},"numerix/v1.0.0/functionalities":{"id":"numerix/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"numerix/v1.0.0/index":{"id":"numerix/v1.0.0/index","title":"v1.0.0","description":"Numerix v1.0.0","sidebar":"tutorialSidebar"},"numerix/v1.0.0/release-notes":{"id":"numerix/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/architecture":{"id":"online-feature-store/v1.0.0/architecture","title":"Architecture","description":"The Online Feature Store (OnFS) is part of BharatMLStack, designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference.","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/benchmarks":{"id":"online-feature-store/v1.0.0/benchmarks","title":"Benchmarks","description":"Summary","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/data-formats":{"id":"online-feature-store/v1.0.0/data-formats","title":"Data Formats","description":"In this section we will go through the data-formats which is at the hear of online-feature-store, it\'s inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/functionalities":{"id":"online-feature-store/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/index":{"id":"online-feature-store/v1.0.0/index","title":"v1.0.0","description":"Online Feature Store v1.0.0","sidebar":"tutorialSidebar"},"online-feature-store/v1.0.0/release-notes":{"id":"online-feature-store/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","sidebar":"tutorialSidebar"},"predator/v1.0.0/architecture":{"id":"predator/v1.0.0/architecture","title":"Architecture","description":"Predator is a scalable, high-performance model inference service built as a wrapper around the NVIDIA Triton Inference Server. It is designed to serve a variety of machine learning models (Deep Learning, Tree-based, etc.) with low latency in a Kubernetes (K8s) environment.","sidebar":"tutorialSidebar"},"predator/v1.0.0/functionalities":{"id":"predator/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","sidebar":"tutorialSidebar"},"predator/v1.0.0/index":{"id":"predator/v1.0.0/index","title":"v1.0.0","description":"Predator v1.0.0","sidebar":"tutorialSidebar"},"predator/v1.0.0/release-notes":{"id":"predator/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0","sidebar":"tutorialSidebar"},"quick-start/v1.0.0/index":{"id":"quick-start/v1.0.0/index","title":"v1.0.0","description":"Quick Start v1.0.0","sidebar":"tutorialSidebar"},"quick-start/v1.0.0/quick-start":{"id":"quick-start/v1.0.0/quick-start","title":"Quick Start","description":"Discord","sidebar":"tutorialSidebar"},"sdks/go/v1.0.0/feature_client":{"id":"sdks/go/v1.0.0/feature_client","title":"GRPC Feature client","description":"Build Status","sidebar":"tutorialSidebar"},"sdks/go/v1.0.0/index":{"id":"sdks/go/v1.0.0/index","title":"v1.0.0","description":"Go SDK v1.0.0","sidebar":"tutorialSidebar"},"sdks/python/v1.0.0/grpc_feature_client":{"id":"sdks/python/v1.0.0/grpc_feature_client","title":"GRPC Feature client","description":"PyPI version","sidebar":"tutorialSidebar"},"sdks/python/v1.0.0/index":{"id":"sdks/python/v1.0.0/index","title":"v1.0.0","description":"Python SDK v1.0.0","sidebar":"tutorialSidebar"},"sdks/python/v1.0.0/spark_feature_push_client":{"id":"sdks/python/v1.0.0/spark_feature_push_client","title":"Spark client","description":"PyPI version","sidebar":"tutorialSidebar"},"skye/v1.0.0/architecture":{"id":"skye/v1.0.0/architecture","title":"Architecture","description":"Skye is BharatMLStack\'s vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: skye-admin, skye-consumers, and skye-serving.","sidebar":"tutorialSidebar"},"skye/v1.0.0/functionalities":{"id":"skye/v1.0.0/functionalities","title":"Functionalities","description":"Core Capabilities","sidebar":"tutorialSidebar"},"skye/v1.0.0/index":{"id":"skye/v1.0.0/index","title":"v1.0.0","description":"Skye v1.0.0","sidebar":"tutorialSidebar"},"skye/v1.0.0/release-notes":{"id":"skye/v1.0.0/release-notes","title":"Release Notes","description":"v1.0.0","sidebar":"tutorialSidebar"},"trufflebox-ui/v1.0.0/index":{"id":"trufflebox-ui/v1.0.0/index","title":"v1.0.0","description":"Trufflebox UI v1.0.0","sidebar":"tutorialSidebar"},"trufflebox-ui/v1.0.0/userguide":{"id":"trufflebox-ui/v1.0.0/userguide","title":"User Manual","description":"This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management.","sidebar":"tutorialSidebar"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/428aafcc.2c1db158.js b/docs/assets/js/428aafcc.2c1db158.js deleted file mode 100644 index 813dbcdf..00000000 --- a/docs/assets/js/428aafcc.2c1db158.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[5503],{702:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/vss-c482f6eac4c68b3219e4c562a6b717ec.png"},788:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},6e3:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},7999:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>o,contentTitle:()=>l,default:()=>h,frontMatter:()=>s,metadata:()=>i,toc:()=>d});var i=t(788),a=t(4848),r=t(8453);const s={slug:"post-three",title:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",authors:["aditya","jaya","adarsha"],date:new Date("2024-05-21T00:00:00.000Z"),tags:["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},l=void 0,o={authorsImageUrls:[void 0,void 0,void 0]},d=[{value:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",level:2},{value:"Breaking Free from the Scalability Ceiling",id:"breaking-free-from-the-scalability-ceiling",level:2},{value:"The Model Serving Bottleneck\u2014A Wake-Up Call",id:"the-model-serving-bottlenecka-wake-up-call",level:3},{value:"Scaling Triton on GKE",id:"scaling-triton-on-gke",level:3},{value:"Fixing the Cold Start Problem",id:"fixing-the-cold-start-problem",level:3},{value:"Embedding Search: The Last Piece of the Puzzle",id:"embedding-search-the-last-piece-of-the-puzzle",level:2},{value:"Choosing the Right Vector Database",id:"choosing-the-right-vector-database",level:3},{value:"Embedding Freshness & Real-Time Updates",id:"embedding-freshness--real-time-updates",level:3},{value:"Final Takeaways: Scaling Smartly for Real-Time ML",id:"final-takeaways-scaling-smartly-for-real-time-ml",level:2}];function c(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",ul:"ul",...(0,r.R)(),...e.components};return(0,a.jsxs)(a.Fragment,{children:[(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"BharatMLStack",src:t(6e3).A+"",width:"1396",height:"460"})}),"\n",(0,a.jsx)(n.h2,{id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",children:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Moving embedding search from batch to real-time for candidate generation"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"Here\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system."}),"\n",(0,a.jsx)(n.h2,{id:"breaking-free-from-the-scalability-ceiling",children:"Breaking Free from the Scalability Ceiling"}),"\n",(0,a.jsx)(n.h3,{id:"the-model-serving-bottlenecka-wake-up-call",children:"The Model Serving Bottleneck\u2014A Wake-Up Call"}),"\n",(0,a.jsx)(n.p,{children:"July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\nIn one of our war rooms, we ran a quick experiment:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 The results matched\u2014perfectly."}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\n"Node availability may be an issue."\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?'}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Triton handled significantly higher throughput on fewer resources"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 No model changes were needed"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"MBS ran without a hitch, proving that self-hosted inference was the way forward."}),"\n",(0,a.jsx)(n.h3,{id:"scaling-triton-on-gke",children:"Scaling Triton on GKE"}),"\n",(0,a.jsx)(n.p,{children:"This left us with two choices:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack"}),"\n",(0,a.jsx)(n.li,{children:"2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"We went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations."}),"\n",(0,a.jsx)(n.h3,{id:"fixing-the-cold-start-problem",children:"Fixing the Cold Start Problem"}),"\n",(0,a.jsx)(n.p,{children:"As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up."}),"\n",(0,a.jsx)(n.p,{children:"After profiling, we found the culprits:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Triton\u2019s base image\u2014a massive 5GB"}),"\n",(0,a.jsx)(n.li,{children:"Model binaries\u2014often 1GB+"}),"\n",(0,a.jsx)(n.li,{children:"Startup delay\u2014mostly due to downloading and initializing these assets"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother."}),"\n",(0,a.jsx)(n.h2,{id:"embedding-search-the-last-piece-of-the-puzzle",children:"Embedding Search: The Last Piece of the Puzzle"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system."}),"\n",(0,a.jsx)(n.h3,{id:"choosing-the-right-vector-database",children:"Choosing the Right Vector Database"}),"\n",(0,a.jsx)(n.p,{children:"We benchmarked three production-ready vector DBs across key parameters:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Milvus"}),"\n",(0,a.jsx)(n.li,{children:"Qdrant"}),"\n",(0,a.jsx)(n.li,{children:"Weaviate"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"After extensive POCs, Qdrant stood out for its:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 Blazing-fast search latency on high-dimensional vectors"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Efficient memory usage, crucial for in-memory workloads"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Support for upserts and soft deletes, vital for Ads use cases"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 gRPC + REST APIs, making integration seamless"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs."}),"\n",(0,a.jsx)(n.h3,{id:"embedding-freshness--real-time-updates",children:"Embedding Freshness & Real-Time Updates"}),"\n",(0,a.jsx)(n.p,{children:"To ensure embeddings stayed up to date, we built a dual ingestion pipeline:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.'}),"\n",(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"Skye",src:t(702).A+"",width:"1260",height:"644"})}),"\n",(0,a.jsx)(n.h2,{id:"final-takeaways-scaling-smartly-for-real-time-ml",children:"Final Takeaways: Scaling Smartly for Real-Time ML"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"By early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,a.jsx)(n,{...e,children:(0,a.jsx)(c,{...e})}):c(e)}},8453:(e,n,t)=>{t.d(n,{R:()=>s,x:()=>l});var i=t(6540);const a={},r=i.createContext(a);function s(e){const n=i.useContext(r);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(a):e.components||a:s(e.components),i.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/428aafcc.e29d0b89.js b/docs/assets/js/428aafcc.e29d0b89.js new file mode 100644 index 00000000..78e9c035 --- /dev/null +++ b/docs/assets/js/428aafcc.e29d0b89.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[5503],{788:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},3217:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/vss-c482f6eac4c68b3219e4c562a6b717ec.png"},4411:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},7999:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>o,contentTitle:()=>l,default:()=>h,frontMatter:()=>s,metadata:()=>t,toc:()=>d});var t=i(788),a=i(4848),r=i(8453);const s={slug:"post-three",title:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",authors:["aditya","jaya","adarsha"],date:new Date("2024-05-21T00:00:00.000Z"),tags:["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},l=void 0,o={authorsImageUrls:[void 0,void 0,void 0]},d=[{value:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",level:2},{value:"Breaking Free from the Scalability Ceiling",id:"breaking-free-from-the-scalability-ceiling",level:2},{value:"The Model Serving Bottleneck\u2014A Wake-Up Call",id:"the-model-serving-bottlenecka-wake-up-call",level:3},{value:"Scaling Triton on GKE",id:"scaling-triton-on-gke",level:3},{value:"Fixing the Cold Start Problem",id:"fixing-the-cold-start-problem",level:3},{value:"Embedding Search: The Last Piece of the Puzzle",id:"embedding-search-the-last-piece-of-the-puzzle",level:2},{value:"Choosing the Right Vector Database",id:"choosing-the-right-vector-database",level:3},{value:"Embedding Freshness & Real-Time Updates",id:"embedding-freshness--real-time-updates",level:3},{value:"Final Takeaways: Scaling Smartly for Real-Time ML",id:"final-takeaways-scaling-smartly-for-real-time-ml",level:2}];function c(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",ul:"ul",...(0,r.R)(),...e.components};return(0,a.jsxs)(a.Fragment,{children:[(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"BharatMLStack",src:i(4411).A+"",width:"1396",height:"460"})}),"\n",(0,a.jsx)(n.h2,{id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",children:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Moving embedding search from batch to real-time for candidate generation"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"Here\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system."}),"\n",(0,a.jsx)(n.h2,{id:"breaking-free-from-the-scalability-ceiling",children:"Breaking Free from the Scalability Ceiling"}),"\n",(0,a.jsx)(n.h3,{id:"the-model-serving-bottlenecka-wake-up-call",children:"The Model Serving Bottleneck\u2014A Wake-Up Call"}),"\n",(0,a.jsx)(n.p,{children:"July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\nIn one of our war rooms, we ran a quick experiment:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 The results matched\u2014perfectly."}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\n"Node availability may be an issue."\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?'}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Triton handled significantly higher throughput on fewer resources"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 No model changes were needed"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"MBS ran without a hitch, proving that self-hosted inference was the way forward."}),"\n",(0,a.jsx)(n.h3,{id:"scaling-triton-on-gke",children:"Scaling Triton on GKE"}),"\n",(0,a.jsx)(n.p,{children:"This left us with two choices:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack"}),"\n",(0,a.jsx)(n.li,{children:"2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"We went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations."}),"\n",(0,a.jsx)(n.h3,{id:"fixing-the-cold-start-problem",children:"Fixing the Cold Start Problem"}),"\n",(0,a.jsx)(n.p,{children:"As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up."}),"\n",(0,a.jsx)(n.p,{children:"After profiling, we found the culprits:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Triton\u2019s base image\u2014a massive 5GB"}),"\n",(0,a.jsx)(n.li,{children:"Model binaries\u2014often 1GB+"}),"\n",(0,a.jsx)(n.li,{children:"Startup delay\u2014mostly due to downloading and initializing these assets"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother."}),"\n",(0,a.jsx)(n.h2,{id:"embedding-search-the-last-piece-of-the-puzzle",children:"Embedding Search: The Last Piece of the Puzzle"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system."}),"\n",(0,a.jsx)(n.h3,{id:"choosing-the-right-vector-database",children:"Choosing the Right Vector Database"}),"\n",(0,a.jsx)(n.p,{children:"We benchmarked three production-ready vector DBs across key parameters:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Milvus"}),"\n",(0,a.jsx)(n.li,{children:"Qdrant"}),"\n",(0,a.jsx)(n.li,{children:"Weaviate"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"After extensive POCs, Qdrant stood out for its:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 Blazing-fast search latency on high-dimensional vectors"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Efficient memory usage, crucial for in-memory workloads"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Support for upserts and soft deletes, vital for Ads use cases"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 gRPC + REST APIs, making integration seamless"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs."}),"\n",(0,a.jsx)(n.h3,{id:"embedding-freshness--real-time-updates",children:"Embedding Freshness & Real-Time Updates"}),"\n",(0,a.jsx)(n.p,{children:"To ensure embeddings stayed up to date, we built a dual ingestion pipeline:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.'}),"\n",(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"Skye",src:i(3217).A+"",width:"1260",height:"644"})}),"\n",(0,a.jsx)(n.h2,{id:"final-takeaways-scaling-smartly-for-real-time-ml",children:"Final Takeaways: Scaling Smartly for Real-Time ML"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"By early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,a.jsx)(n,{...e,children:(0,a.jsx)(c,{...e})}):c(e)}},8453:(e,n,i)=>{i.d(n,{R:()=>s,x:()=>l});var t=i(6540);const a={},r=t.createContext(a);function s(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(a):e.components||a:s(e.components),t.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/44d1c015.4db6c425.js b/docs/assets/js/44d1c015.880095c2.js similarity index 86% rename from docs/assets/js/44d1c015.4db6c425.js rename to docs/assets/js/44d1c015.880095c2.js index 00df5e29..8d256db2 100644 --- a/docs/assets/js/44d1c015.4db6c425.js +++ b/docs/assets/js/44d1c015.880095c2.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1065],{6725:t=>{t.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Python SDK","description":"Python SDK for BharatML Stack. Provides Python client libraries and utilities for interacting with the online feature store, including gRPC clients, Spark integration, and common utilities.","slug":"/category/python-sdk","permalink":"/BharatMLStack/category/python-sdk","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/category/v100"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1065],{6725:t=>{t.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Python SDK","description":"Python SDK for BharatML Stack. Provides Python client libraries and utilities for interacting with the online feature store, including gRPC clients, Spark integration, and common utilities.","slug":"/category/python-sdk","permalink":"/BharatMLStack/category/python-sdk","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/sdks/python/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/45a03d07.6212d749.js b/docs/assets/js/45a03d07.6212d749.js deleted file mode 100644 index 9dbe9ea8..00000000 --- a/docs/assets/js/45a03d07.6212d749.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9955],{8539:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"v1.0.0","description":"Numerix v1.0.0","slug":"/inferflow/v1.0.0","permalink":"/BharatMLStack/inferflow/v1.0.0","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Inferflow","permalink":"/BharatMLStack/category/inferflow"},"next":{"title":"Architecture","permalink":"/BharatMLStack/inferflow/v1.0.0/architecture"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/4af50aac.59b38fde.js b/docs/assets/js/4af50aac.59b38fde.js new file mode 100644 index 00000000..f26cb200 --- /dev/null +++ b/docs/assets/js/4af50aac.59b38fde.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1964],{6220:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>a,contentTitle:()=>o,default:()=>h,frontMatter:()=>l,metadata:()=>s,toc:()=>c});const s=JSON.parse('{"id":"sdks/go/v1.0.0/feature_client","title":"GRPC Feature client","description":"Build Status","source":"@site/docs/sdks/go/v1.0.0/feature_client.md","sourceDirName":"sdks/go/v1.0.0","slug":"/sdks/go/v1.0.0/feature_client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/go/v1.0.0/feature_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"GRPC Feature client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/sdks/go/v1.0.0"},"next":{"title":"Python SDK","permalink":"/BharatMLStack/category/python-sdk"}}');var i=t(4848),r=t(8453);const l={title:"GRPC Feature client",sidebar_position:1},o="BharatMLStack Go SDK",a={},c=[{value:"Features",id:"features",level:2},{value:"Installation",id:"installation",level:2},{value:"Configuration",id:"configuration",level:2},{value:"Usage",id:"usage",level:2},{value:"Basic Usage",id:"basic-usage",level:3},{value:"Complete Example",id:"complete-example",level:3},{value:"Development",id:"development",level:2},{value:"Prerequisites",id:"prerequisites",level:3},{value:"Building",id:"building",level:3},{value:"Testing",id:"testing",level:3},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,r.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/go-sdk.yml/badge.svg",alt:"Build Status"}),"\n",(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/release-v1.0.0-blue?style=flat",alt:"Static Badge"}),"\n",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})})]}),"\n",(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatmlstack-go-sdk",children:"BharatMLStack Go SDK"})}),"\n",(0,i.jsx)(n.p,{children:"A Go SDK for interacting with BharatMLStack components, providing easy-to-use client libraries for the Online Feature Store and other services."}),"\n",(0,i.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store Client"}),": Complete gRPC client for feature retrieval and persistence"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Multiple API Methods"}),": Support for ",(0,i.jsx)(n.code,{children:"RetrieveFeatures"}),", ",(0,i.jsx)(n.code,{children:"RetrieveDecodedFeatures"}),", and ",(0,i.jsx)(n.code,{children:"PersistFeatures"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Protocol Buffer Support"}),": Generated clients from proto definitions with full type safety"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Batch Processing"}),": Configurable batch sizes for efficient bulk operations"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Authentication"}),": Built-in support for caller ID and token-based authentication"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Connection Management"}),": Configurable timeouts, TLS, and connection pooling"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Metrics Integration"}),": Built-in timing and count metrics for monitoring"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Type-Safe API"}),": Strongly typed Go interfaces and data structures"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Test Coverage"}),": Comprehensive test suite with mocking support"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"go get github.com/Meesho/BharatMLStack/go-sdk\n"})}),"\n",(0,i.jsx)(n.h2,{id:"configuration",children:"Configuration"}),"\n",(0,i.jsx)(n.p,{children:"The SDK requires a configuration object with the following fields:"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Field"}),(0,i.jsx)(n.th,{children:"Type"}),(0,i.jsx)(n.th,{children:"Required"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"Host"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:'Server hostname (e.g., "localhost", "feature-store.example.com")'})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"Port"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:'Server port (e.g., "8080", "443")'})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"CallerId"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:"Unique identifier for your service/application"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"CallerToken"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:"Authentication token for API access"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"DeadLine"})}),(0,i.jsx)(n.td,{children:"int"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Request timeout in milliseconds (default: 5000)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"PlainText"})}),(0,i.jsx)(n.td,{children:"bool"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Use plaintext connection instead of TLS (default: false)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"BatchSize"})}),(0,i.jsx)(n.td,{children:"int"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Maximum batch size for bulk operations (default: 50)"})]})]})]}),"\n",(0,i.jsx)(n.h2,{id:"usage",children:"Usage"}),"\n",(0,i.jsx)(n.h3,{id:"basic-usage",children:"Basic Usage"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-go",children:'package main\n\nimport (\n "context"\n "log"\n \n "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"\n)\n\nfunc main() {\n config := &onfs.Config{\n Host: "localhost",\n Port: "8080",\n PlainText: true, // For local development\n CallerId: "my-service",\n CallerToken: "my-token",\n }\n\n // Initialize client (timing and count can be nil)\n client := onfs.NewClientV1(config, nil, nil)\n \n // Your feature operations here...\n}\n'})}),"\n",(0,i.jsx)(n.h3,{id:"complete-example",children:"Complete Example"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-go",children:'package main\n\nimport (\n "context"\n "log"\n "time"\n \n "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"\n)\n\nfunc main() {\n // Create configuration\n config := &onfs.Config{\n Host: "localhost",\n Port: "8080",\n DeadLine: 5000, // 5 seconds timeout in milliseconds\n PlainText: true, // Use plaintext connection for local development\n BatchSize: 50, // Optional: batch size for requests\n CallerId: "your-service-id",\n CallerToken: "your-auth-token",\n }\n\n // Timing and count functions (can be nil for basic usage)\n timing := func(name string, value time.Duration, tags []string) {\n log.Printf("Timing: %s took %v with tags %v", name, value, tags)\n }\n count := func(name string, value int64, tags []string) {\n log.Printf("Count: %s = %d with tags %v", name, value, tags)\n }\n\n // Initialize the client\n client := onfs.InitClient(onfs.Version1, config, timing, count)\n // Or alternatively use: client := onfs.NewClientV1(config, timing, count)\n\n ctx := context.Background()\n\n // Example: Retrieve features\n query := &onfs.Query{\n EntityLabel: "user",\n FeatureGroups: []onfs.FeatureGroup{\n {\n Label: "user_features",\n FeatureLabels: []string{"age", "location", "preferences"},\n },\n },\n KeysSchema: []string{"user_id"},\n Keys: []onfs.Keys{\n {Cols: []string{"12345"}},\n {Cols: []string{"67890"}},\n },\n }\n\n result, err := client.RetrieveFeatures(ctx, query)\n if err != nil {\n log.Fatalf("Failed to retrieve features: %v", err)\n }\n\n log.Printf("Retrieved %d rows for entity %s", len(result.Rows), result.EntityLabel)\n\n // Example: Retrieve decoded features (string values)\n decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)\n if err != nil {\n log.Fatalf("Failed to retrieve decoded features: %v", err)\n }\n\n log.Printf("Retrieved %d decoded rows", len(decodedResult.Rows))\n\n // Example: Persist features\n persistRequest := &onfs.PersistFeaturesRequest{\n EntityLabel: "user",\n KeysSchema: []string{"user_id"},\n FeatureGroups: []onfs.FeatureGroupSchema{\n {\n Label: "user_features",\n FeatureLabels: []string{"age", "location"},\n },\n },\n Data: []onfs.Data{\n {\n KeyValues: []string{"12345"},\n FeatureValues: []onfs.FeatureValues{\n {\n Values: onfs.Values{\n Int32Values: []int32{25},\n StringValues: []string{"New York"},\n },\n },\n },\n },\n },\n }\n\n persistResponse, err := client.PersistFeatures(ctx, persistRequest)\n if err != nil {\n log.Fatalf("Failed to persist features: %v", err)\n }\n\n log.Printf("Persist result: %s", persistResponse.Message)\n}\n'})}),"\n",(0,i.jsx)(n.h2,{id:"development",children:"Development"}),"\n",(0,i.jsx)(n.h3,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Go 1.22 or later (as specified in go.mod)"}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"building",children:"Building"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Build all packages\ngo build ./...\n\n# Run tests\ngo test ./...\n\n# Run tests with coverage\ngo test -v -coverprofile=coverage.out ./...\ngo tool cover -html=coverage.out\n"})}),"\n",(0,i.jsx)(n.h3,{id:"testing",children:"Testing"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Run all tests\ngo test -v ./...\n\n# Run specific package tests\ngo test -v ./pkg/onfs\n\n# Run with race detection\ngo test -race ./...\n"})}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,t)=>{t.d(n,{R:()=>l,x:()=>o});var s=t(6540);const i={},r=s.createContext(i);function l(e){const n=s.useContext(r);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:l(e.components),s.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/4af50aac.f9b29cbf.js b/docs/assets/js/4af50aac.f9b29cbf.js deleted file mode 100644 index 03afe77a..00000000 --- a/docs/assets/js/4af50aac.f9b29cbf.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1964],{6220:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>a,contentTitle:()=>o,default:()=>h,frontMatter:()=>l,metadata:()=>s,toc:()=>c});const s=JSON.parse('{"id":"sdks/go/v1.0.0/feature_client","title":"GRPC Feature client","description":"Build Status","source":"@site/docs/sdks/go/v1.0.0/feature_client.md","sourceDirName":"sdks/go/v1.0.0","slug":"/sdks/go/v1.0.0/feature_client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/go/v1.0.0/feature_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"GRPC Feature client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"Go SDK","permalink":"/BharatMLStack/category/go-sdk"},"next":{"title":"Python SDK","permalink":"/BharatMLStack/category/python-sdk"}}');var i=t(4848),r=t(8453);const l={title:"GRPC Feature client",sidebar_position:1},o="BharatMLStack Go SDK",a={},c=[{value:"Features",id:"features",level:2},{value:"Installation",id:"installation",level:2},{value:"Configuration",id:"configuration",level:2},{value:"Usage",id:"usage",level:2},{value:"Basic Usage",id:"basic-usage",level:3},{value:"Complete Example",id:"complete-example",level:3},{value:"Development",id:"development",level:2},{value:"Prerequisites",id:"prerequisites",level:3},{value:"Building",id:"building",level:3},{value:"Testing",id:"testing",level:3},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,r.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsxs)(n.p,{children:[(0,i.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/go-sdk.yml/badge.svg",alt:"Build Status"}),"\n",(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/release-v1.0.0-blue?style=flat",alt:"Static Badge"}),"\n",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,i.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})})]}),"\n",(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"bharatmlstack-go-sdk",children:"BharatMLStack Go SDK"})}),"\n",(0,i.jsx)(n.p,{children:"A Go SDK for interacting with BharatMLStack components, providing easy-to-use client libraries for the Online Feature Store and other services."}),"\n",(0,i.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Online Feature Store Client"}),": Complete gRPC client for feature retrieval and persistence"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Multiple API Methods"}),": Support for ",(0,i.jsx)(n.code,{children:"RetrieveFeatures"}),", ",(0,i.jsx)(n.code,{children:"RetrieveDecodedFeatures"}),", and ",(0,i.jsx)(n.code,{children:"PersistFeatures"})]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Protocol Buffer Support"}),": Generated clients from proto definitions with full type safety"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Batch Processing"}),": Configurable batch sizes for efficient bulk operations"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Authentication"}),": Built-in support for caller ID and token-based authentication"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Connection Management"}),": Configurable timeouts, TLS, and connection pooling"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Metrics Integration"}),": Built-in timing and count metrics for monitoring"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Type-Safe API"}),": Strongly typed Go interfaces and data structures"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Test Coverage"}),": Comprehensive test suite with mocking support"]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"go get github.com/Meesho/BharatMLStack/go-sdk\n"})}),"\n",(0,i.jsx)(n.h2,{id:"configuration",children:"Configuration"}),"\n",(0,i.jsx)(n.p,{children:"The SDK requires a configuration object with the following fields:"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Field"}),(0,i.jsx)(n.th,{children:"Type"}),(0,i.jsx)(n.th,{children:"Required"}),(0,i.jsx)(n.th,{children:"Description"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"Host"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:'Server hostname (e.g., "localhost", "feature-store.example.com")'})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"Port"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:'Server port (e.g., "8080", "443")'})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"CallerId"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:"Unique identifier for your service/application"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"CallerToken"})}),(0,i.jsx)(n.td,{children:"string"}),(0,i.jsx)(n.td,{children:"Yes"}),(0,i.jsx)(n.td,{children:"Authentication token for API access"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"DeadLine"})}),(0,i.jsx)(n.td,{children:"int"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Request timeout in milliseconds (default: 5000)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"PlainText"})}),(0,i.jsx)(n.td,{children:"bool"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Use plaintext connection instead of TLS (default: false)"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:(0,i.jsx)(n.code,{children:"BatchSize"})}),(0,i.jsx)(n.td,{children:"int"}),(0,i.jsx)(n.td,{children:"No"}),(0,i.jsx)(n.td,{children:"Maximum batch size for bulk operations (default: 50)"})]})]})]}),"\n",(0,i.jsx)(n.h2,{id:"usage",children:"Usage"}),"\n",(0,i.jsx)(n.h3,{id:"basic-usage",children:"Basic Usage"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-go",children:'package main\n\nimport (\n "context"\n "log"\n \n "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"\n)\n\nfunc main() {\n config := &onfs.Config{\n Host: "localhost",\n Port: "8080",\n PlainText: true, // For local development\n CallerId: "my-service",\n CallerToken: "my-token",\n }\n\n // Initialize client (timing and count can be nil)\n client := onfs.NewClientV1(config, nil, nil)\n \n // Your feature operations here...\n}\n'})}),"\n",(0,i.jsx)(n.h3,{id:"complete-example",children:"Complete Example"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-go",children:'package main\n\nimport (\n "context"\n "log"\n "time"\n \n "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"\n)\n\nfunc main() {\n // Create configuration\n config := &onfs.Config{\n Host: "localhost",\n Port: "8080",\n DeadLine: 5000, // 5 seconds timeout in milliseconds\n PlainText: true, // Use plaintext connection for local development\n BatchSize: 50, // Optional: batch size for requests\n CallerId: "your-service-id",\n CallerToken: "your-auth-token",\n }\n\n // Timing and count functions (can be nil for basic usage)\n timing := func(name string, value time.Duration, tags []string) {\n log.Printf("Timing: %s took %v with tags %v", name, value, tags)\n }\n count := func(name string, value int64, tags []string) {\n log.Printf("Count: %s = %d with tags %v", name, value, tags)\n }\n\n // Initialize the client\n client := onfs.InitClient(onfs.Version1, config, timing, count)\n // Or alternatively use: client := onfs.NewClientV1(config, timing, count)\n\n ctx := context.Background()\n\n // Example: Retrieve features\n query := &onfs.Query{\n EntityLabel: "user",\n FeatureGroups: []onfs.FeatureGroup{\n {\n Label: "user_features",\n FeatureLabels: []string{"age", "location", "preferences"},\n },\n },\n KeysSchema: []string{"user_id"},\n Keys: []onfs.Keys{\n {Cols: []string{"12345"}},\n {Cols: []string{"67890"}},\n },\n }\n\n result, err := client.RetrieveFeatures(ctx, query)\n if err != nil {\n log.Fatalf("Failed to retrieve features: %v", err)\n }\n\n log.Printf("Retrieved %d rows for entity %s", len(result.Rows), result.EntityLabel)\n\n // Example: Retrieve decoded features (string values)\n decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)\n if err != nil {\n log.Fatalf("Failed to retrieve decoded features: %v", err)\n }\n\n log.Printf("Retrieved %d decoded rows", len(decodedResult.Rows))\n\n // Example: Persist features\n persistRequest := &onfs.PersistFeaturesRequest{\n EntityLabel: "user",\n KeysSchema: []string{"user_id"},\n FeatureGroups: []onfs.FeatureGroupSchema{\n {\n Label: "user_features",\n FeatureLabels: []string{"age", "location"},\n },\n },\n Data: []onfs.Data{\n {\n KeyValues: []string{"12345"},\n FeatureValues: []onfs.FeatureValues{\n {\n Values: onfs.Values{\n Int32Values: []int32{25},\n StringValues: []string{"New York"},\n },\n },\n },\n },\n },\n }\n\n persistResponse, err := client.PersistFeatures(ctx, persistRequest)\n if err != nil {\n log.Fatalf("Failed to persist features: %v", err)\n }\n\n log.Printf("Persist result: %s", persistResponse.Message)\n}\n'})}),"\n",(0,i.jsx)(n.h2,{id:"development",children:"Development"}),"\n",(0,i.jsx)(n.h3,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Go 1.22 or later (as specified in go.mod)"}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"building",children:"Building"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Build all packages\ngo build ./...\n\n# Run tests\ngo test ./...\n\n# Run tests with coverage\ngo test -v -coverprofile=coverage.out ./...\ngo tool cover -html=coverage.out\n"})}),"\n",(0,i.jsx)(n.h3,{id:"testing",children:"Testing"}),"\n",(0,i.jsx)(n.pre,{children:(0,i.jsx)(n.code,{className:"language-bash",children:"# Run all tests\ngo test -v ./...\n\n# Run specific package tests\ngo test -v ./pkg/onfs\n\n# Run with race detection\ngo test -race ./...\n"})}),"\n",(0,i.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,i.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,i.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:["\ud83d\udcac ",(0,i.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,i.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,i.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,i.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,i.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,i.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,i.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,i.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,i.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,i.jsx)(n.hr,{}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,i.jsx)("div",{align:"center",children:(0,i.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(d,{...e})}):d(e)}},8453:(e,n,t)=>{t.d(n,{R:()=>l,x:()=>o});var s=t(6540);const i={},r=s.createContext(i);function l(e){const n=s.useContext(r);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:l(e.components),s.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/4caa95bf.6b01e416.js b/docs/assets/js/4caa95bf.6b01e416.js new file mode 100644 index 00000000..d38ebef1 --- /dev/null +++ b/docs/assets/js/4caa95bf.6b01e416.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2344],{280:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-bool-encoding-4b154fdf5e6d79a67c91b6fb21c7209e.png"},477:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-anatomy-c1735559f93dce6d0bb3894d16047059.png"},1477:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-string-encoding-b1d69e9452269124d1b545020fa27d63.png"},8453:(e,n,i)=>{i.d(n,{R:()=>d,x:()=>l});var t=i(6540);const s={},r=t.createContext(s);function d(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:d(e.components),t.createElement(r.Provider,{value:n},e.children)}},8457:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-csdb-skip-read-e3926080f7341aa7d3c6ec6d8274ea14.png"},9133:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-fixed-length-encodding-dd252110b084e01cf38f21de16b3a1a5.png"},9584:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>d,metadata:()=>t,toc:()=>a});const t=JSON.parse('{"id":"online-feature-store/v1.0.0/data-formats","title":"Data Formats","description":"In this section we will go through the data-formats which is at the hear of online-feature-store, it\'s inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:","source":"@site/docs/online-feature-store/v1.0.0/data-formats.md","sourceDirName":"online-feature-store/v1.0.0","slug":"/online-feature-store/v1.0.0/data-formats","permalink":"/BharatMLStack/online-feature-store/v1.0.0/data-formats","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/online-feature-store/v1.0.0/data-formats.md","tags":[],"version":"current","sidebarPosition":2,"frontMatter":{"title":"Data Formats","sidebar_position":2},"sidebar":"tutorialSidebar","previous":{"title":"Architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture"},"next":{"title":"Benchmarks","permalink":"/BharatMLStack/online-feature-store/v1.0.0/benchmarks"}}');var s=i(4848),r=i(8453);const d={title:"Data Formats",sidebar_position:2},l="Data Format for Permanent & Cache Storage",c={},a=[{value:"PSDB (Permanent Storage Data Block) Format",id:"psdb-permanent-storage-data-block-format",level:2},{value:"\ud83e\uddf1 Structure Overview",id:"-structure-overview",level:3},{value:"Supported Data Types",id:"supported-data-types",level:3},{value:"Scalar Types",id:"scalar-types",level:4},{value:"Vector Types",id:"vector-types",level:4},{value:"\ud83d\udce6 Encoding for Scalar Feature Type",id:"-encoding-for-scalar-feature-type",level:3},{value:"1. \ud83d\udd21 String Feature Group (Variable Length Encoding using Pascal)",id:"1--string-feature-group-variable-length-encoding-using-pascal",level:4},{value:"2. \ud83d\udfe9 Boolean Feature Group (Bit-Packed)",id:"2--boolean-feature-group-bit-packed",level:4},{value:"3. \ud83d\udccf Fixed-Length Feature Group",id:"3--fixed-length-feature-group",level:4},{value:"4. Compression",id:"4-compression",level:4},{value:"\ud83e\uddec Encoding for Vector Types",id:"-encoding-for-vector-types",level:3},{value:"Conceptual Overview",id:"conceptual-overview",level:4},{value:"Vector Length Metadata",id:"vector-length-metadata",level:4},{value:"Encoding Process",id:"encoding-process",level:4},{value:"Input Structure",id:"input-structure",level:5},{value:"Length Validation",id:"length-validation",level:5},{value:"Flattening Strategy",id:"flattening-strategy",level:5},{value:"Contiguous Layout",id:"contiguous-layout",level:5},{value:"\ud83d\udd04 Deserialization/Decoding Flow",id:"-deserializationdecoding-flow",level:3},{value:"Memory Efficiency Benefits",id:"memory-efficiency-benefits",level:3},{value:"Cache Storage Data Block (CSDB) Design",id:"cache-storage-data-block-csdb-design",level:2},{value:"Overview",id:"overview",level:3},{value:"Structure and Purpose",id:"structure-and-purpose",level:3},{value:"Core Fields and Memory Layout",id:"core-fields-and-memory-layout",level:4},{value:"Cache Types",id:"cache-types",level:4},{value:"Format & Encoding",id:"format--encoding",level:3},{value:"Differences Between In-Memory and Distributed Caching",id:"differences-between-in-memory-and-distributed-caching",level:3},{value:"Optimizations & Features",id:"optimizations--features",level:3}];function o(e){const n={code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",h5:"h5",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"data-format-for-permanent--cache-storage",children:"Data Format for Permanent & Cache Storage"})}),"\n",(0,s.jsx)(n.p,{children:"In this section we will go through the data-formats which is at the hear of online-feature-store, it's inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"PSDB"})," - Permanent Storage Data Block used wile storing data in ScyllaDB"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"CSDB"})," - Cache Storage Data Block used while storing data in DragonflyDB or Redis, optimal for KV"]}),"\n",(0,s.jsx)(n.li,{}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"psdb-permanent-storage-data-block-format",children:"PSDB (Permanent Storage Data Block) Format"}),"\n",(0,s.jsxs)(n.p,{children:["The ",(0,s.jsx)(n.strong,{children:"PSDB"})," format is a compact, versioned, and schema-aware binary layout used to store feature groups efficiently for ML inference. It supports multiple datatypes (strings, booleans, fixed-size vectors), versioning, TTL, and metadata encoding in a compact header."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-structure-overview",children:"\ud83e\uddf1 Structure Overview"}),"\n",(0,s.jsx)(n.p,{children:"Each PSDB block is composed of multiple byte sections:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Permanent Storage Data Block Anatomy",src:i(477).A+"",width:"1854",height:"1102"})}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Byte"}),(0,s.jsx)(n.th,{children:"Bits"}),(0,s.jsx)(n.th,{children:"Field"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"0-1"}),(0,s.jsx)(n.td,{children:"0-15"}),(0,s.jsx)(n.td,{children:"Feature Schema Version"}),(0,s.jsx)(n.td,{children:"Version for tracking schema changes (additions/deletions) in feature group"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"2-6"}),(0,s.jsx)(n.td,{children:"16-55"}),(0,s.jsx)(n.td,{children:"Expiry Timestamp"}),(0,s.jsx)(n.td,{children:"Encoded as a compact representation, ~513 days max"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7"}),(0,s.jsx)(n.td,{children:"56-59"}),(0,s.jsx)(n.td,{children:"Layout Version"}),(0,s.jsx)(n.td,{children:"Used to ensure backward compatibility with layout format changes"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7"}),(0,s.jsx)(n.td,{children:"60-62"}),(0,s.jsx)(n.td,{children:"Compression Type"}),(0,s.jsx)(n.td,{children:"3-bit field specifying compression algorithm"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7-8"}),(0,s.jsx)(n.td,{children:"63-67"}),(0,s.jsx)(n.td,{children:"Data Type"}),(0,s.jsx)(n.td,{children:"5-bit field split across bytes 7 and 8"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"8"}),(0,s.jsx)(n.td,{children:"68-71"}),(0,s.jsx)(n.td,{children:"Bool Last Valid Bit"}),(0,s.jsx)(n.td,{children:"4-bit field for last valid boolean bit"})]})]})]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"supported-data-types",children:"Supported Data Types"}),"\n",(0,s.jsx)(n.h4,{id:"scalar-types",children:"Scalar Types"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Type"}),(0,s.jsx)(n.th,{children:"Container"}),(0,s.jsx)(n.th,{children:"Size"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"FP32"}),", ",(0,s.jsx)(n.code,{children:"FP16"}),", ",(0,s.jsx)(n.code,{children:"FP8E4M3"}),", ",(0,s.jsx)(n.code,{children:"FP8E5M2"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]float32"})}),(0,s.jsx)(n.td,{children:"4/2/1/1 bytes"}),(0,s.jsx)(n.td,{children:"Floating point numbers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Int32"}),", ",(0,s.jsx)(n.code,{children:"Int16"}),", ",(0,s.jsx)(n.code,{children:"Int8"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]int32"})}),(0,s.jsx)(n.td,{children:"4/2/1 bytes"}),(0,s.jsx)(n.td,{children:"Signed integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Uint32"}),", ",(0,s.jsx)(n.code,{children:"Uint16"}),", ",(0,s.jsx)(n.code,{children:"Uint8"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint32"})}),(0,s.jsx)(n.td,{children:"4/2/1 bytes"}),(0,s.jsx)(n.td,{children:"Unsigned integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"FP64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]float64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"Double precision float"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Int64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]int64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"64-bit signed integer"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Uint64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"64-bit unsigned integer"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"String"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]string"})}),(0,s.jsx)(n.td,{children:"Variable"}),(0,s.jsx)(n.td,{children:"Pascal-style strings"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Bool"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint8"})}),(0,s.jsx)(n.td,{children:"Bit-packed"}),(0,s.jsx)(n.td,{children:"Boolean values"})]})]})]}),"\n",(0,s.jsx)(n.h4,{id:"vector-types",children:"Vector Types"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Type"}),(0,s.jsx)(n.th,{children:"Container"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"FP32Vector"}),", ",(0,s.jsx)(n.code,{children:"FP16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]float32"})}),(0,s.jsx)(n.td,{children:"2D slices of floating point"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Int32Vector"}),", ",(0,s.jsx)(n.code,{children:"Int16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]int32"})}),(0,s.jsx)(n.td,{children:"2D slices of signed integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Uint32Vector"}),", ",(0,s.jsx)(n.code,{children:"Uint16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint32"})}),(0,s.jsx)(n.td,{children:"2D slices of unsigned integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"FP64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]float64"})}),(0,s.jsx)(n.td,{children:"2D slices of doubles"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Int64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]int64"})}),(0,s.jsx)(n.td,{children:"2D slices of 64-bit signed"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Uint64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint64"})}),(0,s.jsx)(n.td,{children:"2D slices of 64-bit unsigned"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"StringVector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]string"})}),(0,s.jsx)(n.td,{children:"2D slices of strings"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"BoolVector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint8"})}),(0,s.jsx)(n.td,{children:"2D slices of bit-packed bools"})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"-encoding-for-scalar-feature-type",children:"\ud83d\udce6 Encoding for Scalar Feature Type"}),"\n",(0,s.jsx)(n.h4,{id:"1--string-feature-group-variable-length-encoding-using-pascal",children:"1. \ud83d\udd21 String Feature Group (Variable Length Encoding using Pascal)"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Max string length: ",(0,s.jsx)(n.strong,{children:"65536"})]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB String encoding",src:i(1477).A+"",width:"1488",height:"204"})]}),"\n",(0,s.jsxs)(n.li,{children:["Deserialization:","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read length prefixes"}),"\n",(0,s.jsxs)(n.li,{children:["Extract string bytes using ",(0,s.jsx)(n.code,{children:"StrLenX"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"2--boolean-feature-group-bit-packed",children:"2. \ud83d\udfe9 Boolean Feature Group (Bit-Packed)"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Saves space using bit-level packing."}),"\n",(0,s.jsxs)(n.li,{children:["Encoding:","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Raw: 1 byte per feature"}),"\n",(0,s.jsx)(n.li,{children:"Bit-packed: 1 bit per boolean"}),"\n",(0,s.jsxs)(n.li,{children:["Additional index (",(0,s.jsx)(n.code,{children:"bool last idx"}),") stores where the last bit resides"]}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB Bool encoding",src:i(280).A+"",width:"1120",height:"712"})]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"3--fixed-length-feature-group",children:"3. \ud83d\udccf Fixed-Length Feature Group"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["For fixed-size vectors (",(0,s.jsx)(n.code,{children:"n"})," bytes each)"]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB Fixed Length Datatype encoding",src:i(9133).A+"",width:"1122",height:"202"})]}),"\n",(0,s.jsx)(n.li,{children:"Efficient for dense numeric features like float32, int64, etc."}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"4-compression",children:"4. Compression"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"TypeNone (0)"}),": Raw storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"TypeZSTD (1)"}),": Compressed using Zstandard"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Compression is opportunistic. During serialization, if compressed size is not smaller, PSDB falls back to uncompressed format. It keeps the read/high througput path use less CPU cycles. Also only data part of PSDB is compressed allowing decompression only if block has a valid TTL"}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-encoding-for-vector-types",children:"\ud83e\uddec Encoding for Vector Types"}),"\n",(0,s.jsx)(n.h4,{id:"conceptual-overview",children:"Conceptual Overview"}),"\n",(0,s.jsx)(n.p,{children:"PSDB encodes vector data by flattening multi-dimensional arrays into a single contiguous byte buffer while preserving the ability to reconstruct the original vector boundaries."}),"\n",(0,s.jsx)(n.h4,{id:"vector-length-metadata",children:"Vector Length Metadata"}),"\n",(0,s.jsx)(n.p,{children:"Each feature group maintains metadata about vector dimensions in the Feature Registry. For example, if a feature group has:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-yaml",children:"fg1:\n version-2:\n features:\n f1: { vector_len: 6, default: [bytes] }\n f2: { vector_len: 3, default: [bytes] }\n version-1:\n features:\n f1: { vector_len: 6, default: [bytes] }\n"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Feature f1 with vector_len: 6"}),"\n",(0,s.jsx)(n.li,{children:"Feature f2 with vector_len: 3"}),"\n",(0,s.jsx)(n.li,{}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This means:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"f1"})," contains vectors of exactly 6 elements each"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"f2"})," contains vectors of exactly 3 elements each"]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"encoding-process",children:"Encoding Process"}),"\n",(0,s.jsx)(n.h5,{id:"input-structure",children:(0,s.jsx)(n.strong,{children:"Input Structure"})}),"\n",(0,s.jsx)(n.p,{children:"The serializer receives vector data as 2D slices where:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Outer dimension represents different feature instances/entities"}),"\n",(0,s.jsx)(n.li,{children:"Inner dimension represents the vector elements for each instance"}),"\n"]}),"\n",(0,s.jsx)(n.h5,{id:"length-validation",children:(0,s.jsx)(n.strong,{children:"Length Validation"})}),"\n",(0,s.jsx)(n.p,{children:"Before encoding, PSDB validates that each vector's actual length matches the declared vector_len from the feature metadata. This ensures data integrity and enables efficient decoding."}),"\n",(0,s.jsx)(n.h5,{id:"flattening-strategy",children:(0,s.jsx)(n.strong,{children:"Flattening Strategy"})}),"\n",(0,s.jsx)(n.p,{children:"Vectors are serialized in row-major order (also called C-style order):"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"All elements of the first vector are written consecutively"}),"\n",(0,s.jsx)(n.li,{children:"Followed by all elements of the second vector"}),"\n",(0,s.jsx)(n.li,{children:"And so on..."}),"\n"]}),"\n",(0,s.jsx)(n.h5,{id:"contiguous-layout",children:(0,s.jsx)(n.strong,{children:"Contiguous Layout"})}),"\n",(0,s.jsx)(n.p,{children:"The resulting byte buffer contains all vector elements placed end-to-end without gaps or separators. The decoder can reconstruct vector boundaries because it knows:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"The data type size (e.g., 4 bytes for float32), from feature registry"}),"\n",(0,s.jsx)(n.li,{children:"The vector length for each position, from feature registry"}),"\n",(0,s.jsx)(n.li,{children:"The total number of vectors, from feature registry"}),"\n",(0,s.jsxs)(n.li,{children:["In case of ",(0,s.jsx)(n.code,{children:"variable length"})," length is encoded into the data, like for ",(0,s.jsx)(n.code,{children:"String"})," data-type"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-deserializationdecoding-flow",children:"\ud83d\udd04 Deserialization/Decoding Flow"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Extract version"})," from first 2 bytes."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Look up schema"})," from etcd using the version."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Determine feature shapes"})," (e.g., vector lengths)."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Slice and decode"})," data from byte buffer accordingly."]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"memory-efficiency-benefits",children:"Memory Efficiency Benefits"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"No Padding"}),": Elements are packed tightly without alignment padding"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"No Delimiters"}),": Vector boundaries are implicit, not stored explicitly"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cache Friendly"}),": Sequential memory access patterns during encoding/decoding"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Minimal Metadata"}),": Only vector lengths are stored separately, not per-element"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"cache-storage-data-block-csdb-design",children:"Cache Storage Data Block (CSDB) Design"}),"\n",(0,s.jsx)(n.h3,{id:"overview",children:"Overview"}),"\n",(0,s.jsx)(n.p,{children:"The Cache Storage Data Block (CSDB) is a compact binary data format that encapsulates serialized data blocks for multiple feature groups. It is designed to support both in-memory and distributed caching of deserialized PSDB (Permanent Storage Data Block) content, optimizing for speed, deduplication, and minimal memory overhead."}),"\n",(0,s.jsx)(n.h3,{id:"structure-and-purpose",children:"Structure and Purpose"}),"\n",(0,s.jsx)(n.p,{children:"Each CSDB contains a mapping of feature group IDs (FG IDs) to deserialized PSDBs. For distributed systems, this structure is flattened into a serialized byte slice. The CSDB supports layout versioning for backward compatibility and negative caching for feature groups with no associated data."}),"\n",(0,s.jsx)(n.h4,{id:"core-fields-and-memory-layout",children:"Core Fields and Memory Layout"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-go",children:"type CacheStorageDataBlock struct {\n // 8-byte aligned map pointer\n FGIdToDDB map[int]*DeserializedPSDB // offset: 0\n\n // 24-byte slice (ptr, len, cap)\n serializedCSDB []byte // offset: 8\n\n // 4-byte fields\n TTL uint32 // offset: 32\n\n // 1-byte fields\n layoutVersion uint8 // offset: 36\n cacheType CacheType // offset: 37\n // 2 bytes padding to maintain 4-byte alignment\n}\n"})}),"\n",(0,s.jsx)(n.p,{children:"The structure is memory-aligned for optimal performance:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Pointers and slices are 8-byte aligned"}),"\n",(0,s.jsxs)(n.li,{children:["Smaller fields (like ",(0,s.jsx)(n.code,{children:"uint8"}),") are grouped and padded to avoid false sharing"]}),"\n",(0,s.jsx)(n.li,{children:"This layout ensures efficient use of CPU caches during access"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"cache-types",children:"Cache Types"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-Memory Cache"}),": Uses the ",(0,s.jsx)(n.code,{children:"FGIdToDDB"})," map directly and avoids serialization unless explicitly requested."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Distributed Cache"}),": Stores a serialized binary format in ",(0,s.jsx)(n.code,{children:"serializedCSDB"}),", which is deserialized lazily when required."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"format--encoding",children:"Format & Encoding"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"CSDB Binary Layout"}),": Serialized CSDBs follow this compact format:"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"[LayoutVersion (1 byte)][FGID (2 bytes)][DataLen (2 bytes)][Data ...] \u2192 repeated per feature group\n"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["FGID and DataLen are encoded as ",(0,s.jsx)(n.code,{children:"uint16"})]}),"\n",(0,s.jsxs)(n.li,{children:["If ",(0,s.jsx)(n.code,{children:"DataLen == 0"}),", it denotes a negative cache (no data available for that FG)"]}),"\n",(0,s.jsx)(n.li,{children:"The data section contains the PSDB header and either compressed or uncompressed data"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This layout allows fast scanning and partial deserialization for selected FG IDs, making it optimal for large-scale caching systems."}),"\n",(0,s.jsx)(n.h3,{id:"differences-between-in-memory-and-distributed-caching",children:"Differences Between In-Memory and Distributed Caching"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Aspect"}),(0,s.jsx)(n.th,{children:"In-Memory CSDB"}),(0,s.jsx)(n.th,{children:"Distributed CSDB"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Storage Format"}),(0,s.jsx)(n.td,{children:"Live Go objects (map[int]*DeserializedPSDB)"}),(0,s.jsxs)(n.td,{children:["Serialized byte buffer (",(0,s.jsx)(n.code,{children:"[]byte"}),")"]})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Deserialization"}),(0,s.jsx)(n.td,{children:"Performed on-demand using offset map"}),(0,s.jsx)(n.td,{children:"Performed on-demand using offset map"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Compression"}),(0,s.jsx)(n.td,{children:"Optional during serialization"}),(0,s.jsx)(n.td,{children:"Typically enabled to reduce payload size"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Usage Pattern"}),(0,s.jsx)(n.td,{children:"Fast lookup in active process memory"}),(0,s.jsx)(n.td,{children:"Cross-node cache sharing and persistence"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Memory Overhead"}),(0,s.jsx)(n.td,{children:"Higher (due to live objects)"}),(0,s.jsx)(n.td,{children:"Lower (compact representation)"})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"optimizations--features",children:"Optimizations & Features"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Partial FG ID Fetch"}),": When only a subset of FG IDs is needed, CSDB avoids unnecessary deserialization of other IDs."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Negative Caching"}),": FG IDs with no data are encoded with ",(0,s.jsx)(n.code,{children:"DataLen=0"}),", saving space and avoiding repeated lookups."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Offset-Length Map"}),": During deserialization, FGID to offset+length pairs are cached internally for efficient random access."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Versioning Support"}),": Layout version is stored as the first byte to enable format upgrades while maintaining backward compatibility."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Diagram below explains how compute cycles are saved by partial de-compression."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"CSDB Partial Decompression",src:i(8457).A+"",width:"2292",height:"828"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(o,{...e})}):o(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/4caa95bf.ca3bb1d0.js b/docs/assets/js/4caa95bf.ca3bb1d0.js deleted file mode 100644 index 109ee1e0..00000000 --- a/docs/assets/js/4caa95bf.ca3bb1d0.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2344],{3560:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-anatomy-c1735559f93dce6d0bb3894d16047059.png"},6230:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-fixed-length-encodding-dd252110b084e01cf38f21de16b3a1a5.png"},7676:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-string-encoding-b1d69e9452269124d1b545020fa27d63.png"},7780:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-csdb-skip-read-e3926080f7341aa7d3c6ec6d8274ea14.png"},8453:(e,n,i)=>{i.d(n,{R:()=>d,x:()=>l});var t=i(6540);const s={},r=t.createContext(s);function d(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:d(e.components),t.createElement(r.Provider,{value:n},e.children)}},8645:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/v1.0.0-psdb-bool-encoding-4b154fdf5e6d79a67c91b6fb21c7209e.png"},9584:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>c,contentTitle:()=>l,default:()=>h,frontMatter:()=>d,metadata:()=>t,toc:()=>a});const t=JSON.parse('{"id":"online-feature-store/v1.0.0/data-formats","title":"Data Formats","description":"In this section we will go through the data-formats which is at the hear of online-feature-store, it\'s inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:","source":"@site/docs/online-feature-store/v1.0.0/data-formats.md","sourceDirName":"online-feature-store/v1.0.0","slug":"/online-feature-store/v1.0.0/data-formats","permalink":"/BharatMLStack/online-feature-store/v1.0.0/data-formats","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/online-feature-store/v1.0.0/data-formats.md","tags":[],"version":"current","sidebarPosition":2,"frontMatter":{"title":"Data Formats","sidebar_position":2},"sidebar":"tutorialSidebar","previous":{"title":"Architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture"},"next":{"title":"Benchmarks","permalink":"/BharatMLStack/online-feature-store/v1.0.0/benchmarks"}}');var s=i(4848),r=i(8453);const d={title:"Data Formats",sidebar_position:2},l="Data Format for Permanent & Cache Storage",c={},a=[{value:"PSDB (Permanent Storage Data Block) Format",id:"psdb-permanent-storage-data-block-format",level:2},{value:"\ud83e\uddf1 Structure Overview",id:"-structure-overview",level:3},{value:"Supported Data Types",id:"supported-data-types",level:3},{value:"Scalar Types",id:"scalar-types",level:4},{value:"Vector Types",id:"vector-types",level:4},{value:"\ud83d\udce6 Encoding for Scalar Feature Type",id:"-encoding-for-scalar-feature-type",level:3},{value:"1. \ud83d\udd21 String Feature Group (Variable Length Encoding using Pascal)",id:"1--string-feature-group-variable-length-encoding-using-pascal",level:4},{value:"2. \ud83d\udfe9 Boolean Feature Group (Bit-Packed)",id:"2--boolean-feature-group-bit-packed",level:4},{value:"3. \ud83d\udccf Fixed-Length Feature Group",id:"3--fixed-length-feature-group",level:4},{value:"4. Compression",id:"4-compression",level:4},{value:"\ud83e\uddec Encoding for Vector Types",id:"-encoding-for-vector-types",level:3},{value:"Conceptual Overview",id:"conceptual-overview",level:4},{value:"Vector Length Metadata",id:"vector-length-metadata",level:4},{value:"Encoding Process",id:"encoding-process",level:4},{value:"Input Structure",id:"input-structure",level:5},{value:"Length Validation",id:"length-validation",level:5},{value:"Flattening Strategy",id:"flattening-strategy",level:5},{value:"Contiguous Layout",id:"contiguous-layout",level:5},{value:"\ud83d\udd04 Deserialization/Decoding Flow",id:"-deserializationdecoding-flow",level:3},{value:"Memory Efficiency Benefits",id:"memory-efficiency-benefits",level:3},{value:"Cache Storage Data Block (CSDB) Design",id:"cache-storage-data-block-csdb-design",level:2},{value:"Overview",id:"overview",level:3},{value:"Structure and Purpose",id:"structure-and-purpose",level:3},{value:"Core Fields and Memory Layout",id:"core-fields-and-memory-layout",level:4},{value:"Cache Types",id:"cache-types",level:4},{value:"Format & Encoding",id:"format--encoding",level:3},{value:"Differences Between In-Memory and Distributed Caching",id:"differences-between-in-memory-and-distributed-caching",level:3},{value:"Optimizations & Features",id:"optimizations--features",level:3}];function o(e){const n={code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",h5:"h5",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"data-format-for-permanent--cache-storage",children:"Data Format for Permanent & Cache Storage"})}),"\n",(0,s.jsx)(n.p,{children:"In this section we will go through the data-formats which is at the hear of online-feature-store, it's inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"PSDB"})," - Permanent Storage Data Block used wile storing data in ScyllaDB"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"CSDB"})," - Cache Storage Data Block used while storing data in DragonflyDB or Redis, optimal for KV"]}),"\n",(0,s.jsx)(n.li,{}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"psdb-permanent-storage-data-block-format",children:"PSDB (Permanent Storage Data Block) Format"}),"\n",(0,s.jsxs)(n.p,{children:["The ",(0,s.jsx)(n.strong,{children:"PSDB"})," format is a compact, versioned, and schema-aware binary layout used to store feature groups efficiently for ML inference. It supports multiple datatypes (strings, booleans, fixed-size vectors), versioning, TTL, and metadata encoding in a compact header."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-structure-overview",children:"\ud83e\uddf1 Structure Overview"}),"\n",(0,s.jsx)(n.p,{children:"Each PSDB block is composed of multiple byte sections:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Permanent Storage Data Block Anatomy",src:i(3560).A+"",width:"1854",height:"1102"})}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Byte"}),(0,s.jsx)(n.th,{children:"Bits"}),(0,s.jsx)(n.th,{children:"Field"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"0-1"}),(0,s.jsx)(n.td,{children:"0-15"}),(0,s.jsx)(n.td,{children:"Feature Schema Version"}),(0,s.jsx)(n.td,{children:"Version for tracking schema changes (additions/deletions) in feature group"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"2-6"}),(0,s.jsx)(n.td,{children:"16-55"}),(0,s.jsx)(n.td,{children:"Expiry Timestamp"}),(0,s.jsx)(n.td,{children:"Encoded as a compact representation, ~513 days max"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7"}),(0,s.jsx)(n.td,{children:"56-59"}),(0,s.jsx)(n.td,{children:"Layout Version"}),(0,s.jsx)(n.td,{children:"Used to ensure backward compatibility with layout format changes"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7"}),(0,s.jsx)(n.td,{children:"60-62"}),(0,s.jsx)(n.td,{children:"Compression Type"}),(0,s.jsx)(n.td,{children:"3-bit field specifying compression algorithm"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"7-8"}),(0,s.jsx)(n.td,{children:"63-67"}),(0,s.jsx)(n.td,{children:"Data Type"}),(0,s.jsx)(n.td,{children:"5-bit field split across bytes 7 and 8"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"8"}),(0,s.jsx)(n.td,{children:"68-71"}),(0,s.jsx)(n.td,{children:"Bool Last Valid Bit"}),(0,s.jsx)(n.td,{children:"4-bit field for last valid boolean bit"})]})]})]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"supported-data-types",children:"Supported Data Types"}),"\n",(0,s.jsx)(n.h4,{id:"scalar-types",children:"Scalar Types"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Type"}),(0,s.jsx)(n.th,{children:"Container"}),(0,s.jsx)(n.th,{children:"Size"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"FP32"}),", ",(0,s.jsx)(n.code,{children:"FP16"}),", ",(0,s.jsx)(n.code,{children:"FP8E4M3"}),", ",(0,s.jsx)(n.code,{children:"FP8E5M2"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]float32"})}),(0,s.jsx)(n.td,{children:"4/2/1/1 bytes"}),(0,s.jsx)(n.td,{children:"Floating point numbers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Int32"}),", ",(0,s.jsx)(n.code,{children:"Int16"}),", ",(0,s.jsx)(n.code,{children:"Int8"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]int32"})}),(0,s.jsx)(n.td,{children:"4/2/1 bytes"}),(0,s.jsx)(n.td,{children:"Signed integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Uint32"}),", ",(0,s.jsx)(n.code,{children:"Uint16"}),", ",(0,s.jsx)(n.code,{children:"Uint8"})]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint32"})}),(0,s.jsx)(n.td,{children:"4/2/1 bytes"}),(0,s.jsx)(n.td,{children:"Unsigned integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"FP64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]float64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"Double precision float"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Int64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]int64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"64-bit signed integer"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Uint64"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint64"})}),(0,s.jsx)(n.td,{children:"8 bytes"}),(0,s.jsx)(n.td,{children:"64-bit unsigned integer"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"String"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]string"})}),(0,s.jsx)(n.td,{children:"Variable"}),(0,s.jsx)(n.td,{children:"Pascal-style strings"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Bool"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[]uint8"})}),(0,s.jsx)(n.td,{children:"Bit-packed"}),(0,s.jsx)(n.td,{children:"Boolean values"})]})]})]}),"\n",(0,s.jsx)(n.h4,{id:"vector-types",children:"Vector Types"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Type"}),(0,s.jsx)(n.th,{children:"Container"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"FP32Vector"}),", ",(0,s.jsx)(n.code,{children:"FP16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]float32"})}),(0,s.jsx)(n.td,{children:"2D slices of floating point"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Int32Vector"}),", ",(0,s.jsx)(n.code,{children:"Int16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]int32"})}),(0,s.jsx)(n.td,{children:"2D slices of signed integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsxs)(n.td,{children:[(0,s.jsx)(n.code,{children:"Uint32Vector"}),", ",(0,s.jsx)(n.code,{children:"Uint16Vector"}),", etc."]}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint32"})}),(0,s.jsx)(n.td,{children:"2D slices of unsigned integers"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"FP64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]float64"})}),(0,s.jsx)(n.td,{children:"2D slices of doubles"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Int64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]int64"})}),(0,s.jsx)(n.td,{children:"2D slices of 64-bit signed"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"Uint64Vector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint64"})}),(0,s.jsx)(n.td,{children:"2D slices of 64-bit unsigned"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"StringVector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]string"})}),(0,s.jsx)(n.td,{children:"2D slices of strings"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"BoolVector"})}),(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"[][]uint8"})}),(0,s.jsx)(n.td,{children:"2D slices of bit-packed bools"})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"-encoding-for-scalar-feature-type",children:"\ud83d\udce6 Encoding for Scalar Feature Type"}),"\n",(0,s.jsx)(n.h4,{id:"1--string-feature-group-variable-length-encoding-using-pascal",children:"1. \ud83d\udd21 String Feature Group (Variable Length Encoding using Pascal)"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Max string length: ",(0,s.jsx)(n.strong,{children:"65536"})]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB String encoding",src:i(7676).A+"",width:"1488",height:"204"})]}),"\n",(0,s.jsxs)(n.li,{children:["Deserialization:","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read length prefixes"}),"\n",(0,s.jsxs)(n.li,{children:["Extract string bytes using ",(0,s.jsx)(n.code,{children:"StrLenX"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"2--boolean-feature-group-bit-packed",children:"2. \ud83d\udfe9 Boolean Feature Group (Bit-Packed)"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Saves space using bit-level packing."}),"\n",(0,s.jsxs)(n.li,{children:["Encoding:","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Raw: 1 byte per feature"}),"\n",(0,s.jsx)(n.li,{children:"Bit-packed: 1 bit per boolean"}),"\n",(0,s.jsxs)(n.li,{children:["Additional index (",(0,s.jsx)(n.code,{children:"bool last idx"}),") stores where the last bit resides"]}),"\n"]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB Bool encoding",src:i(8645).A+"",width:"1120",height:"712"})]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"3--fixed-length-feature-group",children:"3. \ud83d\udccf Fixed-Length Feature Group"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["For fixed-size vectors (",(0,s.jsx)(n.code,{children:"n"})," bytes each)"]}),"\n",(0,s.jsxs)(n.li,{children:["Format:\n",(0,s.jsx)(n.img,{alt:"PSDB Fixed Length Datatype encoding",src:i(6230).A+"",width:"1122",height:"202"})]}),"\n",(0,s.jsx)(n.li,{children:"Efficient for dense numeric features like float32, int64, etc."}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"4-compression",children:"4. Compression"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"TypeNone (0)"}),": Raw storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"TypeZSTD (1)"}),": Compressed using Zstandard"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Compression is opportunistic. During serialization, if compressed size is not smaller, PSDB falls back to uncompressed format. It keeps the read/high througput path use less CPU cycles. Also only data part of PSDB is compressed allowing decompression only if block has a valid TTL"}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-encoding-for-vector-types",children:"\ud83e\uddec Encoding for Vector Types"}),"\n",(0,s.jsx)(n.h4,{id:"conceptual-overview",children:"Conceptual Overview"}),"\n",(0,s.jsx)(n.p,{children:"PSDB encodes vector data by flattening multi-dimensional arrays into a single contiguous byte buffer while preserving the ability to reconstruct the original vector boundaries."}),"\n",(0,s.jsx)(n.h4,{id:"vector-length-metadata",children:"Vector Length Metadata"}),"\n",(0,s.jsx)(n.p,{children:"Each feature group maintains metadata about vector dimensions in the Feature Registry. For example, if a feature group has:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-yaml",children:"fg1:\n version-2:\n features:\n f1: { vector_len: 6, default: [bytes] }\n f2: { vector_len: 3, default: [bytes] }\n version-1:\n features:\n f1: { vector_len: 6, default: [bytes] }\n"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Feature f1 with vector_len: 6"}),"\n",(0,s.jsx)(n.li,{children:"Feature f2 with vector_len: 3"}),"\n",(0,s.jsx)(n.li,{}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This means:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"f1"})," contains vectors of exactly 6 elements each"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.code,{children:"f2"})," contains vectors of exactly 3 elements each"]}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"encoding-process",children:"Encoding Process"}),"\n",(0,s.jsx)(n.h5,{id:"input-structure",children:(0,s.jsx)(n.strong,{children:"Input Structure"})}),"\n",(0,s.jsx)(n.p,{children:"The serializer receives vector data as 2D slices where:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Outer dimension represents different feature instances/entities"}),"\n",(0,s.jsx)(n.li,{children:"Inner dimension represents the vector elements for each instance"}),"\n"]}),"\n",(0,s.jsx)(n.h5,{id:"length-validation",children:(0,s.jsx)(n.strong,{children:"Length Validation"})}),"\n",(0,s.jsx)(n.p,{children:"Before encoding, PSDB validates that each vector's actual length matches the declared vector_len from the feature metadata. This ensures data integrity and enables efficient decoding."}),"\n",(0,s.jsx)(n.h5,{id:"flattening-strategy",children:(0,s.jsx)(n.strong,{children:"Flattening Strategy"})}),"\n",(0,s.jsx)(n.p,{children:"Vectors are serialized in row-major order (also called C-style order):"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"All elements of the first vector are written consecutively"}),"\n",(0,s.jsx)(n.li,{children:"Followed by all elements of the second vector"}),"\n",(0,s.jsx)(n.li,{children:"And so on..."}),"\n"]}),"\n",(0,s.jsx)(n.h5,{id:"contiguous-layout",children:(0,s.jsx)(n.strong,{children:"Contiguous Layout"})}),"\n",(0,s.jsx)(n.p,{children:"The resulting byte buffer contains all vector elements placed end-to-end without gaps or separators. The decoder can reconstruct vector boundaries because it knows:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"The data type size (e.g., 4 bytes for float32), from feature registry"}),"\n",(0,s.jsx)(n.li,{children:"The vector length for each position, from feature registry"}),"\n",(0,s.jsx)(n.li,{children:"The total number of vectors, from feature registry"}),"\n",(0,s.jsxs)(n.li,{children:["In case of ",(0,s.jsx)(n.code,{children:"variable length"})," length is encoded into the data, like for ",(0,s.jsx)(n.code,{children:"String"})," data-type"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"-deserializationdecoding-flow",children:"\ud83d\udd04 Deserialization/Decoding Flow"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Extract version"})," from first 2 bytes."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Look up schema"})," from etcd using the version."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Determine feature shapes"})," (e.g., vector lengths)."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Slice and decode"})," data from byte buffer accordingly."]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h3,{id:"memory-efficiency-benefits",children:"Memory Efficiency Benefits"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"No Padding"}),": Elements are packed tightly without alignment padding"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"No Delimiters"}),": Vector boundaries are implicit, not stored explicitly"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cache Friendly"}),": Sequential memory access patterns during encoding/decoding"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Minimal Metadata"}),": Only vector lengths are stored separately, not per-element"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"cache-storage-data-block-csdb-design",children:"Cache Storage Data Block (CSDB) Design"}),"\n",(0,s.jsx)(n.h3,{id:"overview",children:"Overview"}),"\n",(0,s.jsx)(n.p,{children:"The Cache Storage Data Block (CSDB) is a compact binary data format that encapsulates serialized data blocks for multiple feature groups. It is designed to support both in-memory and distributed caching of deserialized PSDB (Permanent Storage Data Block) content, optimizing for speed, deduplication, and minimal memory overhead."}),"\n",(0,s.jsx)(n.h3,{id:"structure-and-purpose",children:"Structure and Purpose"}),"\n",(0,s.jsx)(n.p,{children:"Each CSDB contains a mapping of feature group IDs (FG IDs) to deserialized PSDBs. For distributed systems, this structure is flattened into a serialized byte slice. The CSDB supports layout versioning for backward compatibility and negative caching for feature groups with no associated data."}),"\n",(0,s.jsx)(n.h4,{id:"core-fields-and-memory-layout",children:"Core Fields and Memory Layout"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-go",children:"type CacheStorageDataBlock struct {\n // 8-byte aligned map pointer\n FGIdToDDB map[int]*DeserializedPSDB // offset: 0\n\n // 24-byte slice (ptr, len, cap)\n serializedCSDB []byte // offset: 8\n\n // 4-byte fields\n TTL uint32 // offset: 32\n\n // 1-byte fields\n layoutVersion uint8 // offset: 36\n cacheType CacheType // offset: 37\n // 2 bytes padding to maintain 4-byte alignment\n}\n"})}),"\n",(0,s.jsx)(n.p,{children:"The structure is memory-aligned for optimal performance:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Pointers and slices are 8-byte aligned"}),"\n",(0,s.jsxs)(n.li,{children:["Smaller fields (like ",(0,s.jsx)(n.code,{children:"uint8"}),") are grouped and padded to avoid false sharing"]}),"\n",(0,s.jsx)(n.li,{children:"This layout ensures efficient use of CPU caches during access"}),"\n"]}),"\n",(0,s.jsx)(n.h4,{id:"cache-types",children:"Cache Types"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-Memory Cache"}),": Uses the ",(0,s.jsx)(n.code,{children:"FGIdToDDB"})," map directly and avoids serialization unless explicitly requested."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Distributed Cache"}),": Stores a serialized binary format in ",(0,s.jsx)(n.code,{children:"serializedCSDB"}),", which is deserialized lazily when required."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"format--encoding",children:"Format & Encoding"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"CSDB Binary Layout"}),": Serialized CSDBs follow this compact format:"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"[LayoutVersion (1 byte)][FGID (2 bytes)][DataLen (2 bytes)][Data ...] \u2192 repeated per feature group\n"})}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["FGID and DataLen are encoded as ",(0,s.jsx)(n.code,{children:"uint16"})]}),"\n",(0,s.jsxs)(n.li,{children:["If ",(0,s.jsx)(n.code,{children:"DataLen == 0"}),", it denotes a negative cache (no data available for that FG)"]}),"\n",(0,s.jsx)(n.li,{children:"The data section contains the PSDB header and either compressed or uncompressed data"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This layout allows fast scanning and partial deserialization for selected FG IDs, making it optimal for large-scale caching systems."}),"\n",(0,s.jsx)(n.h3,{id:"differences-between-in-memory-and-distributed-caching",children:"Differences Between In-Memory and Distributed Caching"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Aspect"}),(0,s.jsx)(n.th,{children:"In-Memory CSDB"}),(0,s.jsx)(n.th,{children:"Distributed CSDB"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Storage Format"}),(0,s.jsx)(n.td,{children:"Live Go objects (map[int]*DeserializedPSDB)"}),(0,s.jsxs)(n.td,{children:["Serialized byte buffer (",(0,s.jsx)(n.code,{children:"[]byte"}),")"]})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Deserialization"}),(0,s.jsx)(n.td,{children:"Performed on-demand using offset map"}),(0,s.jsx)(n.td,{children:"Performed on-demand using offset map"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Compression"}),(0,s.jsx)(n.td,{children:"Optional during serialization"}),(0,s.jsx)(n.td,{children:"Typically enabled to reduce payload size"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Usage Pattern"}),(0,s.jsx)(n.td,{children:"Fast lookup in active process memory"}),(0,s.jsx)(n.td,{children:"Cross-node cache sharing and persistence"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Memory Overhead"}),(0,s.jsx)(n.td,{children:"Higher (due to live objects)"}),(0,s.jsx)(n.td,{children:"Lower (compact representation)"})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"optimizations--features",children:"Optimizations & Features"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Partial FG ID Fetch"}),": When only a subset of FG IDs is needed, CSDB avoids unnecessary deserialization of other IDs."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Negative Caching"}),": FG IDs with no data are encoded with ",(0,s.jsx)(n.code,{children:"DataLen=0"}),", saving space and avoiding repeated lookups."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Offset-Length Map"}),": During deserialization, FGID to offset+length pairs are cached internally for efficient random access."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Versioning Support"}),": Layout version is stored as the first byte to enable format upgrades while maintaining backward compatibility."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Diagram below explains how compute cycles are saved by partial de-compression."}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"CSDB Partial Decompression",src:i(7780).A+"",width:"2292",height:"828"})})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(o,{...e})}):o(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/4d1a2db0.1e2d0e16.js b/docs/assets/js/4d1a2db0.1e2d0e16.js new file mode 100644 index 00000000..d423bed6 --- /dev/null +++ b/docs/assets/js/4d1a2db0.1e2d0e16.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8241],{2062:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Skye","description":"Skye is a high-performance vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, tenant-level index isolation, intelligent caching, and centralized cluster management.","slug":"/category/skye","permalink":"/BharatMLStack/category/skye","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Spark client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/skye/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/50899a24.c0cfae29.js b/docs/assets/js/50899a24.53577092.js similarity index 63% rename from docs/assets/js/50899a24.c0cfae29.js rename to docs/assets/js/50899a24.53577092.js index 70349c40..7f1787bc 100644 --- a/docs/assets/js/50899a24.c0cfae29.js +++ b/docs/assets/js/50899a24.53577092.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1009],{1008:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Numerix","description":"Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors.","slug":"/category/numerix","permalink":"/BharatMLStack/category/numerix","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Spark client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/numerix/v1.0.0"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1009],{1008:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Numerix","description":"Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors.","slug":"/category/numerix","permalink":"/BharatMLStack/category/numerix","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Release Notes","permalink":"/BharatMLStack/skye/v1.0.0/release-notes"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/numerix/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/56eef1be.df9c72dd.js b/docs/assets/js/56eef1be.df9c72dd.js new file mode 100644 index 00000000..1cd78a39 --- /dev/null +++ b/docs/assets/js/56eef1be.df9c72dd.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1508],{6613:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/skye-rt-consumer-flow-7f064a31c41151ff4516900b3170dbc8.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>d});var r=i(6540);const s={},t=r.createContext(s);function a(e){const n=r.useContext(t);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function d(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),r.createElement(t.Provider,{value:n},e.children)}},8468:(e,n,i)=>{i.d(n,{A:()=>r});const r=i.p+"assets/images/skye-system-overview-24940f4c319f41fb3b7583a525b0a534.png"},9390:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>d,default:()=>h,frontMatter:()=>a,metadata:()=>r,toc:()=>c});const r=JSON.parse('{"id":"skye/v1.0.0/architecture","title":"Architecture","description":"Skye is BharatMLStack\'s vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: skye-admin, skye-consumers, and skye-serving.","source":"@site/docs/skye/v1.0.0/architecture.md","sourceDirName":"skye/v1.0.0","slug":"/skye/v1.0.0/architecture","permalink":"/BharatMLStack/skye/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/skye/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/skye/v1.0.0"},"next":{"title":"Functionalities","permalink":"/BharatMLStack/skye/v1.0.0/functionalities"}}');var s=i(4848),t=i(8453);const a={title:"Architecture",sidebar_position:1},d="Skye - Vector Similarity Search Platform",l={},c=[{value:"System Overview",id:"system-overview",level:2},{value:"Component Architecture",id:"component-architecture",level:3},{value:"Data Model",id:"data-model",level:2},{value:"Model and Variant Hierarchy",id:"model-and-variant-hierarchy",level:3},{value:"Entity-Based Data Split",id:"entity-based-data-split",level:3},{value:"Serving Flow",id:"serving-flow",level:2},{value:"Configuration Bootstrap",id:"configuration-bootstrap",level:3},{value:"Admin Flows",id:"admin-flows",level:2},{value:"API Contracts",id:"api-contracts",level:3},{value:"Register Model",id:"register-model",level:4},{value:"Register Variant",id:"register-variant",level:4},{value:"Reset Model",id:"reset-model",level:4},{value:"Trigger Model Machine",id:"trigger-model-machine",level:4},{value:"Promote Model / Variant to Scale-Up Cluster",id:"promote-model--variant-to-scale-up-cluster",level:4},{value:"Consumer Flows",id:"consumer-flows",level:2},{value:"Reset/Delta Ingestion",id:"resetdelta-ingestion",level:3},{value:"Real-Time Consumers",id:"real-time-consumers",level:3},{value:"Retry Topic",id:"retry-topic",level:3},{value:"Key Design Decisions",id:"key-design-decisions",level:2},{value:"Pluggable Vector Database Support",id:"pluggable-vector-database-support",level:3},{value:"Variant-Based Model Sharing",id:"variant-based-model-sharing",level:3},{value:"ScyllaDB for Real-Time Aggregation",id:"scylladb-for-real-time-aggregation",level:3},{value:"Event-Driven State Management",id:"event-driven-state-management",level:3},{value:"Resiliency",id:"resiliency",level:2},{value:"Scalability",id:"scalability",level:2},{value:"Observability",id:"observability",level:2},{value:"Metrics (per model + variant)",id:"metrics-per-model--variant",level:3},{value:"Alerts",id:"alerts",level:3},{value:"Technology Stack",id:"technology-stack",level:2}];function o(e){const n={code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,t.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"skye---vector-similarity-search-platform",children:"Skye - Vector Similarity Search Platform"})}),"\n",(0,s.jsxs)(n.p,{children:["Skye is BharatMLStack's vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: ",(0,s.jsx)(n.strong,{children:"skye-admin"}),", ",(0,s.jsx)(n.strong,{children:"skye-consumers"}),", and ",(0,s.jsx)(n.strong,{children:"skye-serving"}),"."]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"system-overview",children:"System Overview"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Skye System Architecture",src:i(8468).A+"",width:"1276",height:"870"})}),"\n",(0,s.jsx)(n.p,{children:"Skye provides a critical platform for managing data aggregation, model onboarding, and embedding support at production scale. The architecture is designed around three core pillars:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Pluggable Vector Databases"}),": Support for multiple vector database backends (Qdrant and extensible to others) via a generic abstraction layer."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Tenant-Level Index Isolation with Shared Embeddings"}),": Models are stored once but can serve multiple tenants (variants), reducing data redundancy."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Event-Driven Administration"}),": Model lifecycle management is handled through Kafka-based event flows for resilience and fault tolerance."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"component-architecture",children:"Component Architecture"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Component"}),(0,s.jsx)(n.th,{children:"Role"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"skye-serving"})}),(0,s.jsx)(n.td,{children:"Handles real-time similarity search queries with in-memory caching and vector DB lookups"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"skye-consumers"})}),(0,s.jsx)(n.td,{children:"Processes embedding ingestion (reset/delta jobs) and real-time aggregation events from Kafka"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"skye-admin"})}),(0,s.jsx)(n.td,{children:"Manages model lifecycle, onboarding, variant registration, and coordinates Databricks jobs"})]})]})]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"data-model",children:"Data Model"}),"\n",(0,s.jsx)(n.h3,{id:"model-and-variant-hierarchy",children:"Model and Variant Hierarchy"}),"\n",(0,s.jsxs)(n.p,{children:["Skye uses a ",(0,s.jsx)(n.strong,{children:"model-first"})," hierarchy rather than a tenant-first approach. Models sit at the base level with variants (formerly tenants) nested within each model. This eliminates embedding duplication across tenants."]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"model (e.g., intent_model)\n \u251c\u2500\u2500 model_config (distance_function, vector_dimension, etc.)\n \u251c\u2500\u2500 embedding_store (shared embeddings for all variants)\n \u251c\u2500\u2500 variant_1 (e.g., organic)\n \u2502 \u251c\u2500\u2500 vss_filter (criteria for index inclusion)\n \u2502 \u251c\u2500\u2500 vectordb_type (QDRANT, etc.)\n \u2502 \u251c\u2500\u2500 vectordb_config (host, port, replication, sharding)\n \u2502 \u251c\u2500\u2500 read_version / write_version\n \u2502 \u2514\u2500\u2500 job_frequency (FREQ_1D, FREQ_3H, etc.)\n \u2514\u2500\u2500 variant_2 (e.g., ad)\n \u251c\u2500\u2500 vss_filter\n \u251c\u2500\u2500 vectordb_type\n \u2514\u2500\u2500 ...\n"})}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Key benefit"}),": If a model consumes 30M embeddings and is used by two variants, the embeddings are stored once (30M) instead of duplicated (60M)."]}),"\n",(0,s.jsx)(n.h3,{id:"entity-based-data-split",children:"Entity-Based Data Split"}),"\n",(0,s.jsx)(n.p,{children:"Data is split at the entity level (catalog, product, user) into separate tables for both embeddings and aggregator data:"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Embedding Tables"})," (per entity):"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-sql",children:"CREATE TABLE catalog_embeddings (\n model_name text,\n version int,\n id text,\n embedding frozen>,\n search_embedding frozen>,\n to_be_indexed_variant_1 boolean,\n to_be_indexed_variant_2 boolean,\n PRIMARY KEY ((model_name, version), id)\n);\n"})}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Aggregator Tables"})," (per entity):"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-sql",children:"CREATE TABLE catalog_aggregator (\n id text,\n is_live_ad text,\n out_of_stock text,\n PRIMARY KEY (id)\n);\n"})}),"\n",(0,s.jsx)(n.p,{children:"Each entity is mapped via a store configuration:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "db_conf_id": "1",\n "embeddings_table": "catalog_embeddings",\n "aggregator_table": "catalog_aggregator"\n}\n'})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"serving-flow",children:"Serving Flow"}),"\n",(0,s.jsx)(n.p,{children:"The serving path is optimized for low latency with multiple caching layers:"}),"\n",(0,s.jsxs)(n.ol,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Request arrives"})," at skye-serving via gRPC"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"ConfigRepo"})," resolves the model configuration, variant filters, and vector DB connection"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-memory cache"})," is checked first to reduce load on distributed cache"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Distributed cache (Redis)"})," is checked next for cached similarity results"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Vector DB query"})," executes if cache misses, using ",(0,s.jsx)(n.code,{children:"search_indexed_only"})," flag for optimal searches within indexed space"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Aggregator data"})," is fetched from ScyllaDB to apply variant-level filters"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Response"})," returns ranked similar candidates with scores"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"configuration-bootstrap",children:"Configuration Bootstrap"}),"\n",(0,s.jsx)(n.p,{children:"On startup, ConfigRepo creates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"A map of each model with its configurations (embedding table, vector DB channel)"}),"\n",(0,s.jsx)(n.li,{children:"A map of each entity to its aggregator table"}),"\n"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "intent_model": {\n "db_conf_id": "1",\n "index_embedding_table": "catalog_embeddings",\n "vector_db_grpc_channel": ""\n }\n}\n'})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"admin-flows",children:"Admin Flows"}),"\n",(0,s.jsxs)(n.p,{children:["Skye uses an ",(0,s.jsx)(n.strong,{children:"event-driven approach"})," for model lifecycle management:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"All admin operations are processed through Kafka consumers asynchronously"}),"\n",(0,s.jsx)(n.li,{children:"A SQL database behind the admin stores all model states"}),"\n",(0,s.jsx)(n.li,{children:"Pod termination does not affect in-progress operations (events are re-consumed on failure)"}),"\n",(0,s.jsx)(n.li,{children:"Databricks jobs are triggered and monitored via the admin API"}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"api-contracts",children:"API Contracts"}),"\n",(0,s.jsx)(n.h4,{id:"register-model",children:"Register Model"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"POST /register-model\n"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "entity": "catalog",\n "ingestion_column_mapping": "{\\"id_column\\":\\"id\\",\\"embedding_column\\":\\"features\\",\\"to_be_indexed_column\\":\\"to_be_indexed\\"}",\n "embedding_store_enabled": true,\n "embedding_store_ttl": 604800,\n "mq_id": 804,\n "model_config": "{\\"distance_function\\":\\"DOT\\",\\"vector_dimension\\":32}",\n "store_id": 1,\n "training_data_path": "gcs_path"\n}\n'})}),"\n",(0,s.jsx)(n.h4,{id:"register-variant",children:"Register Variant"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"POST /register-variant\n"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "entity": "catalog",\n "model_name": "intent_model",\n "vss_filter": "{...filter criteria...}",\n "vectordb_type": "QDRANT",\n "vectordb_config": "{...connection config...}",\n "job_frequency": "FREQ_1D"\n}\n'})}),"\n",(0,s.jsx)(n.h4,{id:"reset-model",children:"Reset Model"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"POST /reset-model\n"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "entity": "catalog",\n "model_name": "intent_model",\n "frequency": "FREQ_1D"\n}\n'})}),"\n",(0,s.jsx)(n.p,{children:"Response includes variant version mappings, MQ ID, and training data path for the Databricks job."}),"\n",(0,s.jsx)(n.h4,{id:"trigger-model-machine",children:"Trigger Model Machine"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"POST /trigger-model-machine\n"})}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "entity": "catalog",\n "model_name": "intent_model",\n "variant": "organic"\n}\n'})}),"\n",(0,s.jsx)(n.h4,{id:"promote-model--variant-to-scale-up-cluster",children:"Promote Model / Variant to Scale-Up Cluster"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{children:"POST /promote-model\nPOST /promote-variant\n"})}),"\n",(0,s.jsx)(n.p,{children:"Used to transition successful experiments from experiment clusters to production clusters."}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"consumer-flows",children:"Consumer Flows"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Skye Real-Time Consumer Flow",src:i(6613).A+"",width:"1192",height:"960"})}),"\n",(0,s.jsx)(n.h3,{id:"resetdelta-ingestion",children:"Reset/Delta Ingestion"}),"\n",(0,s.jsx)(n.p,{children:"Embedding ingestion occurs once per model and executes in parallel for each variant. The Kafka event contract supports:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Multiple variants per event"}),": A single embedding event specifies which variants should index the data"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Separate search and index embeddings"}),": Models can have different embeddings for search space vs index space"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"EOF handling"}),": EOF is sent to all partitions to ensure all data is consumed before completion"]}),"\n"]}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "entity": "catalog",\n "model_name": "intent_model",\n "candidate_id": "48869419",\n "version": "1",\n "index_space": {\n "variants_version_map": "{\'organic\':1,\'ad\':2}",\n "embedding": [0.036, -0.048, ...],\n "variants_index_map": "{\'organic\':true,\'ad\':false}",\n "operation": "A",\n "payload": "{\'sscat_id\':700}"\n },\n "search_space": {\n "embedding": [0.036, -0.048, ...]\n }\n}\n'})}),"\n",(0,s.jsx)(n.h3,{id:"real-time-consumers",children:"Real-Time Consumers"}),"\n",(0,s.jsx)(n.p,{children:"A generic Kafka schema is used for all real-time consumers, simplifying new integrations:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "timestamp": 1719308350,\n "entity_label": "catalog",\n "data": [\n {\n "id": "125138466",\n "label": "is_live_ad",\n "value": "true"\n }\n ]\n}\n'})}),"\n",(0,s.jsx)(n.h3,{id:"retry-topic",children:"Retry Topic"}),"\n",(0,s.jsx)(n.p,{children:"Failed ingestion events are published to a retry topic for reprocessing, ensuring no data loss:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-json",children:'{\n "timestamp": 1719308350,\n "entity_label": "catalog",\n "model_name": "intent_model",\n "variant": "organic",\n "data": [\n {\n "id": "125138466",\n "label": "is_live_ad",\n "value": "true"\n }\n ]\n}\n'})}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"key-design-decisions",children:"Key Design Decisions"}),"\n",(0,s.jsx)(n.h3,{id:"pluggable-vector-database-support",children:"Pluggable Vector Database Support"}),"\n",(0,s.jsxs)(n.p,{children:["Skye introduces a generic ",(0,s.jsx)(n.code,{children:"vector_db_type"})," configuration and converts vendor-specific configs to a generic ",(0,s.jsx)(n.code,{children:"vector_config"}),", enabling support for multiple vector database backends beyond Qdrant."]}),"\n",(0,s.jsx)(n.h3,{id:"variant-based-model-sharing",children:"Variant-Based Model Sharing"}),"\n",(0,s.jsx)(n.p,{children:"By eliminating the tenant-based construct and introducing variants, Skye allows:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Models to be shared across tenants without duplication"}),"\n",(0,s.jsx)(n.li,{children:"Each variant to have its own filter criteria, vector DB config, and job frequency"}),"\n",(0,s.jsx)(n.li,{children:"Independent read/write version tracking per variant"}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"scylladb-for-real-time-aggregation",children:"ScyllaDB for Real-Time Aggregation"}),"\n",(0,s.jsx)(n.p,{children:"Replaced Delta Lake with self-hosted ScyllaDB for cost efficiency. The aggregator is entity-generic (not model/version-specific) since all real-time data is consistent across models."}),"\n",(0,s.jsx)(n.h3,{id:"event-driven-state-management",children:"Event-Driven State Management"}),"\n",(0,s.jsx)(n.p,{children:"Model state transitions are handled via Kafka events with a SQL database backing store. This eliminates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Single points of failure in admin/ingestion flows"}),"\n",(0,s.jsx)(n.li,{children:"Models getting stuck during pod restarts"}),"\n",(0,s.jsx)(n.li,{children:"Manual intervention for consumer pause/resume"}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"resiliency",children:"Resiliency"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Mechanism"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Retry Topics"})}),(0,s.jsx)(n.td,{children:"Failed ingestion messages are captured in a failure topic for reprocessing"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Circuit Breakers"})}),(0,s.jsx)(n.td,{children:"Applied to similarity search API calls to throttle RPS during failures"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Snapshot Backups"})}),(0,s.jsx)(n.td,{children:"Periodic collection snapshots enable quick restore during downtime"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Automated Cluster Setup"})}),(0,s.jsx)(n.td,{children:"Scripted provisioning eliminates configuration inconsistencies"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.strong,{children:"Databricks Job Retries"})}),(0,s.jsx)(n.td,{children:"Lambda functions with retry mechanisms for failed ingestion jobs"})]})]})]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"scalability",children:"Scalability"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Vector DB Scaling"}),": Generic scripts for adding nodes to existing clusters, enabling horizontal scaling based on load and RPS"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Service Scaling"}),": Hosted on EKS with CPU-based autoscaling"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Experiment Isolation"}),": Experiments run on separate EKS and vector DB clusters, reducing production cluster complexity"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Indexed-Only Search"}),": The ",(0,s.jsx)(n.code,{children:"search_indexed_only"})," flag ensures queries only search indexed space, avoiding latency from brute-force searches on partially built indexes"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"observability",children:"Observability"}),"\n",(0,s.jsx)(n.h3,{id:"metrics-per-model--variant",children:"Metrics (per model + variant)"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Metric"}),(0,s.jsx)(n.th,{children:"Description"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"avg_similar_candidates"})}),(0,s.jsx)(n.td,{children:"Average number of similarity candidates returned"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:(0,s.jsx)(n.code,{children:"avg_recall"})}),(0,s.jsx)(n.td,{children:"Score of the first similar catalog returned"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Service Latency"}),(0,s.jsx)(n.td,{children:"P99.9 / P99 / P95 / P50"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Service 5xx Count"}),(0,s.jsx)(n.td,{children:"Error rate monitoring"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Vector DB Latency"}),(0,s.jsx)(n.td,{children:"P99.9 / P99 / P95 / P50"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Vector DB QPS"}),(0,s.jsx)(n.td,{children:"Throughput monitoring"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"ScyllaDB Latency"}),(0,s.jsx)(n.td,{children:"P99.9 / P99 / P95 / P90"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Redis Latency"}),(0,s.jsx)(n.td,{children:"P99.9 / P99 / P95 / P90"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Redis Hit %"}),(0,s.jsx)(n.td,{children:"Cache effectiveness"})]})]})]}),"\n",(0,s.jsx)(n.h3,{id:"alerts",children:"Alerts"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Alert"}),(0,s.jsx)(n.th,{children:"Threshold"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Indexed Vector Count"}),(0,s.jsx)(n.td,{children:"< 95%"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Events to Failure Topic"}),(0,s.jsx)(n.td,{children:"Rate > 0"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Service 5xx"}),(0,s.jsx)(n.td,{children:"< 10"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Service Latency"}),(0,s.jsx)(n.td,{children:"Model-dependent SLA"})]})]})]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"technology-stack",children:"Technology Stack"}),"\n",(0,s.jsxs)(n.table,{children:[(0,s.jsx)(n.thead,{children:(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.th,{children:"Component"}),(0,s.jsx)(n.th,{children:"Technology"})]})}),(0,s.jsxs)(n.tbody,{children:[(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Language"}),(0,s.jsx)(n.td,{children:"Go"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Vector Database"}),(0,s.jsx)(n.td,{children:"Qdrant (pluggable)"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Embedding Storage"}),(0,s.jsx)(n.td,{children:"ScyllaDB"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Real-Time Aggregation"}),(0,s.jsx)(n.td,{children:"ScyllaDB"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Caching"}),(0,s.jsx)(n.td,{children:"Redis + In-Memory"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Message Queue"}),(0,s.jsx)(n.td,{children:"Kafka"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Configuration"}),(0,s.jsx)(n.td,{children:"ZooKeeper / etcd"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Container Orchestration"}),(0,s.jsx)(n.td,{children:"Kubernetes (EKS)"})]}),(0,s.jsxs)(n.tr,{children:[(0,s.jsx)(n.td,{children:"Job Orchestration"}),(0,s.jsx)(n.td,{children:"Databricks"})]})]})]})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(o,{...e})}):o(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/6479fb86.3f75012c.js b/docs/assets/js/6479fb86.3f75012c.js new file mode 100644 index 00000000..3d01bfe6 --- /dev/null +++ b/docs/assets/js/6479fb86.3f75012c.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[5579],{3751:e=>{e.exports=JSON.parse('{"archive":{"blogPosts":[{"id":"post-five","metadata":{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"}},"content":"![BharatMLStack](./bms.png)\\n## LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale\\n\\nRaw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution.\\n\\n## 1. Advanced Memory Management: Paged & Prefix KV Caching\\n\\nThe most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache.\\n\\n### Paged KV caching\\n\\nStandard caching suffers from fragmentation. We use **Paged KV caching**, which operates similarly to an operating system\'s virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory.\\n\\n### KV cache quantization\\n\\nTo further maximize available memory, we implement **KV cache quantization** (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality.\\n\\n### Prefix caching (the \\"voice bot\\" optimizer)\\n\\nFor use cases like GenAI voice bots where the system prompt (e.g., \\"You are a helpful assistant...\\") is static across thousands of requests, we enable **prefix caching**.\\n\\n- **Impact**: By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces **Time To First Token (TTFT)** by skipping redundant computation of the system prompt.\\n\\n## 2. Aggressive Quantization (INT4 AWQ & FP8)\\n\\nRunning models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy.\\n\\n### INT4 AWQ (Activation-aware Weight Quantization)\\n\\nFor the Llama 3 family, we use **AWQ** to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed.\\n\\n### FP8 precision\\n\\nFor NVIDIA Hopper (H100) architectures, we are exploring **FP8 quantization**, leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization.\\n\\n- **Verification**: We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving **>99% similarity**.\\n\\n## 3. Kernel Fusion & Custom Plugins\\n\\nTo minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins.\\n\\n- **Flash attention & FMHA**: We enable **Fused Multi-Head Attention (FMHA)** combined with flash attention to reduce memory reads/writes.\\n- **GEMM plugins**: We use specialized **GEMM** plugins to accelerate transformer linear layers.\\n- **Removing input padding**: Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens.\\n\\n## 4. Inflight (Continuous) Batching\\n\\nTraditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else.\\n\\nWe implement **inflight batching**: as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones.\\n\\n## 5. Parallelism Strategies: Scaling Beyond One GPU\\n\\nFor large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies.\\n\\n- **Tensor parallelism (TP)**: Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer.\\n- **Pipeline parallelism (PP)**: Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B).\\n\\n## 6. Speculative Decoding\\n\\nTo reduce inter-token latency (ITL), we explore **speculative decoding**.\\n\\n- **Mechanism**: A smaller, faster \\"draft\\" model speculatively generates a short token sequence (e.g., 5 tokens).\\n- **Verification**: The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed.\\n\\n## Few Benchmarks\\n\\nBelow are a couple of representative use cases and performance numbers.\\n\\n### Search query rewriting\\n\\n- **LLM**: Fine-tuned llama-3.2-1B\\n- **Input & output token length**: ~10\u201320\\n- **Response type**: Non-streaming\\n\\n| Inference runtime | Hardware | Max requests/sec | Max p99 latency |\\n| --- | --- | ---: | ---: |\\n| TensorRT-LLM | 4 \xd7 L4 GPUs (multi-GPU) | 1000 | 95 ms |\\n| TensorRT-LLM | 1 \xd7 A100 40 GB GPU | 1000 | 69 ms |\\n\\n### Voice bot query\\n\\n- **LLM**: Llama-3.1-8B\\n- **Input token length**: ~1900\u20132000\\n- **Output token length**: ~200\\n- **Response type**: Streaming\\n\\n| Inference runtime | Concurrency | p99 TTFT (ms) | p99 ITL (ms) | Token throughput (tokens/sec) | Request throughput (req/sec) | Hardware |\\n| --- | ---: | ---: | ---: | ---: | ---: | --- |\\n| TensorRT-LLM | 1 | 36.27 | 22.78 | 45.66 | 0.23 | L4 |\\n| TensorRT-LLM | 2 | 49.81 | 23.21 | 89.37 | 0.45 | L4 |\\n| TensorRT-LLM | 4 | 55.33 | 36.62 | 153.39 | 0.78 | L4 |\\n| TensorRT-LLM | 8 | 66.5 | 39.11 | 279.88 | 1.47 | L4 |\\n| TensorRT-LLM | 16 | 131.8 | 30.39 | 547.8 | 2.77 | L4 |\\n| TensorRT-LLM | 32 | 277.22 | 48.02 | 925.7 | 4.78 | L4 |\\n| TensorRT-LLM | 64 | 498.52 | 71.62 | 1,164.40 | 6.2 | L4 |\\n| TensorRT-LLM | 128 | 677.31 | 120.37 | 1,445.18 | 7.69 | L4 |\\n| TensorRT-LLM | 256 | 1,926.31 | 216.88 | 1,600.81 | 8.52 | L4 |\\n| TensorRT-LLM | 1 | 21.17 | 9.24 | 130.05 | 0.68 | A100 |\\n| TensorRT-LLM | 2 | 25.78 | 9.21 | 264.5 | 1.35 | A100 |\\n| TensorRT-LLM | 4 | 28.52 | 10.99 | 437.69 | 2.27 | A100 |\\n| TensorRT-LLM | 8 | 34.4 | 12.61 | 760.49 | 3.96 | A100 |\\n| TensorRT-LLM | 16 | 68.03 | 14.32 | 1,343.80 | 7.01 | A100 |\\n| TensorRT-LLM | 32 | 185.96 | 16.82 | 2,287.30 | 11.92 | A100 |\\n| TensorRT-LLM | 64 | 136.87 | 21.17 | 3,625.22 | 18.89 | A100 |\\n| TensorRT-LLM | 128 | 463.78 | 34.15 | 4,456.51 | 23.24 | A100 |\\n| TensorRT-LLM | 256 | 890.12 | 59.18 | 5,188.24 | 27.05 | A100 |\\n\\n## Conclusion\\n\\nHigh-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.\\n\\nThese optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."},{"id":"post-four","metadata":{"permalink":"/BharatMLStack/blog/post-four","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-four/index.md","source":"@site/blog/bharatmlstack-history/post-four/index.md","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","description":"BharatMLStack","date":"2025-03-29T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":13.38,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-four","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","authors":["jaya"],"date":"2025-3-29","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five"},"nextItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"}},"content":"![BharatMLStack](./bms.png)\\n## Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving\\n\\n\\n\\nServing large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.\\n\\nThe platform implements a complete LLMOps lifecycle \u2014 from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.\\n\\nIn addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques \u2014 such as quantization strategies, batching configurations, and runtime-specific performance enhancements \u2014 enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.\\n\\n## Why LLM Inference Is not just bigger ML model serving\\n\\nLarge language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.\\n\\n### Autoregressive Generation and Sequential Computation:\\n\\nUnlike traditional models such as classifiers or recommenders \u2014 where inference cost is relatively constant \u2014 LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation.\\nBecause tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.\\n\\n### Prefill and Decode Phases:\\n\\nLLM inference typically consists of two distinct stages:\\n\\n- Prefill phase \u2014 the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.\\n- Decode phase \u2014 the model generates tokens sequentially, predicting one token at a time using previously generated context.\\n\\nThe decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.\\n\\n### Context Management and KV Caching:\\n\\nAnother fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens.\\nKV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:\\n\\n- Memory consumption grows with sequence length and batch size\\n- GPU memory becomes a critical bottleneck\\n- Efficient memory management becomes essential for scaling concurrent requests\\n\\nThis tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.\\n\\n### Dynamic and Irregular Workloads:\\n\\nTraditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:\\n\\n- Batch sizes must be dynamic rather than static\\n- Requests may enter and leave batches asynchronously\\n- Scheduling systems must continuously rebalance workloads to maximize GPU utilization\\n\\nThese characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.\\n\\n### Streaming and User Experience Constraints:\\n\\nAnother distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. \\nBecause of these differences \u2014 sequential generation, growing memory requirements, dynamic workloads, and streaming constraints \u2014 LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.\\n\\n## LLMOps: High-Level Architecture \\n\\n![LLM Architecture](./llm-plat.png)\\n\\nThe LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.\\n\\nOur LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.\\n\\n1. Onboarding & Registration (The Source of Truth)\\n\\n The lifecycle begins with the Data Scientist or engineer.\\n\\n - Model Ingestion: Users onboard models\u2014whether open-source (Hugging Face, NeMo) or internally fine-tuned\u2014via the Truffle Box SDK/UI.\\n - LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., \\"customer_support_v2\\") independently of the application code.\\n\\n2. The \\"Black Box\\" Build Engine\\n\\n Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.\\n\\n - Transformation: The raw model is converted into a TRT-LLM Checkpoint.\\n - Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.\\n - Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.\\n\\n3. Intelligent Profiling & Validation\\n\\n Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.\\n\\n - Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).\\n - Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.\\n\\n4. Smart Artifact Generation & Distribution\\n\\n To solve the Kubernetes \\"Cold Start\\" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:\\n\\n - Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.\\n - Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.\\n\\n5. Image Streaming & Deployment\\n\\n Simultaneously, the inference runtime container images are pulled from the Artifact Registry.\\n\\n - Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link\\n\\n6. The Inference Runtime (Kubernetes)\\n\\n The workload lands on Kubernetes with Autoscaling.\\n\\n - Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.\\n - Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk (\\"Pull from Disk\\").\\n\\n7. Client Interaction & Observability\\n\\n Finally, the LLM Inference Client executes the request.\\n\\n - Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.\\n - Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.\\n\\n8. Observability: Monitoring the Pulse of GenAI\\n\\n In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn\'t care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.\\n\\n To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:\\n\\n 1. Time to First Token (TTFT)\\n - Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.\\n - Why it matters: This represents the \\"Prefill Phase\\" latency\u2014the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or \\"hung.\\"\\n - Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.\\n\\n 2. Inter-Token Latency (ITL)\\n - Definition: ITL measures the average time interval between the generation of consecutive tokens during the \\"Decode Phase\\".\\n - Why it matters: This defines the \\"perceived speed\\" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look \\"jerky\\" or slow to the user.\\n - Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.\\n\\n 3. Token Throughput vs. Request Throughput\\n - We distinguish between two types of throughput to balance system efficiency with user load:\\n - Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.\\n - Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.\\n\\n 4. The Monitoring Stack\\n - Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot \\"slow generation\\" incidents that generic \\"500 error\\" alerts would miss.\\n - Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific \\"slow\\" request back to its prompt to understand if a complex input caused the latency spike.\\n\\n## Supported Inference backends (TensorRT LLM, Dynamo & vLLM)\\n\\nTailored for the Use Case: We do not believe in a \\"one-size-fits-all\\" approach to inference. Different use cases\u2014whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows\u2014demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:\\n\\n1. TensorRT-LLM: The High-Performance Standard\\n\\n Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).\\n\\n TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .\\n\\n Key optimizations we tailor for these high-load cases include:\\n\\n - Optimized execution via TensorRT engine compilation\\n - Quantization-aware execution for reduced memory usage and improved throughput\\n - Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .\\n - Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .\\n\\n2. Dynamo: Distributed Inference for Reasoning Models\\n\\n Suitable for: Very large \\"reasoning\\" models (70B+) or scenarios requiring massive context windows where a single GPU\'s memory is insufficient.\\n\\n For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:\\n\\n - KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .\\n - Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy \\"reading\\" phase independently from the memory-heavy \\"writing\\" phase .\\n - Distributed execution across multiple GPU resources\\n\\n3. vLLM: The Flexible Baseline\\n\\n Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.\\n\\n While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .\\n\\n - High throughput through dynamic batching and efficient memory utilization\\n - Paged KV cache management for handling long contexts and concurrent requests\\n - Strong support for open-source model ecosystems\\n - Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.\\n - Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.\\n\\n## Conclusion\\n\\nLarge language model inference introduces a fundamentally new class of infrastructure challenges\u2014where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.\\n\\nThe LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle\u2014from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.\\n\\nEqually important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.\\n\\nUltimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment\u2014allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.\\n\\n## Future Explorations\\n\\nWhile we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:\\n\\n- TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.\\n- Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a \\"serverless\\" experience where specific fine-tunes are hot-swapped instantly per request.\\n- Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user\'s streaming experience.\\n- Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., \\"How do I reset my password?\\" vs. \\"Password reset steps\\"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.\\n- Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.\\n- Online Evaluation & Guardrails: We are integrating a lightweight \\"Trust Layer\\" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous \\"LLM-as-a-Judge\\" evaluation pipelines to monitor response quality in production, not just system health."},{"id":"post-three","metadata":{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}},"content":"![BharatMLStack](./bms.png)\\n\\n## Cracking the Code: Scaling Model Inference & Real-Time Embedding Search\\n\\nBy mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:\\n\\n- \ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks\\n- \ud83d\udd39 Moving embedding search from batch to real-time for candidate generation\\n\\nHere\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.\\n\\n## Breaking Free from the Scalability Ceiling\\n\\n### The Model Serving Bottleneck\u2014A Wake-Up Call\\n\\nJuly 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\\nIn one of our war rooms, we ran a quick experiment:\\n\\n- \ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.\\n- \ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup.\\n- \ud83d\ude80 The results matched\u2014perfectly.\\n\\nThat moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\\n \\"Node availability may be an issue.\\"\\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?\\n\\n- \u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms\\n- \u2705 Triton handled significantly higher throughput on fewer resources\\n- \u2705 No model changes were needed\\n\\nMBS ran without a hitch, proving that self-hosted inference was the way forward.\\n\\n### Scaling Triton on GKE\\n\\nThis left us with two choices:\\n\\n- 1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack\\n- 2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance\\n\\nWe went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.\\n\\n### Fixing the Cold Start Problem\\n\\nAs we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up.\\n\\nAfter profiling, we found the culprits:\\n\\n- Triton\u2019s base image\u2014a massive 5GB\\n- Model binaries\u2014often 1GB+\\n- Startup delay\u2014mostly due to downloading and initializing these assets\\n\\nTo fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.\\n\\n## Embedding Search: The Last Piece of the Puzzle\\n\\nBy mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.\\n\\n### Choosing the Right Vector Database\\n\\nWe benchmarked three production-ready vector DBs across key parameters:\\n\\n- Milvus\\n- Qdrant\\n- Weaviate\\n\\nAfter extensive POCs, Qdrant stood out for its:\\n\\n- \u2705 Blazing-fast search latency on high-dimensional vectors\\n- \u2705 Efficient memory usage, crucial for in-memory workloads\\n- \u2705 Support for upserts and soft deletes, vital for Ads use cases\\n- \u2705 gRPC + REST APIs, making integration seamless\\n- \u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)\\n\\nAt its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs.\\n\\n### Embedding Freshness & Real-Time Updates\\n\\nTo ensure embeddings stayed up to date, we built a dual ingestion pipeline:\\n\\n- \ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight\\n- \ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes\\n\\nThis setup powered real-time \\"Similar Products\\" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.\\n\\n![Skye](./vss.png)\\n\\n## Final Takeaways: Scaling Smartly for Real-Time ML\\n\\n- \ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services\\n- \ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness\\n- \ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale\\n- \ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations\\n\\nBy early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."},{"id":"post-two","metadata":{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}},"content":"![BharatMLStack](./bms.png)\\n## Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)\\n\\nBy late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\\nAnd it worked. Mostly.\\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS.\\n\\n### The Cost of Success\\nEvery new Ranker model required its own feature set, often pulling from different entities. Each addition meant:\\n\\n- Adding new DAG nodes in IOP\\n- Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)\\n- Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)\\n- Optimizing I/O and dealing with the inevitable bugs\\n\\nWhat began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations.\\n\\n### Scaling Pains (and Cassandra\u2019s Limits)\\nAt some point, we were hitting:\\n\\n- 250\u2013300K reads/sec\\n- 1M writes/sec (during lean hours)\\n\\nAll of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership.\\n\\n### Interaction Store Woes\\nOur interaction store was another ticking time bomb:\\n\\n- \ud83d\udea8 Clusters kept growing in size and cost\\n- \ud83d\udea8 Latency spikes became increasingly frequent\\n- \ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance\\n\\nEach time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale.\\n\\n### Silver Linings\\nDespite the chaos, the system was live and delivering value:\\n\\n- Real-time infrastructure was in production\\n- Costs dropped by 60\u201370% compared to offline personalization\\n- New experiments rolled out faster and more successfully\\n- User engagement metrics improved\\n\\nIt wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot.\\n\\n### Round Two: Solving the Top 2 Bottlenecks\\nWith the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:\\n\\n1. Coding feature retrieval logic for every new model was becoming unsustainable\\n2. ML scale was exploding\u2014bringing rising infra costs with it\\n3. Real-time embedding search was the next big unlock\\n\\nWe tackled them one by one\u2014starting with the biggest pain point.\\n\\n#### Problem 1: No-Code Feature Retrieval for Model Inference\\nWe noticed a pattern: for personalized ranking, models needed features from:\\n\\n- \u2705 Product\\n- \u2705 User\\n- \u2705 User \xd7 Category\\n- \u2705 Region, cohort, sub-category, etc.\\n\\nA key insight emerged: Entities that contribute features for a model always map back to the context entities.\\n\\n![MP Dag](./mp-dag.png)\\n\\nWith this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:\\n\\n- 1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)\\n- 2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper\\n- 3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically\\n- 4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors\\n\\n\ud83d\udca1 The impact?\\n\\n- \ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config\\n- \ud83d\ude80 Feature consistency across experiments\\n- \ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond\\n\\nHere\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\\n![MP matrix](./mp-matrix.png)\\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency.\\n\\n#### Problem 2: Scaling Without Breaking the Bank\\nWith more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:\\n\\n- \ud83d\udd39 Online Feature Store\\n- \ud83d\udd39 Interaction Store\\n\\n#### Optimizing the Online Feature Store\\nOur costs were concentrated in:\\n\\n- \ud83d\udccc Database (Cassandra)\\n- \ud83d\udccc Cache (Redis)\\n- \ud83d\udccc Running Pods (Java services)\\n\\n1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:\\n\\n- Throughput: Matched or exceeded Cassandra\'s performance under identical workloads, even under high concurrency.\\n- Latency: Achieved consistently lower P99 latencies due to ScyllaDB\'s shard-per-core architecture and better I/O utilization.\\n- Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes.\\n\\n2\ufe0f\u20e3 Finding the Right Cache\\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:\\n\\n- Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation.\\n- Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access.\\n- Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries.\\n\\n3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?\\n\\n\u2705 Memory usage dropped by ~80%\\n\u2705 CPU utilization was significantly lower\\n\u2705 Faster, more efficient deployments\\n\\n#### Optimizing the Interaction Store\\nWe realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:\\n\\n- \ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events\\n- \ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app\\n\\nSmart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes.\\n\\n![InteractionStore](./interaction-str.png)\\n#### Results\\n\\n- Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat\\n- Infra costs for Online Feature Store and Interaction Store dropped by ~60%\\n\\n#### The Catch: Our ML Hosting Hit a Hard Limit\\nWhile planning for 2023 MBS, we ran into a critical scalability bottleneck:\\n\\n- \u274c Insufficient compute availability in our region for ML instances\\n- \u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale\\n\\nThis forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads.\\n\\n### Conclusion: From Firefighting to Future-Proofing\\nWhat started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."},{"id":"post-one","metadata":{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}},"content":"![BharatMLStack](./bms.png)\\n## The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform\\n\\nIt all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:\\n\\n*\\"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?\\"*\\n\\nThe laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact.\\n\\nAt the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the **Meesho ML Platform**, known today as **BharatMLStack**.\\n\\n![Alt Text](./old-batch-arch.png)\\n\\nBefore the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:\\n- **Data Ingestion**: The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets.\\n- **Layer 1**: Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format.\\n- **Layer 2**: Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3.\\n- **Layer 3**: Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system.\\n- **Serving**: A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as \\"For You\\" and Category Landing Pages (CLP).\\n\\nThis approach held up well\u2014until Meesho started seeing a significant surge in traffic.\\n\\n## The Turning Point: From Batch to Real-Time\\n\\nAt this time, the team was iterating on new **Ranker models**, and real-time inference seemed like the next logical step. But Rankers needed **real-time feature retrieval**, which meant an **online feature store** had to be built first.\\n\\nExploring open-source options led to **cost vs. performance trade-offs**, but Meesho\u2019s surging traffic meant that **latency and stability were non-negotiable**. After multiple debates and stakeholder discussions, a bold decision was made:\\n\\n*We would build our own feature store.*\\n\\nMeanwhile, efforts began to bring **Candidate Generators (CGs)** to real-time. The challenge? **Storing and retrieving user interactions quickly enough** to power real-time recommendations.\\n\\nAs the team dove deeper, a new roadblock emerged: \\nOur ML jobs were orchestrated using **Airflow DAGs**, giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, **slowing down iteration cycles**.\\n\\nThat\u2019s when the idea struck: \\nWe needed a **framework for real-time DAG execution**\u2014one that preserved the same flexibility as Airflow but worked for **streaming data**.\\n\\nThis moment shaped the **next phase of our journey**.\\n\\n## First Generation Design\\n\\n![Alt Text](./first-gen-arch.png)\\n\\n# Laying the Groundwork: The First-Gen ML Platform\\n\\nTo solve these challenges, the team built three foundational components:\\n\\n\\n### 1. IOP Framework: A Real-Time DAG Executor\\n\\n- **Reusable Nodes**: Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config.\\n- **Config-driven Dynamic Graphs**: Execution graphs were defined as adjacency lists stored in **ZooKeeper**, allowing teams to modify the sequence or structure of operations without touching application code.\\n- **Plug-and-play CGs**: The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing `cg_name` in the request. This drastically reduced the code surface area and improved maintainability.\\n- **Production-Grade DAGs**: DAGs were designed to execute in **low-latency real-time environments**, with support for **parallel execution, retries, and branching**.\\n\\n[More about IOP DAG](https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform)\\n\\n\\n### 2. Online Feature Store - 0th Version\\n\\n- Used **Cassandra** and **Redis** for low-latency feature serving.\\n- Maintained feature consistency using **Feature Groups** with TTL-based expiry.\\n- A hybrid schema was used: feature keys stored in **ZooKeeper**, data stored in **compact arrays**.\\n\\n\\n### 3. Interaction Store - 0th Version\\n\\n- Captured real-time user interactions like clicks, orders, and add-to-cart events.\\n- Stored event data in **Redis ZSETs (sorted sets)** to enable fast lookups for recommendation engines.\\n- Provided an API to fetch a user\'s **last _k_ interactions** or **interactions within a time window**.\\n\\n\\nWith these components in place, **real-time ML at Meesho became a reality**.\\n\\nThis was just the beginning.\\n\\n## Building the Online Feature Store - 0th Version\\n\\n![Alt text](./online-feature-store-v0.png)\\n\\n### Choosing the Right Tech Stack\\n\\nWe spent considerable time evaluating various databases, caches, and communication protocols for our **online feature store**. After carefully weighing **cost, latency, throughput**, and **operational stability**, we settled on a combination of:\\n\\n- **Cassandra** and **Redis** for storage\\n- **gRPC + Proto3** as our communication layer\\n\\n\\n### Streamlining the Data Flow\\n\\nTo keep things simple in the initial version:\\n\\n- **Feature engineering jobs** wrote raw outputs to an **S3 bucket**\\n- A **daily feature push job**:\\n - Read from S3\\n - Grouped related features into **Feature Groups** (ensuring consistency)\\n - Pushed them to **Kafka**\\n\\nFor features requiring frequent updates:\\n\\n- **Ad-hoc jobs** computed features in higher frequency\\n- These jobs pushed to both **Kafka** and **S3** (S3 preserved historical data for future model training)\\n\\n\\n## The Challenges: Data Format and Storage\\n\\nOne of the most critical design challenges was how to store feature data **efficiently and consistently**, especially in databases like **Cassandra** and **Redis**, which come with unique storage constraints.\\n\\nWe had to solve for three key requirements:\\n\\n- ### Feature Consistency\\n When a feature group contains features like `order_count_1h` and `click_count_1h`, both must reflect the **same time window**. Inconsistent updates would lead to **unreliable model predictions**.\\n\\n- ### TTL Granularity\\n Each feature group required an **expiry timestamp**, so that **all features within it expired together**\u2014preserving consistency during reads.\\n\\n- ### Extensibility Across Databases\\n We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be **decoupled from DB-specific layouts**, enabling portability to systems like **ScyllaDB**, **DynamoDB**, **HBase**, or **BigTable**.\\n\\n\\n---\\n\\n## Overcoming Technical Constraints\\nAt the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive \\"one column per feature\\" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.\\n\\n## The Solution: Schema Separation\\n\\nWe introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\\nTo represent these groups efficiently, we adopted a layered storage approach:\\n\\n- **Feature Labels (Keys)** were stored in ZooKeeper, serving as the schema.\\n- **Feature Values** were stored as a comma-separated string array in Cassandra or Redis.\\n- **Expiry Timestamp and Schema Version** were appended using a semi-colon delimiter at the end of the string.\\n\\nExample:\\n\\n```bash\\nfeature_1_value,feature_2_value,feature_3_value;expiry_ts\\n```\\n\\nThis format allowed:\\n- Consistent writes and reads at the group level\\n- Easy parsing of feature values using the schema lookup from ZooKeeper\\n- Efficient storage with minimal DB column usage\\n- Support for per-group TTLs and schema evolution\\n\\n## Tracking Changes in Feature Groups\\nFeature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn\'t feasible.\\n\\n### Common Real-World Scenarios:\\n- A new feature is added to the schema, but ingestion jobs still use the older schema version.\\n- Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features.\\n- During serving, models request a mix of old and new features, depending on rollout stages.\\n\\n## The Solution: Schema Versioning\\nWe solved this with versioned feature group schemas, which unlocked several capabilities:\\n- ### Backward Compatibility\\n Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly.\\n- ### Partial Availability Handling \\n During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail.\\n- ### Safe Writes Without Pipeline Pauses\\n With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale.\\n\\n![Alt Text](./schema.png)\\n\\n## Interaction Store - 0th Version\\n\\n![Alt Text](./interaction-store-v0.png)\\n\\nTo power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as **Similar Products**, **People Also Viewed**, or **Recently Ordered Again**.\\nFor the **0th version** of the Interaction Store, we focused on a design that was **simple, fast, and reliable** \u2014 optimized for high-throughput ingestion and low-latency lookups.\\n\\n## Event Ingestion\\nWe instrumented our backend services to emit key user interaction events to Kafka in real time. These included:\\n- Click\\n- Order\\n- Add to Cart\\n- Wishlist\\n- Share\\n\\nEach event carried essential metadata:\\n- userId \u2014 uniquely identifies the user\\n- productId \u2014 the item being interacted with\\n- timestamp \u2014 the moment the interaction occurred\\n\\nThis decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently.\\n\\n## Storage Design\\nTo store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure.\\n\\n### Why Redis?\\nRedis gave us:\\n- **Low-latency** reads and writes\\n- **Time-ordered data** using ZSETs (via score = timestamp)\\n- **Native TTL support**, if needed in later versions\\n- **In-memory performance** \u2014ideal for real-time CGs\\n\\n### Storage Structure\\nEach user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:\\n\\n```bash\\nuserId_eventType \u2192 ZSET[...(pid, ts)...]\\n```\\n\\nWithin each ZSET:\\n\\n- The **timestamp** served as the score, maintaining temporal order\\n- The **productId** (optionally with metadata) was the **value**\\n\\nThis allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:\\n- Fetch the **last k interactions** of a specific type for a given user with `ZREVRANGE(userId_eventType, count)`\\n- Retrieve **all interactions within a time range** (e.g., last 24 hours) with `ZREVRANGEBYSCORE(userId_eventType, timeRange)`\\n\\n### Built-in Guardrails\\nSince Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated.\\n\\n## Conclusion: Laying the Foundation for Real-Time ML\\n\\nIn this first phase, we tackled the **fundamentals**\u2014shifting from batch-based recommendations to a **real-time Recommendation** using ML platform that could keep up with Meesho\u2019s growth.\\n\\nWith the **IOP Framework**, **Online Feature Store**, and **Interaction Store**, we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked: \\n- \u2705 Faster, more dynamic recommendations for millions of users. \\n- \u2705 Better infrastructure efficiency, reducing wasted compute power. \\n- \u2705 A flexible, modular system that allows for further experimentation.\\n\\nBut this is just the beginning. While we\'ve solved key challenges, **certain roadblocks remain** \u2014from optimizing **cost-performance trade-offs** to **seamlessly evolving schemas**.\\n\\n\\nThis foundational work laid the path for a reliable and scalable **real-time feature serving layer**."}]}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/6479fb86.96631f8d.js b/docs/assets/js/6479fb86.96631f8d.js deleted file mode 100644 index 6f77cfc2..00000000 --- a/docs/assets/js/6479fb86.96631f8d.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[5579],{3751:e=>{e.exports=JSON.parse('{"archive":{"blogPosts":[{"id":"post-five","metadata":{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"}},"content":"![BharatMLStack](./bms.png)\\n## LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale\\n\\nRaw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution.\\n\\n## 1. Advanced Memory Management: Paged & Prefix KV Caching\\n\\nThe most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache.\\n\\n### Paged KV caching\\n\\nStandard caching suffers from fragmentation. We use **Paged KV caching**, which operates similarly to an operating system\'s virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory.\\n\\n### KV cache quantization\\n\\nTo further maximize available memory, we implement **KV cache quantization** (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality.\\n\\n### Prefix caching (the \\"voice bot\\" optimizer)\\n\\nFor use cases like GenAI voice bots where the system prompt (e.g., \\"You are a helpful assistant...\\") is static across thousands of requests, we enable **prefix caching**.\\n\\n- **Impact**: By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces **Time To First Token (TTFT)** by skipping redundant computation of the system prompt.\\n\\n## 2. Aggressive Quantization (INT4 AWQ & FP8)\\n\\nRunning models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy.\\n\\n### INT4 AWQ (Activation-aware Weight Quantization)\\n\\nFor the Llama 3 family, we use **AWQ** to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed.\\n\\n### FP8 precision\\n\\nFor NVIDIA Hopper (H100) architectures, we are exploring **FP8 quantization**, leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization.\\n\\n- **Verification**: We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving **>99% similarity**.\\n\\n## 3. Kernel Fusion & Custom Plugins\\n\\nTo minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins.\\n\\n- **Flash attention & FMHA**: We enable **Fused Multi-Head Attention (FMHA)** combined with flash attention to reduce memory reads/writes.\\n- **GEMM plugins**: We use specialized **GEMM** plugins to accelerate transformer linear layers.\\n- **Removing input padding**: Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens.\\n\\n## 4. Inflight (Continuous) Batching\\n\\nTraditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else.\\n\\nWe implement **inflight batching**: as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones.\\n\\n## 5. Parallelism Strategies: Scaling Beyond One GPU\\n\\nFor large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies.\\n\\n- **Tensor parallelism (TP)**: Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer.\\n- **Pipeline parallelism (PP)**: Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B).\\n\\n## 6. Speculative Decoding\\n\\nTo reduce inter-token latency (ITL), we explore **speculative decoding**.\\n\\n- **Mechanism**: A smaller, faster \\"draft\\" model speculatively generates a short token sequence (e.g., 5 tokens).\\n- **Verification**: The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed.\\n\\n## Few Benchmarks\\n\\nBelow are a couple of representative use cases and performance numbers.\\n\\n### Search query rewriting\\n\\n- **LLM**: Fine-tuned llama-3.2-1B\\n- **Input & output token length**: ~10\u201320\\n- **Response type**: Non-streaming\\n\\n| Inference runtime | Hardware | Max requests/sec | Max p99 latency |\\n| --- | --- | ---: | ---: |\\n| TensorRT-LLM | 4 \xd7 L4 GPUs (multi-GPU) | 1000 | 95 ms |\\n| TensorRT-LLM | 1 \xd7 A100 40 GB GPU | 1000 | 69 ms |\\n\\n### Voice bot query\\n\\n- **LLM**: Llama-3.1-8B\\n- **Input token length**: ~1900\u20132000\\n- **Output token length**: ~200\\n- **Response type**: Streaming\\n\\n| Inference runtime | Concurrency | p99 TTFT (ms) | p99 ITL (ms) | Token throughput (tokens/sec) | Request throughput (req/sec) | Hardware |\\n| --- | ---: | ---: | ---: | ---: | ---: | --- |\\n| TensorRT-LLM | 1 | 36.27 | 22.78 | 45.66 | 0.23 | L4 |\\n| TensorRT-LLM | 2 | 49.81 | 23.21 | 89.37 | 0.45 | L4 |\\n| TensorRT-LLM | 4 | 55.33 | 36.62 | 153.39 | 0.78 | L4 |\\n| TensorRT-LLM | 8 | 66.5 | 39.11 | 279.88 | 1.47 | L4 |\\n| TensorRT-LLM | 16 | 131.8 | 30.39 | 547.8 | 2.77 | L4 |\\n| TensorRT-LLM | 32 | 277.22 | 48.02 | 925.7 | 4.78 | L4 |\\n| TensorRT-LLM | 64 | 498.52 | 71.62 | 1,164.40 | 6.2 | L4 |\\n| TensorRT-LLM | 128 | 677.31 | 120.37 | 1,445.18 | 7.69 | L4 |\\n| TensorRT-LLM | 256 | 1,926.31 | 216.88 | 1,600.81 | 8.52 | L4 |\\n| TensorRT-LLM | 1 | 21.17 | 9.24 | 130.05 | 0.68 | A100 |\\n| TensorRT-LLM | 2 | 25.78 | 9.21 | 264.5 | 1.35 | A100 |\\n| TensorRT-LLM | 4 | 28.52 | 10.99 | 437.69 | 2.27 | A100 |\\n| TensorRT-LLM | 8 | 34.4 | 12.61 | 760.49 | 3.96 | A100 |\\n| TensorRT-LLM | 16 | 68.03 | 14.32 | 1,343.80 | 7.01 | A100 |\\n| TensorRT-LLM | 32 | 185.96 | 16.82 | 2,287.30 | 11.92 | A100 |\\n| TensorRT-LLM | 64 | 136.87 | 21.17 | 3,625.22 | 18.89 | A100 |\\n| TensorRT-LLM | 128 | 463.78 | 34.15 | 4,456.51 | 23.24 | A100 |\\n| TensorRT-LLM | 256 | 890.12 | 59.18 | 5,188.24 | 27.05 | A100 |\\n\\n## Conclusion\\n\\nHigh-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.\\n\\nThese optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."},{"id":"post-three","metadata":{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-four/index.md","source":"@site/blog/bharatmlstack-history/post-four/index.md","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","description":"BharatMLStack","date":"2025-03-29T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":13.38,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-three","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","authors":["jaya"],"date":"2025-3-29","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five"},"nextItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"}},"content":"![BharatMLStack](./bms.png)\\n## Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving\\n\\n\\n\\nServing large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.\\n\\nThe platform implements a complete LLMOps lifecycle \u2014 from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.\\n\\nIn addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques \u2014 such as quantization strategies, batching configurations, and runtime-specific performance enhancements \u2014 enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.\\n\\n## Why LLM Inference Is not just bigger ML model serving\\n\\nLarge language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.\\n\\n### Autoregressive Generation and Sequential Computation:\\n\\nUnlike traditional models such as classifiers or recommenders \u2014 where inference cost is relatively constant \u2014 LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation.\\nBecause tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.\\n\\n### Prefill and Decode Phases:\\n\\nLLM inference typically consists of two distinct stages:\\n\\n- Prefill phase \u2014 the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.\\n- Decode phase \u2014 the model generates tokens sequentially, predicting one token at a time using previously generated context.\\n\\nThe decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.\\n\\n### Context Management and KV Caching:\\n\\nAnother fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens.\\nKV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:\\n\\n- Memory consumption grows with sequence length and batch size\\n- GPU memory becomes a critical bottleneck\\n- Efficient memory management becomes essential for scaling concurrent requests\\n\\nThis tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.\\n\\n### Dynamic and Irregular Workloads:\\n\\nTraditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:\\n\\n- Batch sizes must be dynamic rather than static\\n- Requests may enter and leave batches asynchronously\\n- Scheduling systems must continuously rebalance workloads to maximize GPU utilization\\n\\nThese characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.\\n\\n### Streaming and User Experience Constraints:\\n\\nAnother distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. \\nBecause of these differences \u2014 sequential generation, growing memory requirements, dynamic workloads, and streaming constraints \u2014 LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.\\n\\n## LLMOps: High-Level Architecture \\n\\n![LLM Architecture](./llm-plat.png)\\n\\nThe LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.\\n\\nOur LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.\\n\\n1. Onboarding & Registration (The Source of Truth)\\n\\n The lifecycle begins with the Data Scientist or engineer.\\n\\n - Model Ingestion: Users onboard models\u2014whether open-source (Hugging Face, NeMo) or internally fine-tuned\u2014via the Truffle Box SDK/UI.\\n - LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., \\"customer_support_v2\\") independently of the application code.\\n\\n2. The \\"Black Box\\" Build Engine\\n\\n Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.\\n\\n - Transformation: The raw model is converted into a TRT-LLM Checkpoint.\\n - Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.\\n - Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.\\n\\n3. Intelligent Profiling & Validation\\n\\n Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.\\n\\n - Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).\\n - Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.\\n\\n4. Smart Artifact Generation & Distribution\\n\\n To solve the Kubernetes \\"Cold Start\\" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:\\n\\n - Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.\\n - Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.\\n\\n5. Image Streaming & Deployment\\n\\n Simultaneously, the inference runtime container images are pulled from the Artifact Registry.\\n\\n - Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link\\n\\n6. The Inference Runtime (Kubernetes)\\n\\n The workload lands on Kubernetes with Autoscaling.\\n\\n - Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.\\n - Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk (\\"Pull from Disk\\").\\n\\n7. Client Interaction & Observability\\n\\n Finally, the LLM Inference Client executes the request.\\n\\n - Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.\\n - Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.\\n\\n8. Observability: Monitoring the Pulse of GenAI\\n\\n In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn\'t care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.\\n\\n To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:\\n\\n 1. Time to First Token (TTFT)\\n - Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.\\n - Why it matters: This represents the \\"Prefill Phase\\" latency\u2014the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or \\"hung.\\"\\n - Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.\\n\\n 2. Inter-Token Latency (ITL)\\n - Definition: ITL measures the average time interval between the generation of consecutive tokens during the \\"Decode Phase\\".\\n - Why it matters: This defines the \\"perceived speed\\" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look \\"jerky\\" or slow to the user.\\n - Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.\\n\\n 3. Token Throughput vs. Request Throughput\\n - We distinguish between two types of throughput to balance system efficiency with user load:\\n - Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.\\n - Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.\\n\\n 4. The Monitoring Stack\\n - Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot \\"slow generation\\" incidents that generic \\"500 error\\" alerts would miss.\\n - Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific \\"slow\\" request back to its prompt to understand if a complex input caused the latency spike.\\n\\n## Supported Inference backends (TensorRT LLM, Dynamo & vLLM)\\n\\nTailored for the Use Case: We do not believe in a \\"one-size-fits-all\\" approach to inference. Different use cases\u2014whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows\u2014demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:\\n\\n1. TensorRT-LLM: The High-Performance Standard\\n\\n Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).\\n\\n TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .\\n\\n Key optimizations we tailor for these high-load cases include:\\n\\n - Optimized execution via TensorRT engine compilation\\n - Quantization-aware execution for reduced memory usage and improved throughput\\n - Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .\\n - Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .\\n\\n2. Dynamo: Distributed Inference for Reasoning Models\\n\\n Suitable for: Very large \\"reasoning\\" models (70B+) or scenarios requiring massive context windows where a single GPU\'s memory is insufficient.\\n\\n For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:\\n\\n - KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .\\n - Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy \\"reading\\" phase independently from the memory-heavy \\"writing\\" phase .\\n - Distributed execution across multiple GPU resources\\n\\n3. vLLM: The Flexible Baseline\\n\\n Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.\\n\\n While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .\\n\\n - High throughput through dynamic batching and efficient memory utilization\\n - Paged KV cache management for handling long contexts and concurrent requests\\n - Strong support for open-source model ecosystems\\n - Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.\\n - Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.\\n\\n## Conclusion\\n\\nLarge language model inference introduces a fundamentally new class of infrastructure challenges\u2014where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.\\n\\nThe LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle\u2014from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.\\n\\nEqually important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.\\n\\nUltimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment\u2014allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.\\n\\n## Future Explorations\\n\\nWhile we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:\\n\\n- TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.\\n- Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a \\"serverless\\" experience where specific fine-tunes are hot-swapped instantly per request.\\n- Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user\'s streaming experience.\\n- Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., \\"How do I reset my password?\\" vs. \\"Password reset steps\\"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.\\n- Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.\\n- Online Evaluation & Guardrails: We are integrating a lightweight \\"Trust Layer\\" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous \\"LLM-as-a-Judge\\" evaluation pipelines to monitor response quality in production, not just system health."},{"id":"post-three","metadata":{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}},"content":"![BharatMLStack](./bms.png)\\n\\n## Cracking the Code: Scaling Model Inference & Real-Time Embedding Search\\n\\nBy mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:\\n\\n- \ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks\\n- \ud83d\udd39 Moving embedding search from batch to real-time for candidate generation\\n\\nHere\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.\\n\\n## Breaking Free from the Scalability Ceiling\\n\\n### The Model Serving Bottleneck\u2014A Wake-Up Call\\n\\nJuly 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\\nIn one of our war rooms, we ran a quick experiment:\\n\\n- \ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.\\n- \ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup.\\n- \ud83d\ude80 The results matched\u2014perfectly.\\n\\nThat moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\\n \\"Node availability may be an issue.\\"\\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?\\n\\n- \u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms\\n- \u2705 Triton handled significantly higher throughput on fewer resources\\n- \u2705 No model changes were needed\\n\\nMBS ran without a hitch, proving that self-hosted inference was the way forward.\\n\\n### Scaling Triton on GKE\\n\\nThis left us with two choices:\\n\\n- 1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack\\n- 2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance\\n\\nWe went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.\\n\\n### Fixing the Cold Start Problem\\n\\nAs we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up.\\n\\nAfter profiling, we found the culprits:\\n\\n- Triton\u2019s base image\u2014a massive 5GB\\n- Model binaries\u2014often 1GB+\\n- Startup delay\u2014mostly due to downloading and initializing these assets\\n\\nTo fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.\\n\\n## Embedding Search: The Last Piece of the Puzzle\\n\\nBy mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.\\n\\n### Choosing the Right Vector Database\\n\\nWe benchmarked three production-ready vector DBs across key parameters:\\n\\n- Milvus\\n- Qdrant\\n- Weaviate\\n\\nAfter extensive POCs, Qdrant stood out for its:\\n\\n- \u2705 Blazing-fast search latency on high-dimensional vectors\\n- \u2705 Efficient memory usage, crucial for in-memory workloads\\n- \u2705 Support for upserts and soft deletes, vital for Ads use cases\\n- \u2705 gRPC + REST APIs, making integration seamless\\n- \u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)\\n\\nAt its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs.\\n\\n### Embedding Freshness & Real-Time Updates\\n\\nTo ensure embeddings stayed up to date, we built a dual ingestion pipeline:\\n\\n- \ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight\\n- \ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes\\n\\nThis setup powered real-time \\"Similar Products\\" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.\\n\\n![Skye](./vss.png)\\n\\n## Final Takeaways: Scaling Smartly for Real-Time ML\\n\\n- \ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services\\n- \ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness\\n- \ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale\\n- \ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations\\n\\nBy early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."},{"id":"post-two","metadata":{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}},"content":"![BharatMLStack](./bms.png)\\n## Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)\\n\\nBy late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\\nAnd it worked. Mostly.\\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS.\\n\\n### The Cost of Success\\nEvery new Ranker model required its own feature set, often pulling from different entities. Each addition meant:\\n\\n- Adding new DAG nodes in IOP\\n- Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)\\n- Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)\\n- Optimizing I/O and dealing with the inevitable bugs\\n\\nWhat began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations.\\n\\n### Scaling Pains (and Cassandra\u2019s Limits)\\nAt some point, we were hitting:\\n\\n- 250\u2013300K reads/sec\\n- 1M writes/sec (during lean hours)\\n\\nAll of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership.\\n\\n### Interaction Store Woes\\nOur interaction store was another ticking time bomb:\\n\\n- \ud83d\udea8 Clusters kept growing in size and cost\\n- \ud83d\udea8 Latency spikes became increasingly frequent\\n- \ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance\\n\\nEach time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale.\\n\\n### Silver Linings\\nDespite the chaos, the system was live and delivering value:\\n\\n- Real-time infrastructure was in production\\n- Costs dropped by 60\u201370% compared to offline personalization\\n- New experiments rolled out faster and more successfully\\n- User engagement metrics improved\\n\\nIt wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot.\\n\\n### Round Two: Solving the Top 2 Bottlenecks\\nWith the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:\\n\\n1. Coding feature retrieval logic for every new model was becoming unsustainable\\n2. ML scale was exploding\u2014bringing rising infra costs with it\\n3. Real-time embedding search was the next big unlock\\n\\nWe tackled them one by one\u2014starting with the biggest pain point.\\n\\n#### Problem 1: No-Code Feature Retrieval for Model Inference\\nWe noticed a pattern: for personalized ranking, models needed features from:\\n\\n- \u2705 Product\\n- \u2705 User\\n- \u2705 User \xd7 Category\\n- \u2705 Region, cohort, sub-category, etc.\\n\\nA key insight emerged: Entities that contribute features for a model always map back to the context entities.\\n\\n![MP Dag](./mp-dag.png)\\n\\nWith this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:\\n\\n- 1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)\\n- 2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper\\n- 3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically\\n- 4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors\\n\\n\ud83d\udca1 The impact?\\n\\n- \ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config\\n- \ud83d\ude80 Feature consistency across experiments\\n- \ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond\\n\\nHere\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\\n![MP matrix](./mp-matrix.png)\\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency.\\n\\n#### Problem 2: Scaling Without Breaking the Bank\\nWith more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:\\n\\n- \ud83d\udd39 Online Feature Store\\n- \ud83d\udd39 Interaction Store\\n\\n#### Optimizing the Online Feature Store\\nOur costs were concentrated in:\\n\\n- \ud83d\udccc Database (Cassandra)\\n- \ud83d\udccc Cache (Redis)\\n- \ud83d\udccc Running Pods (Java services)\\n\\n1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:\\n\\n- Throughput: Matched or exceeded Cassandra\'s performance under identical workloads, even under high concurrency.\\n- Latency: Achieved consistently lower P99 latencies due to ScyllaDB\'s shard-per-core architecture and better I/O utilization.\\n- Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes.\\n\\n2\ufe0f\u20e3 Finding the Right Cache\\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:\\n\\n- Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation.\\n- Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access.\\n- Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries.\\n\\n3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?\\n\\n\u2705 Memory usage dropped by ~80%\\n\u2705 CPU utilization was significantly lower\\n\u2705 Faster, more efficient deployments\\n\\n#### Optimizing the Interaction Store\\nWe realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:\\n\\n- \ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events\\n- \ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app\\n\\nSmart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes.\\n\\n![InteractionStore](./interaction-str.png)\\n#### Results\\n\\n- Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat\\n- Infra costs for Online Feature Store and Interaction Store dropped by ~60%\\n\\n#### The Catch: Our ML Hosting Hit a Hard Limit\\nWhile planning for 2023 MBS, we ran into a critical scalability bottleneck:\\n\\n- \u274c Insufficient compute availability in our region for ML instances\\n- \u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale\\n\\nThis forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads.\\n\\n### Conclusion: From Firefighting to Future-Proofing\\nWhat started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."},{"id":"post-one","metadata":{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}},"content":"![BharatMLStack](./bms.png)\\n## The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform\\n\\nIt all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:\\n\\n*\\"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?\\"*\\n\\nThe laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact.\\n\\nAt the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the **Meesho ML Platform**, known today as **BharatMLStack**.\\n\\n![Alt Text](./old-batch-arch.png)\\n\\nBefore the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:\\n- **Data Ingestion**: The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets.\\n- **Layer 1**: Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format.\\n- **Layer 2**: Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3.\\n- **Layer 3**: Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system.\\n- **Serving**: A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as \\"For You\\" and Category Landing Pages (CLP).\\n\\nThis approach held up well\u2014until Meesho started seeing a significant surge in traffic.\\n\\n## The Turning Point: From Batch to Real-Time\\n\\nAt this time, the team was iterating on new **Ranker models**, and real-time inference seemed like the next logical step. But Rankers needed **real-time feature retrieval**, which meant an **online feature store** had to be built first.\\n\\nExploring open-source options led to **cost vs. performance trade-offs**, but Meesho\u2019s surging traffic meant that **latency and stability were non-negotiable**. After multiple debates and stakeholder discussions, a bold decision was made:\\n\\n*We would build our own feature store.*\\n\\nMeanwhile, efforts began to bring **Candidate Generators (CGs)** to real-time. The challenge? **Storing and retrieving user interactions quickly enough** to power real-time recommendations.\\n\\nAs the team dove deeper, a new roadblock emerged: \\nOur ML jobs were orchestrated using **Airflow DAGs**, giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, **slowing down iteration cycles**.\\n\\nThat\u2019s when the idea struck: \\nWe needed a **framework for real-time DAG execution**\u2014one that preserved the same flexibility as Airflow but worked for **streaming data**.\\n\\nThis moment shaped the **next phase of our journey**.\\n\\n## First Generation Design\\n\\n![Alt Text](./first-gen-arch.png)\\n\\n# Laying the Groundwork: The First-Gen ML Platform\\n\\nTo solve these challenges, the team built three foundational components:\\n\\n\\n### 1. IOP Framework: A Real-Time DAG Executor\\n\\n- **Reusable Nodes**: Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config.\\n- **Config-driven Dynamic Graphs**: Execution graphs were defined as adjacency lists stored in **ZooKeeper**, allowing teams to modify the sequence or structure of operations without touching application code.\\n- **Plug-and-play CGs**: The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing `cg_name` in the request. This drastically reduced the code surface area and improved maintainability.\\n- **Production-Grade DAGs**: DAGs were designed to execute in **low-latency real-time environments**, with support for **parallel execution, retries, and branching**.\\n\\n[More about IOP DAG](https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform)\\n\\n\\n### 2. Online Feature Store - 0th Version\\n\\n- Used **Cassandra** and **Redis** for low-latency feature serving.\\n- Maintained feature consistency using **Feature Groups** with TTL-based expiry.\\n- A hybrid schema was used: feature keys stored in **ZooKeeper**, data stored in **compact arrays**.\\n\\n\\n### 3. Interaction Store - 0th Version\\n\\n- Captured real-time user interactions like clicks, orders, and add-to-cart events.\\n- Stored event data in **Redis ZSETs (sorted sets)** to enable fast lookups for recommendation engines.\\n- Provided an API to fetch a user\'s **last _k_ interactions** or **interactions within a time window**.\\n\\n\\nWith these components in place, **real-time ML at Meesho became a reality**.\\n\\nThis was just the beginning.\\n\\n## Building the Online Feature Store - 0th Version\\n\\n![Alt text](./online-feature-store-v0.png)\\n\\n### Choosing the Right Tech Stack\\n\\nWe spent considerable time evaluating various databases, caches, and communication protocols for our **online feature store**. After carefully weighing **cost, latency, throughput**, and **operational stability**, we settled on a combination of:\\n\\n- **Cassandra** and **Redis** for storage\\n- **gRPC + Proto3** as our communication layer\\n\\n\\n### Streamlining the Data Flow\\n\\nTo keep things simple in the initial version:\\n\\n- **Feature engineering jobs** wrote raw outputs to an **S3 bucket**\\n- A **daily feature push job**:\\n - Read from S3\\n - Grouped related features into **Feature Groups** (ensuring consistency)\\n - Pushed them to **Kafka**\\n\\nFor features requiring frequent updates:\\n\\n- **Ad-hoc jobs** computed features in higher frequency\\n- These jobs pushed to both **Kafka** and **S3** (S3 preserved historical data for future model training)\\n\\n\\n## The Challenges: Data Format and Storage\\n\\nOne of the most critical design challenges was how to store feature data **efficiently and consistently**, especially in databases like **Cassandra** and **Redis**, which come with unique storage constraints.\\n\\nWe had to solve for three key requirements:\\n\\n- ### Feature Consistency\\n When a feature group contains features like `order_count_1h` and `click_count_1h`, both must reflect the **same time window**. Inconsistent updates would lead to **unreliable model predictions**.\\n\\n- ### TTL Granularity\\n Each feature group required an **expiry timestamp**, so that **all features within it expired together**\u2014preserving consistency during reads.\\n\\n- ### Extensibility Across Databases\\n We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be **decoupled from DB-specific layouts**, enabling portability to systems like **ScyllaDB**, **DynamoDB**, **HBase**, or **BigTable**.\\n\\n\\n---\\n\\n## Overcoming Technical Constraints\\nAt the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive \\"one column per feature\\" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.\\n\\n## The Solution: Schema Separation\\n\\nWe introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\\nTo represent these groups efficiently, we adopted a layered storage approach:\\n\\n- **Feature Labels (Keys)** were stored in ZooKeeper, serving as the schema.\\n- **Feature Values** were stored as a comma-separated string array in Cassandra or Redis.\\n- **Expiry Timestamp and Schema Version** were appended using a semi-colon delimiter at the end of the string.\\n\\nExample:\\n\\n```bash\\nfeature_1_value,feature_2_value,feature_3_value;expiry_ts\\n```\\n\\nThis format allowed:\\n- Consistent writes and reads at the group level\\n- Easy parsing of feature values using the schema lookup from ZooKeeper\\n- Efficient storage with minimal DB column usage\\n- Support for per-group TTLs and schema evolution\\n\\n## Tracking Changes in Feature Groups\\nFeature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn\'t feasible.\\n\\n### Common Real-World Scenarios:\\n- A new feature is added to the schema, but ingestion jobs still use the older schema version.\\n- Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features.\\n- During serving, models request a mix of old and new features, depending on rollout stages.\\n\\n## The Solution: Schema Versioning\\nWe solved this with versioned feature group schemas, which unlocked several capabilities:\\n- ### Backward Compatibility\\n Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly.\\n- ### Partial Availability Handling \\n During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail.\\n- ### Safe Writes Without Pipeline Pauses\\n With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale.\\n\\n![Alt Text](./schema.png)\\n\\n## Interaction Store - 0th Version\\n\\n![Alt Text](./interaction-store-v0.png)\\n\\nTo power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as **Similar Products**, **People Also Viewed**, or **Recently Ordered Again**.\\nFor the **0th version** of the Interaction Store, we focused on a design that was **simple, fast, and reliable** \u2014 optimized for high-throughput ingestion and low-latency lookups.\\n\\n## Event Ingestion\\nWe instrumented our backend services to emit key user interaction events to Kafka in real time. These included:\\n- Click\\n- Order\\n- Add to Cart\\n- Wishlist\\n- Share\\n\\nEach event carried essential metadata:\\n- userId \u2014 uniquely identifies the user\\n- productId \u2014 the item being interacted with\\n- timestamp \u2014 the moment the interaction occurred\\n\\nThis decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently.\\n\\n## Storage Design\\nTo store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure.\\n\\n### Why Redis?\\nRedis gave us:\\n- **Low-latency** reads and writes\\n- **Time-ordered data** using ZSETs (via score = timestamp)\\n- **Native TTL support**, if needed in later versions\\n- **In-memory performance** \u2014ideal for real-time CGs\\n\\n### Storage Structure\\nEach user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:\\n\\n```bash\\nuserId_eventType \u2192 ZSET[...(pid, ts)...]\\n```\\n\\nWithin each ZSET:\\n\\n- The **timestamp** served as the score, maintaining temporal order\\n- The **productId** (optionally with metadata) was the **value**\\n\\nThis allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:\\n- Fetch the **last k interactions** of a specific type for a given user with `ZREVRANGE(userId_eventType, count)`\\n- Retrieve **all interactions within a time range** (e.g., last 24 hours) with `ZREVRANGEBYSCORE(userId_eventType, timeRange)`\\n\\n### Built-in Guardrails\\nSince Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated.\\n\\n## Conclusion: Laying the Foundation for Real-Time ML\\n\\nIn this first phase, we tackled the **fundamentals**\u2014shifting from batch-based recommendations to a **real-time Recommendation** using ML platform that could keep up with Meesho\u2019s growth.\\n\\nWith the **IOP Framework**, **Online Feature Store**, and **Interaction Store**, we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked: \\n- \u2705 Faster, more dynamic recommendations for millions of users. \\n- \u2705 Better infrastructure efficiency, reducing wasted compute power. \\n- \u2705 A flexible, modular system that allows for further experimentation.\\n\\nBut this is just the beginning. While we\'ve solved key challenges, **certain roadblocks remain** \u2014from optimizing **cost-performance trade-offs** to **seamlessly evolving schemas**.\\n\\n\\nThis foundational work laid the path for a reliable and scalable **real-time feature serving layer**."}]}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/6875c492.1403aab6.js b/docs/assets/js/6875c492.1403aab6.js new file mode 100644 index 00000000..0f0f07e9 --- /dev/null +++ b/docs/assets/js/6875c492.1403aab6.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4813],{2234:(e,t,a)=>{a.d(t,{A:()=>c});a(6540);var n=a(4164),s=a(7559),i=a(4084),r=a(7293),l=a(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(i.Rc,{}),className:(0,n.A)(e,s.G.common.unlistedBanner),children:(0,l.jsx)(i.Uh,{})})}function c(e){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)(i.AE,{}),(0,l.jsx)(o,{...e})]})}},2907:(e,t,a)=>{a.d(t,{A:()=>B});a(6540);var n=a(4164),s=a(4096),i=a(4848);function r({children:e,className:t}){return(0,i.jsx)("article",{className:t,children:e})}var l=a(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:a}=(0,s.e7)(),{permalink:r,title:c}=t,d=a?"h1":"h2";return(0,i.jsx)(d,{className:(0,n.A)(o.title,e),children:a?c:(0,i.jsx)(l.A,{to:r,children:c})})}var d=a(1312),g=a(5846),u=a(6266);const m={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,g.W)();return t=>{const a=Math.ceil(t);return e(a,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:a}))}}();return(0,i.jsx)(i.Fragment,{children:t(e)})}function p({date:e,formattedDate:t}){return(0,i.jsx)("time",{dateTime:e,children:t})}function x(){return(0,i.jsx)(i.Fragment,{children:" \xb7 "})}function j({className:e}){const{metadata:t}=(0,s.e7)(),{date:a,readingTime:r}=t,l=(0,u.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,i.jsxs)("div",{className:(0,n.A)(m.container,"margin-vert--md",e),children:[(0,i.jsx)(p,{date:a,formattedDate:(o=a,l.format(new Date(o)))}),void 0!==r&&(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(x,{}),(0,i.jsx)(h,{readingTime:r})]})]});var o}var b=a(6382);const A={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function f({className:e}){const{metadata:{authors:t},assets:a}=(0,s.e7)();if(0===t.length)return null;const r=t.every(({name:e})=>!e),l=1===t.length;return(0,i.jsx)("div",{className:(0,n.A)("margin-top--md margin-bottom--sm",r?A.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,i.jsx)("div",{className:(0,n.A)(!r&&(l?"col col--12":"col col--6"),r?A.imageOnlyAuthorCol:A.authorCol),children:(0,i.jsx)(b.A,{author:{...e,imageURL:a.authorsImageUrls[t]??e.imageURL}})},t))})}function v(){return(0,i.jsxs)("header",{children:[(0,i.jsx)(c,{}),(0,i.jsx)(j,{}),(0,i.jsx)(f,{})]})}var T=a(440),N=a(3253);function w({children:e,className:t}){const{isBlogPostPage:a}=(0,s.e7)();return(0,i.jsx)("div",{id:a?T.LU:void 0,className:(0,n.A)("markdown",t),children:(0,i.jsx)(N.A,{children:e})})}var _=a(7559),k=a(4336),y=a(4434);function P(){return(0,i.jsx)("b",{children:(0,i.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function R(e){const{blogPostTitle:t,...a}=e;return(0,i.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...a,children:(0,i.jsx)(P,{})})}function U(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:a,title:r,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,g=!t&&o,u=a.length>0;if(!(u||g||l))return null;if(t){const e=!!(l||d||c);return(0,i.jsxs)("footer",{className:"docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,n.A)("row","margin-top--sm",_.G.blog.blogFooterEditMetaRow),children:(0,i.jsx)("div",{className:"col",children:(0,i.jsx)(y.A,{tags:a})})}),e&&(0,i.jsx)(k.A,{className:(0,n.A)("margin-top--sm",_.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,i.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,n.A)("col",{"col--9":g}),children:(0,i.jsx)(y.A,{tags:a})}),g&&(0,i.jsx)("div",{className:(0,n.A)("col text--right",{"col--3":u}),children:(0,i.jsx)(R,{blogPostTitle:r,to:e.permalink})})]})}function B({children:e,className:t}){const a=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,i.jsxs)(r,{className:(0,n.A)(a,t),children:[(0,i.jsx)(v,{}),(0,i.jsx)(w,{children:e}),(0,i.jsx)(U,{})]})}},3069:(e,t,a)=>{a.r(t),a.d(t,{default:()=>b});a(6540);var n=a(4164),s=a(1312),i=a(7559),r=a(5500),l=a(6461),o=a(8774),c=a(8027),d=a(7713),g=a(1463),u=a(3892),m=a(2234),h=a(1107),p=a(4848);function x({tag:e}){const t=(0,l.ZD)(e);return(0,p.jsxs)(p.Fragment,{children:[(0,p.jsx)(r.be,{title:t,description:e.description}),(0,p.jsx)(g.A,{tag:"blog_tags_posts"})]})}function j({tag:e,items:t,sidebar:a,listMetadata:n}){const i=(0,l.ZD)(e);return(0,p.jsxs)(c.A,{sidebar:a,children:[e.unlisted&&(0,p.jsx)(m.A,{}),(0,p.jsxs)("header",{className:"margin-bottom--xl",children:[(0,p.jsx)(h.A,{as:"h1",children:i}),e.description&&(0,p.jsx)("p",{children:e.description}),(0,p.jsx)(o.A,{href:e.allTagsPath,children:(0,p.jsx)(s.A,{id:"theme.tags.tagsPageLink",description:"The label of the link targeting the tag list page",children:"View All Tags"})})]}),(0,p.jsx)(u.A,{items:t}),(0,p.jsx)(d.A,{metadata:n})]})}function b(e){return(0,p.jsxs)(r.e3,{className:(0,n.A)(i.G.wrapper.blogPages,i.G.page.blogTagPostListPage),children:[(0,p.jsx)(x,{...e}),(0,p.jsx)(j,{...e})]})}},3892:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(4096),s=a(2907),i=a(4848);function r({items:e,component:t=s.A}){return(0,i.jsx)(i.Fragment,{children:e.map(({content:e})=>(0,i.jsx)(n.in,{content:e,children:(0,i.jsx)(t,{children:(0,i.jsx)(e,{})})},e.metadata.permalink))})}},4084:(e,t,a)=>{a.d(t,{AE:()=>o,Rc:()=>r,TT:()=>d,Uh:()=>l,Yh:()=>c});a(6540);var n=a(1312),s=a(5260),i=a(4848);function r(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function l(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function o(){return(0,i.jsx)(s.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},4434:(e,t,a)=>{a.d(t,{A:()=>o});a(6540);var n=a(4164),s=a(1312),i=a(6133);const r={tags:"tags_jXut",tag:"tag_QGVx"};var l=a(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,n.A)(r.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:r.tag,children:(0,l.jsx)(i.A,{...e})},e.permalink))})]})}},6133:(e,t,a)=>{a.d(t,{A:()=>l});a(6540);var n=a(4164),s=a(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var r=a(4848);function l({permalink:e,label:t,count:a,description:l}){return(0,r.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,n.A)(i.tag,a?i.tagWithCount:i.tagRegular),children:[t,a&&(0,r.jsx)("span",{children:a})]})}},6461:(e,t,a)=>{a.d(t,{ZD:()=>r,uz:()=>l});a(6540);var n=a(1312),s=a(5846);a(4848);function i(){const{selectMessage:e}=(0,s.W)();return t=>e(t,(0,n.T)({id:"theme.blog.post.plurals",description:'Pluralized label for "{count} posts". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One post|{count} posts"},{count:t}))}function r(e){const t=i();return(0,n.T)({id:"theme.blog.tagTitle",description:"The title of the page for a blog tag",message:'{nPosts} tagged with "{tagName}"'},{nPosts:t(e.count),tagName:e.label})}const l=()=>(0,n.T)({id:"theme.blog.authorsList.pageTitle",message:"Authors",description:"The title of the authors page"})},7713:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(1312),s=a(9022),i=a(4848);function r(e){const{metadata:t}=e,{previousPage:a,nextPage:r}=t;return(0,i.jsxs)("nav",{className:"pagination-nav","aria-label":(0,n.T)({id:"theme.blog.paginator.navAriaLabel",message:"Blog list page navigation",description:"The ARIA label for the blog pagination"}),children:[a&&(0,i.jsx)(s.A,{permalink:a,title:(0,i.jsx)(n.A,{id:"theme.blog.paginator.newerEntries",description:"The label used to navigate to the newer blog posts page (previous page)",children:"Newer entries"})}),r&&(0,i.jsx)(s.A,{permalink:r,title:(0,i.jsx)(n.A,{id:"theme.blog.paginator.olderEntries",description:"The label used to navigate to the older blog posts page (next page)",children:"Older entries"}),isNext:!0})]})}},9022:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(4164),s=a(8774),i=a(4848);function r(e){const{permalink:t,title:a,subLabel:r,isNext:l}=e;return(0,i.jsxs)(s.A,{className:(0,n.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[r&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:r}),(0,i.jsx)("div",{className:"pagination-nav__label",children:a})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/6875c492.7e263e94.js b/docs/assets/js/6875c492.7e263e94.js deleted file mode 100644 index c512db9e..00000000 --- a/docs/assets/js/6875c492.7e263e94.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4813],{2053:(e,t,a)=>{a.d(t,{A:()=>o});a(6540);var n=a(4164),s=a(1312),i=a(6133);const r={tags:"tags_jXut",tag:"tag_QGVx"};var l=a(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,n.A)(r.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:r.tag,children:(0,l.jsx)(i.A,{...e})},e.permalink))})]})}},2234:(e,t,a)=>{a.d(t,{A:()=>c});a(6540);var n=a(4164),s=a(7559),i=a(4084),r=a(7293),l=a(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(i.Rc,{}),className:(0,n.A)(e,s.G.common.unlistedBanner),children:(0,l.jsx)(i.Uh,{})})}function c(e){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)(i.AE,{}),(0,l.jsx)(o,{...e})]})}},2907:(e,t,a)=>{a.d(t,{A:()=>B});a(6540);var n=a(4164),s=a(4096),i=a(4848);function r({children:e,className:t}){return(0,i.jsx)("article",{className:t,children:e})}var l=a(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:a}=(0,s.e7)(),{permalink:r,title:c}=t,d=a?"h1":"h2";return(0,i.jsx)(d,{className:(0,n.A)(o.title,e),children:a?c:(0,i.jsx)(l.A,{to:r,children:c})})}var d=a(1312),g=a(5846),u=a(6266);const m={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,g.W)();return t=>{const a=Math.ceil(t);return e(a,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:a}))}}();return(0,i.jsx)(i.Fragment,{children:t(e)})}function p({date:e,formattedDate:t}){return(0,i.jsx)("time",{dateTime:e,children:t})}function x(){return(0,i.jsx)(i.Fragment,{children:" \xb7 "})}function j({className:e}){const{metadata:t}=(0,s.e7)(),{date:a,readingTime:r}=t,l=(0,u.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,i.jsxs)("div",{className:(0,n.A)(m.container,"margin-vert--md",e),children:[(0,i.jsx)(p,{date:a,formattedDate:(o=a,l.format(new Date(o)))}),void 0!==r&&(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(x,{}),(0,i.jsx)(h,{readingTime:r})]})]});var o}var b=a(6382);const A={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function f({className:e}){const{metadata:{authors:t},assets:a}=(0,s.e7)();if(0===t.length)return null;const r=t.every(({name:e})=>!e),l=1===t.length;return(0,i.jsx)("div",{className:(0,n.A)("margin-top--md margin-bottom--sm",r?A.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,i.jsx)("div",{className:(0,n.A)(!r&&(l?"col col--12":"col col--6"),r?A.imageOnlyAuthorCol:A.authorCol),children:(0,i.jsx)(b.A,{author:{...e,imageURL:a.authorsImageUrls[t]??e.imageURL}})},t))})}function v(){return(0,i.jsxs)("header",{children:[(0,i.jsx)(c,{}),(0,i.jsx)(j,{}),(0,i.jsx)(f,{})]})}var T=a(440),N=a(3253);function w({children:e,className:t}){const{isBlogPostPage:a}=(0,s.e7)();return(0,i.jsx)("div",{id:a?T.LU:void 0,className:(0,n.A)("markdown",t),children:(0,i.jsx)(N.A,{children:e})})}var _=a(7559),k=a(4336),y=a(2053);function P(){return(0,i.jsx)("b",{children:(0,i.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function R(e){const{blogPostTitle:t,...a}=e;return(0,i.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...a,children:(0,i.jsx)(P,{})})}function U(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:a,title:r,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,g=!t&&o,u=a.length>0;if(!(u||g||l))return null;if(t){const e=!!(l||d||c);return(0,i.jsxs)("footer",{className:"docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,n.A)("row","margin-top--sm",_.G.blog.blogFooterEditMetaRow),children:(0,i.jsx)("div",{className:"col",children:(0,i.jsx)(y.A,{tags:a})})}),e&&(0,i.jsx)(k.A,{className:(0,n.A)("margin-top--sm",_.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,i.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,n.A)("col",{"col--9":g}),children:(0,i.jsx)(y.A,{tags:a})}),g&&(0,i.jsx)("div",{className:(0,n.A)("col text--right",{"col--3":u}),children:(0,i.jsx)(R,{blogPostTitle:r,to:e.permalink})})]})}function B({children:e,className:t}){const a=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,i.jsxs)(r,{className:(0,n.A)(a,t),children:[(0,i.jsx)(v,{}),(0,i.jsx)(w,{children:e}),(0,i.jsx)(U,{})]})}},3069:(e,t,a)=>{a.r(t),a.d(t,{default:()=>b});a(6540);var n=a(4164),s=a(1312),i=a(7559),r=a(5500),l=a(6461),o=a(8774),c=a(8027),d=a(7713),g=a(1463),u=a(3892),m=a(2234),h=a(1107),p=a(4848);function x({tag:e}){const t=(0,l.ZD)(e);return(0,p.jsxs)(p.Fragment,{children:[(0,p.jsx)(r.be,{title:t,description:e.description}),(0,p.jsx)(g.A,{tag:"blog_tags_posts"})]})}function j({tag:e,items:t,sidebar:a,listMetadata:n}){const i=(0,l.ZD)(e);return(0,p.jsxs)(c.A,{sidebar:a,children:[e.unlisted&&(0,p.jsx)(m.A,{}),(0,p.jsxs)("header",{className:"margin-bottom--xl",children:[(0,p.jsx)(h.A,{as:"h1",children:i}),e.description&&(0,p.jsx)("p",{children:e.description}),(0,p.jsx)(o.A,{href:e.allTagsPath,children:(0,p.jsx)(s.A,{id:"theme.tags.tagsPageLink",description:"The label of the link targeting the tag list page",children:"View All Tags"})})]}),(0,p.jsx)(u.A,{items:t}),(0,p.jsx)(d.A,{metadata:n})]})}function b(e){return(0,p.jsxs)(r.e3,{className:(0,n.A)(i.G.wrapper.blogPages,i.G.page.blogTagPostListPage),children:[(0,p.jsx)(x,{...e}),(0,p.jsx)(j,{...e})]})}},3892:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(4096),s=a(2907),i=a(4848);function r({items:e,component:t=s.A}){return(0,i.jsx)(i.Fragment,{children:e.map(({content:e})=>(0,i.jsx)(n.in,{content:e,children:(0,i.jsx)(t,{children:(0,i.jsx)(e,{})})},e.metadata.permalink))})}},4084:(e,t,a)=>{a.d(t,{AE:()=>o,Rc:()=>r,TT:()=>d,Uh:()=>l,Yh:()=>c});a(6540);var n=a(1312),s=a(5260),i=a(4848);function r(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function l(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function o(){return(0,i.jsx)(s.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(n.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},6133:(e,t,a)=>{a.d(t,{A:()=>l});a(6540);var n=a(4164),s=a(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var r=a(4848);function l({permalink:e,label:t,count:a,description:l}){return(0,r.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,n.A)(i.tag,a?i.tagWithCount:i.tagRegular),children:[t,a&&(0,r.jsx)("span",{children:a})]})}},6461:(e,t,a)=>{a.d(t,{ZD:()=>r,uz:()=>l});a(6540);var n=a(1312),s=a(5846);a(4848);function i(){const{selectMessage:e}=(0,s.W)();return t=>e(t,(0,n.T)({id:"theme.blog.post.plurals",description:'Pluralized label for "{count} posts". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One post|{count} posts"},{count:t}))}function r(e){const t=i();return(0,n.T)({id:"theme.blog.tagTitle",description:"The title of the page for a blog tag",message:'{nPosts} tagged with "{tagName}"'},{nPosts:t(e.count),tagName:e.label})}const l=()=>(0,n.T)({id:"theme.blog.authorsList.pageTitle",message:"Authors",description:"The title of the authors page"})},7713:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(1312),s=a(9022),i=a(4848);function r(e){const{metadata:t}=e,{previousPage:a,nextPage:r}=t;return(0,i.jsxs)("nav",{className:"pagination-nav","aria-label":(0,n.T)({id:"theme.blog.paginator.navAriaLabel",message:"Blog list page navigation",description:"The ARIA label for the blog pagination"}),children:[a&&(0,i.jsx)(s.A,{permalink:a,title:(0,i.jsx)(n.A,{id:"theme.blog.paginator.newerEntries",description:"The label used to navigate to the newer blog posts page (previous page)",children:"Newer entries"})}),r&&(0,i.jsx)(s.A,{permalink:r,title:(0,i.jsx)(n.A,{id:"theme.blog.paginator.olderEntries",description:"The label used to navigate to the older blog posts page (next page)",children:"Older entries"}),isNext:!0})]})}},9022:(e,t,a)=>{a.d(t,{A:()=>r});a(6540);var n=a(4164),s=a(8774),i=a(4848);function r(e){const{permalink:t,title:a,subLabel:r,isNext:l}=e;return(0,i.jsxs)(s.A,{className:(0,n.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[r&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:r}),(0,i.jsx)("div",{className:"pagination-nav__label",children:a})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/72dc5b25.20421ae4.js b/docs/assets/js/72dc5b25.20421ae4.js deleted file mode 100644 index f6c3affa..00000000 --- a/docs/assets/js/72dc5b25.20421ae4.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8261],{3613:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"v1.0.0","description":"Numerix v1.0.0","slug":"/online-feature-store/v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Online Feature Store","permalink":"/BharatMLStack/category/online-feature-store"},"next":{"title":"Architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/74783256.0fa34723.js b/docs/assets/js/74783256.0fa34723.js new file mode 100644 index 00000000..089fd491 --- /dev/null +++ b/docs/assets/js/74783256.0fa34723.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7813],{1566:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Predator","description":"Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server, designed to serve ML models with low latency in Kubernetes, with OnFS and Interflow integration.","slug":"/category/predator","permalink":"/BharatMLStack/category/predator","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Release Notes","permalink":"/BharatMLStack/numerix/v1.0.0/release-notes"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/predator/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/79ae4ea7.1416ba4f.js b/docs/assets/js/79ae4ea7.1416ba4f.js deleted file mode 100644 index 8a4eb857..00000000 --- a/docs/assets/js/79ae4ea7.1416ba4f.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4340],{2173:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/llm-plat-9ac69c0ffd8c387d177e582611b8c775.png"},4311:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>t,toc:()=>c});const t=JSON.parse('{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-four/index.md","source":"@site/blog/bharatmlstack-history/post-four/index.md","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","description":"BharatMLStack","date":"2025-03-29T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":13.38,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-three","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","authors":["jaya"],"date":"2025-3-29","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five"},"nextItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"}}');var r=i(4848),s=i(8453);const a={slug:"post-three",title:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",authors:["jaya"],date:"2025-3-29",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0]},c=[{value:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",level:2},{value:"Why LLM Inference Is not just bigger ML model serving",id:"why-llm-inference-is-not-just-bigger-ml-model-serving",level:2},{value:"Autoregressive Generation and Sequential Computation:",id:"autoregressive-generation-and-sequential-computation",level:3},{value:"Prefill and Decode Phases:",id:"prefill-and-decode-phases",level:3},{value:"Context Management and KV Caching:",id:"context-management-and-kv-caching",level:3},{value:"Dynamic and Irregular Workloads:",id:"dynamic-and-irregular-workloads",level:3},{value:"Streaming and User Experience Constraints:",id:"streaming-and-user-experience-constraints",level:3},{value:"LLMOps: High-Level Architecture",id:"llmops-high-level-architecture",level:2},{value:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)",id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",level:2},{value:"Conclusion",id:"conclusion",level:2},{value:"Future Explorations",id:"future-explorations",level:2}];function d(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:i(7996).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",children:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving"}),"\n",(0,r.jsx)(n.p,{children:"Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale."}),"\n",(0,r.jsx)(n.p,{children:"The platform implements a complete LLMOps lifecycle \u2014 from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required."}),"\n",(0,r.jsx)(n.p,{children:"In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques \u2014 such as quantization strategies, batching configurations, and runtime-specific performance enhancements \u2014 enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference."}),"\n",(0,r.jsx)(n.h2,{id:"why-llm-inference-is-not-just-bigger-ml-model-serving",children:"Why LLM Inference Is not just bigger ML model serving"}),"\n",(0,r.jsx)(n.p,{children:"Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled."}),"\n",(0,r.jsx)(n.h3,{id:"autoregressive-generation-and-sequential-computation",children:"Autoregressive Generation and Sequential Computation:"}),"\n",(0,r.jsx)(n.p,{children:"Unlike traditional models such as classifiers or recommenders \u2014 where inference cost is relatively constant \u2014 LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation.\nBecause tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution."}),"\n",(0,r.jsx)(n.h3,{id:"prefill-and-decode-phases",children:"Prefill and Decode Phases:"}),"\n",(0,r.jsx)(n.p,{children:"LLM inference typically consists of two distinct stages:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prefill phase \u2014 the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable."}),"\n",(0,r.jsx)(n.li,{children:"Decode phase \u2014 the model generates tokens sequentially, predicting one token at a time using previously generated context."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads."}),"\n",(0,r.jsx)(n.h3,{id:"context-management-and-kv-caching",children:"Context Management and KV Caching:"}),"\n",(0,r.jsx)(n.p,{children:"Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens.\nKV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Memory consumption grows with sequence length and batch size"}),"\n",(0,r.jsx)(n.li,{children:"GPU memory becomes a critical bottleneck"}),"\n",(0,r.jsx)(n.li,{children:"Efficient memory management becomes essential for scaling concurrent requests"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads."}),"\n",(0,r.jsx)(n.h3,{id:"dynamic-and-irregular-workloads",children:"Dynamic and Irregular Workloads:"}),"\n",(0,r.jsx)(n.p,{children:"Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Batch sizes must be dynamic rather than static"}),"\n",(0,r.jsx)(n.li,{children:"Requests may enter and leave batches asynchronously"}),"\n",(0,r.jsx)(n.li,{children:"Scheduling systems must continuously rebalance workloads to maximize GPU utilization"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines."}),"\n",(0,r.jsx)(n.h3,{id:"streaming-and-user-experience-constraints",children:"Streaming and User Experience Constraints:"}),"\n",(0,r.jsx)(n.p,{children:"Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated.\nBecause of these differences \u2014 sequential generation, growing memory requirements, dynamic workloads, and streaming constraints \u2014 LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads."}),"\n",(0,r.jsx)(n.h2,{id:"llmops-high-level-architecture",children:"LLMOps: High-Level Architecture"}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"LLM Architecture",src:i(2173).A+"",width:"1302",height:"830"})}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention."}),"\n",(0,r.jsx)(n.p,{children:"Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability."}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Onboarding & Registration (The Source of Truth)"}),"\n",(0,r.jsx)(n.p,{children:"The lifecycle begins with the Data Scientist or engineer."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Model Ingestion: Users onboard models\u2014whether open-source (Hugging Face, NeMo) or internally fine-tuned\u2014via the Truffle Box SDK/UI."}),"\n",(0,r.jsx)(n.li,{children:'LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:'The "Black Box" Build Engine'}),"\n",(0,r.jsx)(n.p,{children:"Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Transformation: The raw model is converted into a TRT-LLM Checkpoint."}),"\n",(0,r.jsx)(n.li,{children:"Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint."}),"\n",(0,r.jsx)(n.li,{children:"Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Intelligent Profiling & Validation"}),"\n",(0,r.jsx)(n.p,{children:"Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM)."}),"\n",(0,r.jsx)(n.li,{children:"Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Smart Artifact Generation & Distribution"}),"\n",(0,r.jsx)(n.p,{children:'To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:'}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup."}),"\n",(0,r.jsx)(n.li,{children:"Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Image Streaming & Deployment"}),"\n",(0,r.jsx)(n.p,{children:"Simultaneously, the inference runtime container images are pulled from the Artifact Registry."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Inference Runtime (Kubernetes)"}),"\n",(0,r.jsx)(n.p,{children:"The workload lands on Kubernetes with Autoscaling."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference."}),"\n",(0,r.jsx)(n.li,{children:'Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Client Interaction & Observability"}),"\n",(0,r.jsx)(n.p,{children:"Finally, the LLM Inference Client executes the request."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used."}),"\n",(0,r.jsx)(n.li,{children:"Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Observability: Monitoring the Pulse of GenAI"}),"\n",(0,r.jsx)(n.p,{children:"In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows."}),"\n",(0,r.jsx)(n.p,{children:"To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Time to First Token (TTFT)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user."}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This represents the "Prefill Phase" latency\u2014the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."'}),"\n",(0,r.jsx)(n.li,{children:"Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Inter-Token Latency (ITL)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".'}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.'}),"\n",(0,r.jsx)(n.li,{children:"Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Token Throughput vs. Request Throughput"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"We distinguish between two types of throughput to balance system efficiency with user load:"}),"\n",(0,r.jsx)(n.li,{children:"Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching."}),"\n",(0,r.jsx)(n.li,{children:"Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Monitoring Stack"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.'}),"\n",(0,r.jsx)(n.li,{children:'Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.'}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",children:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)"}),"\n",(0,r.jsx)(n.p,{children:'Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases\u2014whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows\u2014demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:'}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM: The High-Performance Standard"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots)."}),"\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization ."}),"\n",(0,r.jsx)(n.p,{children:"Key optimizations we tailor for these high-load cases include:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Optimized execution via TensorRT engine compilation"}),"\n",(0,r.jsx)(n.li,{children:"Quantization-aware execution for reduced memory usage and improved throughput"}),"\n",(0,r.jsx)(n.li,{children:"Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization ."}),"\n",(0,r.jsx)(n.li,{children:"Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms ."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Dynamo: Distributed Inference for Reasoning Models"}),"\n",(0,r.jsx)(n.p,{children:'Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU\'s memory is insufficient.'}),"\n",(0,r.jsx)(n.p,{children:"For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation ."}),"\n",(0,r.jsx)(n.li,{children:'Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .'}),"\n",(0,r.jsx)(n.li,{children:"Distributed execution across multiple GPU resources"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"vLLM: The Flexible Baseline"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput."}),"\n",(0,r.jsx)(n.p,{children:"While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline ."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"High throughput through dynamic batching and efficient memory utilization"}),"\n",(0,r.jsx)(n.li,{children:"Paged KV cache management for handling long contexts and concurrent requests"}),"\n",(0,r.jsx)(n.li,{children:"Strong support for open-source model ecosystems"}),"\n",(0,r.jsx)(n.li,{children:"Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build."}),"\n",(0,r.jsx)(n.li,{children:"Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline."}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,r.jsx)(n.p,{children:"Large language model inference introduces a fundamentally new class of infrastructure challenges\u2014where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads."}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle\u2014from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity."}),"\n",(0,r.jsx)(n.p,{children:"Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows."}),"\n",(0,r.jsx)(n.p,{children:"Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment\u2014allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences."}),"\n",(0,r.jsx)(n.h2,{id:"future-explorations",children:"Future Explorations"}),"\n",(0,r.jsx)(n.p,{children:"While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics."}),"\n",(0,r.jsx)(n.li,{children:'Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.'}),"\n",(0,r.jsx)(n.li,{children:"Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience."}),"\n",(0,r.jsx)(n.li,{children:'Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.'}),"\n",(0,r.jsx)(n.li,{children:"Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes."}),"\n",(0,r.jsx)(n.li,{children:'Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.'}),"\n"]})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},7996:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const r={},s=t.createContext(r);function a(e){const n=t.useContext(s);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),t.createElement(s.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/79ae4ea7.4c2a5580.js b/docs/assets/js/79ae4ea7.4c2a5580.js new file mode 100644 index 00000000..2587f905 --- /dev/null +++ b/docs/assets/js/79ae4ea7.4c2a5580.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4340],{2233:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-four","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-four/index.md","source":"@site/blog/bharatmlstack-history/post-four/index.md","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","description":"BharatMLStack","date":"2025-03-29T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":13.38,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-four","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","authors":["jaya"],"date":"2025-3-29","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five"},"nextItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"}}')},2305:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>t,toc:()=>c});var t=i(2233),r=i(4848),s=i(8453);const a={slug:"post-four",title:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",authors:["jaya"],date:"2025-3-29",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0]},c=[{value:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",level:2},{value:"Why LLM Inference Is not just bigger ML model serving",id:"why-llm-inference-is-not-just-bigger-ml-model-serving",level:2},{value:"Autoregressive Generation and Sequential Computation:",id:"autoregressive-generation-and-sequential-computation",level:3},{value:"Prefill and Decode Phases:",id:"prefill-and-decode-phases",level:3},{value:"Context Management and KV Caching:",id:"context-management-and-kv-caching",level:3},{value:"Dynamic and Irregular Workloads:",id:"dynamic-and-irregular-workloads",level:3},{value:"Streaming and User Experience Constraints:",id:"streaming-and-user-experience-constraints",level:3},{value:"LLMOps: High-Level Architecture",id:"llmops-high-level-architecture",level:2},{value:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)",id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",level:2},{value:"Conclusion",id:"conclusion",level:2},{value:"Future Explorations",id:"future-explorations",level:2}];function d(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:i(7613).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",children:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving"}),"\n",(0,r.jsx)(n.p,{children:"Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale."}),"\n",(0,r.jsx)(n.p,{children:"The platform implements a complete LLMOps lifecycle \u2014 from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required."}),"\n",(0,r.jsx)(n.p,{children:"In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques \u2014 such as quantization strategies, batching configurations, and runtime-specific performance enhancements \u2014 enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference."}),"\n",(0,r.jsx)(n.h2,{id:"why-llm-inference-is-not-just-bigger-ml-model-serving",children:"Why LLM Inference Is not just bigger ML model serving"}),"\n",(0,r.jsx)(n.p,{children:"Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled."}),"\n",(0,r.jsx)(n.h3,{id:"autoregressive-generation-and-sequential-computation",children:"Autoregressive Generation and Sequential Computation:"}),"\n",(0,r.jsx)(n.p,{children:"Unlike traditional models such as classifiers or recommenders \u2014 where inference cost is relatively constant \u2014 LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation.\nBecause tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution."}),"\n",(0,r.jsx)(n.h3,{id:"prefill-and-decode-phases",children:"Prefill and Decode Phases:"}),"\n",(0,r.jsx)(n.p,{children:"LLM inference typically consists of two distinct stages:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prefill phase \u2014 the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable."}),"\n",(0,r.jsx)(n.li,{children:"Decode phase \u2014 the model generates tokens sequentially, predicting one token at a time using previously generated context."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads."}),"\n",(0,r.jsx)(n.h3,{id:"context-management-and-kv-caching",children:"Context Management and KV Caching:"}),"\n",(0,r.jsx)(n.p,{children:"Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens.\nKV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Memory consumption grows with sequence length and batch size"}),"\n",(0,r.jsx)(n.li,{children:"GPU memory becomes a critical bottleneck"}),"\n",(0,r.jsx)(n.li,{children:"Efficient memory management becomes essential for scaling concurrent requests"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads."}),"\n",(0,r.jsx)(n.h3,{id:"dynamic-and-irregular-workloads",children:"Dynamic and Irregular Workloads:"}),"\n",(0,r.jsx)(n.p,{children:"Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Batch sizes must be dynamic rather than static"}),"\n",(0,r.jsx)(n.li,{children:"Requests may enter and leave batches asynchronously"}),"\n",(0,r.jsx)(n.li,{children:"Scheduling systems must continuously rebalance workloads to maximize GPU utilization"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines."}),"\n",(0,r.jsx)(n.h3,{id:"streaming-and-user-experience-constraints",children:"Streaming and User Experience Constraints:"}),"\n",(0,r.jsx)(n.p,{children:"Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated.\nBecause of these differences \u2014 sequential generation, growing memory requirements, dynamic workloads, and streaming constraints \u2014 LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads."}),"\n",(0,r.jsx)(n.h2,{id:"llmops-high-level-architecture",children:"LLMOps: High-Level Architecture"}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"LLM Architecture",src:i(3874).A+"",width:"1302",height:"830"})}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention."}),"\n",(0,r.jsx)(n.p,{children:"Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability."}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Onboarding & Registration (The Source of Truth)"}),"\n",(0,r.jsx)(n.p,{children:"The lifecycle begins with the Data Scientist or engineer."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Model Ingestion: Users onboard models\u2014whether open-source (Hugging Face, NeMo) or internally fine-tuned\u2014via the Truffle Box SDK/UI."}),"\n",(0,r.jsx)(n.li,{children:'LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:'The "Black Box" Build Engine'}),"\n",(0,r.jsx)(n.p,{children:"Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Transformation: The raw model is converted into a TRT-LLM Checkpoint."}),"\n",(0,r.jsx)(n.li,{children:"Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint."}),"\n",(0,r.jsx)(n.li,{children:"Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Intelligent Profiling & Validation"}),"\n",(0,r.jsx)(n.p,{children:"Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM)."}),"\n",(0,r.jsx)(n.li,{children:"Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Smart Artifact Generation & Distribution"}),"\n",(0,r.jsx)(n.p,{children:'To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:'}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup."}),"\n",(0,r.jsx)(n.li,{children:"Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Image Streaming & Deployment"}),"\n",(0,r.jsx)(n.p,{children:"Simultaneously, the inference runtime container images are pulled from the Artifact Registry."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Inference Runtime (Kubernetes)"}),"\n",(0,r.jsx)(n.p,{children:"The workload lands on Kubernetes with Autoscaling."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference."}),"\n",(0,r.jsx)(n.li,{children:'Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Client Interaction & Observability"}),"\n",(0,r.jsx)(n.p,{children:"Finally, the LLM Inference Client executes the request."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used."}),"\n",(0,r.jsx)(n.li,{children:"Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Observability: Monitoring the Pulse of GenAI"}),"\n",(0,r.jsx)(n.p,{children:"In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows."}),"\n",(0,r.jsx)(n.p,{children:"To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Time to First Token (TTFT)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user."}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This represents the "Prefill Phase" latency\u2014the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."'}),"\n",(0,r.jsx)(n.li,{children:"Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Inter-Token Latency (ITL)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".'}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.'}),"\n",(0,r.jsx)(n.li,{children:"Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Token Throughput vs. Request Throughput"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"We distinguish between two types of throughput to balance system efficiency with user load:"}),"\n",(0,r.jsx)(n.li,{children:"Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching."}),"\n",(0,r.jsx)(n.li,{children:"Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Monitoring Stack"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.'}),"\n",(0,r.jsx)(n.li,{children:'Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.'}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",children:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)"}),"\n",(0,r.jsx)(n.p,{children:'Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases\u2014whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows\u2014demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:'}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM: The High-Performance Standard"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots)."}),"\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization ."}),"\n",(0,r.jsx)(n.p,{children:"Key optimizations we tailor for these high-load cases include:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Optimized execution via TensorRT engine compilation"}),"\n",(0,r.jsx)(n.li,{children:"Quantization-aware execution for reduced memory usage and improved throughput"}),"\n",(0,r.jsx)(n.li,{children:"Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization ."}),"\n",(0,r.jsx)(n.li,{children:"Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms ."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Dynamo: Distributed Inference for Reasoning Models"}),"\n",(0,r.jsx)(n.p,{children:'Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU\'s memory is insufficient.'}),"\n",(0,r.jsx)(n.p,{children:"For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation ."}),"\n",(0,r.jsx)(n.li,{children:'Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .'}),"\n",(0,r.jsx)(n.li,{children:"Distributed execution across multiple GPU resources"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"vLLM: The Flexible Baseline"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput."}),"\n",(0,r.jsx)(n.p,{children:"While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline ."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"High throughput through dynamic batching and efficient memory utilization"}),"\n",(0,r.jsx)(n.li,{children:"Paged KV cache management for handling long contexts and concurrent requests"}),"\n",(0,r.jsx)(n.li,{children:"Strong support for open-source model ecosystems"}),"\n",(0,r.jsx)(n.li,{children:"Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build."}),"\n",(0,r.jsx)(n.li,{children:"Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline."}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,r.jsx)(n.p,{children:"Large language model inference introduces a fundamentally new class of infrastructure challenges\u2014where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads."}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle\u2014from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity."}),"\n",(0,r.jsx)(n.p,{children:"Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows."}),"\n",(0,r.jsx)(n.p,{children:"Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment\u2014allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences."}),"\n",(0,r.jsx)(n.h2,{id:"future-explorations",children:"Future Explorations"}),"\n",(0,r.jsx)(n.p,{children:"While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics."}),"\n",(0,r.jsx)(n.li,{children:'Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.'}),"\n",(0,r.jsx)(n.li,{children:"Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience."}),"\n",(0,r.jsx)(n.li,{children:'Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.'}),"\n",(0,r.jsx)(n.li,{children:"Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes."}),"\n",(0,r.jsx)(n.li,{children:'Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.'}),"\n"]})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},3874:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/llm-plat-9ac69c0ffd8c387d177e582611b8c775.png"},7613:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const r={},s=t.createContext(r);function a(e){const n=t.useContext(s);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),t.createElement(s.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/814f3328.189ef834.js b/docs/assets/js/814f3328.189ef834.js new file mode 100644 index 00000000..f366747d --- /dev/null +++ b/docs/assets/js/814f3328.189ef834.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7472],{5513:e=>{e.exports=JSON.parse('{"title":"Recent posts","items":[{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five","unlisted":false,"date":"2025-06-02T00:00:00.000Z"},{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four","unlisted":false,"date":"2025-03-29T00:00:00.000Z"},{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three","unlisted":false,"date":"2024-05-21T00:00:00.000Z"},{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two","unlisted":false,"date":"2023-04-10T00:00:00.000Z"},{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one","unlisted":false,"date":"2022-11-15T00:00:00.000Z"}]}')}}]); \ No newline at end of file diff --git a/docs/assets/js/814f3328.bfb123e8.js b/docs/assets/js/814f3328.bfb123e8.js deleted file mode 100644 index f1e59d9a..00000000 --- a/docs/assets/js/814f3328.bfb123e8.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7472],{5513:e=>{e.exports=JSON.parse('{"title":"Recent posts","items":[{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five","unlisted":false,"date":"2025-06-02T00:00:00.000Z"},{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three","unlisted":false,"date":"2025-03-29T00:00:00.000Z"},{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three","unlisted":false,"date":"2024-05-21T00:00:00.000Z"},{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two","unlisted":false,"date":"2023-04-10T00:00:00.000Z"},{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one","unlisted":false,"date":"2022-11-15T00:00:00.000Z"}]}')}}]); \ No newline at end of file diff --git a/docs/assets/js/8ac6191a.6f3973a2.js b/docs/assets/js/8ac6191a.8ac511aa.js similarity index 73% rename from docs/assets/js/8ac6191a.6f3973a2.js rename to docs/assets/js/8ac6191a.8ac511aa.js index 8f8e88c7..ec5af0a3 100644 --- a/docs/assets/js/8ac6191a.6f3973a2.js +++ b/docs/assets/js/8ac6191a.8ac511aa.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8465],{4540:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Online Feature Store","description":"Online-feature-store is a high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with a strong emphasis on developer experience, system observability, and low-latency feature retrieval.","slug":"/category/online-feature-store","permalink":"/BharatMLStack/category/online-feature-store","sidebar":"tutorialSidebar","navigation":{"next":{"title":"v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8465],{4540:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Online Feature Store","description":"Online-feature-store is a high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with a strong emphasis on developer experience, system observability, and low-latency feature retrieval.","slug":"/category/online-feature-store","permalink":"/BharatMLStack/category/online-feature-store","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"BharatMLStack Documentation","permalink":"/BharatMLStack/intro"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/8cdb4121.da7b560e.js b/docs/assets/js/8cdb4121.da7b560e.js new file mode 100644 index 00000000..dff9d8a1 --- /dev/null +++ b/docs/assets/js/8cdb4121.da7b560e.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[252],{2233:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-four","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-four/index.md","source":"@site/blog/bharatmlstack-history/post-four/index.md","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","description":"BharatMLStack","date":"2025-03-29T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":13.38,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-four","title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","authors":["jaya"],"date":"2025-3-29","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","permalink":"/BharatMLStack/blog/post-five"},"nextItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"}}')},2531:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>t,toc:()=>c});var t=i(2233),r=i(4848),s=i(8453);const a={slug:"post-four",title:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",authors:["jaya"],date:"2025-3-29",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0]},c=[{value:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",level:2},{value:"Why LLM Inference Is not just bigger ML model serving",id:"why-llm-inference-is-not-just-bigger-ml-model-serving",level:2},{value:"Autoregressive Generation and Sequential Computation:",id:"autoregressive-generation-and-sequential-computation",level:3},{value:"Prefill and Decode Phases:",id:"prefill-and-decode-phases",level:3},{value:"Context Management and KV Caching:",id:"context-management-and-kv-caching",level:3},{value:"Dynamic and Irregular Workloads:",id:"dynamic-and-irregular-workloads",level:3},{value:"Streaming and User Experience Constraints:",id:"streaming-and-user-experience-constraints",level:3},{value:"LLMOps: High-Level Architecture",id:"llmops-high-level-architecture",level:2},{value:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)",id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",level:2},{value:"Conclusion",id:"conclusion",level:2},{value:"Future Explorations",id:"future-explorations",level:2}];function d(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:i(7613).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"designing-a-production-grade-llm-inference-platform-from-model-weights-to-scalable-gpu-serving",children:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving"}),"\n",(0,r.jsx)(n.p,{children:"Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale."}),"\n",(0,r.jsx)(n.p,{children:"The platform implements a complete LLMOps lifecycle \u2014 from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required."}),"\n",(0,r.jsx)(n.p,{children:"In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques \u2014 such as quantization strategies, batching configurations, and runtime-specific performance enhancements \u2014 enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference."}),"\n",(0,r.jsx)(n.h2,{id:"why-llm-inference-is-not-just-bigger-ml-model-serving",children:"Why LLM Inference Is not just bigger ML model serving"}),"\n",(0,r.jsx)(n.p,{children:"Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled."}),"\n",(0,r.jsx)(n.h3,{id:"autoregressive-generation-and-sequential-computation",children:"Autoregressive Generation and Sequential Computation:"}),"\n",(0,r.jsx)(n.p,{children:"Unlike traditional models such as classifiers or recommenders \u2014 where inference cost is relatively constant \u2014 LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation.\nBecause tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution."}),"\n",(0,r.jsx)(n.h3,{id:"prefill-and-decode-phases",children:"Prefill and Decode Phases:"}),"\n",(0,r.jsx)(n.p,{children:"LLM inference typically consists of two distinct stages:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prefill phase \u2014 the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable."}),"\n",(0,r.jsx)(n.li,{children:"Decode phase \u2014 the model generates tokens sequentially, predicting one token at a time using previously generated context."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads."}),"\n",(0,r.jsx)(n.h3,{id:"context-management-and-kv-caching",children:"Context Management and KV Caching:"}),"\n",(0,r.jsx)(n.p,{children:"Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens.\nKV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Memory consumption grows with sequence length and batch size"}),"\n",(0,r.jsx)(n.li,{children:"GPU memory becomes a critical bottleneck"}),"\n",(0,r.jsx)(n.li,{children:"Efficient memory management becomes essential for scaling concurrent requests"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads."}),"\n",(0,r.jsx)(n.h3,{id:"dynamic-and-irregular-workloads",children:"Dynamic and Irregular Workloads:"}),"\n",(0,r.jsx)(n.p,{children:"Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Batch sizes must be dynamic rather than static"}),"\n",(0,r.jsx)(n.li,{children:"Requests may enter and leave batches asynchronously"}),"\n",(0,r.jsx)(n.li,{children:"Scheduling systems must continuously rebalance workloads to maximize GPU utilization"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines."}),"\n",(0,r.jsx)(n.h3,{id:"streaming-and-user-experience-constraints",children:"Streaming and User Experience Constraints:"}),"\n",(0,r.jsx)(n.p,{children:"Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated.\nBecause of these differences \u2014 sequential generation, growing memory requirements, dynamic workloads, and streaming constraints \u2014 LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads."}),"\n",(0,r.jsx)(n.h2,{id:"llmops-high-level-architecture",children:"LLMOps: High-Level Architecture"}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"LLM Architecture",src:i(3874).A+"",width:"1302",height:"830"})}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention."}),"\n",(0,r.jsx)(n.p,{children:"Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability."}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Onboarding & Registration (The Source of Truth)"}),"\n",(0,r.jsx)(n.p,{children:"The lifecycle begins with the Data Scientist or engineer."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Model Ingestion: Users onboard models\u2014whether open-source (Hugging Face, NeMo) or internally fine-tuned\u2014via the Truffle Box SDK/UI."}),"\n",(0,r.jsx)(n.li,{children:'LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:'The "Black Box" Build Engine'}),"\n",(0,r.jsx)(n.p,{children:"Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Transformation: The raw model is converted into a TRT-LLM Checkpoint."}),"\n",(0,r.jsx)(n.li,{children:"Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint."}),"\n",(0,r.jsx)(n.li,{children:"Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Intelligent Profiling & Validation"}),"\n",(0,r.jsx)(n.p,{children:"Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM)."}),"\n",(0,r.jsx)(n.li,{children:"Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Smart Artifact Generation & Distribution"}),"\n",(0,r.jsx)(n.p,{children:'To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:'}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup."}),"\n",(0,r.jsx)(n.li,{children:"Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Image Streaming & Deployment"}),"\n",(0,r.jsx)(n.p,{children:"Simultaneously, the inference runtime container images are pulled from the Artifact Registry."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Inference Runtime (Kubernetes)"}),"\n",(0,r.jsx)(n.p,{children:"The workload lands on Kubernetes with Autoscaling."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference."}),"\n",(0,r.jsx)(n.li,{children:'Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").'}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Client Interaction & Observability"}),"\n",(0,r.jsx)(n.p,{children:"Finally, the LLM Inference Client executes the request."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used."}),"\n",(0,r.jsx)(n.li,{children:"Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Observability: Monitoring the Pulse of GenAI"}),"\n",(0,r.jsx)(n.p,{children:"In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows."}),"\n",(0,r.jsx)(n.p,{children:"To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Time to First Token (TTFT)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user."}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This represents the "Prefill Phase" latency\u2014the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."'}),"\n",(0,r.jsx)(n.li,{children:"Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Inter-Token Latency (ITL)"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".'}),"\n",(0,r.jsx)(n.li,{children:'Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.'}),"\n",(0,r.jsx)(n.li,{children:"Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Token Throughput vs. Request Throughput"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"We distinguish between two types of throughput to balance system efficiency with user load:"}),"\n",(0,r.jsx)(n.li,{children:"Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching."}),"\n",(0,r.jsx)(n.li,{children:"Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"The Monitoring Stack"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:'Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.'}),"\n",(0,r.jsx)(n.li,{children:'Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.'}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"supported-inference-backends-tensorrt-llm--dynamo--vllm",children:"Supported Inference backends (TensorRT LLM, Dynamo & vLLM)"}),"\n",(0,r.jsx)(n.p,{children:'Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases\u2014whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows\u2014demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:'}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM: The High-Performance Standard"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots)."}),"\n",(0,r.jsx)(n.p,{children:"TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization ."}),"\n",(0,r.jsx)(n.p,{children:"Key optimizations we tailor for these high-load cases include:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Optimized execution via TensorRT engine compilation"}),"\n",(0,r.jsx)(n.li,{children:"Quantization-aware execution for reduced memory usage and improved throughput"}),"\n",(0,r.jsx)(n.li,{children:"Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization ."}),"\n",(0,r.jsx)(n.li,{children:"Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms ."}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"Dynamo: Distributed Inference for Reasoning Models"}),"\n",(0,r.jsx)(n.p,{children:'Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU\'s memory is insufficient.'}),"\n",(0,r.jsx)(n.p,{children:"For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation ."}),"\n",(0,r.jsx)(n.li,{children:'Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .'}),"\n",(0,r.jsx)(n.li,{children:"Distributed execution across multiple GPU resources"}),"\n"]}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:"vLLM: The Flexible Baseline"}),"\n",(0,r.jsx)(n.p,{children:"Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput."}),"\n",(0,r.jsx)(n.p,{children:"While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline ."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"High throughput through dynamic batching and efficient memory utilization"}),"\n",(0,r.jsx)(n.li,{children:"Paged KV cache management for handling long contexts and concurrent requests"}),"\n",(0,r.jsx)(n.li,{children:"Strong support for open-source model ecosystems"}),"\n",(0,r.jsx)(n.li,{children:"Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build."}),"\n",(0,r.jsx)(n.li,{children:"Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline."}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,r.jsx)(n.p,{children:"Large language model inference introduces a fundamentally new class of infrastructure challenges\u2014where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads."}),"\n",(0,r.jsx)(n.p,{children:"The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle\u2014from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity."}),"\n",(0,r.jsx)(n.p,{children:"Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows."}),"\n",(0,r.jsx)(n.p,{children:"Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment\u2014allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences."}),"\n",(0,r.jsx)(n.h2,{id:"future-explorations",children:"Future Explorations"}),"\n",(0,r.jsx)(n.p,{children:"While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics."}),"\n",(0,r.jsx)(n.li,{children:'Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.'}),"\n",(0,r.jsx)(n.li,{children:"Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience."}),"\n",(0,r.jsx)(n.li,{children:'Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.'}),"\n",(0,r.jsx)(n.li,{children:"Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes."}),"\n",(0,r.jsx)(n.li,{children:'Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.'}),"\n"]})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},3874:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/llm-plat-9ac69c0ffd8c387d177e582611b8c775.png"},7613:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const r={},s=t.createContext(r);function a(e){const n=t.useContext(s);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),t.createElement(s.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/8ea48c46.119645ab.js b/docs/assets/js/8ea48c46.119645ab.js new file mode 100644 index 00000000..93031643 --- /dev/null +++ b/docs/assets/js/8ea48c46.119645ab.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9824],{7956:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>a,contentTitle:()=>l,default:()=>h,frontMatter:()=>o,metadata:()=>s,toc:()=>c});const s=JSON.parse('{"id":"numerix/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","source":"@site/docs/numerix/v1.0.0/release-notes.md","sourceDirName":"numerix/v1.0.0","slug":"/numerix/v1.0.0/release-notes","permalink":"/BharatMLStack/numerix/v1.0.0/release-notes","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/numerix/v1.0.0/release-notes.md","tags":[],"version":"current","sidebarPosition":5,"frontMatter":{"title":"Release Notes","sidebar_position":5},"sidebar":"tutorialSidebar","previous":{"title":"Key Functionalities","permalink":"/BharatMLStack/numerix/v1.0.0/functionalities"},"next":{"title":"Predator","permalink":"/BharatMLStack/category/predator"}}');var r=i(4848),t=i(8453);const o={title:"Release Notes",sidebar_position:5},l="Numerix - Release Notes",a={},c=[{value:"Version 1.0.0 \ud83d\ude80",id:"version-100-",level:2},{value:"\ud83c\udfaf What's New",id:"-whats-new",level:2},{value:"Core Engine",id:"core-engine",level:3},{value:"API Surface",id:"api-surface",level:3},{value:"Observability",id:"observability",level:3},{value:"\ud83d\ude80 Performance & Optimization",id:"-performance--optimization",level:2},{value:"\ud83d\udee0\ufe0f APIs",id:"\ufe0f-apis",level:2},{value:"gRPC",id:"grpc",level:3},{value:"\ud83c\udfd7\ufe0f Deployment & Configuration",id:"\ufe0f-deployment--configuration",level:2},{value:"Environment",id:"environment",level:3},{value:"Containers",id:"containers",level:3},{value:"\ud83d\udd04 Compatibility",id:"-compatibility",level:2},{value:"\ud83d\udc1b Known Issues",id:"-known-issues",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",br:"br",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,t.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.header,{children:(0,r.jsx)(n.h1,{id:"numerix---release-notes",children:"Numerix - Release Notes"})}),"\n",(0,r.jsx)(n.h2,{id:"version-100-",children:"Version 1.0.0 \ud83d\ude80"}),"\n",(0,r.jsxs)(n.p,{children:[(0,r.jsx)(n.strong,{children:"Release Date"}),": September 2025",(0,r.jsx)(n.br,{}),"\n",(0,r.jsx)(n.strong,{children:"Status"}),": General Availability (GA)"]}),"\n",(0,r.jsxs)(n.p,{children:["The first stable release of ",(0,r.jsx)(n.strong,{children:"Numerix"})," \u2014 a Rust-based compute service for evaluating mathematical expressions over feature matrices with very low latency. Numerix executes postfix expressions from an etcd-backed registry using a stack-based evaluator and compiler-assisted SIMD."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-whats-new",children:"\ud83c\udfaf What's New"}),"\n",(0,r.jsx)(n.h3,{id:"core-engine",children:"Core Engine"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Postfix Expression Execution"}),": ",(0,r.jsx)(n.code,{children:"compute_id \u2192 postfix"})," mapping in etcd; parser-free request path."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Stack-Based Evaluator"}),": Linear-time execution over aligned vectors for predictable latency."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Compiler-Assisted SIMD"}),": Relies on LLVM autovectorization (NEON/SVE on ARM; SSE/AVX on x86); no handwritten intrinsics."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Typed Evaluation"}),": Internal conversion to ",(0,r.jsx)(n.code,{children:"fp32"}),"/",(0,r.jsx)(n.code,{children:"fp64"})," for consistent performance/precision."]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"api-surface",children:"API Surface"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"gRPC"}),": Single RPC \u2014 ",(0,r.jsx)(n.code,{children:"numerix.Numerix/Compute"}),"."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Input Formats"}),": Strings for ease, bytes for performance; both map to vectorized math internally."]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"observability",children:"Observability"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Datadog/DogStatsD"})," metrics: Latency (P50/P95/P99), RPS, error rate."]}),"\n",(0,r.jsxs)(n.li,{children:["Minimal HTTP diagnostics: ",(0,r.jsx)(n.code,{children:"/health"})," (and optional ",(0,r.jsx)(n.code,{children:"/metrics"}),")."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-performance--optimization",children:"\ud83d\ude80 Performance & Optimization"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Autovectorized Loops"}),": Tight loops over contiguous memory enable the compiler to emit SIMD instructions automatically."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"ARM Focus Option"}),": Excellent results with AArch64; builds can enable NEON/SVE/SVE2:"]}),"\n"]}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \\\ncargo build --release --target aarch64-unknown-linux-gnu\n'})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Deterministic Runtime"}),": No dynamic parsing in hot path; O(n) across tokens with vectorized inner ops."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"\ufe0f-apis",children:"\ud83d\udee0\ufe0f APIs"}),"\n",(0,r.jsx)(n.h3,{id:"grpc",children:"gRPC"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-protobuf",children:"service Numerix {\n rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);\n}\n"})}),"\n",(0,r.jsx)(n.p,{children:"Example call:"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext \\\n -import-path ./numerix/src/protos/proto \\\n -proto numerix.proto \\\n -d \'{\n "entityScoreData": {\n "schema": ["feature1", "feature2"],\n "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],\n "computeId": "1001",\n "dataType": "fp32"\n }\n }\' \\\n localhost:8080 numerix.Numerix/Compute\n'})}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"\ufe0f-deployment--configuration",children:"\ud83c\udfd7\ufe0f Deployment & Configuration"}),"\n",(0,r.jsx)(n.h3,{id:"environment",children:"Environment"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"APPLICATION_PORT=8083\nAPP_ENV=prd\nAPP_LOG_LEVEL=ERROR\nAPP_NAME=numerix\n\n# Performance\nCHANNEL_BUFFER_SIZE=10000\n\n# etcd\nETCD_SERVERS=127.0.0.1:2379\n\n# Metrics\nMETRIC_SAMPLING_RATE=1\nTELEGRAF_UDP_HOST=127.0.0.1\nTELEGRAF_UDP_PORT=8125\n"})}),"\n",(0,r.jsx)(n.h3,{id:"containers",children:"Containers"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["Multi-arch images: ",(0,r.jsx)(n.code,{children:"linux/amd64"}),", ",(0,r.jsx)(n.code,{children:"linux/arm64"}),"."]}),"\n",(0,r.jsxs)(n.li,{children:["Build targets example: ",(0,r.jsx)(n.code,{children:"x86_64-unknown-linux-gnu"}),", ",(0,r.jsx)(n.code,{children:"aarch64-unknown-linux-gnu"}),"."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-compatibility",children:"\ud83d\udd04 Compatibility"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Clients"}),": Any language with gRPC + generated stubs."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Architectures"}),": amd64 and arm64; ARM builds can enable NEON/SVE/SVE2."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-known-issues",children:"\ud83d\udc1b Known Issues"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Introduce a configurable log sampling rate to reduce pod memory usage during computation errors."}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,r.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcac ",(0,r.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,r.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,r.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,r.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},8453:(e,n,i)=>{i.d(n,{R:()=>o,x:()=>l});var s=i(6540);const r={},t=s.createContext(r);function o(e){const n=s.useContext(t);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:o(e.components),s.createElement(t.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/8ea48c46.e52cd527.js b/docs/assets/js/8ea48c46.e52cd527.js deleted file mode 100644 index 00e31ec3..00000000 --- a/docs/assets/js/8ea48c46.e52cd527.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9824],{7956:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>a,contentTitle:()=>l,default:()=>h,frontMatter:()=>o,metadata:()=>s,toc:()=>c});const s=JSON.parse('{"id":"numerix/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0 \ud83d\ude80","source":"@site/docs/numerix/v1.0.0/release-notes.md","sourceDirName":"numerix/v1.0.0","slug":"/numerix/v1.0.0/release-notes","permalink":"/BharatMLStack/numerix/v1.0.0/release-notes","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/numerix/v1.0.0/release-notes.md","tags":[],"version":"current","sidebarPosition":5,"frontMatter":{"title":"Release Notes","sidebar_position":5},"sidebar":"tutorialSidebar","previous":{"title":"Key Functionalities","permalink":"/BharatMLStack/numerix/v1.0.0/functionalities"}}');var r=i(4848),t=i(8453);const o={title:"Release Notes",sidebar_position:5},l="Numerix - Release Notes",a={},c=[{value:"Version 1.0.0 \ud83d\ude80",id:"version-100-",level:2},{value:"\ud83c\udfaf What's New",id:"-whats-new",level:2},{value:"Core Engine",id:"core-engine",level:3},{value:"API Surface",id:"api-surface",level:3},{value:"Observability",id:"observability",level:3},{value:"\ud83d\ude80 Performance & Optimization",id:"-performance--optimization",level:2},{value:"\ud83d\udee0\ufe0f APIs",id:"\ufe0f-apis",level:2},{value:"gRPC",id:"grpc",level:3},{value:"\ud83c\udfd7\ufe0f Deployment & Configuration",id:"\ufe0f-deployment--configuration",level:2},{value:"Environment",id:"environment",level:3},{value:"Containers",id:"containers",level:3},{value:"\ud83d\udd04 Compatibility",id:"-compatibility",level:2},{value:"\ud83d\udc1b Known Issues",id:"-known-issues",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",br:"br",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,t.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.header,{children:(0,r.jsx)(n.h1,{id:"numerix---release-notes",children:"Numerix - Release Notes"})}),"\n",(0,r.jsx)(n.h2,{id:"version-100-",children:"Version 1.0.0 \ud83d\ude80"}),"\n",(0,r.jsxs)(n.p,{children:[(0,r.jsx)(n.strong,{children:"Release Date"}),": September 2025",(0,r.jsx)(n.br,{}),"\n",(0,r.jsx)(n.strong,{children:"Status"}),": General Availability (GA)"]}),"\n",(0,r.jsxs)(n.p,{children:["The first stable release of ",(0,r.jsx)(n.strong,{children:"Numerix"})," \u2014 a Rust-based compute service for evaluating mathematical expressions over feature matrices with very low latency. Numerix executes postfix expressions from an etcd-backed registry using a stack-based evaluator and compiler-assisted SIMD."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-whats-new",children:"\ud83c\udfaf What's New"}),"\n",(0,r.jsx)(n.h3,{id:"core-engine",children:"Core Engine"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Postfix Expression Execution"}),": ",(0,r.jsx)(n.code,{children:"compute_id \u2192 postfix"})," mapping in etcd; parser-free request path."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Stack-Based Evaluator"}),": Linear-time execution over aligned vectors for predictable latency."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Compiler-Assisted SIMD"}),": Relies on LLVM autovectorization (NEON/SVE on ARM; SSE/AVX on x86); no handwritten intrinsics."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Typed Evaluation"}),": Internal conversion to ",(0,r.jsx)(n.code,{children:"fp32"}),"/",(0,r.jsx)(n.code,{children:"fp64"})," for consistent performance/precision."]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"api-surface",children:"API Surface"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"gRPC"}),": Single RPC \u2014 ",(0,r.jsx)(n.code,{children:"numerix.Numerix/Compute"}),"."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Input Formats"}),": Strings for ease, bytes for performance; both map to vectorized math internally."]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"observability",children:"Observability"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Datadog/DogStatsD"})," metrics: Latency (P50/P95/P99), RPS, error rate."]}),"\n",(0,r.jsxs)(n.li,{children:["Minimal HTTP diagnostics: ",(0,r.jsx)(n.code,{children:"/health"})," (and optional ",(0,r.jsx)(n.code,{children:"/metrics"}),")."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-performance--optimization",children:"\ud83d\ude80 Performance & Optimization"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Autovectorized Loops"}),": Tight loops over contiguous memory enable the compiler to emit SIMD instructions automatically."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"ARM Focus Option"}),": Excellent results with AArch64; builds can enable NEON/SVE/SVE2:"]}),"\n"]}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \\\ncargo build --release --target aarch64-unknown-linux-gnu\n'})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Deterministic Runtime"}),": No dynamic parsing in hot path; O(n) across tokens with vectorized inner ops."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"\ufe0f-apis",children:"\ud83d\udee0\ufe0f APIs"}),"\n",(0,r.jsx)(n.h3,{id:"grpc",children:"gRPC"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-protobuf",children:"service Numerix {\n rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);\n}\n"})}),"\n",(0,r.jsx)(n.p,{children:"Example call:"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'grpcurl -plaintext \\\n -import-path ./numerix/src/protos/proto \\\n -proto numerix.proto \\\n -d \'{\n "entityScoreData": {\n "schema": ["feature1", "feature2"],\n "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],\n "computeId": "1001",\n "dataType": "fp32"\n }\n }\' \\\n localhost:8080 numerix.Numerix/Compute\n'})}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"\ufe0f-deployment--configuration",children:"\ud83c\udfd7\ufe0f Deployment & Configuration"}),"\n",(0,r.jsx)(n.h3,{id:"environment",children:"Environment"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"APPLICATION_PORT=8083\nAPP_ENV=prd\nAPP_LOG_LEVEL=ERROR\nAPP_NAME=numerix\n\n# Performance\nCHANNEL_BUFFER_SIZE=10000\n\n# etcd\nETCD_SERVERS=127.0.0.1:2379\n\n# Metrics\nMETRIC_SAMPLING_RATE=1\nTELEGRAF_UDP_HOST=127.0.0.1\nTELEGRAF_UDP_PORT=8125\n"})}),"\n",(0,r.jsx)(n.h3,{id:"containers",children:"Containers"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["Multi-arch images: ",(0,r.jsx)(n.code,{children:"linux/amd64"}),", ",(0,r.jsx)(n.code,{children:"linux/arm64"}),"."]}),"\n",(0,r.jsxs)(n.li,{children:["Build targets example: ",(0,r.jsx)(n.code,{children:"x86_64-unknown-linux-gnu"}),", ",(0,r.jsx)(n.code,{children:"aarch64-unknown-linux-gnu"}),"."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-compatibility",children:"\ud83d\udd04 Compatibility"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Clients"}),": Any language with gRPC + generated stubs."]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Architectures"}),": amd64 and arm64; ARM builds can enable NEON/SVE/SVE2."]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"-known-issues",children:"\ud83d\udc1b Known Issues"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Introduce a configurable log sampling rate to reduce pod memory usage during computation errors."}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,r.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcac ",(0,r.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,r.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,r.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,r.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},8453:(e,n,i)=>{i.d(n,{R:()=>o,x:()=>l});var s=i(6540);const r={},t=s.createContext(r);function o(e){const n=s.useContext(t);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:o(e.components),s.createElement(t.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/9796f4b8.34772e40.js b/docs/assets/js/9796f4b8.34772e40.js new file mode 100644 index 00000000..641510bc --- /dev/null +++ b/docs/assets/js/9796f4b8.34772e40.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[5430],{8453:(e,i,n)=>{n.d(i,{R:()=>a,x:()=>l});var t=n(6540);const s={},r=t.createContext(s);function a(e){const i=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(i):{...i,...e}},[i,e])}function l(e){let i;return i=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),t.createElement(r.Provider,{value:i},e.children)}},9352:(e,i,n)=>{n.r(i),n.d(i,{assets:()=>o,contentTitle:()=>l,default:()=>h,frontMatter:()=>a,metadata:()=>t,toc:()=>d});const t=JSON.parse('{"id":"skye/v1.0.0/functionalities","title":"Functionalities","description":"Core Capabilities","source":"@site/docs/skye/v1.0.0/functionalities.md","sourceDirName":"skye/v1.0.0","slug":"/skye/v1.0.0/functionalities","permalink":"/BharatMLStack/skye/v1.0.0/functionalities","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/skye/v1.0.0/functionalities.md","tags":[],"version":"current","sidebarPosition":2,"frontMatter":{"title":"Functionalities","sidebar_position":2},"sidebar":"tutorialSidebar","previous":{"title":"Architecture","permalink":"/BharatMLStack/skye/v1.0.0/architecture"},"next":{"title":"Release Notes","permalink":"/BharatMLStack/skye/v1.0.0/release-notes"}}');var s=n(4848),r=n(8453);const a={title:"Functionalities",sidebar_position:2},l="Skye - Functionalities",o={},d=[{value:"Core Capabilities",id:"core-capabilities",level:2},{value:"1. Vector Similarity Search",id:"1-vector-similarity-search",level:3},{value:"2. Pluggable Vector Database Support",id:"2-pluggable-vector-database-support",level:3},{value:"3. Model and Variant Management",id:"3-model-and-variant-management",level:3},{value:"Model Registration",id:"model-registration",level:4},{value:"Variant Registration",id:"variant-registration",level:4},{value:"Model Promotion",id:"model-promotion",level:4},{value:"4. Embedding Ingestion",id:"4-embedding-ingestion",level:3},{value:"Batch Ingestion (Reset/Delta Jobs)",id:"batch-ingestion-resetdelta-jobs",level:4},{value:"Real-Time Ingestion",id:"real-time-ingestion",level:4},{value:"5. Real-Time Data Aggregation",id:"5-real-time-data-aggregation",level:3},{value:"6. Intelligent Caching",id:"6-intelligent-caching",level:3},{value:"7. Embedded Storage",id:"7-embedded-storage",level:3},{value:"8. Retry and Fault Tolerance",id:"8-retry-and-fault-tolerance",level:3},{value:"9. Experiment Isolation",id:"9-experiment-isolation",level:3},{value:"10. Centralized Cluster Management",id:"10-centralized-cluster-management",level:3},{value:"Onboarding Flow",id:"onboarding-flow",level:2},{value:"Step-by-step Process",id:"step-by-step-process",level:3},{value:"Extending to New Tenants",id:"extending-to-new-tenants",level:3}];function c(e){const i={code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",li:"li",ol:"ol",p:"p",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(i.header,{children:(0,s.jsx)(i.h1,{id:"skye---functionalities",children:"Skye - Functionalities"})}),"\n",(0,s.jsx)(i.h2,{id:"core-capabilities",children:"Core Capabilities"}),"\n",(0,s.jsx)(i.h3,{id:"1-vector-similarity-search",children:"1. Vector Similarity Search"}),"\n",(0,s.jsx)(i.p,{children:"Skye provides real-time nearest-neighbor search across high-dimensional vector spaces. It supports:"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Configurable distance functions"}),": DOT product, Cosine similarity, Euclidean distance"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Configurable vector dimensions"}),": Per-model vector dimension settings"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Indexed-only search"}),": Queries only search within fully indexed space, avoiding brute-force fallback on partially built indexes"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Pagination support"}),": Service-level pagination for clients, even when the underlying vector DB does not natively support it"]}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"2-pluggable-vector-database-support",children:"2. Pluggable Vector Database Support"}),"\n",(0,s.jsx)(i.p,{children:"The platform is designed to be vector DB agnostic:"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Generic vector config"}),": A ",(0,s.jsx)(i.code,{children:"vector_db_type"})," field and generic ",(0,s.jsx)(i.code,{children:"vectordb_config"})," replace vendor-specific configurations"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Current support"}),": Qdrant with official Go client"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Extensibility"}),": New vector databases can be integrated by implementing the vector DB interface"]}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"3-model-and-variant-management",children:"3. Model and Variant Management"}),"\n",(0,s.jsx)(i.h4,{id:"model-registration",children:"Model Registration"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Models are registered via API with entity type, embedding configuration, distance function, vector dimension, and training data path"}),"\n",(0,s.jsx)(i.li,{children:"Each model is associated with a store ID mapping to specific embedding and aggregator tables"}),"\n"]}),"\n",(0,s.jsx)(i.h4,{id:"variant-registration",children:"Variant Registration"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Variants represent different views/filters of the same model (e.g., organic, ad, commerce)"}),"\n",(0,s.jsx)(i.li,{children:"Each variant has its own filter criteria, vector DB cluster, job frequency, and version tracking"}),"\n",(0,s.jsx)(i.li,{children:"Variants share the same embeddings, eliminating data redundancy"}),"\n"]}),"\n",(0,s.jsx)(i.h4,{id:"model-promotion",children:"Model Promotion"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Successful experiments can be promoted from experiment clusters to production clusters via API"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"4-embedding-ingestion",children:"4. Embedding Ingestion"}),"\n",(0,s.jsx)(i.h4,{id:"batch-ingestion-resetdelta-jobs",children:"Batch Ingestion (Reset/Delta Jobs)"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Triggered via Databricks jobs that read from GCS paths"}),"\n",(0,s.jsx)(i.li,{children:"Supports separate index-space and search-space embeddings"}),"\n",(0,s.jsxs)(i.li,{children:["Per-variant ",(0,s.jsx)(i.code,{children:"to_be_indexed"})," flags control which embeddings are indexed for each variant"]}),"\n",(0,s.jsx)(i.li,{children:"EOF markers sent to all Kafka partitions ensure complete data consumption"}),"\n"]}),"\n",(0,s.jsx)(i.h4,{id:"real-time-ingestion",children:"Real-Time Ingestion"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Generic Kafka schema for all real-time consumers"}),"\n",(0,s.jsx)(i.li,{children:"Entity-based aggregation data (e.g., is_live_ad, out_of_stock) updates in real time"}),"\n",(0,s.jsx)(i.li,{children:"During model resets, real-time consumers continue pushing data to the latest collection (no pausing)"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"5-real-time-data-aggregation",children:"5. Real-Time Data Aggregation"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Entity-wise (catalog, product, user) real-time aggregation via ScyllaDB"}),"\n",(0,s.jsx)(i.li,{children:"Generic approach: aggregator tables are entity-level, not model/version-specific"}),"\n",(0,s.jsx)(i.li,{children:"All real-time data is consistent across models sharing the same entity"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"6-intelligent-caching",children:"6. Intelligent Caching"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"In-memory cache"}),": First layer, reduces load on distributed cache"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Distributed cache (Redis)"}),": Second layer for cached similarity results"]}),"\n",(0,s.jsx)(i.li,{children:"Hit rate monitoring and cache effectiveness metrics per model"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"7-embedded-storage",children:"7. Embedded Storage"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Optional embedding storage with configurable TTL"}),"\n",(0,s.jsx)(i.li,{children:"Enables embedding lookup APIs for downstream consumers"}),"\n",(0,s.jsx)(i.li,{children:"Stored in ScyllaDB with efficient binary serialization"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"8-retry-and-fault-tolerance",children:"8. Retry and Fault Tolerance"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Retry topic"}),": Failed ingestion events are published to a dedicated retry topic"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Event-driven state management"}),": Model states persist in SQL DB, surviving pod restarts"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Kafka-based admin"}),": Asynchronous processing with automatic re-consumption on failure"]}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"9-experiment-isolation",children:"9. Experiment Isolation"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsxs)(i.li,{children:["Dedicated EKS cluster (",(0,s.jsx)(i.code,{children:"skye-service-experiments"}),") for experiments"]}),"\n",(0,s.jsx)(i.li,{children:"Dedicated vector DB cluster for experiment workloads"}),"\n",(0,s.jsx)(i.li,{children:"Clean separation from production: experiments do not impact production performance"}),"\n",(0,s.jsx)(i.li,{children:"Promotion path from experiment to production after load analysis"}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"10-centralized-cluster-management",children:"10. Centralized Cluster Management"}),"\n",(0,s.jsxs)(i.ul,{children:["\n",(0,s.jsx)(i.li,{children:"Automated cluster provisioning via scripts (collaboration with DevOps)"}),"\n",(0,s.jsx)(i.li,{children:"Consistent configurations across all clusters (eliminates consensus issues)"}),"\n",(0,s.jsx)(i.li,{children:"Horizontal scaling support: generic scripts for adding nodes to existing clusters"}),"\n"]}),"\n",(0,s.jsx)(i.hr,{}),"\n",(0,s.jsx)(i.h2,{id:"onboarding-flow",children:"Onboarding Flow"}),"\n",(0,s.jsx)(i.h3,{id:"step-by-step-process",children:"Step-by-step Process"}),"\n",(0,s.jsxs)(i.ol,{children:["\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Data Scientist"})," provides a base GCS path where model embeddings will be pushed"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Register Model"})," via ",(0,s.jsx)(i.code,{children:"POST /register-model"})," with entity type, column mappings, model config"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Register Variant(s)"})," via ",(0,s.jsx)(i.code,{children:"POST /register-variant"})," with filter criteria, vector DB config, job frequency"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Schedule Databricks Job"})," to read data from GCS path and ingest into Skye platform"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Reset Model"})," via ",(0,s.jsx)(i.code,{children:"POST /reset-model"})," to trigger the first full ingestion"]}),"\n",(0,s.jsxs)(i.li,{children:[(0,s.jsx)(i.strong,{children:"Trigger Model Machine"})," via ",(0,s.jsx)(i.code,{children:"POST /trigger-model-machine"})," to start the indexing pipeline"]}),"\n"]}),"\n",(0,s.jsx)(i.h3,{id:"extending-to-new-tenants",children:"Extending to New Tenants"}),"\n",(0,s.jsx)(i.p,{children:"With the variant system, extending a model to a new tenant only requires registering a new variant with appropriate filters -- no re-ingestion of embeddings is needed."})]})}function h(e={}){const{wrapper:i}={...(0,r.R)(),...e.components};return i?(0,s.jsx)(i,{...e,children:(0,s.jsx)(c,{...e})}):c(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/a6aa9e1f.7671586a.js b/docs/assets/js/a6aa9e1f.7671586a.js new file mode 100644 index 00000000..9f3adf31 --- /dev/null +++ b/docs/assets/js/a6aa9e1f.7671586a.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7643],{2907:(e,t,a)=>{a.d(t,{A:()=>U});a(6540);var n=a(4164),s=a(4096),r=a(4848);function i({children:e,className:t}){return(0,r.jsx)("article",{className:t,children:e})}var l=a(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:a}=(0,s.e7)(),{permalink:i,title:c}=t,d=a?"h1":"h2";return(0,r.jsx)(d,{className:(0,n.A)(o.title,e),children:a?c:(0,r.jsx)(l.A,{to:i,children:c})})}var d=a(1312),g=a(5846),m=a(6266);const u={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,g.W)();return t=>{const a=Math.ceil(t);return e(a,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:a}))}}();return(0,r.jsx)(r.Fragment,{children:t(e)})}function p({date:e,formattedDate:t}){return(0,r.jsx)("time",{dateTime:e,children:t})}function x(){return(0,r.jsx)(r.Fragment,{children:" \xb7 "})}function j({className:e}){const{metadata:t}=(0,s.e7)(),{date:a,readingTime:i}=t,l=(0,m.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,r.jsxs)("div",{className:(0,n.A)(u.container,"margin-vert--md",e),children:[(0,r.jsx)(p,{date:a,formattedDate:(o=a,l.format(new Date(o)))}),void 0!==i&&(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(x,{}),(0,r.jsx)(h,{readingTime:i})]})]});var o}var A=a(6382);const b={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function f({className:e}){const{metadata:{authors:t},assets:a}=(0,s.e7)();if(0===t.length)return null;const i=t.every(({name:e})=>!e),l=1===t.length;return(0,r.jsx)("div",{className:(0,n.A)("margin-top--md margin-bottom--sm",i?b.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,r.jsx)("div",{className:(0,n.A)(!i&&(l?"col col--12":"col col--6"),i?b.imageOnlyAuthorCol:b.authorCol),children:(0,r.jsx)(A.A,{author:{...e,imageURL:a.authorsImageUrls[t]??e.imageURL}})},t))})}function v(){return(0,r.jsxs)("header",{children:[(0,r.jsx)(c,{}),(0,r.jsx)(j,{}),(0,r.jsx)(f,{})]})}var N=a(440),_=a(3253);function T({children:e,className:t}){const{isBlogPostPage:a}=(0,s.e7)();return(0,r.jsx)("div",{id:a?N.LU:void 0,className:(0,n.A)("markdown",t),children:(0,r.jsx)(_.A,{children:e})})}var k=a(7559),w=a(4336),y=a(4434);function P(){return(0,r.jsx)("b",{children:(0,r.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function R(e){const{blogPostTitle:t,...a}=e;return(0,r.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...a,children:(0,r.jsx)(P,{})})}function C(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:a,title:i,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,g=!t&&o,m=a.length>0;if(!(m||g||l))return null;if(t){const e=!!(l||d||c);return(0,r.jsxs)("footer",{className:"docusaurus-mt-lg",children:[m&&(0,r.jsx)("div",{className:(0,n.A)("row","margin-top--sm",k.G.blog.blogFooterEditMetaRow),children:(0,r.jsx)("div",{className:"col",children:(0,r.jsx)(y.A,{tags:a})})}),e&&(0,r.jsx)(w.A,{className:(0,n.A)("margin-top--sm",k.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,r.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[m&&(0,r.jsx)("div",{className:(0,n.A)("col",{"col--9":g}),children:(0,r.jsx)(y.A,{tags:a})}),g&&(0,r.jsx)("div",{className:(0,n.A)("col text--right",{"col--3":m}),children:(0,r.jsx)(R,{blogPostTitle:i,to:e.permalink})})]})}function U({children:e,className:t}){const a=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,r.jsxs)(i,{className:(0,n.A)(a,t),children:[(0,r.jsx)(v,{}),(0,r.jsx)(T,{children:e}),(0,r.jsx)(C,{})]})}},3892:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(4096),s=a(2907),r=a(4848);function i({items:e,component:t=s.A}){return(0,r.jsx)(r.Fragment,{children:e.map(({content:e})=>(0,r.jsx)(n.in,{content:e,children:(0,r.jsx)(t,{children:(0,r.jsx)(e,{})})},e.metadata.permalink))})}},4434:(e,t,a)=>{a.d(t,{A:()=>o});a(6540);var n=a(4164),s=a(1312),r=a(6133);const i={tags:"tags_jXut",tag:"tag_QGVx"};var l=a(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,n.A)(i.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:i.tag,children:(0,l.jsx)(r.A,{...e})},e.permalink))})]})}},5124:(e,t,a)=>{a.r(t),a.d(t,{default:()=>j});a(6540);var n=a(4164),s=a(4586),r=a(5500),i=a(7559),l=a(8027),o=a(7713),c=a(1463),d=a(3892),g=a(5260),m=a(4096),u=a(4848);function h(e){const t=(0,m.kJ)(e);return(0,u.jsx)(g.A,{children:(0,u.jsx)("script",{type:"application/ld+json",children:JSON.stringify(t)})})}function p(e){const{metadata:t}=e,{siteConfig:{title:a}}=(0,s.A)(),{blogDescription:n,blogTitle:i,permalink:l}=t,o="/"===l?a:i;return(0,u.jsxs)(u.Fragment,{children:[(0,u.jsx)(r.be,{title:o,description:n}),(0,u.jsx)(c.A,{tag:"blog_posts_list"})]})}function x(e){const{metadata:t,items:a,sidebar:n}=e;return(0,u.jsxs)(l.A,{sidebar:n,children:[(0,u.jsx)(d.A,{items:a}),(0,u.jsx)(o.A,{metadata:t})]})}function j(e){return(0,u.jsxs)(r.e3,{className:(0,n.A)(i.G.wrapper.blogPages,i.G.page.blogListPage),children:[(0,u.jsx)(p,{...e}),(0,u.jsx)(h,{...e}),(0,u.jsx)(x,{...e})]})}},6133:(e,t,a)=>{a.d(t,{A:()=>l});a(6540);var n=a(4164),s=a(8774);const r={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var i=a(4848);function l({permalink:e,label:t,count:a,description:l}){return(0,i.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,n.A)(r.tag,a?r.tagWithCount:r.tagRegular),children:[t,a&&(0,i.jsx)("span",{children:a})]})}},7713:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(1312),s=a(9022),r=a(4848);function i(e){const{metadata:t}=e,{previousPage:a,nextPage:i}=t;return(0,r.jsxs)("nav",{className:"pagination-nav","aria-label":(0,n.T)({id:"theme.blog.paginator.navAriaLabel",message:"Blog list page navigation",description:"The ARIA label for the blog pagination"}),children:[a&&(0,r.jsx)(s.A,{permalink:a,title:(0,r.jsx)(n.A,{id:"theme.blog.paginator.newerEntries",description:"The label used to navigate to the newer blog posts page (previous page)",children:"Newer entries"})}),i&&(0,r.jsx)(s.A,{permalink:i,title:(0,r.jsx)(n.A,{id:"theme.blog.paginator.olderEntries",description:"The label used to navigate to the older blog posts page (next page)",children:"Older entries"}),isNext:!0})]})}},9022:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(4164),s=a(8774),r=a(4848);function i(e){const{permalink:t,title:a,subLabel:i,isNext:l}=e;return(0,r.jsxs)(s.A,{className:(0,n.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[i&&(0,r.jsx)("div",{className:"pagination-nav__sublabel",children:i}),(0,r.jsx)("div",{className:"pagination-nav__label",children:a})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/a6aa9e1f.e531d6c8.js b/docs/assets/js/a6aa9e1f.e531d6c8.js deleted file mode 100644 index 3cc9c0b0..00000000 --- a/docs/assets/js/a6aa9e1f.e531d6c8.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7643],{2053:(e,t,a)=>{a.d(t,{A:()=>o});a(6540);var n=a(4164),s=a(1312),r=a(6133);const i={tags:"tags_jXut",tag:"tag_QGVx"};var l=a(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,n.A)(i.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:i.tag,children:(0,l.jsx)(r.A,{...e})},e.permalink))})]})}},2907:(e,t,a)=>{a.d(t,{A:()=>U});a(6540);var n=a(4164),s=a(4096),r=a(4848);function i({children:e,className:t}){return(0,r.jsx)("article",{className:t,children:e})}var l=a(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:a}=(0,s.e7)(),{permalink:i,title:c}=t,d=a?"h1":"h2";return(0,r.jsx)(d,{className:(0,n.A)(o.title,e),children:a?c:(0,r.jsx)(l.A,{to:i,children:c})})}var d=a(1312),g=a(5846),m=a(6266);const u={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,g.W)();return t=>{const a=Math.ceil(t);return e(a,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:a}))}}();return(0,r.jsx)(r.Fragment,{children:t(e)})}function p({date:e,formattedDate:t}){return(0,r.jsx)("time",{dateTime:e,children:t})}function x(){return(0,r.jsx)(r.Fragment,{children:" \xb7 "})}function j({className:e}){const{metadata:t}=(0,s.e7)(),{date:a,readingTime:i}=t,l=(0,m.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,r.jsxs)("div",{className:(0,n.A)(u.container,"margin-vert--md",e),children:[(0,r.jsx)(p,{date:a,formattedDate:(o=a,l.format(new Date(o)))}),void 0!==i&&(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(x,{}),(0,r.jsx)(h,{readingTime:i})]})]});var o}var A=a(6382);const b={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function f({className:e}){const{metadata:{authors:t},assets:a}=(0,s.e7)();if(0===t.length)return null;const i=t.every(({name:e})=>!e),l=1===t.length;return(0,r.jsx)("div",{className:(0,n.A)("margin-top--md margin-bottom--sm",i?b.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,r.jsx)("div",{className:(0,n.A)(!i&&(l?"col col--12":"col col--6"),i?b.imageOnlyAuthorCol:b.authorCol),children:(0,r.jsx)(A.A,{author:{...e,imageURL:a.authorsImageUrls[t]??e.imageURL}})},t))})}function v(){return(0,r.jsxs)("header",{children:[(0,r.jsx)(c,{}),(0,r.jsx)(j,{}),(0,r.jsx)(f,{})]})}var N=a(440),_=a(3253);function T({children:e,className:t}){const{isBlogPostPage:a}=(0,s.e7)();return(0,r.jsx)("div",{id:a?N.LU:void 0,className:(0,n.A)("markdown",t),children:(0,r.jsx)(_.A,{children:e})})}var k=a(7559),w=a(4336),y=a(2053);function P(){return(0,r.jsx)("b",{children:(0,r.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function R(e){const{blogPostTitle:t,...a}=e;return(0,r.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...a,children:(0,r.jsx)(P,{})})}function C(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:a,title:i,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,g=!t&&o,m=a.length>0;if(!(m||g||l))return null;if(t){const e=!!(l||d||c);return(0,r.jsxs)("footer",{className:"docusaurus-mt-lg",children:[m&&(0,r.jsx)("div",{className:(0,n.A)("row","margin-top--sm",k.G.blog.blogFooterEditMetaRow),children:(0,r.jsx)("div",{className:"col",children:(0,r.jsx)(y.A,{tags:a})})}),e&&(0,r.jsx)(w.A,{className:(0,n.A)("margin-top--sm",k.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,r.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[m&&(0,r.jsx)("div",{className:(0,n.A)("col",{"col--9":g}),children:(0,r.jsx)(y.A,{tags:a})}),g&&(0,r.jsx)("div",{className:(0,n.A)("col text--right",{"col--3":m}),children:(0,r.jsx)(R,{blogPostTitle:i,to:e.permalink})})]})}function U({children:e,className:t}){const a=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,r.jsxs)(i,{className:(0,n.A)(a,t),children:[(0,r.jsx)(v,{}),(0,r.jsx)(T,{children:e}),(0,r.jsx)(C,{})]})}},3892:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(4096),s=a(2907),r=a(4848);function i({items:e,component:t=s.A}){return(0,r.jsx)(r.Fragment,{children:e.map(({content:e})=>(0,r.jsx)(n.in,{content:e,children:(0,r.jsx)(t,{children:(0,r.jsx)(e,{})})},e.metadata.permalink))})}},5124:(e,t,a)=>{a.r(t),a.d(t,{default:()=>j});a(6540);var n=a(4164),s=a(4586),r=a(5500),i=a(7559),l=a(8027),o=a(7713),c=a(1463),d=a(3892),g=a(5260),m=a(4096),u=a(4848);function h(e){const t=(0,m.kJ)(e);return(0,u.jsx)(g.A,{children:(0,u.jsx)("script",{type:"application/ld+json",children:JSON.stringify(t)})})}function p(e){const{metadata:t}=e,{siteConfig:{title:a}}=(0,s.A)(),{blogDescription:n,blogTitle:i,permalink:l}=t,o="/"===l?a:i;return(0,u.jsxs)(u.Fragment,{children:[(0,u.jsx)(r.be,{title:o,description:n}),(0,u.jsx)(c.A,{tag:"blog_posts_list"})]})}function x(e){const{metadata:t,items:a,sidebar:n}=e;return(0,u.jsxs)(l.A,{sidebar:n,children:[(0,u.jsx)(d.A,{items:a}),(0,u.jsx)(o.A,{metadata:t})]})}function j(e){return(0,u.jsxs)(r.e3,{className:(0,n.A)(i.G.wrapper.blogPages,i.G.page.blogListPage),children:[(0,u.jsx)(p,{...e}),(0,u.jsx)(h,{...e}),(0,u.jsx)(x,{...e})]})}},6133:(e,t,a)=>{a.d(t,{A:()=>l});a(6540);var n=a(4164),s=a(8774);const r={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var i=a(4848);function l({permalink:e,label:t,count:a,description:l}){return(0,i.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,n.A)(r.tag,a?r.tagWithCount:r.tagRegular),children:[t,a&&(0,i.jsx)("span",{children:a})]})}},7713:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(1312),s=a(9022),r=a(4848);function i(e){const{metadata:t}=e,{previousPage:a,nextPage:i}=t;return(0,r.jsxs)("nav",{className:"pagination-nav","aria-label":(0,n.T)({id:"theme.blog.paginator.navAriaLabel",message:"Blog list page navigation",description:"The ARIA label for the blog pagination"}),children:[a&&(0,r.jsx)(s.A,{permalink:a,title:(0,r.jsx)(n.A,{id:"theme.blog.paginator.newerEntries",description:"The label used to navigate to the newer blog posts page (previous page)",children:"Newer entries"})}),i&&(0,r.jsx)(s.A,{permalink:i,title:(0,r.jsx)(n.A,{id:"theme.blog.paginator.olderEntries",description:"The label used to navigate to the older blog posts page (next page)",children:"Older entries"}),isNext:!0})]})}},9022:(e,t,a)=>{a.d(t,{A:()=>i});a(6540);var n=a(4164),s=a(8774),r=a(4848);function i(e){const{permalink:t,title:a,subLabel:i,isNext:l}=e;return(0,r.jsxs)(s.A,{className:(0,n.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[i&&(0,r.jsx)("div",{className:"pagination-nav__sublabel",children:i}),(0,r.jsx)("div",{className:"pagination-nav__label",children:a})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/a97f18d9.493bcc36.js b/docs/assets/js/a97f18d9.493bcc36.js new file mode 100644 index 00000000..5b1e3a38 --- /dev/null +++ b/docs/assets/js/a97f18d9.493bcc36.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6724],{411:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-dag-976ff51caf25f09d977ccc10e70918f3.png"},721:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1106:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}}')},4215:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>i,toc:()=>c});var i=t(1106),r=t(4848),s=t(8453);const a={slug:"post-two",title:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",authors:["bhawani","jigar","adarsha"],date:"2023-4-10",tags:["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0]},c=[{value:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",level:2},{value:"The Cost of Success",id:"the-cost-of-success",level:3},{value:"Scaling Pains (and Cassandra\u2019s Limits)",id:"scaling-pains-and-cassandras-limits",level:3},{value:"Interaction Store Woes",id:"interaction-store-woes",level:3},{value:"Silver Linings",id:"silver-linings",level:3},{value:"Round Two: Solving the Top 2 Bottlenecks",id:"round-two-solving-the-top-2-bottlenecks",level:3},{value:"Problem 1: No-Code Feature Retrieval for Model Inference",id:"problem-1-no-code-feature-retrieval-for-model-inference",level:4},{value:"Problem 2: Scaling Without Breaking the Bank",id:"problem-2-scaling-without-breaking-the-bank",level:4},{value:"Optimizing the Online Feature Store",id:"optimizing-the-online-feature-store",level:4},{value:"Optimizing the Interaction Store",id:"optimizing-the-interaction-store",level:4},{value:"Results",id:"results",level:4},{value:"The Catch: Our ML Hosting Hit a Hard Limit",id:"the-catch-our-ml-hosting-hit-a-hard-limit",level:4},{value:"Conclusion: From Firefighting to Future-Proofing",id:"conclusion-from-firefighting-to-future-proofing",level:3}];function d(e){const n={h2:"h2",h3:"h3",h4:"h4",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:t(721).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",children:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)"}),"\n",(0,r.jsx)(n.p,{children:"By late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\nAnd it worked. Mostly.\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS."}),"\n",(0,r.jsx)(n.h3,{id:"the-cost-of-success",children:"The Cost of Success"}),"\n",(0,r.jsx)(n.p,{children:"Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Adding new DAG nodes in IOP"}),"\n",(0,r.jsx)(n.li,{children:"Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)"}),"\n",(0,r.jsx)(n.li,{children:"Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)"}),"\n",(0,r.jsx)(n.li,{children:"Optimizing I/O and dealing with the inevitable bugs"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations."}),"\n",(0,r.jsx)(n.h3,{id:"scaling-pains-and-cassandras-limits",children:"Scaling Pains (and Cassandra\u2019s Limits)"}),"\n",(0,r.jsx)(n.p,{children:"At some point, we were hitting:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"250\u2013300K reads/sec"}),"\n",(0,r.jsx)(n.li,{children:"1M writes/sec (during lean hours)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership."}),"\n",(0,r.jsx)(n.h3,{id:"interaction-store-woes",children:"Interaction Store Woes"}),"\n",(0,r.jsx)(n.p,{children:"Our interaction store was another ticking time bomb:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Clusters kept growing in size and cost"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Latency spikes became increasingly frequent"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale."}),"\n",(0,r.jsx)(n.h3,{id:"silver-linings",children:"Silver Linings"}),"\n",(0,r.jsx)(n.p,{children:"Despite the chaos, the system was live and delivering value:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Real-time infrastructure was in production"}),"\n",(0,r.jsx)(n.li,{children:"Costs dropped by 60\u201370% compared to offline personalization"}),"\n",(0,r.jsx)(n.li,{children:"New experiments rolled out faster and more successfully"}),"\n",(0,r.jsx)(n.li,{children:"User engagement metrics improved"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"It wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot."}),"\n",(0,r.jsx)(n.h3,{id:"round-two-solving-the-top-2-bottlenecks",children:"Round Two: Solving the Top 2 Bottlenecks"}),"\n",(0,r.jsx)(n.p,{children:"With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Coding feature retrieval logic for every new model was becoming unsustainable"}),"\n",(0,r.jsx)(n.li,{children:"ML scale was exploding\u2014bringing rising infra costs with it"}),"\n",(0,r.jsx)(n.li,{children:"Real-time embedding search was the next big unlock"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"We tackled them one by one\u2014starting with the biggest pain point."}),"\n",(0,r.jsx)(n.h4,{id:"problem-1-no-code-feature-retrieval-for-model-inference",children:"Problem 1: No-Code Feature Retrieval for Model Inference"}),"\n",(0,r.jsx)(n.p,{children:"We noticed a pattern: for personalized ranking, models needed features from:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u2705 Product"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User \xd7 Category"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 Region, cohort, sub-category, etc."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"A key insight emerged: Entities that contribute features for a model always map back to the context entities."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"MP Dag",src:t(411).A+"",width:"1272",height:"512"})}),"\n",(0,r.jsx)(n.p,{children:"With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)"}),"\n",(0,r.jsx)(n.li,{children:"2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper"}),"\n",(0,r.jsx)(n.li,{children:"3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically"}),"\n",(0,r.jsx)(n.li,{children:"4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"\ud83d\udca1 The impact?"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Feature consistency across experiments"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond"}),"\n"]}),"\n",(0,r.jsxs)(n.p,{children:["Here\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\n",(0,r.jsx)(n.img,{alt:"MP matrix",src:t(7704).A+"",width:"1262",height:"768"}),"\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency."]}),"\n",(0,r.jsx)(n.h4,{id:"problem-2-scaling-without-breaking-the-bank",children:"Problem 2: Scaling Without Breaking the Bank"}),"\n",(0,r.jsx)(n.p,{children:"With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Online Feature Store"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Interaction Store"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-online-feature-store",children:"Optimizing the Online Feature Store"}),"\n",(0,r.jsx)(n.p,{children:"Our costs were concentrated in:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Database (Cassandra)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cache (Redis)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Running Pods (Java services)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency."}),"\n",(0,r.jsx)(n.li,{children:"Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization."}),"\n",(0,r.jsx)(n.li,{children:"Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"2\ufe0f\u20e3 Finding the Right Cache\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation."}),"\n",(0,r.jsx)(n.li,{children:"Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access."}),"\n",(0,r.jsx)(n.li,{children:"Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?"}),"\n",(0,r.jsx)(n.p,{children:"\u2705 Memory usage dropped by ~80%\n\u2705 CPU utilization was significantly lower\n\u2705 Faster, more efficient deployments"}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-interaction-store",children:"Optimizing the Interaction Store"}),"\n",(0,r.jsx)(n.p,{children:"We realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"InteractionStore",src:t(9497).A+"",width:"1242",height:"572"})}),"\n",(0,r.jsx)(n.h4,{id:"results",children:"Results"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat"}),"\n",(0,r.jsx)(n.li,{children:"Infra costs for Online Feature Store and Interaction Store dropped by ~60%"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"the-catch-our-ml-hosting-hit-a-hard-limit",children:"The Catch: Our ML Hosting Hit a Hard Limit"}),"\n",(0,r.jsx)(n.p,{children:"While planning for 2023 MBS, we ran into a critical scalability bottleneck:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u274c Insufficient compute availability in our region for ML instances"}),"\n",(0,r.jsx)(n.li,{children:"\u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads."}),"\n",(0,r.jsx)(n.h3,{id:"conclusion-from-firefighting-to-future-proofing",children:"Conclusion: From Firefighting to Future-Proofing"}),"\n",(0,r.jsx)(n.p,{children:"What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},7704:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-matrix-43994f433f78905ccbd10cfe284f3c9f.png"},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>o});var i=t(6540);const r={},s=i.createContext(r);function a(e){const n=i.useContext(s);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),i.createElement(s.Provider,{value:n},e.children)}},9497:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/interaction-str-d9e7aefea121aefb4e94c6c9f060d016.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/a97f18d9.ce4ddba2.js b/docs/assets/js/a97f18d9.ce4ddba2.js deleted file mode 100644 index c6af68cd..00000000 --- a/docs/assets/js/a97f18d9.ce4ddba2.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6724],{1106:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-two","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-two/index.md","source":"@site/blog/bharatmlstack-history/post-two/index.md","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","description":"BharatMLStack","date":"2023-04-10T00:00:00.000Z","tags":[{"inline":true,"label":"inferflow","permalink":"/BharatMLStack/blog/tags/inferflow"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":6.31,"hasTruncateMarker":false,"authors":[{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-two","title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","authors":["bhawani","jigar","adarsha"],"date":"2023-4-10","tags":["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","permalink":"/BharatMLStack/blog/post-one"}}')},3086:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},4114:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-dag-976ff51caf25f09d977ccc10e70918f3.png"},4215:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>h,frontMatter:()=>a,metadata:()=>i,toc:()=>c});var i=t(1106),r=t(4848),s=t(8453);const a={slug:"post-two",title:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",authors:["bhawani","jigar","adarsha"],date:"2023-4-10",tags:["inferflow","interaction-store","mlplatform","meesho","bharatmlstack"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0]},c=[{value:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)",id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",level:2},{value:"The Cost of Success",id:"the-cost-of-success",level:3},{value:"Scaling Pains (and Cassandra\u2019s Limits)",id:"scaling-pains-and-cassandras-limits",level:3},{value:"Interaction Store Woes",id:"interaction-store-woes",level:3},{value:"Silver Linings",id:"silver-linings",level:3},{value:"Round Two: Solving the Top 2 Bottlenecks",id:"round-two-solving-the-top-2-bottlenecks",level:3},{value:"Problem 1: No-Code Feature Retrieval for Model Inference",id:"problem-1-no-code-feature-retrieval-for-model-inference",level:4},{value:"Problem 2: Scaling Without Breaking the Bank",id:"problem-2-scaling-without-breaking-the-bank",level:4},{value:"Optimizing the Online Feature Store",id:"optimizing-the-online-feature-store",level:4},{value:"Optimizing the Interaction Store",id:"optimizing-the-interaction-store",level:4},{value:"Results",id:"results",level:4},{value:"The Catch: Our ML Hosting Hit a Hard Limit",id:"the-catch-our-ml-hosting-hit-a-hard-limit",level:4},{value:"Conclusion: From Firefighting to Future-Proofing",id:"conclusion-from-firefighting-to-future-proofing",level:3}];function d(e){const n={h2:"h2",h3:"h3",h4:"h4",img:"img",li:"li",ol:"ol",p:"p",ul:"ul",...(0,s.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"BharatMLStack",src:t(3086).A+"",width:"1396",height:"460"})}),"\n",(0,r.jsx)(n.h2,{id:"building-meeshos-ml-platform-lessons-from-the-first-gen-system-part-2",children:"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)"}),"\n",(0,r.jsx)(n.p,{children:"By late 2022, we had built something we were truly proud of\u2014a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.\nAnd it worked. Mostly.\nBut soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn\u2019t built for scale.\nThis is the story of how we tackled these challenges\u2014building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS."}),"\n",(0,r.jsx)(n.h3,{id:"the-cost-of-success",children:"The Cost of Success"}),"\n",(0,r.jsx)(n.p,{children:"Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Adding new DAG nodes in IOP"}),"\n",(0,r.jsx)(n.li,{children:"Writing custom logic to fetch features from multiple sources (e.g., user, product, user \xd7 category)"}),"\n",(0,r.jsx)(n.li,{children:"Inferring intermediate features (e.g., extracting category from a product to fetch user \xd7 category data)"}),"\n",(0,r.jsx)(n.li,{children:"Optimizing I/O and dealing with the inevitable bugs"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations."}),"\n",(0,r.jsx)(n.h3,{id:"scaling-pains-and-cassandras-limits",children:"Scaling Pains (and Cassandra\u2019s Limits)"}),"\n",(0,r.jsx)(n.p,{children:"At some point, we were hitting:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"250\u2013300K reads/sec"}),"\n",(0,r.jsx)(n.li,{children:"1M writes/sec (during lean hours)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership."}),"\n",(0,r.jsx)(n.h3,{id:"interaction-store-woes",children:"Interaction Store Woes"}),"\n",(0,r.jsx)(n.p,{children:"Our interaction store was another ticking time bomb:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Clusters kept growing in size and cost"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 Latency spikes became increasingly frequent"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udea8 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale."}),"\n",(0,r.jsx)(n.h3,{id:"silver-linings",children:"Silver Linings"}),"\n",(0,r.jsx)(n.p,{children:"Despite the chaos, the system was live and delivering value:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Real-time infrastructure was in production"}),"\n",(0,r.jsx)(n.li,{children:"Costs dropped by 60\u201370% compared to offline personalization"}),"\n",(0,r.jsx)(n.li,{children:"New experiments rolled out faster and more successfully"}),"\n",(0,r.jsx)(n.li,{children:"User engagement metrics improved"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"It wasn\u2019t perfect. It was far from easy. But it worked\u2014and that counted for a lot."}),"\n",(0,r.jsx)(n.h3,{id:"round-two-solving-the-top-2-bottlenecks",children:"Round Two: Solving the Top 2 Bottlenecks"}),"\n",(0,r.jsx)(n.p,{children:"With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsx)(n.li,{children:"Coding feature retrieval logic for every new model was becoming unsustainable"}),"\n",(0,r.jsx)(n.li,{children:"ML scale was exploding\u2014bringing rising infra costs with it"}),"\n",(0,r.jsx)(n.li,{children:"Real-time embedding search was the next big unlock"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"We tackled them one by one\u2014starting with the biggest pain point."}),"\n",(0,r.jsx)(n.h4,{id:"problem-1-no-code-feature-retrieval-for-model-inference",children:"Problem 1: No-Code Feature Retrieval for Model Inference"}),"\n",(0,r.jsx)(n.p,{children:"We noticed a pattern: for personalized ranking, models needed features from:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u2705 Product"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 User \xd7 Category"}),"\n",(0,r.jsx)(n.li,{children:"\u2705 Region, cohort, sub-category, etc."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"A key insight emerged: Entities that contribute features for a model always map back to the context entities."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"MP Dag",src:t(4114).A+"",width:"1272",height:"512"})}),"\n",(0,r.jsx)(n.p,{children:"With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"1\ufe0f\u20e3 Inferflow takes a modelId and context IDs (e.g., userId, productIds)"}),"\n",(0,r.jsx)(n.li,{children:"2\ufe0f\u20e3 Loads a pre-defined feature retrieval graph from ZooKeeper"}),"\n",(0,r.jsx)(n.li,{children:"3\ufe0f\u20e3 Executes the graph to resolve entity relationships dynamically"}),"\n",(0,r.jsx)(n.li,{children:"4\ufe0f\u20e3 Outputs a 2D matrix of feature vectors"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"\ud83d\udca1 The impact?"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 No more custom feature retrieval code\u2014just graph updates in config"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Feature consistency across experiments"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\ude80 Faster iteration cycles for ranking, fraud detection, and beyond"}),"\n"]}),"\n",(0,r.jsxs)(n.p,{children:["Here\u2019s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed:\n",(0,r.jsx)(n.img,{alt:"MP matrix",src:t(8111).A+"",width:"1262",height:"768"}),"\nWe built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency."]}),"\n",(0,r.jsx)(n.h4,{id:"problem-2-scaling-without-breaking-the-bank",children:"Problem 2: Scaling Without Breaking the Bank"}),"\n",(0,r.jsx)(n.p,{children:"With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Online Feature Store"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udd39 Interaction Store"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-online-feature-store",children:"Optimizing the Online Feature Store"}),"\n",(0,r.jsx)(n.p,{children:"Our costs were concentrated in:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Database (Cassandra)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cache (Redis)"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Running Pods (Java services)"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"1\ufe0f\u20e3 Replacing Cassandra with ScyllaDB\nAs we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency."}),"\n",(0,r.jsx)(n.li,{children:"Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization."}),"\n",(0,r.jsx)(n.li,{children:"Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"2\ufe0f\u20e3 Finding the Right Cache\nTo reduce backend load and improve response times, we benchmarked multiple caching solutions\u2014Memcached, KeyDB, and Dragonfly\u2014under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation."}),"\n",(0,r.jsx)(n.li,{children:"Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access."}),"\n",(0,r.jsx)(n.li,{children:"Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility\u2014no changes needed in application code or client libraries."}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"3\ufe0f\u20e3 Moving to GoLang for Cost-Efficient Serving\nJava services were memory-heavy\u2014so we rewrote core services in GoLang. The results?"}),"\n",(0,r.jsx)(n.p,{children:"\u2705 Memory usage dropped by ~80%\n\u2705 CPU utilization was significantly lower\n\u2705 Faster, more efficient deployments"}),"\n",(0,r.jsx)(n.h4,{id:"optimizing-the-interaction-store",children:"Optimizing the Interaction Store"}),"\n",(0,r.jsx)(n.p,{children:"We realized that we only need a user\u2019s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Cold Tier (ScyllaDB)\u2014Stores click, order, wishlist events"}),"\n",(0,r.jsx)(n.li,{children:"\ud83d\udccc Hot Tier (Redis)\u2014Loads a user\u2019s past interactions only when they open the app"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes."}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.img,{alt:"InteractionStore",src:t(9758).A+"",width:"1242",height:"572"})}),"\n",(0,r.jsx)(n.h4,{id:"results",children:"Results"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale\u2014without breaking a sweat"}),"\n",(0,r.jsx)(n.li,{children:"Infra costs for Online Feature Store and Interaction Store dropped by ~60%"}),"\n"]}),"\n",(0,r.jsx)(n.h4,{id:"the-catch-our-ml-hosting-hit-a-hard-limit",children:"The Catch: Our ML Hosting Hit a Hard Limit"}),"\n",(0,r.jsx)(n.p,{children:"While planning for 2023 MBS, we ran into a critical scalability bottleneck:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"\u274c Insufficient compute availability in our region for ML instances"}),"\n",(0,r.jsx)(n.li,{children:"\u274c Couldn\u2019t provision enough nodes to handle real-time inference at scale"}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping\u2014but it wasn\u2019t built to handle the bursty, high-QPS demands of real-world production workloads."}),"\n",(0,r.jsx)(n.h3,{id:"conclusion-from-firefighting-to-future-proofing",children:"Conclusion: From Firefighting to Future-Proofing"}),"\n",(0,r.jsx)(n.p,{children:"What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency\u2014driving down costs while improving experimentation velocity.\nBut new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That\u2019s the next piece of the puzzle\u2014and the story of Part 3."})]})}function h(e={}){const{wrapper:n}={...(0,s.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},8111:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/mp-matrix-43994f433f78905ccbd10cfe284f3c9f.png"},8453:(e,n,t)=>{t.d(n,{R:()=>a,x:()=>o});var i=t(6540);const r={},s=i.createContext(r);function a(e){const n=i.useContext(s);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:a(e.components),i.createElement(s.Provider,{value:n},e.children)}},9758:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/interaction-str-d9e7aefea121aefb4e94c6c9f060d016.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/ac51638e.0b4da379.js b/docs/assets/js/ac51638e.0b4da379.js new file mode 100644 index 00000000..e2381bfd --- /dev/null +++ b/docs/assets/js/ac51638e.0b4da379.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9473],{6692:(e,n,a)=>{a.r(n),a.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>p,frontMatter:()=>s,metadata:()=>t,toc:()=>c});const t=JSON.parse('{"id":"sdks/python/v1.0.0/spark_feature_push_client","title":"Spark client","description":"PyPI version","source":"@site/docs/sdks/python/v1.0.0/spark_feature_push_client.md","sourceDirName":"sdks/python/v1.0.0","slug":"/sdks/python/v1.0.0/spark_feature_push_client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/python/v1.0.0/spark_feature_push_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Spark client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client"},"next":{"title":"Skye","permalink":"/BharatMLStack/category/skye"}}');var r=a(4848),i=a(8453);const s={title:"Spark client",sidebar_position:1},o="Spark Feature Push Client",l={},c=[{value:"Installation",id:"installation",level:2},{value:"Dependencies",id:"dependencies",level:2},{value:"Architecture Role",id:"architecture-role",level:2},{value:"Features",id:"features",level:2},{value:"When to Use This Client",id:"when-to-use-this-client",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"Related Packages",id:"related-packages",level:2},{value:"License",id:"license",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Prerequisites",id:"prerequisites",level:2},{value:"Supported Data Sources",id:"supported-data-sources",level:2},{value:"1. Database Tables",id:"1-database-tables",level:3},{value:"2. Cloud Storage - Parquet",id:"2-cloud-storage---parquet",level:3},{value:"3. Cloud Storage - Delta",id:"3-cloud-storage---delta",level:3},{value:"Configuration Examples",id:"configuration-examples",level:2},{value:"Basic Pipeline",id:"basic-pipeline",level:3},{value:"Reading from Multiple Sources",id:"reading-from-multiple-sources",level:3},{value:"Protobuf Serialization & Kafka Publishing",id:"protobuf-serialization--kafka-publishing",level:3},{value:"Data Type Handling",id:"data-type-handling",level:2},{value:"Scalar Types",id:"scalar-types",level:3},{value:"Vector Types",id:"vector-types",level:3},{value:"Production Pipeline Example",id:"production-pipeline-example",level:2},{value:"Configuration Options",id:"configuration-options",level:2},{value:"Client Configuration",id:"client-configuration",level:3},{value:"Protobuf Serialization Options",id:"protobuf-serialization-options",level:3},{value:"Kafka Publishing Options",id:"kafka-publishing-options",level:3},{value:"Performance Tuning",id:"performance-tuning",level:2},{value:"Spark Optimizations",id:"spark-optimizations",level:3},{value:"Memory Management",id:"memory-management",level:3},{value:"Kafka Throughput",id:"kafka-throughput",level:3},{value:"Monitoring & Debugging",id:"monitoring--debugging",level:2},{value:"DataFrame Inspection",id:"dataframe-inspection",level:3},{value:"Error Handling",id:"error-handling",level:3},{value:"Integration with Other SDKs",id:"integration-with-other-sdks",level:2},{value:"With gRPC Feature Client",id:"with-grpc-feature-client",level:3},{value:"With HTTP Feature Client (bharatml_common)",id:"with-http-feature-client-bharatml_common",level:3},{value:"Common Use Cases",id:"common-use-cases",level:2},{value:"1. Daily Batch ETL",id:"1-daily-batch-etl",level:3},{value:"2. Historical Backfill",id:"2-historical-backfill",level:3},{value:"3. Real-time Streaming (Advanced)",id:"3-real-time-streaming-advanced",level:3},{value:"Troubleshooting",id:"troubleshooting",level:2},{value:"Common Issues",id:"common-issues",level:3},{value:"Debug Mode",id:"debug-mode",level:3},{value:"Migration from Legacy Clients",id:"migration-from-legacy-clients",level:2},{value:"Best Practices",id:"best-practices",level:2},{value:"Contributing",id:"contributing-1",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license-1",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,i.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.header,{children:(0,r.jsx)(n.h1,{id:"spark-feature-push-client",children:"Spark Feature Push Client"})}),"\n",(0,r.jsxs)(n.p,{children:[(0,r.jsx)(n.a,{href:"https://badge.fury.io/py/spark_feature_push_client",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/pypi/v/spark_feature_push_client?label=pypi-package&color=light%20green",alt:"PyPI version"})}),"\n",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml",children:(0,r.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml/badge.svg",alt:"Build Status"})}),"\n",(0,r.jsx)(n.a,{href:"https://www.python.org/downloads/",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/python-3.7+-blue.svg",alt:"Python 3.7+"})}),"\n",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})}),"\n",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/License-BharatMLStack%20BSL%201.1-blue.svg",alt:"License"})})]}),"\n",(0,r.jsxs)(n.p,{children:["Apache Spark-based client for pushing ML features from offline batch sources to the BharatML Stack Online Feature Store via Kafka. This client is designed for ",(0,r.jsx)(n.strong,{children:"data pipeline operations"})," - reading from batch sources and publishing to Kafka for online consumption."]}),"\n",(0,r.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"pip install spark_feature_push_client\n"})}),"\n",(0,r.jsx)(n.h2,{id:"dependencies",children:"Dependencies"}),"\n",(0,r.jsx)(n.p,{children:"This package depends on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"PySpark 3.0+"}),": For distributed data processing"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"architecture-role",children:"Architecture Role"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{children:"\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Batch Sources \u2502\u2500\u2500\u2500\u25b6\u2502 Spark Feature Push \u2502\u2500\u2500\u2500\u25b6\u2502 Kafka \u2502\u2500\u2500\u2500\u25b6\u2502 Online Feature \u2502\n\u2502 \u2022 Tables \u2502 \u2502 Client \u2502 \u2502 \u2502 \u2502 Store \u2502\n\u2502 \u2022 Parquet \u2502 \u2502 \u2022 Read & Transform \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2022 Delta \u2502 \u2502 \u2022 Protobuf Serialize \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2022 S3/GCS/ADLS \u2502 \u2502 \u2022 Batch Processing \u2502 \u2502 \u2502 \u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25b2\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 grpc_feature_ \u2502\n \u2502 client \u2502\n \u2502 (Real-time) \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n"})}),"\n",(0,r.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Batch Source Integration"}),": Read from Tables (Hive/Delta), Parquet, and Delta files on cloud storage"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Spark Processing"}),": Leverage Apache Spark for distributed data processing"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Protobuf Serialization"}),": Convert feature data to protobuf format using bharatml_commons schemas"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Kafka Publishing"}),": Push serialized features to Kafka topics for online consumption"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Metadata Integration"}),": Fetch feature schemas and configurations via REST API"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Data Type Support"}),": Handle scalar and vector types (strings, numbers, booleans, arrays)"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Batch Optimization"}),": Configurable batch sizes for optimal Kafka throughput"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"when-to-use-this-client",children:"When to Use This Client"}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Use spark_feature_push_client for:"})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,r.jsx)(n.strong,{children:"Batch ETL Pipelines"}),": Scheduled feature computation and publishing"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcca ",(0,r.jsx)(n.strong,{children:"Historical Data Backfill"}),": Loading historical features into online store"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83c\udfd7\ufe0f ",(0,r.jsx)(n.strong,{children:"Data Engineering"}),": Spark-based feature transformations"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcc8 ",(0,r.jsx)(n.strong,{children:"Large Scale Processing"}),": Processing millions of records efficiently"]}),"\n",(0,r.jsxs)(n.li,{children:["\u26a1 ",(0,r.jsx)(n.strong,{children:"Offline-to-Online"}),": Bridge between batch and real-time systems"]}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Use grpc_feature_client for:"})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\ude80 ",(0,r.jsx)(n.strong,{children:"Real-time Operations"}),": Direct persist/retrieve operations"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udd0d ",(0,r.jsx)(n.strong,{children:"Interactive Queries"}),": Low-latency feature lookups"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83c\udfaf ",(0,r.jsx)(n.strong,{children:"API Integration"}),": Service-to-service communication"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udca8 ",(0,r.jsx)(n.strong,{children:"Single Records"}),": Persisting individual feature records"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'from spark_feature_push_client import OnlineFeatureStorePyClient\n\n# Initialize client with metadata source\nclient = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://api.example.com/metadata",\n job_id="feature-pipeline-job",\n job_token="your-auth-token"\n)\n\n# Get feature configuration \nfeature_details = client.get_features_details()\n\n# Process your Spark DataFrame\nproto_df = client.generate_df_with_protobuf_messages(your_spark_df)\n\n# Push to Kafka\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="localhost:9092",\n kafka_topic="features.user_features"\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"related-packages",children:"Related Packages"}),"\n",(0,r.jsx)(n.p,{children:"This package is part of the BharatML Stack ecosystem:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions (required dependency)"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/grpc_feature_client/",children:"grpc_feature_client"})}),": High-performance gRPC client for real-time operations"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["Licensed under the BharatMLStack Business Source License 1.1. See ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"LICENSE"})," for details."]}),"\n",(0,r.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTION.md",children:"Contributing Guide"})," for details."]}),"\n",(0,r.jsx)(n.h2,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Apache Spark 3.0+"}),": For distributed processing"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Kafka Connector"}),": ",(0,r.jsx)(n.code,{children:"spark-sql-kafka"})," for Kafka integration"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Java 8/11"}),": Required by Spark"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"bharatml_common"}),": For protobuf schemas"]}),"\n"]}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'# Example Spark session setup\nspark = SparkSession.builder \\\n .appName("FeaturePipeline") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"supported-data-sources",children:"Supported Data Sources"}),"\n",(0,r.jsx)(n.h3,{id:"1-database-tables",children:"1. Database Tables"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Hive/Delta tables\ndf = spark.sql("SELECT * FROM feature_db.user_features")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"2-cloud-storage---parquet",children:"2. Cloud Storage - Parquet"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# AWS S3\ndf = spark.read.parquet("s3a://bucket/path/to/features/")\n\n# Google Cloud Storage \ndf = spark.read.parquet("gs://bucket/path/to/features/")\n\n# Azure Data Lake\ndf = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"3-cloud-storage---delta",children:"3. Cloud Storage - Delta"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Delta format on cloud storage\ndf = spark.read.format("delta").load("s3a://bucket/delta-table/")\n'})}),"\n",(0,r.jsx)(n.h2,{id:"configuration-examples",children:"Configuration Examples"}),"\n",(0,r.jsx)(n.h3,{id:"basic-pipeline",children:"Basic Pipeline"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'from pyspark.sql import SparkSession\nfrom spark_feature_push_client import OnlineFeatureStorePyClient\n\n# Create Spark session\nspark = SparkSession.builder \\\n .appName("FeatureETL") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n\n# Initialize client\nclient = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://metadata-service.example.com/api/v1/features",\n job_id="daily-feature-pipeline",\n job_token="pipeline-secret-token",\n fgs_to_consider=["user_demographics", "user_behavior"] # Optional: filter feature groups\n)\n\n# Get metadata and column mappings\n(\n offline_src_type_columns,\n offline_col_to_default_values_map, \n entity_column_names\n) = client.get_features_details()\n\nprint(f"Entity columns: {entity_column_names}")\nprint(f"Feature mappings: {offline_src_type_columns}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"reading-from-multiple-sources",children:"Reading from Multiple Sources"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'def get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values):\n """\n Read and combine features from multiple offline sources\n """\n dataframes = []\n \n for source_info in feature_mapping:\n table_name, source_type, feature_list = source_info\n \n if source_type == "TABLE":\n # Read from Hive/Delta table\n df = spark.table(table_name)\n \n elif source_type.startswith("PARQUET_"):\n # Read from Parquet files\n df = spark.read.parquet(table_name)\n \n elif source_type.startswith("DELTA_"):\n # Read from Delta files\n df = spark.read.format("delta").load(table_name)\n \n # Select and rename columns\n select_cols = entity_columns.copy()\n for original_col, renamed_col in feature_list:\n if original_col in df.columns:\n df = df.withColumnRenamed(original_col, renamed_col)\n select_cols.append(renamed_col)\n \n df = df.select(select_cols)\n dataframes.append(df)\n \n # Union all dataframes\n if dataframes:\n combined_df = dataframes[0]\n for df in dataframes[1:]:\n combined_df = combined_df.unionByName(df, allowMissingColumns=True)\n \n # Fill missing values with defaults\n for col, default_val in default_values.items():\n if col in combined_df.columns:\n combined_df = combined_df.fillna({col: default_val})\n \n return combined_df\n \n return None\n\n# Use the function\ndf = get_features_from_all_sources(\n spark, \n entity_column_names, \n offline_src_type_columns, \n offline_col_to_default_values_map\n)\n'})}),"\n",(0,r.jsx)(n.h3,{id:"protobuf-serialization--kafka-publishing",children:"Protobuf Serialization & Kafka Publishing"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Convert DataFrame to protobuf messages\n# This creates binary protobuf messages suitable for Kafka\nproto_df = client.generate_df_with_protobuf_messages(\n df, \n intra_batch_size=20 # Batch size for serialization\n)\n\n# The proto_df has schema: [value: binary, intra_batch_id: long]\nproto_df.printSchema()\n# root\n# |-- value: binary (nullable = false) \n# |-- intra_batch_id: long (nullable = false)\n\n# Write to Kafka with batching for better throughput\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="broker1:9092,broker2:9092,broker3:9092",\n kafka_topic="features.user_features",\n additional_options={\n "kafka.acks": "all",\n "kafka.retries": "3",\n "kafka.compression.type": "snappy"\n },\n kafka_num_batches=4 # Split into 4 parallel Kafka writes\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"data-type-handling",children:"Data Type Handling"}),"\n",(0,r.jsx)(n.p,{children:"The client automatically handles the protobuf data type mappings:"}),"\n",(0,r.jsx)(n.h3,{id:"scalar-types",children:"Scalar Types"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Example DataFrame with different types\ndata = [\n ("user123", 25, 185.5, True, "premium"), # int, float, bool, string\n ("user456", 30, 170.0, False, "basic")\n]\ndf = spark.createDataFrame(data, ["user_id", "age", "height", "is_premium", "tier"])\n\n# Automatically mapped to protobuf:\n# age -> int32_values\n# height -> fp32_values \n# is_premium -> bool_values\n# tier -> string_values\n'})}),"\n",(0,r.jsx)(n.h3,{id:"vector-types",children:"Vector Types"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Example with vector/array features\nfrom pyspark.sql.functions import array, lit\n\ndf = spark.createDataFrame([\n ("user123", [0.1, 0.2, 0.3], ["tech", "sports"], [1, 2, 3])\n], ["user_id", "embeddings", "interests", "scores"])\n\n# Automatically mapped to protobuf vectors:\n# embeddings -> fp32_values in Vector\n# interests -> string_values in Vector\n# scores -> int32_values in Vector\n'})}),"\n",(0,r.jsx)(n.h2,{id:"production-pipeline-example",children:"Production Pipeline Example"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'def run_feature_pipeline():\n """\n Complete feature pipeline from batch sources to Kafka\n """\n \n # 1. Initialize Spark\n spark = SparkSession.builder \\\n .appName("DailyFeaturePipeline") \\\n .config("spark.sql.adaptive.enabled", "true") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n \n try:\n # 2. Initialize feature client\n client = OnlineFeatureStorePyClient(\n features_metadata_source_url=os.getenv("METADATA_URL"),\n job_id=os.getenv("JOB_ID"),\n job_token=os.getenv("JOB_TOKEN")\n )\n \n # 3. Get feature configuration\n feature_mapping, default_values, entity_columns = client.get_features_details()\n \n # 4. Read and process data\n df = get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values)\n \n if df is None or df.count() == 0:\n raise ValueError("No data found in sources")\n \n # 5. Convert to protobuf\n proto_df = client.generate_df_with_protobuf_messages(df, intra_batch_size=50)\n \n # 6. Publish to Kafka\n client.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers=os.getenv("KAFKA_BROKERS"),\n kafka_topic=os.getenv("KAFKA_TOPIC"),\n additional_options={\n "kafka.acks": "all",\n "kafka.compression.type": "snappy",\n "kafka.max.request.size": "10485760" # 10MB\n },\n kafka_num_batches=int(os.getenv("KAFKA_BATCHES", "4"))\n )\n \n print(f"\u2705 Successfully processed {df.count()} records")\n \n finally:\n spark.stop()\n\nif __name__ == "__main__":\n run_feature_pipeline()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"configuration-options",children:"Configuration Options"}),"\n",(0,r.jsx)(n.h3,{id:"client-configuration",children:"Client Configuration"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'client = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://api.example.com/metadata", # Required\n job_id="pipeline-job-001", # Required \n job_token="secret-token-123", # Required\n fgs_to_consider=["user_features", "item_features"] # Optional: filter feature groups\n)\n'})}),"\n",(0,r.jsx)(n.h3,{id:"protobuf-serialization-options",children:"Protobuf Serialization Options"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"proto_df = client.generate_df_with_protobuf_messages(\n df,\n intra_batch_size=20 # Records per protobuf message (default: 20)\n)\n"})}),"\n",(0,r.jsx)(n.h3,{id:"kafka-publishing-options",children:"Kafka Publishing Options"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'client.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="localhost:9092",\n kafka_topic="features.topic",\n additional_options={\n "kafka.acks": "all", # Acknowledgment level\n "kafka.retries": "3", # Retry attempts\n "kafka.compression.type": "snappy", # Compression\n "kafka.batch.size": "16384", # Batch size\n "kafka.linger.ms": "100", # Batching delay\n "kafka.max.request.size": "10485760" # Max message size\n },\n kafka_num_batches=1 # Number of parallel Kafka writers (default: 1)\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"performance-tuning",children:"Performance Tuning"}),"\n",(0,r.jsx)(n.h3,{id:"spark-optimizations",children:"Spark Optimizations"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'spark = SparkSession.builder \\\n .appName("FeaturePipeline") \\\n .config("spark.sql.adaptive.enabled", "true") \\\n .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \\\n .config("spark.sql.adaptive.skewJoin.enabled", "true") \\\n .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \\\n .getOrCreate()\n'})}),"\n",(0,r.jsx)(n.h3,{id:"memory-management",children:"Memory Management"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# For large datasets, consider:\ndf = df.repartition(200) # Optimal partition count\ndf.cache() # Cache if reused multiple times\n"})}),"\n",(0,r.jsx)(n.h3,{id:"kafka-throughput",children:"Kafka Throughput"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# For high-throughput scenarios:\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="brokers",\n kafka_topic="topic", \n kafka_num_batches=8, # Increase parallel writers\n additional_options={\n "kafka.batch.size": "65536", # Larger batches\n "kafka.linger.ms": "100", # Allow batching delay\n "kafka.compression.type": "lz4" # Fast compression\n }\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"monitoring--debugging",children:"Monitoring & Debugging"}),"\n",(0,r.jsx)(n.h3,{id:"dataframe-inspection",children:"DataFrame Inspection"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check data before processing\nprint(f"Records: {df.count()}")\nprint(f"Columns: {df.columns}")\ndf.printSchema()\ndf.show(5)\n\n# Check protobuf output\nproto_df.show(5, truncate=False)\nprint(f"Protobuf messages: {proto_df.count()}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"error-handling",children:"Error Handling"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'try:\n proto_df = client.generate_df_with_protobuf_messages(df)\n client.write_protobuf_df_to_kafka(proto_df, brokers, topic)\n \nexcept Exception as e:\n print(f"Pipeline failed: {e}")\n # Log to monitoring system\n # Send alerts\n raise\n'})}),"\n",(0,r.jsx)(n.h2,{id:"integration-with-other-sdks",children:"Integration with Other SDKs"}),"\n",(0,r.jsx)(n.h3,{id:"with-grpc-feature-client",children:"With gRPC Feature Client"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Spark client pushes features to Kafka\nspark_client = OnlineFeatureStorePyClient(...)\nspark_client.write_protobuf_df_to_kafka(proto_df, brokers, topic)\n\n# gRPC client retrieves features in real-time\nfrom grpc_feature_client import GRPCFeatureClient\ngrpc_client = GRPCFeatureClient(config)\nfeatures = grpc_client.retrieve_decoded_features(...)\n"})}),"\n",(0,r.jsx)(n.h3,{id:"with-http-feature-client-bharatml_common",children:"With HTTP Feature Client (bharatml_common)"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Use HTTP client for metadata validation\nfrom bharatml_common import HTTPFeatureClient\nhttp_client = HTTPFeatureClient(base_url, job_id, token)\nmetadata = http_client.get_feature_metadata()\n\n# Validate feature names using shared utilities\nfrom bharatml_common import clean_column_name\nclean_features = [clean_column_name(name) for name in feature_names]\n\n# Process with Spark client\nspark_client.generate_df_with_protobuf_messages(df)\n"})}),"\n",(0,r.jsx)(n.h2,{id:"common-use-cases",children:"Common Use Cases"}),"\n",(0,r.jsx)(n.h3,{id:"1-daily-batch-etl",children:"1. Daily Batch ETL"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"# Cron job: 0 2 * * * (daily at 2 AM)\nspark-submit \\\n --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 \\\n --conf spark.sql.adaptive.enabled=true \\\n daily_feature_pipeline.py\n"})}),"\n",(0,r.jsx)(n.h3,{id:"2-historical-backfill",children:"2. Historical Backfill"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Backfill last 30 days\nfrom datetime import datetime, timedelta\n\nfor i in range(30):\n date = datetime.now() - timedelta(days=i)\n df = spark.sql(f"""\n SELECT * FROM features \n WHERE date = \'{date.strftime(\'%Y-%m-%d\')}\'\n """)\n \n proto_df = client.generate_df_with_protobuf_messages(df)\n client.write_protobuf_df_to_kafka(proto_df, brokers, f"backfill.{date.strftime(\'%Y%m%d\')}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"3-real-time-streaming-advanced",children:"3. Real-time Streaming (Advanced)"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Read from streaming source, process, and publish\nstreaming_df = spark.readStream \\\n .format("kafka") \\\n .option("kafka.bootstrap.servers", input_brokers) \\\n .option("subscribe", input_topic) \\\n .load()\n\n# Process streaming DataFrame\nprocessed_df = streaming_df.select(...)\n\n# Write to output Kafka (requires structured streaming)\nquery = processed_df.writeStream \\\n .format("kafka") \\\n .option("kafka.bootstrap.servers", output_brokers) \\\n .option("topic", output_topic) \\\n .start()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"troubleshooting",children:"Troubleshooting"}),"\n",(0,r.jsx)(n.h3,{id:"common-issues",children:"Common Issues"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"OutOfMemoryError"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Increase driver memory or reduce partition size\nspark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "50")\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Kafka Connection Timeout"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check network connectivity and broker addresses\nadditional_options = {\n "kafka.request.timeout.ms": "60000",\n "kafka.session.timeout.ms": "30000"\n}\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Protobuf Serialization Errors"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check data types and null values\ndf = df.fillna({"string_col": "", "numeric_col": 0})\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Metadata API Errors"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Verify job_id, job_token, and URL\n# Check API server logs\n"})}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"debug-mode",children:"Debug Mode"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'import logging\nlogging.basicConfig(level=logging.DEBUG)\n\n# Enable Spark SQL logging\nspark.sparkContext.setLogLevel("INFO")\n'})}),"\n",(0,r.jsx)(n.h2,{id:"migration-from-legacy-clients",children:"Migration from Legacy Clients"}),"\n",(0,r.jsx)(n.p,{children:"If migrating from older versions:"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Old import\n# from online_feature_store_py_client import OnlineFeatureStorePyClient\n\n# New import (same interface)\nfrom spark_feature_push_client import OnlineFeatureStorePyClient\n\n# API remains the same - no code changes needed!\n"})}),"\n",(0,r.jsx)(n.h2,{id:"best-practices",children:"Best Practices"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Resource Management"}),": Always stop Spark sessions"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Error Handling"}),": Implement proper exception handling and retries"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Monitoring"}),": Add metrics and logging to your pipelines"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Testing"}),": Test with sample data before production runs"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Security"}),": Use secure Kafka configurations in production"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Performance"}),": Monitor Spark UI for optimization opportunities"]}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"The Spark Feature Push Client is your gateway from batch data sources to the real-time online feature store! \ud83d\ude80"}),"\n",(0,r.jsx)(n.h2,{id:"contributing-1",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,r.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcac ",(0,r.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,r.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,r.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,r.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license-1",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function p(e={}){const{wrapper:n}={...(0,i.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},8453:(e,n,a)=>{a.d(n,{R:()=>s,x:()=>o});var t=a(6540);const r={},i=t.createContext(r);function s(e){const n=t.useContext(i);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:s(e.components),t.createElement(i.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/ac51638e.ef64a004.js b/docs/assets/js/ac51638e.ef64a004.js deleted file mode 100644 index d229d3ec..00000000 --- a/docs/assets/js/ac51638e.ef64a004.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9473],{6692:(e,n,a)=>{a.r(n),a.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>u,frontMatter:()=>s,metadata:()=>t,toc:()=>c});const t=JSON.parse('{"id":"sdks/python/v1.0.0/spark_feature_push_client","title":"Spark client","description":"PyPI version","source":"@site/docs/sdks/python/v1.0.0/spark_feature_push_client.md","sourceDirName":"sdks/python/v1.0.0","slug":"/sdks/python/v1.0.0/spark_feature_push_client","permalink":"/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_client","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/python/v1.0.0/spark_feature_push_client.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Spark client","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client"},"next":{"title":"Numerix","permalink":"/BharatMLStack/category/numerix"}}');var r=a(4848),i=a(8453);const s={title:"Spark client",sidebar_position:1},o="Spark Feature Push Client",l={},c=[{value:"Installation",id:"installation",level:2},{value:"Dependencies",id:"dependencies",level:2},{value:"Architecture Role",id:"architecture-role",level:2},{value:"Features",id:"features",level:2},{value:"When to Use This Client",id:"when-to-use-this-client",level:2},{value:"Quick Start",id:"quick-start",level:2},{value:"Related Packages",id:"related-packages",level:2},{value:"License",id:"license",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Prerequisites",id:"prerequisites",level:2},{value:"Supported Data Sources",id:"supported-data-sources",level:2},{value:"1. Database Tables",id:"1-database-tables",level:3},{value:"2. Cloud Storage - Parquet",id:"2-cloud-storage---parquet",level:3},{value:"3. Cloud Storage - Delta",id:"3-cloud-storage---delta",level:3},{value:"Configuration Examples",id:"configuration-examples",level:2},{value:"Basic Pipeline",id:"basic-pipeline",level:3},{value:"Reading from Multiple Sources",id:"reading-from-multiple-sources",level:3},{value:"Protobuf Serialization & Kafka Publishing",id:"protobuf-serialization--kafka-publishing",level:3},{value:"Data Type Handling",id:"data-type-handling",level:2},{value:"Scalar Types",id:"scalar-types",level:3},{value:"Vector Types",id:"vector-types",level:3},{value:"Production Pipeline Example",id:"production-pipeline-example",level:2},{value:"Configuration Options",id:"configuration-options",level:2},{value:"Client Configuration",id:"client-configuration",level:3},{value:"Protobuf Serialization Options",id:"protobuf-serialization-options",level:3},{value:"Kafka Publishing Options",id:"kafka-publishing-options",level:3},{value:"Performance Tuning",id:"performance-tuning",level:2},{value:"Spark Optimizations",id:"spark-optimizations",level:3},{value:"Memory Management",id:"memory-management",level:3},{value:"Kafka Throughput",id:"kafka-throughput",level:3},{value:"Monitoring & Debugging",id:"monitoring--debugging",level:2},{value:"DataFrame Inspection",id:"dataframe-inspection",level:3},{value:"Error Handling",id:"error-handling",level:3},{value:"Integration with Other SDKs",id:"integration-with-other-sdks",level:2},{value:"With gRPC Feature Client",id:"with-grpc-feature-client",level:3},{value:"With HTTP Feature Client (bharatml_common)",id:"with-http-feature-client-bharatml_common",level:3},{value:"Common Use Cases",id:"common-use-cases",level:2},{value:"1. Daily Batch ETL",id:"1-daily-batch-etl",level:3},{value:"2. Historical Backfill",id:"2-historical-backfill",level:3},{value:"3. Real-time Streaming (Advanced)",id:"3-real-time-streaming-advanced",level:3},{value:"Troubleshooting",id:"troubleshooting",level:2},{value:"Common Issues",id:"common-issues",level:3},{value:"Debug Mode",id:"debug-mode",level:3},{value:"Migration from Legacy Clients",id:"migration-from-legacy-clients",level:2},{value:"Best Practices",id:"best-practices",level:2},{value:"Contributing",id:"contributing-1",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license-1",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",img:"img",li:"li",ol:"ol",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,i.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.header,{children:(0,r.jsx)(n.h1,{id:"spark-feature-push-client",children:"Spark Feature Push Client"})}),"\n",(0,r.jsxs)(n.p,{children:[(0,r.jsx)(n.a,{href:"https://badge.fury.io/py/spark_feature_push_client",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/pypi/v/spark_feature_push_client?label=pypi-package&color=light%20green",alt:"PyPI version"})}),"\n",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml",children:(0,r.jsx)(n.img,{src:"https://github.com/Meesho/BharatMLStack/actions/workflows/py-sdk.yml/badge.svg",alt:"Build Status"})}),"\n",(0,r.jsx)(n.a,{href:"https://www.python.org/downloads/",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/python-3.7+-blue.svg",alt:"Python 3.7+"})}),"\n",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white",alt:"Discord"})}),"\n",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:(0,r.jsx)(n.img,{src:"https://img.shields.io/badge/License-BharatMLStack%20BSL%201.1-blue.svg",alt:"License"})})]}),"\n",(0,r.jsxs)(n.p,{children:["Apache Spark-based client for pushing ML features from offline batch sources to the BharatML Stack Online Feature Store via Kafka. This client is designed for ",(0,r.jsx)(n.strong,{children:"data pipeline operations"})," - reading from batch sources and publishing to Kafka for online consumption."]}),"\n",(0,r.jsx)(n.h2,{id:"installation",children:"Installation"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"pip install spark_feature_push_client\n"})}),"\n",(0,r.jsx)(n.h2,{id:"dependencies",children:"Dependencies"}),"\n",(0,r.jsx)(n.p,{children:"This package depends on:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"PySpark 3.0+"}),": For distributed data processing"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"architecture-role",children:"Architecture Role"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{children:"\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Batch Sources \u2502\u2500\u2500\u2500\u25b6\u2502 Spark Feature Push \u2502\u2500\u2500\u2500\u25b6\u2502 Kafka \u2502\u2500\u2500\u2500\u25b6\u2502 Online Feature \u2502\n\u2502 \u2022 Tables \u2502 \u2502 Client \u2502 \u2502 \u2502 \u2502 Store \u2502\n\u2502 \u2022 Parquet \u2502 \u2502 \u2022 Read & Transform \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2022 Delta \u2502 \u2502 \u2022 Protobuf Serialize \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2022 S3/GCS/ADLS \u2502 \u2502 \u2022 Batch Processing \u2502 \u2502 \u2502 \u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25b2\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 grpc_feature_ \u2502\n \u2502 client \u2502\n \u2502 (Real-time) \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n"})}),"\n",(0,r.jsx)(n.h2,{id:"features",children:"Features"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Batch Source Integration"}),": Read from Tables (Hive/Delta), Parquet, and Delta files on cloud storage"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Spark Processing"}),": Leverage Apache Spark for distributed data processing"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Protobuf Serialization"}),": Convert feature data to protobuf format using bharatml_commons schemas"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Kafka Publishing"}),": Push serialized features to Kafka topics for online consumption"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Metadata Integration"}),": Fetch feature schemas and configurations via REST API"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Data Type Support"}),": Handle scalar and vector types (strings, numbers, booleans, arrays)"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Batch Optimization"}),": Configurable batch sizes for optimal Kafka throughput"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"when-to-use-this-client",children:"When to Use This Client"}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Use spark_feature_push_client for:"})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udd04 ",(0,r.jsx)(n.strong,{children:"Batch ETL Pipelines"}),": Scheduled feature computation and publishing"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcca ",(0,r.jsx)(n.strong,{children:"Historical Data Backfill"}),": Loading historical features into online store"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83c\udfd7\ufe0f ",(0,r.jsx)(n.strong,{children:"Data Engineering"}),": Spark-based feature transformations"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcc8 ",(0,r.jsx)(n.strong,{children:"Large Scale Processing"}),": Processing millions of records efficiently"]}),"\n",(0,r.jsxs)(n.li,{children:["\u26a1 ",(0,r.jsx)(n.strong,{children:"Offline-to-Online"}),": Bridge between batch and real-time systems"]}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Use grpc_feature_client for:"})}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\ude80 ",(0,r.jsx)(n.strong,{children:"Real-time Operations"}),": Direct persist/retrieve operations"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udd0d ",(0,r.jsx)(n.strong,{children:"Interactive Queries"}),": Low-latency feature lookups"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83c\udfaf ",(0,r.jsx)(n.strong,{children:"API Integration"}),": Service-to-service communication"]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udca8 ",(0,r.jsx)(n.strong,{children:"Single Records"}),": Persisting individual feature records"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"quick-start",children:"Quick Start"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'from spark_feature_push_client import OnlineFeatureStorePyClient\n\n# Initialize client with metadata source\nclient = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://api.example.com/metadata",\n job_id="feature-pipeline-job",\n job_token="your-auth-token"\n)\n\n# Get feature configuration \nfeature_details = client.get_features_details()\n\n# Process your Spark DataFrame\nproto_df = client.generate_df_with_protobuf_messages(your_spark_df)\n\n# Push to Kafka\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="localhost:9092",\n kafka_topic="features.user_features"\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"related-packages",children:"Related Packages"}),"\n",(0,r.jsx)(n.p,{children:"This package is part of the BharatML Stack ecosystem:"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/bharatml_commons/",children:"bharatml_commons"})}),": Common utilities and protobuf definitions (required dependency)"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:(0,r.jsx)(n.a,{href:"https://pypi.org/project/grpc_feature_client/",children:"grpc_feature_client"})}),": High-performance gRPC client for real-time operations"]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["Licensed under the BharatMLStack Business Source License 1.1. See ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"LICENSE"})," for details."]}),"\n",(0,r.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTION.md",children:"Contributing Guide"})," for details."]}),"\n",(0,r.jsx)(n.h2,{id:"prerequisites",children:"Prerequisites"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Apache Spark 3.0+"}),": For distributed processing"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Kafka Connector"}),": ",(0,r.jsx)(n.code,{children:"spark-sql-kafka"})," for Kafka integration"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Java 8/11"}),": Required by Spark"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"bharatml_common"}),": For protobuf schemas"]}),"\n"]}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:'# Example Spark session setup\nspark = SparkSession.builder \\\n .appName("FeaturePipeline") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"supported-data-sources",children:"Supported Data Sources"}),"\n",(0,r.jsx)(n.h3,{id:"1-database-tables",children:"1. Database Tables"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Hive/Delta tables\ndf = spark.sql("SELECT * FROM feature_db.user_features")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"2-cloud-storage---parquet",children:"2. Cloud Storage - Parquet"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# AWS S3\ndf = spark.read.parquet("s3a://bucket/path/to/features/")\n\n# Google Cloud Storage \ndf = spark.read.parquet("gs://bucket/path/to/features/")\n\n# Azure Data Lake\ndf = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"3-cloud-storage---delta",children:"3. Cloud Storage - Delta"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Delta format on cloud storage\ndf = spark.read.format("delta").load("s3a://bucket/delta-table/")\n'})}),"\n",(0,r.jsx)(n.h2,{id:"configuration-examples",children:"Configuration Examples"}),"\n",(0,r.jsx)(n.h3,{id:"basic-pipeline",children:"Basic Pipeline"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'from pyspark.sql import SparkSession\nfrom spark_feature_push_client import OnlineFeatureStorePyClient\n\n# Create Spark session\nspark = SparkSession.builder \\\n .appName("FeatureETL") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n\n# Initialize client\nclient = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://metadata-service.example.com/api/v1/features",\n job_id="daily-feature-pipeline",\n job_token="pipeline-secret-token",\n fgs_to_consider=["user_demographics", "user_behavior"] # Optional: filter feature groups\n)\n\n# Get metadata and column mappings\n(\n offline_src_type_columns,\n offline_col_to_default_values_map, \n entity_column_names\n) = client.get_features_details()\n\nprint(f"Entity columns: {entity_column_names}")\nprint(f"Feature mappings: {offline_src_type_columns}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"reading-from-multiple-sources",children:"Reading from Multiple Sources"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'def get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values):\n """\n Read and combine features from multiple offline sources\n """\n dataframes = []\n \n for source_info in feature_mapping:\n table_name, source_type, feature_list = source_info\n \n if source_type == "TABLE":\n # Read from Hive/Delta table\n df = spark.table(table_name)\n \n elif source_type.startswith("PARQUET_"):\n # Read from Parquet files\n df = spark.read.parquet(table_name)\n \n elif source_type.startswith("DELTA_"):\n # Read from Delta files\n df = spark.read.format("delta").load(table_name)\n \n # Select and rename columns\n select_cols = entity_columns.copy()\n for original_col, renamed_col in feature_list:\n if original_col in df.columns:\n df = df.withColumnRenamed(original_col, renamed_col)\n select_cols.append(renamed_col)\n \n df = df.select(select_cols)\n dataframes.append(df)\n \n # Union all dataframes\n if dataframes:\n combined_df = dataframes[0]\n for df in dataframes[1:]:\n combined_df = combined_df.unionByName(df, allowMissingColumns=True)\n \n # Fill missing values with defaults\n for col, default_val in default_values.items():\n if col in combined_df.columns:\n combined_df = combined_df.fillna({col: default_val})\n \n return combined_df\n \n return None\n\n# Use the function\ndf = get_features_from_all_sources(\n spark, \n entity_column_names, \n offline_src_type_columns, \n offline_col_to_default_values_map\n)\n'})}),"\n",(0,r.jsx)(n.h3,{id:"protobuf-serialization--kafka-publishing",children:"Protobuf Serialization & Kafka Publishing"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Convert DataFrame to protobuf messages\n# This creates binary protobuf messages suitable for Kafka\nproto_df = client.generate_df_with_protobuf_messages(\n df, \n intra_batch_size=20 # Batch size for serialization\n)\n\n# The proto_df has schema: [value: binary, intra_batch_id: long]\nproto_df.printSchema()\n# root\n# |-- value: binary (nullable = false) \n# |-- intra_batch_id: long (nullable = false)\n\n# Write to Kafka with batching for better throughput\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="broker1:9092,broker2:9092,broker3:9092",\n kafka_topic="features.user_features",\n additional_options={\n "kafka.acks": "all",\n "kafka.retries": "3",\n "kafka.compression.type": "snappy"\n },\n kafka_num_batches=4 # Split into 4 parallel Kafka writes\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"data-type-handling",children:"Data Type Handling"}),"\n",(0,r.jsx)(n.p,{children:"The client automatically handles the protobuf data type mappings:"}),"\n",(0,r.jsx)(n.h3,{id:"scalar-types",children:"Scalar Types"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Example DataFrame with different types\ndata = [\n ("user123", 25, 185.5, True, "premium"), # int, float, bool, string\n ("user456", 30, 170.0, False, "basic")\n]\ndf = spark.createDataFrame(data, ["user_id", "age", "height", "is_premium", "tier"])\n\n# Automatically mapped to protobuf:\n# age -> int32_values\n# height -> fp32_values \n# is_premium -> bool_values\n# tier -> string_values\n'})}),"\n",(0,r.jsx)(n.h3,{id:"vector-types",children:"Vector Types"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Example with vector/array features\nfrom pyspark.sql.functions import array, lit\n\ndf = spark.createDataFrame([\n ("user123", [0.1, 0.2, 0.3], ["tech", "sports"], [1, 2, 3])\n], ["user_id", "embeddings", "interests", "scores"])\n\n# Automatically mapped to protobuf vectors:\n# embeddings -> fp32_values in Vector\n# interests -> string_values in Vector\n# scores -> int32_values in Vector\n'})}),"\n",(0,r.jsx)(n.h2,{id:"production-pipeline-example",children:"Production Pipeline Example"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'def run_feature_pipeline():\n """\n Complete feature pipeline from batch sources to Kafka\n """\n \n # 1. Initialize Spark\n spark = SparkSession.builder \\\n .appName("DailyFeaturePipeline") \\\n .config("spark.sql.adaptive.enabled", "true") \\\n .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \\\n .getOrCreate()\n \n try:\n # 2. Initialize feature client\n client = OnlineFeatureStorePyClient(\n features_metadata_source_url=os.getenv("METADATA_URL"),\n job_id=os.getenv("JOB_ID"),\n job_token=os.getenv("JOB_TOKEN")\n )\n \n # 3. Get feature configuration\n feature_mapping, default_values, entity_columns = client.get_features_details()\n \n # 4. Read and process data\n df = get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values)\n \n if df is None or df.count() == 0:\n raise ValueError("No data found in sources")\n \n # 5. Convert to protobuf\n proto_df = client.generate_df_with_protobuf_messages(df, intra_batch_size=50)\n \n # 6. Publish to Kafka\n client.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers=os.getenv("KAFKA_BROKERS"),\n kafka_topic=os.getenv("KAFKA_TOPIC"),\n additional_options={\n "kafka.acks": "all",\n "kafka.compression.type": "snappy",\n "kafka.max.request.size": "10485760" # 10MB\n },\n kafka_num_batches=int(os.getenv("KAFKA_BATCHES", "4"))\n )\n \n print(f"\u2705 Successfully processed {df.count()} records")\n \n finally:\n spark.stop()\n\nif __name__ == "__main__":\n run_feature_pipeline()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"configuration-options",children:"Configuration Options"}),"\n",(0,r.jsx)(n.h3,{id:"client-configuration",children:"Client Configuration"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'client = OnlineFeatureStorePyClient(\n features_metadata_source_url="https://api.example.com/metadata", # Required\n job_id="pipeline-job-001", # Required \n job_token="secret-token-123", # Required\n fgs_to_consider=["user_features", "item_features"] # Optional: filter feature groups\n)\n'})}),"\n",(0,r.jsx)(n.h3,{id:"protobuf-serialization-options",children:"Protobuf Serialization Options"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"proto_df = client.generate_df_with_protobuf_messages(\n df,\n intra_batch_size=20 # Records per protobuf message (default: 20)\n)\n"})}),"\n",(0,r.jsx)(n.h3,{id:"kafka-publishing-options",children:"Kafka Publishing Options"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'client.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="localhost:9092",\n kafka_topic="features.topic",\n additional_options={\n "kafka.acks": "all", # Acknowledgment level\n "kafka.retries": "3", # Retry attempts\n "kafka.compression.type": "snappy", # Compression\n "kafka.batch.size": "16384", # Batch size\n "kafka.linger.ms": "100", # Batching delay\n "kafka.max.request.size": "10485760" # Max message size\n },\n kafka_num_batches=1 # Number of parallel Kafka writers (default: 1)\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"performance-tuning",children:"Performance Tuning"}),"\n",(0,r.jsx)(n.h3,{id:"spark-optimizations",children:"Spark Optimizations"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'spark = SparkSession.builder \\\n .appName("FeaturePipeline") \\\n .config("spark.sql.adaptive.enabled", "true") \\\n .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \\\n .config("spark.sql.adaptive.skewJoin.enabled", "true") \\\n .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \\\n .getOrCreate()\n'})}),"\n",(0,r.jsx)(n.h3,{id:"memory-management",children:"Memory Management"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# For large datasets, consider:\ndf = df.repartition(200) # Optimal partition count\ndf.cache() # Cache if reused multiple times\n"})}),"\n",(0,r.jsx)(n.h3,{id:"kafka-throughput",children:"Kafka Throughput"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# For high-throughput scenarios:\nclient.write_protobuf_df_to_kafka(\n proto_df,\n kafka_bootstrap_servers="brokers",\n kafka_topic="topic", \n kafka_num_batches=8, # Increase parallel writers\n additional_options={\n "kafka.batch.size": "65536", # Larger batches\n "kafka.linger.ms": "100", # Allow batching delay\n "kafka.compression.type": "lz4" # Fast compression\n }\n)\n'})}),"\n",(0,r.jsx)(n.h2,{id:"monitoring--debugging",children:"Monitoring & Debugging"}),"\n",(0,r.jsx)(n.h3,{id:"dataframe-inspection",children:"DataFrame Inspection"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check data before processing\nprint(f"Records: {df.count()}")\nprint(f"Columns: {df.columns}")\ndf.printSchema()\ndf.show(5)\n\n# Check protobuf output\nproto_df.show(5, truncate=False)\nprint(f"Protobuf messages: {proto_df.count()}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"error-handling",children:"Error Handling"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'try:\n proto_df = client.generate_df_with_protobuf_messages(df)\n client.write_protobuf_df_to_kafka(proto_df, brokers, topic)\n \nexcept Exception as e:\n print(f"Pipeline failed: {e}")\n # Log to monitoring system\n # Send alerts\n raise\n'})}),"\n",(0,r.jsx)(n.h2,{id:"integration-with-other-sdks",children:"Integration with Other SDKs"}),"\n",(0,r.jsx)(n.h3,{id:"with-grpc-feature-client",children:"With gRPC Feature Client"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Spark client pushes features to Kafka\nspark_client = OnlineFeatureStorePyClient(...)\nspark_client.write_protobuf_df_to_kafka(proto_df, brokers, topic)\n\n# gRPC client retrieves features in real-time\nfrom grpc_feature_client import GRPCFeatureClient\ngrpc_client = GRPCFeatureClient(config)\nfeatures = grpc_client.retrieve_decoded_features(...)\n"})}),"\n",(0,r.jsx)(n.h3,{id:"with-http-feature-client-bharatml_common",children:"With HTTP Feature Client (bharatml_common)"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Use HTTP client for metadata validation\nfrom bharatml_common import HTTPFeatureClient\nhttp_client = HTTPFeatureClient(base_url, job_id, token)\nmetadata = http_client.get_feature_metadata()\n\n# Validate feature names using shared utilities\nfrom bharatml_common import clean_column_name\nclean_features = [clean_column_name(name) for name in feature_names]\n\n# Process with Spark client\nspark_client.generate_df_with_protobuf_messages(df)\n"})}),"\n",(0,r.jsx)(n.h2,{id:"common-use-cases",children:"Common Use Cases"}),"\n",(0,r.jsx)(n.h3,{id:"1-daily-batch-etl",children:"1. Daily Batch ETL"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-bash",children:"# Cron job: 0 2 * * * (daily at 2 AM)\nspark-submit \\\n --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 \\\n --conf spark.sql.adaptive.enabled=true \\\n daily_feature_pipeline.py\n"})}),"\n",(0,r.jsx)(n.h3,{id:"2-historical-backfill",children:"2. Historical Backfill"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Backfill last 30 days\nfrom datetime import datetime, timedelta\n\nfor i in range(30):\n date = datetime.now() - timedelta(days=i)\n df = spark.sql(f"""\n SELECT * FROM features \n WHERE date = \'{date.strftime(\'%Y-%m-%d\')}\'\n """)\n \n proto_df = client.generate_df_with_protobuf_messages(df)\n client.write_protobuf_df_to_kafka(proto_df, brokers, f"backfill.{date.strftime(\'%Y%m%d\')}")\n'})}),"\n",(0,r.jsx)(n.h3,{id:"3-real-time-streaming-advanced",children:"3. Real-time Streaming (Advanced)"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Read from streaming source, process, and publish\nstreaming_df = spark.readStream \\\n .format("kafka") \\\n .option("kafka.bootstrap.servers", input_brokers) \\\n .option("subscribe", input_topic) \\\n .load()\n\n# Process streaming DataFrame\nprocessed_df = streaming_df.select(...)\n\n# Write to output Kafka (requires structured streaming)\nquery = processed_df.writeStream \\\n .format("kafka") \\\n .option("kafka.bootstrap.servers", output_brokers) \\\n .option("topic", output_topic) \\\n .start()\n'})}),"\n",(0,r.jsx)(n.h2,{id:"troubleshooting",children:"Troubleshooting"}),"\n",(0,r.jsx)(n.h3,{id:"common-issues",children:"Common Issues"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"OutOfMemoryError"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Increase driver memory or reduce partition size\nspark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "50")\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Kafka Connection Timeout"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check network connectivity and broker addresses\nadditional_options = {\n "kafka.request.timeout.ms": "60000",\n "kafka.session.timeout.ms": "30000"\n}\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Protobuf Serialization Errors"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'# Check data types and null values\ndf = df.fillna({"string_col": "", "numeric_col": 0})\n'})}),"\n"]}),"\n",(0,r.jsxs)(n.li,{children:["\n",(0,r.jsx)(n.p,{children:(0,r.jsx)(n.strong,{children:"Metadata API Errors"})}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Verify job_id, job_token, and URL\n# Check API server logs\n"})}),"\n"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"debug-mode",children:"Debug Mode"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:'import logging\nlogging.basicConfig(level=logging.DEBUG)\n\n# Enable Spark SQL logging\nspark.sparkContext.setLogLevel("INFO")\n'})}),"\n",(0,r.jsx)(n.h2,{id:"migration-from-legacy-clients",children:"Migration from Legacy Clients"}),"\n",(0,r.jsx)(n.p,{children:"If migrating from older versions:"}),"\n",(0,r.jsx)(n.pre,{children:(0,r.jsx)(n.code,{className:"language-python",children:"# Old import\n# from online_feature_store_py_client import OnlineFeatureStorePyClient\n\n# New import (same interface)\nfrom spark_feature_push_client import OnlineFeatureStorePyClient\n\n# API remains the same - no code changes needed!\n"})}),"\n",(0,r.jsx)(n.h2,{id:"best-practices",children:"Best Practices"}),"\n",(0,r.jsxs)(n.ol,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Resource Management"}),": Always stop Spark sessions"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Error Handling"}),": Implement proper exception handling and retries"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Monitoring"}),": Add metrics and logging to your pipelines"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Testing"}),": Test with sample data before production runs"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Security"}),": Use secure Kafka configurations in production"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Performance"}),": Monitor Spark UI for optimization opportunities"]}),"\n"]}),"\n",(0,r.jsx)(n.p,{children:"The Spark Feature Push Client is your gateway from batch data sources to the real-time online feature store! \ud83d\ude80"}),"\n",(0,r.jsx)(n.h2,{id:"contributing-1",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,r.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["\ud83d\udcac ",(0,r.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,r.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,r.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,r.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,r.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license-1",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function u(e={}){const{wrapper:n}={...(0,i.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(d,{...e})}):d(e)}},8453:(e,n,a)=>{a.d(n,{R:()=>s,x:()=>o});var t=a(6540);const r={},i=t.createContext(r);function s(e){const n=t.useContext(i);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:s(e.components),t.createElement(i.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/ae7a6e8a.58ed54b2.js b/docs/assets/js/ae7a6e8a.58ed54b2.js new file mode 100644 index 00000000..8cc02b7b --- /dev/null +++ b/docs/assets/js/ae7a6e8a.58ed54b2.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8591],{4795:(e,t,r)=>{r.d(t,{A:()=>j});r(6540);var n=r(4164),s=r(6972),o=r(8774),i=r(5846),c=r(6654),a=r(1312),l=r(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var d=r(4848);function f({className:e,href:t,children:r}){return(0,d.jsx)(o.A,{href:t,className:(0,n.A)("card padding--lg",u.cardContainer,e),children:r})}function m({className:e,href:t,icon:r,title:s,description:o}){return(0,d.jsxs)(f,{href:t,className:e,children:[(0,d.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",u.cardTitle),title:s,children:[r," ",s]}),o&&(0,d.jsx)("p",{className:(0,n.A)("text--truncate",u.cardDescription),title:o,children:o})]})}function h({item:e}){const t=(0,s.Nr)(e),r=function(){const{selectMessage:e}=(0,i.W)();return t=>e(t,(0,a.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,d.jsx)(m,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??r(e.items.length)}):null}function p({item:e}){const t=(0,c.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(e.docId??void 0);return(0,d.jsx)(m,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??r?.description})}function v({item:e}){switch(e.type){case"link":return(0,d.jsx)(p,{item:e});case"category":return(0,d.jsx)(h,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const x={docCardListItem:"docCardListItem_W1sv"};function g({className:e}){const t=(0,s.a4)();return(0,d.jsx)(j,{items:t,className:e})}function w({item:e}){return(0,d.jsx)("article",{className:(0,n.A)(x.docCardListItem,"col col--6"),children:(0,d.jsx)(v,{item:e})})}function j(e){const{items:t,className:r}=e;if(!t)return(0,d.jsx)(g,{...e});const o=(0,s.d1)(t);return(0,d.jsx)("section",{className:(0,n.A)("row",r),children:o.map((e,t)=>(0,d.jsx)(w,{item:e},t))})}},5846:(e,t,r)=>{r.d(t,{W:()=>l});var n=r(6540),s=r(4586);const o=["zero","one","two","few","many","other"];function i(e){return o.filter(t=>e.includes(t))}const c={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function a(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:i(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),c}},[e])}function l(){const e=a();return{selectMessage:(t,r)=>function(e,t,r){const n=e.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${e}`);const s=r.select(t),o=r.pluralForms.indexOf(s);return n[Math.min(o,n.length-1)]}(r,t,e)}}},8453:(e,t,r)=>{r.d(t,{R:()=>i,x:()=>c});var n=r(6540);const s={},o=n.createContext(s);function i(e){const t=n.useContext(o);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function c(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),n.createElement(o.Provider,{value:t},e.children)}},8670:(e,t,r)=>{r.r(t),r.d(t,{assets:()=>l,contentTitle:()=>a,default:()=>f,frontMatter:()=>c,metadata:()=>n,toc:()=>u});const n=JSON.parse('{"id":"inferflow/v1.0.0/index","title":"v1.0.0","description":"Inferflow v1.0.0","source":"@site/docs/inferflow/v1.0.0/index.md","sourceDirName":"inferflow/v1.0.0","slug":"/inferflow/v1.0.0","permalink":"/BharatMLStack/inferflow/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/inferflow/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Inferflow v1.0.0","sidebar_position":0,"slug":"/inferflow/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Inferflow","permalink":"/BharatMLStack/category/inferflow"},"next":{"title":"Architecture","permalink":"/BharatMLStack/inferflow/v1.0.0/architecture"}}');var s=r(4848),o=r(8453),i=r(4795);const c={title:"v1.0.0",description:"Inferflow v1.0.0",sidebar_position:0,slug:"/inferflow/v1.0.0"},a="Inferflow v1.0.0",l={},u=[];function d(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"inferflow-v100",children:"Inferflow v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Inferflow is a graph-driven feature retrieval and model inference orchestration engine. It dynamically resolves entity relationships via configurable DAGs, retrieves features from the Online Feature Store, and orchestrates model scoring."}),"\n",(0,s.jsx)(i.A,{})]})}function f(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/b0267ac9.2ed3e1de.js b/docs/assets/js/b0267ac9.2ed3e1de.js new file mode 100644 index 00000000..225e475f --- /dev/null +++ b/docs/assets/js/b0267ac9.2ed3e1de.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1965],{2604:(e,t,r)=>{r.r(t),r.d(t,{assets:()=>l,contentTitle:()=>a,default:()=>d,frontMatter:()=>o,metadata:()=>n,toc:()=>u});const n=JSON.parse('{"id":"numerix/v1.0.0/index","title":"v1.0.0","description":"Numerix v1.0.0","source":"@site/docs/numerix/v1.0.0/index.md","sourceDirName":"numerix/v1.0.0","slug":"/numerix/v1.0.0","permalink":"/BharatMLStack/numerix/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/numerix/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Numerix v1.0.0","sidebar_position":0,"slug":"/numerix/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Numerix","permalink":"/BharatMLStack/category/numerix"},"next":{"title":"Architecture","permalink":"/BharatMLStack/numerix/v1.0.0/architecture"}}');var s=r(4848),i=r(8453),c=r(4795);const o={title:"v1.0.0",description:"Numerix v1.0.0",sidebar_position:0,slug:"/numerix/v1.0.0"},a="Numerix v1.0.0",l={},u=[];function m(e){const t={h1:"h1",header:"header",p:"p",...(0,i.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"numerix-v100",children:"Numerix v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors."}),"\n",(0,s.jsx)(c.A,{})]})}function d(e={}){const{wrapper:t}={...(0,i.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(m,{...e})}):m(e)}},4795:(e,t,r)=>{r.d(t,{A:()=>j});r(6540);var n=r(4164),s=r(6972),i=r(8774),c=r(5846),o=r(6654),a=r(1312),l=r(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var m=r(4848);function d({className:e,href:t,children:r}){return(0,m.jsx)(i.A,{href:t,className:(0,n.A)("card padding--lg",u.cardContainer,e),children:r})}function h({className:e,href:t,icon:r,title:s,description:i}){return(0,m.jsxs)(d,{href:t,className:e,children:[(0,m.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",u.cardTitle),title:s,children:[r," ",s]}),i&&(0,m.jsx)("p",{className:(0,n.A)("text--truncate",u.cardDescription),title:i,children:i})]})}function p({item:e}){const t=(0,s.Nr)(e),r=function(){const{selectMessage:e}=(0,c.W)();return t=>e(t,(0,a.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,m.jsx)(h,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??r(e.items.length)}):null}function f({item:e}){const t=(0,o.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(e.docId??void 0);return(0,m.jsx)(h,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??r?.description})}function x({item:e}){switch(e.type){case"link":return(0,m.jsx)(f,{item:e});case"category":return(0,m.jsx)(p,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const v={docCardListItem:"docCardListItem_W1sv"};function g({className:e}){const t=(0,s.a4)();return(0,m.jsx)(j,{items:t,className:e})}function N({item:e}){return(0,m.jsx)("article",{className:(0,n.A)(v.docCardListItem,"col col--6"),children:(0,m.jsx)(x,{item:e})})}function j(e){const{items:t,className:r}=e;if(!t)return(0,m.jsx)(g,{...e});const i=(0,s.d1)(t);return(0,m.jsx)("section",{className:(0,n.A)("row",r),children:i.map((e,t)=>(0,m.jsx)(N,{item:e},t))})}},5846:(e,t,r)=>{r.d(t,{W:()=>l});var n=r(6540),s=r(4586);const i=["zero","one","two","few","many","other"];function c(e){return i.filter(t=>e.includes(t))}const o={locale:"en",pluralForms:c(["one","other"]),select:e=>1===e?"one":"other"};function a(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:c(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),o}},[e])}function l(){const e=a();return{selectMessage:(t,r)=>function(e,t,r){const n=e.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${e}`);const s=r.select(t),i=r.pluralForms.indexOf(s);return n[Math.min(i,n.length-1)]}(r,t,e)}}},8453:(e,t,r)=>{r.d(t,{R:()=>c,x:()=>o});var n=r(6540);const s={},i=n.createContext(s);function c(e){const t=n.useContext(i);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function o(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:c(e.components),n.createElement(i.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/bba9e323.c204228f.js b/docs/assets/js/bba9e323.c204228f.js new file mode 100644 index 00000000..2bacd6f0 --- /dev/null +++ b/docs/assets/js/bba9e323.c204228f.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6812],{2486:(e,n,r)=>{r.r(n),r.d(n,{assets:()=>l,contentTitle:()=>a,default:()=>h,frontMatter:()=>i,metadata:()=>t,toc:()=>c});const t=JSON.parse('{"id":"predator/v1.0.0/release-notes","title":"Release Notes","description":"Version 1.0.0","source":"@site/docs/predator/v1.0.0/release-notes.md","sourceDirName":"predator/v1.0.0","slug":"/predator/v1.0.0/release-notes","permalink":"/BharatMLStack/predator/v1.0.0/release-notes","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/predator/v1.0.0/release-notes.md","tags":[],"version":"current","sidebarPosition":3,"frontMatter":{"title":"Release Notes","sidebar_position":3},"sidebar":"tutorialSidebar","previous":{"title":"Key Functionalities","permalink":"/BharatMLStack/predator/v1.0.0/functionalities"}}');var s=r(4848),o=r(8453);const i={title:"Release Notes",sidebar_position:3},a="Predator - Release Notes",l={},c=[{value:"Version 1.0.0",id:"version-100",level:2},{value:"What's New",id:"whats-new",level:3}];function d(e){const n={br:"br",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",li:"li",p:"p",strong:"strong",ul:"ul",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.header,{children:(0,s.jsx)(n.h1,{id:"predator---release-notes",children:"Predator - Release Notes"})}),"\n",(0,s.jsx)(n.h2,{id:"version-100",children:"Version 1.0.0"}),"\n",(0,s.jsxs)(n.p,{children:[(0,s.jsx)(n.strong,{children:"Release Date"}),": June 2025",(0,s.jsx)(n.br,{}),"\n",(0,s.jsx)(n.strong,{children:"Status"}),": General Availability (GA)"]}),"\n",(0,s.jsxs)(n.p,{children:["First stable release of ",(0,s.jsx)(n.strong,{children:"Predator"})," \u2014 scalable model inference service built around ",(0,s.jsx)(n.strong,{children:"NVIDIA Triton Inference Server"}),", part of BharatMLStack. Serves Deep Learning and tree-based models with low latency in ",(0,s.jsx)(n.strong,{children:"Kubernetes"}),"; integrates with ",(0,s.jsx)(n.strong,{children:"OnFS"})," and ",(0,s.jsx)(n.strong,{children:"Interflow"}),"; clients use the ",(0,s.jsx)(n.strong,{children:"Helix client"})," over gRPC."]}),"\n",(0,s.jsx)(n.h3,{id:"whats-new",children:"What's New"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Triton inference engine"}),": Unified runtime for DL and tree-based models on CPU/GPU; model repository via Init Container from GCS; gRPC API via Helix client."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Multi-backend support"}),": TensorRT, PyTorch, ONNX Runtime, TensorFlow, Python, FIL, DALI, Custom."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Dynamic batching & concurrency"}),": Configurable via ",(0,s.jsx)(n.code,{children:"config.pbtxt"}),"; model versioning and ensembles."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Kubernetes deployment"}),": Helm-based; Init Container + Triton container; custom Triton images from Artifact Registry; health probes; CPU/GPU autoscaling."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Observability"}),": Prometheus metrics, Grafana; warmup requests for cold-start avoidance."]}),"\n"]})]})}function h(e={}){const{wrapper:n}={...(0,o.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},8453:(e,n,r)=>{r.d(n,{R:()=>i,x:()=>a});var t=r(6540);const s={},o=t.createContext(s);function i(e){const n=t.useContext(o);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function a(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),t.createElement(o.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/bcee635f.b2209c62.js b/docs/assets/js/bcee635f.b2209c62.js new file mode 100644 index 00000000..ea1f3e3c --- /dev/null +++ b/docs/assets/js/bcee635f.b2209c62.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[4164],{4459:(e,t,n)=>{n.r(t),n.d(t,{assets:()=>l,contentTitle:()=>a,default:()=>m,frontMatter:()=>i,metadata:()=>r,toc:()=>d});const r=JSON.parse('{"id":"sdks/go/v1.0.0/index","title":"v1.0.0","description":"Go SDK v1.0.0","source":"@site/docs/sdks/go/v1.0.0/index.md","sourceDirName":"sdks/go/v1.0.0","slug":"/sdks/go/v1.0.0","permalink":"/BharatMLStack/sdks/go/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/go/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Go SDK v1.0.0","sidebar_position":0,"slug":"/sdks/go/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Go SDK","permalink":"/BharatMLStack/category/go-sdk"},"next":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client"}}');var s=n(4848),o=n(8453),c=n(4795);const i={title:"v1.0.0",description:"Go SDK v1.0.0",sidebar_position:0,slug:"/sdks/go/v1.0.0"},a="Go SDK v1.0.0",l={},d=[];function u(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"go-sdk-v100",children:"Go SDK v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Go client libraries and packages for interacting with the BharatML Stack online feature store, including gRPC clients and protocol buffer definitions."}),"\n",(0,s.jsx)(c.A,{})]})}function m(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(u,{...e})}):u(e)}},4795:(e,t,n)=>{n.d(t,{A:()=>j});n(6540);var r=n(4164),s=n(6972),o=n(8774),c=n(5846),i=n(6654),a=n(1312),l=n(1107);const d={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var u=n(4848);function m({className:e,href:t,children:n}){return(0,u.jsx)(o.A,{href:t,className:(0,r.A)("card padding--lg",d.cardContainer,e),children:n})}function h({className:e,href:t,icon:n,title:s,description:o}){return(0,u.jsxs)(m,{href:t,className:e,children:[(0,u.jsxs)(l.A,{as:"h2",className:(0,r.A)("text--truncate",d.cardTitle),title:s,children:[n," ",s]}),o&&(0,u.jsx)("p",{className:(0,r.A)("text--truncate",d.cardDescription),title:o,children:o})]})}function f({item:e}){const t=(0,s.Nr)(e),n=function(){const{selectMessage:e}=(0,c.W)();return t=>e(t,(0,a.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,u.jsx)(h,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??n(e.items.length)}):null}function p({item:e}){const t=(0,i.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",n=(0,s.cC)(e.docId??void 0);return(0,u.jsx)(h,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??n?.description})}function g({item:e}){switch(e.type){case"link":return(0,u.jsx)(p,{item:e});case"category":return(0,u.jsx)(f,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const x={docCardListItem:"docCardListItem_W1sv"};function v({className:e}){const t=(0,s.a4)();return(0,u.jsx)(j,{items:t,className:e})}function k({item:e}){return(0,u.jsx)("article",{className:(0,r.A)(x.docCardListItem,"col col--6"),children:(0,u.jsx)(g,{item:e})})}function j(e){const{items:t,className:n}=e;if(!t)return(0,u.jsx)(v,{...e});const o=(0,s.d1)(t);return(0,u.jsx)("section",{className:(0,r.A)("row",n),children:o.map((e,t)=>(0,u.jsx)(k,{item:e},t))})}},5846:(e,t,n)=>{n.d(t,{W:()=>l});var r=n(6540),s=n(4586);const o=["zero","one","two","few","many","other"];function c(e){return o.filter(t=>e.includes(t))}const i={locale:"en",pluralForms:c(["one","other"]),select:e=>1===e?"one":"other"};function a(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,r.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:c(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),i}},[e])}function l(){const e=a();return{selectMessage:(t,n)=>function(e,t,n){const r=e.split("|");if(1===r.length)return r[0];r.length>n.pluralForms.length&&console.error(`For locale=${n.locale}, a maximum of ${n.pluralForms.length} plural forms are expected (${n.pluralForms.join(",")}), but the message contains ${r.length}: ${e}`);const s=n.select(t),o=n.pluralForms.indexOf(s);return r[Math.min(o,r.length-1)]}(n,t,e)}}},8453:(e,t,n)=>{n.d(t,{R:()=>c,x:()=>i});var r=n(6540);const s={},o=r.createContext(s);function c(e){const t=r.useContext(o);return r.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function i(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:c(e.components),r.createElement(o.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/bd5b7851.1913daf9.js b/docs/assets/js/bd5b7851.1913daf9.js new file mode 100644 index 00000000..182284a9 --- /dev/null +++ b/docs/assets/js/bd5b7851.1913daf9.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6063],{8453:(e,n,r)=>{r.d(n,{R:()=>l,x:()=>c});var s=r(6540);const i={},t=s.createContext(i);function l(e){const n=s.useContext(t);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function c(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(i):e.components||i:l(e.components),s.createElement(t.Provider,{value:n},e.children)}},9042:(e,n,r)=>{r.r(n),r.d(n,{assets:()=>a,contentTitle:()=>c,default:()=>h,frontMatter:()=>l,metadata:()=>s,toc:()=>d});const s=JSON.parse('{"id":"skye/v1.0.0/release-notes","title":"Release Notes","description":"v1.0.0","source":"@site/docs/skye/v1.0.0/release-notes.md","sourceDirName":"skye/v1.0.0","slug":"/skye/v1.0.0/release-notes","permalink":"/BharatMLStack/skye/v1.0.0/release-notes","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/skye/v1.0.0/release-notes.md","tags":[],"version":"current","sidebarPosition":3,"frontMatter":{"title":"Release Notes","sidebar_position":3},"sidebar":"tutorialSidebar","previous":{"title":"Functionalities","permalink":"/BharatMLStack/skye/v1.0.0/functionalities"},"next":{"title":"Numerix","permalink":"/BharatMLStack/category/numerix"}}');var i=r(4848),t=r(8453);const l={title:"Release Notes",sidebar_position:3},c="Skye - Release Notes",a={},d=[{value:"v1.0.0",id:"v100",level:2},{value:"Overview",id:"overview",level:3},{value:"What's New",id:"whats-new",level:3},{value:"Architecture",id:"architecture",level:4},{value:"Serving",id:"serving",level:4},{value:"Ingestion",id:"ingestion",level:4},{value:"Operations",id:"operations",level:4},{value:"Improvements Over Previous Architecture",id:"improvements-over-previous-architecture",level:3},{value:"Known Limitations",id:"known-limitations",level:3},{value:"Technology Stack",id:"technology-stack",level:3}];function o(e){const n={code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,t.R)(),...e.components};return(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(n.header,{children:(0,i.jsx)(n.h1,{id:"skye---release-notes",children:"Skye - Release Notes"})}),"\n",(0,i.jsx)(n.h2,{id:"v100",children:"v1.0.0"}),"\n",(0,i.jsx)(n.h3,{id:"overview",children:"Overview"}),"\n",(0,i.jsx)(n.p,{children:"Initial open-source release of Skye, BharatMLStack's vector similarity search platform. This release represents a complete re-architecture of the internal VSS (Vector Similarity Search) service, addressing scalability, resilience, and operational efficiency challenges from the previous generation."}),"\n",(0,i.jsx)(n.h3,{id:"whats-new",children:"What's New"}),"\n",(0,i.jsx)(n.h4,{id:"architecture",children:"Architecture"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Model-first hierarchy"}),": Models at the base level with variants nested within, eliminating embedding duplication across tenants"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Entity-based data split"}),": Separate embedding and aggregator tables per entity type (catalog, product, user)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Event-driven admin flows"}),": Kafka-based model lifecycle management with SQL-backed state persistence"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Pluggable vector DB support"}),": Generic vector database abstraction replacing vendor-specific tight coupling"]}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"serving",children:"Serving"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Multi-layer caching"}),": In-memory cache + Redis distributed cache for low-latency similarity search"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Indexed-only search"}),": ",(0,i.jsx)(n.code,{children:"search_indexed_only"})," flag prevents brute-force fallback on partially indexed collections"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Pagination support"}),": Service-level pagination for clients"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Separate search/index embeddings"}),": Models can use different embedding spaces for search and indexing"]}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"ingestion",children:"Ingestion"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Shared embeddings across variants"}),": Single ingestion per model with parallel variant processing"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Generic RT consumer schema"}),": Simplified onboarding for new real-time data sources"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Retry topic"}),": Automatic capture and reprocessing of failed ingestion events"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"EOF to all partitions"}),": Ensures complete data consumption before processing completion"]}),"\n"]}),"\n",(0,i.jsx)(n.h4,{id:"operations",children:"Operations"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"API-based model onboarding"}),": Register models and variants via REST API (replaces manual Databricks-only flow)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Automated cluster provisioning"}),": Scripted setup for consistent vector DB cluster configurations"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Experiment isolation"}),": Dedicated EKS and vector DB clusters for experiments"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Comprehensive observability"}),": Per-model + per-variant metrics for latency, throughput, error rates, and cache effectiveness"]}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"improvements-over-previous-architecture",children:"Improvements Over Previous Architecture"}),"\n",(0,i.jsxs)(n.table,{children:[(0,i.jsx)(n.thead,{children:(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.th,{children:"Area"}),(0,i.jsx)(n.th,{children:"Before"}),(0,i.jsx)(n.th,{children:"After"})]})}),(0,i.jsxs)(n.tbody,{children:[(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Embedding storage"}),(0,i.jsx)(n.td,{children:"Duplicated per tenant"}),(0,i.jsx)(n.td,{children:"Shared per model"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Vector DB coupling"}),(0,i.jsx)(n.td,{children:"Tightly coupled to Qdrant"}),(0,i.jsx)(n.td,{children:"Pluggable via generic interface"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"State management"}),(0,i.jsx)(n.td,{children:"In-pod synchronous thread"}),(0,i.jsx)(n.td,{children:"Event-driven with SQL backing"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Consumer handling"}),(0,i.jsx)(n.td,{children:"Paused during ingestion"}),(0,i.jsx)(n.td,{children:"No pausing; concurrent writes"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Cluster setup"}),(0,i.jsx)(n.td,{children:"Manual, error-prone"}),(0,i.jsx)(n.td,{children:"Automated, consistent"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Experiment infra"}),(0,i.jsx)(n.td,{children:"Shared with production"}),(0,i.jsx)(n.td,{children:"Isolated clusters"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Failure recovery"}),(0,i.jsx)(n.td,{children:"Manual intervention"}),(0,i.jsx)(n.td,{children:"Retry topics + snapshots"})]}),(0,i.jsxs)(n.tr,{children:[(0,i.jsx)(n.td,{children:"Observability"}),(0,i.jsx)(n.td,{children:"Generic alerts"}),(0,i.jsx)(n.td,{children:"Model + variant level metrics"})]})]})]}),"\n",(0,i.jsx)(n.h3,{id:"known-limitations",children:"Known Limitations"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsx)(n.li,{children:"Snapshot restore is currently supported for smaller indexes only"}),"\n",(0,i.jsx)(n.li,{children:"Pagination is handled at the service level (not natively by the vector DB)"}),"\n",(0,i.jsx)(n.li,{children:"Horizontal scaling of vector DB clusters requires running provisioning scripts"}),"\n"]}),"\n",(0,i.jsx)(n.h3,{id:"technology-stack",children:"Technology Stack"}),"\n",(0,i.jsxs)(n.ul,{children:["\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Language"}),": Go"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Vector Database"}),": Qdrant (pluggable)"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Storage"}),": ScyllaDB"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Cache"}),": Redis + In-Memory"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Message Queue"}),": Kafka"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Configuration"}),": ZooKeeper / etcd"]}),"\n",(0,i.jsxs)(n.li,{children:[(0,i.jsx)(n.strong,{children:"Orchestration"}),": Kubernetes (EKS)"]}),"\n"]})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,i.jsx)(n,{...e,children:(0,i.jsx)(o,{...e})}):o(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/bf2864cf.6fc085c5.js b/docs/assets/js/bf2864cf.6fc085c5.js new file mode 100644 index 00000000..04058e29 --- /dev/null +++ b/docs/assets/js/bf2864cf.6fc085c5.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[9688],{3969:(t,e,r)=>{r.r(e),r.d(e,{assets:()=>l,contentTitle:()=>o,default:()=>m,frontMatter:()=>a,metadata:()=>n,toc:()=>u});const n=JSON.parse('{"id":"quick-start/v1.0.0/index","title":"v1.0.0","description":"Quick Start v1.0.0","source":"@site/docs/quick-start/v1.0.0/index.md","sourceDirName":"quick-start/v1.0.0","slug":"/quick-start/v1.0.0","permalink":"/BharatMLStack/quick-start/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/quick-start/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Quick Start v1.0.0","sidebar_position":0,"slug":"/quick-start/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Quick Start","permalink":"/BharatMLStack/category/quick-start"},"next":{"title":"Quick Start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start"}}');var s=r(4848),c=r(8453),i=r(4795);const a={title:"v1.0.0",description:"Quick Start v1.0.0",sidebar_position:0,slug:"/quick-start/v1.0.0"},o="Quick Start v1.0.0",l={},u=[];function d(t){const e={h1:"h1",header:"header",p:"p",...(0,c.R)(),...t.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(e.header,{children:(0,s.jsx)(e.h1,{id:"quick-start-v100",children:"Quick Start v1.0.0"})}),"\n",(0,s.jsx)(e.p,{children:"Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing."}),"\n",(0,s.jsx)(i.A,{})]})}function m(t={}){const{wrapper:e}={...(0,c.R)(),...t.components};return e?(0,s.jsx)(e,{...t,children:(0,s.jsx)(d,{...t})}):d(t)}},4795:(t,e,r)=>{r.d(e,{A:()=>j});r(6540);var n=r(4164),s=r(6972),c=r(8774),i=r(5846),a=r(6654),o=r(1312),l=r(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var d=r(4848);function m({className:t,href:e,children:r}){return(0,d.jsx)(c.A,{href:e,className:(0,n.A)("card padding--lg",u.cardContainer,t),children:r})}function p({className:t,href:e,icon:r,title:s,description:c}){return(0,d.jsxs)(m,{href:e,className:t,children:[(0,d.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",u.cardTitle),title:s,children:[r," ",s]}),c&&(0,d.jsx)("p",{className:(0,n.A)("text--truncate",u.cardDescription),title:c,children:c})]})}function h({item:t}){const e=(0,s.Nr)(t),r=function(){const{selectMessage:t}=(0,i.W)();return e=>t(e,(0,o.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:e}))}();return e?(0,d.jsx)(p,{className:t.className,href:e,icon:"\ud83d\uddc3\ufe0f",title:t.label,description:t.description??r(t.items.length)}):null}function f({item:t}){const e=(0,a.A)(t.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(t.docId??void 0);return(0,d.jsx)(p,{className:t.className,href:t.href,icon:e,title:t.label,description:t.description??r?.description})}function k({item:t}){switch(t.type){case"link":return(0,d.jsx)(f,{item:t});case"category":return(0,d.jsx)(h,{item:t});default:throw new Error(`unknown item type ${JSON.stringify(t)}`)}}const x={docCardListItem:"docCardListItem_W1sv"};function g({className:t}){const e=(0,s.a4)();return(0,d.jsx)(j,{items:e,className:t})}function v({item:t}){return(0,d.jsx)("article",{className:(0,n.A)(x.docCardListItem,"col col--6"),children:(0,d.jsx)(k,{item:t})})}function j(t){const{items:e,className:r}=t;if(!e)return(0,d.jsx)(g,{...t});const c=(0,s.d1)(e);return(0,d.jsx)("section",{className:(0,n.A)("row",r),children:c.map((t,e)=>(0,d.jsx)(v,{item:t},e))})}},5846:(t,e,r)=>{r.d(e,{W:()=>l});var n=r(6540),s=r(4586);const c=["zero","one","two","few","many","other"];function i(t){return c.filter(e=>t.includes(e))}const a={locale:"en",pluralForms:i(["one","other"]),select:t=>1===t?"one":"other"};function o(){const{i18n:{currentLocale:t}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(t){const e=new Intl.PluralRules(t);return{locale:t,pluralForms:i(e.resolvedOptions().pluralCategories),select:t=>e.select(t)}}(t)}catch(e){return console.error(`Failed to use Intl.PluralRules for locale "${t}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${e.message}\n`),a}},[t])}function l(){const t=o();return{selectMessage:(e,r)=>function(t,e,r){const n=t.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${t}`);const s=r.select(e),c=r.pluralForms.indexOf(s);return n[Math.min(c,n.length-1)]}(r,e,t)}}},8453:(t,e,r)=>{r.d(e,{R:()=>i,x:()=>a});var n=r(6540);const s={},c=n.createContext(s);function i(t){const e=n.useContext(c);return n.useMemo(function(){return"function"==typeof t?t(e):{...e,...t}},[e,t])}function a(t){let e;return e=t.disableParentContext?"function"==typeof t.components?t.components(s):t.components||s:i(t.components),n.createElement(c.Provider,{value:e},t.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/c4f5d8e4.20f7242f.js b/docs/assets/js/c4f5d8e4.20f7242f.js new file mode 100644 index 00000000..227a4898 --- /dev/null +++ b/docs/assets/js/c4f5d8e4.20f7242f.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2634],{1459:(e,t,s)=>{s.r(t),s.d(t,{default:()=>M});var n=s(6540),i=s(1656),r=s(4586),a=s(6025);const o={homepageWrapper:"homepageWrapper_H_rv",customNav:"customNav_xRNg",navContainer:"navContainer_E5Tz",logo:"logo_Ukns",hpGradientShift:"hpGradientShift_w9XB",navLinks:"navLinks_FO3Z",navLink:"navLink_aQaq",btn:"btn_bvfa",btnPrimary:"btnPrimary_hBjO",btnSecondary:"btnSecondary_mRVh",btnWhite:"btnWhite_DoE5",btnOutlineWhite:"btnOutlineWhite_Kzbe",hero:"hero_aEcG",networkCanvas:"networkCanvas_S8Th",heroContent:"heroContent_mKPX",hpFadeInUp:"hpFadeInUp_NspS",heroBadge:"heroBadge_Z6oq",heroTitle:"heroTitle_qg2I",heroSubtitle:"heroSubtitle_jFu1",heroButtons:"heroButtons_r52D",heroImage:"heroImage_xZN7",adoptionBadge:"adoptionBadge_hbYR",section:"section_Q9Zo",container:"container_bfhl",sectionHeader:"sectionHeader_Gahl",sectionSubtitle:"sectionSubtitle_AZuW",sectionTitle:"sectionTitle_Ut5p",sectionDescription:"sectionDescription_cpL1",barriersGrid:"barriersGrid_u0Jf",barrierCard:"barrierCard_tMSq",barrierIcon:"barrierIcon_HTIA",barrierQuestions:"barrierQuestions_jlWA",barrierAnswer:"barrierAnswer_ZtxW",componentsGrid:"componentsGrid_KtT5",componentCard:"componentCard_LlUg",componentCardVisible:"componentCardVisible_hAJc",componentContent:"componentContent_xz2v",componentLink:"componentLink_RzJT",componentIcon:"componentIcon_JDYs",statsSection:"statsSection_GUBq",statsGrid:"statsGrid_wBRk",statCard:"statCard_w2S8",statLabel:"statLabel_I99V",statValue:"statValue_tB6D",statDescription:"statDescription_WIU_",videosGrid:"videosGrid_FXHY",videoCard:"videoCard_jGks",videoWrapper:"videoWrapper_XWWU",videoPlayer:"videoPlayer_Nt7m",videoContent:"videoContent_pd0B",blogGrid:"blogGrid_Qec3",blogCard:"blogCard_hyds",blogCardIcon:"blogCardIcon_JPeR",blogContent:"blogContent_dJxs",blogCategory:"blogCategory_UY54",blogMeta:"blogMeta_skDH",ctaSection:"ctaSection_bmsv",hpRotate:"hpRotate_a55V",ctaTitle:"ctaTitle_arch",ctaDescription:"ctaDescription_HswS",ctaButtons:"ctaButtons_vsp7",customFooter:"customFooter_Ymmc",footerContent:"footerContent_obNo",footerSection:"footerSection__c07",footerList:"footerList_2l2h",footerBottom:"footerBottom_nS2f",footerLinks:"footerLinks_lH9U"};var c=s(4848);const l=[{icon:"\ud83e\udde0",title:"Focus on building intelligence, not infrastructure",questions:["Does every model deployment require a full-stack integration effort?","Do engineers have to rebuild feature retrieval, endpoint integrations, and logging for each new model?","Does changing a simple expression like 0.2\xd7s\u2081 + 0.8\xd7s\u2082 to 0.3\xd7s\u2081 + 0.7\xd7s\u2082 really need code reviews and redeployments?","Why does deploying intelligence require the devops team to provision infra?"],answer:"Machine learning teams should be iterating on models, not systems. Yet today, infrastructure complexity turns simple improvements into weeks of engineering effort, slowing experimentation and innovation."},{icon:"\ud83d\udcb0",title:"Built for scale without exponential cost growth",questions:["Do your infrastructure costs scale faster than your ML impact?","Are you recomputing the same features, reloading the same data, and moving the same bytes across systems repeatedly?","Are expensive GPUs and compute sitting underutilized while workloads wait on data or inefficient pipelines?","Why does scaling ML often mean scaling cost linearly\u2014or worse?"],answer:"A modern ML platform should eliminate redundant computation, reuse features intelligently, and optimize data access across memory, NVMe, and object storage. Compute should be pooled, scheduled efficiently, and fully utilized\u2014ensuring that scale drives impact, not runaway infrastructure costs."},{icon:"\ud83c\udf0d",title:"Freedom to deploy anywhere, without lock-in",questions:["Are your models tied to a single cloud, making migration costly and complex?","Does adopting managed services today limit your ability to optimize cost or move infrastructure tomorrow?","Can you deploy the same ML stack across public cloud, private cloud, or sovereign environments without redesigning everything?","Why should infrastructure choices dictate the future of your ML systems?"],answer:"A modern ML platform should be built on open standards and cloud-neutral abstractions, allowing you to deploy anywhere\u2014public cloud, private infrastructure, or sovereign environments. This ensures complete control over your data, freedom from vendor lock-in, and the ability to optimize for cost, performance, and compliance without architectural constraints."}],d=[{icon:"\u26a1",title:"Online Feature Store",description:"BharatMLStack Online Feature Store delivers sub-10ms, high-throughput access to machine learning features for real-time inference. It seamlessly ingests batch and streaming data, validates schemas, and persists compact, versioned feature groups optimized for low latency and efficiency. With scalable storage backends, gRPC APIs, and binary-optimized formats, it ensures consistent, reliable feature serving across ML pipelines.",cta:"/online-feature-store/v1.0.0"},{icon:"\ud83d\udd00",title:"Inferflow",description:"Inferflow is BharatMLStack's intelligent inference gateway that dynamically retrieves and assembles features required by ML models using a graph-based configuration called Inferpipes. It automatically resolves entity relationships, fetches features from the Online Feature Store, and constructs feature vectors without custom code.",cta:"/inferflow/v1.0.0"},{icon:"\ud83d\udd0d",title:"Skye",description:"Skye enables fast similarity retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, ensuring flexibility across infrastructure. The system provides tenant-level index isolation while allowing single embedding ingestion even when shared across tenants, reducing redundancy.",cta:"/skye/v1.0.0"},{icon:"\ud83e\uddee",title:"Numerix",description:"Numerix is a high-performance compute engine designed for ultra-fast element-wise matrix operations. Built in Rust and accelerated using SIMD, it delivers exceptional efficiency and predictable performance. Optimized for real-time inference workloads, it achieves strict sub-5ms p99 latency on matrices up to 1000\xd710.",cta:"/numerix/v1.0.0"},{icon:"\ud83d\ude80",title:"Predator",description:"Predator streamlines infrastructure and model lifecycle management. It enables the creation of deployables with specific Triton Server versions and supports seamless model rollouts. Leveraging Helm charts and Argo CD, Predator automates Kubernetes-based deployments while integrating with KEDA for auto-scaling and performance tuning.",cta:"/predator/v1.0.0"}],h=[{target:4.5,suffix:"M+",decimals:1,label:"Daily Orders",description:"Daily orders processed via ML pipelines"},{target:2.4,suffix:"M",decimals:1,label:"QPS on FS",description:"QPS on Feature Store with batch size of 100 id lookups"},{target:1,suffix:"M+",decimals:0,label:"QPS Inference",description:"QPS on Model Inference"},{target:500,suffix:"K",decimals:0,label:"QPS Embedding",description:"QPS Embedding Search"}],m=[{title:"Feature Store",description:"Learn how to onboard and manage features using the self-serve UI for the Online Feature Store.",url:"https://videos.meesho.com/reels/feature_store.mp4"},{title:"Embedding Platform",description:"Walkthrough of onboarding and managing embedding models via the Skye self-serve UI.",url:"https://videos.meesho.com/reels/embedding_platform.mp4"},{title:"Numerix",description:"Step-by-step guide to configuring and running matrix operations through the Numerix self-serve UI.",url:"https://videos.meesho.com/reels/numerix.mp4"},{title:"Predator",description:"How to deploy and manage ML models on Kubernetes using the Predator self-serve UI.",url:"https://videos.meesho.com/reels/predator.mp4"},{title:"Inferflow",description:"Setting up inferpipes and feature retrieval graphs through the Inferflow self-serve UI.",url:"https://videos.meesho.com/reels/inferflow.mp4"}],u=[{title:"Building Meesho's ML Platform: From Chaos to Cutting-Edge (Part 1)",category:"ML Platform",icon:"\ud83d\ude80",link:"/blog/post-one"},{title:"Building Meesho's ML Platform: Lessons from the First-Gen System (Part 2)",category:"ML Platform",icon:"\ud83e\udde9",link:"/blog/post-two"},{title:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",category:"Inference",icon:"\u26a1",link:"/blog/post-three"},{title:"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving",category:"LLM",icon:"\ud83e\udde0",link:"/blog/post-four"},{title:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",category:"Optimization",icon:"\ud83d\udd2c",link:"/blog/post-five"}];function p(){const e=(0,a.Ay)("/"),t=(0,a.Ay)("/blog");return(0,c.jsx)("nav",{className:o.customNav,children:(0,c.jsxs)("div",{className:o.navContainer,children:[(0,c.jsx)("a",{href:e,className:o.logo,children:"BharatMLStack"}),(0,c.jsxs)("div",{className:o.navLinks,children:[(0,c.jsx)("a",{href:"#components",className:o.navLink,children:"Components"}),(0,c.jsx)("a",{href:"#stats",className:o.navLink,children:"Scale"}),(0,c.jsx)("a",{href:"#demos",className:o.navLink,children:"Demos"}),(0,c.jsx)("a",{href:t,className:o.navLink,children:"Blog"}),(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack",className:`${o.btn} ${o.btnPrimary}`,target:"_blank",rel:"noopener noreferrer",children:"GitHub"})]})]})})}function f(){const e=(0,n.useRef)(null);return(0,n.useEffect)(()=>{const t=e.current;if(!t)return;const s=t.getContext("2d");let n,i=[];function r(){const e=t.parentElement;t.width=e.offsetWidth,t.height=e.offsetHeight}r(),function(){i=[];for(let e=0;e<50;e++)i.push({x:Math.random()*t.width,y:Math.random()*t.height,vx:.4*(Math.random()-.5),vy:.4*(Math.random()-.5),radius:2*Math.random()+1})}(),function e(){!function(){for(const e of i)e.x+=e.vx,e.y+=e.vy,(e.x<0||e.x>t.width)&&(e.vx*=-1),(e.y<0||e.y>t.height)&&(e.vy*=-1),e.x=Math.max(0,Math.min(t.width,e.x)),e.y=Math.max(0,Math.min(t.height,e.y))}(),function(){s.clearRect(0,0,t.width,t.height);for(let e=0;e{r()});return a.observe(t.parentElement),()=>{cancelAnimationFrame(n),a.disconnect()}},[]),(0,c.jsx)("canvas",{ref:e,className:o.networkCanvas,"aria-hidden":"true"})}function g(){const e=(0,a.Ay)("/intro");return(0,c.jsxs)("section",{className:o.hero,children:[(0,c.jsx)(f,{}),(0,c.jsxs)("div",{className:o.heroContent,children:[(0,c.jsx)("div",{className:o.heroBadge,children:"Open-source, scalable stack for enterprise ML"}),(0,c.jsx)("h1",{className:o.heroTitle,children:"Build production ML pipelines faster"}),(0,c.jsx)("p",{className:o.heroSubtitle,children:"Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Integrate, deploy, and manage robust ML workflows with full reliability and control."}),(0,c.jsxs)("div",{className:o.heroButtons,children:[(0,c.jsx)("a",{href:e,className:`${o.btn} ${o.btnPrimary}`,children:"Get Started"}),(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack",className:`${o.btn} ${o.btnSecondary}`,target:"_blank",rel:"noopener noreferrer",children:"View on GitHub"})]}),(0,c.jsx)("div",{className:o.adoptionBadge,children:(0,c.jsx)("p",{children:"Adopted by data teams building at scale"})})]}),(0,c.jsx)("div",{className:o.heroImage,children:(0,c.jsx)("img",{src:(0,a.Ay)("/img/bharatml-stack-logo.jpg"),alt:"BharatML Stack Logo",loading:"eager"})})]})}function x(){return(0,c.jsx)("section",{className:o.section,children:(0,c.jsxs)("div",{className:o.container,children:[(0,c.jsxs)("div",{className:o.sectionHeader,children:[(0,c.jsx)("p",{className:o.sectionSubtitle,children:"Why BharatMLStack"}),(0,c.jsx)("h2",{className:o.sectionTitle,children:"The Real Barriers to Scaling Machine Learning"}),(0,c.jsx)("p",{className:o.sectionDescription,children:"ML teams spend more time fighting infrastructure than building intelligence. BharatMLStack removes those barriers."})]}),(0,c.jsx)("div",{className:o.barriersGrid,children:l.map((e,t)=>(0,c.jsxs)("div",{className:o.barrierCard,children:[(0,c.jsx)("div",{className:o.barrierIcon,children:e.icon}),(0,c.jsx)("h3",{children:e.title}),(0,c.jsx)("ul",{className:o.barrierQuestions,children:e.questions.map((e,t)=>(0,c.jsx)("li",{children:e},t))}),(0,c.jsx)("p",{className:o.barrierAnswer,children:e.answer})]},t))})]})})}function b(){const e=(0,n.useRef)([]),t=(0,a.Ay)("/");return(0,n.useEffect)(()=>{const t=new IntersectionObserver(e=>{e.forEach(e=>{e.isIntersecting&&e.target.classList.add(o.componentCardVisible)})},{threshold:.1,rootMargin:"0px 0px -80px 0px"});return e.current.forEach(e=>{e&&t.observe(e)}),()=>t.disconnect()},[]),(0,c.jsx)("section",{className:o.section,id:"components",children:(0,c.jsxs)("div",{className:o.container,children:[(0,c.jsxs)("div",{className:o.sectionHeader,children:[(0,c.jsx)("p",{className:o.sectionSubtitle,children:"Platform Components"}),(0,c.jsx)("h2",{className:o.sectionTitle,children:"BharatMLStack Components"}),(0,c.jsx)("p",{className:o.sectionDescription,children:"Purpose-built components for every stage of the ML lifecycle, from feature serving to model deployment."})]}),(0,c.jsx)("div",{className:o.componentsGrid,children:d.map((s,n)=>(0,c.jsxs)("div",{className:o.componentCard,ref:t=>e.current[n]=t,children:[(0,c.jsx)("div",{className:o.componentIcon,children:s.icon}),(0,c.jsxs)("div",{className:o.componentContent,children:[(0,c.jsx)("h3",{children:s.title}),(0,c.jsx)("p",{children:s.description}),(0,c.jsx)("a",{href:`${t}${s.cta.replace(/^\//,"")}`,className:o.componentLink,children:"Learn more \u2192"})]})]},n))})]})})}function v({target:e,suffix:t,decimals:s,duration:i=1500}){const[r,a]=(0,n.useState)(0),[l,d]=(0,n.useState)(!1),h=(0,n.useRef)(null),m=(0,n.useCallback)(()=>{if(l)return;d(!0);const t=performance.now(),s=n=>{const r=n-t,o=Math.min(r/i,1),c=1-Math.pow(1-o,3);a(c*e),o<1?requestAnimationFrame(s):a(e)};requestAnimationFrame(s)},[e,i,l]);(0,n.useEffect)(()=>{const e=h.current;if(!e)return;const t=new IntersectionObserver(([e])=>{e.isIntersecting&&m()},{threshold:.3});return t.observe(e),()=>t.disconnect()},[m]);const u=s>0?r.toFixed(s):Math.round(r).toLocaleString();return(0,c.jsxs)("div",{className:o.statValue,ref:h,children:[u,t]})}function j(){return(0,c.jsx)("section",{className:`${o.section} ${o.statsSection}`,id:"stats",children:(0,c.jsxs)("div",{className:o.container,children:[(0,c.jsxs)("div",{className:o.sectionHeader,children:[(0,c.jsx)("p",{className:o.sectionSubtitle,children:"Proven at scale"}),(0,c.jsx)("h2",{className:o.sectionTitle,children:"Scaling Numbers"})]}),(0,c.jsx)("div",{className:o.statsGrid,children:h.map((e,t)=>(0,c.jsxs)("div",{className:o.statCard,children:[(0,c.jsx)("p",{className:o.statLabel,children:e.label}),(0,c.jsx)(v,{target:e.target,suffix:e.suffix,decimals:e.decimals}),(0,c.jsx)("p",{className:o.statDescription,children:e.description})]},t))})]})})}function y(){return(0,c.jsx)("section",{className:o.section,id:"demos",children:(0,c.jsxs)("div",{className:o.container,children:[(0,c.jsxs)("div",{className:o.sectionHeader,children:[(0,c.jsx)("p",{className:o.sectionSubtitle,children:"See it in action"}),(0,c.jsx)("h2",{className:o.sectionTitle,children:"Demo Videos"}),(0,c.jsx)("p",{className:o.sectionDescription,children:"Watch short demos of each BharatMLStack component in action."})]}),(0,c.jsx)("div",{className:o.videosGrid,children:m.map((e,t)=>(0,c.jsxs)("div",{className:o.videoCard,children:[(0,c.jsx)("div",{className:o.videoWrapper,children:(0,c.jsxs)("video",{className:o.videoPlayer,controls:!0,preload:"metadata",playsInline:!0,children:[(0,c.jsx)("source",{src:e.url,type:"video/mp4"}),"Your browser does not support the video tag."]})}),(0,c.jsxs)("div",{className:o.videoContent,children:[(0,c.jsx)("h3",{children:e.title}),(0,c.jsx)("p",{children:e.description})]})]},t))})]})})}function N(){const e=(0,a.Ay)("/");return(0,c.jsx)("section",{className:o.section,id:"blog",children:(0,c.jsxs)("div",{className:o.container,children:[(0,c.jsxs)("div",{className:o.sectionHeader,children:[(0,c.jsx)("p",{className:o.sectionSubtitle,children:"From our blog"}),(0,c.jsx)("h2",{className:o.sectionTitle,children:"View Our Blogs"}),(0,c.jsx)("p",{className:o.sectionDescription,children:"Technical articles, architecture deep-dives, and the story behind BharatMLStack."})]}),(0,c.jsx)("div",{className:o.blogGrid,children:u.map((t,s)=>(0,c.jsxs)("a",{href:`${e}${t.link.replace(/^\//,"")}`,className:o.blogCard,children:[(0,c.jsx)("div",{className:o.blogCardIcon,children:t.icon}),(0,c.jsxs)("div",{className:o.blogContent,children:[(0,c.jsx)("span",{className:o.blogCategory,children:t.category}),(0,c.jsx)("h3",{children:t.title}),(0,c.jsx)("div",{className:o.blogMeta,children:(0,c.jsx)("span",{children:"BharatMLStack Team"})})]})]},s))})]})})}function S(){const e=(0,a.Ay)("/intro");return(0,c.jsx)("section",{className:o.section,children:(0,c.jsx)("div",{className:o.container,children:(0,c.jsxs)("div",{className:o.ctaSection,children:[(0,c.jsx)("h2",{className:o.ctaTitle,children:"Deploy ML models with confidence"}),(0,c.jsx)("p",{className:o.ctaDescription,children:"Comprehensive stack for business-ready ML. Integrates seamlessly with enterprise systems. Robust security and regulatory compliance."}),(0,c.jsxs)("div",{className:o.ctaButtons,children:[(0,c.jsx)("a",{href:e,className:`${o.btn} ${o.btnWhite}`,children:"Start Now"}),(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack",className:`${o.btn} ${o.btnOutlineWhite}`,target:"_blank",rel:"noopener noreferrer",children:"View on GitHub"})]})]})})})}function L(){const e=(0,a.Ay)("/"),t=(0,a.Ay)("/blog");return(0,c.jsxs)("footer",{className:o.customFooter,children:[(0,c.jsxs)("div",{className:o.footerContent,children:[(0,c.jsxs)("div",{className:o.footerSection,children:[(0,c.jsx)("h4",{children:"BharatMLStack"}),(0,c.jsx)("p",{children:"Enterprise-ready open-source ML infrastructure built for scale, speed, and simplicity."})]}),(0,c.jsxs)("div",{className:o.footerSection,children:[(0,c.jsx)("h4",{children:"Platform"}),(0,c.jsxs)("ul",{className:o.footerList,children:[(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:(0,a.Ay)("/online-feature-store/v1.0.0"),children:"Online Feature Store"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:(0,a.Ay)("/inferflow/v1.0.0"),children:"Inferflow"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:(0,a.Ay)("/skye/v1.0.0"),children:"Skye"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:(0,a.Ay)("/numerix/v1.0.0"),children:"Numerix"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:(0,a.Ay)("/predator/v1.0.0"),children:"Predator"})})]})]}),(0,c.jsxs)("div",{className:o.footerSection,children:[(0,c.jsx)("h4",{children:"Resources"}),(0,c.jsxs)("ul",{className:o.footerList,children:[(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:t,children:"Blog"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:e,children:"Documentation"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack/discussions",children:"Forum"})})]})]}),(0,c.jsxs)("div",{className:o.footerSection,children:[(0,c.jsx)("h4",{children:"Community"}),(0,c.jsxs)("ul",{className:o.footerList,children:[(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack",children:"GitHub"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:"https://discord.gg/XkT7XsV2AU",children:"Discord"})}),(0,c.jsx)("li",{children:(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing"})})]})]})]}),(0,c.jsxs)("div",{className:o.footerBottom,children:[(0,c.jsxs)("p",{children:["\xa9 ",(new Date).getFullYear()," Meesho Ltd. All rights reserved. Open Source under Apache 2.0 License."]}),(0,c.jsx)("div",{className:o.footerLinks,children:(0,c.jsx)("a",{href:"https://github.com/Meesho/BharatMLStack",children:"GitHub"})})]})]})}function M(){const{siteConfig:e}=(0,r.A)();return(0,n.useLayoutEffect)(()=>(document.documentElement.classList.add("homepage-active"),()=>{document.documentElement.classList.remove("homepage-active")}),[]),(0,c.jsxs)(i.A,{title:`${e.title} - Open Source ML Infrastructure`,description:"Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity.",children:[(0,c.jsx)("style",{children:"\n .navbar { display: none !important; }\n .footer { display: none !important; }\n [class*='docMainContainer'], [class*='mainWrapper'] { padding-top: 0 !important; }\n main { margin-top: 0 !important; }\n "}),(0,c.jsxs)("div",{className:o.homepageWrapper,children:[(0,c.jsx)(p,{}),(0,c.jsx)(g,{}),(0,c.jsx)(x,{}),(0,c.jsx)(b,{}),(0,c.jsx)(j,{}),(0,c.jsx)(y,{}),(0,c.jsx)(N,{}),(0,c.jsx)(S,{}),(0,c.jsx)(L,{})]})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/c4f5d8e4.41d5b3c8.js b/docs/assets/js/c4f5d8e4.41d5b3c8.js deleted file mode 100644 index 69b0d45b..00000000 --- a/docs/assets/js/c4f5d8e4.41d5b3c8.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2634],{6467:(e,i,t)=>{t.r(i),t.d(i,{default:()=>M});var s=t(4164),r=t(8774),n=t(4586),a=t(6025),o=t(1656),c=t(1107);const l={features:"features_t9lD",featureSvg:"featureSvg_GfXr",featuresHeader:"featuresHeader_qR2i",featuresSubtitle:"featuresSubtitle_VdGe","bharatml-card":"bharatml-card_xZ6l","bharatml-icon":"bharatml-icon_XBoJ",featureDescription:"featureDescription_sP1D"};var d=t(4848);const h=[{title:"High-Performance Feature Store",icon:"\ud83d\ude80",description:(0,d.jsx)(d.Fragment,{children:"Sub-10ms P99 latency and 1M+ RPS capacity. Built for real-time ML inference with custom PSDB serialization format that outperforms Protocol Buffers and Apache Arrow."})},{title:"Production-Ready ML Infrastructure",icon:"\u26a1",description:(0,d.jsx)(d.Fragment,{children:"Multi-database backends (Scylla, Dragonfly, Redis), comprehensive monitoring, and enterprise-grade features. Deploy with confidence using battle-tested components."})},{title:"Developer-First Experience",icon:"\ud83d\udee0\ufe0f",description:(0,d.jsx)(d.Fragment,{children:"Multi-language SDKs (Go, Python), gRPC APIs, and extensive documentation. From data scientists, ML engineers to backend engineers, everyone gets tools they love."})}],u=[{title:"Feature Catalog & Management",icon:"\ud83d\udccb",description:(0,d.jsx)(d.Fragment,{children:"Comprehensive feature catalog with metadata management, versioning, and governance. Organize and discover features across your ML platform with ease."})},{title:"User Management & Admin Ops",icon:"\ud83d\udc65",description:(0,d.jsx)(d.Fragment,{children:"Role-based access control, user authentication, and administrative operations. Secure your ML platform with enterprise-grade user management capabilities."})},{title:"Modern UI Framework",icon:"\ud83c\udfa8",description:(0,d.jsx)(d.Fragment,{children:"Intuitive, responsive web interface built with modern web technologies. Streamline MLOps workflows with beautiful and functional user experiences."})}],m=[{title:"Multi-Language Support",icon:"\ud83c\udf10",description:(0,d.jsx)(d.Fragment,{children:"Native SDKs for Go and Python with idiomatic APIs. Choose the language that fits your team's expertise and existing infrastructure."})},{title:"gRPC & REST APIs",icon:"\ud83d\udd17",description:(0,d.jsx)(d.Fragment,{children:"High-performance gRPC clients and REST APIs for seamless integration. Built-in support for streaming, batching, and async operations."})},{title:"Spark Integration",icon:"\u26a1",description:(0,d.jsx)(d.Fragment,{children:"Native Apache Spark integration for batch feature processing and ingestion. Scale your feature engineering workflows with distributed computing power."})}];function x({icon:e,title:i,description:t}){return(0,d.jsxs)("div",{className:(0,s.A)("col col--4"),children:[(0,d.jsx)("div",{className:"text--center",children:(0,d.jsx)("div",{className:"bharatml-icon",children:e})}),(0,d.jsxs)("div",{className:"text--center padding-horiz--md bharatml-card",children:[(0,d.jsx)(c.A,{as:"h3",children:i}),(0,d.jsx)("p",{className:l.featureDescription,children:t})]})]})}function p({title:e,subtitle:i,features:t}){return(0,d.jsx)("section",{className:l.features,children:(0,d.jsxs)("div",{className:"container",children:[(0,d.jsxs)("div",{className:"text--center margin-bottom--xl",children:[(0,d.jsx)(c.A,{as:"h2",className:l.featuresHeader,children:e}),(0,d.jsx)("p",{className:l.featuresSubtitle,children:i})]}),(0,d.jsx)("div",{className:"row",children:t.map((e,i)=>(0,d.jsx)(x,{...e},i))})]})})}function g(){return(0,d.jsx)(p,{title:"Online Feature Store",subtitle:"High-performance, production-ready feature serving for real-time ML inference",features:h})}function f(){return(0,d.jsx)(p,{title:"Trufflebox UI",subtitle:"Modern, feature-rich UI framework for comprehensive MLOps management",features:u})}function j(){return(0,d.jsx)(p,{title:"SDKs",subtitle:"Developer-friendly client libraries and APIs for seamless platform integration",features:m})}const b={heroBanner:"heroBanner_qdFl",logoContainer:"logoContainer_xdaK",heroLogo:"heroLogo_U6bI",buttons:"buttons_AeoN",statsContainer:"statsContainer_KpvY",statItem:"statItem_bwiZ",aboutSection:"aboutSection_udvw",highlightBox:"highlightBox_Uhe8"};function v(){const{siteConfig:e}=(0,n.A)();return(0,d.jsx)("header",{className:(0,s.A)("hero bharatml-hero",b.heroBanner),children:(0,d.jsxs)("div",{className:"container",children:[(0,d.jsx)("div",{className:b.logoContainer,children:(0,d.jsx)("img",{src:(0,a.Ay)("/img/logo.svg"),alt:"BharatMLStack Logo",className:b.heroLogo})}),(0,d.jsxs)(c.A,{as:"h1",className:"hero__title",children:["Welcome to ",e.title]}),(0,d.jsx)("p",{className:"hero__subtitle",children:"Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity."}),(0,d.jsxs)("div",{className:b.buttons,children:[(0,d.jsx)(r.A,{className:"button button--secondary button--lg margin-right--md bharatml-button",to:"/category/online-feature-store",children:"\ud83d\udcda Get Started"}),(0,d.jsx)(r.A,{className:"button button--outline button--secondary button--lg",href:"https://github.com/Meesho/BharatMLStack",target:"_blank",children:"\u2b50 Star on GitHub"})]}),(0,d.jsxs)("div",{className:b.statsContainer,children:[(0,d.jsxs)("div",{className:b.statItem,children:[(0,d.jsx)("strong",{children:"Sub-10ms"}),(0,d.jsx)("span",{children:"P99 Latency"})]}),(0,d.jsxs)("div",{className:b.statItem,children:[(0,d.jsx)("strong",{children:"1M+ RPS"}),(0,d.jsx)("span",{children:"Tested Capacity"})]}),(0,d.jsxs)("div",{className:b.statItem,children:[(0,d.jsx)("strong",{children:"Multi-DB"}),(0,d.jsx)("span",{children:"Support"})]})]})]})})}function N(){return(0,d.jsx)("section",{className:b.aboutSection,children:(0,d.jsx)("div",{className:"container",children:(0,d.jsxs)("div",{className:"row",children:[(0,d.jsxs)("div",{className:"col col--6",children:[(0,d.jsx)(c.A,{as:"h2",children:"Built for India's Scale"}),(0,d.jsx)("p",{children:"BharatMLStack is a comprehensive, production-ready machine learning infrastructure platform designed to democratize ML capabilities across India and beyond. Our mission is to provide a robust, scalable, and accessible ML stack that empowers organizations to build, deploy, and manage machine learning solutions at massive scale."}),(0,d.jsx)(r.A,{className:"button button--primary",to:"/category/online-feature-store",children:"Explore Online Feature Store \u2192"})]}),(0,d.jsx)("div",{className:"col col--6",children:(0,d.jsxs)("div",{className:b.highlightBox,children:[(0,d.jsx)("h3",{children:"\ud83c\udfc6 Key Achievements"}),(0,d.jsxs)("ul",{children:[(0,d.jsx)("li",{children:"\u2705 Sub-10ms P99 latency for real-time inference"}),(0,d.jsx)("li",{children:"\u2705 1M+ RPS tested with 100 IDs per request"}),(0,d.jsx)("li",{children:"\u2705 PSDB format outperforms Proto3 & Arrow"}),(0,d.jsx)("li",{children:"\u2705 Multi-database: Scylla, Dragonfly, Redis"}),(0,d.jsx)("li",{children:"\u2705 Production-ready with comprehensive monitoring"})]})]})})]})})})}function y(){return(0,d.jsx)("section",{className:b.aboutSection,children:(0,d.jsx)("div",{className:"container",children:(0,d.jsxs)("div",{className:"row",children:[(0,d.jsxs)("div",{className:"col col--6",children:[(0,d.jsx)(c.A,{as:"h2",children:"Modern MLOps Management"}),(0,d.jsx)("p",{children:"Trufflebox UI provides a comprehensive, modern web interface for managing your entire ML infrastructure. Built with cutting-edge web technologies, it delivers an intuitive experience for feature management, user administration, and operational oversight. Streamline your MLOps workflows with enterprise-grade UI components."}),(0,d.jsx)(r.A,{className:"button button--primary",to:"/category/trufflebox-ui",children:"Explore Trufflebox UI \u2192"})]}),(0,d.jsx)("div",{className:"col col--6",children:(0,d.jsxs)("div",{className:b.highlightBox,children:[(0,d.jsx)("h3",{children:"\ud83c\udfa8 UI Features"}),(0,d.jsxs)("ul",{children:[(0,d.jsx)("li",{children:"\u2705 Comprehensive feature catalog & discovery"}),(0,d.jsx)("li",{children:"\u2705 Role-based access control & user management"}),(0,d.jsx)("li",{children:"\u2705 Job, Store, Admin Ops management"}),(0,d.jsx)("li",{children:"\u2705 Approval flow for everything"}),(0,d.jsx)("li",{children:"\u2705 Responsive design for desktop & mobile"})]})]})})]})})})}function S(){return(0,d.jsx)("section",{className:b.aboutSection,children:(0,d.jsx)("div",{className:"container",children:(0,d.jsxs)("div",{className:"row",children:[(0,d.jsxs)("div",{className:"col col--6",children:[(0,d.jsx)(c.A,{as:"h2",children:"Developer-First Integration"}),(0,d.jsx)("p",{children:"Our SDKs are designed with developers in mind, providing idiomatic APIs for Go and Python that feel natural in your existing codebase. Whether you're building microservices, data pipelines, or ML applications, our SDKs provide the tools you need for seamless integration with BharatMLStack's powerful infrastructure."}),(0,d.jsx)(r.A,{className:"button button--primary",to:"/category/sdks",children:"Explore SDKs \u2192"})]}),(0,d.jsx)("div",{className:"col col--6",children:(0,d.jsxs)("div",{className:b.highlightBox,children:[(0,d.jsx)("h3",{children:"\ud83d\udee0\ufe0f Developer Tools"}),(0,d.jsxs)("ul",{children:[(0,d.jsx)("li",{children:"\u2705 Native Go & Python SDKs with type safety"}),(0,d.jsx)("li",{children:"\u2705 High-performance gRPC"}),(0,d.jsx)("li",{children:"\u2705 Apache Spark integration for publishing features"})]})]})})]})})})}function w(){return(0,d.jsx)("section",{className:b.aboutSection,children:(0,d.jsx)("div",{className:"container",children:(0,d.jsxs)("div",{className:"row",children:[(0,d.jsxs)("div",{className:"col col--6",children:[(0,d.jsx)(c.A,{as:"h2",children:"Numerix"}),(0,d.jsx)("p",{children:"Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors."}),(0,d.jsx)(r.A,{className:"button button--primary",to:"/category/numerix",children:"Explore Numerix \u2192"})]}),(0,d.jsx)("div",{className:"col col--6",children:(0,d.jsxs)("div",{className:b.highlightBox,children:[(0,d.jsx)("h3",{children:"\ud83d\udee0\ufe0f Numerix Features"}),(0,d.jsxs)("ul",{children:[(0,d.jsx)("li",{children:"\u2705 Postfix expression evaluation"}),(0,d.jsx)("li",{children:"\u2705 Vectorized math operations"}),(0,d.jsx)("li",{children:"\u2705 Typed evaluation"}),(0,d.jsx)("li",{children:"\u2705 Compiler-assisted SIMD"}),(0,d.jsx)("li",{children:"\u2705 ARM & AMD support"}),(0,d.jsx)("li",{children:"\u2705 Multi-arch builds"}),(0,d.jsx)("li",{children:"\u2705 Deterministic runtime"})]})]})})]})})})}function M(){const{siteConfig:e}=(0,n.A)();return(0,d.jsxs)(o.A,{title:`${e.title} - Open Source ML Infrastructure`,description:"Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Features high-performance Online Feature Store with sub-10ms latency.",children:[(0,d.jsx)(v,{}),(0,d.jsxs)("main",{children:[(0,d.jsx)(g,{}),(0,d.jsx)(N,{}),(0,d.jsx)(f,{}),(0,d.jsx)(y,{}),(0,d.jsx)(j,{}),(0,d.jsx)(S,{}),(0,d.jsx)(w,{})]})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/c621f852.beef9a06.js b/docs/assets/js/c621f852.beef9a06.js new file mode 100644 index 00000000..ae7e43ad --- /dev/null +++ b/docs/assets/js/c621f852.beef9a06.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8276],{3338:(e,t,n)=>{n.r(t),n.d(t,{assets:()=>l,contentTitle:()=>a,default:()=>h,frontMatter:()=>c,metadata:()=>r,toc:()=>d});const r=JSON.parse('{"id":"sdks/python/v1.0.0/index","title":"v1.0.0","description":"Python SDK v1.0.0","source":"@site/docs/sdks/python/v1.0.0/index.md","sourceDirName":"sdks/python/v1.0.0","slug":"/sdks/python/v1.0.0","permalink":"/BharatMLStack/sdks/python/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/sdks/python/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Python SDK v1.0.0","sidebar_position":0,"slug":"/sdks/python/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Python SDK","permalink":"/BharatMLStack/category/python-sdk"},"next":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client"}}');var s=n(4848),o=n(8453),i=n(4795);const c={title:"v1.0.0",description:"Python SDK v1.0.0",sidebar_position:0,slug:"/sdks/python/v1.0.0"},a="Python SDK v1.0.0",l={},d=[];function u(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"python-sdk-v100",children:"Python SDK v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Python client libraries and utilities for interacting with the BharatML Stack online feature store, including gRPC clients, Spark integration, and common utilities."}),"\n",(0,s.jsx)(i.A,{})]})}function h(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(u,{...e})}):u(e)}},4795:(e,t,n)=>{n.d(t,{A:()=>k});n(6540);var r=n(4164),s=n(6972),o=n(8774),i=n(5846),c=n(6654),a=n(1312),l=n(1107);const d={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var u=n(4848);function h({className:e,href:t,children:n}){return(0,u.jsx)(o.A,{href:t,className:(0,r.A)("card padding--lg",d.cardContainer,e),children:n})}function m({className:e,href:t,icon:n,title:s,description:o}){return(0,u.jsxs)(h,{href:t,className:e,children:[(0,u.jsxs)(l.A,{as:"h2",className:(0,r.A)("text--truncate",d.cardTitle),title:s,children:[n," ",s]}),o&&(0,u.jsx)("p",{className:(0,r.A)("text--truncate",d.cardDescription),title:o,children:o})]})}function p({item:e}){const t=(0,s.Nr)(e),n=function(){const{selectMessage:e}=(0,i.W)();return t=>e(t,(0,a.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,u.jsx)(m,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??n(e.items.length)}):null}function f({item:e}){const t=(0,c.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",n=(0,s.cC)(e.docId??void 0);return(0,u.jsx)(m,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??n?.description})}function x({item:e}){switch(e.type){case"link":return(0,u.jsx)(f,{item:e});case"category":return(0,u.jsx)(p,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const g={docCardListItem:"docCardListItem_W1sv"};function y({className:e}){const t=(0,s.a4)();return(0,u.jsx)(k,{items:t,className:e})}function v({item:e}){return(0,u.jsx)("article",{className:(0,r.A)(g.docCardListItem,"col col--6"),children:(0,u.jsx)(x,{item:e})})}function k(e){const{items:t,className:n}=e;if(!t)return(0,u.jsx)(y,{...e});const o=(0,s.d1)(t);return(0,u.jsx)("section",{className:(0,r.A)("row",n),children:o.map((e,t)=>(0,u.jsx)(v,{item:e},t))})}},5846:(e,t,n)=>{n.d(t,{W:()=>l});var r=n(6540),s=n(4586);const o=["zero","one","two","few","many","other"];function i(e){return o.filter(t=>e.includes(t))}const c={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function a(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,r.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:i(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),c}},[e])}function l(){const e=a();return{selectMessage:(t,n)=>function(e,t,n){const r=e.split("|");if(1===r.length)return r[0];r.length>n.pluralForms.length&&console.error(`For locale=${n.locale}, a maximum of ${n.pluralForms.length} plural forms are expected (${n.pluralForms.join(",")}), but the message contains ${r.length}: ${e}`);const s=n.select(t),o=n.pluralForms.indexOf(s);return r[Math.min(o,r.length-1)]}(n,t,e)}}},8453:(e,t,n)=>{n.d(t,{R:()=>i,x:()=>c});var r=n(6540);const s={},o=r.createContext(s);function i(e){const t=r.useContext(o);return r.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function c(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),r.createElement(o.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/c7b64fcc.9ff95135.js b/docs/assets/js/c7b64fcc.70fc6828.js similarity index 81% rename from docs/assets/js/c7b64fcc.9ff95135.js rename to docs/assets/js/c7b64fcc.70fc6828.js index ad135ec8..17b14bc0 100644 --- a/docs/assets/js/c7b64fcc.9ff95135.js +++ b/docs/assets/js/c7b64fcc.70fc6828.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8933],{9997:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Go SDK","description":"Go SDK for BharatML Stack. Provides Go client libraries and packages for interacting with the online feature store, including gRPC clients and protocol buffer definitions.","slug":"/category/go-sdk","permalink":"/BharatMLStack/category/go-sdk","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"SDKs","permalink":"/BharatMLStack/category/sdks"},"next":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/go/v1.0.0/feature_client"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8933],{9997:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Go SDK","description":"Go SDK for BharatML Stack. Provides Go client libraries and packages for interacting with the online feature store, including gRPC clients and protocol buffer definitions.","slug":"/category/go-sdk","permalink":"/BharatMLStack/category/go-sdk","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"SDKs","permalink":"/BharatMLStack/category/sdks"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/sdks/go/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/ccc49370.1c88001f.js b/docs/assets/js/ccc49370.1c88001f.js deleted file mode 100644 index 6012d4d3..00000000 --- a/docs/assets/js/ccc49370.1c88001f.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[3249],{1689:(e,t,n)=>{n.d(t,{A:()=>d});n(6540);var a=n(4164),s=n(4084),i=n(7559),r=n(7293),l=n(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(s.Yh,{}),className:(0,a.A)(e,i.G.common.draftBanner),children:(0,l.jsx)(s.TT,{})})}var c=n(2234);function d({metadata:e}){const{unlisted:t,frontMatter:n}=e;return(0,l.jsxs)(l.Fragment,{children:[(t||n.unlisted)&&(0,l.jsx)(c.A,{}),n.draft&&(0,l.jsx)(o,{})]})}},2053:(e,t,n)=>{n.d(t,{A:()=>o});n(6540);var a=n(4164),s=n(1312),i=n(6133);const r={tags:"tags_jXut",tag:"tag_QGVx"};var l=n(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,a.A)(r.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:r.tag,children:(0,l.jsx)(i.A,{...e})},e.permalink))})]})}},2234:(e,t,n)=>{n.d(t,{A:()=>c});n(6540);var a=n(4164),s=n(7559),i=n(4084),r=n(7293),l=n(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(i.Rc,{}),className:(0,a.A)(e,s.G.common.unlistedBanner),children:(0,l.jsx)(i.Uh,{})})}function c(e){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)(i.AE,{}),(0,l.jsx)(o,{...e})]})}},2907:(e,t,n)=>{n.d(t,{A:()=>O});n(6540);var a=n(4164),s=n(4096),i=n(4848);function r({children:e,className:t}){return(0,i.jsx)("article",{className:t,children:e})}var l=n(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:n}=(0,s.e7)(),{permalink:r,title:c}=t,d=n?"h1":"h2";return(0,i.jsx)(d,{className:(0,a.A)(o.title,e),children:n?c:(0,i.jsx)(l.A,{to:r,children:c})})}var d=n(1312),m=n(5846),u=n(6266);const g={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,m.W)();return t=>{const n=Math.ceil(t);return e(n,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:n}))}}();return(0,i.jsx)(i.Fragment,{children:t(e)})}function x({date:e,formattedDate:t}){return(0,i.jsx)("time",{dateTime:e,children:t})}function f(){return(0,i.jsx)(i.Fragment,{children:" \xb7 "})}function p({className:e}){const{metadata:t}=(0,s.e7)(),{date:n,readingTime:r}=t,l=(0,u.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,i.jsxs)("div",{className:(0,a.A)(g.container,"margin-vert--md",e),children:[(0,i.jsx)(x,{date:n,formattedDate:(o=n,l.format(new Date(o)))}),void 0!==r&&(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(f,{}),(0,i.jsx)(h,{readingTime:r})]})]});var o}var v=n(6382);const j={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function b({className:e}){const{metadata:{authors:t},assets:n}=(0,s.e7)();if(0===t.length)return null;const r=t.every(({name:e})=>!e),l=1===t.length;return(0,i.jsx)("div",{className:(0,a.A)("margin-top--md margin-bottom--sm",r?j.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,i.jsx)("div",{className:(0,a.A)(!r&&(l?"col col--12":"col col--6"),r?j.imageOnlyAuthorCol:j.authorCol),children:(0,i.jsx)(v.A,{author:{...e,imageURL:n.authorsImageUrls[t]??e.imageURL}})},t))})}function A(){return(0,i.jsxs)("header",{children:[(0,i.jsx)(c,{}),(0,i.jsx)(p,{}),(0,i.jsx)(b,{})]})}var N=n(440),_=n(3253);function L({children:e,className:t}){const{isBlogPostPage:n}=(0,s.e7)();return(0,i.jsx)("div",{id:n?N.LU:void 0,className:(0,a.A)("markdown",t),children:(0,i.jsx)(_.A,{children:e})})}var y=n(7559),C=n(4336),T=n(2053);function k(){return(0,i.jsx)("b",{children:(0,i.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function H(e){const{blogPostTitle:t,...n}=e;return(0,i.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...n,children:(0,i.jsx)(k,{})})}function w(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:n,title:r,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,m=!t&&o,u=n.length>0;if(!(u||m||l))return null;if(t){const e=!!(l||d||c);return(0,i.jsxs)("footer",{className:"docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,a.A)("row","margin-top--sm",y.G.blog.blogFooterEditMetaRow),children:(0,i.jsx)("div",{className:"col",children:(0,i.jsx)(T.A,{tags:n})})}),e&&(0,i.jsx)(C.A,{className:(0,a.A)("margin-top--sm",y.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,i.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,a.A)("col",{"col--9":m}),children:(0,i.jsx)(T.A,{tags:n})}),m&&(0,i.jsx)("div",{className:(0,a.A)("col text--right",{"col--3":u}),children:(0,i.jsx)(H,{blogPostTitle:r,to:e.permalink})})]})}function O({children:e,className:t}){const n=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,i.jsxs)(r,{className:(0,a.A)(n,t),children:[(0,i.jsx)(A,{}),(0,i.jsx)(L,{children:e}),(0,i.jsx)(w,{})]})}},3858:(e,t,n)=>{n.r(t),n.d(t,{default:()=>j});n(6540);var a=n(4164),s=n(5500),i=n(7559),r=n(4096),l=n(8027),o=n(2907),c=n(1312),d=n(9022),m=n(4848);function u(e){const{nextItem:t,prevItem:n}=e;return(0,m.jsxs)("nav",{className:"pagination-nav docusaurus-mt-lg","aria-label":(0,c.T)({id:"theme.blog.post.paginator.navAriaLabel",message:"Blog post page navigation",description:"The ARIA label for the blog posts pagination"}),children:[n&&(0,m.jsx)(d.A,{...n,subLabel:(0,m.jsx)(c.A,{id:"theme.blog.post.paginator.newerPost",description:"The blog post button label to navigate to the newer/previous post",children:"Newer post"})}),t&&(0,m.jsx)(d.A,{...t,subLabel:(0,m.jsx)(c.A,{id:"theme.blog.post.paginator.olderPost",description:"The blog post button label to navigate to the older/next post",children:"Older post"}),isNext:!0})]})}function g(){const{assets:e,metadata:t}=(0,r.e7)(),{title:n,description:a,date:i,tags:l,authors:o,frontMatter:c}=t,{keywords:d}=c,u=e.image??c.image;return(0,m.jsxs)(s.be,{title:c.title_meta??n,description:a,keywords:d,image:u,children:[(0,m.jsx)("meta",{property:"og:type",content:"article"}),(0,m.jsx)("meta",{property:"article:published_time",content:i}),o.some(e=>e.url)&&(0,m.jsx)("meta",{property:"article:author",content:o.map(e=>e.url).filter(Boolean).join(",")}),l.length>0&&(0,m.jsx)("meta",{property:"article:tag",content:l.map(e=>e.label).join(",")})]})}var h=n(5260);function x(){const e=(0,r.J_)();return(0,m.jsx)(h.A,{children:(0,m.jsx)("script",{type:"application/ld+json",children:JSON.stringify(e)})})}var f=n(7763),p=n(1689);function v({sidebar:e,children:t}){const{metadata:n,toc:a}=(0,r.e7)(),{nextItem:s,prevItem:i,frontMatter:c}=n,{hide_table_of_contents:d,toc_min_heading_level:g,toc_max_heading_level:h}=c;return(0,m.jsxs)(l.A,{sidebar:e,toc:!d&&a.length>0?(0,m.jsx)(f.A,{toc:a,minHeadingLevel:g,maxHeadingLevel:h}):void 0,children:[(0,m.jsx)(p.A,{metadata:n}),(0,m.jsx)(o.A,{children:t}),(s||i)&&(0,m.jsx)(u,{nextItem:s,prevItem:i})]})}function j(e){const t=e.content;return(0,m.jsx)(r.in,{content:e.content,isBlogPostPage:!0,children:(0,m.jsxs)(s.e3,{className:(0,a.A)(i.G.wrapper.blogPages,i.G.page.blogPostPage),children:[(0,m.jsx)(g,{}),(0,m.jsx)(x,{}),(0,m.jsx)(v,{sidebar:e.sidebar,children:(0,m.jsx)(t,{})})]})})}},4084:(e,t,n)=>{n.d(t,{AE:()=>o,Rc:()=>r,TT:()=>d,Uh:()=>l,Yh:()=>c});n(6540);var a=n(1312),s=n(5260),i=n(4848);function r(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function l(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function o(){return(0,i.jsx)(s.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},5195:(e,t,n)=>{n.d(t,{A:()=>x});var a=n(6540),s=n(6342);function i(e){const t=e.map(e=>({...e,parentIndex:-1,children:[]})),n=Array(7).fill(-1);t.forEach((e,t)=>{const a=n.slice(2,e.level);e.parentIndex=Math.max(...a),n[e.level]=t});const a=[];return t.forEach(e=>{const{parentIndex:n,...s}=e;n>=0?t[n].children.push(s):a.push(s)}),a}function r({toc:e,minHeadingLevel:t,maxHeadingLevel:n}){return e.flatMap(e=>{const a=r({toc:e.children,minHeadingLevel:t,maxHeadingLevel:n});return function(e){return e.level>=t&&e.level<=n}(e)?[{...e,children:a}]:a})}function l(e){const t=e.getBoundingClientRect();return t.top===t.bottom?l(e.parentNode):t}function o(e,{anchorTopOffset:t}){const n=e.find(e=>l(e).top>=t);if(n){return function(e){return e.top>0&&e.bottom{e.current=t?0:document.querySelector(".navbar").clientHeight},[t]),e}function d(e){const t=(0,a.useRef)(void 0),n=c();(0,a.useEffect)(()=>{if(!e)return()=>{};const{linkClassName:a,linkActiveClassName:s,minHeadingLevel:i,maxHeadingLevel:r}=e;function l(){const e=function(e){return Array.from(document.getElementsByClassName(e))}(a),l=function({minHeadingLevel:e,maxHeadingLevel:t}){const n=[];for(let a=e;a<=t;a+=1)n.push(`h${a}.anchor`);return Array.from(document.querySelectorAll(n.join()))}({minHeadingLevel:i,maxHeadingLevel:r}),c=o(l,{anchorTopOffset:n.current}),d=e.find(e=>c&&c.id===function(e){return decodeURIComponent(e.href.substring(e.href.indexOf("#")+1))}(e));e.forEach(e=>{!function(e,n){n?(t.current&&t.current!==e&&t.current.classList.remove(s),e.classList.add(s),t.current=e):e.classList.remove(s)}(e,e===d)})}return document.addEventListener("scroll",l),document.addEventListener("resize",l),l(),()=>{document.removeEventListener("scroll",l),document.removeEventListener("resize",l)}},[e,n])}var m=n(8774),u=n(4848);function g({toc:e,className:t,linkClassName:n,isChild:a}){return e.length?(0,u.jsx)("ul",{className:a?void 0:t,children:e.map(e=>(0,u.jsxs)("li",{children:[(0,u.jsx)(m.A,{to:`#${e.id}`,className:n??void 0,dangerouslySetInnerHTML:{__html:e.value}}),(0,u.jsx)(g,{isChild:!0,toc:e.children,className:t,linkClassName:n})]},e.id))}):null}const h=a.memo(g);function x({toc:e,className:t="table-of-contents table-of-contents__left-border",linkClassName:n="table-of-contents__link",linkActiveClassName:l,minHeadingLevel:o,maxHeadingLevel:c,...m}){const g=(0,s.p)(),x=o??g.tableOfContents.minHeadingLevel,f=c??g.tableOfContents.maxHeadingLevel,p=function({toc:e,minHeadingLevel:t,maxHeadingLevel:n}){return(0,a.useMemo)(()=>r({toc:i(e),minHeadingLevel:t,maxHeadingLevel:n}),[e,t,n])}({toc:e,minHeadingLevel:x,maxHeadingLevel:f});return d((0,a.useMemo)(()=>{if(n&&l)return{linkClassName:n,linkActiveClassName:l,minHeadingLevel:x,maxHeadingLevel:f}},[n,l,x,f])),(0,u.jsx)(h,{toc:p,className:t,linkClassName:n,...m})}},6133:(e,t,n)=>{n.d(t,{A:()=>l});n(6540);var a=n(4164),s=n(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var r=n(4848);function l({permalink:e,label:t,count:n,description:l}){return(0,r.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,a.A)(i.tag,n?i.tagWithCount:i.tagRegular),children:[t,n&&(0,r.jsx)("span",{children:n})]})}},7763:(e,t,n)=>{n.d(t,{A:()=>c});n(6540);var a=n(4164),s=n(5195);const i={tableOfContents:"tableOfContents_bqdL",docItemContainer:"docItemContainer_F8PC"};var r=n(4848);const l="table-of-contents__link toc-highlight",o="table-of-contents__link--active";function c({className:e,...t}){return(0,r.jsx)("div",{className:(0,a.A)(i.tableOfContents,"thin-scrollbar",e),children:(0,r.jsx)(s.A,{...t,linkClassName:l,linkActiveClassName:o})})}},9022:(e,t,n)=>{n.d(t,{A:()=>r});n(6540);var a=n(4164),s=n(8774),i=n(4848);function r(e){const{permalink:t,title:n,subLabel:r,isNext:l}=e;return(0,i.jsxs)(s.A,{className:(0,a.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[r&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:r}),(0,i.jsx)("div",{className:"pagination-nav__label",children:n})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/ccc49370.471f68d9.js b/docs/assets/js/ccc49370.471f68d9.js new file mode 100644 index 00000000..e4fd97c5 --- /dev/null +++ b/docs/assets/js/ccc49370.471f68d9.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[3249],{1689:(e,t,n)=>{n.d(t,{A:()=>d});n(6540);var a=n(4164),s=n(4084),i=n(7559),r=n(7293),l=n(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(s.Yh,{}),className:(0,a.A)(e,i.G.common.draftBanner),children:(0,l.jsx)(s.TT,{})})}var c=n(2234);function d({metadata:e}){const{unlisted:t,frontMatter:n}=e;return(0,l.jsxs)(l.Fragment,{children:[(t||n.unlisted)&&(0,l.jsx)(c.A,{}),n.draft&&(0,l.jsx)(o,{})]})}},2234:(e,t,n)=>{n.d(t,{A:()=>c});n(6540);var a=n(4164),s=n(7559),i=n(4084),r=n(7293),l=n(4848);function o({className:e}){return(0,l.jsx)(r.A,{type:"caution",title:(0,l.jsx)(i.Rc,{}),className:(0,a.A)(e,s.G.common.unlistedBanner),children:(0,l.jsx)(i.Uh,{})})}function c(e){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)(i.AE,{}),(0,l.jsx)(o,{...e})]})}},2907:(e,t,n)=>{n.d(t,{A:()=>O});n(6540);var a=n(4164),s=n(4096),i=n(4848);function r({children:e,className:t}){return(0,i.jsx)("article",{className:t,children:e})}var l=n(8774);const o={title:"title_f1Hy"};function c({className:e}){const{metadata:t,isBlogPostPage:n}=(0,s.e7)(),{permalink:r,title:c}=t,d=n?"h1":"h2";return(0,i.jsx)(d,{className:(0,a.A)(o.title,e),children:n?c:(0,i.jsx)(l.A,{to:r,children:c})})}var d=n(1312),m=n(5846),u=n(6266);const g={container:"container_mt6G"};function h({readingTime:e}){const t=function(){const{selectMessage:e}=(0,m.W)();return t=>{const n=Math.ceil(t);return e(n,(0,d.T)({id:"theme.blog.post.readingTime.plurals",description:'Pluralized label for "{readingTime} min read". Use as much plural forms (separated by "|") as your language support (see https://www.unicode.org/cldr/cldr-aux/charts/34/supplemental/language_plural_rules.html)',message:"One min read|{readingTime} min read"},{readingTime:n}))}}();return(0,i.jsx)(i.Fragment,{children:t(e)})}function x({date:e,formattedDate:t}){return(0,i.jsx)("time",{dateTime:e,children:t})}function f(){return(0,i.jsx)(i.Fragment,{children:" \xb7 "})}function p({className:e}){const{metadata:t}=(0,s.e7)(),{date:n,readingTime:r}=t,l=(0,u.i)({day:"numeric",month:"long",year:"numeric",timeZone:"UTC"});return(0,i.jsxs)("div",{className:(0,a.A)(g.container,"margin-vert--md",e),children:[(0,i.jsx)(x,{date:n,formattedDate:(o=n,l.format(new Date(o)))}),void 0!==r&&(0,i.jsxs)(i.Fragment,{children:[(0,i.jsx)(f,{}),(0,i.jsx)(h,{readingTime:r})]})]});var o}var v=n(6382);const j={authorCol:"authorCol_Hf19",imageOnlyAuthorRow:"imageOnlyAuthorRow_pa_O",imageOnlyAuthorCol:"imageOnlyAuthorCol_G86a"};function b({className:e}){const{metadata:{authors:t},assets:n}=(0,s.e7)();if(0===t.length)return null;const r=t.every(({name:e})=>!e),l=1===t.length;return(0,i.jsx)("div",{className:(0,a.A)("margin-top--md margin-bottom--sm",r?j.imageOnlyAuthorRow:"row",e),children:t.map((e,t)=>(0,i.jsx)("div",{className:(0,a.A)(!r&&(l?"col col--12":"col col--6"),r?j.imageOnlyAuthorCol:j.authorCol),children:(0,i.jsx)(v.A,{author:{...e,imageURL:n.authorsImageUrls[t]??e.imageURL}})},t))})}function A(){return(0,i.jsxs)("header",{children:[(0,i.jsx)(c,{}),(0,i.jsx)(p,{}),(0,i.jsx)(b,{})]})}var N=n(440),_=n(3253);function L({children:e,className:t}){const{isBlogPostPage:n}=(0,s.e7)();return(0,i.jsx)("div",{id:n?N.LU:void 0,className:(0,a.A)("markdown",t),children:(0,i.jsx)(_.A,{children:e})})}var y=n(7559),C=n(4336),T=n(4434);function k(){return(0,i.jsx)("b",{children:(0,i.jsx)(d.A,{id:"theme.blog.post.readMore",description:"The label used in blog post item excerpts to link to full blog posts",children:"Read more"})})}function H(e){const{blogPostTitle:t,...n}=e;return(0,i.jsx)(l.A,{"aria-label":(0,d.T)({message:"Read more about {title}",id:"theme.blog.post.readMoreLabel",description:"The ARIA label for the link to full blog posts from excerpts"},{title:t}),...n,children:(0,i.jsx)(k,{})})}function w(){const{metadata:e,isBlogPostPage:t}=(0,s.e7)(),{tags:n,title:r,editUrl:l,hasTruncateMarker:o,lastUpdatedBy:c,lastUpdatedAt:d}=e,m=!t&&o,u=n.length>0;if(!(u||m||l))return null;if(t){const e=!!(l||d||c);return(0,i.jsxs)("footer",{className:"docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,a.A)("row","margin-top--sm",y.G.blog.blogFooterEditMetaRow),children:(0,i.jsx)("div",{className:"col",children:(0,i.jsx)(T.A,{tags:n})})}),e&&(0,i.jsx)(C.A,{className:(0,a.A)("margin-top--sm",y.G.blog.blogFooterEditMetaRow),editUrl:l,lastUpdatedAt:d,lastUpdatedBy:c})]})}return(0,i.jsxs)("footer",{className:"row docusaurus-mt-lg",children:[u&&(0,i.jsx)("div",{className:(0,a.A)("col",{"col--9":m}),children:(0,i.jsx)(T.A,{tags:n})}),m&&(0,i.jsx)("div",{className:(0,a.A)("col text--right",{"col--3":u}),children:(0,i.jsx)(H,{blogPostTitle:r,to:e.permalink})})]})}function O({children:e,className:t}){const n=function(){const{isBlogPostPage:e}=(0,s.e7)();return e?void 0:"margin-bottom--xl"}();return(0,i.jsxs)(r,{className:(0,a.A)(n,t),children:[(0,i.jsx)(A,{}),(0,i.jsx)(L,{children:e}),(0,i.jsx)(w,{})]})}},3858:(e,t,n)=>{n.r(t),n.d(t,{default:()=>j});n(6540);var a=n(4164),s=n(5500),i=n(7559),r=n(4096),l=n(8027),o=n(2907),c=n(1312),d=n(9022),m=n(4848);function u(e){const{nextItem:t,prevItem:n}=e;return(0,m.jsxs)("nav",{className:"pagination-nav docusaurus-mt-lg","aria-label":(0,c.T)({id:"theme.blog.post.paginator.navAriaLabel",message:"Blog post page navigation",description:"The ARIA label for the blog posts pagination"}),children:[n&&(0,m.jsx)(d.A,{...n,subLabel:(0,m.jsx)(c.A,{id:"theme.blog.post.paginator.newerPost",description:"The blog post button label to navigate to the newer/previous post",children:"Newer post"})}),t&&(0,m.jsx)(d.A,{...t,subLabel:(0,m.jsx)(c.A,{id:"theme.blog.post.paginator.olderPost",description:"The blog post button label to navigate to the older/next post",children:"Older post"}),isNext:!0})]})}function g(){const{assets:e,metadata:t}=(0,r.e7)(),{title:n,description:a,date:i,tags:l,authors:o,frontMatter:c}=t,{keywords:d}=c,u=e.image??c.image;return(0,m.jsxs)(s.be,{title:c.title_meta??n,description:a,keywords:d,image:u,children:[(0,m.jsx)("meta",{property:"og:type",content:"article"}),(0,m.jsx)("meta",{property:"article:published_time",content:i}),o.some(e=>e.url)&&(0,m.jsx)("meta",{property:"article:author",content:o.map(e=>e.url).filter(Boolean).join(",")}),l.length>0&&(0,m.jsx)("meta",{property:"article:tag",content:l.map(e=>e.label).join(",")})]})}var h=n(5260);function x(){const e=(0,r.J_)();return(0,m.jsx)(h.A,{children:(0,m.jsx)("script",{type:"application/ld+json",children:JSON.stringify(e)})})}var f=n(7763),p=n(1689);function v({sidebar:e,children:t}){const{metadata:n,toc:a}=(0,r.e7)(),{nextItem:s,prevItem:i,frontMatter:c}=n,{hide_table_of_contents:d,toc_min_heading_level:g,toc_max_heading_level:h}=c;return(0,m.jsxs)(l.A,{sidebar:e,toc:!d&&a.length>0?(0,m.jsx)(f.A,{toc:a,minHeadingLevel:g,maxHeadingLevel:h}):void 0,children:[(0,m.jsx)(p.A,{metadata:n}),(0,m.jsx)(o.A,{children:t}),(s||i)&&(0,m.jsx)(u,{nextItem:s,prevItem:i})]})}function j(e){const t=e.content;return(0,m.jsx)(r.in,{content:e.content,isBlogPostPage:!0,children:(0,m.jsxs)(s.e3,{className:(0,a.A)(i.G.wrapper.blogPages,i.G.page.blogPostPage),children:[(0,m.jsx)(g,{}),(0,m.jsx)(x,{}),(0,m.jsx)(v,{sidebar:e.sidebar,children:(0,m.jsx)(t,{})})]})})}},4084:(e,t,n)=>{n.d(t,{AE:()=>o,Rc:()=>r,TT:()=>d,Uh:()=>l,Yh:()=>c});n(6540);var a=n(1312),s=n(5260),i=n(4848);function r(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.unlistedBanner.title",description:"The unlisted content banner title",children:"Unlisted page"})}function l(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.unlistedBanner.message",description:"The unlisted content banner message",children:"This page is unlisted. Search engines will not index it, and only users having a direct link can access it."})}function o(){return(0,i.jsx)(s.A,{children:(0,i.jsx)("meta",{name:"robots",content:"noindex, nofollow"})})}function c(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.draftBanner.title",description:"The draft content banner title",children:"Draft page"})}function d(){return(0,i.jsx)(a.A,{id:"theme.contentVisibility.draftBanner.message",description:"The draft content banner message",children:"This page is a draft. It will only be visible in dev and be excluded from the production build."})}},4434:(e,t,n)=>{n.d(t,{A:()=>o});n(6540);var a=n(4164),s=n(1312),i=n(6133);const r={tags:"tags_jXut",tag:"tag_QGVx"};var l=n(4848);function o({tags:e}){return(0,l.jsxs)(l.Fragment,{children:[(0,l.jsx)("b",{children:(0,l.jsx)(s.A,{id:"theme.tags.tagsListLabel",description:"The label alongside a tag list",children:"Tags:"})}),(0,l.jsx)("ul",{className:(0,a.A)(r.tags,"padding--none","margin-left--sm"),children:e.map(e=>(0,l.jsx)("li",{className:r.tag,children:(0,l.jsx)(i.A,{...e})},e.permalink))})]})}},5195:(e,t,n)=>{n.d(t,{A:()=>x});var a=n(6540),s=n(6342);function i(e){const t=e.map(e=>({...e,parentIndex:-1,children:[]})),n=Array(7).fill(-1);t.forEach((e,t)=>{const a=n.slice(2,e.level);e.parentIndex=Math.max(...a),n[e.level]=t});const a=[];return t.forEach(e=>{const{parentIndex:n,...s}=e;n>=0?t[n].children.push(s):a.push(s)}),a}function r({toc:e,minHeadingLevel:t,maxHeadingLevel:n}){return e.flatMap(e=>{const a=r({toc:e.children,minHeadingLevel:t,maxHeadingLevel:n});return function(e){return e.level>=t&&e.level<=n}(e)?[{...e,children:a}]:a})}function l(e){const t=e.getBoundingClientRect();return t.top===t.bottom?l(e.parentNode):t}function o(e,{anchorTopOffset:t}){const n=e.find(e=>l(e).top>=t);if(n){return function(e){return e.top>0&&e.bottom{e.current=t?0:document.querySelector(".navbar").clientHeight},[t]),e}function d(e){const t=(0,a.useRef)(void 0),n=c();(0,a.useEffect)(()=>{if(!e)return()=>{};const{linkClassName:a,linkActiveClassName:s,minHeadingLevel:i,maxHeadingLevel:r}=e;function l(){const e=function(e){return Array.from(document.getElementsByClassName(e))}(a),l=function({minHeadingLevel:e,maxHeadingLevel:t}){const n=[];for(let a=e;a<=t;a+=1)n.push(`h${a}.anchor`);return Array.from(document.querySelectorAll(n.join()))}({minHeadingLevel:i,maxHeadingLevel:r}),c=o(l,{anchorTopOffset:n.current}),d=e.find(e=>c&&c.id===function(e){return decodeURIComponent(e.href.substring(e.href.indexOf("#")+1))}(e));e.forEach(e=>{!function(e,n){n?(t.current&&t.current!==e&&t.current.classList.remove(s),e.classList.add(s),t.current=e):e.classList.remove(s)}(e,e===d)})}return document.addEventListener("scroll",l),document.addEventListener("resize",l),l(),()=>{document.removeEventListener("scroll",l),document.removeEventListener("resize",l)}},[e,n])}var m=n(8774),u=n(4848);function g({toc:e,className:t,linkClassName:n,isChild:a}){return e.length?(0,u.jsx)("ul",{className:a?void 0:t,children:e.map(e=>(0,u.jsxs)("li",{children:[(0,u.jsx)(m.A,{to:`#${e.id}`,className:n??void 0,dangerouslySetInnerHTML:{__html:e.value}}),(0,u.jsx)(g,{isChild:!0,toc:e.children,className:t,linkClassName:n})]},e.id))}):null}const h=a.memo(g);function x({toc:e,className:t="table-of-contents table-of-contents__left-border",linkClassName:n="table-of-contents__link",linkActiveClassName:l,minHeadingLevel:o,maxHeadingLevel:c,...m}){const g=(0,s.p)(),x=o??g.tableOfContents.minHeadingLevel,f=c??g.tableOfContents.maxHeadingLevel,p=function({toc:e,minHeadingLevel:t,maxHeadingLevel:n}){return(0,a.useMemo)(()=>r({toc:i(e),minHeadingLevel:t,maxHeadingLevel:n}),[e,t,n])}({toc:e,minHeadingLevel:x,maxHeadingLevel:f});return d((0,a.useMemo)(()=>{if(n&&l)return{linkClassName:n,linkActiveClassName:l,minHeadingLevel:x,maxHeadingLevel:f}},[n,l,x,f])),(0,u.jsx)(h,{toc:p,className:t,linkClassName:n,...m})}},6133:(e,t,n)=>{n.d(t,{A:()=>l});n(6540);var a=n(4164),s=n(8774);const i={tag:"tag_zVej",tagRegular:"tagRegular_sFm0",tagWithCount:"tagWithCount_h2kH"};var r=n(4848);function l({permalink:e,label:t,count:n,description:l}){return(0,r.jsxs)(s.A,{rel:"tag",href:e,title:l,className:(0,a.A)(i.tag,n?i.tagWithCount:i.tagRegular),children:[t,n&&(0,r.jsx)("span",{children:n})]})}},7763:(e,t,n)=>{n.d(t,{A:()=>c});n(6540);var a=n(4164),s=n(5195);const i={tableOfContents:"tableOfContents_bqdL",docItemContainer:"docItemContainer_F8PC"};var r=n(4848);const l="table-of-contents__link toc-highlight",o="table-of-contents__link--active";function c({className:e,...t}){return(0,r.jsx)("div",{className:(0,a.A)(i.tableOfContents,"thin-scrollbar",e),children:(0,r.jsx)(s.A,{...t,linkClassName:l,linkActiveClassName:o})})}},9022:(e,t,n)=>{n.d(t,{A:()=>r});n(6540);var a=n(4164),s=n(8774),i=n(4848);function r(e){const{permalink:t,title:n,subLabel:r,isNext:l}=e;return(0,i.jsxs)(s.A,{className:(0,a.A)("pagination-nav__link",l?"pagination-nav__link--next":"pagination-nav__link--prev"),to:t,children:[r&&(0,i.jsx)("div",{className:"pagination-nav__sublabel",children:r}),(0,i.jsx)("div",{className:"pagination-nav__label",children:n})]})}}}]); \ No newline at end of file diff --git a/docs/assets/js/d01bc907.3a0113c2.js b/docs/assets/js/d01bc907.3a0113c2.js new file mode 100644 index 00000000..d8833b48 --- /dev/null +++ b/docs/assets/js/d01bc907.3a0113c2.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8593],{4795:(e,t,r)=>{r.d(t,{A:()=>N});r(6540);var n=r(4164),s=r(6972),o=r(8774),a=r(5846),c=r(6654),i=r(1312),l=r(1107);const d={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var u=r(4848);function m({className:e,href:t,children:r}){return(0,u.jsx)(o.A,{href:t,className:(0,n.A)("card padding--lg",d.cardContainer,e),children:r})}function p({className:e,href:t,icon:r,title:s,description:o}){return(0,u.jsxs)(m,{href:t,className:e,children:[(0,u.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",d.cardTitle),title:s,children:[r," ",s]}),o&&(0,u.jsx)("p",{className:(0,n.A)("text--truncate",d.cardDescription),title:o,children:o})]})}function h({item:e}){const t=(0,s.Nr)(e),r=function(){const{selectMessage:e}=(0,a.W)();return t=>e(t,(0,i.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,u.jsx)(p,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??r(e.items.length)}):null}function f({item:e}){const t=(0,c.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(e.docId??void 0);return(0,u.jsx)(p,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??r?.description})}function x({item:e}){switch(e.type){case"link":return(0,u.jsx)(f,{item:e});case"category":return(0,u.jsx)(h,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const v={docCardListItem:"docCardListItem_W1sv"};function g({className:e}){const t=(0,s.a4)();return(0,u.jsx)(N,{items:t,className:e})}function j({item:e}){return(0,u.jsx)("article",{className:(0,n.A)(v.docCardListItem,"col col--6"),children:(0,u.jsx)(x,{item:e})})}function N(e){const{items:t,className:r}=e;if(!t)return(0,u.jsx)(g,{...e});const o=(0,s.d1)(t);return(0,u.jsx)("section",{className:(0,n.A)("row",r),children:o.map((e,t)=>(0,u.jsx)(j,{item:e},t))})}},5846:(e,t,r)=>{r.d(t,{W:()=>l});var n=r(6540),s=r(4586);const o=["zero","one","two","few","many","other"];function a(e){return o.filter(t=>e.includes(t))}const c={locale:"en",pluralForms:a(["one","other"]),select:e=>1===e?"one":"other"};function i(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:a(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),c}},[e])}function l(){const e=i();return{selectMessage:(t,r)=>function(e,t,r){const n=e.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${e}`);const s=r.select(t),o=r.pluralForms.indexOf(s);return n[Math.min(o,n.length-1)]}(r,t,e)}}},7383:(e,t,r)=>{r.r(t),r.d(t,{assets:()=>l,contentTitle:()=>i,default:()=>m,frontMatter:()=>c,metadata:()=>n,toc:()=>d});const n=JSON.parse('{"id":"predator/v1.0.0/index","title":"v1.0.0","description":"Predator v1.0.0","source":"@site/docs/predator/v1.0.0/index.md","sourceDirName":"predator/v1.0.0","slug":"/predator/v1.0.0","permalink":"/BharatMLStack/predator/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/predator/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Predator v1.0.0","sidebar_position":0,"slug":"/predator/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Predator","permalink":"/BharatMLStack/category/predator"},"next":{"title":"Architecture","permalink":"/BharatMLStack/predator/v1.0.0/architecture"}}');var s=r(4848),o=r(8453),a=r(4795);const c={title:"v1.0.0",description:"Predator v1.0.0",sidebar_position:0,slug:"/predator/v1.0.0"},i="Predator v1.0.0",l={},d=[];function u(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"predator-v100",children:"Predator v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server, designed to serve ML models with low latency in Kubernetes."}),"\n",(0,s.jsx)(a.A,{})]})}function m(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(u,{...e})}):u(e)}},8453:(e,t,r)=>{r.d(t,{R:()=>a,x:()=>c});var n=r(6540);const s={},o=n.createContext(s);function a(e){const t=n.useContext(o);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function c(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),n.createElement(o.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/df502808.a61c45a0.js b/docs/assets/js/df502808.a61c45a0.js new file mode 100644 index 00000000..784b7367 --- /dev/null +++ b/docs/assets/js/df502808.a61c45a0.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6088],{4795:(e,t,r)=>{r.d(t,{A:()=>j});r(6540);var n=r(4164),s=r(6972),o=r(8774),i=r(5846),a=r(6654),c=r(1312),l=r(1107);const u={cardContainer:"cardContainer_fWXF",cardTitle:"cardTitle_rnsV",cardDescription:"cardDescription_PWke"};var d=r(4848);function f({className:e,href:t,children:r}){return(0,d.jsx)(o.A,{href:t,className:(0,n.A)("card padding--lg",u.cardContainer,e),children:r})}function m({className:e,href:t,icon:r,title:s,description:o}){return(0,d.jsxs)(f,{href:t,className:e,children:[(0,d.jsxs)(l.A,{as:"h2",className:(0,n.A)("text--truncate",u.cardTitle),title:s,children:[r," ",s]}),o&&(0,d.jsx)("p",{className:(0,n.A)("text--truncate",u.cardDescription),title:o,children:o})]})}function p({item:e}){const t=(0,s.Nr)(e),r=function(){const{selectMessage:e}=(0,i.W)();return t=>e(t,(0,c.T)({message:"1 item|{count} items",id:"theme.docs.DocCard.categoryDescription.plurals",description:"The default description for a category card in the generated index about how many items this category includes"},{count:t}))}();return t?(0,d.jsx)(m,{className:e.className,href:t,icon:"\ud83d\uddc3\ufe0f",title:e.label,description:e.description??r(e.items.length)}):null}function h({item:e}){const t=(0,a.A)(e.href)?"\ud83d\udcc4\ufe0f":"\ud83d\udd17",r=(0,s.cC)(e.docId??void 0);return(0,d.jsx)(m,{className:e.className,href:e.href,icon:t,title:e.label,description:e.description??r?.description})}function x({item:e}){switch(e.type){case"link":return(0,d.jsx)(h,{item:e});case"category":return(0,d.jsx)(p,{item:e});default:throw new Error(`unknown item type ${JSON.stringify(e)}`)}}const g={docCardListItem:"docCardListItem_W1sv"};function b({className:e}){const t=(0,s.a4)();return(0,d.jsx)(j,{items:t,className:e})}function v({item:e}){return(0,d.jsx)("article",{className:(0,n.A)(g.docCardListItem,"col col--6"),children:(0,d.jsx)(x,{item:e})})}function j(e){const{items:t,className:r}=e;if(!t)return(0,d.jsx)(b,{...e});const o=(0,s.d1)(t);return(0,d.jsx)("section",{className:(0,n.A)("row",r),children:o.map((e,t)=>(0,d.jsx)(v,{item:e},t))})}},5074:(e,t,r)=>{r.r(t),r.d(t,{assets:()=>l,contentTitle:()=>c,default:()=>f,frontMatter:()=>a,metadata:()=>n,toc:()=>u});const n=JSON.parse('{"id":"trufflebox-ui/v1.0.0/index","title":"v1.0.0","description":"Trufflebox UI v1.0.0","source":"@site/docs/trufflebox-ui/v1.0.0/index.md","sourceDirName":"trufflebox-ui/v1.0.0","slug":"/trufflebox-ui/v1.0.0","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/trufflebox-ui/v1.0.0/index.md","tags":[],"version":"current","sidebarPosition":0,"frontMatter":{"title":"v1.0.0","description":"Trufflebox UI v1.0.0","sidebar_position":0,"slug":"/trufflebox-ui/v1.0.0"},"sidebar":"tutorialSidebar","previous":{"title":"Trufflebox UI","permalink":"/BharatMLStack/category/trufflebox-ui"},"next":{"title":"User Manual","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide"}}');var s=r(4848),o=r(8453),i=r(4795);const a={title:"v1.0.0",description:"Trufflebox UI v1.0.0",sidebar_position:0,slug:"/trufflebox-ui/v1.0.0"},c="Trufflebox UI v1.0.0",l={},u=[];function d(e){const t={h1:"h1",header:"header",p:"p",...(0,o.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.header,{children:(0,s.jsx)(t.h1,{id:"trufflebox-ui-v100",children:"Trufflebox UI v1.0.0"})}),"\n",(0,s.jsx)(t.p,{children:"Trufflebox UI is a modern, feature-rich UI framework for supporting MLOps. It supports feature catalog, management, user management, and other admin operations."}),"\n",(0,s.jsx)(i.A,{})]})}function f(e={}){const{wrapper:t}={...(0,o.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(d,{...e})}):d(e)}},5846:(e,t,r)=>{r.d(t,{W:()=>l});var n=r(6540),s=r(4586);const o=["zero","one","two","few","many","other"];function i(e){return o.filter(t=>e.includes(t))}const a={locale:"en",pluralForms:i(["one","other"]),select:e=>1===e?"one":"other"};function c(){const{i18n:{currentLocale:e}}=(0,s.A)();return(0,n.useMemo)(()=>{try{return function(e){const t=new Intl.PluralRules(e);return{locale:e,pluralForms:i(t.resolvedOptions().pluralCategories),select:e=>t.select(e)}}(e)}catch(t){return console.error(`Failed to use Intl.PluralRules for locale "${e}".\nDocusaurus will fallback to the default (English) implementation.\nError: ${t.message}\n`),a}},[e])}function l(){const e=c();return{selectMessage:(t,r)=>function(e,t,r){const n=e.split("|");if(1===n.length)return n[0];n.length>r.pluralForms.length&&console.error(`For locale=${r.locale}, a maximum of ${r.pluralForms.length} plural forms are expected (${r.pluralForms.join(",")}), but the message contains ${n.length}: ${e}`);const s=r.select(t),o=r.pluralForms.indexOf(s);return n[Math.min(o,n.length-1)]}(r,t,e)}}},8453:(e,t,r)=>{r.d(t,{R:()=>i,x:()=>a});var n=r(6540);const s={},o=n.createContext(s);function i(e){const t=n.useContext(o);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function a(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:i(e.components),n.createElement(o.Provider,{value:t},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/e66382f6.ad26fd04.js b/docs/assets/js/e66382f6.aae8118f.js similarity index 95% rename from docs/assets/js/e66382f6.ad26fd04.js rename to docs/assets/js/e66382f6.aae8118f.js index 23fcfec6..2e5ef514 100644 --- a/docs/assets/js/e66382f6.ad26fd04.js +++ b/docs/assets/js/e66382f6.aae8118f.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1405],{8453:(e,n,s)=>{s.d(n,{R:()=>l,x:()=>o});var r=s(6540);const t={},i=r.createContext(t);function l(e){const n=r.useContext(i);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(t):e.components||t:l(e.components),r.createElement(i.Provider,{value:n},e.children)}},9326:(e,n,s)=>{s.d(n,{A:()=>r});const r=s.p+"assets/images/v1.0.0-onfs-arch-7b3e91a84b2a24a378d13db769995c08.png"},9563:(e,n,s)=>{s.r(n),s.d(n,{assets:()=>a,contentTitle:()=>o,default:()=>h,frontMatter:()=>l,metadata:()=>r,toc:()=>c});const r=JSON.parse('{"id":"online-feature-store/v1.0.0/architecture","title":"Architecture","description":"The Online Feature Store (OnFS) is part of BharatMLStack, designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference.","source":"@site/docs/online-feature-store/v1.0.0/architecture.md","sourceDirName":"online-feature-store/v1.0.0","slug":"/online-feature-store/v1.0.0/architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/online-feature-store/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0"},"next":{"title":"Data Formats","permalink":"/BharatMLStack/online-feature-store/v1.0.0/data-formats"}}');var t=s(4848),i=s(8453);const l={title:"Architecture",sidebar_position:1},o="BharatMLStack - Online Feature Store (OnFS)",a={},c=[{value:"\ud83e\udde9 Key Components",id:"-key-components",level:2},{value:"1. Data Ingestion Paths",id:"1-data-ingestion-paths",level:3},{value:"a. Direct Push from Feature Engineering Jobs",id:"a-direct-push-from-feature-engineering-jobs",level:4},{value:"b. Push from Offline Feature Store",id:"b-push-from-offline-feature-store",level:4},{value:"c. Streaming Push via Apache Flink",id:"c-streaming-push-via-apache-flink",level:4},{value:"2. Message Queue: Kafka",id:"2-message-queue-kafka",level:3},{value:"3. Core Components",id:"3-core-components",level:3},{value:"\ud83e\udde0 Horizon Control Plane",id:"-horizon-control-plane",level:4},{value:"\ud83d\udd0d Trufflebox UI",id:"-trufflebox-ui",level:4},{value:"\u2699\ufe0f OnFS-Consumers",id:"\ufe0f-onfs-consumers",level:4},{value:"\ud83d\ude80 OnFS API Server",id:"-onfs-api-server",level:4},{value:"4. Online Databases",id:"4-online-databases",level:3},{value:"5. Clients for Serving",id:"5-clients-for-serving",level:3},{value:"6. Observability",id:"6-observability",level:3},{value:"\ud83d\udcbb Supported Environments",id:"-supported-environments",level:2},{value:"\ud83d\udc65 Target Users",id:"-target-users",level:2},{value:"\u2705 Benefits",id:"-benefits",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,i.R)(),...e.components};return(0,t.jsxs)(t.Fragment,{children:[(0,t.jsx)(n.header,{children:(0,t.jsx)(n.h1,{id:"bharatmlstack---online-feature-store-onfs",children:"BharatMLStack - Online Feature Store (OnFS)"})}),"\n",(0,t.jsxs)(n.p,{children:["The Online Feature Store (OnFS) is part of ",(0,t.jsx)(n.strong,{children:"BharatMLStack"}),", designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference."]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.p,{children:(0,t.jsx)(n.img,{alt:"BharatMLStack's Online-feature-store Architecture",src:s(9326).A+"",width:"2174",height:"1208"})}),"\n",(0,t.jsx)(n.h2,{id:"-key-components",children:"\ud83e\udde9 Key Components"}),"\n",(0,t.jsxs)(n.h3,{id:"1-data-ingestion-paths",children:["1. ",(0,t.jsx)(n.strong,{children:"Data Ingestion Paths"})]}),"\n",(0,t.jsxs)(n.h4,{id:"a-direct-push-from-feature-engineering-jobs",children:["a. ",(0,t.jsx)(n.strong,{children:"Direct Push from Feature Engineering Jobs"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Apache Spark"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Client:"})," ",(0,t.jsx)(n.code,{children:"spark_feature_push_client"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Features are pushed directly to Kafka."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"b-push-from-offline-feature-store",children:["b. ",(0,t.jsx)(n.strong,{children:"Push from Offline Feature Store"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Delta Lake, GCS, or S3"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Scheduled notebooks (",(0,t.jsx)(n.code,{children:"push_features_to_online-feature-stores.ipynb"}),") push to Kafka using the same ",(0,t.jsx)(n.code,{children:"spark_feature_push_client"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"c-streaming-push-via-apache-flink",children:["c. ",(0,t.jsx)(n.strong,{children:"Streaming Push via Apache Flink"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Flink streaming jobs"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Client:"})," ",(0,t.jsx)(n.code,{children:"custom-producer"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Real-time features sent to Kafka."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"2-message-queue-kafka",children:["2. ",(0,t.jsx)(n.strong,{children:"Message Queue: Kafka"})]}),"\n",(0,t.jsx)(n.p,{children:"Kafka serves as a decoupled buffer between producers (push clients) and consumers (OnFS ingestion workers), ensuring durability and backpressure handling."}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"3-core-components",children:["3. ",(0,t.jsx)(n.strong,{children:"Core Components"})]}),"\n",(0,t.jsxs)(n.h4,{id:"-horizon-control-plane",children:["\ud83e\udde0 ",(0,t.jsx)(n.strong,{children:"Horizon Control Plane"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Manages config distribution and metadata orchestration."}),"\n",(0,t.jsxs)(n.li,{children:["Stores schemas, feature group mappings, job configurations in ",(0,t.jsx)(n.code,{children:"etcd"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"-trufflebox-ui",children:["\ud83d\udd0d ",(0,t.jsx)(n.strong,{children:"Trufflebox UI"})]}),"\n",(0,t.jsx)(n.p,{children:"Frontend interface for managing the ML Feature Store ecosystem:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Feature Catalog"})," \u2013 Browse, search, and inspect registered features and groups."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Store and Job Registry"})," \u2013 View and manage ingestion jobs, feature store states, and lineage."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Admin Ops"})," \u2013 Approve or reject feature group pushes and schema edits."]}),"\n",(0,t.jsxs)(n.li,{children:["Designed for use by ",(0,t.jsx)(n.strong,{children:"Data Scientists, MLEs"}),", and ",(0,t.jsx)(n.strong,{children:"Platform Admins"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"\ufe0f-onfs-consumers",children:["\u2699\ufe0f ",(0,t.jsx)(n.strong,{children:"OnFS-Consumers"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Kafka consumers that read and validate feature messages."}),"\n",(0,t.jsx)(n.li,{children:"Responsible for persisting features to online databases (Redis, ScyllaDB, DragonflyDB)."}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"-onfs-api-server",children:["\ud83d\ude80 ",(0,t.jsx)(n.strong,{children:"OnFS API Server"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:["gRPC server exposing interfaces for:","\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Real-time feature persistence."}),"\n",(0,t.jsx)(n.li,{children:"Low-latency feature retrieval."}),"\n"]}),"\n"]}),"\n",(0,t.jsxs)(n.li,{children:["Access controlled and schema-validated via ",(0,t.jsx)(n.code,{children:"etcd"}),"."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"4-online-databases",children:["4. ",(0,t.jsx)(n.strong,{children:"Online Databases"})]}),"\n",(0,t.jsx)(n.p,{children:"Stores real-time features for high-performance retrieval:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"DragonflyDB"})}),"\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"Redis"})}),"\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"ScyllaDB"})}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"5-clients-for-serving",children:["5. ",(0,t.jsx)(n.strong,{children:"Clients for Serving"})]}),"\n",(0,t.jsx)(n.p,{children:"Applications use client SDKs to fetch features:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Go SDK"}),": ",(0,t.jsx)(n.code,{children:"go-sdk"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Python SDK"}),": ",(0,t.jsx)(n.code,{children:"grpc-feature-client"})]}),"\n",(0,t.jsx)(n.li,{children:"Used in backend inference apps to request features using entity keys."}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"6-observability",children:["6. ",(0,t.jsx)(n.strong,{children:"Observability"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Prometheus"})," \u2013 Metrics collection (e.g., ingest lag, QPS, latency)."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Grafana"})," \u2013 Dashboard for platform health, feature access, ingestion success/failure."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-supported-environments",children:"\ud83d\udcbb Supported Environments"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Kubernetes (K8s)"}),"\n",(0,t.jsx)(n.li,{children:"Google Kubernetes Engine (GKE)"}),"\n",(0,t.jsx)(n.li,{children:"Amazon EKS"}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-target-users",children:"\ud83d\udc65 Target Users"}),"\n",(0,t.jsxs)(n.table,{children:[(0,t.jsx)(n.thead,{children:(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.th,{children:"User"}),(0,t.jsx)(n.th,{children:"Role"})]})}),(0,t.jsxs)(n.tbody,{children:[(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Data Scientists"}),(0,t.jsx)(n.td,{children:"Browse features, define jobs, approve/reject changes via Trufflebox UI"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"MLEs"}),(0,t.jsx)(n.td,{children:"Develop and push features using Spark/Flink/notebooks"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Infra Admins"}),(0,t.jsx)(n.td,{children:"Manage store lifecycle, metadata, and approvals"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Backend Devs"}),(0,t.jsx)(n.td,{children:"Use SDKs to retrieve features in Go/Python inference services"})]})]})]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-benefits",children:"\u2705 Benefits"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Unified real-time and offline ingestion."}),"\n",(0,t.jsx)(n.li,{children:"Low-latency inference-ready features."}),"\n",(0,t.jsx)(n.li,{children:"Config-driven orchestration."}),"\n",(0,t.jsx)(n.li,{children:"Built-in approval workflows via Trufflebox."}),"\n",(0,t.jsx)(n.li,{children:"Scalable across thousands of entities and feature groups."}),"\n"]}),"\n",(0,t.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,t.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,t.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:["\ud83d\udcac ",(0,t.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,t.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,t.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,t.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,t.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,t.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,t.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,t.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,t.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)("div",{align:"center",children:(0,t.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,t.jsx)("div",{align:"center",children:(0,t.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,i.R)(),...e.components};return n?(0,t.jsx)(n,{...e,children:(0,t.jsx)(d,{...e})}):d(e)}}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1405],{6205:(e,n,s)=>{s.d(n,{A:()=>r});const r=s.p+"assets/images/v1.0.0-onfs-arch-7b3e91a84b2a24a378d13db769995c08.png"},8453:(e,n,s)=>{s.d(n,{R:()=>l,x:()=>o});var r=s(6540);const t={},i=r.createContext(t);function l(e){const n=r.useContext(i);return r.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(t):e.components||t:l(e.components),r.createElement(i.Provider,{value:n},e.children)}},9563:(e,n,s)=>{s.r(n),s.d(n,{assets:()=>a,contentTitle:()=>o,default:()=>h,frontMatter:()=>l,metadata:()=>r,toc:()=>c});const r=JSON.parse('{"id":"online-feature-store/v1.0.0/architecture","title":"Architecture","description":"The Online Feature Store (OnFS) is part of BharatMLStack, designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference.","source":"@site/docs/online-feature-store/v1.0.0/architecture.md","sourceDirName":"online-feature-store/v1.0.0","slug":"/online-feature-store/v1.0.0/architecture","permalink":"/BharatMLStack/online-feature-store/v1.0.0/architecture","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/online-feature-store/v1.0.0/architecture.md","tags":[],"version":"current","sidebarPosition":1,"frontMatter":{"title":"Architecture","sidebar_position":1},"sidebar":"tutorialSidebar","previous":{"title":"v1.0.0","permalink":"/BharatMLStack/online-feature-store/v1.0.0"},"next":{"title":"Data Formats","permalink":"/BharatMLStack/online-feature-store/v1.0.0/data-formats"}}');var t=s(4848),i=s(8453);const l={title:"Architecture",sidebar_position:1},o="BharatMLStack - Online Feature Store (OnFS)",a={},c=[{value:"\ud83e\udde9 Key Components",id:"-key-components",level:2},{value:"1. Data Ingestion Paths",id:"1-data-ingestion-paths",level:3},{value:"a. Direct Push from Feature Engineering Jobs",id:"a-direct-push-from-feature-engineering-jobs",level:4},{value:"b. Push from Offline Feature Store",id:"b-push-from-offline-feature-store",level:4},{value:"c. Streaming Push via Apache Flink",id:"c-streaming-push-via-apache-flink",level:4},{value:"2. Message Queue: Kafka",id:"2-message-queue-kafka",level:3},{value:"3. Core Components",id:"3-core-components",level:3},{value:"\ud83e\udde0 Horizon Control Plane",id:"-horizon-control-plane",level:4},{value:"\ud83d\udd0d Trufflebox UI",id:"-trufflebox-ui",level:4},{value:"\u2699\ufe0f OnFS-Consumers",id:"\ufe0f-onfs-consumers",level:4},{value:"\ud83d\ude80 OnFS API Server",id:"-onfs-api-server",level:4},{value:"4. Online Databases",id:"4-online-databases",level:3},{value:"5. Clients for Serving",id:"5-clients-for-serving",level:3},{value:"6. Observability",id:"6-observability",level:3},{value:"\ud83d\udcbb Supported Environments",id:"-supported-environments",level:2},{value:"\ud83d\udc65 Target Users",id:"-target-users",level:2},{value:"\u2705 Benefits",id:"-benefits",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function d(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",h4:"h4",header:"header",hr:"hr",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,i.R)(),...e.components};return(0,t.jsxs)(t.Fragment,{children:[(0,t.jsx)(n.header,{children:(0,t.jsx)(n.h1,{id:"bharatmlstack---online-feature-store-onfs",children:"BharatMLStack - Online Feature Store (OnFS)"})}),"\n",(0,t.jsxs)(n.p,{children:["The Online Feature Store (OnFS) is part of ",(0,t.jsx)(n.strong,{children:"BharatMLStack"}),", designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference."]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.p,{children:(0,t.jsx)(n.img,{alt:"BharatMLStack's Online-feature-store Architecture",src:s(6205).A+"",width:"2174",height:"1208"})}),"\n",(0,t.jsx)(n.h2,{id:"-key-components",children:"\ud83e\udde9 Key Components"}),"\n",(0,t.jsxs)(n.h3,{id:"1-data-ingestion-paths",children:["1. ",(0,t.jsx)(n.strong,{children:"Data Ingestion Paths"})]}),"\n",(0,t.jsxs)(n.h4,{id:"a-direct-push-from-feature-engineering-jobs",children:["a. ",(0,t.jsx)(n.strong,{children:"Direct Push from Feature Engineering Jobs"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Apache Spark"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Client:"})," ",(0,t.jsx)(n.code,{children:"spark_feature_push_client"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Features are pushed directly to Kafka."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"b-push-from-offline-feature-store",children:["b. ",(0,t.jsx)(n.strong,{children:"Push from Offline Feature Store"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Delta Lake, GCS, or S3"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Scheduled notebooks (",(0,t.jsx)(n.code,{children:"push_features_to_online-feature-stores.ipynb"}),") push to Kafka using the same ",(0,t.jsx)(n.code,{children:"spark_feature_push_client"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"c-streaming-push-via-apache-flink",children:["c. ",(0,t.jsx)(n.strong,{children:"Streaming Push via Apache Flink"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Source:"})," Flink streaming jobs"]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Client:"})," ",(0,t.jsx)(n.code,{children:"custom-producer"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Flow:"})," Real-time features sent to Kafka."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"2-message-queue-kafka",children:["2. ",(0,t.jsx)(n.strong,{children:"Message Queue: Kafka"})]}),"\n",(0,t.jsx)(n.p,{children:"Kafka serves as a decoupled buffer between producers (push clients) and consumers (OnFS ingestion workers), ensuring durability and backpressure handling."}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"3-core-components",children:["3. ",(0,t.jsx)(n.strong,{children:"Core Components"})]}),"\n",(0,t.jsxs)(n.h4,{id:"-horizon-control-plane",children:["\ud83e\udde0 ",(0,t.jsx)(n.strong,{children:"Horizon Control Plane"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Manages config distribution and metadata orchestration."}),"\n",(0,t.jsxs)(n.li,{children:["Stores schemas, feature group mappings, job configurations in ",(0,t.jsx)(n.code,{children:"etcd"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"-trufflebox-ui",children:["\ud83d\udd0d ",(0,t.jsx)(n.strong,{children:"Trufflebox UI"})]}),"\n",(0,t.jsx)(n.p,{children:"Frontend interface for managing the ML Feature Store ecosystem:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Feature Catalog"})," \u2013 Browse, search, and inspect registered features and groups."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Store and Job Registry"})," \u2013 View and manage ingestion jobs, feature store states, and lineage."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Admin Ops"})," \u2013 Approve or reject feature group pushes and schema edits."]}),"\n",(0,t.jsxs)(n.li,{children:["Designed for use by ",(0,t.jsx)(n.strong,{children:"Data Scientists, MLEs"}),", and ",(0,t.jsx)(n.strong,{children:"Platform Admins"}),"."]}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"\ufe0f-onfs-consumers",children:["\u2699\ufe0f ",(0,t.jsx)(n.strong,{children:"OnFS-Consumers"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Kafka consumers that read and validate feature messages."}),"\n",(0,t.jsx)(n.li,{children:"Responsible for persisting features to online databases (Redis, ScyllaDB, DragonflyDB)."}),"\n"]}),"\n",(0,t.jsxs)(n.h4,{id:"-onfs-api-server",children:["\ud83d\ude80 ",(0,t.jsx)(n.strong,{children:"OnFS API Server"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:["gRPC server exposing interfaces for:","\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Real-time feature persistence."}),"\n",(0,t.jsx)(n.li,{children:"Low-latency feature retrieval."}),"\n"]}),"\n"]}),"\n",(0,t.jsxs)(n.li,{children:["Access controlled and schema-validated via ",(0,t.jsx)(n.code,{children:"etcd"}),"."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"4-online-databases",children:["4. ",(0,t.jsx)(n.strong,{children:"Online Databases"})]}),"\n",(0,t.jsx)(n.p,{children:"Stores real-time features for high-performance retrieval:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"DragonflyDB"})}),"\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"Redis"})}),"\n",(0,t.jsx)(n.li,{children:(0,t.jsx)(n.strong,{children:"ScyllaDB"})}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"5-clients-for-serving",children:["5. ",(0,t.jsx)(n.strong,{children:"Clients for Serving"})]}),"\n",(0,t.jsx)(n.p,{children:"Applications use client SDKs to fetch features:"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Go SDK"}),": ",(0,t.jsx)(n.code,{children:"go-sdk"})]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Python SDK"}),": ",(0,t.jsx)(n.code,{children:"grpc-feature-client"})]}),"\n",(0,t.jsx)(n.li,{children:"Used in backend inference apps to request features using entity keys."}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsxs)(n.h3,{id:"6-observability",children:["6. ",(0,t.jsx)(n.strong,{children:"Observability"})]}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Prometheus"})," \u2013 Metrics collection (e.g., ingest lag, QPS, latency)."]}),"\n",(0,t.jsxs)(n.li,{children:[(0,t.jsx)(n.strong,{children:"Grafana"})," \u2013 Dashboard for platform health, feature access, ingestion success/failure."]}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-supported-environments",children:"\ud83d\udcbb Supported Environments"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Kubernetes (K8s)"}),"\n",(0,t.jsx)(n.li,{children:"Google Kubernetes Engine (GKE)"}),"\n",(0,t.jsx)(n.li,{children:"Amazon EKS"}),"\n"]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-target-users",children:"\ud83d\udc65 Target Users"}),"\n",(0,t.jsxs)(n.table,{children:[(0,t.jsx)(n.thead,{children:(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.th,{children:"User"}),(0,t.jsx)(n.th,{children:"Role"})]})}),(0,t.jsxs)(n.tbody,{children:[(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Data Scientists"}),(0,t.jsx)(n.td,{children:"Browse features, define jobs, approve/reject changes via Trufflebox UI"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"MLEs"}),(0,t.jsx)(n.td,{children:"Develop and push features using Spark/Flink/notebooks"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Infra Admins"}),(0,t.jsx)(n.td,{children:"Manage store lifecycle, metadata, and approvals"})]}),(0,t.jsxs)(n.tr,{children:[(0,t.jsx)(n.td,{children:"Backend Devs"}),(0,t.jsx)(n.td,{children:"Use SDKs to retrieve features in Go/Python inference services"})]})]})]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)(n.h2,{id:"-benefits",children:"\u2705 Benefits"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsx)(n.li,{children:"Unified real-time and offline ingestion."}),"\n",(0,t.jsx)(n.li,{children:"Low-latency inference-ready features."}),"\n",(0,t.jsx)(n.li,{children:"Config-driven orchestration."}),"\n",(0,t.jsx)(n.li,{children:"Built-in approval workflows via Trufflebox."}),"\n",(0,t.jsx)(n.li,{children:"Scalable across thousands of entities and feature groups."}),"\n"]}),"\n",(0,t.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,t.jsxs)(n.p,{children:["We welcome contributions from the community! Please see our ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"})," for details on how to get started."]}),"\n",(0,t.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,t.jsxs)(n.ul,{children:["\n",(0,t.jsxs)(n.li,{children:["\ud83d\udcac ",(0,t.jsx)(n.strong,{children:"Discord"}),": Join our ",(0,t.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,t.jsxs)(n.li,{children:["\ud83d\udc1b ",(0,t.jsx)(n.strong,{children:"Issues"}),": Report bugs and request features on ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,t.jsxs)(n.li,{children:["\ud83d\udce7 ",(0,t.jsx)(n.strong,{children:"Email"}),": Contact us at ",(0,t.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,t.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,t.jsxs)(n.p,{children:["BharatMLStack is open-source software licensed under the ",(0,t.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,t.jsx)(n.hr,{}),"\n",(0,t.jsx)("div",{align:"center",children:(0,t.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,t.jsx)("div",{align:"center",children:(0,t.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,i.R)(),...e.components};return n?(0,t.jsx)(n,{...e,children:(0,t.jsx)(d,{...e})}):d(e)}}}]); \ No newline at end of file diff --git a/docs/assets/js/e8321834.dbcc9814.js b/docs/assets/js/e8321834.dbcc9814.js new file mode 100644 index 00000000..2756e845 --- /dev/null +++ b/docs/assets/js/e8321834.dbcc9814.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[149],{5842:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>d,contentTitle:()=>o,default:()=>h,frontMatter:()=>l,metadata:()=>s,toc:()=>c});const s=JSON.parse('{"id":"predator/v1.0.0/functionalities","title":"Key Functionalities","description":"Overview","source":"@site/docs/predator/v1.0.0/functionalities.md","sourceDirName":"predator/v1.0.0","slug":"/predator/v1.0.0/functionalities","permalink":"/BharatMLStack/predator/v1.0.0/functionalities","draft":false,"unlisted":false,"editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/docs/predator/v1.0.0/functionalities.md","tags":[],"version":"current","sidebarPosition":2,"frontMatter":{"title":"Key Functionalities","sidebar_position":2},"sidebar":"tutorialSidebar","previous":{"title":"Architecture","permalink":"/BharatMLStack/predator/v1.0.0/architecture"},"next":{"title":"Release Notes","permalink":"/BharatMLStack/predator/v1.0.0/release-notes"}}');var r=i(4848),t=i(8453);const l={title:"Key Functionalities",sidebar_position:2},o="Predator - Key Functionalities",d={},c=[{value:"Overview",id:"overview",level:2},{value:"Core Capabilities",id:"core-capabilities",level:2},{value:"Multi-Backend Inference",id:"multi-backend-inference",level:3},{value:"Dynamic Batching",id:"dynamic-batching",level:3},{value:"Concurrent Model Execution",id:"concurrent-model-execution",level:3},{value:"Model Versioning & Ensembles",id:"model-versioning--ensembles",level:3},{value:"Model Instance Scaling",id:"model-instance-scaling",level:3},{value:"Inference & API",id:"inference--api",level:2},{value:"gRPC via Helix Client",id:"grpc-via-helix-client",level:3},{value:"Model Repository",id:"model-repository",level:3},{value:"Deployment & Operational Features",id:"deployment--operational-features",level:2},{value:"Custom Triton Images",id:"custom-triton-images",level:3},{value:"Image Distribution",id:"image-distribution",level:3},{value:"Health Probes",id:"health-probes",level:3},{value:"Autoscaling",id:"autoscaling",level:3},{value:"Observability",id:"observability",level:2},{value:"Contributing",id:"contributing",level:2},{value:"Community & Support",id:"community--support",level:2},{value:"License",id:"license",level:2}];function a(e){const n={a:"a",code:"code",h1:"h1",h2:"h2",h3:"h3",header:"header",hr:"hr",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,t.R)(),...e.components};return(0,r.jsxs)(r.Fragment,{children:[(0,r.jsx)(n.header,{children:(0,r.jsx)(n.h1,{id:"predator---key-functionalities",children:"Predator - Key Functionalities"})}),"\n",(0,r.jsx)(n.h2,{id:"overview",children:"Overview"}),"\n",(0,r.jsxs)(n.p,{children:["Predator is a scalable, high-performance model inference service built as a wrapper around ",(0,r.jsx)(n.strong,{children:"NVIDIA Triton Inference Server"}),". It serves Deep Learning and tree-based models with low latency in ",(0,r.jsx)(n.strong,{children:"Kubernetes"}),", integrates with the ",(0,r.jsx)(n.strong,{children:"Online Feature Store (OnFS)"})," and uses ",(0,r.jsx)(n.strong,{children:"Interflow"})," for orchestration between clients, feature store, and inference engine. Clients send inference requests via the ",(0,r.jsx)(n.strong,{children:"Helix client"})," over gRPC."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"core-capabilities",children:"Core Capabilities"}),"\n",(0,r.jsx)(n.h3,{id:"multi-backend-inference",children:"Multi-Backend Inference"}),"\n",(0,r.jsx)(n.p,{children:"Predator leverages Triton's pluggable backends so you can serve a variety of model types from a single deployment:"}),"\n",(0,r.jsxs)(n.table,{children:[(0,r.jsx)(n.thead,{children:(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.th,{children:"Backend"}),(0,r.jsx)(n.th,{children:"Use Case"})]})}),(0,r.jsxs)(n.tbody,{children:[(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"TensorRT"})}),(0,r.jsx)(n.td,{children:"GPU-optimized DL; serialized engines (FP16/INT8)"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"PyTorch"})}),(0,r.jsx)(n.td,{children:"Native PyTorch via LibTorch"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"ONNX Runtime"})}),(0,r.jsx)(n.td,{children:"Framework-agnostic ONNX with TensorRT/GPU"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"TensorFlow"})}),(0,r.jsx)(n.td,{children:"SavedModel format"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"Python"})}),(0,r.jsx)(n.td,{children:"Custom preprocessing, postprocessing, or unsupported models"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"FIL"})}),(0,r.jsx)(n.td,{children:"Tree-based models (XGBoost, LightGBM, Random Forest) on GPU"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"DALI"})}),(0,r.jsx)(n.td,{children:"GPU-accelerated data preprocessing (image, audio, video)"})]}),(0,r.jsxs)(n.tr,{children:[(0,r.jsx)(n.td,{children:(0,r.jsx)(n.strong,{children:"Custom"})}),(0,r.jsx)(n.td,{children:"C++/Python backends for proprietary or specialized runtimes"})]})]})]}),"\n",(0,r.jsx)(n.h3,{id:"dynamic-batching",children:"Dynamic Batching"}),"\n",(0,r.jsx)(n.p,{children:"Triton combines multiple incoming requests into a single batch at runtime."}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Higher GPU utilization and improved throughput"}),"\n",(0,r.jsx)(n.li,{children:"Reduced latency variance"}),"\n",(0,r.jsxs)(n.li,{children:["Configurable ",(0,r.jsx)(n.code,{children:"preferred_batch_size"})," and ",(0,r.jsx)(n.code,{children:"max_queue_delay_microseconds"})," in ",(0,r.jsx)(n.code,{children:"config.pbtxt"})]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"concurrent-model-execution",children:"Concurrent Model Execution"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Run multiple models simultaneously"}),"\n",(0,r.jsx)(n.li,{children:"Run multiple instances of the same model"}),"\n",(0,r.jsxs)(n.li,{children:["Distribute load across GPUs via ",(0,r.jsx)(n.code,{children:"instance_group"})," in model config"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"model-versioning--ensembles",children:"Model Versioning & Ensembles"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Versioning"}),": Multiple versions per model (e.g. ",(0,r.jsx)(n.code,{children:"1/"}),", ",(0,r.jsx)(n.code,{children:"2/"})," in the model repository)"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Ensembles"}),": Define a pipeline of models as an ensemble; eliminates intermediate network hops and reduces latency"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"model-instance-scaling",children:"Model Instance Scaling"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"Deploy multiple copies of a model for parallel inference and load isolation"}),"\n",(0,r.jsxs)(n.li,{children:["Configured via ",(0,r.jsx)(n.code,{children:"instance_group"})]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"inference--api",children:"Inference & API"}),"\n",(0,r.jsx)(n.h3,{id:"grpc-via-helix-client",children:"gRPC via Helix Client"}),"\n",(0,r.jsxs)(n.p,{children:["Predator uses ",(0,r.jsx)(n.strong,{children:"gRPC"})," for efficient request/response handling. Client applications (e.g. Realestate, IOP) send inference requests through the ",(0,r.jsx)(n.strong,{children:"Helix client"}),", which talks to the Triton Inference Server inside the Predator pod."]}),"\n",(0,r.jsx)(n.h3,{id:"model-repository",children:"Model Repository"}),"\n",(0,r.jsxs)(n.p,{children:["Models are stored in a local model repository. Predator materializes this via an ",(0,r.jsx)(n.strong,{children:"Init Container"})," that downloads artifacts from cloud storage (e.g. GCS) so Triton has no runtime dependency on remote storage during inference."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"deployment--operational-features",children:"Deployment & Operational Features"}),"\n",(0,r.jsx)(n.h3,{id:"custom-triton-images",children:"Custom Triton Images"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["Production uses ",(0,r.jsx)(n.strong,{children:"custom-built"})," Triton images (only required backends) for smaller size and faster startup"]}),"\n",(0,r.jsxs)(n.li,{children:["Images built on GCP VM, pushed to ",(0,r.jsx)(n.strong,{children:"Artifact Registry"}),", and referenced in Helm deployments"]}),"\n",(0,r.jsxs)(n.li,{children:["Optional ",(0,r.jsx)(n.strong,{children:"response caching"})," via custom cache plugins added at image build time"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"image-distribution",children:"Image Distribution"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Secondary boot disk caching"}),": Triton image pre-cached on GPU node pool to reduce pod startup and scale-up latency"]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Image streaming"}),": Optionally used for faster time-to-readiness during scaling"]}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"health-probes",children:"Health Probes"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:["Readiness and liveness use ",(0,r.jsx)(n.code,{children:"/v2/health/ready"})]}),"\n",(0,r.jsx)(n.li,{children:"Triton receives traffic only after models are loaded; failed instances are restarted automatically"}),"\n"]}),"\n",(0,r.jsx)(n.h3,{id:"autoscaling",children:"Autoscaling"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsx)(n.li,{children:"CPU-based scaling for generic load"}),"\n",(0,r.jsxs)(n.li,{children:["GPU-based scaling using ",(0,r.jsx)(n.strong,{children:"DCGM"})," metrics (utilization, memory, power); custom queries drive scale-up/scale-down"]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"observability",children:"Observability"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Prometheus metrics"}),": Latency, throughput, GPU utilization, and more"]}),"\n",(0,r.jsxs)(n.li,{children:["Metrics emitted from the Triton Inference Container and visualized in ",(0,r.jsx)(n.strong,{children:"Grafana"})]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Warmup requests"}),": Configurable to preload kernels and avoid cold-start latency"]}),"\n"]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)(n.h2,{id:"contributing",children:"Contributing"}),"\n",(0,r.jsxs)(n.p,{children:["We welcome contributions! See the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/CONTRIBUTING.md",children:"Contributing Guide"}),"."]}),"\n",(0,r.jsx)(n.h2,{id:"community--support",children:"Community & Support"}),"\n",(0,r.jsxs)(n.ul,{children:["\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Discord"}),": ",(0,r.jsx)(n.a,{href:"https://discord.gg/XkT7XsV2AU",children:"community chat"})]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Issues"}),": ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/issues",children:"GitHub Issues"})]}),"\n",(0,r.jsxs)(n.li,{children:[(0,r.jsx)(n.strong,{children:"Email"}),": ",(0,r.jsx)(n.a,{href:"mailto:ml-oss@meesho.com",children:"ml-oss@meesho.com"})]}),"\n"]}),"\n",(0,r.jsx)(n.h2,{id:"license",children:"License"}),"\n",(0,r.jsxs)(n.p,{children:["BharatMLStack is open-source under the ",(0,r.jsx)(n.a,{href:"https://github.com/Meesho/BharatMLStack/blob/main/LICENSE.md",children:"BharatMLStack Business Source License 1.1"}),"."]}),"\n",(0,r.jsx)(n.hr,{}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"Built with \u2764\ufe0f for the ML community from Meesho"})}),"\n",(0,r.jsx)("div",{align:"center",children:(0,r.jsx)("strong",{children:"If you find this useful, \u2b50\ufe0f the repo \u2014 your support means the world to us!"})})]})}function h(e={}){const{wrapper:n}={...(0,t.R)(),...e.components};return n?(0,r.jsx)(n,{...e,children:(0,r.jsx)(a,{...e})}):a(e)}},8453:(e,n,i)=>{i.d(n,{R:()=>l,x:()=>o});var s=i(6540);const r={},t=s.createContext(r);function l(e){const n=s.useContext(t);return s.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(r):e.components||r:l(e.components),s.createElement(t.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/f2c141e4.7dc49a6b.js b/docs/assets/js/f2c141e4.11047a92.js similarity index 68% rename from docs/assets/js/f2c141e4.7dc49a6b.js rename to docs/assets/js/f2c141e4.11047a92.js index 6d7c497b..569f0849 100644 --- a/docs/assets/js/f2c141e4.7dc49a6b.js +++ b/docs/assets/js/f2c141e4.11047a92.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1909],{161:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>c,frontMatter:()=>a,metadata:()=>t,toc:()=>d});var t=i(3983),s=i(4848),r=i(8453);const a={slug:"post-one",title:"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)",authors:["adarsha","aditya","bhawani","jigar"],date:new Date("2022-11-15T00:00:00.000Z"),tags:["online-feature-store","interaction-store","mlplatform","meesho"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0,void 0]},d=[{value:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform",id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",level:2},{value:"The Turning Point: From Batch to Real-Time",id:"the-turning-point-from-batch-to-real-time",level:2},{value:"First Generation Design",id:"first-generation-design",level:2},{value:"1. IOP Framework: A Real-Time DAG Executor",id:"1-iop-framework-a-real-time-dag-executor",level:3},{value:"2. Online Feature Store - 0th Version",id:"2-online-feature-store---0th-version",level:3},{value:"3. Interaction Store - 0th Version",id:"3-interaction-store---0th-version",level:3},{value:"Building the Online Feature Store - 0th Version",id:"building-the-online-feature-store---0th-version",level:2},{value:"Choosing the Right Tech Stack",id:"choosing-the-right-tech-stack",level:3},{value:"Streamlining the Data Flow",id:"streamlining-the-data-flow",level:3},{value:"The Challenges: Data Format and Storage",id:"the-challenges-data-format-and-storage",level:2},{value:"Feature Consistency",id:"feature-consistency",level:3},{value:"TTL Granularity",id:"ttl-granularity",level:3},{value:"Extensibility Across Databases",id:"extensibility-across-databases",level:3},{value:"Overcoming Technical Constraints",id:"overcoming-technical-constraints",level:2},{value:"The Solution: Schema Separation",id:"the-solution-schema-separation",level:2},{value:"Tracking Changes in Feature Groups",id:"tracking-changes-in-feature-groups",level:2},{value:"Common Real-World Scenarios:",id:"common-real-world-scenarios",level:3},{value:"The Solution: Schema Versioning",id:"the-solution-schema-versioning",level:2},{value:"Backward Compatibility",id:"backward-compatibility",level:3},{value:"Partial Availability Handling",id:"partial-availability-handling",level:3},{value:"Safe Writes Without Pipeline Pauses",id:"safe-writes-without-pipeline-pauses",level:3},{value:"Interaction Store - 0th Version",id:"interaction-store---0th-version",level:2},{value:"Event Ingestion",id:"event-ingestion",level:2},{value:"Storage Design",id:"storage-design",level:2},{value:"Why Redis?",id:"why-redis",level:3},{value:"Storage Structure",id:"storage-structure",level:3},{value:"Built-in Guardrails",id:"built-in-guardrails",level:3},{value:"Conclusion: Laying the Foundation for Real-Time ML",id:"conclusion-laying-the-foundation-for-real-time-ml",level:2}];function h(e){const n={a:"a",br:"br",code:"code",em:"em",h1:"h1",h2:"h2",h3:"h3",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"BharatMLStack",src:i(1164).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(n.h2,{id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",children:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:'"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?"'})}),"\n",(0,s.jsx)(n.p,{children:"The laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact."}),"\n",(0,s.jsxs)(n.p,{children:["At the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the ",(0,s.jsx)(n.strong,{children:"Meesho ML Platform"}),", known today as ",(0,s.jsx)(n.strong,{children:"BharatMLStack"}),"."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1757).A+"",width:"1600",height:"1078"})}),"\n",(0,s.jsx)(n.p,{children:"Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Ingestion"}),": The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 1"}),": Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 2"}),": Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 3"}),": Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Serving"}),': A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This approach held up well\u2014until Meesho started seeing a significant surge in traffic."}),"\n",(0,s.jsx)(n.h2,{id:"the-turning-point-from-batch-to-real-time",children:"The Turning Point: From Batch to Real-Time"}),"\n",(0,s.jsxs)(n.p,{children:["At this time, the team was iterating on new ",(0,s.jsx)(n.strong,{children:"Ranker models"}),", and real-time inference seemed like the next logical step. But Rankers needed ",(0,s.jsx)(n.strong,{children:"real-time feature retrieval"}),", which meant an ",(0,s.jsx)(n.strong,{children:"online feature store"})," had to be built first."]}),"\n",(0,s.jsxs)(n.p,{children:["Exploring open-source options led to ",(0,s.jsx)(n.strong,{children:"cost vs. performance trade-offs"}),", but Meesho\u2019s surging traffic meant that ",(0,s.jsx)(n.strong,{children:"latency and stability were non-negotiable"}),". After multiple debates and stakeholder discussions, a bold decision was made:"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:"We would build our own feature store."})}),"\n",(0,s.jsxs)(n.p,{children:["Meanwhile, efforts began to bring ",(0,s.jsx)(n.strong,{children:"Candidate Generators (CGs)"})," to real-time. The challenge? ",(0,s.jsx)(n.strong,{children:"Storing and retrieving user interactions quickly enough"})," to power real-time recommendations."]}),"\n",(0,s.jsxs)(n.p,{children:["As the team dove deeper, a new roadblock emerged:",(0,s.jsx)(n.br,{}),"\n","Our ML jobs were orchestrated using ",(0,s.jsx)(n.strong,{children:"Airflow DAGs"}),", giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, ",(0,s.jsx)(n.strong,{children:"slowing down iteration cycles"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["That\u2019s when the idea struck:",(0,s.jsx)(n.br,{}),"\n","We needed a ",(0,s.jsx)(n.strong,{children:"framework for real-time DAG execution"}),"\u2014one that preserved the same flexibility as Airflow but worked for ",(0,s.jsx)(n.strong,{children:"streaming data"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This moment shaped the ",(0,s.jsx)(n.strong,{children:"next phase of our journey"}),"."]}),"\n",(0,s.jsx)(n.h2,{id:"first-generation-design",children:"First Generation Design"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(7848).A+"",width:"1600",height:"1006"})}),"\n",(0,s.jsx)(n.h1,{id:"laying-the-groundwork-the-first-gen-ml-platform",children:"Laying the Groundwork: The First-Gen ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"To solve these challenges, the team built three foundational components:"}),"\n",(0,s.jsx)(n.h3,{id:"1-iop-framework-a-real-time-dag-executor",children:"1. IOP Framework: A Real-Time DAG Executor"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Reusable Nodes"}),": Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Config-driven Dynamic Graphs"}),": Execution graphs were defined as adjacency lists stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", allowing teams to modify the sequence or structure of operations without touching application code."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Plug-and-play CGs"}),": The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing ",(0,s.jsx)(n.code,{children:"cg_name"})," in the request. This drastically reduced the code surface area and improved maintainability."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Production-Grade DAGs"}),": DAGs were designed to execute in ",(0,s.jsx)(n.strong,{children:"low-latency real-time environments"}),", with support for ",(0,s.jsx)(n.strong,{children:"parallel execution, retries, and branching"}),"."]}),"\n"]}),"\n",(0,s.jsx)("u",{children:(0,s.jsx)(n.a,{href:"https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform",children:"More about IOP DAG"})}),"\n",(0,s.jsx)(n.h3,{id:"2-online-feature-store---0th-version",children:"2. Online Feature Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Used ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for low-latency feature serving."]}),"\n",(0,s.jsxs)(n.li,{children:["Maintained feature consistency using ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," with TTL-based expiry."]}),"\n",(0,s.jsxs)(n.li,{children:["A hybrid schema was used: feature keys stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", data stored in ",(0,s.jsx)(n.strong,{children:"compact arrays"}),"."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"3-interaction-store---0th-version",children:"3. Interaction Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Captured real-time user interactions like clicks, orders, and add-to-cart events."}),"\n",(0,s.jsxs)(n.li,{children:["Stored event data in ",(0,s.jsx)(n.strong,{children:"Redis ZSETs (sorted sets)"})," to enable fast lookups for recommendation engines."]}),"\n",(0,s.jsxs)(n.li,{children:["Provided an API to fetch a user's ",(0,s.jsxs)(n.strong,{children:["last ",(0,s.jsx)(n.em,{children:"k"})," interactions"]})," or ",(0,s.jsx)(n.strong,{children:"interactions within a time window"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["With these components in place, ",(0,s.jsx)(n.strong,{children:"real-time ML at Meesho became a reality"}),"."]}),"\n",(0,s.jsx)(n.p,{children:"This was just the beginning."}),"\n",(0,s.jsx)(n.h2,{id:"building-the-online-feature-store---0th-version",children:"Building the Online Feature Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt text",src:i(5017).A+"",width:"1574",height:"562"})}),"\n",(0,s.jsx)(n.h3,{id:"choosing-the-right-tech-stack",children:"Choosing the Right Tech Stack"}),"\n",(0,s.jsxs)(n.p,{children:["We spent considerable time evaluating various databases, caches, and communication protocols for our ",(0,s.jsx)(n.strong,{children:"online feature store"}),". After carefully weighing ",(0,s.jsx)(n.strong,{children:"cost, latency, throughput"}),", and ",(0,s.jsx)(n.strong,{children:"operational stability"}),", we settled on a combination of:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"gRPC + Proto3"})," as our communication layer"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"streamlining-the-data-flow",children:"Streamlining the Data Flow"}),"\n",(0,s.jsx)(n.p,{children:"To keep things simple in the initial version:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature engineering jobs"})," wrote raw outputs to an ",(0,s.jsx)(n.strong,{children:"S3 bucket"})]}),"\n",(0,s.jsxs)(n.li,{children:["A ",(0,s.jsx)(n.strong,{children:"daily feature push job"}),":","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read from S3"}),"\n",(0,s.jsxs)(n.li,{children:["Grouped related features into ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," (ensuring consistency)"]}),"\n",(0,s.jsxs)(n.li,{children:["Pushed them to ",(0,s.jsx)(n.strong,{children:"Kafka"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"For features requiring frequent updates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ad-hoc jobs"})," computed features in higher frequency"]}),"\n",(0,s.jsxs)(n.li,{children:["These jobs pushed to both ",(0,s.jsx)(n.strong,{children:"Kafka"})," and ",(0,s.jsx)(n.strong,{children:"S3"})," (S3 preserved historical data for future model training)"]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-challenges-data-format-and-storage",children:"The Challenges: Data Format and Storage"}),"\n",(0,s.jsxs)(n.p,{children:["One of the most critical design challenges was how to store feature data ",(0,s.jsx)(n.strong,{children:"efficiently and consistently"}),", especially in databases like ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"}),", which come with unique storage constraints."]}),"\n",(0,s.jsx)(n.p,{children:"We had to solve for three key requirements:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"feature-consistency",children:"Feature Consistency"}),"\n",(0,s.jsxs)(n.p,{children:["When a feature group contains features like ",(0,s.jsx)(n.code,{children:"order_count_1h"})," and ",(0,s.jsx)(n.code,{children:"click_count_1h"}),", both must reflect the ",(0,s.jsx)(n.strong,{children:"same time window"}),". Inconsistent updates would lead to ",(0,s.jsx)(n.strong,{children:"unreliable model predictions"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"ttl-granularity",children:"TTL Granularity"}),"\n",(0,s.jsxs)(n.p,{children:["Each feature group required an ",(0,s.jsx)(n.strong,{children:"expiry timestamp"}),", so that ",(0,s.jsx)(n.strong,{children:"all features within it expired together"}),"\u2014preserving consistency during reads."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"extensibility-across-databases",children:"Extensibility Across Databases"}),"\n",(0,s.jsxs)(n.p,{children:["We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be ",(0,s.jsx)(n.strong,{children:"decoupled from DB-specific layouts"}),", enabling portability to systems like ",(0,s.jsx)(n.strong,{children:"ScyllaDB"}),", ",(0,s.jsx)(n.strong,{children:"DynamoDB"}),", ",(0,s.jsx)(n.strong,{children:"HBase"}),", or ",(0,s.jsx)(n.strong,{children:"BigTable"}),"."]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"overcoming-technical-constraints",children:"Overcoming Technical Constraints"}),"\n",(0,s.jsx)(n.p,{children:'At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.'}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-separation",children:"The Solution: Schema Separation"}),"\n",(0,s.jsx)(n.p,{children:"We introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\nTo represent these groups efficiently, we adopted a layered storage approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Labels (Keys)"})," were stored in ZooKeeper, serving as the schema."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Values"})," were stored as a comma-separated string array in Cassandra or Redis."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Expiry Timestamp and Schema Version"})," were appended using a semi-colon delimiter at the end of the string."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Example:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"feature_1_value,feature_2_value,feature_3_value;expiry_ts\n"})}),"\n",(0,s.jsx)(n.p,{children:"This format allowed:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Consistent writes and reads at the group level"}),"\n",(0,s.jsx)(n.li,{children:"Easy parsing of feature values using the schema lookup from ZooKeeper"}),"\n",(0,s.jsx)(n.li,{children:"Efficient storage with minimal DB column usage"}),"\n",(0,s.jsx)(n.li,{children:"Support for per-group TTLs and schema evolution"}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"tracking-changes-in-feature-groups",children:"Tracking Changes in Feature Groups"}),"\n",(0,s.jsx)(n.p,{children:"Feature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn't feasible."}),"\n",(0,s.jsx)(n.h3,{id:"common-real-world-scenarios",children:"Common Real-World Scenarios:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"A new feature is added to the schema, but ingestion jobs still use the older schema version."}),"\n",(0,s.jsx)(n.li,{children:"Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features."}),"\n",(0,s.jsx)(n.li,{children:"During serving, models request a mix of old and new features, depending on rollout stages."}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-versioning",children:"The Solution: Schema Versioning"}),"\n",(0,s.jsx)(n.p,{children:"We solved this with versioned feature group schemas, which unlocked several capabilities:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"backward-compatibility",children:"Backward Compatibility"}),"\n","Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"partial-availability-handling",children:"Partial Availability Handling"}),"\n","During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"safe-writes-without-pipeline-pauses",children:"Safe Writes Without Pipeline Pauses"}),"\n","With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(8733).A+"",width:"1600",height:"599"})}),"\n",(0,s.jsx)(n.h2,{id:"interaction-store---0th-version",children:"Interaction Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(395).A+"",width:"1600",height:"518"})}),"\n",(0,s.jsxs)(n.p,{children:["To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as ",(0,s.jsx)(n.strong,{children:"Similar Products"}),", ",(0,s.jsx)(n.strong,{children:"People Also Viewed"}),", or ",(0,s.jsx)(n.strong,{children:"Recently Ordered Again"}),".\nFor the ",(0,s.jsx)(n.strong,{children:"0th version"})," of the Interaction Store, we focused on a design that was ",(0,s.jsx)(n.strong,{children:"simple, fast, and reliable"})," \u2014 optimized for high-throughput ingestion and low-latency lookups."]}),"\n",(0,s.jsx)(n.h2,{id:"event-ingestion",children:"Event Ingestion"}),"\n",(0,s.jsx)(n.p,{children:"We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Click"}),"\n",(0,s.jsx)(n.li,{children:"Order"}),"\n",(0,s.jsx)(n.li,{children:"Add to Cart"}),"\n",(0,s.jsx)(n.li,{children:"Wishlist"}),"\n",(0,s.jsx)(n.li,{children:"Share"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Each event carried essential metadata:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"userId \u2014 uniquely identifies the user"}),"\n",(0,s.jsx)(n.li,{children:"productId \u2014 the item being interacted with"}),"\n",(0,s.jsx)(n.li,{children:"timestamp \u2014 the moment the interaction occurred"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently."}),"\n",(0,s.jsx)(n.h2,{id:"storage-design",children:"Storage Design"}),"\n",(0,s.jsx)(n.p,{children:"To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure."}),"\n",(0,s.jsx)(n.h3,{id:"why-redis",children:"Why Redis?"}),"\n",(0,s.jsx)(n.p,{children:"Redis gave us:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Low-latency"})," reads and writes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Time-ordered data"})," using ZSETs (via score = timestamp)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Native TTL support"}),", if needed in later versions"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-memory performance"})," \u2014ideal for real-time CGs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"storage-structure",children:"Storage Structure"}),"\n",(0,s.jsx)(n.p,{children:"Each user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"userId_eventType \u2192 ZSET[...(pid, ts)...]\n"})}),"\n",(0,s.jsx)(n.p,{children:"Within each ZSET:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"timestamp"})," served as the score, maintaining temporal order"]}),"\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"productId"})," (optionally with metadata) was the ",(0,s.jsx)(n.strong,{children:"value"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Fetch the ",(0,s.jsx)(n.strong,{children:"last k interactions"})," of a specific type for a given user with ",(0,s.jsx)(n.code,{children:"ZREVRANGE(userId_eventType, count)"})]}),"\n",(0,s.jsxs)(n.li,{children:["Retrieve ",(0,s.jsx)(n.strong,{children:"all interactions within a time range"})," (e.g., last 24 hours) with ",(0,s.jsx)(n.code,{children:"ZREVRANGEBYSCORE(userId_eventType, timeRange)"})]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"built-in-guardrails",children:"Built-in Guardrails"}),"\n",(0,s.jsx)(n.p,{children:"Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated."}),"\n",(0,s.jsx)(n.h2,{id:"conclusion-laying-the-foundation-for-real-time-ml",children:"Conclusion: Laying the Foundation for Real-Time ML"}),"\n",(0,s.jsxs)(n.p,{children:["In this first phase, we tackled the ",(0,s.jsx)(n.strong,{children:"fundamentals"}),"\u2014shifting from batch-based recommendations to a ",(0,s.jsx)(n.strong,{children:"real-time Recommendation"})," using ML platform that could keep up with Meesho\u2019s growth."]}),"\n",(0,s.jsxs)(n.p,{children:["With the ",(0,s.jsx)(n.strong,{children:"IOP Framework"}),", ",(0,s.jsx)(n.strong,{children:"Online Feature Store"}),", and ",(0,s.jsx)(n.strong,{children:"Interaction Store"}),", we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"\u2705 Faster, more dynamic recommendations for millions of users."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 Better infrastructure efficiency, reducing wasted compute power."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 A flexible, modular system that allows for further experimentation."}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["But this is just the beginning. While we've solved key challenges, ",(0,s.jsx)(n.strong,{children:"certain roadblocks remain"})," \u2014from optimizing ",(0,s.jsx)(n.strong,{children:"cost-performance trade-offs"})," to ",(0,s.jsx)(n.strong,{children:"seamlessly evolving schemas"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This foundational work laid the path for a reliable and scalable ",(0,s.jsx)(n.strong,{children:"real-time feature serving layer"}),"."]})]})}function c(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(h,{...e})}):h(e)}},395:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/interaction-store-v0-68167b64c6e462ef2f177f0f86d55bda.png"},1164:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1757:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/old-batch-arch-bc2cedbc1fed0fc6f08479ba8fe52996.png"},3983:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},5017:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/online-feature-store-v0-86ec0010947ae24621f39ebd0d1729ca.png"},7848:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/first-gen-arch-7c0b286810aecb7eff42b48f51caee1f.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const s={},r=t.createContext(s);function a(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),t.createElement(r.Provider,{value:n},e.children)}},8733:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/schema-d699efc400ed0f83bba421c1f55ab211.png"}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[1909],{161:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>l,contentTitle:()=>o,default:()=>c,frontMatter:()=>a,metadata:()=>t,toc:()=>d});var t=i(3983),s=i(4848),r=i(8453);const a={slug:"post-one",title:"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)",authors:["adarsha","aditya","bhawani","jigar"],date:new Date("2022-11-15T00:00:00.000Z"),tags:["online-feature-store","interaction-store","mlplatform","meesho"]},o=void 0,l={authorsImageUrls:[void 0,void 0,void 0,void 0]},d=[{value:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform",id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",level:2},{value:"The Turning Point: From Batch to Real-Time",id:"the-turning-point-from-batch-to-real-time",level:2},{value:"First Generation Design",id:"first-generation-design",level:2},{value:"1. IOP Framework: A Real-Time DAG Executor",id:"1-iop-framework-a-real-time-dag-executor",level:3},{value:"2. Online Feature Store - 0th Version",id:"2-online-feature-store---0th-version",level:3},{value:"3. Interaction Store - 0th Version",id:"3-interaction-store---0th-version",level:3},{value:"Building the Online Feature Store - 0th Version",id:"building-the-online-feature-store---0th-version",level:2},{value:"Choosing the Right Tech Stack",id:"choosing-the-right-tech-stack",level:3},{value:"Streamlining the Data Flow",id:"streamlining-the-data-flow",level:3},{value:"The Challenges: Data Format and Storage",id:"the-challenges-data-format-and-storage",level:2},{value:"Feature Consistency",id:"feature-consistency",level:3},{value:"TTL Granularity",id:"ttl-granularity",level:3},{value:"Extensibility Across Databases",id:"extensibility-across-databases",level:3},{value:"Overcoming Technical Constraints",id:"overcoming-technical-constraints",level:2},{value:"The Solution: Schema Separation",id:"the-solution-schema-separation",level:2},{value:"Tracking Changes in Feature Groups",id:"tracking-changes-in-feature-groups",level:2},{value:"Common Real-World Scenarios:",id:"common-real-world-scenarios",level:3},{value:"The Solution: Schema Versioning",id:"the-solution-schema-versioning",level:2},{value:"Backward Compatibility",id:"backward-compatibility",level:3},{value:"Partial Availability Handling",id:"partial-availability-handling",level:3},{value:"Safe Writes Without Pipeline Pauses",id:"safe-writes-without-pipeline-pauses",level:3},{value:"Interaction Store - 0th Version",id:"interaction-store---0th-version",level:2},{value:"Event Ingestion",id:"event-ingestion",level:2},{value:"Storage Design",id:"storage-design",level:2},{value:"Why Redis?",id:"why-redis",level:3},{value:"Storage Structure",id:"storage-structure",level:3},{value:"Built-in Guardrails",id:"built-in-guardrails",level:3},{value:"Conclusion: Laying the Foundation for Real-Time ML",id:"conclusion-laying-the-foundation-for-real-time-ml",level:2}];function h(e){const n={a:"a",br:"br",code:"code",em:"em",h1:"h1",h2:"h2",h3:"h3",hr:"hr",img:"img",li:"li",p:"p",pre:"pre",strong:"strong",ul:"ul",...(0,r.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"BharatMLStack",src:i(1547).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(n.h2,{id:"the-genesis-how-a-friday-night-roast-sparked-meeshos-ml-platform",children:"The Genesis: How a Friday Night Roast Sparked Meesho\u2019s ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting\u2014until one remark hit a little too close to home:"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:'"Why are we still crunching data for Monthly Active Users (MAU) when the next day it\u2019s all about Daily Active Users (DAU)?"'})}),"\n",(0,s.jsx)(n.p,{children:"The laughter died down, and the question lingered. When we regrouped on Monday\u2014clear-headed and slightly reflective\u2014we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn\u2019t being put to good use.\nMuch of the system\u2019s effort was spent supporting users who weren\u2019t actively engaging, and even for new users, the experience wasn\u2019t optimized to make a meaningful impact."}),"\n",(0,s.jsxs)(n.p,{children:["At the same time, Meesho had just launched a company-wide initiative to reduce costs\u2014and every team had to contribute. This realization sparked the journey that would eventually lead to the ",(0,s.jsx)(n.strong,{children:"Meesho ML Platform"}),", known today as ",(0,s.jsx)(n.strong,{children:"BharatMLStack"}),"."]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(4204).A+"",width:"1600",height:"1078"})}),"\n",(0,s.jsx)(n.p,{children:"Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Data Ingestion"}),": The Data Platform team executed ETL jobs to ingest raw user data\u2014including user profiles, interaction logs, and product impressions\u2014into designated S3 buckets."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 1"}),": Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 2"}),": Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Layer 3"}),": Ranking and Merging \u2013 A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Serving"}),': A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).']}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This approach held up well\u2014until Meesho started seeing a significant surge in traffic."}),"\n",(0,s.jsx)(n.h2,{id:"the-turning-point-from-batch-to-real-time",children:"The Turning Point: From Batch to Real-Time"}),"\n",(0,s.jsxs)(n.p,{children:["At this time, the team was iterating on new ",(0,s.jsx)(n.strong,{children:"Ranker models"}),", and real-time inference seemed like the next logical step. But Rankers needed ",(0,s.jsx)(n.strong,{children:"real-time feature retrieval"}),", which meant an ",(0,s.jsx)(n.strong,{children:"online feature store"})," had to be built first."]}),"\n",(0,s.jsxs)(n.p,{children:["Exploring open-source options led to ",(0,s.jsx)(n.strong,{children:"cost vs. performance trade-offs"}),", but Meesho\u2019s surging traffic meant that ",(0,s.jsx)(n.strong,{children:"latency and stability were non-negotiable"}),". After multiple debates and stakeholder discussions, a bold decision was made:"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.em,{children:"We would build our own feature store."})}),"\n",(0,s.jsxs)(n.p,{children:["Meanwhile, efforts began to bring ",(0,s.jsx)(n.strong,{children:"Candidate Generators (CGs)"})," to real-time. The challenge? ",(0,s.jsx)(n.strong,{children:"Storing and retrieving user interactions quickly enough"})," to power real-time recommendations."]}),"\n",(0,s.jsxs)(n.p,{children:["As the team dove deeper, a new roadblock emerged:",(0,s.jsx)(n.br,{}),"\n","Our ML jobs were orchestrated using ",(0,s.jsx)(n.strong,{children:"Airflow DAGs"}),", giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, ",(0,s.jsx)(n.strong,{children:"slowing down iteration cycles"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["That\u2019s when the idea struck:",(0,s.jsx)(n.br,{}),"\n","We needed a ",(0,s.jsx)(n.strong,{children:"framework for real-time DAG execution"}),"\u2014one that preserved the same flexibility as Airflow but worked for ",(0,s.jsx)(n.strong,{children:"streaming data"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This moment shaped the ",(0,s.jsx)(n.strong,{children:"next phase of our journey"}),"."]}),"\n",(0,s.jsx)(n.h2,{id:"first-generation-design",children:"First Generation Design"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1585).A+"",width:"1600",height:"1006"})}),"\n",(0,s.jsx)(n.h1,{id:"laying-the-groundwork-the-first-gen-ml-platform",children:"Laying the Groundwork: The First-Gen ML Platform"}),"\n",(0,s.jsx)(n.p,{children:"To solve these challenges, the team built three foundational components:"}),"\n",(0,s.jsx)(n.h3,{id:"1-iop-framework-a-real-time-dag-executor",children:"1. IOP Framework: A Real-Time DAG Executor"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Reusable Nodes"}),": Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Config-driven Dynamic Graphs"}),": Execution graphs were defined as adjacency lists stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", allowing teams to modify the sequence or structure of operations without touching application code."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Plug-and-play CGs"}),": The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing ",(0,s.jsx)(n.code,{children:"cg_name"})," in the request. This drastically reduced the code surface area and improved maintainability."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Production-Grade DAGs"}),": DAGs were designed to execute in ",(0,s.jsx)(n.strong,{children:"low-latency real-time environments"}),", with support for ",(0,s.jsx)(n.strong,{children:"parallel execution, retries, and branching"}),"."]}),"\n"]}),"\n",(0,s.jsx)("u",{children:(0,s.jsx)(n.a,{href:"https://www.meesho.io/blog/rebuilding-meeshos-ranking-platform",children:"More about IOP DAG"})}),"\n",(0,s.jsx)(n.h3,{id:"2-online-feature-store---0th-version",children:"2. Online Feature Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Used ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for low-latency feature serving."]}),"\n",(0,s.jsxs)(n.li,{children:["Maintained feature consistency using ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," with TTL-based expiry."]}),"\n",(0,s.jsxs)(n.li,{children:["A hybrid schema was used: feature keys stored in ",(0,s.jsx)(n.strong,{children:"ZooKeeper"}),", data stored in ",(0,s.jsx)(n.strong,{children:"compact arrays"}),"."]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"3-interaction-store---0th-version",children:"3. Interaction Store - 0th Version"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Captured real-time user interactions like clicks, orders, and add-to-cart events."}),"\n",(0,s.jsxs)(n.li,{children:["Stored event data in ",(0,s.jsx)(n.strong,{children:"Redis ZSETs (sorted sets)"})," to enable fast lookups for recommendation engines."]}),"\n",(0,s.jsxs)(n.li,{children:["Provided an API to fetch a user's ",(0,s.jsxs)(n.strong,{children:["last ",(0,s.jsx)(n.em,{children:"k"})," interactions"]})," or ",(0,s.jsx)(n.strong,{children:"interactions within a time window"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["With these components in place, ",(0,s.jsx)(n.strong,{children:"real-time ML at Meesho became a reality"}),"."]}),"\n",(0,s.jsx)(n.p,{children:"This was just the beginning."}),"\n",(0,s.jsx)(n.h2,{id:"building-the-online-feature-store---0th-version",children:"Building the Online Feature Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt text",src:i(7490).A+"",width:"1574",height:"562"})}),"\n",(0,s.jsx)(n.h3,{id:"choosing-the-right-tech-stack",children:"Choosing the Right Tech Stack"}),"\n",(0,s.jsxs)(n.p,{children:["We spent considerable time evaluating various databases, caches, and communication protocols for our ",(0,s.jsx)(n.strong,{children:"online feature store"}),". After carefully weighing ",(0,s.jsx)(n.strong,{children:"cost, latency, throughput"}),", and ",(0,s.jsx)(n.strong,{children:"operational stability"}),", we settled on a combination of:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"})," for storage"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"gRPC + Proto3"})," as our communication layer"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"streamlining-the-data-flow",children:"Streamlining the Data Flow"}),"\n",(0,s.jsx)(n.p,{children:"To keep things simple in the initial version:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature engineering jobs"})," wrote raw outputs to an ",(0,s.jsx)(n.strong,{children:"S3 bucket"})]}),"\n",(0,s.jsxs)(n.li,{children:["A ",(0,s.jsx)(n.strong,{children:"daily feature push job"}),":","\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Read from S3"}),"\n",(0,s.jsxs)(n.li,{children:["Grouped related features into ",(0,s.jsx)(n.strong,{children:"Feature Groups"})," (ensuring consistency)"]}),"\n",(0,s.jsxs)(n.li,{children:["Pushed them to ",(0,s.jsx)(n.strong,{children:"Kafka"})]}),"\n"]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"For features requiring frequent updates:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Ad-hoc jobs"})," computed features in higher frequency"]}),"\n",(0,s.jsxs)(n.li,{children:["These jobs pushed to both ",(0,s.jsx)(n.strong,{children:"Kafka"})," and ",(0,s.jsx)(n.strong,{children:"S3"})," (S3 preserved historical data for future model training)"]}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-challenges-data-format-and-storage",children:"The Challenges: Data Format and Storage"}),"\n",(0,s.jsxs)(n.p,{children:["One of the most critical design challenges was how to store feature data ",(0,s.jsx)(n.strong,{children:"efficiently and consistently"}),", especially in databases like ",(0,s.jsx)(n.strong,{children:"Cassandra"})," and ",(0,s.jsx)(n.strong,{children:"Redis"}),", which come with unique storage constraints."]}),"\n",(0,s.jsx)(n.p,{children:"We had to solve for three key requirements:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"feature-consistency",children:"Feature Consistency"}),"\n",(0,s.jsxs)(n.p,{children:["When a feature group contains features like ",(0,s.jsx)(n.code,{children:"order_count_1h"})," and ",(0,s.jsx)(n.code,{children:"click_count_1h"}),", both must reflect the ",(0,s.jsx)(n.strong,{children:"same time window"}),". Inconsistent updates would lead to ",(0,s.jsx)(n.strong,{children:"unreliable model predictions"}),"."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"ttl-granularity",children:"TTL Granularity"}),"\n",(0,s.jsxs)(n.p,{children:["Each feature group required an ",(0,s.jsx)(n.strong,{children:"expiry timestamp"}),", so that ",(0,s.jsx)(n.strong,{children:"all features within it expired together"}),"\u2014preserving consistency during reads."]}),"\n"]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"extensibility-across-databases",children:"Extensibility Across Databases"}),"\n",(0,s.jsxs)(n.p,{children:["We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be ",(0,s.jsx)(n.strong,{children:"decoupled from DB-specific layouts"}),", enabling portability to systems like ",(0,s.jsx)(n.strong,{children:"ScyllaDB"}),", ",(0,s.jsx)(n.strong,{children:"DynamoDB"}),", ",(0,s.jsx)(n.strong,{children:"HBase"}),", or ",(0,s.jsx)(n.strong,{children:"BigTable"}),"."]}),"\n"]}),"\n"]}),"\n",(0,s.jsx)(n.hr,{}),"\n",(0,s.jsx)(n.h2,{id:"overcoming-technical-constraints",children:"Overcoming Technical Constraints"}),"\n",(0,s.jsx)(n.p,{children:'At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.'}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-separation",children:"The Solution: Schema Separation"}),"\n",(0,s.jsx)(n.p,{children:"We introduced the concept of Feature Groups\u2014logical groupings of features that must remain consistent with one another.\nTo represent these groups efficiently, we adopted a layered storage approach:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Labels (Keys)"})," were stored in ZooKeeper, serving as the schema."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Feature Values"})," were stored as a comma-separated string array in Cassandra or Redis."]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Expiry Timestamp and Schema Version"})," were appended using a semi-colon delimiter at the end of the string."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Example:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"feature_1_value,feature_2_value,feature_3_value;expiry_ts\n"})}),"\n",(0,s.jsx)(n.p,{children:"This format allowed:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Consistent writes and reads at the group level"}),"\n",(0,s.jsx)(n.li,{children:"Easy parsing of feature values using the schema lookup from ZooKeeper"}),"\n",(0,s.jsx)(n.li,{children:"Efficient storage with minimal DB column usage"}),"\n",(0,s.jsx)(n.li,{children:"Support for per-group TTLs and schema evolution"}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"tracking-changes-in-feature-groups",children:"Tracking Changes in Feature Groups"}),"\n",(0,s.jsx)(n.p,{children:"Feature groups don\u2019t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready\u2014and stopping ingestion just to wait for everything to align isn't feasible."}),"\n",(0,s.jsx)(n.h3,{id:"common-real-world-scenarios",children:"Common Real-World Scenarios:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"A new feature is added to the schema, but ingestion jobs still use the older schema version."}),"\n",(0,s.jsx)(n.li,{children:"Ongoing writes don\u2019t include the newly added feature, and stopping ingestion would break freshness for existing features."}),"\n",(0,s.jsx)(n.li,{children:"During serving, models request a mix of old and new features, depending on rollout stages."}),"\n"]}),"\n",(0,s.jsx)(n.h2,{id:"the-solution-schema-versioning",children:"The Solution: Schema Versioning"}),"\n",(0,s.jsx)(n.p,{children:"We solved this with versioned feature group schemas, which unlocked several capabilities:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"backward-compatibility",children:"Backward Compatibility"}),"\n","Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"partial-availability-handling",children:"Partial Availability Handling"}),"\n","During inference, if some features in the request aren\u2019t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn\u2019t fail."]}),"\n",(0,s.jsxs)(n.li,{children:["\n",(0,s.jsx)(n.h3,{id:"safe-writes-without-pipeline-pauses",children:"Safe Writes Without Pipeline Pauses"}),"\n","With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently.\nThis design gave us the flexibility to move fast without breaking things\u2014preserving data quality, enabling experimentation, and ensuring reliability at scale."]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(1544).A+"",width:"1600",height:"599"})}),"\n",(0,s.jsx)(n.h2,{id:"interaction-store---0th-version",children:"Interaction Store - 0th Version"}),"\n",(0,s.jsx)(n.p,{children:(0,s.jsx)(n.img,{alt:"Alt Text",src:i(5714).A+"",width:"1600",height:"518"})}),"\n",(0,s.jsxs)(n.p,{children:["To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals\u2014like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as ",(0,s.jsx)(n.strong,{children:"Similar Products"}),", ",(0,s.jsx)(n.strong,{children:"People Also Viewed"}),", or ",(0,s.jsx)(n.strong,{children:"Recently Ordered Again"}),".\nFor the ",(0,s.jsx)(n.strong,{children:"0th version"})," of the Interaction Store, we focused on a design that was ",(0,s.jsx)(n.strong,{children:"simple, fast, and reliable"})," \u2014 optimized for high-throughput ingestion and low-latency lookups."]}),"\n",(0,s.jsx)(n.h2,{id:"event-ingestion",children:"Event Ingestion"}),"\n",(0,s.jsx)(n.p,{children:"We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"Click"}),"\n",(0,s.jsx)(n.li,{children:"Order"}),"\n",(0,s.jsx)(n.li,{children:"Add to Cart"}),"\n",(0,s.jsx)(n.li,{children:"Wishlist"}),"\n",(0,s.jsx)(n.li,{children:"Share"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"Each event carried essential metadata:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"userId \u2014 uniquely identifies the user"}),"\n",(0,s.jsx)(n.li,{children:"productId \u2014 the item being interacted with"}),"\n",(0,s.jsx)(n.li,{children:"timestamp \u2014 the moment the interaction occurred"}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently."}),"\n",(0,s.jsx)(n.h2,{id:"storage-design",children:"Storage Design"}),"\n",(0,s.jsx)(n.p,{children:"To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure."}),"\n",(0,s.jsx)(n.h3,{id:"why-redis",children:"Why Redis?"}),"\n",(0,s.jsx)(n.p,{children:"Redis gave us:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Low-latency"})," reads and writes"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Time-ordered data"})," using ZSETs (via score = timestamp)"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"Native TTL support"}),", if needed in later versions"]}),"\n",(0,s.jsxs)(n.li,{children:[(0,s.jsx)(n.strong,{children:"In-memory performance"})," \u2014ideal for real-time CGs"]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"storage-structure",children:"Storage Structure"}),"\n",(0,s.jsx)(n.p,{children:"Each user\u2019s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:"}),"\n",(0,s.jsx)(n.pre,{children:(0,s.jsx)(n.code,{className:"language-bash",children:"userId_eventType \u2192 ZSET[...(pid, ts)...]\n"})}),"\n",(0,s.jsx)(n.p,{children:"Within each ZSET:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"timestamp"})," served as the score, maintaining temporal order"]}),"\n",(0,s.jsxs)(n.li,{children:["The ",(0,s.jsx)(n.strong,{children:"productId"})," (optionally with metadata) was the ",(0,s.jsx)(n.strong,{children:"value"})]}),"\n"]}),"\n",(0,s.jsx)(n.p,{children:"This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:"}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsxs)(n.li,{children:["Fetch the ",(0,s.jsx)(n.strong,{children:"last k interactions"})," of a specific type for a given user with ",(0,s.jsx)(n.code,{children:"ZREVRANGE(userId_eventType, count)"})]}),"\n",(0,s.jsxs)(n.li,{children:["Retrieve ",(0,s.jsx)(n.strong,{children:"all interactions within a time range"})," (e.g., last 24 hours) with ",(0,s.jsx)(n.code,{children:"ZREVRANGEBYSCORE(userId_eventType, timeRange)"})]}),"\n"]}),"\n",(0,s.jsx)(n.h3,{id:"built-in-guardrails",children:"Built-in Guardrails"}),"\n",(0,s.jsx)(n.p,{children:"Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type\u2014only storing the last k interactions per user, with older entries getting truncated."}),"\n",(0,s.jsx)(n.h2,{id:"conclusion-laying-the-foundation-for-real-time-ml",children:"Conclusion: Laying the Foundation for Real-Time ML"}),"\n",(0,s.jsxs)(n.p,{children:["In this first phase, we tackled the ",(0,s.jsx)(n.strong,{children:"fundamentals"}),"\u2014shifting from batch-based recommendations to a ",(0,s.jsx)(n.strong,{children:"real-time Recommendation"})," using ML platform that could keep up with Meesho\u2019s growth."]}),"\n",(0,s.jsxs)(n.p,{children:["With the ",(0,s.jsx)(n.strong,{children:"IOP Framework"}),", ",(0,s.jsx)(n.strong,{children:"Online Feature Store"}),", and ",(0,s.jsx)(n.strong,{children:"Interaction Store"}),", we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:"]}),"\n",(0,s.jsxs)(n.ul,{children:["\n",(0,s.jsx)(n.li,{children:"\u2705 Faster, more dynamic recommendations for millions of users."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 Better infrastructure efficiency, reducing wasted compute power."}),"\n",(0,s.jsx)(n.li,{children:"\u2705 A flexible, modular system that allows for further experimentation."}),"\n"]}),"\n",(0,s.jsxs)(n.p,{children:["But this is just the beginning. While we've solved key challenges, ",(0,s.jsx)(n.strong,{children:"certain roadblocks remain"})," \u2014from optimizing ",(0,s.jsx)(n.strong,{children:"cost-performance trade-offs"})," to ",(0,s.jsx)(n.strong,{children:"seamlessly evolving schemas"}),"."]}),"\n",(0,s.jsxs)(n.p,{children:["This foundational work laid the path for a reliable and scalable ",(0,s.jsx)(n.strong,{children:"real-time feature serving layer"}),"."]})]})}function c(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,s.jsx)(n,{...e,children:(0,s.jsx)(h,{...e})}):h(e)}},1544:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/schema-d699efc400ed0f83bba421c1f55ab211.png"},1547:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},1585:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/first-gen-arch-7c0b286810aecb7eff42b48f51caee1f.png"},3983:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-one","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-one/index.md","source":"@site/blog/bharatmlstack-history/post-one/index.md","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","description":"BharatMLStack","date":"2022-11-15T00:00:00.000Z","tags":[{"inline":true,"label":"online-feature-store","permalink":"/BharatMLStack/blog/tags/online-feature-store"},{"inline":true,"label":"interaction-store","permalink":"/BharatMLStack/blog/tags/interaction-store"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"}],"readingTime":10.25,"hasTruncateMarker":false,"authors":[{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null},{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Bhawani Singh","title":"Architect @ Meesho","url":"https://github.com/singh-bhawani","imageURL":"https://github.com/singh-bhawani.png","key":"bhawani","page":null},{"name":"Jigar Dave","title":"Lead Software Engineer @ Meesho","url":"https://github.com/jigarpatel26","imageURL":"https://github.com/jigarpatel26.png","key":"jigar","page":null}],"frontMatter":{"slug":"post-one","title":"Building Meesho\u2019s ML Platform: From Chaos to Cutting-Edge (Part 1)","authors":["adarsha","aditya","bhawani","jigar"],"date":"2022-11-15T00:00:00.000Z","tags":["online-feature-store","interaction-store","mlplatform","meesho"]},"unlisted":false,"prevItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},4204:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/old-batch-arch-bc2cedbc1fed0fc6f08479ba8fe52996.png"},5714:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/interaction-store-v0-68167b64c6e462ef2f177f0f86d55bda.png"},7490:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/online-feature-store-v0-86ec0010947ae24621f39ebd0d1729ca.png"},8453:(e,n,i)=>{i.d(n,{R:()=>a,x:()=>o});var t=i(6540);const s={},r=t.createContext(s);function a(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function o(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:a(e.components),t.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/f9755c6e.8811662b.js b/docs/assets/js/f9755c6e.8811662b.js deleted file mode 100644 index d17be1e7..00000000 --- a/docs/assets/js/f9755c6e.8811662b.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8315],{5969:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"}}')},8319:(e,t,i)=>{i.r(t),i.d(t,{assets:()=>h,contentTitle:()=>d,default:()=>o,frontMatter:()=>r,metadata:()=>n,toc:()=>c});var n=i(5969),s=i(4848),l=i(8453);const r={slug:"post-five",title:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",authors:["jaya"],date:"2025-6-2",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},d=void 0,h={authorsImageUrls:[void 0]},c=[{value:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",level:2},{value:"1. Advanced Memory Management: Paged & Prefix KV Caching",id:"1-advanced-memory-management-paged--prefix-kv-caching",level:2},{value:"Paged KV caching",id:"paged-kv-caching",level:3},{value:"KV cache quantization",id:"kv-cache-quantization",level:3},{value:"Prefix caching (the "voice bot" optimizer)",id:"prefix-caching-the-voice-bot-optimizer",level:3},{value:"2. Aggressive Quantization (INT4 AWQ & FP8)",id:"2-aggressive-quantization-int4-awq--fp8",level:2},{value:"INT4 AWQ (Activation-aware Weight Quantization)",id:"int4-awq-activation-aware-weight-quantization",level:3},{value:"FP8 precision",id:"fp8-precision",level:3},{value:"3. Kernel Fusion & Custom Plugins",id:"3-kernel-fusion--custom-plugins",level:2},{value:"4. Inflight (Continuous) Batching",id:"4-inflight-continuous-batching",level:2},{value:"5. Parallelism Strategies: Scaling Beyond One GPU",id:"5-parallelism-strategies-scaling-beyond-one-gpu",level:2},{value:"6. Speculative Decoding",id:"6-speculative-decoding",level:2},{value:"Few Benchmarks",id:"few-benchmarks",level:2},{value:"Search query rewriting",id:"search-query-rewriting",level:3},{value:"Voice bot query",id:"voice-bot-query",level:3},{value:"Conclusion",id:"conclusion",level:2}];function a(e){const t={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,l.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.p,{children:(0,s.jsx)(t.img,{alt:"BharatMLStack",src:i(9200).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(t.h2,{id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",children:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale"}),"\n",(0,s.jsx)(t.p,{children:"Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution."}),"\n",(0,s.jsx)(t.h2,{id:"1-advanced-memory-management-paged--prefix-kv-caching",children:"1. Advanced Memory Management: Paged & Prefix KV Caching"}),"\n",(0,s.jsx)(t.p,{children:"The most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache."}),"\n",(0,s.jsx)(t.h3,{id:"paged-kv-caching",children:"Paged KV caching"}),"\n",(0,s.jsxs)(t.p,{children:["Standard caching suffers from fragmentation. We use ",(0,s.jsx)(t.strong,{children:"Paged KV caching"}),", which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory."]}),"\n",(0,s.jsx)(t.h3,{id:"kv-cache-quantization",children:"KV cache quantization"}),"\n",(0,s.jsxs)(t.p,{children:["To further maximize available memory, we implement ",(0,s.jsx)(t.strong,{children:"KV cache quantization"})," (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality."]}),"\n",(0,s.jsx)(t.h3,{id:"prefix-caching-the-voice-bot-optimizer",children:'Prefix caching (the "voice bot" optimizer)'}),"\n",(0,s.jsxs)(t.p,{children:['For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable ',(0,s.jsx)(t.strong,{children:"prefix caching"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Impact"}),": By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces ",(0,s.jsx)(t.strong,{children:"Time To First Token (TTFT)"})," by skipping redundant computation of the system prompt."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"2-aggressive-quantization-int4-awq--fp8",children:"2. Aggressive Quantization (INT4 AWQ & FP8)"}),"\n",(0,s.jsx)(t.p,{children:"Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy."}),"\n",(0,s.jsx)(t.h3,{id:"int4-awq-activation-aware-weight-quantization",children:"INT4 AWQ (Activation-aware Weight Quantization)"}),"\n",(0,s.jsxs)(t.p,{children:["For the Llama 3 family, we use ",(0,s.jsx)(t.strong,{children:"AWQ"})," to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed."]}),"\n",(0,s.jsx)(t.h3,{id:"fp8-precision",children:"FP8 precision"}),"\n",(0,s.jsxs)(t.p,{children:["For NVIDIA Hopper (H100) architectures, we are exploring ",(0,s.jsx)(t.strong,{children:"FP8 quantization"}),", leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving ",(0,s.jsx)(t.strong,{children:">99% similarity"}),"."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"3-kernel-fusion--custom-plugins",children:"3. Kernel Fusion & Custom Plugins"}),"\n",(0,s.jsx)(t.p,{children:"To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Flash attention & FMHA"}),": We enable ",(0,s.jsx)(t.strong,{children:"Fused Multi-Head Attention (FMHA)"})," combined with flash attention to reduce memory reads/writes."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"GEMM plugins"}),": We use specialized ",(0,s.jsx)(t.strong,{children:"GEMM"})," plugins to accelerate transformer linear layers."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Removing input padding"}),": Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"4-inflight-continuous-batching",children:"4. Inflight (Continuous) Batching"}),"\n",(0,s.jsx)(t.p,{children:"Traditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else."}),"\n",(0,s.jsxs)(t.p,{children:["We implement ",(0,s.jsx)(t.strong,{children:"inflight batching"}),": as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones."]}),"\n",(0,s.jsx)(t.h2,{id:"5-parallelism-strategies-scaling-beyond-one-gpu",children:"5. Parallelism Strategies: Scaling Beyond One GPU"}),"\n",(0,s.jsx)(t.p,{children:"For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Tensor parallelism (TP)"}),": Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Pipeline parallelism (PP)"}),": Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B)."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"6-speculative-decoding",children:"6. Speculative Decoding"}),"\n",(0,s.jsxs)(t.p,{children:["To reduce inter-token latency (ITL), we explore ",(0,s.jsx)(t.strong,{children:"speculative decoding"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Mechanism"}),': A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).']}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"few-benchmarks",children:"Few Benchmarks"}),"\n",(0,s.jsx)(t.p,{children:"Below are a couple of representative use cases and performance numbers."}),"\n",(0,s.jsx)(t.h3,{id:"search-query-rewriting",children:"Search query rewriting"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Fine-tuned llama-3.2-1B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input & output token length"}),": ~10\u201320"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Non-streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{children:"Hardware"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max requests/sec"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max p99 latency"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"4 \xd7 L4 GPUs (multi-GPU)"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"95 ms"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"1 \xd7 A100 40 GB GPU"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"69 ms"})]})]})]}),"\n",(0,s.jsx)(t.h3,{id:"voice-bot-query",children:"Voice bot query"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Llama-3.1-8B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input token length"}),": ~1900\u20132000"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Output token length"}),": ~200"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Concurrency"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 TTFT (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 ITL (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Token throughput (tokens/sec)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Request throughput (req/sec)"}),(0,s.jsx)(t.th,{children:"Hardware"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.27"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"22.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"45.66"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.23"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"49.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"89.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.45"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"55.33"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"153.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"66.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"39.11"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"279.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.47"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"131.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"30.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"547.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.77"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"277.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"48.02"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"925.7"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"498.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"71.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,164.40"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"6.2"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"677.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"120.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,445.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.69"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,926.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"216.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,600.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8.52"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"130.05"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.68"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"25.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"264.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.35"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"28.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"10.99"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"437.69"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.27"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"12.61"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"760.49"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3.96"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"68.03"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"14.32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,343.80"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.01"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"185.96"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16.82"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2,287.30"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"11.92"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"136.87"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3,625.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"18.89"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"463.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.15"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4,456.51"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.24"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"890.12"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"59.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"5,188.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"27.05"}),(0,s.jsx)(t.td,{children:"A100"})]})]})]}),"\n",(0,s.jsx)(t.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,s.jsx)(t.p,{children:"High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure."}),"\n",(0,s.jsx)(t.p,{children:"These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."})]})}function o(e={}){const{wrapper:t}={...(0,l.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(a,{...e})}):a(e)}},8453:(e,t,i)=>{i.d(t,{R:()=>r,x:()=>d});var n=i(6540);const s={},l=n.createContext(s);function r(e){const t=n.useContext(l);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function d(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:r(e.components),n.createElement(l.Provider,{value:t},e.children)}},9200:(e,t,i)=>{i.d(t,{A:()=>n});const n=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/3aeb33c7.b4a8c40f.js b/docs/assets/js/f9755c6e.f599da7e.js similarity index 92% rename from docs/assets/js/3aeb33c7.b4a8c40f.js rename to docs/assets/js/f9755c6e.f599da7e.js index 854f0ff5..8c7b57a0 100644 --- a/docs/assets/js/3aeb33c7.b4a8c40f.js +++ b/docs/assets/js/f9755c6e.f599da7e.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[974],{5969:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"}}')},7309:(e,t,i)=>{i.r(t),i.d(t,{assets:()=>h,contentTitle:()=>d,default:()=>o,frontMatter:()=>r,metadata:()=>n,toc:()=>c});var n=i(5969),s=i(4848),l=i(8453);const r={slug:"post-five",title:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",authors:["jaya"],date:"2025-6-2",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},d=void 0,h={authorsImageUrls:[void 0]},c=[{value:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",level:2},{value:"1. Advanced Memory Management: Paged & Prefix KV Caching",id:"1-advanced-memory-management-paged--prefix-kv-caching",level:2},{value:"Paged KV caching",id:"paged-kv-caching",level:3},{value:"KV cache quantization",id:"kv-cache-quantization",level:3},{value:"Prefix caching (the "voice bot" optimizer)",id:"prefix-caching-the-voice-bot-optimizer",level:3},{value:"2. Aggressive Quantization (INT4 AWQ & FP8)",id:"2-aggressive-quantization-int4-awq--fp8",level:2},{value:"INT4 AWQ (Activation-aware Weight Quantization)",id:"int4-awq-activation-aware-weight-quantization",level:3},{value:"FP8 precision",id:"fp8-precision",level:3},{value:"3. Kernel Fusion & Custom Plugins",id:"3-kernel-fusion--custom-plugins",level:2},{value:"4. Inflight (Continuous) Batching",id:"4-inflight-continuous-batching",level:2},{value:"5. Parallelism Strategies: Scaling Beyond One GPU",id:"5-parallelism-strategies-scaling-beyond-one-gpu",level:2},{value:"6. Speculative Decoding",id:"6-speculative-decoding",level:2},{value:"Few Benchmarks",id:"few-benchmarks",level:2},{value:"Search query rewriting",id:"search-query-rewriting",level:3},{value:"Voice bot query",id:"voice-bot-query",level:3},{value:"Conclusion",id:"conclusion",level:2}];function a(e){const t={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,l.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.p,{children:(0,s.jsx)(t.img,{alt:"BharatMLStack",src:i(9200).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(t.h2,{id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",children:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale"}),"\n",(0,s.jsx)(t.p,{children:"Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution."}),"\n",(0,s.jsx)(t.h2,{id:"1-advanced-memory-management-paged--prefix-kv-caching",children:"1. Advanced Memory Management: Paged & Prefix KV Caching"}),"\n",(0,s.jsx)(t.p,{children:"The most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache."}),"\n",(0,s.jsx)(t.h3,{id:"paged-kv-caching",children:"Paged KV caching"}),"\n",(0,s.jsxs)(t.p,{children:["Standard caching suffers from fragmentation. We use ",(0,s.jsx)(t.strong,{children:"Paged KV caching"}),", which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory."]}),"\n",(0,s.jsx)(t.h3,{id:"kv-cache-quantization",children:"KV cache quantization"}),"\n",(0,s.jsxs)(t.p,{children:["To further maximize available memory, we implement ",(0,s.jsx)(t.strong,{children:"KV cache quantization"})," (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality."]}),"\n",(0,s.jsx)(t.h3,{id:"prefix-caching-the-voice-bot-optimizer",children:'Prefix caching (the "voice bot" optimizer)'}),"\n",(0,s.jsxs)(t.p,{children:['For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable ',(0,s.jsx)(t.strong,{children:"prefix caching"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Impact"}),": By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces ",(0,s.jsx)(t.strong,{children:"Time To First Token (TTFT)"})," by skipping redundant computation of the system prompt."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"2-aggressive-quantization-int4-awq--fp8",children:"2. Aggressive Quantization (INT4 AWQ & FP8)"}),"\n",(0,s.jsx)(t.p,{children:"Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy."}),"\n",(0,s.jsx)(t.h3,{id:"int4-awq-activation-aware-weight-quantization",children:"INT4 AWQ (Activation-aware Weight Quantization)"}),"\n",(0,s.jsxs)(t.p,{children:["For the Llama 3 family, we use ",(0,s.jsx)(t.strong,{children:"AWQ"})," to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed."]}),"\n",(0,s.jsx)(t.h3,{id:"fp8-precision",children:"FP8 precision"}),"\n",(0,s.jsxs)(t.p,{children:["For NVIDIA Hopper (H100) architectures, we are exploring ",(0,s.jsx)(t.strong,{children:"FP8 quantization"}),", leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving ",(0,s.jsx)(t.strong,{children:">99% similarity"}),"."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"3-kernel-fusion--custom-plugins",children:"3. Kernel Fusion & Custom Plugins"}),"\n",(0,s.jsx)(t.p,{children:"To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Flash attention & FMHA"}),": We enable ",(0,s.jsx)(t.strong,{children:"Fused Multi-Head Attention (FMHA)"})," combined with flash attention to reduce memory reads/writes."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"GEMM plugins"}),": We use specialized ",(0,s.jsx)(t.strong,{children:"GEMM"})," plugins to accelerate transformer linear layers."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Removing input padding"}),": Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"4-inflight-continuous-batching",children:"4. Inflight (Continuous) Batching"}),"\n",(0,s.jsx)(t.p,{children:"Traditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else."}),"\n",(0,s.jsxs)(t.p,{children:["We implement ",(0,s.jsx)(t.strong,{children:"inflight batching"}),": as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones."]}),"\n",(0,s.jsx)(t.h2,{id:"5-parallelism-strategies-scaling-beyond-one-gpu",children:"5. Parallelism Strategies: Scaling Beyond One GPU"}),"\n",(0,s.jsx)(t.p,{children:"For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Tensor parallelism (TP)"}),": Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Pipeline parallelism (PP)"}),": Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B)."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"6-speculative-decoding",children:"6. Speculative Decoding"}),"\n",(0,s.jsxs)(t.p,{children:["To reduce inter-token latency (ITL), we explore ",(0,s.jsx)(t.strong,{children:"speculative decoding"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Mechanism"}),': A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).']}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"few-benchmarks",children:"Few Benchmarks"}),"\n",(0,s.jsx)(t.p,{children:"Below are a couple of representative use cases and performance numbers."}),"\n",(0,s.jsx)(t.h3,{id:"search-query-rewriting",children:"Search query rewriting"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Fine-tuned llama-3.2-1B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input & output token length"}),": ~10\u201320"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Non-streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{children:"Hardware"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max requests/sec"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max p99 latency"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"4 \xd7 L4 GPUs (multi-GPU)"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"95 ms"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"1 \xd7 A100 40 GB GPU"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"69 ms"})]})]})]}),"\n",(0,s.jsx)(t.h3,{id:"voice-bot-query",children:"Voice bot query"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Llama-3.1-8B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input token length"}),": ~1900\u20132000"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Output token length"}),": ~200"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Concurrency"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 TTFT (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 ITL (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Token throughput (tokens/sec)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Request throughput (req/sec)"}),(0,s.jsx)(t.th,{children:"Hardware"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.27"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"22.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"45.66"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.23"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"49.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"89.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.45"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"55.33"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"153.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"66.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"39.11"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"279.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.47"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"131.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"30.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"547.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.77"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"277.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"48.02"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"925.7"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"498.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"71.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,164.40"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"6.2"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"677.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"120.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,445.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.69"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,926.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"216.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,600.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8.52"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"130.05"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.68"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"25.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"264.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.35"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"28.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"10.99"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"437.69"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.27"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"12.61"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"760.49"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3.96"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"68.03"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"14.32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,343.80"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.01"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"185.96"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16.82"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2,287.30"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"11.92"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"136.87"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3,625.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"18.89"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"463.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.15"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4,456.51"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.24"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"890.12"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"59.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"5,188.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"27.05"}),(0,s.jsx)(t.td,{children:"A100"})]})]})]}),"\n",(0,s.jsx)(t.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,s.jsx)(t.p,{children:"High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure."}),"\n",(0,s.jsx)(t.p,{children:"These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."})]})}function o(e={}){const{wrapper:t}={...(0,l.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(a,{...e})}):a(e)}},8453:(e,t,i)=>{i.d(t,{R:()=>r,x:()=>d});var n=i(6540);const s={},l=n.createContext(s);function r(e){const t=n.useContext(l);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function d(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:r(e.components),n.createElement(l.Provider,{value:t},e.children)}},9200:(e,t,i)=>{i.d(t,{A:()=>n});const n=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8315],{5969:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-five","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-five/index.md","source":"@site/blog/bharatmlstack-history/post-five/index.md","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","description":"BharatMLStack","date":"2025-06-02T00:00:00.000Z","tags":[{"inline":true,"label":"llm","permalink":"/BharatMLStack/blog/tags/llm"},{"inline":true,"label":"vllm","permalink":"/BharatMLStack/blog/tags/vllm"},{"inline":true,"label":"tensorrt-llm","permalink":"/BharatMLStack/blog/tags/tensorrt-llm"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":4.93,"hasTruncateMarker":false,"authors":[{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null}],"frontMatter":{"slug":"post-five","title":"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale","authors":["jaya"],"date":"2025-6-2","tags":["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"nextItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"}}')},8319:(e,t,i)=>{i.r(t),i.d(t,{assets:()=>h,contentTitle:()=>d,default:()=>o,frontMatter:()=>r,metadata:()=>n,toc:()=>c});var n=i(5969),s=i(4848),l=i(8453);const r={slug:"post-five",title:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",authors:["jaya"],date:"2025-6-2",tags:["llm","vllm","tensorrt-llm","mlplatform","meesho","bharatmlstack"]},d=void 0,h={authorsImageUrls:[void 0]},c=[{value:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale",id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",level:2},{value:"1. Advanced Memory Management: Paged & Prefix KV Caching",id:"1-advanced-memory-management-paged--prefix-kv-caching",level:2},{value:"Paged KV caching",id:"paged-kv-caching",level:3},{value:"KV cache quantization",id:"kv-cache-quantization",level:3},{value:"Prefix caching (the "voice bot" optimizer)",id:"prefix-caching-the-voice-bot-optimizer",level:3},{value:"2. Aggressive Quantization (INT4 AWQ & FP8)",id:"2-aggressive-quantization-int4-awq--fp8",level:2},{value:"INT4 AWQ (Activation-aware Weight Quantization)",id:"int4-awq-activation-aware-weight-quantization",level:3},{value:"FP8 precision",id:"fp8-precision",level:3},{value:"3. Kernel Fusion & Custom Plugins",id:"3-kernel-fusion--custom-plugins",level:2},{value:"4. Inflight (Continuous) Batching",id:"4-inflight-continuous-batching",level:2},{value:"5. Parallelism Strategies: Scaling Beyond One GPU",id:"5-parallelism-strategies-scaling-beyond-one-gpu",level:2},{value:"6. Speculative Decoding",id:"6-speculative-decoding",level:2},{value:"Few Benchmarks",id:"few-benchmarks",level:2},{value:"Search query rewriting",id:"search-query-rewriting",level:3},{value:"Voice bot query",id:"voice-bot-query",level:3},{value:"Conclusion",id:"conclusion",level:2}];function a(e){const t={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",strong:"strong",table:"table",tbody:"tbody",td:"td",th:"th",thead:"thead",tr:"tr",ul:"ul",...(0,l.R)(),...e.components};return(0,s.jsxs)(s.Fragment,{children:[(0,s.jsx)(t.p,{children:(0,s.jsx)(t.img,{alt:"BharatMLStack",src:i(8849).A+"",width:"1396",height:"460"})}),"\n",(0,s.jsx)(t.h2,{id:"llm-inference-optimization-techniques-engineering-sub-second-latency-at-scale",children:"LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale"}),"\n",(0,s.jsx)(t.p,{children:"Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack\u2014from memory management to kernel execution."}),"\n",(0,s.jsx)(t.h2,{id:"1-advanced-memory-management-paged--prefix-kv-caching",children:"1. Advanced Memory Management: Paged & Prefix KV Caching"}),"\n",(0,s.jsx)(t.p,{children:"The most significant bottleneck in LLM inference is not always compute, but memory bandwidth\u2014specifically managing the Key-Value (KV) cache."}),"\n",(0,s.jsx)(t.h3,{id:"paged-kv-caching",children:"Paged KV caching"}),"\n",(0,s.jsxs)(t.p,{children:["Standard caching suffers from fragmentation. We use ",(0,s.jsx)(t.strong,{children:"Paged KV caching"}),", which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory."]}),"\n",(0,s.jsx)(t.h3,{id:"kv-cache-quantization",children:"KV cache quantization"}),"\n",(0,s.jsxs)(t.p,{children:["To further maximize available memory, we implement ",(0,s.jsx)(t.strong,{children:"KV cache quantization"})," (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality."]}),"\n",(0,s.jsx)(t.h3,{id:"prefix-caching-the-voice-bot-optimizer",children:'Prefix caching (the "voice bot" optimizer)'}),"\n",(0,s.jsxs)(t.p,{children:['For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable ',(0,s.jsx)(t.strong,{children:"prefix caching"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Impact"}),": By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces ",(0,s.jsx)(t.strong,{children:"Time To First Token (TTFT)"})," by skipping redundant computation of the system prompt."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"2-aggressive-quantization-int4-awq--fp8",children:"2. Aggressive Quantization (INT4 AWQ & FP8)"}),"\n",(0,s.jsx)(t.p,{children:"Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy."}),"\n",(0,s.jsx)(t.h3,{id:"int4-awq-activation-aware-weight-quantization",children:"INT4 AWQ (Activation-aware Weight Quantization)"}),"\n",(0,s.jsxs)(t.p,{children:["For the Llama 3 family, we use ",(0,s.jsx)(t.strong,{children:"AWQ"})," to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed."]}),"\n",(0,s.jsx)(t.h3,{id:"fp8-precision",children:"FP8 precision"}),"\n",(0,s.jsxs)(t.p,{children:["For NVIDIA Hopper (H100) architectures, we are exploring ",(0,s.jsx)(t.strong,{children:"FP8 quantization"}),", leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving ",(0,s.jsx)(t.strong,{children:">99% similarity"}),"."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"3-kernel-fusion--custom-plugins",children:"3. Kernel Fusion & Custom Plugins"}),"\n",(0,s.jsx)(t.p,{children:"To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Flash attention & FMHA"}),": We enable ",(0,s.jsx)(t.strong,{children:"Fused Multi-Head Attention (FMHA)"})," combined with flash attention to reduce memory reads/writes."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"GEMM plugins"}),": We use specialized ",(0,s.jsx)(t.strong,{children:"GEMM"})," plugins to accelerate transformer linear layers."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Removing input padding"}),": Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"4-inflight-continuous-batching",children:"4. Inflight (Continuous) Batching"}),"\n",(0,s.jsx)(t.p,{children:"Traditional static batching waits for all requests in a batch to finish before returning results\u2014so one long response delays everyone else."}),"\n",(0,s.jsxs)(t.p,{children:["We implement ",(0,s.jsx)(t.strong,{children:"inflight batching"}),": as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones."]}),"\n",(0,s.jsx)(t.h2,{id:"5-parallelism-strategies-scaling-beyond-one-gpu",children:"5. Parallelism Strategies: Scaling Beyond One GPU"}),"\n",(0,s.jsx)(t.p,{children:"For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies."}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Tensor parallelism (TP)"}),": Split weight matrices across multiple GPUs (e.g., 4\xd7 L4 or 8\xd7 A100). Each GPU computes a shard and outputs are reduced at every layer."]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Pipeline parallelism (PP)"}),": Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B)."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"6-speculative-decoding",children:"6. Speculative Decoding"}),"\n",(0,s.jsxs)(t.p,{children:["To reduce inter-token latency (ITL), we explore ",(0,s.jsx)(t.strong,{children:"speculative decoding"}),"."]}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Mechanism"}),': A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).']}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Verification"}),": The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed."]}),"\n"]}),"\n",(0,s.jsx)(t.h2,{id:"few-benchmarks",children:"Few Benchmarks"}),"\n",(0,s.jsx)(t.p,{children:"Below are a couple of representative use cases and performance numbers."}),"\n",(0,s.jsx)(t.h3,{id:"search-query-rewriting",children:"Search query rewriting"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Fine-tuned llama-3.2-1B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input & output token length"}),": ~10\u201320"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Non-streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{children:"Hardware"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max requests/sec"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Max p99 latency"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"4 \xd7 L4 GPUs (multi-GPU)"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"95 ms"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{children:"1 \xd7 A100 40 GB GPU"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1000"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"69 ms"})]})]})]}),"\n",(0,s.jsx)(t.h3,{id:"voice-bot-query",children:"Voice bot query"}),"\n",(0,s.jsxs)(t.ul,{children:["\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"LLM"}),": Llama-3.1-8B"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Input token length"}),": ~1900\u20132000"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Output token length"}),": ~200"]}),"\n",(0,s.jsxs)(t.li,{children:[(0,s.jsx)(t.strong,{children:"Response type"}),": Streaming"]}),"\n"]}),"\n",(0,s.jsxs)(t.table,{children:[(0,s.jsx)(t.thead,{children:(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.th,{children:"Inference runtime"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Concurrency"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 TTFT (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"p99 ITL (ms)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Token throughput (tokens/sec)"}),(0,s.jsx)(t.th,{style:{textAlign:"right"},children:"Request throughput (req/sec)"}),(0,s.jsx)(t.th,{children:"Hardware"})]})}),(0,s.jsxs)(t.tbody,{children:[(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.27"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"22.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"45.66"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.23"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"49.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"89.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.45"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"55.33"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"36.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"153.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"66.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"39.11"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"279.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.47"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"131.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"30.39"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"547.8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.77"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"277.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"48.02"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"925.7"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4.78"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"498.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"71.62"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,164.40"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"6.2"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"677.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"120.37"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,445.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.69"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,926.31"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"216.88"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,600.81"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8.52"}),(0,s.jsx)(t.td,{children:"L4"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"130.05"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"0.68"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"25.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"9.21"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"264.5"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1.35"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"28.52"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"10.99"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"437.69"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2.27"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"8"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.4"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"12.61"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"760.49"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3.96"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"68.03"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"14.32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"1,343.80"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"7.01"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"32"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"185.96"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"16.82"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"2,287.30"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"11.92"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"64"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"136.87"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"21.17"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"3,625.22"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"18.89"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"128"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"463.78"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"34.15"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"4,456.51"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"23.24"}),(0,s.jsx)(t.td,{children:"A100"})]}),(0,s.jsxs)(t.tr,{children:[(0,s.jsx)(t.td,{children:"TensorRT-LLM"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"256"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"890.12"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"59.18"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"5,188.24"}),(0,s.jsx)(t.td,{style:{textAlign:"right"},children:"27.05"}),(0,s.jsx)(t.td,{children:"A100"})]})]})]}),"\n",(0,s.jsx)(t.h2,{id:"conclusion",children:"Conclusion"}),"\n",(0,s.jsx)(t.p,{children:"High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure."}),"\n",(0,s.jsx)(t.p,{children:"These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications."})]})}function o(e={}){const{wrapper:t}={...(0,l.R)(),...e.components};return t?(0,s.jsx)(t,{...e,children:(0,s.jsx)(a,{...e})}):a(e)}},8453:(e,t,i)=>{i.d(t,{R:()=>r,x:()=>d});var n=i(6540);const s={},l=n.createContext(s);function r(e){const t=n.useContext(l);return n.useMemo(function(){return"function"==typeof e?e(t):{...t,...e}},[t,e])}function d(e){let t;return t=e.disableParentContext?"function"==typeof e.components?e.components(s):e.components||s:r(e.components),n.createElement(l.Provider,{value:t},e.children)}},8849:(e,t,i)=>{i.d(t,{A:()=>n});const n=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"}}]); \ No newline at end of file diff --git a/docs/assets/js/fa31f022.968b3373.js b/docs/assets/js/fa31f022.968b3373.js deleted file mode 100644 index fa441b0d..00000000 --- a/docs/assets/js/fa31f022.968b3373.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[6062],{6096:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"v1.0.0","description":"Numerix v1.0.0","slug":"/category/v100","permalink":"/BharatMLStack/category/v100","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Python SDK","permalink":"/BharatMLStack/category/python-sdk"},"next":{"title":"GRPC Feature client","permalink":"/BharatMLStack/sdks/python/v1.0.0/grpc_feature_client"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/fccc4c42.4690f84a.js b/docs/assets/js/fccc4c42.4690f84a.js deleted file mode 100644 index 539fe26f..00000000 --- a/docs/assets/js/fccc4c42.4690f84a.js +++ /dev/null @@ -1 +0,0 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2117],{702:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/vss-c482f6eac4c68b3219e4c562a6b717ec.png"},788:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-three"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},2561:(e,n,t)=>{t.r(n),t.d(n,{assets:()=>o,contentTitle:()=>l,default:()=>h,frontMatter:()=>s,metadata:()=>i,toc:()=>d});var i=t(788),a=t(4848),r=t(8453);const s={slug:"post-three",title:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",authors:["aditya","jaya","adarsha"],date:new Date("2024-05-21T00:00:00.000Z"),tags:["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},l=void 0,o={authorsImageUrls:[void 0,void 0,void 0]},d=[{value:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",level:2},{value:"Breaking Free from the Scalability Ceiling",id:"breaking-free-from-the-scalability-ceiling",level:2},{value:"The Model Serving Bottleneck\u2014A Wake-Up Call",id:"the-model-serving-bottlenecka-wake-up-call",level:3},{value:"Scaling Triton on GKE",id:"scaling-triton-on-gke",level:3},{value:"Fixing the Cold Start Problem",id:"fixing-the-cold-start-problem",level:3},{value:"Embedding Search: The Last Piece of the Puzzle",id:"embedding-search-the-last-piece-of-the-puzzle",level:2},{value:"Choosing the Right Vector Database",id:"choosing-the-right-vector-database",level:3},{value:"Embedding Freshness & Real-Time Updates",id:"embedding-freshness--real-time-updates",level:3},{value:"Final Takeaways: Scaling Smartly for Real-Time ML",id:"final-takeaways-scaling-smartly-for-real-time-ml",level:2}];function c(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",ul:"ul",...(0,r.R)(),...e.components};return(0,a.jsxs)(a.Fragment,{children:[(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"BharatMLStack",src:t(6e3).A+"",width:"1396",height:"460"})}),"\n",(0,a.jsx)(n.h2,{id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",children:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Moving embedding search from batch to real-time for candidate generation"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"Here\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system."}),"\n",(0,a.jsx)(n.h2,{id:"breaking-free-from-the-scalability-ceiling",children:"Breaking Free from the Scalability Ceiling"}),"\n",(0,a.jsx)(n.h3,{id:"the-model-serving-bottlenecka-wake-up-call",children:"The Model Serving Bottleneck\u2014A Wake-Up Call"}),"\n",(0,a.jsx)(n.p,{children:"July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\nIn one of our war rooms, we ran a quick experiment:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 The results matched\u2014perfectly."}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\n"Node availability may be an issue."\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?'}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Triton handled significantly higher throughput on fewer resources"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 No model changes were needed"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"MBS ran without a hitch, proving that self-hosted inference was the way forward."}),"\n",(0,a.jsx)(n.h3,{id:"scaling-triton-on-gke",children:"Scaling Triton on GKE"}),"\n",(0,a.jsx)(n.p,{children:"This left us with two choices:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack"}),"\n",(0,a.jsx)(n.li,{children:"2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"We went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations."}),"\n",(0,a.jsx)(n.h3,{id:"fixing-the-cold-start-problem",children:"Fixing the Cold Start Problem"}),"\n",(0,a.jsx)(n.p,{children:"As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up."}),"\n",(0,a.jsx)(n.p,{children:"After profiling, we found the culprits:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Triton\u2019s base image\u2014a massive 5GB"}),"\n",(0,a.jsx)(n.li,{children:"Model binaries\u2014often 1GB+"}),"\n",(0,a.jsx)(n.li,{children:"Startup delay\u2014mostly due to downloading and initializing these assets"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother."}),"\n",(0,a.jsx)(n.h2,{id:"embedding-search-the-last-piece-of-the-puzzle",children:"Embedding Search: The Last Piece of the Puzzle"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system."}),"\n",(0,a.jsx)(n.h3,{id:"choosing-the-right-vector-database",children:"Choosing the Right Vector Database"}),"\n",(0,a.jsx)(n.p,{children:"We benchmarked three production-ready vector DBs across key parameters:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Milvus"}),"\n",(0,a.jsx)(n.li,{children:"Qdrant"}),"\n",(0,a.jsx)(n.li,{children:"Weaviate"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"After extensive POCs, Qdrant stood out for its:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 Blazing-fast search latency on high-dimensional vectors"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Efficient memory usage, crucial for in-memory workloads"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Support for upserts and soft deletes, vital for Ads use cases"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 gRPC + REST APIs, making integration seamless"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs."}),"\n",(0,a.jsx)(n.h3,{id:"embedding-freshness--real-time-updates",children:"Embedding Freshness & Real-Time Updates"}),"\n",(0,a.jsx)(n.p,{children:"To ensure embeddings stayed up to date, we built a dual ingestion pipeline:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.'}),"\n",(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"Skye",src:t(702).A+"",width:"1260",height:"644"})}),"\n",(0,a.jsx)(n.h2,{id:"final-takeaways-scaling-smartly-for-real-time-ml",children:"Final Takeaways: Scaling Smartly for Real-Time ML"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"By early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,a.jsx)(n,{...e,children:(0,a.jsx)(c,{...e})}):c(e)}},6e3:(e,n,t)=>{t.d(n,{A:()=>i});const i=t.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},8453:(e,n,t)=>{t.d(n,{R:()=>s,x:()=>l});var i=t(6540);const a={},r=i.createContext(a);function s(e){const n=i.useContext(r);return i.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(a):e.components||a:s(e.components),i.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/fccc4c42.793ba51f.js b/docs/assets/js/fccc4c42.793ba51f.js new file mode 100644 index 00000000..b8703a1a --- /dev/null +++ b/docs/assets/js/fccc4c42.793ba51f.js @@ -0,0 +1 @@ +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[2117],{788:e=>{e.exports=JSON.parse('{"permalink":"/BharatMLStack/blog/post-three","editUrl":"https://github.com/Meesho/BharatMLStack/tree/main/docs/blog/bharatmlstack-history/post-three/index.md","source":"@site/blog/bharatmlstack-history/post-three/index.md","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","description":"BharatMLStack","date":"2024-05-21T00:00:00.000Z","tags":[{"inline":true,"label":"model-inference","permalink":"/BharatMLStack/blog/tags/model-inference"},{"inline":true,"label":"embedding-search","permalink":"/BharatMLStack/blog/tags/embedding-search"},{"inline":true,"label":"mlplatform","permalink":"/BharatMLStack/blog/tags/mlplatform"},{"inline":true,"label":"meesho","permalink":"/BharatMLStack/blog/tags/meesho"},{"inline":true,"label":"bharatmlstack","permalink":"/BharatMLStack/blog/tags/bharatmlstack"}],"readingTime":3.6,"hasTruncateMarker":false,"authors":[{"name":"Aditya Kumar","title":"Lead Software Engineer @ Meesho","url":"https://github.com/Adit2607","imageURL":"https://github.com/Adit2607.png","key":"aditya","page":null},{"name":"Jaya Kumar","title":"Lead ML Engineer @ Meesho","url":"https://github.com/jayakommuru","imageURL":"https://github.com/jayakommuru.png","key":"jaya","page":null},{"name":"Adarsha Das","title":"Senior Architect @ Meesho","url":"https://github.com/a0d00kc","imageURL":"https://github.com/a0d00kc.png","key":"adarsha","page":null}],"frontMatter":{"slug":"post-three","title":"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search","authors":["aditya","jaya","adarsha"],"date":"2024-05-21T00:00:00.000Z","tags":["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},"unlisted":false,"prevItem":{"title":"Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving","permalink":"/BharatMLStack/blog/post-four"},"nextItem":{"title":"Building Meesho\u2019s ML Platform: Lessons from the First-Gen System (Part 2)","permalink":"/BharatMLStack/blog/post-two"}}')},2561:(e,n,i)=>{i.r(n),i.d(n,{assets:()=>o,contentTitle:()=>l,default:()=>h,frontMatter:()=>s,metadata:()=>t,toc:()=>d});var t=i(788),a=i(4848),r=i(8453);const s={slug:"post-three",title:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",authors:["aditya","jaya","adarsha"],date:new Date("2024-05-21T00:00:00.000Z"),tags:["model-inference","embedding-search","mlplatform","meesho","bharatmlstack"]},l=void 0,o={authorsImageUrls:[void 0,void 0,void 0]},d=[{value:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search",id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",level:2},{value:"Breaking Free from the Scalability Ceiling",id:"breaking-free-from-the-scalability-ceiling",level:2},{value:"The Model Serving Bottleneck\u2014A Wake-Up Call",id:"the-model-serving-bottlenecka-wake-up-call",level:3},{value:"Scaling Triton on GKE",id:"scaling-triton-on-gke",level:3},{value:"Fixing the Cold Start Problem",id:"fixing-the-cold-start-problem",level:3},{value:"Embedding Search: The Last Piece of the Puzzle",id:"embedding-search-the-last-piece-of-the-puzzle",level:2},{value:"Choosing the Right Vector Database",id:"choosing-the-right-vector-database",level:3},{value:"Embedding Freshness & Real-Time Updates",id:"embedding-freshness--real-time-updates",level:3},{value:"Final Takeaways: Scaling Smartly for Real-Time ML",id:"final-takeaways-scaling-smartly-for-real-time-ml",level:2}];function c(e){const n={h2:"h2",h3:"h3",img:"img",li:"li",p:"p",ul:"ul",...(0,r.R)(),...e.components};return(0,a.jsxs)(a.Fragment,{children:[(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"BharatMLStack",src:i(4411).A+"",width:"1396",height:"460"})}),"\n",(0,a.jsx)(n.h2,{id:"cracking-the-code-scaling-model-inference--real-time-embedding-search",children:"Cracking the Code: Scaling Model Inference & Real-Time Embedding Search"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, we had transformed our ML stack\u2014building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Scaling model inference without hitting infrastructure roadblocks"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udd39 Moving embedding search from batch to real-time for candidate generation"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"Here\u2019s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system."}),"\n",(0,a.jsx)(n.h2,{id:"breaking-free-from-the-scalability-ceiling",children:"Breaking Free from the Scalability Ceiling"}),"\n",(0,a.jsx)(n.h3,{id:"the-model-serving-bottlenecka-wake-up-call",children:"The Model Serving Bottleneck\u2014A Wake-Up Call"}),"\n",(0,a.jsx)(n.p,{children:"July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue\u2014scaling our model-serving infrastructure was taking 10\u201315 minutes. In real-time ML, that\u2019s an eternity.\nIn one of our war rooms, we ran a quick experiment:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Fired requests and compared the outputs with our existing cloud-hosted setup."}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 The results matched\u2014perfectly."}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn\'t allocate enough compute resources in time. Luckily, they did\u2014but the seed was planted.\nThen in October, just two weeks before MBS, we got an alarming response from our infrastructure team:\n"Node availability may be an issue."\nWith no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?'}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 p99 latency dropped from 90\u2013100ms to 30\u201340ms"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Triton handled significantly higher throughput on fewer resources"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 No model changes were needed"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"MBS ran without a hitch, proving that self-hosted inference was the way forward."}),"\n",(0,a.jsx)(n.h3,{id:"scaling-triton-on-gke",children:"Scaling Triton on GKE"}),"\n",(0,a.jsx)(n.p,{children:"This left us with two choices:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"1\ufe0f\u20e3 Port models to a managed cloud inference service, investing time in learning a new deployment stack"}),"\n",(0,a.jsx)(n.li,{children:"2\ufe0f\u20e3 Scale our existing Triton setup on GKE, optimizing for cost and performance"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"We went with Option 2\u2014and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations."}),"\n",(0,a.jsx)(n.h3,{id:"fixing-the-cold-start-problem",children:"Fixing the Cold Start Problem"}),"\n",(0,a.jsx)(n.p,{children:"As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7\u20139 minutes to spin up."}),"\n",(0,a.jsx)(n.p,{children:"After profiling, we found the culprits:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Triton\u2019s base image\u2014a massive 5GB"}),"\n",(0,a.jsx)(n.li,{children:"Model binaries\u2014often 1GB+"}),"\n",(0,a.jsx)(n.li,{children:"Startup delay\u2014mostly due to downloading and initializing these assets"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother."}),"\n",(0,a.jsx)(n.h2,{id:"embedding-search-the-last-piece-of-the-puzzle",children:"Embedding Search: The Last Piece of the Puzzle"}),"\n",(0,a.jsx)(n.p,{children:"By mid-2023, most of our ML stack had gone real-time\u2014except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system."}),"\n",(0,a.jsx)(n.h3,{id:"choosing-the-right-vector-database",children:"Choosing the Right Vector Database"}),"\n",(0,a.jsx)(n.p,{children:"We benchmarked three production-ready vector DBs across key parameters:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"Milvus"}),"\n",(0,a.jsx)(n.li,{children:"Qdrant"}),"\n",(0,a.jsx)(n.li,{children:"Weaviate"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"After extensive POCs, Qdrant stood out for its:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\u2705 Blazing-fast search latency on high-dimensional vectors"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Efficient memory usage, crucial for in-memory workloads"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Support for upserts and soft deletes, vital for Ads use cases"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 gRPC + REST APIs, making integration seamless"}),"\n",(0,a.jsx)(n.li,{children:"\u2705 Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search\u2014a perfect fit for our needs."}),"\n",(0,a.jsx)(n.h3,{id:"embedding-freshness--real-time-updates",children:"Embedding Freshness & Real-Time Updates"}),"\n",(0,a.jsx)(n.p,{children:"To ensure embeddings stayed up to date, we built a dual ingestion pipeline:"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Daily Refresh: A bulk pipeline updated embeddings overnight"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\udccc Real-Time Updates: Ads events triggered immediate upserts/deletes"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:'This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.'}),"\n",(0,a.jsx)(n.p,{children:(0,a.jsx)(n.img,{alt:"Skye",src:i(3217).A+"",width:"1260",height:"644"})}),"\n",(0,a.jsx)(n.h2,{id:"final-takeaways-scaling-smartly-for-real-time-ml",children:"Final Takeaways: Scaling Smartly for Real-Time ML"}),"\n",(0,a.jsxs)(n.ul,{children:["\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Building a custom Triton image reduced cold starts, improving responsiveness"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Qdrant-based embedding search enabled real-time personalization at scale"}),"\n",(0,a.jsx)(n.li,{children:"\ud83d\ude80 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations"}),"\n"]}),"\n",(0,a.jsx)(n.p,{children:"By early 2024, Meesho\u2019s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead."})]})}function h(e={}){const{wrapper:n}={...(0,r.R)(),...e.components};return n?(0,a.jsx)(n,{...e,children:(0,a.jsx)(c,{...e})}):c(e)}},3217:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/vss-c482f6eac4c68b3219e4c562a6b717ec.png"},4411:(e,n,i)=>{i.d(n,{A:()=>t});const t=i.p+"assets/images/bms-7399e8796d2cd24617c432518ce3f312.png"},8453:(e,n,i)=>{i.d(n,{R:()=>s,x:()=>l});var t=i(6540);const a={},r=t.createContext(a);function s(e){const n=t.useContext(r);return t.useMemo(function(){return"function"==typeof e?e(n):{...n,...e}},[n,e])}function l(e){let n;return n=e.disableParentContext?"function"==typeof e.components?e.components(a):e.components||a:s(e.components),t.createElement(r.Provider,{value:n},e.children)}}}]); \ No newline at end of file diff --git a/docs/assets/js/fcf4f6ca.d9bac5e5.js b/docs/assets/js/fcf4f6ca.8b12d88e.js similarity index 79% rename from docs/assets/js/fcf4f6ca.d9bac5e5.js rename to docs/assets/js/fcf4f6ca.8b12d88e.js index c8bf1a6b..98232945 100644 --- a/docs/assets/js/fcf4f6ca.d9bac5e5.js +++ b/docs/assets/js/fcf4f6ca.8b12d88e.js @@ -1 +1 @@ -"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7720],{4041:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Trufflebox UI","description":"Trufflebox UI is a modern, feature rich UI framework for supporting MLOps. It supports Feature catalog, management, user managemnet and other adminops","slug":"/category/trufflebox-ui","permalink":"/BharatMLStack/category/trufflebox-ui","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Quick Start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start"},"next":{"title":"User Manual","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0/userguide"}}}}')}}]); \ No newline at end of file +"use strict";(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[7720],{4041:e=>{e.exports=JSON.parse('{"categoryGeneratedIndex":{"title":"Trufflebox UI","description":"Trufflebox UI is a modern, feature rich UI framework for supporting MLOps. It supports Feature catalog, management, user managemnet and other adminops","slug":"/category/trufflebox-ui","permalink":"/BharatMLStack/category/trufflebox-ui","sidebar":"tutorialSidebar","navigation":{"previous":{"title":"Quick Start","permalink":"/BharatMLStack/quick-start/v1.0.0/quick-start"},"next":{"title":"v1.0.0","permalink":"/BharatMLStack/trufflebox-ui/v1.0.0"}}}}')}}]); \ No newline at end of file diff --git a/docs/assets/js/main.3e15e71d.js b/docs/assets/js/main.391e8e6d.js similarity index 65% rename from docs/assets/js/main.3e15e71d.js rename to docs/assets/js/main.391e8e6d.js index 1a92bb08..ad6ce2ec 100644 --- a/docs/assets/js/main.3e15e71d.js +++ b/docs/assets/js/main.391e8e6d.js @@ -1,2 +1,2 @@ -/*! For license information please see main.3e15e71d.js.LICENSE.txt */ -(self.webpackChunkdocs=self.webpackChunkdocs||[]).push([[8792],{115:e=>{var t="undefined"!=typeof Element,n="function"==typeof Map,r="function"==typeof Set,a="function"==typeof ArrayBuffer&&!!ArrayBuffer.isView;function o(e,i){if(e===i)return!0;if(e&&i&&"object"==typeof e&&"object"==typeof i){if(e.constructor!==i.constructor)return!1;var l,s,c,u;if(Array.isArray(e)){if((l=e.length)!=i.length)return!1;for(s=l;0!==s--;)if(!o(e[s],i[s]))return!1;return!0}if(n&&e instanceof Map&&i instanceof Map){if(e.size!==i.size)return!1;for(u=e.entries();!(s=u.next()).done;)if(!i.has(s.value[0]))return!1;for(u=e.entries();!(s=u.next()).done;)if(!o(s.value[1],i.get(s.value[0])))return!1;return!0}if(r&&e instanceof Set&&i instanceof Set){if(e.size!==i.size)return!1;for(u=e.entries();!(s=u.next()).done;)if(!i.has(s.value[0]))return!1;return!0}if(a&&ArrayBuffer.isView(e)&&ArrayBuffer.isView(i)){if((l=e.length)!=i.length)return!1;for(s=l;0!==s--;)if(e[s]!==i[s])return!1;return!0}if(e.constructor===RegExp)return e.source===i.source&&e.flags===i.flags;if(e.valueOf!==Object.prototype.valueOf&&"function"==typeof e.valueOf&&"function"==typeof i.valueOf)return e.valueOf()===i.valueOf();if(e.toString!==Object.prototype.toString&&"function"==typeof e.toString&&"function"==typeof i.toString)return e.toString()===i.toString();if((l=(c=Object.keys(e)).length)!==Object.keys(i).length)return!1;for(s=l;0!==s--;)if(!Object.prototype.hasOwnProperty.call(i,c[s]))return!1;if(t&&e instanceof Element)return!1;for(s=l;0!==s--;)if(("_owner"!==c[s]&&"__v"!==c[s]&&"__o"!==c[s]||!e.$$typeof)&&!o(e[c[s]],i[c[s]]))return!1;return!0}return e!=e&&i!=i}e.exports=function(e,t){try{return o(e,t)}catch(n){if((n.message||"").match(/stack|recursion/i))return console.warn("react-fast-compare cannot handle circular refs"),!1;throw n}}},119:(e,t,n)=>{"use strict";n.r(t)},205:(e,t,n)=>{"use strict";n.d(t,{A:()=>a});var r=n(6540);const a=n(8193).A.canUseDOM?r.useLayoutEffect:r.useEffect},253:(e,t)=>{"use strict";Object.defineProperty(t,"__esModule",{value:!0}),t.getErrorCausalChain=function e(t){if(t.cause)return[t,...e(t.cause)];return[t]}},311:e=>{"use strict";e.exports=function(e,t,n,r,a,o,i,l){if(!e){var s;if(void 0===t)s=new Error("Minified exception occurred; use the non-minified dev environment for the full error message and additional helpful warnings.");else{var c=[n,r,a,o,i,l],u=0;(s=new Error(t.replace(/%s/g,function(){return c[u++]}))).name="Invariant Violation"}throw s.framesToPop=1,s}}},418:(e,t,n)=>{"use strict";n.d(t,{A:()=>r});const r=()=>null},440:(e,t,n)=>{"use strict";t.rA=t.Ks=t.LU=void 0;const r=n(1635);t.LU="__blog-post-container";var a=n(2983);Object.defineProperty(t,"Ks",{enumerable:!0,get:function(){return r.__importDefault(a).default}});var o=n(2566);var i=n(253);Object.defineProperty(t,"rA",{enumerable:!0,get:function(){return i.getErrorCausalChain}})},545:(e,t,n)=>{"use strict";n.d(t,{mg:()=>J,vd:()=>G});var r=n(6540),a=n(5556),o=n.n(a),i=n(115),l=n.n(i),s=n(311),c=n.n(s),u=n(2833),d=n.n(u);function f(){return f=Object.assign||function(e){for(var t=1;t=0||(a[n]=e[n]);return a}var g={BASE:"base",BODY:"body",HEAD:"head",HTML:"html",LINK:"link",META:"meta",NOSCRIPT:"noscript",SCRIPT:"script",STYLE:"style",TITLE:"title",FRAGMENT:"Symbol(react.fragment)"},b={rel:["amphtml","canonical","alternate"]},y={type:["application/ld+json"]},v={charset:"",name:["robots","description"],property:["og:type","og:title","og:url","og:image","og:image:alt","og:description","twitter:url","twitter:title","twitter:description","twitter:image","twitter:image:alt","twitter:card","twitter:site"]},w=Object.keys(g).map(function(e){return g[e]}),k={accesskey:"accessKey",charset:"charSet",class:"className",contenteditable:"contentEditable",contextmenu:"contextMenu","http-equiv":"httpEquiv",itemprop:"itemProp",tabindex:"tabIndex"},S=Object.keys(k).reduce(function(e,t){return e[k[t]]=t,e},{}),x=function(e,t){for(var n=e.length-1;n>=0;n-=1){var r=e[n];if(Object.prototype.hasOwnProperty.call(r,t))return r[t]}return null},_=function(e){var t=x(e,g.TITLE),n=x(e,"titleTemplate");if(Array.isArray(t)&&(t=t.join("")),n&&t)return n.replace(/%s/g,function(){return t});var r=x(e,"defaultTitle");return t||r||void 0},E=function(e){return x(e,"onChangeClientState")||function(){}},C=function(e,t){return t.filter(function(t){return void 0!==t[e]}).map(function(t){return t[e]}).reduce(function(e,t){return f({},e,t)},{})},A=function(e,t){return t.filter(function(e){return void 0!==e[g.BASE]}).map(function(e){return e[g.BASE]}).reverse().reduce(function(t,n){if(!t.length)for(var r=Object.keys(n),a=0;a/g,">").replace(/"/g,""").replace(/'/g,"'")},R=function(e){return Object.keys(e).reduce(function(t,n){var r=void 0!==e[n]?n+'="'+e[n]+'"':""+n;return t?t+" "+r:r},"")},D=function(e,t){return void 0===t&&(t={}),Object.keys(e).reduce(function(t,n){return t[k[n]||n]=e[n],t},t)},B=function(e,t){return t.map(function(t,n){var a,o=((a={key:n})["data-rh"]=!0,a);return Object.keys(t).forEach(function(e){var n=k[e]||e;"innerHTML"===n||"cssText"===n?o.dangerouslySetInnerHTML={__html:t.innerHTML||t.cssText}:o[n]=t[e]}),r.createElement(e,o)})},F=function(e,t,n){switch(e){case g.TITLE:return{toComponent:function(){return n=t.titleAttributes,(a={key:e=t.title})["data-rh"]=!0,o=D(n,a),[r.createElement(g.TITLE,o,e)];var e,n,a,o},toString:function(){return function(e,t,n,r){var a=R(n),o=j(t);return a?"<"+e+' data-rh="true" '+a+">"+O(o,r)+"":"<"+e+' data-rh="true">'+O(o,r)+""}(e,t.title,t.titleAttributes,n)}};case"bodyAttributes":case"htmlAttributes":return{toComponent:function(){return D(t)},toString:function(){return R(t)}};default:return{toComponent:function(){return B(e,t)},toString:function(){return function(e,t,n){return t.reduce(function(t,r){var a=Object.keys(r).filter(function(e){return!("innerHTML"===e||"cssText"===e)}).reduce(function(e,t){var a=void 0===r[t]?t:t+'="'+O(r[t],n)+'"';return e?e+" "+a:a},""),o=r.innerHTML||r.cssText||"",i=-1===N.indexOf(e);return t+"<"+e+' data-rh="true" '+a+(i?"/>":">"+o+"")},"")}(e,t,n)}}}},I=function(e){var t=e.baseTag,n=e.bodyAttributes,r=e.encode,a=e.htmlAttributes,o=e.noscriptTags,i=e.styleTags,l=e.title,s=void 0===l?"":l,c=e.titleAttributes,u=e.linkTags,d=e.metaTags,f=e.scriptTags,p={toComponent:function(){},toString:function(){return""}};if(e.prioritizeSeoTags){var h=function(e){var t=e.linkTags,n=e.scriptTags,r=e.encode,a=P(e.metaTags,v),o=P(t,b),i=P(n,y);return{priorityMethods:{toComponent:function(){return[].concat(B(g.META,a.priority),B(g.LINK,o.priority),B(g.SCRIPT,i.priority))},toString:function(){return F(g.META,a.priority,r)+" "+F(g.LINK,o.priority,r)+" "+F(g.SCRIPT,i.priority,r)}},metaTags:a.default,linkTags:o.default,scriptTags:i.default}}(e);p=h.priorityMethods,u=h.linkTags,d=h.metaTags,f=h.scriptTags}return{priority:p,base:F(g.BASE,t,r),bodyAttributes:F("bodyAttributes",n,r),htmlAttributes:F("htmlAttributes",a,r),link:F(g.LINK,u,r),meta:F(g.META,d,r),noscript:F(g.NOSCRIPT,o,r),script:F(g.SCRIPT,f,r),style:F(g.STYLE,i,r),title:F(g.TITLE,{title:s,titleAttributes:c},r)}},z=[],$=function(e,t){var n=this;void 0===t&&(t="undefined"!=typeof document),this.instances=[],this.value={setHelmet:function(e){n.context.helmet=e},helmetInstances:{get:function(){return n.canUseDOM?z:n.instances},add:function(e){(n.canUseDOM?z:n.instances).push(e)},remove:function(e){var t=(n.canUseDOM?z:n.instances).indexOf(e);(n.canUseDOM?z:n.instances).splice(t,1)}}},this.context=e,this.canUseDOM=t,t||(e.helmet=I({baseTag:[],bodyAttributes:{},encodeSpecialCharacters:!0,htmlAttributes:{},linkTags:[],metaTags:[],noscriptTags:[],scriptTags:[],styleTags:[],title:"",titleAttributes:{}}))},U=r.createContext({}),q=o().shape({setHelmet:o().func,helmetInstances:o().shape({get:o().func,add:o().func,remove:o().func})}),H="undefined"!=typeof document,G=function(e){function t(n){var r;return(r=e.call(this,n)||this).helmetData=new $(r.props.context,t.canUseDOM),r}return p(t,e),t.prototype.render=function(){return r.createElement(U.Provider,{value:this.helmetData.value},this.props.children)},t}(r.Component);G.canUseDOM=H,G.propTypes={context:o().shape({helmet:o().shape()}),children:o().node.isRequired},G.defaultProps={context:{}},G.displayName="HelmetProvider";var V=function(e,t){var n,r=document.head||document.querySelector(g.HEAD),a=r.querySelectorAll(e+"[data-rh]"),o=[].slice.call(a),i=[];return t&&t.length&&t.forEach(function(t){var r=document.createElement(e);for(var a in t)Object.prototype.hasOwnProperty.call(t,a)&&("innerHTML"===a?r.innerHTML=t.innerHTML:"cssText"===a?r.styleSheet?r.styleSheet.cssText=t.cssText:r.appendChild(document.createTextNode(t.cssText)):r.setAttribute(a,void 0===t[a]?"":t[a]));r.setAttribute("data-rh","true"),o.some(function(e,t){return n=t,r.isEqualNode(e)})?o.splice(n,1):i.push(r)}),o.forEach(function(e){return e.parentNode.removeChild(e)}),i.forEach(function(e){return r.appendChild(e)}),{oldTags:o,newTags:i}},W=function(e,t){var n=document.getElementsByTagName(e)[0];if(n){for(var r=n.getAttribute("data-rh"),a=r?r.split(","):[],o=[].concat(a),i=Object.keys(t),l=0;l=0;d-=1)n.removeAttribute(o[d]);a.length===o.length?n.removeAttribute("data-rh"):n.getAttribute("data-rh")!==i.join(",")&&n.setAttribute("data-rh",i.join(","))}},Q=function(e,t){var n=e.baseTag,r=e.htmlAttributes,a=e.linkTags,o=e.metaTags,i=e.noscriptTags,l=e.onChangeClientState,s=e.scriptTags,c=e.styleTags,u=e.title,d=e.titleAttributes;W(g.BODY,e.bodyAttributes),W(g.HTML,r),function(e,t){void 0!==e&&document.title!==e&&(document.title=j(e)),W(g.TITLE,t)}(u,d);var f={baseTag:V(g.BASE,n),linkTags:V(g.LINK,a),metaTags:V(g.META,o),noscriptTags:V(g.NOSCRIPT,i),scriptTags:V(g.SCRIPT,s),styleTags:V(g.STYLE,c)},p={},h={};Object.keys(f).forEach(function(e){var t=f[e],n=t.newTags,r=t.oldTags;n.length&&(p[e]=n),r.length&&(h[e]=f[e].oldTags)}),t&&t(),l(e,p,h)},K=null,Y=function(e){function t(){for(var t,n=arguments.length,r=new Array(n),a=0;a elements are self-closing and can not contain children. Refer to our API for more information.")}},n.flattenArrayTypeChildren=function(e){var t,n=e.child,r=e.arrayTypeChildren;return f({},r,((t={})[n.type]=[].concat(r[n.type]||[],[f({},e.newChildProps,this.mapNestedChildrenToProps(n,e.nestedChildren))]),t))},n.mapObjectTypeChildren=function(e){var t,n,r=e.child,a=e.newProps,o=e.newChildProps,i=e.nestedChildren;switch(r.type){case g.TITLE:return f({},a,((t={})[r.type]=i,t.titleAttributes=f({},o),t));case g.BODY:return f({},a,{bodyAttributes:f({},o)});case g.HTML:return f({},a,{htmlAttributes:f({},o)});default:return f({},a,((n={})[r.type]=f({},o),n))}},n.mapArrayTypeChildrenToProps=function(e,t){var n=f({},t);return Object.keys(e).forEach(function(t){var r;n=f({},n,((r={})[t]=e[t],r))}),n},n.warnOnInvalidChildren=function(e,t){return c()(w.some(function(t){return e.type===t}),"function"==typeof e.type?"You may be attempting to nest components within each other, which is not allowed. Refer to our API for more information.":"Only elements types "+w.join(", ")+" are allowed. Helmet does not support rendering <"+e.type+"> elements. Refer to our API for more information."),c()(!t||"string"==typeof t||Array.isArray(t)&&!t.some(function(e){return"string"!=typeof e}),"Helmet expects a string as a child of <"+e.type+">. Did you forget to wrap your children in braces? ( <"+e.type+">{``} ) Refer to our API for more information."),!0},n.mapChildrenToProps=function(e,t){var n=this,a={};return r.Children.forEach(e,function(e){if(e&&e.props){var r=e.props,o=r.children,i=m(r,X),l=Object.keys(i).reduce(function(e,t){return e[S[t]||t]=i[t],e},{}),s=e.type;switch("symbol"==typeof s?s=s.toString():n.warnOnInvalidChildren(e,o),s){case g.FRAGMENT:t=n.mapChildrenToProps(o,t);break;case g.LINK:case g.META:case g.NOSCRIPT:case g.SCRIPT:case g.STYLE:a=n.flattenArrayTypeChildren({child:e,arrayTypeChildren:a,newChildProps:l,nestedChildren:o});break;default:t=n.mapObjectTypeChildren({child:e,newProps:t,newChildProps:l,nestedChildren:o})}}}),this.mapArrayTypeChildrenToProps(a,t)},n.render=function(){var e=this.props,t=e.children,n=m(e,Z),a=f({},n),o=n.helmetData;return t&&(a=this.mapChildrenToProps(t,a)),!o||o instanceof $||(o=new $(o.context,o.instances)),o?r.createElement(Y,f({},a,{context:o.value,helmetData:void 0})):r.createElement(U.Consumer,null,function(e){return r.createElement(Y,f({},a,{context:e}))})},t}(r.Component);J.propTypes={base:o().object,bodyAttributes:o().object,children:o().oneOfType([o().arrayOf(o().node),o().node]),defaultTitle:o().string,defer:o().bool,encodeSpecialCharacters:o().bool,htmlAttributes:o().object,link:o().arrayOf(o().object),meta:o().arrayOf(o().object),noscript:o().arrayOf(o().object),onChangeClientState:o().func,script:o().arrayOf(o().object),style:o().arrayOf(o().object),title:o().string,titleAttributes:o().object,titleTemplate:o().string,prioritizeSeoTags:o().bool,helmetData:o().object},J.defaultProps={defer:!0,encodeSpecialCharacters:!0,prioritizeSeoTags:!1},J.displayName="Helmet"},609:(e,t,n)=>{"use strict";n.d(t,{V:()=>s,t:()=>c});var r=n(6540),a=n(9532),o=n(4848);const i=Symbol("EmptyContext"),l=r.createContext(i);function s({children:e,name:t,items:n}){const a=(0,r.useMemo)(()=>t&&n?{name:t,items:n}:null,[t,n]);return(0,o.jsx)(l.Provider,{value:a,children:e})}function c(){const e=(0,r.useContext)(l);if(e===i)throw new a.dV("DocsSidebarProvider");return e}},679:(e,t,n)=>{"use strict";n.d(t,{Wf:()=>c});n(6540);const r=JSON.parse('{"N":"localStorage","M":""}'),a=r.N;function o({key:e,oldValue:t,newValue:n,storage:r}){if(t===n)return;const a=document.createEvent("StorageEvent");a.initStorageEvent("storage",!1,!1,e,t,n,window.location.href,r),window.dispatchEvent(a)}function i(e=a){if("undefined"==typeof window)throw new Error("Browser storage is not available on Node.js/Docusaurus SSR process.");if("none"===e)return null;try{return window[e]}catch(n){return t=n,l||(console.warn("Docusaurus browser storage is not available.\nPossible reasons: running Docusaurus in an iframe, in an incognito browser session, or using too strict browser privacy settings.",t),l=!0),null}var t}let l=!1;const s={get:()=>null,set:()=>{},del:()=>{},listen:()=>()=>{}};function c(e,t){const n=`${e}${r.M}`;if("undefined"==typeof window)return function(e){function t(){throw new Error(`Illegal storage API usage for storage key "${e}".\nDocusaurus storage APIs are not supposed to be called on the server-rendering process.\nPlease only call storage APIs in effects and event handlers.`)}return{get:t,set:t,del:t,listen:t}}(n);const a=i(t?.persistence);return null===a?s:{get:()=>{try{return a.getItem(n)}catch(e){return console.error(`Docusaurus storage error, can't get key=${n}`,e),null}},set:e=>{try{const t=a.getItem(n);a.setItem(n,e),o({key:n,oldValue:t,newValue:e,storage:a})}catch(t){console.error(`Docusaurus storage error, can't set ${n}=${e}`,t)}},del:()=>{try{const e=a.getItem(n);a.removeItem(n),o({key:n,oldValue:e,newValue:null,storage:a})}catch(e){console.error(`Docusaurus storage error, can't delete key=${n}`,e)}},listen:e=>{try{const t=t=>{t.storageArea===a&&t.key===n&&e(t)};return window.addEventListener("storage",t),()=>window.removeEventListener("storage",t)}catch(t){return console.error(`Docusaurus storage error, can't listen for changes of key=${n}`,t),()=>{}}}}}},961:(e,t,n)=>{"use strict";!function e(){if("undefined"!=typeof __REACT_DEVTOOLS_GLOBAL_HOOK__&&"function"==typeof __REACT_DEVTOOLS_GLOBAL_HOOK__.checkDCE)try{__REACT_DEVTOOLS_GLOBAL_HOOK__.checkDCE(e)}catch(t){console.error(t)}}(),e.exports=n(6221)},1043:(e,t,n)=>{"use strict";n.r(t)},1107:(e,t,n)=>{"use strict";n.d(t,{A:()=>u});n(6540);var r=n(4164),a=n(1312),o=n(6342),i=n(8774),l=n(3427);const s={anchorWithStickyNavbar:"anchorWithStickyNavbar_LWe7",anchorWithHideOnScrollNavbar:"anchorWithHideOnScrollNavbar_WYt5"};var c=n(4848);function u({as:e,id:t,...n}){const u=(0,l.A)(),{navbar:{hideOnScroll:d}}=(0,o.p)();if("h1"===e||!t)return(0,c.jsx)(e,{...n,id:void 0});u.collectAnchor(t);const f=(0,a.T)({id:"theme.common.headingLinkTitle",message:"Direct link to {heading}",description:"Title for link to heading"},{heading:"string"==typeof n.children?n.children:t});return(0,c.jsxs)(e,{...n,className:(0,r.A)("anchor",d?s.anchorWithHideOnScrollNavbar:s.anchorWithStickyNavbar,n.className),id:t,children:[n.children,(0,c.jsx)(i.A,{className:"hash-link",to:`#${t}`,"aria-label":f,title:f,children:"\u200b"})]})}},1122:(e,t,n)=>{"use strict";n.d(t,{A:()=>u});var r=n(6540),a=n(4164),o=n(2303),i=n(5293);const l={themedComponent:"themedComponent_mlkZ","themedComponent--light":"themedComponent--light_NVdE","themedComponent--dark":"themedComponent--dark_xIcU"};var s=n(4848);function c({className:e,children:t}){const n=(0,o.A)(),{colorMode:c}=(0,i.G)();return(0,s.jsx)(s.Fragment,{children:(n?"dark"===c?["dark"]:["light"]:["light","dark"]).map(n=>{const o=t({theme:n,className:(0,a.A)(e,l.themedComponent,l[`themedComponent--${n}`])});return(0,s.jsx)(r.Fragment,{children:o},n)})})}function u(e){const{sources:t,className:n,alt:r,...a}=e;return(0,s.jsx)(c,{className:n,children:({theme:e,className:n})=>(0,s.jsx)("img",{src:t[e],alt:r,className:n,...a})})}},1247:(e,t,n)=>{"use strict";var r=n(9982),a=n(6540),o=n(961);function i(e){var t="https://react.dev/errors/"+e;if(1F||(e.current=B[F],B[F]=null,F--)}function $(e,t){F++,B[F]=e.current,e.current=t}var U=I(null),q=I(null),H=I(null),G=I(null);function V(e,t){switch($(H,t),$(q,e),$(U,null),t.nodeType){case 9:case 11:e=(e=t.documentElement)&&(e=e.namespaceURI)?ad(e):0;break;default:if(e=t.tagName,t=t.namespaceURI)e=od(t=ad(t),e);else switch(e){case"svg":e=1;break;case"math":e=2;break;default:e=0}}z(U),$(U,e)}function W(){z(U),z(q),z(H)}function Q(e){null!==e.memoizedState&&$(G,e);var t=U.current,n=od(t,e.type);t!==n&&($(q,e),$(U,n))}function K(e){q.current===e&&(z(U),z(q)),G.current===e&&(z(G),Qd._currentValue=D)}var Y=Object.prototype.hasOwnProperty,X=r.unstable_scheduleCallback,Z=r.unstable_cancelCallback,J=r.unstable_shouldYield,ee=r.unstable_requestPaint,te=r.unstable_now,ne=r.unstable_getCurrentPriorityLevel,re=r.unstable_ImmediatePriority,ae=r.unstable_UserBlockingPriority,oe=r.unstable_NormalPriority,ie=r.unstable_LowPriority,le=r.unstable_IdlePriority,se=r.log,ce=r.unstable_setDisableYieldValue,ue=null,de=null;function fe(e){if("function"==typeof se&&ce(e),de&&"function"==typeof de.setStrictMode)try{de.setStrictMode(ue,e)}catch(t){}}var pe=Math.clz32?Math.clz32:function(e){return 0===(e>>>=0)?32:31-(he(e)/me|0)|0},he=Math.log,me=Math.LN2;var ge=256,be=4194304;function ye(e){var t=42&e;if(0!==t)return t;switch(e&-e){case 1:return 1;case 2:return 2;case 4:return 4;case 8:return 8;case 16:return 16;case 32:return 32;case 64:return 64;case 128:return 128;case 256:case 512:case 1024:case 2048:case 4096:case 8192:case 16384:case 32768:case 65536:case 131072:case 262144:case 524288:case 1048576:case 2097152:return 4194048&e;case 4194304:case 8388608:case 16777216:case 33554432:return 62914560&e;case 67108864:return 67108864;case 134217728:return 134217728;case 268435456:return 268435456;case 536870912:return 536870912;case 1073741824:return 0;default:return e}}function ve(e,t,n){var r=e.pendingLanes;if(0===r)return 0;var a=0,o=e.suspendedLanes,i=e.pingedLanes;e=e.warmLanes;var l=134217727&r;return 0!==l?0!==(r=l&~o)?a=ye(r):0!==(i&=l)?a=ye(i):n||0!==(n=l&~e)&&(a=ye(n)):0!==(l=r&~o)?a=ye(l):0!==i?a=ye(i):n||0!==(n=r&~e)&&(a=ye(n)),0===a?0:0!==t&&t!==a&&0===(t&o)&&((o=a&-a)>=(n=t&-t)||32===o&&4194048&n)?t:a}function we(e,t){return 0===(e.pendingLanes&~(e.suspendedLanes&~e.pingedLanes)&t)}function ke(e,t){switch(e){case 1:case 2:case 4:case 8:case 64:return t+250;case 16:case 32:case 128:case 256:case 512:case 1024:case 2048:case 4096:case 8192:case 16384:case 32768:case 65536:case 131072:case 262144:case 524288:case 1048576:case 2097152:return t+5e3;default:return-1}}function Se(){var e=ge;return!(4194048&(ge<<=1))&&(ge=256),e}function xe(){var e=be;return!(62914560&(be<<=1))&&(be=4194304),e}function _e(e){for(var t=[],n=0;31>n;n++)t.push(e);return t}function Ee(e,t){e.pendingLanes|=t,268435456!==t&&(e.suspendedLanes=0,e.pingedLanes=0,e.warmLanes=0)}function Ce(e,t,n){e.pendingLanes|=t,e.suspendedLanes&=~t;var r=31-pe(t);e.entangledLanes|=t,e.entanglements[r]=1073741824|e.entanglements[r]|4194090&n}function Ae(e,t){var n=e.entangledLanes|=t;for(e=e.entanglements;n;){var r=31-pe(n),a=1<)":-1--a||s[r]!==c[a]){var u="\n"+s[r].replace(" at new "," at ");return e.displayName&&u.includes("")&&(u=u.replace("",e.displayName)),u}}while(1<=r&&0<=a);break}}}finally{ot=!1,Error.prepareStackTrace=n}return(n=e?e.displayName||e.name:"")?at(n):""}function lt(e){switch(e.tag){case 26:case 27:case 5:return at(e.type);case 16:return at("Lazy");case 13:return at("Suspense");case 19:return at("SuspenseList");case 0:case 15:return it(e.type,!1);case 11:return it(e.type.render,!1);case 1:return it(e.type,!0);case 31:return at("Activity");default:return""}}function st(e){try{var t="";do{t+=lt(e),e=e.return}while(e);return t}catch(n){return"\nError generating stack: "+n.message+"\n"+n.stack}}function ct(e){switch(typeof e){case"bigint":case"boolean":case"number":case"string":case"undefined":case"object":return e;default:return""}}function ut(e){var t=e.type;return(e=e.nodeName)&&"input"===e.toLowerCase()&&("checkbox"===t||"radio"===t)}function dt(e){e._valueTracker||(e._valueTracker=function(e){var t=ut(e)?"checked":"value",n=Object.getOwnPropertyDescriptor(e.constructor.prototype,t),r=""+e[t];if(!e.hasOwnProperty(t)&&void 0!==n&&"function"==typeof n.get&&"function"==typeof n.set){var a=n.get,o=n.set;return Object.defineProperty(e,t,{configurable:!0,get:function(){return a.call(this)},set:function(e){r=""+e,o.call(this,e)}}),Object.defineProperty(e,t,{enumerable:n.enumerable}),{getValue:function(){return r},setValue:function(e){r=""+e},stopTracking:function(){e._valueTracker=null,delete e[t]}}}}(e))}function ft(e){if(!e)return!1;var t=e._valueTracker;if(!t)return!0;var n=t.getValue(),r="";return e&&(r=ut(e)?e.checked?"true":"false":e.value),(e=r)!==n&&(t.setValue(e),!0)}function pt(e){if(void 0===(e=e||("undefined"!=typeof document?document:void 0)))return null;try{return e.activeElement||e.body}catch(t){return e.body}}var ht=/[\n"\\]/g;function mt(e){return e.replace(ht,function(e){return"\\"+e.charCodeAt(0).toString(16)+" "})}function gt(e,t,n,r,a,o,i,l){e.name="",null!=i&&"function"!=typeof i&&"symbol"!=typeof i&&"boolean"!=typeof i?e.type=i:e.removeAttribute("type"),null!=t?"number"===i?(0===t&&""===e.value||e.value!=t)&&(e.value=""+ct(t)):e.value!==""+ct(t)&&(e.value=""+ct(t)):"submit"!==i&&"reset"!==i||e.removeAttribute("value"),null!=t?yt(e,i,ct(t)):null!=n?yt(e,i,ct(n)):null!=r&&e.removeAttribute("value"),null==a&&null!=o&&(e.defaultChecked=!!o),null!=a&&(e.checked=a&&"function"!=typeof a&&"symbol"!=typeof a),null!=l&&"function"!=typeof l&&"symbol"!=typeof l&&"boolean"!=typeof l?e.name=""+ct(l):e.removeAttribute("name")}function bt(e,t,n,r,a,o,i,l){if(null!=o&&"function"!=typeof o&&"symbol"!=typeof o&&"boolean"!=typeof o&&(e.type=o),null!=t||null!=n){if(("submit"===o||"reset"===o)&&null==t)return;n=null!=n?""+ct(n):"",t=null!=t?""+ct(t):n,l||t===e.value||(e.value=t),e.defaultValue=t}r="function"!=typeof(r=null!=r?r:a)&&"symbol"!=typeof r&&!!r,e.checked=l?e.checked:!!r,e.defaultChecked=!!r,null!=i&&"function"!=typeof i&&"symbol"!=typeof i&&"boolean"!=typeof i&&(e.name=i)}function yt(e,t,n){"number"===t&&pt(e.ownerDocument)===e||e.defaultValue===""+n||(e.defaultValue=""+n)}function vt(e,t,n,r){if(e=e.options,t){t={};for(var a=0;a=xn),Cn=String.fromCharCode(32),An=!1;function Ln(e,t){switch(e){case"keyup":return-1!==kn.indexOf(t.keyCode);case"keydown":return 229!==t.keyCode;case"keypress":case"mousedown":case"focusout":return!0;default:return!1}}function Tn(e){return"object"==typeof(e=e.detail)&&"data"in e?e.data:null}var jn=!1;var Pn={color:!0,date:!0,datetime:!0,"datetime-local":!0,email:!0,month:!0,number:!0,password:!0,range:!0,search:!0,tel:!0,text:!0,time:!0,url:!0,week:!0};function Mn(e){var t=e&&e.nodeName&&e.nodeName.toLowerCase();return"input"===t?!!Pn[e.type]:"textarea"===t}function Nn(e,t,n,r){Mt?Nt?Nt.push(r):Nt=[r]:Mt=r,0<(t=Hu(t,"onChange")).length&&(n=new Jt("onChange","change",null,n,r),e.push({event:n,listeners:t}))}var On=null,Rn=null;function Dn(e){Du(e,0)}function Bn(e){if(ft(qe(e)))return e}function Fn(e,t){if("change"===e)return t}var In=!1;if(Ft){var zn;if(Ft){var $n="oninput"in document;if(!$n){var Un=document.createElement("div");Un.setAttribute("oninput","return;"),$n="function"==typeof Un.oninput}zn=$n}else zn=!1;In=zn&&(!document.documentMode||9=t)return{node:r,offset:t-e};e=n}e:{for(;r;){if(r.nextSibling){r=r.nextSibling;break e}r=r.parentNode}r=void 0}r=Xn(r)}}function Jn(e,t){return!(!e||!t)&&(e===t||(!e||3!==e.nodeType)&&(t&&3===t.nodeType?Jn(e,t.parentNode):"contains"in e?e.contains(t):!!e.compareDocumentPosition&&!!(16&e.compareDocumentPosition(t))))}function er(e){for(var t=pt((e=null!=e&&null!=e.ownerDocument&&null!=e.ownerDocument.defaultView?e.ownerDocument.defaultView:window).document);t instanceof e.HTMLIFrameElement;){try{var n="string"==typeof t.contentWindow.location.href}catch(r){n=!1}if(!n)break;t=pt((e=t.contentWindow).document)}return t}function tr(e){var t=e&&e.nodeName&&e.nodeName.toLowerCase();return t&&("input"===t&&("text"===e.type||"search"===e.type||"tel"===e.type||"url"===e.type||"password"===e.type)||"textarea"===t||"true"===e.contentEditable)}var nr=Ft&&"documentMode"in document&&11>=document.documentMode,rr=null,ar=null,or=null,ir=!1;function lr(e,t,n){var r=n.window===n?n.document:9===n.nodeType?n:n.ownerDocument;ir||null==rr||rr!==pt(r)||("selectionStart"in(r=rr)&&tr(r)?r={start:r.selectionStart,end:r.selectionEnd}:r={anchorNode:(r=(r.ownerDocument&&r.ownerDocument.defaultView||window).getSelection()).anchorNode,anchorOffset:r.anchorOffset,focusNode:r.focusNode,focusOffset:r.focusOffset},or&&Yn(or,r)||(or=r,0<(r=Hu(ar,"onSelect")).length&&(t=new Jt("onSelect","select",null,t,n),e.push({event:t,listeners:r}),t.target=rr)))}function sr(e,t){var n={};return n[e.toLowerCase()]=t.toLowerCase(),n["Webkit"+e]="webkit"+t,n["Moz"+e]="moz"+t,n}var cr={animationend:sr("Animation","AnimationEnd"),animationiteration:sr("Animation","AnimationIteration"),animationstart:sr("Animation","AnimationStart"),transitionrun:sr("Transition","TransitionRun"),transitionstart:sr("Transition","TransitionStart"),transitioncancel:sr("Transition","TransitionCancel"),transitionend:sr("Transition","TransitionEnd")},ur={},dr={};function fr(e){if(ur[e])return ur[e];if(!cr[e])return e;var t,n=cr[e];for(t in n)if(n.hasOwnProperty(t)&&t in dr)return ur[e]=n[t];return e}Ft&&(dr=document.createElement("div").style,"AnimationEvent"in window||(delete cr.animationend.animation,delete cr.animationiteration.animation,delete cr.animationstart.animation),"TransitionEvent"in window||delete cr.transitionend.transition);var pr=fr("animationend"),hr=fr("animationiteration"),mr=fr("animationstart"),gr=fr("transitionrun"),br=fr("transitionstart"),yr=fr("transitioncancel"),vr=fr("transitionend"),wr=new Map,kr="abort auxClick beforeToggle cancel canPlay canPlayThrough click close contextMenu copy cut drag dragEnd dragEnter dragExit dragLeave dragOver dragStart drop durationChange emptied encrypted ended error gotPointerCapture input invalid keyDown keyPress keyUp load loadedData loadedMetadata loadStart lostPointerCapture mouseDown mouseMove mouseOut mouseOver mouseUp paste pause play playing pointerCancel pointerDown pointerMove pointerOut pointerOver pointerUp progress rateChange reset resize seeked seeking stalled submit suspend timeUpdate touchCancel touchEnd touchStart volumeChange scroll toggle touchMove waiting wheel".split(" ");function Sr(e,t){wr.set(e,t),Qe(t,[e])}kr.push("scrollEnd");var xr=new WeakMap;function _r(e,t){if("object"==typeof e&&null!==e){var n=xr.get(e);return void 0!==n?n:(t={value:e,source:t,stack:st(t)},xr.set(e,t),t)}return{value:e,source:t,stack:st(t)}}var Er=[],Cr=0,Ar=0;function Lr(){for(var e=Cr,t=Ar=Cr=0;t>=i,a-=i,Xr=1<<32-pe(t)+a|n<o?o:8;var i,l,s,c=O.T,u={};O.T=u,$i(e,!1,t,n);try{var d=a(),f=O.S;if(null!==f&&f(u,d),null!==d&&"object"==typeof d&&"function"==typeof d.then)zi(e,t,(i=r,l=[],s={status:"pending",value:null,reason:null,then:function(e){l.push(e)}},d.then(function(){s.status="fulfilled",s.value=i;for(var e=0;eh?(m=d,d=null):m=d.sibling;var g=p(a,d,l[h],s);if(null===g){null===d&&(d=m);break}e&&d&&null===g.alternate&&t(a,d),i=o(g,i,h),null===u?c=g:u.sibling=g,u=g,d=m}if(h===l.length)return n(a,d),oa&&Jr(a,h),c;if(null===d){for(;hm?(g=h,h=null):g=h.sibling;var v=p(a,h,y.value,c);if(null===v){null===h&&(h=g);break}e&&h&&null===v.alternate&&t(a,h),l=o(v,l,m),null===d?u=v:d.sibling=v,d=v,h=g}if(y.done)return n(a,h),oa&&Jr(a,m),u;if(null===h){for(;!y.done;m++,y=s.next())null!==(y=f(a,y.value,c))&&(l=o(y,l,m),null===d?u=y:d.sibling=y,d=y);return oa&&Jr(a,m),u}for(h=r(h);!y.done;m++,y=s.next())null!==(y=b(h,a,m,y.value,c))&&(e&&null!==y.alternate&&h.delete(null===y.key?m:y.key),l=o(y,l,m),null===d?u=y:d.sibling=y,d=y);return e&&h.forEach(function(e){return t(a,e)}),oa&&Jr(a,m),u}(s,c,u=v.call(u),d)}if("function"==typeof u.then)return y(s,c,Xi(u),d);if(u.$$typeof===k)return y(s,c,Aa(s,u),d);Ji(s,u)}return"string"==typeof u&&""!==u||"number"==typeof u||"bigint"==typeof u?(u=""+u,null!==c&&6===c.tag?(n(s,c.sibling),(d=a(c,u)).return=s,s=d):(n(s,c),(d=Ur(u,s.mode,d)).return=s,s=d),l(s)):n(s,c)}return function(e,t,n,r){try{Yi=0;var a=y(e,t,n,r);return Ki=null,a}catch(i){if(i===Ga||i===Wa)throw i;var o=Dr(29,i,null,e.mode);return o.lanes=r,o.return=e,o}}}var nl=tl(!0),rl=tl(!1),al=I(null),ol=null;function il(e){var t=e.alternate;$(ul,1&ul.current),$(al,e),null===ol&&(null===t||null!==ho.current||null!==t.memoizedState)&&(ol=e)}function ll(e){if(22===e.tag){if($(ul,ul.current),$(al,e),null===ol){var t=e.alternate;null!==t&&null!==t.memoizedState&&(ol=e)}}else sl()}function sl(){$(ul,ul.current),$(al,al.current)}function cl(e){z(al),ol===e&&(ol=null),z(ul)}var ul=I(0);function dl(e){for(var t=e;null!==t;){if(13===t.tag){var n=t.memoizedState;if(null!==n&&(null===(n=n.dehydrated)||"$?"===n.data||gd(n)))return t}else if(19===t.tag&&void 0!==t.memoizedProps.revealOrder){if(128&t.flags)return t}else if(null!==t.child){t.child.return=t,t=t.child;continue}if(t===e)break;for(;null===t.sibling;){if(null===t.return||t.return===e)return null;t=t.return}t.sibling.return=t.return,t=t.sibling}return null}function fl(e,t,n,r){n=null==(n=n(r,t=e.memoizedState))?t:f({},t,n),e.memoizedState=n,0===e.lanes&&(e.updateQueue.baseState=n)}var pl={enqueueSetState:function(e,t,n){e=e._reactInternals;var r=Oc(),a=ao(r);a.payload=t,null!=n&&(a.callback=n),null!==(t=oo(e,a,r))&&(Dc(t,e,r),io(t,e,r))},enqueueReplaceState:function(e,t,n){e=e._reactInternals;var r=Oc(),a=ao(r);a.tag=1,a.payload=t,null!=n&&(a.callback=n),null!==(t=oo(e,a,r))&&(Dc(t,e,r),io(t,e,r))},enqueueForceUpdate:function(e,t){e=e._reactInternals;var n=Oc(),r=ao(n);r.tag=2,null!=t&&(r.callback=t),null!==(t=oo(e,r,n))&&(Dc(t,e,n),io(t,e,n))}};function hl(e,t,n,r,a,o,i){return"function"==typeof(e=e.stateNode).shouldComponentUpdate?e.shouldComponentUpdate(r,o,i):!t.prototype||!t.prototype.isPureReactComponent||(!Yn(n,r)||!Yn(a,o))}function ml(e,t,n,r){e=t.state,"function"==typeof t.componentWillReceiveProps&&t.componentWillReceiveProps(n,r),"function"==typeof t.UNSAFE_componentWillReceiveProps&&t.UNSAFE_componentWillReceiveProps(n,r),t.state!==e&&pl.enqueueReplaceState(t,t.state,null)}function gl(e,t){var n=t;if("ref"in t)for(var r in n={},t)"ref"!==r&&(n[r]=t[r]);if(e=e.defaultProps)for(var a in n===t&&(n=f({},n)),e)void 0===n[a]&&(n[a]=e[a]);return n}var bl="function"==typeof reportError?reportError:function(e){if("object"==typeof window&&"function"==typeof window.ErrorEvent){var t=new window.ErrorEvent("error",{bubbles:!0,cancelable:!0,message:"object"==typeof e&&null!==e&&"string"==typeof e.message?String(e.message):String(e),error:e});if(!window.dispatchEvent(t))return}else if("object"==typeof process&&"function"==typeof process.emit)return void process.emit("uncaughtException",e);console.error(e)};function yl(e){bl(e)}function vl(e){console.error(e)}function wl(e){bl(e)}function kl(e,t){try{(0,e.onUncaughtError)(t.value,{componentStack:t.stack})}catch(n){setTimeout(function(){throw n})}}function Sl(e,t,n){try{(0,e.onCaughtError)(n.value,{componentStack:n.stack,errorBoundary:1===t.tag?t.stateNode:null})}catch(r){setTimeout(function(){throw r})}}function xl(e,t,n){return(n=ao(n)).tag=3,n.payload={element:null},n.callback=function(){kl(e,t)},n}function _l(e){return(e=ao(e)).tag=3,e}function El(e,t,n,r){var a=n.type.getDerivedStateFromError;if("function"==typeof a){var o=r.value;e.payload=function(){return a(o)},e.callback=function(){Sl(t,n,r)}}var i=n.stateNode;null!==i&&"function"==typeof i.componentDidCatch&&(e.callback=function(){Sl(t,n,r),"function"!=typeof a&&(null===_c?_c=new Set([this]):_c.add(this));var e=r.stack;this.componentDidCatch(r.value,{componentStack:null!==e?e:""})})}var Cl=Error(i(461)),Al=!1;function Ll(e,t,n,r){t.child=null===e?rl(t,null,n,r):nl(t,e.child,n,r)}function Tl(e,t,n,r,a){n=n.render;var o=t.ref;if("ref"in r){var i={};for(var l in r)"ref"!==l&&(i[l]=r[l])}else i=r;return Ea(t),r=Mo(e,t,n,i,o,a),l=Do(),null===e||Al?(oa&&l&&ta(t),t.flags|=1,Ll(e,t,r,a),t.child):(Bo(e,t,a),Kl(e,t,a))}function jl(e,t,n,r,a){if(null===e){var o=n.type;return"function"!=typeof o||Br(o)||void 0!==o.defaultProps||null!==n.compare?((e=zr(n.type,null,r,t,t.mode,a)).ref=t.ref,e.return=t,t.child=e):(t.tag=15,t.type=o,Pl(e,t,o,r,a))}if(o=e.child,!Yl(e,a)){var i=o.memoizedProps;if((n=null!==(n=n.compare)?n:Yn)(i,r)&&e.ref===t.ref)return Kl(e,t,a)}return t.flags|=1,(e=Fr(o,r)).ref=t.ref,e.return=t,t.child=e}function Pl(e,t,n,r,a){if(null!==e){var o=e.memoizedProps;if(Yn(o,r)&&e.ref===t.ref){if(Al=!1,t.pendingProps=r=o,!Yl(e,a))return t.lanes=e.lanes,Kl(e,t,a);131072&e.flags&&(Al=!0)}}return Rl(e,t,n,r,a)}function Ml(e,t,n){var r=t.pendingProps,a=r.children,o=null!==e?e.memoizedState:null;if("hidden"===r.mode){if(128&t.flags){if(r=null!==o?o.baseLanes|n:n,null!==e){for(a=t.child=e.child,o=0;null!==a;)o=o|a.lanes|a.childLanes,a=a.sibling;t.childLanes=o&~r}else t.childLanes=0,t.child=null;return Nl(e,t,r,n)}if(!(536870912&n))return t.lanes=t.childLanes=536870912,Nl(e,t,null!==o?o.baseLanes|n:n,n);t.memoizedState={baseLanes:0,cachePool:null},null!==e&&qa(0,null!==o?o.cachePool:null),null!==o?go(t,o):bo(),ll(t)}else null!==o?(qa(0,o.cachePool),go(t,o),sl(),t.memoizedState=null):(null!==e&&qa(0,null),bo(),sl());return Ll(e,t,a,n),t.child}function Nl(e,t,n,r){var a=Ua();return a=null===a?null:{parent:Ma._currentValue,pool:a},t.memoizedState={baseLanes:n,cachePool:a},null!==e&&qa(0,null),bo(),ll(t),null!==e&&xa(e,t,r,!0),null}function Ol(e,t){var n=t.ref;if(null===n)null!==e&&null!==e.ref&&(t.flags|=4194816);else{if("function"!=typeof n&&"object"!=typeof n)throw Error(i(284));null!==e&&e.ref===n||(t.flags|=4194816)}}function Rl(e,t,n,r,a){return Ea(t),n=Mo(e,t,n,r,void 0,a),r=Do(),null===e||Al?(oa&&r&&ta(t),t.flags|=1,Ll(e,t,n,a),t.child):(Bo(e,t,a),Kl(e,t,a))}function Dl(e,t,n,r,a,o){return Ea(t),t.updateQueue=null,n=Oo(t,r,n,a),No(e),r=Do(),null===e||Al?(oa&&r&&ta(t),t.flags|=1,Ll(e,t,n,o),t.child):(Bo(e,t,o),Kl(e,t,o))}function Bl(e,t,n,r,a){if(Ea(t),null===t.stateNode){var o=Or,i=n.contextType;"object"==typeof i&&null!==i&&(o=Ca(i)),o=new n(r,o),t.memoizedState=null!==o.state&&void 0!==o.state?o.state:null,o.updater=pl,t.stateNode=o,o._reactInternals=t,(o=t.stateNode).props=r,o.state=t.memoizedState,o.refs={},no(t),i=n.contextType,o.context="object"==typeof i&&null!==i?Ca(i):Or,o.state=t.memoizedState,"function"==typeof(i=n.getDerivedStateFromProps)&&(fl(t,n,i,r),o.state=t.memoizedState),"function"==typeof n.getDerivedStateFromProps||"function"==typeof o.getSnapshotBeforeUpdate||"function"!=typeof o.UNSAFE_componentWillMount&&"function"!=typeof o.componentWillMount||(i=o.state,"function"==typeof o.componentWillMount&&o.componentWillMount(),"function"==typeof o.UNSAFE_componentWillMount&&o.UNSAFE_componentWillMount(),i!==o.state&&pl.enqueueReplaceState(o,o.state,null),uo(t,r,o,a),co(),o.state=t.memoizedState),"function"==typeof o.componentDidMount&&(t.flags|=4194308),r=!0}else if(null===e){o=t.stateNode;var l=t.memoizedProps,s=gl(n,l);o.props=s;var c=o.context,u=n.contextType;i=Or,"object"==typeof u&&null!==u&&(i=Ca(u));var d=n.getDerivedStateFromProps;u="function"==typeof d||"function"==typeof o.getSnapshotBeforeUpdate,l=t.pendingProps!==l,u||"function"!=typeof o.UNSAFE_componentWillReceiveProps&&"function"!=typeof o.componentWillReceiveProps||(l||c!==i)&&ml(t,o,r,i),to=!1;var f=t.memoizedState;o.state=f,uo(t,r,o,a),co(),c=t.memoizedState,l||f!==c||to?("function"==typeof d&&(fl(t,n,d,r),c=t.memoizedState),(s=to||hl(t,n,s,r,f,c,i))?(u||"function"!=typeof o.UNSAFE_componentWillMount&&"function"!=typeof o.componentWillMount||("function"==typeof o.componentWillMount&&o.componentWillMount(),"function"==typeof o.UNSAFE_componentWillMount&&o.UNSAFE_componentWillMount()),"function"==typeof o.componentDidMount&&(t.flags|=4194308)):("function"==typeof o.componentDidMount&&(t.flags|=4194308),t.memoizedProps=r,t.memoizedState=c),o.props=r,o.state=c,o.context=i,r=s):("function"==typeof o.componentDidMount&&(t.flags|=4194308),r=!1)}else{o=t.stateNode,ro(e,t),u=gl(n,i=t.memoizedProps),o.props=u,d=t.pendingProps,f=o.context,c=n.contextType,s=Or,"object"==typeof c&&null!==c&&(s=Ca(c)),(c="function"==typeof(l=n.getDerivedStateFromProps)||"function"==typeof o.getSnapshotBeforeUpdate)||"function"!=typeof o.UNSAFE_componentWillReceiveProps&&"function"!=typeof o.componentWillReceiveProps||(i!==d||f!==s)&&ml(t,o,r,s),to=!1,f=t.memoizedState,o.state=f,uo(t,r,o,a),co();var p=t.memoizedState;i!==d||f!==p||to||null!==e&&null!==e.dependencies&&_a(e.dependencies)?("function"==typeof l&&(fl(t,n,l,r),p=t.memoizedState),(u=to||hl(t,n,u,r,f,p,s)||null!==e&&null!==e.dependencies&&_a(e.dependencies))?(c||"function"!=typeof o.UNSAFE_componentWillUpdate&&"function"!=typeof o.componentWillUpdate||("function"==typeof o.componentWillUpdate&&o.componentWillUpdate(r,p,s),"function"==typeof o.UNSAFE_componentWillUpdate&&o.UNSAFE_componentWillUpdate(r,p,s)),"function"==typeof o.componentDidUpdate&&(t.flags|=4),"function"==typeof o.getSnapshotBeforeUpdate&&(t.flags|=1024)):("function"!=typeof o.componentDidUpdate||i===e.memoizedProps&&f===e.memoizedState||(t.flags|=4),"function"!=typeof o.getSnapshotBeforeUpdate||i===e.memoizedProps&&f===e.memoizedState||(t.flags|=1024),t.memoizedProps=r,t.memoizedState=p),o.props=r,o.state=p,o.context=s,r=u):("function"!=typeof o.componentDidUpdate||i===e.memoizedProps&&f===e.memoizedState||(t.flags|=4),"function"!=typeof o.getSnapshotBeforeUpdate||i===e.memoizedProps&&f===e.memoizedState||(t.flags|=1024),r=!1)}return o=r,Ol(e,t),r=!!(128&t.flags),o||r?(o=t.stateNode,n=r&&"function"!=typeof n.getDerivedStateFromError?null:o.render(),t.flags|=1,null!==e&&r?(t.child=nl(t,e.child,null,a),t.child=nl(t,null,n,a)):Ll(e,t,n,a),t.memoizedState=o.state,e=t.child):e=Kl(e,t,a),e}function Fl(e,t,n,r){return pa(),t.flags|=256,Ll(e,t,n,r),t.child}var Il={dehydrated:null,treeContext:null,retryLane:0,hydrationErrors:null};function zl(e){return{baseLanes:e,cachePool:Ha()}}function $l(e,t,n){return e=null!==e?e.childLanes&~n:0,t&&(e|=gc),e}function Ul(e,t,n){var r,a=t.pendingProps,o=!1,l=!!(128&t.flags);if((r=l)||(r=(null===e||null!==e.memoizedState)&&!!(2&ul.current)),r&&(o=!0,t.flags&=-129),r=!!(32&t.flags),t.flags&=-33,null===e){if(oa){if(o?il(t):sl(),oa){var s,c=aa;if(s=c){e:{for(s=c,c=la;8!==s.nodeType;){if(!c){c=null;break e}if(null===(s=bd(s.nextSibling))){c=null;break e}}c=s}null!==c?(t.memoizedState={dehydrated:c,treeContext:null!==Yr?{id:Xr,overflow:Zr}:null,retryLane:536870912,hydrationErrors:null},(s=Dr(18,null,null,0)).stateNode=c,s.return=t,t.child=s,ra=t,aa=null,s=!0):s=!1}s||ca(t)}if(null!==(c=t.memoizedState)&&null!==(c=c.dehydrated))return gd(c)?t.lanes=32:t.lanes=536870912,null;cl(t)}return c=a.children,a=a.fallback,o?(sl(),c=Hl({mode:"hidden",children:c},o=t.mode),a=$r(a,o,n,null),c.return=t,a.return=t,c.sibling=a,t.child=c,(o=t.child).memoizedState=zl(n),o.childLanes=$l(e,r,n),t.memoizedState=Il,a):(il(t),ql(t,c))}if(null!==(s=e.memoizedState)&&null!==(c=s.dehydrated)){if(l)256&t.flags?(il(t),t.flags&=-257,t=Gl(e,t,n)):null!==t.memoizedState?(sl(),t.child=e.child,t.flags|=128,t=null):(sl(),o=a.fallback,c=t.mode,a=Hl({mode:"visible",children:a.children},c),(o=$r(o,c,n,null)).flags|=2,a.return=t,o.return=t,a.sibling=o,t.child=a,nl(t,e.child,null,n),(a=t.child).memoizedState=zl(n),a.childLanes=$l(e,r,n),t.memoizedState=Il,t=o);else if(il(t),gd(c)){if(r=c.nextSibling&&c.nextSibling.dataset)var u=r.dgst;r=u,(a=Error(i(419))).stack="",a.digest=r,ma({value:a,source:null,stack:null}),t=Gl(e,t,n)}else if(Al||xa(e,t,n,!1),r=0!==(n&e.childLanes),Al||r){if(null!==(r=rc)&&(0!==(a=0!==((a=42&(a=n&-n)?1:Le(a))&(r.suspendedLanes|n))?0:a)&&a!==s.retryLane))throw s.retryLane=a,Pr(e,a),Dc(r,e,a),Cl;"$?"===c.data||Wc(),t=Gl(e,t,n)}else"$?"===c.data?(t.flags|=192,t.child=e.child,t=null):(e=s.treeContext,aa=bd(c.nextSibling),ra=t,oa=!0,ia=null,la=!1,null!==e&&(Qr[Kr++]=Xr,Qr[Kr++]=Zr,Qr[Kr++]=Yr,Xr=e.id,Zr=e.overflow,Yr=t),(t=ql(t,a.children)).flags|=4096);return t}return o?(sl(),o=a.fallback,c=t.mode,u=(s=e.child).sibling,(a=Fr(s,{mode:"hidden",children:a.children})).subtreeFlags=65011712&s.subtreeFlags,null!==u?o=Fr(u,o):(o=$r(o,c,n,null)).flags|=2,o.return=t,a.return=t,a.sibling=o,t.child=a,a=o,o=t.child,null===(c=e.child.memoizedState)?c=zl(n):(null!==(s=c.cachePool)?(u=Ma._currentValue,s=s.parent!==u?{parent:u,pool:u}:s):s=Ha(),c={baseLanes:c.baseLanes|n,cachePool:s}),o.memoizedState=c,o.childLanes=$l(e,r,n),t.memoizedState=Il,a):(il(t),e=(n=e.child).sibling,(n=Fr(n,{mode:"visible",children:a.children})).return=t,n.sibling=null,null!==e&&(null===(r=t.deletions)?(t.deletions=[e],t.flags|=16):r.push(e)),t.child=n,t.memoizedState=null,n)}function ql(e,t){return(t=Hl({mode:"visible",children:t},e.mode)).return=e,e.child=t}function Hl(e,t){return(e=Dr(22,e,null,t)).lanes=0,e.stateNode={_visibility:1,_pendingMarkers:null,_retryCache:null,_transitions:null},e}function Gl(e,t,n){return nl(t,e.child,null,n),(e=ql(t,t.pendingProps.children)).flags|=2,t.memoizedState=null,e}function Vl(e,t,n){e.lanes|=t;var r=e.alternate;null!==r&&(r.lanes|=t),ka(e.return,t,n)}function Wl(e,t,n,r,a){var o=e.memoizedState;null===o?e.memoizedState={isBackwards:t,rendering:null,renderingStartTime:0,last:r,tail:n,tailMode:a}:(o.isBackwards=t,o.rendering=null,o.renderingStartTime=0,o.last=r,o.tail=n,o.tailMode=a)}function Ql(e,t,n){var r=t.pendingProps,a=r.revealOrder,o=r.tail;if(Ll(e,t,r.children,n),2&(r=ul.current))r=1&r|2,t.flags|=128;else{if(null!==e&&128&e.flags)e:for(e=t.child;null!==e;){if(13===e.tag)null!==e.memoizedState&&Vl(e,n,t);else if(19===e.tag)Vl(e,n,t);else if(null!==e.child){e.child.return=e,e=e.child;continue}if(e===t)break e;for(;null===e.sibling;){if(null===e.return||e.return===t)break e;e=e.return}e.sibling.return=e.return,e=e.sibling}r&=1}switch($(ul,r),a){case"forwards":for(n=t.child,a=null;null!==n;)null!==(e=n.alternate)&&null===dl(e)&&(a=n),n=n.sibling;null===(n=a)?(a=t.child,t.child=null):(a=n.sibling,n.sibling=null),Wl(t,!1,a,n,o);break;case"backwards":for(n=null,a=t.child,t.child=null;null!==a;){if(null!==(e=a.alternate)&&null===dl(e)){t.child=a;break}e=a.sibling,a.sibling=n,n=a,a=e}Wl(t,!0,n,null,o);break;case"together":Wl(t,!1,null,null,void 0);break;default:t.memoizedState=null}return t.child}function Kl(e,t,n){if(null!==e&&(t.dependencies=e.dependencies),pc|=t.lanes,0===(n&t.childLanes)){if(null===e)return null;if(xa(e,t,n,!1),0===(n&t.childLanes))return null}if(null!==e&&t.child!==e.child)throw Error(i(153));if(null!==t.child){for(n=Fr(e=t.child,e.pendingProps),t.child=n,n.return=t;null!==e.sibling;)e=e.sibling,(n=n.sibling=Fr(e,e.pendingProps)).return=t;n.sibling=null}return t.child}function Yl(e,t){return 0!==(e.lanes&t)||!(null===(e=e.dependencies)||!_a(e))}function Xl(e,t,n){if(null!==e)if(e.memoizedProps!==t.pendingProps)Al=!0;else{if(!(Yl(e,n)||128&t.flags))return Al=!1,function(e,t,n){switch(t.tag){case 3:V(t,t.stateNode.containerInfo),va(0,Ma,e.memoizedState.cache),pa();break;case 27:case 5:Q(t);break;case 4:V(t,t.stateNode.containerInfo);break;case 10:va(0,t.type,t.memoizedProps.value);break;case 13:var r=t.memoizedState;if(null!==r)return null!==r.dehydrated?(il(t),t.flags|=128,null):0!==(n&t.child.childLanes)?Ul(e,t,n):(il(t),null!==(e=Kl(e,t,n))?e.sibling:null);il(t);break;case 19:var a=!!(128&e.flags);if((r=0!==(n&t.childLanes))||(xa(e,t,n,!1),r=0!==(n&t.childLanes)),a){if(r)return Ql(e,t,n);t.flags|=128}if(null!==(a=t.memoizedState)&&(a.rendering=null,a.tail=null,a.lastEffect=null),$(ul,ul.current),r)break;return null;case 22:case 23:return t.lanes=0,Ml(e,t,n);case 24:va(0,Ma,e.memoizedState.cache)}return Kl(e,t,n)}(e,t,n);Al=!!(131072&e.flags)}else Al=!1,oa&&1048576&t.flags&&ea(t,Wr,t.index);switch(t.lanes=0,t.tag){case 16:e:{e=t.pendingProps;var r=t.elementType,a=r._init;if(r=a(r._payload),t.type=r,"function"!=typeof r){if(null!=r){if((a=r.$$typeof)===S){t.tag=11,t=Tl(null,t,r,e,n);break e}if(a===E){t.tag=14,t=jl(null,t,r,e,n);break e}}throw t=M(r)||r,Error(i(306,t,""))}Br(r)?(e=gl(r,e),t.tag=1,t=Bl(null,t,r,e,n)):(t.tag=0,t=Rl(null,t,r,e,n))}return t;case 0:return Rl(e,t,t.type,t.pendingProps,n);case 1:return Bl(e,t,r=t.type,a=gl(r,t.pendingProps),n);case 3:e:{if(V(t,t.stateNode.containerInfo),null===e)throw Error(i(387));r=t.pendingProps;var o=t.memoizedState;a=o.element,ro(e,t),uo(t,r,null,n);var l=t.memoizedState;if(r=l.cache,va(0,Ma,r),r!==o.cache&&Sa(t,[Ma],n,!0),co(),r=l.element,o.isDehydrated){if(o={element:r,isDehydrated:!1,cache:l.cache},t.updateQueue.baseState=o,t.memoizedState=o,256&t.flags){t=Fl(e,t,r,n);break e}if(r!==a){ma(a=_r(Error(i(424)),t)),t=Fl(e,t,r,n);break e}if(9===(e=t.stateNode.containerInfo).nodeType)e=e.body;else e="HTML"===e.nodeName?e.ownerDocument.body:e;for(aa=bd(e.firstChild),ra=t,oa=!0,ia=null,la=!0,n=rl(t,null,r,n),t.child=n;n;)n.flags=-3&n.flags|4096,n=n.sibling}else{if(pa(),r===a){t=Kl(e,t,n);break e}Ll(e,t,r,n)}t=t.child}return t;case 26:return Ol(e,t),null===e?(n=Ld(t.type,null,t.pendingProps,null))?t.memoizedState=n:oa||(n=t.type,e=t.pendingProps,(r=rd(H.current).createElement(n))[Me]=t,r[Ne]=e,ed(r,n,e),Ge(r),t.stateNode=r):t.memoizedState=Ld(t.type,e.memoizedProps,t.pendingProps,e.memoizedState),null;case 27:return Q(t),null===e&&oa&&(r=t.stateNode=wd(t.type,t.pendingProps,H.current),ra=t,la=!0,a=aa,pd(t.type)?(yd=a,aa=bd(r.firstChild)):aa=a),Ll(e,t,t.pendingProps.children,n),Ol(e,t),null===e&&(t.flags|=4194304),t.child;case 5:return null===e&&oa&&((a=r=aa)&&(null!==(r=function(e,t,n,r){for(;1===e.nodeType;){var a=n;if(e.nodeName.toLowerCase()!==t.toLowerCase()){if(!r&&("INPUT"!==e.nodeName||"hidden"!==e.type))break}else if(r){if(!e[Ie])switch(t){case"meta":if(!e.hasAttribute("itemprop"))break;return e;case"link":if("stylesheet"===(o=e.getAttribute("rel"))&&e.hasAttribute("data-precedence"))break;if(o!==a.rel||e.getAttribute("href")!==(null==a.href||""===a.href?null:a.href)||e.getAttribute("crossorigin")!==(null==a.crossOrigin?null:a.crossOrigin)||e.getAttribute("title")!==(null==a.title?null:a.title))break;return e;case"style":if(e.hasAttribute("data-precedence"))break;return e;case"script":if(((o=e.getAttribute("src"))!==(null==a.src?null:a.src)||e.getAttribute("type")!==(null==a.type?null:a.type)||e.getAttribute("crossorigin")!==(null==a.crossOrigin?null:a.crossOrigin))&&o&&e.hasAttribute("async")&&!e.hasAttribute("itemprop"))break;return e;default:return e}}else{if("input"!==t||"hidden"!==e.type)return e;var o=null==a.name?null:""+a.name;if("hidden"===a.type&&e.getAttribute("name")===o)return e}if(null===(e=bd(e.nextSibling)))break}return null}(r,t.type,t.pendingProps,la))?(t.stateNode=r,ra=t,aa=bd(r.firstChild),la=!1,a=!0):a=!1),a||ca(t)),Q(t),a=t.type,o=t.pendingProps,l=null!==e?e.memoizedProps:null,r=o.children,id(a,o)?r=null:null!==l&&id(a,l)&&(t.flags|=32),null!==t.memoizedState&&(a=Mo(e,t,Ro,null,null,n),Qd._currentValue=a),Ol(e,t),Ll(e,t,r,n),t.child;case 6:return null===e&&oa&&((e=n=aa)&&(null!==(n=function(e,t,n){if(""===t)return null;for(;3!==e.nodeType;){if((1!==e.nodeType||"INPUT"!==e.nodeName||"hidden"!==e.type)&&!n)return null;if(null===(e=bd(e.nextSibling)))return null}return e}(n,t.pendingProps,la))?(t.stateNode=n,ra=t,aa=null,e=!0):e=!1),e||ca(t)),null;case 13:return Ul(e,t,n);case 4:return V(t,t.stateNode.containerInfo),r=t.pendingProps,null===e?t.child=nl(t,null,r,n):Ll(e,t,r,n),t.child;case 11:return Tl(e,t,t.type,t.pendingProps,n);case 7:return Ll(e,t,t.pendingProps,n),t.child;case 8:case 12:return Ll(e,t,t.pendingProps.children,n),t.child;case 10:return r=t.pendingProps,va(0,t.type,r.value),Ll(e,t,r.children,n),t.child;case 9:return a=t.type._context,r=t.pendingProps.children,Ea(t),r=r(a=Ca(a)),t.flags|=1,Ll(e,t,r,n),t.child;case 14:return jl(e,t,t.type,t.pendingProps,n);case 15:return Pl(e,t,t.type,t.pendingProps,n);case 19:return Ql(e,t,n);case 31:return r=t.pendingProps,n=t.mode,r={mode:r.mode,children:r.children},null===e?((n=Hl(r,n)).ref=t.ref,t.child=n,n.return=t,t=n):((n=Fr(e.child,r)).ref=t.ref,t.child=n,n.return=t,t=n),t;case 22:return Ml(e,t,n);case 24:return Ea(t),r=Ca(Ma),null===e?(null===(a=Ua())&&(a=rc,o=Na(),a.pooledCache=o,o.refCount++,null!==o&&(a.pooledCacheLanes|=n),a=o),t.memoizedState={parent:r,cache:a},no(t),va(0,Ma,a)):(0!==(e.lanes&n)&&(ro(e,t),uo(t,null,null,n),co()),a=e.memoizedState,o=t.memoizedState,a.parent!==r?(a={parent:r,cache:r},t.memoizedState=a,0===t.lanes&&(t.memoizedState=t.updateQueue.baseState=a),va(0,Ma,r)):(r=o.cache,va(0,Ma,r),r!==a.cache&&Sa(t,[Ma],n,!0))),Ll(e,t,t.pendingProps.children,n),t.child;case 29:throw t.pendingProps}throw Error(i(156,t.tag))}function Zl(e){e.flags|=4}function Jl(e,t){if("stylesheet"!==t.type||4&t.state.loading)e.flags&=-16777217;else if(e.flags|=16777216,!$d(t)){if(null!==(t=al.current)&&((4194048&oc)===oc?null!==ol:(62914560&oc)!==oc&&!(536870912&oc)||t!==ol))throw Za=Qa,Va;e.flags|=8192}}function es(e,t){null!==t&&(e.flags|=4),16384&e.flags&&(t=22!==e.tag?xe():536870912,e.lanes|=t,bc|=t)}function ts(e,t){if(!oa)switch(e.tailMode){case"hidden":t=e.tail;for(var n=null;null!==t;)null!==t.alternate&&(n=t),t=t.sibling;null===n?e.tail=null:n.sibling=null;break;case"collapsed":n=e.tail;for(var r=null;null!==n;)null!==n.alternate&&(r=n),n=n.sibling;null===r?t||null===e.tail?e.tail=null:e.tail.sibling=null:r.sibling=null}}function ns(e){var t=null!==e.alternate&&e.alternate.child===e.child,n=0,r=0;if(t)for(var a=e.child;null!==a;)n|=a.lanes|a.childLanes,r|=65011712&a.subtreeFlags,r|=65011712&a.flags,a.return=e,a=a.sibling;else for(a=e.child;null!==a;)n|=a.lanes|a.childLanes,r|=a.subtreeFlags,r|=a.flags,a.return=e,a=a.sibling;return e.subtreeFlags|=r,e.childLanes=n,t}function rs(e,t,n){var r=t.pendingProps;switch(na(t),t.tag){case 31:case 16:case 15:case 0:case 11:case 7:case 8:case 12:case 9:case 14:case 1:return ns(t),null;case 3:return n=t.stateNode,r=null,null!==e&&(r=e.memoizedState.cache),t.memoizedState.cache!==r&&(t.flags|=2048),wa(Ma),W(),n.pendingContext&&(n.context=n.pendingContext,n.pendingContext=null),null!==e&&null!==e.child||(fa(t)?Zl(t):null===e||e.memoizedState.isDehydrated&&!(256&t.flags)||(t.flags|=1024,ha())),ns(t),null;case 26:return n=t.memoizedState,null===e?(Zl(t),null!==n?(ns(t),Jl(t,n)):(ns(t),t.flags&=-16777217)):n?n!==e.memoizedState?(Zl(t),ns(t),Jl(t,n)):(ns(t),t.flags&=-16777217):(e.memoizedProps!==r&&Zl(t),ns(t),t.flags&=-16777217),null;case 27:K(t),n=H.current;var a=t.type;if(null!==e&&null!=t.stateNode)e.memoizedProps!==r&&Zl(t);else{if(!r){if(null===t.stateNode)throw Error(i(166));return ns(t),null}e=U.current,fa(t)?ua(t):(e=wd(a,r,n),t.stateNode=e,Zl(t))}return ns(t),null;case 5:if(K(t),n=t.type,null!==e&&null!=t.stateNode)e.memoizedProps!==r&&Zl(t);else{if(!r){if(null===t.stateNode)throw Error(i(166));return ns(t),null}if(e=U.current,fa(t))ua(t);else{switch(a=rd(H.current),e){case 1:e=a.createElementNS("http://www.w3.org/2000/svg",n);break;case 2:e=a.createElementNS("http://www.w3.org/1998/Math/MathML",n);break;default:switch(n){case"svg":e=a.createElementNS("http://www.w3.org/2000/svg",n);break;case"math":e=a.createElementNS("http://www.w3.org/1998/Math/MathML",n);break;case"script":(e=a.createElement("div")).innerHTML=" - + + + - + \ No newline at end of file diff --git a/docs/blog/atom.xml b/docs/blog/atom.xml index 321c11b1..d11a6164 100644 --- a/docs/blog/atom.xml +++ b/docs/blog/atom.xml @@ -91,28 +91,28 @@ <![CDATA[Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving]]> - https://meesho.github.io/BharatMLStack/blog/post-three - + https://meesho.github.io/BharatMLStack/blog/post-four + 2025-03-29T00:00:00.000Z BharatMLStack

-

Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

+

Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

-

Why LLM Inference Is not just bigger ML model serving

+

Why LLM Inference Is not just bigger ML model serving

Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

-

Autoregressive Generation and Sequential Computation:

+

Autoregressive Generation and Sequential Computation:

Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

-

Prefill and Decode Phases:

+

Prefill and Decode Phases:

LLM inference typically consists of two distinct stages:

  • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
  • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.

The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

-

Context Management and KV Caching:

+

Context Management and KV Caching:

Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

    @@ -121,7 +121,7 @@ KV caching significantly improves performance by eliminating redundant computati
  • Efficient memory management becomes essential for scaling concurrent requests

This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

-

Dynamic and Irregular Workloads:

+

Dynamic and Irregular Workloads:

Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

  • Batch sizes must be dynamic rather than static
  • @@ -129,10 +129,10 @@ KV caching significantly improves performance by eliminating redundant computati
  • Scheduling systems must continuously rebalance workloads to maximize GPU utilization

These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

-

Streaming and User Experience Constraints:

+

Streaming and User Experience Constraints:

Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

-

LLMOps: High-Level Architecture

+

LLMOps: High-Level Architecture

LLM Architecture

The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

@@ -232,7 +232,7 @@ Because of these differences — sequential generation, growing memory requireme -

Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

+

Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

  1. @@ -270,12 +270,12 @@ Because of these differences — sequential generation, growing memory requireme
-

Conclusion

+

Conclusion

Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

-

Future Explorations

+

Future Explorations

While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

  • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
  • @@ -303,195 +303,82 @@ Because of these differences — sequential generation, growing memory requireme 2024-05-21T00:00:00.000Z BharatMLStack

    -

    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

    -

    Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

    -

    The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

    -

    In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

    -

    Why LLM Inference Is not just bigger ML model serving

    -

    Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

    -

    Autoregressive Generation and Sequential Computation:

    -

    Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. -Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

    -

    Prefill and Decode Phases:

    -

    LLM inference typically consists of two distinct stages:

    -
      -
    • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
    • -
    • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
    • -
    -

    The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

    -

    Context Management and KV Caching:

    -

    Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. -KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

    -
      -
    • Memory consumption grows with sequence length and batch size
    • -
    • GPU memory becomes a critical bottleneck
    • -
    • Efficient memory management becomes essential for scaling concurrent requests
    • -
    -

    This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

    -

    Dynamic and Irregular Workloads:

    -

    Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

    -
      -
    • Batch sizes must be dynamic rather than static
    • -
    • Requests may enter and leave batches asynchronously
    • -
    • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
    • -
    -

    These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

    -

    Streaming and User Experience Constraints:

    -

    Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. -Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

    -

    LLMOps: High-Level Architecture

    -

    LLM Architecture

    -

    The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

    -

    Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

    -
      -
    1. -

      Onboarding & Registration (The Source of Truth)

      -

      The lifecycle begins with the Data Scientist or engineer.

      -
        -
      • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
      • -
      • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
      • -
      -
    2. -
    3. -

      The "Black Box" Build Engine

      -

      Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

      -
        -
      • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
      • -
      • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
      • -
      • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
      • -
      -
    4. -
    5. -

      Intelligent Profiling & Validation

      -

      Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

      -
        -
      • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
      • -
      • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
      • -
      -
    6. -
    7. -

      Smart Artifact Generation & Distribution

      -

      To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

      -
        -
      • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
      • -
      • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
      • -
      -
    8. -
    9. -

      Image Streaming & Deployment

      -

      Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

      -
        -
      • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
      • -
      -
    10. -
    11. -

      The Inference Runtime (Kubernetes)

      -

      The workload lands on Kubernetes with Autoscaling.

      -
        -
      • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
      • -
      • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
      • -
      -
    12. -
    13. -

      Client Interaction & Observability

      -

      Finally, the LLM Inference Client executes the request.

      -
        -
      • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
      • -
      • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
      • -
      -
    14. -
    15. -

      Observability: Monitoring the Pulse of GenAI

      -

      In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

      -

      To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

      -
        -
      1. -

        Time to First Token (TTFT)

        -
          -
        • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
        • -
        • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
        • -
        • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
        • -
        -
      2. -
      3. -

        Inter-Token Latency (ITL)

        -
          -
        • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
        • -
        • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
        • -
        • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
        • -
        -
      4. -
      5. -

        Token Throughput vs. Request Throughput

        -
          -
        • We distinguish between two types of throughput to balance system efficiency with user load:
        • -
        • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
        • -
        • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
        • -
        -
      6. -
      7. -

        The Monitoring Stack

        -
          -
        • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
        • -
        • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
        • -
        -
      8. -
      -
    16. -
    -

    Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

    -

    Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

    -
      -
    1. -

      TensorRT-LLM: The High-Performance Standard

      -

      Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

      -

      TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

      -

      Key optimizations we tailor for these high-load cases include:

      -
        -
      • Optimized execution via TensorRT engine compilation
      • -
      • Quantization-aware execution for reduced memory usage and improved throughput
      • -
      • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
      • -
      • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
      • -
      -
    2. -
    3. -

      Dynamo: Distributed Inference for Reasoning Models

      -

      Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

      -

      For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

      -
        -
      • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
      • -
      • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
      • -
      • Distributed execution across multiple GPU resources
      • -
      -
    4. -
    5. -

      vLLM: The Flexible Baseline

      -

      Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

      -

      While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

      -
        -
      • High throughput through dynamic batching and efficient memory utilization
      • -
      • Paged KV cache management for handling long contexts and concurrent requests
      • -
      • Strong support for open-source model ecosystems
      • -
      • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
      • -
      • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
      • -
      -
    6. -
    -

    Conclusion

    -

    Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

    -

    The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

    -

    Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

    -

    Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

    -

    Future Explorations

    -

    While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

    -
      -
    • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
    • -
    • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
    • -
    • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
    • -
    • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
    • -
    • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
    • -
    • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
    • -
    ]]>
    + +

    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

    +
      +
    • 🔹 Scaling model inference without hitting infrastructure roadblocks
    • +
    • 🔹 Moving embedding search from batch to real-time for candidate generation
    • +
    +

    Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

    +

    Breaking Free from the Scalability Ceiling

    +

    The Model Serving Bottleneck—A Wake-Up Call

    +

    July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. +In one of our war rooms, we ran a quick experiment:

    +
      +
    • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
    • +
    • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
    • +
    • 🚀 The results matched—perfectly.
    • +
    +

    That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. +Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: +"Node availability may be an issue." +With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

    +
      +
    • ✅ p99 latency dropped from 90–100ms to 30–40ms
    • +
    • ✅ Triton handled significantly higher throughput on fewer resources
    • +
    • ✅ No model changes were needed
    • +
    +

    MBS ran without a hitch, proving that self-hosted inference was the way forward.

    +

    Scaling Triton on GKE

    +

    This left us with two choices:

    +
      +
    • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
    • +
    • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
    • +
    +

    We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

    +

    Fixing the Cold Start Problem

    +

    As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

    +

    After profiling, we found the culprits:

    +
      +
    • Triton’s base image—a massive 5GB
    • +
    • Model binaries—often 1GB+
    • +
    • Startup delay—mostly due to downloading and initializing these assets
    • +
    +

    To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

    +

    Embedding Search: The Last Piece of the Puzzle

    +

    By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

    +

    Choosing the Right Vector Database

    +

    We benchmarked three production-ready vector DBs across key parameters:

    +
      +
    • Milvus
    • +
    • Qdrant
    • +
    • Weaviate
    • +
    +

    After extensive POCs, Qdrant stood out for its:

    +
      +
    • ✅ Blazing-fast search latency on high-dimensional vectors
    • +
    • ✅ Efficient memory usage, crucial for in-memory workloads
    • +
    • ✅ Support for upserts and soft deletes, vital for Ads use cases
    • +
    • ✅ gRPC + REST APIs, making integration seamless
    • +
    • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
    • +
    +

    At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

    +

    Embedding Freshness & Real-Time Updates

    +

    To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

    +
      +
    • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
    • +
    • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
    • +
    +

    This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

    +

    Skye

    +

    Final Takeaways: Scaling Smartly for Real-Time ML

    +
      +
    • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
    • +
    • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
    • +
    • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
    • +
    • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
    • +
    +

    By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

    ]]> Aditya Kumar https://github.com/Adit2607 @@ -775,7 +662,7 @@ To represent these groups efficiently, we adopted a layered storage approach:

    Expiry Timestamp and Schema Version were appended using a semi-colon delimiter at the end of the string.

Example:

-
feature_1_value,feature_2_value,feature_3_value;expiry_ts
+
feature_1_value,feature_2_value,feature_3_value;expiry_ts

This format allowed:

  • Consistent writes and reads at the group level
  • @@ -838,7 +725,7 @@ For the 0th version of the Interaction Store, we focused on a d

Storage Structure

Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

-
userId_eventType → ZSET[...(pid, ts)...]
+
userId_eventType → ZSET[...(pid, ts)...]

Within each ZSET:

  • The timestamp served as the score, maintaining temporal order
  • diff --git a/docs/blog/authors/index.html b/docs/blog/authors/index.html index 441c4c45..c75a4ecf 100644 --- a/docs/blog/authors/index.html +++ b/docs/blog/authors/index.html @@ -4,14 +4,14 @@ Authors | BharatMLStack - - - + + + - + \ No newline at end of file diff --git a/docs/blog/index.html b/docs/blog/index.html index 6bfb6e82..417e00af 100644 --- a/docs/blog/index.html +++ b/docs/blog/index.html @@ -3,16 +3,16 @@ -Blog | BharatMLStack - - - +Blog | BharatMLStack + + + -

    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

    · 5 min read
    Jaya Kumar
    Lead ML Engineer @ Meesho

    BharatMLStack

    +

    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

    · 5 min read
    Jaya Kumar
    Lead ML Engineer @ Meesho

    BharatMLStack

    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

    Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

    1. Advanced Memory Management: Paged & Prefix KV Caching

    @@ -76,83 +76,196 @@

    Voice bot qu
    Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
    TensorRT-LLM136.2722.7845.660.23L4
    TensorRT-LLM249.8123.2189.370.45L4
    TensorRT-LLM455.3336.62153.390.78L4
    TensorRT-LLM866.539.11279.881.47L4
    TensorRT-LLM16131.830.39547.82.77L4
    TensorRT-LLM32277.2248.02925.74.78L4
    TensorRT-LLM64498.5271.621,164.406.2L4
    TensorRT-LLM128677.31120.371,445.187.69L4
    TensorRT-LLM2561,926.31216.881,600.818.52L4
    TensorRT-LLM121.179.24130.050.68A100
    TensorRT-LLM225.789.21264.51.35A100
    TensorRT-LLM428.5210.99437.692.27A100
    TensorRT-LLM834.412.61760.493.96A100
    TensorRT-LLM1668.0314.321,343.807.01A100
    TensorRT-LLM32185.9616.822,287.3011.92A100
    TensorRT-LLM64136.8721.173,625.2218.89A100
    TensorRT-LLM128463.7834.154,456.5123.24A100
    TensorRT-LLM256890.1259.185,188.2427.05A100

    Conclusion

    High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

    -

    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

    · 4 min read
    Aditya Kumar
    Lead Software Engineer @ Meesho
    Jaya Kumar
    Lead ML Engineer @ Meesho
    Adarsha Das
    Senior Architect @ Meesho

    BharatMLStack

    - -

    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

    +

    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

    · 14 min read
    Jaya Kumar
    Lead ML Engineer @ Meesho

    BharatMLStack

    +

    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

    +

    Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

    +

    The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

    +

    In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

    +

    Why LLM Inference Is not just bigger ML model serving

    +

    Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

    +

    Autoregressive Generation and Sequential Computation:

    +

    Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

    +

    Prefill and Decode Phases:

    +

    LLM inference typically consists of two distinct stages:

    +
      +
    • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
    • +
    • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
    • +
    +

    The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

    +

    Context Management and KV Caching:

    +

    Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

    +
      +
    • Memory consumption grows with sequence length and batch size
    • +
    • GPU memory becomes a critical bottleneck
    • +
    • Efficient memory management becomes essential for scaling concurrent requests
    • +
    +

    This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

    +

    Dynamic and Irregular Workloads:

    +

    Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

    +
      +
    • Batch sizes must be dynamic rather than static
    • +
    • Requests may enter and leave batches asynchronously
    • +
    • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
    • +
    +

    These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

    +

    Streaming and User Experience Constraints:

    +

    Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

    +

    LLMOps: High-Level Architecture

    +

    LLM Architecture

    +

    The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

    +

    Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

    +
      +
    1. +

      Onboarding & Registration (The Source of Truth)

      +

      The lifecycle begins with the Data Scientist or engineer.

        -
      • 🔹 Scaling model inference without hitting infrastructure roadblocks
      • -
      • 🔹 Moving embedding search from batch to real-time for candidate generation
      • +
      • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
      • +
      • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
      -

      Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

      -

      Breaking Free from the Scalability Ceiling

      -

      The Model Serving Bottleneck—A Wake-Up Call

      -

      July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

      +
    2. +
    3. +

      The "Black Box" Build Engine

      +

      Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

        -
      • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
      • -
      • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
      • -
      • 🚀 The results matched—perfectly.
      • +
      • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
      • +
      • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
      • +
      • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
      -

      That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

      +
    4. +
    5. +

      Intelligent Profiling & Validation

      +

      Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

        -
      • ✅ p99 latency dropped from 90–100ms to 30–40ms
      • -
      • ✅ Triton handled significantly higher throughput on fewer resources
      • -
      • ✅ No model changes were needed
      • +
      • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
      • +
      • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
      -

      MBS ran without a hitch, proving that self-hosted inference was the way forward.

      -

      Scaling Triton on GKE

      -

      This left us with two choices:

      +
    6. +
    7. +

      Smart Artifact Generation & Distribution

      +

      To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

        -
      • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
      • -
      • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
      • +
      • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
      • +
      • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
      -

      We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

      -

      Fixing the Cold Start Problem

      -

      As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

      -

      After profiling, we found the culprits:

      +
    8. +
    9. +

      Image Streaming & Deployment

      +

      Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

        -
      • Triton’s base image—a massive 5GB
      • -
      • Model binaries—often 1GB+
      • -
      • Startup delay—mostly due to downloading and initializing these assets
      • +
      • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
      -

      To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

      -

      Embedding Search: The Last Piece of the Puzzle

      -

      By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

      -

      Choosing the Right Vector Database

      -

      We benchmarked three production-ready vector DBs across key parameters:

      +
    10. +
    11. +

      The Inference Runtime (Kubernetes)

      +

      The workload lands on Kubernetes with Autoscaling.

        -
      • Milvus
      • -
      • Qdrant
      • -
      • Weaviate
      • +
      • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
      • +
      • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
      -

      After extensive POCs, Qdrant stood out for its:

      +
    12. +
    13. +

      Client Interaction & Observability

      +

      Finally, the LLM Inference Client executes the request.

        -
      • ✅ Blazing-fast search latency on high-dimensional vectors
      • -
      • ✅ Efficient memory usage, crucial for in-memory workloads
      • -
      • ✅ Support for upserts and soft deletes, vital for Ads use cases
      • -
      • ✅ gRPC + REST APIs, making integration seamless
      • -
      • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
      • +
      • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
      • +
      • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
      -

      At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

      -

      Embedding Freshness & Real-Time Updates

      -

      To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

      +
    14. +
    15. +

      Observability: Monitoring the Pulse of GenAI

      +

      In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

      +

      To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

      +
        +
      1. +

        Time to First Token (TTFT)

          -
        • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
        • -
        • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
        • +
        • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
        • +
        • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
        • +
        • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
        -

        This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

        -

        Skye

        -

        Final Takeaways: Scaling Smartly for Real-Time ML

        +
      2. +
      3. +

        Inter-Token Latency (ITL)

          -
        • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
        • -
        • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
        • -
        • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
        • -
        • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
        • +
        • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
        • +
        • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
        • +
        • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
        • +
        +
      4. +
      5. +

        Token Throughput vs. Request Throughput

        +
          +
        • We distinguish between two types of throughput to balance system efficiency with user load:
        • +
        • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
        • +
        • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
        • +
        +
      6. +
      7. +

        The Monitoring Stack

        +
          +
        • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
        • +
        • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
        • +
        +
      8. +
      +
    16. +
    +

    Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

    +

    Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

    +
      +
    1. +

      TensorRT-LLM: The High-Performance Standard

      +

      Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

      +

      TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

      +

      Key optimizations we tailor for these high-load cases include:

      +
        +
      • Optimized execution via TensorRT engine compilation
      • +
      • Quantization-aware execution for reduced memory usage and improved throughput
      • +
      • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
      • +
      • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
      -

      By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

    · 4 min read
    Aditya Kumar
    Lead Software Engineer @ Meesho
    Jaya Kumar
    Lead ML Engineer @ Meesho
    Adarsha Das
    Senior Architect @ Meesho

    BharatMLStack

    + +
  • +

    Dynamo: Distributed Inference for Reasoning Models

    +

    Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

    +

    For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

    +
      +
    • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
    • +
    • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
    • +
    • Distributed execution across multiple GPU resources
    • +
    +
  • +
  • +

    vLLM: The Flexible Baseline

    +

    Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

    +

    While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

    +
      +
    • High throughput through dynamic batching and efficient memory utilization
    • +
    • Paged KV cache management for handling long contexts and concurrent requests
    • +
    • Strong support for open-source model ecosystems
    • +
    • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
    • +
    • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
    • +
    +
  • + +

    Conclusion

    +

    Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

    +

    The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

    +

    Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

    +

    Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

    +

    Future Explorations

    +

    While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

    +
      +
    • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
    • +
    • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
    • +
    • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
    • +
    • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
    • +
    • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
    • +
    • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
    • +

    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

    · 4 min read
    Aditya Kumar
    Lead Software Engineer @ Meesho
    Jaya Kumar
    Lead ML Engineer @ Meesho
    Adarsha Das
    Senior Architect @ Meesho

    BharatMLStack

    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

      @@ -462,7 +575,7 @@

      feature_1_value,feature_2_value,feature_3_value;expiry_ts

    +
    feature_1_value,feature_2_value,feature_3_value;expiry_ts

    This format allowed:

    • Consistent writes and reads at the group level
    • @@ -525,7 +638,7 @@

      Why Redis?

      Storage Structure

      Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

      -
      userId_eventType → ZSET[...(pid, ts)...]
      +
      userId_eventType → ZSET[...(pid, ts)...]

      Within each ZSET:

      • The timestamp served as the score, maintaining temporal order
      • diff --git a/docs/blog/post-five/index.html b/docs/blog/post-five/index.html index 01dc8558..6bcfc48a 100644 --- a/docs/blog/post-five/index.html +++ b/docs/blog/post-five/index.html @@ -4,15 +4,15 @@ LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale | BharatMLStack - - - + + + -

        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

        · 5 min read
        Jaya Kumar
        Lead ML Engineer @ Meesho

        BharatMLStack

        +

        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

        · 5 min read
        Jaya Kumar
        Lead ML Engineer @ Meesho

        BharatMLStack

        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

        Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

        1. Advanced Memory Management: Paged & Prefix KV Caching

        @@ -76,6 +76,6 @@

        Voice bot qu
        Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
        TensorRT-LLM136.2722.7845.660.23L4
        TensorRT-LLM249.8123.2189.370.45L4
        TensorRT-LLM455.3336.62153.390.78L4
        TensorRT-LLM866.539.11279.881.47L4
        TensorRT-LLM16131.830.39547.82.77L4
        TensorRT-LLM32277.2248.02925.74.78L4
        TensorRT-LLM64498.5271.621,164.406.2L4
        TensorRT-LLM128677.31120.371,445.187.69L4
        TensorRT-LLM2561,926.31216.881,600.818.52L4
        TensorRT-LLM121.179.24130.050.68A100
        TensorRT-LLM225.789.21264.51.35A100
        TensorRT-LLM428.5210.99437.692.27A100
        TensorRT-LLM834.412.61760.493.96A100
        TensorRT-LLM1668.0314.321,343.807.01A100
        TensorRT-LLM32185.9616.822,287.3011.92A100
        TensorRT-LLM64136.8721.173,625.2218.89A100
        TensorRT-LLM128463.7834.154,456.5123.24A100
        TensorRT-LLM256890.1259.185,188.2427.05A100

        Conclusion

        High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

        -

        These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

        +

        These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

        \ No newline at end of file diff --git a/docs/blog/post-four/index.html b/docs/blog/post-four/index.html new file mode 100644 index 00000000..d408d042 --- /dev/null +++ b/docs/blog/post-four/index.html @@ -0,0 +1,206 @@ + + + + + +Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving | BharatMLStack + + + + + + + + +

        Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

        · 14 min read
        Jaya Kumar
        Lead ML Engineer @ Meesho

        BharatMLStack

        +

        Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

        +

        Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

        +

        The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

        +

        In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

        +

        Why LLM Inference Is not just bigger ML model serving

        +

        Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

        +

        Autoregressive Generation and Sequential Computation:

        +

        Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

        +

        Prefill and Decode Phases:

        +

        LLM inference typically consists of two distinct stages:

        +
          +
        • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
        • +
        • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
        • +
        +

        The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

        +

        Context Management and KV Caching:

        +

        Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

        +
          +
        • Memory consumption grows with sequence length and batch size
        • +
        • GPU memory becomes a critical bottleneck
        • +
        • Efficient memory management becomes essential for scaling concurrent requests
        • +
        +

        This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

        +

        Dynamic and Irregular Workloads:

        +

        Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

        +
          +
        • Batch sizes must be dynamic rather than static
        • +
        • Requests may enter and leave batches asynchronously
        • +
        • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
        • +
        +

        These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

        +

        Streaming and User Experience Constraints:

        +

        Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

        +

        LLMOps: High-Level Architecture

        +

        LLM Architecture

        +

        The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

        +

        Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

        +
          +
        1. +

          Onboarding & Registration (The Source of Truth)

          +

          The lifecycle begins with the Data Scientist or engineer.

          +
            +
          • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
          • +
          • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
          • +
          +
        2. +
        3. +

          The "Black Box" Build Engine

          +

          Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

          +
            +
          • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
          • +
          • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
          • +
          • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
          • +
          +
        4. +
        5. +

          Intelligent Profiling & Validation

          +

          Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

          +
            +
          • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
          • +
          • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
          • +
          +
        6. +
        7. +

          Smart Artifact Generation & Distribution

          +

          To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

          +
            +
          • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
          • +
          • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
          • +
          +
        8. +
        9. +

          Image Streaming & Deployment

          +

          Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

          +
            +
          • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
          • +
          +
        10. +
        11. +

          The Inference Runtime (Kubernetes)

          +

          The workload lands on Kubernetes with Autoscaling.

          +
            +
          • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
          • +
          • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
          • +
          +
        12. +
        13. +

          Client Interaction & Observability

          +

          Finally, the LLM Inference Client executes the request.

          +
            +
          • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
          • +
          • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
          • +
          +
        14. +
        15. +

          Observability: Monitoring the Pulse of GenAI

          +

          In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

          +

          To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

          +
            +
          1. +

            Time to First Token (TTFT)

            +
              +
            • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
            • +
            • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
            • +
            • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
            • +
            +
          2. +
          3. +

            Inter-Token Latency (ITL)

            +
              +
            • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
            • +
            • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
            • +
            • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
            • +
            +
          4. +
          5. +

            Token Throughput vs. Request Throughput

            +
              +
            • We distinguish between two types of throughput to balance system efficiency with user load:
            • +
            • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
            • +
            • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
            • +
            +
          6. +
          7. +

            The Monitoring Stack

            +
              +
            • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
            • +
            • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
            • +
            +
          8. +
          +
        16. +
        +

        Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

        +

        Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

        +
          +
        1. +

          TensorRT-LLM: The High-Performance Standard

          +

          Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

          +

          TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

          +

          Key optimizations we tailor for these high-load cases include:

          +
            +
          • Optimized execution via TensorRT engine compilation
          • +
          • Quantization-aware execution for reduced memory usage and improved throughput
          • +
          • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
          • +
          • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
          • +
          +
        2. +
        3. +

          Dynamo: Distributed Inference for Reasoning Models

          +

          Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

          +

          For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

          +
            +
          • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
          • +
          • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
          • +
          • Distributed execution across multiple GPU resources
          • +
          +
        4. +
        5. +

          vLLM: The Flexible Baseline

          +

          Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

          +

          While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

          +
            +
          • High throughput through dynamic batching and efficient memory utilization
          • +
          • Paged KV cache management for handling long contexts and concurrent requests
          • +
          • Strong support for open-source model ecosystems
          • +
          • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
          • +
          • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
          • +
          +
        6. +
        +

        Conclusion

        +

        Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

        +

        The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

        +

        Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

        +

        Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

        +

        Future Explorations

        +

        While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

        +
          +
        • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
        • +
        • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
        • +
        • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
        • +
        • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
        • +
        • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
        • +
        • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
        • +
        + + \ No newline at end of file diff --git a/docs/blog/post-one/index.html b/docs/blog/post-one/index.html index 2f484bd6..d9f3e30d 100644 --- a/docs/blog/post-one/index.html +++ b/docs/blog/post-one/index.html @@ -4,15 +4,15 @@ Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1) | BharatMLStack - - - + + + -

        Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

        · 11 min read
        Adarsha Das
        Senior Architect @ Meesho
        Aditya Kumar
        Lead Software Engineer @ Meesho
        Bhawani Singh
        Architect @ Meesho
        Jigar Dave
        Lead Software Engineer @ Meesho

        BharatMLStack

        +

        Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

        · 11 min read
        Adarsha Das
        Senior Architect @ Meesho
        Aditya Kumar
        Lead Software Engineer @ Meesho
        Bhawani Singh
        Architect @ Meesho
        Jigar Dave
        Lead Software Engineer @ Meesho

        BharatMLStack

        The Genesis: How a Friday Night Roast Sparked Meesho’s ML Platform

        It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting—until one remark hit a little too close to home:

        "Why are we still crunching data for Monthly Active Users (MAU) when the next day it’s all about Daily Active Users (DAU)?"

        @@ -119,7 +119,7 @@

        feature_1_value,feature_2_value,feature_3_value;expiry_ts

        +
        feature_1_value,feature_2_value,feature_3_value;expiry_ts

        This format allowed:

        • Consistent writes and reads at the group level
        • @@ -182,7 +182,7 @@

          Why Redis?

          Storage Structure

          Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

          -
          userId_eventType → ZSET[...(pid, ts)...]
          +
          userId_eventType → ZSET[...(pid, ts)...]

          Within each ZSET:

          • The timestamp served as the score, maintaining temporal order
          • diff --git a/docs/blog/post-three/index.html b/docs/blog/post-three/index.html index a160d2ad..a2ab3419 100644 --- a/docs/blog/post-three/index.html +++ b/docs/blog/post-three/index.html @@ -3,204 +3,91 @@ -Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving | BharatMLStack - - - +Cracking the Code: Scaling Model Inference & Real-Time Embedding Search | BharatMLStack + + + -

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            · 14 min read
            Jaya Kumar
            Lead ML Engineer @ Meesho

            BharatMLStack

            -

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            -

            Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

            -

            The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

            -

            In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

            -

            Why LLM Inference Is not just bigger ML model serving

            -

            Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

            -

            Autoregressive Generation and Sequential Computation:

            -

            Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. -Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

            -

            Prefill and Decode Phases:

            -

            LLM inference typically consists of two distinct stages:

            -
              -
            • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
            • -
            • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
            • -
            -

            The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

            -

            Context Management and KV Caching:

            -

            Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. -KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

            -
              -
            • Memory consumption grows with sequence length and batch size
            • -
            • GPU memory becomes a critical bottleneck
            • -
            • Efficient memory management becomes essential for scaling concurrent requests
            • -
            -

            This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

            -

            Dynamic and Irregular Workloads:

            -

            Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

            -
              -
            • Batch sizes must be dynamic rather than static
            • -
            • Requests may enter and leave batches asynchronously
            • -
            • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
            • -
            -

            These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

            -

            Streaming and User Experience Constraints:

            -

            Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. -Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

            -

            LLMOps: High-Level Architecture

            -

            LLM Architecture

            -

            The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

            -

            Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

            -
              -
            1. -

              Onboarding & Registration (The Source of Truth)

              -

              The lifecycle begins with the Data Scientist or engineer.

              -
                -
              • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
              • -
              • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
              • -
              -
            2. -
            3. -

              The "Black Box" Build Engine

              -

              Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

              -
                -
              • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
              • -
              • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
              • -
              • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
              • -
              -
            4. -
            5. -

              Intelligent Profiling & Validation

              -

              Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

              -
                -
              • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
              • -
              • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
              • -
              -
            6. -
            7. -

              Smart Artifact Generation & Distribution

              -

              To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

              -
                -
              • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
              • -
              • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
              • -
              -
            8. -
            9. -

              Image Streaming & Deployment

              -

              Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

              -
                -
              • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
              • -
              -
            10. -
            11. -

              The Inference Runtime (Kubernetes)

              -

              The workload lands on Kubernetes with Autoscaling.

              -
                -
              • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
              • -
              • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
              • -
              -
            12. -
            13. -

              Client Interaction & Observability

              -

              Finally, the LLM Inference Client executes the request.

              -
                -
              • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
              • -
              • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
              • -
              -
            14. -
            15. -

              Observability: Monitoring the Pulse of GenAI

              -

              In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

              -

              To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

              -
                -
              1. -

                Time to First Token (TTFT)

                -
                  -
                • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                • -
                • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                • -
                • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                • -
                -
              2. -
              3. -

                Inter-Token Latency (ITL)

                -
                  -
                • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                • -
                • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                • -
                • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                • -
                -
              4. -
              5. -

                Token Throughput vs. Request Throughput

                -
                  -
                • We distinguish between two types of throughput to balance system efficiency with user load:
                • -
                • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                • -
                • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                • -
                -
              6. -
              7. -

                The Monitoring Stack

                -
                  -
                • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                • -
                • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                • -
                -
              8. -
              -
            16. -
            -

            Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

            -

            Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

            -
              -
            1. -

              TensorRT-LLM: The High-Performance Standard

              -

              Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

              -

              TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

              -

              Key optimizations we tailor for these high-load cases include:

              -
                -
              • Optimized execution via TensorRT engine compilation
              • -
              • Quantization-aware execution for reduced memory usage and improved throughput
              • -
              • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
              • -
              • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
              • -
              -
            2. -
            3. -

              Dynamo: Distributed Inference for Reasoning Models

              -

              Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

              -

              For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

              -
                -
              • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
              • -
              • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
              • -
              • Distributed execution across multiple GPU resources
              • -
              -
            4. -
            5. -

              vLLM: The Flexible Baseline

              -

              Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

              -

              While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

              -
                -
              • High throughput through dynamic batching and efficient memory utilization
              • -
              • Paged KV cache management for handling long contexts and concurrent requests
              • -
              • Strong support for open-source model ecosystems
              • -
              • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
              • -
              • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
              • -
              -
            6. -
            -

            Conclusion

            -

            Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

            -

            The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

            -

            Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

            -

            Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

            -

            Future Explorations

            -

            While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

            -
              -
            • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
            • -
            • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
            • -
            • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
            • -
            • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
            • -
            • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
            • -
            • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
            • -
            +

            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

            · 4 min read
            Aditya Kumar
            Lead Software Engineer @ Meesho
            Jaya Kumar
            Lead ML Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            + +

            By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

            +
              +
            • 🔹 Scaling model inference without hitting infrastructure roadblocks
            • +
            • 🔹 Moving embedding search from batch to real-time for candidate generation
            • +
            +

            Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

            +

            Breaking Free from the Scalability Ceiling

            +

            The Model Serving Bottleneck—A Wake-Up Call

            +

            July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. +In one of our war rooms, we ran a quick experiment:

            +
              +
            • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
            • +
            • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
            • +
            • 🚀 The results matched—perfectly.
            • +
            +

            That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. +Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: +"Node availability may be an issue." +With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

            +
              +
            • ✅ p99 latency dropped from 90–100ms to 30–40ms
            • +
            • ✅ Triton handled significantly higher throughput on fewer resources
            • +
            • ✅ No model changes were needed
            • +
            +

            MBS ran without a hitch, proving that self-hosted inference was the way forward.

            +

            Scaling Triton on GKE

            +

            This left us with two choices:

            +
              +
            • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
            • +
            • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
            • +
            +

            We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

            +

            Fixing the Cold Start Problem

            +

            As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

            +

            After profiling, we found the culprits:

            +
              +
            • Triton’s base image—a massive 5GB
            • +
            • Model binaries—often 1GB+
            • +
            • Startup delay—mostly due to downloading and initializing these assets
            • +
            +

            To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

            +

            Embedding Search: The Last Piece of the Puzzle

            +

            By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

            +

            Choosing the Right Vector Database

            +

            We benchmarked three production-ready vector DBs across key parameters:

            +
              +
            • Milvus
            • +
            • Qdrant
            • +
            • Weaviate
            • +
            +

            After extensive POCs, Qdrant stood out for its:

            +
              +
            • ✅ Blazing-fast search latency on high-dimensional vectors
            • +
            • ✅ Efficient memory usage, crucial for in-memory workloads
            • +
            • ✅ Support for upserts and soft deletes, vital for Ads use cases
            • +
            • ✅ gRPC + REST APIs, making integration seamless
            • +
            • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
            • +
            +

            At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

            +

            Embedding Freshness & Real-Time Updates

            +

            To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

            +
              +
            • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
            • +
            • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
            • +
            +

            This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

            +

            Skye

            +

            Final Takeaways: Scaling Smartly for Real-Time ML

            +
              +
            • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
            • +
            • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
            • +
            • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
            • +
            • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
            • +
            +

            By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

            \ No newline at end of file diff --git a/docs/blog/post-two/index.html b/docs/blog/post-two/index.html index 3d7c40d7..47628cc1 100644 --- a/docs/blog/post-two/index.html +++ b/docs/blog/post-two/index.html @@ -4,15 +4,15 @@ Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2) | BharatMLStack - - - + + + -

            Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

            · 7 min read
            Bhawani Singh
            Architect @ Meesho
            Jigar Dave
            Lead Software Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            +

            Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

            · 7 min read
            Bhawani Singh
            Architect @ Meesho
            Jigar Dave
            Lead Software Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

            By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation. And it worked. Mostly. diff --git a/docs/blog/rss.xml b/docs/blog/rss.xml index a5b41193..3ddc1126 100644 --- a/docs/blog/rss.xml +++ b/docs/blog/rss.xml @@ -88,28 +88,28 @@ <![CDATA[Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving]]> - https://meesho.github.io/BharatMLStack/blog/post-three - https://meesho.github.io/BharatMLStack/blog/post-three + https://meesho.github.io/BharatMLStack/blog/post-four + https://meesho.github.io/BharatMLStack/blog/post-four Sat, 29 Mar 2025 00:00:00 GMT BharatMLStack

            -

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            +

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

            The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

            In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

            -

            Why LLM Inference Is not just bigger ML model serving

            +

            Why LLM Inference Is not just bigger ML model serving

            Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

            -

            Autoregressive Generation and Sequential Computation:

            +

            Autoregressive Generation and Sequential Computation:

            Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

            -

            Prefill and Decode Phases:

            +

            Prefill and Decode Phases:

            LLM inference typically consists of two distinct stages:

            • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
            • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.

            The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

            -

            Context Management and KV Caching:

            +

            Context Management and KV Caching:

            Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

              @@ -118,7 +118,7 @@ KV caching significantly improves performance by eliminating redundant computati
            • Efficient memory management becomes essential for scaling concurrent requests

            This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

            -

            Dynamic and Irregular Workloads:

            +

            Dynamic and Irregular Workloads:

            Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

            • Batch sizes must be dynamic rather than static
            • @@ -126,10 +126,10 @@ KV caching significantly improves performance by eliminating redundant computati
            • Scheduling systems must continuously rebalance workloads to maximize GPU utilization

            These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

            -

            Streaming and User Experience Constraints:

            +

            Streaming and User Experience Constraints:

            Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

            -

            LLMOps: High-Level Architecture

            +

            LLMOps: High-Level Architecture

            LLM Architecture

            The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

            Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

            @@ -229,7 +229,7 @@ Because of these differences — sequential generation, growing memory requireme -

            Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

            +

            Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

            Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

            1. @@ -267,12 +267,12 @@ Because of these differences — sequential generation, growing memory requireme
          -

          Conclusion

          +

          Conclusion

          Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

          The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

          Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

          Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

          -

          Future Explorations

          +

          Future Explorations

          While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

          • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
          • @@ -296,195 +296,82 @@ Because of these differences — sequential generation, growing memory requireme Tue, 21 May 2024 00:00:00 GMT BharatMLStack

            -

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            -

            Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

            -

            The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

            -

            In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

            -

            Why LLM Inference Is not just bigger ML model serving

            -

            Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

            -

            Autoregressive Generation and Sequential Computation:

            -

            Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. -Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

            -

            Prefill and Decode Phases:

            -

            LLM inference typically consists of two distinct stages:

            -
              -
            • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
            • -
            • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
            • -
            -

            The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

            -

            Context Management and KV Caching:

            -

            Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. -KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

            -
              -
            • Memory consumption grows with sequence length and batch size
            • -
            • GPU memory becomes a critical bottleneck
            • -
            • Efficient memory management becomes essential for scaling concurrent requests
            • -
            -

            This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

            -

            Dynamic and Irregular Workloads:

            -

            Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

            -
              -
            • Batch sizes must be dynamic rather than static
            • -
            • Requests may enter and leave batches asynchronously
            • -
            • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
            • -
            -

            These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

            -

            Streaming and User Experience Constraints:

            -

            Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. -Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

            -

            LLMOps: High-Level Architecture

            -

            LLM Architecture

            -

            The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

            -

            Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

            -
              -
            1. -

              Onboarding & Registration (The Source of Truth)

              -

              The lifecycle begins with the Data Scientist or engineer.

              -
                -
              • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
              • -
              • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
              • -
              -
            2. -
            3. -

              The "Black Box" Build Engine

              -

              Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

              -
                -
              • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
              • -
              • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
              • -
              • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
              • -
              -
            4. -
            5. -

              Intelligent Profiling & Validation

              -

              Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

              -
                -
              • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
              • -
              • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
              • -
              -
            6. -
            7. -

              Smart Artifact Generation & Distribution

              -

              To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

              -
                -
              • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
              • -
              • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
              • -
              -
            8. -
            9. -

              Image Streaming & Deployment

              -

              Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

              -
                -
              • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
              • -
              -
            10. -
            11. -

              The Inference Runtime (Kubernetes)

              -

              The workload lands on Kubernetes with Autoscaling.

              -
                -
              • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
              • -
              • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
              • -
              -
            12. -
            13. -

              Client Interaction & Observability

              -

              Finally, the LLM Inference Client executes the request.

              -
                -
              • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
              • -
              • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
              • -
              -
            14. -
            15. -

              Observability: Monitoring the Pulse of GenAI

              -

              In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

              -

              To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

              -
                -
              1. -

                Time to First Token (TTFT)

                -
                  -
                • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                • -
                • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                • -
                • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                • -
                -
              2. -
              3. -

                Inter-Token Latency (ITL)

                -
                  -
                • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                • -
                • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                • -
                • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                • -
                -
              4. -
              5. -

                Token Throughput vs. Request Throughput

                -
                  -
                • We distinguish between two types of throughput to balance system efficiency with user load:
                • -
                • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                • -
                • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                • -
                -
              6. -
              7. -

                The Monitoring Stack

                -
                  -
                • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                • -
                • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                • -
                -
              8. -
              -
            16. -
            -

            Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

            -

            Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

            -
              -
            1. -

              TensorRT-LLM: The High-Performance Standard

              -

              Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

              -

              TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

              -

              Key optimizations we tailor for these high-load cases include:

              -
                -
              • Optimized execution via TensorRT engine compilation
              • -
              • Quantization-aware execution for reduced memory usage and improved throughput
              • -
              • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
              • -
              • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
              • -
              -
            2. -
            3. -

              Dynamo: Distributed Inference for Reasoning Models

              -

              Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

              -

              For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

              -
                -
              • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
              • -
              • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
              • -
              • Distributed execution across multiple GPU resources
              • -
              -
            4. -
            5. -

              vLLM: The Flexible Baseline

              -

              Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

              -

              While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

              -
                -
              • High throughput through dynamic batching and efficient memory utilization
              • -
              • Paged KV cache management for handling long contexts and concurrent requests
              • -
              • Strong support for open-source model ecosystems
              • -
              • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
              • -
              • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
              • -
              -
            6. -
            -

            Conclusion

            -

            Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

            -

            The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

            -

            Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

            -

            Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

            -

            Future Explorations

            -

            While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

            -
              -
            • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
            • -
            • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
            • -
            • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
            • -
            • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
            • -
            • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
            • -
            • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
            • -
            ]]>
            + +

            By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

            +
              +
            • 🔹 Scaling model inference without hitting infrastructure roadblocks
            • +
            • 🔹 Moving embedding search from batch to real-time for candidate generation
            • +
            +

            Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

            +

            Breaking Free from the Scalability Ceiling

            +

            The Model Serving Bottleneck—A Wake-Up Call

            +

            July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. +In one of our war rooms, we ran a quick experiment:

            +
              +
            • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
            • +
            • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
            • +
            • 🚀 The results matched—perfectly.
            • +
            +

            That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. +Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: +"Node availability may be an issue." +With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

            +
              +
            • ✅ p99 latency dropped from 90–100ms to 30–40ms
            • +
            • ✅ Triton handled significantly higher throughput on fewer resources
            • +
            • ✅ No model changes were needed
            • +
            +

            MBS ran without a hitch, proving that self-hosted inference was the way forward.

            +

            Scaling Triton on GKE

            +

            This left us with two choices:

            +
              +
            • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
            • +
            • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
            • +
            +

            We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

            +

            Fixing the Cold Start Problem

            +

            As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

            +

            After profiling, we found the culprits:

            +
              +
            • Triton’s base image—a massive 5GB
            • +
            • Model binaries—often 1GB+
            • +
            • Startup delay—mostly due to downloading and initializing these assets
            • +
            +

            To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

            +

            Embedding Search: The Last Piece of the Puzzle

            +

            By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

            +

            Choosing the Right Vector Database

            +

            We benchmarked three production-ready vector DBs across key parameters:

            +
              +
            • Milvus
            • +
            • Qdrant
            • +
            • Weaviate
            • +
            +

            After extensive POCs, Qdrant stood out for its:

            +
              +
            • ✅ Blazing-fast search latency on high-dimensional vectors
            • +
            • ✅ Efficient memory usage, crucial for in-memory workloads
            • +
            • ✅ Support for upserts and soft deletes, vital for Ads use cases
            • +
            • ✅ gRPC + REST APIs, making integration seamless
            • +
            • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
            • +
            +

            At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

            +

            Embedding Freshness & Real-Time Updates

            +

            To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

            +
              +
            • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
            • +
            • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
            • +
            +

            This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

            +

            Skye

            +

            Final Takeaways: Scaling Smartly for Real-Time ML

            +
              +
            • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
            • +
            • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
            • +
            • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
            • +
            • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
            • +
            +

            By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

            ]]> model-inference embedding-search mlplatform @@ -744,7 +631,7 @@ To represent these groups efficiently, we adopted a layered storage approach:

            Expiry Timestamp and Schema Version were appended using a semi-colon delimiter at the end of the string.

          Example:

          -
          feature_1_value,feature_2_value,feature_3_value;expiry_ts
          +
          feature_1_value,feature_2_value,feature_3_value;expiry_ts

          This format allowed:

          • Consistent writes and reads at the group level
          • @@ -807,7 +694,7 @@ For the 0th version of the Interaction Store, we focused on a d

          Storage Structure

          Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

          -
          userId_eventType → ZSET[...(pid, ts)...]
          +
          userId_eventType → ZSET[...(pid, ts)...]

          Within each ZSET:

          • The timestamp served as the score, maintaining temporal order
          • diff --git a/docs/blog/tags/bharatmlstack/index.html b/docs/blog/tags/bharatmlstack/index.html index 9bf61277..6e52e98b 100644 --- a/docs/blog/tags/bharatmlstack/index.html +++ b/docs/blog/tags/bharatmlstack/index.html @@ -4,15 +4,15 @@ 4 posts tagged with "bharatmlstack" | BharatMLStack - - - + + + -

            4 posts tagged with "bharatmlstack"

            View All Tags

            LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

            · 5 min read
            Jaya Kumar
            Lead ML Engineer @ Meesho

            BharatMLStack

            +

            4 posts tagged with "bharatmlstack"

            View All Tags

            LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

            · 5 min read
            Jaya Kumar
            Lead ML Engineer @ Meesho

            BharatMLStack

            LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

            Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

            1. Advanced Memory Management: Paged & Prefix KV Caching

            @@ -76,83 +76,196 @@

            Voice bot qu
            Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
            TensorRT-LLM136.2722.7845.660.23L4
            TensorRT-LLM249.8123.2189.370.45L4
            TensorRT-LLM455.3336.62153.390.78L4
            TensorRT-LLM866.539.11279.881.47L4
            TensorRT-LLM16131.830.39547.82.77L4
            TensorRT-LLM32277.2248.02925.74.78L4
            TensorRT-LLM64498.5271.621,164.406.2L4
            TensorRT-LLM128677.31120.371,445.187.69L4
            TensorRT-LLM2561,926.31216.881,600.818.52L4
            TensorRT-LLM121.179.24130.050.68A100
            TensorRT-LLM225.789.21264.51.35A100
            TensorRT-LLM428.5210.99437.692.27A100
            TensorRT-LLM834.412.61760.493.96A100
            TensorRT-LLM1668.0314.321,343.807.01A100
            TensorRT-LLM32185.9616.822,287.3011.92A100
            TensorRT-LLM64136.8721.173,625.2218.89A100
            TensorRT-LLM128463.7834.154,456.5123.24A100
            TensorRT-LLM256890.1259.185,188.2427.05A100

            Conclusion

            High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

            -

            These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

            · 4 min read
            Aditya Kumar
            Lead Software Engineer @ Meesho
            Jaya Kumar
            Lead ML Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            - -

            By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

            -
              -
            • 🔹 Scaling model inference without hitting infrastructure roadblocks
            • -
            • 🔹 Moving embedding search from batch to real-time for candidate generation
            • -
            -

            Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

            -

            Breaking Free from the Scalability Ceiling

            -

            The Model Serving Bottleneck—A Wake-Up Call

            -

            July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

            -
              -
            • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
            • -
            • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
            • -
            • 🚀 The results matched—perfectly.
            • -
            -

            That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

            -
              -
            • ✅ p99 latency dropped from 90–100ms to 30–40ms
            • -
            • ✅ Triton handled significantly higher throughput on fewer resources
            • -
            • ✅ No model changes were needed
            • -
            -

            MBS ran without a hitch, proving that self-hosted inference was the way forward.

            -

            Scaling Triton on GKE

            -

            This left us with two choices:

            -
              -
            • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
            • -
            • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
            • -
            -

            We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

            -

            Fixing the Cold Start Problem

            -

            As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

            -

            After profiling, we found the culprits:

            -
              -
            • Triton’s base image—a massive 5GB
            • -
            • Model binaries—often 1GB+
            • -
            • Startup delay—mostly due to downloading and initializing these assets
            • -
            -

            To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

            -

            Embedding Search: The Last Piece of the Puzzle

            -

            By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

            -

            Choosing the Right Vector Database

            -

            We benchmarked three production-ready vector DBs across key parameters:

            +

            These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            · 14 min read
            Jaya Kumar
            Lead ML Engineer @ Meesho

            BharatMLStack

            +

            Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

            +

            Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

            +

            The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

            +

            In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

            +

            Why LLM Inference Is not just bigger ML model serving

            +

            Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

            +

            Autoregressive Generation and Sequential Computation:

            +

            Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

            +

            Prefill and Decode Phases:

            +

            LLM inference typically consists of two distinct stages:

            +
              +
            • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
            • +
            • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
            • +
            +

            The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

            +

            Context Management and KV Caching:

            +

            Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

            +
              +
            • Memory consumption grows with sequence length and batch size
            • +
            • GPU memory becomes a critical bottleneck
            • +
            • Efficient memory management becomes essential for scaling concurrent requests
            • +
            +

            This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

            +

            Dynamic and Irregular Workloads:

            +

            Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

            +
              +
            • Batch sizes must be dynamic rather than static
            • +
            • Requests may enter and leave batches asynchronously
            • +
            • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
            • +
            +

            These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

            +

            Streaming and User Experience Constraints:

            +

            Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

            +

            LLMOps: High-Level Architecture

            +

            LLM Architecture

            +

            The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

            +

            Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

            +
              +
            1. +

              Onboarding & Registration (The Source of Truth)

              +

              The lifecycle begins with the Data Scientist or engineer.

              +
                +
              • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
              • +
              • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
              • +
              +
            2. +
            3. +

              The "Black Box" Build Engine

              +

              Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

              +
                +
              • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
              • +
              • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
              • +
              • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
              • +
              +
            4. +
            5. +

              Intelligent Profiling & Validation

              +

              Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

              +
                +
              • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
              • +
              • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
              • +
              +
            6. +
            7. +

              Smart Artifact Generation & Distribution

              +

              To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

              +
                +
              • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
              • +
              • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
              • +
              +
            8. +
            9. +

              Image Streaming & Deployment

              +

              Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

              +
                +
              • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
              • +
              +
            10. +
            11. +

              The Inference Runtime (Kubernetes)

              +

              The workload lands on Kubernetes with Autoscaling.

              +
                +
              • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
              • +
              • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
              • +
              +
            12. +
            13. +

              Client Interaction & Observability

              +

              Finally, the LLM Inference Client executes the request.

              +
                +
              • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
              • +
              • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
              • +
              +
            14. +
            15. +

              Observability: Monitoring the Pulse of GenAI

              +

              In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

              +

              To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

              +
                +
              1. +

                Time to First Token (TTFT)

                  -
                • Milvus
                • -
                • Qdrant
                • -
                • Weaviate
                • +
                • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                • +
                • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                • +
                • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                -

                After extensive POCs, Qdrant stood out for its:

                +
              2. +
              3. +

                Inter-Token Latency (ITL)

                  -
                • ✅ Blazing-fast search latency on high-dimensional vectors
                • -
                • ✅ Efficient memory usage, crucial for in-memory workloads
                • -
                • ✅ Support for upserts and soft deletes, vital for Ads use cases
                • -
                • ✅ gRPC + REST APIs, making integration seamless
                • -
                • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                • +
                • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                • +
                • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                • +
                • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                -

                At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                -

                Embedding Freshness & Real-Time Updates

                -

                To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                +
              4. +
              5. +

                Token Throughput vs. Request Throughput

                  -
                • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                • -
                • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                • +
                • We distinguish between two types of throughput to balance system efficiency with user load:
                • +
                • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                • +
                • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                -

                This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                -

                Skye

                -

                Final Takeaways: Scaling Smartly for Real-Time ML

                +
              6. +
              7. +

                The Monitoring Stack

                  -
                • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                • -
                • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                • -
                • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                • -
                • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                • +
                • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                • +
                • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                -

                By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

            · 4 min read
            Aditya Kumar
            Lead Software Engineer @ Meesho
            Jaya Kumar
            Lead ML Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            + + + + +

            Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

            +

            Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

            +
              +
            1. +

              TensorRT-LLM: The High-Performance Standard

              +

              Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

              +

              TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

              +

              Key optimizations we tailor for these high-load cases include:

              +
                +
              • Optimized execution via TensorRT engine compilation
              • +
              • Quantization-aware execution for reduced memory usage and improved throughput
              • +
              • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
              • +
              • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
              • +
              +
            2. +
            3. +

              Dynamo: Distributed Inference for Reasoning Models

              +

              Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

              +

              For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

              +
                +
              • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
              • +
              • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
              • +
              • Distributed execution across multiple GPU resources
              • +
              +
            4. +
            5. +

              vLLM: The Flexible Baseline

              +

              Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

              +

              While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

              +
                +
              • High throughput through dynamic batching and efficient memory utilization
              • +
              • Paged KV cache management for handling long contexts and concurrent requests
              • +
              • Strong support for open-source model ecosystems
              • +
              • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
              • +
              • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
              • +
              +
            6. +
            +

            Conclusion

            +

            Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

            +

            The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

            +

            Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

            +

            Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

            +

            Future Explorations

            +

            While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

            +
              +
            • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
            • +
            • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
            • +
            • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
            • +
            • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
            • +
            • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
            • +
            • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
            • +

            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

            · 4 min read
            Aditya Kumar
            Lead Software Engineer @ Meesho
            Jaya Kumar
            Lead ML Engineer @ Meesho
            Adarsha Das
            Senior Architect @ Meesho

            BharatMLStack

            By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

              diff --git a/docs/blog/tags/embedding-search/index.html b/docs/blog/tags/embedding-search/index.html index 33570fb8..9d2bd21c 100644 --- a/docs/blog/tags/embedding-search/index.html +++ b/docs/blog/tags/embedding-search/index.html @@ -4,15 +4,15 @@ One post tagged with "embedding-search" | BharatMLStack - - - + + + -

              One post tagged with "embedding-search"

              View All Tags

              Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

              · 4 min read
              Aditya Kumar
              Lead Software Engineer @ Meesho
              Jaya Kumar
              Lead ML Engineer @ Meesho
              Adarsha Das
              Senior Architect @ Meesho

              BharatMLStack

              +

              One post tagged with "embedding-search"

              View All Tags

              Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

              · 4 min read
              Aditya Kumar
              Lead Software Engineer @ Meesho
              Jaya Kumar
              Lead ML Engineer @ Meesho
              Adarsha Das
              Senior Architect @ Meesho

              BharatMLStack

              By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                diff --git a/docs/blog/tags/index.html b/docs/blog/tags/index.html index c5d991bf..bf5ce122 100644 --- a/docs/blog/tags/index.html +++ b/docs/blog/tags/index.html @@ -4,14 +4,14 @@ Tags | BharatMLStack - - - + + + - + \ No newline at end of file diff --git a/docs/blog/tags/inferflow/index.html b/docs/blog/tags/inferflow/index.html index 78e772c0..c6b05aba 100644 --- a/docs/blog/tags/inferflow/index.html +++ b/docs/blog/tags/inferflow/index.html @@ -4,15 +4,15 @@ One post tagged with "inferflow" | BharatMLStack - - - + + + -

                One post tagged with "inferflow"

                View All Tags

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                · 7 min read
                Bhawani Singh
                Architect @ Meesho
                Jigar Dave
                Lead Software Engineer @ Meesho
                Adarsha Das
                Senior Architect @ Meesho

                BharatMLStack

                +

                One post tagged with "inferflow"

                View All Tags

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                · 7 min read
                Bhawani Singh
                Architect @ Meesho
                Jigar Dave
                Lead Software Engineer @ Meesho
                Adarsha Das
                Senior Architect @ Meesho

                BharatMLStack

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation. And it worked. Mostly. diff --git a/docs/blog/tags/interaction-store/index.html b/docs/blog/tags/interaction-store/index.html index dfb1374a..5fe6b87d 100644 --- a/docs/blog/tags/interaction-store/index.html +++ b/docs/blog/tags/interaction-store/index.html @@ -4,15 +4,15 @@ 2 posts tagged with "interaction-store" | BharatMLStack - - - + + + -

                2 posts tagged with "interaction-store"

                View All Tags

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                · 7 min read
                Bhawani Singh
                Architect @ Meesho
                Jigar Dave
                Lead Software Engineer @ Meesho
                Adarsha Das
                Senior Architect @ Meesho

                BharatMLStack

                +

                2 posts tagged with "interaction-store"

                View All Tags

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                · 7 min read
                Bhawani Singh
                Architect @ Meesho
                Jigar Dave
                Lead Software Engineer @ Meesho
                Adarsha Das
                Senior Architect @ Meesho

                BharatMLStack

                Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

                By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation. And it worked. Mostly. @@ -246,7 +246,7 @@

                feature_1_value,feature_2_value,feature_3_value;expiry_ts

                +
                feature_1_value,feature_2_value,feature_3_value;expiry_ts

                This format allowed:

                • Consistent writes and reads at the group level
                • @@ -309,7 +309,7 @@

                  Why Redis?

                  Storage Structure

                  Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

                  -
                  userId_eventType → ZSET[...(pid, ts)...]
                  +
                  userId_eventType → ZSET[...(pid, ts)...]

                  Within each ZSET:

                  • The timestamp served as the score, maintaining temporal order
                  • diff --git a/docs/blog/tags/llm/index.html b/docs/blog/tags/llm/index.html index f13ce660..6ea21f18 100644 --- a/docs/blog/tags/llm/index.html +++ b/docs/blog/tags/llm/index.html @@ -4,15 +4,15 @@ 2 posts tagged with "llm" | BharatMLStack - - - + + + -

                    2 posts tagged with "llm"

                    View All Tags

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    · 5 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    +

                    2 posts tagged with "llm"

                    View All Tags

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    · 5 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

                    1. Advanced Memory Management: Paged & Prefix KV Caching

                    @@ -76,82 +76,195 @@

                    Voice bot qu
                    Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
                    TensorRT-LLM136.2722.7845.660.23L4
                    TensorRT-LLM249.8123.2189.370.45L4
                    TensorRT-LLM455.3336.62153.390.78L4
                    TensorRT-LLM866.539.11279.881.47L4
                    TensorRT-LLM16131.830.39547.82.77L4
                    TensorRT-LLM32277.2248.02925.74.78L4
                    TensorRT-LLM64498.5271.621,164.406.2L4
                    TensorRT-LLM128677.31120.371,445.187.69L4
                    TensorRT-LLM2561,926.31216.881,600.818.52L4
                    TensorRT-LLM121.179.24130.050.68A100
                    TensorRT-LLM225.789.21264.51.35A100
                    TensorRT-LLM428.5210.99437.692.27A100
                    TensorRT-LLM834.412.61760.493.96A100
                    TensorRT-LLM1668.0314.321,343.807.01A100
                    TensorRT-LLM32185.9616.822,287.3011.92A100
                    TensorRT-LLM64136.8721.173,625.2218.89A100
                    TensorRT-LLM128463.7834.154,456.5123.24A100
                    TensorRT-LLM256890.1259.185,188.2427.05A100

                    Conclusion

                    High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

                    -

                    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                    · 4 min read
                    Aditya Kumar
                    Lead Software Engineer @ Meesho
                    Jaya Kumar
                    Lead ML Engineer @ Meesho
                    Adarsha Das
                    Senior Architect @ Meesho

                    BharatMLStack

                    - -

                    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                    -
                      -
                    • 🔹 Scaling model inference without hitting infrastructure roadblocks
                    • -
                    • 🔹 Moving embedding search from batch to real-time for candidate generation
                    • -
                    -

                    Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

                    -

                    Breaking Free from the Scalability Ceiling

                    -

                    The Model Serving Bottleneck—A Wake-Up Call

                    -

                    July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

                    -
                      -
                    • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
                    • -
                    • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
                    • -
                    • 🚀 The results matched—perfectly.
                    • -
                    -

                    That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

                    -
                      -
                    • ✅ p99 latency dropped from 90–100ms to 30–40ms
                    • -
                    • ✅ Triton handled significantly higher throughput on fewer resources
                    • -
                    • ✅ No model changes were needed
                    • -
                    -

                    MBS ran without a hitch, proving that self-hosted inference was the way forward.

                    -

                    Scaling Triton on GKE

                    -

                    This left us with two choices:

                    -
                      -
                    • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
                    • -
                    • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
                    • -
                    -

                    We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

                    -

                    Fixing the Cold Start Problem

                    -

                    As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

                    -

                    After profiling, we found the culprits:

                    -
                      -
                    • Triton’s base image—a massive 5GB
                    • -
                    • Model binaries—often 1GB+
                    • -
                    • Startup delay—mostly due to downloading and initializing these assets
                    • -
                    -

                    To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

                    -

                    Embedding Search: The Last Piece of the Puzzle

                    -

                    By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

                    -

                    Choosing the Right Vector Database

                    -

                    We benchmarked three production-ready vector DBs across key parameters:

                    -
                      -
                    • Milvus
                    • -
                    • Qdrant
                    • -
                    • Weaviate
                    • -
                    -

                    After extensive POCs, Qdrant stood out for its:

                    -
                      -
                    • ✅ Blazing-fast search latency on high-dimensional vectors
                    • -
                    • ✅ Efficient memory usage, crucial for in-memory workloads
                    • -
                    • ✅ Support for upserts and soft deletes, vital for Ads use cases
                    • -
                    • ✅ gRPC + REST APIs, making integration seamless
                    • -
                    • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                    • -
                    -

                    At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                    -

                    Embedding Freshness & Real-Time Updates

                    -

                    To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                    -
                      -
                    • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                    • -
                    • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                    • -
                    -

                    This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                    -

                    Skye

                    -

                    Final Takeaways: Scaling Smartly for Real-Time ML

                    -
                      -
                    • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                    • -
                    • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                    • -
                    • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                    • -
                    • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                    • -
                    -

                    By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

                    +

                    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                    · 14 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    +

                    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                    +

                    Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

                    +

                    The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

                    +

                    In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

                    +

                    Why LLM Inference Is not just bigger ML model serving

                    +

                    Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

                    +

                    Autoregressive Generation and Sequential Computation:

                    +

                    Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

                    +

                    Prefill and Decode Phases:

                    +

                    LLM inference typically consists of two distinct stages:

                    +
                      +
                    • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
                    • +
                    • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
                    • +
                    +

                    The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

                    +

                    Context Management and KV Caching:

                    +

                    Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

                    +
                      +
                    • Memory consumption grows with sequence length and batch size
                    • +
                    • GPU memory becomes a critical bottleneck
                    • +
                    • Efficient memory management becomes essential for scaling concurrent requests
                    • +
                    +

                    This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

                    +

                    Dynamic and Irregular Workloads:

                    +

                    Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

                    +
                      +
                    • Batch sizes must be dynamic rather than static
                    • +
                    • Requests may enter and leave batches asynchronously
                    • +
                    • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
                    • +
                    +

                    These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

                    +

                    Streaming and User Experience Constraints:

                    +

                    Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

                    +

                    LLMOps: High-Level Architecture

                    +

                    LLM Architecture

                    +

                    The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

                    +

                    Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

                    +
                      +
                    1. +

                      Onboarding & Registration (The Source of Truth)

                      +

                      The lifecycle begins with the Data Scientist or engineer.

                      +
                        +
                      • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
                      • +
                      • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
                      • +
                      +
                    2. +
                    3. +

                      The "Black Box" Build Engine

                      +

                      Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

                      +
                        +
                      • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
                      • +
                      • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
                      • +
                      • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
                      • +
                      +
                    4. +
                    5. +

                      Intelligent Profiling & Validation

                      +

                      Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

                      +
                        +
                      • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
                      • +
                      • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
                      • +
                      +
                    6. +
                    7. +

                      Smart Artifact Generation & Distribution

                      +

                      To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

                      +
                        +
                      • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
                      • +
                      • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
                      • +
                      +
                    8. +
                    9. +

                      Image Streaming & Deployment

                      +

                      Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

                      +
                        +
                      • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
                      • +
                      +
                    10. +
                    11. +

                      The Inference Runtime (Kubernetes)

                      +

                      The workload lands on Kubernetes with Autoscaling.

                      +
                        +
                      • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
                      • +
                      • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
                      • +
                      +
                    12. +
                    13. +

                      Client Interaction & Observability

                      +

                      Finally, the LLM Inference Client executes the request.

                      +
                        +
                      • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
                      • +
                      • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
                      • +
                      +
                    14. +
                    15. +

                      Observability: Monitoring the Pulse of GenAI

                      +

                      In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

                      +

                      To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

                      +
                        +
                      1. +

                        Time to First Token (TTFT)

                        +
                          +
                        • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                        • +
                        • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                        • +
                        • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                        • +
                        +
                      2. +
                      3. +

                        Inter-Token Latency (ITL)

                        +
                          +
                        • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                        • +
                        • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                        • +
                        • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                        • +
                        +
                      4. +
                      5. +

                        Token Throughput vs. Request Throughput

                        +
                          +
                        • We distinguish between two types of throughput to balance system efficiency with user load:
                        • +
                        • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                        • +
                        • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                        • +
                        +
                      6. +
                      7. +

                        The Monitoring Stack

                        +
                          +
                        • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                        • +
                        • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                        • +
                        +
                      8. +
                      +
                    16. +
                    +

                    Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

                    +

                    Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

                    +
                      +
                    1. +

                      TensorRT-LLM: The High-Performance Standard

                      +

                      Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

                      +

                      TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

                      +

                      Key optimizations we tailor for these high-load cases include:

                      +
                        +
                      • Optimized execution via TensorRT engine compilation
                      • +
                      • Quantization-aware execution for reduced memory usage and improved throughput
                      • +
                      • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
                      • +
                      • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
                      • +
                      +
                    2. +
                    3. +

                      Dynamo: Distributed Inference for Reasoning Models

                      +

                      Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

                      +

                      For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

                      +
                        +
                      • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
                      • +
                      • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
                      • +
                      • Distributed execution across multiple GPU resources
                      • +
                      +
                    4. +
                    5. +

                      vLLM: The Flexible Baseline

                      +

                      Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

                      +

                      While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

                      +
                        +
                      • High throughput through dynamic batching and efficient memory utilization
                      • +
                      • Paged KV cache management for handling long contexts and concurrent requests
                      • +
                      • Strong support for open-source model ecosystems
                      • +
                      • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
                      • +
                      • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
                      • +
                      +
                    6. +
                    +

                    Conclusion

                    +

                    Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

                    +

                    The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

                    +

                    Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

                    +

                    Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

                    +

                    Future Explorations

                    +

                    While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

                    +
                      +
                    • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
                    • +
                    • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
                    • +
                    • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
                    • +
                    • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
                    • +
                    • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
                    • +
                    • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
                    • +
                    \ No newline at end of file diff --git a/docs/blog/tags/meesho/index.html b/docs/blog/tags/meesho/index.html index 373157df..ccce6e36 100644 --- a/docs/blog/tags/meesho/index.html +++ b/docs/blog/tags/meesho/index.html @@ -4,15 +4,15 @@ 5 posts tagged with "meesho" | BharatMLStack - - - + + + -

                    5 posts tagged with "meesho"

                    View All Tags

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    · 5 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    +

                    5 posts tagged with "meesho"

                    View All Tags

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    · 5 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                    Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

                    1. Advanced Memory Management: Paged & Prefix KV Caching

                    @@ -76,83 +76,196 @@

                    Voice bot qu
                    Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
                    TensorRT-LLM136.2722.7845.660.23L4
                    TensorRT-LLM249.8123.2189.370.45L4
                    TensorRT-LLM455.3336.62153.390.78L4
                    TensorRT-LLM866.539.11279.881.47L4
                    TensorRT-LLM16131.830.39547.82.77L4
                    TensorRT-LLM32277.2248.02925.74.78L4
                    TensorRT-LLM64498.5271.621,164.406.2L4
                    TensorRT-LLM128677.31120.371,445.187.69L4
                    TensorRT-LLM2561,926.31216.881,600.818.52L4
                    TensorRT-LLM121.179.24130.050.68A100
                    TensorRT-LLM225.789.21264.51.35A100
                    TensorRT-LLM428.5210.99437.692.27A100
                    TensorRT-LLM834.412.61760.493.96A100
                    TensorRT-LLM1668.0314.321,343.807.01A100
                    TensorRT-LLM32185.9616.822,287.3011.92A100
                    TensorRT-LLM64136.8721.173,625.2218.89A100
                    TensorRT-LLM128463.7834.154,456.5123.24A100
                    TensorRT-LLM256890.1259.185,188.2427.05A100

                    Conclusion

                    High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

                    -

                    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                    · 4 min read
                    Aditya Kumar
                    Lead Software Engineer @ Meesho
                    Jaya Kumar
                    Lead ML Engineer @ Meesho
                    Adarsha Das
                    Senior Architect @ Meesho

                    BharatMLStack

                    - -

                    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                    +

                    These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                    · 14 min read
                    Jaya Kumar
                    Lead ML Engineer @ Meesho

                    BharatMLStack

                    +

                    Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                    +

                    Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

                    +

                    The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

                    +

                    In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

                    +

                    Why LLM Inference Is not just bigger ML model serving

                    +

                    Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

                    +

                    Autoregressive Generation and Sequential Computation:

                    +

                    Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

                    +

                    Prefill and Decode Phases:

                    +

                    LLM inference typically consists of two distinct stages:

                    +
                      +
                    • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
                    • +
                    • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
                    • +
                    +

                    The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

                    +

                    Context Management and KV Caching:

                    +

                    Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

                    +
                      +
                    • Memory consumption grows with sequence length and batch size
                    • +
                    • GPU memory becomes a critical bottleneck
                    • +
                    • Efficient memory management becomes essential for scaling concurrent requests
                    • +
                    +

                    This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

                    +

                    Dynamic and Irregular Workloads:

                    +

                    Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

                    +
                      +
                    • Batch sizes must be dynamic rather than static
                    • +
                    • Requests may enter and leave batches asynchronously
                    • +
                    • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
                    • +
                    +

                    These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

                    +

                    Streaming and User Experience Constraints:

                    +

                    Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

                    +

                    LLMOps: High-Level Architecture

                    +

                    LLM Architecture

                    +

                    The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

                    +

                    Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

                    +
                      +
                    1. +

                      Onboarding & Registration (The Source of Truth)

                      +

                      The lifecycle begins with the Data Scientist or engineer.

                        -
                      • 🔹 Scaling model inference without hitting infrastructure roadblocks
                      • -
                      • 🔹 Moving embedding search from batch to real-time for candidate generation
                      • +
                      • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
                      • +
                      • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
                      -

                      Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

                      -

                      Breaking Free from the Scalability Ceiling

                      -

                      The Model Serving Bottleneck—A Wake-Up Call

                      -

                      July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

                      +
                    2. +
                    3. +

                      The "Black Box" Build Engine

                      +

                      Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

                        -
                      • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
                      • -
                      • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
                      • -
                      • 🚀 The results matched—perfectly.
                      • +
                      • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
                      • +
                      • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
                      • +
                      • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
                      -

                      That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

                      +
                    4. +
                    5. +

                      Intelligent Profiling & Validation

                      +

                      Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

                        -
                      • ✅ p99 latency dropped from 90–100ms to 30–40ms
                      • -
                      • ✅ Triton handled significantly higher throughput on fewer resources
                      • -
                      • ✅ No model changes were needed
                      • +
                      • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
                      • +
                      • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
                      -

                      MBS ran without a hitch, proving that self-hosted inference was the way forward.

                      -

                      Scaling Triton on GKE

                      -

                      This left us with two choices:

                      +
                    6. +
                    7. +

                      Smart Artifact Generation & Distribution

                      +

                      To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

                        -
                      • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
                      • -
                      • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
                      • +
                      • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
                      • +
                      • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
                      -

                      We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

                      -

                      Fixing the Cold Start Problem

                      -

                      As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

                      -

                      After profiling, we found the culprits:

                      +
                    8. +
                    9. +

                      Image Streaming & Deployment

                      +

                      Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

                        -
                      • Triton’s base image—a massive 5GB
                      • -
                      • Model binaries—often 1GB+
                      • -
                      • Startup delay—mostly due to downloading and initializing these assets
                      • +
                      • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
                      -

                      To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

                      -

                      Embedding Search: The Last Piece of the Puzzle

                      -

                      By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

                      -

                      Choosing the Right Vector Database

                      -

                      We benchmarked three production-ready vector DBs across key parameters:

                      +
                    10. +
                    11. +

                      The Inference Runtime (Kubernetes)

                      +

                      The workload lands on Kubernetes with Autoscaling.

                        -
                      • Milvus
                      • -
                      • Qdrant
                      • -
                      • Weaviate
                      • +
                      • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
                      • +
                      • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
                      -

                      After extensive POCs, Qdrant stood out for its:

                      +
                    12. +
                    13. +

                      Client Interaction & Observability

                      +

                      Finally, the LLM Inference Client executes the request.

                        -
                      • ✅ Blazing-fast search latency on high-dimensional vectors
                      • -
                      • ✅ Efficient memory usage, crucial for in-memory workloads
                      • -
                      • ✅ Support for upserts and soft deletes, vital for Ads use cases
                      • -
                      • ✅ gRPC + REST APIs, making integration seamless
                      • -
                      • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                      • +
                      • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
                      • +
                      • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
                      -

                      At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                      -

                      Embedding Freshness & Real-Time Updates

                      -

                      To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                      +
                    14. +
                    15. +

                      Observability: Monitoring the Pulse of GenAI

                      +

                      In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

                      +

                      To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

                      +
                        +
                      1. +

                        Time to First Token (TTFT)

                          -
                        • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                        • -
                        • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                        • +
                        • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                        • +
                        • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                        • +
                        • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                        -

                        This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                        -

                        Skye

                        -

                        Final Takeaways: Scaling Smartly for Real-Time ML

                        +
                      2. +
                      3. +

                        Inter-Token Latency (ITL)

                          -
                        • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                        • -
                        • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                        • -
                        • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                        • -
                        • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                        • +
                        • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                        • +
                        • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                        • +
                        • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                        • +
                        +
                      4. +
                      5. +

                        Token Throughput vs. Request Throughput

                        +
                          +
                        • We distinguish between two types of throughput to balance system efficiency with user load:
                        • +
                        • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                        • +
                        • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                        • +
                        +
                      6. +
                      7. +

                        The Monitoring Stack

                        +
                          +
                        • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                        • +
                        • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                        • +
                        +
                      8. +
                      +
                    16. +
                    +

                    Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

                    +

                    Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

                    +
                      +
                    1. +

                      TensorRT-LLM: The High-Performance Standard

                      +

                      Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

                      +

                      TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

                      +

                      Key optimizations we tailor for these high-load cases include:

                      +
                        +
                      • Optimized execution via TensorRT engine compilation
                      • +
                      • Quantization-aware execution for reduced memory usage and improved throughput
                      • +
                      • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
                      • +
                      • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
                      -

                      By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

                    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                    · 4 min read
                    Aditya Kumar
                    Lead Software Engineer @ Meesho
                    Jaya Kumar
                    Lead ML Engineer @ Meesho
                    Adarsha Das
                    Senior Architect @ Meesho

                    BharatMLStack

                    + +
                  • +

                    Dynamo: Distributed Inference for Reasoning Models

                    +

                    Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

                    +

                    For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

                    +
                      +
                    • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
                    • +
                    • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
                    • +
                    • Distributed execution across multiple GPU resources
                    • +
                    +
                  • +
                  • +

                    vLLM: The Flexible Baseline

                    +

                    Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

                    +

                    While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

                    +
                      +
                    • High throughput through dynamic batching and efficient memory utilization
                    • +
                    • Paged KV cache management for handling long contexts and concurrent requests
                    • +
                    • Strong support for open-source model ecosystems
                    • +
                    • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
                    • +
                    • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
                    • +
                    +
                  • + +

                    Conclusion

                    +

                    Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

                    +

                    The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

                    +

                    Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

                    +

                    Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

                    +

                    Future Explorations

                    +

                    While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

                    +
                      +
                    • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
                    • +
                    • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
                    • +
                    • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
                    • +
                    • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
                    • +
                    • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
                    • +
                    • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
                    • +

                    Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                    · 4 min read
                    Aditya Kumar
                    Lead Software Engineer @ Meesho
                    Jaya Kumar
                    Lead ML Engineer @ Meesho
                    Adarsha Das
                    Senior Architect @ Meesho

                    BharatMLStack

                    By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                      @@ -462,7 +575,7 @@

                      feature_1_value,feature_2_value,feature_3_value;expiry_ts

                    +
                    feature_1_value,feature_2_value,feature_3_value;expiry_ts

                    This format allowed:

                    • Consistent writes and reads at the group level
                    • @@ -525,7 +638,7 @@

                      Why Redis?

                      Storage Structure

                      Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

                      -
                      userId_eventType → ZSET[...(pid, ts)...]
                      +
                      userId_eventType → ZSET[...(pid, ts)...]

                      Within each ZSET:

                      • The timestamp served as the score, maintaining temporal order
                      • diff --git a/docs/blog/tags/mlplatform/index.html b/docs/blog/tags/mlplatform/index.html index ce749519..c97f4a0e 100644 --- a/docs/blog/tags/mlplatform/index.html +++ b/docs/blog/tags/mlplatform/index.html @@ -4,15 +4,15 @@ 5 posts tagged with "mlplatform" | BharatMLStack - - - + + + -

                        5 posts tagged with "mlplatform"

                        View All Tags

                        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                        · 5 min read
                        Jaya Kumar
                        Lead ML Engineer @ Meesho

                        BharatMLStack

                        +

                        5 posts tagged with "mlplatform"

                        View All Tags

                        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                        · 5 min read
                        Jaya Kumar
                        Lead ML Engineer @ Meesho

                        BharatMLStack

                        LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                        Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

                        1. Advanced Memory Management: Paged & Prefix KV Caching

                        @@ -76,83 +76,196 @@

                        Voice bot qu
                        Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
                        TensorRT-LLM136.2722.7845.660.23L4
                        TensorRT-LLM249.8123.2189.370.45L4
                        TensorRT-LLM455.3336.62153.390.78L4
                        TensorRT-LLM866.539.11279.881.47L4
                        TensorRT-LLM16131.830.39547.82.77L4
                        TensorRT-LLM32277.2248.02925.74.78L4
                        TensorRT-LLM64498.5271.621,164.406.2L4
                        TensorRT-LLM128677.31120.371,445.187.69L4
                        TensorRT-LLM2561,926.31216.881,600.818.52L4
                        TensorRT-LLM121.179.24130.050.68A100
                        TensorRT-LLM225.789.21264.51.35A100
                        TensorRT-LLM428.5210.99437.692.27A100
                        TensorRT-LLM834.412.61760.493.96A100
                        TensorRT-LLM1668.0314.321,343.807.01A100
                        TensorRT-LLM32185.9616.822,287.3011.92A100
                        TensorRT-LLM64136.8721.173,625.2218.89A100
                        TensorRT-LLM128463.7834.154,456.5123.24A100
                        TensorRT-LLM256890.1259.185,188.2427.05A100

                        Conclusion

                        High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

                        -

                        These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                        Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                        · 4 min read
                        Aditya Kumar
                        Lead Software Engineer @ Meesho
                        Jaya Kumar
                        Lead ML Engineer @ Meesho
                        Adarsha Das
                        Senior Architect @ Meesho

                        BharatMLStack

                        - -

                        By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                        +

                        These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                        Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                        · 14 min read
                        Jaya Kumar
                        Lead ML Engineer @ Meesho

                        BharatMLStack

                        +

                        Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                        +

                        Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

                        +

                        The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

                        +

                        In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

                        +

                        Why LLM Inference Is not just bigger ML model serving

                        +

                        Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

                        +

                        Autoregressive Generation and Sequential Computation:

                        +

                        Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

                        +

                        Prefill and Decode Phases:

                        +

                        LLM inference typically consists of two distinct stages:

                        +
                          +
                        • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
                        • +
                        • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
                        • +
                        +

                        The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

                        +

                        Context Management and KV Caching:

                        +

                        Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

                        +
                          +
                        • Memory consumption grows with sequence length and batch size
                        • +
                        • GPU memory becomes a critical bottleneck
                        • +
                        • Efficient memory management becomes essential for scaling concurrent requests
                        • +
                        +

                        This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

                        +

                        Dynamic and Irregular Workloads:

                        +

                        Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

                        +
                          +
                        • Batch sizes must be dynamic rather than static
                        • +
                        • Requests may enter and leave batches asynchronously
                        • +
                        • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
                        • +
                        +

                        These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

                        +

                        Streaming and User Experience Constraints:

                        +

                        Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

                        +

                        LLMOps: High-Level Architecture

                        +

                        LLM Architecture

                        +

                        The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

                        +

                        Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

                        +
                          +
                        1. +

                          Onboarding & Registration (The Source of Truth)

                          +

                          The lifecycle begins with the Data Scientist or engineer.

                            -
                          • 🔹 Scaling model inference without hitting infrastructure roadblocks
                          • -
                          • 🔹 Moving embedding search from batch to real-time for candidate generation
                          • +
                          • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
                          • +
                          • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
                          -

                          Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

                          -

                          Breaking Free from the Scalability Ceiling

                          -

                          The Model Serving Bottleneck—A Wake-Up Call

                          -

                          July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

                          +
                        2. +
                        3. +

                          The "Black Box" Build Engine

                          +

                          Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

                            -
                          • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
                          • -
                          • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
                          • -
                          • 🚀 The results matched—perfectly.
                          • +
                          • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
                          • +
                          • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
                          • +
                          • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
                          -

                          That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

                          +
                        4. +
                        5. +

                          Intelligent Profiling & Validation

                          +

                          Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

                            -
                          • ✅ p99 latency dropped from 90–100ms to 30–40ms
                          • -
                          • ✅ Triton handled significantly higher throughput on fewer resources
                          • -
                          • ✅ No model changes were needed
                          • +
                          • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
                          • +
                          • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
                          -

                          MBS ran without a hitch, proving that self-hosted inference was the way forward.

                          -

                          Scaling Triton on GKE

                          -

                          This left us with two choices:

                          +
                        6. +
                        7. +

                          Smart Artifact Generation & Distribution

                          +

                          To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

                            -
                          • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
                          • -
                          • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
                          • +
                          • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
                          • +
                          • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
                          -

                          We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

                          -

                          Fixing the Cold Start Problem

                          -

                          As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

                          -

                          After profiling, we found the culprits:

                          +
                        8. +
                        9. +

                          Image Streaming & Deployment

                          +

                          Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

                            -
                          • Triton’s base image—a massive 5GB
                          • -
                          • Model binaries—often 1GB+
                          • -
                          • Startup delay—mostly due to downloading and initializing these assets
                          • +
                          • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
                          -

                          To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

                          -

                          Embedding Search: The Last Piece of the Puzzle

                          -

                          By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

                          -

                          Choosing the Right Vector Database

                          -

                          We benchmarked three production-ready vector DBs across key parameters:

                          +
                        10. +
                        11. +

                          The Inference Runtime (Kubernetes)

                          +

                          The workload lands on Kubernetes with Autoscaling.

                            -
                          • Milvus
                          • -
                          • Qdrant
                          • -
                          • Weaviate
                          • +
                          • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
                          • +
                          • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
                          -

                          After extensive POCs, Qdrant stood out for its:

                          +
                        12. +
                        13. +

                          Client Interaction & Observability

                          +

                          Finally, the LLM Inference Client executes the request.

                            -
                          • ✅ Blazing-fast search latency on high-dimensional vectors
                          • -
                          • ✅ Efficient memory usage, crucial for in-memory workloads
                          • -
                          • ✅ Support for upserts and soft deletes, vital for Ads use cases
                          • -
                          • ✅ gRPC + REST APIs, making integration seamless
                          • -
                          • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                          • +
                          • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
                          • +
                          • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
                          -

                          At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                          -

                          Embedding Freshness & Real-Time Updates

                          -

                          To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                          +
                        14. +
                        15. +

                          Observability: Monitoring the Pulse of GenAI

                          +

                          In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

                          +

                          To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

                          +
                            +
                          1. +

                            Time to First Token (TTFT)

                              -
                            • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                            • -
                            • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                            • +
                            • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                            • +
                            • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                            • +
                            • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                            -

                            This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                            -

                            Skye

                            -

                            Final Takeaways: Scaling Smartly for Real-Time ML

                            +
                          2. +
                          3. +

                            Inter-Token Latency (ITL)

                              -
                            • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                            • -
                            • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                            • -
                            • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                            • -
                            • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                            • +
                            • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                            • +
                            • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                            • +
                            • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                            • +
                            +
                          4. +
                          5. +

                            Token Throughput vs. Request Throughput

                            +
                              +
                            • We distinguish between two types of throughput to balance system efficiency with user load:
                            • +
                            • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                            • +
                            • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                            • +
                            +
                          6. +
                          7. +

                            The Monitoring Stack

                            +
                              +
                            • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                            • +
                            • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                            • +
                            +
                          8. +
                          +
                        16. +
                        +

                        Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

                        +

                        Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

                        +
                          +
                        1. +

                          TensorRT-LLM: The High-Performance Standard

                          +

                          Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

                          +

                          TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

                          +

                          Key optimizations we tailor for these high-load cases include:

                          +
                            +
                          • Optimized execution via TensorRT engine compilation
                          • +
                          • Quantization-aware execution for reduced memory usage and improved throughput
                          • +
                          • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
                          • +
                          • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
                          -

                          By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

                        Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                        · 4 min read
                        Aditya Kumar
                        Lead Software Engineer @ Meesho
                        Jaya Kumar
                        Lead ML Engineer @ Meesho
                        Adarsha Das
                        Senior Architect @ Meesho

                        BharatMLStack

                        + +
                      • +

                        Dynamo: Distributed Inference for Reasoning Models

                        +

                        Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

                        +

                        For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

                        +
                          +
                        • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
                        • +
                        • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
                        • +
                        • Distributed execution across multiple GPU resources
                        • +
                        +
                      • +
                      • +

                        vLLM: The Flexible Baseline

                        +

                        Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

                        +

                        While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

                        +
                          +
                        • High throughput through dynamic batching and efficient memory utilization
                        • +
                        • Paged KV cache management for handling long contexts and concurrent requests
                        • +
                        • Strong support for open-source model ecosystems
                        • +
                        • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
                        • +
                        • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
                        • +
                        +
                      • + +

                        Conclusion

                        +

                        Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

                        +

                        The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

                        +

                        Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

                        +

                        Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

                        +

                        Future Explorations

                        +

                        While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

                        +
                          +
                        • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
                        • +
                        • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
                        • +
                        • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
                        • +
                        • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
                        • +
                        • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
                        • +
                        • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
                        • +

                        Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                        · 4 min read
                        Aditya Kumar
                        Lead Software Engineer @ Meesho
                        Jaya Kumar
                        Lead ML Engineer @ Meesho
                        Adarsha Das
                        Senior Architect @ Meesho

                        BharatMLStack

                        By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                          @@ -462,7 +575,7 @@

                          feature_1_value,feature_2_value,feature_3_value;expiry_ts

                        +
                        feature_1_value,feature_2_value,feature_3_value;expiry_ts

                        This format allowed:

                        • Consistent writes and reads at the group level
                        • @@ -525,7 +638,7 @@

                          Why Redis?

                          Storage Structure

                          Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

                          -
                          userId_eventType → ZSET[...(pid, ts)...]
                          +
                          userId_eventType → ZSET[...(pid, ts)...]

                          Within each ZSET:

                          • The timestamp served as the score, maintaining temporal order
                          • diff --git a/docs/blog/tags/model-inference/index.html b/docs/blog/tags/model-inference/index.html index a630ecd7..ea847718 100644 --- a/docs/blog/tags/model-inference/index.html +++ b/docs/blog/tags/model-inference/index.html @@ -4,15 +4,15 @@ One post tagged with "model-inference" | BharatMLStack - - - + + + -

                            One post tagged with "model-inference"

                            View All Tags

                            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                            · 4 min read
                            Aditya Kumar
                            Lead Software Engineer @ Meesho
                            Jaya Kumar
                            Lead ML Engineer @ Meesho
                            Adarsha Das
                            Senior Architect @ Meesho

                            BharatMLStack

                            +

                            One post tagged with "model-inference"

                            View All Tags

                            Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                            · 4 min read
                            Aditya Kumar
                            Lead Software Engineer @ Meesho
                            Jaya Kumar
                            Lead ML Engineer @ Meesho
                            Adarsha Das
                            Senior Architect @ Meesho

                            BharatMLStack

                            By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                              diff --git a/docs/blog/tags/online-feature-store/index.html b/docs/blog/tags/online-feature-store/index.html index dc3c003b..74f8631e 100644 --- a/docs/blog/tags/online-feature-store/index.html +++ b/docs/blog/tags/online-feature-store/index.html @@ -4,15 +4,15 @@ One post tagged with "online-feature-store" | BharatMLStack - - - + + + -

                              One post tagged with "online-feature-store"

                              View All Tags

                              Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

                              · 11 min read
                              Adarsha Das
                              Senior Architect @ Meesho
                              Aditya Kumar
                              Lead Software Engineer @ Meesho
                              Bhawani Singh
                              Architect @ Meesho
                              Jigar Dave
                              Lead Software Engineer @ Meesho

                              BharatMLStack

                              +

                              One post tagged with "online-feature-store"

                              View All Tags

                              Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

                              · 11 min read
                              Adarsha Das
                              Senior Architect @ Meesho
                              Aditya Kumar
                              Lead Software Engineer @ Meesho
                              Bhawani Singh
                              Architect @ Meesho
                              Jigar Dave
                              Lead Software Engineer @ Meesho

                              BharatMLStack

                              The Genesis: How a Friday Night Roast Sparked Meesho’s ML Platform

                              It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting—until one remark hit a little too close to home:

                              "Why are we still crunching data for Monthly Active Users (MAU) when the next day it’s all about Daily Active Users (DAU)?"

                              @@ -119,7 +119,7 @@

                              feature_1_value,feature_2_value,feature_3_value;expiry_ts

                              +
                              feature_1_value,feature_2_value,feature_3_value;expiry_ts

                              This format allowed:

                              • Consistent writes and reads at the group level
                              • @@ -182,7 +182,7 @@

                                Why Redis?

                                Storage Structure

                                Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

                                -
                                userId_eventType → ZSET[...(pid, ts)...]
                                +
                                userId_eventType → ZSET[...(pid, ts)...]

                                Within each ZSET:

                                • The timestamp served as the score, maintaining temporal order
                                • diff --git a/docs/blog/tags/tensorrt-llm/index.html b/docs/blog/tags/tensorrt-llm/index.html index f614cf7f..eae65ad9 100644 --- a/docs/blog/tags/tensorrt-llm/index.html +++ b/docs/blog/tags/tensorrt-llm/index.html @@ -4,15 +4,15 @@ 2 posts tagged with "tensorrt-llm" | BharatMLStack - - - + + + -

                                  2 posts tagged with "tensorrt-llm"

                                  View All Tags

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  · 5 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  +

                                  2 posts tagged with "tensorrt-llm"

                                  View All Tags

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  · 5 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

                                  1. Advanced Memory Management: Paged & Prefix KV Caching

                                  @@ -76,82 +76,195 @@

                                  Voice bot qu
                                  Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
                                  TensorRT-LLM136.2722.7845.660.23L4
                                  TensorRT-LLM249.8123.2189.370.45L4
                                  TensorRT-LLM455.3336.62153.390.78L4
                                  TensorRT-LLM866.539.11279.881.47L4
                                  TensorRT-LLM16131.830.39547.82.77L4
                                  TensorRT-LLM32277.2248.02925.74.78L4
                                  TensorRT-LLM64498.5271.621,164.406.2L4
                                  TensorRT-LLM128677.31120.371,445.187.69L4
                                  TensorRT-LLM2561,926.31216.881,600.818.52L4
                                  TensorRT-LLM121.179.24130.050.68A100
                                  TensorRT-LLM225.789.21264.51.35A100
                                  TensorRT-LLM428.5210.99437.692.27A100
                                  TensorRT-LLM834.412.61760.493.96A100
                                  TensorRT-LLM1668.0314.321,343.807.01A100
                                  TensorRT-LLM32185.9616.822,287.3011.92A100
                                  TensorRT-LLM64136.8721.173,625.2218.89A100
                                  TensorRT-LLM128463.7834.154,456.5123.24A100
                                  TensorRT-LLM256890.1259.185,188.2427.05A100

                                  Conclusion

                                  High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

                                  -

                                  These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                                  Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                                  · 4 min read
                                  Aditya Kumar
                                  Lead Software Engineer @ Meesho
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho
                                  Adarsha Das
                                  Senior Architect @ Meesho

                                  BharatMLStack

                                  - -

                                  By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                                  -
                                    -
                                  • 🔹 Scaling model inference without hitting infrastructure roadblocks
                                  • -
                                  • 🔹 Moving embedding search from batch to real-time for candidate generation
                                  • -
                                  -

                                  Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

                                  -

                                  Breaking Free from the Scalability Ceiling

                                  -

                                  The Model Serving Bottleneck—A Wake-Up Call

                                  -

                                  July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

                                  -
                                    -
                                  • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
                                  • -
                                  • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
                                  • -
                                  • 🚀 The results matched—perfectly.
                                  • -
                                  -

                                  That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

                                  -
                                    -
                                  • ✅ p99 latency dropped from 90–100ms to 30–40ms
                                  • -
                                  • ✅ Triton handled significantly higher throughput on fewer resources
                                  • -
                                  • ✅ No model changes were needed
                                  • -
                                  -

                                  MBS ran without a hitch, proving that self-hosted inference was the way forward.

                                  -

                                  Scaling Triton on GKE

                                  -

                                  This left us with two choices:

                                  -
                                    -
                                  • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
                                  • -
                                  • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
                                  • -
                                  -

                                  We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

                                  -

                                  Fixing the Cold Start Problem

                                  -

                                  As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

                                  -

                                  After profiling, we found the culprits:

                                  -
                                    -
                                  • Triton’s base image—a massive 5GB
                                  • -
                                  • Model binaries—often 1GB+
                                  • -
                                  • Startup delay—mostly due to downloading and initializing these assets
                                  • -
                                  -

                                  To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

                                  -

                                  Embedding Search: The Last Piece of the Puzzle

                                  -

                                  By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

                                  -

                                  Choosing the Right Vector Database

                                  -

                                  We benchmarked three production-ready vector DBs across key parameters:

                                  -
                                    -
                                  • Milvus
                                  • -
                                  • Qdrant
                                  • -
                                  • Weaviate
                                  • -
                                  -

                                  After extensive POCs, Qdrant stood out for its:

                                  -
                                    -
                                  • ✅ Blazing-fast search latency on high-dimensional vectors
                                  • -
                                  • ✅ Efficient memory usage, crucial for in-memory workloads
                                  • -
                                  • ✅ Support for upserts and soft deletes, vital for Ads use cases
                                  • -
                                  • ✅ gRPC + REST APIs, making integration seamless
                                  • -
                                  • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                                  • -
                                  -

                                  At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                                  -

                                  Embedding Freshness & Real-Time Updates

                                  -

                                  To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                                  -
                                    -
                                  • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                                  • -
                                  • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                                  • -
                                  -

                                  This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                                  -

                                  Skye

                                  -

                                  Final Takeaways: Scaling Smartly for Real-Time ML

                                  -
                                    -
                                  • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                                  • -
                                  • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                                  • -
                                  • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                                  • -
                                  • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                                  • -
                                  -

                                  By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

                                  +

                                  These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                                  Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                                  · 14 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  +

                                  Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                                  +

                                  Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

                                  +

                                  The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

                                  +

                                  In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

                                  +

                                  Why LLM Inference Is not just bigger ML model serving

                                  +

                                  Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

                                  +

                                  Autoregressive Generation and Sequential Computation:

                                  +

                                  Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

                                  +

                                  Prefill and Decode Phases:

                                  +

                                  LLM inference typically consists of two distinct stages:

                                  +
                                    +
                                  • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
                                  • +
                                  • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
                                  • +
                                  +

                                  The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

                                  +

                                  Context Management and KV Caching:

                                  +

                                  Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

                                  +
                                    +
                                  • Memory consumption grows with sequence length and batch size
                                  • +
                                  • GPU memory becomes a critical bottleneck
                                  • +
                                  • Efficient memory management becomes essential for scaling concurrent requests
                                  • +
                                  +

                                  This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

                                  +

                                  Dynamic and Irregular Workloads:

                                  +

                                  Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

                                  +
                                    +
                                  • Batch sizes must be dynamic rather than static
                                  • +
                                  • Requests may enter and leave batches asynchronously
                                  • +
                                  • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
                                  • +
                                  +

                                  These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

                                  +

                                  Streaming and User Experience Constraints:

                                  +

                                  Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

                                  +

                                  LLMOps: High-Level Architecture

                                  +

                                  LLM Architecture

                                  +

                                  The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

                                  +

                                  Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

                                  +
                                    +
                                  1. +

                                    Onboarding & Registration (The Source of Truth)

                                    +

                                    The lifecycle begins with the Data Scientist or engineer.

                                    +
                                      +
                                    • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
                                    • +
                                    • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
                                    • +
                                    +
                                  2. +
                                  3. +

                                    The "Black Box" Build Engine

                                    +

                                    Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

                                    +
                                      +
                                    • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
                                    • +
                                    • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
                                    • +
                                    • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
                                    • +
                                    +
                                  4. +
                                  5. +

                                    Intelligent Profiling & Validation

                                    +

                                    Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

                                    +
                                      +
                                    • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
                                    • +
                                    • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
                                    • +
                                    +
                                  6. +
                                  7. +

                                    Smart Artifact Generation & Distribution

                                    +

                                    To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

                                    +
                                      +
                                    • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
                                    • +
                                    • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
                                    • +
                                    +
                                  8. +
                                  9. +

                                    Image Streaming & Deployment

                                    +

                                    Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

                                    +
                                      +
                                    • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
                                    • +
                                    +
                                  10. +
                                  11. +

                                    The Inference Runtime (Kubernetes)

                                    +

                                    The workload lands on Kubernetes with Autoscaling.

                                    +
                                      +
                                    • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
                                    • +
                                    • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
                                    • +
                                    +
                                  12. +
                                  13. +

                                    Client Interaction & Observability

                                    +

                                    Finally, the LLM Inference Client executes the request.

                                    +
                                      +
                                    • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
                                    • +
                                    • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
                                    • +
                                    +
                                  14. +
                                  15. +

                                    Observability: Monitoring the Pulse of GenAI

                                    +

                                    In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

                                    +

                                    To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

                                    +
                                      +
                                    1. +

                                      Time to First Token (TTFT)

                                      +
                                        +
                                      • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                                      • +
                                      • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                                      • +
                                      • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                                      • +
                                      +
                                    2. +
                                    3. +

                                      Inter-Token Latency (ITL)

                                      +
                                        +
                                      • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                                      • +
                                      • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                                      • +
                                      • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                                      • +
                                      +
                                    4. +
                                    5. +

                                      Token Throughput vs. Request Throughput

                                      +
                                        +
                                      • We distinguish between two types of throughput to balance system efficiency with user load:
                                      • +
                                      • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                                      • +
                                      • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                                      • +
                                      +
                                    6. +
                                    7. +

                                      The Monitoring Stack

                                      +
                                        +
                                      • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                                      • +
                                      • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                                      • +
                                      +
                                    8. +
                                    +
                                  16. +
                                  +

                                  Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

                                  +

                                  Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

                                  +
                                    +
                                  1. +

                                    TensorRT-LLM: The High-Performance Standard

                                    +

                                    Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

                                    +

                                    TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

                                    +

                                    Key optimizations we tailor for these high-load cases include:

                                    +
                                      +
                                    • Optimized execution via TensorRT engine compilation
                                    • +
                                    • Quantization-aware execution for reduced memory usage and improved throughput
                                    • +
                                    • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
                                    • +
                                    • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
                                    • +
                                    +
                                  2. +
                                  3. +

                                    Dynamo: Distributed Inference for Reasoning Models

                                    +

                                    Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

                                    +

                                    For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

                                    +
                                      +
                                    • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
                                    • +
                                    • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
                                    • +
                                    • Distributed execution across multiple GPU resources
                                    • +
                                    +
                                  4. +
                                  5. +

                                    vLLM: The Flexible Baseline

                                    +

                                    Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

                                    +

                                    While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

                                    +
                                      +
                                    • High throughput through dynamic batching and efficient memory utilization
                                    • +
                                    • Paged KV cache management for handling long contexts and concurrent requests
                                    • +
                                    • Strong support for open-source model ecosystems
                                    • +
                                    • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
                                    • +
                                    • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
                                    • +
                                    +
                                  6. +
                                  +

                                  Conclusion

                                  +

                                  Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

                                  +

                                  The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

                                  +

                                  Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

                                  +

                                  Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

                                  +

                                  Future Explorations

                                  +

                                  While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

                                  +
                                    +
                                  • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
                                  • +
                                  • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
                                  • +
                                  • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
                                  • +
                                  • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
                                  • +
                                  • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
                                  • +
                                  • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
                                  • +
                                  \ No newline at end of file diff --git a/docs/blog/tags/vllm/index.html b/docs/blog/tags/vllm/index.html index c5ce9278..91dca811 100644 --- a/docs/blog/tags/vllm/index.html +++ b/docs/blog/tags/vllm/index.html @@ -4,15 +4,15 @@ 2 posts tagged with "vllm" | BharatMLStack - - - + + + -

                                  2 posts tagged with "vllm"

                                  View All Tags

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  · 5 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  +

                                  2 posts tagged with "vllm"

                                  View All Tags

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  · 5 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

                                  Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

                                  1. Advanced Memory Management: Paged & Prefix KV Caching

                                  @@ -76,82 +76,195 @@

                                  Voice bot qu
                                  Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
                                  TensorRT-LLM136.2722.7845.660.23L4
                                  TensorRT-LLM249.8123.2189.370.45L4
                                  TensorRT-LLM455.3336.62153.390.78L4
                                  TensorRT-LLM866.539.11279.881.47L4
                                  TensorRT-LLM16131.830.39547.82.77L4
                                  TensorRT-LLM32277.2248.02925.74.78L4
                                  TensorRT-LLM64498.5271.621,164.406.2L4
                                  TensorRT-LLM128677.31120.371,445.187.69L4
                                  TensorRT-LLM2561,926.31216.881,600.818.52L4
                                  TensorRT-LLM121.179.24130.050.68A100
                                  TensorRT-LLM225.789.21264.51.35A100
                                  TensorRT-LLM428.5210.99437.692.27A100
                                  TensorRT-LLM834.412.61760.493.96A100
                                  TensorRT-LLM1668.0314.321,343.807.01A100
                                  TensorRT-LLM32185.9616.822,287.3011.92A100
                                  TensorRT-LLM64136.8721.173,625.2218.89A100
                                  TensorRT-LLM128463.7834.154,456.5123.24A100
                                  TensorRT-LLM256890.1259.185,188.2427.05A100

                                  Conclusion

                                  High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

                                  -

                                  These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                                  Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

                                  · 4 min read
                                  Aditya Kumar
                                  Lead Software Engineer @ Meesho
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho
                                  Adarsha Das
                                  Senior Architect @ Meesho

                                  BharatMLStack

                                  - -

                                  By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

                                  -
                                    -
                                  • 🔹 Scaling model inference without hitting infrastructure roadblocks
                                  • -
                                  • 🔹 Moving embedding search from batch to real-time for candidate generation
                                  • -
                                  -

                                  Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

                                  -

                                  Breaking Free from the Scalability Ceiling

                                  -

                                  The Model Serving Bottleneck—A Wake-Up Call

                                  -

                                  July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. -In one of our war rooms, we ran a quick experiment:

                                  -
                                    -
                                  • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
                                  • -
                                  • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
                                  • -
                                  • 🚀 The results matched—perfectly.
                                  • -
                                  -

                                  That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. -Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: -"Node availability may be an issue." -With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

                                  -
                                    -
                                  • ✅ p99 latency dropped from 90–100ms to 30–40ms
                                  • -
                                  • ✅ Triton handled significantly higher throughput on fewer resources
                                  • -
                                  • ✅ No model changes were needed
                                  • -
                                  -

                                  MBS ran without a hitch, proving that self-hosted inference was the way forward.

                                  -

                                  Scaling Triton on GKE

                                  -

                                  This left us with two choices:

                                  -
                                    -
                                  • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
                                  • -
                                  • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance
                                  • -
                                  -

                                  We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

                                  -

                                  Fixing the Cold Start Problem

                                  -

                                  As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

                                  -

                                  After profiling, we found the culprits:

                                  -
                                    -
                                  • Triton’s base image—a massive 5GB
                                  • -
                                  • Model binaries—often 1GB+
                                  • -
                                  • Startup delay—mostly due to downloading and initializing these assets
                                  • -
                                  -

                                  To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

                                  -

                                  Embedding Search: The Last Piece of the Puzzle

                                  -

                                  By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

                                  -

                                  Choosing the Right Vector Database

                                  -

                                  We benchmarked three production-ready vector DBs across key parameters:

                                  -
                                    -
                                  • Milvus
                                  • -
                                  • Qdrant
                                  • -
                                  • Weaviate
                                  • -
                                  -

                                  After extensive POCs, Qdrant stood out for its:

                                  -
                                    -
                                  • ✅ Blazing-fast search latency on high-dimensional vectors
                                  • -
                                  • ✅ Efficient memory usage, crucial for in-memory workloads
                                  • -
                                  • ✅ Support for upserts and soft deletes, vital for Ads use cases
                                  • -
                                  • ✅ gRPC + REST APIs, making integration seamless
                                  • -
                                  • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)
                                  • -
                                  -

                                  At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

                                  -

                                  Embedding Freshness & Real-Time Updates

                                  -

                                  To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

                                  -
                                    -
                                  • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
                                  • -
                                  • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes
                                  • -
                                  -

                                  This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

                                  -

                                  Skye

                                  -

                                  Final Takeaways: Scaling Smartly for Real-Time ML

                                  -
                                    -
                                  • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
                                  • -
                                  • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
                                  • -
                                  • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
                                  • -
                                  • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations
                                  • -
                                  -

                                  By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

                                  +

                                  These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

                                  Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                                  · 14 min read
                                  Jaya Kumar
                                  Lead ML Engineer @ Meesho

                                  BharatMLStack

                                  +

                                  Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

                                  +

                                  Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

                                  +

                                  The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

                                  +

                                  In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

                                  +

                                  Why LLM Inference Is not just bigger ML model serving

                                  +

                                  Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

                                  +

                                  Autoregressive Generation and Sequential Computation:

                                  +

                                  Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. +Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

                                  +

                                  Prefill and Decode Phases:

                                  +

                                  LLM inference typically consists of two distinct stages:

                                  +
                                    +
                                  • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
                                  • +
                                  • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.
                                  • +
                                  +

                                  The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

                                  +

                                  Context Management and KV Caching:

                                  +

                                  Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. +KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

                                  +
                                    +
                                  • Memory consumption grows with sequence length and batch size
                                  • +
                                  • GPU memory becomes a critical bottleneck
                                  • +
                                  • Efficient memory management becomes essential for scaling concurrent requests
                                  • +
                                  +

                                  This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

                                  +

                                  Dynamic and Irregular Workloads:

                                  +

                                  Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

                                  +
                                    +
                                  • Batch sizes must be dynamic rather than static
                                  • +
                                  • Requests may enter and leave batches asynchronously
                                  • +
                                  • Scheduling systems must continuously rebalance workloads to maximize GPU utilization
                                  • +
                                  +

                                  These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

                                  +

                                  Streaming and User Experience Constraints:

                                  +

                                  Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. +Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

                                  +

                                  LLMOps: High-Level Architecture

                                  +

                                  LLM Architecture

                                  +

                                  The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

                                  +

                                  Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

                                  +
                                    +
                                  1. +

                                    Onboarding & Registration (The Source of Truth)

                                    +

                                    The lifecycle begins with the Data Scientist or engineer.

                                    +
                                      +
                                    • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
                                    • +
                                    • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
                                    • +
                                    +
                                  2. +
                                  3. +

                                    The "Black Box" Build Engine

                                    +

                                    Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

                                    +
                                      +
                                    • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
                                    • +
                                    • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
                                    • +
                                    • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
                                    • +
                                    +
                                  4. +
                                  5. +

                                    Intelligent Profiling & Validation

                                    +

                                    Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

                                    +
                                      +
                                    • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
                                    • +
                                    • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
                                    • +
                                    +
                                  6. +
                                  7. +

                                    Smart Artifact Generation & Distribution

                                    +

                                    To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

                                    +
                                      +
                                    • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
                                    • +
                                    • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
                                    • +
                                    +
                                  8. +
                                  9. +

                                    Image Streaming & Deployment

                                    +

                                    Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

                                    +
                                      +
                                    • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
                                    • +
                                    +
                                  10. +
                                  11. +

                                    The Inference Runtime (Kubernetes)

                                    +

                                    The workload lands on Kubernetes with Autoscaling.

                                    +
                                      +
                                    • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
                                    • +
                                    • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
                                    • +
                                    +
                                  12. +
                                  13. +

                                    Client Interaction & Observability

                                    +

                                    Finally, the LLM Inference Client executes the request.

                                    +
                                      +
                                    • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
                                    • +
                                    • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
                                    • +
                                    +
                                  14. +
                                  15. +

                                    Observability: Monitoring the Pulse of GenAI

                                    +

                                    In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

                                    +

                                    To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

                                    +
                                      +
                                    1. +

                                      Time to First Token (TTFT)

                                      +
                                        +
                                      • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
                                      • +
                                      • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
                                      • +
                                      • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
                                      • +
                                      +
                                    2. +
                                    3. +

                                      Inter-Token Latency (ITL)

                                      +
                                        +
                                      • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
                                      • +
                                      • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
                                      • +
                                      • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
                                      • +
                                      +
                                    4. +
                                    5. +

                                      Token Throughput vs. Request Throughput

                                      +
                                        +
                                      • We distinguish between two types of throughput to balance system efficiency with user load:
                                      • +
                                      • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
                                      • +
                                      • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
                                      • +
                                      +
                                    6. +
                                    7. +

                                      The Monitoring Stack

                                      +
                                        +
                                      • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
                                      • +
                                      • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.
                                      • +
                                      +
                                    8. +
                                    +
                                  16. +
                                  +

                                  Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

                                  +

                                  Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

                                  +
                                    +
                                  1. +

                                    TensorRT-LLM: The High-Performance Standard

                                    +

                                    Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

                                    +

                                    TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

                                    +

                                    Key optimizations we tailor for these high-load cases include:

                                    +
                                      +
                                    • Optimized execution via TensorRT engine compilation
                                    • +
                                    • Quantization-aware execution for reduced memory usage and improved throughput
                                    • +
                                    • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
                                    • +
                                    • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
                                    • +
                                    +
                                  2. +
                                  3. +

                                    Dynamo: Distributed Inference for Reasoning Models

                                    +

                                    Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

                                    +

                                    For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

                                    +
                                      +
                                    • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
                                    • +
                                    • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
                                    • +
                                    • Distributed execution across multiple GPU resources
                                    • +
                                    +
                                  4. +
                                  5. +

                                    vLLM: The Flexible Baseline

                                    +

                                    Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

                                    +

                                    While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

                                    +
                                      +
                                    • High throughput through dynamic batching and efficient memory utilization
                                    • +
                                    • Paged KV cache management for handling long contexts and concurrent requests
                                    • +
                                    • Strong support for open-source model ecosystems
                                    • +
                                    • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
                                    • +
                                    • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.
                                    • +
                                    +
                                  6. +
                                  +

                                  Conclusion

                                  +

                                  Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

                                  +

                                  The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

                                  +

                                  Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

                                  +

                                  Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

                                  +

                                  Future Explorations

                                  +

                                  While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

                                  +
                                    +
                                  • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
                                  • +
                                  • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
                                  • +
                                  • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
                                  • +
                                  • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
                                  • +
                                  • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
                                  • +
                                  • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.
                                  • +
                                  \ No newline at end of file diff --git a/docs/category/go-sdk/index.html b/docs/category/go-sdk/index.html index 56e2e431..98e56f3f 100644 --- a/docs/category/go-sdk/index.html +++ b/docs/category/go-sdk/index.html @@ -4,14 +4,14 @@ Go SDK | BharatMLStack - - - + + + -

                                  Go SDK

                                  Go SDK for BharatML Stack. Provides Go client libraries and packages for interacting with the online feature store, including gRPC clients and protocol buffer definitions.

                                  +

                                  Go SDK

                                  Go SDK for BharatML Stack. Provides Go client libraries and packages for interacting with the online feature store, including gRPC clients and protocol buffer definitions.

                                  \ No newline at end of file diff --git a/docs/category/inferflow/index.html b/docs/category/inferflow/index.html index 054fa36f..96765a32 100644 --- a/docs/category/inferflow/index.html +++ b/docs/category/inferflow/index.html @@ -4,14 +4,14 @@ Inferflow | BharatMLStack - - - + + + -

                                  Inferflow

                                  Inferflow is a graph-driven feature retrieval and model inference orchestration engine. It dynamically resolves entity relationships via configurable DAGs, retrieves features from the Online Feature Store, and orchestrates model scoring — all without custom code.

                                  +

                                  Inferflow

                                  Inferflow is a graph-driven feature retrieval and model inference orchestration engine. It dynamically resolves entity relationships via configurable DAGs, retrieves features from the Online Feature Store, and orchestrates model scoring — all without custom code.

                                  \ No newline at end of file diff --git a/docs/category/numerix/index.html b/docs/category/numerix/index.html index c4b73d39..1ae8cb2f 100644 --- a/docs/category/numerix/index.html +++ b/docs/category/numerix/index.html @@ -4,14 +4,14 @@ Numerix | BharatMLStack - - - + + + -

                                  Numerix

                                  Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors.

                                  +

                                  Numerix

                                  Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors.

                                  \ No newline at end of file diff --git a/docs/category/online-feature-store/index.html b/docs/category/online-feature-store/index.html index 75ca8c79..55c35cbf 100644 --- a/docs/category/online-feature-store/index.html +++ b/docs/category/online-feature-store/index.html @@ -4,14 +4,14 @@ Online Feature Store | BharatMLStack - - - + + + -

                                  Online Feature Store

                                  Online-feature-store is a high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with a strong emphasis on developer experience, system observability, and low-latency feature retrieval.

                                  +

                                  Online Feature Store

                                  Online-feature-store is a high-performance, scalable, and production-grade feature store built for modern machine learning systems. It supports both real-time and batch workflows, with a strong emphasis on developer experience, system observability, and low-latency feature retrieval.

                                  \ No newline at end of file diff --git a/docs/category/predator/index.html b/docs/category/predator/index.html new file mode 100644 index 00000000..fa5db5b1 --- /dev/null +++ b/docs/category/predator/index.html @@ -0,0 +1,17 @@ + + + + + +Predator | BharatMLStack + + + + + + + + +

                                  Predator

                                  Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server, designed to serve ML models with low latency in Kubernetes, with OnFS and Interflow integration.

                                  + + \ No newline at end of file diff --git a/docs/category/python-sdk/index.html b/docs/category/python-sdk/index.html index cdea2a59..21cf3cc2 100644 --- a/docs/category/python-sdk/index.html +++ b/docs/category/python-sdk/index.html @@ -4,14 +4,14 @@ Python SDK | BharatMLStack - - - + + + -

                                  Python SDK

                                  Python SDK for BharatML Stack. Provides Python client libraries and utilities for interacting with the online feature store, including gRPC clients, Spark integration, and common utilities.

                                  +

                                  Python SDK

                                  Python SDK for BharatML Stack. Provides Python client libraries and utilities for interacting with the online feature store, including gRPC clients, Spark integration, and common utilities.

                                  \ No newline at end of file diff --git a/docs/category/quick-start/index.html b/docs/category/quick-start/index.html index a04bd2a8..0b2772e2 100644 --- a/docs/category/quick-start/index.html +++ b/docs/category/quick-start/index.html @@ -4,14 +4,14 @@ Quick Start | BharatMLStack - - - + + + -

                                  Quick Start

                                  Quick Start guide for BharatML Stack. Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing.

                                  +

                                  Quick Start

                                  Quick Start guide for BharatML Stack. Get up and running quickly with step-by-step instructions, sample data, and Docker Compose setup for local development and testing.

                                  \ No newline at end of file diff --git a/docs/category/sdks/index.html b/docs/category/sdks/index.html index d9e7c141..43bfc40f 100644 --- a/docs/category/sdks/index.html +++ b/docs/category/sdks/index.html @@ -4,14 +4,14 @@ SDKs | BharatMLStack - - - + + + -

                                  SDKs

                                  Software Development Kits (SDKs) for BharatML Stack. Includes client libraries for Go and Python to interact with the online feature store and other platform components.

                                  +

                                  SDKs

                                  Software Development Kits (SDKs) for BharatML Stack. Includes client libraries for Go and Python to interact with the online feature store and other platform components.

                                  \ No newline at end of file diff --git a/docs/category/skye/index.html b/docs/category/skye/index.html new file mode 100644 index 00000000..71fb7dce --- /dev/null +++ b/docs/category/skye/index.html @@ -0,0 +1,17 @@ + + + + + +Skye | BharatMLStack + + + + + + + + +

                                  Skye

                                  Skye is a high-performance vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, tenant-level index isolation, intelligent caching, and centralized cluster management.

                                  + + \ No newline at end of file diff --git a/docs/category/trufflebox-ui/index.html b/docs/category/trufflebox-ui/index.html index 192feb18..60a9513e 100644 --- a/docs/category/trufflebox-ui/index.html +++ b/docs/category/trufflebox-ui/index.html @@ -4,14 +4,14 @@ Trufflebox UI | BharatMLStack - - - + + + -

                                  Trufflebox UI

                                  Trufflebox UI is a modern, feature rich UI framework for supporting MLOps. It supports Feature catalog, management, user managemnet and other adminops

                                  +

                                  Trufflebox UI

                                  Trufflebox UI is a modern, feature rich UI framework for supporting MLOps. It supports Feature catalog, management, user managemnet and other adminops

                                  \ No newline at end of file diff --git a/docs/category/v100/index.html b/docs/category/v100/index.html deleted file mode 100644 index 65f65f90..00000000 --- a/docs/category/v100/index.html +++ /dev/null @@ -1,17 +0,0 @@ - - - - - -v1.0.0 | BharatMLStack - - - - - - - - - - - \ No newline at end of file diff --git a/docs/img/bharatml-stack-logo.jpg b/docs/img/bharatml-stack-logo.jpg new file mode 100644 index 00000000..46ecdfa5 Binary files /dev/null and b/docs/img/bharatml-stack-logo.jpg differ diff --git a/docs/img/skye-rt-consumer-flow.png b/docs/img/skye-rt-consumer-flow.png new file mode 100644 index 00000000..11e40769 Binary files /dev/null and b/docs/img/skye-rt-consumer-flow.png differ diff --git a/docs/img/skye-system-overview.png b/docs/img/skye-system-overview.png new file mode 100644 index 00000000..2f992dbf Binary files /dev/null and b/docs/img/skye-system-overview.png differ diff --git a/docs/img/v1.0.0-predator-hld.png b/docs/img/v1.0.0-predator-hld.png new file mode 100644 index 00000000..3e8a21ad Binary files /dev/null and b/docs/img/v1.0.0-predator-hld.png differ diff --git a/docs/index.html b/docs/index.html index 4f2a8eb9..45c4280b 100644 --- a/docs/index.html +++ b/docs/index.html @@ -3,15 +3,20 @@ -BharatMLStack - Open Source ML Infrastructure | BharatMLStack - - - +BharatMLStack - Open Source ML Infrastructure | BharatMLStack + + + -
                                  BharatMLStack Logo

                                  Welcome to BharatMLStack

                                  Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity.

                                  Sub-10msP99 Latency
                                  1M+ RPSTested Capacity
                                  Multi-DBSupport

                                  Online Feature Store

                                  High-performance, production-ready feature serving for real-time ML inference

                                  🚀

                                  High-Performance Feature Store

                                  Sub-10ms P99 latency and 1M+ RPS capacity. Built for real-time ML inference with custom PSDB serialization format that outperforms Protocol Buffers and Apache Arrow.

                                  Production-Ready ML Infrastructure

                                  Multi-database backends (Scylla, Dragonfly, Redis), comprehensive monitoring, and enterprise-grade features. Deploy with confidence using battle-tested components.

                                  🛠️

                                  Developer-First Experience

                                  Multi-language SDKs (Go, Python), gRPC APIs, and extensive documentation. From data scientists, ML engineers to backend engineers, everyone gets tools they love.

                                  Built for India's Scale

                                  BharatMLStack is a comprehensive, production-ready machine learning infrastructure platform designed to democratize ML capabilities across India and beyond. Our mission is to provide a robust, scalable, and accessible ML stack that empowers organizations to build, deploy, and manage machine learning solutions at massive scale.

                                  Explore Online Feature Store →

                                  🏆 Key Achievements

                                  • ✅ Sub-10ms P99 latency for real-time inference
                                  • ✅ 1M+ RPS tested with 100 IDs per request
                                  • ✅ PSDB format outperforms Proto3 & Arrow
                                  • ✅ Multi-database: Scylla, Dragonfly, Redis
                                  • ✅ Production-ready with comprehensive monitoring

                                  Trufflebox UI

                                  Modern, feature-rich UI framework for comprehensive MLOps management

                                  📋

                                  Feature Catalog & Management

                                  Comprehensive feature catalog with metadata management, versioning, and governance. Organize and discover features across your ML platform with ease.

                                  👥

                                  User Management & Admin Ops

                                  Role-based access control, user authentication, and administrative operations. Secure your ML platform with enterprise-grade user management capabilities.

                                  🎨

                                  Modern UI Framework

                                  Intuitive, responsive web interface built with modern web technologies. Streamline MLOps workflows with beautiful and functional user experiences.

                                  Modern MLOps Management

                                  Trufflebox UI provides a comprehensive, modern web interface for managing your entire ML infrastructure. Built with cutting-edge web technologies, it delivers an intuitive experience for feature management, user administration, and operational oversight. Streamline your MLOps workflows with enterprise-grade UI components.

                                  Explore Trufflebox UI →

                                  🎨 UI Features

                                  • ✅ Comprehensive feature catalog & discovery
                                  • ✅ Role-based access control & user management
                                  • ✅ Job, Store, Admin Ops management
                                  • ✅ Approval flow for everything
                                  • ✅ Responsive design for desktop & mobile

                                  SDKs

                                  Developer-friendly client libraries and APIs for seamless platform integration

                                  🌐

                                  Multi-Language Support

                                  Native SDKs for Go and Python with idiomatic APIs. Choose the language that fits your team's expertise and existing infrastructure.

                                  🔗

                                  gRPC & REST APIs

                                  High-performance gRPC clients and REST APIs for seamless integration. Built-in support for streaming, batching, and async operations.

                                  Spark Integration

                                  Native Apache Spark integration for batch feature processing and ingestion. Scale your feature engineering workflows with distributed computing power.

                                  Developer-First Integration

                                  Our SDKs are designed with developers in mind, providing idiomatic APIs for Go and Python that feel natural in your existing codebase. Whether you're building microservices, data pipelines, or ML applications, our SDKs provide the tools you need for seamless integration with BharatMLStack's powerful infrastructure.

                                  Explore SDKs →

                                  🛠️ Developer Tools

                                  • ✅ Native Go & Python SDKs with type safety
                                  • ✅ High-performance gRPC
                                  • ✅ Apache Spark integration for publishing features

                                  Numerix

                                  Numerix is a mathematical compute engine for BharatML Stack. It is used to perform mathematical operations on matrices and vectors.

                                  Explore Numerix →

                                  🛠️ Numerix Features

                                  • ✅ Postfix expression evaluation
                                  • ✅ Vectorized math operations
                                  • ✅ Typed evaluation
                                  • ✅ Compiler-assisted SIMD
                                  • ✅ ARM & AMD support
                                  • ✅ Multi-arch builds
                                  • ✅ Deterministic runtime
                                  +
                                  Open-source, scalable stack for enterprise ML

                                  Build production ML pipelines faster

                                  Open source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Integrate, deploy, and manage robust ML workflows with full reliability and control.

                                  Adopted by data teams building at scale

                                  BharatML Stack Logo

                                  Why BharatMLStack

                                  The Real Barriers to Scaling Machine Learning

                                  ML teams spend more time fighting infrastructure than building intelligence. BharatMLStack removes those barriers.

                                  🧠

                                  Focus on building intelligence, not infrastructure

                                  • Does every model deployment require a full-stack integration effort?
                                  • Do engineers have to rebuild feature retrieval, endpoint integrations, and logging for each new model?
                                  • Does changing a simple expression like 0.2×s₁ + 0.8×s₂ to 0.3×s₁ + 0.7×s₂ really need code reviews and redeployments?
                                  • Why does deploying intelligence require the devops team to provision infra?

                                  Machine learning teams should be iterating on models, not systems. Yet today, infrastructure complexity turns simple improvements into weeks of engineering effort, slowing experimentation and innovation.

                                  💰

                                  Built for scale without exponential cost growth

                                  • Do your infrastructure costs scale faster than your ML impact?
                                  • Are you recomputing the same features, reloading the same data, and moving the same bytes across systems repeatedly?
                                  • Are expensive GPUs and compute sitting underutilized while workloads wait on data or inefficient pipelines?
                                  • Why does scaling ML often mean scaling cost linearly—or worse?

                                  A modern ML platform should eliminate redundant computation, reuse features intelligently, and optimize data access across memory, NVMe, and object storage. Compute should be pooled, scheduled efficiently, and fully utilized—ensuring that scale drives impact, not runaway infrastructure costs.

                                  🌍

                                  Freedom to deploy anywhere, without lock-in

                                  • Are your models tied to a single cloud, making migration costly and complex?
                                  • Does adopting managed services today limit your ability to optimize cost or move infrastructure tomorrow?
                                  • Can you deploy the same ML stack across public cloud, private cloud, or sovereign environments without redesigning everything?
                                  • Why should infrastructure choices dictate the future of your ML systems?

                                  A modern ML platform should be built on open standards and cloud-neutral abstractions, allowing you to deploy anywhere—public cloud, private infrastructure, or sovereign environments. This ensures complete control over your data, freedom from vendor lock-in, and the ability to optimize for cost, performance, and compliance without architectural constraints.

                                  Platform Components

                                  BharatMLStack Components

                                  Purpose-built components for every stage of the ML lifecycle, from feature serving to model deployment.

                                  Online Feature Store

                                  BharatMLStack Online Feature Store delivers sub-10ms, high-throughput access to machine learning features for real-time inference. It seamlessly ingests batch and streaming data, validates schemas, and persists compact, versioned feature groups optimized for low latency and efficiency. With scalable storage backends, gRPC APIs, and binary-optimized formats, it ensures consistent, reliable feature serving across ML pipelines.

                                  Learn more →
                                  🔀

                                  Inferflow

                                  Inferflow is BharatMLStack's intelligent inference gateway that dynamically retrieves and assembles features required by ML models using a graph-based configuration called Inferpipes. It automatically resolves entity relationships, fetches features from the Online Feature Store, and constructs feature vectors without custom code.

                                  Learn more →
                                  🔍

                                  Skye

                                  Skye enables fast similarity retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It supports pluggable vector databases, ensuring flexibility across infrastructure. The system provides tenant-level index isolation while allowing single embedding ingestion even when shared across tenants, reducing redundancy.

                                  Learn more →
                                  🧮

                                  Numerix

                                  Numerix is a high-performance compute engine designed for ultra-fast element-wise matrix operations. Built in Rust and accelerated using SIMD, it delivers exceptional efficiency and predictable performance. Optimized for real-time inference workloads, it achieves strict sub-5ms p99 latency on matrices up to 1000×10.

                                  Learn more →
                                  🚀

                                  Predator

                                  Predator streamlines infrastructure and model lifecycle management. It enables the creation of deployables with specific Triton Server versions and supports seamless model rollouts. Leveraging Helm charts and Argo CD, Predator automates Kubernetes-based deployments while integrating with KEDA for auto-scaling and performance tuning.

                                  Learn more →

                                  Proven at scale

                                  Scaling Numbers

                                  Daily Orders

                                  0.0M+

                                  Daily orders processed via ML pipelines

                                  QPS on FS

                                  0.0M

                                  QPS on Feature Store with batch size of 100 id lookups

                                  QPS Inference

                                  0M+

                                  QPS on Model Inference

                                  QPS Embedding

                                  0K

                                  QPS Embedding Search

                                  See it in action

                                  Demo Videos

                                  Watch short demos of each BharatMLStack component in action.

                                  Feature Store

                                  Learn how to onboard and manage features using the self-serve UI for the Online Feature Store.

                                  Embedding Platform

                                  Walkthrough of onboarding and managing embedding models via the Skye self-serve UI.

                                  Numerix

                                  Step-by-step guide to configuring and running matrix operations through the Numerix self-serve UI.

                                  Predator

                                  How to deploy and manage ML models on Kubernetes using the Predator self-serve UI.

                                  Inferflow

                                  Setting up inferpipes and feature retrieval graphs through the Inferflow self-serve UI.

                                  Deploy ML models with confidence

                                  Comprehensive stack for business-ready ML. Integrates seamlessly with enterprise systems. Robust security and regulatory compliance.

                                  \ No newline at end of file diff --git a/docs/inferflow/v1.0.0/architecture/index.html b/docs/inferflow/v1.0.0/architecture/index.html index aa0a5091..d12d476d 100644 --- a/docs/inferflow/v1.0.0/architecture/index.html +++ b/docs/inferflow/v1.0.0/architecture/index.html @@ -4,15 +4,15 @@ Architecture | BharatMLStack - - - + + + -

                                  BharatMLStack - Inferflow

                                  +

                                  BharatMLStack - Inferflow

                                  Inferflow is part of BharatMLStack, a graph-driven feature retrieval and model inference orchestration engine built in Go. It eliminates the need for custom feature retrieval code by using configurable DAG topologies to dynamically resolve entity relationships, fetch features from the Online Feature Store, and orchestrate model scoring — all driven by configuration stored in etcd.


                                  Overview

                                  @@ -112,7 +112,7 @@

                                  Kafk


                                Request Flow

                                -
                                1. Client sends gRPC request with model_config_id + entity IDs

                                2. Load ModelConfig from etcd-backed ConfigMap

                                3. Adapt proto request → ComponentRequest
                                (build ComponentMatrix with entity schema)

                                4. Resolve DAG topology from component_dependency config

                                5. Execute DAG (Kahn's algorithm, concurrent):

                                ├─ FeatureInitComponent: populate matrix with entity IDs + schema

                                ├─ FeatureComponents (parallel): fetch features from OnFS → fill matrix columns

                                ├─ PredatorComponent: build feature payloads from matrix → call model → write scores

                                └─ NumerixComponent: read scores from matrix → call compute → write final scores

                                6. Build response from matrix columns per ResponseConfig

                                7. (Optional) Async Kafka logging of inference features and scores

                                8. Return gRPC response to client
                                +
                                1. Client sends gRPC request with model_config_id + entity IDs

                                2. Load ModelConfig from etcd-backed ConfigMap

                                3. Adapt proto request → ComponentRequest
                                (build ComponentMatrix with entity schema)

                                4. Resolve DAG topology from component_dependency config

                                5. Execute DAG (Kahn's algorithm, concurrent):

                                ├─ FeatureInitComponent: populate matrix with entity IDs + schema

                                ├─ FeatureComponents (parallel): fetch features from OnFS → fill matrix columns

                                ├─ PredatorComponent: build feature payloads from matrix → call model → write scores

                                └─ NumerixComponent: read scores from matrix → call compute → write final scores

                                6. Build response from matrix columns per ResponseConfig

                                7. (Optional) Async Kafka logging of inference features and scores

                                8. Return gRPC response to client

                                Observability

                                Metrics (StatsD / Telegraf)

                                diff --git a/docs/inferflow/v1.0.0/configuration/index.html b/docs/inferflow/v1.0.0/configuration/index.html index 7eb8f4d3..c5ce780a 100644 --- a/docs/inferflow/v1.0.0/configuration/index.html +++ b/docs/inferflow/v1.0.0/configuration/index.html @@ -4,15 +4,15 @@ Configuration Guide | BharatMLStack - - - + + + -

                                Inferflow - Configuration Guide

                                +

                                Inferflow - Configuration Guide

                                Inferflow is fully config-driven. All model onboarding, feature retrieval logic, DAG topology, and inference behavior are controlled through configuration stored in etcd — with zero code changes required.


                                Configuration Overview

                                @@ -44,11 +44,11 @@

                                In-Memory Ca

                                Dynamic Configuration (etcd Model Config)

                                Model configurations are stored in etcd and hot-reloaded. Each model is identified by a model_config_id.

                                Config Structure

                                -
                                {
                                "model_config_id_example": {
                                "dag_execution_config": {
                                "component_dependency": {
                                "feature_initializer": ["fs_user", "fs_product"],
                                "fs_user": ["ranker_model"],
                                "fs_product": ["ranker_model"],
                                "ranker_model": []
                                }
                                },
                                "component_config": {
                                "feature_component_config": {
                                "fs_user": { ... },
                                "fs_product": { ... }
                                },
                                "predator_component_config": {
                                "ranker_model": { ... }
                                },
                                "numerix_component_config": {},
                                "cache_enabled": true,
                                "cache_version": "v1",
                                "cache_ttl": 300,
                                "error_logging_percent": 10
                                },
                                "response_config": {
                                "features": ["ranker_model:score"],
                                "model_schema_perc": 100,
                                "logging_perc": 5,
                                "log_features": ["fs_user:profile:age", "ranker_model:score"],
                                "log_batch_size": 100
                                }
                                }
                                }
                                +
                                {
                                "model_config_id_example": {
                                "dag_execution_config": {
                                "component_dependency": {
                                "feature_initializer": ["fs_user", "fs_product"],
                                "fs_user": ["ranker_model"],
                                "fs_product": ["ranker_model"],
                                "ranker_model": []
                                }
                                },
                                "component_config": {
                                "feature_component_config": {
                                "fs_user": { ... },
                                "fs_product": { ... }
                                },
                                "predator_component_config": {
                                "ranker_model": { ... }
                                },
                                "numerix_component_config": {},
                                "cache_enabled": true,
                                "cache_version": "v1",
                                "cache_ttl": 300,
                                "error_logging_percent": 10
                                },
                                "response_config": {
                                "features": ["ranker_model:score"],
                                "model_schema_perc": 100,
                                "logging_perc": 5,
                                "log_features": ["fs_user:profile:age", "ranker_model:score"],
                                "log_batch_size": 100
                                }
                                }
                                }

                                DAG Execution Config

                                Defines the component dependency graph.

                                -
                                {
                                "component_dependency": {
                                "<parent_component>": ["<child_1>", "<child_2>"],
                                "<child_1>": ["<grandchild>"],
                                "<child_2>": ["<grandchild>"],
                                "<grandchild>": []
                                }
                                }
                                +
                                {
                                "component_dependency": {
                                "<parent_component>": ["<child_1>", "<child_2>"],
                                "<child_1>": ["<grandchild>"],
                                "<child_2>": ["<grandchild>"],
                                "<grandchild>": []
                                }
                                }

                                Rules:

                                • The graph must be a valid DAG (no cycles)
                                • @@ -59,33 +59,33 @@

                                  DAG Exe

                                  Feature Component Config

                                  Configures how features are fetched from the Online Feature Store.

                                  -
                                  {
                                  "fs_user": {
                                  "fs_keys": {
                                  "schema": ["user_id"],
                                  "col": "context:user:user_id"
                                  },
                                  "fs_request": {
                                  "entity_label": "user",
                                  "feature_groups": [
                                  {
                                  "label": "demographics",
                                  "feature_labels": ["age", "location", "income_bracket"]
                                  },
                                  {
                                  "label": "behavior",
                                  "feature_labels": ["click_rate", "purchase_freq"]
                                  }
                                  ]
                                  },
                                  "fs_flatten_resp_keys": ["user_id"],
                                  "col_name_prefix": "user",
                                  "comp_cache_enabled": true,
                                  "comp_cache_ttl": 600,
                                  "composite_id": false
                                  }
                                  }
                                  +
                                  {
                                  "fs_user": {
                                  "fs_keys": {
                                  "schema": ["user_id"],
                                  "col": "context:user:user_id"
                                  },
                                  "fs_request": {
                                  "entity_label": "user",
                                  "feature_groups": [
                                  {
                                  "label": "demographics",
                                  "feature_labels": ["age", "location", "income_bracket"]
                                  },
                                  {
                                  "label": "behavior",
                                  "feature_labels": ["click_rate", "purchase_freq"]
                                  }
                                  ]
                                  },
                                  "fs_flatten_resp_keys": ["user_id"],
                                  "col_name_prefix": "user",
                                  "comp_cache_enabled": true,
                                  "comp_cache_ttl": 600,
                                  "composite_id": false
                                  }
                                  }
                                  FieldDescription
                                  fs_keysHow to extract lookup keys from the matrix. schema defines key column names; col references a matrix column
                                  fs_requestOnFS query: entity label + feature groups with specific features
                                  fs_flatten_resp_keysKeys to flatten in response mapping
                                  col_name_prefixPrefix for matrix column names (e.g., user:demographics:age)
                                  comp_cache_enabledEnable in-memory caching for this component
                                  comp_cache_ttlCache TTL in seconds
                                  composite_idWhether entity keys are composite

                                  Predator Component Config

                                  Configures model inference endpoints.

                                  -
                                  {
                                  "ranker_model": {
                                  "model_name": "product_ranker_v3",
                                  "model_endpoint": "predator-ranker:8080",
                                  "model_end_points": {
                                  "predator-ranker-v3:8080": 80,
                                  "predator-ranker-v4:8080": 20
                                  },
                                  "deadline": 100,
                                  "batch_size": 50,
                                  "calibration": {
                                  "enabled": false
                                  },
                                  "inputs": {
                                  "feature_map": {
                                  "user:demographics:age": "INT32",
                                  "user:behavior:click_rate": "FP32",
                                  "product:attributes:category_id": "INT32"
                                  }
                                  },
                                  "outputs": {
                                  "score_columns": ["score", "confidence"]
                                  },
                                  "slate_component": false
                                  }
                                  }
                                  +
                                  {
                                  "ranker_model": {
                                  "model_name": "product_ranker_v3",
                                  "model_endpoint": "predator-ranker:8080",
                                  "model_end_points": {
                                  "predator-ranker-v3:8080": 80,
                                  "predator-ranker-v4:8080": 20
                                  },
                                  "deadline": 100,
                                  "batch_size": 50,
                                  "calibration": {
                                  "enabled": false
                                  },
                                  "inputs": {
                                  "feature_map": {
                                  "user:demographics:age": "INT32",
                                  "user:behavior:click_rate": "FP32",
                                  "product:attributes:category_id": "INT32"
                                  }
                                  },
                                  "outputs": {
                                  "score_columns": ["score", "confidence"]
                                  },
                                  "slate_component": false
                                  }
                                  }
                                  FieldDescription
                                  model_nameModel identifier on the serving platform
                                  model_endpointPrimary model serving endpoint
                                  model_end_pointsMultiple endpoints with percentage-based traffic routing
                                  deadlineInference timeout in milliseconds
                                  batch_sizeMax items per inference batch
                                  calibrationScore calibration settings
                                  inputs.feature_mapMap of matrix column → data type for model input
                                  outputs.score_columnsColumn names for model output scores
                                  slate_componentIf true, runs per-slate inference

                                  Numerix Component Config

                                  Configures compute operations (e.g., reranking).

                                  -
                                  {
                                  "reranker": {
                                  "score_column": "final_score",
                                  "data_type": "FP32",
                                  "score_mapping": {
                                  "ranker_model:score": "FP32",
                                  "user:behavior:click_rate": "FP32"
                                  },
                                  "compute_id": "diversity_rerank_v1",
                                  "slate_component": false
                                  }
                                  }
                                  +
                                  {
                                  "reranker": {
                                  "score_column": "final_score",
                                  "data_type": "FP32",
                                  "score_mapping": {
                                  "ranker_model:score": "FP32",
                                  "user:behavior:click_rate": "FP32"
                                  },
                                  "compute_id": "diversity_rerank_v1",
                                  "slate_component": false
                                  }
                                  }
                                  FieldDescription
                                  score_columnOutput column name for the computed score
                                  data_typeOutput data type
                                  score_mappingMap of matrix columns to include as compute inputs
                                  compute_idIdentifies the compute operation on Numerix
                                  slate_componentIf true, runs per-slate compute

                                  Response Config

                                  Controls what data is returned to the client and what is logged.

                                  -
                                  {
                                  "features": ["ranker_model:score", "reranker:final_score"],
                                  "model_schema_perc": 100,
                                  "logging_perc": 5,
                                  "log_features": [
                                  "user:demographics:age",
                                  "ranker_model:score",
                                  "reranker:final_score"
                                  ],
                                  "log_batch_size": 100
                                  }
                                  +
                                  {
                                  "features": ["ranker_model:score", "reranker:final_score"],
                                  "model_schema_perc": 100,
                                  "logging_perc": 5,
                                  "log_features": [
                                  "user:demographics:age",
                                  "ranker_model:score",
                                  "reranker:final_score"
                                  ],
                                  "log_batch_size": 100
                                  }
                                  FieldDescription
                                  featuresMatrix columns to include in the gRPC response
                                  model_schema_percPercentage of requests that include full schema in response
                                  logging_percPercentage of requests to send to Kafka for logging
                                  log_featuresSpecific features to include in log messages
                                  log_batch_sizeBatch size for grouped log messages

                                  Service-Level Config

                                  Global settings that apply across all models.

                                  -
                                  {
                                  "v2_logging_type": "proto",
                                  "compression_enabled": false
                                  }
                                  +
                                  {
                                  "v2_logging_type": "proto",
                                  "compression_enabled": false
                                  }
                                  FieldValuesDescription
                                  v2_logging_typeproto, arrow, parquetSerialization format for Kafka inference logs
                                  compression_enabledtrue, falseEnable compression for log messages

                                  Example: Onboarding a New Model

                                  To onboard a new ranking model, update the etcd config:

                                  Step 1: Define the feature retrieval graph

                                  -
                                  "component_dependency": {
                                  "feature_initializer": ["fs_user", "fs_product", "fs_user_x_category"],
                                  "fs_product": ["fs_user_x_category"],
                                  "fs_user": ["new_ranker"],
                                  "fs_user_x_category": ["new_ranker"],
                                  "new_ranker": []
                                  }
                                  +
                                  "component_dependency": {
                                  "feature_initializer": ["fs_user", "fs_product", "fs_user_x_category"],
                                  "fs_product": ["fs_user_x_category"],
                                  "fs_user": ["new_ranker"],
                                  "fs_user_x_category": ["new_ranker"],
                                  "new_ranker": []
                                  }

                                  Here fs_user_x_category depends on fs_product because it needs the category ID extracted from the product entity to resolve the user x category key.

                                  Step 2: Configure each component (feature groups, model endpoints, etc.)

                                  Step 3: Push the config to etcd — Inferflow picks it up automatically via watchers.

                                  diff --git a/docs/inferflow/v1.0.0/functionalities/index.html b/docs/inferflow/v1.0.0/functionalities/index.html index 6dc7d775..8aa550d9 100644 --- a/docs/inferflow/v1.0.0/functionalities/index.html +++ b/docs/inferflow/v1.0.0/functionalities/index.html @@ -4,15 +4,15 @@ Key Functionalities | BharatMLStack - - - + + + -

                                  Inferflow - Key Functionalities

                                  +

                                  Inferflow - Key Functionalities

                                  Overview

                                  Inferflow is a high-performance, config-driven ML inference orchestration engine built in Go. It provides no-code feature retrieval, DAG-based execution, and multi-pattern model inference — enabling ML teams to onboard new models through configuration changes alone.


                                  @@ -35,7 +35,7 @@

                                  DAG Topology Executor

                                  The execution engine uses Kahn's algorithm for topological ordering with concurrent goroutine execution at each level:

                                  -
                                  component_dependency: {
                                  "feature_initializer": ["fs_user", "fs_product"],
                                  "fs_user": ["ranker"],
                                  "fs_product": ["ranker"],
                                  "ranker": ["reranker"],
                                  "reranker": []
                                  }
                                  +
                                  component_dependency: {
                                  "feature_initializer": ["fs_user", "fs_product"],
                                  "fs_user": ["ranker"],
                                  "fs_product": ["ranker"],
                                  "ranker": ["reranker"],
                                  "reranker": []
                                  }

                                  This config defines:

                                  • feature_initializer runs first (zero in-degree)
                                  • @@ -54,19 +54,19 @@

                                    Inferflow supports three inference patterns via the Predict API, each designed for different ML use cases:

                                    PointWise Inference

                                    Score each target independently against context features.

                                    -
                                    rpc InferPointWise(PredictRequest) returns (PredictResponse);
                                    +
                                    rpc InferPointWise(PredictRequest) returns (PredictResponse);

                                    Use cases: Click-through rate prediction, fraud scoring, relevance ranking

                                    Input: Context features + list of targets (e.g., products) Output: Per-target scores

                                    PairWise Inference

                                    Score pairs of targets relative to each other.

                                    -
                                    rpc InferPairWise(PredictRequest) returns (PredictResponse);
                                    +
                                    rpc InferPairWise(PredictRequest) returns (PredictResponse);

                                    Use cases: Preference learning, comparison-based ranking

                                    Input: Context features + targets + pair indices (first/second) Output: Per-pair scores + optional per-target scores

                                    SlateWise Inference

                                    Score groups (slates) of targets together, capturing inter-item effects.

                                    -
                                    rpc InferSlateWise(PredictRequest) returns (PredictResponse);
                                    +
                                    rpc InferSlateWise(PredictRequest) returns (PredictResponse);

                                    Use cases: Whole-page optimization, slate-level reranking, diversity-aware scoring

                                    Input: Context features + targets + slate definitions (target indices per slate) Output: Per-slate scores + optional per-target scores

                                    @@ -74,7 +74,7 @@

                                    SlateWis

                                    Entity & Legacy API

                                    RetrieveModelScore

                                    The original Inferflow API for entity-based feature retrieval and scoring:

                                    -
                                    service Inferflow {
                                    rpc RetrieveModelScore(InferflowRequestProto) returns (InferflowResponseProto);
                                    }
                                    +
                                    service Inferflow {
                                    rpc RetrieveModelScore(InferflowRequestProto) returns (InferflowResponseProto);
                                    }

                                    Request structure:

                                    FieldDescription
                                    entitiesList of entity types with their IDs and optional inline features
                                    model_config_idIdentifies the model configuration (DAG, components, response format)
                                    tracking_idRequest-level tracing identifier

                                    Entity structure:

                                    @@ -120,7 +120,7 @@

                                    NumerixComp

                                    Feature Retrieval Pipeline

                                    Key Resolution

                                    Feature components use FSKeys configuration to dynamically resolve entity keys:

                                    -
                                    {
                                    "FSKeys": {
                                    "schema": ["user_id"],
                                    "col": "user:profile:user_id"
                                    }
                                    }
                                    +
                                    {
                                    "FSKeys": {
                                    "schema": ["user_id"],
                                    "col": "user:profile:user_id"
                                    }
                                    }

                                    The component reads key values from the existing matrix columns, enabling chained entity resolution — e.g., fetch product entity first, extract category, then fetch user x category features.

                                    Batched Retrieval

                                      diff --git a/docs/inferflow/v1.0.0/index.html b/docs/inferflow/v1.0.0/index.html index 63bf06cd..ece04f18 100644 --- a/docs/inferflow/v1.0.0/index.html +++ b/docs/inferflow/v1.0.0/index.html @@ -1,17 +1,19 @@ - + -v1.0.0 | BharatMLStack - - - +v1.0.0 | BharatMLStack + + + - + \ No newline at end of file diff --git a/docs/inferflow/v1.0.0/release-notes/index.html b/docs/inferflow/v1.0.0/release-notes/index.html index 21c42146..2bc54c22 100644 --- a/docs/inferflow/v1.0.0/release-notes/index.html +++ b/docs/inferflow/v1.0.0/release-notes/index.html @@ -4,15 +4,15 @@ Release Notes | BharatMLStack - - - + + + -

                                      Inferflow - Release Notes

                                      +

                                      Inferflow - Release Notes

                                      Version 1.0.0

                                      Release Date: June 2025 Status: General Availability (GA)

                                      @@ -84,9 +84,9 @@

                                      Serialization<

                                      APIs & Protocols

                                      gRPC API

                                      Inferflow Service:

                                      -
                                      service Inferflow {
                                      rpc RetrieveModelScore(InferflowRequestProto) returns (InferflowResponseProto);
                                      }
                                      +
                                      service Inferflow {
                                      rpc RetrieveModelScore(InferflowRequestProto) returns (InferflowResponseProto);
                                      }

                                      Predict Service:

                                      -
                                      service PredictService {
                                      rpc InferPointWise(PredictRequest) returns (PredictResponse);
                                      rpc InferPairWise(PredictRequest) returns (PredictResponse);
                                      rpc InferSlateWise(PredictRequest) returns (PredictResponse);
                                      }
                                      +
                                      service PredictService {
                                      rpc InferPointWise(PredictRequest) returns (PredictResponse);
                                      rpc InferPairWise(PredictRequest) returns (PredictResponse);
                                      rpc InferSlateWise(PredictRequest) returns (PredictResponse);
                                      }

                                      Data Types Supported

                                      TypeVariants
                                      Integersint8, int16, int32, int64
                                      Floatsfloat8 (e4m3, e5m2), float16, float32, float64
                                      StringsVariable length
                                      BooleansBit-packed
                                      VectorsAll scalar types

                                      @@ -137,11 +137,11 @@

                                      Extern

                                      Download & Installation

                                      Source Code

                                      -
                                      git clone https://github.com/Meesho/BharatMLStack.git
                                      cd BharatMLStack/inferflow
                                      +
                                      git clone https://github.com/Meesho/BharatMLStack.git
                                      cd BharatMLStack/inferflow

                                      Build

                                      -
                                      go build -o inferflow-server cmd/inferflow/main.go
                                      +
                                      go build -o inferflow-server cmd/inferflow/main.go

                                      Docker

                                      -
                                      docker build -t inferflow:latest .
                                      +
                                      docker build -t inferflow:latest .

                                      Contributing

                                      We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.

                                      diff --git a/docs/intro/index.html b/docs/intro/index.html new file mode 100644 index 00000000..040ca6a9 --- /dev/null +++ b/docs/intro/index.html @@ -0,0 +1,42 @@ + + + + + +BharatMLStack Documentation | BharatMLStack + + + + + + + + +

                                      BharatMLStack Documentation

                                      +

                                      Welcome to the BharatMLStack documentation. BharatMLStack is an open-source, end-to-end ML infrastructure stack built for scale, speed, and simplicity. Explore the components below to get started.

                                      +
                                      +

                                      Quick Start

                                      +

                                      Get up and running with BharatMLStack in minutes. Step-by-step instructions, sample data, and Docker Compose setup for local development and testing.

                                      +

                                      Go to Quick Start →

                                      +
                                      +

                                      Online Feature Store

                                      +

                                      Sub-10ms, high-throughput access to machine learning features for real-time inference. Supports batch and streaming ingestion, schema validation, and compact versioned feature groups.

                                      +

                                      Go to Online Feature Store →

                                      +
                                      +

                                      Inferflow

                                      +

                                      Graph-driven feature retrieval and model inference orchestration engine. Dynamically resolves entity relationships, retrieves features, and orchestrates model scoring — all without custom code.

                                      +

                                      Go to Inferflow →

                                      +
                                      +

                                      Trufflebox UI

                                      +

                                      Modern, feature-rich UI framework for MLOps management. Supports feature catalog, user management, and admin operations with approval flows.

                                      +

                                      Go to Trufflebox UI →

                                      +
                                      +

                                      SDKs

                                      +

                                      Client libraries for Go and Python to interact with the Online Feature Store and other platform components. Includes gRPC clients, REST APIs, and Apache Spark integration.

                                      +

                                      Go to SDKs →

                                      +
                                      +

                                      Numerix

                                      +

                                      High-performance compute engine for ultra-fast element-wise matrix operations. Built in Rust with SIMD acceleration for sub-5ms p99 latency.

                                      +

                                      Go to Numerix →

                                      + + \ No newline at end of file diff --git a/docs/markdown-page/index.html b/docs/markdown-page/index.html index cf005ac0..8c73739c 100644 --- a/docs/markdown-page/index.html +++ b/docs/markdown-page/index.html @@ -4,15 +4,15 @@ Markdown page example | BharatMLStack - - - + + + -

                                      Markdown page example

                                      +

                                      Markdown page example

                                      You don't need React to write simple standalone pages.

                                      \ No newline at end of file diff --git a/docs/numerix/v1.0.0/architecture/index.html b/docs/numerix/v1.0.0/architecture/index.html index 51bb28c8..d843f532 100644 --- a/docs/numerix/v1.0.0/architecture/index.html +++ b/docs/numerix/v1.0.0/architecture/index.html @@ -4,15 +4,15 @@ Architecture | BharatMLStack - - - + + + -

                                      BharatMLStack - Numerix

                                      +

                                      BharatMLStack - Numerix


                                      Numerix is a Rust-based compute service in BharatMLStack designed for low-latency evaluation of mathematical expressions over feature matrices. Each request carries a compute_id and a matrix of features; Numerix fetches the corresponding postfix expression, maps variables to feature columns (treated as vectors), and evaluates the expression with a stack-based SIMD-optimized runtime.


                                      @@ -53,7 +53,7 @@

                                      Why ARM, Why LLVM

                                      During design exploration, we tested SIMD on different architectures and found ARM (AArch64) with NEON/SVE/SVE2 provided excellent performance for our workloads.

                                      Instead of writing custom intrinsics, Numerix compiles with SIMD flags and lets LLVM handle vectorization:

                                      -
                                      RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \
                                      cargo build --release --target aarch64-unknown-linux-gnu
                                      +
                                      RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \
                                      cargo build --release --target aarch64-unknown-linux-gnu
                                      • This approach works well because operations are straightforward, data is aligned, and compiler auto-vectorization is reliable.

                                        @@ -98,7 +98,7 @@

                                        gRPC Interfac
                                      • Response fields: computation_score_data or error

                                      Example (grpcurl):

                                      -
                                      grpcurl -plaintext \
                                      -import-path ./numerix/src/protos/proto \
                                      -proto numerix.proto \
                                      -d '{
                                      "entityScoreData": {
                                      "schema": ["feature1", "feature2"],
                                      "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                      "computeId": "1001",
                                      "dataType": "fp32"
                                      }
                                      }' \
                                      localhost:8080 numerix.Numerix/Compute
                                      +
                                      grpcurl -plaintext \
                                      -import-path ./numerix/src/protos/proto \
                                      -proto numerix.proto \
                                      -d '{
                                      "entityScoreData": {
                                      "schema": ["feature1", "feature2"],
                                      "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                      "computeId": "1001",
                                      "dataType": "fp32"
                                      }
                                      }' \
                                      localhost:8080 numerix.Numerix/Compute

                                      Observability

                                        diff --git a/docs/numerix/v1.0.0/benchmarks/index.html b/docs/numerix/v1.0.0/benchmarks/index.html index a741f8e9..b12025b2 100644 --- a/docs/numerix/v1.0.0/benchmarks/index.html +++ b/docs/numerix/v1.0.0/benchmarks/index.html @@ -4,15 +4,15 @@ Benchmarks | BharatMLStack - - - + + + -

                                        Benchmarks (PoC)

                                        +

                                        Benchmarks (PoC)

                                        This PoC measures the performance of vector addition in Rust with and without compiler SIMD optimizations. Requests consist of repeated fixed-size vector addition operations processed in parallel by the CPU. These results provide perspective on how much faster SIMD makes vectorized computations, and similar improvements are expected for other vectorized operations in Numerix.

                                        System Configuration

                                          diff --git a/docs/numerix/v1.0.0/functionalities/index.html b/docs/numerix/v1.0.0/functionalities/index.html index 820fa713..f8bbff14 100644 --- a/docs/numerix/v1.0.0/functionalities/index.html +++ b/docs/numerix/v1.0.0/functionalities/index.html @@ -4,15 +4,15 @@ Key Functionalities | BharatMLStack - - - + + + -

                                          Numerix — Key Functionalities

                                          +

                                          Numerix — Key Functionalities

                                          Overview

                                          Numerix evaluates mathematical expressions over feature matrices with a simple, low-latency gRPC surface. Each request references a compute_id; Numerix resolves a postfix expression, maps variables to input columns, and evaluates it over fp32/fp64 vectors with compiler-assisted SIMD.

                                          🚀 Core Capabilities

                                          @@ -39,9 +39,9 @@

                                          🎯 D
                                        • Deterministic behavior: No parsing at request time; expression resolved from registry.

                                        gRPC Service

                                        -
                                        service Numerix {
                                        rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);
                                        }
                                        +
                                        service Numerix {
                                        rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);
                                        }

                                        Example Call (grpcurl)

                                        -
                                        grpcurl -plaintext \
                                        -import-path ./numerix/src/protos/proto \
                                        -proto numerix.proto \
                                        -d '{
                                        "entityScoreData": {
                                        "schema": ["feature1", "feature2"],
                                        "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                        "computeId": "1001",
                                        "dataType": "fp32"
                                        }
                                        }' \
                                        localhost:8080 numerix.Numerix/Compute
                                        +
                                        grpcurl -plaintext \
                                        -import-path ./numerix/src/protos/proto \
                                        -proto numerix.proto \
                                        -d '{
                                        "entityScoreData": {
                                        "schema": ["feature1", "feature2"],
                                        "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                        "computeId": "1001",
                                        "dataType": "fp32"
                                        }
                                        }' \
                                        localhost:8080 numerix.Numerix/Compute

                                        📊 Observability

                                        -
                                        RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \
                                        cargo build --release --target aarch64-unknown-linux-gnu
                                        +
                                        RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \
                                        cargo build --release --target aarch64-unknown-linux-gnu
                                        • Deterministic Runtime: No dynamic parsing in hot path; O(n) across tokens with vectorized inner ops.

                                        🛠️ APIs

                                        gRPC

                                        -
                                        service Numerix {
                                        rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);
                                        }
                                        +
                                        service Numerix {
                                        rpc Compute(NumerixRequestProto) returns (NumerixResponseProto);
                                        }

                                        Example call:

                                        -
                                        grpcurl -plaintext \
                                        -import-path ./numerix/src/protos/proto \
                                        -proto numerix.proto \
                                        -d '{
                                        "entityScoreData": {
                                        "schema": ["feature1", "feature2"],
                                        "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                        "computeId": "1001",
                                        "dataType": "fp32"
                                        }
                                        }' \
                                        localhost:8080 numerix.Numerix/Compute
                                        +
                                        grpcurl -plaintext \
                                        -import-path ./numerix/src/protos/proto \
                                        -proto numerix.proto \
                                        -d '{
                                        "entityScoreData": {
                                        "schema": ["feature1", "feature2"],
                                        "entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
                                        "computeId": "1001",
                                        "dataType": "fp32"
                                        }
                                        }' \
                                        localhost:8080 numerix.Numerix/Compute

                                        🏗️ Deployment & Configuration

                                        Environment

                                        -
                                        APPLICATION_PORT=8083
                                        APP_ENV=prd
                                        APP_LOG_LEVEL=ERROR
                                        APP_NAME=numerix

                                        # Performance
                                        CHANNEL_BUFFER_SIZE=10000

                                        # etcd
                                        ETCD_SERVERS=127.0.0.1:2379

                                        # Metrics
                                        METRIC_SAMPLING_RATE=1
                                        TELEGRAF_UDP_HOST=127.0.0.1
                                        TELEGRAF_UDP_PORT=8125
                                        +
                                        APPLICATION_PORT=8083
                                        APP_ENV=prd
                                        APP_LOG_LEVEL=ERROR
                                        APP_NAME=numerix

                                        # Performance
                                        CHANNEL_BUFFER_SIZE=10000

                                        # etcd
                                        ETCD_SERVERS=127.0.0.1:2379

                                        # Metrics
                                        METRIC_SAMPLING_RATE=1
                                        TELEGRAF_UDP_HOST=127.0.0.1
                                        TELEGRAF_UDP_PORT=8125

                                        Containers

                                        • Multi-arch images: linux/amd64, linux/arm64.
                                        • @@ -84,6 +84,6 @@

                                          LicenseBharatMLStack Business Source License 1.1.


                                          Built with ❤️ for the ML community from Meesho
                                          -
                                          If you find this useful, ⭐️ the repo — your support means the world to us!

                                        +
                                        If you find this useful, ⭐️ the repo — your support means the world to us!
                                        \ No newline at end of file diff --git a/docs/online-feature-store/v1.0.0/architecture/index.html b/docs/online-feature-store/v1.0.0/architecture/index.html index a1ab0342..0d608616 100644 --- a/docs/online-feature-store/v1.0.0/architecture/index.html +++ b/docs/online-feature-store/v1.0.0/architecture/index.html @@ -4,15 +4,15 @@ Architecture | BharatMLStack - - - + + + -

                                        BharatMLStack - Online Feature Store (OnFS)

                                        +

                                        BharatMLStack - Online Feature Store (OnFS)

                                        The Online Feature Store (OnFS) is part of BharatMLStack, designed to support real-time ML workloads through low-latency feature retrieval and flexible feature ingestion pipelines. It ensures that features generated offline or online are immediately accessible for inference.


                                        BharatMLStack&#39;s Online-feature-store Architecture

                                        diff --git a/docs/online-feature-store/v1.0.0/benchmarks/index.html b/docs/online-feature-store/v1.0.0/benchmarks/index.html index a94269ff..51d5cbba 100644 --- a/docs/online-feature-store/v1.0.0/benchmarks/index.html +++ b/docs/online-feature-store/v1.0.0/benchmarks/index.html @@ -4,15 +4,15 @@ Benchmarks | BharatMLStack - - - + + + -

                                        Serialization Performance Benchmarks

                                        +

                                        Serialization Performance Benchmarks

                                        Summary

                                        This report presents comprehensive benchmark results comparing three serialization formats for the BharatML Online Feature Store:

                                          @@ -88,9 +88,9 @@

                                          Technical Implementation Notes

                                          PSDB Optimizations

                                          -
                                          // Object pooling for zero allocations
                                          var psdbPool = GetPSDBPool()

                                          // Direct buffer allocation
                                          headerSize := PSDBLayout1LengthBytes // 9 bytes
                                          dataSize := len(data) * 4 // 4 bytes per int32

                                          // No compression for maximum speed
                                          compressionType = compression.TypeNone
                                          +
                                          // Object pooling for zero allocations
                                          var psdbPool = GetPSDBPool()

                                          // Direct buffer allocation
                                          headerSize := PSDBLayout1LengthBytes // 9 bytes
                                          dataSize := len(data) * 4 // 4 bytes per int32

                                          // No compression for maximum speed
                                          compressionType = compression.TypeNone

                                          Memory Layout Comparison

                                          -
                                          PSDB Layout:    [9-byte header][raw data]
                                          Proto3 Layout: [varint lengths][encoded data][padding]
                                          Arrow Layout: [schema][metadata][buffers][padding]
                                          +
                                          PSDB Layout:    [9-byte header][raw data]
                                          Proto3 Layout: [varint lengths][encoded data][padding]
                                          Arrow Layout: [schema][metadata][buffers][padding]

                                          Conclusion

                                          The optimal format depends on your use case and scale:

                                          PSDB: Best for Small-Medium Scale (≤1,000 features)

                                          @@ -114,7 +114,7 @@

                                          Raw Benchmark Output [Uncompressed Data]

                                          -
                                          goos: darwin
                                          goarch: arm64
                                          pkg: github.com/Meesho/BharatMLStack/online-feature-store/internal/data/blocks
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-100-10 1940238 625.3 ns/op 409.0 bytes 461 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-1000-10 288300 4056 ns/op 4009 bytes 4143 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-10000-10 32144 37357 ns/op 40009 bytes 41032 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-100000-10 3244 359932 ns/op 400009 bytes 401572 B/op 4 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-100-10 1703066 695.9 ns/op 486.0 bytes 768 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-1000-10 194142 6004 ns/op 4885 bytes 5632 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-10000-10 20937 57674 ns/op 48734 bytes 49408 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-100000-10 2085 556541 ns/op 487263 bytes 491776 B/op 2 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-100-10 302257 3831 ns/op 680.0 bytes 7032 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-1000-10 228718 5191 ns/op 4280 bytes 15544 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-10000-10 52482 23173 ns/op 40280 bytes 122617 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-100000-10 9765 120081 ns/op 400280 bytes 957948 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/PSDB-10 1919401 670.2 ns/op 409.0 bytes 461 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/Proto3-10 1733599 693.2 ns/op 490.0 bytes 768 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/Arrow-10 304066 3896 ns/op 680.0 bytes 7032 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/PSDB-10 290784 4074 ns/op 4009 bytes 4143 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/Proto3-10 196962 6034 ns/op 4882 bytes 5632 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/Arrow-10 227908 5240 ns/op 4280 bytes 15544 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/PSDB-10 31732 38064 ns/op 40009 bytes 41024 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/Proto3-10 20827 57670 ns/op 48745 bytes 49408 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/Arrow-10 52000 23557 ns/op 40280 bytes 122617 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/PSDB-10 3268 363817 ns/op 400009 bytes 401575 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/Proto3-10 2097 559621 ns/op 487247 bytes 491776 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/Arrow-10 10000 118489 ns/op 400280 bytes 957947 B/op 66 allocs/op
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-100-10 1000000000 0.0000223 ns/op 680.0 arrow_bytes 170.0 arrow_ratio_pct 490.0 proto3_bytes 122.5 proto3_ratio_pct 409.0 psdb_bytes 102.2 psdb_ratio_pct 400.0 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-1000-10 1000000000 0.0000379 ns/op 4280 arrow_bytes 107.0 arrow_ratio_pct 4881 proto3_bytes 122.0 proto3_ratio_pct 4009 psdb_bytes 100.2 psdb_ratio_pct 4000 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-10000-10 1000000000 0.0001182 ns/op 40280 arrow_bytes 100.7 arrow_ratio_pct 48717 proto3_bytes 121.8 proto3_ratio_pct 40009 psdb_bytes 100.0 psdb_ratio_pct 40000 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-100000-10 1000000000 0.001034 ns/op 400280 arrow_bytes 100.1 arrow_ratio_pct 487225 proto3_bytes 121.8 proto3_ratio_pct 400009 psdb_bytes 100.0 psdb_ratio_pct 400000 raw_bytes
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/PSDB_Pooled-10 1926676 622.4 ns/op 461 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/Proto3-10 1713428 685.0 ns/op 768 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/Arrow-10 312584 4029 ns/op 7032 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/PSDB_Pooled-10 290197 4189 ns/op 4143 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/Proto3-10 195694 6078 ns/op 5632 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/Arrow-10 224722 5190 ns/op 15544 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/PSDB_Pooled-10 31898 37684 ns/op 41029 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/Proto3-10 20840 58032 ns/op 49408 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/Arrow-10 51440 24049 ns/op 122617 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/PSDB_Pooled-10 3325 357690 ns/op 401814 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/Proto3-10 2158 559694 ns/op 491776 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/Arrow-10 9622 117515 ns/op 957948 B/op 66 allocs/op
                                          BenchmarkInt32Throughput/Throughput/PSDB-10 290912 4101 ns/op 975.31 MB/s 4143 B/op 4 allocs/op
                                          BenchmarkInt32Throughput/Throughput/Proto3-10 199087 6005 ns/op 666.12 MB/s 5632 B/op 2 allocs/op
                                          BenchmarkInt32Throughput/Throughput/Arrow-10 229594 5207 ns/op 768.25 MB/s 15544 B/op 66 allocs/op
                                          BenchmarkGetPSDBPoolWithoutPool-10 23836599 50.64 ns/op 192 B/op 1 allocs/op
                                          BenchmarkGetPSDBPoolWithPool-10 100000000 10.76 ns/op 0 B/op 0 allocs/op
                                          PASS
                                          ok github.com/Meesho/BharatMLStack/online-feature-store/internal/data/blocks 58.891s
                                          +
                                          goos: darwin
                                          goarch: arm64
                                          pkg: github.com/Meesho/BharatMLStack/online-feature-store/internal/data/blocks
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-100-10 1940238 625.3 ns/op 409.0 bytes 461 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-1000-10 288300 4056 ns/op 4009 bytes 4143 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-10000-10 32144 37357 ns/op 40009 bytes 41032 B/op 4 allocs/op
                                          BenchmarkInt32SerializationPSDB/PSDB/Size-100000-10 3244 359932 ns/op 400009 bytes 401572 B/op 4 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-100-10 1703066 695.9 ns/op 486.0 bytes 768 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-1000-10 194142 6004 ns/op 4885 bytes 5632 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-10000-10 20937 57674 ns/op 48734 bytes 49408 B/op 2 allocs/op
                                          BenchmarkInt32SerializationProto3/Proto3/Size-100000-10 2085 556541 ns/op 487263 bytes 491776 B/op 2 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-100-10 302257 3831 ns/op 680.0 bytes 7032 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-1000-10 228718 5191 ns/op 4280 bytes 15544 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-10000-10 52482 23173 ns/op 40280 bytes 122617 B/op 66 allocs/op
                                          BenchmarkInt32SerializationArrow/Arrow/Size-100000-10 9765 120081 ns/op 400280 bytes 957948 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/PSDB-10 1919401 670.2 ns/op 409.0 bytes 461 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/Proto3-10 1733599 693.2 ns/op 490.0 bytes 768 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100/Arrow-10 304066 3896 ns/op 680.0 bytes 7032 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/PSDB-10 290784 4074 ns/op 4009 bytes 4143 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/Proto3-10 196962 6034 ns/op 4882 bytes 5632 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-1000/Arrow-10 227908 5240 ns/op 4280 bytes 15544 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/PSDB-10 31732 38064 ns/op 40009 bytes 41024 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/Proto3-10 20827 57670 ns/op 48745 bytes 49408 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-10000/Arrow-10 52000 23557 ns/op 40280 bytes 122617 B/op 66 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/PSDB-10 3268 363817 ns/op 400009 bytes 401575 B/op 4 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/Proto3-10 2097 559621 ns/op 487247 bytes 491776 B/op 2 allocs/op
                                          BenchmarkInt32SerializationComparison/Comparison/Size-100000/Arrow-10 10000 118489 ns/op 400280 bytes 957947 B/op 66 allocs/op
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-100-10 1000000000 0.0000223 ns/op 680.0 arrow_bytes 170.0 arrow_ratio_pct 490.0 proto3_bytes 122.5 proto3_ratio_pct 409.0 psdb_bytes 102.2 psdb_ratio_pct 400.0 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-1000-10 1000000000 0.0000379 ns/op 4280 arrow_bytes 107.0 arrow_ratio_pct 4881 proto3_bytes 122.0 proto3_ratio_pct 4009 psdb_bytes 100.2 psdb_ratio_pct 4000 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-10000-10 1000000000 0.0001182 ns/op 40280 arrow_bytes 100.7 arrow_ratio_pct 48717 proto3_bytes 121.8 proto3_ratio_pct 40009 psdb_bytes 100.0 psdb_ratio_pct 40000 raw_bytes
                                          BenchmarkInt32SizeComparison/SizeOnly/Size-100000-10 1000000000 0.001034 ns/op 400280 arrow_bytes 100.1 arrow_ratio_pct 487225 proto3_bytes 121.8 proto3_ratio_pct 400009 psdb_bytes 100.0 psdb_ratio_pct 400000 raw_bytes
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/PSDB_Pooled-10 1926676 622.4 ns/op 461 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/Proto3-10 1713428 685.0 ns/op 768 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100/Arrow-10 312584 4029 ns/op 7032 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/PSDB_Pooled-10 290197 4189 ns/op 4143 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/Proto3-10 195694 6078 ns/op 5632 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-1000/Arrow-10 224722 5190 ns/op 15544 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/PSDB_Pooled-10 31898 37684 ns/op 41029 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/Proto3-10 20840 58032 ns/op 49408 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-10000/Arrow-10 51440 24049 ns/op 122617 B/op 66 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/PSDB_Pooled-10 3325 357690 ns/op 401814 B/op 4 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/Proto3-10 2158 559694 ns/op 491776 B/op 2 allocs/op
                                          BenchmarkInt32MemoryEfficiency/Memory/Size-100000/Arrow-10 9622 117515 ns/op 957948 B/op 66 allocs/op
                                          BenchmarkInt32Throughput/Throughput/PSDB-10 290912 4101 ns/op 975.31 MB/s 4143 B/op 4 allocs/op
                                          BenchmarkInt32Throughput/Throughput/Proto3-10 199087 6005 ns/op 666.12 MB/s 5632 B/op 2 allocs/op
                                          BenchmarkInt32Throughput/Throughput/Arrow-10 229594 5207 ns/op 768.25 MB/s 15544 B/op 66 allocs/op
                                          BenchmarkGetPSDBPoolWithoutPool-10 23836599 50.64 ns/op 192 B/op 1 allocs/op
                                          BenchmarkGetPSDBPoolWithPool-10 100000000 10.76 ns/op 0 B/op 0 allocs/op
                                          PASS
                                          ok github.com/Meesho/BharatMLStack/online-feature-store/internal/data/blocks 58.891s

                                          Benchmarks run on Apple Silicon (ARM64) with Go 1.22.12. Results may vary on different architectures and Go versions.

                                        diff --git a/docs/online-feature-store/v1.0.0/data-formats/index.html b/docs/online-feature-store/v1.0.0/data-formats/index.html index e2446a6e..aed51abd 100644 --- a/docs/online-feature-store/v1.0.0/data-formats/index.html +++ b/docs/online-feature-store/v1.0.0/data-formats/index.html @@ -4,15 +4,15 @@ Data Formats | BharatMLStack - - - + + + -

                                        Data Format for Permanent & Cache Storage

                                        +

                                        Data Format for Permanent & Cache Storage

                                        In this section we will go through the data-formats which is at the hear of online-feature-store, it's inspired form other storage efficient formats like parquet & arrow, but custom made to deliver in constraint environment. The two key data-formats are:

                                        • PSDB - Permanent Storage Data Block used wile storing data in ScyllaDB
                                        • @@ -77,7 +77,7 @@

                                          Conceptu

                                          PSDB encodes vector data by flattening multi-dimensional arrays into a single contiguous byte buffer while preserving the ability to reconstruct the original vector boundaries.

                                          Vector Length Metadata

                                          Each feature group maintains metadata about vector dimensions in the Feature Registry. For example, if a feature group has:

                                          -
                                          fg1:
                                          version-2:
                                          features:
                                          f1: { vector_len: 6, default: [bytes] }
                                          f2: { vector_len: 3, default: [bytes] }
                                          version-1:
                                          features:
                                          f1: { vector_len: 6, default: [bytes] }
                                          +
                                          fg1:
                                          version-2:
                                          features:
                                          f1: { vector_len: 6, default: [bytes] }
                                          f2: { vector_len: 3, default: [bytes] }
                                          version-1:
                                          features:
                                          f1: { vector_len: 6, default: [bytes] }
                                          • Feature f1 with vector_len: 6
                                          • Feature f2 with vector_len: 3
                                          • @@ -136,7 +136,7 @@

                                            OverviewStructure and Purpose

                                            Each CSDB contains a mapping of feature group IDs (FG IDs) to deserialized PSDBs. For distributed systems, this structure is flattened into a serialized byte slice. The CSDB supports layout versioning for backward compatibility and negative caching for feature groups with no associated data.

                                            Core Fields and Memory Layout

                                            -
                                            type CacheStorageDataBlock struct {
                                            // 8-byte aligned map pointer
                                            FGIdToDDB map[int]*DeserializedPSDB // offset: 0

                                            // 24-byte slice (ptr, len, cap)
                                            serializedCSDB []byte // offset: 8

                                            // 4-byte fields
                                            TTL uint32 // offset: 32

                                            // 1-byte fields
                                            layoutVersion uint8 // offset: 36
                                            cacheType CacheType // offset: 37
                                            // 2 bytes padding to maintain 4-byte alignment
                                            }
                                            +
                                            type CacheStorageDataBlock struct {
                                            // 8-byte aligned map pointer
                                            FGIdToDDB map[int]*DeserializedPSDB // offset: 0

                                            // 24-byte slice (ptr, len, cap)
                                            serializedCSDB []byte // offset: 8

                                            // 4-byte fields
                                            TTL uint32 // offset: 32

                                            // 1-byte fields
                                            layoutVersion uint8 // offset: 36
                                            cacheType CacheType // offset: 37
                                            // 2 bytes padding to maintain 4-byte alignment
                                            }

                                            The structure is memory-aligned for optimal performance:

                                            • Pointers and slices are 8-byte aligned
                                            • @@ -150,7 +150,7 @@

                                              Cache Types

                                              Format & Encoding

                                              CSDB Binary Layout: Serialized CSDBs follow this compact format:

                                              -
                                              [LayoutVersion (1 byte)][FGID (2 bytes)][DataLen (2 bytes)][Data ...]   → repeated per feature group
                                              +
                                              [LayoutVersion (1 byte)][FGID (2 bytes)][DataLen (2 bytes)][Data ...]   → repeated per feature group
                                              • FGID and DataLen are encoded as uint16
                                              • If DataLen == 0, it denotes a negative cache (no data available for that FG)
                                              • diff --git a/docs/online-feature-store/v1.0.0/functionalities/index.html b/docs/online-feature-store/v1.0.0/functionalities/index.html index dd05aa9b..f0ae5e28 100644 --- a/docs/online-feature-store/v1.0.0/functionalities/index.html +++ b/docs/online-feature-store/v1.0.0/functionalities/index.html @@ -4,15 +4,15 @@ Key Functionalities | BharatMLStack - - - + + + -

                                                Online Feature Store - Key Functionalities

                                                +

                                                Online Feature Store - Key Functionalities

                                                Overview

                                                The BharatML Online Feature Store is a high-performance, production-ready system designed to serve machine learning features with sub-10ms P99 latency and 1M+ RPS capacity. It bridges the gap between offline feature engineering and real-time model inference.

                                                🚀 Core Capabilities

                                                @@ -65,11 +65,11 @@

                                                Pro

                                              📊 Use Cases

                                              Real-Time ML Inference

                                              -
                                              // Fetch user features for recommendation model
                                              query := &onfs.Query{
                                              EntityLabel: "user",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "demographics",
                                              FeatureLabels: []string{"age", "location", "income"},
                                              },
                                              {
                                              Label: "behavior",
                                              FeatureLabels: []string{"click_rate", "purchase_history"},
                                              },
                                              },
                                              KeysSchema: []string{"user_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"user_123"}},
                                              },
                                              }

                                              result, err := client.RetrieveFeatures(ctx, query)
                                              +
                                              // Fetch user features for recommendation model
                                              query := &onfs.Query{
                                              EntityLabel: "user",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "demographics",
                                              FeatureLabels: []string{"age", "location", "income"},
                                              },
                                              {
                                              Label: "behavior",
                                              FeatureLabels: []string{"click_rate", "purchase_history"},
                                              },
                                              },
                                              KeysSchema: []string{"user_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"user_123"}},
                                              },
                                              }

                                              result, err := client.RetrieveFeatures(ctx, query)

                                              Batch Feature Serving

                                              -
                                              // Bulk feature retrieval for model training
                                              query := &onfs.Query{
                                              EntityLabel: "transaction",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "transaction_history",
                                              FeatureLabels: []string{"amount", "frequency", "merchant_type"},
                                              },
                                              {
                                              Label: "risk_scores",
                                              FeatureLabels: []string{"fraud_score", "credit_score"},
                                              },
                                              },
                                              KeysSchema: []string{"transaction_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"txn_001"}},
                                              {Cols: []string{"txn_002"}},
                                              // ... 100s of transaction IDs
                                              },
                                              }

                                              result, err := client.RetrieveFeatures(ctx, query)
                                              +
                                              // Bulk feature retrieval for model training
                                              query := &onfs.Query{
                                              EntityLabel: "transaction",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "transaction_history",
                                              FeatureLabels: []string{"amount", "frequency", "merchant_type"},
                                              },
                                              {
                                              Label: "risk_scores",
                                              FeatureLabels: []string{"fraud_score", "credit_score"},
                                              },
                                              },
                                              KeysSchema: []string{"transaction_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"txn_001"}},
                                              {Cols: []string{"txn_002"}},
                                              // ... 100s of transaction IDs
                                              },
                                              }

                                              result, err := client.RetrieveFeatures(ctx, query)

                                              A/B Testing Support

                                              -
                                              // Version-aware feature retrieval with decoded values
                                              query := &onfs.Query{
                                              EntityLabel: "experiment",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "model_features_v2", // Specific version
                                              FeatureLabels: []string{"feature_a", "feature_b", "feature_c"},
                                              },
                                              },
                                              KeysSchema: []string{"user_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"user_123"}},
                                              },
                                              }

                                              // Get string-decoded values for easier debugging/analysis
                                              decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)
                                              +
                                              // Version-aware feature retrieval with decoded values
                                              query := &onfs.Query{
                                              EntityLabel: "experiment",
                                              FeatureGroups: []onfs.FeatureGroup{
                                              {
                                              Label: "model_features_v2", // Specific version
                                              FeatureLabels: []string{"feature_a", "feature_b", "feature_c"},
                                              },
                                              },
                                              KeysSchema: []string{"user_id"},
                                              Keys: []onfs.Keys{
                                              {Cols: []string{"user_123"}},
                                              },
                                              }

                                              // Get string-decoded values for easier debugging/analysis
                                              decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)

                                              🎛️ Configuration Options

                                              Performance Tuning

                                                diff --git a/docs/online-feature-store/v1.0.0/index.html b/docs/online-feature-store/v1.0.0/index.html index 6c38f703..92c9a091 100644 --- a/docs/online-feature-store/v1.0.0/index.html +++ b/docs/online-feature-store/v1.0.0/index.html @@ -1,17 +1,19 @@ - + -v1.0.0 | BharatMLStack - - - +v1.0.0 | BharatMLStack + + + - + \ No newline at end of file diff --git a/docs/online-feature-store/v1.0.0/release-notes/index.html b/docs/online-feature-store/v1.0.0/release-notes/index.html index d6f22f70..fff9ba21 100644 --- a/docs/online-feature-store/v1.0.0/release-notes/index.html +++ b/docs/online-feature-store/v1.0.0/release-notes/index.html @@ -4,15 +4,15 @@ Release Notes | BharatMLStack - - - + + + -

                                                Online Feature Store - Release Notes

                                                +

                                                Online Feature Store - Release Notes

                                                Version 1.0.0 🚀

                                                Release Date: June 2025
                                                Status: General Availability (GA)

                                                @@ -61,7 +61,7 @@

                                                🛠️ APIs & SDKs

                                                gRPC API

                                                High-performance, language-agnostic interface:

                                                -
                                                service FeatureStoreService {
                                                rpc RetrieveFeatures(Query) returns (QueryResult);
                                                rpc RetrieveDecodedFeatures(Query) returns (DecodedQueryResult);
                                                rpc PersistFeatures(PersistFeaturesRequest) returns (Result);
                                                }
                                                +
                                                service FeatureStoreService {
                                                rpc RetrieveFeatures(Query) returns (QueryResult);
                                                rpc RetrieveDecodedFeatures(Query) returns (DecodedQueryResult);
                                                rpc PersistFeatures(PersistFeaturesRequest) returns (Result);
                                                }

                                                Go SDK v1.0.0

                                                Native Go client with enterprise features:

                                                  @@ -141,7 +141,7 @@

                                                  Workarou

                                                  💾 Download & Installation

                                                  Container Images

                                                  -
                                                  # Pull the latest image
                                                  docker pull ghcr.io/meesho/onfs-api-server:latest
                                                  docker pull ghcr.io/meesho/onfs-consumer:latest
                                                  docker pull ghcr.io/meesho/horizon:latest
                                                  docker pull ghcr.io/meesho/trufflebox-ui:latest

                                                  +
                                                  # Pull the latest image
                                                  docker pull ghcr.io/meesho/onfs-api-server:latest
                                                  docker pull ghcr.io/meesho/onfs-consumer:latest
                                                  docker pull ghcr.io/meesho/horizon:latest
                                                  docker pull ghcr.io/meesho/trufflebox-ui:latest

                                                  Arch Supported

                                                  • Linux (amd64)
                                                  • @@ -151,7 +151,7 @@

                                                    Arch

                                                  Checkout Packages

                                                  Source Code

                                                  -
                                                  git clone https://github.com/Meesho/BharatMLStack.git
                                                  cd BharatMLStack/online-feature-store
                                                  git checkout release/1.0.0
                                                  +
                                                  git clone https://github.com/Meesho/BharatMLStack.git
                                                  cd BharatMLStack/online-feature-store
                                                  git checkout release/1.0.0

                                                  Contributing

                                                  We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.

                                                  Community & Support

                                                  diff --git a/docs/predator/v1.0.0/architecture/index.html b/docs/predator/v1.0.0/architecture/index.html new file mode 100644 index 00000000..c49794bc --- /dev/null +++ b/docs/predator/v1.0.0/architecture/index.html @@ -0,0 +1,143 @@ + + + + + +Architecture | BharatMLStack + + + + + + + + +

                                                  BharatMLStack - Predator

                                                  +

                                                  Predator is a scalable, high-performance model inference service built as a wrapper around the NVIDIA Triton Inference Server. It is designed to serve a variety of machine learning models (Deep Learning, Tree-based, etc.) with low latency in a Kubernetes (K8s) environment.

                                                  +

                                                  The system integrates seamlessly with the Online Feature Store (OnFS) for real-time feature retrieval and uses Horizon as the deployment orchestration layer. Deployments follow a GitOps pipeline — Horizon generates Helm configurations, commits them to GitHub, and Argo Sync reconciles the desired state onto Kubernetes.

                                                  +
                                                  +

                                                  High-Level Design

                                                  +

                                                  Predator HLD - End-to-end deployment and inference architecture

                                                  +

                                                  End-to-End Flow

                                                  +
                                                    +
                                                  1. +

                                                    Model Deployment Trigger: An actor initiates deployment through Trufflebox UI, specifying the GCS path (gcs://) of the trained model. Separately, post-training pipelines write model artifacts to GCS Artifactory.

                                                    +
                                                  2. +
                                                  3. +

                                                    Orchestration via Horizon: Trufflebox UI communicates with Horizon, the deployment orchestration layer. Horizon generates the appropriate Helm chart configuration for the inference service.

                                                    +
                                                  4. +
                                                  5. +

                                                    GitOps Pipeline: Horizon commits the Helm values to a GitHub repository. Argo Sync watches the repo and reconciles the desired state onto the Kubernetes cluster, creating or updating deployable units.

                                                    +
                                                  6. +
                                                  7. +

                                                    Deployable Units (Deployable 1 … N): Each deployable is an independent Kubernetes deployment that:

                                                    +
                                                      +
                                                    • Downloads model artifacts from GCS at startup via an init.sh script.
                                                    • +
                                                    • Launches a Triton Inference Server instance loaded with the model.
                                                    • +
                                                    • Runs one or more pods, each containing the inference runtime and configured backends.
                                                    • +
                                                    +
                                                  8. +
                                                  9. +

                                                    Triton Backends: Each Triton instance supports pluggable backends based on the model type:

                                                    +
                                                      +
                                                    • FIL — GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest).
                                                    • +
                                                    • PyTorch — Native PyTorch models via LibTorch.
                                                    • +
                                                    • Python — Custom preprocessing/postprocessing or unsupported model formats.
                                                    • +
                                                    • TRT (TensorRT) — GPU-optimized serialized TensorRT engines.
                                                    • +
                                                    • ONNX — Framework-agnostic execution via ONNX Runtime.
                                                    • +
                                                    • DALI — GPU-accelerated data preprocessing (image, audio, video).
                                                    • +
                                                    +
                                                  10. +
                                                  11. +

                                                    Autoscaling with KEDA: The cluster uses KEDA (Kubernetes Event-Driven Autoscaling) to scale deployable pods based on custom metrics (CPU utilization, GPU utilization via DCGM, queue depth, etc.). The underlying Kubernetes scheduler places pods across GPU/CPU node pools.

                                                    +
                                                  12. +
                                                  +

                                                  Key Design Principles

                                                  +
                                                    +
                                                  • GitOps-driven: All deployment state is version-controlled in Git; Argo Sync ensures cluster state matches the declared configuration.
                                                  • +
                                                  • Isolation per deployable: Each model or model group gets its own deployable unit, preventing noisy-neighbor interference.
                                                  • +
                                                  • Init-based model loading: Models are materialized to local disk before Triton starts, ensuring deterministic startup and no runtime dependency on remote storage.
                                                  • +
                                                  • Pluggable backends: The same infrastructure serves deep learning, tree-based, and custom models through Triton's backend abstraction.
                                                  • +
                                                  +
                                                  +

                                                  Inference Engine: Triton Inference Server

                                                  +

                                                  NVIDIA Triton Inference Server is a high-performance model serving system designed to deploy ML and deep learning models at scale across CPUs and GPUs. It provides a unified inference runtime that supports multiple frameworks, optimized execution, and production-grade scheduling.

                                                  +

                                                  Triton operates as a standalone server that loads models from a model repository and exposes standardized HTTP/gRPC APIs. Predator uses gRPC for efficient request and response handling via the helix client.

                                                  +

                                                  Core Components

                                                  +
                                                    +
                                                  • Model Repository: Central directory where models are stored. Predator typically materializes the model repository onto local disk via an init container, enabling fast model loading and eliminating runtime dependency on remote storage during inference.
                                                  • +
                                                  +

                                                  Backends

                                                  +

                                                  A backend is the runtime responsible for executing a model. Each model specifies which backend runs it via configuration.

                                                  +
                                                  BackendDescription
                                                  TensorRTGPU-optimized; executes serialized TensorRT engines (kernel fusion, FP16/INT8).
                                                  PyTorchServes native PyTorch models via LibTorch.
                                                  ONNX RuntimeFramework-agnostic ONNX execution with TensorRT and other accelerators.
                                                  TensorFlowRuns TensorFlow SavedModel format.
                                                  Python backendCustom Python code for preprocessing, postprocessing, or unsupported models.
                                                  Custom backendsC++/Python backends for specialized or proprietary runtimes.
                                                  DALIGPU-accelerated data preprocessing (image, audio, video).
                                                  FIL (Forest Inference Library)GPU-accelerated tree-based models (XGBoost, LightGBM, Random Forest).
                                                  +

                                                  Key Features

                                                  +
                                                    +
                                                  • Dynamic batching: Combines multiple requests into a single batch at runtime — higher GPU utilization, improved throughput, reduced latency variance.
                                                  • +
                                                  • Concurrent model execution: Run multiple models or multiple instances of the same model; distribute load across GPUs.
                                                  • +
                                                  • Model versioning: Support multiple versions per model.
                                                  • +
                                                  • Ensemble models: Pipeline of models as an ensemble; eliminates intermediate network hops, reduces latency.
                                                  • +
                                                  • Model instance scaling: Multiple copies of a model for parallel inference and load isolation.
                                                  • +
                                                  • Observability: Prometheus metrics, granular latency, throughput, GPU utilization.
                                                  • +
                                                  • Warmup requests: Preload kernels and avoid cold-start latency.
                                                  • +
                                                  +
                                                  +

                                                  Model Repository Structure

                                                  +
                                                  model_repository/
                                                  ├── model_A/
                                                  │ ├── config.pbtxt
                                                  │ ├── 1/
                                                  │ │ └── model.plan
                                                  │ ├── 2/
                                                  │ │ └── model.plan
                                                  ├── model_B/
                                                  │ ├── config.pbtxt
                                                  │ ├── 1/
                                                  │ └── model.py
                                                  +

                                                  The config.pbtxt file defines how Triton loads and executes a model: input/output tensors, batch settings, hardware execution, backend runtime, and optimization parameters. At minimum it defines: backend/platform, max_batch_size, inputs, outputs.

                                                  +

                                                  Sample config.pbtxt

                                                  +
                                                  name: "product_ranking_model"
                                                  platform: "tensorrt_plan"
                                                  max_batch_size: 64
                                                  input [ { name: "input_embeddings" data_type: TYPE_FP16 dims: [ 128 ] }, { name: "context_features" data_type: TYPE_FP32 dims: [ 32 ] } ]
                                                  output [ { name: "scores" data_type: TYPE_FP32 dims: [ 1 ] } ]
                                                  instance_group [ { kind: KIND_GPU count: 2 gpus: [0] } ]
                                                  dynamic_batching { preferred_batch_size: [8,16,32,64] max_queue_delay_microseconds: 2000 }
                                                  +
                                                  +

                                                  Kubernetes Deployment Architecture

                                                  +

                                                  Predator inference services are deployed on Kubernetes using Helm-based deployments for standardized, scalable, GPU-optimized model serving. Each deployment consists of Triton Inference Server wrapped within a Predator runtime, with autoscaling driven by CPU and GPU utilization.

                                                  +

                                                  Pod Architecture

                                                  +
                                                  Predator Pod
                                                  ├── Init Container (Model Sync)
                                                  ├── Triton Inference Server Container
                                                  +

                                                  Model artifacts and runtime are initialized before inference traffic is accepted.

                                                  +

                                                  Init Container

                                                  +
                                                    +
                                                  • Download model artifacts from cloud storage (GCS).
                                                  • +
                                                  • Populate the Triton model repository directory.
                                                  • +
                                                  • Example: gcloud storage cp -r gs://.../model-path/* /models
                                                  • +
                                                  +

                                                  Benefits: deterministic startup (Triton starts only after models are available), separation of concerns (image = runtime, repository = data).

                                                  +

                                                  Triton Inference Server Container

                                                  +
                                                    +
                                                  • Load model artifacts from local repository.
                                                  • +
                                                  • Manage inference scheduling, request/response handling, and expose inference endpoints.
                                                  • +
                                                  +

                                                  Triton Server Image Strategy

                                                  +

                                                  The Helm chart uses the Triton container image from the internal artifact registry. Production uses custom-built images (only required backends, e.g. TensorRT, Python) to reduce size and startup time. Unnecessary components are excluded; images are built internally and pushed to the registry.

                                                  +

                                                  Response Caching: Custom cache plugins can be added at image build time for optional inference response caching — reducing redundant execution and GPU use for repeated inputs.

                                                  +

                                                  Image Distribution Optimization

                                                  +
                                                    +
                                                  • Secondary boot disk image caching: Images are pre-cached on GPU node pool secondary boot disks to avoid repeated pulls during scale-up and reduce pod startup time and cold-start latency.
                                                  • +
                                                  • Image streaming: Can be used to progressively pull layers for faster time-to-readiness during scaling.
                                                  • +
                                                  +

                                                  Health Probes

                                                  +

                                                  Readiness and liveness use /v2/health/ready. Triton receives traffic only after model loading; failed instances are restarted automatically.

                                                  +

                                                  Resource Configuration

                                                  +

                                                  Sample GPU resource config:

                                                  +
                                                  limits:
                                                  cpu: 7000m
                                                  memory: 28Gi
                                                  gpu: 1
                                                  +

                                                  Autoscaling Architecture

                                                  +

                                                  Predator uses KEDA (Kubernetes Event-Driven Autoscaling) for scaling deployable pods. KEDA supports custom metric sources including:

                                                  +
                                                    +
                                                  • CPU / Memory utilization for CPU-based deployments.
                                                  • +
                                                  • GPU utilization via DCGM (Data Center GPU Manager) for GPU pods — covering utilization, memory, power, etc.
                                                  • +
                                                  • Custom Prometheus queries for application-level scaling signals (e.g., inference queue depth, request latency).
                                                  • +
                                                  +

                                                  KEDA ScaledObjects are configured per deployable, enabling fine-grained, independent scaling for each model or model group.

                                                  +
                                                  +

                                                  Contributing

                                                  +

                                                  We welcome contributions! See the Contributing Guide.

                                                  +

                                                  Community & Support

                                                  + +

                                                  License

                                                  +

                                                  BharatMLStack is open-source under the BharatMLStack Business Source License 1.1.

                                                  +
                                                  +
                                                  Built with ❤️ for the ML community from Meesho
                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  + + \ No newline at end of file diff --git a/docs/predator/v1.0.0/functionalities/index.html b/docs/predator/v1.0.0/functionalities/index.html new file mode 100644 index 00000000..fac7c01f --- /dev/null +++ b/docs/predator/v1.0.0/functionalities/index.html @@ -0,0 +1,97 @@ + + + + + +Key Functionalities | BharatMLStack + + + + + + + + +

                                                  Predator - Key Functionalities

                                                  +

                                                  Overview

                                                  +

                                                  Predator is a scalable, high-performance model inference service built as a wrapper around NVIDIA Triton Inference Server. It serves Deep Learning and tree-based models with low latency in Kubernetes, integrates with the Online Feature Store (OnFS) and uses Interflow for orchestration between clients, feature store, and inference engine. Clients send inference requests via the Helix client over gRPC.

                                                  +
                                                  +

                                                  Core Capabilities

                                                  +

                                                  Multi-Backend Inference

                                                  +

                                                  Predator leverages Triton's pluggable backends so you can serve a variety of model types from a single deployment:

                                                  +
                                                  BackendUse Case
                                                  TensorRTGPU-optimized DL; serialized engines (FP16/INT8)
                                                  PyTorchNative PyTorch via LibTorch
                                                  ONNX RuntimeFramework-agnostic ONNX with TensorRT/GPU
                                                  TensorFlowSavedModel format
                                                  PythonCustom preprocessing, postprocessing, or unsupported models
                                                  FILTree-based models (XGBoost, LightGBM, Random Forest) on GPU
                                                  DALIGPU-accelerated data preprocessing (image, audio, video)
                                                  CustomC++/Python backends for proprietary or specialized runtimes
                                                  +

                                                  Dynamic Batching

                                                  +

                                                  Triton combines multiple incoming requests into a single batch at runtime.

                                                  +
                                                    +
                                                  • Higher GPU utilization and improved throughput
                                                  • +
                                                  • Reduced latency variance
                                                  • +
                                                  • Configurable preferred_batch_size and max_queue_delay_microseconds in config.pbtxt
                                                  • +
                                                  +

                                                  Concurrent Model Execution

                                                  +
                                                    +
                                                  • Run multiple models simultaneously
                                                  • +
                                                  • Run multiple instances of the same model
                                                  • +
                                                  • Distribute load across GPUs via instance_group in model config
                                                  • +
                                                  +

                                                  Model Versioning & Ensembles

                                                  +
                                                    +
                                                  • Versioning: Multiple versions per model (e.g. 1/, 2/ in the model repository)
                                                  • +
                                                  • Ensembles: Define a pipeline of models as an ensemble; eliminates intermediate network hops and reduces latency
                                                  • +
                                                  +

                                                  Model Instance Scaling

                                                  +
                                                    +
                                                  • Deploy multiple copies of a model for parallel inference and load isolation
                                                  • +
                                                  • Configured via instance_group
                                                  • +
                                                  +
                                                  +

                                                  Inference & API

                                                  +

                                                  gRPC via Helix Client

                                                  +

                                                  Predator uses gRPC for efficient request/response handling. Client applications (e.g. Realestate, IOP) send inference requests through the Helix client, which talks to the Triton Inference Server inside the Predator pod.

                                                  +

                                                  Model Repository

                                                  +

                                                  Models are stored in a local model repository. Predator materializes this via an Init Container that downloads artifacts from cloud storage (e.g. GCS) so Triton has no runtime dependency on remote storage during inference.

                                                  +
                                                  +

                                                  Deployment & Operational Features

                                                  +

                                                  Custom Triton Images

                                                  +
                                                    +
                                                  • Production uses custom-built Triton images (only required backends) for smaller size and faster startup
                                                  • +
                                                  • Images built on GCP VM, pushed to Artifact Registry, and referenced in Helm deployments
                                                  • +
                                                  • Optional response caching via custom cache plugins added at image build time
                                                  • +
                                                  +

                                                  Image Distribution

                                                  +
                                                    +
                                                  • Secondary boot disk caching: Triton image pre-cached on GPU node pool to reduce pod startup and scale-up latency
                                                  • +
                                                  • Image streaming: Optionally used for faster time-to-readiness during scaling
                                                  • +
                                                  +

                                                  Health Probes

                                                  +
                                                    +
                                                  • Readiness and liveness use /v2/health/ready
                                                  • +
                                                  • Triton receives traffic only after models are loaded; failed instances are restarted automatically
                                                  • +
                                                  +

                                                  Autoscaling

                                                  +
                                                    +
                                                  • CPU-based scaling for generic load
                                                  • +
                                                  • GPU-based scaling using DCGM metrics (utilization, memory, power); custom queries drive scale-up/scale-down
                                                  • +
                                                  +
                                                  +

                                                  Observability

                                                  +
                                                    +
                                                  • Prometheus metrics: Latency, throughput, GPU utilization, and more
                                                  • +
                                                  • Metrics emitted from the Triton Inference Container and visualized in Grafana
                                                  • +
                                                  • Warmup requests: Configurable to preload kernels and avoid cold-start latency
                                                  • +
                                                  +
                                                  +

                                                  Contributing

                                                  +

                                                  We welcome contributions! See the Contributing Guide.

                                                  +

                                                  Community & Support

                                                  + +

                                                  License

                                                  +

                                                  BharatMLStack is open-source under the BharatMLStack Business Source License 1.1.

                                                  +
                                                  +
                                                  Built with ❤️ for the ML community from Meesho
                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  + + \ No newline at end of file diff --git a/docs/predator/v1.0.0/index.html b/docs/predator/v1.0.0/index.html new file mode 100644 index 00000000..c79732ab --- /dev/null +++ b/docs/predator/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + + + + \ No newline at end of file diff --git a/docs/predator/v1.0.0/release-notes/index.html b/docs/predator/v1.0.0/release-notes/index.html new file mode 100644 index 00000000..f0dafc72 --- /dev/null +++ b/docs/predator/v1.0.0/release-notes/index.html @@ -0,0 +1,29 @@ + + + + + +Release Notes | BharatMLStack + + + + + + + + +

                                                  Predator - Release Notes

                                                  +

                                                  Version 1.0.0

                                                  +

                                                  Release Date: June 2025
                                                  +Status: General Availability (GA)

                                                  +

                                                  First stable release of Predator — scalable model inference service built around NVIDIA Triton Inference Server, part of BharatMLStack. Serves Deep Learning and tree-based models with low latency in Kubernetes; integrates with OnFS and Interflow; clients use the Helix client over gRPC.

                                                  +

                                                  What's New

                                                  +
                                                    +
                                                  • Triton inference engine: Unified runtime for DL and tree-based models on CPU/GPU; model repository via Init Container from GCS; gRPC API via Helix client.
                                                  • +
                                                  • Multi-backend support: TensorRT, PyTorch, ONNX Runtime, TensorFlow, Python, FIL, DALI, Custom.
                                                  • +
                                                  • Dynamic batching & concurrency: Configurable via config.pbtxt; model versioning and ensembles.
                                                  • +
                                                  • Kubernetes deployment: Helm-based; Init Container + Triton container; custom Triton images from Artifact Registry; health probes; CPU/GPU autoscaling.
                                                  • +
                                                  • Observability: Prometheus metrics, Grafana; warmup requests for cold-start avoidance.
                                                  • +
                                                  + + \ No newline at end of file diff --git a/docs/quick-start/v1.0.0/index.html b/docs/quick-start/v1.0.0/index.html new file mode 100644 index 00000000..dce71524 --- /dev/null +++ b/docs/quick-start/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + +
                                                  + + \ No newline at end of file diff --git a/docs/quick-start/v1.0.0/quick-start/index.html b/docs/quick-start/v1.0.0/quick-start/index.html index 41c009bb..7df45875 100644 --- a/docs/quick-start/v1.0.0/quick-start/index.html +++ b/docs/quick-start/v1.0.0/quick-start/index.html @@ -3,16 +3,16 @@ -Quick Start | BharatMLStack - - - +Quick Start | BharatMLStack + + + -

                                                  BharatML Stack Quick Start Guide

                                                  +

                                                  BharatML Stack Quick Start Guide

                                                  Discord

                                                  A quick way to get the BharatML Stack Online Feature Store platform up and running locally for development and testing.

                                                  Prerequisites

                                                  @@ -43,10 +43,10 @@

                                                  System Com

                                                  Quick Start

                                                  Starting the System

                                                  Run the start script to set up your workspace and launch all services:

                                                  -
                                                  ./start.sh
                                                  +
                                                  ./start.sh

                                                  Testing Different Versions

                                                  You can easily test different versions of the application services by setting environment variables:

                                                  -
                                                  # Test specific versions [Replace with actual versions]
                                                  ONFS_VERSION=v1.2.3 HORIZON_VERSION=v2.1.0 TRUFFLEBOX_VERSION=v1.0.5 ./start.sh

                                                  # Or set them in your workspace and run docker-compose directly
                                                  cd workspace
                                                  ONFS_VERSION=main docker-compose up -d onfs-api-server
                                                  +
                                                  # Test specific versions [Replace with actual versions]
                                                  ONFS_VERSION=v1.2.3 HORIZON_VERSION=v2.1.0 TRUFFLEBOX_VERSION=v1.0.5 ./start.sh

                                                  # Or set them in your workspace and run docker-compose directly
                                                  cd workspace
                                                  ONFS_VERSION=main docker-compose up -d onfs-api-server

                                                  Available version formats:

                                                  • latest (default) - Latest stable release
                                                  • @@ -72,9 +72,9 @@

                                                    T

                                                  Stopping the System

                                                  To stop all services:

                                                  -
                                                  ./stop.sh
                                                  +
                                                  ./stop.sh

                                                  To stop and completely purge all containers, volumes, and workspace:

                                                  -
                                                  ./stop.sh --purge
                                                  +
                                                  ./stop.sh --purge

                                                  Accessing Services

                                                  Frontend UI

                                                    @@ -138,24 +138,24 @@

                                                    F

                                                    gRPC API Commands

                                                    Use the following grpcurl commands to interact with the Online Feature Store gRPC API:

                                                    Persist Features:

                                                    -
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 persist.FeatureService/PersistFeatures
                                                    +
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 persist.FeatureService/PersistFeatures

                                                    Retrieve Features (Decoded):

                                                    -
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 retrieve.FeatureService/RetrieveDecodedResult
                                                    +
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 retrieve.FeatureService/RetrieveDecodedResult

                                                    Retrieve Features (Binary):

                                                    -
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 retrieve.FeatureService/RetrieveFeatures
                                                    +
                                                    grpcurl -plaintext -H "online-feature-store-caller-id: <caller-id>" -H "online-feature-store-auth-token: <auth-token>" -d '<request-body>' localhost:8089 retrieve.FeatureService/RetrieveFeatures

                                                    Sample Request Bodies

                                                    Single Feature Group Persist:

                                                    -
                                                    {
                                                    "data": [{
                                                    "key_values": ["10"],
                                                    "feature_values": [{
                                                    "values": {"fp32_values": [123.45]}
                                                    }]
                                                    }],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [{
                                                    "label": "int_fg",
                                                    "feature_labels": ["id"]
                                                    }],
                                                    "keys_schema": ["catalog_id"]
                                                    }
                                                    +
                                                    {
                                                    "data": [{
                                                    "key_values": ["10"],
                                                    "feature_values": [{
                                                    "values": {"fp32_values": [123.45]}
                                                    }]
                                                    }],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [{
                                                    "label": "int_fg",
                                                    "feature_labels": ["id"]
                                                    }],
                                                    "keys_schema": ["catalog_id"]
                                                    }

                                                    Single Feature Group Retrieve:

                                                    -
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [{
                                                    "label": "int_fg",
                                                    "feature_labels": ["id"]
                                                    }],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [{"cols": ["10"]}]
                                                    }
                                                    +
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [{
                                                    "label": "int_fg",
                                                    "feature_labels": ["id"]
                                                    }],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [{"cols": ["10"]}]
                                                    }

                                                    Multiple Feature Groups Persist:

                                                    -
                                                    {
                                                    "data": [
                                                    {
                                                    "key_values": ["1"],
                                                    "feature_values": [
                                                    {"values": {"fp32_values": [28.5]}},
                                                    {"values": {"string_values": ["Bharat"]}}
                                                    ]
                                                    },
                                                    {
                                                    "key_values": ["2"],
                                                    "feature_values": [
                                                    {"values": {"fp32_values": [32.0]}},
                                                    {"values": {"string_values": ["India"]}}
                                                    ]
                                                    }
                                                    ],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [
                                                    {"label": "int_fg", "feature_labels": ["id"]},
                                                    {"label": "string_fg", "feature_labels": ["name"]}
                                                    ],
                                                    "keys_schema": ["catalog_id"]
                                                    }
                                                    +
                                                    {
                                                    "data": [
                                                    {
                                                    "key_values": ["1"],
                                                    "feature_values": [
                                                    {"values": {"fp32_values": [28.5]}},
                                                    {"values": {"string_values": ["Bharat"]}}
                                                    ]
                                                    },
                                                    {
                                                    "key_values": ["2"],
                                                    "feature_values": [
                                                    {"values": {"fp32_values": [32.0]}},
                                                    {"values": {"string_values": ["India"]}}
                                                    ]
                                                    }
                                                    ],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [
                                                    {"label": "int_fg", "feature_labels": ["id"]},
                                                    {"label": "string_fg", "feature_labels": ["name"]}
                                                    ],
                                                    "keys_schema": ["catalog_id"]
                                                    }

                                                    Multiple Feature Groups Retrieve:

                                                    -
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [
                                                    {"label": "int_fg", "feature_labels": ["id"]},
                                                    {"label": "string_fg", "feature_labels": ["name"]}
                                                    ],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [
                                                    {"cols": ["1"]},
                                                    {"cols": ["2"]}
                                                    ]
                                                    }
                                                    +
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [
                                                    {"label": "int_fg", "feature_labels": ["id"]},
                                                    {"label": "string_fg", "feature_labels": ["name"]}
                                                    ],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [
                                                    {"cols": ["1"]},
                                                    {"cols": ["2"]}
                                                    ]
                                                    }

                                                    Vector Feature Group Persist:

                                                    -
                                                    {
                                                    "data": [{
                                                    "key_values": ["123"],
                                                    "feature_values": [{
                                                    "values": {
                                                    "vector": [{
                                                    "values": {"fp32_values": [1.0, 2.0, 3.0, 4.0]}
                                                    }]
                                                    }
                                                    }]
                                                    }],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [{
                                                    "label": "vector_fg",
                                                    "feature_labels": ["embedding"]
                                                    }],
                                                    "keys_schema": ["catalog_id"]
                                                    }
                                                    +
                                                    {
                                                    "data": [{
                                                    "key_values": ["123"],
                                                    "feature_values": [{
                                                    "values": {
                                                    "vector": [{
                                                    "values": {"fp32_values": [1.0, 2.0, 3.0, 4.0]}
                                                    }]
                                                    }
                                                    }]
                                                    }],
                                                    "entity_label": "catalog",
                                                    "feature_group_schema": [{
                                                    "label": "vector_fg",
                                                    "feature_labels": ["embedding"]
                                                    }],
                                                    "keys_schema": ["catalog_id"]
                                                    }

                                                    Vector Feature Group Retrieve:

                                                    -
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [{
                                                    "label": "vector_fg",
                                                    "feature_labels": ["embedding"]
                                                    }],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [{"cols": ["123"]}]
                                                    }
                                                    +
                                                    {
                                                    "entity_label": "catalog",
                                                    "feature_groups": [{
                                                    "label": "vector_fg",
                                                    "feature_labels": ["embedding"]
                                                    }],
                                                    "keys_schema": ["catalog_id"],
                                                    "keys": [{"cols": ["123"]}]
                                                    }

                                                    Key Points

                                                    Only one type per feature value block:

                                                      @@ -182,9 +182,9 @@

                                                    Managing Services

                                                    Viewing Logs

                                                    -
                                                    # View logs for all services
                                                    cd workspace && docker-compose logs -f

                                                    # View logs for specific services
                                                    cd workspace && docker-compose logs -f horizon
                                                    cd workspace && docker-compose logs -f trufflebox-ui
                                                    cd workspace && docker-compose logs -f onfs-api-server
                                                    +
                                                    # View logs for all services
                                                    cd workspace && docker-compose logs -f

                                                    # View logs for specific services
                                                    cd workspace && docker-compose logs -f horizon
                                                    cd workspace && docker-compose logs -f trufflebox-ui
                                                    cd workspace && docker-compose logs -f onfs-api-server

                                                    Service Management

                                                    -
                                                    # Restart a specific service
                                                    cd workspace && docker-compose restart horizon

                                                    # Stop all services
                                                    cd workspace && docker-compose down

                                                    # Start services again
                                                    cd workspace && docker-compose up -d

                                                    # Check service status
                                                    cd workspace && docker-compose ps
                                                    +
                                                    # Restart a specific service
                                                    cd workspace && docker-compose restart horizon

                                                    # Stop all services
                                                    cd workspace && docker-compose down

                                                    # Start services again
                                                    cd workspace && docker-compose up -d

                                                    # Check service status
                                                    cd workspace && docker-compose ps

                                                    Troubleshooting

                                                    Common Issues

                                                      @@ -193,15 +193,15 @@

                                                      Common Issues<
                                                    1. Docker network issues: If containers can't communicate, try recreating:

                                                      -
                                                      docker network rm onfs-network
                                                      docker network create onfs-network
                                                      +
                                                      docker network rm onfs-network
                                                      docker network create onfs-network
                                                    2. Service health checks failing: Check if all infrastructure services (databases) are running:

                                                      -
                                                      cd workspace && docker-compose ps
                                                      +
                                                      cd workspace && docker-compose ps
                                                    3. Image pull issues: Ensure you have access to GitHub Container Registry:

                                                      -
                                                      docker login ghcr.io
                                                      +
                                                      docker login ghcr.io
                                                    4. How to use Etcd Workbench ?

                                                      @@ -235,6 +235,6 @@

                                                      LicenseBharatMLStack Business Source License 1.1.


                                                      Built with ❤️ for the ML community from Meesho
                                                      -
                                                      If you find this useful, ⭐️ the repo — your support means the world to us!

                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  \ No newline at end of file diff --git a/docs/sdks/go/v1.0.0/feature_client/index.html b/docs/sdks/go/v1.0.0/feature_client/index.html index 6a83cd31..312c7ff4 100644 --- a/docs/sdks/go/v1.0.0/feature_client/index.html +++ b/docs/sdks/go/v1.0.0/feature_client/index.html @@ -3,16 +3,16 @@ -GRPC Feature client | BharatMLStack - - - +GRPC Feature client | BharatMLStack + + + -

                                                  Build Status +

                                                  Build Status Static Badge Discord

                                                  BharatMLStack Go SDK

                                                  @@ -30,24 +30,24 @@

                                                  FeaturesInstallation

                                                  -
                                                  go get github.com/Meesho/BharatMLStack/go-sdk
                                                  +
                                                  go get github.com/Meesho/BharatMLStack/go-sdk

                                                  Configuration

                                                  The SDK requires a configuration object with the following fields:

                                                  FieldTypeRequiredDescription
                                                  HoststringYesServer hostname (e.g., "localhost", "feature-store.example.com")
                                                  PortstringYesServer port (e.g., "8080", "443")
                                                  CallerIdstringYesUnique identifier for your service/application
                                                  CallerTokenstringYesAuthentication token for API access
                                                  DeadLineintNoRequest timeout in milliseconds (default: 5000)
                                                  PlainTextboolNoUse plaintext connection instead of TLS (default: false)
                                                  BatchSizeintNoMaximum batch size for bulk operations (default: 50)

                                                  Usage

                                                  Basic Usage

                                                  -
                                                  package main

                                                  import (
                                                  "context"
                                                  "log"

                                                  "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"
                                                  )

                                                  func main() {
                                                  config := &onfs.Config{
                                                  Host: "localhost",
                                                  Port: "8080",
                                                  PlainText: true, // For local development
                                                  CallerId: "my-service",
                                                  CallerToken: "my-token",
                                                  }

                                                  // Initialize client (timing and count can be nil)
                                                  client := onfs.NewClientV1(config, nil, nil)

                                                  // Your feature operations here...
                                                  }
                                                  +
                                                  package main

                                                  import (
                                                  "context"
                                                  "log"

                                                  "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"
                                                  )

                                                  func main() {
                                                  config := &onfs.Config{
                                                  Host: "localhost",
                                                  Port: "8080",
                                                  PlainText: true, // For local development
                                                  CallerId: "my-service",
                                                  CallerToken: "my-token",
                                                  }

                                                  // Initialize client (timing and count can be nil)
                                                  client := onfs.NewClientV1(config, nil, nil)

                                                  // Your feature operations here...
                                                  }

                                                  Complete Example

                                                  -
                                                  package main

                                                  import (
                                                  "context"
                                                  "log"
                                                  "time"

                                                  "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"
                                                  )

                                                  func main() {
                                                  // Create configuration
                                                  config := &onfs.Config{
                                                  Host: "localhost",
                                                  Port: "8080",
                                                  DeadLine: 5000, // 5 seconds timeout in milliseconds
                                                  PlainText: true, // Use plaintext connection for local development
                                                  BatchSize: 50, // Optional: batch size for requests
                                                  CallerId: "your-service-id",
                                                  CallerToken: "your-auth-token",
                                                  }

                                                  // Timing and count functions (can be nil for basic usage)
                                                  timing := func(name string, value time.Duration, tags []string) {
                                                  log.Printf("Timing: %s took %v with tags %v", name, value, tags)
                                                  }
                                                  count := func(name string, value int64, tags []string) {
                                                  log.Printf("Count: %s = %d with tags %v", name, value, tags)
                                                  }

                                                  // Initialize the client
                                                  client := onfs.InitClient(onfs.Version1, config, timing, count)
                                                  // Or alternatively use: client := onfs.NewClientV1(config, timing, count)

                                                  ctx := context.Background()

                                                  // Example: Retrieve features
                                                  query := &onfs.Query{
                                                  EntityLabel: "user",
                                                  FeatureGroups: []onfs.FeatureGroup{
                                                  {
                                                  Label: "user_features",
                                                  FeatureLabels: []string{"age", "location", "preferences"},
                                                  },
                                                  },
                                                  KeysSchema: []string{"user_id"},
                                                  Keys: []onfs.Keys{
                                                  {Cols: []string{"12345"}},
                                                  {Cols: []string{"67890"}},
                                                  },
                                                  }

                                                  result, err := client.RetrieveFeatures(ctx, query)
                                                  if err != nil {
                                                  log.Fatalf("Failed to retrieve features: %v", err)
                                                  }

                                                  log.Printf("Retrieved %d rows for entity %s", len(result.Rows), result.EntityLabel)

                                                  // Example: Retrieve decoded features (string values)
                                                  decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)
                                                  if err != nil {
                                                  log.Fatalf("Failed to retrieve decoded features: %v", err)
                                                  }

                                                  log.Printf("Retrieved %d decoded rows", len(decodedResult.Rows))

                                                  // Example: Persist features
                                                  persistRequest := &onfs.PersistFeaturesRequest{
                                                  EntityLabel: "user",
                                                  KeysSchema: []string{"user_id"},
                                                  FeatureGroups: []onfs.FeatureGroupSchema{
                                                  {
                                                  Label: "user_features",
                                                  FeatureLabels: []string{"age", "location"},
                                                  },
                                                  },
                                                  Data: []onfs.Data{
                                                  {
                                                  KeyValues: []string{"12345"},
                                                  FeatureValues: []onfs.FeatureValues{
                                                  {
                                                  Values: onfs.Values{
                                                  Int32Values: []int32{25},
                                                  StringValues: []string{"New York"},
                                                  },
                                                  },
                                                  },
                                                  },
                                                  },
                                                  }

                                                  persistResponse, err := client.PersistFeatures(ctx, persistRequest)
                                                  if err != nil {
                                                  log.Fatalf("Failed to persist features: %v", err)
                                                  }

                                                  log.Printf("Persist result: %s", persistResponse.Message)
                                                  }
                                                  +
                                                  package main

                                                  import (
                                                  "context"
                                                  "log"
                                                  "time"

                                                  "github.com/Meesho/BharatMLStack/go-sdk/pkg/onfs"
                                                  )

                                                  func main() {
                                                  // Create configuration
                                                  config := &onfs.Config{
                                                  Host: "localhost",
                                                  Port: "8080",
                                                  DeadLine: 5000, // 5 seconds timeout in milliseconds
                                                  PlainText: true, // Use plaintext connection for local development
                                                  BatchSize: 50, // Optional: batch size for requests
                                                  CallerId: "your-service-id",
                                                  CallerToken: "your-auth-token",
                                                  }

                                                  // Timing and count functions (can be nil for basic usage)
                                                  timing := func(name string, value time.Duration, tags []string) {
                                                  log.Printf("Timing: %s took %v with tags %v", name, value, tags)
                                                  }
                                                  count := func(name string, value int64, tags []string) {
                                                  log.Printf("Count: %s = %d with tags %v", name, value, tags)
                                                  }

                                                  // Initialize the client
                                                  client := onfs.InitClient(onfs.Version1, config, timing, count)
                                                  // Or alternatively use: client := onfs.NewClientV1(config, timing, count)

                                                  ctx := context.Background()

                                                  // Example: Retrieve features
                                                  query := &onfs.Query{
                                                  EntityLabel: "user",
                                                  FeatureGroups: []onfs.FeatureGroup{
                                                  {
                                                  Label: "user_features",
                                                  FeatureLabels: []string{"age", "location", "preferences"},
                                                  },
                                                  },
                                                  KeysSchema: []string{"user_id"},
                                                  Keys: []onfs.Keys{
                                                  {Cols: []string{"12345"}},
                                                  {Cols: []string{"67890"}},
                                                  },
                                                  }

                                                  result, err := client.RetrieveFeatures(ctx, query)
                                                  if err != nil {
                                                  log.Fatalf("Failed to retrieve features: %v", err)
                                                  }

                                                  log.Printf("Retrieved %d rows for entity %s", len(result.Rows), result.EntityLabel)

                                                  // Example: Retrieve decoded features (string values)
                                                  decodedResult, err := client.RetrieveDecodedFeatures(ctx, query)
                                                  if err != nil {
                                                  log.Fatalf("Failed to retrieve decoded features: %v", err)
                                                  }

                                                  log.Printf("Retrieved %d decoded rows", len(decodedResult.Rows))

                                                  // Example: Persist features
                                                  persistRequest := &onfs.PersistFeaturesRequest{
                                                  EntityLabel: "user",
                                                  KeysSchema: []string{"user_id"},
                                                  FeatureGroups: []onfs.FeatureGroupSchema{
                                                  {
                                                  Label: "user_features",
                                                  FeatureLabels: []string{"age", "location"},
                                                  },
                                                  },
                                                  Data: []onfs.Data{
                                                  {
                                                  KeyValues: []string{"12345"},
                                                  FeatureValues: []onfs.FeatureValues{
                                                  {
                                                  Values: onfs.Values{
                                                  Int32Values: []int32{25},
                                                  StringValues: []string{"New York"},
                                                  },
                                                  },
                                                  },
                                                  },
                                                  },
                                                  }

                                                  persistResponse, err := client.PersistFeatures(ctx, persistRequest)
                                                  if err != nil {
                                                  log.Fatalf("Failed to persist features: %v", err)
                                                  }

                                                  log.Printf("Persist result: %s", persistResponse.Message)
                                                  }

                                                  Development

                                                  Prerequisites

                                                  • Go 1.22 or later (as specified in go.mod)

                                                  Building

                                                  -
                                                  # Build all packages
                                                  go build ./...

                                                  # Run tests
                                                  go test ./...

                                                  # Run tests with coverage
                                                  go test -v -coverprofile=coverage.out ./...
                                                  go tool cover -html=coverage.out
                                                  +
                                                  # Build all packages
                                                  go build ./...

                                                  # Run tests
                                                  go test ./...

                                                  # Run tests with coverage
                                                  go test -v -coverprofile=coverage.out ./...
                                                  go tool cover -html=coverage.out

                                                  Testing

                                                  -
                                                  # Run all tests
                                                  go test -v ./...

                                                  # Run specific package tests
                                                  go test -v ./pkg/onfs

                                                  # Run with race detection
                                                  go test -race ./...
                                                  +
                                                  # Run all tests
                                                  go test -v ./...

                                                  # Run specific package tests
                                                  go test -v ./pkg/onfs

                                                  # Run with race detection
                                                  go test -race ./...

                                                  Contributing

                                                  We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.

                                                  Community & Support

                                                  @@ -60,6 +60,6 @@

                                                  LicenseBharatMLStack Business Source License 1.1.


                                                  Built with ❤️ for the ML community from Meesho
                                                  -
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!

                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  \ No newline at end of file diff --git a/docs/sdks/go/v1.0.0/index.html b/docs/sdks/go/v1.0.0/index.html new file mode 100644 index 00000000..18a27e2c --- /dev/null +++ b/docs/sdks/go/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + +
                                                  + + \ No newline at end of file diff --git a/docs/sdks/python/v1.0.0/grpc_feature_client/index.html b/docs/sdks/python/v1.0.0/grpc_feature_client/index.html index d7de40ad..15f6a224 100644 --- a/docs/sdks/python/v1.0.0/grpc_feature_client/index.html +++ b/docs/sdks/python/v1.0.0/grpc_feature_client/index.html @@ -3,16 +3,16 @@ -GRPC Feature client | BharatMLStack - - - +GRPC Feature client | BharatMLStack + + + -

                                                  GRPC Feature Client

                                                  +

                                                  GRPC Feature Client

                                                  PyPI version Build Status Python 3.7+ @@ -20,7 +20,7 @@ License

                                                  High-performance gRPC client for BharatML Stack real-time feature operations with direct API access.

                                                  Installation

                                                  -
                                                  pip install grpc_feature_client
                                                  +
                                                  pip install grpc_feature_client

                                                  Dependencies

                                                  This package depends on:

                                                    @@ -38,19 +38,19 @@

                                                    FeaturesQuick Start

                                                    -
                                                    from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig

                                                    # Configure for real-time operations
                                                    config = GRPCClientConfig(
                                                    server_address="localhost:50051",
                                                    job_id="realtime-service",
                                                    job_token="api-token"
                                                    )

                                                    client = GRPCFeatureClient(config)

                                                    # Direct API operations
                                                    result = client.persist_features(entity_label, keys_schema, feature_groups, data)
                                                    features = client.retrieve_decoded_features(entity_label, feature_groups, keys, entity_keys)
                                                    +
                                                    from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig

                                                    # Configure for real-time operations
                                                    config = GRPCClientConfig(
                                                    server_address="localhost:50051",
                                                    job_id="realtime-service",
                                                    job_token="api-token"
                                                    )

                                                    client = GRPCFeatureClient(config)

                                                    # Direct API operations
                                                    result = client.persist_features(entity_label, keys_schema, feature_groups, data)
                                                    features = client.retrieve_decoded_features(entity_label, feature_groups, keys, entity_keys)

                                                    API Reference

                                                    GRPCFeatureClient

                                                    -
                                                    class GRPCFeatureClient:
                                                    def __init__(self, config: GRPCClientConfig)

                                                    def persist_features(
                                                    self,
                                                    entity_label: str,
                                                    keys_schema: List[str],
                                                    feature_group_schemas: List[Dict[str, Any]],
                                                    data_rows: List[Dict[str, Any]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]

                                                    def retrieve_features(
                                                    self,
                                                    entity_label: str,
                                                    feature_groups: List[Dict[str, Any]],
                                                    keys_schema: List[str],
                                                    entity_keys: List[List[str]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]

                                                    def retrieve_decoded_features(
                                                    self,
                                                    entity_label: str,
                                                    feature_groups: List[Dict[str, Any]],
                                                    keys_schema: List[str],
                                                    entity_keys: List[List[str]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]
                                                    +
                                                    class GRPCFeatureClient:
                                                    def __init__(self, config: GRPCClientConfig)

                                                    def persist_features(
                                                    self,
                                                    entity_label: str,
                                                    keys_schema: List[str],
                                                    feature_group_schemas: List[Dict[str, Any]],
                                                    data_rows: List[Dict[str, Any]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]

                                                    def retrieve_features(
                                                    self,
                                                    entity_label: str,
                                                    feature_groups: List[Dict[str, Any]],
                                                    keys_schema: List[str],
                                                    entity_keys: List[List[str]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]

                                                    def retrieve_decoded_features(
                                                    self,
                                                    entity_label: str,
                                                    feature_groups: List[Dict[str, Any]],
                                                    keys_schema: List[str],
                                                    entity_keys: List[List[str]],
                                                    timeout: Optional[float] = None
                                                    ) -> Dict[str, Any]

                                                    GRPCClientConfig

                                                    -
                                                    class GRPCClientConfig:
                                                    def __init__(
                                                    self,
                                                    server_address: str,
                                                    job_id: str,
                                                    job_token: str,
                                                    use_tls: bool = False,
                                                    timeout_seconds: float = 30.0,
                                                    metadata: Dict[str, str] = None,
                                                    max_receive_message_length: int = 4 * 1024 * 1024,
                                                    max_send_message_length: int = 4 * 1024 * 1024
                                                    )
                                                    +
                                                    class GRPCClientConfig:
                                                    def __init__(
                                                    self,
                                                    server_address: str,
                                                    job_id: str,
                                                    job_token: str,
                                                    use_tls: bool = False,
                                                    timeout_seconds: float = 30.0,
                                                    metadata: Dict[str, str] = None,
                                                    max_receive_message_length: int = 4 * 1024 * 1024,
                                                    max_send_message_length: int = 4 * 1024 * 1024
                                                    )

                                                    Usage Examples

                                                    Persisting Features

                                                    -
                                                    from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig

                                                    config = GRPCClientConfig(
                                                    server_address="feature-store.example.com:50051",
                                                    job_id="predator-service",
                                                    job_token="api-token"
                                                    )

                                                    client = GRPCFeatureClient(config)

                                                    # Persist real-time features
                                                    result = client.persist_features(
                                                    entity_label="user_interaction",
                                                    keys_schema=["user_id", "session_id"],
                                                    feature_group_schemas=[{
                                                    "label": "realtime_features",
                                                    "feature_labels": ["click_count", "page_views"]
                                                    }],
                                                    data_rows=[{
                                                    "user_id": "u123",
                                                    "session_id": "s456",
                                                    "click_count": 5,
                                                    "page_views": 3
                                                    }]
                                                    )

                                                    print(f"Persist result: {result}")
                                                    +
                                                    from grpc_feature_client import GRPCFeatureClient, GRPCClientConfig

                                                    config = GRPCClientConfig(
                                                    server_address="feature-store.example.com:50051",
                                                    job_id="predator-service",
                                                    job_token="api-token"
                                                    )

                                                    client = GRPCFeatureClient(config)

                                                    # Persist real-time features
                                                    result = client.persist_features(
                                                    entity_label="user_interaction",
                                                    keys_schema=["user_id", "session_id"],
                                                    feature_group_schemas=[{
                                                    "label": "realtime_features",
                                                    "feature_labels": ["click_count", "page_views"]
                                                    }],
                                                    data_rows=[{
                                                    "user_id": "u123",
                                                    "session_id": "s456",
                                                    "click_count": 5,
                                                    "page_views": 3
                                                    }]
                                                    )

                                                    print(f"Persist result: {result}")

                                                    Retrieving Features

                                                    -
                                                    # Retrieve features for ML model inference
                                                    features = client.retrieve_decoded_features(
                                                    entity_label="user_interaction",
                                                    feature_groups=[{
                                                    "label": "user_features",
                                                    "feature_labels": ["age", "location"]
                                                    }],
                                                    keys_schema=["user_id"],
                                                    entity_keys=[["u123"], ["u456"]]
                                                    )

                                                    print(f"Retrieved features: {features}")
                                                    +
                                                    # Retrieve features for ML model inference
                                                    features = client.retrieve_decoded_features(
                                                    entity_label="user_interaction",
                                                    feature_groups=[{
                                                    "label": "user_features",
                                                    "feature_labels": ["age", "location"]
                                                    }],
                                                    keys_schema=["user_id"],
                                                    entity_keys=[["u123"], ["u456"]]
                                                    )

                                                    print(f"Retrieved features: {features}")

                                                    With Context Management

                                                    -
                                                    # Use client with automatic cleanup
                                                    with GRPCFeatureClient(config) as client:
                                                    result = client.persist_features(...)
                                                    features = client.retrieve_decoded_features(...)
                                                    # Connection automatically closed
                                                    +
                                                    # Use client with automatic cleanup
                                                    with GRPCFeatureClient(config) as client:
                                                    result = client.persist_features(...)
                                                    features = client.retrieve_decoded_features(...)
                                                    # Connection automatically closed

                                                    When to Use

                                                    Use grpc_feature_client for:

                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  \ No newline at end of file diff --git a/docs/sdks/python/v1.0.0/index.html b/docs/sdks/python/v1.0.0/index.html new file mode 100644 index 00000000..612f40ac --- /dev/null +++ b/docs/sdks/python/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + +
                                                  + + \ No newline at end of file diff --git a/docs/sdks/python/v1.0.0/spark_feature_push_client/index.html b/docs/sdks/python/v1.0.0/spark_feature_push_client/index.html index afe6e790..6bad557a 100644 --- a/docs/sdks/python/v1.0.0/spark_feature_push_client/index.html +++ b/docs/sdks/python/v1.0.0/spark_feature_push_client/index.html @@ -3,16 +3,16 @@ -Spark client | BharatMLStack - - - +Spark client | BharatMLStack + + + -

                                                  Spark Feature Push Client

                                                  +

                                                  Spark Feature Push Client

                                                  PyPI version Build Status Python 3.7+ @@ -20,7 +20,7 @@ License

                                                  Apache Spark-based client for pushing ML features from offline batch sources to the BharatML Stack Online Feature Store via Kafka. This client is designed for data pipeline operations - reading from batch sources and publishing to Kafka for online consumption.

                                                  Installation

                                                  -
                                                  pip install spark_feature_push_client
                                                  +
                                                  pip install spark_feature_push_client

                                                  Dependencies

                                                  This package depends on:

                                                  Architecture Role

                                                  -
                                                  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────┐    ┌─────────────────┐
                                                  │ Batch Sources │───▶│ Spark Feature Push │───▶│ Kafka │───▶│ Online Feature │
                                                  │ • Tables │ │ Client │ │ │ │ Store │
                                                  │ • Parquet │ │ • Read & Transform │ │ │ │ │
                                                  │ • Delta │ │ • Protobuf Serialize │ │ │ │ │
                                                  │ • S3/GCS/ADLS │ │ • Batch Processing │ │ │ │ │
                                                  └─────────────────┘ └──────────────────────┘ └─────────────┘ └─────────────────┘


                                                  ┌─────────────────┐
                                                  │ grpc_feature_ │
                                                  │ client │
                                                  │ (Real-time) │
                                                  └─────────────────┘
                                                  +
                                                  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────┐    ┌─────────────────┐
                                                  │ Batch Sources │───▶│ Spark Feature Push │───▶│ Kafka │───▶│ Online Feature │
                                                  │ • Tables │ │ Client │ │ │ │ Store │
                                                  │ • Parquet │ │ • Read & Transform │ │ │ │ │
                                                  │ • Delta │ │ • Protobuf Serialize │ │ │ │ │
                                                  │ • S3/GCS/ADLS │ │ • Batch Processing │ │ │ │ │
                                                  └─────────────────┘ └──────────────────────┘ └─────────────┘ └─────────────────┘


                                                  ┌─────────────────┐
                                                  │ grpc_feature_ │
                                                  │ client │
                                                  │ (Real-time) │
                                                  └─────────────────┘

                                                  Features

                                                  • Batch Source Integration: Read from Tables (Hive/Delta), Parquet, and Delta files on cloud storage
                                                  • @@ -56,7 +56,7 @@

                                                    When
                                                  • 💨 Single Records: Persisting individual feature records

                                                  Quick Start

                                                  -
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # Initialize client with metadata source
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://api.example.com/metadata",
                                                  job_id="feature-pipeline-job",
                                                  job_token="your-auth-token"
                                                  )

                                                  # Get feature configuration
                                                  feature_details = client.get_features_details()

                                                  # Process your Spark DataFrame
                                                  proto_df = client.generate_df_with_protobuf_messages(your_spark_df)

                                                  # Push to Kafka
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="localhost:9092",
                                                  kafka_topic="features.user_features"
                                                  )
                                                  +
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # Initialize client with metadata source
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://api.example.com/metadata",
                                                  job_id="feature-pipeline-job",
                                                  job_token="your-auth-token"
                                                  )

                                                  # Get feature configuration
                                                  feature_details = client.get_features_details()

                                                  # Process your Spark DataFrame
                                                  proto_df = client.generate_df_with_protobuf_messages(your_spark_df)

                                                  # Push to Kafka
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="localhost:9092",
                                                  kafka_topic="features.user_features"
                                                  )

                                                  This package is part of the BharatML Stack ecosystem:

                                                    @@ -74,85 +74,85 @@

                                                    Prerequisites<
                                                  • Java 8/11: Required by Spark
                                                  • bharatml_common: For protobuf schemas
                                                  -
                                                  # Example Spark session setup
                                                  spark = SparkSession.builder \
                                                  .appName("FeaturePipeline") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()
                                                  +
                                                  # Example Spark session setup
                                                  spark = SparkSession.builder \
                                                  .appName("FeaturePipeline") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()

                                                  Supported Data Sources

                                                  1. Database Tables

                                                  -
                                                  # Hive/Delta tables
                                                  df = spark.sql("SELECT * FROM feature_db.user_features")
                                                  +
                                                  # Hive/Delta tables
                                                  df = spark.sql("SELECT * FROM feature_db.user_features")

                                                  2. Cloud Storage - Parquet

                                                  -
                                                  # AWS S3
                                                  df = spark.read.parquet("s3a://bucket/path/to/features/")

                                                  # Google Cloud Storage
                                                  df = spark.read.parquet("gs://bucket/path/to/features/")

                                                  # Azure Data Lake
                                                  df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/")
                                                  +
                                                  # AWS S3
                                                  df = spark.read.parquet("s3a://bucket/path/to/features/")

                                                  # Google Cloud Storage
                                                  df = spark.read.parquet("gs://bucket/path/to/features/")

                                                  # Azure Data Lake
                                                  df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/")

                                                  3. Cloud Storage - Delta

                                                  -
                                                  # Delta format on cloud storage
                                                  df = spark.read.format("delta").load("s3a://bucket/delta-table/")
                                                  +
                                                  # Delta format on cloud storage
                                                  df = spark.read.format("delta").load("s3a://bucket/delta-table/")

                                                  Configuration Examples

                                                  Basic Pipeline

                                                  -
                                                  from pyspark.sql import SparkSession
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # Create Spark session
                                                  spark = SparkSession.builder \
                                                  .appName("FeatureETL") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()

                                                  # Initialize client
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://metadata-service.example.com/api/v1/features",
                                                  job_id="daily-feature-pipeline",
                                                  job_token="pipeline-secret-token",
                                                  fgs_to_consider=["user_demographics", "user_behavior"] # Optional: filter feature groups
                                                  )

                                                  # Get metadata and column mappings
                                                  (
                                                  offline_src_type_columns,
                                                  offline_col_to_default_values_map,
                                                  entity_column_names
                                                  ) = client.get_features_details()

                                                  print(f"Entity columns: {entity_column_names}")
                                                  print(f"Feature mappings: {offline_src_type_columns}")
                                                  +
                                                  from pyspark.sql import SparkSession
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # Create Spark session
                                                  spark = SparkSession.builder \
                                                  .appName("FeatureETL") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()

                                                  # Initialize client
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://metadata-service.example.com/api/v1/features",
                                                  job_id="daily-feature-pipeline",
                                                  job_token="pipeline-secret-token",
                                                  fgs_to_consider=["user_demographics", "user_behavior"] # Optional: filter feature groups
                                                  )

                                                  # Get metadata and column mappings
                                                  (
                                                  offline_src_type_columns,
                                                  offline_col_to_default_values_map,
                                                  entity_column_names
                                                  ) = client.get_features_details()

                                                  print(f"Entity columns: {entity_column_names}")
                                                  print(f"Feature mappings: {offline_src_type_columns}")

                                                  Reading from Multiple Sources

                                                  -
                                                  def get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values):
                                                  """
                                                  Read and combine features from multiple offline sources
                                                  """
                                                  dataframes = []

                                                  for source_info in feature_mapping:
                                                  table_name, source_type, feature_list = source_info

                                                  if source_type == "TABLE":
                                                  # Read from Hive/Delta table
                                                  df = spark.table(table_name)

                                                  elif source_type.startswith("PARQUET_"):
                                                  # Read from Parquet files
                                                  df = spark.read.parquet(table_name)

                                                  elif source_type.startswith("DELTA_"):
                                                  # Read from Delta files
                                                  df = spark.read.format("delta").load(table_name)

                                                  # Select and rename columns
                                                  select_cols = entity_columns.copy()
                                                  for original_col, renamed_col in feature_list:
                                                  if original_col in df.columns:
                                                  df = df.withColumnRenamed(original_col, renamed_col)
                                                  select_cols.append(renamed_col)

                                                  df = df.select(select_cols)
                                                  dataframes.append(df)

                                                  # Union all dataframes
                                                  if dataframes:
                                                  combined_df = dataframes[0]
                                                  for df in dataframes[1:]:
                                                  combined_df = combined_df.unionByName(df, allowMissingColumns=True)

                                                  # Fill missing values with defaults
                                                  for col, default_val in default_values.items():
                                                  if col in combined_df.columns:
                                                  combined_df = combined_df.fillna({col: default_val})

                                                  return combined_df

                                                  return None

                                                  # Use the function
                                                  df = get_features_from_all_sources(
                                                  spark,
                                                  entity_column_names,
                                                  offline_src_type_columns,
                                                  offline_col_to_default_values_map
                                                  )
                                                  +
                                                  def get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values):
                                                  """
                                                  Read and combine features from multiple offline sources
                                                  """
                                                  dataframes = []

                                                  for source_info in feature_mapping:
                                                  table_name, source_type, feature_list = source_info

                                                  if source_type == "TABLE":
                                                  # Read from Hive/Delta table
                                                  df = spark.table(table_name)

                                                  elif source_type.startswith("PARQUET_"):
                                                  # Read from Parquet files
                                                  df = spark.read.parquet(table_name)

                                                  elif source_type.startswith("DELTA_"):
                                                  # Read from Delta files
                                                  df = spark.read.format("delta").load(table_name)

                                                  # Select and rename columns
                                                  select_cols = entity_columns.copy()
                                                  for original_col, renamed_col in feature_list:
                                                  if original_col in df.columns:
                                                  df = df.withColumnRenamed(original_col, renamed_col)
                                                  select_cols.append(renamed_col)

                                                  df = df.select(select_cols)
                                                  dataframes.append(df)

                                                  # Union all dataframes
                                                  if dataframes:
                                                  combined_df = dataframes[0]
                                                  for df in dataframes[1:]:
                                                  combined_df = combined_df.unionByName(df, allowMissingColumns=True)

                                                  # Fill missing values with defaults
                                                  for col, default_val in default_values.items():
                                                  if col in combined_df.columns:
                                                  combined_df = combined_df.fillna({col: default_val})

                                                  return combined_df

                                                  return None

                                                  # Use the function
                                                  df = get_features_from_all_sources(
                                                  spark,
                                                  entity_column_names,
                                                  offline_src_type_columns,
                                                  offline_col_to_default_values_map
                                                  )

                                                  Protobuf Serialization & Kafka Publishing

                                                  -
                                                  # Convert DataFrame to protobuf messages
                                                  # This creates binary protobuf messages suitable for Kafka
                                                  proto_df = client.generate_df_with_protobuf_messages(
                                                  df,
                                                  intra_batch_size=20 # Batch size for serialization
                                                  )

                                                  # The proto_df has schema: [value: binary, intra_batch_id: long]
                                                  proto_df.printSchema()
                                                  # root
                                                  # |-- value: binary (nullable = false)
                                                  # |-- intra_batch_id: long (nullable = false)

                                                  # Write to Kafka with batching for better throughput
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="broker1:9092,broker2:9092,broker3:9092",
                                                  kafka_topic="features.user_features",
                                                  additional_options={
                                                  "kafka.acks": "all",
                                                  "kafka.retries": "3",
                                                  "kafka.compression.type": "snappy"
                                                  },
                                                  kafka_num_batches=4 # Split into 4 parallel Kafka writes
                                                  )
                                                  +
                                                  # Convert DataFrame to protobuf messages
                                                  # This creates binary protobuf messages suitable for Kafka
                                                  proto_df = client.generate_df_with_protobuf_messages(
                                                  df,
                                                  intra_batch_size=20 # Batch size for serialization
                                                  )

                                                  # The proto_df has schema: [value: binary, intra_batch_id: long]
                                                  proto_df.printSchema()
                                                  # root
                                                  # |-- value: binary (nullable = false)
                                                  # |-- intra_batch_id: long (nullable = false)

                                                  # Write to Kafka with batching for better throughput
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="broker1:9092,broker2:9092,broker3:9092",
                                                  kafka_topic="features.user_features",
                                                  additional_options={
                                                  "kafka.acks": "all",
                                                  "kafka.retries": "3",
                                                  "kafka.compression.type": "snappy"
                                                  },
                                                  kafka_num_batches=4 # Split into 4 parallel Kafka writes
                                                  )

                                                  Data Type Handling

                                                  The client automatically handles the protobuf data type mappings:

                                                  Scalar Types

                                                  -
                                                  # Example DataFrame with different types
                                                  data = [
                                                  ("user123", 25, 185.5, True, "premium"), # int, float, bool, string
                                                  ("user456", 30, 170.0, False, "basic")
                                                  ]
                                                  df = spark.createDataFrame(data, ["user_id", "age", "height", "is_premium", "tier"])

                                                  # Automatically mapped to protobuf:
                                                  # age -> int32_values
                                                  # height -> fp32_values
                                                  # is_premium -> bool_values
                                                  # tier -> string_values
                                                  +
                                                  # Example DataFrame with different types
                                                  data = [
                                                  ("user123", 25, 185.5, True, "premium"), # int, float, bool, string
                                                  ("user456", 30, 170.0, False, "basic")
                                                  ]
                                                  df = spark.createDataFrame(data, ["user_id", "age", "height", "is_premium", "tier"])

                                                  # Automatically mapped to protobuf:
                                                  # age -> int32_values
                                                  # height -> fp32_values
                                                  # is_premium -> bool_values
                                                  # tier -> string_values

                                                  Vector Types

                                                  -
                                                  # Example with vector/array features
                                                  from pyspark.sql.functions import array, lit

                                                  df = spark.createDataFrame([
                                                  ("user123", [0.1, 0.2, 0.3], ["tech", "sports"], [1, 2, 3])
                                                  ], ["user_id", "embeddings", "interests", "scores"])

                                                  # Automatically mapped to protobuf vectors:
                                                  # embeddings -> fp32_values in Vector
                                                  # interests -> string_values in Vector
                                                  # scores -> int32_values in Vector
                                                  +
                                                  # Example with vector/array features
                                                  from pyspark.sql.functions import array, lit

                                                  df = spark.createDataFrame([
                                                  ("user123", [0.1, 0.2, 0.3], ["tech", "sports"], [1, 2, 3])
                                                  ], ["user_id", "embeddings", "interests", "scores"])

                                                  # Automatically mapped to protobuf vectors:
                                                  # embeddings -> fp32_values in Vector
                                                  # interests -> string_values in Vector
                                                  # scores -> int32_values in Vector

                                                  Production Pipeline Example

                                                  -
                                                  def run_feature_pipeline():
                                                  """
                                                  Complete feature pipeline from batch sources to Kafka
                                                  """

                                                  # 1. Initialize Spark
                                                  spark = SparkSession.builder \
                                                  .appName("DailyFeaturePipeline") \
                                                  .config("spark.sql.adaptive.enabled", "true") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()

                                                  try:
                                                  # 2. Initialize feature client
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url=os.getenv("METADATA_URL"),
                                                  job_id=os.getenv("JOB_ID"),
                                                  job_token=os.getenv("JOB_TOKEN")
                                                  )

                                                  # 3. Get feature configuration
                                                  feature_mapping, default_values, entity_columns = client.get_features_details()

                                                  # 4. Read and process data
                                                  df = get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values)

                                                  if df is None or df.count() == 0:
                                                  raise ValueError("No data found in sources")

                                                  # 5. Convert to protobuf
                                                  proto_df = client.generate_df_with_protobuf_messages(df, intra_batch_size=50)

                                                  # 6. Publish to Kafka
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers=os.getenv("KAFKA_BROKERS"),
                                                  kafka_topic=os.getenv("KAFKA_TOPIC"),
                                                  additional_options={
                                                  "kafka.acks": "all",
                                                  "kafka.compression.type": "snappy",
                                                  "kafka.max.request.size": "10485760" # 10MB
                                                  },
                                                  kafka_num_batches=int(os.getenv("KAFKA_BATCHES", "4"))
                                                  )

                                                  print(f"✅ Successfully processed {df.count()} records")

                                                  finally:
                                                  spark.stop()

                                                  if __name__ == "__main__":
                                                  run_feature_pipeline()
                                                  +
                                                  def run_feature_pipeline():
                                                  """
                                                  Complete feature pipeline from batch sources to Kafka
                                                  """

                                                  # 1. Initialize Spark
                                                  spark = SparkSession.builder \
                                                  .appName("DailyFeaturePipeline") \
                                                  .config("spark.sql.adaptive.enabled", "true") \
                                                  .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
                                                  .getOrCreate()

                                                  try:
                                                  # 2. Initialize feature client
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url=os.getenv("METADATA_URL"),
                                                  job_id=os.getenv("JOB_ID"),
                                                  job_token=os.getenv("JOB_TOKEN")
                                                  )

                                                  # 3. Get feature configuration
                                                  feature_mapping, default_values, entity_columns = client.get_features_details()

                                                  # 4. Read and process data
                                                  df = get_features_from_all_sources(spark, entity_columns, feature_mapping, default_values)

                                                  if df is None or df.count() == 0:
                                                  raise ValueError("No data found in sources")

                                                  # 5. Convert to protobuf
                                                  proto_df = client.generate_df_with_protobuf_messages(df, intra_batch_size=50)

                                                  # 6. Publish to Kafka
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers=os.getenv("KAFKA_BROKERS"),
                                                  kafka_topic=os.getenv("KAFKA_TOPIC"),
                                                  additional_options={
                                                  "kafka.acks": "all",
                                                  "kafka.compression.type": "snappy",
                                                  "kafka.max.request.size": "10485760" # 10MB
                                                  },
                                                  kafka_num_batches=int(os.getenv("KAFKA_BATCHES", "4"))
                                                  )

                                                  print(f"✅ Successfully processed {df.count()} records")

                                                  finally:
                                                  spark.stop()

                                                  if __name__ == "__main__":
                                                  run_feature_pipeline()

                                                  Configuration Options

                                                  Client Configuration

                                                  -
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://api.example.com/metadata", # Required
                                                  job_id="pipeline-job-001", # Required
                                                  job_token="secret-token-123", # Required
                                                  fgs_to_consider=["user_features", "item_features"] # Optional: filter feature groups
                                                  )
                                                  +
                                                  client = OnlineFeatureStorePyClient(
                                                  features_metadata_source_url="https://api.example.com/metadata", # Required
                                                  job_id="pipeline-job-001", # Required
                                                  job_token="secret-token-123", # Required
                                                  fgs_to_consider=["user_features", "item_features"] # Optional: filter feature groups
                                                  )

                                                  Protobuf Serialization Options

                                                  -
                                                  proto_df = client.generate_df_with_protobuf_messages(
                                                  df,
                                                  intra_batch_size=20 # Records per protobuf message (default: 20)
                                                  )
                                                  +
                                                  proto_df = client.generate_df_with_protobuf_messages(
                                                  df,
                                                  intra_batch_size=20 # Records per protobuf message (default: 20)
                                                  )

                                                  Kafka Publishing Options

                                                  -
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="localhost:9092",
                                                  kafka_topic="features.topic",
                                                  additional_options={
                                                  "kafka.acks": "all", # Acknowledgment level
                                                  "kafka.retries": "3", # Retry attempts
                                                  "kafka.compression.type": "snappy", # Compression
                                                  "kafka.batch.size": "16384", # Batch size
                                                  "kafka.linger.ms": "100", # Batching delay
                                                  "kafka.max.request.size": "10485760" # Max message size
                                                  },
                                                  kafka_num_batches=1 # Number of parallel Kafka writers (default: 1)
                                                  )
                                                  +
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="localhost:9092",
                                                  kafka_topic="features.topic",
                                                  additional_options={
                                                  "kafka.acks": "all", # Acknowledgment level
                                                  "kafka.retries": "3", # Retry attempts
                                                  "kafka.compression.type": "snappy", # Compression
                                                  "kafka.batch.size": "16384", # Batch size
                                                  "kafka.linger.ms": "100", # Batching delay
                                                  "kafka.max.request.size": "10485760" # Max message size
                                                  },
                                                  kafka_num_batches=1 # Number of parallel Kafka writers (default: 1)
                                                  )

                                                  Performance Tuning

                                                  Spark Optimizations

                                                  -
                                                  spark = SparkSession.builder \
                                                  .appName("FeaturePipeline") \
                                                  .config("spark.sql.adaptive.enabled", "true") \
                                                  .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
                                                  .config("spark.sql.adaptive.skewJoin.enabled", "true") \
                                                  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
                                                  .getOrCreate()
                                                  +
                                                  spark = SparkSession.builder \
                                                  .appName("FeaturePipeline") \
                                                  .config("spark.sql.adaptive.enabled", "true") \
                                                  .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
                                                  .config("spark.sql.adaptive.skewJoin.enabled", "true") \
                                                  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
                                                  .getOrCreate()

                                                  Memory Management

                                                  -
                                                  # For large datasets, consider:
                                                  df = df.repartition(200) # Optimal partition count
                                                  df.cache() # Cache if reused multiple times
                                                  +
                                                  # For large datasets, consider:
                                                  df = df.repartition(200) # Optimal partition count
                                                  df.cache() # Cache if reused multiple times

                                                  Kafka Throughput

                                                  -
                                                  # For high-throughput scenarios:
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="brokers",
                                                  kafka_topic="topic",
                                                  kafka_num_batches=8, # Increase parallel writers
                                                  additional_options={
                                                  "kafka.batch.size": "65536", # Larger batches
                                                  "kafka.linger.ms": "100", # Allow batching delay
                                                  "kafka.compression.type": "lz4" # Fast compression
                                                  }
                                                  )
                                                  +
                                                  # For high-throughput scenarios:
                                                  client.write_protobuf_df_to_kafka(
                                                  proto_df,
                                                  kafka_bootstrap_servers="brokers",
                                                  kafka_topic="topic",
                                                  kafka_num_batches=8, # Increase parallel writers
                                                  additional_options={
                                                  "kafka.batch.size": "65536", # Larger batches
                                                  "kafka.linger.ms": "100", # Allow batching delay
                                                  "kafka.compression.type": "lz4" # Fast compression
                                                  }
                                                  )

                                                  Monitoring & Debugging

                                                  DataFrame Inspection

                                                  -
                                                  # Check data before processing
                                                  print(f"Records: {df.count()}")
                                                  print(f"Columns: {df.columns}")
                                                  df.printSchema()
                                                  df.show(5)

                                                  # Check protobuf output
                                                  proto_df.show(5, truncate=False)
                                                  print(f"Protobuf messages: {proto_df.count()}")
                                                  +
                                                  # Check data before processing
                                                  print(f"Records: {df.count()}")
                                                  print(f"Columns: {df.columns}")
                                                  df.printSchema()
                                                  df.show(5)

                                                  # Check protobuf output
                                                  proto_df.show(5, truncate=False)
                                                  print(f"Protobuf messages: {proto_df.count()}")

                                                  Error Handling

                                                  -
                                                  try:
                                                  proto_df = client.generate_df_with_protobuf_messages(df)
                                                  client.write_protobuf_df_to_kafka(proto_df, brokers, topic)

                                                  except Exception as e:
                                                  print(f"Pipeline failed: {e}")
                                                  # Log to monitoring system
                                                  # Send alerts
                                                  raise
                                                  +
                                                  try:
                                                  proto_df = client.generate_df_with_protobuf_messages(df)
                                                  client.write_protobuf_df_to_kafka(proto_df, brokers, topic)

                                                  except Exception as e:
                                                  print(f"Pipeline failed: {e}")
                                                  # Log to monitoring system
                                                  # Send alerts
                                                  raise

                                                  Integration with Other SDKs

                                                  With gRPC Feature Client

                                                  -
                                                  # Spark client pushes features to Kafka
                                                  spark_client = OnlineFeatureStorePyClient(...)
                                                  spark_client.write_protobuf_df_to_kafka(proto_df, brokers, topic)

                                                  # gRPC client retrieves features in real-time
                                                  from grpc_feature_client import GRPCFeatureClient
                                                  grpc_client = GRPCFeatureClient(config)
                                                  features = grpc_client.retrieve_decoded_features(...)
                                                  +
                                                  # Spark client pushes features to Kafka
                                                  spark_client = OnlineFeatureStorePyClient(...)
                                                  spark_client.write_protobuf_df_to_kafka(proto_df, brokers, topic)

                                                  # gRPC client retrieves features in real-time
                                                  from grpc_feature_client import GRPCFeatureClient
                                                  grpc_client = GRPCFeatureClient(config)
                                                  features = grpc_client.retrieve_decoded_features(...)

                                                  With HTTP Feature Client (bharatml_common)

                                                  -
                                                  # Use HTTP client for metadata validation
                                                  from bharatml_common import HTTPFeatureClient
                                                  http_client = HTTPFeatureClient(base_url, job_id, token)
                                                  metadata = http_client.get_feature_metadata()

                                                  # Validate feature names using shared utilities
                                                  from bharatml_common import clean_column_name
                                                  clean_features = [clean_column_name(name) for name in feature_names]

                                                  # Process with Spark client
                                                  spark_client.generate_df_with_protobuf_messages(df)
                                                  +
                                                  # Use HTTP client for metadata validation
                                                  from bharatml_common import HTTPFeatureClient
                                                  http_client = HTTPFeatureClient(base_url, job_id, token)
                                                  metadata = http_client.get_feature_metadata()

                                                  # Validate feature names using shared utilities
                                                  from bharatml_common import clean_column_name
                                                  clean_features = [clean_column_name(name) for name in feature_names]

                                                  # Process with Spark client
                                                  spark_client.generate_df_with_protobuf_messages(df)

                                                  Common Use Cases

                                                  1. Daily Batch ETL

                                                  -
                                                  # Cron job: 0 2 * * * (daily at 2 AM)
                                                  spark-submit \
                                                  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 \
                                                  --conf spark.sql.adaptive.enabled=true \
                                                  daily_feature_pipeline.py
                                                  +
                                                  # Cron job: 0 2 * * * (daily at 2 AM)
                                                  spark-submit \
                                                  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 \
                                                  --conf spark.sql.adaptive.enabled=true \
                                                  daily_feature_pipeline.py

                                                  2. Historical Backfill

                                                  -
                                                  # Backfill last 30 days
                                                  from datetime import datetime, timedelta

                                                  for i in range(30):
                                                  date = datetime.now() - timedelta(days=i)
                                                  df = spark.sql(f"""
                                                  SELECT * FROM features
                                                  WHERE date = '{date.strftime('%Y-%m-%d')}'
                                                  """)

                                                  proto_df = client.generate_df_with_protobuf_messages(df)
                                                  client.write_protobuf_df_to_kafka(proto_df, brokers, f"backfill.{date.strftime('%Y%m%d')}")
                                                  +
                                                  # Backfill last 30 days
                                                  from datetime import datetime, timedelta

                                                  for i in range(30):
                                                  date = datetime.now() - timedelta(days=i)
                                                  df = spark.sql(f"""
                                                  SELECT * FROM features
                                                  WHERE date = '{date.strftime('%Y-%m-%d')}'
                                                  """)

                                                  proto_df = client.generate_df_with_protobuf_messages(df)
                                                  client.write_protobuf_df_to_kafka(proto_df, brokers, f"backfill.{date.strftime('%Y%m%d')}")

                                                  3. Real-time Streaming (Advanced)

                                                  -
                                                  # Read from streaming source, process, and publish
                                                  streaming_df = spark.readStream \
                                                  .format("kafka") \
                                                  .option("kafka.bootstrap.servers", input_brokers) \
                                                  .option("subscribe", input_topic) \
                                                  .load()

                                                  # Process streaming DataFrame
                                                  processed_df = streaming_df.select(...)

                                                  # Write to output Kafka (requires structured streaming)
                                                  query = processed_df.writeStream \
                                                  .format("kafka") \
                                                  .option("kafka.bootstrap.servers", output_brokers) \
                                                  .option("topic", output_topic) \
                                                  .start()
                                                  +
                                                  # Read from streaming source, process, and publish
                                                  streaming_df = spark.readStream \
                                                  .format("kafka") \
                                                  .option("kafka.bootstrap.servers", input_brokers) \
                                                  .option("subscribe", input_topic) \
                                                  .load()

                                                  # Process streaming DataFrame
                                                  processed_df = streaming_df.select(...)

                                                  # Write to output Kafka (requires structured streaming)
                                                  query = processed_df.writeStream \
                                                  .format("kafka") \
                                                  .option("kafka.bootstrap.servers", output_brokers) \
                                                  .option("topic", output_topic) \
                                                  .start()

                                                  Troubleshooting

                                                  Common Issues

                                                  1. OutOfMemoryError

                                                    -
                                                    # Increase driver memory or reduce partition size
                                                    spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "50")
                                                    +
                                                    # Increase driver memory or reduce partition size
                                                    spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "50")
                                                  2. Kafka Connection Timeout

                                                    -
                                                    # Check network connectivity and broker addresses
                                                    additional_options = {
                                                    "kafka.request.timeout.ms": "60000",
                                                    "kafka.session.timeout.ms": "30000"
                                                    }
                                                    +
                                                    # Check network connectivity and broker addresses
                                                    additional_options = {
                                                    "kafka.request.timeout.ms": "60000",
                                                    "kafka.session.timeout.ms": "30000"
                                                    }
                                                  3. Protobuf Serialization Errors

                                                    -
                                                    # Check data types and null values
                                                    df = df.fillna({"string_col": "", "numeric_col": 0})
                                                    +
                                                    # Check data types and null values
                                                    df = df.fillna({"string_col": "", "numeric_col": 0})
                                                  4. Metadata API Errors

                                                    -
                                                    # Verify job_id, job_token, and URL
                                                    # Check API server logs
                                                    +
                                                    # Verify job_id, job_token, and URL
                                                    # Check API server logs

                                                  Debug Mode

                                                  -
                                                  import logging
                                                  logging.basicConfig(level=logging.DEBUG)

                                                  # Enable Spark SQL logging
                                                  spark.sparkContext.setLogLevel("INFO")
                                                  +
                                                  import logging
                                                  logging.basicConfig(level=logging.DEBUG)

                                                  # Enable Spark SQL logging
                                                  spark.sparkContext.setLogLevel("INFO")

                                                  Migration from Legacy Clients

                                                  If migrating from older versions:

                                                  -
                                                  # Old import
                                                  # from online_feature_store_py_client import OnlineFeatureStorePyClient

                                                  # New import (same interface)
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # API remains the same - no code changes needed!
                                                  +
                                                  # Old import
                                                  # from online_feature_store_py_client import OnlineFeatureStorePyClient

                                                  # New import (same interface)
                                                  from spark_feature_push_client import OnlineFeatureStorePyClient

                                                  # API remains the same - no code changes needed!

                                                  Best Practices

                                                  1. Resource Management: Always stop Spark sessions
                                                  2. @@ -175,6 +175,6 @@

                                                    LicenseBharatMLStack Business Source License 1.1.


                                                    Built with ❤️ for the ML community from Meesho
                                                    -
                                                    If you find this useful, ⭐️ the repo — your support means the world to us!

                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index b40f123c..e9943299 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -1 +1 @@ -https://meesho.github.io/BharatMLStack/blogweekly0.5https://meesho.github.io/BharatMLStack/blog/archiveweekly0.5https://meesho.github.io/BharatMLStack/blog/authorsweekly0.5https://meesho.github.io/BharatMLStack/blog/post-fiveweekly0.5https://meesho.github.io/BharatMLStack/blog/post-oneweekly0.5https://meesho.github.io/BharatMLStack/blog/post-threeweekly0.5https://meesho.github.io/BharatMLStack/blog/post-threeweekly0.5https://meesho.github.io/BharatMLStack/blog/post-twoweekly0.5https://meesho.github.io/BharatMLStack/blog/tagsweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/bharatmlstackweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/embedding-searchweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/inferflowweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/interaction-storeweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/llmweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/meeshoweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/mlplatformweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/model-inferenceweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/online-feature-storeweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/tensorrt-llmweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/vllmweekly0.5https://meesho.github.io/BharatMLStack/markdown-pageweekly0.5https://meesho.github.io/BharatMLStack/weekly0.5https://meesho.github.io/BharatMLStack/category/go-sdkweekly0.5https://meesho.github.io/BharatMLStack/category/inferflowweekly0.5https://meesho.github.io/BharatMLStack/category/numerixweekly0.5https://meesho.github.io/BharatMLStack/category/online-feature-storeweekly0.5https://meesho.github.io/BharatMLStack/category/python-sdkweekly0.5https://meesho.github.io/BharatMLStack/category/quick-startweekly0.5https://meesho.github.io/BharatMLStack/category/sdksweekly0.5https://meesho.github.io/BharatMLStack/category/trufflebox-uiweekly0.5https://meesho.github.io/BharatMLStack/category/v100weekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/configurationweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/benchmarksweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/benchmarksweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/data-formatsweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/quick-start/v1.0.0/quick-startweekly0.5https://meesho.github.io/BharatMLStack/sdks/go/v1.0.0/feature_clientweekly0.5https://meesho.github.io/BharatMLStack/sdks/python/v1.0.0/grpc_feature_clientweekly0.5https://meesho.github.io/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_clientweekly0.5https://meesho.github.io/BharatMLStack/trufflebox-ui/v1.0.0/userguideweekly0.5 \ No newline at end of file +https://meesho.github.io/BharatMLStack/blogweekly0.5https://meesho.github.io/BharatMLStack/blog/archiveweekly0.5https://meesho.github.io/BharatMLStack/blog/authorsweekly0.5https://meesho.github.io/BharatMLStack/blog/post-fiveweekly0.5https://meesho.github.io/BharatMLStack/blog/post-fourweekly0.5https://meesho.github.io/BharatMLStack/blog/post-oneweekly0.5https://meesho.github.io/BharatMLStack/blog/post-threeweekly0.5https://meesho.github.io/BharatMLStack/blog/post-twoweekly0.5https://meesho.github.io/BharatMLStack/blog/tagsweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/bharatmlstackweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/embedding-searchweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/inferflowweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/interaction-storeweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/llmweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/meeshoweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/mlplatformweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/model-inferenceweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/online-feature-storeweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/tensorrt-llmweekly0.5https://meesho.github.io/BharatMLStack/blog/tags/vllmweekly0.5https://meesho.github.io/BharatMLStack/markdown-pageweekly0.5https://meesho.github.io/BharatMLStack/weekly0.5https://meesho.github.io/BharatMLStack/category/go-sdkweekly0.5https://meesho.github.io/BharatMLStack/category/inferflowweekly0.5https://meesho.github.io/BharatMLStack/category/numerixweekly0.5https://meesho.github.io/BharatMLStack/category/online-feature-storeweekly0.5https://meesho.github.io/BharatMLStack/category/predatorweekly0.5https://meesho.github.io/BharatMLStack/category/python-sdkweekly0.5https://meesho.github.io/BharatMLStack/category/quick-startweekly0.5https://meesho.github.io/BharatMLStack/category/sdksweekly0.5https://meesho.github.io/BharatMLStack/category/skyeweekly0.5https://meesho.github.io/BharatMLStack/category/trufflebox-uiweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/configurationweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/inferflow/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/introweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/benchmarksweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/numerix/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/benchmarksweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/data-formatsweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/online-feature-store/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/predator/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/predator/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/predator/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/predator/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/quick-start/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/quick-start/v1.0.0/quick-startweekly0.5https://meesho.github.io/BharatMLStack/sdks/go/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/sdks/go/v1.0.0/feature_clientweekly0.5https://meesho.github.io/BharatMLStack/sdks/python/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/sdks/python/v1.0.0/grpc_feature_clientweekly0.5https://meesho.github.io/BharatMLStack/sdks/python/v1.0.0/spark_feature_push_clientweekly0.5https://meesho.github.io/BharatMLStack/skye/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/skye/v1.0.0/architectureweekly0.5https://meesho.github.io/BharatMLStack/skye/v1.0.0/functionalitiesweekly0.5https://meesho.github.io/BharatMLStack/skye/v1.0.0/release-notesweekly0.5https://meesho.github.io/BharatMLStack/trufflebox-ui/v1.0.0weekly0.5https://meesho.github.io/BharatMLStack/trufflebox-ui/v1.0.0/userguideweekly0.5 \ No newline at end of file diff --git a/docs/skye/v1.0.0/architecture/index.html b/docs/skye/v1.0.0/architecture/index.html new file mode 100644 index 00000000..41dfa713 --- /dev/null +++ b/docs/skye/v1.0.0/architecture/index.html @@ -0,0 +1,145 @@ + + + + + +Architecture | BharatMLStack + + + + + + + + +

                                                  Skye - Vector Similarity Search Platform

                                                  +

                                                  Skye is BharatMLStack's vector similarity search platform that enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: skye-admin, skye-consumers, and skye-serving.

                                                  +
                                                  +

                                                  System Overview

                                                  +

                                                  Skye System Architecture

                                                  +

                                                  Skye provides a critical platform for managing data aggregation, model onboarding, and embedding support at production scale. The architecture is designed around three core pillars:

                                                  +
                                                    +
                                                  • Pluggable Vector Databases: Support for multiple vector database backends (Qdrant and extensible to others) via a generic abstraction layer.
                                                  • +
                                                  • Tenant-Level Index Isolation with Shared Embeddings: Models are stored once but can serve multiple tenants (variants), reducing data redundancy.
                                                  • +
                                                  • Event-Driven Administration: Model lifecycle management is handled through Kafka-based event flows for resilience and fault tolerance.
                                                  • +
                                                  +

                                                  Component Architecture

                                                  +
                                                  ComponentRole
                                                  skye-servingHandles real-time similarity search queries with in-memory caching and vector DB lookups
                                                  skye-consumersProcesses embedding ingestion (reset/delta jobs) and real-time aggregation events from Kafka
                                                  skye-adminManages model lifecycle, onboarding, variant registration, and coordinates Databricks jobs
                                                  +
                                                  +

                                                  Data Model

                                                  +

                                                  Model and Variant Hierarchy

                                                  +

                                                  Skye uses a model-first hierarchy rather than a tenant-first approach. Models sit at the base level with variants (formerly tenants) nested within each model. This eliminates embedding duplication across tenants.

                                                  +
                                                  model (e.g., intent_model)
                                                  ├── model_config (distance_function, vector_dimension, etc.)
                                                  ├── embedding_store (shared embeddings for all variants)
                                                  ├── variant_1 (e.g., organic)
                                                  │ ├── vss_filter (criteria for index inclusion)
                                                  │ ├── vectordb_type (QDRANT, etc.)
                                                  │ ├── vectordb_config (host, port, replication, sharding)
                                                  │ ├── read_version / write_version
                                                  │ └── job_frequency (FREQ_1D, FREQ_3H, etc.)
                                                  └── variant_2 (e.g., ad)
                                                  ├── vss_filter
                                                  ├── vectordb_type
                                                  └── ...
                                                  +

                                                  Key benefit: If a model consumes 30M embeddings and is used by two variants, the embeddings are stored once (30M) instead of duplicated (60M).

                                                  +

                                                  Entity-Based Data Split

                                                  +

                                                  Data is split at the entity level (catalog, product, user) into separate tables for both embeddings and aggregator data:

                                                  +

                                                  Embedding Tables (per entity):

                                                  +
                                                  CREATE TABLE catalog_embeddings (
                                                  model_name text,
                                                  version int,
                                                  id text,
                                                  embedding frozen<list<decimal>>,
                                                  search_embedding frozen<list<decimal>>,
                                                  to_be_indexed_variant_1 boolean,
                                                  to_be_indexed_variant_2 boolean,
                                                  PRIMARY KEY ((model_name, version), id)
                                                  );
                                                  +

                                                  Aggregator Tables (per entity):

                                                  +
                                                  CREATE TABLE catalog_aggregator (
                                                  id text,
                                                  is_live_ad text,
                                                  out_of_stock text,
                                                  PRIMARY KEY (id)
                                                  );
                                                  +

                                                  Each entity is mapped via a store configuration:

                                                  +
                                                  {
                                                  "db_conf_id": "1",
                                                  "embeddings_table": "catalog_embeddings",
                                                  "aggregator_table": "catalog_aggregator"
                                                  }
                                                  +
                                                  +

                                                  Serving Flow

                                                  +

                                                  The serving path is optimized for low latency with multiple caching layers:

                                                  +
                                                    +
                                                  1. Request arrives at skye-serving via gRPC
                                                  2. +
                                                  3. ConfigRepo resolves the model configuration, variant filters, and vector DB connection
                                                  4. +
                                                  5. In-memory cache is checked first to reduce load on distributed cache
                                                  6. +
                                                  7. Distributed cache (Redis) is checked next for cached similarity results
                                                  8. +
                                                  9. Vector DB query executes if cache misses, using search_indexed_only flag for optimal searches within indexed space
                                                  10. +
                                                  11. Aggregator data is fetched from ScyllaDB to apply variant-level filters
                                                  12. +
                                                  13. Response returns ranked similar candidates with scores
                                                  14. +
                                                  +

                                                  Configuration Bootstrap

                                                  +

                                                  On startup, ConfigRepo creates:

                                                  +
                                                    +
                                                  • A map of each model with its configurations (embedding table, vector DB channel)
                                                  • +
                                                  • A map of each entity to its aggregator table
                                                  • +
                                                  +
                                                  {
                                                  "intent_model": {
                                                  "db_conf_id": "1",
                                                  "index_embedding_table": "catalog_embeddings",
                                                  "vector_db_grpc_channel": "<grpc_channel_info>"
                                                  }
                                                  }
                                                  +
                                                  +

                                                  Admin Flows

                                                  +

                                                  Skye uses an event-driven approach for model lifecycle management:

                                                  +
                                                    +
                                                  • All admin operations are processed through Kafka consumers asynchronously
                                                  • +
                                                  • A SQL database behind the admin stores all model states
                                                  • +
                                                  • Pod termination does not affect in-progress operations (events are re-consumed on failure)
                                                  • +
                                                  • Databricks jobs are triggered and monitored via the admin API
                                                  • +
                                                  +

                                                  API Contracts

                                                  +

                                                  Register Model

                                                  +
                                                  POST /register-model
                                                  +
                                                  {
                                                  "entity": "catalog",
                                                  "ingestion_column_mapping": "{\"id_column\":\"id\",\"embedding_column\":\"features\",\"to_be_indexed_column\":\"to_be_indexed\"}",
                                                  "embedding_store_enabled": true,
                                                  "embedding_store_ttl": 604800,
                                                  "mq_id": 804,
                                                  "model_config": "{\"distance_function\":\"DOT\",\"vector_dimension\":32}",
                                                  "store_id": 1,
                                                  "training_data_path": "gcs_path"
                                                  }
                                                  +

                                                  Register Variant

                                                  +
                                                  POST /register-variant
                                                  +
                                                  {
                                                  "entity": "catalog",
                                                  "model_name": "intent_model",
                                                  "vss_filter": "{...filter criteria...}",
                                                  "vectordb_type": "QDRANT",
                                                  "vectordb_config": "{...connection config...}",
                                                  "job_frequency": "FREQ_1D"
                                                  }
                                                  +

                                                  Reset Model

                                                  +
                                                  POST /reset-model
                                                  +
                                                  {
                                                  "entity": "catalog",
                                                  "model_name": "intent_model",
                                                  "frequency": "FREQ_1D"
                                                  }
                                                  +

                                                  Response includes variant version mappings, MQ ID, and training data path for the Databricks job.

                                                  +

                                                  Trigger Model Machine

                                                  +
                                                  POST /trigger-model-machine
                                                  +
                                                  {
                                                  "entity": "catalog",
                                                  "model_name": "intent_model",
                                                  "variant": "organic"
                                                  }
                                                  +

                                                  Promote Model / Variant to Scale-Up Cluster

                                                  +
                                                  POST /promote-model
                                                  POST /promote-variant
                                                  +

                                                  Used to transition successful experiments from experiment clusters to production clusters.

                                                  +
                                                  +

                                                  Consumer Flows

                                                  +

                                                  Skye Real-Time Consumer Flow

                                                  +

                                                  Reset/Delta Ingestion

                                                  +

                                                  Embedding ingestion occurs once per model and executes in parallel for each variant. The Kafka event contract supports:

                                                  +
                                                    +
                                                  • Multiple variants per event: A single embedding event specifies which variants should index the data
                                                  • +
                                                  • Separate search and index embeddings: Models can have different embeddings for search space vs index space
                                                  • +
                                                  • EOF handling: EOF is sent to all partitions to ensure all data is consumed before completion
                                                  • +
                                                  +
                                                  {
                                                  "entity": "catalog",
                                                  "model_name": "intent_model",
                                                  "candidate_id": "48869419",
                                                  "version": "1",
                                                  "index_space": {
                                                  "variants_version_map": "{'organic':1,'ad':2}",
                                                  "embedding": [0.036, -0.048, ...],
                                                  "variants_index_map": "{'organic':true,'ad':false}",
                                                  "operation": "A",
                                                  "payload": "{'sscat_id':700}"
                                                  },
                                                  "search_space": {
                                                  "embedding": [0.036, -0.048, ...]
                                                  }
                                                  }
                                                  +

                                                  Real-Time Consumers

                                                  +

                                                  A generic Kafka schema is used for all real-time consumers, simplifying new integrations:

                                                  +
                                                  {
                                                  "timestamp": 1719308350,
                                                  "entity_label": "catalog",
                                                  "data": [
                                                  {
                                                  "id": "125138466",
                                                  "label": "is_live_ad",
                                                  "value": "true"
                                                  }
                                                  ]
                                                  }
                                                  +

                                                  Retry Topic

                                                  +

                                                  Failed ingestion events are published to a retry topic for reprocessing, ensuring no data loss:

                                                  +
                                                  {
                                                  "timestamp": 1719308350,
                                                  "entity_label": "catalog",
                                                  "model_name": "intent_model",
                                                  "variant": "organic",
                                                  "data": [
                                                  {
                                                  "id": "125138466",
                                                  "label": "is_live_ad",
                                                  "value": "true"
                                                  }
                                                  ]
                                                  }
                                                  +
                                                  +

                                                  Key Design Decisions

                                                  +

                                                  Pluggable Vector Database Support

                                                  +

                                                  Skye introduces a generic vector_db_type configuration and converts vendor-specific configs to a generic vector_config, enabling support for multiple vector database backends beyond Qdrant.

                                                  +

                                                  Variant-Based Model Sharing

                                                  +

                                                  By eliminating the tenant-based construct and introducing variants, Skye allows:

                                                  +
                                                    +
                                                  • Models to be shared across tenants without duplication
                                                  • +
                                                  • Each variant to have its own filter criteria, vector DB config, and job frequency
                                                  • +
                                                  • Independent read/write version tracking per variant
                                                  • +
                                                  +

                                                  ScyllaDB for Real-Time Aggregation

                                                  +

                                                  Replaced Delta Lake with self-hosted ScyllaDB for cost efficiency. The aggregator is entity-generic (not model/version-specific) since all real-time data is consistent across models.

                                                  +

                                                  Event-Driven State Management

                                                  +

                                                  Model state transitions are handled via Kafka events with a SQL database backing store. This eliminates:

                                                  +
                                                    +
                                                  • Single points of failure in admin/ingestion flows
                                                  • +
                                                  • Models getting stuck during pod restarts
                                                  • +
                                                  • Manual intervention for consumer pause/resume
                                                  • +
                                                  +
                                                  +

                                                  Resiliency

                                                  +
                                                  MechanismDescription
                                                  Retry TopicsFailed ingestion messages are captured in a failure topic for reprocessing
                                                  Circuit BreakersApplied to similarity search API calls to throttle RPS during failures
                                                  Snapshot BackupsPeriodic collection snapshots enable quick restore during downtime
                                                  Automated Cluster SetupScripted provisioning eliminates configuration inconsistencies
                                                  Databricks Job RetriesLambda functions with retry mechanisms for failed ingestion jobs
                                                  +
                                                  +

                                                  Scalability

                                                  +
                                                    +
                                                  • Vector DB Scaling: Generic scripts for adding nodes to existing clusters, enabling horizontal scaling based on load and RPS
                                                  • +
                                                  • Service Scaling: Hosted on EKS with CPU-based autoscaling
                                                  • +
                                                  • Experiment Isolation: Experiments run on separate EKS and vector DB clusters, reducing production cluster complexity
                                                  • +
                                                  • Indexed-Only Search: The search_indexed_only flag ensures queries only search indexed space, avoiding latency from brute-force searches on partially built indexes
                                                  • +
                                                  +
                                                  +

                                                  Observability

                                                  +

                                                  Metrics (per model + variant)

                                                  +
                                                  MetricDescription
                                                  avg_similar_candidatesAverage number of similarity candidates returned
                                                  avg_recallScore of the first similar catalog returned
                                                  Service LatencyP99.9 / P99 / P95 / P50
                                                  Service 5xx CountError rate monitoring
                                                  Vector DB LatencyP99.9 / P99 / P95 / P50
                                                  Vector DB QPSThroughput monitoring
                                                  ScyllaDB LatencyP99.9 / P99 / P95 / P90
                                                  Redis LatencyP99.9 / P99 / P95 / P90
                                                  Redis Hit %Cache effectiveness
                                                  +

                                                  Alerts

                                                  +
                                                  AlertThreshold
                                                  Indexed Vector Count< 95%
                                                  Events to Failure TopicRate > 0
                                                  Service 5xx< 10
                                                  Service LatencyModel-dependent SLA
                                                  +
                                                  +

                                                  Technology Stack

                                                  +
                                                  ComponentTechnology
                                                  LanguageGo
                                                  Vector DatabaseQdrant (pluggable)
                                                  Embedding StorageScyllaDB
                                                  Real-Time AggregationScyllaDB
                                                  CachingRedis + In-Memory
                                                  Message QueueKafka
                                                  ConfigurationZooKeeper / etcd
                                                  Container OrchestrationKubernetes (EKS)
                                                  Job OrchestrationDatabricks
                                                  + + \ No newline at end of file diff --git a/docs/skye/v1.0.0/functionalities/index.html b/docs/skye/v1.0.0/functionalities/index.html new file mode 100644 index 00000000..23d11b5f --- /dev/null +++ b/docs/skye/v1.0.0/functionalities/index.html @@ -0,0 +1,113 @@ + + + + + +Functionalities | BharatMLStack + + + + + + + + +

                                                  Skye - Functionalities

                                                  +

                                                  Core Capabilities

                                                  + +

                                                  Skye provides real-time nearest-neighbor search across high-dimensional vector spaces. It supports:

                                                  +
                                                    +
                                                  • Configurable distance functions: DOT product, Cosine similarity, Euclidean distance
                                                  • +
                                                  • Configurable vector dimensions: Per-model vector dimension settings
                                                  • +
                                                  • Indexed-only search: Queries only search within fully indexed space, avoiding brute-force fallback on partially built indexes
                                                  • +
                                                  • Pagination support: Service-level pagination for clients, even when the underlying vector DB does not natively support it
                                                  • +
                                                  +

                                                  2. Pluggable Vector Database Support

                                                  +

                                                  The platform is designed to be vector DB agnostic:

                                                  +
                                                    +
                                                  • Generic vector config: A vector_db_type field and generic vectordb_config replace vendor-specific configurations
                                                  • +
                                                  • Current support: Qdrant with official Go client
                                                  • +
                                                  • Extensibility: New vector databases can be integrated by implementing the vector DB interface
                                                  • +
                                                  +

                                                  3. Model and Variant Management

                                                  +

                                                  Model Registration

                                                  +
                                                    +
                                                  • Models are registered via API with entity type, embedding configuration, distance function, vector dimension, and training data path
                                                  • +
                                                  • Each model is associated with a store ID mapping to specific embedding and aggregator tables
                                                  • +
                                                  +

                                                  Variant Registration

                                                  +
                                                    +
                                                  • Variants represent different views/filters of the same model (e.g., organic, ad, commerce)
                                                  • +
                                                  • Each variant has its own filter criteria, vector DB cluster, job frequency, and version tracking
                                                  • +
                                                  • Variants share the same embeddings, eliminating data redundancy
                                                  • +
                                                  +

                                                  Model Promotion

                                                  +
                                                    +
                                                  • Successful experiments can be promoted from experiment clusters to production clusters via API
                                                  • +
                                                  +

                                                  4. Embedding Ingestion

                                                  +

                                                  Batch Ingestion (Reset/Delta Jobs)

                                                  +
                                                    +
                                                  • Triggered via Databricks jobs that read from GCS paths
                                                  • +
                                                  • Supports separate index-space and search-space embeddings
                                                  • +
                                                  • Per-variant to_be_indexed flags control which embeddings are indexed for each variant
                                                  • +
                                                  • EOF markers sent to all Kafka partitions ensure complete data consumption
                                                  • +
                                                  +

                                                  Real-Time Ingestion

                                                  +
                                                    +
                                                  • Generic Kafka schema for all real-time consumers
                                                  • +
                                                  • Entity-based aggregation data (e.g., is_live_ad, out_of_stock) updates in real time
                                                  • +
                                                  • During model resets, real-time consumers continue pushing data to the latest collection (no pausing)
                                                  • +
                                                  +

                                                  5. Real-Time Data Aggregation

                                                  +
                                                    +
                                                  • Entity-wise (catalog, product, user) real-time aggregation via ScyllaDB
                                                  • +
                                                  • Generic approach: aggregator tables are entity-level, not model/version-specific
                                                  • +
                                                  • All real-time data is consistent across models sharing the same entity
                                                  • +
                                                  +

                                                  6. Intelligent Caching

                                                  +
                                                    +
                                                  • In-memory cache: First layer, reduces load on distributed cache
                                                  • +
                                                  • Distributed cache (Redis): Second layer for cached similarity results
                                                  • +
                                                  • Hit rate monitoring and cache effectiveness metrics per model
                                                  • +
                                                  +

                                                  7. Embedded Storage

                                                  +
                                                    +
                                                  • Optional embedding storage with configurable TTL
                                                  • +
                                                  • Enables embedding lookup APIs for downstream consumers
                                                  • +
                                                  • Stored in ScyllaDB with efficient binary serialization
                                                  • +
                                                  +

                                                  8. Retry and Fault Tolerance

                                                  +
                                                    +
                                                  • Retry topic: Failed ingestion events are published to a dedicated retry topic
                                                  • +
                                                  • Event-driven state management: Model states persist in SQL DB, surviving pod restarts
                                                  • +
                                                  • Kafka-based admin: Asynchronous processing with automatic re-consumption on failure
                                                  • +
                                                  +

                                                  9. Experiment Isolation

                                                  +
                                                    +
                                                  • Dedicated EKS cluster (skye-service-experiments) for experiments
                                                  • +
                                                  • Dedicated vector DB cluster for experiment workloads
                                                  • +
                                                  • Clean separation from production: experiments do not impact production performance
                                                  • +
                                                  • Promotion path from experiment to production after load analysis
                                                  • +
                                                  +

                                                  10. Centralized Cluster Management

                                                  +
                                                    +
                                                  • Automated cluster provisioning via scripts (collaboration with DevOps)
                                                  • +
                                                  • Consistent configurations across all clusters (eliminates consensus issues)
                                                  • +
                                                  • Horizontal scaling support: generic scripts for adding nodes to existing clusters
                                                  • +
                                                  +
                                                  +

                                                  Onboarding Flow

                                                  +

                                                  Step-by-step Process

                                                  +
                                                    +
                                                  1. Data Scientist provides a base GCS path where model embeddings will be pushed
                                                  2. +
                                                  3. Register Model via POST /register-model with entity type, column mappings, model config
                                                  4. +
                                                  5. Register Variant(s) via POST /register-variant with filter criteria, vector DB config, job frequency
                                                  6. +
                                                  7. Schedule Databricks Job to read data from GCS path and ingest into Skye platform
                                                  8. +
                                                  9. Reset Model via POST /reset-model to trigger the first full ingestion
                                                  10. +
                                                  11. Trigger Model Machine via POST /trigger-model-machine to start the indexing pipeline
                                                  12. +
                                                  +

                                                  Extending to New Tenants

                                                  +

                                                  With the variant system, extending a model to a new tenant only requires registering a new variant with appropriate filters -- no re-ingestion of embeddings is needed.

                                                  + + \ No newline at end of file diff --git a/docs/skye/v1.0.0/index.html b/docs/skye/v1.0.0/index.html new file mode 100644 index 00000000..a5dba36b --- /dev/null +++ b/docs/skye/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + + + + \ No newline at end of file diff --git a/docs/skye/v1.0.0/release-notes/index.html b/docs/skye/v1.0.0/release-notes/index.html new file mode 100644 index 00000000..44f0bdcf --- /dev/null +++ b/docs/skye/v1.0.0/release-notes/index.html @@ -0,0 +1,67 @@ + + + + + +Release Notes | BharatMLStack + + + + + + + + +

                                                  Skye - Release Notes

                                                  +

                                                  v1.0.0

                                                  +

                                                  Overview

                                                  +

                                                  Initial open-source release of Skye, BharatMLStack's vector similarity search platform. This release represents a complete re-architecture of the internal VSS (Vector Similarity Search) service, addressing scalability, resilience, and operational efficiency challenges from the previous generation.

                                                  +

                                                  What's New

                                                  +

                                                  Architecture

                                                  +
                                                    +
                                                  • Model-first hierarchy: Models at the base level with variants nested within, eliminating embedding duplication across tenants
                                                  • +
                                                  • Entity-based data split: Separate embedding and aggregator tables per entity type (catalog, product, user)
                                                  • +
                                                  • Event-driven admin flows: Kafka-based model lifecycle management with SQL-backed state persistence
                                                  • +
                                                  • Pluggable vector DB support: Generic vector database abstraction replacing vendor-specific tight coupling
                                                  • +
                                                  +

                                                  Serving

                                                  +
                                                    +
                                                  • Multi-layer caching: In-memory cache + Redis distributed cache for low-latency similarity search
                                                  • +
                                                  • Indexed-only search: search_indexed_only flag prevents brute-force fallback on partially indexed collections
                                                  • +
                                                  • Pagination support: Service-level pagination for clients
                                                  • +
                                                  • Separate search/index embeddings: Models can use different embedding spaces for search and indexing
                                                  • +
                                                  +

                                                  Ingestion

                                                  +
                                                    +
                                                  • Shared embeddings across variants: Single ingestion per model with parallel variant processing
                                                  • +
                                                  • Generic RT consumer schema: Simplified onboarding for new real-time data sources
                                                  • +
                                                  • Retry topic: Automatic capture and reprocessing of failed ingestion events
                                                  • +
                                                  • EOF to all partitions: Ensures complete data consumption before processing completion
                                                  • +
                                                  +

                                                  Operations

                                                  +
                                                    +
                                                  • API-based model onboarding: Register models and variants via REST API (replaces manual Databricks-only flow)
                                                  • +
                                                  • Automated cluster provisioning: Scripted setup for consistent vector DB cluster configurations
                                                  • +
                                                  • Experiment isolation: Dedicated EKS and vector DB clusters for experiments
                                                  • +
                                                  • Comprehensive observability: Per-model + per-variant metrics for latency, throughput, error rates, and cache effectiveness
                                                  • +
                                                  +

                                                  Improvements Over Previous Architecture

                                                  +
                                                  AreaBeforeAfter
                                                  Embedding storageDuplicated per tenantShared per model
                                                  Vector DB couplingTightly coupled to QdrantPluggable via generic interface
                                                  State managementIn-pod synchronous threadEvent-driven with SQL backing
                                                  Consumer handlingPaused during ingestionNo pausing; concurrent writes
                                                  Cluster setupManual, error-proneAutomated, consistent
                                                  Experiment infraShared with productionIsolated clusters
                                                  Failure recoveryManual interventionRetry topics + snapshots
                                                  ObservabilityGeneric alertsModel + variant level metrics
                                                  +

                                                  Known Limitations

                                                  +
                                                    +
                                                  • Snapshot restore is currently supported for smaller indexes only
                                                  • +
                                                  • Pagination is handled at the service level (not natively by the vector DB)
                                                  • +
                                                  • Horizontal scaling of vector DB clusters requires running provisioning scripts
                                                  • +
                                                  +

                                                  Technology Stack

                                                  +
                                                    +
                                                  • Language: Go
                                                  • +
                                                  • Vector Database: Qdrant (pluggable)
                                                  • +
                                                  • Storage: ScyllaDB
                                                  • +
                                                  • Cache: Redis + In-Memory
                                                  • +
                                                  • Message Queue: Kafka
                                                  • +
                                                  • Configuration: ZooKeeper / etcd
                                                  • +
                                                  • Orchestration: Kubernetes (EKS)
                                                  • +
                                                  + + \ No newline at end of file diff --git a/docs/trufflebox-ui/v1.0.0/index.html b/docs/trufflebox-ui/v1.0.0/index.html new file mode 100644 index 00000000..7f4e1d52 --- /dev/null +++ b/docs/trufflebox-ui/v1.0.0/index.html @@ -0,0 +1,19 @@ + + + + + +v1.0.0 | BharatMLStack + + + + + + + + + + + \ No newline at end of file diff --git a/docs/trufflebox-ui/v1.0.0/userguide/index.html b/docs/trufflebox-ui/v1.0.0/userguide/index.html index 0c2e7fc2..d4ed7e97 100644 --- a/docs/trufflebox-ui/v1.0.0/userguide/index.html +++ b/docs/trufflebox-ui/v1.0.0/userguide/index.html @@ -3,16 +3,16 @@ -User Manual | BharatMLStack - - - +User Manual | BharatMLStack + + + -

                                                  Usage Guide

                                                  +

                                                  Usage Guide

                                                  This guide covers the complete setup and usage of the Online Feature Store system, including the core services (Online Feature Store and Horizon) and the TruffleBox UI for feature management.

                                                  Table of Contents

                                                    @@ -44,10 +44,10 @@

                                                    Environmen

                                                    Online Feature Store Configuration

                                                    The Online Feature Store requires several environment variables to configure storage backends, caching, and service settings.

                                                    Core Application Settings

                                                    -
                                                    APP_ENV=prod
                                                    APP_LOG_LEVEL=DEBUG
                                                    APP_METRIC_SAMPLING_RATE=1
                                                    APP_NAME=online-feature-store
                                                    APP_PORT=8005
                                                    AUTH_TOKEN=ofs-token
                                                    +
                                                    APP_ENV=prod
                                                    APP_LOG_LEVEL=DEBUG
                                                    APP_METRIC_SAMPLING_RATE=1
                                                    APP_NAME=online-feature-store
                                                    APP_PORT=8005
                                                    AUTH_TOKEN=ofs-token

                                                    Storage Configuration

                                                    ScyllaDB Storage (Primary Storage)

                                                    -
                                                    # Primary ScyllaDB cluster
                                                    STORAGE_SCYLLA_1_CONTACT_POINTS=localhost
                                                    STORAGE_SCYLLA_1_KEYSPACE=ofs
                                                    STORAGE_SCYLLA_1_NUM_CONNS=1
                                                    STORAGE_SCYLLA_1_PORT=9042
                                                    STORAGE_SCYLLA_1_TIMEOUT_IN_MS=300000
                                                    STORAGE_SCYLLA_1_PASSWORD=
                                                    STORAGE_SCYLLA_1_USERNAME=ofs

                                                    # Secondary ScyllaDB cluster
                                                    STORAGE_SCYLLA_5_CONTACT_POINTS=localhost
                                                    STORAGE_SCYLLA_5_KEYSPACE=onfs
                                                    STORAGE_SCYLLA_5_NUM_CONNS=1
                                                    STORAGE_SCYLLA_5_PASSWORD=
                                                    STORAGE_SCYLLA_5_PORT=9042
                                                    STORAGE_SCYLLA_5_TIMEOUT_IN_MS=300000
                                                    STORAGE_SCYLLA_5_USERNAME=

                                                    # Active ScyllaDB configurations
                                                    STORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5
                                                    +
                                                    # Primary ScyllaDB cluster
                                                    STORAGE_SCYLLA_1_CONTACT_POINTS=localhost
                                                    STORAGE_SCYLLA_1_KEYSPACE=ofs
                                                    STORAGE_SCYLLA_1_NUM_CONNS=1
                                                    STORAGE_SCYLLA_1_PORT=9042
                                                    STORAGE_SCYLLA_1_TIMEOUT_IN_MS=300000
                                                    STORAGE_SCYLLA_1_PASSWORD=
                                                    STORAGE_SCYLLA_1_USERNAME=ofs

                                                    # Secondary ScyllaDB cluster
                                                    STORAGE_SCYLLA_5_CONTACT_POINTS=localhost
                                                    STORAGE_SCYLLA_5_KEYSPACE=onfs
                                                    STORAGE_SCYLLA_5_NUM_CONNS=1
                                                    STORAGE_SCYLLA_5_PASSWORD=
                                                    STORAGE_SCYLLA_5_PORT=9042
                                                    STORAGE_SCYLLA_5_TIMEOUT_IN_MS=300000
                                                    STORAGE_SCYLLA_5_USERNAME=

                                                    # Active ScyllaDB configurations
                                                    STORAGE_SCYLLA_ACTIVE_CONFIG_IDS=1,5

                                                    Redis Storage Configuration

                                                    Redis serves dual purposes in the Online Feature Store:

                                                      @@ -55,21 +55,21 @@

                                                      Storag
                                                    1. Distributed Cache Layer: For improved performance and reduced latency

                                                    Redis configurations can be referenced by their IDs in Store configurations, similar to ScyllaDB. Each Redis configuration can be independently used as either a storage backend or cache layer.

                                                    -
                                                    # Redis Failover Configuration 1 (ID: 2)
                                                    STORAGE_REDIS_FAILOVER_2_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_2_DB=0
                                                    STORAGE_REDIS_FAILOVER_2_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_2_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_2_MAX_IDLE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_2_MIN_IDLE_CONN=20
                                                    STORAGE_REDIS_FAILOVER_2_MAX_ACTIVE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_2_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_2_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_2_READ_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_WRITE_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_POOL_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_POOL_SIZE=32
                                                    STORAGE_REDIS_FAILOVER_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15
                                                    STORAGE_REDIS_FAILOVER_2_CONN_MAX_AGE_IN_MINUTES=30

                                                    # Redis Failover Configuration 2 (ID: 4)
                                                    STORAGE_REDIS_FAILOVER_4_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_4_DB=0
                                                    STORAGE_REDIS_FAILOVER_4_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_4_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_4_MAX_IDLE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_4_MIN_IDLE_CONN=20
                                                    STORAGE_REDIS_FAILOVER_4_MAX_ACTIVE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_4_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_4_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_4_READ_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_WRITE_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_POOL_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_POOL_SIZE=32
                                                    STORAGE_REDIS_FAILOVER_4_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15
                                                    STORAGE_REDIS_FAILOVER_4_CONN_MAX_AGE_IN_MINUTES=30

                                                    # High-Performance Redis Configuration (ID: 6)
                                                    STORAGE_REDIS_FAILOVER_6_CONN_MAX_AGE_IN_MINUTES=-1
                                                    STORAGE_REDIS_FAILOVER_6_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=30
                                                    STORAGE_REDIS_FAILOVER_6_DB=0
                                                    STORAGE_REDIS_FAILOVER_6_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_6_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_6_MAX_ACTIVE_CONN=202
                                                    STORAGE_REDIS_FAILOVER_6_MAX_IDLE_CONN=157
                                                    STORAGE_REDIS_FAILOVER_6_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_6_MIN_IDLE_CONN=52
                                                    STORAGE_REDIS_FAILOVER_6_PASSWORD=
                                                    STORAGE_REDIS_FAILOVER_6_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_6_POOL_SIZE=202
                                                    STORAGE_REDIS_FAILOVER_6_POOL_TIMEOUT_IN_MS=2
                                                    STORAGE_REDIS_FAILOVER_6_READ_TIMEOUT_IN_MS=75
                                                    STORAGE_REDIS_FAILOVER_6_ROUTE_RANDOM=true
                                                    STORAGE_REDIS_FAILOVER_6_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_6_WRITE_TIMEOUT_IN_MS=300

                                                    # Active Redis configurations
                                                    STORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6
                                                    +
                                                    # Redis Failover Configuration 1 (ID: 2)
                                                    STORAGE_REDIS_FAILOVER_2_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_2_DB=0
                                                    STORAGE_REDIS_FAILOVER_2_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_2_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_2_MAX_IDLE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_2_MIN_IDLE_CONN=20
                                                    STORAGE_REDIS_FAILOVER_2_MAX_ACTIVE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_2_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_2_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_2_READ_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_WRITE_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_POOL_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_2_POOL_SIZE=32
                                                    STORAGE_REDIS_FAILOVER_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15
                                                    STORAGE_REDIS_FAILOVER_2_CONN_MAX_AGE_IN_MINUTES=30

                                                    # Redis Failover Configuration 2 (ID: 4)
                                                    STORAGE_REDIS_FAILOVER_4_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_4_DB=0
                                                    STORAGE_REDIS_FAILOVER_4_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_4_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_4_MAX_IDLE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_4_MIN_IDLE_CONN=20
                                                    STORAGE_REDIS_FAILOVER_4_MAX_ACTIVE_CONN=32
                                                    STORAGE_REDIS_FAILOVER_4_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_4_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_4_READ_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_WRITE_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_POOL_TIMEOUT_IN_MS=3000
                                                    STORAGE_REDIS_FAILOVER_4_POOL_SIZE=32
                                                    STORAGE_REDIS_FAILOVER_4_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=15
                                                    STORAGE_REDIS_FAILOVER_4_CONN_MAX_AGE_IN_MINUTES=30

                                                    # High-Performance Redis Configuration (ID: 6)
                                                    STORAGE_REDIS_FAILOVER_6_CONN_MAX_AGE_IN_MINUTES=-1
                                                    STORAGE_REDIS_FAILOVER_6_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES=30
                                                    STORAGE_REDIS_FAILOVER_6_DB=0
                                                    STORAGE_REDIS_FAILOVER_6_DISABLE_IDENTITY=true
                                                    STORAGE_REDIS_FAILOVER_6_MASTER_NAME=mymaster
                                                    STORAGE_REDIS_FAILOVER_6_MAX_ACTIVE_CONN=202
                                                    STORAGE_REDIS_FAILOVER_6_MAX_IDLE_CONN=157
                                                    STORAGE_REDIS_FAILOVER_6_MAX_RETRY=-1
                                                    STORAGE_REDIS_FAILOVER_6_MIN_IDLE_CONN=52
                                                    STORAGE_REDIS_FAILOVER_6_PASSWORD=
                                                    STORAGE_REDIS_FAILOVER_6_POOL_FIFO=false
                                                    STORAGE_REDIS_FAILOVER_6_POOL_SIZE=202
                                                    STORAGE_REDIS_FAILOVER_6_POOL_TIMEOUT_IN_MS=2
                                                    STORAGE_REDIS_FAILOVER_6_READ_TIMEOUT_IN_MS=75
                                                    STORAGE_REDIS_FAILOVER_6_ROUTE_RANDOM=true
                                                    STORAGE_REDIS_FAILOVER_6_SENTINEL_ADDRESSES=localhost:26379
                                                    STORAGE_REDIS_FAILOVER_6_WRITE_TIMEOUT_IN_MS=300

                                                    # Active Redis configurations
                                                    STORAGE_REDIS_FAILOVER_ACTIVE_CONFIG_IDS=2,4,6

                                                    Caching Configuration

                                                    -
                                                    # In-Memory Cache
                                                    IN_MEM_CACHE_3_ENABLED=true
                                                    IN_MEM_CACHE_3_NAME=onfs
                                                    IN_MEM_CACHE_3_SIZE_IN_BYTES=10000000
                                                    IN_MEM_CACHE_ACTIVE_CONFIG_IDS=3

                                                    # Distributed Cache (uses Redis configurations)
                                                    # Redis configurations (IDs: 2,4,6) can be used for distributed caching
                                                    DISTRIBUTED_CACHE_CONF_IDS=2
                                                    +
                                                    # In-Memory Cache
                                                    IN_MEM_CACHE_3_ENABLED=true
                                                    IN_MEM_CACHE_3_NAME=onfs
                                                    IN_MEM_CACHE_3_SIZE_IN_BYTES=10000000
                                                    IN_MEM_CACHE_ACTIVE_CONFIG_IDS=3

                                                    # Distributed Cache (uses Redis configurations)
                                                    # Redis configurations (IDs: 2,4,6) can be used for distributed caching
                                                    DISTRIBUTED_CACHE_CONF_IDS=2

                                                    Service Discovery and Configuration

                                                    -
                                                    # ETCD Configuration for service discovery
                                                    ETCD_SERVER=0.0.0.0:2379
                                                    ETCD_WATCHER_ENABLED=true
                                                    +
                                                    # ETCD Configuration for service discovery
                                                    ETCD_SERVER=0.0.0.0:2379
                                                    ETCD_WATCHER_ENABLED=true

                                                    Horizon Configuration

                                                    Horizon manages the metadata and configuration for the Online Feature Store system.

                                                    Core Application Settings

                                                    -
                                                    APP_NAME=horizon
                                                    APP_ENVIRONMENT=PROD
                                                    APP_ENV=production
                                                    APP_PORT=8082
                                                    APP_LOG_LEVEL=DEBUG
                                                    APP_METRIC_SAMPLING_RATE=1
                                                    APP_GC_PERCENTAGE=1
                                                    +
                                                    APP_NAME=horizon
                                                    APP_ENVIRONMENT=PROD
                                                    APP_ENV=production
                                                    APP_PORT=8082
                                                    APP_LOG_LEVEL=DEBUG
                                                    APP_METRIC_SAMPLING_RATE=1
                                                    APP_GC_PERCENTAGE=1

                                                    Database Configuration

                                                    -
                                                    # MySQL Master Configuration
                                                    MYSQL_MASTER_MAX_POOL_SIZE=5
                                                    MYSQL_MASTER_MIN_POOL_SIZE=2
                                                    MYSQL_MASTER_PASSWORD=
                                                    MYSQL_MASTER_HOST=127.0.0.1
                                                    MYSQL_MASTER_PORT=3306
                                                    MYSQL_DB_NAME=ml_config
                                                    MYSQL_MASTER_USERNAME=root

                                                    # MySQL Slave Configuration
                                                    MYSQL_SLAVE_MAX_POOL_SIZE=5
                                                    MYSQL_SLAVE_MIN_POOL_SIZE=2
                                                    MYSQL_SLAVE_PASSWORD=
                                                    MYSQL_SLAVE_HOST=127.0.0.1
                                                    MYSQL_SLAVE_USERNAME=root
                                                    MYSQL_SLAVE_PORT=3306
                                                    +
                                                    # MySQL Master Configuration
                                                    MYSQL_MASTER_MAX_POOL_SIZE=5
                                                    MYSQL_MASTER_MIN_POOL_SIZE=2
                                                    MYSQL_MASTER_PASSWORD=
                                                    MYSQL_MASTER_HOST=127.0.0.1
                                                    MYSQL_MASTER_PORT=3306
                                                    MYSQL_DB_NAME=ml_config
                                                    MYSQL_MASTER_USERNAME=root

                                                    # MySQL Slave Configuration
                                                    MYSQL_SLAVE_MAX_POOL_SIZE=5
                                                    MYSQL_SLAVE_MIN_POOL_SIZE=2
                                                    MYSQL_SLAVE_PASSWORD=
                                                    MYSQL_SLAVE_HOST=127.0.0.1
                                                    MYSQL_SLAVE_USERNAME=root
                                                    MYSQL_SLAVE_PORT=3306

                                                    ScyllaDB Configuration

                                                    -
                                                    # ScyllaDB for Horizon
                                                    SCYLLA_1_CONTACT_POINTS=localhost
                                                    SCYLLA_1_KEYSPACE=onfs
                                                    SCYLLA_1_NUM_CONNS=1
                                                    SCYLLA_1_PORT=9042
                                                    SCYLLA_1_TIMEOUT_IN_MS=300000
                                                    SCYLLA_1_PASSWORD=
                                                    SCYLLA_1_USERNAME=
                                                    SCYLLA_ACTIVE_CONFIG_IDS=1
                                                    +
                                                    # ScyllaDB for Horizon
                                                    SCYLLA_1_CONTACT_POINTS=localhost
                                                    SCYLLA_1_KEYSPACE=onfs
                                                    SCYLLA_1_NUM_CONNS=1
                                                    SCYLLA_1_PORT=9042
                                                    SCYLLA_1_TIMEOUT_IN_MS=300000
                                                    SCYLLA_1_PASSWORD=
                                                    SCYLLA_1_USERNAME=
                                                    SCYLLA_ACTIVE_CONFIG_IDS=1

                                                    Service Integration

                                                    -
                                                    # ETCD Configuration
                                                    ETCD_WATCHER_ENABLED=true
                                                    ETCD_SERVER=localhost:2379

                                                    # Integration with Online Feature Store
                                                    ONLINE_FEATURE_STORE_APP_NAME=online-feature-store
                                                    +
                                                    # ETCD Configuration
                                                    ETCD_WATCHER_ENABLED=true
                                                    ETCD_SERVER=localhost:2379

                                                    # Integration with Online Feature Store
                                                    ONLINE_FEATURE_STORE_APP_NAME=online-feature-store

                                                    Key Constructs

                                                    Understanding these key constructs is essential for effectively using the Online Feature Store:

                                                    @@ -132,7 +132,7 @@

                                                    Job

                                                    Configuration Hierarchy

                                                    The system uses a hierarchical configuration approach:

                                                    -
                                                    Store → Entity → Feature Group → Feature
                                                    ↓ ↓ ↓ ↓
                                                    Config Identity Collection Individual
                                                    Level Level Level Level
                                                    +
                                                    Store → Entity → Feature Group → Feature
                                                    ↓ ↓ ↓ ↓
                                                    Config Identity Collection Individual
                                                    Level Level Level Level

                                                    This hierarchy allows for:

                                                    • Inheritance: Lower levels inherit settings from higher levels
                                                    • @@ -316,6 +316,6 @@

                                                      LicenseBharatMLStack Business Source License 1.1.


                                                      Built with ❤️ for the ML community from Meesho
                                                      -
                                                      If you find this useful, ⭐️ the repo — your support means the world to us!

                                                  +
                                                  If you find this useful, ⭐️ the repo — your support means the world to us!
                                                  \ No newline at end of file diff --git a/go-sdk/VERSION b/go-sdk/VERSION index 0408c30b..8b3a0227 100644 --- a/go-sdk/VERSION +++ b/go-sdk/VERSION @@ -1 +1 @@ -v1.2.0 \ No newline at end of file +v1.3.0 \ No newline at end of file diff --git a/go-sdk/pkg/clients/inferflow/README.md b/go-sdk/pkg/clients/inferflow/README.md new file mode 100644 index 00000000..ef82948f --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/README.md @@ -0,0 +1,150 @@ +# Inferflow Client + +A Go client library for interacting with the Inferflow Predict service, supporting PointWise, PairWise, and SlateWise inference APIs. + +## Features + +- **Three Inference Patterns**: PointWise (per-target scoring), PairWise (pair-level ranking), SlateWise (group-level scoring) +- **gRPC Communication**: Efficient binary protocol via gRPC +- **Authentication**: Built-in caller ID and token-based auth via metadata +- **Configurable Timeouts**: Per-client deadline configuration +- **Singleton Pattern**: Thread-safe client initialization with `sync.Once` + +## Quick Start + +### 1. Configuration + +Set environment variables with the `INFERFLOW_CLIENT_V1_` prefix: + +```bash +export INFERFLOW_CLIENT_V1_HOST=inferflow.svc +export INFERFLOW_CLIENT_V1_PORT=8080 +export INFERFLOW_CLIENT_V1_DEADLINE_MS=500 +export INFERFLOW_CLIENT_V1_PLAINTEXT=true +export INFERFLOW_CLIENT_V1_AUTH_TOKEN=your-token +export APP_NAME=my-service +``` + +### 2. Initialize Client + +```go +import "github.com/Meesho/BharatMLStack/go-sdk/pkg/clients/inferflow" + +// From environment variables +client := inferflow.GetInferflowClient(1) + +// Or from explicit config +client := inferflow.GetInferflowClientFromConfig(1, inferflow.ClientConfig{ + Host: "inferflow.svc", + Port: "8080", + DeadlineExceedMS: 500, + PlainText: true, + AuthToken: "your-token", +}, "my-service") +``` + +### 3. PointWise Inference + +Score each target independently against context features. + +**Use cases:** CTR prediction, fraud scoring, relevance ranking. + +```go +import grpc "github.com/Meesho/BharatMLStack/go-sdk/pkg/clients/inferflow/client/grpc" + +resp, err := client.InferPointWise(&grpc.PointWiseRequest{ + ModelConfigId: "ranking_model_v1", + TrackingId: "req-123", + TenantId: "tenant-1", + TargetInputSchema: []*grpc.FeatureSchema{ + {Name: "price", DataType: grpc.DataType_DataTypeFP32}, + }, + Targets: []*grpc.Target{ + {Id: "product-1", FeatureValues: [][]byte{priceBytes}}, + {Id: "product-2", FeatureValues: [][]byte{priceBytes}}, + }, + ContextFeatures: []*grpc.ContextFeature{ + {Name: "user_segment", Value: segmentBytes, DataType: grpc.DataType_DataTypeString}, + }, +}) +// resp.TargetScores contains per-target scores +``` + +### 4. PairWise Inference + +Score pairs of targets relative to each other. + +**Use cases:** Preference learning, comparison-based ranking. + +```go +resp, err := client.InferPairWise(&grpc.PairWiseRequest{ + ModelConfigId: "pairwise_model_v1", + TrackingId: "req-456", + TenantId: "tenant-1", + TargetInputSchema: targetSchema, + PairInputSchema: pairSchema, + Targets: targets, + Pairs: []*grpc.TargetPair{ + {FirstTargetIndex: 0, SecondTargetIndex: 1, FeatureValues: pairFeatures}, + }, +}) +// resp.PairScores contains per-pair scores +// resp.TargetScores contains optional per-target scores +``` + +### 5. SlateWise Inference + +Score groups (slates) of targets together, capturing inter-item effects. + +**Use cases:** Whole-page optimization, slate-level reranking, diversity-aware scoring. + +```go +resp, err := client.InferSlateWise(&grpc.SlateWiseRequest{ + ModelConfigId: "slate_model_v1", + TrackingId: "req-789", + TenantId: "tenant-1", + TargetInputSchema: targetSchema, + SlateInputSchema: slateSchema, + Targets: targets, + Slates: []*grpc.TargetSlate{ + {TargetIndices: []int32{0, 1, 2}, FeatureValues: slateFeatures}, + }, +}) +// resp.SlateScores contains per-slate scores +// resp.TargetScores contains optional per-target scores +``` + +## Configuration Options + +| Option | Env Var Suffix | Type | Description | Default | +|--------|---------------|------|-------------|---------| +| `Host` | `HOST` | string | Inferflow service hostname | Required | +| `Port` | `PORT` | string | Inferflow service port | `8080` | +| `DeadlineExceedMS` | `DEADLINE_MS` | int | Request timeout (ms) | `200` | +| `PlainText` | `PLAINTEXT` | bool | Use plaintext connection | `true` | +| `AuthToken` | `AUTH_TOKEN` | string | Authentication token | `""` | + +## API Reference + +### InferflowClient Interface + +```go +type InferflowClient interface { + InferPointWise(request *grpc.PointWiseRequest) (*grpc.PointWiseResponse, error) + InferPairWise(request *grpc.PairWiseRequest) (*grpc.PairWiseResponse, error) + InferSlateWise(request *grpc.SlateWiseRequest) (*grpc.SlateWiseResponse, error) +} +``` + +## Testing + +```bash +go test -v ./pkg/clients/inferflow/... +``` + +## Dependencies + +- `google.golang.org/grpc` — gRPC framework +- `google.golang.org/protobuf` — Protocol Buffers +- `github.com/rs/zerolog` — Structured logging +- `github.com/spf13/viper` — Configuration management diff --git a/go-sdk/pkg/clients/inferflow/client/grpc/predict.pb.go b/go-sdk/pkg/clients/inferflow/client/grpc/predict.pb.go new file mode 100644 index 00000000..1a10406e --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/client/grpc/predict.pb.go @@ -0,0 +1,1318 @@ +// Code generated by protoc-gen-go. DO NOT EDIT. +// versions: +// protoc-gen-go v1.36.11 +// protoc v5.28.3 +// source: predict.proto + +package grpc + +import ( + protoreflect "google.golang.org/protobuf/reflect/protoreflect" + protoimpl "google.golang.org/protobuf/runtime/protoimpl" + reflect "reflect" + sync "sync" + unsafe "unsafe" +) + +const ( + // Verify that this generated code is sufficiently up-to-date. + _ = protoimpl.EnforceVersion(20 - protoimpl.MinVersion) + // Verify that runtime/protoimpl is sufficiently up-to-date. + _ = protoimpl.EnforceVersion(protoimpl.MaxVersion - 20) +) + +type DataType int32 + +const ( + DataType_DataTypeUnknown DataType = 0 + DataType_DataTypeFP8E5M2 DataType = 1 + DataType_DataTypeFP8E4M3 DataType = 2 + DataType_DataTypeFP16 DataType = 3 + DataType_DataTypeFP32 DataType = 4 + DataType_DataTypeFP64 DataType = 5 + DataType_DataTypeInt8 DataType = 6 + DataType_DataTypeInt16 DataType = 7 + DataType_DataTypeInt32 DataType = 8 + DataType_DataTypeInt64 DataType = 9 + DataType_DataTypeUint8 DataType = 10 + DataType_DataTypeUint16 DataType = 11 + DataType_DataTypeUint32 DataType = 12 + DataType_DataTypeUint64 DataType = 13 + DataType_DataTypeString DataType = 14 + DataType_DataTypeBool DataType = 15 + DataType_DataTypeFP8E5M2Vector DataType = 16 + DataType_DataTypeFP8E4M3Vector DataType = 17 + DataType_DataTypeFP16Vector DataType = 18 + DataType_DataTypeFP32Vector DataType = 19 + DataType_DataTypeFP64Vector DataType = 20 + DataType_DataTypeInt8Vector DataType = 21 + DataType_DataTypeInt16Vector DataType = 22 + DataType_DataTypeInt32Vector DataType = 23 + DataType_DataTypeInt64Vector DataType = 24 + DataType_DataTypeUint8Vector DataType = 25 + DataType_DataTypeUint16Vector DataType = 26 + DataType_DataTypeUint32Vector DataType = 27 + DataType_DataTypeUint64Vector DataType = 28 + DataType_DataTypeStringVector DataType = 29 + DataType_DataTypeBoolVector DataType = 30 +) + +// Enum value maps for DataType. +var ( + DataType_name = map[int32]string{ + 0: "DataTypeUnknown", + 1: "DataTypeFP8E5M2", + 2: "DataTypeFP8E4M3", + 3: "DataTypeFP16", + 4: "DataTypeFP32", + 5: "DataTypeFP64", + 6: "DataTypeInt8", + 7: "DataTypeInt16", + 8: "DataTypeInt32", + 9: "DataTypeInt64", + 10: "DataTypeUint8", + 11: "DataTypeUint16", + 12: "DataTypeUint32", + 13: "DataTypeUint64", + 14: "DataTypeString", + 15: "DataTypeBool", + 16: "DataTypeFP8E5M2Vector", + 17: "DataTypeFP8E4M3Vector", + 18: "DataTypeFP16Vector", + 19: "DataTypeFP32Vector", + 20: "DataTypeFP64Vector", + 21: "DataTypeInt8Vector", + 22: "DataTypeInt16Vector", + 23: "DataTypeInt32Vector", + 24: "DataTypeInt64Vector", + 25: "DataTypeUint8Vector", + 26: "DataTypeUint16Vector", + 27: "DataTypeUint32Vector", + 28: "DataTypeUint64Vector", + 29: "DataTypeStringVector", + 30: "DataTypeBoolVector", + } + DataType_value = map[string]int32{ + "DataTypeUnknown": 0, + "DataTypeFP8E5M2": 1, + "DataTypeFP8E4M3": 2, + "DataTypeFP16": 3, + "DataTypeFP32": 4, + "DataTypeFP64": 5, + "DataTypeInt8": 6, + "DataTypeInt16": 7, + "DataTypeInt32": 8, + "DataTypeInt64": 9, + "DataTypeUint8": 10, + "DataTypeUint16": 11, + "DataTypeUint32": 12, + "DataTypeUint64": 13, + "DataTypeString": 14, + "DataTypeBool": 15, + "DataTypeFP8E5M2Vector": 16, + "DataTypeFP8E4M3Vector": 17, + "DataTypeFP16Vector": 18, + "DataTypeFP32Vector": 19, + "DataTypeFP64Vector": 20, + "DataTypeInt8Vector": 21, + "DataTypeInt16Vector": 22, + "DataTypeInt32Vector": 23, + "DataTypeInt64Vector": 24, + "DataTypeUint8Vector": 25, + "DataTypeUint16Vector": 26, + "DataTypeUint32Vector": 27, + "DataTypeUint64Vector": 28, + "DataTypeStringVector": 29, + "DataTypeBoolVector": 30, + } +) + +func (x DataType) Enum() *DataType { + p := new(DataType) + *p = x + return p +} + +func (x DataType) String() string { + return protoimpl.X.EnumStringOf(x.Descriptor(), protoreflect.EnumNumber(x)) +} + +func (DataType) Descriptor() protoreflect.EnumDescriptor { + return file_predict_proto_enumTypes[0].Descriptor() +} + +func (DataType) Type() protoreflect.EnumType { + return &file_predict_proto_enumTypes[0] +} + +func (x DataType) Number() protoreflect.EnumNumber { + return protoreflect.EnumNumber(x) +} + +// Deprecated: Use DataType.Descriptor instead. +func (DataType) EnumDescriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{0} +} + +// Schema definition for a feature column +type FeatureSchema struct { + state protoimpl.MessageState `protogen:"open.v1"` + Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` + DataType DataType `protobuf:"varint,2,opt,name=data_type,json=dataType,proto3,enum=DataType" json:"data_type,omitempty"` + VectorDim int32 `protobuf:"varint,3,opt,name=vector_dim,json=vectorDim,proto3" json:"vector_dim,omitempty"` // 0 = scalar, >0 = fixed vector length + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *FeatureSchema) Reset() { + *x = FeatureSchema{} + mi := &file_predict_proto_msgTypes[0] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *FeatureSchema) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*FeatureSchema) ProtoMessage() {} + +func (x *FeatureSchema) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[0] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use FeatureSchema.ProtoReflect.Descriptor instead. +func (*FeatureSchema) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{0} +} + +func (x *FeatureSchema) GetName() string { + if x != nil { + return x.Name + } + return "" +} + +func (x *FeatureSchema) GetDataType() DataType { + if x != nil { + return x.DataType + } + return DataType_DataTypeUnknown +} + +func (x *FeatureSchema) GetVectorDim() int32 { + if x != nil { + return x.VectorDim + } + return 0 +} + +// A request-level context feature (user, session, device, etc.) +type ContextFeature struct { + state protoimpl.MessageState `protogen:"open.v1"` + Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` + Value []byte `protobuf:"bytes,2,opt,name=value,proto3" json:"value,omitempty"` + DataType DataType `protobuf:"varint,3,opt,name=data_type,json=dataType,proto3,enum=DataType" json:"data_type,omitempty"` + VectorDim int32 `protobuf:"varint,4,opt,name=vector_dim,json=vectorDim,proto3" json:"vector_dim,omitempty"` // 0 = scalar, >0 = fixed vector length + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *ContextFeature) Reset() { + *x = ContextFeature{} + mi := &file_predict_proto_msgTypes[1] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *ContextFeature) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*ContextFeature) ProtoMessage() {} + +func (x *ContextFeature) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[1] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use ContextFeature.ProtoReflect.Descriptor instead. +func (*ContextFeature) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{1} +} + +func (x *ContextFeature) GetName() string { + if x != nil { + return x.Name + } + return "" +} + +func (x *ContextFeature) GetValue() []byte { + if x != nil { + return x.Value + } + return nil +} + +func (x *ContextFeature) GetDataType() DataType { + if x != nil { + return x.DataType + } + return DataType_DataTypeUnknown +} + +func (x *ContextFeature) GetVectorDim() int32 { + if x != nil { + return x.VectorDim + } + return 0 +} + +// A single entity to be scored/ranked +type Target struct { + state protoimpl.MessageState `protogen:"open.v1"` + Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"` + FeatureValues [][]byte `protobuf:"bytes,2,rep,name=feature_values,json=featureValues,proto3" json:"feature_values,omitempty"` // aligned with target_input_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *Target) Reset() { + *x = Target{} + mi := &file_predict_proto_msgTypes[2] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *Target) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*Target) ProtoMessage() {} + +func (x *Target) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[2] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use Target.ProtoReflect.Descriptor instead. +func (*Target) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{2} +} + +func (x *Target) GetId() string { + if x != nil { + return x.Id + } + return "" +} + +func (x *Target) GetFeatureValues() [][]byte { + if x != nil { + return x.FeatureValues + } + return nil +} + +type TargetScore struct { + state protoimpl.MessageState `protogen:"open.v1"` + Error string `protobuf:"bytes,1,opt,name=error,proto3" json:"error,omitempty"` + OutputValues [][]byte `protobuf:"bytes,2,rep,name=output_values,json=outputValues,proto3" json:"output_values,omitempty"` // aligned with target_output_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *TargetScore) Reset() { + *x = TargetScore{} + mi := &file_predict_proto_msgTypes[3] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *TargetScore) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*TargetScore) ProtoMessage() {} + +func (x *TargetScore) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[3] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use TargetScore.ProtoReflect.Descriptor instead. +func (*TargetScore) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{3} +} + +func (x *TargetScore) GetError() string { + if x != nil { + return x.Error + } + return "" +} + +func (x *TargetScore) GetOutputValues() [][]byte { + if x != nil { + return x.OutputValues + } + return nil +} + +type PointWiseRequest struct { + state protoimpl.MessageState `protogen:"open.v1"` + ModelConfigId string `protobuf:"bytes,1,opt,name=model_config_id,json=modelConfigId,proto3" json:"model_config_id,omitempty"` + TrackingId string `protobuf:"bytes,2,opt,name=tracking_id,json=trackingId,proto3" json:"tracking_id,omitempty"` + ContextFeatures []*ContextFeature `protobuf:"bytes,3,rep,name=context_features,json=contextFeatures,proto3" json:"context_features,omitempty"` + TargetInputSchema []*FeatureSchema `protobuf:"bytes,4,rep,name=target_input_schema,json=targetInputSchema,proto3" json:"target_input_schema,omitempty"` + Targets []*Target `protobuf:"bytes,5,rep,name=targets,proto3" json:"targets,omitempty"` + TenantId string `protobuf:"bytes,6,opt,name=tenant_id,json=tenantId,proto3" json:"tenant_id,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *PointWiseRequest) Reset() { + *x = PointWiseRequest{} + mi := &file_predict_proto_msgTypes[4] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *PointWiseRequest) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*PointWiseRequest) ProtoMessage() {} + +func (x *PointWiseRequest) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[4] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use PointWiseRequest.ProtoReflect.Descriptor instead. +func (*PointWiseRequest) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{4} +} + +func (x *PointWiseRequest) GetModelConfigId() string { + if x != nil { + return x.ModelConfigId + } + return "" +} + +func (x *PointWiseRequest) GetTrackingId() string { + if x != nil { + return x.TrackingId + } + return "" +} + +func (x *PointWiseRequest) GetContextFeatures() []*ContextFeature { + if x != nil { + return x.ContextFeatures + } + return nil +} + +func (x *PointWiseRequest) GetTargetInputSchema() []*FeatureSchema { + if x != nil { + return x.TargetInputSchema + } + return nil +} + +func (x *PointWiseRequest) GetTargets() []*Target { + if x != nil { + return x.Targets + } + return nil +} + +func (x *PointWiseRequest) GetTenantId() string { + if x != nil { + return x.TenantId + } + return "" +} + +type PointWiseResponse struct { + state protoimpl.MessageState `protogen:"open.v1"` + TargetOutputSchema []*FeatureSchema `protobuf:"bytes,1,rep,name=target_output_schema,json=targetOutputSchema,proto3" json:"target_output_schema,omitempty"` + TargetScores []*TargetScore `protobuf:"bytes,2,rep,name=target_scores,json=targetScores,proto3" json:"target_scores,omitempty"` + RequestError string `protobuf:"bytes,3,opt,name=request_error,json=requestError,proto3" json:"request_error,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *PointWiseResponse) Reset() { + *x = PointWiseResponse{} + mi := &file_predict_proto_msgTypes[5] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *PointWiseResponse) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*PointWiseResponse) ProtoMessage() {} + +func (x *PointWiseResponse) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[5] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use PointWiseResponse.ProtoReflect.Descriptor instead. +func (*PointWiseResponse) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{5} +} + +func (x *PointWiseResponse) GetTargetOutputSchema() []*FeatureSchema { + if x != nil { + return x.TargetOutputSchema + } + return nil +} + +func (x *PointWiseResponse) GetTargetScores() []*TargetScore { + if x != nil { + return x.TargetScores + } + return nil +} + +func (x *PointWiseResponse) GetRequestError() string { + if x != nil { + return x.RequestError + } + return "" +} + +type TargetPair struct { + state protoimpl.MessageState `protogen:"open.v1"` + FirstTargetIndex int32 `protobuf:"varint,1,opt,name=first_target_index,json=firstTargetIndex,proto3" json:"first_target_index,omitempty"` + SecondTargetIndex int32 `protobuf:"varint,2,opt,name=second_target_index,json=secondTargetIndex,proto3" json:"second_target_index,omitempty"` + FeatureValues [][]byte `protobuf:"bytes,3,rep,name=feature_values,json=featureValues,proto3" json:"feature_values,omitempty"` // aligned with pair_input_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *TargetPair) Reset() { + *x = TargetPair{} + mi := &file_predict_proto_msgTypes[6] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *TargetPair) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*TargetPair) ProtoMessage() {} + +func (x *TargetPair) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[6] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use TargetPair.ProtoReflect.Descriptor instead. +func (*TargetPair) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{6} +} + +func (x *TargetPair) GetFirstTargetIndex() int32 { + if x != nil { + return x.FirstTargetIndex + } + return 0 +} + +func (x *TargetPair) GetSecondTargetIndex() int32 { + if x != nil { + return x.SecondTargetIndex + } + return 0 +} + +func (x *TargetPair) GetFeatureValues() [][]byte { + if x != nil { + return x.FeatureValues + } + return nil +} + +type PairWiseRequest struct { + state protoimpl.MessageState `protogen:"open.v1"` + ModelConfigId string `protobuf:"bytes,1,opt,name=model_config_id,json=modelConfigId,proto3" json:"model_config_id,omitempty"` + TrackingId string `protobuf:"bytes,2,opt,name=tracking_id,json=trackingId,proto3" json:"tracking_id,omitempty"` + ContextFeatures []*ContextFeature `protobuf:"bytes,3,rep,name=context_features,json=contextFeatures,proto3" json:"context_features,omitempty"` + TargetInputSchema []*FeatureSchema `protobuf:"bytes,4,rep,name=target_input_schema,json=targetInputSchema,proto3" json:"target_input_schema,omitempty"` + PairInputSchema []*FeatureSchema `protobuf:"bytes,5,rep,name=pair_input_schema,json=pairInputSchema,proto3" json:"pair_input_schema,omitempty"` + Pairs []*TargetPair `protobuf:"bytes,6,rep,name=pairs,proto3" json:"pairs,omitempty"` + Targets []*Target `protobuf:"bytes,7,rep,name=targets,proto3" json:"targets,omitempty"` + TenantId string `protobuf:"bytes,8,opt,name=tenant_id,json=tenantId,proto3" json:"tenant_id,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *PairWiseRequest) Reset() { + *x = PairWiseRequest{} + mi := &file_predict_proto_msgTypes[7] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *PairWiseRequest) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*PairWiseRequest) ProtoMessage() {} + +func (x *PairWiseRequest) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[7] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use PairWiseRequest.ProtoReflect.Descriptor instead. +func (*PairWiseRequest) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{7} +} + +func (x *PairWiseRequest) GetModelConfigId() string { + if x != nil { + return x.ModelConfigId + } + return "" +} + +func (x *PairWiseRequest) GetTrackingId() string { + if x != nil { + return x.TrackingId + } + return "" +} + +func (x *PairWiseRequest) GetContextFeatures() []*ContextFeature { + if x != nil { + return x.ContextFeatures + } + return nil +} + +func (x *PairWiseRequest) GetTargetInputSchema() []*FeatureSchema { + if x != nil { + return x.TargetInputSchema + } + return nil +} + +func (x *PairWiseRequest) GetPairInputSchema() []*FeatureSchema { + if x != nil { + return x.PairInputSchema + } + return nil +} + +func (x *PairWiseRequest) GetPairs() []*TargetPair { + if x != nil { + return x.Pairs + } + return nil +} + +func (x *PairWiseRequest) GetTargets() []*Target { + if x != nil { + return x.Targets + } + return nil +} + +func (x *PairWiseRequest) GetTenantId() string { + if x != nil { + return x.TenantId + } + return "" +} + +type PairScore struct { + state protoimpl.MessageState `protogen:"open.v1"` + Error string `protobuf:"bytes,1,opt,name=error,proto3" json:"error,omitempty"` + OutputValues [][]byte `protobuf:"bytes,2,rep,name=output_values,json=outputValues,proto3" json:"output_values,omitempty"` // aligned with pair_output_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *PairScore) Reset() { + *x = PairScore{} + mi := &file_predict_proto_msgTypes[8] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *PairScore) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*PairScore) ProtoMessage() {} + +func (x *PairScore) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[8] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use PairScore.ProtoReflect.Descriptor instead. +func (*PairScore) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{8} +} + +func (x *PairScore) GetError() string { + if x != nil { + return x.Error + } + return "" +} + +func (x *PairScore) GetOutputValues() [][]byte { + if x != nil { + return x.OutputValues + } + return nil +} + +type PairWiseResponse struct { + state protoimpl.MessageState `protogen:"open.v1"` + PairScores []*PairScore `protobuf:"bytes,1,rep,name=pair_scores,json=pairScores,proto3" json:"pair_scores,omitempty"` + TargetScores []*TargetScore `protobuf:"bytes,2,rep,name=target_scores,json=targetScores,proto3" json:"target_scores,omitempty"` + TargetOutputSchema []*FeatureSchema `protobuf:"bytes,3,rep,name=target_output_schema,json=targetOutputSchema,proto3" json:"target_output_schema,omitempty"` + PairOutputSchema []*FeatureSchema `protobuf:"bytes,4,rep,name=pair_output_schema,json=pairOutputSchema,proto3" json:"pair_output_schema,omitempty"` + RequestError string `protobuf:"bytes,5,opt,name=request_error,json=requestError,proto3" json:"request_error,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *PairWiseResponse) Reset() { + *x = PairWiseResponse{} + mi := &file_predict_proto_msgTypes[9] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *PairWiseResponse) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*PairWiseResponse) ProtoMessage() {} + +func (x *PairWiseResponse) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[9] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use PairWiseResponse.ProtoReflect.Descriptor instead. +func (*PairWiseResponse) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{9} +} + +func (x *PairWiseResponse) GetPairScores() []*PairScore { + if x != nil { + return x.PairScores + } + return nil +} + +func (x *PairWiseResponse) GetTargetScores() []*TargetScore { + if x != nil { + return x.TargetScores + } + return nil +} + +func (x *PairWiseResponse) GetTargetOutputSchema() []*FeatureSchema { + if x != nil { + return x.TargetOutputSchema + } + return nil +} + +func (x *PairWiseResponse) GetPairOutputSchema() []*FeatureSchema { + if x != nil { + return x.PairOutputSchema + } + return nil +} + +func (x *PairWiseResponse) GetRequestError() string { + if x != nil { + return x.RequestError + } + return "" +} + +type TargetSlate struct { + state protoimpl.MessageState `protogen:"open.v1"` + TargetIndices []int32 `protobuf:"varint,1,rep,packed,name=target_indices,json=targetIndices,proto3" json:"target_indices,omitempty"` + FeatureValues [][]byte `protobuf:"bytes,2,rep,name=feature_values,json=featureValues,proto3" json:"feature_values,omitempty"` // aligned with slate_input_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *TargetSlate) Reset() { + *x = TargetSlate{} + mi := &file_predict_proto_msgTypes[10] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *TargetSlate) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*TargetSlate) ProtoMessage() {} + +func (x *TargetSlate) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[10] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use TargetSlate.ProtoReflect.Descriptor instead. +func (*TargetSlate) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{10} +} + +func (x *TargetSlate) GetTargetIndices() []int32 { + if x != nil { + return x.TargetIndices + } + return nil +} + +func (x *TargetSlate) GetFeatureValues() [][]byte { + if x != nil { + return x.FeatureValues + } + return nil +} + +type SlateScore struct { + state protoimpl.MessageState `protogen:"open.v1"` + Error string `protobuf:"bytes,1,opt,name=error,proto3" json:"error,omitempty"` + OutputValues [][]byte `protobuf:"bytes,2,rep,name=output_values,json=outputValues,proto3" json:"output_values,omitempty"` // aligned with slate_output_schema + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *SlateScore) Reset() { + *x = SlateScore{} + mi := &file_predict_proto_msgTypes[11] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *SlateScore) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*SlateScore) ProtoMessage() {} + +func (x *SlateScore) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[11] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use SlateScore.ProtoReflect.Descriptor instead. +func (*SlateScore) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{11} +} + +func (x *SlateScore) GetError() string { + if x != nil { + return x.Error + } + return "" +} + +func (x *SlateScore) GetOutputValues() [][]byte { + if x != nil { + return x.OutputValues + } + return nil +} + +type SlateWiseRequest struct { + state protoimpl.MessageState `protogen:"open.v1"` + ModelConfigId string `protobuf:"bytes,1,opt,name=model_config_id,json=modelConfigId,proto3" json:"model_config_id,omitempty"` + TrackingId string `protobuf:"bytes,2,opt,name=tracking_id,json=trackingId,proto3" json:"tracking_id,omitempty"` + ContextFeatures []*ContextFeature `protobuf:"bytes,3,rep,name=context_features,json=contextFeatures,proto3" json:"context_features,omitempty"` + TargetInputSchema []*FeatureSchema `protobuf:"bytes,4,rep,name=target_input_schema,json=targetInputSchema,proto3" json:"target_input_schema,omitempty"` + SlateInputSchema []*FeatureSchema `protobuf:"bytes,5,rep,name=slate_input_schema,json=slateInputSchema,proto3" json:"slate_input_schema,omitempty"` + Slates []*TargetSlate `protobuf:"bytes,6,rep,name=slates,proto3" json:"slates,omitempty"` + Targets []*Target `protobuf:"bytes,7,rep,name=targets,proto3" json:"targets,omitempty"` + TenantId string `protobuf:"bytes,8,opt,name=tenant_id,json=tenantId,proto3" json:"tenant_id,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *SlateWiseRequest) Reset() { + *x = SlateWiseRequest{} + mi := &file_predict_proto_msgTypes[12] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *SlateWiseRequest) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*SlateWiseRequest) ProtoMessage() {} + +func (x *SlateWiseRequest) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[12] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use SlateWiseRequest.ProtoReflect.Descriptor instead. +func (*SlateWiseRequest) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{12} +} + +func (x *SlateWiseRequest) GetModelConfigId() string { + if x != nil { + return x.ModelConfigId + } + return "" +} + +func (x *SlateWiseRequest) GetTrackingId() string { + if x != nil { + return x.TrackingId + } + return "" +} + +func (x *SlateWiseRequest) GetContextFeatures() []*ContextFeature { + if x != nil { + return x.ContextFeatures + } + return nil +} + +func (x *SlateWiseRequest) GetTargetInputSchema() []*FeatureSchema { + if x != nil { + return x.TargetInputSchema + } + return nil +} + +func (x *SlateWiseRequest) GetSlateInputSchema() []*FeatureSchema { + if x != nil { + return x.SlateInputSchema + } + return nil +} + +func (x *SlateWiseRequest) GetSlates() []*TargetSlate { + if x != nil { + return x.Slates + } + return nil +} + +func (x *SlateWiseRequest) GetTargets() []*Target { + if x != nil { + return x.Targets + } + return nil +} + +func (x *SlateWiseRequest) GetTenantId() string { + if x != nil { + return x.TenantId + } + return "" +} + +type SlateWiseResponse struct { + state protoimpl.MessageState `protogen:"open.v1"` + SlateScores []*SlateScore `protobuf:"bytes,1,rep,name=slate_scores,json=slateScores,proto3" json:"slate_scores,omitempty"` + TargetScores []*TargetScore `protobuf:"bytes,2,rep,name=target_scores,json=targetScores,proto3" json:"target_scores,omitempty"` + TargetOutputSchema []*FeatureSchema `protobuf:"bytes,3,rep,name=target_output_schema,json=targetOutputSchema,proto3" json:"target_output_schema,omitempty"` + SlateOutputSchema []*FeatureSchema `protobuf:"bytes,4,rep,name=slate_output_schema,json=slateOutputSchema,proto3" json:"slate_output_schema,omitempty"` + RequestError string `protobuf:"bytes,5,opt,name=request_error,json=requestError,proto3" json:"request_error,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *SlateWiseResponse) Reset() { + *x = SlateWiseResponse{} + mi := &file_predict_proto_msgTypes[13] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *SlateWiseResponse) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*SlateWiseResponse) ProtoMessage() {} + +func (x *SlateWiseResponse) ProtoReflect() protoreflect.Message { + mi := &file_predict_proto_msgTypes[13] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use SlateWiseResponse.ProtoReflect.Descriptor instead. +func (*SlateWiseResponse) Descriptor() ([]byte, []int) { + return file_predict_proto_rawDescGZIP(), []int{13} +} + +func (x *SlateWiseResponse) GetSlateScores() []*SlateScore { + if x != nil { + return x.SlateScores + } + return nil +} + +func (x *SlateWiseResponse) GetTargetScores() []*TargetScore { + if x != nil { + return x.TargetScores + } + return nil +} + +func (x *SlateWiseResponse) GetTargetOutputSchema() []*FeatureSchema { + if x != nil { + return x.TargetOutputSchema + } + return nil +} + +func (x *SlateWiseResponse) GetSlateOutputSchema() []*FeatureSchema { + if x != nil { + return x.SlateOutputSchema + } + return nil +} + +func (x *SlateWiseResponse) GetRequestError() string { + if x != nil { + return x.RequestError + } + return "" +} + +var File_predict_proto protoreflect.FileDescriptor + +const file_predict_proto_rawDesc = "" + + "\n" + + "\rpredict.proto\"j\n" + + "\rFeatureSchema\x12\x12\n" + + "\x04name\x18\x01 \x01(\tR\x04name\x12&\n" + + "\tdata_type\x18\x02 \x01(\x0e2\t.DataTypeR\bdataType\x12\x1d\n" + + "\n" + + "vector_dim\x18\x03 \x01(\x05R\tvectorDim\"\x81\x01\n" + + "\x0eContextFeature\x12\x12\n" + + "\x04name\x18\x01 \x01(\tR\x04name\x12\x14\n" + + "\x05value\x18\x02 \x01(\fR\x05value\x12&\n" + + "\tdata_type\x18\x03 \x01(\x0e2\t.DataTypeR\bdataType\x12\x1d\n" + + "\n" + + "vector_dim\x18\x04 \x01(\x05R\tvectorDim\"?\n" + + "\x06Target\x12\x0e\n" + + "\x02id\x18\x01 \x01(\tR\x02id\x12%\n" + + "\x0efeature_values\x18\x02 \x03(\fR\rfeatureValues\"H\n" + + "\vTargetScore\x12\x14\n" + + "\x05error\x18\x01 \x01(\tR\x05error\x12#\n" + + "\routput_values\x18\x02 \x03(\fR\foutputValues\"\x97\x02\n" + + "\x10PointWiseRequest\x12&\n" + + "\x0fmodel_config_id\x18\x01 \x01(\tR\rmodelConfigId\x12\x1f\n" + + "\vtracking_id\x18\x02 \x01(\tR\n" + + "trackingId\x12:\n" + + "\x10context_features\x18\x03 \x03(\v2\x0f.ContextFeatureR\x0fcontextFeatures\x12>\n" + + "\x13target_input_schema\x18\x04 \x03(\v2\x0e.FeatureSchemaR\x11targetInputSchema\x12!\n" + + "\atargets\x18\x05 \x03(\v2\a.TargetR\atargets\x12\x1b\n" + + "\ttenant_id\x18\x06 \x01(\tR\btenantId\"\xad\x01\n" + + "\x11PointWiseResponse\x12@\n" + + "\x14target_output_schema\x18\x01 \x03(\v2\x0e.FeatureSchemaR\x12targetOutputSchema\x121\n" + + "\rtarget_scores\x18\x02 \x03(\v2\f.TargetScoreR\ftargetScores\x12#\n" + + "\rrequest_error\x18\x03 \x01(\tR\frequestError\"\x91\x01\n" + + "\n" + + "TargetPair\x12,\n" + + "\x12first_target_index\x18\x01 \x01(\x05R\x10firstTargetIndex\x12.\n" + + "\x13second_target_index\x18\x02 \x01(\x05R\x11secondTargetIndex\x12%\n" + + "\x0efeature_values\x18\x03 \x03(\fR\rfeatureValues\"\xf5\x02\n" + + "\x0fPairWiseRequest\x12&\n" + + "\x0fmodel_config_id\x18\x01 \x01(\tR\rmodelConfigId\x12\x1f\n" + + "\vtracking_id\x18\x02 \x01(\tR\n" + + "trackingId\x12:\n" + + "\x10context_features\x18\x03 \x03(\v2\x0f.ContextFeatureR\x0fcontextFeatures\x12>\n" + + "\x13target_input_schema\x18\x04 \x03(\v2\x0e.FeatureSchemaR\x11targetInputSchema\x12:\n" + + "\x11pair_input_schema\x18\x05 \x03(\v2\x0e.FeatureSchemaR\x0fpairInputSchema\x12!\n" + + "\x05pairs\x18\x06 \x03(\v2\v.TargetPairR\x05pairs\x12!\n" + + "\atargets\x18\a \x03(\v2\a.TargetR\atargets\x12\x1b\n" + + "\ttenant_id\x18\b \x01(\tR\btenantId\"F\n" + + "\tPairScore\x12\x14\n" + + "\x05error\x18\x01 \x01(\tR\x05error\x12#\n" + + "\routput_values\x18\x02 \x03(\fR\foutputValues\"\x97\x02\n" + + "\x10PairWiseResponse\x12+\n" + + "\vpair_scores\x18\x01 \x03(\v2\n" + + ".PairScoreR\n" + + "pairScores\x121\n" + + "\rtarget_scores\x18\x02 \x03(\v2\f.TargetScoreR\ftargetScores\x12@\n" + + "\x14target_output_schema\x18\x03 \x03(\v2\x0e.FeatureSchemaR\x12targetOutputSchema\x12<\n" + + "\x12pair_output_schema\x18\x04 \x03(\v2\x0e.FeatureSchemaR\x10pairOutputSchema\x12#\n" + + "\rrequest_error\x18\x05 \x01(\tR\frequestError\"[\n" + + "\vTargetSlate\x12%\n" + + "\x0etarget_indices\x18\x01 \x03(\x05R\rtargetIndices\x12%\n" + + "\x0efeature_values\x18\x02 \x03(\fR\rfeatureValues\"G\n" + + "\n" + + "SlateScore\x12\x14\n" + + "\x05error\x18\x01 \x01(\tR\x05error\x12#\n" + + "\routput_values\x18\x02 \x03(\fR\foutputValues\"\xfb\x02\n" + + "\x10SlateWiseRequest\x12&\n" + + "\x0fmodel_config_id\x18\x01 \x01(\tR\rmodelConfigId\x12\x1f\n" + + "\vtracking_id\x18\x02 \x01(\tR\n" + + "trackingId\x12:\n" + + "\x10context_features\x18\x03 \x03(\v2\x0f.ContextFeatureR\x0fcontextFeatures\x12>\n" + + "\x13target_input_schema\x18\x04 \x03(\v2\x0e.FeatureSchemaR\x11targetInputSchema\x12<\n" + + "\x12slate_input_schema\x18\x05 \x03(\v2\x0e.FeatureSchemaR\x10slateInputSchema\x12$\n" + + "\x06slates\x18\x06 \x03(\v2\f.TargetSlateR\x06slates\x12!\n" + + "\atargets\x18\a \x03(\v2\a.TargetR\atargets\x12\x1b\n" + + "\ttenant_id\x18\b \x01(\tR\btenantId\"\x9d\x02\n" + + "\x11SlateWiseResponse\x12.\n" + + "\fslate_scores\x18\x01 \x03(\v2\v.SlateScoreR\vslateScores\x121\n" + + "\rtarget_scores\x18\x02 \x03(\v2\f.TargetScoreR\ftargetScores\x12@\n" + + "\x14target_output_schema\x18\x03 \x03(\v2\x0e.FeatureSchemaR\x12targetOutputSchema\x12>\n" + + "\x13slate_output_schema\x18\x04 \x03(\v2\x0e.FeatureSchemaR\x11slateOutputSchema\x12#\n" + + "\rrequest_error\x18\x05 \x01(\tR\frequestError*\xb9\x05\n" + + "\bDataType\x12\x13\n" + + "\x0fDataTypeUnknown\x10\x00\x12\x13\n" + + "\x0fDataTypeFP8E5M2\x10\x01\x12\x13\n" + + "\x0fDataTypeFP8E4M3\x10\x02\x12\x10\n" + + "\fDataTypeFP16\x10\x03\x12\x10\n" + + "\fDataTypeFP32\x10\x04\x12\x10\n" + + "\fDataTypeFP64\x10\x05\x12\x10\n" + + "\fDataTypeInt8\x10\x06\x12\x11\n" + + "\rDataTypeInt16\x10\a\x12\x11\n" + + "\rDataTypeInt32\x10\b\x12\x11\n" + + "\rDataTypeInt64\x10\t\x12\x11\n" + + "\rDataTypeUint8\x10\n" + + "\x12\x12\n" + + "\x0eDataTypeUint16\x10\v\x12\x12\n" + + "\x0eDataTypeUint32\x10\f\x12\x12\n" + + "\x0eDataTypeUint64\x10\r\x12\x12\n" + + "\x0eDataTypeString\x10\x0e\x12\x10\n" + + "\fDataTypeBool\x10\x0f\x12\x19\n" + + "\x15DataTypeFP8E5M2Vector\x10\x10\x12\x19\n" + + "\x15DataTypeFP8E4M3Vector\x10\x11\x12\x16\n" + + "\x12DataTypeFP16Vector\x10\x12\x12\x16\n" + + "\x12DataTypeFP32Vector\x10\x13\x12\x16\n" + + "\x12DataTypeFP64Vector\x10\x14\x12\x16\n" + + "\x12DataTypeInt8Vector\x10\x15\x12\x17\n" + + "\x13DataTypeInt16Vector\x10\x16\x12\x17\n" + + "\x13DataTypeInt32Vector\x10\x17\x12\x17\n" + + "\x13DataTypeInt64Vector\x10\x18\x12\x17\n" + + "\x13DataTypeUint8Vector\x10\x19\x12\x18\n" + + "\x14DataTypeUint16Vector\x10\x1a\x12\x18\n" + + "\x14DataTypeUint32Vector\x10\x1b\x12\x18\n" + + "\x14DataTypeUint64Vector\x10\x1c\x12\x18\n" + + "\x14DataTypeStringVector\x10\x1d\x12\x16\n" + + "\x12DataTypeBoolVector\x10\x1e2\xb7\x01\n" + + "\aPredict\x129\n" + + "\x0eInferPointWise\x12\x11.PointWiseRequest\x1a\x12.PointWiseResponse\"\x00\x126\n" + + "\rInferPairWise\x12\x10.PairWiseRequest\x1a\x11.PairWiseResponse\"\x00\x129\n" + + "\x0eInferSlateWise\x12\x11.SlateWiseRequest\x1a\x12.SlateWiseResponse\"\x00B\x11Z\x0f../grpc/predictb\x06proto3" + +var ( + file_predict_proto_rawDescOnce sync.Once + file_predict_proto_rawDescData []byte +) + +func file_predict_proto_rawDescGZIP() []byte { + file_predict_proto_rawDescOnce.Do(func() { + file_predict_proto_rawDescData = protoimpl.X.CompressGZIP(unsafe.Slice(unsafe.StringData(file_predict_proto_rawDesc), len(file_predict_proto_rawDesc))) + }) + return file_predict_proto_rawDescData +} + +var file_predict_proto_enumTypes = make([]protoimpl.EnumInfo, 1) +var file_predict_proto_msgTypes = make([]protoimpl.MessageInfo, 14) +var file_predict_proto_goTypes = []any{ + (DataType)(0), // 0: DataType + (*FeatureSchema)(nil), // 1: FeatureSchema + (*ContextFeature)(nil), // 2: ContextFeature + (*Target)(nil), // 3: Target + (*TargetScore)(nil), // 4: TargetScore + (*PointWiseRequest)(nil), // 5: PointWiseRequest + (*PointWiseResponse)(nil), // 6: PointWiseResponse + (*TargetPair)(nil), // 7: TargetPair + (*PairWiseRequest)(nil), // 8: PairWiseRequest + (*PairScore)(nil), // 9: PairScore + (*PairWiseResponse)(nil), // 10: PairWiseResponse + (*TargetSlate)(nil), // 11: TargetSlate + (*SlateScore)(nil), // 12: SlateScore + (*SlateWiseRequest)(nil), // 13: SlateWiseRequest + (*SlateWiseResponse)(nil), // 14: SlateWiseResponse +} +var file_predict_proto_depIdxs = []int32{ + 0, // 0: FeatureSchema.data_type:type_name -> DataType + 0, // 1: ContextFeature.data_type:type_name -> DataType + 2, // 2: PointWiseRequest.context_features:type_name -> ContextFeature + 1, // 3: PointWiseRequest.target_input_schema:type_name -> FeatureSchema + 3, // 4: PointWiseRequest.targets:type_name -> Target + 1, // 5: PointWiseResponse.target_output_schema:type_name -> FeatureSchema + 4, // 6: PointWiseResponse.target_scores:type_name -> TargetScore + 2, // 7: PairWiseRequest.context_features:type_name -> ContextFeature + 1, // 8: PairWiseRequest.target_input_schema:type_name -> FeatureSchema + 1, // 9: PairWiseRequest.pair_input_schema:type_name -> FeatureSchema + 7, // 10: PairWiseRequest.pairs:type_name -> TargetPair + 3, // 11: PairWiseRequest.targets:type_name -> Target + 9, // 12: PairWiseResponse.pair_scores:type_name -> PairScore + 4, // 13: PairWiseResponse.target_scores:type_name -> TargetScore + 1, // 14: PairWiseResponse.target_output_schema:type_name -> FeatureSchema + 1, // 15: PairWiseResponse.pair_output_schema:type_name -> FeatureSchema + 2, // 16: SlateWiseRequest.context_features:type_name -> ContextFeature + 1, // 17: SlateWiseRequest.target_input_schema:type_name -> FeatureSchema + 1, // 18: SlateWiseRequest.slate_input_schema:type_name -> FeatureSchema + 11, // 19: SlateWiseRequest.slates:type_name -> TargetSlate + 3, // 20: SlateWiseRequest.targets:type_name -> Target + 12, // 21: SlateWiseResponse.slate_scores:type_name -> SlateScore + 4, // 22: SlateWiseResponse.target_scores:type_name -> TargetScore + 1, // 23: SlateWiseResponse.target_output_schema:type_name -> FeatureSchema + 1, // 24: SlateWiseResponse.slate_output_schema:type_name -> FeatureSchema + 5, // 25: Predict.InferPointWise:input_type -> PointWiseRequest + 8, // 26: Predict.InferPairWise:input_type -> PairWiseRequest + 13, // 27: Predict.InferSlateWise:input_type -> SlateWiseRequest + 6, // 28: Predict.InferPointWise:output_type -> PointWiseResponse + 10, // 29: Predict.InferPairWise:output_type -> PairWiseResponse + 14, // 30: Predict.InferSlateWise:output_type -> SlateWiseResponse + 28, // [28:31] is the sub-list for method output_type + 25, // [25:28] is the sub-list for method input_type + 25, // [25:25] is the sub-list for extension type_name + 25, // [25:25] is the sub-list for extension extendee + 0, // [0:25] is the sub-list for field type_name +} + +func init() { file_predict_proto_init() } +func file_predict_proto_init() { + if File_predict_proto != nil { + return + } + type x struct{} + out := protoimpl.TypeBuilder{ + File: protoimpl.DescBuilder{ + GoPackagePath: reflect.TypeOf(x{}).PkgPath(), + RawDescriptor: unsafe.Slice(unsafe.StringData(file_predict_proto_rawDesc), len(file_predict_proto_rawDesc)), + NumEnums: 1, + NumMessages: 14, + NumExtensions: 0, + NumServices: 1, + }, + GoTypes: file_predict_proto_goTypes, + DependencyIndexes: file_predict_proto_depIdxs, + EnumInfos: file_predict_proto_enumTypes, + MessageInfos: file_predict_proto_msgTypes, + }.Build() + File_predict_proto = out.File + file_predict_proto_goTypes = nil + file_predict_proto_depIdxs = nil +} diff --git a/go-sdk/pkg/clients/inferflow/client/grpc/predict_grpc.pb.go b/go-sdk/pkg/clients/inferflow/client/grpc/predict_grpc.pb.go new file mode 100644 index 00000000..267954e5 --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/client/grpc/predict_grpc.pb.go @@ -0,0 +1,197 @@ +// Code generated by protoc-gen-go-grpc. DO NOT EDIT. +// versions: +// - protoc-gen-go-grpc v1.6.1 +// - protoc v5.28.3 +// source: predict.proto + +package grpc + +import ( + context "context" + grpc "google.golang.org/grpc" + codes "google.golang.org/grpc/codes" + status "google.golang.org/grpc/status" +) + +// This is a compile-time assertion to ensure that this generated file +// is compatible with the grpc package it is being compiled against. +// Requires gRPC-Go v1.64.0 or later. +const _ = grpc.SupportPackageIsVersion9 + +const ( + Predict_InferPointWise_FullMethodName = "/Predict/InferPointWise" + Predict_InferPairWise_FullMethodName = "/Predict/InferPairWise" + Predict_InferSlateWise_FullMethodName = "/Predict/InferSlateWise" +) + +// PredictClient is the client API for Predict service. +// +// For semantics around ctx use and closing/ending streaming RPCs, please refer to https://pkg.go.dev/google.golang.org/grpc/?tab=doc#ClientConn.NewStream. +type PredictClient interface { + InferPointWise(ctx context.Context, in *PointWiseRequest, opts ...grpc.CallOption) (*PointWiseResponse, error) + InferPairWise(ctx context.Context, in *PairWiseRequest, opts ...grpc.CallOption) (*PairWiseResponse, error) + InferSlateWise(ctx context.Context, in *SlateWiseRequest, opts ...grpc.CallOption) (*SlateWiseResponse, error) +} + +type predictClient struct { + cc grpc.ClientConnInterface +} + +func NewPredictClient(cc grpc.ClientConnInterface) PredictClient { + return &predictClient{cc} +} + +func (c *predictClient) InferPointWise(ctx context.Context, in *PointWiseRequest, opts ...grpc.CallOption) (*PointWiseResponse, error) { + cOpts := append([]grpc.CallOption{grpc.StaticMethod()}, opts...) + out := new(PointWiseResponse) + err := c.cc.Invoke(ctx, Predict_InferPointWise_FullMethodName, in, out, cOpts...) + if err != nil { + return nil, err + } + return out, nil +} + +func (c *predictClient) InferPairWise(ctx context.Context, in *PairWiseRequest, opts ...grpc.CallOption) (*PairWiseResponse, error) { + cOpts := append([]grpc.CallOption{grpc.StaticMethod()}, opts...) + out := new(PairWiseResponse) + err := c.cc.Invoke(ctx, Predict_InferPairWise_FullMethodName, in, out, cOpts...) + if err != nil { + return nil, err + } + return out, nil +} + +func (c *predictClient) InferSlateWise(ctx context.Context, in *SlateWiseRequest, opts ...grpc.CallOption) (*SlateWiseResponse, error) { + cOpts := append([]grpc.CallOption{grpc.StaticMethod()}, opts...) + out := new(SlateWiseResponse) + err := c.cc.Invoke(ctx, Predict_InferSlateWise_FullMethodName, in, out, cOpts...) + if err != nil { + return nil, err + } + return out, nil +} + +// PredictServer is the server API for Predict service. +// All implementations must embed UnimplementedPredictServer +// for forward compatibility. +type PredictServer interface { + InferPointWise(context.Context, *PointWiseRequest) (*PointWiseResponse, error) + InferPairWise(context.Context, *PairWiseRequest) (*PairWiseResponse, error) + InferSlateWise(context.Context, *SlateWiseRequest) (*SlateWiseResponse, error) + mustEmbedUnimplementedPredictServer() +} + +// UnimplementedPredictServer must be embedded to have +// forward compatible implementations. +// +// NOTE: this should be embedded by value instead of pointer to avoid a nil +// pointer dereference when methods are called. +type UnimplementedPredictServer struct{} + +func (UnimplementedPredictServer) InferPointWise(context.Context, *PointWiseRequest) (*PointWiseResponse, error) { + return nil, status.Error(codes.Unimplemented, "method InferPointWise not implemented") +} +func (UnimplementedPredictServer) InferPairWise(context.Context, *PairWiseRequest) (*PairWiseResponse, error) { + return nil, status.Error(codes.Unimplemented, "method InferPairWise not implemented") +} +func (UnimplementedPredictServer) InferSlateWise(context.Context, *SlateWiseRequest) (*SlateWiseResponse, error) { + return nil, status.Error(codes.Unimplemented, "method InferSlateWise not implemented") +} +func (UnimplementedPredictServer) mustEmbedUnimplementedPredictServer() {} +func (UnimplementedPredictServer) testEmbeddedByValue() {} + +// UnsafePredictServer may be embedded to opt out of forward compatibility for this service. +// Use of this interface is not recommended, as added methods to PredictServer will +// result in compilation errors. +type UnsafePredictServer interface { + mustEmbedUnimplementedPredictServer() +} + +func RegisterPredictServer(s grpc.ServiceRegistrar, srv PredictServer) { + // If the following call panics, it indicates UnimplementedPredictServer was + // embedded by pointer and is nil. This will cause panics if an + // unimplemented method is ever invoked, so we test this at initialization + // time to prevent it from happening at runtime later due to I/O. + if t, ok := srv.(interface{ testEmbeddedByValue() }); ok { + t.testEmbeddedByValue() + } + s.RegisterService(&Predict_ServiceDesc, srv) +} + +func _Predict_InferPointWise_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) { + in := new(PointWiseRequest) + if err := dec(in); err != nil { + return nil, err + } + if interceptor == nil { + return srv.(PredictServer).InferPointWise(ctx, in) + } + info := &grpc.UnaryServerInfo{ + Server: srv, + FullMethod: Predict_InferPointWise_FullMethodName, + } + handler := func(ctx context.Context, req interface{}) (interface{}, error) { + return srv.(PredictServer).InferPointWise(ctx, req.(*PointWiseRequest)) + } + return interceptor(ctx, in, info, handler) +} + +func _Predict_InferPairWise_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) { + in := new(PairWiseRequest) + if err := dec(in); err != nil { + return nil, err + } + if interceptor == nil { + return srv.(PredictServer).InferPairWise(ctx, in) + } + info := &grpc.UnaryServerInfo{ + Server: srv, + FullMethod: Predict_InferPairWise_FullMethodName, + } + handler := func(ctx context.Context, req interface{}) (interface{}, error) { + return srv.(PredictServer).InferPairWise(ctx, req.(*PairWiseRequest)) + } + return interceptor(ctx, in, info, handler) +} + +func _Predict_InferSlateWise_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) { + in := new(SlateWiseRequest) + if err := dec(in); err != nil { + return nil, err + } + if interceptor == nil { + return srv.(PredictServer).InferSlateWise(ctx, in) + } + info := &grpc.UnaryServerInfo{ + Server: srv, + FullMethod: Predict_InferSlateWise_FullMethodName, + } + handler := func(ctx context.Context, req interface{}) (interface{}, error) { + return srv.(PredictServer).InferSlateWise(ctx, req.(*SlateWiseRequest)) + } + return interceptor(ctx, in, info, handler) +} + +// Predict_ServiceDesc is the grpc.ServiceDesc for Predict service. +// It's only intended for direct use with grpc.RegisterService, +// and not to be introspected or modified (even as a copy) +var Predict_ServiceDesc = grpc.ServiceDesc{ + ServiceName: "Predict", + HandlerType: (*PredictServer)(nil), + Methods: []grpc.MethodDesc{ + { + MethodName: "InferPointWise", + Handler: _Predict_InferPointWise_Handler, + }, + { + MethodName: "InferPairWise", + Handler: _Predict_InferPairWise_Handler, + }, + { + MethodName: "InferSlateWise", + Handler: _Predict_InferSlateWise_Handler, + }, + }, + Streams: []grpc.StreamDesc{}, + Metadata: "predict.proto", +} diff --git a/go-sdk/pkg/clients/inferflow/client/proto/predict.proto b/go-sdk/pkg/clients/inferflow/client/proto/predict.proto new file mode 100644 index 00000000..0cb7f38f --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/client/proto/predict.proto @@ -0,0 +1,148 @@ +syntax = "proto3"; + +option go_package = "../grpc"; + +enum DataType { + DataTypeUnknown = 0; + DataTypeFP8E5M2 = 1; + DataTypeFP8E4M3 = 2; + DataTypeFP16 = 3; + DataTypeFP32 = 4; + DataTypeFP64 = 5; + DataTypeInt8 = 6; + DataTypeInt16 = 7; + DataTypeInt32 = 8; + DataTypeInt64 = 9; + DataTypeUint8 = 10; + DataTypeUint16 = 11; + DataTypeUint32 = 12; + DataTypeUint64 = 13; + DataTypeString = 14; + DataTypeBool = 15; + DataTypeFP8E5M2Vector = 16; + DataTypeFP8E4M3Vector = 17; + DataTypeFP16Vector = 18; + DataTypeFP32Vector = 19; + DataTypeFP64Vector = 20; + DataTypeInt8Vector = 21; + DataTypeInt16Vector = 22; + DataTypeInt32Vector = 23; + DataTypeInt64Vector = 24; + DataTypeUint8Vector = 25; + DataTypeUint16Vector = 26; + DataTypeUint32Vector = 27; + DataTypeUint64Vector = 28; + DataTypeStringVector = 29; + DataTypeBoolVector = 30; +} + +message FeatureSchema { + string name = 1; + DataType data_type = 2; + int32 vector_dim = 3; +} + +message ContextFeature { + string name = 1; + bytes value = 2; + DataType data_type = 3; + int32 vector_dim = 4; +} + +message Target { + string id = 1; + repeated bytes feature_values = 2; +} + +// --- PointWise --- + +message TargetScore { + string error = 1; + repeated bytes output_values = 2; +} + +message PointWiseRequest { + string model_config_id = 1; + string tracking_id = 2; + repeated ContextFeature context_features = 3; + repeated FeatureSchema target_input_schema = 4; + repeated Target targets = 5; + string tenant_id = 6; +} + +message PointWiseResponse { + repeated FeatureSchema target_output_schema = 1; + repeated TargetScore target_scores = 2; + string request_error = 3; +} + +// --- PairWise --- + +message TargetPair { + int32 first_target_index = 1; + int32 second_target_index = 2; + repeated bytes feature_values = 3; +} + +message PairWiseRequest { + string model_config_id = 1; + string tracking_id = 2; + repeated ContextFeature context_features = 3; + repeated FeatureSchema target_input_schema = 4; + repeated FeatureSchema pair_input_schema = 5; + repeated TargetPair pairs = 6; + repeated Target targets = 7; + string tenant_id = 8; +} + +message PairScore { + string error = 1; + repeated bytes output_values = 2; +} + +message PairWiseResponse { + repeated PairScore pair_scores = 1; + repeated TargetScore target_scores = 2; + repeated FeatureSchema target_output_schema = 3; + repeated FeatureSchema pair_output_schema = 4; + string request_error = 5; +} + +// --- SlateWise --- + +message TargetSlate { + repeated int32 target_indices = 1; + repeated bytes feature_values = 2; +} + +message SlateScore { + string error = 1; + repeated bytes output_values = 2; +} + +message SlateWiseRequest { + string model_config_id = 1; + string tracking_id = 2; + repeated ContextFeature context_features = 3; + repeated FeatureSchema target_input_schema = 4; + repeated FeatureSchema slate_input_schema = 5; + repeated TargetSlate slates = 6; + repeated Target targets = 7; + string tenant_id = 8; +} + +message SlateWiseResponse { + repeated SlateScore slate_scores = 1; + repeated TargetScore target_scores = 2; + repeated FeatureSchema target_output_schema = 3; + repeated FeatureSchema slate_output_schema = 4; + string request_error = 5; +} + +// --- Service --- + +service Predict { + rpc InferPointWise(PointWiseRequest) returns (PointWiseResponse) {}; + rpc InferPairWise(PairWiseRequest) returns (PairWiseResponse) {}; + rpc InferSlateWise(SlateWiseRequest) returns (SlateWiseResponse) {}; +} diff --git a/go-sdk/pkg/clients/inferflow/conf.go b/go-sdk/pkg/clients/inferflow/conf.go new file mode 100644 index 00000000..81720396 --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/conf.go @@ -0,0 +1,76 @@ +package inferflow + +import ( + "fmt" + "github.com/spf13/viper" +) + +const ( + Host = "HOST" + Port = "PORT" + DeadlineMS = "DEADLINE_MS" + PlainText = "PLAINTEXT" + AuthToken = "AUTH_TOKEN" + DefaultHost = "" + DefaultPort = "8080" + DefaultDeadlineMS = 200 + DefaultPlainText = true + DefaultAuthToken = "" +) + +type ClientConfig struct { + Host string + Port string + DeadlineExceedMS int + PlainText bool + AuthToken string +} + +func getClientConfigs(prefix string) (*ClientConfig, error) { + host := DefaultHost + port := DefaultPort + deadline := DefaultDeadlineMS + plaintext := DefaultPlainText + authToken := DefaultAuthToken + + if viper.IsSet(prefix + Host) { + host = viper.GetString(prefix + Host) + } + if viper.IsSet(prefix + Port) { + port = viper.GetString(prefix + Port) + } + if viper.IsSet(prefix + DeadlineMS) { + deadline = viper.GetInt(prefix + DeadlineMS) + } + if viper.IsSet(prefix + PlainText) { + plaintext = viper.GetBool(prefix + PlainText) + } + if viper.IsSet(prefix + AuthToken) { + authToken = viper.GetString(prefix + AuthToken) + } + conf := &ClientConfig{ + Host: host, + Port: port, + DeadlineExceedMS: deadline, + PlainText: plaintext, + AuthToken: authToken, + } + if valid, err := validConfigs(conf); !valid { + return nil, err + } + return conf, nil +} + +func validConfigs(configs *ClientConfig) (bool, error) { + if configs.Host == "" { + return false, fmt.Errorf("inferflow service host is invalid, configured value: %v", configs.Host) + } + if configs.Port == "" { + return false, fmt.Errorf("inferflow service port is invalid, configured value: %v", configs.Port) + } + if configs.DeadlineExceedMS <= 0 { + return false, fmt.Errorf("inferflow service deadline exceed timeout is invalid, configured value: %v", + configs.DeadlineExceedMS) + } + return true, nil +} diff --git a/go-sdk/pkg/clients/inferflow/inferflow.go b/go-sdk/pkg/clients/inferflow/inferflow.go new file mode 100644 index 00000000..e7775660 --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/inferflow.go @@ -0,0 +1,12 @@ +package inferflow + +import ( + "github.com/Meesho/BharatMLStack/go-sdk/pkg/clients/inferflow/client/grpc" +) + +// InferflowClient exposes the Predict service APIs: PointWise, PairWise, and SlateWise. +type InferflowClient interface { + InferPointWise(request *grpc.PointWiseRequest) (*grpc.PointWiseResponse, error) + InferPairWise(request *grpc.PairWiseRequest) (*grpc.PairWiseResponse, error) + InferSlateWise(request *grpc.SlateWiseRequest) (*grpc.SlateWiseResponse, error) +} diff --git a/go-sdk/pkg/clients/inferflow/init.go b/go-sdk/pkg/clients/inferflow/init.go new file mode 100644 index 00000000..92d58001 --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/init.go @@ -0,0 +1,19 @@ +package inferflow + +func GetInferflowClient(version int) InferflowClient { + switch version { + case 1: + return InitV1Client() + default: + return nil + } +} + +func GetInferflowClientFromConfig(version int, conf ClientConfig, callerId string) InferflowClient { + switch version { + case 1: + return InitV1ClientFromConfig(conf, callerId) + default: + return nil + } +} diff --git a/go-sdk/pkg/clients/inferflow/models.go b/go-sdk/pkg/clients/inferflow/models.go new file mode 100644 index 00000000..8b9460ea --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/models.go @@ -0,0 +1,8 @@ +package inferflow + +import "github.com/Meesho/BharatMLStack/go-sdk/pkg/grpcclient" + +type ClientV1 struct { + ClientConfigs *ClientConfig + GrpcClient *grpcclient.GRPCClient +} diff --git a/go-sdk/pkg/clients/inferflow/v1.go b/go-sdk/pkg/clients/inferflow/v1.go new file mode 100644 index 00000000..6695fa20 --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/v1.go @@ -0,0 +1,154 @@ +package inferflow + +import ( + "context" + "fmt" + "sync" + "time" + + grpc2 "github.com/Meesho/BharatMLStack/go-sdk/pkg/clients/inferflow/client/grpc" + "github.com/Meesho/BharatMLStack/go-sdk/pkg/grpcclient" + "github.com/rs/zerolog/log" + "github.com/spf13/viper" + "google.golang.org/grpc/metadata" +) + +var ( + client *ClientV1 + onceEnv sync.Once + onceConfig sync.Once + headers metadata.MD +) + +const ( + V1Prefix = "INFERFLOW_CLIENT_V1_" + CallerIDMetadata = "inferflow-caller-id" + AuthMetadata = "inferflow-auth-token" +) + +// InitV1Client initializes the client from environment variables (via viper). +// Safe to call multiple times; initialization runs once. +func InitV1Client() InferflowClient { + onceEnv.Do(func() { + clientConfig, err := getClientConfigs(V1Prefix) + if err != nil { + log.Panic().Err(err).Msgf("Invalid Inferflow client configs: %#v", clientConfig) + } + grpcClient, grpcErr := getGrpcClient(clientConfig) + if grpcErr != nil { + log.Panic().Err(grpcErr).Msgf("Error creating inferflow service grpc client, client: %#v", grpcClient) + } + headers = getMetadata(clientConfig.AuthToken) + client = &ClientV1{ + ClientConfigs: clientConfig, + GrpcClient: grpcClient, + } + }) + return client +} + +// InitV1ClientFromConfig initializes the client from an explicit config. +// Uses a separate sync.Once so it is not blocked by a prior InitV1Client call. +func InitV1ClientFromConfig(conf ClientConfig, callerId string) InferflowClient { + onceConfig.Do(func() { + grpcClient, grpcErr := getGrpcClient(&conf) + if grpcErr != nil { + log.Panic().Err(grpcErr).Msgf("Error creating inferflow service grpc client, client: %#v", grpcClient) + } + headers = metadata.New(map[string]string{ + CallerIDMetadata: callerId, + AuthMetadata: conf.AuthToken, + }) + client = &ClientV1{ + ClientConfigs: &conf, + GrpcClient: grpcClient, + } + }) + return client +} + +func getGrpcClient(conf *ClientConfig) (*grpcclient.GRPCClient, error) { + var client *grpcclient.GRPCClient + var err error + defer func() { + if r := recover(); r != nil { + err = fmt.Errorf("panic creating grpc client from prefix: %v", r) + } + }() + client = grpcclient.NewConnFromConfig(&grpcclient.Config{ + Host: conf.Host, + Port: conf.Port, + DeadLine: conf.DeadlineExceedMS, + LoadBalancingPolicy: "round_robin", + PlainText: conf.PlainText, + }, V1Prefix) + return client, err +} + +func (c *ClientV1) InferPointWise(req *grpc2.PointWiseRequest) (*grpc2.PointWiseResponse, error) { + predictClient := grpc2.NewPredictClient(c.GrpcClient) + timeout := time.Duration(c.ClientConfigs.DeadlineExceedMS) * time.Millisecond + ctx, cancel := context.WithTimeout(context.Background(), timeout) + defer cancel() + + ctx = metadata.NewOutgoingContext(ctx, headers) + protoResponse, err := predictClient.InferPointWise(ctx, req) + if err != nil { + log.Error().Msgf("Error while calling InferPointWise on inferflow service, err: %v", err) + return nil, err + } else if protoResponse == nil { + log.Error().Msgf("Empty response from inferflow InferPointWise") + return nil, fmt.Errorf("empty response from inferflow InferPointWise") + } + return protoResponse, nil +} + +func (c *ClientV1) InferPairWise(req *grpc2.PairWiseRequest) (*grpc2.PairWiseResponse, error) { + predictClient := grpc2.NewPredictClient(c.GrpcClient) + timeout := time.Duration(c.ClientConfigs.DeadlineExceedMS) * time.Millisecond + ctx, cancel := context.WithTimeout(context.Background(), timeout) + defer cancel() + + ctx = metadata.NewOutgoingContext(ctx, headers) + protoResponse, err := predictClient.InferPairWise(ctx, req) + if err != nil { + log.Error().Msgf("Error while calling InferPairWise on inferflow service, err: %v", err) + return nil, err + } else if protoResponse == nil { + log.Error().Msgf("Empty response from inferflow InferPairWise") + return nil, fmt.Errorf("empty response from inferflow InferPairWise") + } + return protoResponse, nil +} + +func (c *ClientV1) InferSlateWise(req *grpc2.SlateWiseRequest) (*grpc2.SlateWiseResponse, error) { + predictClient := grpc2.NewPredictClient(c.GrpcClient) + timeout := time.Duration(c.ClientConfigs.DeadlineExceedMS) * time.Millisecond + ctx, cancel := context.WithTimeout(context.Background(), timeout) + defer cancel() + + ctx = metadata.NewOutgoingContext(ctx, headers) + protoResponse, err := predictClient.InferSlateWise(ctx, req) + if err != nil { + log.Error().Msgf("Error while calling InferSlateWise on inferflow service, err: %v", err) + return nil, err + } else if protoResponse == nil { + log.Error().Msgf("Empty response from inferflow InferSlateWise") + return nil, fmt.Errorf("empty response from inferflow InferSlateWise") + } + return protoResponse, nil +} + +func getMetadata(authToken string) metadata.MD { + callerId := viper.GetString("INFERFLOW_CALLER_ID") + if callerId == "" { + log.Panic().Msgf("INFERFLOW_CALLER_ID not set!") + } + md := metadata.New(map[string]string{ + CallerIDMetadata: callerId, + }) + if authToken != "" { + md.Set(AuthMetadata, authToken) + } + return md +} diff --git a/go-sdk/pkg/clients/inferflow/v1_test.go b/go-sdk/pkg/clients/inferflow/v1_test.go new file mode 100644 index 00000000..6cdd09ca --- /dev/null +++ b/go-sdk/pkg/clients/inferflow/v1_test.go @@ -0,0 +1,144 @@ +package inferflow + +import ( + "testing" + + "github.com/spf13/viper" +) + +func TestGetClientConfigs_Valid(t *testing.T) { + viper.Set("INFERFLOW_CLIENT_V1_HOST", "localhost") + viper.Set("INFERFLOW_CLIENT_V1_PORT", "8080") + viper.Set("INFERFLOW_CLIENT_V1_DEADLINE_MS", 500) + viper.Set("INFERFLOW_CLIENT_V1_PLAINTEXT", true) + defer viper.Reset() + + conf, err := getClientConfigs(V1Prefix) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if conf.Host != "localhost" { + t.Errorf("expected host localhost, got %s", conf.Host) + } + if conf.Port != "8080" { + t.Errorf("expected port 8080, got %s", conf.Port) + } + if conf.DeadlineExceedMS != 500 { + t.Errorf("expected deadline 500, got %d", conf.DeadlineExceedMS) + } + if !conf.PlainText { + t.Error("expected plaintext true") + } +} + +func TestGetClientConfigs_MissingHost(t *testing.T) { + viper.Reset() + viper.Set("INFERFLOW_CLIENT_V1_PORT", "8080") + viper.Set("INFERFLOW_CLIENT_V1_DEADLINE_MS", 500) + defer viper.Reset() + + _, err := getClientConfigs(V1Prefix) + if err == nil { + t.Fatal("expected error for missing host, got nil") + } +} + +func TestGetClientConfigs_EmptyPort(t *testing.T) { + viper.Reset() + viper.Set("INFERFLOW_CLIENT_V1_HOST", "localhost") + viper.Set("INFERFLOW_CLIENT_V1_PORT", "") + viper.Set("INFERFLOW_CLIENT_V1_DEADLINE_MS", 500) + defer viper.Reset() + + _, err := getClientConfigs(V1Prefix) + if err == nil { + t.Fatal("expected error for empty port, got nil") + } +} + +func TestGetClientConfigs_InvalidDeadline(t *testing.T) { + viper.Set("INFERFLOW_CLIENT_V1_HOST", "localhost") + viper.Set("INFERFLOW_CLIENT_V1_PORT", "8080") + viper.Set("INFERFLOW_CLIENT_V1_DEADLINE_MS", 0) + defer viper.Reset() + + _, err := getClientConfigs(V1Prefix) + if err == nil { + t.Fatal("expected error for zero deadline, got nil") + } +} + +func TestGetClientConfigs_Defaults(t *testing.T) { + viper.Set("INFERFLOW_CLIENT_V1_HOST", "inferflow.svc") + defer viper.Reset() + + conf, err := getClientConfigs(V1Prefix) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if conf.Port != DefaultPort { + t.Errorf("expected default port %s, got %s", DefaultPort, conf.Port) + } + if conf.DeadlineExceedMS != DefaultDeadlineMS { + t.Errorf("expected default deadline %d, got %d", DefaultDeadlineMS, conf.DeadlineExceedMS) + } + if conf.PlainText != DefaultPlainText { + t.Errorf("expected default plaintext %v, got %v", DefaultPlainText, conf.PlainText) + } +} + +func TestValidConfigs(t *testing.T) { + tests := []struct { + name string + config *ClientConfig + wantOK bool + }{ + { + name: "valid config", + config: &ClientConfig{Host: "localhost", Port: "8080", DeadlineExceedMS: 200}, + wantOK: true, + }, + { + name: "empty host", + config: &ClientConfig{Host: "", Port: "8080", DeadlineExceedMS: 200}, + wantOK: false, + }, + { + name: "empty port", + config: &ClientConfig{Host: "localhost", Port: "", DeadlineExceedMS: 200}, + wantOK: false, + }, + { + name: "zero deadline", + config: &ClientConfig{Host: "localhost", Port: "8080", DeadlineExceedMS: 0}, + wantOK: false, + }, + { + name: "negative deadline", + config: &ClientConfig{Host: "localhost", Port: "8080", DeadlineExceedMS: -1}, + wantOK: false, + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ok, err := validConfigs(tt.config) + if ok != tt.wantOK { + t.Errorf("validConfigs() = %v, want %v, err: %v", ok, tt.wantOK, err) + } + }) + } +} + +func TestGetInferflowClient_InvalidVersion(t *testing.T) { + c := GetInferflowClient(99) + if c != nil { + t.Error("expected nil for unsupported version") + } +} + +func TestGetInferflowClientFromConfig_InvalidVersion(t *testing.T) { + c := GetInferflowClientFromConfig(99, ClientConfig{}, "test") + if c != nil { + t.Error("expected nil for unsupported version") + } +} diff --git a/helm-charts/horizon/Chart.yaml b/helm-charts/horizon/Chart.yaml new file mode 100644 index 00000000..701b977b --- /dev/null +++ b/helm-charts/horizon/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: horizon +description: A Helm chart for the Horizon control plane service (onboarding, GitOps, ArgoCD orchestration) +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/horizon/templates/NOTES.txt b/helm-charts/horizon/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/horizon/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/horizon/templates/_helpers.tpl b/helm-charts/horizon/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/horizon/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/horizon/templates/alert-provider.yaml b/helm-charts/horizon/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/horizon/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/horizon/templates/configmap.yaml b/helm-charts/horizon/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/horizon/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/horizon/templates/deployment.yaml b/helm-charts/horizon/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/horizon/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/horizon/templates/external-secrets.yaml b/helm-charts/horizon/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/horizon/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/horizon/templates/httpproxy.yaml b/helm-charts/horizon/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/horizon/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/horizon/templates/otel-secret.yaml b/helm-charts/horizon/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/horizon/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/horizon/templates/pdb.yaml b/helm-charts/horizon/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/horizon/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/horizon/templates/scaledobject.yaml b/helm-charts/horizon/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/horizon/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/horizon/templates/service.yaml b/helm-charts/horizon/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/horizon/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/horizon/templates/serviceaccount.yaml b/helm-charts/horizon/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/horizon/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/horizon/values.yaml b/helm-charts/horizon/values.yaml new file mode 100644 index 00000000..29b5ee61 --- /dev/null +++ b/helm-charts/horizon/values.yaml @@ -0,0 +1,324 @@ +# Default values for horizon helm chart + +namespace: prd-horizon +applicationName: horizon +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/horizon + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8082 + name: http + protocol: TCP + probes: + liveness: + path: /health + port: 8082 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health + port: 8082 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + # Application + - name: APP_NAME + value: "horizon" + - name: APP_ENV + value: "PROD" + - name: APP_PORT + value: "8082" + - name: APP_LOG_LEVEL + value: "DEBUG" + - name: APP_METRIC_SAMPLING_RATE + value: "1" + - name: APP_GC_PERCENTAGE + value: "1" + # MySQL master + - name: MYSQL_MASTER_MAX_POOL_SIZE + value: "5" + - name: MYSQL_MASTER_MIN_POOL_SIZE + value: "2" + - name: MYSQL_MASTER_PASSWORD + value: "root" + - name: MYSQL_MASTER_HOST + value: "mysql" + - name: MYSQL_MASTER_PORT + value: "3306" + - name: MYSQL_DB_NAME + value: "testdb" + - name: MYSQL_MASTER_USERNAME + value: "root" + # MySQL slave + - name: MYSQL_SLAVE_MAX_POOL_SIZE + value: "5" + - name: MYSQL_SLAVE_MIN_POOL_SIZE + value: "2" + - name: MYSQL_SLAVE_PASSWORD + value: "root" + - name: MYSQL_SLAVE_HOST + value: "mysql" + - name: MYSQL_SLAVE_PORT + value: "3306" + - name: MYSQL_SLAVE_USERNAME + value: "root" + - name: MYSQL_ACTIVE_CONFIG_IDS + value: "2" + # Etcd + - name: ETCD_WATCHER_ENABLED + value: "true" + - name: ETCD_SERVER + value: "etcd:2379" + # ONFS + - name: ONLINE_FEATURE_STORE_APP_NAME + value: "onfs" + # Redis + - name: REDIS_FAILOVER_ACTIVE_CONFIG_IDS + value: "4" + # ScyllaDB (primary) + - name: SCYLLA_1_CONTACT_POINTS + value: "scylla" + - name: SCYLLA_1_KEYSPACE + value: "onfs" + - name: SCYLLA_1_NUM_CONNS + value: "1" + - name: SCYLLA_1_PORT + value: "9042" + - name: SCYLLA_1_TIMEOUT_IN_MS + value: "300000" + - name: SCYLLA_1_PASSWORD + value: "" + - name: SCYLLA_1_USERNAME + value: "" + - name: SCYLLA_ACTIVE_CONFIG_IDS + value: "1" + # Caching + - name: DISTRIBUTED_CACHE_ACTIVE_CONFIG_IDS + value: "2" + - name: IN_MEMORY_CACHE_ACTIVE_CONFIG_IDS + value: "3" + # CORS + - name: CORS_ORIGINS + value: "http://localhost:3000,http://localhost:8080" + # Service names + - name: HORIZON_APP_NAME + value: "horizon" + - name: NUMERIX_APP_NAME + value: "numerix" + - name: INFERFLOW_APP_NAME + value: "inferflow" + - name: IS_DUMMY_MODEL_ENABLED + value: "true" + # ArgoCD + - name: ARGOCD_API + value: "http://host.docker.internal:8087" + - name: ARGOCD_TOKEN + value: "" + - name: ARGOCD_NAMESPACE + value: "argocd" + - name: ARGOCD_DESTINATION_NAME + value: "in-cluster" + - name: ARGOCD_PROJECT + value: "default" + - name: ARGOCD_HELMCHART_PATH + value: "1.0.0" + - name: ARGOCD_SYNC_POLICY_OPTIONS + value: "CreateNamespace=true" + - name: ARGOCD_INSECURE + value: "true" + # Model path + - name: LOCAL_MODEL_PATH + value: "/tmp/models" + # Environments + - name: SUPPORTED_ENVIRONMENTS + value: "prd,stg,int" + - name: WORKING_ENV + value: "prd" + # Service config + - name: SERVICE_CONFIG_SOURCE + value: "local" + - name: SERVICE_CONFIG_REPO + value: "BharatMLStack-configs" + - name: SERVICE_CONFIG_PATH + value: "/app/configs" + # GitHub + - name: REPOSITORY_NAME + value: "" + - name: BRANCH_NAME + value: "main" + - name: GITHUB_APP_ID + value: "" + - name: GITHUB_INSTALLATION_ID + value: "" + - name: GITHUB_PRIVATE_KEY + value: "/app/configs/github.pem" + - name: GITHUB_OWNER + value: "" + - name: GITHUB_COMMIT_AUTHOR + value: "horizon-bot" + - name: GITHUB_COMMIT_EMAIL + value: "devops@example.com" + # GCP + - name: GCP_PROJECT_ID + value: "" + - name: GCS_ENABLED + value: "true" + - name: GCS_MODEL_BUCKET + value: "" + - name: GCS_MODEL_BASE_PATH + value: "" + - name: CLOUDSDK_CONFIG + value: "/home/nonroot/.config/gcloud" + # Skye + - name: SKYE_APP_NAME + value: "skye" + - name: SKYE_SCYLLA_ACTIVE_CONFIG_IDS + value: "2" + - name: SKYE_HOST + value: "scylla" + - name: SKYE_PORT + value: "9042" + - name: SKYE_AUTH_TOKEN + value: "" + - name: SKYE_DEADLINE_EXCEED_MS + value: "2000" + # ScyllaDB (Skye) + - name: SCYLLA_2_CONTACT_POINTS + value: "scylla" + - name: SCYLLA_2_PORT + value: "9042" + - name: SCYLLA_2_KEYSPACE + value: "skye" + - name: SCYLLA_2_USERNAME + value: "" + - name: SCYLLA_2_PASSWORD + value: "" + - name: HORIZON_TO_SKYE_SCYLLA_CONF_ID_MAP + value: "2:1" + # Skye trigger (OSS replacement for Airflow) + - name: USE_SKYE_TRIGGER_INSTEAD_OF_AIRFLOW + value: "true" + - name: SKYE_TRIGGER_URL + value: "http://skye-trigger:8080" + volumes: + - name: configs + configMap: + name: horizon-configs + volumeMounts: + - name: configs + mountPath: /app/configs + readOnly: true + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8082 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/inferflow/Chart.yaml b/helm-charts/inferflow/Chart.yaml new file mode 100644 index 00000000..58140987 --- /dev/null +++ b/helm-charts/inferflow/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: inferflow +description: A Helm chart for the Inferflow inference orchestration service +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/inferflow/templates/NOTES.txt b/helm-charts/inferflow/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/inferflow/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/inferflow/templates/_helpers.tpl b/helm-charts/inferflow/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/inferflow/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/inferflow/templates/alert-provider.yaml b/helm-charts/inferflow/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/inferflow/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/inferflow/templates/configmap.yaml b/helm-charts/inferflow/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/inferflow/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/inferflow/templates/deployment.yaml b/helm-charts/inferflow/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/inferflow/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/inferflow/templates/external-secrets.yaml b/helm-charts/inferflow/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/inferflow/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/inferflow/templates/httpproxy.yaml b/helm-charts/inferflow/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/inferflow/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/inferflow/templates/otel-secret.yaml b/helm-charts/inferflow/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/inferflow/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/inferflow/templates/pdb.yaml b/helm-charts/inferflow/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/inferflow/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/inferflow/templates/scaledobject.yaml b/helm-charts/inferflow/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/inferflow/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/inferflow/templates/service.yaml b/helm-charts/inferflow/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/inferflow/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/inferflow/templates/serviceaccount.yaml b/helm-charts/inferflow/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/inferflow/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/inferflow/values.yaml b/helm-charts/inferflow/values.yaml new file mode 100644 index 00000000..4f7699e2 --- /dev/null +++ b/helm-charts/inferflow/values.yaml @@ -0,0 +1,207 @@ +# Default values for inferflow helm chart + +namespace: prd-inferflow +applicationName: inferflow +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/inferflow + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8085 + name: http + protocol: TCP + probes: + liveness: + path: /health/self + port: 8085 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health/self + port: 8085 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + - name: APP_ENV + value: "prod" + - name: APP_LOG_LEVEL + value: "INFO" + - name: APP_NAME + value: "inferflow" + - name: APP_PORT + value: "8085" + - name: APP_GC_PERCENTAGE + value: "1" + # Etcd + - name: ETCD_SERVER + value: "http://etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # In-memory cache + - name: IN_MEMORY_CACHE_SIZE_IN_BYTES + value: "6000000000" + # DAG topology + - name: DAG_TOPOLOGY_CACHE_SIZE + value: "500" + - name: DAG_TOPOLOGY_CACHE_TTL_SEC + value: "300" + # Numerix client + - name: NUMERIX_CLIENT_V1_AUTHTOKEN + value: "numerix" + - name: NUMERIX_CLIENT_V1_BATCHSIZE + value: "100" + - name: NUMERIX_CLIENT_V1_DEADLINE_MS + value: "5000" + - name: NUMERIX_CLIENT_V1_HOST + value: "numerix" + - name: NUMERIX_CLIENT_V1_PLAINTEXT + value: "true" + - name: NUMERIX_CLIENT_V1_PORT + value: "8083" + # ONFS client + - name: EXTERNAL_SERVICE_ONFS_FS_BATCH_SIZE + value: "50" + - name: EXTERNAL_SERVICE_ONFS_FS_CALLER_ID + value: "inferflow" + - name: EXTERNAL_SERVICE_ONFS_FS_CALLER_TOKEN + value: "inferflow" + - name: EXTERNAL_SERVICE_ONFS_FS_GRPC_PLAIN_TEXT + value: "true" + - name: EXTERNAL_SERVICE_ONFS_FS_HOST + value: "onfs-api-server" + - name: EXTERNAL_SERVICE_ONFS_FS_PORT + value: "8089" + - name: EXTERNAL_SERVICE_ONFS_FS_DEAD_LINE + value: "200" + # Predator client + - name: EXTERNAL_SERVICE_PREDATOR_PORT + value: "8090" + - name: EXTERNAL_SERVICE_PREDATOR_GRPC_PLAIN_TEXT + value: "true" + - name: EXTERNAL_SERVICE_PREDATOR_CALLER_ID + value: "inferflow" + - name: EXTERNAL_SERVICE_PREDATOR_CALLER_TOKEN + value: "inferflow" + - name: EXTERNAL_SERVICE_PREDATOR_DEADLINE + value: "200" + # Metrics + - name: METRIC_SAMPLING_RATE + value: "1" + # Kafka logging + - name: KAFKA_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_LOGGING_TOPIC + value: "inferflow_inference_logs" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8085 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/numerix/Chart.yaml b/helm-charts/numerix/Chart.yaml new file mode 100644 index 00000000..37e29c1a --- /dev/null +++ b/helm-charts/numerix/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: numerix +description: A Helm chart for the Numerix matrix operations service +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/numerix/templates/NOTES.txt b/helm-charts/numerix/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/numerix/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/numerix/templates/_helpers.tpl b/helm-charts/numerix/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/numerix/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/numerix/templates/alert-provider.yaml b/helm-charts/numerix/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/numerix/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/numerix/templates/configmap.yaml b/helm-charts/numerix/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/numerix/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/numerix/templates/deployment.yaml b/helm-charts/numerix/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/numerix/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/numerix/templates/external-secrets.yaml b/helm-charts/numerix/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/numerix/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/numerix/templates/httpproxy.yaml b/helm-charts/numerix/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/numerix/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/numerix/templates/otel-secret.yaml b/helm-charts/numerix/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/numerix/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/numerix/templates/pdb.yaml b/helm-charts/numerix/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/numerix/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/numerix/templates/scaledobject.yaml b/helm-charts/numerix/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/numerix/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/numerix/templates/service.yaml b/helm-charts/numerix/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/numerix/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/numerix/templates/serviceaccount.yaml b/helm-charts/numerix/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/numerix/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/numerix/values.yaml b/helm-charts/numerix/values.yaml new file mode 100644 index 00000000..c6be05c1 --- /dev/null +++ b/helm-charts/numerix/values.yaml @@ -0,0 +1,166 @@ +# Default values for numerix helm chart + +namespace: prd-numerix +applicationName: numerix +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +# Priority class (leave empty to skip) +priorityClassName: "" + +# Telegraf sidecar metrics +telegraf: + enabled: false + +# OTEL tracing +otel_enabled: false + +# Infrastructure configuration +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +# Deployment configuration +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/numerix + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8083 + name: grpc + protocol: TCP + probes: + liveness: + path: /health + port: 8083 + scheme: HTTP + initialDelaySeconds: 15 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health + port: 8083 + scheme: HTTP + initialDelaySeconds: 10 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" + env: + - name: APPLICATION_PORT + value: "8083" + - name: APP_ENV + value: "prd" + - name: APP_NAME + value: "numerix" + - name: APP_LOG_LEVEL + value: "ERROR" + - name: CHANNEL_BUFFER_SIZE + value: "10000" + - name: ETCD_SERVERS + value: "http://etcd:2379" + - name: METRIC_SAMPLING_RATE + value: "1" + - name: LOG_SAMPLING_RATE + value: "1" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +# Service configuration +service: + enabled: true + type: ClusterIP + ports: + - name: grpc + port: 80 + targetPort: 8083 + protocol: TCP + +# Autoscaling (KEDA ScaledObject) +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +# Ingress (Contour HTTPProxy) +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +# External secrets (Vault) +externalSecret: + enabled: false + path: "" + +# ConfigMap for non-secret env vars +configMap: + enabled: false + +# Canary / Flagger +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +# Pod Disruption Budget +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +# Disaster Recovery +disasterRecovery: + enabled: false diff --git a/helm-charts/onfs-api-server/Chart.yaml b/helm-charts/onfs-api-server/Chart.yaml new file mode 100644 index 00000000..c3a90a16 --- /dev/null +++ b/helm-charts/onfs-api-server/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: onfs-api-server +description: A Helm chart for the Online Feature Store API Server +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/onfs-api-server/templates/NOTES.txt b/helm-charts/onfs-api-server/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/_helpers.tpl b/helm-charts/onfs-api-server/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/onfs-api-server/templates/alert-provider.yaml b/helm-charts/onfs-api-server/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/configmap.yaml b/helm-charts/onfs-api-server/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/deployment.yaml b/helm-charts/onfs-api-server/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/onfs-api-server/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/external-secrets.yaml b/helm-charts/onfs-api-server/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/httpproxy.yaml b/helm-charts/onfs-api-server/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/onfs-api-server/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/onfs-api-server/templates/otel-secret.yaml b/helm-charts/onfs-api-server/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/onfs-api-server/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/onfs-api-server/templates/pdb.yaml b/helm-charts/onfs-api-server/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/onfs-api-server/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/scaledobject.yaml b/helm-charts/onfs-api-server/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/service.yaml b/helm-charts/onfs-api-server/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/onfs-api-server/templates/serviceaccount.yaml b/helm-charts/onfs-api-server/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/onfs-api-server/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-api-server/values.yaml b/helm-charts/onfs-api-server/values.yaml new file mode 100644 index 00000000..68f70e45 --- /dev/null +++ b/helm-charts/onfs-api-server/values.yaml @@ -0,0 +1,238 @@ +# Default values for onfs-api-server helm chart + +namespace: prd-onfs-api-server +applicationName: onfs-api-server +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/onfs-api-server + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8089 + name: grpc + protocol: TCP + probes: + liveness: + path: /health/self + port: 8089 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health/self + port: 8089 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + - name: APP_ENV + value: "local" + - name: APP_LOG_LEVEL + value: "DEBUG" + - name: APP_METRIC_SAMPLING_RATE + value: "1" + - name: APP_NAME + value: "onfs" + - name: APP_PORT + value: "8089" + - name: AUTH_TOKEN + value: "test" + - name: ENABLE_HTTP_API + value: "true" + # Etcd + - name: ETCD_SERVER + value: "http://etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # ScyllaDB + - name: STORAGE_SCYLLA_1_CONTACT_POINTS + value: "scylla" + - name: STORAGE_SCYLLA_1_KEYSPACE + value: "onfs" + - name: STORAGE_SCYLLA_1_PORT + value: "9042" + - name: STORAGE_SCYLLA_1_NUM_CONNS + value: "1" + - name: STORAGE_SCYLLA_1_TIMEOUT_IN_MS + value: "300000" + - name: STORAGE_SCYLLA_1_USERNAME + value: "" + - name: STORAGE_SCYLLA_1_PASSWORD + value: "" + - name: STORAGE_SCYLLA_1_MAJOR_VERSION + value: "5" + - name: STORAGE_SCYLLA_1_SCYLLA_VERSION + value: "5" + - name: STORAGE_SCYLLA_ACTIVE_CONFIG_IDS + value: "1" + # Redis + - name: STORAGE_REDIS_STANDALONE_2_ADDR + value: "redis:6379" + - name: STORAGE_REDIS_STANDALONE_2_DB + value: "0" + - name: STORAGE_REDIS_STANDALONE_2_DISABLE_IDENTITY + value: "true" + - name: STORAGE_REDIS_STANDALONE_2_MAX_IDLE_CONN + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_MIN_IDLE_CONN + value: "20" + - name: STORAGE_REDIS_STANDALONE_2_MAX_ACTIVE_CONN + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_MAX_RETRY + value: "-1" + - name: STORAGE_REDIS_STANDALONE_2_POOL_FIFO + value: "false" + - name: STORAGE_REDIS_STANDALONE_2_READ_TIMEOUT_IN_MS + value: "300" + - name: STORAGE_REDIS_STANDALONE_2_WRITE_TIMEOUT_IN_MS + value: "300" + - name: STORAGE_REDIS_STANDALONE_2_POOL_TIMEOUT_IN_MS + value: "300" + - name: STORAGE_REDIS_STANDALONE_2_POOL_SIZE + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES + value: "15" + - name: STORAGE_REDIS_STANDALONE_2_CONN_MAX_AGE_IN_MINUTES + value: "30" + - name: STORAGE_REDIS_STANDALONE_ACTIVE_CONFIG_IDS + value: "2" + - name: DISTRIBUTED_CACHE_CONF_IDS + value: "2" + # In-memory cache + - name: IN_MEM_CACHE_3_ENABLED + value: "true" + - name: IN_MEM_CACHE_3_NAME + value: "onfs" + - name: IN_MEM_CACHE_3_SIZE_IN_BYTES + value: "100000" + - name: IN_MEM_CACHE_ACTIVE_CONFIG_IDS + value: "3" + # P2P cache + - name: P2P_CACHE_5_ENABLED + value: "true" + - name: P2P_CACHE_5_CLUSTER_NAME + value: "onfs-cluster" + - name: P2P_CACHE_5_NAME + value: "p2p-onfs" + - name: P2P_CACHE_5_OWN_PARTITION_SIZE_IN_BYTES + value: "100000" + - name: P2P_CACHE_5_GLOBAL_SIZE_IN_BYTES + value: "1000" + - name: P2P_CACHE_5_GLOBAL_CACHE_TTL_IN_SECONDS + value: "3600" + - name: P2P_CACHE_5_NUM_CLIENTS + value: "2" + - name: P2P_CACHE_5_SERVER_PORT + value: "8088" + - name: P2P_CACHE_ACTIVE_CONFIG_IDS + value: "5" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: grpc + port: 80 + targetPort: 8089 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/onfs-consumer/Chart.yaml b/helm-charts/onfs-consumer/Chart.yaml new file mode 100644 index 00000000..07ef952c --- /dev/null +++ b/helm-charts/onfs-consumer/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: onfs-consumer +description: A Helm chart for the Online Feature Store Kafka Consumer +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/onfs-consumer/templates/NOTES.txt b/helm-charts/onfs-consumer/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/_helpers.tpl b/helm-charts/onfs-consumer/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/onfs-consumer/templates/alert-provider.yaml b/helm-charts/onfs-consumer/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/configmap.yaml b/helm-charts/onfs-consumer/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/deployment.yaml b/helm-charts/onfs-consumer/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/onfs-consumer/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/external-secrets.yaml b/helm-charts/onfs-consumer/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/httpproxy.yaml b/helm-charts/onfs-consumer/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/onfs-consumer/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/onfs-consumer/templates/otel-secret.yaml b/helm-charts/onfs-consumer/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/onfs-consumer/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/onfs-consumer/templates/pdb.yaml b/helm-charts/onfs-consumer/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/onfs-consumer/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/scaledobject.yaml b/helm-charts/onfs-consumer/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/service.yaml b/helm-charts/onfs-consumer/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/onfs-consumer/templates/serviceaccount.yaml b/helm-charts/onfs-consumer/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/onfs-consumer/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/onfs-consumer/values.yaml b/helm-charts/onfs-consumer/values.yaml new file mode 100644 index 00000000..3704b181 --- /dev/null +++ b/helm-charts/onfs-consumer/values.yaml @@ -0,0 +1,237 @@ +# Default values for onfs-consumer helm chart + +namespace: prd-onfs-consumer +applicationName: onfs-consumer +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/onfs-consumer + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8090 + name: http + protocol: TCP + probes: + liveness: + path: /health/self + port: 8090 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health/self + port: 8090 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + - name: APP_ENV + value: "local" + - name: APP_LOG_LEVEL + value: "DEBUG" + - name: APP_METRIC_SAMPLING_RATE + value: "1" + - name: APP_NAME + value: "onfs" + - name: APP_PORT + value: "8090" + # Kafka consumer + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_BATCH_SIZE + value: "100" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_CLIENT_ID + value: "onfs-consumer" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_ENABLE_AUTO_COMMIT + value: "true" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_GROUP_ID + value: "onfs-consumer-group" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_LISTENER_CONCURRENCY + value: "2" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_MAX_WORKERS + value: "50" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_POLL_TIMEOUT + value: "1000" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_SECURITY_PROTOCOL + value: "PLAINTEXT" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_TOPIC + value: "online-feature-store.feature_ingestion" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_BASIC_AUTH_CREDENTIAL_SOURCE + value: "USER_INFO" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_SASL_MECHANISM + value: "PLAIN" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_SASL_PASSWORD + value: "" + - name: KAFKA_CONSUMERS_FEATURE_CONSUMER_SASL_USERNAME + value: "" + # Etcd + - name: ETCD_SERVER + value: "http://etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # ScyllaDB + - name: STORAGE_SCYLLA_1_CONTACT_POINTS + value: "scylla" + - name: STORAGE_SCYLLA_1_KEYSPACE + value: "onfs" + - name: STORAGE_SCYLLA_1_PORT + value: "9042" + - name: STORAGE_SCYLLA_1_NUM_CONNS + value: "1" + - name: STORAGE_SCYLLA_1_TIMEOUT_IN_MS + value: "300000" + - name: STORAGE_SCYLLA_1_USERNAME + value: "" + - name: STORAGE_SCYLLA_1_PASSWORD + value: "" + - name: STORAGE_SCYLLA_1_MAJOR_VERSION + value: "5" + - name: STORAGE_SCYLLA_1_SCYLLA_VERSION + value: "5" + - name: STORAGE_SCYLLA_ACTIVE_CONFIG_IDS + value: "1" + # Redis + - name: STORAGE_REDIS_STANDALONE_2_ADDR + value: "redis:6379" + - name: STORAGE_REDIS_STANDALONE_2_DB + value: "0" + - name: STORAGE_REDIS_STANDALONE_2_DISABLE_IDENTITY + value: "true" + - name: STORAGE_REDIS_STANDALONE_2_MAX_IDLE_CONN + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_MIN_IDLE_CONN + value: "20" + - name: STORAGE_REDIS_STANDALONE_2_MAX_ACTIVE_CONN + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_MAX_RETRY + value: "-1" + - name: STORAGE_REDIS_STANDALONE_2_POOL_FIFO + value: "false" + - name: STORAGE_REDIS_STANDALONE_2_READ_TIMEOUT_IN_MS + value: "3000" + - name: STORAGE_REDIS_STANDALONE_2_WRITE_TIMEOUT_IN_MS + value: "3000" + - name: STORAGE_REDIS_STANDALONE_2_POOL_TIMEOUT_IN_MS + value: "3000" + - name: STORAGE_REDIS_STANDALONE_2_POOL_SIZE + value: "32" + - name: STORAGE_REDIS_STANDALONE_2_CONN_MAX_IDLE_TIMEOUT_IN_MINUTES + value: "15" + - name: STORAGE_REDIS_STANDALONE_2_CONN_MAX_AGE_IN_MINUTES + value: "30" + - name: STORAGE_REDIS_STANDALONE_ACTIVE_CONFIG_IDS + value: "2" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8090 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/skye-admin/Chart.yaml b/helm-charts/skye-admin/Chart.yaml new file mode 100644 index 00000000..cf4aefd2 --- /dev/null +++ b/helm-charts/skye-admin/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: skye-admin +description: A Helm chart for the Skye Admin service (embedding platform administration) +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/skye-admin/templates/NOTES.txt b/helm-charts/skye-admin/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/skye-admin/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/_helpers.tpl b/helm-charts/skye-admin/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/skye-admin/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/skye-admin/templates/alert-provider.yaml b/helm-charts/skye-admin/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/skye-admin/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/configmap.yaml b/helm-charts/skye-admin/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/skye-admin/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/deployment.yaml b/helm-charts/skye-admin/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/skye-admin/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/external-secrets.yaml b/helm-charts/skye-admin/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/skye-admin/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/skye-admin/templates/httpproxy.yaml b/helm-charts/skye-admin/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/skye-admin/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/skye-admin/templates/otel-secret.yaml b/helm-charts/skye-admin/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/skye-admin/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/skye-admin/templates/pdb.yaml b/helm-charts/skye-admin/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/skye-admin/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/scaledobject.yaml b/helm-charts/skye-admin/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/skye-admin/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/skye-admin/templates/service.yaml b/helm-charts/skye-admin/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/skye-admin/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-admin/templates/serviceaccount.yaml b/helm-charts/skye-admin/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/skye-admin/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-admin/values.yaml b/helm-charts/skye-admin/values.yaml new file mode 100644 index 00000000..4c93014b --- /dev/null +++ b/helm-charts/skye-admin/values.yaml @@ -0,0 +1,186 @@ +# Default values for skye-admin helm chart + +namespace: prd-skye-admin +applicationName: skye-admin +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/skye-admin + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8092 + name: http + protocol: TCP + probes: + liveness: + path: /health + port: 8092 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health + port: 8092 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" + env: + - name: APP_NAME + value: "skye" + - name: APP_ENV + value: "local" + - name: APP_LOG_LEVEL + value: "INFO" + - name: APP_METRIC_SAMPLING_RATE + value: "100" + - name: PORT + value: "8092" + # Etcd + - name: ETCD_SERVER + value: "etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # Kafka producer (model state) + - name: MODEL_STATE_PRODUCER + value: "1" + - name: KAFKA_PRODUCER_1_TOPICS + value: "skye.model-state" + - name: KAFKA_PRODUCER_1_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_PRODUCER_1_CLIENT_ID + value: "skye-admin-producer" + # Kafka consumer (model state) + - name: MODEL_STATE_CONSUMER + value: "1" + - name: KAFKA_1_TOPICS + value: "skye.model-state" + - name: KAFKA_1_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_1_BASIC_AUTH_CREDENTIAL_SOURCE + value: "NONE" + - name: KAFKA_1_GROUP_ID + value: "skye-admin-model-state" + - name: KAFKA_1_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_1_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_1_ENABLE_AUTO_COMMIT + value: "false" + - name: KAFKA_1_LISTENER_CONCURRENCY + value: "1" + - name: KAFKA_1_CLIENT_ID + value: "skye-admin-consumer" + - name: KAFKA_1_BATCH_SIZE + value: "10" + - name: KAFKA_1_POLL_TIMEOUT + value: "1000" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8092 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/skye-consumers/Chart.yaml b/helm-charts/skye-consumers/Chart.yaml new file mode 100644 index 00000000..fe010992 --- /dev/null +++ b/helm-charts/skye-consumers/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: skye-consumers +description: A Helm chart for the Skye Consumers service (embedding, realtime, and delta consumers) +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/skye-consumers/templates/NOTES.txt b/helm-charts/skye-consumers/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/skye-consumers/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/_helpers.tpl b/helm-charts/skye-consumers/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/skye-consumers/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/skye-consumers/templates/alert-provider.yaml b/helm-charts/skye-consumers/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/skye-consumers/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/configmap.yaml b/helm-charts/skye-consumers/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/skye-consumers/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/deployment.yaml b/helm-charts/skye-consumers/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/skye-consumers/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/external-secrets.yaml b/helm-charts/skye-consumers/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/skye-consumers/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/httpproxy.yaml b/helm-charts/skye-consumers/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/skye-consumers/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/skye-consumers/templates/otel-secret.yaml b/helm-charts/skye-consumers/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/skye-consumers/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/skye-consumers/templates/pdb.yaml b/helm-charts/skye-consumers/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/skye-consumers/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/scaledobject.yaml b/helm-charts/skye-consumers/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/skye-consumers/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/skye-consumers/templates/service.yaml b/helm-charts/skye-consumers/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/skye-consumers/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-consumers/templates/serviceaccount.yaml b/helm-charts/skye-consumers/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/skye-consumers/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-consumers/values.yaml b/helm-charts/skye-consumers/values.yaml new file mode 100644 index 00000000..f18d0bd9 --- /dev/null +++ b/helm-charts/skye-consumers/values.yaml @@ -0,0 +1,270 @@ +# Default values for skye-consumers helm chart + +namespace: prd-skye-consumers +applicationName: skye-consumers +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/skye-consumers + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8093 + name: http + protocol: TCP + probes: + liveness: + path: /health + port: 8093 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health + port: 8093 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + - name: APP_NAME + value: "skye" + - name: APP_ENV + value: "local" + - name: APP_LOG_LEVEL + value: "INFO" + - name: APP_METRIC_SAMPLING_RATE + value: "100" + - name: PORT + value: "8093" + # Etcd + - name: ETCD_SERVER + value: "etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # Embedding consumer (ID=2) + - name: EMBEDDING_CONSUMER_KAFKA_IDS + value: "2" + - name: KAFKA_2_TOPICS + value: "skye.embedding" + - name: KAFKA_2_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_2_BASIC_AUTH_CREDENTIAL_SOURCE + value: "NONE" + - name: KAFKA_2_GROUP_ID + value: "skye-embedding-consumer" + - name: KAFKA_2_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_2_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_2_ENABLE_AUTO_COMMIT + value: "false" + - name: KAFKA_2_LISTENER_CONCURRENCY + value: "1" + - name: KAFKA_2_CLIENT_ID + value: "skye-embedding-consumer" + - name: KAFKA_2_BATCH_SIZE + value: "10" + - name: KAFKA_2_POLL_TIMEOUT + value: "1000" + # Embedding sequence consumer (ID=3) + - name: EMBEDDING_CONSUMER_SEQUENCE_KAFKA_IDS + value: "3" + - name: KAFKA_3_TOPICS + value: "skye.embedding-sequence" + - name: KAFKA_3_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_3_BASIC_AUTH_CREDENTIAL_SOURCE + value: "NONE" + - name: KAFKA_3_GROUP_ID + value: "skye-embedding-seq-consumer" + - name: KAFKA_3_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_3_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_3_ENABLE_AUTO_COMMIT + value: "false" + - name: KAFKA_3_LISTENER_CONCURRENCY + value: "1" + - name: KAFKA_3_CLIENT_ID + value: "skye-embedding-seq-consumer" + - name: KAFKA_3_BATCH_SIZE + value: "10" + - name: KAFKA_3_POLL_TIMEOUT + value: "1000" + # Realtime consumer (ID=4) + - name: REALTIME_CONSUMER_KAFKA_IDS + value: "4" + - name: KAFKA_4_TOPICS + value: "skye.realtime" + - name: KAFKA_4_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_4_BASIC_AUTH_CREDENTIAL_SOURCE + value: "NONE" + - name: KAFKA_4_GROUP_ID + value: "skye-realtime-consumer" + - name: KAFKA_4_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_4_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_4_ENABLE_AUTO_COMMIT + value: "false" + - name: KAFKA_4_LISTENER_CONCURRENCY + value: "1" + - name: KAFKA_4_CLIENT_ID + value: "skye-realtime-consumer" + - name: KAFKA_4_BATCH_SIZE + value: "10" + - name: KAFKA_4_POLL_TIMEOUT + value: "1000" + # Realtime producer (ID=5) + - name: REALTIME_PRODUCER_KAFKA_ID + value: "5" + - name: KAFKA_PRODUCER_5_TOPICS + value: "skye.realtime" + - name: KAFKA_PRODUCER_5_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_PRODUCER_5_CLIENT_ID + value: "skye-realtime-producer" + # Realtime delta producer (ID=6) + - name: REALTIME_DELTA_PRODUCER_KAFKA_ID + value: "6" + - name: KAFKA_PRODUCER_6_TOPICS + value: "skye.realtime-delta" + - name: KAFKA_PRODUCER_6_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_PRODUCER_6_CLIENT_ID + value: "skye-realtime-delta-producer" + # Realtime delta consumer (ID=7) + - name: REALTIME_DELTA_CONSUMER_KAFKA_ID + value: "7" + - name: KAFKA_7_TOPICS + value: "skye.realtime-delta" + - name: KAFKA_7_BOOTSTRAP_SERVERS + value: "broker:29092" + - name: KAFKA_7_BASIC_AUTH_CREDENTIAL_SOURCE + value: "NONE" + - name: KAFKA_7_GROUP_ID + value: "skye-realtime-delta-consumer" + - name: KAFKA_7_AUTO_OFFSET_RESET + value: "earliest" + - name: KAFKA_7_AUTO_COMMIT_INTERVAL_MS + value: "5000" + - name: KAFKA_7_ENABLE_AUTO_COMMIT + value: "false" + - name: KAFKA_7_LISTENER_CONCURRENCY + value: "1" + - name: KAFKA_7_CLIENT_ID + value: "skye-realtime-delta-consumer" + - name: KAFKA_7_BATCH_SIZE + value: "10" + - name: KAFKA_7_POLL_TIMEOUT + value: "1000" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8093 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/skye-serving/Chart.yaml b/helm-charts/skye-serving/Chart.yaml new file mode 100644 index 00000000..064903ff --- /dev/null +++ b/helm-charts/skye-serving/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: skye-serving +description: A Helm chart for the Skye Serving service (embedding search and retrieval) +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/skye-serving/templates/NOTES.txt b/helm-charts/skye-serving/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/skye-serving/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/_helpers.tpl b/helm-charts/skye-serving/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/skye-serving/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/skye-serving/templates/alert-provider.yaml b/helm-charts/skye-serving/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/skye-serving/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/configmap.yaml b/helm-charts/skye-serving/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/skye-serving/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/deployment.yaml b/helm-charts/skye-serving/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/skye-serving/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/external-secrets.yaml b/helm-charts/skye-serving/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/skye-serving/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/skye-serving/templates/httpproxy.yaml b/helm-charts/skye-serving/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/skye-serving/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/skye-serving/templates/otel-secret.yaml b/helm-charts/skye-serving/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/skye-serving/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/skye-serving/templates/pdb.yaml b/helm-charts/skye-serving/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/skye-serving/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/scaledobject.yaml b/helm-charts/skye-serving/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/skye-serving/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/skye-serving/templates/service.yaml b/helm-charts/skye-serving/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/skye-serving/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/skye-serving/templates/serviceaccount.yaml b/helm-charts/skye-serving/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/skye-serving/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/skye-serving/values.yaml b/helm-charts/skye-serving/values.yaml new file mode 100644 index 00000000..283a32b9 --- /dev/null +++ b/helm-charts/skye-serving/values.yaml @@ -0,0 +1,171 @@ +# Default values for skye-serving helm chart + +namespace: prd-skye-serving +applicationName: skye-serving +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p1 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/skye-serving + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 8094 + name: http + protocol: TCP + probes: + liveness: + path: /health/self + port: 8094 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: /health/self + port: 8094 + scheme: HTTP + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "1000m" + env: + - name: APP_NAME + value: "skye" + - name: APP_ENV + value: "local" + - name: APP_LOG_LEVEL + value: "INFO" + - name: APP_PORT + value: "8094" + - name: APP_METRIC_SAMPLING_RATE + value: "100" + # In-memory cache (10 MB) + - name: IN_MEMORY_CACHE_SIZE_IN_BYTES + value: "10485760" + # Etcd + - name: ETCD_SERVER + value: "etcd:2379" + - name: ETCD_WATCHER_ENABLED + value: "true" + # Redis + - name: REDIS_ADDR + value: "redis:6379" + - name: REDIS_DB + value: "0" + # Profiling + - name: PROFILING_ENABLED + value: "false" + # Storage + - name: STORAGE_AGGREGATOR_DB_COUNT + value: "0" + - name: STORAGE_EMBEDDING_STORE_COUNT + value: "0" + # Auth + - name: AUTH_TOKENS + value: "test" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 8094 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/helm-charts/trufflebox-ui/Chart.yaml b/helm-charts/trufflebox-ui/Chart.yaml new file mode 100644 index 00000000..e5f83fbe --- /dev/null +++ b/helm-charts/trufflebox-ui/Chart.yaml @@ -0,0 +1,10 @@ +apiVersion: v2 +name: trufflebox-ui +description: A Helm chart for the TruffleBox UI dashboard +type: application +version: 1.0.0 +appVersion: "1.0.0" + +maintainers: + - name: BharatMLStack Team + email: ml-oss@meesho.com diff --git a/helm-charts/trufflebox-ui/templates/NOTES.txt b/helm-charts/trufflebox-ui/templates/NOTES.txt new file mode 100644 index 00000000..35a4f3c2 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/NOTES.txt @@ -0,0 +1,18 @@ +{{ .Chart.Name }} has been deployed. + +Namespace: {{ .Values.namespace }} +Application: {{ .Values.applicationName }} + +{{- if .Values.service.enabled }} +Service: {{ .Values.namespace }}:{{ (index .Values.service.ports 0).port }} +{{- end }} + +{{- if .Values.ingress.enabled }} +Ingress is enabled. +{{- end }} + +{{- if .Values.autoscaling.enabled }} +Autoscaling: {{ .Values.autoscaling.minReplicas }} - {{ .Values.autoscaling.maxReplicas }} replicas +{{- else }} +Replicas: {{ .Values.replicaCount }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/_helpers.tpl b/helm-charts/trufflebox-ui/templates/_helpers.tpl new file mode 100644 index 00000000..168e0342 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/_helpers.tpl @@ -0,0 +1,70 @@ +{{/* vim: set filetype=mustache: */}} +{{/* +Expand the name of the chart. +*/}} +{{- define "name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{/* +Create a default fully qualified app name. +We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). +*/}} +{{- define "fullname" -}} +{{- $name := default .Chart.Name .Values.nameOverride -}} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} +{{- end -}} + +{{- define "labels.selector" -}} +app: {{ .Values.namespace }} +{{- end -}} + +{{- define "labels.primary-selector" -}} +app: {{ .Values.namespace }}-primary +{{- end -}} + +{{- define "labels.common" -}} +{{ template "labels.selector" . }} +{{- if and .Values.deployment .Values.deployment.image }} +version: {{ .Values.deployment.image.tag }} +{{- end }} +env: {{ .Values.labels.env }} +team: {{ .Values.labels.team }} +bu: {{ .Values.labels.bu }} +service: {{ .Values.applicationName }} +priority: {{ .Values.labels.priority }} +priority_v2: {{ .Values.labels.priority_v2 | default "cp3" }} +primary_owner: {{ .Values.labels.primary_owner | default .Values.labels.team }} +secondary_owner: {{ .Values.labels.secondary_owner | default .Values.labels.team }} +service_type: {{ .Values.labels.service_type | default "" | replace "," "-" | quote }} +{{- end -}} + +{{- define "labels.chart" -}} +chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" +release: {{ .Release.Name | quote }} +heritage: {{ .Release.Service | quote }} +{{- end -}} + +{{/* +Renders a value that contains template. +Usage: +{{ include "application.tplvalues.render" ( dict "value" .Values.path.to.the.Value "context" $) }} +*/}} + +{{- define "application.tplvalues.render" -}} + {{- if typeIs "string" .value }} + {{- tpl .value .context }} + {{- else }} + {{- tpl (.value | toYaml) .context }} + {{- end }} +{{- end -}} + +{{- define "canary.promURL" -}} +{{- if .Values.canary.promURL }} +{{- .Values.canary.promURL }} +{{- else if eq .Values.labels.env "prod" }} +prod-ops-metricsui.example.com/select/100/ +{{- else }} +https://sb-ops-metricsui.example.com/select/100/ +{{- end }} +{{- end -}} diff --git a/helm-charts/trufflebox-ui/templates/alert-provider.yaml b/helm-charts/trufflebox-ui/templates/alert-provider.yaml new file mode 100644 index 00000000..a300bb14 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/alert-provider.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.canary (.Values.canary.enabled) .Values.canary.slackWebhookURL (ne .Values.canary.slackWebhookURL "") }} +apiVersion: flagger.app/v1beta1 +kind: AlertProvider +metadata: + name: flagger-status + namespace: {{ .Values.namespace }} +spec: + type: slack + {{- if .Values.canary.slackChannel }} + channel: {{ .Values.canary.slackChannel }} + {{- end }} + username: flagger + address: {{ .Values.canary.slackWebhookURL }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/configmap.yaml b/helm-charts/trufflebox-ui/templates/configmap.yaml new file mode 100644 index 00000000..fc527219 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/configmap.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.configMap .Values.configMap.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ .Values.namespace }}-config + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +data: + {{- range $key, $value := .Values.configMap.data }} + {{ $key }}: {{ $value | quote }} + {{- end }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/deployment.yaml b/helm-charts/trufflebox-ui/templates/deployment.yaml new file mode 100644 index 00000000..34cbb1ed --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/deployment.yaml @@ -0,0 +1,237 @@ +{{- if and .Values.deployment .Values.deployment.enabled }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.namespace }} + namespace: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +{{- include "labels.selector" . | nindent 4 }} +spec: + {{- with .Values.deployment.minReadySeconds }} + minReadySeconds: {{ . }} + {{- end }} + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + revisionHistoryLimit: {{ .Values.deployment.revisionHistoryLimit }} + selector: + matchLabels: +{{- include "labels.selector" . | nindent 6 }} +{{- with .Values.deployment.updateStrategy }} +{{ toYaml . | indent 2 -}} +{{- end }} + template: + metadata: + annotations: + {{- with .Values.deployment.podAnnotations }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.telegraf.enabled }} + telegraf.influxdata.com/class: "infra" + {{- end }} + labels: + {{- include "labels.common" . | nindent 8 }} + spec: + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: +{{- include "labels.selector" . | nindent 12 }} + {{- if .Values.deployment.image.pullSecret }} + imagePullSecrets: + - name: {{ .Values.deployment.image.pullSecret }} + {{- end }} + {{- if .Values.deployment.volumes }} + volumes: + {{- toYaml .Values.deployment.volumes | nindent 8 }} + {{- end }} + {{- if .Values.deployment.initContainers }} + initContainers: + {{- toYaml .Values.deployment.initContainers | nindent 8 }} + {{- end }} + containers: + - name: {{ .Values.applicationName }} + {{- if .Values.deployment.volumeMounts }} + volumeMounts: + {{- toYaml .Values.deployment.volumeMounts | nindent 12 }} + {{- end }} + {{- if .Values.deployment.command }} + command: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.command "context" $) | nindent 12 }} + {{- end }} + {{- if .Values.deployment.args }} + args: {{- include "application.tplvalues.render" (dict "value" .Values.deployment.args "context" $) | nindent 12 }} + {{- end }} + image: "{{ .Values.deployment.image.repository }}:{{ .Values.deployment.image.tag }}" + imagePullPolicy: {{ .Values.deployment.image.pullPolicy }} + {{- if .Values.deployment.lifecycle }} + lifecycle: {{ toYaml .Values.deployment.lifecycle | nindent 12 }} + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .liveness }} + livenessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds }} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- if or .Values.externalSecret.enabled .Values.otel_enabled (and .Values.configMap .Values.configMap.enabled) }} + envFrom: + {{- if .Values.externalSecret.enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-dr + {{- end }} + {{- end }} + {{- if .Values.otel_enabled }} + - secretRef: + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end }} + {{- if and .Values.configMap .Values.configMap.enabled }} + - configMapRef: + name: {{ .Values.namespace }}-config + {{- end }} + {{- end }} + env: + - name: TZ + value: Asia/Kolkata + - name: NODE_IP + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- if .Values.telegraf.enabled }} + - name: TELEGRAF_UDP_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + {{- end }} + {{- if .Values.otel_enabled }} + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://$(NODE_IP):4317 + {{- end }} + {{- with .Values.deployment.env }} + {{- range . }} + - name: {{ .name }} + value: "{{ .value }}" + {{- end }} + {{- end }} + {{- with .Values.deployment.ports }} + ports: + {{- range . }} + - containerPort: {{ .containerPort }} + name: {{ .name }} + protocol: {{ .protocol }} + {{- end }} + {{- end }} + {{- if .Values.telegraf.enabled }} + - containerPort: 9273 + name: telegraf-sc + protocol: TCP + {{- end }} + {{- with .Values.deployment.probes }} + {{- with .readiness }} + readinessProbe: + {{- with .failureThreshold }} + failureThreshold: {{ . }} + {{- end }} + httpGet: + path: {{ .path }} + port: {{ .port }} + scheme: {{ .scheme }} + {{- with .periodSeconds }} + periodSeconds: {{ . }} + {{- end }} + {{- with .successThreshold }} + successThreshold: {{ . }} + {{- end }} + {{- with .timeoutSeconds}} + timeoutSeconds: {{ . }} + {{- end }} + initialDelaySeconds: {{ .initialDelaySeconds }} + {{- end }} + {{- end }} + {{- with .Values.deployment.resources }} + resources: + {{- with .limits }} + limits: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- with .requests }} + requests: + {{- with .memory }} + memory: "{{ . }}" + {{- end }} + {{- with .cpu }} + cpu: "{{ . }}" + {{- end }} + {{- end }} + {{- end }} + + {{- with .Values.deployment.hostAliases }} + hostAliases: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.deployment.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + terminationGracePeriodSeconds: {{ .Values.deployment.terminationGracePeriodSeconds | default "300" }} + {{- if .Values.deployment.serviceAccount.enabled }} + serviceAccountName: {{ .Values.namespace }} + {{- end }} + {{- if .Values.securityContext }} + podSecurityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + {{- end }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/external-secrets.yaml b/helm-charts/trufflebox-ui/templates/external-secrets.yaml new file mode 100644 index 00000000..0d602056 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/external-secrets.yaml @@ -0,0 +1,30 @@ +{{- if and .Values.externalSecret .Values.externalSecret.enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} +{{- if .Values.externalSecret.annotations }} + annotations: +{{ toYaml .Values.externalSecret.annotations | indent 4 }} +{{- end }} + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.externalSecret.path }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }} + {{- end}} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/httpproxy.yaml b/helm-charts/trufflebox-ui/templates/httpproxy.yaml new file mode 100644 index 00000000..94fe8f3e --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/httpproxy.yaml @@ -0,0 +1,50 @@ +{{- if and .Values.ingress .Values.ingress.enabled -}} +{{- if .Values.createContourGateway -}} +{{- if or ( eq "contour-internal" .Values.ingress.ingressClassName ) ( eq "contour-external" .Values.ingress.ingressClassName ) ( eq "contour-internal-0" .Values.ingress.ingressClassName ) ( eq "contour-internal-1" .Values.ingress.ingressClassName ) ( eq "contour-external-0" .Values.ingress.ingressClassName ) ( eq "contour-external-1" .Values.ingress.ingressClassName ) }} +{{- $servicePortNumber := .Values.ingress.servicePortNumber -}} +{{- $pathType := .Values.ingress.pathType -}} +{{- $namespace := .Values.namespace -}} +{{- $ingressClassName := .Values.ingress.ingressClassName -}} +{{ $count := 0 | int }} +{{- range .Values.ingress.hosts }} +apiVersion: projectcontour.io/v1 +kind: HTTPProxy +metadata: + namespace: {{ $namespace }} + name: {{ $namespace }}-{{ $count }} + labels: +{{ include "labels.common" $ | indent 4 }} +{{ include "labels.chart" $ | indent 4 }} + annotations: + projectcontour.io/ingress.class: {{ $.Values.ingress.ingressClassName }} +spec: + ingressClassName: {{ $ingressClassName }} + virtualhost: + fqdn: "{{ .host }}" + includes: + {{- range .paths }} + - conditions: + {{- if or ( eq ( lower .pathType ) "prefix" ) ( eq ( lower .pathType ) "implementationspecific") }} + - prefix: {{ .path }} + {{- end }} + {{- if ( eq ( lower .pathType ) "exact" ) }} + - header: + name: :path + exact: {{ .path }} + - prefix: {{ .path }} + {{- end }} + {{- if .targetService }} + name: {{ .targetService | replace "/" "-" }} + namespace: {{ (split "/" .targetService)._0 }} + {{- else }} + name: {{ $namespace }} + namespace: {{ $namespace }} + {{- end }} + {{- end }} + {{ $count = add1 $count }} +--- + +{{- end -}} +{{- end -}} +{{- end -}} +{{- end -}} diff --git a/helm-charts/trufflebox-ui/templates/otel-secret.yaml b/helm-charts/trufflebox-ui/templates/otel-secret.yaml new file mode 100644 index 00000000..c2514b7d --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/otel-secret.yaml @@ -0,0 +1,26 @@ +{{ if .Values.otel_enabled }} +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + annotations: + flagger.app/config-tracking: disabled + name: {{ if and .Values.deployment .Values.deployment.envFrom }}{{ .Values.deployment.envFrom.secretRef }}{{ else }}{{ .Values.namespace }}{{ end }}-otel + namespace: {{ .Values.namespace }} +spec: + dataFrom: + - extract: + conversionStrategy: Default + key: {{ .Values.infrastructure.vault.otelTokenPath | default "org/prd/cntr/devop/coralogix-token" }} + refreshInterval: 15s + secretStoreRef: + kind: {{ .Values.infrastructure.secretStore.kind | default "ClusterSecretStore" }} + name: {{ .Values.infrastructure.secretStore.name | default "vault-backend" }} + target: + creationPolicy: Owner + deletionPolicy: Retain + {{- if and .Values.deployment .Values.deployment.enabled }} + name: {{ .Values.deployment.envFrom.secretRef }}-otel + {{- end}} +{{ end }} diff --git a/helm-charts/trufflebox-ui/templates/pdb.yaml b/helm-charts/trufflebox-ui/templates/pdb.yaml new file mode 100644 index 00000000..db87f89a --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/pdb.yaml @@ -0,0 +1,36 @@ +{{- if or (and .Values.podDisruptionBudget .Values.podDisruptionBudget.enabled) (and .Values.deployment .Values.deployment.enabled) }} +{{- if or ( and .Values.deployment (.Values.deployment.enabled) (.Values.autoscaling.enabled) ( gt (int .Values.autoscaling.minReplicas) 1)) ( and (eq .Values.autoscaling.enabled false) .Values.deployment ( gt ( int .Values.deployment.replicaCount) 1 )) }} +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + {{- if .Values.podDisruptionBudget.enabled }} + maxUnavailable: {{ .Values.podDisruptionBudget.maxUnavailable }} + {{- else }} + {{- if and .Values.deployment .Values.deployment.enabled }} + {{- if (eq .Values.deployment.updateStrategy.strategy.type "RollingUpdate") }} + maxUnavailable: {{ .Values.deployment.updateStrategy.strategy.rollingUpdate.maxSurge | default "10%" }} + {{- else }} + maxUnavailable: "10%" + {{- end }} + {{ else }} + maxUnavailable: "10%" + {{- end }} + {{- end }} + {{- if .Values.podDisruptionBudget.minAvailable }} + minAvailable: {{ .Values.podDisruptionBudget.minAvailable }} + {{- end }} + selector: + matchLabels: + {{- if and (.Values.canary.enabled) (eq .Values.labels.env "prod") }} + {{- include "labels.primary-selector" . | nindent 6 }} + {{- else }} + {{- include "labels.selector" . | nindent 6 }} + {{- end }} +{{- end }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/scaledobject.yaml b/helm-charts/trufflebox-ui/templates/scaledobject.yaml new file mode 100644 index 00000000..259bfea2 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/scaledobject.yaml @@ -0,0 +1,56 @@ +{{- if and .Values.autoscaling .Values.autoscaling.enabled }} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: {{ .Values.namespace }} + pollingInterval: {{ .Values.autoscaling.pollingInterval }} + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + minReplicaCount: 1 + {{- else }} + minReplicaCount: {{ .Values.canary.minCanaryReplicas | default $.Values.autoscaling.minReplicas }} + {{- end }} + maxReplicaCount: {{ .Values.canary.maxCanaryReplicas | default $.Values.autoscaling.maxReplicas }} + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaledown.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaledown.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaledown.selectpolicy }} + scaleUp: + stabilizationWindowSeconds: {{ .Values.autoscaling.scaleup.stabilizationWindowSeconds }} + policies: + {{- range .Values.autoscaling.scaleup.policies }} + - type: {{ .type }} + value: {{ .value }} + periodSeconds: {{ .periodseconds }} + {{- end }} + selectPolicy: {{ .Values.autoscaling.scaleup.selectpolicy }} + triggers: + {{- if ( default false (.Values.disasterRecovery).enabled ) }} + {{- range $.Values.autoscaling.triggers }} + {{- if or (eq .type "cpu") (eq .type "memory") }} + - metadata: + {{- toYaml .metadata | nindent 8 }} + type: {{ .type }} + metricType: "Utilization" + {{- end }} + {{- end }} + {{- else }} + {{- toYaml .Values.autoscaling.triggers | nindent 2 }} + {{ end }} + +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/service.yaml b/helm-charts/trufflebox-ui/templates/service.yaml new file mode 100644 index 00000000..8fcc5bc8 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/service.yaml @@ -0,0 +1,29 @@ +{{- if and .Values.service .Values.service.enabled }} +{{- if or (eq .Values.canary.enabled false) ( and (.Values.canary.enabled) (ne .Values.labels.env "prod")) }} +apiVersion: v1 +kind: Service +metadata: + labels: + {{- include "labels.common" . | nindent 4 }} + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} +{{- if .Values.service.annotations }} + annotations: +{{ toYaml .Values.service.annotations | indent 4 }} +{{- end }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + {{- range .Values.service.ports }} + - name: {{ .name }} + port: {{ .port }} + protocol: {{ .protocol }} + targetPort: {{ .targetPort }} + {{- end }} + selector: + {{- include "labels.selector" . | nindent 4 }} +{{- end }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/templates/serviceaccount.yaml b/helm-charts/trufflebox-ui/templates/serviceaccount.yaml new file mode 100644 index 00000000..f05362f5 --- /dev/null +++ b/helm-charts/trufflebox-ui/templates/serviceaccount.yaml @@ -0,0 +1,14 @@ +{{- if and .Values.deployment (.Values.deployment.enabled) (.Values.deployment.serviceAccount.enabled) }} +apiVersion: v1 +kind: ServiceAccount +metadata: + namespace: {{ .Values.namespace }} + name: {{ .Values.namespace }} + labels: +{{ include "labels.common" . | indent 4 }} +{{ include "labels.chart" . | indent 4 }} + {{- with .Values.deployment.serviceAccount.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/helm-charts/trufflebox-ui/values.yaml b/helm-charts/trufflebox-ui/values.yaml new file mode 100644 index 00000000..06cffa41 --- /dev/null +++ b/helm-charts/trufflebox-ui/values.yaml @@ -0,0 +1,151 @@ +# Default values for trufflebox-ui helm chart + +namespace: prd-trufflebox-ui +applicationName: trufflebox-ui +replicaCount: 2 + +labels: + env: prd + team: bharatml + bu: ml + priority: p2 + priority_v2: cp3 + service_type: "" + +priorityClassName: "" + +telegraf: + enabled: false + +otel_enabled: false + +infrastructure: + secretStore: + name: vault-backend + kind: ClusterSecretStore + vault: + basePath: "" + otelTokenPath: "" + +deployment: + enabled: true + replicaCount: 2 + revisionHistoryLimit: 3 + image: + repository: ghcr.io/meesho/trufflebox-ui + tag: latest + pullPolicy: IfNotPresent + ports: + - containerPort: 80 + name: http + protocol: TCP + probes: + liveness: + path: / + port: 80 + scheme: HTTP + initialDelaySeconds: 10 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + readiness: + path: / + port: 80 + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + timeoutSeconds: 5 + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "250m" + env: + - name: REACT_APP_ENVIRONMENT + value: "production" + - name: REACT_APP_HORIZON_BASE_URL + value: "http://horizon:8082" + - name: REACT_APP_ONLINE_FEATURE_STORE_ENABLED + value: "true" + - name: REACT_APP_INFERFLOW_ENABLED + value: "true" + - name: REACT_APP_NUMERIX_ENABLED + value: "true" + - name: REACT_APP_PREDATOR_ENABLED + value: "true" + - name: REACT_APP_EMBEDDING_PLATFORM_ENABLED + value: "false" + serviceAccount: + enabled: false + annotations: {} + updateStrategy: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + terminationGracePeriodSeconds: 30 + +service: + enabled: true + type: ClusterIP + ports: + - name: http + port: 80 + targetPort: 80 + protocol: TCP + +autoscaling: + enabled: false + minReplicas: 2 + maxReplicas: 10 + pollingInterval: 30 + scaledown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodseconds: 60 + selectpolicy: Min + scaleup: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 50 + periodseconds: 60 + selectpolicy: Max + triggers: + - type: cpu + metadata: + value: "70" + metricType: Utilization + +ingress: + enabled: false + ingressClassName: contour-internal +createContourGateway: false + +externalSecret: + enabled: false + path: "" + +configMap: + enabled: false + +canary: + enabled: false + promURL: "" + slackChannel: "" + slackWebhookURL: "" + +podDisruptionBudget: + enabled: false + maxUnavailable: "10%" + +disasterRecovery: + enabled: false diff --git a/inferflow/README.md b/inferflow/README.md index 27db0b8a..5de41092 100644 --- a/inferflow/README.md +++ b/inferflow/README.md @@ -1,121 +1,221 @@ -## inferflow +![Build Status](https://github.com/Meesho/BharatMLStack/actions/workflows/inferflow.yml/badge.svg) +![Static Badge](https://img.shields.io/badge/release-v1.0.0-blue?style=flat) +[![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white)](https://discord.gg/XkT7XsV2AU) -### Build / Debug / Run +# Inferflow -* To build the app run - * export GOOS=linux && go build -o bin/inferflow_app cmd/inferflow/main.go -* To run or debug the app - * In IDE please run launch.json in VSCode; you can select debug mode or auto mode. +DAG-based real-time ML inference orchestration service for BharatML Stack. -### Notes -* Configs are present in cmd/inferflow/application.env file as environment variables +## What is Inferflow? -* Use camel case for naming config keys in application.env file +Inferflow is the inference layer of BharatML Stack. It receives scoring requests, orchestrates feature retrieval, model execution, and post-processing through a configurable DAG (Directed Acyclic Graph) pipeline, and returns predictions — all in real time. -* Make sure to not have "__" (double underscores) in actual config keys +Each model gets its own DAG of components (feature fetch, model scoring, numeric computation) that execute concurrently where possible, making inference both flexible and fast. -* Environment variables don't support "." or "-" so we are using "__" (double underscores) as delimiter +## Features -* eg. if you have nested config parent.child.name then your environemnt variables should look like parent__child__name. In code you can access the config as kConfig.String("parent.child.name") +- ⚡ **DAG-Based Execution** — Components run in parallel when independent; topological ordering ensures correctness +- 🧠 **PointWise, PairWise & SlateWise APIs** — gRPC APIs for per-target, pair-level, and slate-level inference +- 🔄 **Dynamic Model Configuration** — Model DAGs and configs are loaded from etcd and hot-reloaded without restarts +- 📦 **Feature Retrieval** — Fetches real-time features from the Online Feature Store (ONFS) via gRPC +- 🎯 **Model Scoring** — Calls Predator for ML model inference (supports multi-endpoint routing) +- 🔢 **Numeric Computation** — Delegates matrix operations to Numerix via gRPC +- 💾 **In-Memory Caching** — FreeCache-based feature caching for low-latency repeated lookups +- 📊 **Inference Logging** — Asynchronous logging to Kafka in proto, Arrow, or Parquet format +- 🔀 **Multiplexed Server** — gRPC and HTTP served on a single port via cmux -* Do not panic in code except during app start; if you want to panic just return an error +## Architecture -* Automaxprocs issue (GOMAXPROCS = 1) may cause throttling in kubernetes if cpu_limit is less than 1000m (1 core). Make sure you use atleast 1 core in EKS +For detailed architecture and data flow diagrams, see the [Inferflow documentation](https://meesho.github.io/BharatMLStack/docs/inferflow). -* To add test case for a file abc.go add a file abc_test.go in the same package -* Always check if the variables being shared amongst goroutines are concurrently safe or not - -* Headers are automatically converted to PascalCase with mux router -* TELEGRAF_UDP_HOST and TELEGRAF_UDP_PORT are used for EKS, don't add or change these environment variables in code. application.env doesn't have these variables because default values are handled in code +### External Dependencies +| Dependency | Purpose | +|------------|---------| +| **etcd** | Dynamic model configuration | +| **ONFS** | Real-time feature retrieval | +| **Predator** | ML model inference | +| **Numerix** | Numeric / matrix computation | +| **Kafka** | Inference logging | -### Packaging structure +## gRPC APIs - * cmd/ contains main package of the application and application.env - * internal/ application specific routers, errors, etc. packages - * handlers/ handlers for various APIs exposed along with their commons & initialization - * pkg/ utils that can be used by other apps - * test/ packaged code that is required for unit testing - * deployments/ contain deployment related files +### Predict Service (`server/proto/predict.proto`) +| RPC | Description | +|-----|-------------| +| `InferPointWise(PointWiseRequest)` | Per-target scoring — score each target independently | +| `InferPairWise(PairWiseRequest)` | Pair-level scoring — rank/score pairs of targets | +| `InferSlateWise(SlateWiseRequest)` | Slate-level scoring — score ordered groups of targets | -## inferflow-client +### Legacy Service (`server/proto/inferflow.proto`) -## Install +| RPC | Description | +|-----|-------------| +| `RetrieveModelScore(InferflowRequestProto)` | Entity-based model scoring (legacy API) | -```xml +### HTTP Endpoints - - com.meesho.ml - inferflow-client - 1.0.2-RELEASE - +| Endpoint | Description | +|----------|-------------| +| `GET /health/self` | Health check | + +For detailed API schemas, see [Predict APIs and Feature Logging](PREDICT_APIS_AND_FEATURE_LOGGING.md). + +## 🧰 SDKs + +Inferflow provides SDKs to interact with the feature store: + +- **[Go SDK](sdks/go/README.md)** - For backend services and ML inference + + +## Quick Start + +### Prerequisites + +- Go 1.24 or later +- Docker and Docker Compose (for local development) +- A running BharatML Stack environment (etcd, ONFS, Predator, Numerix) + +### Using Docker Compose (Recommended) + +The easiest way to run Inferflow with all its dependencies is via the BharatML Stack quick-start: + +```bash +cd quick-start +./start.sh ``` -### Properties:- - -application.yml - -```yml - -grpc: - inferflow-enabled: true - -client: - inferflow-grpc: - host: ${INFERFLOW_GRPC_HOST} - port: ${INFERFLOW_GRPC_PORT} - http2-config: - grpc-deadline: ${INFERFLOW_GRPC_DEADLINE} - connect-timeout: ${INFERFLOW_GRPC_CONNECT_TIMEOUT} - keepAliveTime: ${INFERFLOW_GRPC_KEEP_ALIVE_TIMEOUT} - connection-request-timeout: ${INFERFLOW_GRPC_CONN_REQUEST_TIMEOUT} - pool-size: ${INFERFLOW_GRPC_CHANNEL_POOL_SIZE} - thread-pool-size: ${INFERFLOW_GRPC_THREAD_POOL_SIZE} - bounded-queue-size: ${INFERFLOW_GRPC_QUEUE_POOL_SIZE} - is-plain-text: ${INFERFLOW_GRPC_PLAIN_TEXT} - +This starts Inferflow alongside all required services. See the [Quick Start Guide](../quick-start/README.md) for details. + +### Standalone + +```bash +cd inferflow + +# Set up environment variables +cp cmd/inferflow/application.env .env +# Edit .env with your configuration + +# Build +go build -o bin/inferflow cmd/inferflow/main.go + +# Run +./bin/inferflow ``` - -Prod: -```properties - INFERFLOW_GRPC_HOST=inferflow.cluster.meeshoint.in - INFERFLOW_GRPC_PORT=80 - INFERFLOW_GRPC_DEADLINE=500 - INFERFLOW_GRPC_CONNECT_TIMEOUT=100 - INFERFLOW_GRPC_KEEP_ALIVE_TIMEOUT=10000 - INFERFLOW_GRPC_CONN_REQUEST_TIMEOUT=100 - INFERFLOW_GRPC_CHANNEL_POOL_SIZE=1 - INFERFLOW_GRPC_THREAD_POOL_SIZE=100 - INFERFLOW_GRPC_QUEUE_POOL_SIZE=100 - INFERFLOW_GRPC_PLAIN_TEXT=true -``` - -To Connect Prod from Local: -```properties - INFERFLOW_GRPC_HOST=inferflow.meesho.com - INFERFLOW_GRPC_PORT=443 - INFERFLOW_GRPC_DEADLINE=50000 - INFERFLOW_GRPC_CONNECT_TIMEOUT=10000 - INFERFLOW_GRPC_KEEP_ALIVE_TIMEOUT=10000 - INFERFLOW_GRPC_CONN_REQUEST_TIMEOUT=10000 - INFERFLOW_GRPC_CHANNEL_POOL_SIZE=1 - INFERFLOW_GRPC_THREAD_POOL_SIZE=100 - INFERFLOW_GRPC_QUEUE_POOL_SIZE=100 - INFERFLOW_GRPC_PLAIN_TEXT=false + +## Configuration + +Inferflow is configured via environment variables. Group reference: + +### Application + +| Variable | Description | Default | +|----------|-------------|---------| +| `APP_ENV` | Environment (dev/staging/prod) | `prod` | +| `APP_NAME` | Application name | `inferflow` | +| `APP_PORT` | Server port (gRPC + HTTP) | `8085` | +| `APP_LOG_LEVEL` | Log level (DEBUG/INFO/ERROR) | `INFO` | +| `APP_GC_PERCENTAGE` | Go GC target percentage | `1` | + +### etcd + +| Variable | Description | Default | +|----------|-------------|---------| +| `ETCD_SERVER` | etcd server address | `http://etcd:2379` | +| `ETCD_WATCHER_ENABLED` | Enable config hot-reload | `true` | + +### Cache + +| Variable | Description | Default | +|----------|-------------|---------| +| `IN_MEMORY_CACHE_SIZE_IN_BYTES` | Feature cache size | `6000000000` | +| `DAG_TOPOLOGY_CACHE_SIZE` | DAG topology cache entries | `500` | +| `DAG_TOPOLOGY_CACHE_TTL_SEC` | DAG cache TTL in seconds | `300` | + +### Predator (ML Model Serving) + +| Variable | Description | Default | +|----------|-------------|---------| +| `EXTERNAL_SERVICE_PREDATOR_PORT` | Predator gRPC port | `8090` | +| `EXTERNAL_SERVICE_PREDATOR_GRPC_PLAIN_TEXT` | Use plaintext gRPC | `true` | +| `EXTERNAL_SERVICE_PREDATOR_CALLER_ID` | Caller identifier | `inferflow` | +| `EXTERNAL_SERVICE_PREDATOR_CALLER_TOKEN` | Auth token | `inferflow` | +| `EXTERNAL_SERVICE_PREDATOR_DEADLINE` | Request deadline (ms) | `200` | + +### Numerix (Matrix Operations) + +| Variable | Description | Default | +|----------|-------------|---------| +| `NUMERIX_CLIENT_V1_HOST` | Numerix host | `numerix` | +| `NUMERIX_CLIENT_V1_PORT` | Numerix port | `8083` | +| `NUMERIX_CLIENT_V1_DEADLINE_MS` | Request deadline (ms) | `5000` | +| `NUMERIX_CLIENT_V1_PLAINTEXT` | Use plaintext gRPC | `true` | +| `NUMERIX_CLIENT_V1_AUTHTOKEN` | Auth token | `numerix` | +| `NUMERIX_CLIENT_V1_BATCHSIZE` | Batch size | `100` | + +### ONFS (Online Feature Store) + +| Variable | Description | Default | +|----------|-------------|---------| +| `EXTERNAL_SERVICE_ONFS_FS_HOST` | ONFS host | `onfs-api-server` | +| `EXTERNAL_SERVICE_ONFS_FS_PORT` | ONFS port | `8089` | +| `EXTERNAL_SERVICE_ONFS_FS_GRPC_PLAIN_TEXT` | Use plaintext gRPC | `true` | +| `EXTERNAL_SERVICE_ONFS_FS_CALLER_ID` | Caller identifier | `inferflow` | +| `EXTERNAL_SERVICE_ONFS_FS_CALLER_TOKEN` | Auth token | `inferflow` | +| `EXTERNAL_SERVICE_ONFS_FS_DEAD_LINE` | Request deadline (ms) | `200` | +| `EXTERNAL_SERVICE_ONFS_FS_BATCH_SIZE` | Batch size | `50` | + +### Kafka (Inference Logging) + +| Variable | Description | Default | +|----------|-------------|---------| +| `KAFKA_BOOTSTRAP_SERVERS` | Kafka broker addresses | `broker:29092` | +| `KAFKA_LOGGING_TOPIC` | Inference log topic | `inferflow_inference_logs` | + +## Docker + +```bash +# Build +docker build -f cmd/inferflow/Dockerfile -t inferflow:latest . + +# Run +docker run -p 8085:8085 --env-file .env inferflow:latest ``` -### Usage -API-1: Get model score for entities (eg.catalog) +## Documentation + +| Version | Link | +|---------|------| +| Latest | [Inferflow Documentation](https://meesho.github.io/BharatMLStack/category/inferflow) | + +## 🤝 Contributing + +Contributions are welcome! Please check our [Contribution Guide](../CONTRIBUTING.md) for details on how to get started. + +We encourage you to: +- Join our [Discord community](https://discord.gg/XkT7XsV2AU) to discuss features, ideas, and questions +- Check existing issues before opening a new one +- Follow our coding guidelines and pull request process +- Participate in code reviews and discussions -Class: @Qualifier(InferflowConstants.BeanNames.INFERFLOW_SERVICE) IInferflow Inferflow; +## Community & Support -Method: retrieveModelScore +- 💬 **Discord**: Join our [community chat](https://discord.gg/XkT7XsV2AU) +- 🐛 **Issues**: Report bugs and request features on [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) +- 📧 **Email**: Contact us at [ml-oss@meesho.com](mailto:ml-oss@meesho.com ) -* Note : Implement batching at client side and configure batch size as per your latency needs. +## License +BharatMLStack is open-source software licensed under the [BharatMLStack Business Source License 1.1](LICENSE.md). +--- -temp change \ No newline at end of file +
                                                  + Built with ❤️ for the ML community from Meesho +
                                                  +
                                                  + If you find this useful, ⭐️ the repo — your support means the world to us! +
                                                  diff --git a/online-feature-store/README.md b/online-feature-store/README.md index 9e20618d..e82626a4 100644 --- a/online-feature-store/README.md +++ b/online-feature-store/README.md @@ -33,10 +33,6 @@ Online-feature-store consists of several key components working together: ![Online-feature-store Architecture](../docs-src/static/img/v1.0.0-onfs-arch.png) -## 🚀 Quick Start - -For detailed setup instructions, see the [**Quick Start Guide**](quick-start/README.md). - ## 🧰 SDKs Online-feature-store provides SDKs to interact with the feature store: @@ -44,6 +40,9 @@ Online-feature-store provides SDKs to interact with the feature store: - **[Go SDK](sdks/go/README.md)** - For backend services and ML inference - **[Python SDK](sdks/python/README.md)** - For feature ingestion and Spark jobs +## 🚀 Quick Start + +For detailed setup instructions, see the [**Quick Start Guide**](quick-start/README.md). ## 📊 Use Cases @@ -681,9 +680,6 @@ There are several ways to get help with Online-feature-store: Feedback and contributions are welcome! -## Contributing - -We welcome contributions from the community! Please see our [Contributing Guide](CONTRIBUTING.md) for details on how to get started. ## Community & Support diff --git a/skye/README.md b/skye/README.md index faf071c4..b0f12308 100644 --- a/skye/README.md +++ b/skye/README.md @@ -1,5 +1,223 @@ +![Build Status](https://github.com/Meesho/BharatMLStack/actions/workflows/skye.yml/badge.svg) +![Static Badge](https://img.shields.io/badge/release-v1.0.0-blue?style=flat) +[![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da?style=flat&logo=discord&logoColor=white)](https://discord.gg/XkT7XsV2AU) + # Skye -Vector Similarity Search Service with three runnable components: **skye-admin**, **skye-consumers**, and **skye-serving**. +Vector similarity search platform for BharatML Stack. + +Skye enables fast semantic retrieval by representing data as vectors and querying nearest matches in high-dimensional space. It is composed of three runnable components: **skye-admin**, **skye-consumers**, and **skye-serving**. + +## ✨ Features + +- 🔌 **Pluggable Vector Databases** — Support for multiple vector DB backends (Qdrant, NGT, Eigenix) via a generic abstraction layer +- 🏗️ **Shared Embeddings, Isolated Indexes** — Models are stored once but serve multiple tenants (variants), reducing data redundancy +- ⚡ **Event-Driven Administration** — Model lifecycle management through Kafka-based event flows for resilience and fault tolerance +- 💾 **Multi-Layer Caching** — In-memory (FreeCache) + distributed (Redis) caching for ultra-low-latency serving +- 🔍 **Similarity Search APIs** — gRPC APIs for similar-candidate search, bulk embedding retrieval, and dot-product computation +- 🔄 **Real-Time + Batch Ingestion** — Kafka consumers for both reset/delta batch jobs and real-time embedding updates +- 🎯 **Configurable Distance Functions** — DOT, Cosine, and Euclidean distance support +- 🛡️ **Resilience** — Circuit breakers, retry topics, and snapshot-based recovery + +## 🏗️ Architecture + +Skye is built around three components: + +| Component | Role | +|-----------|------| +| **skye-serving** | Handles real-time similarity search queries with in-memory caching and vector DB lookups (gRPC, port 9090) | +| **skye-consumers** | Processes embedding ingestion (reset/delta jobs) and real-time aggregation events from Kafka (HTTP, port 8080) | +| **skye-admin** | Manages model lifecycle, onboarding, variant registration, and coordinates jobs (HTTP, port 8080) | + +For detailed architecture and data flow diagrams, see the [Skye documentation](https://meesho.github.io/BharatMLStack/category/skye). + +### External Dependencies + +| Dependency | Purpose | +|------------|---------| +| **etcd** | Dynamic model/variant configuration | +| **Kafka** | Embedding ingestion events, model state machine | +| **Qdrant** | Vector database (pluggable) | +| **ScyllaDB** | Embedding storage + aggregator data | +| **Redis** | Distributed caching | + +## 📡 gRPC APIs (skye-serving) + +### Similar Candidate Service + +| RPC | Description | +|-----|-------------| +| `GetSimilarCandidates(SkyeRequest)` | Find similar candidates using embeddings or candidate IDs | + +### Embedding Service + +| RPC | Description | +|-----|-------------| +| `GetEmbeddingsForCandidates(SkyeBulkEmbeddingRequest)` | Bulk embedding retrieval for candidate IDs | +| `GetDotProductOfCandidatesForEmbedding(EmbeddingDotProductRequest)` | Compute dot products between an embedding and candidates | + +### HTTP Endpoints + +| Endpoint | Description | +|----------|-------------| +| `GET /health` | Health check | + +## 🔧 Admin HTTP APIs (skye-admin) + +### Model Management + +| Endpoint | Description | +|----------|-------------| +| `POST /api/v1/model/register-model` | Register a new model | +| `POST /api/v1/model/register-variant` | Register a variant for a model | +| `POST /api/v1/model/register-store` | Register a storage store | +| `POST /api/v1/model/register-frequency` | Register job frequency | +| `POST /api/v1/model/register-entity` | Register an entity type | + +### Qdrant Operations + +| Endpoint | Description | +|----------|-------------| +| `POST /api/v1/qdrant/create-collection` | Create a Qdrant collection | +| `POST /api/v1/qdrant/process-model` | Process a model (reset) | +| `POST /api/v1/qdrant/process-multi-variant` | Process multiple variants | +| `POST /api/v1/qdrant/promote-variant` | Promote variant to scale-up cluster | +| `POST /api/v1/qdrant/trigger-indexing` | Trigger indexing pipeline | + +## 🧰 SDKs + +- **[Go SDK](../go-sdk/pkg/clients/skye/)** — Client library for backend services + +## 🚀 Quick Start + +### Prerequisites + +- Go 1.24 or later +- Docker and Docker Compose (for local development) +- `librdkafka-dev` (CGO dependency for Kafka client) +- A running BharatML Stack environment (etcd, Kafka, Qdrant, ScyllaDB, Redis) + +### Using Docker Compose (Recommended) + +The easiest way to run Skye with all its dependencies is via the BharatML Stack quick-start: + +```bash +cd quick-start +./start.sh +``` + +This starts Skye alongside all required services. See the [Quick Start Guide](../quick-start/README.md) for details. + +### Standalone + +```bash +cd skye + +# Build all components +go build -o bin/skye-admin ./cmd/admin +go build -o bin/skye-consumers ./cmd/consumers +go build -o bin/skye-serving ./cmd/serving + +# Run (example: serving) +./bin/skye-serving +``` + +## ⚙️ Configuration + +Skye is configured via environment variables (loaded through Viper). Dynamic model/variant configuration is managed via etcd. + +### Application + +| Variable | Description | +|----------|-------------| +| `app_name` | Application name | +| `app_env` | Environment (staging/production) | +| `port` | HTTP/gRPC server port | +| `auth_tokens` | Authentication tokens | + +### etcd + +| Variable | Description | +|----------|-------------| +| `etcd_server` | etcd server address | +| `etcd_username` | etcd username | +| `etcd_password` | etcd password | +| `etcd_watcher_enabled` | Enable config hot-reload | + +### Kafka + +| Variable | Description | +|----------|-------------| +| `kafka_broker` | Kafka broker address | +| `kafka_group_id` | Kafka consumer group ID | +| `kafka_topic` | Kafka topic | +| `embedding_consumer_kafka_ids` | Comma-separated embedding consumer IDs | +| `realtime_consumer_kafka_ids` | Comma-separated real-time consumer IDs | +| `realtime_producer_kafka_id` | Real-time producer ID | + +### Redis + +| Variable | Description | +|----------|-------------| +| `redis_addr` | Redis server address | +| `redis_password` | Redis password | +| `redis_db` | Redis database number | + +### Storage + +| Variable | Description | +|----------|-------------| +| `storage_aggregator_db_count` | Number of aggregator database connections | +| `storage_embedding_store_count` | Number of embedding store connections | + +## 🐳 Docker + +```bash +cd skye + +# Build images +docker build -f cmd/admin/Dockerfile -t skye-admin:latest . +docker build -f cmd/consumers/Dockerfile -t skye-consumers:latest . +docker build -f cmd/serving/Dockerfile -t skye-serving:latest . + +# Run (example: serving) +docker run -p 9090:9090 --env-file .env skye-serving:latest + +# Run (example: admin) +docker run -p 8080:8080 --env-file .env skye-admin:latest +``` + +## 📚 Documentation + +| Version | Link | +|---------|------| +| v1.0.0 | [Skye Documentation](https://meesho.github.io/BharatMLStack/category/skye) | + +## 🤝 Contributing + +Contributions are welcome! Please check our [Contribution Guide](../CONTRIBUTING.md) for details on how to get started. + +We encourage you to: +- Join our [Discord community](https://discord.gg/XkT7XsV2AU) to discuss features, ideas, and questions +- Check existing issues before opening a new one +- Follow our coding guidelines and pull request process +- Participate in code reviews and discussions + +## Community & Support + +- 💬 **Discord**: Join our [community chat](https://discord.gg/XkT7XsV2AU) +- 🐛 **Issues**: Report bugs and request features on [GitHub Issues](https://github.com/Meesho/BharatMLStack/issues) +- 📧 **Email**: Contact us at [ml-oss@meesho.com](mailto:ml-oss@meesho.com) + +## License + +BharatMLStack is open-source software licensed under the [BharatMLStack Business Source License 1.1](LICENSE.md). + +--- -- **[Quickstart](docs/QUICKSTART.md)** – Build, configure, and run skye-admin, skye-consumers, and skye-serving. \ No newline at end of file +
                                                  + Built with ❤️ for the ML community from Meesho +
                                                  +
                                                  + If you find this useful, ⭐️ the repo — your support means the world to us! +