From 0b71ed5b2c47aaeeb740acc70f9447ed4addf445 Mon Sep 17 00:00:00 2001 From: "mintlify[bot]" <109931778+mintlify[bot]@users.noreply.github.com> Date: Sat, 14 Feb 2026 17:09:37 +0000 Subject: [PATCH 1/2] Update integrations/llms/vertex-ai.mdx Co-Authored-By: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> --- integrations/llms/vertex-ai.mdx | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/integrations/llms/vertex-ai.mdx b/integrations/llms/vertex-ai.mdx index 8a05cb46..cfac849a 100644 --- a/integrations/llms/vertex-ai.mdx +++ b/integrations/llms/vertex-ai.mdx @@ -199,6 +199,25 @@ Portkey supports the [Google Vertex AI CountTokens API](https://docs.cloud.googl Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can explicitly create a cache and then reference it in subsequent inference requests. + +**This is different from Portkey's gateway caching.** Vertex AI context caching is a native Google feature that caches large context (system instructions, documents, etc.) on Google's servers. Portkey's [simple and semantic caching](/product/ai-gateway/cache-simple-and-semantic) caches complete request-response pairs at the gateway level. Use Vertex AI context caching when you have large, reusable context; use Portkey's gateway caching for repeated identical or similar requests. + + + +**Gemini models only.** Context caching is only available for Gemini models on Vertex AI (e.g., `gemini-1.5-pro`, `gemini-1.5-flash`). It is not supported for Anthropic, Meta, or other models hosted on Vertex AI. + + +### Pricing + +Context caching can significantly reduce costs for applications with large, reusable context: + +| Token type | Price per token | +|------------|-----------------| +| Cache write (input) | $0.000625 | +| Cache read (input) | $0.00005 | + +Cache reads are ~12x cheaper than cache writes, making this cost-effective for scenarios where you reference the same cached content multiple times. + ### Step 1: Create a context cache Use the Vertex AI `cachedContents` endpoint through Portkey to create a cache: From 241a7b99edcf326b6bc1e8bc04b6dc195c0841b6 Mon Sep 17 00:00:00 2001 From: "mintlify[bot]" <109931778+mintlify[bot]@users.noreply.github.com> Date: Sat, 14 Feb 2026 17:10:41 +0000 Subject: [PATCH 2/2] Update integrations/llms/vertex-ai.mdx Co-Authored-By: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> --- integrations/llms/vertex-ai.mdx | 107 +++++++++++++++++++++++++++++--- 1 file changed, 97 insertions(+), 10 deletions(-) diff --git a/integrations/llms/vertex-ai.mdx b/integrations/llms/vertex-ai.mdx index cfac849a..137eb0d2 100644 --- a/integrations/llms/vertex-ai.mdx +++ b/integrations/llms/vertex-ai.mdx @@ -220,9 +220,11 @@ Cache reads are ~12x cheaper than cache writes, making this cost-effective for s ### Step 1: Create a context cache -Use the Vertex AI `cachedContents` endpoint through Portkey to create a cache: +Use Portkey's proxy capability with the `x-portkey-custom-host` header to call Vertex AI's native caching endpoints. This routes your request through Portkey while targeting Vertex AI's `cachedContents` API directly. -```sh cURL + + +```sh curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \ --header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \ --header 'Content-Type: application/json' \ @@ -242,25 +244,110 @@ curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/location "parts": [{ "text": "thankyou I am your helpful assistant" }] - }] + }], + "ttl": "3600s" }' ``` + + +```python +import requests + +YOUR_PROJECT_ID = "your-project-id" +LOCATION = "us-central1" +MODEL_ID = "gemini-1.5-pro-001" + +url = f"https://api.portkey.ai/v1/projects/{YOUR_PROJECT_ID}/locations/{LOCATION}/cachedContents" + +headers = { + "x-portkey-provider": "@my-vertex-ai-provider", + "Content-Type": "application/json", + "x-portkey-api-key": "your_portkey_api_key", + "x-portkey-custom-host": "https://aiplatform.googleapis.com/v1" +} + +payload = { + "model": f"projects/{YOUR_PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}", + "displayName": "my-cache-display-name", + "contents": [ + { + "role": "user", + "parts": [{"text": "This is sample text to demonstrate explicit caching. (minimum 1024 tokens required)"}] + }, + { + "role": "model", + "parts": [{"text": "Thank you, I am your helpful assistant"}] + } + ], + "ttl": "3600s" +} + +response = requests.post(url, headers=headers, json=payload) +cache_name = response.json().get("name") # Save this for inference requests +print(f"Cache created: {cache_name}") +``` + + +```javascript +const YOUR_PROJECT_ID = "your-project-id"; +const LOCATION = "us-central1"; +const MODEL_ID = "gemini-1.5-pro-001"; + +const response = await fetch( + `https://api.portkey.ai/v1/projects/${YOUR_PROJECT_ID}/locations/${LOCATION}/cachedContents`, + { + method: 'POST', + headers: { + 'x-portkey-provider': '@my-vertex-ai-provider', + 'Content-Type': 'application/json', + 'x-portkey-api-key': 'your_portkey_api_key', + 'x-portkey-custom-host': 'https://aiplatform.googleapis.com/v1' + }, + body: JSON.stringify({ + model: `projects/${YOUR_PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}`, + displayName: 'my-cache-display-name', + contents: [ + { + role: 'user', + parts: [{ text: 'This is sample text to demonstrate explicit caching. (minimum 1024 tokens required)' }] + }, + { + role: 'model', + parts: [{ text: 'Thank you, I am your helpful assistant' }] + } + ], + ttl: '3600s' + }) + } +); + +const data = await response.json(); +const cacheName = data.name; // Save this for inference requests +console.log(`Cache created: ${cacheName}`); +``` + + **Request variables:** | Variable | Description | |----------|-------------| -| `YOUR_PROJECT_ID` | Your Google Cloud project ID. | -| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). | -| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). | -| `my-cache-display-name` | A unique name to identify your cache. | -| `your_api_key` | Your Portkey API key. | -| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. | +| `YOUR_PROJECT_ID` | Your Google Cloud project ID | +| `LOCATION` | The region where your model is deployed (e.g., `us-central1`) | +| `MODEL_ID` | The Gemini model identifier (e.g., `gemini-1.5-pro-001`) | +| `my-cache-display-name` | A unique name to identify your cache | +| `your_api_key` | Your Portkey API key | +| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog | +| `ttl` | Cache time-to-live (e.g., `3600s` for 1 hour, `86400s` for 1 day) | -Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter. +Context caching requires a minimum of **1024 tokens** in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter. + +**Proxy pattern limitation:** Creating caches uses Portkey's proxy capability to forward requests to Vertex AI's native endpoints. Native endpoint support within Portkey's standard API is not planned since context caching is Gemini-specific. + + ### Step 2: Use the cache in inference requests Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter: