Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 116 additions & 10 deletions integrations/llms/vertex-ai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,32 @@ Portkey supports the [Google Vertex AI CountTokens API](https://docs.cloud.googl

Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can explicitly create a cache and then reference it in subsequent inference requests.

<Info>
**This is different from Portkey's gateway caching.** Vertex AI context caching is a native Google feature that caches large context (system instructions, documents, etc.) on Google's servers. Portkey's [simple and semantic caching](/product/ai-gateway/cache-simple-and-semantic) caches complete request-response pairs at the gateway level. Use Vertex AI context caching when you have large, reusable context; use Portkey's gateway caching for repeated identical or similar requests.
</Info>

<Note>
**Gemini models only.** Context caching is only available for Gemini models on Vertex AI (e.g., `gemini-1.5-pro`, `gemini-1.5-flash`). It is not supported for Anthropic, Meta, or other models hosted on Vertex AI.
</Note>

### Pricing

Context caching can significantly reduce costs for applications with large, reusable context:

| Token type | Price per token |
|------------|-----------------|
| Cache write (input) | $0.000625 |
| Cache read (input) | $0.00005 |

Cache reads are ~12x cheaper than cache writes, making this cost-effective for scenarios where you reference the same cached content multiple times.

### Step 1: Create a context cache

Use the Vertex AI `cachedContents` endpoint through Portkey to create a cache:
Use Portkey's proxy capability with the `x-portkey-custom-host` header to call Vertex AI's native caching endpoints. This routes your request through Portkey while targeting Vertex AI's `cachedContents` API directly.

```sh cURL
<Tabs>
<Tab title="cURL">
```sh
curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \
--header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \
--header 'Content-Type: application/json' \
Expand All @@ -223,25 +244,110 @@ curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/location
"parts": [{
"text": "thankyou I am your helpful assistant"
}]
}]
}],
"ttl": "3600s"
}'
```
</Tab>
<Tab title="Python">
```python
import requests

YOUR_PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
MODEL_ID = "gemini-1.5-pro-001"

url = f"https://api.portkey.ai/v1/projects/{YOUR_PROJECT_ID}/locations/{LOCATION}/cachedContents"

headers = {
"x-portkey-provider": "@my-vertex-ai-provider",
"Content-Type": "application/json",
"x-portkey-api-key": "your_portkey_api_key",
"x-portkey-custom-host": "https://aiplatform.googleapis.com/v1"
}

payload = {
"model": f"projects/{YOUR_PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}",
"displayName": "my-cache-display-name",
"contents": [
{
"role": "user",
"parts": [{"text": "This is sample text to demonstrate explicit caching. (minimum 1024 tokens required)"}]
},
{
"role": "model",
"parts": [{"text": "Thank you, I am your helpful assistant"}]
}
],
"ttl": "3600s"
}

response = requests.post(url, headers=headers, json=payload)
cache_name = response.json().get("name") # Save this for inference requests
print(f"Cache created: {cache_name}")
```
</Tab>
<Tab title="NodeJS">
```javascript
const YOUR_PROJECT_ID = "your-project-id";
const LOCATION = "us-central1";
const MODEL_ID = "gemini-1.5-pro-001";

const response = await fetch(
`https://api.portkey.ai/v1/projects/${YOUR_PROJECT_ID}/locations/${LOCATION}/cachedContents`,
{
method: 'POST',
headers: {
'x-portkey-provider': '@my-vertex-ai-provider',
'Content-Type': 'application/json',
'x-portkey-api-key': 'your_portkey_api_key',
'x-portkey-custom-host': 'https://aiplatform.googleapis.com/v1'
},
body: JSON.stringify({
model: `projects/${YOUR_PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}`,
displayName: 'my-cache-display-name',
contents: [
{
role: 'user',
parts: [{ text: 'This is sample text to demonstrate explicit caching. (minimum 1024 tokens required)' }]
},
{
role: 'model',
parts: [{ text: 'Thank you, I am your helpful assistant' }]
}
],
ttl: '3600s'
})
}
);

const data = await response.json();
const cacheName = data.name; // Save this for inference requests
console.log(`Cache created: ${cacheName}`);
```
</Tab>
</Tabs>

**Request variables:**

| Variable | Description |
|----------|-------------|
| `YOUR_PROJECT_ID` | Your Google Cloud project ID. |
| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). |
| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). |
| `my-cache-display-name` | A unique name to identify your cache. |
| `your_api_key` | Your Portkey API key. |
| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. |
| `YOUR_PROJECT_ID` | Your Google Cloud project ID |
| `LOCATION` | The region where your model is deployed (e.g., `us-central1`) |
| `MODEL_ID` | The Gemini model identifier (e.g., `gemini-1.5-pro-001`) |
| `my-cache-display-name` | A unique name to identify your cache |
| `your_api_key` | Your Portkey API key |
| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog |
| `ttl` | Cache time-to-live (e.g., `3600s` for 1 hour, `86400s` for 1 day) |

<Note>
Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
Context caching requires a minimum of **1024 tokens** in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
</Note>

<Warning>
**Proxy pattern limitation:** Creating caches uses Portkey's proxy capability to forward requests to Vertex AI's native endpoints. Native endpoint support within Portkey's standard API is not planned since context caching is Gemini-specific.
</Warning>

### Step 2: Use the cache in inference requests

Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter:
Expand Down