Beyond Vector RAG: Building Agent Memory That Learns From Experience.
-Every agent framework on the market will tell you their agents "have memory." What they mean is: they have a vector database.
They chunk text, embed it, store it, and retrieve whatever looks similar at query time. This works for document Q&A. It fails the moment you expect an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
-We are trying to built something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving.
-Then we tested it. The results surprised us.
+Agent memory has come a long way. Persistent context, vector retrieval, knowledge graphs — the building blocks are real and getting better fast. +But most of what we call "memory" today is still closer to search: chunk text, embed it, retrieve whatever looks similar at query time. That works well for recalling facts and preferences. It starts to break down when you need an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
+We are trying to experiment something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving. +Then we tested it. The results were interesting.
The Gap Nobody Talks About​
Here's a scenario every engineering team has encountered: AI agent hits a Redis connection pool exhaustion issue. It misdiagnoses it as a database problem. You correct it. Next week, a different service has the exact same failure pattern. The agent makes the exact same mistake.
Why? Because LLMs don't learn at inference time. Corrections adjust behavior within a conversation. Once the session ends, the lesson is gone. The model weights haven't changed. The next conversation starts from zero.
diff --git a/docs/blog/index.html b/docs/blog/index.html index e168f16f..3a5bed61 100644 --- a/docs/blog/index.html +++ b/docs/blog/index.html @@ -5,7 +5,7 @@Beyond Vector RAG: Building Agent Memory That Learns From Experience.
-Every agent framework on the market will tell you their agents "have memory." What they mean is: they have a vector database.
They chunk text, embed it, store it, and retrieve whatever looks similar at query time. This works for document Q&A. It fails the moment you expect an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
-We are trying to built something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving.
-Then we tested it. The results surprised us.
LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale
+Agent memory has come a long way. Persistent context, vector retrieval, knowledge graphs — the building blocks are real and getting better fast.
But most of what we call "memory" today is still closer to search: chunk text, embed it, retrieve whatever looks similar at query time. That works well for recalling facts and preferences. It starts to break down when you need an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
+We are trying to experiment something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving. +Then we tested it. The results were interesting.
LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale
Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.
1. Advanced Memory Management: Paged & Prefix KV Caching​
The most significant bottleneck in LLM inference is not always compute, but memory bandwidth—specifically managing the Key-Value (KV) cache.
diff --git a/docs/blog/llm-inference-optimization-sub-sec-latency/index.html b/docs/blog/llm-inference-optimization-sub-sec-latency/index.html index e6b17871..0ff7ad94 100644 --- a/docs/blog/llm-inference-optimization-sub-sec-latency/index.html +++ b/docs/blog/llm-inference-optimization-sub-sec-latency/index.html @@ -5,7 +5,7 @@
-Every agent framework on the market will tell you their agents "have memory." What they mean is: they have a vector database.
-They chunk text, embed it, store it, and retrieve whatever looks similar at query time. This works for document Q&A. It fails the moment you expect an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
-We are trying to built something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving.
-Then we tested it. The results surprised us.
+Agent memory has come a long way. Persistent context, vector retrieval, knowledge graphs — the building blocks are real and getting better fast. +But most of what we call "memory" today is still closer to search: chunk text, embed it, retrieve whatever looks similar at query time. That works well for recalling facts and preferences. It starts to break down when you need an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
+We are trying to experiment something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving. +Then we tested it. The results were interesting.
The Gap Nobody Talks About​
Here's a scenario every engineering team has encountered: AI agent hits a Redis connection pool exhaustion issue. It misdiagnoses it as a database problem. You correct it. Next week, a different service has the exact same failure pattern. The agent makes the exact same mistake.
Why? Because LLMs don't learn at inference time. Corrections adjust behavior within a conversation. Once the session ends, the lesson is gone. The model weights haven't changed. The next conversation starts from zero.
diff --git a/docs/blog/scaling-model-inference-and-embedding-search/index.html b/docs/blog/scaling-model-inference-and-embedding-search/index.html index 62643c89..a5da21ff 100644 --- a/docs/blog/scaling-model-inference-and-embedding-search/index.html +++ b/docs/blog/scaling-model-inference-and-embedding-search/index.html @@ -5,7 +5,7 @@One post tagged with "ai-agents"
View All TagsBeyond Vector RAG: Building Agent Memory That Learns From Experience.
-Every agent framework on the market will tell you their agents "have memory." What they mean is: they have a vector database.
They chunk text, embed it, store it, and retrieve whatever looks similar at query time. This works for document Q&A. It fails the moment you expect an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
-We are trying to built something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving.
-Then we tested it. The results surprised us.
But most of what we call "memory" today is still closer to search: chunk text, embed it, retrieve whatever looks similar at query time. That works well for recalling facts and preferences. It starts to break down when you need an agent to recall what happened last time, learn from a mistake, or avoid repeating a failed approach.
+We are trying to experiment something different. An episodic memory system where a frozen LLM — same weights, no retraining — produces increasingly better decisions over time because the memory feeding it context is continuously evolving. +Then we tested it. The results were interesting.

