Click Below to Get the Code

Browse, clone, and build from real-world templates powered by Harper.
Tutorial
GitHub Logo

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Tutorial

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Drew Chambers
CMO
at Harper
May 13, 2026
Drew Chambers
CMO
at Harper
May 13, 2026
Drew Chambers
CMO
at Harper
May 13, 2026
May 13, 2026
Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Drew Chambers
CMO

The Tax Nobody Talks About

You've built your agent. It reasons, it retrieves, it acts. And then your cloud bill arrives.

Most teams building on LLMs treat every agent invocation as a fresh, independent transaction — a blank slate that calls the model, waits for a response, and moves on. At small scale, this is fine. At production scale, it becomes a compounding tax: on latency, on tokens, and on cost. The agent pattern that works great in a demo starts to buckle when thousands of users are running similar workflows simultaneously.

The fix isn't a better model. It's a better architecture — and the combination of Harper and Vertex AI is one of the most elegant ways to build it.

What the Problem Actually Looks Like

Consider an agent that answers product questions for an e-commerce platform. Across thousands of daily sessions:

  • "What's your return policy?" gets asked ~400 times a day
  • "Do you ship to Canada?" another ~200 times
  • Hundreds of variations of "How do I track my order?"

With a naive architecture, each of those is a full round-trip: embed the query, retrieve context, construct a prompt, call Gemini, stream a response. You're paying for tokens and waiting on latency every single time — even though the semantically correct answer hasn't changed since yesterday.

This is the problem semantic caching solves. And it's where Harper and Vertex AI fit together almost perfectly.

What Each Piece Brings

Vertex AI

Vertex AI is Google Cloud's managed ML platform, and for agent architectures it offers two things that matter most:

Gemini models — A tiered family ranging from Gemini Flash (fast, cheap, ideal for structured retrieval and summarization tasks) to Gemini Pro (deeper reasoning, more context). For agents, the tier selection is itself a performance optimization: route simple queries to Flash, complex ones to Pro.

Text Embeddings APItext-embedding-004 produces 768-dimension embeddings optimized for semantic similarity. This is the input to the caching layer: embed the incoming query, hand the vector to Harper, and let Harper decide whether to serve a cached response or call the model fresh.

Harper

Harper is a distributed data platform that combines a database, cache, and application server into a single deployable unit. For agent architectures, three characteristics matter:

Native HNSW vector index — Harper indexes vector fields using HNSW (Hierarchical Navigable Small World), a best-in-class approximate nearest-neighbor algorithm. Declare a field as [Float] @indexed(type: "HNSW", distance: "cosine") in your schema and Harper handles the index automatically. Similarity search is a first-class query operation — no external vector database, no extra network hop.

Low-latency, in-memory access — Harper's storage engine keeps hot data in memory and flushes to disk asynchronously. Cache reads are sub-millisecond. For an agent that otherwise waits 1–3 seconds per LLM call, serving a cached response from Harper is an order-of-magnitude latency improvement.

Edge-deployable — Harper runs at the edge, close to your users. This matters for multi-region agent deployments where you want to serve cached responses from the nearest node rather than round-tripping to a central cloud region.

Components (server-side logic) — Harper's v5 component model lets you run Node.js logic directly alongside your data using Fastify routes. Your cache lookup and store logic live inside a Harper component — no extra network hops in your hot path.

The Core Architecture: Semantic Caching

Here's the pattern. When an agent receives a user query:

1. Embed the incoming query (Vertex AI Embeddings)
2. Search Harper's HNSW index for the nearest cached entry
3a. Cache hit (similarity > threshold): return cached response immediately
3b. Cache miss: call Vertex AI (Gemini), store response + embedding in Harper, return response

The division of labor is clean: Vertex AI generates embeddings and runs the LLM; Harper owns the vector index, the similarity search, and the response store.

Schema Definition

Harper's vector index is defined in schema.graphql:

type SemanticCache @table(expiration: 604800) @export {
  id: ID @primaryKey
  question: String
  answer: String
  embedding: [Float] @indexed(type: "HNSW", distance: "cosine")
  model: String
  generatedAt: Long
}

@indexed(type: "HNSW", distance: "cosine") creates the vector index automatically. @table(expiration: 604800) sets a 7-day TTL at the table level — no need to manage expires_at fields manually. @export makes the table accessible via Harper's REST API.

The Python Cache Client

This class wraps the embed → lookup → store cycle on the orchestrator side:

# semantic_cache.py
import hashlib
import httpx
import vertexai
from dataclasses import dataclass
from vertexai.language_models import TextEmbeddingModel

@dataclass
class CacheEntry:
    question: str
    answer: str
    model: str
    similarity: float

class SemanticCache:
    def __init__(
        self,
        harper_url: str,
        harper_token: str,
        gcp_project: str,
        location: str = "us-central1",
        threshold: float = 0.92,
    ):
        vertexai.init(project=gcp_project, location=location)
        self.embed_model = TextEmbeddingModel.from_pretrained("text-embedding-004")
        self.harper_url = harper_url
        self.headers = {
            "Authorization": f"Basic {harper_token}",
            "Content-Type": "application/json",
        }
        self.threshold = threshold

    def embed(self, text: str) -> list[float]:
        return self.embed_model.get_embeddings([text])[0].values

    async def lookup(self, embedding: list[float]) -> CacheEntry | None:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.harper_url}/semantic-cache/lookup",
                headers=self.headers,
                json={"embedding": embedding, "threshold": self.threshold},
            )
            resp.raise_for_status()
            data = resp.json()

        if data.get("result"):
            r = data["result"]
            return CacheEntry(
                question=r["question"],
                answer=r["answer"],
                model=r["model"],
                similarity=data["similarity"],
            )
        return None

    async def store(
        self, question: str, embedding: list[float], answer: str, model: str
    ) -> None:
        async with httpx.AsyncClient() as client:
            await client.post(
                f"{self.harper_url}/semantic-cache/store",
                headers=self.headers,
                json={
                    "id": hashlib.sha256(question.encode()).hexdigest(),
                    "question": question,
                    "answer": answer,
                    "embedding": embedding,
                    "model": model,
                },
            )

Wiring It Into the Agent

# agent.py
from vertexai.generative_models import GenerativeModel
from semantic_cache import SemanticCache

cache = SemanticCache(
    harper_url="https://your-instance.harperdbcloud.com",
    harper_token="your-token",
    gcp_project="your-gcp-project",
    threshold=0.92,
)

gemini = GenerativeModel("gemini-2.0-flash-001")

async def run_agent(user_input: str) -> dict:
    embedding = cache.embed(user_input)

    hit = await cache.lookup(embedding)
    if hit:
        return {
            "response": hit.answer,
            "source": "cache",
            "similarity": round(hit.similarity, 4),
            "model": hit.model,
        }

    response = gemini.generate_content(user_input)
    answer = response.text

    await cache.store(user_input, embedding, answer, "gemini-2.0-flash-001")

    return {
        "response": answer,
        "source": "model",
        "similarity": None,
        "model": "gemini-2.0-flash-001",
    }

The source field is important — log it. Cache hit rates are invisible otherwise, and hit rate is the metric that tells you whether your threshold is tuned right.

The Harper Component: Vector Search Server-Side

The lookup runs inside Harper as a component, co-located with the HNSW index. Harper's search() with a vector sort target performs the nearest-neighbor lookup natively — no manual cosine similarity math, no in-process iteration over stored vectors.

Configure the component to serve Fastify routes:

# config.yaml — Harper component root
fastifyRoutes:
  files: routes/*.js

// routes/cache.js
import { tables } from 'harper';

export default async (server) => {

  // POST /lookup
  server.route({
    url: '/lookup',
    method: 'POST',
    handler: async (request) => {
      const { embedding, threshold = 0.92 } = request.body;

      const results = tables.SemanticCache.search({
        sort: { attribute: 'embedding', target: embedding },
        limit: 1,
        select: ['id', 'question', 'answer', 'model', '$distance'],
      });

      for await (const cached of results) {
        const similarity = 1 - cached.$distance;
        if (similarity >= threshold) {
          return { result: cached, similarity };
        }
      }

      return { result: null };
    },
  });

  // POST /store
  server.route({
    url: '/store',
    method: 'POST',
    handler: async (request) => {
      const { id, question, answer, embedding, model } = request.body;

      await tables.SemanticCache.put({
        id,
        question,
        answer,
        embedding,
        model,
        generatedAt: Date.now(),
      });

      return { success: true };
    },
  });

};

search() returns results sorted by cosine distance via the HNSW index. $distance is the raw distance value; 1 - $distance gives you cosine similarity. With limit: 1, Harper returns only the closest match — if it clears the threshold, it's a hit.

Compare this to a hand-rolled similarity implementation: no 500-entry fetch, no in-process loop, no numpy dependency. Harper's index does the work.

Threshold Tuning

The similarity threshold is the single most important tuning parameter. Too low and you return wrong cached answers. Too high and you miss cache opportunities.

A good starting strategy:

  • Start at 0.95 and log misses for a week
  • If you're seeing near-duplicate misses, drop to 0.92
  • For high-stakes domains (medical, legal, financial), stay at 0.97+
  • Semantically equivalent questions typically score above 0.95; genuinely different ones fall below 0.85 — the gap between those is your working range

Cache Invalidation

Semantic caches have a different invalidation model than key-value caches. You can't just bust a key — you need to invalidate semantically related entries.

  • TTL-based: Already handled — @table(expiration: 604800) in the schema enforces a 7-day TTL automatically. Adjust per domain.
  • Tag-based: Attach topic tags to cached entries (e.g., ["shipping", "returns"]). When your product policy changes, delete all entries where tags contain returns.
  • Embedding-drift detection: When you upgrade your embedding model, re-embed a sample of cached queries and compare to stored vectors. Significant drift means your cached embeddings are misaligned — wipe and rebuild.

Token Efficiency: The Math

Let's make the benefit concrete. Say your agent handles 10,000 queries per day. Average query + context is 800 input tokens; average response is 400 output tokens.

Using Gemini 2.0 Flash pricing as a reference point:

Scenario Daily API calls Daily tokens Relative cost
No caching 10,000 12M
60% cache hit rate 4,000 4.8M 0.4×
80% cache hit rate 2,000 2.4M 0.2×

An 80% cache hit rate — realistic for a focused domain like customer support — cuts your LLM spend by 80%. That's not a marginal gain; it's a structural change to your unit economics.

Context Compression as a Second Lever

Beyond response caching, Harper enables a second token efficiency pattern: storing compressed conversation summaries.

Instead of passing full conversation history to the model on every turn (which grows unboundedly), store a rolling summary in Harper:

# context_store.py
async def get_compressed_context(session_id: str, cache: SemanticCache) -> str:
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            f"{cache.harper_url}/semantic-cache/context/{session_id}",
            headers=cache.headers,
        )
        if resp.status_code == 200:
            return resp.json().get("summary", "")
    return ""

async def update_context_summary(
    session_id: str,
    new_exchange: str,
    cache: SemanticCache,
) -> None:
    existing = await get_compressed_context(session_id, cache)

    compressor = GenerativeModel("gemini-2.0-flash-001")
    prompt = f"""
    Existing summary: {existing}
    New exchange: {new_exchange}
    Produce a concise updated summary (max 200 words) capturing all key facts.
    """
    updated = compressor.generate_content(prompt).text

    async with httpx.AsyncClient() as client:
        await client.post(
            f"{cache.harper_url}/semantic-cache/context/{session_id}",
            headers=cache.headers,
            json={"summary": updated, "updatedAt": int(time.time())},
        )

On each turn you inject the summary (~150 tokens) instead of the full history (potentially 2,000+ tokens). The compression call itself uses Gemini Flash at minimal cost. For long-running agent sessions, this alone can cut per-turn token usage by 60–70%.

Performance Architecture

The Latency Profile

A typical agent round-trip without caching:

Embed query:        ~80ms   (Vertex AI Embeddings API)
LLM call:         ~1,200ms  (Gemini Pro, streaming)
Total:            ~1,280ms

With semantic caching via Harper:

Embed query:        ~80ms   (Vertex AI Embeddings API)
Harper HNSW search:  ~2ms   (in-memory vector index)
Cache hit return:    ~5ms   (network + serialization)
Total (hit):        ~87ms

That's a 15× latency improvement on cache hits. For multi-step agents where each step compounds latency, this becomes the difference between a 2-second workflow and a 30-second one.

Reference Architecture

Putting it all together:

Harper owns the hot path and the data layer: vector search, response storage, conversation context. Vertex AI owns the model layer: embedding generation and LLM inference. The orchestrator routes between them.

When to Use This Pattern

This architecture pays off most when:

  • Your query space has natural clustering — customer support, internal knowledge bases, FAQ-style agents. High semantic overlap = high cache hit rates.
  • You're operating at scale — the infrastructure overhead only makes sense if you have enough volume to realize the token savings.
  • Latency is user-facing — if a human is waiting for a response, the 15× latency improvement is viscerally noticeable.

It's overkill for:

  • Exploratory research agents where every query is unique by design
  • Low-volume internal tooling where token costs aren't material
  • Use cases where response freshness is critical and TTLs would be near-zero anyway

The Bigger Picture

Most agent architectures are designed around the model. The model is the centerpiece; everything else is scaffolding. This works at small scale.

At production scale, the model becomes one component among several — and the data layer is where the real performance and efficiency work happens. Harper gives you a data layer that was built for speed, native vector search, and edge deployment. Vertex AI gives you a model layer that scales managed. The two together give you a production agent stack that doesn't just work — it works economically.

The teams building the most efficient agents in 2026 aren't necessarily the ones using the best models. They're the ones who figured out that the best model call is the one you don't have to make.

The Tax Nobody Talks About

You've built your agent. It reasons, it retrieves, it acts. And then your cloud bill arrives.

Most teams building on LLMs treat every agent invocation as a fresh, independent transaction — a blank slate that calls the model, waits for a response, and moves on. At small scale, this is fine. At production scale, it becomes a compounding tax: on latency, on tokens, and on cost. The agent pattern that works great in a demo starts to buckle when thousands of users are running similar workflows simultaneously.

The fix isn't a better model. It's a better architecture — and the combination of Harper and Vertex AI is one of the most elegant ways to build it.

What the Problem Actually Looks Like

Consider an agent that answers product questions for an e-commerce platform. Across thousands of daily sessions:

  • "What's your return policy?" gets asked ~400 times a day
  • "Do you ship to Canada?" another ~200 times
  • Hundreds of variations of "How do I track my order?"

With a naive architecture, each of those is a full round-trip: embed the query, retrieve context, construct a prompt, call Gemini, stream a response. You're paying for tokens and waiting on latency every single time — even though the semantically correct answer hasn't changed since yesterday.

This is the problem semantic caching solves. And it's where Harper and Vertex AI fit together almost perfectly.

What Each Piece Brings

Vertex AI

Vertex AI is Google Cloud's managed ML platform, and for agent architectures it offers two things that matter most:

Gemini models — A tiered family ranging from Gemini Flash (fast, cheap, ideal for structured retrieval and summarization tasks) to Gemini Pro (deeper reasoning, more context). For agents, the tier selection is itself a performance optimization: route simple queries to Flash, complex ones to Pro.

Text Embeddings APItext-embedding-004 produces 768-dimension embeddings optimized for semantic similarity. This is the input to the caching layer: embed the incoming query, hand the vector to Harper, and let Harper decide whether to serve a cached response or call the model fresh.

Harper

Harper is a distributed data platform that combines a database, cache, and application server into a single deployable unit. For agent architectures, three characteristics matter:

Native HNSW vector index — Harper indexes vector fields using HNSW (Hierarchical Navigable Small World), a best-in-class approximate nearest-neighbor algorithm. Declare a field as [Float] @indexed(type: "HNSW", distance: "cosine") in your schema and Harper handles the index automatically. Similarity search is a first-class query operation — no external vector database, no extra network hop.

Low-latency, in-memory access — Harper's storage engine keeps hot data in memory and flushes to disk asynchronously. Cache reads are sub-millisecond. For an agent that otherwise waits 1–3 seconds per LLM call, serving a cached response from Harper is an order-of-magnitude latency improvement.

Edge-deployable — Harper runs at the edge, close to your users. This matters for multi-region agent deployments where you want to serve cached responses from the nearest node rather than round-tripping to a central cloud region.

Components (server-side logic) — Harper's v5 component model lets you run Node.js logic directly alongside your data using Fastify routes. Your cache lookup and store logic live inside a Harper component — no extra network hops in your hot path.

The Core Architecture: Semantic Caching

Here's the pattern. When an agent receives a user query:

1. Embed the incoming query (Vertex AI Embeddings)
2. Search Harper's HNSW index for the nearest cached entry
3a. Cache hit (similarity > threshold): return cached response immediately
3b. Cache miss: call Vertex AI (Gemini), store response + embedding in Harper, return response

The division of labor is clean: Vertex AI generates embeddings and runs the LLM; Harper owns the vector index, the similarity search, and the response store.

Schema Definition

Harper's vector index is defined in schema.graphql:

type SemanticCache @table(expiration: 604800) @export {
  id: ID @primaryKey
  question: String
  answer: String
  embedding: [Float] @indexed(type: "HNSW", distance: "cosine")
  model: String
  generatedAt: Long
}

@indexed(type: "HNSW", distance: "cosine") creates the vector index automatically. @table(expiration: 604800) sets a 7-day TTL at the table level — no need to manage expires_at fields manually. @export makes the table accessible via Harper's REST API.

The Python Cache Client

This class wraps the embed → lookup → store cycle on the orchestrator side:

# semantic_cache.py
import hashlib
import httpx
import vertexai
from dataclasses import dataclass
from vertexai.language_models import TextEmbeddingModel

@dataclass
class CacheEntry:
    question: str
    answer: str
    model: str
    similarity: float

class SemanticCache:
    def __init__(
        self,
        harper_url: str,
        harper_token: str,
        gcp_project: str,
        location: str = "us-central1",
        threshold: float = 0.92,
    ):
        vertexai.init(project=gcp_project, location=location)
        self.embed_model = TextEmbeddingModel.from_pretrained("text-embedding-004")
        self.harper_url = harper_url
        self.headers = {
            "Authorization": f"Basic {harper_token}",
            "Content-Type": "application/json",
        }
        self.threshold = threshold

    def embed(self, text: str) -> list[float]:
        return self.embed_model.get_embeddings([text])[0].values

    async def lookup(self, embedding: list[float]) -> CacheEntry | None:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.harper_url}/semantic-cache/lookup",
                headers=self.headers,
                json={"embedding": embedding, "threshold": self.threshold},
            )
            resp.raise_for_status()
            data = resp.json()

        if data.get("result"):
            r = data["result"]
            return CacheEntry(
                question=r["question"],
                answer=r["answer"],
                model=r["model"],
                similarity=data["similarity"],
            )
        return None

    async def store(
        self, question: str, embedding: list[float], answer: str, model: str
    ) -> None:
        async with httpx.AsyncClient() as client:
            await client.post(
                f"{self.harper_url}/semantic-cache/store",
                headers=self.headers,
                json={
                    "id": hashlib.sha256(question.encode()).hexdigest(),
                    "question": question,
                    "answer": answer,
                    "embedding": embedding,
                    "model": model,
                },
            )

Wiring It Into the Agent

# agent.py
from vertexai.generative_models import GenerativeModel
from semantic_cache import SemanticCache

cache = SemanticCache(
    harper_url="https://your-instance.harperdbcloud.com",
    harper_token="your-token",
    gcp_project="your-gcp-project",
    threshold=0.92,
)

gemini = GenerativeModel("gemini-2.0-flash-001")

async def run_agent(user_input: str) -> dict:
    embedding = cache.embed(user_input)

    hit = await cache.lookup(embedding)
    if hit:
        return {
            "response": hit.answer,
            "source": "cache",
            "similarity": round(hit.similarity, 4),
            "model": hit.model,
        }

    response = gemini.generate_content(user_input)
    answer = response.text

    await cache.store(user_input, embedding, answer, "gemini-2.0-flash-001")

    return {
        "response": answer,
        "source": "model",
        "similarity": None,
        "model": "gemini-2.0-flash-001",
    }

The source field is important — log it. Cache hit rates are invisible otherwise, and hit rate is the metric that tells you whether your threshold is tuned right.

The Harper Component: Vector Search Server-Side

The lookup runs inside Harper as a component, co-located with the HNSW index. Harper's search() with a vector sort target performs the nearest-neighbor lookup natively — no manual cosine similarity math, no in-process iteration over stored vectors.

Configure the component to serve Fastify routes:

# config.yaml — Harper component root
fastifyRoutes:
  files: routes/*.js

// routes/cache.js
import { tables } from 'harper';

export default async (server) => {

  // POST /lookup
  server.route({
    url: '/lookup',
    method: 'POST',
    handler: async (request) => {
      const { embedding, threshold = 0.92 } = request.body;

      const results = tables.SemanticCache.search({
        sort: { attribute: 'embedding', target: embedding },
        limit: 1,
        select: ['id', 'question', 'answer', 'model', '$distance'],
      });

      for await (const cached of results) {
        const similarity = 1 - cached.$distance;
        if (similarity >= threshold) {
          return { result: cached, similarity };
        }
      }

      return { result: null };
    },
  });

  // POST /store
  server.route({
    url: '/store',
    method: 'POST',
    handler: async (request) => {
      const { id, question, answer, embedding, model } = request.body;

      await tables.SemanticCache.put({
        id,
        question,
        answer,
        embedding,
        model,
        generatedAt: Date.now(),
      });

      return { success: true };
    },
  });

};

search() returns results sorted by cosine distance via the HNSW index. $distance is the raw distance value; 1 - $distance gives you cosine similarity. With limit: 1, Harper returns only the closest match — if it clears the threshold, it's a hit.

Compare this to a hand-rolled similarity implementation: no 500-entry fetch, no in-process loop, no numpy dependency. Harper's index does the work.

Threshold Tuning

The similarity threshold is the single most important tuning parameter. Too low and you return wrong cached answers. Too high and you miss cache opportunities.

A good starting strategy:

  • Start at 0.95 and log misses for a week
  • If you're seeing near-duplicate misses, drop to 0.92
  • For high-stakes domains (medical, legal, financial), stay at 0.97+
  • Semantically equivalent questions typically score above 0.95; genuinely different ones fall below 0.85 — the gap between those is your working range

Cache Invalidation

Semantic caches have a different invalidation model than key-value caches. You can't just bust a key — you need to invalidate semantically related entries.

  • TTL-based: Already handled — @table(expiration: 604800) in the schema enforces a 7-day TTL automatically. Adjust per domain.
  • Tag-based: Attach topic tags to cached entries (e.g., ["shipping", "returns"]). When your product policy changes, delete all entries where tags contain returns.
  • Embedding-drift detection: When you upgrade your embedding model, re-embed a sample of cached queries and compare to stored vectors. Significant drift means your cached embeddings are misaligned — wipe and rebuild.

Token Efficiency: The Math

Let's make the benefit concrete. Say your agent handles 10,000 queries per day. Average query + context is 800 input tokens; average response is 400 output tokens.

Using Gemini 2.0 Flash pricing as a reference point:

Scenario Daily API calls Daily tokens Relative cost
No caching 10,000 12M
60% cache hit rate 4,000 4.8M 0.4×
80% cache hit rate 2,000 2.4M 0.2×

An 80% cache hit rate — realistic for a focused domain like customer support — cuts your LLM spend by 80%. That's not a marginal gain; it's a structural change to your unit economics.

Context Compression as a Second Lever

Beyond response caching, Harper enables a second token efficiency pattern: storing compressed conversation summaries.

Instead of passing full conversation history to the model on every turn (which grows unboundedly), store a rolling summary in Harper:

# context_store.py
async def get_compressed_context(session_id: str, cache: SemanticCache) -> str:
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            f"{cache.harper_url}/semantic-cache/context/{session_id}",
            headers=cache.headers,
        )
        if resp.status_code == 200:
            return resp.json().get("summary", "")
    return ""

async def update_context_summary(
    session_id: str,
    new_exchange: str,
    cache: SemanticCache,
) -> None:
    existing = await get_compressed_context(session_id, cache)

    compressor = GenerativeModel("gemini-2.0-flash-001")
    prompt = f"""
    Existing summary: {existing}
    New exchange: {new_exchange}
    Produce a concise updated summary (max 200 words) capturing all key facts.
    """
    updated = compressor.generate_content(prompt).text

    async with httpx.AsyncClient() as client:
        await client.post(
            f"{cache.harper_url}/semantic-cache/context/{session_id}",
            headers=cache.headers,
            json={"summary": updated, "updatedAt": int(time.time())},
        )

On each turn you inject the summary (~150 tokens) instead of the full history (potentially 2,000+ tokens). The compression call itself uses Gemini Flash at minimal cost. For long-running agent sessions, this alone can cut per-turn token usage by 60–70%.

Performance Architecture

The Latency Profile

A typical agent round-trip without caching:

Embed query:        ~80ms   (Vertex AI Embeddings API)
LLM call:         ~1,200ms  (Gemini Pro, streaming)
Total:            ~1,280ms

With semantic caching via Harper:

Embed query:        ~80ms   (Vertex AI Embeddings API)
Harper HNSW search:  ~2ms   (in-memory vector index)
Cache hit return:    ~5ms   (network + serialization)
Total (hit):        ~87ms

That's a 15× latency improvement on cache hits. For multi-step agents where each step compounds latency, this becomes the difference between a 2-second workflow and a 30-second one.

Reference Architecture

Putting it all together:

Harper owns the hot path and the data layer: vector search, response storage, conversation context. Vertex AI owns the model layer: embedding generation and LLM inference. The orchestrator routes between them.

When to Use This Pattern

This architecture pays off most when:

  • Your query space has natural clustering — customer support, internal knowledge bases, FAQ-style agents. High semantic overlap = high cache hit rates.
  • You're operating at scale — the infrastructure overhead only makes sense if you have enough volume to realize the token savings.
  • Latency is user-facing — if a human is waiting for a response, the 15× latency improvement is viscerally noticeable.

It's overkill for:

  • Exploratory research agents where every query is unique by design
  • Low-volume internal tooling where token costs aren't material
  • Use cases where response freshness is critical and TTLs would be near-zero anyway

The Bigger Picture

Most agent architectures are designed around the model. The model is the centerpiece; everything else is scaffolding. This works at small scale.

At production scale, the model becomes one component among several — and the data layer is where the real performance and efficiency work happens. Harper gives you a data layer that was built for speed, native vector search, and edge deployment. Vertex AI gives you a model layer that scales managed. The two together give you a production agent stack that doesn't just work — it works economically.

The teams building the most efficient agents in 2026 aren't necessarily the ones using the best models. They're the ones who figured out that the best model call is the one you don't have to make.

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.

Download

White arrow pointing right
Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.

Download

White arrow pointing right
Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.

Download

White arrow pointing right

Explore Recent Resources

Tutorial
GitHub Logo

Production Quality at Vibe Code Velocity: Dispatched Agent Teams with Harper

Harper enables production-grade agentic engineering by collapsing database, cache, runtime, and messaging into one process, reducing agent complexity and review burden. A multi-model dispatch workflow lets specialized agents plan, code, QA, and review in parallel while humans retain control over critical decisions.
Tutorial
Harper enables production-grade agentic engineering by collapsing database, cache, runtime, and messaging into one process, reducing agent complexity and review burden. A multi-model dispatch workflow lets specialized agents plan, code, QA, and review in parallel while humans retain control over critical decisions.
Person with very short hair and a goatee wearing a plaid button‑up shirt over a white undershirt, smiling outdoors with leafy greenery behind.
Jeff Darnton
SVP, Professional Services & Customer Success
Tutorial

Production Quality at Vibe Code Velocity: Dispatched Agent Teams with Harper

Harper enables production-grade agentic engineering by collapsing database, cache, runtime, and messaging into one process, reducing agent complexity and review burden. A multi-model dispatch workflow lets specialized agents plan, code, QA, and review in parallel while humans retain control over critical decisions.
Jeff Darnton
May 2026
Tutorial

Production Quality at Vibe Code Velocity: Dispatched Agent Teams with Harper

Harper enables production-grade agentic engineering by collapsing database, cache, runtime, and messaging into one process, reducing agent complexity and review burden. A multi-model dispatch workflow lets specialized agents plan, code, QA, and review in parallel while humans retain control over critical decisions.
Jeff Darnton
Tutorial

Production Quality at Vibe Code Velocity: Dispatched Agent Teams with Harper

Harper enables production-grade agentic engineering by collapsing database, cache, runtime, and messaging into one process, reducing agent complexity and review burden. A multi-model dispatch workflow lets specialized agents plan, code, QA, and review in parallel while humans retain control over critical decisions.
Jeff Darnton
Tutorial
GitHub Logo

Change Data Capture Into a Runtime: One Pipeline for Pages, Search, and AI Agents

Learn how Harper turns CDC streams into real-time workflows that refresh cached pages, update search indexes, and keep AI agent context current. See why landing changes in an application runtime beats warehouses, queues, and traditional CDNs.
Tutorial
Learn how Harper turns CDC streams into real-time workflows that refresh cached pages, update search indexes, and keep AI agent context current. See why landing changes in an application runtime beats warehouses, queues, and traditional CDNs.
Person with very short hair and a goatee wearing a plaid button‑up shirt over a white undershirt, smiling outdoors with leafy greenery behind.
Jeff Darnton
SVP, Professional Services & Customer Success
Tutorial

Change Data Capture Into a Runtime: One Pipeline for Pages, Search, and AI Agents

Learn how Harper turns CDC streams into real-time workflows that refresh cached pages, update search indexes, and keep AI agent context current. See why landing changes in an application runtime beats warehouses, queues, and traditional CDNs.
Jeff Darnton
May 2026
Tutorial

Change Data Capture Into a Runtime: One Pipeline for Pages, Search, and AI Agents

Learn how Harper turns CDC streams into real-time workflows that refresh cached pages, update search indexes, and keep AI agent context current. See why landing changes in an application runtime beats warehouses, queues, and traditional CDNs.
Jeff Darnton
Tutorial

Change Data Capture Into a Runtime: One Pipeline for Pages, Search, and AI Agents

Learn how Harper turns CDC streams into real-time workflows that refresh cached pages, update search indexes, and keep AI agent context current. See why landing changes in an application runtime beats warehouses, queues, and traditional CDNs.
Jeff Darnton
Tutorial
GitHub Logo

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Tutorial
Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Person with styled reddish‑brown hair and a full beard wearing a gray suit with a light blue shirt and dark green tie, posing outdoors with a blurred pathway and greenery behind.
Drew Chambers
CMO
Tutorial

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Drew Chambers
May 2026
Tutorial

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Drew Chambers
Tutorial

Harper + Vertex AI: The Architecture Every Agent Builder Should Know

Production agents bleed tokens and latency on repeated queries. Pair a managed model layer with a vector-indexed data layer at the edge, and an 80% cache hit rate cuts LLM spend by 80% while delivering sub-100ms responses on semantically similar requests.
Drew Chambers
Blog
GitHub Logo

Why Harper is the Definitive Platform for Enterprise Citizen Developers

Harper bridges the gap between business agility and IT security. Utilizing a unified runtime, Harper Fabric guarantees data sovereignty across any environment, from public clouds to air-gapped facilities. Empower users with secure, compliant AI application development and robust governance.
Blog
Harper bridges the gap between business agility and IT security. Utilizing a unified runtime, Harper Fabric guarantees data sovereignty across any environment, from public clouds to air-gapped facilities. Empower users with secure, compliant AI application development and robust governance.
A smiling man with a beard and salt-and-pepper hair stands outdoors with arms crossed, wearing a white button-down shirt.
Stephen Goldberg
CEO & Co-Founder
Blog

Why Harper is the Definitive Platform for Enterprise Citizen Developers

Harper bridges the gap between business agility and IT security. Utilizing a unified runtime, Harper Fabric guarantees data sovereignty across any environment, from public clouds to air-gapped facilities. Empower users with secure, compliant AI application development and robust governance.
Stephen Goldberg
May 2026
Blog

Why Harper is the Definitive Platform for Enterprise Citizen Developers

Harper bridges the gap between business agility and IT security. Utilizing a unified runtime, Harper Fabric guarantees data sovereignty across any environment, from public clouds to air-gapped facilities. Empower users with secure, compliant AI application development and robust governance.
Stephen Goldberg
Blog

Why Harper is the Definitive Platform for Enterprise Citizen Developers

Harper bridges the gap between business agility and IT security. Utilizing a unified runtime, Harper Fabric guarantees data sovereignty across any environment, from public clouds to air-gapped facilities. Empower users with secure, compliant AI application development and robust governance.
Stephen Goldberg
Comparison
GitHub Logo

Harper vs. Vercel + Supabase

Harper offers a unified application platform alternative to Vercel + Supabase, combining database, cache, app logic, messaging, vectors, and real-time capabilities in one globally distributed runtime to reduce latency, operational complexity, and total cost of ownership.
Comparison
Harper offers a unified application platform alternative to Vercel + Supabase, combining database, cache, app logic, messaging, vectors, and real-time capabilities in one globally distributed runtime to reduce latency, operational complexity, and total cost of ownership.
Colorful geometric illustration of a dog's head resembling folded paper art in shades of teal and pink.
Harper
Comparison

Harper vs. Vercel + Supabase

Harper offers a unified application platform alternative to Vercel + Supabase, combining database, cache, app logic, messaging, vectors, and real-time capabilities in one globally distributed runtime to reduce latency, operational complexity, and total cost of ownership.
Harper
May 2026
Comparison

Harper vs. Vercel + Supabase

Harper offers a unified application platform alternative to Vercel + Supabase, combining database, cache, app logic, messaging, vectors, and real-time capabilities in one globally distributed runtime to reduce latency, operational complexity, and total cost of ownership.
Harper
Comparison

Harper vs. Vercel + Supabase

Harper offers a unified application platform alternative to Vercel + Supabase, combining database, cache, app logic, messaging, vectors, and real-time capabilities in one globally distributed runtime to reduce latency, operational complexity, and total cost of ownership.
Harper
Report
GitHub Logo

Agents on Harper: A Practical Guide

A practical guide to building AI agents on Harper. Learn how Harper’s unified runtime supports agentic workloads with local data, durable state, native protocols, deterministic logic, and full-stack hosting, plus how it compares to LangChain, vector databases, memory layers, and hosted agent platforms.
Report
A practical guide to building AI agents on Harper. Learn how Harper’s unified runtime supports agentic workloads with local data, durable state, native protocols, deterministic logic, and full-stack hosting, plus how it compares to LangChain, vector databases, memory layers, and hosted agent platforms.
Colorful geometric illustration of a dog's head resembling folded paper art in shades of teal and pink.
Harper
Report

Agents on Harper: A Practical Guide

A practical guide to building AI agents on Harper. Learn how Harper’s unified runtime supports agentic workloads with local data, durable state, native protocols, deterministic logic, and full-stack hosting, plus how it compares to LangChain, vector databases, memory layers, and hosted agent platforms.
Harper
May 2026
Report

Agents on Harper: A Practical Guide

A practical guide to building AI agents on Harper. Learn how Harper’s unified runtime supports agentic workloads with local data, durable state, native protocols, deterministic logic, and full-stack hosting, plus how it compares to LangChain, vector databases, memory layers, and hosted agent platforms.
Harper
Report

Agents on Harper: A Practical Guide

A practical guide to building AI agents on Harper. Learn how Harper’s unified runtime supports agentic workloads with local data, durable state, native protocols, deterministic logic, and full-stack hosting, plus how it compares to LangChain, vector databases, memory layers, and hosted agent platforms.
Harper