Click Below to Get the Code

Browse, clone, and build from real-world templates powered by Harper.
Tutorial
GitHub Logo

5x Faster HNSW Vector Search with int8 Quantization

Harper 5.1 adds int8 quantization, dropping index size by roughly 3x, search throughput improves roughly 5x, and p99 latency falls from ~9s to ~0.5s under load. The tradeoff is approximately 1% recall degradation before reranking.
JavaScript
Tutorial
JavaScript

5x Faster HNSW Vector Search with int8 Quantization

Kris Zyp
SVP of Engineering
at Harper
June 30, 2026
Kris Zyp
SVP of Engineering
at Harper
June 30, 2026
Kris Zyp
SVP of Engineering
at Harper
June 30, 2026
June 30, 2026
Harper 5.1 adds int8 quantization, dropping index size by roughly 3x, search throughput improves roughly 5x, and p99 latency falls from ~9s to ~0.5s under load. The tradeoff is approximately 1% recall degradation before reranking.
Kris Zyp
SVP of Engineering

Harper's HNSW index has stored vectors as float32 arrays since the feature shipped. For a typical embedding dimension of 768, each vector node in the index graph is roughly 3KB of raw floats, plus graph structure overhead. At scale, tens of thousands of vectors and high query concurrency, that decode cost accumulates: 768 individually tagged floats decoded into a boxed JavaScript array per node visit, across potentially hundreds of node visits per query.

Harper 5.1 adds optional int8 quantization for HNSW indexes (enabled by default), which reduces index size roughly 3x and search throughput roughly 5x, at the cost of approximately 1% recall degradation for nearest-neighbor queries.

Enabling it

type Article @table {
  bodyEmbedding: [Float] @indexed(type: "HNSW", quantization: "int8")
}

The quantization: "int8" is the new default, but this shows the full explicit declaration. Existing float32 nodes in the index are still readable — the decode path auto-detects which format each node uses, so you don't need to rebuild the index when enabling quantization.

What the encoding actually looks like

For each vector stored in the graph, Harper scales the float components to the signed int8 range [-127, 127] and stores a single per-vector scale factor alongside. What was 768 × 4 bytes = 3,072 bytes becomes 768 × 1 byte + 4 bytes ≈ 772 bytes — about a 4x reduction in raw storage per node.

The performance gain is larger than the storage ratio because of how JavaScript handles the decode. The float32 path decoded 768 individually msgpack-tagged values into a boxed array, creating GC pressure proportional to the number of node visits per query. The int8 path reads the entire vector as a single typed-array view — essentially a memcpy. On a 10,000-vector benchmark, this reduced p99 search latency from ~9s to ~0.5s under concurrent load, and improved update throughput roughly 3.6x.

Distance computation and reranking

During graph traversal, Harper computes distances between the full-precision query vector and the dequantized int8 stored vectors. The query vector is never quantized — only the stored graph nodes are. This asymmetric approach preserves query precision while keeping storage compact.

For nearest-neighbor (sort) queries, Harper adds a reranking step after traversal: the candidate set is re-scored against the record's full-precision vector, and the final sort and $distance values are exact. The ~1% recall loss refers to which candidates make it into the candidate set before reranking, not to the accuracy of the distances returned.

For threshold queries ($distance < x), reranking is not currently applied — the approximate distance from traversal is used for the threshold comparison. If you're using distance thresholds and need precise results, this is a limitation worth knowing about. Exact threshold filtering with an int8 index is planned but not in 5.1.

Query-time ef tuning

Separately, 5.1 adds a per-query ef override. The ef parameter controls the candidate set size during graph traversal — larger values increase recall at the cost of more node visits. Previously this was set at index definition time and applied to all queries. Now you can override it per-query:

const results = await Article.search({
  embedding: queryVector,
  limit: 10,
  ef: 200
});

The auto-scaled default has also been raised from 50 to 100 in this release, which improves out-of-the-box recall for most workloads without requiring explicit configuration.

When to use it

int8 quantization is worth enabling if you have large vector collections (tens of thousands or more), high query concurrency, or memory pressure. The ~1% recall loss is negligible for most recommendation and semantic search use cases. If you're running precision-sensitive re-ranking pipelines where recall@10 purity matters, measure it against your specific dataset before committing.

For small vector collections or low-concurrency workloads, the float32 path is fine — the throughput difference only becomes meaningful once you're seeing GC pressure from high node-visit rates.

Harper's HNSW index has stored vectors as float32 arrays since the feature shipped. For a typical embedding dimension of 768, each vector node in the index graph is roughly 3KB of raw floats, plus graph structure overhead. At scale, tens of thousands of vectors and high query concurrency, that decode cost accumulates: 768 individually tagged floats decoded into a boxed JavaScript array per node visit, across potentially hundreds of node visits per query.

Harper 5.1 adds optional int8 quantization for HNSW indexes (enabled by default), which reduces index size roughly 3x and search throughput roughly 5x, at the cost of approximately 1% recall degradation for nearest-neighbor queries.

Enabling it

type Article @table {
  bodyEmbedding: [Float] @indexed(type: "HNSW", quantization: "int8")
}

The quantization: "int8" is the new default, but this shows the full explicit declaration. Existing float32 nodes in the index are still readable — the decode path auto-detects which format each node uses, so you don't need to rebuild the index when enabling quantization.

What the encoding actually looks like

For each vector stored in the graph, Harper scales the float components to the signed int8 range [-127, 127] and stores a single per-vector scale factor alongside. What was 768 × 4 bytes = 3,072 bytes becomes 768 × 1 byte + 4 bytes ≈ 772 bytes — about a 4x reduction in raw storage per node.

The performance gain is larger than the storage ratio because of how JavaScript handles the decode. The float32 path decoded 768 individually msgpack-tagged values into a boxed array, creating GC pressure proportional to the number of node visits per query. The int8 path reads the entire vector as a single typed-array view — essentially a memcpy. On a 10,000-vector benchmark, this reduced p99 search latency from ~9s to ~0.5s under concurrent load, and improved update throughput roughly 3.6x.

Distance computation and reranking

During graph traversal, Harper computes distances between the full-precision query vector and the dequantized int8 stored vectors. The query vector is never quantized — only the stored graph nodes are. This asymmetric approach preserves query precision while keeping storage compact.

For nearest-neighbor (sort) queries, Harper adds a reranking step after traversal: the candidate set is re-scored against the record's full-precision vector, and the final sort and $distance values are exact. The ~1% recall loss refers to which candidates make it into the candidate set before reranking, not to the accuracy of the distances returned.

For threshold queries ($distance < x), reranking is not currently applied — the approximate distance from traversal is used for the threshold comparison. If you're using distance thresholds and need precise results, this is a limitation worth knowing about. Exact threshold filtering with an int8 index is planned but not in 5.1.

Query-time ef tuning

Separately, 5.1 adds a per-query ef override. The ef parameter controls the candidate set size during graph traversal — larger values increase recall at the cost of more node visits. Previously this was set at index definition time and applied to all queries. Now you can override it per-query:

const results = await Article.search({
  embedding: queryVector,
  limit: 10,
  ef: 200
});

The auto-scaled default has also been raised from 50 to 100 in this release, which improves out-of-the-box recall for most workloads without requiring explicit configuration.

When to use it

int8 quantization is worth enabling if you have large vector collections (tens of thousands or more), high query concurrency, or memory pressure. The ~1% recall loss is negligible for most recommendation and semantic search use cases. If you're running precision-sensitive re-ranking pipelines where recall@10 purity matters, measure it against your specific dataset before committing.

For small vector collections or low-concurrency workloads, the float32 path is fine — the throughput difference only becomes meaningful once you're seeing GC pressure from high node-visit rates.

Harper 5.1 adds int8 quantization, dropping index size by roughly 3x, search throughput improves roughly 5x, and p99 latency falls from ~9s to ~0.5s under load. The tradeoff is approximately 1% recall degradation before reranking.

Download

White arrow pointing right
Harper 5.1 adds int8 quantization, dropping index size by roughly 3x, search throughput improves roughly 5x, and p99 latency falls from ~9s to ~0.5s under load. The tradeoff is approximately 1% recall degradation before reranking.

Download

White arrow pointing right
Harper 5.1 adds int8 quantization, dropping index size by roughly 3x, search throughput improves roughly 5x, and p99 latency falls from ~9s to ~0.5s under load. The tradeoff is approximately 1% recall degradation before reranking.

Download

White arrow pointing right

Explore Recent Resources

Blog
GitHub Logo

Agentic Engineering Needs an Opinion: Why Scale Starts with Architecture

AI coding works in a sandbox because the environment is trivially narrow. Real systems have history, constraints, and blast radius. Coding agents make sound decisions only when the architecture is explicit and shared. Opinion isn't a constraint on agentic engineering, it's what makes it possible at scale.
Select*
Blog
AI coding works in a sandbox because the environment is trivially narrow. Real systems have history, constraints, and blast radius. Coding agents make sound decisions only when the architecture is explicit and shared. Opinion isn't a constraint on agentic engineering, it's what makes it possible at scale.
A smiling man with a beard and salt-and-pepper hair stands outdoors with arms crossed, wearing a white button-down shirt.
Stephen Goldberg
CEO & Co-Founder
Blog

Agentic Engineering Needs an Opinion: Why Scale Starts with Architecture

AI coding works in a sandbox because the environment is trivially narrow. Real systems have history, constraints, and blast radius. Coding agents make sound decisions only when the architecture is explicit and shared. Opinion isn't a constraint on agentic engineering, it's what makes it possible at scale.
Stephen Goldberg
Jun 2026
Blog

Agentic Engineering Needs an Opinion: Why Scale Starts with Architecture

AI coding works in a sandbox because the environment is trivially narrow. Real systems have history, constraints, and blast radius. Coding agents make sound decisions only when the architecture is explicit and shared. Opinion isn't a constraint on agentic engineering, it's what makes it possible at scale.
Stephen Goldberg
Blog

Agentic Engineering Needs an Opinion: Why Scale Starts with Architecture

AI coding works in a sandbox because the environment is trivially narrow. Real systems have history, constraints, and blast radius. Coding agents make sound decisions only when the architecture is explicit and shared. Opinion isn't a constraint on agentic engineering, it's what makes it possible at scale.
Stephen Goldberg
Blog
GitHub Logo

Building a Cozy Sandbox Game on Harper

A nature-restoration game with six biomes, 150 animals, and a real food web — built with a single Harper component as the entire backend. One YAML file wires the database, API, content seeder, and static host. The same binary ships offline on itch.io.
Shell
Blog
A nature-restoration game with six biomes, 150 animals, and a real food web — built with a single Harper component as the entire backend. One YAML file wires the database, API, content seeder, and static host. The same binary ships offline on itch.io.
Person with long wavy brown hair wearing a bright pink shirt with a teal trim, smiling outdoors in soft sunlight with blurred trees in the background.
Bailey Dunning
Forward Deployed Engineer
Blog

Building a Cozy Sandbox Game on Harper

A nature-restoration game with six biomes, 150 animals, and a real food web — built with a single Harper component as the entire backend. One YAML file wires the database, API, content seeder, and static host. The same binary ships offline on itch.io.
Bailey Dunning
Jun 2026
Blog

Building a Cozy Sandbox Game on Harper

A nature-restoration game with six biomes, 150 animals, and a real food web — built with a single Harper component as the entire backend. One YAML file wires the database, API, content seeder, and static host. The same binary ships offline on itch.io.
Bailey Dunning
Blog

Building a Cozy Sandbox Game on Harper

A nature-restoration game with six biomes, 150 animals, and a real food web — built with a single Harper component as the entire backend. One YAML file wires the database, API, content seeder, and static host. The same binary ships offline on itch.io.
Bailey Dunning
Blog
GitHub Logo

Your Website was Built for Humans. AI Needs Something Cleaner.

The web spent a decade optimizing for browsers. JavaScript-heavy rendering, dynamic CMS templates, and client-side hydration made pages beautiful and machines blind. AI answer engines retrieve, parse, and cite content directly. If your best content is trapped behind a render cycle, a cleaner source wins.
A.I.
Blog
The web spent a decade optimizing for browsers. JavaScript-heavy rendering, dynamic CMS templates, and client-side hydration made pages beautiful and machines blind. AI answer engines retrieve, parse, and cite content directly. If your best content is trapped behind a render cycle, a cleaner source wins.
Person with short dark hair and moustache, wearing a colorful plaid shirt, smiling outdoors in a forested mountain landscape.
Aleks Haugom
Senior Manager of GTM
Blog

Your Website was Built for Humans. AI Needs Something Cleaner.

The web spent a decade optimizing for browsers. JavaScript-heavy rendering, dynamic CMS templates, and client-side hydration made pages beautiful and machines blind. AI answer engines retrieve, parse, and cite content directly. If your best content is trapped behind a render cycle, a cleaner source wins.
Aleks Haugom
Jun 2026
Blog

Your Website was Built for Humans. AI Needs Something Cleaner.

The web spent a decade optimizing for browsers. JavaScript-heavy rendering, dynamic CMS templates, and client-side hydration made pages beautiful and machines blind. AI answer engines retrieve, parse, and cite content directly. If your best content is trapped behind a render cycle, a cleaner source wins.
Aleks Haugom
Blog

Your Website was Built for Humans. AI Needs Something Cleaner.

The web spent a decade optimizing for browsers. JavaScript-heavy rendering, dynamic CMS templates, and client-side hydration made pages beautiful and machines blind. AI answer engines retrieve, parse, and cite content directly. If your best content is trapped behind a render cycle, a cleaner source wins.
Aleks Haugom