5x Faster HNSW Vector Search with int8 Quantization

Harper's HNSW index has stored vectors as float32 arrays since the feature shipped. For a typical embedding dimension of 768, each vector node in the index graph is roughly 3KB of raw floats, plus graph structure overhead. At scale, tens of thousands of vectors and high query concurrency, that decode cost accumulates: 768 individually tagged floats decoded into a boxed JavaScript array per node visit, across potentially hundreds of node visits per query.

Harper 5.1 adds optional int8 quantization for HNSW indexes (enabled by default), which reduces index size roughly 3x and search throughput roughly 5x, at the cost of approximately 1% recall degradation for nearest-neighbor queries.

Enabling it

type Article @table { bodyEmbedding: [Float] @indexed(type: "HNSW", quantization: "int8") }

The quantization: "int8" is the new default, but this shows the full explicit declaration. Existing float32 nodes in the index are still readable — the decode path auto-detects which format each node uses, so you don't need to rebuild the index when enabling quantization.

What the encoding actually looks like

For each vector stored in the graph, Harper scales the float components to the signed int8 range [-127, 127] and stores a single per-vector scale factor alongside. What was 768 × 4 bytes = 3,072 bytes becomes 768 × 1 byte + 4 bytes ≈ 772 bytes — about a 4x reduction in raw storage per node.

The performance gain is larger than the storage ratio because of how JavaScript handles the decode. The float32 path decoded 768 individually msgpack-tagged values into a boxed array, creating GC pressure proportional to the number of node visits per query. The int8 path reads the entire vector as a single typed-array view — essentially a memcpy. On a 10,000-vector benchmark, this reduced p99 search latency from ~9s to ~0.5s under concurrent load, and improved update throughput roughly 3.6x.

Distance computation and reranking

During graph traversal, Harper computes distances between the full-precision query vector and the dequantized int8 stored vectors. The query vector is never quantized — only the stored graph nodes are. This asymmetric approach preserves query precision while keeping storage compact.

For nearest-neighbor (sort) queries, Harper adds a reranking step after traversal: the candidate set is re-scored against the record's full-precision vector, and the final sort and $distance values are exact. The ~1% recall loss refers to which candidates make it into the candidate set before reranking, not to the accuracy of the distances returned.

For threshold queries ($distance < x), reranking is not currently applied — the approximate distance from traversal is used for the threshold comparison. If you're using distance thresholds and need precise results, this is a limitation worth knowing about. Exact threshold filtering with an int8 index is planned but not in 5.1.

Query-time ef tuning

Separately, 5.1 adds a per-query ef override. The ef parameter controls the candidate set size during graph traversal — larger values increase recall at the cost of more node visits. Previously this was set at index definition time and applied to all queries. Now you can override it per-query:

const results = await Article.search({ embedding: queryVector, limit: 10, ef: 200 });

The auto-scaled default has also been raised from 50 to 100 in this release, which improves out-of-the-box recall for most workloads without requiring explicit configuration.

When to use it

int8 quantization is worth enabling if you have large vector collections (tens of thousands or more), high query concurrency, or memory pressure. The ~1% recall loss is negligible for most recommendation and semantic search use cases. If you're running precision-sensitive re-ranking pipelines where recall@10 purity matters, measure it against your specific dataset before committing.

For small vector collections or low-concurrency workloads, the float32 path is fine — the throughput difference only becomes meaningful once you're seeing GC pressure from high node-visit rates.

Click Below to Get the Code