Click Below to Get the Code

Browse, clone, and build from real-world templates powered by Harper.
Tutorial
GitHub Logo

The Complete Guide to Peer-to-Peer Data Replication in Harper

Discover how HarperDB's peer-to-peer data replication powers globally distributed, low-latency systems with no single point of failure. Learn to configure clusters, secure them with mTLS, and scale seamlessly using real-world examples and practical guides.
Tutorial

The Complete Guide to Peer-to-Peer Data Replication in Harper

By
David Cockerill
June 25, 2025
By
David Cockerill
June 25, 2025
June 25, 2025
Discover how HarperDB's peer-to-peer data replication powers globally distributed, low-latency systems with no single point of failure. Learn to configure clusters, secure them with mTLS, and scale seamlessly using real-world examples and practical guides.
David Cockerill
Platform Software Engineer

Replication is the engine that powers Harper's global data fabric. It transforms distributed nodes into a seamless network of synchronized peers—no leader nodes, no single points of failure, just fast, reliable, and secure data movement.

This guide breaks down the architecture, shows you how to configure it, and explains what makes it special, with practical examples and code along the way.

The Basics: What is Harper Replication?

Replication is Harper's mechanism for creating and maintaining multiple copies of data across distributed nodes. This ensures low-latency reads, high availability, and fault tolerance through eventual consistency.

  • Cluster: A group of Harper instances connected through replication.
  • Node: An individual Harper instance (VM, container, IoT device, etc.).
  • Peer-to-Peer: Every node can both publish and subscribe. No leader, no coordinator.

Real-World Example

A retailer deploys Harper on distributed servers near every user population center. Product catalog and inventory updates made in Denver are replicated in real-time to Chicago and Boston, ensuring all locations share consistent data without routing everything through a central server.

How Harper Keeps Clusters in Sync

Harper's support for high-throughput, low-latency, geographically distributed deployments is made possible by three key ingredients:

Audit Log Streaming

Rather than duplicating full data sets, Harper logs each change to an audit log, a compact, transaction-level journal optimized for minimum resource use.

If Node A inserts { id: 1, name: "Tiger" } into the animals table, that change becomes an audit entry that is replicated to other nodes.

The hdb_nodes Table

Cluster configuration is driven by a system table called hdb_nodes. Every node has a record with its hostname, connection URL, and metadata. This table is replicated, so when a new node joins, it inherits the full topology of the cluster.

Mutual TLS for Security

All connections are secured using mutual TLS (mTLS). Each node must present a certificate matching its hostname. Certificates are validated against trusted certificate authorities and cross-checked with hdb_nodes.

  • A client writes data to Node A.
  • Node A records that transaction in its audit log.
  • That log entry is replicated securely to peers via websockets
  • Peers apply the transaction to their local table.

How to Configure Replication

Option 1: Declarative YAML Config

The most robust and production-ready approach:

When Harper initializes, it uses this process to establish connections and populate the hdb_nodes table. Note: for this to function correctly, valid certificates must be configured to enable the mTLS connection.

Option 2: Dynamic API Call

For quick tests or automation:

Control and Flexibility

Harper's replication system offers powerful and granular control over how data is shared across nodes. Whether you're managing compliance constraints, optimizing for performance, or just fine-tuning your architecture, these capabilities let you shape replication to your exact needs.

Database Filtering

By default, Harper replicates all user-created databases. However, you can override this behavior in the harperdb-config.yaml file by explicitly listing only the databases you want replicated. This is useful when some data is sensitive, irrelevant across regions, or simply not needed outside of a specific node.

Table Filtering

You can also control replication at the table level. Within your GraphQL schema, set the replicate: false flag to prevent specific tables from being included in replication.

This is ideal for caching tables, temporary storage, or data that is large and local-only.

Directionality with Subscriptions

For more advanced scenarios, the add_node and set_node API calls support directional replication. This means you can:

  • Subscribe to a table without publishing.
  • Publish to a table without subscribing.
  • Do both (the default behavior).

This opens the door for sophisticated topologies, like selective upstream syncing or isolating write-heavy tables to specific nodes.

Real-World Example

A multi-tenant SaaS platform deploys Harper nodes in major cloud regions: US-East, EU-West, and APAC. Each edge node holds customer data specific to that geography. Using table-level subscriptions, the system replicates only the relevant tenants’ records to each region. This minimizes cross-region bandwidth costs, improves latency for customers, and ensures compliance with local data residency regulations.

For instance, a Harper node in Frankfurt might only subscribe to customer_eu and invoices_eu, while ignoring customer_us and invoices_apac. Meanwhile, a central analytics node can be configured to subscribe to all tenant databases for cross-region reporting—without publishing any of its own data back.

This kind of targeted replication is what enables Harper to power modern, distributed apps that are both globally performant and locally compliant.

Managing Node Lifecycles

Harper's replication system is built to accommodate the dynamic nature of distributed systems. Whether you're scaling up, handling failure recovery, or optimizing storage, Harper handles the lifecycle of each node with resilience and flexibility.

New Nodes

When a new node is added to a cluster, Harper performs an initial sync, downloading the full contents of the relevant databases from its first connected peer. This ensures the new node becomes a fully consistent member of the cluster without requiring manual intervention.

This process is automatic and includes replicating database and tables and hdb_nodes entries, allowing the new node to discover and connect with other peers in the cluster.

startTime Option

If you don’t want a new node to sync the entire history, you can specify a startTime parameter. This tells Harper to only replicate changes from that point forward.

Use case: spinning up a temporary analytics node that only cares about current and future data, not historical records.

Offline Resync

If a node goes offline—due to a network issue, reboot, or planned maintenance—it will automatically resync upon reconnecting. Harper compares audit log sequences and replays any missed transactions to bring the node up to date.

This happens without needing to reinitialize or manually trigger sync processes, making Harper well-suited for edge or intermittently connected environments.

Audit Log Pruning

Over time, the audit log can grow significantly, especially in high-write environments. Harper allows you to purge old entries to free up storage. If a new or returning node requests a transaction that has already been purged, Harper will fall back to a full table copy to fill in the gaps.

Admins can configure log retention periods or trigger pruning operations based on time or size thresholds.

Conflict Resolution

Replication conflicts are rare but inevitable in distributed systems. Harper avoids duplicate inserts by identifying records with the same primary key and only applying the latest transaction using a last-write-wins strategy unless a Conflict-free Data Replication Type (CRDT) is used.

This ensures eventual consistency even if two nodes attempt to write the same record simultaneously.

Monitoring and Troubleshooting

Need a quick health-check on your Harper cluster?

Call the cluster_status operation, and Harper returns a tidy JSON snapshot of every node—showing current WebSocket connections, subscribed tables, per-database latency, and a flag that tells you if clustering is enabled. This single endpoint gives ops teams immediate, actionable insight into replication health and connection quality, making it easy to spot broken links or lagging databases before they disrupt your app.

What This Enables

Harper’s replication architecture makes global low-latency experiences possible. Whether you're building applications at the edge, running global infrastructure, or scaling dynamically without downtime, Harper gives you the flexibility and resilience to do it right.

Edge Intelligence

Harper is well-suited for deployment on resource-constrained, distributed hardware commonly found in environments like warehouses, retail locations, delivery vehicles, and remote sensor networks.

  • Each node holds just the data it needs.
  • If it loses connection, it keeps running locally.
  • When it's back online, it syncs up automatically—no manual recovery needed.

Real-World Example: A retail chain rolls out Harper to POS devices in 300 stores. Each store gets real-time pricing updates and logs its own sales data locally. Even if the network goes down, the POS system keeps working. Once the connection is restored, the sales data syncs back to HQ without missing a beat.

Multi-Region Deployments

Need to serve customers across continents while respecting data sovereignty laws and reducing latency? Harper has you covered.

  • Securely connect regions with mTLS.
  • Control exactly what gets replicated where.
  • Keep data close to users and inside legal boundaries.

Real-World Example: A SaaS company runs Harper clusters in the US, EU, and APAC. EU customer data never leaves the Frankfurt region to stay GDPR-compliant. Meanwhile, global metrics are rolled up to a central analytics cluster. It's fast, efficient, and compliant—all at once.

Zero-Downtime Scale-Out

Scaling your infrastructure should be smooth—not stressful.

  • Add new nodes without restarts.
  • New nodes join and auto-sync with the cluster without manual setup.
  • Just declare them in the config.

Real-World Example: On Black Friday, an e-commerce platform adds five new edge nodes to handle the traffic spike. Each one boots up, connects securely to the cluster, downloads the necessary data, and starts serving traffic—no downtime, no disruption.

Try It Yourself

  1. Set up 3 Harper nodes with the YAML config.
  2. Insert data into one—watch it appear on others.
  3. Simulate an offline node, then bring it back.

Final Thought

Harper’s peer-to-peer replication, declarative design, and secure architecture make it a powerful tool for building distributed systems that feel like one.

Once you understand how it works, you’ll start designing systems that let Harper do the heavy lifting, so your data is always where it needs to be, fast and safe.

Replication is the engine that powers Harper's global data fabric. It transforms distributed nodes into a seamless network of synchronized peers—no leader nodes, no single points of failure, just fast, reliable, and secure data movement.

This guide breaks down the architecture, shows you how to configure it, and explains what makes it special, with practical examples and code along the way.

The Basics: What is Harper Replication?

Replication is Harper's mechanism for creating and maintaining multiple copies of data across distributed nodes. This ensures low-latency reads, high availability, and fault tolerance through eventual consistency.

  • Cluster: A group of Harper instances connected through replication.
  • Node: An individual Harper instance (VM, container, IoT device, etc.).
  • Peer-to-Peer: Every node can both publish and subscribe. No leader, no coordinator.

Real-World Example

A retailer deploys Harper on distributed servers near every user population center. Product catalog and inventory updates made in Denver are replicated in real-time to Chicago and Boston, ensuring all locations share consistent data without routing everything through a central server.

How Harper Keeps Clusters in Sync

Harper's support for high-throughput, low-latency, geographically distributed deployments is made possible by three key ingredients:

Audit Log Streaming

Rather than duplicating full data sets, Harper logs each change to an audit log, a compact, transaction-level journal optimized for minimum resource use.

If Node A inserts { id: 1, name: "Tiger" } into the animals table, that change becomes an audit entry that is replicated to other nodes.

The hdb_nodes Table

Cluster configuration is driven by a system table called hdb_nodes. Every node has a record with its hostname, connection URL, and metadata. This table is replicated, so when a new node joins, it inherits the full topology of the cluster.

Mutual TLS for Security

All connections are secured using mutual TLS (mTLS). Each node must present a certificate matching its hostname. Certificates are validated against trusted certificate authorities and cross-checked with hdb_nodes.

  • A client writes data to Node A.
  • Node A records that transaction in its audit log.
  • That log entry is replicated securely to peers via websockets
  • Peers apply the transaction to their local table.

How to Configure Replication

Option 1: Declarative YAML Config

The most robust and production-ready approach:

When Harper initializes, it uses this process to establish connections and populate the hdb_nodes table. Note: for this to function correctly, valid certificates must be configured to enable the mTLS connection.

Option 2: Dynamic API Call

For quick tests or automation:

Control and Flexibility

Harper's replication system offers powerful and granular control over how data is shared across nodes. Whether you're managing compliance constraints, optimizing for performance, or just fine-tuning your architecture, these capabilities let you shape replication to your exact needs.

Database Filtering

By default, Harper replicates all user-created databases. However, you can override this behavior in the harperdb-config.yaml file by explicitly listing only the databases you want replicated. This is useful when some data is sensitive, irrelevant across regions, or simply not needed outside of a specific node.

Table Filtering

You can also control replication at the table level. Within your GraphQL schema, set the replicate: false flag to prevent specific tables from being included in replication.

This is ideal for caching tables, temporary storage, or data that is large and local-only.

Directionality with Subscriptions

For more advanced scenarios, the add_node and set_node API calls support directional replication. This means you can:

  • Subscribe to a table without publishing.
  • Publish to a table without subscribing.
  • Do both (the default behavior).

This opens the door for sophisticated topologies, like selective upstream syncing or isolating write-heavy tables to specific nodes.

Real-World Example

A multi-tenant SaaS platform deploys Harper nodes in major cloud regions: US-East, EU-West, and APAC. Each edge node holds customer data specific to that geography. Using table-level subscriptions, the system replicates only the relevant tenants’ records to each region. This minimizes cross-region bandwidth costs, improves latency for customers, and ensures compliance with local data residency regulations.

For instance, a Harper node in Frankfurt might only subscribe to customer_eu and invoices_eu, while ignoring customer_us and invoices_apac. Meanwhile, a central analytics node can be configured to subscribe to all tenant databases for cross-region reporting—without publishing any of its own data back.

This kind of targeted replication is what enables Harper to power modern, distributed apps that are both globally performant and locally compliant.

Managing Node Lifecycles

Harper's replication system is built to accommodate the dynamic nature of distributed systems. Whether you're scaling up, handling failure recovery, or optimizing storage, Harper handles the lifecycle of each node with resilience and flexibility.

New Nodes

When a new node is added to a cluster, Harper performs an initial sync, downloading the full contents of the relevant databases from its first connected peer. This ensures the new node becomes a fully consistent member of the cluster without requiring manual intervention.

This process is automatic and includes replicating database and tables and hdb_nodes entries, allowing the new node to discover and connect with other peers in the cluster.

startTime Option

If you don’t want a new node to sync the entire history, you can specify a startTime parameter. This tells Harper to only replicate changes from that point forward.

Use case: spinning up a temporary analytics node that only cares about current and future data, not historical records.

Offline Resync

If a node goes offline—due to a network issue, reboot, or planned maintenance—it will automatically resync upon reconnecting. Harper compares audit log sequences and replays any missed transactions to bring the node up to date.

This happens without needing to reinitialize or manually trigger sync processes, making Harper well-suited for edge or intermittently connected environments.

Audit Log Pruning

Over time, the audit log can grow significantly, especially in high-write environments. Harper allows you to purge old entries to free up storage. If a new or returning node requests a transaction that has already been purged, Harper will fall back to a full table copy to fill in the gaps.

Admins can configure log retention periods or trigger pruning operations based on time or size thresholds.

Conflict Resolution

Replication conflicts are rare but inevitable in distributed systems. Harper avoids duplicate inserts by identifying records with the same primary key and only applying the latest transaction using a last-write-wins strategy unless a Conflict-free Data Replication Type (CRDT) is used.

This ensures eventual consistency even if two nodes attempt to write the same record simultaneously.

Monitoring and Troubleshooting

Need a quick health-check on your Harper cluster?

Call the cluster_status operation, and Harper returns a tidy JSON snapshot of every node—showing current WebSocket connections, subscribed tables, per-database latency, and a flag that tells you if clustering is enabled. This single endpoint gives ops teams immediate, actionable insight into replication health and connection quality, making it easy to spot broken links or lagging databases before they disrupt your app.

What This Enables

Harper’s replication architecture makes global low-latency experiences possible. Whether you're building applications at the edge, running global infrastructure, or scaling dynamically without downtime, Harper gives you the flexibility and resilience to do it right.

Edge Intelligence

Harper is well-suited for deployment on resource-constrained, distributed hardware commonly found in environments like warehouses, retail locations, delivery vehicles, and remote sensor networks.

  • Each node holds just the data it needs.
  • If it loses connection, it keeps running locally.
  • When it's back online, it syncs up automatically—no manual recovery needed.

Real-World Example: A retail chain rolls out Harper to POS devices in 300 stores. Each store gets real-time pricing updates and logs its own sales data locally. Even if the network goes down, the POS system keeps working. Once the connection is restored, the sales data syncs back to HQ without missing a beat.

Multi-Region Deployments

Need to serve customers across continents while respecting data sovereignty laws and reducing latency? Harper has you covered.

  • Securely connect regions with mTLS.
  • Control exactly what gets replicated where.
  • Keep data close to users and inside legal boundaries.

Real-World Example: A SaaS company runs Harper clusters in the US, EU, and APAC. EU customer data never leaves the Frankfurt region to stay GDPR-compliant. Meanwhile, global metrics are rolled up to a central analytics cluster. It's fast, efficient, and compliant—all at once.

Zero-Downtime Scale-Out

Scaling your infrastructure should be smooth—not stressful.

  • Add new nodes without restarts.
  • New nodes join and auto-sync with the cluster without manual setup.
  • Just declare them in the config.

Real-World Example: On Black Friday, an e-commerce platform adds five new edge nodes to handle the traffic spike. Each one boots up, connects securely to the cluster, downloads the necessary data, and starts serving traffic—no downtime, no disruption.

Try It Yourself

  1. Set up 3 Harper nodes with the YAML config.
  2. Insert data into one—watch it appear on others.
  3. Simulate an offline node, then bring it back.

Final Thought

Harper’s peer-to-peer replication, declarative design, and secure architecture make it a powerful tool for building distributed systems that feel like one.

Once you understand how it works, you’ll start designing systems that let Harper do the heavy lifting, so your data is always where it needs to be, fast and safe.

Discover how HarperDB's peer-to-peer data replication powers globally distributed, low-latency systems with no single point of failure. Learn to configure clusters, secure them with mTLS, and scale seamlessly using real-world examples and practical guides.

Download

White arrow pointing right
Discover how HarperDB's peer-to-peer data replication powers globally distributed, low-latency systems with no single point of failure. Learn to configure clusters, secure them with mTLS, and scale seamlessly using real-world examples and practical guides.

Download

White arrow pointing right
Discover how HarperDB's peer-to-peer data replication powers globally distributed, low-latency systems with no single point of failure. Learn to configure clusters, secure them with mTLS, and scale seamlessly using real-world examples and practical guides.

Download

White arrow pointing right

Explore Recent Resources

Blog
GitHub Logo

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Search Optimization
Blog
Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Colorful geometric illustration of a dog's head in shades of purple, pink and teal.
Martin Spiek
SEO Subject Matter Expert
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Sep 2025
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Case Study
GitHub Logo

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Early Hints
Case Study
A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Colorful geometric illustration of a dog's head resembling folded paper art in shades of teal and pink.
Harper
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper
Sep 2025
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper