Click Below to Get the Code

Browse, clone, and build from real-world templates powered by Harper.
Blog
GitHub Logo

What is Polyglot Persistence and Why is it Awful?

This post examines the concept of polyglot persistence—using multiple data storage technologies for one application—and argues that mixing databases often adds needless complexity. It suggests that choosing a single versatile database can simplify development and improve performance.
Blog

What is Polyglot Persistence and Why is it Awful?

By
Stephen Goldberg
April 23, 2019
By
Stephen Goldberg
April 23, 2019
April 23, 2019
This post examines the concept of polyglot persistence—using multiple data storage technologies for one application—and argues that mixing databases often adds needless complexity. It suggests that choosing a single versatile database can simplify development and improve performance.
Stephen Goldberg
CEO & Co-Founder

According to Wikipedia, Polyglot persistence is the concept of using different data storage technologies to handle different data storage needs within a given software application.”  James Serra, in his blog writes,  “Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.  Different kinds of data are best dealt with different data stores.

The logic behind this methodology according to Wikipedia, and most other sources is, “There are numerous databases available to solve different problems.  Using a single database to satisfy all of a program's requirements can result in a non-performant, "jack of all trades, master of none" solution.  Relational databases, for example, are good at enforcing relationships that exist between various data tables.  To discover a relationship or to find data from different tables that belong to the same object, a SQLjoin operation can be used.  This might work when the data is smaller in size, but becomes problematic when the data involved grows larger.  A graph database might solve the problem of relationships in the case of Big Data, but it might not solve the problem of database transactions, which are provided by RDBM systems.  Instead, a NoSQLdocument database might be used to store unstructured data for that particular part of the problem.  Thus, different problems are solved by different database systems, all within the same application.” 

As James Serra notes, this is a “fancy term”, and it certainly sounds smart.  It has clearly become the reigning paradigm for most large scale data management implementations; however, I would argue that it is a terrible idea.  Why is it such a bad idea? The answer is pretty simple - consistency, cost, and complexity.  

The three Cs - Consistency, Cost, and Complexity

The idea that you should adopt the right tool for the job seems sound.  When it comes to implementation in the software world it often is a good idea.  Windows and OSX are great operating systems for end-user interfaces for laptops, mobile devices, and desktops, but far from ideal for server environments.  Conversely, I wouldn’t want to support my sales team using Linux, I’ve been there and done that during my days at Red Hat - it was awful.    

 I think at this point it’s pretty clear that data is the lifeblood of any organization regardless of size or industry.  One could argue, this belief has gone too far.  I recently listened to a podcast where the CEO of one of the world’s largest auto manufacturers claimed they weren’t a car company anymore, but a “data platform company”.  This made me roll my eyes.  Despite this, while my firm belief is that car companies should build cars, data is still a vital asset to any organization, and we all know this. 

So, if we go back to my three Cs, consistency, cost, and complexity, let’s examine how the pervasive concept of polyglot persistence is a major threat to those areas of an organization’s data management strategy.   

While my career has taken many twists and turns, I have basically spent the entirety of it trying to achieve one single goal for organizations like Red Hat, Nissan North America, The Charlotte Hornets, Parkwood Entertainment, and many others - get a single view of their data.  I have learned an enormous amount on the hundreds of projects I have worked, trying to provide a single view of the truth, and I have fought one battle time and time again, consistency.    

By introducing many different databases, as the Wikipedia article suggests above, into their technology ecosystems, companies inevitably create a situation where their data is inconsistent. Add to that the fact that we are consuming data at a frequency never before seen it becomes nearly impossible to keep the data in synch.  The very nature of polyglot persistence ensures this, as it states that certain types of data should live in certain types of systems.  That would be fine if there was an easy way to access it holistically across these systems, but that simply doesn’t exist.     A year or two ago, many folks argued with me that the solution to this problem was data lakes like Hadoop and other technologies, but I haven’t heard that argument very often in the last 6 to 12 months.  Why, because data goes to data lakes to die.  They are slow, expensive, difficult to maintain, and make it challenging to get a near real-time view of your data.

The issue is that this model requires a significant reliance on memory and CPU for each data silo to perform on read transformations and calculations of data.  These systems are being asked to essentially do double duty; their primary function that they have been designated for in a polyglot model, and then function as a worker for a data lake.  This is over taxing these systems, adds to latency and data corruption, and creates a lot of complexity.  

I fully agree that  RDBMS’s are ideal for relationships and transactions but fail at massive scale. That said what you end up with in a polyglot persistence paradigm is an inability to get a consistent view of your data across your entire organization.   

A Database for IoT and the convergence of OT and IT

All data is valuable because of its relationships.  To truly achieve the promise of Industry 4.0, it will be essential to drive a convergence of OT and IT.  This is combining operational technology (OT) data with IT data.  OT data comes at a very high frequency.  A single electrical relay can put off 2000 records a second.  One project we are working on, that is smaller in scale, has 40 electrical relays - that’s 80,000 records a second.  This power consumption data, to be valuable, needs to be combined with production data in other systems like ERPs.  These relationships will drive the value of that data.  For example, being able to understand in real-time, what the cost in power is to produce a unit of inventory, is a question that would need to be answered.  This requires a database for IoT as well as a database that can functionally handle IT data.  

Most folks would use a polyglot persistence model to achieve this.  They would use a highly scalable streaming data solution, or an edge database, to consume the OT power data.  They would then use an RDBMS to consume the inventory data.  How then do we correlate those in real-time?  Most likely by sending them to a third system.  Things will get lost in transit, integrations will break, and consistency is lost.   

The True Cost of Polyglot Persistence 

Furthermore, this is highly complex.  As we begin to add additional systems for each of these data types, we need additional specialized resources in both people and hardware to maintain them. We also need multiple integration layers that often times lack a dynamic nature, and ultimately become the failure points in these architectures.  The more and more layers we add to these architectures, the more challenging it becomes to determine consistency and to manage this complexity.  It also adds significant costs to house the same data in multiple places, as well as increased compute costs.   

We are also paying in lost productivity more than anywhere else.  How long does it take to triage an issue in your operational data pipeline when you have 5 to 7 different sources of the truth?  How do you determine what is causing data corruption?    

There is also a major risk in terms of compliance.  If we look at the different data breaches across social media companies, credit bureaus, financial institutions, etc. how much time has it taken for them to diagnose the real effect of those breaches?  Why is that?  The answer is pretty simple, they don’t have a holistic picture of their data nor a unified audit trail on said data.  This is becoming more and more dangerous as the impact of these breaches becomes more dramatic effecting personal data while more things become connected.   

What is the solution?

I am not suggesting we go back to the days of monolithic RDBMS environments.  I think it’s clear that paradigm is over.  Nor am I suggesting that we abandon many of the products we currently are using.  Many of the developer tools are awesome for different uses.  Tools for search like ElastiCache have become vital parts of the technology ecosystem, and in-memory databases play important roles for areas where very high speed search on relational data is needed.   

What I am suggesting is that we need to look at data architectures that provide a single persistence model across all these tools, providing those tools with the stability and consistency that they require.  Creating data planes with persistence that can function as middleware layers as well as stream processing engines will be key to reducing complexity.    

If we stop relying on each of these individual tools for long term persistence, but rather view the data inside them as transient, we can then accept the fact that their version of the data might be out of synch and they might crash.  If we are able to put persistence in a stream processing layer with ACID compliance and very high stability, we can then rely on that layer to provide a holistic view of our data.  Stop overtaxing these systems with data lakes where storage algorithms make it impossible to do performant transformation and aggregation, but rather allow these end-point data stores to do their jobs and provide that functionality in a layer that can be used as an operational data pipeline.

According to Wikipedia, Polyglot persistence is the concept of using different data storage technologies to handle different data storage needs within a given software application.”  James Serra, in his blog writes,  “Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.  Different kinds of data are best dealt with different data stores.

The logic behind this methodology according to Wikipedia, and most other sources is, “There are numerous databases available to solve different problems.  Using a single database to satisfy all of a program's requirements can result in a non-performant, "jack of all trades, master of none" solution.  Relational databases, for example, are good at enforcing relationships that exist between various data tables.  To discover a relationship or to find data from different tables that belong to the same object, a SQLjoin operation can be used.  This might work when the data is smaller in size, but becomes problematic when the data involved grows larger.  A graph database might solve the problem of relationships in the case of Big Data, but it might not solve the problem of database transactions, which are provided by RDBM systems.  Instead, a NoSQLdocument database might be used to store unstructured data for that particular part of the problem.  Thus, different problems are solved by different database systems, all within the same application.” 

As James Serra notes, this is a “fancy term”, and it certainly sounds smart.  It has clearly become the reigning paradigm for most large scale data management implementations; however, I would argue that it is a terrible idea.  Why is it such a bad idea? The answer is pretty simple - consistency, cost, and complexity.  

The three Cs - Consistency, Cost, and Complexity

The idea that you should adopt the right tool for the job seems sound.  When it comes to implementation in the software world it often is a good idea.  Windows and OSX are great operating systems for end-user interfaces for laptops, mobile devices, and desktops, but far from ideal for server environments.  Conversely, I wouldn’t want to support my sales team using Linux, I’ve been there and done that during my days at Red Hat - it was awful.    

 I think at this point it’s pretty clear that data is the lifeblood of any organization regardless of size or industry.  One could argue, this belief has gone too far.  I recently listened to a podcast where the CEO of one of the world’s largest auto manufacturers claimed they weren’t a car company anymore, but a “data platform company”.  This made me roll my eyes.  Despite this, while my firm belief is that car companies should build cars, data is still a vital asset to any organization, and we all know this. 

So, if we go back to my three Cs, consistency, cost, and complexity, let’s examine how the pervasive concept of polyglot persistence is a major threat to those areas of an organization’s data management strategy.   

While my career has taken many twists and turns, I have basically spent the entirety of it trying to achieve one single goal for organizations like Red Hat, Nissan North America, The Charlotte Hornets, Parkwood Entertainment, and many others - get a single view of their data.  I have learned an enormous amount on the hundreds of projects I have worked, trying to provide a single view of the truth, and I have fought one battle time and time again, consistency.    

By introducing many different databases, as the Wikipedia article suggests above, into their technology ecosystems, companies inevitably create a situation where their data is inconsistent. Add to that the fact that we are consuming data at a frequency never before seen it becomes nearly impossible to keep the data in synch.  The very nature of polyglot persistence ensures this, as it states that certain types of data should live in certain types of systems.  That would be fine if there was an easy way to access it holistically across these systems, but that simply doesn’t exist.     A year or two ago, many folks argued with me that the solution to this problem was data lakes like Hadoop and other technologies, but I haven’t heard that argument very often in the last 6 to 12 months.  Why, because data goes to data lakes to die.  They are slow, expensive, difficult to maintain, and make it challenging to get a near real-time view of your data.

The issue is that this model requires a significant reliance on memory and CPU for each data silo to perform on read transformations and calculations of data.  These systems are being asked to essentially do double duty; their primary function that they have been designated for in a polyglot model, and then function as a worker for a data lake.  This is over taxing these systems, adds to latency and data corruption, and creates a lot of complexity.  

I fully agree that  RDBMS’s are ideal for relationships and transactions but fail at massive scale. That said what you end up with in a polyglot persistence paradigm is an inability to get a consistent view of your data across your entire organization.   

A Database for IoT and the convergence of OT and IT

All data is valuable because of its relationships.  To truly achieve the promise of Industry 4.0, it will be essential to drive a convergence of OT and IT.  This is combining operational technology (OT) data with IT data.  OT data comes at a very high frequency.  A single electrical relay can put off 2000 records a second.  One project we are working on, that is smaller in scale, has 40 electrical relays - that’s 80,000 records a second.  This power consumption data, to be valuable, needs to be combined with production data in other systems like ERPs.  These relationships will drive the value of that data.  For example, being able to understand in real-time, what the cost in power is to produce a unit of inventory, is a question that would need to be answered.  This requires a database for IoT as well as a database that can functionally handle IT data.  

Most folks would use a polyglot persistence model to achieve this.  They would use a highly scalable streaming data solution, or an edge database, to consume the OT power data.  They would then use an RDBMS to consume the inventory data.  How then do we correlate those in real-time?  Most likely by sending them to a third system.  Things will get lost in transit, integrations will break, and consistency is lost.   

The True Cost of Polyglot Persistence 

Furthermore, this is highly complex.  As we begin to add additional systems for each of these data types, we need additional specialized resources in both people and hardware to maintain them. We also need multiple integration layers that often times lack a dynamic nature, and ultimately become the failure points in these architectures.  The more and more layers we add to these architectures, the more challenging it becomes to determine consistency and to manage this complexity.  It also adds significant costs to house the same data in multiple places, as well as increased compute costs.   

We are also paying in lost productivity more than anywhere else.  How long does it take to triage an issue in your operational data pipeline when you have 5 to 7 different sources of the truth?  How do you determine what is causing data corruption?    

There is also a major risk in terms of compliance.  If we look at the different data breaches across social media companies, credit bureaus, financial institutions, etc. how much time has it taken for them to diagnose the real effect of those breaches?  Why is that?  The answer is pretty simple, they don’t have a holistic picture of their data nor a unified audit trail on said data.  This is becoming more and more dangerous as the impact of these breaches becomes more dramatic effecting personal data while more things become connected.   

What is the solution?

I am not suggesting we go back to the days of monolithic RDBMS environments.  I think it’s clear that paradigm is over.  Nor am I suggesting that we abandon many of the products we currently are using.  Many of the developer tools are awesome for different uses.  Tools for search like ElastiCache have become vital parts of the technology ecosystem, and in-memory databases play important roles for areas where very high speed search on relational data is needed.   

What I am suggesting is that we need to look at data architectures that provide a single persistence model across all these tools, providing those tools with the stability and consistency that they require.  Creating data planes with persistence that can function as middleware layers as well as stream processing engines will be key to reducing complexity.    

If we stop relying on each of these individual tools for long term persistence, but rather view the data inside them as transient, we can then accept the fact that their version of the data might be out of synch and they might crash.  If we are able to put persistence in a stream processing layer with ACID compliance and very high stability, we can then rely on that layer to provide a holistic view of our data.  Stop overtaxing these systems with data lakes where storage algorithms make it impossible to do performant transformation and aggregation, but rather allow these end-point data stores to do their jobs and provide that functionality in a layer that can be used as an operational data pipeline.

This post examines the concept of polyglot persistence—using multiple data storage technologies for one application—and argues that mixing databases often adds needless complexity. It suggests that choosing a single versatile database can simplify development and improve performance.

Download

White arrow pointing right
This post examines the concept of polyglot persistence—using multiple data storage technologies for one application—and argues that mixing databases often adds needless complexity. It suggests that choosing a single versatile database can simplify development and improve performance.

Download

White arrow pointing right
This post examines the concept of polyglot persistence—using multiple data storage technologies for one application—and argues that mixing databases often adds needless complexity. It suggests that choosing a single versatile database can simplify development and improve performance.

Download

White arrow pointing right

Explore Recent Resources

Blog
GitHub Logo

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Search Optimization
Blog
Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Colorful geometric illustration of a dog's head in shades of purple, pink and teal.
Martin Spiek
SEO Subject Matter Expert
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Sep 2025
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Blog

Answer Engine Optimization: How to Get Cited by AI Answers

Answer Engine Optimization (AEO) is the next evolution of SEO. Learn how to prepare your content for Google’s AI Overviews, Perplexity, and other answer engines. From structuring pages to governing bots, discover how to stay visible, earn citations, and capture future traffic streams.
Martin Spiek
Case Study
GitHub Logo

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Early Hints
Case Study
A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Colorful geometric illustration of a dog's head resembling folded paper art in shades of teal and pink.
Harper
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper
Sep 2025
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper
Case Study

The Impact of Early Hints - Auto Parts

A leading U.S. auto parts retailer used Harper’s Early Hints technology to overcome Core Web Vitals failures, achieving faster load speeds, dramatically improved indexation, and an estimated $8.6M annual revenue uplift. With minimal code changes, the proof-of-concept validated that even small performance gains can unlock significant growth opportunities for large-scale e-commerce businesses.
Harper