Metadata, not data, is what drags your database down
In recent years, data has seen exponential growth due to evolving connected devices and the Internet of Things (IoT). With this comes an alarming rate of expansion in the amount of metadata, meaning data that describes and provides information on other data. Although metadata has always been around, it used to be stored in memory and behind the scenes, as it came at a fraction of the size it is today.
Ten years ago, the typical ratio between data and metadata was 1,000:1. This means that a data unit (file, block, or object) that is 32k in size would have metadata of around 32 bytes. Existing data engines were able to handle these amounts of data quite effectively. Since then, however, the ratio has shifted significantly towards metadata. The ratio can now vary from 1000:1 when the object size is large and 1:10 when the object is really small. The explosion of metadata has a direct and immediate impact on our data infrastructures.
The massive adoption of cloud applications and infrastructure services, along with IoT, big data analytics, and other data-intensive workloads, means that unstructured data volumes will only continue to grow in the coming years. Current data architectures can no longer support the needs of modern businesses. To tackle the ever-growing challenge of metadata, we need new architecture to underpin a new generation of data engines that can effectively handle the tsunami of metadata while also giving applications fast access to that metadata.
Understanding metadata: the silent data engine killer
Every database system, whether SQL or NoSQL, uses a storage engine—or data engine—embedded or not, to manage how data is stored. In our everyday life, we don’t pay much attention to these engines that run our world. We usually only notice them when they suddenly fail. Similarly, most of us never even heard the term “data engine” until recently. They run our databases, storage systems, and basically any application that handles a large amount of data. Just like a car engine, we only become aware of their existence when they break. After all, we wouldn’t expect a sedan car engine to be able to run a giant truck. At some point, probably sooner than later, it will crack under the strain.
So what’s causing our data engines to heat up? The main reason is the overwhelming pace of data growth, especially in metadata, which is the silent data engine killer. Metadata refers to any piece of information about the data—such as indexes, for example—that makes it easier to find and work with data. This means that metadata doesn’t have a pre-fixed schema to fit a database (which is usually in a key-value format); rather, it’s a general description of the data that is created by various systems and devices. These pieces of data, which need to be stored somewhere and usually stay hidden in cached RAM memory, are now becoming bigger and bigger.
In addition to the continuous increase in the volume of unstructured data—such as documents and audio/video files—the rapid propagation of connected devices and IoT sensors creates a metadata sprawl that is expected to accelerate going forward. The data itself is typically very small (for example, an alphanumeric read of a sensor), but it is accompanied by large chunks of metadata (location, timestamp, description) that might be even larger than the data itself.
Existing data engines are based on architectures that were not designed to support the scale of modern datasets. They are stretched to their limits trying to keep up with the ever-growing volumes of data. This includes SQL-based, key-value stores, time-series data, and even unstructured data engines like MongoDB. They all use an underlying storage engine (embedded or not) that was not built to support today’s data sizes. Now that metadata is much bigger and “leaks” out of memory, the access to the underlying media is much slower and causes a hit to performance. The impact of the performance hit on the application is directly determined by the data size and number of objects.
As this trend continues to unfold, data engines must adapt so they can effectively support the metadata processing and management needs of modern businesses.
Under the hood of the data engine
Installed as a software layer between the application and the storage layers, a data engine is an embedded key-value store (KVS) that sorts and indexes data. Historically, data engines were mainly used to handle basic operations of storage management, most notably to create, read, update, and delete (CRUD) data.
Today, KVS is increasingly implemented as a software layer within the application to execute different on-the-fly activities on live data while in transit. While existing data engines, such as RocksDB, are being used to handle in-application operations beyond CRUD, they still face limitations due to their design. This type of deployment is often aimed at managing metadata-intensive workloads and preventing metadata access bottlenecks that may lead to performance issues. Because KVS is going beyond its traditional role as a storage engine, the term “data engine” is being used to describe a wider scope of use cases.
Traditional KVSs are based on data structures that are optimized for either fast write speed or fast read speed. To store metadata in memory, data engines typically use a log-structured merge (LSM) tree-based KVS. An LSM tree-based KVS has an advantage over B-trees, another popular data structure used in KVS, because it can store data very quickly without needing to make changes to the data structure thanks to the usage of immutable SST files. While existing KVS data structures can be tuned for good-enough write and read speeds, they cannot provide high performance for both operations.
When your data engine overheats
As data engines are increasingly used for processing and mapping trillions of objects, the limitations of traditional KVSs become apparent. Despite offering more flexibility and speed than traditional relational databases, an LSM-based KVS has limited capacity and high CPU utilization and memory consumption due to high write amplification, which impacts its performance solid state storage media. Developers will have to make trade-offs between write performance and read or vice versa. However, configuring KVSs to address these requirements will not only be an ongoing task but will also be challenging and labor-intensive due to their complex internal structure.
To keep things running, application developers will find themselves spending more and more time dealing with sharding, database tuning, and other time-consuming operational tasks. These limitations will force many organizations that lack adequate developer resources to use default settings that fail to meet the data engines’ needs.
Obviously, this approach cannot be sustained for long. Due to the inherent shortcomings of existing KVS offerings, currently-available data engines struggle to scale while maintaining adequate performance—let alone scale in a cost-effective manner.
A new data architecture
Recognizing the problems metadata generates and the limitations within current data engines is what drove us to found Speedb, the data engine that provides faster performance at scale. My cofounders and I recognized the limitations of current data architectures. We decided to develop a new data engine built from scratch to deal with the metadata sprawl that would eliminate the trade-offs between scalability, performance, and cost while providing superior read and write speeds.
To accomplish this, we redesigned the basic components of KVS. We developed a new compaction method that dramatically reduces write amplification for large-scale LSM; a new flow control mechanism to eliminate spikes in user latency; and a probabilistic index that consumes less than three bytes per object, regardless of object and key size, delivering extreme performance at scale. Speedb is a drop-in embeddable solution compliant with RocksDB storage engines that can address the rising demand for high performance at cloud scale. The growth of metadata isn’t slowing down, but with this new architecture, we will at least be able to keep up with demand.
Tags: database
10 Comments
This whole post makes no sense to me.
Unless, maybe, it’s a marketing blurb for a niche technology?
I am trying to understand why timestamp was considered as “metadata” instead of “data” in IoT streams?
“when” the sensor sensed the data is an important intrinsic piece of information, in my opinion. Without timestamp the bit generated by sensor or of not much value by themselves.
This is try for location too, but here an argument could be made that location hardly changes.
I think the question is how do we define metadata distinctly from data.
Sorry for the typos, didn’t see any edit option for the comment, hence rewriting it below (fast typing to be blamed).
—-
“I am trying to understand why timestamp was considered as “metadata” instead of “data” in IoT streams?
“when” the sensor sensed the data is an important and intrinsic piece of information, in my opinion. Without timestamp the bits generated by sensor are of not much value by themselves.
This is true for location too, but here an argument could be made that location hardly changes.
I think the question is how do we define metadata distinctly from data?
Maybe would be nice to clearly disclose that this is an ad, not an opinion piece or essay or anything. Will remember to stay away from speedb if it ever crops up anywhere.
This is not an ad. They pitched something that I thought could be interesting. If you disagree with it or if you think that their perspective is wrong or misguided, please let us know in the comments. They certainly have an agenda, but if turned down every pitch with an agenda, I would lose out on some interesting articles.
This is 100% an ad. Maybe they didn’t pay SO to post it, but it doesn’t provide any useful information. It just boils down to, “Current DB engines are inefficient. We created a new one. Come buy it from us.”
It could have provided information on mitigating the issues in current engines. It could have even elaborated more into what/how they specifically fixed the issues, instead of a single paragraph marketing pitch.
In the UK we call these advertorials. I don’t know if the same name is used over the pond. I’ve no objection to them on the SO blog, but it would be courteous to mark them as such.
Piece of total non-sense by a guy that barely understand anything to a database.
Example:
“The data itself is typically very small (for example, an alphanumeric read of a sensor), but it is accompanied by large chunks of metadata (location, timestamp, description) that might be even larger than the data itself.”
I stopped there.
defining the timestamp as metadata in this context is incredibly silly. This is Pure 100% data, just mlike the sensor’s unique ID would be. Your sensor’s value worth nothing and is not exploitable if you dont have the timestamp on which it was recorded.
It looks like your whole metadata idea is file-related, but the problem here is that its an article on database only. Metadata in databases is something else, you can learn what is it here: https://dataedo.com/kb/databases/all/metadata
The article does look a lot like an ad, but that is OK. Everyone who invents some new technology and wants to tell you about what it does sounds like an advertisement. I’m sure that early articles about much of what is now mainstream technology looked a lot like ads at the time.
His example of metadata was a bit off-base. Time and location data for sensors is important data; but his discussion about metadata in general is accurate. I am surprised at a few of the comments that immediately dismissed him as some kind of amateur because of one example. I too created a data engine from scratch that can do queries 10x faster than other RDBMS offerings (and I have the numbers to back it up), but people constantly insist that I must be lying or don’t know anything without even trying to prove me wrong. The software is available for free download but few make the effort. I wish this article had a demo video that showed a side by side comparison of their system instead of a flashy marketing one. Something like this:
It does look like an ad.
First, the whole data vs metadata discussion make no sens (a temperature alone is not data, what you put in the metadata are the properties of a temperature point and should be included). This is an “unstructured” data problem, and the solution can be better performance (from hardware or software) but the root cause is lazy object definition and maintenance. There is no end of performance problem if you don’t structure your performance sensitive data, no external software would indefinitely solve that, at most it will temporarily relief it, but the next month/year it will again be a problem.
Second, the whole “build from scratch” argument it absurd. You didn’t reinvent the wheel, you simply optimised one part of the regular KVS engines and, claim to be vastly more performant than the concurrence. Where are the numbers? Where are the independent tests?
This an add, and a poor one. I vote to remove this false article.