Metadata, not data, is what drags your database down 

Databases today are built for Big Data. But what happens when the metadata is bigger?

Article hero image

In recent years, data has seen exponential growth due to evolving connected devices and the Internet of Things (IoT). With this comes an alarming rate of expansion in the amount of metadata, meaning data that describes and provides information on other data. Although metadata has always been around, it used to be stored in memory and behind the scenes, as it came at a fraction of the size it is today.

Ten years ago, the typical ratio between data and metadata was 1,000:1. This means that a data unit (file, block, or object) that is 32k in size would have metadata of around 32 bytes. Existing data engines were able to handle these amounts of data quite effectively. Since then, however, the ratio has shifted significantly towards metadata. The ratio can now vary from 1000:1 when the object size is large and 1:10 when the object is really small. The explosion of metadata has a direct and immediate impact on our data infrastructures.

The massive adoption of cloud applications and infrastructure services, along with IoT, big data analytics, and other data-intensive workloads, means that unstructured data volumes will only continue to grow in the coming years. Current data architectures can no longer support the needs of modern businesses. To tackle the ever-growing challenge of metadata, we need new architecture to underpin a new generation of data engines that can effectively handle the tsunami of metadata while also giving applications fast access to that metadata.

Understanding metadata: the silent data engine killer

Every database system, whether SQL or NoSQL, uses a storage engine—or data engine—embedded or not, to manage how data is stored. In our everyday life, we don’t pay much attention to these engines that run our world. We usually only notice them when they suddenly fail. Similarly, most of us never even heard the term “data engine” until recently. They run our databases, storage systems, and basically any application that handles a large amount of data. Just like a car engine, we only become aware of their existence when they break. After all, we wouldn’t expect a sedan car engine to be able to run a giant truck. At some point, probably sooner than later, it will crack under the strain.

So what’s causing our data engines to heat up? The main reason is the overwhelming pace of data growth, especially in metadata, which is the silent data engine killer. Metadata refers to any piece of information about the data—such as indexes, for example—that makes it easier to find and work with data. This means that metadata doesn’t have a pre-fixed schema to fit a database (which is usually in a key-value format); rather, it’s a general description of the data that is created by various systems and devices. These pieces of data, which need to be stored somewhere and usually stay hidden in cached RAM memory, are now becoming bigger and bigger.

In addition to the continuous increase in the volume of unstructured data—such as documents and audio/video files—the rapid propagation of connected devices and IoT sensors creates a metadata sprawl that is expected to accelerate going forward. The data itself is typically very small (for example, an alphanumeric read of a sensor), but it is accompanied by large chunks of metadata (location, timestamp, description) that might be even larger than the data itself.

Existing data engines are based on architectures that were not designed to support the scale of modern datasets. They are stretched to their limits trying to keep up with the ever-growing volumes of data. This includes SQL-based, key-value stores, time-series data, and even unstructured data engines like MongoDB. They all use an underlying storage engine (embedded or not) that was not built to support today’s data sizes. Now that metadata is much bigger and “leaks” out of memory, the access to the underlying media is much slower and causes a hit to performance. The impact of the performance hit on the application is directly determined by the data size and number of objects.

As this trend continues to unfold, data engines must adapt so they can effectively support the metadata processing and management needs of modern businesses.

Under the hood of the data engine

Installed as a software layer between the application and the storage layers, a data engine is an embedded key-value store (KVS) that sorts and indexes data. Historically, data engines were mainly used to handle basic operations of storage management, most notably to create, read, update, and delete (CRUD) data.

Today, KVS is increasingly implemented as a software layer within the application to execute different on-the-fly activities on live data while in transit. While existing data engines, such as RocksDB, are being used to handle in-application operations beyond CRUD, they still face limitations due to their design. This type of deployment is often aimed at managing metadata-intensive workloads and preventing metadata access bottlenecks that may lead to performance issues. Because KVS is going beyond its traditional role as a storage engine, the term “data engine” is being used to describe a wider scope of use cases.

Traditional KVSs are based on data structures that are optimized for either fast write speed or fast read speed. To store metadata in memory, data engines typically use a log-structured merge (LSM) tree-based KVS. An LSM tree-based KVS has an advantage over B-trees, another popular data structure used in KVS, because it can store data very quickly without needing to make changes to the data structure thanks to the usage of immutable SST files. While existing KVS data structures can be tuned for good-enough write and read speeds, they cannot provide high performance for both operations.

When your data engine overheats

As data engines are increasingly used for processing and mapping trillions of objects, the limitations of traditional KVSs become apparent. Despite offering more flexibility and speed than traditional relational databases, an LSM-based KVS has limited capacity and high CPU utilization and memory consumption due to high write amplification, which impacts its performance solid state storage media. Developers will have to make trade-offs between write performance and read or vice versa. However, configuring KVSs to address these requirements will not only be an ongoing task but will also be challenging and labor-intensive due to their complex internal structure.

To keep things running, application developers will find themselves spending more and more time dealing with sharding, database tuning, and other time-consuming operational tasks. These limitations will force many organizations that lack adequate developer resources to use default settings that fail to meet the data engines’ needs.

Obviously, this approach cannot be sustained for long. Due to the inherent shortcomings of existing KVS offerings, currently-available data engines struggle to scale while maintaining adequate performance—let alone scale in a cost-effective manner.

A new data architecture

Recognizing the problems metadata generates and the limitations within current data engines is what drove us to found Speedb, the data engine that provides faster performance at scale. My cofounders and I recognized the limitations of current data architectures. We decided to develop a new data engine built from scratch to deal with the metadata sprawl that would eliminate the trade-offs between scalability, performance, and cost while providing superior read and write speeds.

To accomplish this, we redesigned the basic components of KVS. We developed a new compaction method that dramatically reduces write amplification for large-scale LSM; a new flow control mechanism to eliminate spikes in user latency; and a probabilistic index that consumes less than three bytes per object, regardless of object and key size, delivering extreme performance at scale. Speedb is a drop-in embeddable solution compliant with RocksDB storage engines that can address the rising demand for high performance at cloud scale. The growth of metadata isn’t slowing down, but with this new architecture, we will at least be able to keep up with demand.

Login with your stackoverflow.com account to take part in the discussion.