code-for-a-living January 14, 2021

Have the tables turned on NoSQL?

NoSQL was the next big thing in system architecture in 2011, but overall interest in it has plateaued recently. What is NoSQL, what does it have to do with modern development, and is it worth implementing in your project?

NoSQL climbed up the charts as the next big thing in system architecture in 2011, but overall interest in it has plateaued recently. You’ve may have heard of it and ignored it, safe in the knowledge that you can always have a an SQL command line at your fingertips. But what is NoSQL, what does it have to do with modern development, and is it worth implementing in your project?

Let’s find out.

Sysadmins managing big projects know a few things about traditional SQL databases. First, they are notoriously hard to scale, making it difficult to spread data across services or geographic regions. A small mistake in a single file can tank an entire database. And while SQL statements are fun, it’s easy to drop all tables while futzing with a key or corrupting an entire repository with a malformed query. 

The goal of a NoSQL database, on the other hand, is to ensure ultimate scalability by making sure that the data is stored in a format that can be shared—or sharded—across multiple servers. NoSQL databases scale far more linearly than relational databases, i.e. ones that depend on various keys shared across tables. NoSQL databases come in a lot of flavors: 

  • Indexed document stores like MongoDB
  • Graph databases like Neo4j
  • Column stores like Cassandra
  • Time-series databases, which index data by time stamps, like InfluxDB
  • Hybrid forms that use multiple of the previous paradigms

Some of them even store in a table format. The commonality between all of them, though, is that no matter what format they store data in, these databases don’t support relations between data. 

NoSQL is no joke

Understanding NoSQL databases takes a minute of comprehension. Traditional SQL uses related tables connected by IDs. A single domain entity might be fragmented or normalized across multiple tables, which means the overhead necessary to ensure a record is accurate can be immense. Instead of, say, a table for user IDs and then a table for addresses, NoSQL lets you create a generalized user object that holds everything important about those users. The benefit of this is that you can easily replicate the database across multiple devices, ensuring the ability to scale and replicate.

Further, a NoSQL database allows for fast access to lots of data. An SQL, or relational  database, is excellent for data processing—creating granular connections between pieces of data. A NoSQL database is great for finding one piece of data quickly and operating on it. There’s little to no searching; it just gives you the user data.

How it works

Many types of NoSQL databases are designed for fast data lookups. Instead of writing a complex query, many use a single value—a key, a timestamp, a document—and pull data stored under that value. That is, if you expect to want to know the details for a user account, then all the user data can be retrieved by reading that user’s record. The relationships between different records are unimportant and data can change—one record can hold multiple addresses while another can hold none. 

Because companies like Google and Amazon created these databases for their own massive data stores, the goal was to reduce the time needed to grab a piece of data. In fact, NoSQL forgoes the traditional database expectations of atomicity, consistency, isolation, and durability—ACID—for a far looser interpretation of data storage.

Using a NoSQL database doesn’t mean you can’t use SQL; SQL is just the query language. In fact, NoSQL and SQL can be complementary. Some NoSQL databases use SQL to search the data. Those that don’t can be either analyzed using a SQL query engine like Presto or sent through a data pipeline to more analyzable data warehouses. To be fair, a good data pipeline requires sophisticated ETL processing to get the end data into a usable state. 

Because an SQL database uses a schema or structure, this means changes are difficult. Say you’re running a production database full of a million records. Adding a single field is a nightmare and could trash the entire database. Further, connecting those million records through joins is hugely expensive. This means you can very easily search for a particular piece of data and connect it with another piece when you’re looking for a few records and a few tables. Multiply that out, however, and you’ve got a headache.

NoSQL databases like MongoDB just take data and store it. Want to add a field? Add it to the next record stored. Want to ignore a field? Just don’t read it. You can add multiple addresses to a user record, for example, or none. You can add a last name or avoid adding a last name. And because you can shard the data, you can send some data to a server in an untrusted jurisdiction and other data in a trusted jurisdiction. The database considers each chunk as part of the whole.

Querying data is a little harder. Apache’s Cassandra uses Cassandra Query Language or CQL which, interestingly, does not allow for joins. MongoDB just sends JSON objects in reaction to requests. Need all users in Ohio? MongoDB sends a big chunk of data. Want to delete all users in Spain? MongoDB will run the search and perform the action.

Further, there is no need to ping every server to get a piece of data. The closest server will share nothing with other servers and instead return what it has. At some point all the data replicates but each server works in a vacuum. This means that changes to records on one server won’t affect a query made on another server. 

Benefits of NoSQL

NoSQL databases—MongoDB being the most popular—are great for scaling. Because the databases use sharding to partition data on multiple machines, you can ensure that the right data is in the right place at the right time. Further, an outage on one machine won’t take down the entire network. As data grows, the database can simply expand to another device as needed and shrink if things slow down. You can also, say, store geo-specific data in geo-specific servers, ensuring that calls from a certain country are faster on data relating specifically to that country.

Next, NoSQL databases offer high availability. Because the data is simply a single file, you can copy backups from other servers on the network. If a server fails, another server can take over that server’s shard and incorporate it. The data is constantly replicated and safe.

The problems

NoSQL databases don’t offer much in the way of transaction management or real coding. They are great for storing data that doesn’t change much or changes minutely with every transaction. NoSQL systems have been daunting for new users to approach. While hosted solutions are available, running your own simple instance isn’t as easy as, say, spinning up a MySQL server. 

Finally, because the entire database can have a lot of duplicated data, the actual database is quite large. There are a number of types of NoSQL databases, with the document-based solution being the most prevalent. However, you can also use key-value databases like Redis as well as tabular ones like Hbase and Acculuo.

A key-based solution like Redis is a bit more familiar to admins and Redis in particular is performant because it stores much of its data in memory. Tabular databases like Hbase offer a slightly different system that focuses on, according to the documentation, “very large tables—billions of rows X millions of columns—atop clusters of commodity hardware.”

If NoSQL provides so much freedom and flexibility, why not abandon SQL entirely? The simple answer: Many applications still call for the kinds of constraints, consistency, and safeguards that SQL databases provide. In those cases, some “advantages” of NoSQL may turn to disadvantages. 

Traditional relational databases have long caught up with the novelty that some NoSQL databases promised. They’ve massively improved their sharding functionality, so you’re no longer limited to scaling vertically. They introduced more lenient data types; you can store JSON in PostgreSQL, MySQL, SQL Server now, giving you a MongoDB-like experience.

There are a number of problems with NoSQL databases, the first one being a dearth of sysadmins who can maintain them. Implementing a NoSQL database is a real endeavor and picking the right provider and manager is tough. If you’re in the position to need a massive database you might be in the financial position to pay for that expertise but smaller companies may have to wait.

Further, understanding the NoSQL model is difficult for developers used to coding for SQL systems. Because much of the structure must happen in the application, a developer could go into a dev project expecting certain constraints to be met or errors to throw on duplicate rows. Instead, this logic must be managed in the application itself. NoSQL solutions offer faster and more performant data storage but that’s about it. You, the developer, have to step in to manage the various relationships.

Finally, because NoSQL is not consistent, roll-backs are impossible if something goes wrong. Further, some parts of the database may return inconsistent information—one example experts offer is that an SQL database will return the right bank balance all the time while a NoSQL solution might return a different balance based on the server. If that sounds scary you might want to rethink your choice. This happens in real life when you search for orders on ecommerce sites like Amazon. In some cases the data takes a few seconds to appear because it must be populated throughout the network.

Believe the hype?

First, we have to remember that NoSQL databases are probably great for Amazon and Google but not so great for your side hustle. The performance benefits become more obvious the greater the scale of your database. Implementing them sounds like fun and it’s a great way to become conversant in a brand new technology, but you could probably do that by reading a few FAQs and trying out a MongoDB install for yourself. Using a NoSQL solution for a small ecommerce site or recommendation engine might not work out so well. A consensus has emerged in conferences and blogs that SQL is the gold standard—with a lot of emphasis on PostgreSQL—and you should use it by default, only deviating if you have good reasons to use NoSQL.

That said, big companies that need the kind of speed that NoSQL offers use these databases, and NoSQL skills are in demand. You can grab a nice salary if you can support someone else’s NoSQL database. By the time you’re ready to implement a NoSQL solution of your own—in a side project or over a massive data store—you’ll be fully versed in the pros and cons and, to paraphrase Kenny Rogers, you’ll know when to shard ‘em, know when to JOIN them, know when to use a schema, and know when to use none.

Tags: ,
Podcast logo The Stack Overflow Podcast is a weekly conversation about working in software development, learning to code, and the art and culture of computer programming.

Related

newsletter January 22, 2021

The Overflow #57: /dev/null on demand

Welcome to ISSUE #57 of the Overflow! This newsletter is by developers, for developers, written and curated by the Stack Overflow team and Cassidy Williams at Netlify. This week, teach your kids coding basics with fun apps, draw straight lines by hand, and recreate Minecraft with React in about an hour. From the blog Want to teach your…