When someone asks me what I do for a living, I say that I am a data quality assurance (QA) engineer. They don’t really understand what I mean. “Well, I do data testing,” I try to explain, often to no avail. I have friends in technology and in software development who don’t quite understand what data testing is, why it’s necessary, or where it fits into the world of programming. It’s understandable, as data science is a brand new field, and even those of us who work with data daily have to remain open to anything and everything changing about the way we handle our work.
To understand how data testing works, we have to understand what data engineering is first. Then we can look into data quality, and how to measure it.
Data engineering and analysis
To understand where data testing begins, we’ll need to know how data is engineered and what makes it different from other kinds of programming, such as software development. Let’s start with what data is. Data is some kind of aggregated information kept in a business tool. Whether that tool is a spreadsheet or a database is up to the business, but that original place where data is created is where we start.
Raw data in a source isn’t much use to anyone, which is where data engineering comes in. In data engineering, we call the process of getting that data and making it useful extract, transform, load or ETL. Once the data is extracted from the sources, it can be transformed according to the needs of the business and loaded into business analysis tools. This is where business analysts and financial analysts have the opportunity to use the data sets to create reports, charts, and other requested metrics that inform business decisions.
T is for Transformation
Transformation is perhaps the most critical point in the data engineering process. Let’s use a retail business with multiple stores as a quick example. Say there are several older stores on an outdated point-of-sale (POS) system while the newer stores are running a more modern system. Transactions are recorded and stored differently in each type of POS system on different databases. If the business owner wants to see a weekly report on sales, this would require an aggregate of the transactions from both systems.
To get that, there will have to be a transformation process that can get transaction information from each POS system and bring them together in a way that makes sense. On top of that, questions will quickly emerge about the transaction data and its relation to a sales report. I’ll ask just one: how are returns counted in each system versus an actual sale?
Let’s take this example even further. The original POS system stores everything in a type of database that is incompatible with the newer POS system’s database, so there is no way to simply join the information. Now the transformation stage must include some kind of data conversion before the transactions can be brought together. Ultimately, the business owner just wants to receive the aggregated sales information in a report, which sounded simple in the beginning.
The need for the data (the sales report) and the technical work required to transform it into something that makes sense (bringing together disparate systems) are the two critical elements that define the meaning of the data set. In the case of our example, we were looking for the meaning of “sales” and then needed a report on it. As you can imagine, this kind of nebulous and subjective business definition can make testing data very tricky.
Measuring data quality
Now we have an idea of what data and data engineering look like. We can’t define data quality without some kind of benchmark to measure against, and generally within testing processes that benchmark is a variety of reportable metrics. So how do we find something measurable to validate a product in the data world?
Six dimensions of data quality
The current industry standard in data validation is to use some form of the six dimensions of data quality to test data models, pipelines, architecture, and more. These metrics for data quality were initially defined in a data quality paper called “The Six Primary Dimensions for Data Quality Assessment,” written by the UK chapter of the Data Management Association (DAMA) in 2013. The six dimensions are a largely agreed upon series of validation metrics used to examine the quality of any given data set. They help data quality engineers and data engineers create measurable validation metrics that can be improved on.
The six dimensions are:
- Consistency: If data is copied across multiple databases, systems, tables, and reports, it should remain the same, making it consistent. For example, a customer’s current ZIP Code should always be the same five digits (nine if you are using ZIP+4) no matter where you find it.
- Accuracy: Perhaps the most nebulous of data quality metrics, accuracy is how well the data in question represents the real world occurrence or object. Let’s say there is a column representing the aggregate dollar amount for all of a customer’s transactions and a column representing the sum total of transactions in a table. Each of those values should be able to be clearly traced back to the sources, where it can be proven that the totals are accurate to the real world transactions that occurred.
- Validity: In any given field in a data set, it’s likely there is some kind of data type requirement. You wouldn’t ever expect to see numbers in a state field where the field restrictions are two-letter representations of US states like NY, CA, or IL. If there were an integer in that field, it would break the validity of the data.
- Uniqueness: For each unique record expected in a database there should be one field that uniquely identifies each given record, e.g. a customer account number for an online shopping database. That account number’s uniqueness may be critical to identifying repeated transactions for a single customer account.
- Completeness: Data is incomplete if there are any missing critical fields. Perhaps in a business transaction record there should be a timestamp for every transaction. If that timestamp is ever missing, the transactions data set is incomplete.
- Timeliness: What are the expectations on receiving new data into each report? The timeliness of data is defined by the needs of the business. For example, if it’s necessary for a data set to be refreshed daily, the testing metric for timeliness on that data set is also daily.
For any given data set, data tests and validation should cover each of these dimensions. Especially automated unit tests, but we’ll get to that later. If you’re interested in a deeper dive on the six dimensions of data quality, read this great piece in Towards Data Science.
Data testing in the engineering process
Now that we know about the six dimensions of data quality, how data engineering generally works, and the critical importance of business definitions of data needs, the task becomes bringing all of those things together to create a testing plan. Data quality engineers are at the heart of the data engineering process; they support the engineers’ technical work to deliver the requested data set, and work with business analysts to validate that data.
Types of data QA tests
In the software testing world, there are several common types of useful quality tests that identify bugs, confirm working components, and investigate the software’s expected behavior. These types of tests are still critically useful in the data testing world, so if you have any awareness of these test categories then you already understand something about data testing.
These include but are not limited to:
- Unit tests: Bite-sized tests embedded in the data modeling code, used to identify small breakpoints inside critical code blocks necessary for basic functionality. For example, perhaps there is a column where no NULL values should exist in the data. A quick unit test can check for NULLs in a field and make sure there are none.
- Integration tests: These are tests that look for the sum parts of a program or data pipeline to work rather than just one part of it. In a data pipeline example, integration tests would check that the entire ETL process could be executed successfully from start to finish.
- Smoke tests: Quick tests used to efficiently check the most important and usually common parts of the data pipeline for failure. The term originated in computer hardware testing, when the first initial test was to plug the machine in and see if any of the mechanical components generated smoke. If not, the hardware passed the test. If there was smoke, however… As you can imagine, they’d turn the machine off and try again.
- Regression tests: A series of tests used to check the core operations of a program, called “regression” because the test suite looks at standard functional parts of the code that should never change. Usually a staple suite used before a production release. In the case of data, regression usually covers the mission-critical transformations.
- Feature tests: Tests used to check on any new components (“features”) that have been added to a program on top of its currently released operations. These get added to regression test suites if new features are critical to production releases and should remain unchanged moving forward.
Maybe you’re familiar with a few of these, and others. These are all common, useful types of tests that many software development teams use on a regular basis. And they can and are applied to data testing.
Unit tests as data validation’s secret weapon
As with pretty much every QA role, data QA engineering should add value to the engineering team’s work and deliver feedback in a useful way. Data QA almost always begins with manual testing, especially in enterprises with legacy database and data warehouse technology. There will likely be a few hundred too many SQL scripts on the laptop of the data QA engineer. But that manual testing still needs to efficiently check for the six dimensions of data quality without becoming a bottleneck for production releases. All of that being said, data testing automation is entirely possible and especially when it comes to unit tests and assertions.
In my experience, one of a data QA engineer’s biggest roles is evangelizing unit tests built into the data pipeline. When implemented by a data engineer during the development of the data transformation, unit tests can catch data errors before data QA engineers even look at the data set. Typically data engineers are not used to unit tests as that is more of a software development practice, but I’m not the only one who thinks data engineering needs more unit testing culturally. Popular open source frameworks like data build tool (dbt) have built-in unit tests that can prevent problems with completeness, uniqueness, and timeliness. Other open source validation tools like Great Expectations layer a suite of data assertion unit tests on top of the data pipeline.
Automation is a critical principle in the world of data quality, and the execution of the six dimensions of data quality. For example, using the blend of unit testing and testing for data completeness, data quality engineers can write a small, quick automated test that checks for any NULL values in a critical field. The more automated tests you have in your data pipeline checking for critical data quality dimensions in a variety of ways, the less work the data engineers have to do.
What if, instead of generating a table and looking at it every time using SQL and a database connection, a data engineer had a built-in regression test suite that ran her daily checks for her? These are the kinds of useful automation foundations and test engineering that data quality engineers do.
Data testing is a unique field that is growing and changing on a daily basis. There aren’t too many widely accepted data quality standards, and even those like the six dimensions of data quality are debated over. More data science fields like machine learning and artificial intelligence (AI) are growing and creating new ways to validate data accuracy, consistency, completeness, and more.
What we do know is that data quality right now depends heavily on the subjective meaning of the data set being requested and the needs of the people on the end of the data pipeline. That makes it tough to find the right benchmarks for testing and improving our data quality, but we can still use our knowledge of useful test types and the dimensions of data quality to validate the data that we use every day. And as our understanding of how to use data evolves, so will data quality metrics and our understanding of data testing.