How much can you trust your data?
By
Published: July 23, 2020
Data is the fuel for intelligent decision making for both humans and machines. Just like high quality fuel ensures that jet engines run efficiently and reliably in the long run, high quality data fuels effective and reliable decision making.
Whether it is for decisions taken by corporate executives, frontline staff or intelligent machine learning models, any intelligent enterprise needs high quality data to operate. But unfortunately, data quality issues are very widespread. In a survey conducted by O’Reilly Media in November 2019, only 10 percent of responding companies stated that they do not face data quality problems.
For our Client One, mismatch between the data in the product information system and the reality of what can be built in the factories currently can result in desired car configurations mistakenly not being offered. Or in cars being purchased that cannot be produced in the factory. Which will lead to worse sales, to customer frustration and possibly legal claims.
And for our second client, poor data quality will lead to company executives, sales managers, and logistic managers drawing incorrect conclusions about the state of the company’s operations. This could result in reduced customer satisfaction, loss of revenue, increased costs, or misdirected investments.
In all of these cases, low data quality leads to poor business decisions being taken, resulting in undesirable business outcomes such as decreased revenue, customer dissatisfaction and increased costs. Gartner reported in 2018 that surveyed organizations believed they, on average, lost $15 million per year due to data quality issues.
Efforts to address data quality therefore can help directly make companies more effective and profitable.
Unlike the fish in the water though, we can actually all contribute to addressing data quality issues and that process starts with assessing the current state of our data quality.
Data consumers: Usage perspective
Based on the dimensions, we can create specific metrics to measure the quality for our chosen dimensions. Once we know how good our data quality is for those dimensions, we can design specific improvement strategies for each dimension.
Dimensions that can be measured at data point level include the accuracy of values and the completeness of field values. At dataset level, they include completeness of the data set, uniqueness of data points, and the timeliness of data.
Dimensions that require human judgement usually require additional context or subjective value judgement for assessment. Some examples for these dimensions are: Interpretability, ease of understanding, security.
For those dimensions that we can assess automatically, we can make use of two different validation strategies: Rule-based checks and anomaly detection.
Rule-based checks work well whenever we can define absolute reference points for quality. They are used for conditions that must be met in any case for data to be valid. If these constraints are violated, we know we have a data quality issue.
Examples are at the data point level are:
Anomaly detection works well whenever we can define data quality relative to other data points and is defined as "the identification of rare items, events or observations which raise suspicions"(Wikipedia). It is often used for detecting spikes and drops in time series of metrics data.
An identified anomaly only tells us that there might be something wrong with the data, which might arise from the data quality issue, or it might be based on an outlier event recorded in the dataset. A detected anomaly should therefore be used as an investigation point for figuring out what happened.
Examples for anomaly based validation constraints are
We recently used deequ at the online retailer introduced as Client Three. The data quality gates implemented with deequ prevent bad data from feeding forward to external stakeholders.
The library provides both rule-based checks and anomaly detection. Validation can be implemented with a few lines of code. Here is an example for a rule-based check:
What is happening here:
Here, the main difference is the addition of a repository. As anomaly detection involves comparing the current metrics to a previous state, we need to store and access the previous state of the metrics. This is handled by the repository. The anomaly detection itself is configured very similar to the static rule check by calling “addAnomalyCheck”.
Deequ provides a lot of different metric analyzers that can be used to assess data quality. They operate on columns of the dataset or the entire dataset itself and can be used for both rule- and anomaly-based validation.
For example, for the completeness dimensions, there are analyzers to analyze the completeness of fields and the size of the dataset. For the accuracy dimension, we could use the various statistical analyzers deequ provides and describe the data properties we need.
In our project, we found deequ to be worth exploring further. Some of its strengths are
As the documentation seems to be under active development, it seems to us like a promising project overall.
Detecting data quality issues as close to their source as possible. The same as for software defects, the earlier we can detect data quality issues, the easier and cheaper it is to fix them. In a typical data pipeline, a data point will be combined, aggregated and otherwise transformed several times. Each transformation step multiplies the effort required to detect and trace quality issues. Coordinated data quality gates should therefore be implemented along the entire production pipeline of a data product. Ownership of data quality needs to reside with each data product owner along the pipeline.
Identifying the most impactful data quality issues with relevant validation scenarios. The most impactful data quality issues are those with the biggest effect on the business. Quality gates therefore need to be defined less by what is technically easy to validate, and more by the usage scenarios for the data product. Defining those quality scenarios requires, not only a good understanding of the data, but most importantly a strong understanding of the business domain.
Complementing automated validation with manual validation efficiently. As mentioned above, only some of the desired data quality dimensions can be assessed with automated validation. Depending on the quality scenarios, we might need additional manual validation. Manual validation usually involves more effort and is not as easily repeatable. Therefore, we need to figure out in which cases manual validation is really required and how to integrate it efficiently into the release process for a data product.
Overall, data quality assessments are an effective, but often overlooked way to make your company’s data products more trustworthy. Detecting and fixing data quality issues could help you reduce costs, increase customer satisfaction, and improve revenue, which will ultimately contribute to your company’s overall performance.
Whether it is for decisions taken by corporate executives, frontline staff or intelligent machine learning models, any intelligent enterprise needs high quality data to operate. But unfortunately, data quality issues are very widespread. In a survey conducted by O’Reilly Media in November 2019, only 10 percent of responding companies stated that they do not face data quality problems.
Why does data quality matter so much?
Let’s have a look at three typical data case studies from different Thoughtworks engagements:- Corporación Favorita, a large Ecuadorian-based grocery retailer, needs to predict how much of a given product will sell in the future, based on historical data. (Thoughtworkers participated in the linked Kaggle competition.)
- A large German automotive company, Client Two, needs a product information system that allows their clients to configure the car they want to buy.
- A large online retailer, Client Three, needs dashboards to track sales and logistics KPIs for their products.
For our Client One, mismatch between the data in the product information system and the reality of what can be built in the factories currently can result in desired car configurations mistakenly not being offered. Or in cars being purchased that cannot be produced in the factory. Which will lead to worse sales, to customer frustration and possibly legal claims.
And for our second client, poor data quality will lead to company executives, sales managers, and logistic managers drawing incorrect conclusions about the state of the company’s operations. This could result in reduced customer satisfaction, loss of revenue, increased costs, or misdirected investments.
In all of these cases, low data quality leads to poor business decisions being taken, resulting in undesirable business outcomes such as decreased revenue, customer dissatisfaction and increased costs. Gartner reported in 2018 that surveyed organizations believed they, on average, lost $15 million per year due to data quality issues.
Efforts to address data quality therefore can help directly make companies more effective and profitable.
How good is your company’s data quality?
In a modern business, everyone works with data one way or another, be it producing, managing or using it. Yet like water for fish, we often fail to notice data because it is all around us and, just like fish in the water suffer from bad water quality, we suffer if our data quality decreases.Unlike the fish in the water though, we can actually all contribute to addressing data quality issues and that process starts with assessing the current state of our data quality.
Making data quality measurable
Loosely following David Garvin’s widely referenced definition of quality in “Managing Quality” (1988), we can distinguish between three perspectives on data quality:Data consumers: Usage perspective
- Does our data meet our consumers’ expectations?
- Does our data satisfy the requirements of its usage?
- How much value are we getting out of our data?
- How much are we willing to invest into our data?
- To which degree does our data fulfill specifications?
- How accurate, complete, and timely is our data?
Based on the dimensions, we can create specific metrics to measure the quality for our chosen dimensions. Once we know how good our data quality is for those dimensions, we can design specific improvement strategies for each dimension.
Automating the assessment of data quality
Assessing data quality can be a labor intensive and costly process. Some data quality dimensions used in practice can only be assessed with expert human judgement, but many others can be automated with a little effort. An early investment in automating data quality monitoring can pay continuing dividends over time.Dimensions that can be measured at data point level include the accuracy of values and the completeness of field values. At dataset level, they include completeness of the data set, uniqueness of data points, and the timeliness of data.
Dimensions that require human judgement usually require additional context or subjective value judgement for assessment. Some examples for these dimensions are: Interpretability, ease of understanding, security.
For those dimensions that we can assess automatically, we can make use of two different validation strategies: Rule-based checks and anomaly detection.
Rule-based checks work well whenever we can define absolute reference points for quality. They are used for conditions that must be met in any case for data to be valid. If these constraints are violated, we know we have a data quality issue.
Examples are at the data point level are:
- Part description must not be empty
- Opening hours per day must be between 0 and 24
- There must be exactly 85 unique shops in the dataset
- All categories must be unique
- There must be at least 700,000 data points in the dataset
Anomaly detection works well whenever we can define data quality relative to other data points and is defined as "the identification of rare items, events or observations which raise suspicions"(Wikipedia). It is often used for detecting spikes and drops in time series of metrics data.
An identified anomaly only tells us that there might be something wrong with the data, which might arise from the data quality issue, or it might be based on an outlier event recorded in the dataset. A detected anomaly should therefore be used as an investigation point for figuring out what happened.
Examples for anomaly based validation constraints are
- The number of transactions should not change more than 20% for each day
- The number of car parts on offer should only be increasing over time
Case study: Assessing data quality with deequ
Deequ is a Scala library for data quality validation on large datasets with Spark. It is developed by AWS Labs. Based on our experiences, we recommend the library on the Thoughtworks Tech Radar for organizations to “assess”.We recently used deequ at the online retailer introduced as Client Three. The data quality gates implemented with deequ prevent bad data from feeding forward to external stakeholders.
The library provides both rule-based checks and anomaly detection. Validation can be implemented with a few lines of code. Here is an example for a rule-based check:
val verificationResult = VerificationSuite() .onData(data) .addCheck(Check(CheckLevel.Error, "Testing our data") .isUnique("date")) // should not contain duplicates .run() if (verificationResult.status != CheckStatus.Success) { println("We found errors in the data:n") }
What is happening here:
- We create an instance of the core validation class VerificationSuite. We can chain all operations needed to define our validation as method calls to this object.
- We configure the data set we want to run our validation on
- We add a uniqueness check as the validation we want to use
- We run the validation
- We check whether the validation succeeded. If not, we can input an alert on this failure. In the example, we are just printing an error message, but we could also log a message, trigger our monitoring system, trigger a notification etc.
A validation using anomaly detection can be implemented with just a few extra lines of code:
val verificationResult = VerificationSuite() .onData(todaysDataset) .useRepository(metricsRepository) .saveOrAppendResult(ResultKey(System.currentTimeMillis())) .addAnomalyCheck(RelativeRateOfChangeStrategy( maxRateDecrease = Some(0)), Size()) .run() if (verificationResult.status != Success) { println("Anomaly detected in the Size() metric!") }
Here, the main difference is the addition of a repository. As anomaly detection involves comparing the current metrics to a previous state, we need to store and access the previous state of the metrics. This is handled by the repository. The anomaly detection itself is configured very similar to the static rule check by calling “addAnomalyCheck”.
Deequ provides a lot of different metric analyzers that can be used to assess data quality. They operate on columns of the dataset or the entire dataset itself and can be used for both rule- and anomaly-based validation.
For example, for the completeness dimensions, there are analyzers to analyze the completeness of fields and the size of the dataset. For the accuracy dimension, we could use the various statistical analyzers deequ provides and describe the data properties we need.
In our project, we found deequ to be worth exploring further. Some of its strengths are
- Fast execution of rule check and anomaly detection steps
- Validation can be implemented with very little code
- Lots of metric analyzers to choose from
- The library code is fairly easy to understand when you need to dig deeper than the documented examples
- Code and documentation are under very active development
As the documentation seems to be under active development, it seems to us like a promising project overall.
Challenges in modern data quality assessment
Tooling, however, is only one of the challenges for effective data quality assessment. I see three other areas as big challenges:Detecting data quality issues as close to their source as possible. The same as for software defects, the earlier we can detect data quality issues, the easier and cheaper it is to fix them. In a typical data pipeline, a data point will be combined, aggregated and otherwise transformed several times. Each transformation step multiplies the effort required to detect and trace quality issues. Coordinated data quality gates should therefore be implemented along the entire production pipeline of a data product. Ownership of data quality needs to reside with each data product owner along the pipeline.
Identifying the most impactful data quality issues with relevant validation scenarios. The most impactful data quality issues are those with the biggest effect on the business. Quality gates therefore need to be defined less by what is technically easy to validate, and more by the usage scenarios for the data product. Defining those quality scenarios requires, not only a good understanding of the data, but most importantly a strong understanding of the business domain.
Complementing automated validation with manual validation efficiently. As mentioned above, only some of the desired data quality dimensions can be assessed with automated validation. Depending on the quality scenarios, we might need additional manual validation. Manual validation usually involves more effort and is not as easily repeatable. Therefore, we need to figure out in which cases manual validation is really required and how to integrate it efficiently into the release process for a data product.
Where should you start assessing your data?
In typical organizations with lots of data sets, assessing all of your data products will be overwhelming. To define priorities, you could ask yourself:- Which KPIs are most sensitive to data quality concerns?
- Which data that we provide to customers or partners is essential in core business processes?
- Which intelligent services are embedded in core business processes?
Overall, data quality assessments are an effective, but often overlooked way to make your company’s data products more trustworthy. Detecting and fixing data quality issues could help you reduce costs, increase customer satisfaction, and improve revenue, which will ultimately contribute to your company’s overall performance.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.