If the past decade saw every company become a tech company, today, every company aspires to be a data-driven one. Enterprises are gathering vast amounts of data from diverse sources in real-time in hope of gleaning insights for decision making.
But one of the biggest challenges in leveraging data is that not all data is useful data.
When we meet enterprise leaders, we often hear, they don’t know the overall picture of their business and aren’t able to connect the dots with available data. They’re not able to carry out effective analysis of failures. Some complain they don’t know if they can monetize their data or that they’re unable to use the data as a differentiator to stay ahead of the competition. Worst of all, their teams don’t trust the data from their platform and are still using their old spreadsheet-based methods to make decisions.
If you look carefully, all the above are data quality problems. And none of them popped up overnight.
Systemic issues causing data quality problems
If we look under the hood, insufficient data quality is often due to a lot of systemic issues. Highlighted below are examples of such problems. Let's try to put them in perspective using a problem domain, in this case – product pricing for a retailer.
Addressing data quality issues late in the process: instead of taking care of data quality issues in the source systems causes teams to spend a disproportionate amount of time tinkering with data in downstream systems. This wastes time and affects overall data quality. If the prices or sales information that was fed to the data pipeline for a pricing algorithm are incorrect to begin with, fixing the quality issues in the price recommendations won't improve the overall quality.
Missing context: pushing data quality interventions downstream causes another problem – context gets lost and addressing one data quality issue myopically often leads to more data quality issues elsewhere. Any erroneous prices/sales data can be used elsewhere, say for reporting, most often with certain transformations on these values. The consumers of this information (transformed data derived from erroneous input) would not know if certain values that seemed incorrect were in fact correct with appropriate context. The source system for prices might have applied some short-time discounts which could have resulted in seemingly outlier price points.
Lack of strategy: data teams often address data quality tactically. Without an integrated framework – one that covers the entire ecosystem of products and platforms in an organization – data quality is inconsistent at best. This being said, the idea is not to have a central, single place for all data quality checks and tests but to be able to provide a self-service integrated framework for data quality execution that can be used by the entire data platform. For isntance, if prices of products are being used in ten different places, with possibly ten different rules, can we use the integrated framework to execute these data quality checks?
Setting non-uniform definitions: organizations don’t invest the time and energy to define standards and redress mechanisms for data quality issues. This leads to a lack of trust in the underlying data itself.
Building point solutions: by approaching data quality from the user’s point of view, the solutions built around them tend to be tactical and ineffective in identifying the root cause. For example, fixing incorrect recommendations of prices from the pricing engine by applying some rules on recommended prices, does not solve the root cause which could either be incorrect input to the algorithm or some issue in the algorithm itself.
Underestimating impact - business and people: organizations forget that low data quality doesn’t just impact downstream systems, such as business intelligence/predictive dashboard, but the team sentiment as well. When teams start losing trust in the data platform, it becomes a huge impediment to change management.
To avoid these issues, you need to treat your data as code and apply the same rigor for data quality as you would for code quality – with a test pyramid of automation tests, fitness functions etc.
Before we talk about how to fix this and establish trust in our data, let us define what quality data means.
What is data quality?
Data quality refers to the ability of a given dataset to fulfill an intended purpose.
It is the degree to which a set of inherent characteristics fulfill the requirements of a system, determine its fitness for use and ensure its conformance to the requirements. It is important to remember that data quality is an ongoing process. What is good quality data today might not be so tomorrow, because the requirements of today will not remain the same tomorrow.
The analogy I use to remind myself of this is the 'potato quality in a fast-food chain'. Big, round potatoes are great quality input if the intended purpose is to make french fries. However, it might be irrelevant quality input if the intended purpose is to make mashed potatoes.
In effect, it is important to define data quality metrics for a given purpose/use-case.
Big data quality
Big data quality is complex and here is why:
Volume - how do we have a comprehensive data quality control for PBs of data
Variety - how do we cater to multiple types of data - structured, semi-structured and unstructured
Velocity - how do we have a data quality measure in time to cater for high velocity
Veracity - how to handle inherent impreciseness and uncertainty
Big Data ecosystems bring in additional complexities
Volume makes the impact larger - with the increasing volume of data, comprehensive data quality assessment is next to impossible. So, enterprises work with approximation. Often, they end up approximating metrics as well while using techniques like probability and confidence intervals to tackle the scale. This compromises data quality.
For a giant eCommerce retailer, its platforms generating data of orders of gigabytes per second is normal. The data includes actual orders/transaction, product reviews, returns/refunds and more. The validation of such volumes, although possible, is not practical.
We might want to use key performance indicators (KPIs) which are aggregated over this data or some batches of this data, like dollar sales per day/hour, instead of validating each record. If the aggregated values are within the normal range (no outliers), it’s a good litmus test for quality of sales figures.
Variety makes the process complex - the variety of data being collected (from internal and external sources) has increased manifold – data from the internet and mobile internet, data from the Internet of Things, data from various industries, scientific experimental and observational data. This requires every quality assessment to consider all data sources and types.
Velocity makes the needs instant - businesses are moving from batch processing to near real-time to real-time insights. This requires systems that assess data quality using techniques such as sampling and structural validations instead of semantic etc. to tackle the speed.
Veracity make the problem more difficult to solve - veracity refers to the inherent noise, abnormality and the imprecise or incorrect values in data. This is the most complex dimension. One example of high veracity data is one which involves human data entry. A point of sales terminal, which records sales transaction consisting of prices and discounts using code scanners, will be much more accurate than the manual entry sales transaction. I called this complex because there are no direct ways to identify such inaccuracies.
Dimensions of data quality
Data quality has ten key dimensions and each of them can be further unpacked into its tenets. The intended purpose of the data defines the quality we need in each dimension, its tenet and the degree to which we need it.
Below are the dimensions and some of the tenets.
Multidimensionality and its tenets in data quality |
|
---|---|
Relevance |
|
Integrity |
|
Availability |
|
Usability |
|
Trustworthiness |
|
Class balance |
|
Consistency |
|
Standardization |
|
Reliability |
|
Unqiueness |
Data quality framework
To solve problems of big data quality, two things have worked for us:
Configurable data quality framework that suits enterprises needs
Clear action plan to operationalize this data quality framework
I will explore both in the next part of this series. Read on.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.