By Simon Aubury and Kunal Tiwary
Imagine what your data teams could achieve with an extra two days per week. The thought is exciting, isn’t it? But where would this additional time come from? The answer may lie in better managing the quality of your data. The best way to do that is by catching issues early with the help of a rigorous data testing strategy.
The impact of moving data testing upstream shouldn’t be underestimated. Research suggests that data teams spend 30-40% of their time focusing on data quality issues. That’s a significant amount of time they could be spending on revenue-generating activities like creating better products and features or improving access to faster and more accurate insights across their organization. And beyond productivity and organizational effectiveness, data downtime – caused by missing, inaccurate, or compromised data – can cost companies millions of dollars each year and erode organizational trust in your data team as a revenue driver for the organization.
Missing values in datasets can lead to failures in production systems, incorrect data can lead to the wrong business decisions being made and changes in data distribution can degrade the performance of machine learning models. Irrelevant product recommendations could impact customer experience and lead to a loss of revenue. In sectors such as healthcare, the consequences can be far more significant. Incorrect data can lead to the wrong medication being prescribed, triggering adverse reactions, or even death.
In this chapter we’ll take a look at how to implement an effective data strategy. But first, we’ll look at the core considerations that lay the foundations for data quality.
The considerations and trade-offs of data quality
Instilling organizational trust in the quality of its data and the data team is vital. However, this can only be done by describing what we actually mean by data quality: what features and dimensions are fundamental to it. Here are five areas to consider:
Freshness: The importance of data recency is context specific. A security monitoring or fraud detection application requires very fresh data to make sure aberrations are detected and can be dealt with quickly, whereas training a machine learning model can tolerate more latent data.
Accuracy: Life impacting decisions such as drug effectiveness and financial decisions have little room for inaccurate data. However, we may sacrifice accuracy for speed when offering suggestions on a retail online store or streaming service.
Consistency: Definitions of common terms need to be consistent. For example, what does “current” mean when we talk about customers – purchased last week or two months ago? Or what constitutes a “customer” – already signed on, authenticated, a real human?
- Understanding the data source: The source of data or how data is captured can affect its accuracy. If a customer service representative at a bank selects a drop-down field in a hurry or without validation, a manual error could lead to incorrect account closure reports.
- Metadata (including lineage): Metadata is the foundation of quality output. It helps characterize data and helps your organization understand and consume it easily. Metadata should explain the who, what, when, how and why of data — it can even provide information on things like the ownership of data product code.
Preparing a test strategy: Quality starts with a conversation
Our experience suggests that when data producer teams take ownership of the data testing process, data quality is more easily and consistently maintained. But a robust test strategy requires collaboration between data consumers and data producers. Data consumers need to address the below areas to develop a data test strategy:
What quality features are important – is it completeness, distinctness, compliance or something else?
What business requirements are we building?
What target value should producers aim and optimize for?
Identifying domain-driven quality metrics – for example, the needs of retail would be quite different to the needs to real estate
Data producer teams should also aim to capture finer-grained metrics such as:
Error tolerance
Ownership – if a quality check is broken who is going to fix it and when?
- How much is data quality worth? Will adding more data improve your analytics? What are the appropriate and agreed thresholds and tolerances for data quality for the business? Is 90%, 99%, 99.9% accuracy expected or acceptable for the end user of the data?
- Service level agreements (SLAs) – how much downtime does the business allow for each year?
As the owners of data quality, data producers are ultimately responsible for knowing these thresholds and meeting agreed expectations.
Implementing the strategy
With a solid test strategy in place, the next thing to consider is implementing data quality tests as a write-audit-publish (WAP) pattern. Using this pattern, you write data and audit the results before you publish them. That will allow you to make corrections before publishing.
Enabling new data ingestion within continuous integration and continuous integration/continuous delivery (CI/CD) pipelines also ensures imported data goes through quality tests – and doesn’t break existing checks. There may be instances where checks should break the pipeline and send a high urgency alert. If a check flags a negative house number within a pipeline running on real estate data for instance, you need to immediately address the issue and stop the run. Whereas if a house number is simply out of range, you can continue the pipeline runs with simple alerts.
Making the results from these quality checks available to the wider business is incredibly important. For example, an address list that is 15% incomplete may delay the marketing team’s campaign launch. While a variance of 1% in an engineering measurement could jeopardise an expensive manufacturing process. Making quality levels visible as part of a metadata catalog can also be immensely valuable, as it allows data consumers to make informed decisions when considering the data’s use cases.
Many data quality frameworks today offer profiling reports that include the error/failure distribution. You can find some good open-source frameworks – we’ve had positive experiences with Great Expectations, Deequ and Soda – that can help you implement data quality tests through a range of features. Depending on the level of integration you require, some key features to consider for your framework include:
Using an open source solution to avoid vendor lock-in
How results can be visualized and shared to ensure transparency across the organization
Implementing data validation on incremental loads to ensure checks are performed on an ongoing basis with whatever is the desired ingest frequency
Implementing anomaly detection to automatically catch and raise alerts for unexpected deviations above or below a certain tolerance
Integration with alerting and monitoring tools to ensure visibility on the system without building observability integrations yourself (and speeden the time to roll out the checks)
Integrating with the data catalog to avoid building discoverability integrations yourself and create visibility from the start
programming language support — choosing a framework based on the technical stack of your current data ecosystem
Just enough testing at just the right time
You should consider implementing data checks at various stages - ensuring quality is maintained throughout the Extract, Transform, and Load (ETL) pipelines and standalone tests for complex data transformations. Think about this as shifting data quality focus to the left – or having checks as early as possible. Implement schema and basic integrity checks during raw ingestion itself, as well as during the transformation stage. Organisational data quality improves by building different layers of tests as you pass through the data pipeline.
Treat your development environment and pipeline tests just as you would treat them in a production setting. Communicate with source application teams about data quality issues and fix them at the source application. For example, Know Your Customer (KYC) checks in your pipeline need to have non-nullable customer attributes. If the source application doesn’t enforce a validation on the system, null/empty values will be ingested – making the transformations pointless or invalid for many rows of data. Monitoring metrics such as row counts, totals and averages and setting timely failure alerts will also reduce time-to-resolution.
Build trust and save time through data quality
A robust data quality test suite with a focus on trust, relevance and repeatability will go a long way to instill confidence in your data consumers and reduce the amount of time spent on quality issues.
Organizations need to build infrastructure that enables data producer teams to fix and resolve issues and deploy changes quickly to continually maintain and evolve the data quality paradigm. Once they do that, they can begin focusing more on value-adding activities, rather than simply fixing problems.
Measuring and communicating data quality will help you achieve alignment on the state and business relevance of data across the organization. It will also allow data teams to make continuous improvements over time.