Fertile data: why building quality into your process is important
By
Published: July 19, 2018
When it comes to data quality, most technologists are familiar with the adage “Garbage in, garbage out”; and yet today, most organizations appear to be content with wallowing in junk. Just 3% of organizations examined by Harvard Business Review had data that met basic quality standards.
That’s worrying. If you’re planning to increase your use of AI and machine learning to automate some decision making, the actions you make will be fatally compromised if your data isn’t good. Little wonder then that senior IT decision makers see data quality as a priority.
There’s often a fear that data quality initiatives will become a Sisyphean task — where endless effort is expended to little effect. But it need not be that way. Huge quality improvements can be realized by applying the principles we routinely use in software development to data.
Instead of thinking about data as oil, we prefer to think of it as soil — that if you invest time and effort into cultivating you can reap the rewards. And that’s where data quality comes in. Even if data quality isn’t your primary driver for an initiative, by making a routine aspect of the way you work, you can make small investments that pay off in the long run.
Recently, I was working for a large, multinational client that wanted to improve the effectiveness of its online store: to better understand the customer payment conversion funnel, to be able to spot any important patterns in incomplete journeys.
And these folks have a lot of user behavior data: they capture events about everything you click or type while browsing their store: from the buttons you click to buy a product, to the number of scrolls or swipes you do while reading a product description. There are petabytes of anonymous user behavior data, and it grows by hundreds of terabytes every day.
We’d spoken to the client about the customer journey, and the happy path they’d take through the store. But as I went through the data to find examples to validate my understanding, I kept coming up with instances where that didn’t happen: the customers were following some path that we didn’t properly understand. And that’s one of the challenges with big data: you find your edge cases actually happen quite frequently.
The problem for us was that in trying to come up with ways to improve this web store, we now needed to understand these edge cases. Why were customers making unexpected decisions to terminate a purchase? Are we seeing this unusual behavior because of a bug? Is it because we didn’t capture an important bit of information? Are we capturing the wrong information?
To investigate this, we weren’t just looking at user clicks, but also all of the logs for back-end processes that are triggered at different points throughout the journey. We were actually doing a lot of QA by trying to understand this data: going back to engineers with bugs on the server side and the front-end side. And they were aware of these small things but didn’t really see the positive impact of the fix.
Once engineers on both sides were able to see the cumulative effect of these minor data issues, they were keen to get it fixed. But ideally, you should be looking to fix these systems before they’re put in production.
But if you think about fertile data, caring for quality so that you can really deliver powerful insights, you’ll start to appreciate that you need to cultivate it closer to the source: to treat data quality as an integral part of everything you do.
That’s not to say you will solve all your data quality problems. But if you only ever fix them downstream, you’ll never get round to fixing the source, and it can cost you more later on since more consumers will have to deal with the same issues. Or, worse, they will start consuming the “clean data” from the downstream system instead of the source of truth. This is a common anti-pattern: where aggregated data from enterprise data warehouse systems are used as input to transactional systems because they had to go through the pain of fixing earlier data quality issues.
So what do we mean by making data quality part of your routines? We can make the parallel with software development practices by looking from two perspectives: the activities we do in pre-production and those we look at once we are live.
In pre-production, if you really care about cultivating your data, and you find there’s a problem, you stop the line and fix it so that you don’t produce junk further down the road.
You’d typically have a bunch of tests to determine whether your code is working. For data, we think it’s useful to have a published schema for the events you expect to produce, and an idea of how those events flow. That allows you to perform some basic sanity checks on your data, to ensure you are not producing something that doesn’t fit with the schema. Schema evolution testing and publishing can also be added to your deployment pipeline, to ensure changes are backward- and forward-compatible.
Sometimes this process is more like functional testing. For instance, I expected an element to look like a date. So we could assert that the data is in the right format and within a valid range of dates — if you’re getting sales data from the 1800s, you might have a data quality issue.
But the ‘stop the line’ thinking doesn’t work so well for systems in production.
I learned this lesson from a client that was building a product that ingested data from government sources. We were consuming a data feed, with the aim of surfacing interesting information to their customers — and we were working with the client to build that product.
That meant we’d had to make various assumptions about the data that was coming from this external feed. But we kept hitting problems with the data: sometimes fields were missing, or the data didn’t make sense. And that was just breaking our product features.
To work out what was happening, we started writing tests about our assumptions on the data, to establish what sort of patterns would break our system. As we built those tests, we found more and more data quality issues. As a result, we incorporated those tests into our data ingestion pipeline. So every day, the client gave us a new dump of the data, and we started running our tests against that.
The problem we faced here was that we were able to spot things that might break the product, but stopping the data pipeline to fix them would mean that the product would surface outdated information.
And that's why I started to question the idea of stopping the line for production issues. Instead, we could ingest the data that was good, and just mark the things that were wrong as things to follow up.
This approach is more similar to the software development practices for doing QA in production: observability and monitoring become the preferred approach.
You can setup thresholds to define your normal expectations and only take action when you get alerts that the thresholds were crossed. Whereas in pre-production you might want to aim for as close to perfection as possible, once you’re in production, you’re looking at what level of quality is acceptable and adapting as you learn more about the data flows.
It is also unlikely that a single data quality program initiated by senior decision makers is going to solve your problems. Instead, data quality needs to become part of your development process, so you have the teams that are producing data or consuming data to be the ones thinking about data quality all the time.
These are big changes for many organizations. And if you want to cultivate and use data to empower your business, we think it’s the right way to go.
That’s worrying. If you’re planning to increase your use of AI and machine learning to automate some decision making, the actions you make will be fatally compromised if your data isn’t good. Little wonder then that senior IT decision makers see data quality as a priority.
There’s often a fear that data quality initiatives will become a Sisyphean task — where endless effort is expended to little effect. But it need not be that way. Huge quality improvements can be realized by applying the principles we routinely use in software development to data.
Fertile data: soil not oil
We’re often told that data is the new oil, but it’s a pretty weak analogy. Having lots of data won't make you rich. Many of the clients we work with have an abundance of data, but they also struggle to derive significant value from their data.Instead of thinking about data as oil, we prefer to think of it as soil — that if you invest time and effort into cultivating you can reap the rewards. And that’s where data quality comes in. Even if data quality isn’t your primary driver for an initiative, by making a routine aspect of the way you work, you can make small investments that pay off in the long run.
Recently, I was working for a large, multinational client that wanted to improve the effectiveness of its online store: to better understand the customer payment conversion funnel, to be able to spot any important patterns in incomplete journeys.
And these folks have a lot of user behavior data: they capture events about everything you click or type while browsing their store: from the buttons you click to buy a product, to the number of scrolls or swipes you do while reading a product description. There are petabytes of anonymous user behavior data, and it grows by hundreds of terabytes every day.
We’d spoken to the client about the customer journey, and the happy path they’d take through the store. But as I went through the data to find examples to validate my understanding, I kept coming up with instances where that didn’t happen: the customers were following some path that we didn’t properly understand. And that’s one of the challenges with big data: you find your edge cases actually happen quite frequently.
The problem for us was that in trying to come up with ways to improve this web store, we now needed to understand these edge cases. Why were customers making unexpected decisions to terminate a purchase? Are we seeing this unusual behavior because of a bug? Is it because we didn’t capture an important bit of information? Are we capturing the wrong information?
To investigate this, we weren’t just looking at user clicks, but also all of the logs for back-end processes that are triggered at different points throughout the journey. We were actually doing a lot of QA by trying to understand this data: going back to engineers with bugs on the server side and the front-end side. And they were aware of these small things but didn’t really see the positive impact of the fix.
Once engineers on both sides were able to see the cumulative effect of these minor data issues, they were keen to get it fixed. But ideally, you should be looking to fix these systems before they’re put in production.
Cultivating your data
You might take the view that data quality is important, but it’s something that exists separately from your day-to-day work — that there’s a data quality department/function that will take care of it at some point down the line. That’s where we can learn from how the QA practice evolved in the broader field of software development: we also used to have separate QA department/function, but Agile and Lean thinking helped us bring quality into the software development process.But if you think about fertile data, caring for quality so that you can really deliver powerful insights, you’ll start to appreciate that you need to cultivate it closer to the source: to treat data quality as an integral part of everything you do.
That’s not to say you will solve all your data quality problems. But if you only ever fix them downstream, you’ll never get round to fixing the source, and it can cost you more later on since more consumers will have to deal with the same issues. Or, worse, they will start consuming the “clean data” from the downstream system instead of the source of truth. This is a common anti-pattern: where aggregated data from enterprise data warehouse systems are used as input to transactional systems because they had to go through the pain of fixing earlier data quality issues.
So what do we mean by making data quality part of your routines? We can make the parallel with software development practices by looking from two perspectives: the activities we do in pre-production and those we look at once we are live.
In pre-production, if you really care about cultivating your data, and you find there’s a problem, you stop the line and fix it so that you don’t produce junk further down the road.
You can think about testing your data quality much as you’d think about testing your code.
You’d typically have a bunch of tests to determine whether your code is working. For data, we think it’s useful to have a published schema for the events you expect to produce, and an idea of how those events flow. That allows you to perform some basic sanity checks on your data, to ensure you are not producing something that doesn’t fit with the schema. Schema evolution testing and publishing can also be added to your deployment pipeline, to ensure changes are backward- and forward-compatible.
Sometimes this process is more like functional testing. For instance, I expected an element to look like a date. So we could assert that the data is in the right format and within a valid range of dates — if you’re getting sales data from the 1800s, you might have a data quality issue.
But the ‘stop the line’ thinking doesn’t work so well for systems in production.
You can’t always just stop a production system to fix a data quality issue.
I learned this lesson from a client that was building a product that ingested data from government sources. We were consuming a data feed, with the aim of surfacing interesting information to their customers — and we were working with the client to build that product.
That meant we’d had to make various assumptions about the data that was coming from this external feed. But we kept hitting problems with the data: sometimes fields were missing, or the data didn’t make sense. And that was just breaking our product features.
To work out what was happening, we started writing tests about our assumptions on the data, to establish what sort of patterns would break our system. As we built those tests, we found more and more data quality issues. As a result, we incorporated those tests into our data ingestion pipeline. So every day, the client gave us a new dump of the data, and we started running our tests against that.
The problem we faced here was that we were able to spot things that might break the product, but stopping the data pipeline to fix them would mean that the product would surface outdated information.
And that's why I started to question the idea of stopping the line for production issues. Instead, we could ingest the data that was good, and just mark the things that were wrong as things to follow up.
This approach is more similar to the software development practices for doing QA in production: observability and monitoring become the preferred approach.
You can setup thresholds to define your normal expectations and only take action when you get alerts that the thresholds were crossed. Whereas in pre-production you might want to aim for as close to perfection as possible, once you’re in production, you’re looking at what level of quality is acceptable and adapting as you learn more about the data flows.
Making the change stick
While I firmly believe that we can learn a lot about improving data quality from looking at software development best practice, it doesn’t mean that solving issues is easy. A lot of our clients don’t have the structures in place to be able to implement these types of practices. So there’s a cost of entry that will require an upfront investment.It is also unlikely that a single data quality program initiated by senior decision makers is going to solve your problems. Instead, data quality needs to become part of your development process, so you have the teams that are producing data or consuming data to be the ones thinking about data quality all the time.
These are big changes for many organizations. And if you want to cultivate and use data to empower your business, we think it’s the right way to go.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.