“Yes, excessive automation at Tesla was a mistake. To be precise, my mistake. Humans are underrated.”
Most technologists believe that with R&D, time and resources – everything can be done. Elon Musk’s story shows us that this is untrue. His fully-robotic production line created so many problems that he eventually hired human workers to do the job alongside robots.
People still play a paramount role in dealing with uncertainties, unexpected events, and approximations. Therefore, despite best efforts, one can successfully automate only a fraction of all the processes. This is particularly true when managing information and processing data instead of building products.
Setting the right kind of expectations and outcomes for data projects is a hard task, especially when one considers a few basic truths:
Digitizing information simply replicates original uncertainties, errors and the lack of understanding in a digital form
Automation does not eliminate all errors, it simply reduces the effort required to handle them or increases the number of things we can process
When dealing with complexity, we cannot avoid dealing with approximations, errors and unexpected events
So, how does one ensure the reliability of data and ensuing decisions made based on said data?
The answer lies not in managing the data alone, but also the information around and about data acquisition, transformations and visualization – to provide a better understanding and support decision makers.
A 'single source of truth,' was the mantra of every data warehouse project. "Hoard all the data you can, we’ll figure it out," was true of data lakes. 'Curated data products' is the new norm for data mesh. In the end, what really matters for an organization is to leverage trusted data that support decision making and day-to-day operations, meetings and activities.
This article discusses how to leverage metadata and data quality to build an engineering foundation of trust, with the proper tools and processes. No matter how much effort or investments we put into building such a foundation, this is just half of what is required to build trust in data.
Metadata is the context we need to understand data
In data governance, metadata usually refers to all the service information needed – both for processes and for the consumption of information – to make sense of data. Collecting and managing this metadata is the first step in moving towards 'data governance.' Data catalog tools respond to this need and are now widespread both as stand-alone products and integrated into the services of all cloud providers.
Among the open source tools, Apache Atlas represents a reference point for completeness and maturity, while Amundsen (created by Lyft) focuses on simplicity and usability by non-technical users. Both these tools approach data governance as centered on two types of 'service' information:
Metadata proper, which describes the structure of the data (e.g. tables, fields, data types, sensitive data, who can access it and what) and where it is stored (e.g. file paths or databases)
Data lineage describes the sequence of processes right from data acquisition, all subsequent transformations and intermediate data generations up to the final result – that's used to provide information or support decisions. This 'technical' information is linked to descriptions closer to the end-users' language through glossaries or dictionaries.
This is a great first milestone. However, this ‘descriptive’ approach does not take into account the variability of data, the quality and impact on reliability of information and the effectiveness of decisions. Even if metadata makes the platform more transparent, it tells us nothing about what is on the inside.
Data quality requires metadata to deliver continuous value
Metadata management – the first level of data governance – begins on the false assumption that everything worked correctly but slowly, before digital transformation. Therefore, digitization will also follow and be not just right but also fast. This approach does not take the changing nature of information into account.
Engineers must consider errors and approximations not as exceptions, but as structural elements of any form of data collection and management, and handle them accordingly.
Let’s delve into the essential role of continuous data quality as part of efficient data governance.
Data profiling
Data profiling offers an understanding of the kind of data we are dealing with, the values they assume, their statistical distribution, how much they grow or decrease in size, and whose results constitute a further level of metadata that we can collect. Although it is fairly easy to create programs to generate these statistics, two open source libraries are worth considering: Great Expectations and Deequ (created by AWS). Both use Spark as the back-end and can also analyze massive data sets.
When we model data, we map our understanding and expectations around the needed information – into data structures (i.e. proper metadata like tables, columns data types, relationships). In an ideal world, these specifications are equal to the data we eventually receive. However, in the real world, capturing data is tricky.
We successfully collect data through forms, applications, OCRs, file specifications, APIs, streams etc. Until we discover unexpected results or information ambiguity. Take the example of an email field. It may contain a valid email, a wrong email, or no email at all. The first case is aligned with the expectation that the application has proper email validation in place. Finding out an occurrence of the second case may lead to the discovery of an unexpected application bug or partial or wrong implementation.
The third case, instead, leads us to an interesting conclusion of there is no email because:
- the field was not mandatory and the user didn't enter it in
- there was no way of entering it in because there was no email field or it was hidden or locked (e.g. due to a bug, a mistake, or a business/process rule)
But, just stating the customers’ table should have a mandatory email field, in our data model, doesn’t solve the problem and neither does it nudge one in the right direction.
The gap between data and information models can be shortened by integrating data profiling in every data pipeline and collecting metadata time series about what data is actually flowing. This way we create a solid basis for intercepting anomalies and unexpected events during the testing phase and also at runtime. This approach drastically reduces the time spent in debugging problems that arise.
Data validation
A rather obvious limitation of data profiling is, it only tells us how the data is and not how it should be. Which raises the need for data validation. Common dimensions that help assess data conformity that meet required expected standards are:
Completeness – absence of holes in the information
Unambiguousness – lack of duplicate values
Timeliness – availability of the latest information at the right time
Validity – for a particular period or context
Accuracy – respect of ranges or lists of values
Coherence – satisfaction of a set of relationships between values and the whole or part of the data set itself
All these dimensions have a dual nature. We can define them with absolute values and ratios or as per the previous historical trend of the values themselves. For instance, an often underestimated case study is the comparison of data set sizes to their evolution over time – have they grown, and if yes, by how much? Else, have they drastically decreased?
These simple questions might require a historical series of metadata, but allow one to quickly interrupt processes that could have disastrous impacts. For instance, a case where an empty or partial file is uploaded by mistake. The data pipeline would not load any or will load very little information in the first temporary area. Left unchecked, the process could go on and delete data in production!
Building validation rules requires technical and domain knowledge to be effective. We need to bridge the gap between Developers (DEVs) and SMEs – helping the former better understand domains and rules, while the latter is able to translate their knowledge in to executable documentation rather than into specification documents. Just as learning to write code, SMEs should learn how to write simple rules definitions using programming languages or libraries.
Tools like Apache Griffin (initially created by eBay, now open source) try to fill the gap between the DEVs and SMEs with the help of a JSON DSL (domain specific language). Rules are defined as data structures that follow a specific semantic and can also be expressed in SQL. Great Expectations and Deequ , instead, require writing rules in Python or Scala respectively – but both tools can help automatically generate rules from data sampling information, greatly simplifying the DEVs work and requiring less programming skills for SMEs as well.
As always, the choice between a general-purpose programming language and a DSL must be evaluated based on the complexity of rules that have to be implemented alongside the ability of the SMEs to describe/validate/update them.
For this reason – and to avoid the temptation of over engineering – it is important to think of SMEs as end-users of data validations that they can manage and monitor.
Data quality should be at the center stage of every data project
If we look back, data quality has been around for a long time both as practices and tools, but the need for a shorter feedback loop between DEVs and SMEs has increased dramatically.
When building a data pipeline (Extract-Transform-Load or ETL) for a data warehouse, waterfall-like approach directs quality. The contract between data sources and expected results is defined using specification documents and translated into developers’ activities with quality tests and gates hardcoded into the process. And, unexpected data anomalies usually halt the overall data pipeline or cause cascading troubles all the way through.
With the advent of data lakes, data pipelines split between acquisition of data and the further processing of if (ELT). The market emphasis on accumulating data to make sense of it later allowed for a more elusive definition of quality. For instance, “Where should I put the quality gates or tests?,” is usually a huge debate within data lake teams. The net result – quality is inevitably baked into the process as an afterthought or only when things have already suffered quite badly.
As companies shift towards the Data Mesh paradigm, the pace of change has increased dramatically. Dealing with data products and federated data governance has brought data quality back to center stage of every data-oriented organization.
Data trust is the actual goal of every data project
To paraphrase Elon Musk, don’t underrate the role of people in building the chain of trust. All people in the organization, not just DEVs and SMEs, should be engaged in taking care of data and making sense of it. This requires extensive investment to be put into gathering needs, listening to feedback, ramping up knowledge on data and training with tools.
I would suggest performing this simple test during the next C-level meeting – if there's debate around the correctness of data, invest more in the activities listed above. And, if the conversations are focused on data and the decisions needed to be taken based on it – one is progressing in the right direction.
We recommend listening to this podcast for more information: Data governance: the foundation of data-driven organizations