From my experience working in teams with a mix of data engineers, software engineers and analysts I have found that the term “test” is a frequent source of confusion. Especially with tools like dbt gaining popularity, the confusion seems to be widening due to a lack of clear guidelines around the use of the dbt testing framework to test code vs. test data. Not only that, I also found that the term “pipeline” itself can be confusing for some! This motivated me to write this mini-blog which can be used as a simple guide to understand the terms “code test” and “data test” in the context of data pipelines.
Data pipelines are applications that process data and thus, along with data, there is code to transform it. You can visualize it as two perpendicular planes where:
- In the Code Plane code flows vertically up the Y-axis between environments and in traditional software engineering terms it's a CI/CD pipeline. This is the flow that software engineers are quite familiar with.
- In the Data Plane data flows horizontally along the X-axis in each of these environments where data is transformed from one form to another. This is a data pipeline and is something data experts understand very well.
Visualization of a data pipeline as two perpendicular planes
Code tests
When code is flowing through the Code Plane i.e. being promoted from one environment to another, we should adhere to the proven practices of Self-testing code and Test-Driven Development (TDD). This means automated tests run before the code is deployed to the higher level environment. We can do this by using mock data, writing tests and then invoking a single command that executes the tests. This way we can be confident that these tests will illuminate any bugs hiding in your code, even when we are using tools like dbt with sql!!
Seeding Data for code tests requires us to use production like data and this creates a third plane i.e. Reverse data plane. In this plane, data flows along the Y-axis in the reverse direction of the code plane. This is production data that can be replicated into test environments using different privacy preserving or obfuscation techniques like masking, differential privacy or can be purely synthetic which can be created using samples. In the lower level development environments this can be done by just taking a sample or individual data point. To begin with this flow can be conceptual in terms of defining test cases but it's best to automate this eventually, especially in the higher level environments.
Visualization of a reverse data plane in a data pipeline
Data tests
Data tests are executed in Data Planes and are mostly part of the production functionality where data is checked for its completeness, uniqueness, validity, consistency, referential integrity, accuracy and compliance etc. In Data Planes, data mostly comes from untrusted systems and thus these data tests ensure that the bad data is identified to avoid turning a data defect into a data disaster. This may not only cause application downtime but can also be in the form of incorrect sales figures leading to other disasters!! This may often require a manual fix, but the sooner it is identified, the smaller the impact. It's also important to understand that each of these test types: i.e. data and code should run in their respective planes to realize their value and avoid any antipatterns.
Always be clear about the test coverage and ensure both data quality and logic are covered. This can be best done by wearing a software engineer’s hat and making sure that the Code Plane gives maximum test coverage for logic and also working with data experts/owners by making sure that Data Plane covers the data quality tests. And in both the cases “Shift left” is the mantra one should never forget. You can also refer to the Practical data test grid which guides us to ensure holistic test coverage in data pipelines and also read through the Traits of productionized data pipelines blog to learn more about robust data pipelines.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.