As the volume and variety of organizational data explodes, data teams face complex challenges in managing it. Existing paradigms are becoming inadequate, demanding new thinking and approaches to data management. This blog, based on a conversation between Vanya Seth who is the Thoughtworks Head of Technology for India, Aveek Mishra of Intuit India Development Center and Rajesh Parikh, Founder and CEO of data catalog organization Cynepia, explores the challenges that data product teams face and how they can solve them.
Accountability
In traditional models, accountability for data often lies with data teams. However, data teams don’t always have the necessary domain knowledge to comprehensively understand the data in front of them. “Within every organization/industry, there are hundreds of sub-domains, making it almost impossible for data teams to become experts,” says Vanya Seth, Head of Technology for Thoughtworks in India.
On the other hand, domain teams should already know the landscape and possess a certain amount of knowledge not only about the data they have, but also the integrity rules it should abide by. So, the Data Mesh model comes into play – an analytical data architecture and operating model where data is treated as a product and owned by teams that most intimately know and consume the data. The model shifts accountability for quality, integrity and usability of the data to domain teams. This should enhance the organization’s ability to realize value from the data. Simply put, Data Mesh brings the DevOps model to data management.
Data quality
As Rajesh Parikh, Founder and CEO at Cynepia Technologies points out, “one of the biggest challenges in data management today is that bad data isn’t tracked in the data pipeline and so ends up in consumer-facing reports and dashboards.” Current solutions such as observability and data contracts are inadequate if bad data flows uncontrolled. This is true of Data Mesh architecture as well. Aveek Misra, an Engineering Manager for Data Engineering at Intuit India Product Development Centre, adds that “today, whatever quality control is in place fails to detect issues because they do not embody business rules. They perform null checks, row count checks and hash checks, but that is not enough.” This is because data quality control (QC) systems lack domain expertise and knowledge.
These are several interrelated data quality problems that need to be addressed thoughtfully. As accountability shifts to domain teams, they also take on responsibility for defining quality metrics. For instance, in the healthcare industry, the domain team should know that a diabetes test result is valid for only three months. That means they are in the best position to define these rules. However, shifting responsibility upstream alone is not enough.
Data products are built at multiple levels — from source-oriented data products, to aggregated data products to consumer oriented data products — so even if the data is high quality at the source, it may become distorted later. This is why there needs to be robust QC mechanisms at every stage of the data product journey. To do this, there must be a close relationship between domain and data teams to ensure that quality testing is aligned with business goals and aims.
Customer centricity
In most organizations, customer-centricity is an external issue. However, data teams are the customers for the data that domain teams are generating. And unfortunately, as internal customers, they don’t get the same treatment that the external customers do. Customer delight is rarely a top priority. This creates inefficiencies in data management.
“If a data scientist has to hypothesize on how to improve sales, they might have to ask the product teams for that data, or ask the best way to run experiments. This increases the cognitive load on data teams,” says Thoughtworks’ Vanya. The Data Mesh model solves this problem by challenging the status quo, ensuring domain teams are responsible for providing the data. They should actively collaborate with the data team to define details such as whether a SQL interface/graph format should be used.
Skills and capabilities
Roles in data management are barely a decade old. They are evolving rapidly. “For instance, today, a data analyst is dashboarding and transforming data simultaneously. A data scientist is transforming data and building models. We are looking at an overlap of responsibilities. As the accountability for data moves left, should domain teams also consist of data analysts/engineers/scientists?” asks Cynepia Technologies’ Rajesh.
Thoughtworks Head of Tech, Vanya thinks not. “As these are specialized skills, it is hard to obtain and retain such talent at scale,” she says. This problem is better solved by defining a domain-agnostic self-serve platform that provides everything domain teams need to leverage that data. The Data Mesh abstracts interactions and flow, creating platform capabilities that democratize data so that any developer can build a data product.
Product mindset
Building data products needs a product mindset. As Intuit’s Aveek points out, “In the product world, we have solved how microservice contracts are validated, checking for resilience and circuit breakers,for example. Some of these best practices need to be brought to the data world.”
Thoughtworks’ Vanya expands on this mindset change by suggesting that for data to be considered a product, it needs to be long-lived and used repeatedly. Product-thinking data teams are not just solving point-in-time problems but creating long-term, reusable solutions. This also prevents the creation of hundreds of quick pipelines for point-in-time problems that end up making the system so complex it could collapse like a house of cards.
Governance
“We have traded off data quality and governance for speed of delivery,” says Rajesh from Cynepia Technologies, raising an important concern. The traditional model of governance, which was centralized with a team at the top making decisions, is no longer viable. Moreover, master data management, the most commonly used model today, cannot scale at the rate at which data is evolving.
The future needs yet another change in mindset, this time centered on governance. Vanya reiterates decentralization as an approach that can bring about the required change. Not unlike the microservices model, it is enabled by thoughtful automation. “Platform teams have to automate governance issues, leveraging policy as code and governance as code models. Domain teams must find ways to computationally give feedback to the developer and let the platform take care of it,” she says.
However, governance is not just about quality but also about discoverability. We need to create platforms that empower consumers to browse data and enable decisions for the right dataset for every use case. It needs to be metric-driven and transparent. “And teams should think about it from day one,” adds Intuit’s Aveek. “For instance, GDPR mandates the deletion of data if a customer requests it. But does one even know where all that data is stored? In such cases, lineage becomes very important.”
As data becomes a competitive advantage for businesses, the challenges around data management are likely to become more complex. Success lies in empowering every individual to be accountable for what they do best and leverage every asset without unnecessary bottlenecks.
For a more detailed understanding of the topic, you can watch the entire conversation here.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.