Today’s businesses, especially those that have prioritized digital transformation, need near real-time data. Weekly and monthly batch processes no longer cut it. But acquiring real-time data from various sources to automate processes and make dynamic optimization decisions isn’t an easy task.
Recently, while re-architecting a legacy system and breaking a monolith into microservices for a client, we faced this particular challenge. We set out changing the database for the new architecture and modernizing the system in modules. During this phase, we needed both databases to work in sync because different modules might require the same data — in other words, the old system would need data generated by the new system in the new database and vice-versa.
We explored the change data capture (CDC) to see if it could do just that. This blog takes a deeper dive into what CDC is, the tools we explored, how they work and most importantly their benefits. We've also included a few examples and suggestions on how other technologists can go about choosing the right CDC tool for their use case.
What is change-data capture?
Change data capture is the process of detecting and capturing changes in a source system (typically, databases) and delivering those changes in near real-time to a target system. This could be changes such as insert, delete, update and the database structure via DDLs.
How do change-data capture tools work?
CDC tools work by continuously monitoring the source system for any changes made to the data. Whenever a change is detected, it captures and records it in a separate location, such as a database or log file, or sends it to a message broker. The captured data is then processed, transformed and loaded into a target system such as a data warehouse, analytics platform or another database.
There are various ways to capture changes in a database. Let’s look at a few of them here:
Timestamp-based/query-based
Here we maintain audit columns in the source like CREATED_AT, LAST_UPDATED or DATE_MODIFIED and query the data in the source to pick up any change based on these columns. It is important to note that this approach cannot capture delete operations.
Trigger-based
Triggers are functions that perform user-defined actions based on a particular event happening with the database. This can be used to capture any changes in the database, including delete operations. However, this approach reduces database performance as it requires multiple writes every time an event occurs.
Log-based
Databases contain transaction logs (or redo logs) that store all events for recovery in case of a crash. With a log-based CDC, new database transactions are read from their own native log. This captures changes without scanning the source tables, which makes it more efficient than the other two approaches.
This approach is akin to event sourcing in an event-driven architecture, where whenever there is a change in the state of a system, we record it as an event. The recorded events can be used to rebuild the system state at any time by replaying them in the same sequence.
Why use CDC?
Depending on the situation, application, architecture and business needs, there are many scenarios in which CDC is critical. Here are some ways in which CDC benefits the engineering process:
Real-time data availability: CDC tools capture changes in near real-time, ensuring the most up-to-date data is available for analysis, reporting or further processing.
Faster decision-making: CDC helps reduce the latency between the capture and availability of data, enabling quicker analysis and decision-making.
Efficient data integration: CDC tools help capture data from multiple operational sources and transform them into a common format in a single target database or data lake.
Customized design of target databases: CDC offers cross-functional benefits like creating a read-only search or query database in a CQRS system, creating an audit database, or capturing data in a data warehouse. It allows non-functional and architectural requirements to be decoupled from the primary data store.
Simplified data migration: In our case, CDC helped maintain data consistency between legacy and new databases during the modernization phase. This applies to various other data migration scenarios as well.
How do you choose the right CDC tool?
There are several CDC tools available in the market, such as Oracle Golden Gate, Debezium, IBM Infosphere, Striim, StreamSets and Qlik Replicate. These tools can be open-source or paid. They often support both on-prem and cloud environments, and can handle various data sources. When choosing one, consider the following:
Compatibility with data sources: At the very least, the tool you choose must be compatible with all the data sources you want to capture changes from.
Real-time data capture: The tool should capture changes in near real-time for you to have the most up-to-date data to be available.
Data transformation and integration: The CDC tool should be able to handle data transformation from source to target data types.
Pricing: The CDC tool must be cost-effective for your use case. There are open-source, paid and licensed products available that you can choose from.
Ease of use and support: The tool should be easy to use for your team and have adequate support, including comprehensive documentation and technical support.
Miscellaneous features: Depending on your requirements, you might also want to check for additional specific features, such as bidirectional sync between source and target and cloud support.
As businesses become tech-driven, data — both historical and current — will become a critical differentiating factor. Enabling accurate, timely, efficient and cost-effective change-data capture will become an integral part of any technology transformation initiative. When you face that situation, I hope this article is helpful.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.