Preparing test data for data engineering is a significant challenge. Transferring data from production to test environments can be risky, so teams often rely on fake or synthetic data instead. In this Radar, we explored novel approaches like synthetic data for testing and training models. But most of the time, lower-cost procedural generation is enough. dbldatagen (Databricks Labs Data Generator) is such a tool; it’s a Python library for generating synthetic data within the Databricks environment for testing, benchmarking, demoing and many other uses. dbldatagen can generate synthetic data at scale, up to billions of rows within minutes, supporting various scenarios such as multiple tables, change data capture and merge/join operations. It can handle Spark SQL primitive types well, generate ranges and discrete values and apply specified distributions. When creating synthetic data using the Databricks ecosystem, dbldatagen is an option worth evaluating.