Synthetic data for testing models

Technology Radar

Published : Oct 26, 2022

Not on the current edition

This blip is not on the current edition of the Radar. If it was on one of the last few editions it is likely that it is still relevant. If the blip is older it might no longer be relevant and our assessment might be different today. Unfortunately, we simply don't have the bandwidth to continuously review blips from previous editions of the Radar Understand more

Oct 2022

Assess

During our discussions for this edition of the Radar, several tools and applications for synthetic data generation came up. As the tools mature, we've found that using synthetic data for testing models is a powerful and broadly useful technique. Although not intended as a substitute for real data in validating the discrimination power of machine-learning models, synthetic data can be used in a variety of situations. For example, it can be used to guard against catastrophic model failure in response to rarely occurring events or to test data pipelines without exposing personally identifiable information. Synthetic data is also useful for exploring edge cases that lack real data or for identifying model bias. Some helpful tools for generating data include Faker or Synth, which generate data that conforms to desired statistical properties, and tools like Synthetic Data Vault that can generate data that mimics the properties of an input data set.