AI success depends on lots of good quality data. In fact, that’s one of the biggest challenges in AI today. Not only are there significant costs to acquiring, storing and processing data there are also issues around privacy that organizations need to address. Indeed, simply finding the data to train models effectively can be difficult. However, synthetic data — data that’s artificially created and mimics ‘real’ data — is emerging as a key solution.
Synthetic data is already being used by big tech to train their flagship AI products. In March, NVIDIA released a huge data set which consisted, in part, of synthetic data. However, it isn’t just an approach used by the big industry players. Indeed, spending on synthetic data is predicted to grow from just $300 million in 2023 to $2.1 billion by 2028.
With this in mind, it’s essential for businesses engaged in training or fine-tuning models s to be open to the opportunities of synthetic data: in this article we’ll outline how it can drive your organization’s AI-enabled future.
Synthetic data helps you learn more — faster and cheaper
One of the major benefits of synthetic data is that it can accelerate time to insight — and even teach AI models what “bad” looks like. Let’s look at what that means.
In healthcare, synthetic data can help simulate clinical trials, accelerating research and making it more comprehensive (by creating synthetic diverse virtual populations, for instance).
In banking and financial services, synthetic data is often used in fraud detection. By simulating novel scenarios on a huge scale, it becomes possible for a detection system to learn where and when fraud might occur.
Indeed, this isn’t just a question of speed: it’s also about cost. With synthetic data, organizations can learn more without the associated data costs.
Mitigate bias and develop more effective AI
Synthetic data offers a significant approach to mitigating bias in AI models by enabling the creation of balanced datasets that accurately represent diverse groups, particularly those often underrepresented in real-world data. One of the big reasons AI can be biased is that the real-world data used to train it doesn't have a good mix of different types of people or situations. Synthetic data helps fix this by creating more examples of those groups that are usually missing or underrepresented in the real data — imagine, for instance, using AI to sort through CVs; if the model is trained on data that has a bias towards male candidates, that will impact the model.
Because this synthetic data is made to fill in the gaps, AI models get to learn from a more complete picture. This means they're less likely to make unfair decisions or mistakes based on who or what they haven't seen enough of. So, in simple terms, synthetic data helps AI become fairer by giving it a more balanced and diverse set of examples to learn from.
It’s not just a question of bias, though. It’s also about accuracy. If you're trying to train a model on a certain set of images and the data set is too small or lacks detail, synthetic data can expand the data set and give you the richness you’d otherwise lack. In turn, this can help better train an AI system, making it more reliable and effective.
In short, synthetic data might not just be the difference between an AI project floundering and getting off the ground — it can also give it a competitive edge in a world where poor AI products are unfortunately common.
Using synthetic data to better understand the UK’s bus network
In a project with the UK’s Department for Transport we’re using synthetic data to provide better accessibility data.
We need it because we’re trying to train an AI model to identify specific street furniture (such as seating, a pavement or a bus shelter) in images of bus stops. While it can be trained on original data, this is limited; by extension, this also limits the model’s accuracy.
So, we use synthetic data in combination with original images — implementing cropping, blurring, image rotation to create a richer data set — that will help the AI model accomplish its goal faster.
In turn, this will help the Department of Transport better understand accessibility across the network, giving it an even stronger foundation for improving services for UK citizens.
Do more with data — without compromising security
Yes, data is an asset — but it can also be a liability. And given the demanding nature of our AI era, that means scaling data isn’t just expensive, it’s risky as well.
Synthetic data can play an important role here by helping to create approximations or representations of data, essentially mimicking a given data set but without any sensitive or personal information. This isn’t the only technique available: it’s possible to use anonymization or data masking to remove identifiable information. However, synthetic data can offer more robust privacy because the data matches only the statistical properties of another data set: it is otherwise fictitious.
In addition, being able to de-risk data with the help of synthetic data also opens up new opportunities for partnerships and data sharing. Most immediately, this is effective in fields like healthcare where collaborating on research could be incredibly helpful: creating a synthetic dataset that statistically mimics a real one makes it possible for institutions to more readily share what regulations currently prohibit.
While increased consumer awareness of privacy and legislation has made the question of data sharing more urgent than ever for businesses, the ability to generate data makes it possible to explore data sharing with less risk and more confidence. And given AI is shaping the future of many industries, it’s likely that the ability to think beyond the confines of the organization and operate and support a broader ecosystem will be immensely valuable. Organizations that have something to offer will be in a great position for growth.
What are the risks of using synthetic data?
There are nevertheless risks that need to be considered when using synthetic data. First, just as it can tackle bias, it can sometimes also perpetuate and exacerbate it. If you are trying to create synthetic data by expanding on an existing data set, if that data set contains significant biases and inaccuracies, your synthetic data set is likely to perpetuate and extend bias. The way to tackle this is to ensure you properly understand your existing data sets, how they were created and what they do and don’t represent.
While it’s true synthetic can be a helpful privacy tool — arguably offering greater privacy than other techniques like data masking and anonymization — it certainly isn’t a silver bullet. It’s possible, for instance, for data to be re-identified if the synthetic data is based on a real dataset.
There are also times when we shouldn’t try to mitigate bias in a data set using synthetic. That might sound counterintuitive, but sometimes bias is important and valuable. Trying to enrich data sets purely because synthetic data can be created quickly and easily might sometimes be the wrong approach.
Make synthetic data a key part of your data and AI strategy
The future of AI is, of course, about far more than just synthetic data. But the benefits of synthetic data, when used appropriately, are significant: it can improve time to insight, bolster model accuracy and minimize privacy risks.
This doesn’t mean we should stop collecting, cleaning, analyzing and processing ‘real’ data. But it does mean synthetic data needs to be embedded in your data and AI strategy: it could well give you a competitive edge when it comes to the continuing race to leverage AI.