Data Analytics

What Is Synthetic Data, and How Does It Help Artificial Intelligence?

Created to resemble real data, this resource can help AI applications achieve success while saving time and maintaining privacy.

Ernie Smith

Ernie Smith is a former contributor to BizTech, an old-school blogger who specializes in side projects, and a tech history nut who researches vintage operating systems for fun.

Businesses that use machine learning or other artificial intelligence for some applications often face an important question: Does the application have enough data to achieve its objective?

Having a sufficient set of data is essential to many AI applications, such as AI-driven simulations, and for training an algorithm. But compiling enough data to support these use cases can be a challenge. The process may be time-consuming, and some use cases may involve variables that can be difficult to test for through traditional means.

Synthetic data has emerged as a resource for building AI-driven models in areas such as manufacturing and supply chain management, as well as for developing algorithmic processes for fraud detection and spam identification. Earlier this year, the MIT Technology Review designated synthetic data as one of its 10 breakthrough technologies of 2022.

But unlike many emerging technologies, synthetic data is available now and usable by businesses large and small as they try to make sense of complex problems.

Click below to receive exclusive content on data analytics when you register as an Insider.

What Is Synthetic Data?

Synthetic data is information that is artificially produced but is based on real data from existing inputs. This data can either be structured within a database or unstructured, but the goal is that the information is synthesized from real data sets to resemble original information while not using it.

The information can be directly produced from real data, but it can also be indirectly produced from a model, removing any tie to directly identifiable data sets.

For artificial intelligence or business analytics applications, data models can be used in a variety of settings such as simulations and visualizations, which can help to identify technical problems in areas such as engineering, financial services or healthcare. According to a 2021 Accenture report, the National Institutes of Health is utilizing a synthetic data set to help improve its approach to researching the challenges facing COVID-19 patients — while not actually using any true patient data.

“While the pandemic has illustrated potential health research–oriented use cases for synthetic data, we see potential for the technology across a range of other industries,” writes Fernando Lucini, Accenture’s global lead on data science and machine learning engineering.

WATCH: Discover the unique threats healthcare organizations face in protecting their data.

How Does Synthetic Data Work?

Synthetic data, as the name suggests, is a substitute for actual data — used in place of the real thing and contextualized in similar ways.

This approach enables some important use cases. For example, it can be much faster than traditional data-gathering processes, according to Akhil Docca, a senior product marketing manager for NVIDIA’s NGC cloud services platform.

“The idea is that you can create a really awesome model with synthetic data, bring in real data, fine-tune it, and you’re off and running in production a lot faster,” Docca said. “So, for us, it’s shortening that time to market, if you will.”

Nyla Worker, who serves as product manager for NVIDIA’s Omniverse Replicator, says working with synthetic data in this way, called bootstrapping, carries a lot of potential, especially for business intelligence. In other cases, synthetic data can effectively augment the real data used.

For example, Worker cited the widely reported use of augmented data in training self-driving vehicles. While automotive companies have worked heavily to train vehicles in real-world situations, there may be circumstances where it is difficult to account for real-world conditions, creating potential for error.

“They will have the data maybe in a sunny condition, but not at night, or not in all of these other conditions,” she says. “That is exactly where synthetic data would come in and fill in the gaps for your data set.”

In some cases, legal reasons, such as privacy concerns, may prevent access to certain data. Synthetic data can address these issues while maintaining data privacy. For example, Microsoft’s AI Lab built a synthetic data generator to detect human trafficking while not tracking personally identifiable information related to the subjects being monitored. This helped create a model that nonprofits could use to assess the impact of human trafficking.

“By using synthetic data, we provide a level of indirection,” Microsoft’s AI Lab explained on its website.

You can create a really awesome model with synthetic data, bring in real data, fine-tune it, and you’re off and running in production a lot faster.”

Akhil Docca Senior Product Marketing Manager, NVIDIA

How Synthetic Data Benefits Artificial Intelligence

Synthetic data can be used to expand the data pool for a given use case. Whether the goal is to build a full 3D visualization of a real-world setting — say, a factory where a complex array of machines work in unison — or to generate models for optimizing trading in financial markets, the goal is to stretch the capabilities of the AI model.

As author Khaled El Emam wrote in Accelerating AI with Synthetic Data, technical considerations may be only one part of the discussion. El Emam cited privacy considerations as a driver for the use of synthetic data in a specific context.

“For example, a data science group specializing in understanding customer behaviors would need large amounts of data to build its models,” he wrote. “But because of privacy or other concerns, the process for getting access to that customer data is slow and does not provide good enough data when it does arrive because of extensive masking and redaction of information. Instead, a synthetic version of the production data sets can be provided to the analysts for building their models.”

This, according to El Emam, will offer the benefit of allowing the data team to continue working quickly on a given task while getting around the limitations created by consumer privacy regulations such as the European Union’s General Data Protection Regulation and the California Consumer Privacy Act.

“Essentially, it's about deducing the larger patterns that inform us of data present, then generating unlimited amounts of that data with the same statistical distribution, meaning that it's not really ever reversible back to an individual,” says Ali Golshan, CEO and co-founder of Gretel, a startup that is a member of NVIDIA’s Inception program and focuses on privacy-driven AI generation.

Strategies to Build and Generate Synthetic Data

To build synthetic data, organizations generally use processing resources that can develop complex, detailed data sets. This may not be a one-step process. If the goal is to build a model that is not tied to an identifiable source, data will need to be trained to properly build out a model rooted in real-world results but not actually reflective of the real world.

Often, this will require the scaling of cloud platforms such as Google Cloud, Microsoft Azure and Amazon Web Services, which can throw heavy-duty processing power at developing data for business intelligence and machine learning use cases.

Much of this data generation leverages the capabilities of graphics processing units, as AI plays into the benefits of the parallel processing capacity that GPUs offer.

The strategies to build synthetic data, whether in physical simulations or elsewhere, are growing more sophisticated. Some startups have emerged in recent years to offer application programming interface (API) solutions for generating synthetic data on the fly for use in analytics or academic research.

Meanwhile, NVIDIA offers its Omniverse platform for building realistic 3D simulations. Last year, the company extended the platform with its Omniverse Replicator tool, which can generate in-depth neural networking data for these simulations. Replicator is part of NVIDIA’s Omniverse Cloud Service tool, which can integrate with AWS and other cloud platforms to generate and store data.

“Replicator is basically an API that connects to Omniverse; Omniverse has many connectors and brings in data for many tools,” NVIDIA’s Worker says. “With the Replicator API, you can go directly to that to generate data, or use a third-party tool.”

As AI gets smarter, so too will the ways that organizations generate data to train it.

gorodenkoff/Getty Images