What Is Synthetic Data?
Synthetic data is information that is artificially produced but is based on real data from existing inputs. This data can either be structured within a database or unstructured, but the goal is that the information is synthesized from real data sets to resemble original information while not using it.
The information can be directly produced from real data, but it can also be indirectly produced from a model, removing any tie to directly identifiable data sets.
For artificial intelligence or business analytics applications, data models can be used in a variety of settings such as simulations and visualizations, which can help to identify technical problems in areas such as engineering, financial services or healthcare. According to a 2021 Accenture report, the National Institutes of Health is utilizing a synthetic data set to help improve its approach to researching the challenges facing COVID-19 patients — while not actually using any true patient data.
“While the pandemic has illustrated potential health research–oriented use cases for synthetic data, we see potential for the technology across a range of other industries,” writes Fernando Lucini, Accenture’s global lead on data science and machine learning engineering.
WATCH: Discover the unique threats healthcare organizations face in protecting their data.
How Does Synthetic Data Work?
Synthetic data, as the name suggests, is a substitute for actual data — used in place of the real thing and contextualized in similar ways.
This approach enables some important use cases. For example, it can be much faster than traditional data-gathering processes, according to Akhil Docca, a senior product marketing manager for NVIDIA’s NGC cloud services platform.
“The idea is that you can create a really awesome model with synthetic data, bring in real data, fine-tune it, and you’re off and running in production a lot faster,” Docca said. “So, for us, it’s shortening that time to market, if you will.”
Nyla Worker, who serves as product manager for NVIDIA’s Omniverse Replicator, says working with synthetic data in this way, called bootstrapping, carries a lot of potential, especially for business intelligence. In other cases, synthetic data can effectively augment the real data used.
For example, Worker cited the widely reported use of augmented data in training self-driving vehicles. While automotive companies have worked heavily to train vehicles in real-world situations, there may be circumstances where it is difficult to account for real-world conditions, creating potential for error.
“They will have the data maybe in a sunny condition, but not at night, or not in all of these other conditions,” she says. “That is exactly where synthetic data would come in and fill in the gaps for your data set.”
In some cases, legal reasons, such as privacy concerns, may prevent access to certain data. Synthetic data can address these issues while maintaining data privacy. For example, Microsoft’s AI Lab built a synthetic data generator to detect human trafficking while not tracking personally identifiable information related to the subjects being monitored. This helped create a model that nonprofits could use to assess the impact of human trafficking.
“By using synthetic data, we provide a level of indirection,” Microsoft’s AI Lab explained on its website.