The Advantages and Limitations of Synthetic Data

Marcelo Benedetti

January 24, 2018

5 Minute Read

With increased buzz around synthetic data, it is important to understand the advantages and limitations of this solution, and the overall affect on the application.

Is it really possible to use generated synthetic data as training data? Let's first take a step back and define synthetic data. In short, synthetic data is system-generated data that mimics real data in terms of essential parameters set by the user. Synthetic data is any production data not obtained by direct measurement, and is considered anonymized. Conceptually, synthetic data may seem like a compilation of “made up” data, but there are specific algorithms designed to create realistic data. Synthetic data can assist in teaching a system how to react to certain situations or criteria.

How is synthetic data generated? Synthetic data can be created by stripping any personal information (names, license plates, etc.) from a real dataset so it is completely anonymized. Another method is to create a generative model from the original dataset that produces synthetic data that closely resembles the authentic data. A generative model is a model that can learn from large, real datasets to ensure the data it produces accurately resembles real-world data. There are three types of generative models: Generative Adversarial Networks or GAN's, Variational Auto Encoders or VAE's, and Autoregressive models. GAN models utilize a generative and discriminative network in a zero-sum game framework. VAE's attempt to recreate output from input using encoding and decoding methods. Autoregressive models train the network to create individual pixels based on previous pixels above and to the left of them.

Now that we've discussed what synthetic data is and how it's created, we can look into why synthetic data sets are used and how it stacks up against real data for system training.

Synthetic data has a wide variety of applications such as image processing, IoT, AI, machine learning, defense, and natural language processing. Synthetic data has also been used for machine learning applications. The main reasons why synthetic data is used instead of real data are cost, privacy, and testing. Producing synthetic data through a generation model is significantly more cost-effective and efficient than collecting real-world data. This especially applies to the autonomous vehicle space where real-world data can be both time-consuming and costly to collect. With synthetic data, it becomes cheaper, and fast to produce new data once the generative model is set up.

Another major advantage of synthetic data is anonymity as all personal information has been removed and the data cannot be traced back to the original owner, avoiding any possible copyright infringements. This is critical when attempting to recreate realistic user behaviors as synthetic data protects the authentic data privacy and confidentiality. For example, the U.S. Census Bureau utilized synthetic data without personal information that mirrored real data collected via household surveys for income and program participation.

Synthetic data can be used to test existing system performance as well as train new systems on scenarios that are not represented in the authentic data. Rather than utilizing costly real-world data to test if the system is providing the desired output, you can plug in synthetic data and analyze the results. For an instance where authentic data does not represent every possible situation, synthetic data can play a vital part in system training. This is notably relevant in the defense space where it is necessary to ensure the system can handle a variety of intrusion and attack types. Using artificial data, you can train a system on a multitude of scenarios not covered in the authentic data, thus improving its defensive capabilities.

Synthetic data does not come without its limitations. While synthetic data can mimic many properties of authentic data, it does not copy the original content exactly. Models look for common trends in the original data when creating synthetic data and in turn, may not cover the corner cases that the authentic data did. In some instances, this may not be a critical issue. However, in most system training scenarios, this will severely limit its capabilities and negatively impact the output accuracy.

Also, the quality of synthetic data is highly dependent on the quality of the model that created it. These generative models can be excellent at recognizing statistical regularities in datasets but can also be susceptible to statistical noise, such as adversarial perturbations. Adversarial perturbations can cause the model or network to completely misclassify data and in turn, create highly inaccurate outputs. The way to resolve this issue is to leverage real-world human annotated data, input into the model, and test the outputs for accuracy.

Another challenge presented by using synthetic data is the need for a verification server, an intermediary computer that performs identical analysis on the initial data. This system is put in place to test and compare the authentic and synthetic data outputs. This is to ensure the system has been properly trained and is not generating the desired outputs due to any assumptions that were built into the synthetic data.

While synthetic data can be easy to create, cost-effective, and highly useful in some circumstances, there is still a heavy reliance on human annotated and real-world data. The only way to guarantee a model is generating accurate, realistic outputs is to test its performance on well-understood, human annotated validation data. While generating realistic synthetic data has become easier over time, real-world human annotated data remains a necessary part of machine learning training data.