
At the upcoming CVPR conference, we believe that emphasizing the new trends in computer vision and synthetic data will be very useful. Synthetic data is artificially created information, rather than information generated from actual events. Synthetic data is not limited to visual data; it also exists in voice, entities, and sensors (such as LiDAR, radar, and GPS). In this article, we will elaborate on the value of synthetic data and categorize 45 products.
AI Frontline Note: CVPR, short for Conference on Computer Vision and Pattern Recognition, is an annual global academic conference organized by IEEE, focusing on computer vision and pattern recognition technology. Each year, CVPR has a fixed seminar theme and is generally held in June, often in the western United States, with occasional rotations to the central and eastern regions.
With the development of ready-made training frameworks such as TensorFlow and PyTorch, building machine learning models is easier than ever. Unfortunately, data still represents a “cold start” problem in machine learning. Companies often cannot obtain sufficient data within a given timeframe to build high-accuracy models. Additionally, large companies like Google possess vast, hard-to-breach data moats. Today, companies capturing data are manually labeling it, which can be a slow, expensive, and low-quality process. If synthetic data is utilized, it can help companies bypass these limitations and democratize data.
AI Frontline Note: Data democratization refers to making various public data owned by governments, businesses, and institutions available on the internet, allowing anyone to access and download it. Citizens have the right to use data they deem appropriate, choose experts and applications they wish to assist, and only when help is needed. In other words, citizens have the right to be informed, to speak, and to make decisions regarding data.
Synthetic data has numerous benefits:
-
Reduces reliance on generating and capturing data.
-
If a company chooses to generate synthetic data themselves, it minimizes the need for third-party data sources.
-
Can be cheaper and faster than manually labeled data.
-
Can generate data that is difficult to capture in the real world (e.g., visual content in underwater or military conflict zones).
-
Can generate data that rarely occurs in nature but is critical for training (e.g., edge cases).
-
Can produce large amounts of data.
-
Can provide perfectly labeled data.
-
Can support faster labeling iterations.
-
Can reduce privacy concerns.
This article mainly focuses on the visual aspects of synthetic data, which comes in two primary forms:
1) Photo-realistic data;
2) Data created programmatically.
Photo-realistic data is crafted by artists with the goal of resembling real objects as closely as possible. The process of generating photo-realistic data takes longer than creating data programmatically.
Synthetic data can be created programmatically using game engines such as Unreal, Blender, and Unity. Then, programs like Houdini are used to accelerate asset creation. The next team can employ techniques like domain adaptation using Generative Adversarial Networks (GAN) or domain randomization to enhance data variation.
Domain adaptation is the task of classifying an unlabeled dataset (target) using a labeled dataset from a related domain (source). It allows teams to obtain low-quality synthetic data and real data, thereby improving the synthetic data.
AI Frontline Note: Domain adaptation is a crucial part of transfer learning, aimed at mapping data from source and target domains with different distributions into a feature space, minimizing the distance between them. Consequently, the objective function trained on the source domain can be transferred to the target domain, improving accuracy on the target domain.
Domain randomization also helps reduce the reality gap. According to a paper by Nvidia, “Domain randomization intentionally sacrifices photo-realism by randomly perturbing the environment in a non-realistic manner, forcing the network to focus on the fundamental features of the images.” Adjustments to the data can include image scenes, lighting positions and intensities, textures, scales, and positions. This algorithm does not train a model on a single simulated dataset; rather, it randomizes the simulator as a team, exposing the model to a wide variety of varied data (as shown below). Due to the low threshold, this technique quickly became the most popular approach.
AI Frontline Note: The Nvidia paper can be found at “Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization” https://arxiv.org/pdf/1804.06516.pdf

Source: “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World,” by Tobin, Joshua et al. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017): 23-30
Domain randomization is a subclass known as guided domain randomization. This research area focuses on automatically creating randomizations rather than manually designing them, which can be both tedious and monotonous. The ability to programmatically create synthetic data further accelerates the speed of value realization.
Companies can choose to use third-party vendors that provide synthetic data or build their internal teams. We know it is challenging to find and hire talent with expertise in technical artistry, game development, and machine learning. When teams decide to leverage synthetic data, we hear that they mix synthetic data with real data for training. Typically, this ratio is 80%-90% synthetic and 10%-20% real.
Academic research is working to create technologies capable of representing 100% of training data with synthetic data and to develop models trained on real data that achieve the same level of accuracy. Currently, cross-domain applications are the highlight of synthetic data. For example, if you operate a self-driving car company, the vehicles manufactured will drive in both San Francisco and Tokyo, necessitating training data from both locations. Perhaps you do not have access to data from Tokyo. However, if you train solely on data from San Francisco and then drive the car in Tokyo, its performance will be worse than if you trained on synthetic data from Tokyo combined with real data from San Francisco.
Most synthetic data currently suffers from a “reality gap,” meaning it appears less realistic. Consequently, synthetic data applied to domain training rarely matches the performance of actual data in the domain. Within a domain, synthetic data can be challenging because it often needs to incorporate physical behaviors such as gravity and inertia. Accurately reflecting physical principles is difficult, but game engines are making progress.
Advanced academic research from Berkeley, OpenAI, and Nvidia is driving the capability to generate high-accuracy models using only 100% synthetic data. For instance, a paper from OpenAI used domain randomization to build a data generation pipeline for synthesizing objects. A robot grasping model generated from 100% synthetic data achieved over 90% success when grasping real objects it had never seen before.
AI Frontline Note: OpenAI’s paper can be found at “Domain Randomization and Generative Models for Robotic Grasping” https://arxiv.org/pdf/1710.06425.pdf
Even mixing different types of synthetic data for training can yield positive effects. A paper by Nvidia found that using a combination of domain randomization and photo-realistic data to generate an object pose estimation model could compete with state-of-the-art networks trained on a combination of real and synthetic data. We have yet to see any company successfully build a high-accuracy model running in production using 100% synthetic data.
AI Frontline Note: Nvidia’s paper can be found at “Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects” https://arxiv.org/pdf/1809.10790.pdf
The use cases for synthetic data are vast. For computer vision applications, the most common use cases for synthetic data include autonomous systems (such as video output, robotics, and drones), agricultural technology, real estate, video surveillance, central pattern generators, retail, and defense. Since synthetic entity data can remove names, emails, social security numbers, etc., while still reflecting the underlying dataset, it has been driven by privacy concerns. This helps data scientists experiment without accessing sensitive information. We have also seen applications of synthetic voice data in media production.
AI Frontline Note: Central pattern generators (CPG) are neural networks that can produce rhythmic pattern outputs without requiring sensor feedback. Research shows that even in the absence of movement and sensor feedback, CPGs can still generate rhythmic outputs and form “rhythmic movement patterns”.
We categorized 45 synthetic data solutions into six categories:
-
Tools
-
Sensors (cameras, LiDAR, radar, and GPS)
-
Entities
-
Voice
-
Forensics
-
Products/virtual avatars utilizing synthetic data
The display in the image below is not exhaustive but highlights some of the more well-known products in the field.
The display above includes products utilizing synthetic data, such as media production. In recent months, a wave of “Deepfake” has emerged, which are videos or audios that present events that did not actually occur. For example, Lyrebird can replicate the voice of U.S. President Donald Trump. The music game Synthesia recently released a video of David Beckham fighting malaria, which used machine learning to generate content. Deepfakes of Elon Musk, Salvador Dalí, and Barack Obama have also appeared online.
AI Frontline Note: Salvador Dalí (May 11, 1904 – January 23, 1989) was a Spanish Catalan painter known for his surrealist works and is considered one of the three most representative painters of the 20th century alongside Picasso and Matisse.
Deepfakes are an increasingly concerning issue because they are often nearly indistinguishable from reality. McAfee, Symantec, and academia are researching forensic techniques for detecting Deepfakes. A paper presented by Symantec at Black Hat 2018 described how to identify fake videos based on Google FaceNet. The University at Albany has launched software that can identify whether a video is a Deepfake by analyzing the frequency of simulated eye blinks. We believe that in the future, synthetic audio and video content will be watermarked to avoid confusion.
Synthetic data is an emerging trend in the fields of machine learning and data science. Synthetic data exists across voice, sensor, and entity data. Compared to data labeling techniques, synthetic data brings many benefits, including speed, cost, scale, and diversity. Some vendors offer synthetic data as a service, while others utilize it to improve media production. With the emergence of Deepfakes, there is a growing need to verify real content versus synthetic content. This field is just beginning but is developing rapidly.
Original link:
https://medium.com/memory-leak/deepfakes-whats-real-with-synthetic-data-5c8348b041d2
Are you also “watching” this?👇