Is “fake data” the real deal when training algorithms? | Artificial Intelligence (AI)

Yesou are driving your car but you are exhausted. Your shoulders start to sag, your neck starts to sag, your eyelids slide down. As your head tilts forward, you pull off the road and speed through a field, crashing into a tree.

But what if your car’s monitoring system recognized the telltale signs of drowsiness and told you to pull off the road and park instead? The European Commission has legislated that from this year new vehicles must be equipped with systems to detect distracted and drowsy drivers in order to avoid accidents. Today, a number of startups are training artificial intelligence systems to recognize gifts in our facial expressions and body language.

These companies are taking an innovative approach to the field of AI. Instead of filming thousands of real drivers falling asleep and feeding that information into a deep learning model to “learn” the signs of drowsiness, they create millions of fake human avatars to replicate the signals. drowsiness.

“Big data” defines the field of AI for a reason. To accurately train deep learning algorithms, models must have a multitude of data points. This creates problems for a task such as recognizing a person falling asleep at the wheel, which would be difficult and time-consuming to film in thousands of cars. Instead, companies have started creating virtual datasets.

Synthetic AI and datagen are two companies that use full-body 3D scans, including detailed facial scans, and motion data captured by full-body sensors, to collect raw data from real people. This data is fed by algorithms that repeatedly alter different dimensions to create millions of 3D representations of humans, resembling characters in a video game, adopting different behaviors across a variety of simulations.

In the case of someone falling asleep at the wheel, they can film a human actor falling asleep and combine it with motion capture, 3D animations and other techniques used to create video games and animated films, to build the desired simulation. “You can map [the target behaviour] on thousands of different body types, different angles, different lighting and also adds variability to the movement,” says Yashar Behzadi, CEO of Synthesis AI.

Using synthetic data removes much of the clutter from the more traditional way of training deep learning algorithms. Typically, companies had to amass a large collection of live footage, and low-paid workers painstakingly tagged each of the clips. These would be fed into the model, which would learn to recognize behaviors.

The big sell for the synthetic data approach is that it’s faster and cheaper by a wide margin. But these companies also say it can help combat the bias that creates a huge headache for AI developers. It is well documented that some AI facial recognition software fails to recognize and correctly identify particular demographic groups. This tends to be because these groups are underrepresented in the training data, which means the software is more likely to misidentify these people.

Niharika Jain, a software engineer and expert on gender and racial bias in generative machine learning, highlights the notorious example of Nikon Coolpix’s “blink detection” feature, which, because the training data included a majority of white faces, disproportionately judged Asian faces as blinking. “A good driver monitoring system should avoid falsely identifying members of a certain demographic group as sleeping more often than others,” she says.

The typical response to this problem is to collect more data from underrepresented groups in real settings. But companies like Datagen say that’s no longer necessary. The company can simply create more faces from the underrepresented groups, which means they will make up a larger proportion of the final dataset. Real 3D face scan data from thousands of people is transformed into millions of AI composites. “There is no bias in the data; you have complete control over the age, gender, and ethnicity of the people you generate,” says Gil Elbaz, co-founder of Datagen. The creepy faces that emerge don’t look like real people, but the company says they’re similar enough to teach AI systems how to respond to real people in similar scenarios.

There is, however, some debate about whether synthetic data can actually eliminate bias. Bernease Herman, a data scientist at the University of Washington’s eScience Institute, says that while synthetic data can improve the robustness of facial recognition models on underrepresented groups, she doesn’t think synthetic data can. alone close the gap between the performance of these groups and the others. Although companies sometimes publish academic papers showcasing how their algorithms work, the algorithms themselves are proprietary, so researchers cannot independently evaluate them.

In areas such as virtual reality, as well as robotics, where 3D mapping is important, synthetic data companies say it might actually be better to train AI on simulations, especially since the 3D modeling, visual effects and game technologies are improving. “It’s only a matter of time before… you can create these virtual worlds and fully train your systems in a simulation,” Behzadi says.

This type of thinking is gaining traction in the autonomous vehicle industry, where synthetic data is becoming essential for teaching AI in self-driving vehicles how to navigate the road. The traditional approach – filming hours of driving and feeding them into a deep learning model – was enough to get the cars relatively good at navigating the roads. But the problem the industry is worrying about is how to get cars to reliably handle what are known as ‘edge cases’ – events rare enough that they don’t appear much in millions. hours of training data. For example, a child or a dog running on the road, complicated works or even traffic cones placed in an unexpected position, which was enough to stump a Waymo driverless vehicle in Arizona in 2021.

Synthetic faces made by Datagen.

With synthetic data, companies can create endless variations of scenarios in virtual worlds that rarely occur in the real world. “Instead of waiting for millions more miles to accumulate more examples, they can artificially generate as many examples as they need from the boundary case for training and testing,” says Phil Koopman, professor in Electrical and Computer Engineering from Carnegie Mellon. University.

AV companies such as Waymo, Cruise and Wayve are increasingly relying on real data combined with simulated driving in virtual worlds. Waymo has created a simulated world using AI and sensor data collected from its self-driving vehicles, complete with artificial raindrops and sun flare. He uses it to train the vehicles in normal driving situations, as well as in the most delicate cases. In 2021, Waymo said the Rod that it had simulated 15 billion kilometers of driving, compared to only 20 million kilometers of real driving.

An added benefit of testing autonomous vehicles in virtual worlds first is to minimize the risk of very real accidents. “Fault tolerance is one of the main reasons why autonomous driving is at the forefront of most synthetic data,” Herman says. “A self-driving car making a mistake 1% of the time, or even 0.01% of the time, is probably too much.”

In 2017, Volvo’s self-driving technology, which had learned to react to large North American animals such as deer, was left baffled when it first encountered kangaroos in Australia. “If a simulator doesn’t know about kangaroos, no amount of simulation will create one until it’s seen in testing and the designers figure out how to add it,” Koopman says. For Aaron Roth, professor of computer science and cognitive science at the University of Pennsylvania, the challenge will be to create synthetic data indistinguishable from real data. He thinks it’s plausible that we’ve come to this for facial data, because computers can now generate photorealistic images of faces. “But for a lot of other things,” – which may or may not include kangaroos – “I don’t think we’re there yet.”

Sharon D. Cole