Real science

David Ng on deepfakes
RE | Issue 16 | 2019

There is a science to the art of fooling ourselves. Videos, photographs, voice tracks and newspaper articles are all capable of being generated through the use of machine-learning technologies.

To generate deepfake media, you need neural networks trained to encode and decode an image, plus a generator capable of fooling an increasingly discerning discriminator.

Deep learning architectures—from which the term ‘deepfake’ was coined—are based on neural networks. These layers of interconnected nodes take a large set of input values (such as the colour data for every pixel in an image) and propagate them through the nodes to arrive at a set of outputs. As values are passed to nodes, various computations or transformations are performed.

The computations are a function of a set of weights. Each time the neural network makes a mistake, the weights are incrementally adjusted and the process is repeated. In the end, you get an output which can represent a classification (does the image contain a hot dog?) or a transformation.

Deepfake generation uses auto-encoders. These networks can compress and decompress images to reduce noise, generate training data, or make one person look like someone else.

During compression, an encoder takes an original, full-resolution image and reduces it down to form a ‘latent’ image. It may, for example, encapsulate structural body positions and facial expressions but ignore or minimize eye color or shape. A visual representation of the encoded image may not look like a face or person at all.

During decompression, a decoder takes the latent image and tries to regenerate the original image. The regenerated image will not be identical to the original, because it is created from less information.

To make a video of Daniel Radcliffe (as Harry Potter) acting like Daniel Craig (in the role of James Bond), an encoder trained to encode the likenesses of both Potter and Bond is paired with a decoder trained to generate Potter’s likeness. When a video of Bond is input, the encoder scales the video down into the latent space, essentially generalizing Bond’s face by ignoring or minimizing the importance of aspects such as Bond’s hair and voice tone and capturing details such as body movements and words. The decoder rebuilds the video from the latent space, generating Potter’s facial features and voice signature on top of Bond’s body movements and words. The result is a video of Harry Potter as Bond—James Bond.

GANs—generative adversarial networks—are now being used to improve the videos generated by the decoder. A discriminator (another type of neural network) evaluates an image to determine error values between a generated image and the discriminator’s understanding of a true image. A generator tries to fool the discriminator by generating images which reduce these error values. As both discriminator and generator cycle through their task, gradually the generated images become more difficult to distinguish from the real thing.

David Ng, BSc MASc JD, is a partner in Toronto