Generating side sprites with Pix2Pix

.

This notebook is part of the pixel sides project. It describes an experiment of creating an adversarial generative network model based on the Pix2Pix architecture (Isola et al, 2017) to generate drawings of pixel art character sprites turn to the right side given its image facing front:

center

The Pix2Pix architecture contempltes conditional deep convolutional neural networks (cDCGAN) that learn the mapping between images in one domain into another one by looking at their pairings. It is composed of a generative and a discriminative network and has been used in a variety of applications with good result and without requiring architecture hand engineering to fit the specificities of each task.

Loading the paired dataset

The input for a Pix2Pix network is a dataset in which each example is a source and a target image. The original network receives and produces imagens of size $(256, 256, 3)$. However, the one implemented here processes images in $(64, 64, 4)$.

The architecture requires data to be scaled in the range of $\{-1, +1\}$ and the use of small perturbations involving (a) translations and (b) mirrorring. However, in our network such augmentations were not executed due to the fact that all images are perfectly centered and, as they are pixel art and the change of pose can be semantically impacted by mirrorring.

In case such operations were implemented, the results would like like these:

The dataset was created using pairs of imagens of characters facing front to facing right:

Generator Network

As in the Pix2Pix architecture, the generator network was created as a U-net in which the layers in the first half reduce the spatial dimensions down to $(1,1)$ and the last half upscale the image back to its original size. There are skip-connections between the two halfs of layers.

The number of layers was reduced to fit the new input/output dimensions of $(64, 64)$ instead of $(256, 256)$. As each downscaling layer splits the size by two:

$$layers=\log{\left|input\right|}$$

Hence, we used 6 downsampling layers instead of 8 in the original architecture. The same goes for the number of upsampling blocks.

Testing the image generation from input data to check that an image is being generated and that it keeps relation with the input.

Generator's loss function:

Discriminator Network

The discriminator network followed the architecture of a convolutional binary classifier through image patches: instead of considering the whole image as either real or fake, it checks image patches of size $(30, 30)$. As it is a discriminator, it receives source images paired with either real target data and fake ones, created by the generator.

The discriminator loss function considers what it missed using real and fake data from the generator network. The total loss is their sum:

Traning and Evaluation