Naruto x AI. Generate some for yourself at sharingans.com

## The SharinGAN

Perhaps the most iconic symbol from Naruto is the sharingan, the infamous eye mark of the Uchiha clan. The original sharingans are visually striking and beautifully designed by the show’s creators. Many Naruto fans have even been inspired to generate their own versions of the sharingan.

After watching the show, I too was inspired to craft my own sharingan. Unfortunately, I’m not very artistic. So instead, I made the sharinGAN: a GAN to create novel sharingan artwork for me.

Since our training data is composed entirely of merely 15 sharingans from the series, this poses a challenging and fun problem of generating high-fidelity images in the extremely low-data regime. "And then some day, when you have the same eyes as I do, come before me" - Itachi Uchiha

## GANs in 30 Seconds

GANs, or Generative Adversarial Networks, comprise a class of approaches for teaching a deep learning model to effectively generate novel data matching the training data’s distribution. In our case, we want to generate creative, non-degenerate eyes that follow the sharingan distribution.

There are two components to a GAN, namely the generator $G$ that produces novel data and the discriminator $D$ whose role is to discern whether the data is from the pre-existing distribution (“real”) or novelly generated (“fake”). $G$ attempts to fool $D$ by progressively learning data that mimics the data from the training distribution.

More formally, we have the following:

$\text{  D  wants to maximize } \mathbb{E}_{x \sim p_{data}(x)}[\log(D(x))]$

Here, $D$ represents the discriminator, which can be viewed as a function defined on the (generated and real) image space outputting the probability that it thinks $x \sim p_{data}$ – a real image – is from the true data distribution. Ideally, $D$ should claim $x$ is a real image when it is in fact a genuine image.

Conversely, we also have:

$\text{  G  wants to minimize } \mathbb{E}_{z \sim p_{z}(z)}[1 - \log(D(G(z)))]$

Here, $G$ the generator is a function defined on a latent noise space $p_z(z)$. Critically, this random variable is not parameterized by the true data distribution. The math states that $G$ is attempting to transform this random noise $z$ into something that makes $\log(D(G(z)))$ close to $1$, i.e. that tricks the discriminator $D$ into thinking this generated image $G(z)$ is “genuine”, when in fact it was generated artifically by G.

Taken together, we see that the generator and discriminator are engaged in the following mini-max game:

$\underset{G}{\min} \underset{D}{\max} \mathbb{E}_{x \sim p_{data}(x)}[\log(D(x))] + \mathbb{E}_{z \sim p_{z}(z)}[1 - \log(D(G(z)))]$

## First, a General Anime GAN

My initial intuition for sharinGAN was to transfer a model from some source distribution to the sharingan distribution. Of course, most of the standard image synthesis benchmarks (FFHQ, LSUN, CIFAR-10, etc.) look nothing like sharingans.

To minimize the perceptual distance from source to target distribution, I decided to first train a GAN to generate anime images; for some reason, much prior work and datasets exist for this purpose.

Thankfully, Gwern has already done precisely this. Unfortunately, he based his implementation off of NVIDIA’s Tensorflow stylegan2 model, so I wrote a quick weights converter to port everything over to PyTorch. This made the source model compatible with the Cross-Domain Correspondence and Discriminator Relaxation approaches below.

I also searched for a lottery ticket for this anime model in an attempt to further reduce the dataset size requirements.

## Image Synthesis in the Extremely Limited Data Regime

Unfortunately, the majority of contemporary GAN-based methods are entirely unsuitable for our task. SOTA models like BigGAN (massively scaled model), stylegan and stylegan2 (style transfer-based architecture), stylegan-ADA (Adapative Discriminator Augmentation), and so on generally require thousands of images to produce even modest results.

A simple solution to this is to perform transfer learning on GANs; that is, given a model trained on some source distribution, adapt it to a target domain. Indeed, a flurry of recent papers do this. In the last 4 years, we’ve seen Transferring GANs (straightforwardly finetuning a pretrained source model); Batch Statistics Adaptation (only finetuning scale and shift parameters); Freeze-D and GANTransferLimitedData (freezing high-resolution discriminator and/or generator layers); and lottery ticket GANs (finding a lottery ticket, then aggresively training it). Though these approaches move the needle on dataset size, they still require data roughly on the order of 100 images.

In general, this class of recent models seriously suffer from overfitting. In the extremely limited data regime, these models literally memorize the target image set and reproduce the original 15 sharingan.

## Cross-Domain Correspondence and Relaxing the Discriminator

Recently, I stumbled upon Few-shot Image Generation via Cross-domain Correspondence by Ojha et al., which achieves impressive results in the extremely limited data regime (i.e. 10 images). This is perfect, given that we only have 15 sharingans. The key insight they share is that the memorization of target distribution training images implies a degradation of relative distances when adapting from the source domain to the target domain. In other words, the mapping from source to target domain induced by the GAN is not injective. As a concrete example, in an overfit model, two perceptually distant images in the training source set are mapped to the same image in the target set. Loss of correspondence between source distribution image and target distribution image; this is defined as lack of "cross-domain correspondence" by the authors.

Cast in this framework, the overfitting problem is solved by:

• Enforcing distance consistency in the source and target distribution by introducing a cross-domain consistency loss. This loss acts on two $N$-way distributions, one from the (fixed) source domain generator and one from the (learned) source-to-target domain generator induced by sampling $N + 1$ latent vectors and taking pairwise cosine similarities between each generation and the remaining $N$ generations.

The loss itself is the expected sum of KL-divergences between the two induced distributions at various layers in the generator(s).

• Relaxing the “realism” critieria on the discriminator to prevent the generator from solely reproducing the training set images. They do this by introducing a path-level discriminator $D_{patch}$. Instead of discriminating on the entire image, the discriminator ocassionally discriminates on a sub-image (i.e. at the patch-level).

For sharinGAN purposes, I ended up adapting their architecture for my (lottery ticket) anime stylegan2 model. I also aggresively relaxed the discriminator by using extremely small, and even non-rectangular, patches. Surprisingly, the cross-domain consistency loss only marginally benefitted the sharingan synthesis quality, so I almost entirely relied on the patch-level loss combined with the original GAN loss.

## Circularization and Colorization

Because I had so aggresively relaxed the discriminator and relied on patch-level discern, the sharinGANs I had generated were slightly…deformed. Deformed sharingans; discriminator relaxation mitigates memorization but also permits non-circular generations.

I should have probably introduced a rotational symmetry loss to properly address this issue, but I instead borrowed some recent concepts from Chemical Physics to circularize post-generation.

Originally, this circularization method was intended to repair the distortions of velocity map images in order to improve the extracted spectrum resolution. The method essentially approximates a function from a discrete set of trigonometric series tht are fit to the angular behavior of each ring in the image. This function can then remap the intensity at a particular radius and angle to yield a circular object.