As part of the IFT6266 Class at Université de Montréal during the Winter 2017 semester, our final project was a Conditional Image Generation task. The official website explains in depth the project, so here I’ll simply summarize the important points assuming you’ve read the full description already.
Here is an example of the task at hand.
The objective here is to complete the missing part of the image given the contour (or non-missing parts of the image) and a caption describing the image. As I try different approaches and models, I’ll also try to answer the following questions:
- To what extend is the caption useful for the inpainting task?
- What models produce the “best” results?
- How can we quantitatively evaluate the output in way that matches human judgements of quality?
The dataset used for the project is a downsampled 64×64 version of the MSCOCO dataset (http://mscoco.org/dataset/#overview)
The dataset is available here: dataset
Intuition and Plan
Let’s start with some context. First, I’ve started to learn Python three weeks prior to the class. I code in Matlab, in C/C++, in .NET, but not in Python. Second, I’m doing my PhD in Visual Neuroscience at École d’Optométrie de l’Université de Montréal. I’m interested in Deep Learning, but this is not my field of research. With that in mind, I knew I wouldn’t achieve crazy results and that my best angle to (1) pass the class and (2) learn something useful in the process, was to start by reproducing some of the literature and then tweak models for the current task. Once having reproduced exist results and with a bit of understanding of the code I would work on toy projects to get better with the process of working on deep learning problems.
In the explanations given with the project, we can read: “To our knowledge, this is a novel task and has never been done before, so a successful project would in fact be a valuable research contribution.” Reading that, I knew I shouldn’t try too hard discovering something new, given my (non-existing) experience in the field, but rather get some basic building blocks that could indicate a good direction for such a problem and most importantly learn in the process.
In following sections I will cover:
- Deep Learning Frameworks
- Keras Examples
- Toy Projects
- Variational (VAE)
- Convolutional (CAE)
- Text to Image
Deep Learning Frameworks
Once you’ve coded your MLP for MNIST by hand, you can move to a deep learning framework and the fun can really begin!
The popular frameworks are:
In my case, I thought I’d use Theano since I’m doing the grad course at Université de Montréal, but many examples and resources use Tensorflow or Torch. So I had the chance to experiment the 3 of them. In retrospect I spent way too much time junggling with dependencies and packages. I cover this in the Problems section. On top of Theano and Tensorflow,
On top of Theano and Tensorflow, Keras is quite useful to make things even easier with a higher level of abstraction!
Here are the links you sure want to look at for Keras.
First Step: Run the Examples!
I’m using a MacBook as my main laptop and have an Asus ROG with a NVidia GPU running Windows & Linux. After few hours of not having fun trying to get my environments ready to support Theano and Tensorflow to run on the GPU, I gave up trying on windows. Then spent another few hours on Linux to finally have it to work. (disclaimer, I’m not very comfortable in Linux)
When I ran the Keras examples and they just worked. No error, no Python 2 vs Python 3 vs Hardware vs compiler (gcc, fortran, etc.) vs the world errors… What a blast! I’ve just used my GPU to train a model very quickly and had generated images.
Next Step: Toy Projects!
Now that I have a working setup and can train and try models with a decent speed and I understand the theory behind these models, it’s time to play with it a little, before jumping on a hard problem.
1) X != O
Using Keras and a small CNN, it was easy to classify Os and Xs, even with missing parts. The code is available here.
2) Elephant == Mammoth?
After training a ResNet with Keras example models, it was easy to classify the first picture as an elephant. The problem is that I Googled that image and it might actually been used in the training data, so to make sure I could use something that hadn’t been used, I went for a mammoth. It was properly classified as an elephant.
3) Tom Hanks vs Deep Learning
After using a Keras pre-trained ResNet to classify elephants and mammoths, I wanted to test another category. I wanted a plane that for sure wasn’t contained in the training samples, so I googled “Tom Hanks Plane Crash” trying to find images of the movie Sully with Tom Hanks. Here are the 2 images I used to get classification on:
They were resized and filled with white to fit the ResNet input size.
The classification top-3 results were: Plane, Submarine & Aircraft Carrier. Which all kinda make sense.
4) Tanks & Elephants – AutoEncoders
Ok. Now that I can learn to classify elephants and planes (and Tom Hanks crashes…), can I generate elephants and/or planes (and even tanks?)
So I first went on and started downloading the full ImageNet dataset, until I realized that I could download images from different categories… I was very happy to see that option, so I’m sharing it here.
- Go on ImageNet
- Search for “Tank”
- Select the category: Tank, army tank, armored combat vehicle, armoured combat vehicle
- Click Downloads
You need to be logged in to download the images, with an email from an institution (or at least not gmail, hotmail, etc.)
You now have 1488 images of tanks! You have different sizes, so you’ll have to write a small piece of code to resize them prior to feeding your neural net.
After trying different Variational AutoEncoders with Deconvolution for the generator. I was able to see some features of a tank, but not to generate images of a tank, only shapes and shades that let you presume that would be a tank. I have an IPython Notebook here with the piece of codes to reproduce the experiment.
CAE – Convolutional Auto-Encoder
After messing around with Toy Projects, it was time to dive right in. The first model used was the convolutional auto-encoder.
Already we see that the model is overfitting. So it makes sense that it doesn’t get any better with more epoch.
Here are some more images, from the same training, same epochs.
We can see that the model captures some of the context information, but either stays too blurry or to specific. The problem with the training time increasing is that the network is then trying to draw a clearer image, but one that ends up being wrong most of the time. There are some interesting component of that models that can capture some of the context, but fail to deliver the resolution we would expect. Keeping the convolutions in mind I decided to explore GANs with convolutional layers.
GAN – Generative Adversarial Network, is one the latest approach in deep learning and his getting a lot of attention given the exciting results that the approach has delivered so far.
Generator vs Discriminator
The Generator learns how to generate a new image.
The Discriminator learns to differentiate a real image vs a fake image.
The main idea behind GANs is to make the Generator play versus the Discriminator. In an optimal scenario, the system would reach a fifty-fifty where the discriminator can’t tell if it’s a real or a fake image anymore because the generator has become so good at creating real looking images. (Goodfellow et al., 2014)
One could ask, how do you evaluate an image to say if it’s good enough and that the training can stop? There is currently no real answer to that, unfortunately. The best way is still a subjective evaluation from a human to say if the images are “good enough” and the training can stop or not.
After trying different approaches to training GANs, I can confirm that it can quickly become a nightmare and a time sucker.
Training a GAN is quite difficult. Because there is this game going on between the discriminator and the generator, you are training 2 models together (if one network wasn’t hard enough to train).
Paralysis of Analysis. For a class project or a personal project, it is easy to waste a lot of time turning in circles with different parameters and not moving forward. Don’t try to go fast and to get results right away. Make sure you have a plan and take notes of the different attempts you do. Be very rigorous, otherwise, you’ll always start new attempts and turning in circles. You get that leap to a very good model might lay in a perfect combination of the amount of data, conv net settings, generator vs discriminator training, time of training, hyperparameters, optimization approach, etc.
As opposed to an AutoEncoder where we try to learn the z-space for our generator, here we use noise (random values) for our z inputs. From the z inputs we train a network that can generate images.
Messing around with the DCGAN model was a lot of fun!
The results you see are taken from my own training and testing, but was inspired by the Deep Convolutional Generative Adversarial Network (DCGAN) paper (Radford et al., 2015) and these two DCGAN implementations from Brandon Amos’ blog post and GitHub, and also Taehoon Kim’s GitHub.
From the CelebA dataset containing 202 600 faces of celebrity, I wanted to try if I could reproduce the impressive results, and indeed I could. Some preprocessing was applied on the data to crop it to keep only the face, centered and all the same size. Here is a 108×108 example for each face. We see the training happening on a batch of 64.
Here are the results after 23 epochs. It clearly learned important features. To me that’s quite impressive, now it needs to capture perfection, which might be quite difficult.
I could probably let it run for longer, but I wanted to try different things. Deep Learning is Exciting!
Completion / Inpainting
Since our challenge for the class project is “inpainting” (i.e. filling missing part of an image), I tried on that dataset on which I have a trained model that gives “ok” results (given the size of the network and the training time of ~ 15h). On the picture below, the third image is the original, the first one is the cropped version and the middle one are different attempts at trying to find the best reconstructed image (i.e. from the image above) to fill the center. This attempt was inspired by the work of Brandon Amos and his blog post on the subject. His own work was inspired by the paper Semantic Image Inpainting with Perceptual and Contextual Losses (Yeh et al., 2016)
Obviously, both our generator and our discriminator are not perfect, so we can’t expect a very good result, nor a linear progression towards a “good answer”. I also tried this approach on the dataset from the class project, I cover it later in this post.
The main idea for this completion is explained here: “To do completion for some image , something reasonable that doesn’t work is to maximize over the missing pixels. This will result in something that’s neither from the data distribution ( ) nor the generative distribution ( ). What we want is a reasonable projection of onto the generative distribution.”
Obviously, one of the first test to do with an interesting model is to test it on MNIST. So that’s what I did. Here we see the training on MNIST, learning how to represent handwritten digits. This training occurs over 8 epochs.
Then it gives us the following results when we generate new images.
* All images are my own, generated from my own training, using and tweaking existing models. All the code is available on my GitHub.
DCGAN on IFT6266 MS Coco dataset.
After getting good results on MNIST and CelebA I was very optimistic for generating good images from the dataset provided for the project, but unfortunately, I was unable to get good looking generated images, here are some examples.
The DCGAN network was trained without captions, only the images. We can see that it learns some of the features and to product sharper looking images. The complexity and the variety of the images made it complicated to get very good results. Different parameters should have been tried, longer training could have helped, and implementing this solution to fix the checkerboard artifacts was tried but I, unfortunately, failed to implement it properly.
Here is another attempt at filling the cropped part of the image with the DCGAN and completion approach described above.
On some of the images the context and colors was captured and could give an “ok” filling, but on some of them, not at all.
After reading the StackGAN paper, I really wanted to try it for the IFT6266 Project, even knowing that I would probably spend a lot of time getting it to work. The first step was to reproduce it with Birds and Flowers like in the paper, before trying to understand the code and adapt it to the project.
The Stage-I generates a low-resolution guess and then the Stage-II is using that guess and the sentence again as an input to create a higher resolution version of it. Inspired by StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (Zhang et al., 2016) and Stacked Generative Adversarial Networks (Huang et al., 2016) I was confident that it could be a nice approach for the class project, so I started from Han Zhang’s implementation, available here.
Here is the idea.
Here are my own generation and own sentences.
Here is a nice Docker of a StackGAN that works very well if you want to use the pre-trained model.
GANs are very exciting and there are many great examples online, like this CycleGAN transforming horse videos in zebra videos!
For more GAN models, check out the GAN Zoo.
After seeing the results of the StackGAN I knew that’s the model I want to try. I knew how to work with GANs for image generation from noise, now I needed to add a text-encoder to create embeddings of the sentences used with the pictures. After reading Generative Adversarial Text to Image Synthesis (Reed et al., 2016), I wanted to use their model, so I used the Torch implementation’s created by Scott Ellison Reed, available here.
The results are shown below.
Here is the idea for the Text-to-Image GAN approach.
These are my own images, generated from my own implementation of Reed’s code.
It is very interesting to try to understand how the “space” might be created and how the sentence influence the image, for example on the “a person and a dog” one, we can feel that it kinda represent a space in which people and dogs live, but when drawing asked to draw “a person and a dog”, it seems to be a mixture of concepts related to both in some mixed reality spaces.
This project was a little nightmare from time to time for someone with no Linux nor Python skills and not doing deep learning as part as their full-time grad studies. After spending countless hours understanding: Python 2.7, Python 3.5, Theano, Tensorflow, Torch, Keras and Lasagne on both Mac OS and Linux (after giving up on Windows), I also got the chance to experiment with CUDA and GPU computation locally on my laptop and also on a cluster of Calcul Quebec called Hades. I haven’t been much successful on Hades as most of my test resulted in errors anyway. Here are some of the good ones.
As shown on the images, sometimes you don’t get many hints on where to look to fix the problem and can be a little clueless in front of the situation.
After all the frustration and the wasted time, why not have a corrupted hard drive and lose your results (images and some code) in the process… Because… Why not… The fun part is when you recover a corrupted hard drive and it throws all the files in the same folder called “found.000”. When you do deep learning on images all stored on your Hard Drive. So you end up with a folder of hundreds of thousands of files you have to go through to find the original files you are looking for.
But after all this, the main takeaway is to not underestimate the time one will spend by trying to reproduce examples available online. Because Python’s libraries evolve so quickly, code on GitHub that is more than 4 months old, is most likely not gonna work. You will have to install specific libraries with a specific version to get the full environment right, to be able to run the code. Sometimes it’s well documented in the GitHub, sometimes it’s a couple of hours of Googling and trials and errors. When you are not that familiar with Python, that quickly becomes a major time sucker. I know better, now.
All the pieces were previously somewhat successful in the sense that I had results and was mastering the concepts, so I was confident that by putting all of them together with a StackGAN I would achieve some results. Unfortunately, I was unable to make it work to get results. With all the problems and time suckers, I realize that I took on a model that I couldn’t chew. I will definitely continue to debug the StackGAN to make it work with a text-encoder trained on the dataset provided for the project, but I won’t be able to finish it in time to show the results.
It was a very interesting journey to start from no Python skills to running deep learning on Mac OS, Linux, Hades with Theano, Tensorflow or using Keras and Lasagne. Even if I didn’t manage to get great results, I was impressed with the direction of my own results. With the fact with small models (not deep), with around 10h of training, I was able to recognize objects, generate faces, etc.
Unfortunately, I can’t really comment on the usefulness of the caption versus the image and/or a combination of both, but from separate tests and my intuition based on the StackGAN is that the caption can help complete the missing information if the contour can’t give enough context information. If we remove the center part of the image, chances are we are removing very valuable information about the context and the contour might not be enough (even for a human) to infer on what’s missing in the middle and/or not even infer in what space could the possible answers live. A sentence could help position in space the possible range of solutions. Filling the missing part is an ill-posed problem and the more constraint the better for a valid solution.
GANs seems to give the best results for realistic images, but are quite hard to train. So they would be the weapon of choice for such a project given that time for tweaking parameters and trying several times is available.
As for evaluating the result, multiple mathematical approaches could work “ok”, but since possible solutions are pretty much infinite, it can be hard to have a flawless evaluation system. Subjective human evaluation remains the best evaluation possible for image generation. One good way of doing it is to produce multiple samples often as training goes, so that the results can be analyzed and a human could decide when the training should have stopped to get the best model.
Go Deep or Go Home
One of my main take away from the IFT6266 class is that trying different deep learning models and parameters is not linear, in the sense that you cannot really run a small test and predict if you are going in the right direction or not. It’s more of a question of are you in the right “regime” given your model, the amount of data and training time, because you might need a minimum depth in your network with a minimum amount of data with a minimum amount of training time before you see any encouraging results.
Despite all the problems explained above, I did manage to develop a strong work ethic to accelerate deep learning testing and a good intuition to how to attack a deep learning problem, not just on paper, but also through all the very important phases of deep learning that you don’t realize how much time they will add when you work on paper.
- Preprocessing / Cleaning the Data.
- Removing Grayscale images to keep only RGB ones.
- Resizing images to the right shape/size.
- Denoising the data (e.g. words, removing some words)
- Handling the Data.
- When you start working with gigabytes of data, some new problems and complication arise.
- Keeping/Saving the data and the results as you go. (e.g. images generated)
- Implementing the Model.
- Using the right tools and the right libraries from low-level (numpy) to higher level ones (e.g. Keras).
- Saving “Checkpoints”.
- Save the parameters of the model every now and then during its training or you might waste a lot of time.
- Visualizing the Results.
- It’s one thing on paper to “know” what to visualize. It’s another one to implement it the way you want.
- Testing/Re-using Pre-Trained Models.
I’ve learned to become good at this loop because it’s the same thing that comes again and again for every test you would like to do. Structure yourself and take the time at the beginning to put a good work structure in place, you’ll thank yourself down the road. Otherwise, you will feel like me halfway through, with the feeling of redoing the same things over and over again (e.g. moving image folders, rewriting scripts to resize them, rewriting scripts to remove grayscale images, rerunning experiments because I didn’t keep the model parameters, just keeping the images, etc.)
Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.
Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).
Yeh, Raymond, et al. “Semantic Image Inpainting with Perceptual and Contextual Losses.” arXiv preprint arXiv:1607.07539 (2016).
Zhang, Han, et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks.” arXiv preprint arXiv:1612.03242 (2016).
Huang, Xun, et al. “Stacked Generative Adversarial Networks.” arXiv preprint arXiv:1612.04357 (2016).