Content and Style

In August 2015, researchers from the Bernstein Center for Computational Neuroscience in Tubingen, Germany released a paper titled A Neural Algorithm of Artistic Style. The paper describes a technique to use state of the art computer vision software to blend stylistic elements of one image (or set of images) with the content of another. The word "neural" in the title refers to the underlying computer vision software: convolutional neural networks.

Convolutional neural networks consist of layers of "artificial neurons", which are computational units that process (visual) information. Each layer is essentially a set of image filters; each filter extracts a certain feature from the input image. As information flows through the network, the series of representations increasingly correspond to the high level content of the image, while discarding noisy details of the individual pixel values. Thus, the deeper layers output (lossy) compressed versions of the original image, while "trying" to retain as much information as possible about the semantic category of the image.

To extract the style of an image, we need to do a bit more work. In a paper by the same authors titled Texture Synthesis Using Convolutional Neural Networks, a full description of the method is given. To summarize, we first pass the image through the network and capture the output from each layer. For each layer we compute a matrix, where entry (i, j) stores the activation for the ith filter at position j. The texture is then computed from correlations between the responses of different features. This texture model appropriately discards spatial information, since by definition the texture of an image should be stationary.

Thus, for a given neural network and a given image, we can compute mathematical objects which serve as representations for the content and the style of the image. Now suppose we used one image for the content and a different image for the style. To create a new image that has the content of one and style of the other, we start with a random, white noise image, and perform gradient descent to best fit the content and the style. In other words, the goal is to gradually transform an image so that the activations match the content activations of one image and the style activations of another (details in paper's figure 1 and section 3).

Here are a some examples.

The style image: Monet's Impression, soleil levant.

The style image: van Gogh's A Wheatfield with Cypresses.

One more, this time the content image contains a piece by the artist of the style image:


The style image: Kandinski's Composition VIII.