Style Transfer: Solved
Interrupting our regular coverage for what's been lighting up the internet the last few days. Both Google and OpenAI have released multi-modal models that, for the first time, integrate image generation and image editing right into the language model itself. This is a big deal because now, for the first time, the model that's generating the images has a solid understanding of the user's intent. One of the most obvious use cases here is that of style transfer: "Turn this photo into a Simpsons cartoon" or "Make this a Van Gough painting."
A brief history of style transfer
Repurposed Image Recognition Models
The first widely publicized algorithm for style transfer was published almost ten years ago. Gatys et al. recognized that, in an image recognition model, the first few layers mostly recognize style, and the later layers mostly recognize content. By feeding a content image (the photo of me) and a style image (Van Gough's Starry Night) through the model, the algorithm then optimizes for an output image whose activations in the style layers match the style image and the activations in the content layers match the content image.
Clever as it is, this algorithm and subsequent works were mostly good at matching brush strokes of various art styles but not so much the broader artistic implications, like Picasso's way of messing with perspective or Dali's "everything melts" style.
Diffusion Models
Without getting too technical, these models (DALL.E, Stable Diffusion, Midjourney, etc.) use a reverse diffusion process to start from pure noise and slowly generate a target image, guided by a text prompt. Early models famously had huge problems adhering to prompts (and let's not forget the horrific way they generated hands and fingers). They were great at applying styles to a target prompt: "A bored cat in the style of The Girl with the Pearl Earring." and so on. Instead of a text prompt, they could also be prompted with a source image and glean style from there.
They could also be fine-tuned on your images and after that time-consuming and finicky process, you could ask for yourself as, say, a Lego mini figure.
However, they were still bad at:
Text generation
Prompt adherence
Image editing
In simple terms, these models do not have a genuine concept of content, just a vague sense that the text and image embeddings are close to each other.
Multi-modal Models
By baking image generation right into their Large Multimodal Model (LMMM?), both Gemini and GPT4.o have access to all their text-based world knowledge and what they have learned about how objects relate to each other. These models know that the Simpsons have yellow skin and an overbite. They can move parts of the image around while keeping them consistent overall in a way the previous generative models couldn't because they were too focused on individual pixels.
Cheers, and enjoy the weekend!