The latest artificial intelligence (AI) research at Google offers ‘Imagic,’ an efficient technique based on diffusion models for editing images with text prompts | Tech Rasta


Especially in the last few years, real-world photo editing with non-trivial semantic adjustments in image processing has become an attractive challenge. In particular, being able to control an image via a brief natural language text prompt was a disruptive innovation in the field.

The current top methods for this task still have different drawbacks: first, they are usually only used with specific domain or artificially created images. Second, they offer limited editing, such as painting on the image, adding an object, or transferring a style. Third, they require auxiliary inputs in addition to the input image, such as image masks that indicate the desired editing position.

A team of researchers from Google, the Technion, and the Weizmann Institute of Science have proposed iMagic, a semantic image-altering technique based on Imagen that solves all of the above problems. Their approach can perform complex non-rigid modifications on real high-resolution photographs with only an input image to be modified and a single text prompt indicating the target modification. The output images are perfectly aligned with the target text and maintain the background, composition and general structure of the source image. Imagic can make many changes, including style adjustments, color changes, and object additions, along with more complex changes. Some examples are shown in the figure below.



Given an input image and a target text prompt that describes the modifications to be applied, iMagic’s goal is to modify the image in a way that satisfies the given text while retaining as much detail as possible.

More precisely, this method consists of three steps, also shown in the image below:

  1. Optimizing text embedding. An initial text encoder is used to generate the target text embedding etgt From the target text. Then, the generated diffusion model is frozen and the target text embedding is optimized for a few steps, obtaining echoice. After this process, the input image and echoice Compare as nearly as possible.
  2. Fine-tuning diffusion patterns. When subjected to a productive diffusion process, the product is a proper embedding echoice The input picture may not always be rendered correctly. To close this gap, the model parameters are also adjusted in the second step while freezing the optimal embedding. echoice.
  3. Interpolating linearly between optimized embedding echoice and embedding the target text etgt Using a fine-tuned model in step b, to find the point where both image fidelity and target text alignment are achieved. A generative diffusion model is used to apply the desired correction by moving in the direction of the target text embedding, as it is trained to fully reconstruct the input image in the optimized embedding. This third step is, more precisely, a simple linear interpolation between the two embeddings.


The authors present a comparison between iMagic and state-of-the-art models, showing the clear superiority of their approach.


Also, the model’s ability to produce different outputs with different seeds from the same input image and text prompt is shown below.


Imagic still presents some drawbacks: in some cases, the desired modification is applied smoothly; In other cases, it is applied effectively, but it affects external image details. However, this is the first time a diffusion model has been able to edit images from a text prompt with such accuracy, and we can wait to see what comes next.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Imagic: Text-Based Real Image Editing with Diffusion Models'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.
Please Don't Forget To Join Our ML Subreddit

Leonardo Tanzi is currently a Ph.D. student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methods for smart support during complex interventions in the medical domain, using deep learning and augmented reality for 3D assistance.


Source link