ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

WACV 2025

1Indian Institute of Technology, Roorkee
2Adobe MDSR 3Microsoft Research
4Indian Institute of Technology, Bombay
5Stanford University 6Carnegie Mellon University

*Indicates Equal Contribution

Work done during internship at Adobe MDSR

Work done while at Adobe

Abstract

Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality photorealistic images. While the de facto method for performing edits with T2I models is through text instructions, this approach is non-trivial due to the complex many-to-many mapping between natural language and images. In this work, we address exemplar-based image editing – the task of transferring an edit from an exemplar pair to a content image(s). We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities while ensuring the fidelity of the edited image. We validate the effectiveness of ReEdit through extensive comparisons with state-of-the-art baselines and sensitivity analyses of key design choices. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.

Method

ReEdit Method

Our method involves a sophisticated process of exemplar-based image editing using diffusion models. The approach begins with extracting image embeddings from both the original image \(x\) and the edited exemplar \(x_{\text{edit}}\) using the encoder \(\mathcal{E}_{\text{img}}\). The difference between these embeddings is pooled to form \(\Delta_{\text{img}}\), capturing the nuanced details of the edit. Simultaneously, text guidance is incorporated through \(\mathcal{E}_{\text{text}}(g_{\text{caption}})\), providing additional context for the edit. These components are combined into a guidance vector \(g\), which is used to condition the diffusion model \(M\).

The process involves DDIM inversion to map the target image \(y\) into a latent space, followed by denoising to maintain the structural integrity of the original image. The model employs cross attention, self-attention injection, and feature injection to ensure precise and contextually relevant edits. This method allows for high-quality image transformations that respect the original content while applying the desired changes, as demonstrated in the final output \(\hat{y}_{\text{edit}}\).

Dataset Creation

The dataset used for evaluation consists of a diverse range of edits, carefully curated to ensure a comprehensive representation of various editing techniques. The table below summarizes the types of edits included in the dataset:


Type of Edit Number of Examples
Global Style Transfer 428
Background Change 212
Localized Style Transfer 290
Object Replacement 366
Motion Edit 14
Object Insertion 164
Total 1474

Summary and statistics of the types of edits in the evaluation dataset. Special care was taken to ensure diversity of edit categories.

Effect of Varying Hyperparameter Lambda

Image x
x
Image xedit
xedit
Image y
y
Lambda 0
λ = 0
Lambda 0.6
λ = 0.6
Lambda 0.7
λ = 0.7
Lambda 0.8
λ = 0.8
Image x
x
Image xedit
xedit
Image y
y
Lambda 0
λ = 0
Lambda 0.6
λ = 0.6
Lambda 0.7
λ = 0.7
Lambda 0.8
λ = 0.8

Depiction of the effect of edit weight λ on the edited image. Higher values correspond to higher influence from the desired edit.

BibTeX

@article{srivastava2024reedit,
  title={ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models},
  author={Srivastava, Ashutosh and Menta, Tarun Ram and Java, Abhinav and Jadhav, Avadhoot and Singh, Silky and Jandial, Surgan and Krishnamurthy, Balaji},
  journal={arXiv preprint arXiv:2411.03982},
  year={2024}
}