vivago.ai

Perface

In recent years, powerful Diffusion Models have taken the field of AI image editing by storm, enabling the generation of stunning and photorealistic images. However, two major pain points lurk behind this achievement: uncontrollable effects and low efficiency. Due to their generation mechanism where “a single change affects the whole”, even when only a local detail needs modification, the model may make unnecessary adjustments that impact areas that should remain unchanged, resulting in imprecise editing. Meanwhile, the lengthy iterative process makes “real-time editing” an unattainable wish.

To overcome these challenges, the HiDream.ai team has pioneered a new approach: introducing the Visual Autoregressive (VAR) architecture into image editing. We propose VAREdit, an innovative instruction-guided editing framework that accurately addresses the inherent flaws of Diffusion Models. VAREdit can achieve “precision editing as instructed” — it strictly follows editing commands to enhance editing quality while pushing generation efficiency to new heights, realizing a dual breakthrough in both accuracy and speed.

Both the model and code have been open-sourced:

GitHub: https://github.com/HiDream-ai/VAREdit

Online Demo: https://huggingface.co/spaces/HiDream-ai/VAREdit-8B-1024

Research Background and Motivation

In recent years, the development of high-quality editing datasets and efficient diffusion denoising architectures has brought significant progress to the field of instruction-guided image editing. Benefiting from the global iterative denoising process of diffusion architectures, the generated edited images demonstrate strong visual fidelity. However, the inherent flaws of this core process also pose major challenges for instruction-guided image editing:

Limited editing effects: The global nature of the denoising process means that local editing instructions inevitably affect the global image structure. The editing process can easily permeate areas that should remain unchanged, creating unintended coupling between local edits and global structure, resulting in spurious or incomplete edits.

Slow editing speed: Diffusion models rely on multi-step iterative denoising to generate target images. This process requires substantial computational resources and overhead, leading to long editing times that severely hinder deployment in real-time applications requiring immediate feedback.

In contrast, autoregressive (AR) models generate images through sequential causal prediction of visual tokens on a token-by-token basis. This compositional generation process provides a flexible mechanism that can precisely modify edited regions while preserving unaltered areas, addressing the regional coupling issue in diffusion models. However, image editing models based on traditional autoregressive modeling still suffer from difficulties in capturing global structures and low sampling efficiency.

Method Overview

1. Main Architecture

VAREdit proposed in this work introduces visual autoregressive modeling into instruction-guided image editing, defining image editing as a next-scale prediction task. It achieves precise image editing by autoregressively generating the next-scale target feature residuals.

2. Condition Organization

A core challenge in designing VAREdit lies in how to incorporate source image information into the backbone network as reference data for target scale generation. This work first explores two organization schemes:

This work further conducts a diagnostic analysis of the self-attention mechanism in the model trained under full-scale conditioning. The key observations are as follows:

In the first self-attention layer, the attention distribution is broad, primarily concentrating on the corresponding and all coarser source scales. This pattern indicates that the initial layer is responsible for establishing global layout and long-range dependencies.
In deeper self-attention layers, the attention pattern becomes highly localized. These layers exhibit a strong diagonal structure, suggesting that attention is mainly confined to tokens within spatial neighborhoods. This functional shift demonstrates a transition of attention from global construction to local refinement.

3. Scale-Aligned Reference

The aforementioned explorations have prompted this work to design a hybrid solution, namely the Scale-Aligned Reference (SAR) module, which provides scale-aligned references in the first layer while all subsequent layers focus solely on the finest-scale source.

Experimental Results

Leading Quantitative Metrics: VAREdit achieves significant advantages in both traditional CLIP evaluation metrics and GPT metrics (which better reflect editing precision) on two industry-recognized benchmark datasets: EMU-Edit and PIE-Bench. In particular, the VAREdit-8.4B model shows improvements of 41.5% and 30.8% in GPT-Balance metrics on these two datasets compared to the competitive diffusion architecture methods ICEdit and UltraEdit, respectively. Additionally, the smaller VAREdit-2.2B still demonstrates notable performance improvements over previous methods.
Exceptional Editing Speed: While significantly improving image editing precision, VAREdit maintains extremely high generation speed. Benefiting from the next-scale prediction paradigm of visual autoregression, VAREdit-8.4B can edit a 512×512 resolution image in 1.2 seconds, which is 2.2 times faster than UltraEdit, a diffusion editing model of similar scale. Furthermore, the smaller VAREdit-2.2B requires only 0.7 seconds for editing while still delivering leading image editing quality compared to previous methods.

Wide Applicability: Test results across different editing types show that VAREdit achieves optimal performance in the majority of editing categories. While the 2.2B model exhibits some limitations in challenging global style and text editing tasks, the 8.4B model significantly bridges this performance gap. This demonstrates VAREdit’s excellent scalability — editing performance can be further improved by scaling to larger models and datasets.

Excellent Finalized Results: From the visual comparison charts, VAREdit’s powerful image editing capabilities are clearly evident. Compared with previous editing architectures based on diffusion models, VAREdit exhibits stronger editing precision — specifically reflected in more successful instruction execution, reduced effects of over-editing, and edited images that are highly fidelity-preserving with natural visual appeal.

Improvement from the SAR Module: When comparing the original full-scale conditioning model, maximum-scale conditioning model, and the model optimized with the SAR module, it is evident that the SAR-enhanced model achieves a higher GPT-Balance score. This demonstrates that the injection of scale-matched information effectively improves the precision of image editing.

Conclusion

VAREdit, proposed in this work, introduces an innovative next-scale prediction paradigm into the instruction-guided image editing framework. It predicts the multi-scale visual residuals of target images based on text instructions and quantized source image features. By analyzing the effectiveness of different condition organization formats and proposing the novel SAR module, VAREdit achieves dual improvements in both the precision and efficiency of image editing.

In the future, the HiDream.ai team will continue to explore the architecture of next-generation multimodal image editing models, committed to developing instruction-guided image editing and visual generation technologies with higher quality, faster speed, and stronger controllability.

Paper Link: https://arxiv.org/pdf/2508.15772

Contact

Company: HiDream.ai

Contact Person: Yuechong Zhai

Email: info@hidream.ai

Website: hidream.ai

Telephone: +86 13718564372

City: Beijing/Shanghai/Hefei

vivago.ai

VAREdit - The World’s First Pure Autoregressive Image Editor, Precise Edits in 0.7s

vivago.ai

Perface

Research Background and Motivation