PerTouch

VLM-Driven Agent for
Image Retouching

📝 Paper 💻 Code 🤗 Model

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu
Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li†
VCIP Lab †Corresponding author

Nankai University

SRC - B

↓ learn more

Abstract

Bridging Human Intent and Pixel Manipulation

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch.

Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms during training.

To connect natural language instructions with visual control, we develop a VLM-Driven agent to handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences.

Data & Pipeline

A Comprehensive Framework for Intelligent Retouching

Dataset construction and training pipeline of PerTouch

Data Preparation

We use the MIT-Adobe FiveK dataset and generate parameter maps by extracting semantic masks using SAM and estimating attribute parameters for each region.

Semantic Replacement

Constructs diverse yet semantically consistent samples to help the model perceive semantic regions and develop fine-grained retouching capabilities.

Perturbation Mechanism

Applies perturbations to parameter maps to prevent overfitting to segmentation boundaries and improve overall visual quality.

Agent Workflow

Perception, Execution, and Reflection

Instruction Parsing

The agent categorizes user instructions into strong and weak instructions, and invokes the appropriate retouching process based on the instruction type.

Feedback-driven Rethinking

The agent iteratively refines control parameters based on feedback from previous results, enabling it to handle vague user expressions and align with user intent.

Scene-aware Memory

The agent stores scene semantics and editing parameters in a memory bank, enabling personalized and context-aware retouching based on historical preferences.

Performance

SOTA results, Remarkable Visual Quality

Quantitative Comparison

Method	A		B		C		D		E
Method	PSNR ↑	LPIPS ↓	PSNR ↑	LPIPS ↓	PSNR ↑	LPIPS ↓	PSNR ↑	LPIPS ↓	PSNR ↑	LPIPS ↓
PIENet	21.5184	0.1265	25.9065	0.0912	25.1927	0.0975	22.8989	0.1119	24.1171	0.1131
TSFlow	20.6123	0.1037	25.2474	0.0716	25.6243	0.0630	22.3720	0.0894	23.5393	0.0822
StarEnhancer	20.7100	0.1057	25.7296	0.0738	25.5198	0.0645	23.3875	0.0803	24.4558	0.0834
Diffretouch	24.5082	0.0812	26.1473	0.0672	25.9148	0.0684	24.5087	0.0768	24.7373	0.0776
PerTouch	25.1430	0.0798	27.4733	0.0687	26.7510	0.0844	25.9726	0.0823	25.6602	0.0792

Visual Quality

User Study

50 volunteers prefer PerTouch in nearly half of 30 tests, outperforming SOTA.

Citation & License

If you find our work useful for your research, please consider citing:

Citation

BibTeX

                                    
@inproceedings{chang2026pertouch,
    title        = {PerTouch: A Unified Diffusion-based Image Retouching Framework with VLM-driven Agent},
    author       = {Chang, Zewei and Duan, Zheng-Peng and Zhang, Jianxing and others},
    year         = 2026,
    booktitle    = {The 40th Annual AAAI Conference on Artificial Intelligence},
    address      = {Singapore, Singapore},
}
                                

★ Star on GitHub

VLM-Driven Agent for Image Retouching