PerTouch
PerTouch
Copied!

VLM-Driven  Agent  for
 Image  Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu
Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li†
VCIP Lab     †Corresponding author


↓ learn more

Abstract

Bridging Human Intent and Pixel Manipulation

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch.

Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms during training.

To connect natural language instructions with visual control, we develop a VLM-Driven agent to handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences.

Overview of our PerTouch pipeline

Data & Pipeline

A Comprehensive Framework for Intelligent Retouching
Dataset construction and training pipeline of PerTouch

Data Preparation

We use the MIT-Adobe FiveK dataset and generate parameter maps by extracting semantic masks using SAM and estimating attribute parameters for each region.

Semantic Replacement

Constructs diverse yet semantically consistent samples to help the model perceive semantic regions and develop fine-grained retouching capabilities.

Perturbation Mechanism

Applies perturbations to parameter maps to prevent overfitting to segmentation boundaries and improve overall visual quality.

Agent Workflow

Perception, Execution, and Reflection
Agent workflow in PerTouch
01

Instruction Parsing

The agent categorizes user instructions into strong and weak instructions, and invokes the appropriate retouching process based on the instruction type.

02

Feedback-driven Rethinking

The agent iteratively refines control parameters based on feedback from previous results, enabling it to handle vague user expressions and align with user intent.

03

Scene-aware Memory

The agent stores scene semantics and editing parameters in a memory bank, enabling personalized and context-aware retouching based on historical preferences.

Performance

SOTA results, Remarkable Visual Quality

Quantitative Comparison

Method A B C D E
PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓
PIENet 21.5184 0.1265 25.9065 0.0912 25.1927 0.0975 22.8989 0.1119 24.1171 0.1131
TSFlow 20.6123 0.1037 25.2474 0.0716 25.6243 0.0630 22.3720 0.0894 23.5393 0.0822
StarEnhancer 20.7100 0.1057 25.7296 0.0738 25.5198 0.0645 23.3875 0.0803 24.4558 0.0834
Diffretouch 24.5082 0.0812 26.1473 0.0672 25.9148 0.0684 24.5087 0.0768 24.7373 0.0776
PerTouch 25.1430 0.0798 27.4733 0.0687 26.7510 0.0844 25.9726 0.0823 25.6602 0.0792

Citation & License

If you find our work useful for your research, please consider citing:

Citation

BibTeX

@inproceedings{chang2026pertouch, title = {PerTouch: A Unified Diffusion-based Image Retouching Framework with VLM-driven Agent}, author = {Chang, Zewei and Duan, Zheng-Peng and Zhang, Jianxing and others}, year = 2026, booktitle = {The 40th Annual AAAI Conference on Artificial Intelligence}, address = {Singapore, Singapore}, }

License

License Dataset License