UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin1,2,†  Shu Zhang1  Ning Yu1  Yihao Feng1  Xinyi Yang1  Yingbo Zhou1  Huan Wang1  Juan Carlos Niebles1  Caiming Xiong1  Silvio Savarese1  Stefano Ermon3  Yun Fu 2 Ran Xu1 
1Salesforce AI, Palo Alto, CA
2Northeastern University, Boston, MA 
3Stanford University, Palo Alto, CA
Work done when Can Qin was an intern at Salesforce AI 

Gradio Demo Video

Approach

UniControl is trained with multiple tasks with a unified model, and it further demonstrates promising capability in zero-shot tasks generalization with visual example results shown above.

Abstract

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
This figure shows our proposed UniControl method. To accommodate diverse tasks, we've designed a Mixture of Experts (MOE) Adapter, containing roughly ~70K #params for each task, and a Task-aware HyperNet (~12M #params) to modulate N (i.e., 7) zero-conv layers. This structure allows for multi-task functionality within a singular model, significantly reducing the model size compared to an equivalent stack of single-task models, each with around 1.4B #params.

Pre-training Tasks Results

Visual comparison between official or re-implemented task-specific ControlNet and our proposed model. The example data is collected from our testing set sampled from COCO and Laion.

Zero-shot Tasks Results

(a)-(b): Example results of UniControl over hybrid (unseen combination) conditions with key words "background" and "foreground" attached in prompts. (c)-(e): Example results of UniControl on three unseen tasks (deblurring, colorization, inpainting)

BibTeX

@article{qin2023unicontrol,
  title={UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild},
  author={Qin, Can and Zhang, Shu and Yu, Ning and Feng, Yihao and Yang, Xinyi and Zhou, Yingbo and Wang, Huan and Niebles, Juan Carlos and Xiong, Caiming and Savarese, Silvio and others},
  journal={arXiv preprint arXiv:2305.11147},
  year={2023}
}