GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Can Qin1,†  Ning Yu2  Chen Xing2  Shu Zhang2  Zeyuan Chen2 
Stefano Ermon3  Yun Fu1 Caiming Xiong2  Ran Xu2 
1Northeastern University, Boston, MA 
2Salesforce AI Research, Palo Alto, CA
3Stanford University, Palo Alto, CA
Work done when Can Qin was an intern at Salesforce AI Research, Primary Contact: qin.ca@northeastern.edu 
The proposed GlueNet is trying to provide an adaptable portal for the Stable Diffusion model to input multi-modal data, such as text, audio, (a) and (b), or text-audio hybrid signals, (c), for X-to-image generation.

Abstract

Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.
With the proposed GlueNet model of the GlueGen framework, the pre-trained image generator (i.e., UNet) can be bridged to off-the-shelf single- or multi-modal encoders to expand their functionalities, i.e., multilingual/sound-to-image generation, within a limited budget. GlueNet is trained offline and does not require back-propagation of UNet and image-text pairs for training. Therefore, GlueGen is flexible and efficient to achieve.

Sound and Sound/text-to-image Generation
(AudioCLIP + GlueNet + Stable Diffusion)

Beyond the text signals, the proposed GlueNet also achieves sound-to-image generation, i.e., (a) and (b), and image generation via sound-text-mix signals, i.e., (c), by aligning the CLIP text encoder with AudioCLIP audio encoders.

Monolingual Text-to-image Generation
(T5-3B + GlueNet + Latent Diffusion)

Monolingual text-to-image generation in resolution 256 $\times$ 256 with guidance weight 7.5 and DDIM steps 200.

Multilingual Text-to-image Generation
(XLM-Roberta-L + GlueNet + Stable Diffusion)


Multilingual generation results in resolution 512 * 512 of XLM-Roberta + Glue-Net + SDM decoder (sd-v1-4) with the same caption, ``afternoon garden oil painting painted by impressionists". With the help of different Glue-Nets and multilingual text encoder, the SDM decoder can support different languages including Japanese, Italian, Chinese, French and Spanish. The guidance weight is assigned as 7.5 and PLMS sampling steps are 50.

Hybrid multilingual generation in resolution of 512 * 512. There are three-different-language texts in the input caption including Chinese, Japanese and English. The caption of (a) is ``colorful, a cat painted by Picasso, sit on a table, is eating food'' and the caption of (b) is ``a white, sedan, crash into a building''. With our GlueNet infused ahead, the XLM-Roberta can guide SDM decoder to generate reasonable results where the original SDM fails to work.

BibTeX

@article{qin2023gluegen,
        title={GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation},
        author={Qin, Can and Yu, Ning and Xing, Chen and Zhang, Shu and Chen, Zeyuan and Ermon, Stefano and Fu, Yun and Xiong, Caiming and Xu, Ran},
        journal={arXiv preprint arXiv:2303.10056},
        year={2023}
      }