DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation


DisenBooth can disentangle the subject identity-relevant and the identity-irrelevant information in the given input images. Then DisenBooth utilizes the subject identity-relevant information, e.g., "a S* dog", and new text descriptions, e.g., "on top of a pink fabric", to generate new images about the subject. The disentangled strategy tackles the problem of existing methods, where they overfit identity-irrelevant information or change the subject identity in the generated images.


Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.

Model Architecture


DisenBooth conducts finetuning in the denoising process, where each input image is denoised with the textual embedding fs shared by all the images to preserve the subject identity, and visual embedding fi to capture the identity-irrelevant information. To make the two embeddings disentangled, the weak denoising objective L2 and the contrastive embedding objective L3 are proposed. Fine-tuned parameters include the adapter and the LoRA parameters.

Visual Comparison to Other Methods

Animation Generation Results


We also use DisenBooth to finetune some anime characters, and DisenBooth can effectively generate animation pictures that preserve the subject identity and conform to the text descriptions.

Reference-Image-Guided Subject-Driven Text-to-Image Generation


With the disentangled identity-preserved and identity-irrelevant embeddings, we can jointly use the reference image and the text prompt to make the generated image conform to the text while inheriting some characteristics of the reference image. From left to right, we increase the weight of the image-irrelevant embedding, and the generated image will be more similar to the reference image while preserving the pose of the dog.