UniDiffuser

UniDiffuser模型在Fan Bao、Shen Nie、Kaiwen Xue、Chongxuan Li、Shi Pu、Yaole Wang、Gang Yue、Yue Cao、Hang Su和Jun Zhu共同撰写的论文One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale中被提出。

论文的摘要如下：

本文提出了一种统一的扩散框架（称为UniDiffuser），以在一个模型中拟合与一组多模态数据相关的所有分布。我们的关键见解是——学习边缘、条件和联合分布的扩散模型可以统一为预测扰动数据中的噪声，其中不同模态的扰动水平（即时间步）可以不同。受统一视角的启发，UniDiffuser通过最小化对原始扩散模型的修改——扰动所有模态的数据而不是单一模态，输入不同模态的个体时间步，并预测所有模态的噪声而不是单一模态，同时学习所有分布。UniDiffuser由一个用于扩散模型的transformer参数化，以处理不同模态的输入类型。在大规模配对的图像-文本数据上实现，UniDiffuser能够通过设置适当的时间步来执行图像、文本、文本到图像、图像到文本以及图像-文本对生成，而无需额外开销。特别是，UniDiffuser能够在所有任务中生成感知上逼真的样本，其定量结果（例如，FID和CLIP分数）不仅优于现有的通用模型，而且在代表性任务（例如，文本到图像生成）中与定制模型（例如，Stable Diffusion和DALL-E 2）相当。

你可以在thu-ml/unidiffuser找到原始代码库，并在thu-ml找到额外的检查点。

此管道由dg845贡献。❤️

使用示例

由于UniDiffuser模型被训练为建模（图像，文本）对的联合分布，因此它能够执行多种生成任务：

无条件图像和文本生成

从[UniDiffuserPipeline]进行无条件生成（我们从仅从标准高斯先验采样的潜在变量开始）将生成一个（图像，文本）对：

python

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

这也被称为UniDiffuser论文中的“联合”生成，因为我们是从联合图像-文本分布中进行采样。

请注意，生成任务是从调用管道时使用的输入中推断出来的。也可以通过[UniDiffuserPipeline.set_joint_mode]手动指定无条件生成任务（“模式”）：

python

# Equivalent to the above.
pipe.set_joint_mode()
sample = pipe(num_inference_steps=20, guidance_scale=8.0)

当模式被手动设置时，后续对管道的调用将使用设置的模式，而不会尝试推断模式。你可以通过[UniDiffuserPipeline.reset_mode]重置模式，之后管道将再次推断模式。

你还可以仅生成图像或仅生成文本（UniDiffuser论文称之为“边缘”生成，因为我们分别从图像和文本的边缘分布中进行采样）：

python

# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance
# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]
# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

文本到图像生成

UniDiffuser 还能够从条件分布中进行采样；也就是说，基于文本提示的图像分布或基于图像的文本分布。以下是基于条件图像分布进行采样的示例（文本到图像生成或文本条件图像生成）：

python

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image

text2img 模式要求提供输入 prompt 或 prompt_embeds。你可以通过 [UniDiffuserPipeline.set_text_to_image_mode] 手动设置 text2img 模式。

图像到文本生成

同样地，UniDiffuser 也可以在给定图像的情况下生成文本样本（图像到文本或图像条件文本生成）：

python

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

img2text 模式要求提供一个输入的 image。你可以通过 [UniDiffuserPipeline.set_image_to_text_mode] 手动设置 img2text 模式。

图像变化

UniDiffuser 的作者建议通过一种“往返”生成方法来进行图像变化，即给定一个输入图像，我们首先进行图像到文本的生成，然后在第一次生成的输出上进行文本到图像的生成。这会产生一张在语义上与输入图像相似的新图像：

python

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

文本变化

类似地，可以通过文本到图像生成，然后进行图像到文本生成的方式对输入提示进行文本变化：

python

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

UniDiffuserPipeline

[[autodoc]] UniDiffuserPipeline - all - call

ImageTextPipelineOutput

[[autodoc]] pipelines.ImageTextPipelineOutput

UniDiffuser ​

使用示例 ​

无条件图像和文本生成 ​

文本到图像生成 ​

图像到文本生成 ​

图像变化 ​

文本变化 ​

UniDiffuserPipeline ​

ImageTextPipelineOutput ​

UniDiffuser

使用示例

无条件图像和文本生成

文本到图像生成

图像到文本生成

图像变化

文本变化

UniDiffuserPipeline

ImageTextPipelineOutput