有效且高效的扩散

让 [DiffusionPipeline] 生成特定风格的图像或包含你想要的内容可能很棘手。很多时候，你必须运行 [DiffusionPipeline] 好几次才能得到你满意的图像。但从无到有地生成图像是一个计算密集型过程，尤其是在你反复进行推理时。

这就是为什么从管道中获得最大的计算（速度）和内存（GPU vRAM）效率非常重要的原因，这样可以减少推理周期之间的间隔时间，从而更快地迭代。

本教程将引导你了解如何使用 [DiffusionPipeline] 生成更快、更好的图像。

首先加载 stable-diffusion-v1-5/stable-diffusion-v1-5 模型：

python

from diffusers import DiffusionPipeline

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True)

你将使用的示例提示是一个老战士首领的肖像，但你可以随意使用自己的提示：

python

prompt = "portrait photo of a old warrior chief"

速度

加速推理最简单的方法之一是在 GPU 上放置管道，就像你对任何 PyTorch 模块所做的那样：

python

pipeline = pipeline.to("cuda")

为了确保你可以使用相同的图像并对其进行改进，请使用 Generator 并为可重复性设置一个种子：

python

import torch

generator = torch.Generator("cuda").manual_seed(0)

现在你可以生成一张图像：

python

image = pipeline(prompt, generator=generator).images[0]
image

这个过程在一个 T4 GPU 上大约花了 30 秒（如果你的分配的 GPU 比 T4 更好，它可能会更快）。默认情况下，[DiffusionPipeline] 使用完整的 float32 精度运行推理，共 50 个推理步骤。你可以通过切换到更低的精度（如 float16）或运行更少的推理步骤来加快速度。

让我们从加载 float16 中的模型开始，并生成一张图像：

python

import torch

pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)
pipeline = pipeline.to("cuda")
generator = torch.Generator("cuda").manual_seed(0)
image = pipeline(prompt, generator=generator).images[0]
image

这次，生成图像只花了大约 11 秒，比之前快了近 3 倍！

另一个选择是减少推理步骤的数量。选择更高效的调度器可以帮助减少步骤数量，而不会牺牲输出质量。你可以在 [DiffusionPipeline] 中调用 compatibles 方法来查找与当前模型兼容的调度器：

python

pipeline.scheduler.compatibles
[
    diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
    diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
    diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
    diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
    diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
    diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
    diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
    diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
    diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
    diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler,
    diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
    diffusers.schedulers.scheduling_pndm.PNDMScheduler,
    diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
    diffusers.schedulers.scheduling_ddim.DDIMScheduler,
]

Stable Diffusion 模型默认使用 [PNDMScheduler]，它通常需要大约 50 个推理步骤，但更高效的调度器，如 [DPMSolverMultistepScheduler]，只需要大约 20 或 25 个推理步骤。使用 [~ConfigMixin.from_config] 方法加载新的调度器：

python

from diffusers import DPMSolverMultistepScheduler

pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

现在将 num_inference_steps 设置为 20：

python

generator = torch.Generator("cuda").manual_seed(0)
image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
image

很棒，你已经成功将推理时间缩短到只有 4 秒！⚡️

内存

提高管道性能的另一个关键是减少内存消耗，这间接意味着速度更快，因为你通常试图最大化每秒生成的图像数量。查看你可以一次生成多少图像的最简单方法是尝试不同的批次大小，直到你得到一个 OutOfMemoryError (OOM)。

创建一个函数，从提示列表和 Generators 生成一批图像。确保为每个 Generator 分配一个种子，以便你可以在它产生良好结果时重复使用它。

python

def get_inputs(batch_size=1):
    generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
    prompts = batch_size * [prompt]
    num_inference_steps = 20

    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}

从 batch_size=4 开始，看看你消耗了多少内存：

python

from diffusers.utils import make_image_grid

images = pipeline(**get_inputs(batch_size=4)).images
make_image_grid(images, 2, 2)

除非你的 GPU 有更多 vRAM，否则上面的代码可能会返回 OOM 错误！大部分内存都被交叉注意力层占用。你可以通过顺序运行来节省大量内存，而不是以批次运行此操作。你只需要配置管道以使用 [~DiffusionPipeline.enable_attention_slicing] 函数：

python

pipeline.enable_attention_slicing()

现在尝试将 batch_size 增加到 8！

python

images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, rows=2, cols=4)

以前你甚至无法生成一批 4 张图像，现在你可以生成一批 8 张图像，每张图像大约需要 3.5 秒！这可能是你在 T4 GPU 上能达到的最快速度，而不会牺牲质量。

质量

在过去的两节中，你学习了如何通过使用 fp16 来优化管道速度，通过使用更高效的调度器来减少推理步骤，以及启用注意力切片来减少内存消耗。现在，你将专注于如何提高生成图像的质量。

更好的检查点

最明显的一步是使用更好的检查点。Stable Diffusion 模型是一个很好的起点，并且自其正式发布以来，也发布了几个改进版本。但是，使用更新的版本并不一定意味着你会得到更好的结果。你仍然需要自己尝试不同的检查点，并进行一些研究（例如使用负面提示）才能获得最佳结果。

随着该领域的不断发展，越来越多的高质量检查点被微调以产生特定风格。尝试探索 Hub 和 Diffusers Gallery 来找到你感兴趣的检查点！

更好的管道组件

你也可以尝试用更新的版本替换当前的管道组件。让我们尝试将最新的自动编码器从 Stability AI 加载到管道中，并生成一些图像：

python

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
pipeline.vae = vae
images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, rows=2, cols=4)

更好的提示工程

你用来生成图像的文本提示非常重要，以至于它被称为提示工程。在提示工程中需要考虑的一些因素是：

我想要生成的图像或类似图像如何在互联网上存储？
我可以提供哪些额外的细节来引导模型朝着我想要的风格发展？

考虑到这一点，让我们改进提示以包含颜色和更高质量的细节：

python

prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta"

使用新的提示生成一批图像：

python

images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, rows=2, cols=4)

太棒了！让我们稍微调整一下第二张图像 - 对应于种子为 1 的 Generator - 通过添加一些关于主题年龄的文字：

python

prompts = [
    "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",
    "portrait photo of an old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",
    "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",
    "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",
]

generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
make_image_grid(images, 2, 2)

下一步

在本教程中，你学习了如何优化 [DiffusionPipeline] 以提高计算和内存效率，以及如何提高生成输出的质量。如果你有兴趣让你的管道更快，请查看以下资源：

了解 PyTorch 2.0 和 torch.compile 如何实现 5 - 300% 的推理速度提升。在 A100 GPU 上，推理速度可以提高 50%！
如果你无法使用 PyTorch 2，我们建议你安装 xFormers。它的内存高效注意力机制与 PyTorch 1.13.1 完美配合，可以实现更快的速度和更低的内存消耗。
其他优化技术，例如模型卸载，在本指南中有介绍。

有效且高效的扩散 ​

速度 ​

内存 ​

质量 ​

更好的检查点 ​

更好的管道组件 ​

更好的提示工程 ​

下一步 ​

实用工具

有效且高效的扩散

速度

内存

质量

更好的检查点

更好的管道组件

更好的提示工程

下一步