Skip to content

PyTorch 2.0

🤗 Diffusers 支持来自 PyTorch 2.0 的最新优化,包括:

  1. 一种内存高效的注意力实现,缩放点积注意力,无需任何额外的依赖项,如 xFormers。
  2. torch.compile,一个即时(JIT)编译器,当单个模型被编译时提供额外的性能提升。

这两项优化都需要 PyTorch 2.0 或更高版本以及 🤗 Diffusers > 0.13.0。

bash
pip install --upgrade torch diffusers

缩放点积注意力

torch.nn.functional.scaled_dot_product_attention(SDPA)是一种优化且内存高效的注意力机制(类似于xFormers),它会根据模型输入和GPU类型自动启用多种其他优化。如果你使用的是PyTorch 2.0和最新版本的🤗 Diffusers,SDPA默认启用,因此你无需在代码中添加任何内容。

然而,如果你想显式启用它,可以将[DiffusionPipeline]设置为使用[~models.attention_processor.AttnProcessor2_0]:

diff
  import torch
  from diffusers import DiffusionPipeline
+ from diffusers.models.attention_processor import AttnProcessor2_0

  pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+ pipe.unet.set_attn_processor(AttnProcessor2_0())

  prompt = "a photo of an astronaut riding a horse on mars"
  image = pipe(prompt).images[0]

SDPA 应该与 xFormers 一样快速且内存高效;更多详情请查看基准测试

在某些情况下——例如使流水线更加确定性或将其转换为其他格式——使用原生的注意力处理器 [~models.attention_processor.AttnProcessor] 可能会有所帮助。要恢复到 [~models.attention_processor.AttnProcessor],请在流水线上调用 [~UNet2DConditionModel.set_default_attn_processor] 函数:

diff
  import torch
  from diffusers import DiffusionPipeline

  pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+ pipe.unet.set_default_attn_processor()

  prompt = "a photo of an astronaut riding a horse on mars"
  image = pipe(prompt).images[0]

torch.compile

torch.compile 函数通常可以为你的 PyTorch 代码提供额外的加速。在 🤗 Diffusers 中,通常最好用 torch.compile 包装 UNet,因为它在流水线中完成了大部分繁重的工作。

python
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0]

根据 GPU 类型的不同,torch.compile 可以在 SDPA 的基础上提供 5-300 倍额外加速!如果你使用的是较新的 GPU 架构,如 Ampere(A100、3090)、Ada(4090)和 Hopper(H100),torch.compile 能够从这些 GPU 中榨取更多的性能。

编译需要一些时间来完成,因此它最适合在准备管道一次然后多次执行相同类型的推理操作的情况下使用。例如,在不同的图像尺寸上调用编译后的管道会再次触发编译,这可能会很耗时。

有关 torch.compile 的更多信息和不同选项,请参阅 torch_compile 教程。

TIP

加速文本到图像扩散模型的推理 教程中了解更多关于 PyTorch 2.0 可以帮助优化模型的其他方式。

基准测试

我们使用 PyTorch 2.0 的高效注意力实现和 torch.compile 对不同 GPU 和批量大小的五种最常用管道进行了全面的基准测试。代码在 🤗 Diffusers v0.17.0.dev0 上进行了基准测试,以优化 torch.compile 的使用(更多详情请参见此处)。

展开下面的下拉菜单以找到用于基准测试每个管道的代码:

Stable Diffusion text-to-image

python
from diffusers import DiffusionPipeline
import torch

path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

run_compile = True  # Set True / False

pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    images = pipe(prompt=prompt).images

稳定扩散图像到图像

python
from diffusers import StableDiffusionImg2ImgPipeline
from diffusers.utils import load_image
import torch

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

init_image = load_image(url)
init_image = init_image.resize((512, 512))

path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

run_compile = True  # Set True / False

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image).images[0]

稳定扩散修复

python
from diffusers import StableDiffusionInpaintPipeline
from diffusers.utils import load_image
import torch

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).resize((512, 512))
mask_image = load_image(mask_url).resize((512, 512))

path = "runwayml/stable-diffusion-inpainting"

run_compile = True  # Set True / False

pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

ControlNet

python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

init_image = load_image(url)
init_image = init_image.resize((512, 512))

path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

run_compile = True  # Set True / False
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
pipe.controlnet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image).images[0]

DeepFloyd IF 文本到图像 + 放大

python
from diffusers import DiffusionPipeline
import torch

run_compile = True  # Set True / False

pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
pipe_1.to("cuda")
pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
pipe_2.to("cuda")
pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True)
pipe_3.to("cuda")


pipe_1.unet.to(memory_format=torch.channels_last)
pipe_2.unet.to(memory_format=torch.channels_last)
pipe_3.unet.to(memory_format=torch.channels_last)

if run_compile:
    pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True)
    pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
    pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)

prompt = "the blue hulk"

prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)

for _ in range(3):
    image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
    image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
    image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images

下图突出了在启用 PyTorch 2.0 和 torch.compile 的情况下,[StableDiffusionPipeline] 在五个 GPU 系列中的相对速度提升。以下图表的基准测试以 每秒迭代次数 为单位进行测量。

t2i_speedup

为了让你更好地了解这种速度提升在其他管道中的表现,请考虑以下使用 PyTorch 2.0 和 torch.compile 的 A100 图表:

a100_numbers

在下表中,我们以 每秒迭代次数 为单位报告了我们的发现。

A100(批量大小:1)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img21.6623.1344.0349.74
SD - img2img21.8122.4043.9246.32
SD - inpaint22.2423.2343.7649.25
SD - controlnet15.0215.8232.1336.08
IF20.21 /
13.84 /
24.00
20.12 /
13.70 /
24.03
97.34 /
27.23 /
111.66
SDXL - txt2img8.649.9--

A100(批量大小:4)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img11.613.1214.6217.27
SD - img2img11.4713.0614.6617.25
SD - inpaint11.6713.3114.8817.48
SD - controlnet8.289.3810.5112.41
IF25.0218.0448.47
SDXL - txt2img2.442.74--

A100(批量大小:16)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img3.043.63.834.68
SD - img2img2.983.583.834.67
SD - inpaint3.043.663.94.76
SD - controlnet2.152.582.743.35
IF8.789.8216.77
SDXL - txt2img0.640.72--

V100(批量大小:1)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img18.9919.1420.9522.17
SD - img2img18.5619.1820.9522.11
SD - inpaint19.1419.0621.0822.20
SD - controlnet13.4813.9315.1815.88
IF20.01 /
9.08 /
23.34
19.79 /
8.98 /
24.10
55.75 /
11.57 /
57.67

V100(批量大小:4)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img5.965.896.836.86
SD - img2img5.905.916.816.82
SD - inpaint5.996.036.936.95
SD - controlnet4.264.294.924.93
IF15.4114.7622.95

V100(批量大小:16)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img1.661.661.921.90
SD - img2img1.651.651.911.89
SD - inpaint1.691.691.951.93
SD - controlnet1.191.19OOM after warmup1.36
IF5.435.297.06

T4(批量大小:1)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img6.96.957.37.56
SD - img2img6.846.997.047.55
SD - inpaint6.916.77.017.37
SD - controlnet4.894.865.355.48
IF17.42 /
2.47 /
18.52
16.96 /
2.45 /
18.69
24.63 /
2.47 /
23.39
SDXL - txt2img1.151.16--

T4(批量大小:4)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img1.791.792.031.99
SD - img2img1.771.772.052.04
SD - inpaint1.811.822.092.09
SD - controlnet1.341.271.471.46
IF5.795.617.39
SDXL - txt2img0.2880.289--

T4(批量大小:16)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img2.34s2.30sOOM after 2nd iteration1.99s
SD - img2img2.35s2.31sOOM after warmup2.00s
SD - inpaint2.30s2.26sOOM after 2nd iteration1.95s
SD - controlnetOOM after 2nd iterationOOM after 2nd iterationOOM after warmupOOM after warmup
IF *1.441.441.94
SDXL - txt2imgOOMOOM--

RTX 3090(批量大小:1)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img22.5622.8423.8425.69
SD - img2img22.2522.6124.125.83
SD - inpaint22.2222.5424.2626.02
SD - controlnet16.0316.3317.3818.56
IF27.08 /
9.07 /
31.23
26.75 /
8.92 /
31.47
68.08 /
11.16 /
65.29

RTX 3090(批量大小:4)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img6.466.357.297.3
SD - img2img6.336.277.317.26
SD - inpaint6.476.47.447.39
SD - controlnet4.594.545.275.26
IF16.8116.6221.57

RTX 3090(批量大小:16)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img1.71.691.931.91
SD - img2img1.681.671.931.9
SD - inpaint1.721.711.971.94
SD - controlnet1.231.221.41.38
IF5.015.006.33

RTX 4090(批量大小:1)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img40.541.8944.6549.81
SD - img2img40.3941.9544.4649.8
SD - inpaint40.5141.8844.5849.72
SD - controlnet29.2730.2932.2636.03
IF69.71 /
18.78 /
85.49
69.13 /
18.80 /
85.56
124.60 /
26.37 /
138.79
SDXL - txt2img6.88.18--

RTX 4090(批量大小:4)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img12.6212.8415.3215.59
SD - img2img12.6112,.7915.3515.66
SD - inpaint12.6512.8115.315.58
SD - controlnet9.19.2511.0311.22
IF31.8831.1443.92
SDXL - txt2img2.192.35--

RTX 4090(批量大小:16)

Pipelinetorch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img3.173.23.843.85
SD - img2img3.163.23.843.85
SD - inpaint3.173.23.853.85
SD - controlnet2.232.32.72.75
IF9.269.213.31
SDXL - txt2img0.520.53--

注释

  • 有关用于进行基准测试的环境的更多详细信息,请参阅此PR
  • 对于批量大小 > 1 的 DeepFloyd IF 管道,我们仅在第一个 IF 管道中使用了 > 1 的批量大小进行文本到图像生成,而不是用于放大。这意味着两个放大管道接收的批量大小为 1。

感谢来自 PyTorch 团队的 Horace He 在改进 Diffusers 中对 torch.compile() 的支持方面提供的帮助。