Kolors：用于逼真文本到图像合成的扩散模型的有效训练

Kolors 是一个基于潜在扩散的大规模文本到图像生成模型，由快手 Kolors 团队开发。该模型在数十亿文本-图像对上进行训练，在视觉质量、复杂语义准确性以及中英文文本渲染方面，相比开源和闭源模型都展现出显著优势。此外，Kolors 支持中英文输入，在中文特定内容的理解和生成方面表现出色。更多详情请参阅此技术报告。

技术报告的摘要如下：

我们介绍了 Kolors，一个用于文本到图像合成的潜在扩散模型，以其对英语和中文的深刻理解以及令人印象深刻的逼真度为特征。Kolors 的开发得益于三个关键见解。首先，与 Imagen 和 Stable Diffusion 3 中使用的大型语言模型 T5 不同，Kolors 基于通用语言模型（GLM）构建，增强了其对英语和中文的理解能力。此外，我们采用多模态大型语言模型对广泛的训练数据进行重新标注，以实现细粒度的文本理解。这些策略显著提升了 Kolors 对复杂语义的理解能力，特别是涉及多个实体的语义，并增强了其高级文本渲染能力。其次，我们将 Kolors 的训练分为两个阶段：概念学习阶段，涵盖广泛的知识；以及质量提升阶段，使用精心挑选的高审美数据。此外，我们研究了噪声调度在训练中的关键作用，并引入了一种新的调度方法以优化高分辨率图像生成。这些策略共同提升了生成高分辨率图像的视觉吸引力。最后，我们提出了一个类别平衡的基准 KolorsPrompts，作为 Kolors 训练和评估的指南。因此，即使在采用常用的 U-Net 骨干网络的情况下，Kolors 在人类评估中也表现出色，超越了现有的开源模型，达到了 Midjourney-v6 级别的性能，特别是在视觉吸引力方面。我们将在

<https: github.com="" kolors="" kwai-kolors="">, and hope that it will benefit future research and applications in the visual generation community.*

Usage Example

</https:>

python

import torch

from diffusers import DPMSolverMultistepScheduler, KolorsPipeline

pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)

image = pipe(
    prompt='一张瓢虫的照片，微距，变焦，高质量，电影，拿着一个牌子，写着"可图"',
    negative_prompt="",
    guidance_scale=6.5,
    num_inference_steps=25,
).images[0]

image.save("kolors_sample.png")

IP Adapter

Kolors 需要一个不同的 IP Adapter 才能工作，并且它使用 Openai-CLIP-336 作为图像编码器。

python

import torch
from transformers import CLIPVisionModelWithProjection

from diffusers import DPMSolverMultistepScheduler, KolorsPipeline
from diffusers.utils import load_image

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "Kwai-Kolors/Kolors-IP-Adapter-Plus",
    subfolder="image_encoder",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    revision="refs/pr/4",
)

pipe = KolorsPipeline.from_pretrained(
    "Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16"
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)

pipe.load_ip_adapter(
    "Kwai-Kolors/Kolors-IP-Adapter-Plus",
    subfolder="",
    weight_name="ip_adapter_plus_general.safetensors",
    revision="refs/pr/4",
    image_encoder_folder=None,
)
pipe.enable_model_cpu_offload()

ipa_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/cat_square.png")

image = pipe(
    prompt="best quality, high quality",
    negative_prompt="",
    guidance_scale=6.5,
    num_inference_steps=25,
    ip_adapter_image=ipa_image,
).images[0]

image.save("kolors_ipa_sample.png")

KolorsPipeline

[[autodoc]] KolorsPipeline

all
call

KolorsImg2ImgPipeline

[[autodoc]] KolorsImg2ImgPipeline

all
call

Kolors：用于逼真文本到图像合成的扩散模型的有效训练 ​

Usage Example ​

IP Adapter ​

KolorsPipeline ​

KolorsImg2ImgPipeline ​

Kolors：用于逼真文本到图像合成的扩散模型的有效训练

Usage Example

IP Adapter

KolorsPipeline

KolorsImg2ImgPipeline