Skip to content

使用 Batch API 进行批处理

新的 Batch API 允许创建异步批处理作业,以更低的价格和更高的速率限制

批处理将在 24 小时内完成,但可能会根据全局使用情况提前完成。

Batch API 的理想用例包括:

  • 对市场或博客上的内容进行标记、加标题或丰富内容
  • 对支持票证进行分类并建议答案
  • 对大型客户反馈数据集进行情感分析
  • 为文档或文章集合生成摘要或翻译

等等!

本烹饪书将通过几个实际示例带您了解如何使用 Batch API。

我们将从一个使用 gpt-4o-mini 对电影进行分类的示例开始,然后介绍如何使用该模型的视觉功能为图像加标题。

请注意,可以通过 Batch API 使用多种模型,并且您可以在 Batch API 调用中使用与 Chat Completions 端点相同的参数。

设置

python
# Make sure you have the latest version of the SDK available to use the Batch API
%pip install openai --upgrade
python
import json
from openai import OpenAI
import pandas as pd
from IPython.display import Image, display
python
# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()

第一个示例:分类电影

在本示例中,我们将使用 gpt-4o-mini 从电影描述中提取电影类别。我们还将从该描述中提取一个句子摘要。

我们将使用 JSON 模式 将类别提取为字符串数组,并将一个句子摘要以结构化格式提取出来。

对于每部电影,我们希望得到如下所示的结果:

{
    categories: ['category1', 'category2', 'category3'],
    summary: '1-sentence summary'
}

加载数据

我们将使用 IMDB 前 1000 部电影数据集作为本示例。

python
dataset_path = "data/imdb_top_1000.csv"

df = pd.read_csv(dataset_path)
df.head()
Poster_LinkSeries_TitleReleased_YearCertificateRuntimeGenreIMDB_RatingOverviewMeta_scoreDirectorStar1Star2Star3Star4No_of_VotesGross
0https://m.media-amazon.com/images/M/MV5BMDFkYT...The Shawshank Redemption1994A142 minDrama9.3Two imprisoned men bond over a number of years...80.0Frank DarabontTim RobbinsMorgan FreemanBob GuntonWilliam Sadler234311028,341,469
1https://m.media-amazon.com/images/M/MV5BM2MyNj...The Godfather1972A175 minCrime, Drama9.2An organized crime dynasty's aging patriarch t...100.0Francis Ford CoppolaMarlon BrandoAl PacinoJames CaanDiane Keaton1620367134,966,411
2https://m.media-amazon.com/images/M/MV5BMTMxNT...The Dark Knight2008UA152 minAction, Crime, Drama9.0When the menace known as the Joker wreaks havo...84.0Christopher NolanChristian BaleHeath LedgerAaron EckhartMichael Caine2303232534,858,444
3https://m.media-amazon.com/images/M/MV5BMWMwMG...The Godfather: Part II1974A202 minCrime, Drama9.0The early life and career of Vito Corleone in ...90.0Francis Ford CoppolaAl PacinoRobert De NiroRobert DuvallDiane Keaton112995257,300,000
4https://m.media-amazon.com/images/M/MV5BMWU4N2...12 Angry Men1957U96 minCrime, Drama9.0A jury holdout attempts to prevent a miscarria...96.0Sidney LumetHenry FondaLee J. CobbMartin BalsamJohn Fiedler6898454,360,000

处理步骤

在这里,我们将通过首先使用 Chat Completions 端点试用请求来准备我们的请求。

一旦我们对结果满意,我们就可以继续创建批处理文件。

python
categorize_system_prompt = '''
Your goal is to extract movie categories from movie descriptions, as well as a 1-sentence summary for these movies.
You will be provided with a movie description, and you will output a json object containing the following information:

{
    categories: string[] // Array of categories based on the movie description,
    summary: string // 1-sentence summary of the movie based on the movie description
}

Categories refer to the genre or type of the movie, like "action", "romance", "comedy", etc. Keep category names simple and use only lower case letters.
Movies can have several categories, but try to keep it under 3-4. Only mention the categories that are the most obvious based on the description.
'''

def get_categories(description):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.1,
    # This is to enable JSON mode, making sure responses are valid json objects
    response_format={ 
        "type": "json_object"
    },
    messages=[
        {
            "role": "system",
            "content": categorize_system_prompt
        },
        {
            "role": "user",
            "content": description
        }
    ],
    )

    return response.choices[0].message.content
python
# Testing on a few examples
for _, row in df[:5].iterrows():
    description = row['Overview']
    title = row['Series_Title']
    result = get_categories(description)
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")
text
    TITLE: The Shawshank Redemption
    OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

    RESULT: {
        "categories": ["drama"],
        "summary": "Two imprisoned men develop a deep bond over the years, ultimately finding redemption through their shared acts of kindness."
    }
        
        
    ----------------------------


    TITLE: The Godfather
    OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

    RESULT: {
      "categories": ["crime", "drama"],
      "summary": "An aging crime lord hands over his empire to his hesitant son."
    }


    ----------------------------
    
    
    TITLE: The Dark Knight
    OVERVIEW: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.
    
    RESULT: {
        "categories": ["action", "thriller", "superhero"],
        "summary": "Batman faces a formidable challenge as the Joker unleashes chaos on Gotham City."
    }
    
    
    ----------------------------
    
    
    TITLE: The Godfather: Part II
    OVERVIEW: The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.
    
    RESULT: {
        "categories": ["crime", "drama"],
        "summary": "The film depicts the early life of Vito Corleone and the rise of his son Michael within the family crime syndicate in 1920s New York City."
    }
    
    
    ----------------------------
    
    
    TITLE: 12 Angry Men
    OVERVIEW: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.
    
    RESULT: {
        "categories": ["drama", "thriller"],
        "summary": "A jury holdout fights to ensure justice is served by challenging his fellow jurors to reevaluate the evidence."
    }
    
    
    ----------------------------

创建批处理文件

批处理文件,采用 jsonl 格式,应包含每个请求的一行(json 对象)。 每个请求的定义如下:

{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // 其他参数
    }
}

注意:请求 ID 应在每个批次中唯一。这就是你用来将结果与初始输入文件匹配的东西,因为请求不会以相同的顺序返回。

python
# Creating an array of json tasks

tasks = []

for index, row in df.iterrows():
    
    description = row['Overview']
    
    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # This is what you would have in your Chat Completions API call
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "response_format": { 
                "type": "json_object"
            },
            "messages": [
                {
                    "role": "system",
                    "content": categorize_system_prompt
                },
                {
                    "role": "user",
                    "content": description
                }
            ],
        }
    }
    
    tasks.append(task)
python
# Creating the file

file_name = "data/batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

上传文件

python
batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
python
print(batch_file)
text
    FileObject(id='file-lx16f1KyIxQ2UHVvkG3HLfNR', bytes=1127310, created_at=1721144107, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)

创建批处理作业

python
batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

检查批处理状态

注意:这可能需要长达 24 小时,但通常会更快完成。

你可以继续检查,直到状态为“已完成”。

python
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

获取结果

python
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
python
result_file_name = "data/batch_job_results_movies.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)
python
# Loading data from saved file
results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.strip())
        results.append(json_object)

读取结果

提醒:结果的顺序与输入文件中的顺序不同。 确保检查 custom_id 以将结果与输入请求匹配

python
# Reading only the first results
for res in results[:5]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    movie = df.iloc[int(index)]
    description = movie['Overview']
    title = movie['Series_Title']
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")
text
    TITLE: American Psycho
    OVERVIEW: A wealthy New York City investment banking executive, Patrick Bateman, hides his alternate psychopathic ego from his co-workers and friends as he delves deeper into his violent, hedonistic fantasies.
    
    RESULT: {
        "categories": ["thriller", "psychological", "drama"],
        "summary": "A wealthy investment banker in New York City conceals his psychopathic alter ego while indulging in violent and hedonistic fantasies."
    }
    
    
    ----------------------------
    
    
    TITLE: Lethal Weapon
    OVERVIEW: Two newly paired cops who are complete opposites must put aside their differences in order to catch a gang of drug smugglers.
    
    RESULT: {
        "categories": ["action", "comedy", "crime"],
        "summary": "An action-packed comedy about two mismatched cops teaming up to take down a drug smuggling gang."
    }
    
    
    ----------------------------
    
    
    TITLE: A Star Is Born
    OVERVIEW: A musician helps a young singer find fame as age and alcoholism send his own career into a downward spiral.
    
    RESULT: {
        "categories": ["drama", "music"],
        "summary": "A musician's career spirals downward as he helps a young singer find fame amidst struggles with age and alcoholism."
    }
    
    
    ----------------------------
    
    
    TITLE: From Here to Eternity
    OVERVIEW: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.
    
    RESULT: {
        "categories": ["drama", "romance", "war"],
        "summary": "A drama set in Hawaii in 1941, where a private faces punishment for not boxing on his unit's team, amidst a forbidden love affair between his captain's wife and second-in-command."
    }
    
    
    ----------------------------
    
    
    TITLE: The Jungle Book
    OVERVIEW: Bagheera the Panther and Baloo the Bear have a difficult time trying to convince a boy to leave the jungle for human civilization.
    
    RESULT: {
        "categories": ["adventure", "animation", "family"],
        "summary": "An adventure-filled animated movie about a panther and a bear trying to persuade a boy to leave the jungle for human civilization."
    }
    
    
    ----------------------------

第二个示例:为图像加标题

在本示例中,我们将使用 gpt-4-turbo 为家具物品的图像加标题。

我们将使用模型的视觉功能来分析图像并生成标题。

加载数据

我们将使用 Amazon 家具数据集作为本示例。

python
dataset_path = "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()
asinurltitlebrandpriceavailabilitycategoriesprimary_imageimagesupc...colormaterialstyleimportant_informationproduct_overviewabout_itemdescriptionspecificationsuniq_idscraped_at
0B0CJHKVG6Phttps://www.amazon.com/dp/B0CJHKVG6PGOYMFK 1pc Free Standing Shoe Rack, Multi-laye...GOYMFK$24.99Only 13 left in stock - order soon.['Home & Kitchen', 'Storage & Organization', '...https://m.media-amazon.com/images/I/416WaLx10j...['https://m.media-amazon.com/images/I/416WaLx1...NaN...WhiteMetalModern[][{'Brand': ' GOYMFK '}, {'Color': ' White '}, ...['Multiple layers: Provides ample storage spac...multiple shoes, coats, hats, and other items E...['Brand: GOYMFK', 'Color: White', 'Material: M...02593e81-5c09-5069-8516-b0b29f439ded2024-02-02 15:15:08
1B0B66QHB23https://www.amazon.com/dp/B0B66QHB23subrtex Leather ding Room, Dining Chairs Set o...subrtexNaNNaN['Home & Kitchen', 'Furniture', 'Dining Room F...https://m.media-amazon.com/images/I/31SejUEWY7...['https://m.media-amazon.com/images/I/31SejUEW...NaN...BlackSpongeBlack Rubber Wood[]NaN['【Easy Assembly】: Set of 2 dining room chairs...subrtex Dining chairs Set of 2['Brand: subrtex', 'Color: Black', 'Product Di...5938d217-b8c5-5d3e-b1cf-e28e340f292e2024-02-02 15:15:09
2B0BXRTWLYKhttps://www.amazon.com/dp/B0BXRTWLYKPlant Repotting Mat MUYETOL Waterproof Transpl...MUYETOL$5.98In Stock['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...https://m.media-amazon.com/images/I/41RgefVq70...['https://m.media-amazon.com/images/I/41RgefVq...NaN...GreenPolyethyleneModern[][{'Brand': ' MUYETOL '}, {'Size': ' 26.8*26.8 ...['PLANT REPOTTING MAT SIZE: 26.8" x 26.8", squ...NaN['Brand: MUYETOL', 'Size: 26.8*26.8', 'Item We...b2ede786-3f51-5a45-9a5b-bcf856958cd82024-02-02 15:15:09
3B0C1MRB2M8https://www.amazon.com/dp/B0C1MRB2M8Pickleball Doormat, Welcome Doormat Absorbent ...VEWETOL$13.99Only 10 left in stock - order soon.['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...https://m.media-amazon.com/images/I/61vz1Igler...['https://m.media-amazon.com/images/I/61vz1Igl...NaN...A5589RubberModern[][{'Brand': ' VEWETOL '}, {'Size': ' 16*24INCH ...['Specifications: 16x24 Inch ', " High-Quality...The decorative doormat features a subtle textu...['Brand: VEWETOL', 'Size: 16*24INCH', 'Materia...8fd9377b-cfa6-5f10-835c-6b8eca2816b52024-02-02 15:15:10
4B0CG1N9QRChttps://www.amazon.com/dp/B0CG1N9QRCJOIN IRON Foldable TV Trays for Eating Set of ...JOIN IRON Store$89.99Usually ships within 5 to 6 weeks['Home & Kitchen', 'Furniture', 'Game & Recrea...https://m.media-amazon.com/images/I/41p4d4VJnN...['https://m.media-amazon.com/images/I/41p4d4VJ...NaN...Grey Set of 4IronX Classic Style[]NaN['Includes 4 Folding Tv Tray Tables And one Co...Set of Four Folding Trays With Matching Storag...['Brand: JOIN IRON', 'Shape: Rectangular', 'In...bdc9aa30-9439-50dc-8e89-213ea211d66a2024-02-02 15:15:11

5 rows × 25 columns

处理步骤

同样,我们将首先使用 Chat Completions 端点准备我们的请求,然后创建批处理文件。

python
caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of items.
You will be provided with an item image and the name of that item and you will output a caption that captures the most important information about the item.
If there are multiple items depicted, refer to the name provided to understand which item you should describe.
Your generated caption should be short (1 sentence), and include only the most important information about the item.
The most important information could be: the type of item, the style (if mentioned), the material or color if especially relevant and/or any distinctive features.
Keep it short and to the point.
'''

def get_caption(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=300,
    messages=[
        {
            "role": "system",
            "content": caption_system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": title
                },
                # The content type should be "image_url" to use gpt-4-turbo's vision capabilities
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url
                    }
                },
            ],
        }
    ]
    )

    return response.choices[0].message.content
python
# Testing on a few images
for _, row in df[:5].iterrows():
    img_url = row['primary_image']
    caption = get_caption(img_url, row['title'])
    img = Image(url=img_url)
    display(img)
    print(f"CAPTION: {caption}\n\n")
text
    CAPTION: A stylish white free-standing shoe rack featuring multiple layers and eight double hooks, perfect for organizing shoes and accessories in living rooms, bathrooms, or hallways.
text

    CAPTION: Set of 2 black leather dining chairs featuring a sleek design with vertical stitching and sturdy wooden legs.
text
CAPTION: The MUYETOL Plant Repotting Mat is a waterproof, portable, and foldable gardening work mat measuring 26.8" x 26.8", designed for easy soil changing and indoor transplanting.
text
    CAPTION: Absorbent non-slip doormat featuring the phrase "It's a good day to play PICKLEBALL" with paddle graphics, measuring 16x24 inches.
text
    CAPTION: Set of 4 foldable TV trays in grey, featuring a compact design with a stand for easy storage, perfect for small spaces.

创建批处理作业

与第一个示例一样,我们将创建一个 json 任务数组来生成一个 jsonl 文件,并使用它来创建批处理作业。

python
# Creating an array of json tasks

tasks = []

for index, row in df.iterrows():
    
    title = row['title']
    img_url = row['primary_image']
    
    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # This is what you would have in your Chat Completions API call
            "model": "gpt-4o-mini",
            "temperature": 0.2,
            "max_tokens": 300,
            "messages": [
                {
                    "role": "system",
                    "content": caption_system_prompt
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": title
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": img_url
                            }
                        },
                    ],
                }
            ]            
        }
    }
    
    tasks.append(task)
python
# Creating the file

file_name = "data/batch_tasks_furniture.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')
python
# Uploading the file 

batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
python
# Creating the job

batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)
python
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

获取结果

与第一个示例一样,一旦批处理作业完成,我们就可以检索结果。

提醒:结果的顺序与输入文件中的顺序不同。 确保检查 custom_id 以将结果与输入请求匹配

python
# Retrieving result file

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
python
result_file_name = "data/batch_job_results_furniture.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)
python
# Loading data from saved file

results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.strip())
        results.append(json_object)
python
# Reading only the first results
for res in results[:5]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    item = df.iloc[int(index)]
    img_url = item['primary_image']
    img = Image(url=img_url)
    display(img)
    print(f"CAPTION: {result}\n\n")
text
    CAPTION: Brushed brass pedestal towel rack with a sleek, modern design, featuring multiple bars for hanging towels, measuring 25.75 x 14.44 x 32 inches.
text
    CAPTION: Black round end table featuring a tempered glass top and a metal frame, with a lower shelf for additional storage.
text
    CAPTION: Black collapsible and height-adjustable telescoping stool, portable and designed for makeup artists and hairstylists, shown in various stages of folding for easy transport.
text
    CAPTION: Ergonomic pink gaming chair featuring breathable fabric, adjustable height, lumbar support, a footrest, and a swivel recliner function.
text
    CAPTION: A set of two Glitzhome adjustable bar stools featuring a mid-century modern design with swivel seats, PU leather upholstery, and wooden backrests.

总结

在本烹饪书中,我们已经看到了如何使用新的 Batch API 的两个示例,但请记住,Batch API 的工作方式与 Chat Completions 端点相同,支持相同的参数和大多数最新模型(gpt-4o、gpt-4o-mini、gpt-4-turbo、gpt-3.5-turbo…)。

通过使用此 API,您可以大幅降低成本,因此我们建议将所有可以异步进行的工作负载切换到使用此新 API 的批处理作业。