Create eval run
Create a new evaluation run. This is the endpoint that will kick off grading.
Path parameters
eval_id
string
Required
The ID of the evaluation to create a run for.
Request body
name
string
The name of the run.metadata
object or null
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.data_source
object
Details about the run's data source.JsonlRunDataSource
object
Required
A JsonlRunDataSource object with that specifies a JSONL file that matches the evaltype
string
Required
Defaults: jsonl
The type of data source. Always
jsonl
.jsonl
string
source
object
EvalJsonlFileContentSource
object
Required
type
string
Required
Defaults: file_content
The type of jsonl source. Always
file_content
.file_content
string
content
array
Required
The content of the jsonl file.items
object
item
object
Required
sample
object
EvalJsonlFileIdSource
object
Required
type
string
Required
Defaults: file_id
The type of jsonl source. Always
file_id
.file_id
string
id
string
Required
The identifier of the file.
CompletionsRunDataSource
object
Required
A CompletionsRunDataSource object describing a model sampling configuration.type
string
Required
Defaults: completions
The type of run data source. Always
completions
.completions
string
input_messages
object
TemplateInputMessages
object
type
string
Required
The type of input messages. Always
template
.template
string
template
array
Required
A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.Input message
object
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developer
orsystem
role take precedence over instructions given with theuser
role. Messages with theassistant
role are presumed to have been generated by the model in previous interactions.role
string
Required
The role of the message input. One of
user
,assistant
,system
, ordeveloper
.user
string
assistant
string
system
string
developer
string
content
string or array
Text, image, or audio input to the model, used to generate a response. Can also contain previous assistant responses.Text input
string
Required
A text input to the model.Input item content list
array
Required
A list of one or many input items to the model, containing different content types.Input text
object
A text input to the model.type
string
Required
Defaults: input_text
The type of the input item. Always
input_text
.input_text
string
text
string
Required
The text input to the model.
Input image
object
An image input to the model. Learn about image inputs.
type
string
Required
Defaults: input_image
The type of the input item. Always
input_image
.input_image
string
image_url
string or null
image_url
string
The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.image_url
null
file_id
string or null
file_id
string
The ID of the file to be sent to the model.file_id
null
detail
string
Required
The detail level of the image to be sent to the model. One of
high
,low
, orauto
. Defaults toauto
.low
string
high
string
auto
string
Input file
object
A file input to the model.type
string
Required
Defaults: input_file
The type of the input item. Always
input_file
.input_file
string
file_id
string or null
file_id
string
The ID of the file to be sent to the model.file_id
null
filename
string
The name of the file to be sent to the model.file_data
string
The content of the file to be sent to the model.
type
string
The type of the message input. Always
message
.message
string
Eval message object
object
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developer
orsystem
role take precedence over instructions given with theuser
role. Messages with theassistant
role are presumed to have been generated by the model in previous interactions.role
string
Required
The role of the message input. One of
user
,assistant
,system
, ordeveloper
.user
string
assistant
string
system
string
developer
string
content
string or object
Text inputs to the model - can contain template strings.Text input
string
Required
A text input to the model.Input text
object
Required
A text input to the model.type
string
Required
Defaults: input_text
The type of the input item. Always
input_text
.input_text
string
text
string
Required
The text input to the model.
Output text
object
Required
A text output from the model.type
string
Required
The type of the output text. Always
output_text
.output_text
string
text
string
Required
The text output from the model.
type
string
The type of the message input. Always
message
.message
string
ItemReferenceInputMessages
object
type
string
Required
The type of input messages. Always
item_reference
.item_reference
string
item_reference
string
Required
A reference to a variable in the "item" namespace. Ie, "item.name"
sampling_params
object
temperature
number
Defaults: 1
A higher temperature increases randomness in the outputs.max_completion_tokens
integer
The maximum number of tokens in the generated output.top_p
number
Defaults: 1
An alternative to temperature for nucleus sampling; 1.0 includes all tokens.seed
integer
Defaults: 42
A seed value to initialize the randomness, during sampling.
model
string
The name of the model to use for generating completions (e.g. "o3-mini").source
object
EvalJsonlFileContentSource
object
Required
type
string
Required
Defaults: file_content
The type of jsonl source. Always
file_content
.file_content
string
content
array
Required
The content of the jsonl file.items
object
item
object
Required
sample
object
EvalJsonlFileIdSource
object
Required
type
string
Required
Defaults: file_id
The type of jsonl source. Always
file_id
.file_id
string
id
string
Required
The identifier of the file.
StoredCompletionsRunDataSource
object
Required
A StoredCompletionsRunDataSource configuration describing a set of filterstype
string
Required
Defaults: stored_completions
The type of source. Always
stored_completions
.stored_completions
string
metadata
object or null
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.model
string or null
An optional model to filter by (e.g., 'gpt-4o').created_after
integer or null
An optional Unix timestamp to filter items created after this time.created_before
integer or null
An optional Unix timestamp to filter items created before this time.limit
integer or null
An optional maximum number of items to return.
ResponsesRunDataSource
object
Required
A ResponsesRunDataSource object describing a model sampling configuration.type
string
Required
Defaults: completions
The type of run data source. Always
completions
.completions
string
input_messages
object
input_messages
object
type
string
Required
The type of input messages. Always
template
.template
string
template
array
Required
A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.ChatMessage
object
role
string
Required
The role of the message (e.g. "system", "assistant", "user").content
string
Required
The content of the message.
Eval message object
object
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developer
orsystem
role take precedence over instructions given with theuser
role. Messages with theassistant
role are presumed to have been generated by the model in previous interactions.role
string
Required
The role of the message input. One of
user
,assistant
,system
, ordeveloper
.user
string
assistant
string
system
string
developer
string
content
string or object
Text inputs to the model - can contain template strings.Text input
string
Required
A text input to the model.Input text
object
Required
A text input to the model.type
string
Required
Defaults: input_text
The type of the input item. Always
input_text
.input_text
string
text
string
Required
The text input to the model.
Output text
object
Required
A text output from the model.type
string
Required
The type of the output text. Always
output_text
.output_text
string
text
string
Required
The text output from the model.
type
string
The type of the message input. Always
message
.message
string
input_messages
object
type
string
Required
The type of input messages. Always
item_reference
.item_reference
string
item_reference
string
Required
A reference to a variable in the "item" namespace. Ie, "item.name"
sampling_params
object
temperature
number
Defaults: 1
A higher temperature increases randomness in the outputs.max_completion_tokens
integer
The maximum number of tokens in the generated output.top_p
number
Defaults: 1
An alternative to temperature for nucleus sampling; 1.0 includes all tokens.seed
integer
Defaults: 42
A seed value to initialize the randomness, during sampling.
model
string
The name of the model to use for generating completions (e.g. "o3-mini").source
object
EvalJsonlFileContentSource
object
Required
type
string
Required
Defaults: file_content
The type of jsonl source. Always
file_content
.file_content
string
content
array
Required
The content of the jsonl file.items
object
item
object
Required
sample
object
EvalJsonlFileIdSource
object
Required
type
string
Required
Defaults: file_id
The type of jsonl source. Always
file_id
.file_id
string
id
string
Required
The identifier of the file.
EvalResponsesSource
object
Required
A EvalResponsesSource object describing a run data source configuration.type
string
Required
The type of run data source. Always
responses
.responses
string
metadata
object or null
Metadata filter for the responses. This is a query parameter used to select responses.model
string or null
The name of the model to find responses for. This is a query parameter used to select responses.instructions_search
string or null
Optional search string for instructions. This is a query parameter used to select responses.created_after
integer or null
Only include items created after this timestamp (inclusive). This is a query parameter used to select responses.created_before
integer or null
Only include items created before this timestamp (inclusive). This is a query parameter used to select responses.has_tool_calls
boolean or null
Whether the response has tool calls. This is a query parameter used to select responses.reasoning_effort
string or null
Defaults: medium
o-series models only
Constrains effort on reasoning for reasoning models. Currently supported values are
low
,medium
, andhigh
. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.low
string
medium
string
high
string
temperature
number or null
Sampling temperature. This is a query parameter used to select responses.top_p
number or null
Nucleus sampling parameter. This is a query parameter used to select responses.users
array or null
List of user identifiers. This is a query parameter used to select responses.items
string
allow_parallel_tool_calls
boolean or null
Whether to allow parallel tool calls. This is a query parameter used to select responses.
Response
The EvalRun object matching the specified ID.
1 curl https://api.openai.com/v1/evals/eval_67e579652b548190aaa83ada4b125f47/runs \2 -X POST \3 -H "Authorization: Bearer $OPENAI_API_KEY" \4 -H "Content-Type: application/json" \5 -d '{"name":"gpt-4o-mini","data_source":{"type":"completions","input_messages":{"type":"template","template":[{"role":"developer","content":"Categorize a given news headline into one of the following topics: Technology, Markets, World, Business, or Sports.\n\n# Steps\n\n1. Analyze the content of the news headline to understand its primary focus.\n2. Extract the subject matter, identifying any key indicators or keywords.\n3. Use the identified indicators to determine the most suitable category out of the five options: Technology, Markets, World, Business, or Sports.\n4. Ensure only one category is selected per headline.\n\n# Output Format\n\nRespond with the chosen category as a single word. For instance: \"Technology\", \"Markets\", \"World\", \"Business\", or \"Sports\".\n\n# Examples\n\n**Input**: \"Apple Unveils New iPhone Model, Featuring Advanced AI Features\" \n**Output**: \"Technology\"\n\n**Input**: \"Global Stocks Mixed as Investors Await Central Bank Decisions\" \n**Output**: \"Markets\"\n\n**Input**: \"War in Ukraine: Latest Updates on Negotiation Status\" \n**Output**: \"World\"\n\n**Input**: \"Microsoft in Talks to Acquire Gaming Company for $2 Billion\" \n**Output**: \"Business\"\n\n**Input**: \"Manchester United Secures Win in Premier League Football Match\" \n**Output**: \"Sports\" \n\n# Notes\n\n- If the headline appears to fit into more than one category, choose the most dominant theme.\n- Keywords or phrases such as \"stocks\", \"company acquisition\", \"match\", or technological brands can be good indicators for classification.\n"} , {"role":"user","content":"{{item.input}}"}]},"sampling_params":{"temperature":1,"max_completions_tokens":2048,"top_p":1,"seed":42},"model":"gpt-4o-mini","source":{"type":"file_content","content":[{"item":{"input":"Tech Company Launches Advanced Artificial Intelligence Platform","ground_truth":"Technology"}}]}}'
1 {2 "object": "eval.run",3 "id": "evalrun_67e57965b480819094274e3a32235e4c",4 "eval_id": "eval_67e579652b548190aaa83ada4b125f47",5 "report_url": "https://platform.openai.com/evaluations/eval_67e579652b548190aaa83ada4b125f47&run_id=evalrun_67e57965b480819094274e3a32235e4c",6 "status": "queued",7 "model": "gpt-4o-mini",8 "name": "gpt-4o-mini",9 "created_at": 1743092069,10 "result_counts": {11 "total": 0,12 "errored": 0,13 "failed": 0,14 "passed": 015 },16 "per_model_usage": null,17 "per_testing_criteria_results": null,18 "data_source": {19 "type": "completions",20 "source": {21 "type": "file_content",22 "content": [23 {24 "item": {25 "input": "Tech Company Launches Advanced Artificial Intelligence Platform",26 "ground_truth": "Technology"27 }28 }29 ]30 },31 "input_messages": {32 "type": "template",33 "template": [34 {35 "type": "message",36 "role": "developer",37 "content": {38 "type": "input_text",39 "text": "Categorize a given news headline into one of the following topics: Technology, Markets, World, Business, or Sports.\n\n# Steps\n\n1. Analyze the content of the news headline to understand its primary focus.\n2. Extract the subject matter, identifying any key indicators or keywords.\n3. Use the identified indicators to determine the most suitable category out of the five options: Technology, Markets, World, Business, or Sports.\n4. Ensure only one category is selected per headline.\n\n# Output Format\n\nRespond with the chosen category as a single word. For instance: \"Technology\", \"Markets\", \"World\", \"Business\", or \"Sports\".\n\n# Examples\n\n**Input**: \"Apple Unveils New iPhone Model, Featuring Advanced AI Features\" \n**Output**: \"Technology\"\n\n**Input**: \"Global Stocks Mixed as Investors Await Central Bank Decisions\" \n**Output**: \"Markets\"\n\n**Input**: \"War in Ukraine: Latest Updates on Negotiation Status\" \n**Output**: \"World\"\n\n**Input**: \"Microsoft in Talks to Acquire Gaming Company for $2 Billion\" \n**Output**: \"Business\"\n\n**Input**: \"Manchester United Secures Win in Premier League Football Match\" \n**Output**: \"Sports\" \n\n# Notes\n\n- If the headline appears to fit into more than one category, choose the most dominant theme.\n- Keywords or phrases such as \"stocks\", \"company acquisition\", \"match\", or technological brands can be good indicators for classification.\n"40 }41 },42 {43 "type": "message",44 "role": "user",45 "content": {46 "type": "input_text",47 "text": "{{item.input}}"48 }49 }50 ]51 },52 "model": "gpt-4o-mini",53 "sampling_params": {54 "seed": 42,55 "temperature": 1.0,56 "top_p": 1.0,57 "max_completions_tokens": 204858 }59 },60 "error": null,61 "metadata": {}62 }