Create eval
Create the structure of an evaluation that can be used to test a model's performance. An evaluation is a set of testing criteria and a datasource. After creating an evaluation, you can run it on different models and model parameters. We support several types of graders and datasources. For more information, see the Evals guide.
Request body
name
string
The name of the evaluation.metadata
object or null
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.data_source_config
object
The configuration for the data source used for the evaluation runs.CustomDataSourceConfig
object
Required
A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs. This schema is used to define the shape of the data that will be:
- Used to define your testing criteria and
- What data is required when creating a run
type
string
Required
Defaults: custom
The type of data source. Always
custom
.custom
string
item_schema
object
Required
The json schema for each row in the data source.include_sample_schema
boolean
Defaults: false
Whether the eval should expect you to populate the sample namespace (ie, by generating responses off of your data source)
LogsDataSourceConfig
object
Required
A data source config which specifies the metadata property of your stored completions query. This is usually metadata like
usecase=chatbot
orprompt-version=v2
, etc.type
string
Required
Defaults: logs
The type of data source. Always
logs
.logs
string
metadata
object
Metadata filters for the logs data source.
testing_criteria
array
Required
A list of graders for all eval runs in this group.LabelModelGrader
object
A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.type
string
Required
The object type, which is always
label_model
.label_model
string
name
string
Required
The name of the grader.model
string
Required
The model to use for the evaluation. Must support structured outputs.input
array
Required
A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.SimpleInputMessage
object
role
string
Required
The role of the message (e.g. "system", "assistant", "user").content
string
Required
The content of the message.
Eval message object
object
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developer
orsystem
role take precedence over instructions given with theuser
role. Messages with theassistant
role are presumed to have been generated by the model in previous interactions.role
string
Required
The role of the message input. One of
user
,assistant
,system
, ordeveloper
.user
string
assistant
string
system
string
developer
string
content
string or object
Text inputs to the model - can contain template strings.Text input
string
Required
A text input to the model.Input text
object
Required
A text input to the model.type
string
Required
Defaults: input_text
The type of the input item. Always
input_text
.input_text
string
text
string
Required
The text input to the model.
Output text
object
Required
A text output from the model.type
string
Required
The type of the output text. Always
output_text
.output_text
string
text
string
Required
The text output from the model.
type
string
The type of the message input. Always
message
.message
string
labels
array
Required
The labels to classify to each item in the evaluation.items
string
passing_labels
array
Required
The labels that indicate a passing result. Must be a subset of labels.items
string
StringCheckGrader
object
A StringCheckGrader object that performs a string comparison between input and reference using a specified operation.type
string
Required
The object type, which is always
string_check
.string_check
string
name
string
Required
The name of the grader.input
string
Required
The input text. This may include template strings.reference
string
Required
The reference text. This may include template strings.operation
string
Required
The string check operation to perform. One of
eq
,ne
,like
, orilike
.eq
string
ne
string
like
string
ilike
string
TextSimilarityGrader
object
A TextSimilarityGrader object which grades text based on similarity metrics.type
string
Required
Defaults: text_similarity
The type of grader.text_similarity
string
name
string
The name of the grader.input
string
Required
The text being graded.reference
string
Required
The text being graded against.pass_threshold
number
Required
A float score where a value greater than or equal indicates a passing grade.evaluation_metric
string
Required
The evaluation metric to use. One of
fuzzy_match
,bleu
,gleu
,meteor
,rouge_1
,rouge_2
,rouge_3
,rouge_4
,rouge_5
, orrouge_l
.fuzzy_match
string
bleu
string
gleu
string
meteor
string
rouge_1
string
rouge_2
string
rouge_3
string
rouge_4
string
rouge_5
string
rouge_l
string
PythonGrader
object
A PythonGrader object that runs a python script on the input.type
string
Required
The object type, which is always
python
.python
string
name
string
Required
The name of the grader.source
string
Required
The source code of the python script.pass_threshold
number
The threshold for the score.image_tag
string
The image tag to use for the python script.
ScoreModelGrader
object
A ScoreModelGrader object that uses a model to assign a score to the input.type
string
Required
The object type, which is always
score_model
.score_model
string
name
string
Required
The name of the grader.model
string
Required
The model to use for the evaluation.sampling_params
object
The sampling parameters for the model.input
array
Required
The input text. This may include template strings.items
object
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developer
orsystem
role take precedence over instructions given with theuser
role. Messages with theassistant
role are presumed to have been generated by the model in previous interactions.role
string
Required
The role of the message input. One of
user
,assistant
,system
, ordeveloper
.user
string
assistant
string
system
string
developer
string
content
string or object
Text inputs to the model - can contain template strings.Text input
string
Required
A text input to the model.Input text
object
Required
A text input to the model.type
string
Required
Defaults: input_text
The type of the input item. Always
input_text
.input_text
string
text
string
Required
The text input to the model.
Output text
object
Required
A text output from the model.type
string
Required
The type of the output text. Always
output_text
.output_text
string
text
string
Required
The text output from the model.
type
string
The type of the message input. Always
message
.message
string
pass_threshold
number
The threshold for the score.range
array
The range of the score. Defaults to
[0, 1]
.items
number
Response
The created Eval object.
1 curl https://api.openai.com/v1/evals \2 -H "Authorization: Bearer $OPENAI_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "name": "Sentiment",6 "data_source_config": {7 "type": "stored_completions",8 "metadata": {9 "usecase": "chatbot"10 }11 },12 "testing_criteria": [13 {14 "type": "label_model",15 "model": "o3-mini",16 "input": [17 {18 "role": "developer",19 "content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'"20 },21 {22 "role": "user",23 "content": "Statement: {{item.input}}"24 }25 ],26 "passing_labels": [27 "positive"28 ],29 "labels": [30 "positive",31 "neutral",32 "negative"33 ],34 "name": "Example label grader"35 }36 ]37 }'
1 {2 "object": "eval",3 "id": "eval_67b7fa9a81a88190ab4aa417e397ea21",4 "data_source_config": {5 "type": "stored_completions",6 "metadata": {7 "usecase": "chatbot"8 },9 "schema": {10 "type": "object",11 "properties": {12 "item": {13 "type": "object"14 },15 "sample": {16 "type": "object"17 }18 },19 "required": [20 "item",21 "sample"22 ]23 },24 "testing_criteria": [25 {26 "name": "Example label grader",27 "type": "label_model",28 "model": "o3-mini",29 "input": [30 {31 "type": "message",32 "role": "developer",33 "content": {34 "type": "input_text",35 "text": "Classify the sentiment of the following statement as one of positive, neutral, or negative"36 }37 },38 {39 "type": "message",40 "role": "user",41 "content": {42 "type": "input_text",43 "text": "Statement: {{item.input}}"44 }45 }46 ],47 "passing_labels": [48 "positive"49 ],50 "labels": [51 "positive",52 "neutral",53 "negative"54 ]55 }56 ],57 "name": "Sentiment",58 "created_at": 1740110490,59 "metadata": {60 "description": "An eval for sentiment analysis"61 }62 }