Create eval
Create the structure of an evaluation that can be used to test a model's performance. An evaluation is a set of testing criteria and a datasource. After creating an evaluation, you can run it on different models and model parameters. We support several types of graders and datasources. For more information, see the Evals guide.
Request body
namestring
The name of the evaluation.metadataobject or null
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.data_source_configobject
The configuration for the data source used for the evaluation runs.CustomDataSourceConfigobjectRequired
A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs. This schema is used to define the shape of the data that will be:
- Used to define your testing criteria and
- What data is required when creating a run
typestringRequiredDefaults: custom
The type of data source. Always
custom.customstring
item_schemaobjectRequired
The json schema for each row in the data source.include_sample_schemabooleanDefaults: false
Whether the eval should expect you to populate the sample namespace (ie, by generating responses off of your data source)
LogsDataSourceConfigobjectRequired
A data source config which specifies the metadata property of your stored completions query. This is usually metadata like
usecase=chatbotorprompt-version=v2, etc.typestringRequiredDefaults: logs
The type of data source. Always
logs.logsstring
metadataobject
Metadata filters for the logs data source.
testing_criteriaarrayRequired
A list of graders for all eval runs in this group.LabelModelGraderobject
A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.typestringRequired
The object type, which is always
label_model.label_modelstring
namestringRequired
The name of the grader.modelstringRequired
The model to use for the evaluation. Must support structured outputs.inputarrayRequired
A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.SimpleInputMessageobject
rolestringRequired
The role of the message (e.g. "system", "assistant", "user").contentstringRequired
The content of the message.
Eval message objectobject
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developerorsystemrole take precedence over instructions given with theuserrole. Messages with theassistantrole are presumed to have been generated by the model in previous interactions.rolestringRequired
The role of the message input. One of
user,assistant,system, ordeveloper.userstring
assistantstring
systemstring
developerstring
contentstring or object
Text inputs to the model - can contain template strings.Text inputstringRequired
A text input to the model.Input textobjectRequired
A text input to the model.typestringRequiredDefaults: input_text
The type of the input item. Always
input_text.input_textstring
textstringRequired
The text input to the model.
Output textobjectRequired
A text output from the model.typestringRequired
The type of the output text. Always
output_text.output_textstring
textstringRequired
The text output from the model.
typestring
The type of the message input. Always
message.messagestring
labelsarrayRequired
The labels to classify to each item in the evaluation.itemsstring
passing_labelsarrayRequired
The labels that indicate a passing result. Must be a subset of labels.itemsstring
StringCheckGraderobject
A StringCheckGrader object that performs a string comparison between input and reference using a specified operation.typestringRequired
The object type, which is always
string_check.string_checkstring
namestringRequired
The name of the grader.inputstringRequired
The input text. This may include template strings.referencestringRequired
The reference text. This may include template strings.operationstringRequired
The string check operation to perform. One of
eq,ne,like, orilike.eqstring
nestring
likestring
ilikestring
TextSimilarityGraderobject
A TextSimilarityGrader object which grades text based on similarity metrics.typestringRequiredDefaults: text_similarity
The type of grader.text_similaritystring
namestring
The name of the grader.inputstringRequired
The text being graded.referencestringRequired
The text being graded against.pass_thresholdnumberRequired
A float score where a value greater than or equal indicates a passing grade.evaluation_metricstringRequired
The evaluation metric to use. One of
fuzzy_match,bleu,gleu,meteor,rouge_1,rouge_2,rouge_3,rouge_4,rouge_5, orrouge_l.fuzzy_matchstring
bleustring
gleustring
meteorstring
rouge_1string
rouge_2string
rouge_3string
rouge_4string
rouge_5string
rouge_lstring
PythonGraderobject
A PythonGrader object that runs a python script on the input.typestringRequired
The object type, which is always
python.pythonstring
namestringRequired
The name of the grader.sourcestringRequired
The source code of the python script.pass_thresholdnumber
The threshold for the score.image_tagstring
The image tag to use for the python script.
ScoreModelGraderobject
A ScoreModelGrader object that uses a model to assign a score to the input.typestringRequired
The object type, which is always
score_model.score_modelstring
namestringRequired
The name of the grader.modelstringRequired
The model to use for the evaluation.sampling_paramsobject
The sampling parameters for the model.inputarrayRequired
The input text. This may include template strings.itemsobject
A message input to the model with a role indicating instruction following hierarchy. Instructions given with the
developerorsystemrole take precedence over instructions given with theuserrole. Messages with theassistantrole are presumed to have been generated by the model in previous interactions.rolestringRequired
The role of the message input. One of
user,assistant,system, ordeveloper.userstring
assistantstring
systemstring
developerstring
contentstring or object
Text inputs to the model - can contain template strings.Text inputstringRequired
A text input to the model.Input textobjectRequired
A text input to the model.typestringRequiredDefaults: input_text
The type of the input item. Always
input_text.input_textstring
textstringRequired
The text input to the model.
Output textobjectRequired
A text output from the model.typestringRequired
The type of the output text. Always
output_text.output_textstring
textstringRequired
The text output from the model.
typestring
The type of the message input. Always
message.messagestring
pass_thresholdnumber
The threshold for the score.rangearray
The range of the score. Defaults to
[0, 1].itemsnumber
Response
The created Eval object.
1 curl https://api.openai.com/v1/evals \2 -H "Authorization: Bearer $OPENAI_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "name": "Sentiment",6 "data_source_config": {7 "type": "stored_completions",8 "metadata": {9 "usecase": "chatbot"10 }11 },12 "testing_criteria": [13 {14 "type": "label_model",15 "model": "o3-mini",16 "input": [17 {18 "role": "developer",19 "content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'"20 },21 {22 "role": "user",23 "content": "Statement: {{item.input}}"24 }25 ],26 "passing_labels": [27 "positive"28 ],29 "labels": [30 "positive",31 "neutral",32 "negative"33 ],34 "name": "Example label grader"35 }36 ]37 }'
1 {2 "object": "eval",3 "id": "eval_67b7fa9a81a88190ab4aa417e397ea21",4 "data_source_config": {5 "type": "stored_completions",6 "metadata": {7 "usecase": "chatbot"8 },9 "schema": {10 "type": "object",11 "properties": {12 "item": {13 "type": "object"14 },15 "sample": {16 "type": "object"17 }18 },19 "required": [20 "item",21 "sample"22 ]23 },24 "testing_criteria": [25 {26 "name": "Example label grader",27 "type": "label_model",28 "model": "o3-mini",29 "input": [30 {31 "type": "message",32 "role": "developer",33 "content": {34 "type": "input_text",35 "text": "Classify the sentiment of the following statement as one of positive, neutral, or negative"36 }37 },38 {39 "type": "message",40 "role": "user",41 "content": {42 "type": "input_text",43 "text": "Statement: {{item.input}}"44 }45 }46 ],47 "passing_labels": [48 "positive"49 ],50 "labels": [51 "positive",52 "neutral",53 "negative"54 ]55 }56 ],57 "name": "Sentiment",58 "created_at": 1740110490,59 "metadata": {60 "description": "An eval for sentiment analysis"61 }62 }