Create eval

POSThttps:/api.openai.com/v1/evals

Create the structure of an evaluation that can be used to test a model's performance. An evaluation is a set of testing criteria and a datasource. After creating an evaluation, you can run it on different models and model parameters. We support several types of graders and datasources. For more information, see the Evals guide.

Request body

name
string
The name of the evaluation.
metadata
object or null
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
data_source_config
object
The configuration for the data source used for the evaluation runs.
- CustomDataSourceConfig
  object
  Required
  A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs. This schema is used to define the shape of the data that will be:
  Used to define your testing criteria and
  What data is required when creating a run
  type
  string
  Required
  Defaults: custom
  The type of data source. Always custom.
  custom
  string
  item_schema
  object
  Required
  The json schema for each row in the data source.
  include_sample_schema
  boolean
  Defaults: false
  Whether the eval should expect you to populate the sample namespace (ie, by generating responses off of your data source)
- LogsDataSourceConfig
  object
  Required
  A data source config which specifies the metadata property of your stored completions query. This is usually metadata like usecase=chatbot or prompt-version=v2, etc.
  type
  string
  Required
  Defaults: logs
  The type of data source. Always logs.
  logs
  string
  metadata
  object
  Metadata filters for the logs data source.
testing_criteria
array
Required
A list of graders for all eval runs in this group.
- LabelModelGrader
  object
  A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.
  type
  string
  Required
  The object type, which is always label_model.
  label_model
  string
  name
  string
  Required
  The name of the grader.
  model
  string
  Required
  The model to use for the evaluation. Must support structured outputs.
  input
  array
  Required
  A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.
  SimpleInputMessage
  object
  role
  string
  Required
  The role of the message (e.g. "system", "assistant", "user").
  content
  string
  Required
  The content of the message.
  Eval message object
  object
  A message input to the model with a role indicating instruction following hierarchy. Instructions given with the developer or system role take precedence over instructions given with the user role. Messages with the assistant role are presumed to have been generated by the model in previous interactions.
  role
  string
  Required
  The role of the message input. One of user, assistant, system, or developer.
  user
  string
  assistant
  string
  system
  string
  developer
  string
  content
  string or object
  Text inputs to the model - can contain template strings.
  Text input
  string
  Required
  A text input to the model.
  Input text
  object
  Required
  A text input to the model.
  type
  string
  Required
  Defaults: input_text
  The type of the input item. Always input_text.
  input_text
  string
  text
  string
  Required
  The text input to the model.
  Output text
  object
  Required
  A text output from the model.
  type
  string
  Required
  The type of the output text. Always output_text.
  output_text
  string
  text
  string
  Required
  The text output from the model.
  type
  string
  The type of the message input. Always message.
  message
  string
  labels
  array
  Required
  The labels to classify to each item in the evaluation.
  items
  string
  passing_labels
  array
  Required
  The labels that indicate a passing result. Must be a subset of labels.
  items
  string
- StringCheckGrader
  object
  A StringCheckGrader object that performs a string comparison between input and reference using a specified operation.
  type
  string
  Required
  The object type, which is always string_check.
  string_check
  string
  name
  string
  Required
  The name of the grader.
  input
  string
  Required
  The input text. This may include template strings.
  reference
  string
  Required
  The reference text. This may include template strings.
  operation
  string
  Required
  The string check operation to perform. One of eq, ne, like, or ilike.
  eq
  string
  ne
  string
  like
  string
  ilike
  string
- TextSimilarityGrader
  object
  A TextSimilarityGrader object which grades text based on similarity metrics.
  type
  string
  Required
  Defaults: text_similarity
  The type of grader.
  text_similarity
  string
  name
  string
  The name of the grader.
  input
  string
  Required
  The text being graded.
  reference
  string
  Required
  The text being graded against.
  pass_threshold
  number
  Required
  A float score where a value greater than or equal indicates a passing grade.
  evaluation_metric
  string
  Required
  The evaluation metric to use. One of fuzzy_match, bleu, gleu, meteor, rouge_1, rouge_2, rouge_3, rouge_4, rouge_5, or rouge_l.
  fuzzy_match
  string
  bleu
  string
  gleu
  string
  meteor
  string
  rouge_1
  string
  rouge_2
  string
  rouge_3
  string
  rouge_4
  string
  rouge_5
  string
  rouge_l
  string
- PythonGrader
  object
  A PythonGrader object that runs a python script on the input.
  type
  string
  Required
  The object type, which is always python.
  python
  string
  name
  string
  Required
  The name of the grader.
  source
  string
  Required
  The source code of the python script.
  pass_threshold
  number
  The threshold for the score.
  image_tag
  string
  The image tag to use for the python script.
- ScoreModelGrader
  object
  A ScoreModelGrader object that uses a model to assign a score to the input.
  type
  string
  Required
  The object type, which is always score_model.
  score_model
  string
  name
  string
  Required
  The name of the grader.
  model
  string
  Required
  The model to use for the evaluation.
  sampling_params
  object
  The sampling parameters for the model.
  input
  array
  Required
  The input text. This may include template strings.
  items
  object
  A message input to the model with a role indicating instruction following hierarchy. Instructions given with the developer or system role take precedence over instructions given with the user role. Messages with the assistant role are presumed to have been generated by the model in previous interactions.
  role
  string
  Required
  The role of the message input. One of user, assistant, system, or developer.
  user
  string
  assistant
  string
  system
  string
  developer
  string
  content
  string or object
  Text inputs to the model - can contain template strings.
  Text input
  string
  Required
  A text input to the model.
  Input text
  object
  Required
  A text input to the model.
  type
  string
  Required
  Defaults: input_text
  The type of the input item. Always input_text.
  input_text
  string
  text
  string
  Required
  The text input to the model.
  Output text
  object
  Required
  A text output from the model.
  type
  string
  Required
  The type of the output text. Always output_text.
  output_text
  string
  text
  string
  Required
  The text output from the model.
  type
  string
  The type of the message input. Always message.
  message
  string
  pass_threshold
  number
  The threshold for the score.
  range
  array
  The range of the score. Defaults to [0, 1].
  items
  number

Response

The created Eval object.

Example request

1curl https://api.openai.com/v1/evals \
2  -H "Authorization: Bearer $OPENAI_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5        "name": "Sentiment",
6        "data_source_config": {
7          "type": "stored_completions",
8          "metadata": {
9              "usecase": "chatbot"
10          }
11        },
12        "testing_criteria": [
13          {
14            "type": "label_model",
15            "model": "o3-mini",
16            "input": [
17              {
18                "role": "developer",
19                "content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'"
20              },
21              {
22                "role": "user",
23                "content": "Statement: {{item.input}}"
24              }
25            ],
26            "passing_labels": [
27              "positive"
28            ],
29            "labels": [
30              "positive",
31              "neutral",
32              "negative"
33            ],
34            "name": "Example label grader"
35          }
36        ]
37      }'

Example response

1{
2  "object": "eval",
3  "id": "eval_67b7fa9a81a88190ab4aa417e397ea21",
4  "data_source_config": {
5    "type": "stored_completions",
6    "metadata": {
7      "usecase": "chatbot"
8    },
9    "schema": {
10      "type": "object",
11      "properties": {
12        "item": {
13          "type": "object"
14        },
15        "sample": {
16          "type": "object"
17        }
18      },
19      "required": [
20        "item",
21        "sample"
22      ]
23  },
24  "testing_criteria": [
25    {
26      "name": "Example label grader",
27      "type": "label_model",
28      "model": "o3-mini",
29      "input": [
30        {
31          "type": "message",
32          "role": "developer",
33          "content": {
34            "type": "input_text",
35            "text": "Classify the sentiment of the following statement as one of positive, neutral, or negative"
36          }
37        },
38        {
39          "type": "message",
40          "role": "user",
41          "content": {
42            "type": "input_text",
43            "text": "Statement: {{item.input}}"
44          }
45        }
46      ],
47      "passing_labels": [
48        "positive"
49      ],
50      "labels": [
51        "positive",
52        "neutral",
53        "negative"
54      ]
55    }
56  ],
57  "name": "Sentiment",
58  "created_at": 1740110490,
59  "metadata": {
60    "description": "An eval for sentiment analysis"
61  }
62}

Built with