Create eval

POSThttps:/api.openai.com/v1/evals

Create the structure of an evaluation that can be used to test a model's performance. An evaluation is a set of testing criteria and a datasource. After creating an evaluation, you can run it on different models and model parameters. We support several types of graders and datasources. For more information, see the Evals guide.

Request body

  • name
    string
    The name of the evaluation.
  • metadata
    object or null
    Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
  • data_source_config
    object
    The configuration for the data source used for the evaluation runs.
    • CustomDataSourceConfig
      object
      Required

      A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs. This schema is used to define the shape of the data that will be:

      • Used to define your testing criteria and
      • What data is required when creating a run
      • type
        string
        Required
        Defaults: custom

        The type of data source. Always custom.

        • custom
          string
      • item_schema
        object
        Required
        The json schema for each row in the data source.
      • include_sample_schema
        boolean
        Defaults: false
        Whether the eval should expect you to populate the sample namespace (ie, by generating responses off of your data source)
    • LogsDataSourceConfig
      object
      Required

      A data source config which specifies the metadata property of your stored completions query. This is usually metadata like usecase=chatbot or prompt-version=v2, etc.

      • type
        string
        Required
        Defaults: logs

        The type of data source. Always logs.

        • logs
          string
      • metadata
        object
        Metadata filters for the logs data source.
  • testing_criteria
    array
    Required
    A list of graders for all eval runs in this group.
    • LabelModelGrader
      object
      A LabelModelGrader object which uses a model to assign labels to each item in the evaluation.
      • type
        string
        Required

        The object type, which is always label_model.

        • label_model
          string
      • name
        string
        Required
        The name of the grader.
      • model
        string
        Required
        The model to use for the evaluation. Must support structured outputs.
      • input
        array
        Required
        A list of chat messages forming the prompt or context. May include variable references to the "item" namespace, ie {{item.name}}.
        • SimpleInputMessage
          object
          • role
            string
            Required
            The role of the message (e.g. "system", "assistant", "user").
          • content
            string
            Required
            The content of the message.
        • Eval message object
          object

          A message input to the model with a role indicating instruction following hierarchy. Instructions given with the developer or system role take precedence over instructions given with the user role. Messages with the assistant role are presumed to have been generated by the model in previous interactions.

          • role
            string
            Required

            The role of the message input. One of user, assistant, system, or developer.

            • user
              string
            • assistant
              string
            • system
              string
            • developer
              string
          • content
            string or object
            Text inputs to the model - can contain template strings.
            • Text input
              string
              Required
              A text input to the model.
            • Input text
              object
              Required
              A text input to the model.
              • type
                string
                Required
                Defaults: input_text

                The type of the input item. Always input_text.

                • input_text
                  string
              • text
                string
                Required
                The text input to the model.
            • Output text
              object
              Required
              A text output from the model.
              • type
                string
                Required

                The type of the output text. Always output_text.

                • output_text
                  string
              • text
                string
                Required
                The text output from the model.
          • type
            string

            The type of the message input. Always message.

            • message
              string
      • labels
        array
        Required
        The labels to classify to each item in the evaluation.
        • items
          string
      • passing_labels
        array
        Required
        The labels that indicate a passing result. Must be a subset of labels.
        • items
          string
    • StringCheckGrader
      object
      A StringCheckGrader object that performs a string comparison between input and reference using a specified operation.
      • type
        string
        Required

        The object type, which is always string_check.

        • string_check
          string
      • name
        string
        Required
        The name of the grader.
      • input
        string
        Required
        The input text. This may include template strings.
      • reference
        string
        Required
        The reference text. This may include template strings.
      • operation
        string
        Required

        The string check operation to perform. One of eq, ne, like, or ilike.

        • eq
          string
        • ne
          string
        • like
          string
        • ilike
          string
    • TextSimilarityGrader
      object
      A TextSimilarityGrader object which grades text based on similarity metrics.
      • type
        string
        Required
        Defaults: text_similarity
        The type of grader.
        • text_similarity
          string
      • name
        string
        The name of the grader.
      • input
        string
        Required
        The text being graded.
      • reference
        string
        Required
        The text being graded against.
      • pass_threshold
        number
        Required
        A float score where a value greater than or equal indicates a passing grade.
      • evaluation_metric
        string
        Required

        The evaluation metric to use. One of fuzzy_match, bleu, gleu, meteor, rouge_1, rouge_2, rouge_3, rouge_4, rouge_5, or rouge_l.

        • fuzzy_match
          string
        • bleu
          string
        • gleu
          string
        • meteor
          string
        • rouge_1
          string
        • rouge_2
          string
        • rouge_3
          string
        • rouge_4
          string
        • rouge_5
          string
        • rouge_l
          string
    • PythonGrader
      object
      A PythonGrader object that runs a python script on the input.
      • type
        string
        Required

        The object type, which is always python.

        • python
          string
      • name
        string
        Required
        The name of the grader.
      • source
        string
        Required
        The source code of the python script.
      • pass_threshold
        number
        The threshold for the score.
      • image_tag
        string
        The image tag to use for the python script.
    • ScoreModelGrader
      object
      A ScoreModelGrader object that uses a model to assign a score to the input.
      • type
        string
        Required

        The object type, which is always score_model.

        • score_model
          string
      • name
        string
        Required
        The name of the grader.
      • model
        string
        Required
        The model to use for the evaluation.
      • sampling_params
        object
        The sampling parameters for the model.
      • input
        array
        Required
        The input text. This may include template strings.
        • items
          object

          A message input to the model with a role indicating instruction following hierarchy. Instructions given with the developer or system role take precedence over instructions given with the user role. Messages with the assistant role are presumed to have been generated by the model in previous interactions.

          • role
            string
            Required

            The role of the message input. One of user, assistant, system, or developer.

            • user
              string
            • assistant
              string
            • system
              string
            • developer
              string
          • content
            string or object
            Text inputs to the model - can contain template strings.
            • Text input
              string
              Required
              A text input to the model.
            • Input text
              object
              Required
              A text input to the model.
              • type
                string
                Required
                Defaults: input_text

                The type of the input item. Always input_text.

                • input_text
                  string
              • text
                string
                Required
                The text input to the model.
            • Output text
              object
              Required
              A text output from the model.
              • type
                string
                Required

                The type of the output text. Always output_text.

                • output_text
                  string
              • text
                string
                Required
                The text output from the model.
          • type
            string

            The type of the message input. Always message.

            • message
              string
      • pass_threshold
        number
        The threshold for the score.
      • range
        array

        The range of the score. Defaults to [0, 1].

        • items
          number

Response

The created Eval object.

Example request
1
curl https://api.openai.com/v1/evals \
2
-H "Authorization: Bearer $OPENAI_API_KEY" \
3
-H "Content-Type: application/json" \
4
-d '{
5
"name": "Sentiment",
6
"data_source_config": {
7
"type": "stored_completions",
8
"metadata": {
9
"usecase": "chatbot"
10
}
11
},
12
"testing_criteria": [
13
{
14
"type": "label_model",
15
"model": "o3-mini",
16
"input": [
17
{
18
"role": "developer",
19
"content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'"
20
},
21
{
22
"role": "user",
23
"content": "Statement: {{item.input}}"
24
}
25
],
26
"passing_labels": [
27
"positive"
28
],
29
"labels": [
30
"positive",
31
"neutral",
32
"negative"
33
],
34
"name": "Example label grader"
35
}
36
]
37
}'
Example response
1
{
2
"object": "eval",
3
"id": "eval_67b7fa9a81a88190ab4aa417e397ea21",
4
"data_source_config": {
5
"type": "stored_completions",
6
"metadata": {
7
"usecase": "chatbot"
8
},
9
"schema": {
10
"type": "object",
11
"properties": {
12
"item": {
13
"type": "object"
14
},
15
"sample": {
16
"type": "object"
17
}
18
},
19
"required": [
20
"item",
21
"sample"
22
]
23
},
24
"testing_criteria": [
25
{
26
"name": "Example label grader",
27
"type": "label_model",
28
"model": "o3-mini",
29
"input": [
30
{
31
"type": "message",
32
"role": "developer",
33
"content": {
34
"type": "input_text",
35
"text": "Classify the sentiment of the following statement as one of positive, neutral, or negative"
36
}
37
},
38
{
39
"type": "message",
40
"role": "user",
41
"content": {
42
"type": "input_text",
43
"text": "Statement: {{item.input}}"
44
}
45
}
46
],
47
"passing_labels": [
48
"positive"
49
],
50
"labels": [
51
"positive",
52
"neutral",
53
"negative"
54
]
55
}
56
],
57
"name": "Sentiment",
58
"created_at": 1740110490,
59
"metadata": {
60
"description": "An eval for sentiment analysis"
61
}
62
}
Built with