> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Auto Evaluation

> Leverage LLM-based tasks to produce evaluation results.

## Guided Decoding

Use the `auto_evaluation.guided_decoding` task type for configuring auto evaluation tasks where the set of potential results are well defined.

<Expandable title="configuration properties" defaultOpen={true}>
  <ResponseField name="model" type="string" required>
    The identifier of the model (in `<model_provider>/<model_name>` format) to generate the auto evaluation result. Note this model must support some form of guided decoding (e.g. [OpenAI's response formatting](https://platform.openai.com/docs/guides/structured-outputs)) in order for results to be computed.
  </ResponseField>

  <ResponseField name="system_prompt" type="string">
    An optional system prompt to be sent as the first message of the chat completion request.
  </ResponseField>

  <ResponseField name="prompt" type="string" required>
    The user prompt containing the evaluation question and any relevant data from the evaluation item (see [referencing item data](/docs/v5/next-gen-evaluation/overview#referencing-item-data)).
  </ResponseField>

  <ResponseField name="run_condition" type="object">
    A condition that determines whether the model will be called for each row.
  </ResponseField>

  <ResponseField name="response_format" type="string[]" required>
    A list of responses for the model to return, each with a name and type configuration.
  </ResponseField>

  <ResponseField name="inference_args" type="object">
    Any additional properties to be included in the chat completion request to the model<br />(e.g. `{ "temperature": 0 }`).
  </ResponseField>
</Expandable>

### Example Usage

The following illustrates a basic example in which a guided decoding task is defined to evaluate the correctness of a generated output compared to the ground truth.

```python theme={null}
client.evaluations.create(
  name="Example Correctness Evaluation",
  data=[
    {
      "input": "What color is the sky?",
      "expected_output": "Blue",
      "generated_output": "The sky appears blue during ..."
    },
    ...
  ],
  tasks=[
    {
      "task_type": "auto_evaluation.guided_decoding",
      "alias": "correctness",
      "configuration": {
        "model": "openai/o3-mini",
        "prompt": """
          Given the user's query: {{item.input}},
          The agent's response was: {{item.generated_output}}
          The expected response is: {{item.expected_output}}
          Did the agent's response fully represent the expected response?
        """,
        "response_format":
          {
            "type":"object",
            "properties":
              {
                "question-response":
                {
                  "type":"string",
                  "enum":["yes","no"]
                }
              },
            "required":["question-response"]
          }
      }
    }
  ]
)
```

When defining a task, you can also customize the response format. For example, you can have the judge LLM provide a reason for the final score, which is often helpful. The following example demonstrates several options.

```
tasks = [
  {
    "task_type": "auto_evaluation.guided_decoding",
    "alias": "multi_response_option_judge",
    "configuration": {
      "model": "openai/gpt-4o",
      "prompt": "Evaluate this response...",
      "response_format": {
        "type": "object",
        "properties": {
          "is_helpful": {
            "type": "boolean",
            "description": "Whether the response is helpful"
          },
          "quality_score": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5,
            "description": "Quality score from 1 to 5"
          },
          "accuracy_score": {
            "type": "number",
            "minimum": 0.0,
            "maximum": 1.0,
            "description": "Accuracy as a decimal"
          },
          "category": {
            "type": "string",
            "enum": ["excellent", "good", "fair", "poor"],
            "description": "Quality category"
          },
          "reasoning": {
            "type": "string",
            "description": "Explanation of the evaluation"
          }
        },
        "required": ["is_helpful", "quality_score", "accuracy_score", "category", "reasoning"]
      }
    }
  }
]
```
