Skip to main content

Overview

Use the custom_function task type to run your own Python scoring logic on each evaluation item, giving you full control over scoring beyond the built-in metrics. The function source code is extracted, sent to the API, and executed server-side in a sandboxed environment.
The scale-gp-beta SDK provides a CustomFunction helper that handles source extraction and serialization for you.

SDK Helper

The CustomFunction class wraps a Python callable and provides methods to serialize it for the API and test it with a dry run.
from scale_gp_beta import SGPClient
from scale_gp_beta.lib import CustomFunction

client = SGPClient(account_id=..., api_key=...)

def string_similarity(expected: str, actual: str) -> float:
    import difflib
    return difflib.SequenceMatcher(None, expected, actual).ratio()

cf = CustomFunction(func=string_similarity)
Any imports your function needs must be placed inside the function body. Module-level imports are not captured when the source is extracted and will not be available at execution time.

Allowed Imports

Custom functions run in a restricted environment. Only these standard library modules are available: math, json, re, collections, itertools, functools, statistics, decimal, fractions, datetime, copy, textwrap, difflib, unicodedata

Return Type

Functions must return an int or float. Returning other types (including bool) will produce an error.

Using arg_mapping

By default, function parameter names are matched to dataset column names. Use arg_mapping when your function parameters don’t match the column names in your data:
def score(a: float, b: float) -> float:
    return a - b

cf = CustomFunction(
    func=score,
    alias="difference_score",
    arg_mapping={"a": "model_score", "b": "baseline_score"},
)
Values that already start with item. are passed through unchanged, which lets you reference nested fields:
cf = CustomFunction(
    func=score,
    arg_mapping={"a": "item.nested.model_score", "b": "baseline_score"},
)
# produces: {"a": "item.nested.model_score", "b": "item.baseline_score"}

Dry Run

Test your function against sample data before creating a full evaluation. The dry run endpoint is synchronous — it executes your function inline and returns per-row results immediately.
result = cf.dry_run(
    client=client,
    sample_data=[
        {"expected": "Hello world", "actual": "Hello wrold"},
        {"expected": "The quick brown fox jumps over the lazy dog.", "actual": "The quick brown fox jumped over the lazy dog."},
    ],
)
print(result)

Discovering Columns

If you already have an evaluation and want to see which columns are available for arg_mapping:
from scale_gp_beta.lib import get_evaluation_columns

columns = get_evaluation_columns(client, "eval_abc123")
for col in columns:
    print(col.field_name, col.data_type)

Example Usage

The following illustrates a complete example creating an evaluation with a custom function task.
from scale_gp_beta import SGPClient
from scale_gp_beta.lib import CustomFunction

client = SGPClient(account_id=..., api_key=...)

def string_similarity(expected: str, actual: str) -> float:
    import difflib
    return difflib.SequenceMatcher(None, expected, actual).ratio()

cf_similarity = CustomFunction(func=string_similarity)

def length_ratio(expected: str, actual: str) -> float:
    if len(expected) == 0:
        return 0.0
    return len(actual) / len(expected)

cf_length = CustomFunction(func=length_ratio)

evaluation = client.evaluations.create(
    name="Custom Scoring Evaluation",
    data=[
        {"expected": "Hello world", "actual": "Hello wrold"},
        {"expected": "The quick brown fox jumps over the lazy dog.", "actual": "The quick brown fox jumped over the lazy dog."},
    ],
    tasks=[cf_similarity.serialize(), cf_length.serialize()],
)
You can also create evaluations using the raw task dict directly, without the SDK helper:
evaluation = client.evaluations.create(
    name="Custom Scoring Evaluation",
    data=[
        {"expected": "Hello world", "actual": "Hello wrold"},
        {"expected": "The quick brown fox jumps over the lazy dog.", "actual": "The quick brown fox jumped over the lazy dog."},
    ],
    tasks=[
        {
            "task_type": "custom_function",
            "alias": "string_similarity",
            "configuration": {
                "function_source": "def string_similarity(expected: str, actual: str) -> float:\n    import difflib\n    return difflib.SequenceMatcher(None, expected, actual).ratio()\n",
            }
        }
    ],
)
Poll for completion and inspect results:
import time

while True:
    evaluation = client.evaluations.retrieve(evaluation.id)
    if evaluation.status in ("completed", "failed"):
        break
    time.sleep(5)

items = client.evaluation_items.list(evaluation_id=evaluation.id)
for item in items:
    print(item.data)  # scores are stored under the task alias key, e.g. data["string_similarity"]
    if item.task_errors:
        print(item.task_errors)
Once the evaluation completes, you can also view the results — including per-item custom function scores — in the SGP UI.

Constraints

ConstraintDetail
Return typeint or float only (not bool)
ImportsMust be inside function body; only allowed stdlib modules
Function signatureNo *args, **kwargs, async, or lambdas
Dry run timeout10s per row, 30s total