POST
/
v4
/
models
/
{model_deployment_id}
/
chat-completions

Authorizations

x-api-key
string
headerrequired

Headers

x-selected-account-id
string | null

Path Parameters

model_deployment_id
string
required

Body

application/json

Represents a chat completion request.

Attributes: chat_history (List[Message]): Chat history

chat_history
object[]
required

Chat history entries with roles and messages. If there's no history, pass an empty list.

prompt
string
required

New user prompt. This will be sent to the model with a user role.

model_request_parameters
object
temperature
number
default: 0.2

What sampling temperature to use, between [0, 2]. Higher values like 1.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling.NOTE: The temperature parameter range for some model is limited to [0, 1] if the given value is above the available range, it defaults to the max value.

Required range: 0 < x < 2
stop_sequences
string[]

List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

max_tokens
integer

The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used: | Model API family | Model API default | EGP applied default | | --- | --- | --- | | OpenAI Completions | 16 | context window - prompt size | | OpenAI Chat Completions | context window - prompt size | context window - prompt size | | LLM Engine | max_new_tokens parameter is required | 100 | | Anthropic Claude 2 | max_tokens_to_sample parameter is required | 10000 |

top_p
number

The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Available for models provided by Google, LLM Engine, and OpenAI.

top_k
number

Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. Available for models provided by Google and LLM Engine.

frequency_penalty
number

Penalize tokens based on how much they have already appeared in the text. Positive values encourage the model to generate new tokens and negative values encourage the model to repeat tokens. Available for models provided by LLM Engine and OpenAI.

presence_penalty
number

Penalize tokens based on if they have already appeared in the text. Positive values encourage the model to generate new tokens and negative values encourage the model to repeat tokens. Available for models provided by LLM Engine and OpenAI.

stream
boolean
default: false

Flag indicating whether to stream the completion response

logprobs
boolean

Whether to return logprobs. Currently only supported for llmengine chat models.

top_logprobs
integer

Number of top logprobs to return. Currently only supported for llmengine chat models.

chat_template
string

The chat template to use for the completion. Currently only supported for llmengine chat models.

chat_template_kwargs
object

Additional keyword arguments for the chat template. Currently only supported for llmengine chat models.

Response

200 - application/json
completions
array
required
finish_reason
string
prompt_tokens
integer
default: 0
response_tokens
integer
default: 0
choices
object[]