Generate completion
Description
Interact with the LLM model using the specified model_deployment_id. The LLM model will generate a text completion based on the provided prompt.
{
"prompt": "What is the capital of France?"
}
Authorizations
Headers
Path Parameters
Body
What sampling temperature to use, between [0, 2]. Higher values like 1.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling.NOTE: The temperature parameter range for some model is limited to [0, 1] if the given value is above the available range, it defaults to the max value.
0 < x < 2
List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used:
| Model API family | Model API default | EGP applied default |
| --- | --- | --- |
| OpenAI Completions | 16
| context window - prompt size
|
| OpenAI Chat Completions | context window - prompt size
| context window - prompt size
|
| LLM Engine | max_new_tokens
parameter is required | 100
|
| Anthropic Claude 2 | max_tokens_to_sample
parameter is required | 10000
|
The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Available for models provided by Google, LLM Engine, and OpenAI.
Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. Available for models provided by Google and LLM Engine.
Penalize tokens based on how much they have already appeared in the text. Positive values encourage the model to generate new tokens and negative values encourage the model to repeat tokens. Available for models provided by LLM Engine and OpenAI.
Penalize tokens based on if they have already appeared in the text. Positive values encourage the model to generate new tokens and negative values encourage the model to repeat tokens. Available for models provided by LLM Engine and OpenAI.
Flag indicating whether to stream the completion response
Whether to return logprobs. Currently only supported for llmengine chat models.
Number of top logprobs to return. Currently only supported for llmengine chat models.
The chat template to use for the completion. Currently only supported for llmengine chat models.
Additional keyword arguments for the chat template. Currently only supported for llmengine chat models.