Hyperparameters

Hyperparameters are external settings that determine a model’s behaviour, shape, size, resource use and other characteristics.

In this article, we will discuss some of the common hyperparameters and what they control.

Context Window

The context window (or “context length”) of a large language model is the amount of text, in tokens, that the model can consider or “remember” at any one time. A larger context window enables an AI model to process longer inputs and incorporate more information into each output.

The context window is measured as the absolute maximum capacity of tokens allowed for a single interaction. It is the combined total of all input tokens, output tokens, system tokens, or any other type of tokens used at that exact moment.

Total Window Capacity >= System + Input + History + Output

Temperature

The LLM temperature hyperparameter is akin to a dial for randomness or creativity. Raising the temperature increases the probability distribution for the next words that appear in the model’s output during text generation.

A temperature setting of 1 uses the model's standard probability distribution. Temperatures higher than 1 flatten the probability distribution, encouraging the model to select a wider range of tokens. Conversely, temperatures below 1 sharpen the probability distribution, increasing the model's likelihood of selecting the most probable next token.

A temperature value closer to 1.0, such as 0.8, indicates that the LLM is more creative in its responses but potentially less predictable. Meanwhile, a lower temperature of 0.2 will yield more deterministic responses. A model with low temperature delivers predictable, if staid, outputs. Higher temperatures closer to 2.0 can begin to produce nonsensical output.

The use case informs the ideal temperature value for an LLM. A chatbot designed to be entertaining and creative may benefit from a higher temperature setting, which encourages more varied and imaginative responses. A text summarisation app in a highly regulated field such as law, health, or finance requires the inverse: its generated text summaries must adhere to strict requirements.

Top K and Top P

Top K

The top-k hyperparameter is another diversity-focused setting. The k value sets the limit for the number of terms that can be considered as the next in the sequence. Terms are ordered by probability, and the top k are selected as candidates.

Top P (nucleus sampling)

Like temperature, top-p sampling also affects word diversity in generated text outputs. Top-p works by setting a probability threshold p for the next token in an output sequence. The model is allowed to generate responses by using tokens within the probability limit.

In top-p sampling, tokens are ranked by probability. Tokens with a greater likelihood of appearing next in the sequence have a higher score, with the opposite being true for less-likely tokens. The model assembles a set of potential next tokens until the cumulative p-score reaches the set threshold, then randomly selects a token from that set.

Higher p thresholds result in more diverse outputs, while lower thresholds preserve accuracy and coherency.

Example

Top-k tells the model to pick the next token from the top ‘k’ tokens in its list, sorted by probability.

Consider the input phrase - “The name of that country is the”. The next token could be “United”, “Netherlands”, “Czech”, and so on, with varying probabilities. There may be dozens of potential outputs with decreasing probabilities but if you set k as 3, you’re telling the model to only pick from the top 3 options.

So if you ran the same prompt a bunch of times, you’ll get United very often, and you’ll get a smattering of Netherlands or Czech, but nothing else.

If you set k to 1, the model will only pick the top token (United, in this case).

Top-p is similar but picks from the top tokens based on the sum of their probabilities. So, for the previous example, if we set p to 0.15, the model accumulates tokens in descending probability order until the cumulative sum reaches or exceeds 0.15. Since United + Netherlands only sum to 14.7%, which does not yet meet the 0.15 threshold, the next most probable token (e.g., Czech) would also be included in the candidate set.

Top-p is more dynamic than top-k and is often used to exclude outputs with lower probabilities. So if you set p to 0.75, you exclude the bottom 25% of probable outputs.

Temperature v/s Top P sampling

The difference between temperature and top-p sampling is that while temperature adjusts the probability distribution of potential tokens, top-p sampling limits token selection to a finite group.

Top P v/s Top K

Top-p limits the token pool up to a set p probability total, while top-k limits the pool to the top k most likely terms.

Penalties

Frequency penalty

The frequency penalty hyperparameter helps prevent models from overusing terms within the same outputs. Once a term appears in the output, the frequency penalty dissuades the model from reusing it again later.

Models assign scores to each token, known as logits, and use logits to calculate probability values. Frequency penalties linearly lower the logit value of a term each time it is repeated, making it progressively less likely to be chosen next time. Higher penalty values at higher frequencies lower the logit by a greater amount per application.

Because the model is discouraged from repeating terms, it must choose other terms, leading to more diverse word choices in generated text.

Repetition penalty

Repetition penalty is similar to frequency penalty except that it is exponential rather than linear. Repetition penalty lowers a term’s logit exponentially each time it is reused, making it a stronger discouragement than the frequency penalty. For this reason, lower repetition penalty values are recommended.

Presence penalty

The presence penalty is a related hyperparameter that works similarly to the frequency penalty, except it applies only once. The presence penalty lowers a term’s logit value by the same amount, regardless of how often it appears in the output, so long as it appears at least once.

If the term bear appears in the output 10 times, and the term fox appears once, bear has a higher frequency penalty than fox. However, both bear and fox share the same presence penalty.

Token number (max tokens)

The token number or max tokens hyperparameter sets an upper limit for output token length. Smaller token number values are ideal for quick tasks such as chatbot conversations and summarisation—tasks that can be handled by small language models as well as LLMs.

Higher token counts are better when longer outputs are needed, such as when using an LLM for vibe coding.

Max Length and Stop Sequence

Max Length - You can manage the number of tokens the model generates by adjusting the max length. Specifying a max length helps you prevent long or irrelevant responses and control costs.

Stop Sequences - A stop sequence is a string that stops the model from generating tokens. Specifying stop sequences is another way to control the length and structure of the model's response. For example, you can tell the model to generate lists that have no more than 10 items by adding "11" as a stop sequence.

References / Resources

What are LLM Parameters? - IBM

LLM Parameters Demystified - Cohere

Context Window :

Context Windows - Claude

What is a context window? - IBM

Most devs don’t understand how context windows work - Matt Pocock (Video)

Temperature :

What is LLM Temperature? - IBM

What is the LLM's Temperature - New Machina (Video)

Top K and Top P :

What are the LLM’s Top-P + Top-K? - New Machina (Video)