What's the difference between temperature and top_p?

quangngoc

temperature is a parameter that controls the “creativity” or randomness of the text generated by GPT-3. A higher temperature (e.g., 0.7) results in more diverse and creative output, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.
In practice, temperature affects the probability distribution over the possible tokens at each step of the generation process. A temperature of 0 would make the model completely deterministic, always choosing the most likely token.
The top_p sampling, also called nucleus sampling, can also be used to control the randomness of the outputs. Here we set a probability threshold, this threshold represents the proportion of the probability distribution to consider for the next word. In other words, It consists of selecting the top words from the probability distribution, having the highest probabilities that add up to the given threshold.
For example, if we set a top_p of 0.05, it means that the model, once it generated the probability distribution, will only be considering the tokens that have the highest probabilities, and sum up to 5%. Then the model will be randomly selecting the next token among these 5% tokens, according to its likelihood.

quangngoc

The terms "temperature" and "top_p," also known as "nucleus sampling," are hyperparameters used in the context of text generation with large language models like OpenAI's GPT series. They help control the randomness of the text generation process. While they both influence the diversity of the generated text, they operate differently.

Temperature:
Temperature is a hyperparameter that scales the logits (the inputs to the softmax function that determines the probability distribution over the next token to be generated) before applying softmax. A lower temperature (less than 1) makes the model more confident in its choices, leading to less random outputs that can sometimes be repetitive or predictable. A temperature closer to 0 will make the model always choose the most likely next word, essentially removing randomness. A higher temperature increases randomness, encouraging the model to choose less likely words, which can introduce more variety and creativity but can sometimes result in less coherent text. A temperature of 1 means no scaling is applied, keeping the original logits.
Top_p (Nucleus Sampling):
Top-p, also known as nucleus sampling, is a sampling strategy that involves selecting from the most probable tokens whose cumulative probability mass exceeds a certain threshold (p). Instead of sampling from the entire probability distribution, top-p sampling dynamically truncates the distribution based on the cumulative probabilities until it reaches the threshold p. This allows for controlling the diversity of generated text while ensuring that only a subset of the most probable tokens are considered for sampling.

Both temperature and top_p are used to strike a balance between randomness and determinism during text generation, with temperature generally controlling the "spread" of probabilities over all words, while top_p controls the "cut-off" point beyond which words are no longer considered for sampling. They can be fine-tuned depending on whether you want more creative, diverse, and potentially surprising outputs (high temperature, high top_p) or more predictable, safe, and coherent outputs (low temperature, low top_p).

quangngoc

The token logits are converted to probabilities using the softmax function, however that isn’t true exactly. Instead of directly calculating softmax, the logits are divided by the temperature value resulting in below equation:

Softmax with Temperature = T

Once the logits are divided by the temperature (and then softmax’ed), the distribution of probabilities becomes more even, which increases the chance of selecting a wider variety of tokens and makes the output more random.

For logits — 2, 0.5 & 0.5, the graph shows converging probability values for token probabilities for temperature ranging from 0 to 2 (x-axis).

As a side note, I like to remember this by the analogy that a hotheaded (read higher temperature) person can say anything, so can an LLM.