What is a token in the Large Language Models context?

quangngoc

ChatGPT and other LLMs rely on input text being broken into pieces. Each piece is about a word-sized sequence of characters or smaller. We call those sub-word tokens. That process is called tokenization and is done using a tokenizer.

Tokens can be words or just chunks of characters. For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with whitespace, for example, “ hello” and “ bye”.

The models understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens.

The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, the 1 token is approximately 4 characters or 0.75 words for English text.

gpt4

In the context of Large Language Models (LLMs), a token is a unit of text that the model uses as input or output during processing. Tokens can take various forms depending on the tokenization method used, and they represent the smallest elements that the language model operates on. Here's how tokens are typically understood and used:

Types of Tokens:
- Word Tokens: In some models, tokens correspond to whole words. However, using word tokens can quickly lead to very large vocabularies with a lot of rare words, which can be inefficient and difficult for a model to handle.
- Subword Tokens: Modern LLMs often use subword tokenization methods like Byte Pair Encoding (BPE), WordPiece, or SentencePiece. These methods break down words into smaller, more manageable pieces (subwords) that are frequent enough in the training corpus to be useful but also allow the model to handle rare and unknown words by combining subword units.
- Character Tokens: Some models tokenize input at the character level, treating each character as a separate token. This leads to longer sequences but ensures that the model can represent any text without encountering unknown words.
- Byte Tokens: Tokenization can also occur at the byte level, as seen in models that use a byte-level BPE. This allows the model to handle a wide variety of text, including multiple languages and special characters, without relying on a predefined vocabulary.
Functions of Tokens:
- Tokens serve as a bridge between raw text data and the numerical representations (embeddings) that machine learning models understand and process. Each token is typically associated with a vector in a high-dimensional space.
- Tokens can also include special tokens that convey particular meanings to the model, such as start-of-sequence <s>, end-of-sequence </s>, and padding <pad> tokens, among others.
Tokenization Process:
- Text data is usually preprocessed through a tokenization process where it is split into tokens. This process may also include lowercasing, removing or normalizing punctuation, and sometimes stemming or lemmatization.
Tokens and Model Input:
- After tokenization, tokens are converted into numerical IDs based on the model's vocabulary, and these IDs are used as input for the LLM. The model often adds position-specific information to these token embeddings to retain information about the order of tokens in the sequence.
Tokens and Model Output:
- Large Language Models generate output in the form of tokens, which are then converted back into human-readable text. When generating text, LLMs sequentially predict the next token in the sequence given the previous tokens, which involves sampling or choosing the most likely token at each step according to the model's learned probabilities.

Overall, tokens are fundamental to the functioning of LLMs, serving as the atomic pieces that make up the text sequences the models learn to generate, interpret, and manipulate. The choice of tokenization significantly impacts the model's performance, its ability to handle unseen words, and the computational efficiency.