WordPiece and Byte Pair Encoding (BPE) are both subword tokenization methods used to efficiently handle large vocabularies in natural language processing tasks. Although they are similar in their approach to breaking down words into more manageable subunits (subwords), there are some differences in their algorithms and applications:
Algorithmic Differences:
- Byte Pair Encoding (BPE):
BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pair of tokens/characters to create new tokens. This process is repeated until a desired vocabulary size is reached or no more merges can effectively reduce the vocabulary's entropy. BPE is frequency-based, meaning that it merges pairs purely based on their frequency of occurrence in the training corpus.
- WordPiece:
WordPiece also begins with a base vocabulary of characters and iteratively creates new tokens. However, instead of merely relying on frequency, WordPiece introduces a likelihood-based metric for merges. It chooses new tokens based on the likelihood of the data under the current model—essentially trying to increase the overall likelihood of the training data by merging tokens that most reduce the model's loss. WordPiece often involves more complex optimization compared to BPE.
What does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.
Applications:
BPE was initially introduced for data compression but has been extensively used in NLP, notably by models like GPT (Generative Pretrained Transformer) and its successors. It is widely used due to its simplicity and effectiveness.
WordPiece is utilized in Google's NLP models like BERT (Bidirectional Encoder Representations from Transformers) and other related models. It is favored in some applications for providing a somewhat more optimized vocabulary for the training data.
Resulting Vocabularies:
Due to the way BPE merges tokens, you may get more merges that reflect common substrings even if such merges do not always contribute to meaningful subword units.
WordPiece, with its optimization approach, may yield subword units that are more linguistically sensible, as it is not strictly driven by the frequency of adjacent characters but also by how well tokens contribute to the model's performance.
Subword Representations:
Both BPE and WordPiece generate a list of subwords such that any word in the dataset can be represented as a sequence of these subwords. This helps in dealing with rare or out-of-vocabulary words by representing them as sequences of subword units.
They also both typically add special markers to denote when a subword is not the beginning of a word (e.g., by adding "##" in BERT's WordPiece to indicate continuation subwords).
In practice, BPE and WordPiece are more similar than different when it comes to real-world applications. Both create vocabularies that allow models to handle words not seen during training (out-of-vocabulary words) and to work more efficiently with less frequent words. The choice between the two often depends on the specific model's architecture and the preferences of the model's designers.