« Encodage par paires d'octets » : différence entre les versions
(Page créée avec « == en construction == == Définition == XXXXXXX Voir aussi '''traitement automatique de la langue naturelle''' == Français == ''' XXXXXXX''' == Anglais == ''' Byte Pair Encoding''' ''' BPE''' ''BPE is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data'' == Source == [https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/ Source : Geeks... ») |
Aucun résumé des modifications |
||
Ligne 4 : | Ligne 4 : | ||
XXXXXXX | XXXXXXX | ||
Voir aussi '''[[traitement automatique de la langue naturelle]]''' | Voir aussi '''[[segment]]''', '''[[traitement automatique de la langue naturelle]]''' et '''[[Vocabulary (NLP)]]''' | ||
== Français == | == Français == | ||
Ligne 14 : | Ligne 14 : | ||
''' BPE''' | ''' BPE''' | ||
'' | ''Byte Pair Encoding is a simple form of data compression algorithms and is one of the most widely used subword-tokenization algorithms. It replaces the most frequent pair of bytes of data with a new byte that was not contained int the initial dataset. In Natural Language Processing, BPE is used to represent large vocabulary with a small set of subword units and most common words are represented in the vocabulary as a single token.'' | ||
''It is used in all of GPT versions, RoBERTa, XML, FlauBERT and more.'' | |||
== Source == | == Source == |
Version du 15 novembre 2024 à 11:40
en construction
Définition
XXXXXXX
Voir aussi segment, traitement automatique de la langue naturelle et Vocabulary (NLP)
Français
XXXXXXX
Anglais
Byte Pair Encoding
BPE
Byte Pair Encoding is a simple form of data compression algorithms and is one of the most widely used subword-tokenization algorithms. It replaces the most frequent pair of bytes of data with a new byte that was not contained int the initial dataset. In Natural Language Processing, BPE is used to represent large vocabulary with a small set of subword units and most common words are represented in the vocabulary as a single token.
It is used in all of GPT versions, RoBERTa, XML, FlauBERT and more.