U ,È-eÞ<ã@sÂdZddlZddlZddlZddlZddlmZmZddlm Z m Z mZmZddl mZe e¡Zddd œZd did did œZd d iZdd„ZGdd„deƒZdd„Zdd„ZGdd„de ƒZdS)z$Tokenization classes for OpenAI GPT.éN)ÚOptionalÚTupleé)ÚPreTrainedTokenizerÚ_is_controlÚ_is_punctuationÚ_is_whitespace)Úloggingz vocab.jsonz merges.txt)Ú vocab_fileÚmerges_filez openai-gptz9https://huggingface.co/openai-gpt/resolve/main/vocab.jsonz9https://huggingface.co/openai-gpt/resolve/main/merges.txticCs| ¡}|sgS| ¡}|S)z@Runs basic whitespace cleaning and splitting on a piece of text.)ÚstripÚsplit)ÚtextÚtokens©rúo/var/www/html/Darija-Ai-Train/env/lib/python3.8/site-packages/transformers/models/openai/tokenization_openai.pyÚwhitespace_tokenize.s rc@sNeZdZdZddd„Zddd„Zdd „Zdd d„Zdd „Zdd„Z dd„Z dS)ÚBasicTokenizeraª Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). Args: do_lower_case (`bool`, *optional*, defaults to `True`): Whether or not to lowercase the input when tokenizing. never_split (`Iterable`, *optional*): Collection of tokens which will never be split during tokenization. Only has an effect when `do_basic_tokenize=True` tokenize_chinese_chars (`bool`, *optional*, defaults to `True`): Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)). strip_accents (`bool`, *optional*): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for `lowercase` (as in the original BERT). do_split_on_punc (`bool`, *optional*, defaults to `True`): In some instances we want to skip the basic punctuation splitting so that later tokenization can capture the full context of the words, such as contractions. TNcCs2|dkrg}||_t|ƒ|_||_||_||_dS©N)Ú do_lower_caseÚsetÚnever_splitÚtokenize_chinese_charsÚ strip_accentsÚdo_split_on_punc)ÚselfrrrrrrrrÚ__init__Os zBasicTokenizer.__init__cCs¶|r|j t|ƒ¡n|j}| |¡}|jr4| |¡}t d|¡}t|ƒ}g}|D]R}||kr|j r€| ¡}|jdk r| |¡}n|jr| |¡}| | ||¡¡qPtd |¡ƒ}|S)aj Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer. Args: never_split (`List[str]`, *optional*) Kept for backward compatibility purposes. Now implemented directly at the base class level (see [`PreTrainedTokenizer.tokenize`]) List of token not to split. ÚNFCFú )rÚunionrÚ_clean_textrÚ_tokenize_chinese_charsÚunicodedataÚ normalizerrÚlowerrÚ_run_strip_accentsÚextendÚ_run_split_on_puncÚjoin)rrrZunicode_normalized_textZorig_tokensÚsplit_tokensÚtokenZ output_tokensrrrÚtokenize_s$ zBasicTokenizer.tokenizecCsBt d|¡}g}|D]"}t |¡}|dkr,q| |¡qd |¡S)z$Strips accents from a piece of text.ZNFDZMnÚ)r"r#ÚcategoryÚappendr()rrÚoutputÚcharÚcatrrrr%…s z!BasicTokenizer._run_strip_accentscCs–|jr|dk r||kr|gSt|ƒ}d}d}g}|t|ƒkrˆ||}t|ƒr^| |g¡d}n |rl| g¡d}|d |¡|d7}q0dd„|DƒS) z&Splits punctuation on a piece of text.NrTFéÿÿÿÿécSsg|]}d |¡‘qS)r,)r()Ú.0ÚxrrrÚ ¤sz5BasicTokenizer._run_split_on_punc..)rÚlistÚlenrr.)rrrÚcharsÚiZstart_new_wordr/r0rrrr's" z!BasicTokenizer._run_split_on_punccCsTg}|D]@}t|ƒ}| |¡r>| d¡| |¡| d¡q| |¡qd |¡S)z)Adds whitespace around any CJK character.rr,)ÚordÚ_is_chinese_charr.r(©rrr/r0Úcprrrr!¦s z&BasicTokenizer._tokenize_chinese_charscCsˆ|dkr|dks€|dkr |dks€|dkr0|dks€|dkr@|dks€|d krP|d ks€|dkr`|dks€|d krp|dks€|dkr„|dkr„dSdS)z6Checks whether CP is the codepoint of a CJK character.iNiÿŸi4i¿Miiß¦i§i?·i@·i¸i ¸i¯ÎiùiÿúiøiúTFr)rr>rrrr<³sDÿÿþþýýüüûûúúùùøø zBasicTokenizer._is_chinese_charcCsXg}|D]D}t|ƒ}|dks|dkst|ƒr.qt|ƒrB| d¡q| |¡qd |¡S)zBPerforms invalid character removal and whitespace cleanup on text.riýÿrr,)r;rrr.r(r=rrrr ËszBasicTokenizer._clean_text)TNTNT)N)N)Ú__name__Ú __module__Ú__qualname__Ú__doc__rr+r%r'r!r<r rrrrr8sú & rcCs6tƒ}|d}|dd…D]}| ||f¡|}q|S)zƒ Return set of symbol pairs in a word. word is represented as tuple of symbols (symbols being variable-length strings) rr3N)rÚadd)ÚwordÚpairsZ prev_charr0rrrÚ get_pairsÙsrFcCsn| dd¡}| dd¡}| dd¡}| dd¡}| dd¡}t d d |¡}t dd|¡}t d d|¡}| ¡S)zm fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization uâ€”ú-uâ€“uâ€•uâ€¦z...õÂ´ú'zD(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)z \1 z\s*\n\s*z z[^\S\n]+r)ÚreplaceÚreÚsubr)rrrrÚtext_standardizeæsrMcsšeZdZdZeZeZeZ ddgZ d‡fdd„ Zedd„ƒZ ed d „ƒZdd„Zd d„Zdd„Zdd„Zdd„Zdd„Zdeeeeedœdd„Z‡ZS)ÚOpenAIGPTTokenizera( Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities: - lowercases all inputs, - uses `SpaCy` tokenizer and `ftfy` for pre-BPE tokenization if they are installed, fallback to BERT's `BasicTokenizer` if not. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): Path to the vocabulary file. merges_file (`str`): Path to the merges file. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Z input_idsZattention_maskúc sz.ddl}ddlm}|ƒ}|j|_|j|_Wn0tk r^t d¡t dd|_d|_YnXt |dd}t |¡|_ W5QRXdd „|j ¡Dƒ|_t |dd} | ¡ d ¡dd…} W5QRXd d„| Dƒ} tt| tt| ƒƒƒƒ|_i|_tƒjfd|i|—ŽdS)Nr)ÚEnglishzQftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.T)rúutf-8©ÚencodingcSsi|]\}}||“qSrr)r4ÚkÚvrrrÚ sz/OpenAIGPTTokenizer.__init__..Ú r3r2cSsg|]}t| ¡ƒ‘qSr)Útupler )r4Úmergerrrr6!sz/OpenAIGPTTokenizer.__init__..Ú unk_token)ÚftfyZ spacy.lang.enrPZ tokenizerÚnlpÚfix_textÚImportErrorÚloggerÚwarningrÚopenÚjsonÚloadÚencoderÚitemsÚdecoderÚreadr ÚdictÚzipÚranger8Ú bpe_ranksÚcacheÚsuperr)rr rrZÚkwargsr[rPZ_nlpZvocab_handleZ merges_handleZmerges©Ú __class__rrrs& zOpenAIGPTTokenizer.__init__cCsdS)NTr©rrrrr'sz OpenAIGPTTokenizer.do_lower_casecCs t|jƒSr)r8rdrqrrrÚ vocab_size+szOpenAIGPTTokenizer.vocab_sizecCst|jf|jŽSr)rhrdZadded_tokens_encoderrqrrrÚ get_vocab/szOpenAIGPTTokenizer.get_vocabc sŒt|dd…ƒ|ddf}|ˆjkr2ˆj|St|ƒ}|sF|dSt|‡fdd„d}|ˆjkrhqf|\}}g}d}|t|ƒkrcsˆj |tdƒ¡S)NÚinf)rkÚgetÚfloat)ÚpairrqrrÚ<óz(OpenAIGPTTokenizer.bpe..©Úkeyrr3érz z )rXrlrFÚminrkr8ÚindexÚ ValueErrorr&r.r() rr*rDrEZbigramÚfirstÚsecondZnew_wordr:ÚjrrqrÚbpe2sF 2 zOpenAIGPTTokenizer.bpecCs‚g}|jdkr@|j |¡}|D]}| t| |¡ d¡ƒ¡qn>| t| |¡ƒ¡}|D]$}| t| |j ¡¡ d¡ƒ¡qX|S)zTokenize a string.Nr) r]r\r+r&r7r„r rMrr$)rrr)r*rrrÚ _tokenize^s "zOpenAIGPTTokenizer._tokenizecCs|j ||j |j¡¡S)z0Converts a token (str) in an id using the vocab.)rdrvrZ)rr*rrrÚ_convert_token_to_idmsz'OpenAIGPTTokenizer._convert_token_to_idcCs|j ||j¡S)z0Converts an id in a token (BPE) using the vocab.)rfrvrZ)rrrrrÚ_convert_id_to_tokenqsz'OpenAIGPTTokenizer._convert_id_to_tokencCsd |¡ dd¡ ¡}|S)z:Converts a sequence of tokens (string) in a single string.r,rtr)r(rJr)rrZ out_stringrrrÚconvert_tokens_to_stringusz+OpenAIGPTTokenizer.convert_tokens_to_stringN)Úsave_directoryÚfilename_prefixÚreturnc Cs(tj |¡s"t d|›d¡dStj ||r6|dndtd¡}tj ||rX|dndtd¡}t|ddd $}| t j |jd ddd d¡W5QRXd}t|ddd j}| d¡t|j ¡dd„dD]B\}} || krøt d|›d¡| }| d |¡d¡|d7}qÒW5QRX||fS)NzVocabulary path (z) should be a directoryrGr,r rÚwrQrRr}TF)ÚindentÚ sort_keysÚensure_asciirWrz#version: 0.2 cSs|dS)Nr3r)Úkvrrrry‹rzz4OpenAIGPTTokenizer.save_vocabulary..r{zSaving vocabulary to zZ: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!rr3)ÚosÚpathÚisdirr_Úerrorr(ÚVOCAB_FILES_NAMESraÚwriterbÚdumpsrdÚsortedrkrer`) rr‰rŠr Z merge_fileÚfrÚwriterZ bpe_tokensZtoken_indexrrrÚsave_vocabularyzs2ÿÿ( ÿz"OpenAIGPTTokenizer.save_vocabulary)rO)N)r?r@rArBr•Zvocab_files_namesÚPRETRAINED_VOCAB_FILES_MAPZpretrained_vocab_files_mapÚ&PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESZmax_model_input_sizesZmodel_input_namesrÚpropertyrrrrsr„r…r†r‡rˆÚstrrrr›Ú __classcell__rrrorrNõs" ,rN)rBrbr‘rKr"ÚtypingrrZtokenization_utilsrrrrÚutilsr Z get_loggerr?r_r•rœrrÚobjectrrFrMrNrrrrÚs. þþÿ "