tokenize - Huggingface 的 BERT 标记器未添加填充标记
问题描述
从文档中并不完全清楚,但我可以看到它BertTokenizer
是用 初始化的pad_token='[PAD]'
,所以我假设当你用它编码时,add_special_tokens=True
它会自动填充它。鉴于此pad_token_id=0
,我看不到任何0
stoken_ids
但是:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=True, max_length=2048)
# Print the original sentence.
print('Original: ', text)
# Print the sentence split into tokens.
print('\nTokenized: ', tokens)
# Print the sentence mapped to token ids.
print('\nToken IDs: ', token_ids)
输出:
Original: Toronto's key stock index ended higher in brisk trading on Thursday, extending Wednesday's rally despite being weighed down by losses on Wall Street.
The TSE 300 Composite Index rose 29.80 points to close at 5828.62, outperforming the Dow Jones Industrial Average which slumped 21.27 points to finish at 6658.60.
Toronto added to Wednesday's 55-point rally while investors took profits in New York after the Dow's 92-point gains, said MMS International analyst Katherine Beattie.
"That shows that the markets are very fragile," Beattie said. "They (investors) want to take advantage of any strength to sell," she said.
Toronto was also buoyed by its heavyweight gold group which jumped nearly 2.2 percent, aided by firmer COMEX gold prices. The key June contract rose $1.00 to $344.30.
Ten of Toronto's 14 sub-indices posted gains, led by golds, transportation, forestry products and consumer products.
The weak side included conglomerates, base metals and utilities.
Trading was heavy at 100 million shares worth C$1.54 billion ($1.1 billion).
Advancing stocks outnumbered declines 556 to 395, with 276 issues flat.
Among hot stocks, Bre-X Minerals Ltd. rose 0.13 to 2.30 on 5.0 million shares as investors continued to consider the viability of its Busang gold discovery in Indonesia.
Kenting Energy Services Inc. rose 0.25 to 9.05 after Precision Drilling Corp. amended its takeover offer
Bakery and foodstuffs maker George Weston Ltd. jumped 4.50 to close at 74.50, the TSE's top gainer.
Tokenized: ['toronto', "'", 's', 'key', 'stock', 'index', 'ended', 'higher', 'in', 'brisk', 'trading', 'on', 'thursday', ',', 'extending', 'wednesday', "'", 's', 'rally', 'despite', 'being', 'weighed', 'down', 'by', 'losses', 'on', 'wall', 'street', '.', 'the', 'ts', '##e', '300', 'composite', 'index', 'rose', '29', '.', '80', 'points', 'to', 'close', 'at', '58', '##28', '.', '62', ',', 'out', '##per', '##form', '##ing', 'the', 'dow', 'jones', 'industrial', 'average', 'which', 'slumped', '21', '.', '27', 'points', 'to', 'finish', 'at', '66', '##58', '.', '60', '.', 'toronto', 'added', 'to', 'wednesday', "'", 's', '55', '-', 'point', 'rally', 'while', 'investors', 'took', 'profits', 'in', 'new', 'york', 'after', 'the', 'dow', "'", 's', '92', '-', 'point', 'gains', ',', 'said', 'mm', '##s', 'international', 'analyst', 'katherine', 'beat', '##tie', '.', '"', 'that', 'shows', 'that', 'the', 'markets', 'are', 'very', 'fragile', ',', '"', 'beat', '##tie', 'said', '.', '"', 'they', '(', 'investors', ')', 'want', 'to', 'take', 'advantage', 'of', 'any', 'strength', 'to', 'sell', ',', '"', 'she', 'said', '.', 'toronto', 'was', 'also', 'bu', '##oy', '##ed', 'by', 'its', 'heavyweight', 'gold', 'group', 'which', 'jumped', 'nearly', '2', '.', '2', 'percent', ',', 'aided', 'by', 'firm', '##er', 'come', '##x', 'gold', 'prices', '.', 'the', 'key', 'june', 'contract', 'rose', '$', '1', '.', '00', 'to', '$', '344', '.', '30', '.', 'ten', 'of', 'toronto', "'", 's', '14', 'sub', '-', 'indices', 'posted', 'gains', ',', 'led', 'by', 'gold', '##s', ',', 'transportation', ',', 'forestry', 'products', 'and', 'consumer', 'products', '.', 'the', 'weak', 'side', 'included', 'conglomerate', '##s', ',', 'base', 'metals', 'and', 'utilities', '.', 'trading', 'was', 'heavy', 'at', '100', 'million', 'shares', 'worth', 'c', '$', '1', '.', '54', 'billion', '(', '$', '1', '.', '1', 'billion', ')', '.', 'advancing', 'stocks', 'outnumbered', 'declines', '55', '##6', 'to', '395', ',', 'with', '276', 'issues', 'flat', '.', 'among', 'hot', 'stocks', ',', 'br', '##e', '-', 'x', 'minerals', 'ltd', '.', 'rose', '0', '.', '13', 'to', '2', '.', '30', 'on', '5', '.', '0', 'million', 'shares', 'as', 'investors', 'continued', 'to', 'consider', 'the', 'via', '##bility', 'of', 'its', 'bus', '##ang', 'gold', 'discovery', 'in', 'indonesia', '.', 'kent', '##ing', 'energy', 'services', 'inc', '.', 'rose', '0', '.', '25', 'to', '9', '.', '05', 'after', 'precision', 'drilling', 'corp', '.', 'amended', 'its', 'takeover', 'offer', 'bakery', 'and', 'foods', '##tu', '##ffs', 'maker', 'george', 'weston', 'ltd', '.', 'jumped', '4', '.', '50', 'to', 'close', 'at', '74', '.', '50', ',', 'the', 'ts', '##e', "'", 's', 'top', 'gain', '##er', '.']
Token IDs: [101, 4361, 1005, 1055, 3145, 4518, 5950, 3092, 3020, 1999, 28022, 6202, 2006, 9432, 1010, 8402, 9317, 1005, 1055, 8320, 2750, 2108, 12781, 2091, 2011, 6409, 2006, 2813, 2395, 1012, 1996, 24529, 2063, 3998, 12490, 5950, 3123, 2756, 1012, 3770, 2685, 2000, 2485, 2012, 5388, 22407, 1012, 5786, 1010, 2041, 4842, 14192, 2075, 1996, 23268, 3557, 3919, 2779, 2029, 14319, 2538, 1012, 2676, 2685, 2000, 3926, 2012, 5764, 27814, 1012, 3438, 1012, 4361, 2794, 2000, 9317, 1005, 1055, 4583, 1011, 2391, 8320, 2096, 9387, 2165, 11372, 1999, 2047, 2259, 2044, 1996, 23268, 1005, 1055, 6227, 1011, 2391, 12154, 1010, 2056, 3461, 2015, 2248, 12941, 9477, 3786, 9515, 1012, 1000, 2008, 3065, 2008, 1996, 6089, 2024, 2200, 13072, 1010, 1000, 3786, 9515, 2056, 1012, 1000, 2027, 1006, 9387, 1007, 2215, 2000, 2202, 5056, 1997, 2151, 3997, 2000, 5271, 1010, 1000, 2016, 2056, 1012, 4361, 2001, 2036, 20934, 6977, 2098, 2011, 2049, 8366, 2751, 2177, 2029, 5598, 3053, 1016, 1012, 1016, 3867, 1010, 11553, 2011, 3813, 2121, 2272, 2595, 2751, 7597, 1012, 1996, 3145, 2238, 3206, 3123, 1002, 1015, 1012, 4002, 2000, 1002, 29386, 1012, 2382, 1012, 2702, 1997, 4361, 1005, 1055, 2403, 4942, 1011, 29299, 6866, 12154, 1010, 2419, 2011, 2751, 2015, 1010, 5193, 1010, 13116, 3688, 1998, 7325, 3688, 1012, 1996, 5410, 2217, 2443, 22453, 2015, 1010, 2918, 11970, 1998, 16548, 1012, 6202, 2001, 3082, 2012, 2531, 2454, 6661, 4276, 1039, 1002, 1015, 1012, 5139, 4551, 1006, 1002, 1015, 1012, 1015, 4551, 1007, 1012, 10787, 15768, 21943, 26451, 4583, 2575, 2000, 24673, 1010, 2007, 25113, 3314, 4257, 1012, 2426, 2980, 15768, 1010, 7987, 2063, 1011, 1060, 13246, 5183, 1012, 3123, 1014, 1012, 2410, 2000, 1016, 1012, 2382, 2006, 1019, 1012, 1014, 2454, 6661, 2004, 9387, 2506, 2000, 5136, 1996, 3081, 8553, 1997, 2049, 3902, 5654, 2751, 5456, 1999, 6239, 1012, 5982, 2075, 2943, 2578, 4297, 1012, 3123, 1014, 1012, 2423, 2000, 1023, 1012, 5709, 2044, 11718, 15827, 13058, 1012, 13266, 2049, 15336, 3749, 18112, 1998, 9440, 8525, 21807, 9338, 2577, 12755, 5183, 1012, 5598, 1018, 1012, 2753, 2000, 2485, 2012, 6356, 1012, 2753, 1010, 1996, 24529, 2063, 1005, 1055, 2327, 5114, 2121, 1012, 102]
解决方案
不,它不会。有一个不同的参数pad_to_max_length您必须将其设置为 True 才能添加填充标记。add_special_tokens
将添加 [CLS] 和 [SEP] 令牌(分别为 101 和 102)。
推荐阅读
- android - ScrollView 不适用于固定页脚
- asp.net - 从 asp.net 中的服务器端从 Json 中删除斜线
- javascript - d3.event 在 webpack 构建中为空
- redis - 如果设置了过期,则 nodejs 的 redis 模块不会保存
- git - git上的脏分支
- mongodb - 如何使用 mongodb 在单个查询中更新一个值并增加另一个值
- duplicates - 即使使用 withIdAttribute 也会在 Apache Beam / Dataflow 输入上重复
- javascript - 当我单击php中的pdf链接或使用.htaccess文件时如何隐藏浏览器的pdf下载选项
- primefaces - PrimeFaces 管理主题部分适用
- c# - 用四个不同的数字随机填充列表