python - 输入无效。应该是字符串、字符串列表/元组或整数列表/元组”在 Huggingface 标记化中
问题描述
我在标题中遇到了问题,类似于link。在这种情况下,我知道要说什么错误,但我想知道我的数据中的哪一行导致了问题?这里的任何人都可以给我一些提示如何解决这个问题?
!pip install transformers
!pip install datasets
from transformers import BertTokenizer
from datasets import load_dataset
pos = '/content/drive/MyDrive/positive_preprocess.csv'
neg = '/content/drive/MyDrive/negative_preprocess.csv'
train, test = load_dataset("csv",data_files={"train":pos,"test":neg},split=['train', 'test'])
train=train.remove_columns(column_names=['Unnamed: 0', 'hashtag','label'])
test=test.remove_columns(column_names=['Unnamed: 0', 'hashtag','label'])
def tokenize_function(data):
return tokenizer(data["text"])
tokenized_train= train.map(tokenize_function,batched=True, num_proc=2)
tokenized_test= test.map(tokenize_function,batched=True, num_proc=2)
追溯:
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/multiprocess/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 174, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 340, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1823, in _map_single
offset=offset,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1715, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "<ipython-input-16-aa03da28a7e7>", line 2, in tokenize_function
return tokenizer(data["text"])
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py", line 2271, in __call__
**kwargs,
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py", line 2456, in batch_encode_plus
**kwargs,
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 545, in _batch_encode_plus
first_ids = get_input_ids(ids)
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py", line 526, in get_input_ids
"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-42-61ad0d3cfb1a> in <module>()
----> 1 tokenized_train= train.map(tokenize_function,batched=True, num_proc=2)
2 tokenized_test= test.map(tokenize_function,batched=True, num_proc=2)
12 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py in get_input_ids()
524 else:
525 raise ValueError(
--> 526 "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
527 )
528
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
解决方案
推荐阅读
- filter - 来自度量的 DAX 过滤器函数参数
- python - 使用 Gocoder python 跟踪我的位置时出现问题
- python - 在 shell 脚本中设置烧瓶环境变量
- c# - Python C# 互操作
- python - 优化复杂的列表理解语句
- monero - Monero wallet-rpc 一直说 set max-reorg-depth N 不管我做什么
- python - 使用 PyInstaller 编译后 PySimpleGUIQt 的 SystemTray 不显示
- sql-server - 如何抑制 SQL71501 错误消息?
- r - 有没有办法强制 R 的硬 RAM 使用限制,导致它在被命中后使用交换空间?
- flutter - Flutter 'child' 和 'duration' 参数未定义