首页 > 解决方案 > 为什么在标记化之前拆分数据帧时 tensorflow 会给出 InvalidArgumentError?

问题描述

这是一个奇怪的问题,我不知道为什么会发生或如何解决它。但是在使用 tensorflow 和 BERT 运行一些简单的测试代码进行主题分类时

from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf

from sklearn.preprocessing import LabelEncoder

import pandas as pd
import numpy as np
from ast import literal_eval # Lists of topics needs to be split


def preprocess_text(df):
    # Remove punctuations and numbers
    df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)

    # Single character removal
    df['text'] = df['text'].str.replace(r"\s+[a-zA-Z]\s+", ' ', regex=True)

    # Removing multiple spaces
    df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

    # Turn topic column into list from string with square brackets
    df['topic'] = df['topic'].apply(literal_eval)

    # Remove NaNs
    df['text'] = df['text'].fillna('')
    df['topic'] = df['topic'].fillna('')

    return df


# Load dataframe with just text and topic columns
df = pd.DataFrame()
for chunk in pd.read_csv(r'test.csv',
                         sep='|', chunksize=1000, usecols=['text', 'topic']):  # Just take top 1000 for test
    df = chunk
    break
df = preprocess_text(df)

# Unstack topics columns
df = df.explode('topic').reset_index(drop=True)
df['topic'] = df['topic'].astype('category')
df['topic'] = df['topic'].cat.add_categories('N/A').fillna('N/A')

# Encode labels
le = LabelEncoder()
df['topic_encoded'] = le.fit_transform(df['topic'])

text = list(df['text'])
labels = list(df['topic_encoded'])

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(text, return_tensors='tf', truncation=True, padding=True, max_length=128)

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), labels))

model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=len(np.unique(np.array(labels)))
)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])

model.fit(
    train_dataset.shuffle(100).batch(8),
    epochs=2
)

正如人们所期望的那样,一切都完美无缺。但是当我添加代码时(在我对主题进行编码之后)

# Consider only Top n tags - want to keep a smaller dataset for testing
n = 5
top_tags = df['topic_encoded'].value_counts()[:n].index.tolist()
df = df[df['topic_encoded'].isin(top_tags)].sort_values(by='topic_encoded', ascending=True).reset_index(drop=True)

我得到错误

tensorflow.python.framework.errors_impl.InvalidArgumentError:  Received a label value of 16 which is outside the valid range of [0, 5).  Label values: 0 0 0 0 0 0 16 0
     [[node compute_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at \Users\Projects\transformers_test\venv\lib\site-packages\transformers\modeling_tf_utils.py:220) ]] [Op:__inference_train_function_12942]

Function call stack:
train_function

我不知道为什么添加这个会导致这个问题。即使我指定loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)它仍然会给出相同的错误。

编辑

我已经在很多方面摆弄过拆分df,看来无论如何我都会将它切片,df = df[df['topic_encoded'] == 1]例如,我会得到一个InvalidArgumentError. 所以这个问题是由只取数据帧的一部分引起的。

任何帮助将不胜感激。

标签: pythonpandastensorflowmachine-learningbert-language-model

解决方案


推荐阅读