首页 > 解决方案 > 我有一个包含一个目标列和两个文本列的数据集。这是一个我试图通过深度学习解决的 nlp 问题

问题描述

我正在处理一个有 3 个字段的数据集。一个字段是我的目标字段,另外两个字段是文本字段。它基本上是一个基于 NLP 的问题陈述。我正在尝试一种深度学习机制,但是在考虑到两个文本字段的同时,我在标记 X_train 数据后训练测试拆分时遇到了错误。我已经阅读了数据集和标签编码的目标列。我已经清理了文本列并使用词干分析器进一步对它们进行词形还原。我将两个文本列存储在 X 中,将目标列存储在 y 中。然后,我执行了一次火车测试拆分。之后我试图标记 X_train 这给了我一个错误。评论文本和评论标题是文本列。

df=pd.read_csv('train_amazon.csv')
df.head(10)

df['topic'].nunique()

df['topic'].value_counts()

df['Review Text'].isnull().any()

df['Review Title'].isnull().any()

df['topic'].isnull().any()

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['topic'] = le.fit_transform(df['topic'])

df.head()

le.classes_

dummy_y = pd.get_dummies(df['topic']).values

X =df.iloc[:, :-1].values
y = dummy_y

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
vocabulary_size = len(tokenizer.word_index) + 1
vocabulary_size

我收到如下错误:-

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-67-7ab7cb886988> in <module>
      1 tokenizer = Tokenizer()
----> 2 tokenizer.fit_on_texts(X_train)
      3 vocabulary_size = len(tokenizer.word_index) + 1
      4 vocabulary_size

~\Anaconda3\lib\site-packages\keras_preprocessing\text.py in fit_on_texts(self, texts)
    221                                             self.filters,
    222                                             self.lower,
--> 223                                             self.split)
    224             for w in seq:
    225                 if w in self.word_counts:

~\Anaconda3\lib\site-packages\keras_preprocessing\text.py in text_to_word_sequence(text, filters, lower, split)
     41     """
     42     if lower:
---> 43         text = text.lower()
     44 
     45     if sys.version_info < (3,):

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

我的 X_train 有一个形状 (4469,2)

我的 X_train 看起来像:-

array([['use sinc seal miss', 'broken seal'],
       ['took week immedi effect 1 2 hour hour ingest includ tingl extrem slight relax probabl help anxieti much like medic numb make care less what is bother you howev product made difficult focus short term memori sever impact bout week stuff good detail orient job trust take longer sinc long term effect like unknown care take unregul supplements!!!!',
        'careless'],
       ['smell aw mean rancid could make sick sooooo annoy wish could money back',
        'rancid pill'],
       ...,
       ['didn t realiz serv size capsul purchas huge deal fault prefer take pill vitamin idea it s work help',
        'vitamin yeah'],
       ['horribl taste! wast money', 'horribl fake tast'],
       ['nasti stuff work with thick dropper doesn t work well finger bottl leav sticki mess don t lick bitter',
        'nasti']], dtype=object)

标签: scikit-learnnlptokenize

解决方案


推荐阅读