首页 > 解决方案 > AttributeError:'NoneType' 对象在 Python 中没有属性 'lower'。如何在标记文本内容之前进行预处理?

问题描述

我使用的数据集如下所示。它是一个视频字幕数据集,在“字幕”列下带有字幕,单个视频剪辑具有多个字幕。

video_id       caption
mv89psg6zh4    A bird is bathing in a sink.
mv89psg6zh4    A faucet is running while a bird stands.
mv89psg6zh4    A bird gets washed.
mv89psg6zh4    A parakeet is taking a shower in a sink.
mv89psg6zh4    The bird is taking a bath under the faucet.
mv89psg6zh4    A bird is standing in a sink drinking water.
R2DvpPTfl-E    PLAYING GAME ON LAPTOP.
R2DvpPTfl-E    THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU    A woman is pouring ingredients into a bowl.
l7x8uIdg2XU    A woman is adding milk to some pasta.
l7x8uIdg2XU    A person adds ingredients to pasta. 
l7x8uIdg2XU    the girls are doing the cooking.

它在这里处理“CandidateA”json 文件 但是,它不适用于看起来像这样的“Referencedf”json 文件(完整的文件可以在这里找到):

(Excerpt only):
[{"video_id":"mv89psg6zh4_33_46","caption":"A bird in a sink keeps getting under the running water from a faucet."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is bathing in a sink."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is splashing around under a running faucet."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A MAN IS WATCHING A LAPTOP."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A man is sitting at his computer."}]

这是我正在应用的以下代码:

import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

with open("Referencedf.json", 'r') as f:
    datastore = json.load(f)

captions = []
video_id = []

for item in datastore:
    captions.append(item['caption'])

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(captions)

我得到的错误是:

AttributeError                            Traceback (most recent call last)
<ipython-input-25-63fee6e467f1> in <module>
      1 tokenizer = Tokenizer(oov_token="<OOV>")
----> 2 tokenizer.fit_on_texts(captions)
      3 word_index = tokenizer.word_index
      4 print(len(word_index))

~\anaconda3\lib\site-packages\keras_preprocessing\text.py in fit_on_texts(self, texts)
    221                                             self.filters,
    222                                             self.lower,
--> 223                                             self.split)
    224             for w in seq:
    225                 if w in self.word_counts:

~\anaconda3\lib\site-packages\keras_preprocessing\text.py in text_to_word_sequence(text, filters, lower, split)
     41     """
     42     if lower:
---> 43         text = text.lower()
     44 
     45     if sys.version_info < (3,):

AttributeError: 'NoneType' object has no attribute 'lower'

编辑:

正如@MahindraSinghMeena 所建议的那样,我事先从数据框中删除了 Null 行,以便通过使用来避免错误

df = df.dropna()

标签: pythontensorflownlp

解决方案


如果您在提供给 Tokenizer 的文本中有一些不正确的数据,则会发生这种情况,因为错误消息表明它发现某些元素为无。因此,应该对数据进行清理以消除此类情况。

您可以在以下代码段中看到,条目的标题文本无效。

import json
datastore = json.load(open('/Referencedf.json', 'r'))

for d in datastore:
  if d['caption'] is None:
    print(d)
{'video_id': 'SKhmFSV-XB0_12_18', 'caption': None}

推荐阅读