python - AttributeError:'NoneType' 对象在 Python 中没有属性 'lower'。如何在标记文本内容之前进行预处理?
问题描述
我使用的数据集如下所示。它是一个视频字幕数据集,在“字幕”列下带有字幕,单个视频剪辑具有多个字幕。
video_id caption
mv89psg6zh4 A bird is bathing in a sink.
mv89psg6zh4 A faucet is running while a bird stands.
mv89psg6zh4 A bird gets washed.
mv89psg6zh4 A parakeet is taking a shower in a sink.
mv89psg6zh4 The bird is taking a bath under the faucet.
mv89psg6zh4 A bird is standing in a sink drinking water.
R2DvpPTfl-E PLAYING GAME ON LAPTOP.
R2DvpPTfl-E THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU A woman is pouring ingredients into a bowl.
l7x8uIdg2XU A woman is adding milk to some pasta.
l7x8uIdg2XU A person adds ingredients to pasta.
l7x8uIdg2XU the girls are doing the cooking.
它在这里处理“CandidateA”json 文件 但是,它不适用于看起来像这样的“Referencedf”json 文件(完整的文件可以在这里找到):
(Excerpt only):
[{"video_id":"mv89psg6zh4_33_46","caption":"A bird in a sink keeps getting under the running water from a faucet."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is bathing in a sink."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is splashing around under a running faucet."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A MAN IS WATCHING A LAPTOP."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A man is sitting at his computer."}]
这是我正在应用的以下代码:
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
with open("Referencedf.json", 'r') as f:
datastore = json.load(f)
captions = []
video_id = []
for item in datastore:
captions.append(item['caption'])
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(captions)
我得到的错误是:
AttributeError Traceback (most recent call last)
<ipython-input-25-63fee6e467f1> in <module>
1 tokenizer = Tokenizer(oov_token="<OOV>")
----> 2 tokenizer.fit_on_texts(captions)
3 word_index = tokenizer.word_index
4 print(len(word_index))
~\anaconda3\lib\site-packages\keras_preprocessing\text.py in fit_on_texts(self, texts)
221 self.filters,
222 self.lower,
--> 223 self.split)
224 for w in seq:
225 if w in self.word_counts:
~\anaconda3\lib\site-packages\keras_preprocessing\text.py in text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'NoneType' object has no attribute 'lower'
编辑:
正如@MahindraSinghMeena 所建议的那样,我事先从数据框中删除了 Null 行,以便通过使用来避免错误
df = df.dropna()
解决方案
如果您在提供给 Tokenizer 的文本中有一些不正确的数据,则会发生这种情况,因为错误消息表明它发现某些元素为无。因此,应该对数据进行清理以消除此类情况。
您可以在以下代码段中看到,条目的标题文本无效。
import json
datastore = json.load(open('/Referencedf.json', 'r'))
for d in datastore:
if d['caption'] is None:
print(d)
{'video_id': 'SKhmFSV-XB0_12_18', 'caption': None}
推荐阅读
- r - How to convert row names to column names and bind it by order in r
- apache-kafka - 将 Post Telemetry 消息发布到 Kafka 主题的有效负载是什么 - 协议消息解析失败
- wso2 - 高级限制适用于一台服务器,但不适用于其他服务器
- php - Using relationship in laravel models how to get the data in blade file?
- c# - C# 多功能单击按钮
- sql - 如何在报表生成器中编写表达式来分隔名字和姓氏?
- reactjs - onEditSubmit 未正确绑定提交
- mobile - Flutter - 通知列内的兄弟小部件
- python - 从子目录导航和附加 csv
- object-detection-api - TensorFlow,对象检测 API