首页 > 解决方案 > 卡在 ForLoop 和插入 DataFrame

问题描述

我有这个数据框:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   NaN             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   NaN             2 owner 0 rust. Cadillac.

我想用从描述中获取的关键字填充 NaN 值。为此,我创建了一个包含我想要的关键字的列表:

keyword = ['gmc', 'toyota', 'cadillac']

最后,我想遍历 DataFrame 中的每一行。从每一行的“描述”列中拆分内容,如果该词也在“关键字”列表中,则将其添加到“制造商”列中。例如,它看起来像这样:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   gmc             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   cadillac        2 owner 0 rust. Cadillac.

感谢这个社区中的一个友好的人,我可以改进我的代码:

import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z\-]+""", test3["description"][i])
for word in bag_of_words: 
    if word.lower() in keyword:
            test3.loc[i, 'manufacturer'] = word.lower()

但我意识到第一行也改变了值,即使它不是 NaN:

  manufacturer  description
0   gmc         toyota, gmc 10 years old.
1   gmc         gmc, Motor runs and drives good.
2   NaN         Motor old, in pieces.
3   cadillac    2 owner 0 rust. Cadillac.

我只想更改 NaN 值,但是当我尝试添加时:

if word.lower() in keyword and test3.loc[i, 'manufacturer'] == np.nan:

它没有任何效果。

标签: pythonpandasdataframefor-loop

解决方案


这是一个快速修复。你做错了几件事:

  • 混合描述索引和描述本身(由 解决enumerate())。
  • bag_of_words应该为每个单词更新,而不是附加。
  • 错误的项目被迭代(应该是word,不是bag_of_words)。

如果选择直观/传统的名称,可以很容易地看到一些错误。一定要花一些时间在这上面。

代码

from nltk.tokenize import RegexpTokenizer

# test3 = the main dataset
keyword = ['gmc', 'toyota', 'cadillac']

tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

for i, description in enumerate(test3['description']):
    bag_of_words = tokenizer.tokenize(description.lower())
    for word in bag_of_words:
        if word in keyword:
            test3.loc[i, 'manufacturer'] = word

输出

test3
Out[31]: 
  manufacturer                       description
0       toyota             toyota, 10 years old.
1          gmc  gmc, Motor runs and drives good.
2          NaN             Motor old, in pieces.
3     cadillac         2 owner 0 rust. Cadillac.

re.findall() 通过 RegexpTokenizer

我个人认为nltk是一个需要导入、安装和部署的比较重的模块。如果只进行字符串拆分,我建议使用re.findall来提取有效的单词模式。例如:

import re

# won't extract numbers, currency signs and apostrophes
re.findall(r"""[A-Za-z\-]+""", test3["description"][3])

# the output is much cleaner than before
Out[39]: ['owner', 'rust', 'Cadillac']

但这取决于用户的选择,具体取决于整个任务。


推荐阅读