首页 > 解决方案 > 将数据框中的列表与另一个列表进行比较,如果未找到,则将其保存在另一列中

问题描述

我想问一下,例如我有一个词汇表和一个数据框。数据框包含标记化的句子。

vocab_list = ['aaa',....,'zzz']

数据框

tokenized_sentenced
========
[lorem , ipsum]
[it , is, a, long, established, fact ]
[various, versions, have, evolved]
[the, generated, lorem, ipsum]

如何将词汇列表中未找到的标记列表存储到数据框中的新列中。结果应该是这样的:

   tokenized_sentenced                        token_not_found_in_vocab
    =========================================|===========================
    [lorem , ipsum]                          |[lorem, ipsum]
    [it , is, a, long, established, fact ]   |[]
    [various, versions, have, evolved, toq]  |[toq]
    [the, generated, lorem, ipsum]           |[lorem, ipsum]

我试过这个:

for i in range(0,1005):
  for j in range(0, len(df['tokenized_sentenced'][i])-1):
    if (df['tokenized_sentenced'][i][j] not in vocab_list):
      
      df['token_not_found_in_vocab'][i].append(df['tokenized_sentenced'][i][j])

但我得到了错误:

AttributeError: 'str' object has no attribute 'append'

标签: pythonpandasdataframe

解决方案


以下可以在一行中解决您的问题:

df['token_not_found_in_vocab'] = df['tokenized_sentenced'].apply(lambda x: list(set(x).difference(vocab_list)))

推荐阅读