首页 > 解决方案 > 如何在 Python 中标记关键字并添加到新列

问题描述

我正在尝试使用下面的代码提取在句子中找到的标签,但它会返回关键字。我错过了什么?如何输出以逗号分隔的所有标签(而不是关键字)的新列?

s = set(dict_list)
f = lambda x: ', '.join(set([y for y in x.split() if y in s]))
# df['tags'] = df['description_summary'].apply(f)

df['tags'] = df['description_summary'].apply(lambda x: ', '.join(set(x.split()).intersection(s)))
df

这基本上是我在 excel 文件中使用的数据:

    description_summary

0   Long sentence with keywords ball and hot
1   Long sentence with keywords stick, glove, and cold

这是当前(错误)输出:

     description_summary                                     keywords instead of tags

0    Long sentence with keywords ball and hot                ball, hot
1    Long sentence with keywords cold, stick, and glove      cold, stick, glove

这是我想要的输出:

     description_summary                                     tags

0    Long sentence with keywords ball and hot                toy, temperature
1    Long sentence with keywords cold, stick, and glove      temperature, toy 

这是关键字和标签的字典('keywords':'tags'):

dict_list = {'Hot': 'Temperature',
 'Cold': 'Temperature',
 'Very cold': 'Temperature',
 'Ball': 'Toy',
 'Glove': 'Toy',
 'Stick': 'Toy'
 }

如何在同一文件的新列中仅输出标签(以逗号分隔)?

标签: pythonpandastagskeyword

解决方案


您可以使用普通的字典索引来返回关联的值,而不是键本身。

请注意,我已经编辑了您问题中的字典列表,以便更轻松地验证它是否有效,并且您还需要考虑区分大小写。

df = pd.DataFrame({'description_summary':['Long sentence with keywords ball and hot',
                                          'Long sentence with keywords cold, stick, and glove']})

dict_list = {'Hot': 'Temperature (hot)',
             'Cold': 'Temperature (cold)',
             'Very cold': 'Temperature (very cold)',
             'Ball': 'Toy (ball)',
             'Glove': 'Toy (glove)',
             'Stick': 'Toy (stick)'}

d_lower = {key.lower():value.lower() for key, value in dict_list.items()}

df['tags'] = df['description_summary'].apply(lambda x: ', '.join(
      set([d_lower[y] for y in d_lower.keys() if y in x])
    ))

产量'tags'

0                   temperature (hot), toy (ball)
1    temperature (cold), toy (glove), toy (stick)
Name: tags, dtype: object

推荐阅读