首页 > 解决方案 > 检查给定列表中有多少单词出现在文本/字符串列表中

问题描述

我有一个包含评论的文本数据列表,如下所示:

1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',

3. 'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.

我有一个单独的单词列表,我想知道这些评论中存在的单词:

['food','science','good','buy','feedback'....]

我想知道评论中存在哪些这些词,并选择包含一定数量这些词的评论。例如,假设仅选择包含此列表中至少 3 个单词的评论,因此它会显示所有这些评论,但还会显示在选择评论时遇到哪些评论。

我有用于选择包含至少3 个单词的评论的代码,但是我如何获得第二部分,它告诉我究竟遇到了哪些单词。这是我的初始代码:

keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
    if len(set(keywords)&set(element.split(' '))) >=3:
        sentences.append(element)

标签: pythonpandaslistnumpytext

解决方案


为了回答第二部分,请允许我重新审视如何处理第一部分。这里一种方便的方法是将您的评论字符串转换为一组单词字符串。

像这样:

review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))

现在 review_1 集包含每个单词之一。然后获取您的单词列表,将其转换为一组,然后进行交集。

words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])

matches = review_1.intersection(words)

结果集,matches,包含所有常见的单词。这个的长度是匹配的数量。

现在,如果您关心每个单词有多少匹配,这将不起作用。例如,如果在评论中找到两次“food”这个词,并且找到一次“science”,这算作匹配三个词吗?

如果是这样,请通过评论告诉我,我可以编写一些代码来更新答案以包含该场景。

编辑:更新以包含评论问题


如果您想记录每个单词重复的次数,请查看评论列表。仅在执行交集时将其转换为设置。然后,使用“计数”列表方法来计算每个匹配项在评论中出现的次数。在下面的示例中,我使用字典来存储结果。

review_1 = "I have bought several of the Vitality canned dog food products and"

words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])

matches = set(review_1).intersection(words)

match_counts = dict()
for match in matches:
    match_counts[match] = words.count(match)

推荐阅读