首页 > 解决方案 > Python - 遍历关键字列表,搜索字符串中的匹配数,计算最终总数

问题描述

我有一些词要检查,看看它们是否出现在研究摘要中,如果出现,请计算出现次数。不知道我的代码做错了什么,但它的计数不正确。提前致谢!

 mh_terms = ['mental', 'ptsd', 'sud', 'substance abuse', 'drug abuse', 
  'alcohol', 'alcoholism', 'anxiety', 'depressing', 'bipolar', 'mh', 
  'smi', 'oud', 'opioid' ]

  singleabstract = 'This is a research abstract that includes words like 
  mental health and anxiety.  My hope is that I get my code to work and 
  not resort to alcohol.'

  for mh in mh_terms: 
       mh = mh.lower
       mh = str(mh)
       number_of_occurences = 0
       for word in singleabstract.split():
          if mh in word:
          number_of_occurences += 1
  print(number_of_occurences)

标签: pythonlistloopstext

解决方案


通常,对于分组,adict是一个很好的方法。对于计数,您可以使用如下实现:

c = {}

singleabstract = 'This is a research abstract that includes words like 
  mental health and anxiety.  My hope is that I get my code to work and 
  not resort to alcohol.'

for s in singleabstract.split():
    s = ''.join(char for char in s.lower() if char.isalpha()) # '<punctuation>'.isalpha() yields False
    # you'll need to check if the word is in the dict
    # first, and set it to 1
    if s not in c:
        c[s] = 1
    # otherwise, increment the existing value by 1
    else:
        c[s] += 1

# You can sum the number of occurrences, but you'll need
# to use c.get to avoid KeyErrors
occurrences = sum(c.get(term, 0) for term in mh_terms)

occurrences
3

# or you can use an if in the generator expression
occurrences = sum(c[term] for term in mh_terms if term in c)

计算出现次数的最佳方法是使用collections.Counter. 这是一个字典,它允许你 O(1) 检查键:

from collections import Counter

singleabstract = 'This is a research abstract that includes words like 
  mental health and anxiety.  My hope is that I get my code to work and 
  not resort to alcohol.'

# the Counter can consume a generator expression analogous to
# the for loop in the dict implementation
c = Counter(''.join(char for char in s.lower() if char.isalpha()) 
            for s in singleabstract.split())

# Then you can iterate through
for term in mh_terms:
    # don't need to use get, as Counter will return 0
    # for missing keys, rather than raising KeyError 
    print(term, c[term]) 

mental 1
ptsd 0
sud 0
substance abuse 0
drug abuse 0
alcohol 1
alcoholism 0
anxiety 1
depressing 0
bipolar 0
mh 0
smi 0
oud 0
opioid 0

要获得所需的输出,您可以总结Counter对象的值:

total_occurrences = sum(c[v] for v in mh_terms)

total_occurrences
3

推荐阅读