首页 > 解决方案 > Python 3 计数器忽略少于 x 个字符的字符串

问题描述

我有一个计算文本文件单词的程序。现在我想将计数器限制为超过 x 个字符的字符串

from collections import Counter
input = 'C:/Users/micha/Dropbox/IPCC_Boox/FOD_v1_ch15.txt'

Counter = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
  for line in fh:
    word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
    for word in word_list:
      if word not in Counter:
        Counter[word] = 1
      else:
        Counter[word] = Counter[word] + 1
N = 20
top_words = Counter(Counter).most_common(N)
for word, frequency in top_words:
    print("%s %d" % (word, frequency))

我尝试了re代码,但它不起作用。

    re.sub(r'\b\w{1,3}\b')

我不知道如何实现它...

最后,我希望有一个忽略所有短词的输出,例如 and, you, be 等。

标签: pythonstringcounteranalysisword

解决方案


你可以更简单地做到这一点:

  for word in word_list:
      if len(word) < 5:   # check the length of each word is less than 5 for example
          continue        # this skips the counter portion and jumps to next word in word_list
      elif word not in Counter:
          Counter[word] = 1
      else:
          Counter[word] = Counter[word] + 1

推荐阅读