首页 > 解决方案 > 摆脱文件中的特定单词

问题描述

我正在处理垃圾邮件过滤器和文件,我还有 HTML 格式的电子邮件,所以有以下部分:

br></font><br><br><br><br><br><br><br><br><br><br><br><br><br=
><br><br><br></font></p></center></center></tr></tbody></table></center></=
center></center></center></center></body></html>

我忽略了他们的方式:

if word[0] == '<' or word[len(word)-1] == '>':

但是仍然有部分传递到 mi 字典中。我一直在寻找一些方法来忽略这些词,但没有成功。python中是否有一些库可以解决这个问题,或者有人知道更有效的编码方式吗?

现在我读到这样的词:

mail_words = {}
with open(email, 'r', encoding='utf-8') as file:
       text_of_mail = file.read()
        words = text_of_mail.split()
        words = [w.translate(str.maketrans("", "", "0123456789”#%&\’()*+,-./:;=?@[\\]^_`{|}~’&quot;)) for w in words]



for word in words:
  if word == '' or word == ' ' or word == '\n' or word[0] == '<' or word[len(word)-1] == '>':
                pass
  elif word not in mail_words:
      mail_words[word] = 1
  else:
      mail_words[word] += 1

欣赏

标签: pythonfile

解决方案


而不是使用 maketrans - 使用内置的轻量级html 解析器

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    """Adjusted from https://docs.python.org/3/library/html.parser.html"""
    data_set = set()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)
        self.data_set.add(data)


parser = MyHTMLParser()

# well formed html example 
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

print(parser.data_set)

输出:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

{'Test', 'Parse me!'}       # parser.data_set  content

你会像这样使用它:

parser = MyHTMLParser()
with open(email, 'r', encoding='utf-8') as file:
    parser.feed(file.read())
print(parser.data_set)

然后,您对结果集 - fe 进行后处理

# remove entries consisting purely out of whitespaces \t \n etc.
cleaned = {a.strip() for a in parser.data_set if a.strip()}

推荐阅读