python - 仅从文本文件中获取信件正文、电子邮件
问题描述
我想从这个文本文档中删除所有来自、到、抄送、主题发送的标签,只保留邮件的正文,以便我可以使用它来总结文档的内容。在 python 中执行此操作的最佳方法是什么。我认为最好先进行提取,然后在这种情况下使用预处理。还在这里附上代码。因此,如果有人可以建议如何做到这一点,那将非常有帮助。文件的有效负载和 ismultipart 部分未正确完成,我的疑问在哪里,因此已评论该部分并需要帮助。
附上代码和下面的 .txt 文件以供参考。
import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords
# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
try:
for files in filename:
file = open(filename, 'r', encoding ='utf-8')
filecontents = file.read()
filecontents = re.sub(r'\s+', ' ', filecontents)
print(filecontents)
filecontents = filecontents.strip('\n')
b = email.message_from_string(filecontents)# NEED
if b.is_multipart():#HELP
for payload in b.get_payload():#HERE
# if payload.is_multipart(): ...#SO
print (payload.get_payload())#COMMENTED
else:#
print (b.get_payload())#
summary = summarize(filecontents, ratio =0.10)
print(summary)
kw = keywords(filecontents, words=15)
print(kw)
break
#writer.writerow([file, summary, kw])
except Exception as e:
pass
文本文件
Stephanie /ANN
From: Mr.A, <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE: Holdings: XXXX SPA – mfno.1322
Dear Dr. Tim A. ,
The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format. However, starting May 5, 2018,
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A.gov/abc/bca
This communication is an informal communication consistent with which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
From: Tim.@xxxx.com [mailto:Tim.@xxxx.com]
Sent: Wednesday, July 25, 2018 2:10 PM
To: Mr.A, <.Mr.A@abc.com>
Cc: May.Abd@xxxx.com
Subject: RE: Holdings: XXXX SPA ‐ dm 013383
Dear ,
XXXX
2
Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does
direct bANNiness for test S intermediate with b. and not with the other companies (e,
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a
separate QA S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as
described below:
Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced
to our NA 13383.
Option 2: We can do a single QA for and mention that they can cross‐reference any of their NAs. This
would allow them to cross‐reference any of their
If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know.
If not, when you issue your request, can you please send to me and May Abd by email?
Kind regards.
Tim
Tim A. , BsC
Director, YY SERVICES)
Xxxx ANN
Phone/FAX: 2312333
Cell: 23312123131
Email: tim.@xxxx.com
From: , Tim /ANN
Sent: Monday, July 23, 2018 7:05 AM
To: 'Mr.A, '
Cc: Abd, May /ANN
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383
Dear ,
May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience.
Kind regards.
Tim
Tim A. , MSC
Director, PQR
Xxxx
Phone/FAX: 2312313313
Cell: 3142342424
Email: tim.@xxxx.com
XXXX
3
‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
From: "Mr.A, " <.Mr.A@abc.com>
Date: Jul 20, 2018 9:01 AM
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383
To: "TRETE/ANN" <May.Abd@xxxx.com>
Cc: "mno.com>
Dear May Abd,
. I need to talk to you on this.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format.
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A./cder/NA
This communication is an informal communication which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
XXXX
解决方案
目前还不清楚您需要帮助的代码的哪一部分,您希望它做什么而不是它当前做什么,或者如何正确传递结果以进行进一步处理。
但是,我会注意到您的代码存在许多问题。
- 您无法将电子邮件消息作为 UTF-8 文本阅读。无论文件扩展名如何,RFC822 消息只是一个字节序列。传统电子邮件可能有大量不同的编码,如果您尝试将其强制转换为 UTF-8,您将遇到
UnicodeDecodeError
s 和其他问题。 - 与往常一样,毯子
except Exception:
是一个主要错误。也许您只是将它用于调试,但实际上它使调试变得更加困难。 - 典型的现代电子邮件带有一些复杂的 MIME 正文结构,您必须在确定实际要处理的内容之前对其进行分析。一种常见的现象是
multipart/alternative
,相同的消息以不同的格式呈现,以便收件人可以决定是否要阅读呈现为 HTML、纯文本,或者偶尔可能是 PDF 或 RTF 或单个图像或其他格式的消息,具体取决于应用程序. 此外,HTML 结构通常有多个部分,因为主要的 HTML 想要拉入 MIME 结构中提供的小图像(公司徽标、动画表情符号和其他对读者的侮辱)。也许另请参阅多部分电子邮件中的“部分”是什么?
这个答案的另一个复杂之处是 Python 的email
库最近经历了一次大修。新功能是在 Python 3.3 中实验性引入的,但仅在 3.6 中成为文档化和默认版本。您将在野外发现的大多数代码都将使用 3.6 之前的工具,但展望未来,您可能希望以新的和改进的 API 为目标。
使用旧版 API,您的代码可能看起来像
from email import message_from_binary_file
for filename in glob.glob(os.path.join(dirs, '*.txt')):
# Not useful; we already have a filename
#for files in filename:
# Open in binary mode, don't try to guess encoding
# Use a context manager so we don't leave the file open
with open(filename, 'rb') as file:
# Just let the email library take it from here
#filecontents = file.read()
#filecontents = re.sub(r'\s+', ' ', filecontents)
#print(filecontents)
#filecontents = filecontents.strip('\n')
b = email.message_from_binary_file(file)
if b.is_multipart():
# There are a number of things you could do to pick out
# one or more payloads for analysis, but let's just take
# the first text/plain part and call it "main_part"
for part in b.walk()
if part.get_content_type() == 'text/plain':
main_part = part.get_payload()
break
else:
main_part = b.get_payload()
summary = summarize(main_part, ratio =0.10)
print(summary)
kw = keywords(main_part, words=15)
print(kw)
要使用新的 3.6+ API,您需要将其调整为类似
from email.policy import default as default_email_policy
...
b = email.message_from_binary_file(file, policy=default_email_policy)
main_part = b.get_body(['related', 'plain', 'html'])
这将产生一个新对象,该对象与旧类email.message.EmailMessage
具有一些不同的方法和不同的行为。email.message.Message
文档表明,也许有一天会默认policy
传递默认值,届时旧代码将切换到新行为(但也可能会出现一些令人不快的意外和彻底的破坏)。
还要注意3.6 中的新get_body()
方法,它可以让您轻松挑选出“可能的主要部分”;虽然如果没有可用的部分,上面的代码将回退到 HTML,然后您需要进一步处理以提取实际文本(也许text/plain
看看Beautifulsoup ?)
没有技术、稳健、可靠的方法可以将样板文件(标题、签名等)与电子邮件中的实际内容分开。一些 HTML 电子邮件客户端可能会在生成的消息中提供关于哪些<div>
包含用户输入的内容的提示,但在一般情况下,您只需要在(坦率地说,无望的)启发式方法中费力地摸索。
推荐阅读
- angular - Angular Material如何获取复选框文本溢出省略号
- html - Html 5 必填字段验证器未显示验证消息
- flutter - 关闭列表视图:当列表中的小部件被关闭时,如何删除数组中的相应条目
- python - 熊猫布尔运算符的混淆结果
- reactjs - 更新graphQL apollo突变中的多个值
- powershell - 使用 powershell 想要根据用户选择从多个批次运行批处理文件
- flutter - 在 dart 中,如何将 const Map 中的值分配给 const 变量?
- docker - 从 xml 将作业导入 Jenkins
- jquery - 如何让我的菜单在第二次点击时调用不同的操作?
- apache-spark - Spark SQL 是否像在 sql Server 中那样支持排序规则