python - 从 html 标签中提取原始邮件
问题描述
我有 30B 行。我的数据框看起来像
age email
33 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">.
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif; color:black">Iam not interested.
Please unsubscribe me. </span></p><pclass="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
22 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif;color:black">Please share company
details</span></p><divclass="MsoNormal" align="center"style="text-
align:center"><hr size="2"width="98%" align="center"></div>
<pclass="MsoNormal">
43 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif;color:black">Can you send
some project info for west region ofIndia</span></p><p class="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
38 </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div
class="WordSection1"><pclass="MsoNormal"><span style="font-
family:"Calibri",sans-serif;color:black">Price of Mono perc</span>
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>
我的最终数据框看起来像 -
age email text
33 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">. Iam not interested.
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- Please unsubscribe
family:"Calibri",sans-serif; color:black">Iam not interested. me.
Please unsubscribe me. </span></p><pclass="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
22 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> Please share
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- company details
family:"Calibri",sans-serif;color:black">Please share company
details</span></p><divclass="MsoNormal" align="center"style="text-
align:center"><hr size="2"width="98%" align="center"></div>
<pclass="MsoNormal">
43 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> Can you send
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- some project
family:"Calibri",sans-serif;color:black">Can you send info for west
some project info for west region ofIndia</span></p><p class="MsoNormal"> region ofIndia
<spanstyle="font-family:"Calibri",sans-serif;color:black">
38 </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div Price of Mono
class="WordSection1"><pclass="MsoNormal"><span style="font- perc
family:"Calibri",sans-serif;color:black">Price of Mono perc</span>
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>
我的代码看起来像 -
word1 = "sans-serif; color:black">"
word2 = "</span></p>"
df['text'] = s.split(word1)[1].split(word2)[0]
这将返回 word1 和 word2 之间的文本。但目前不工作。我的逻辑是从文本位于 word1 和 word2 之间的文本中提取邮件正文或信息。
解决方案
用于BeautifulSoup
解析 HTML
前任:
from bs4 import BeautifulSoup
df['text'] = df['email'].apply(lambda x: BeautifulSoup(x, "html.parser").find("p", class_="MsoNormal").text)
print(df)
输出:
0 Iam not interested. \nPlease unsubscribe me.
1 Please share company \ndetails
2 Can you send \nsome project info for west regi...
3 Price of Mono perc\n
Name: text, dtype: object
根据评论编辑
def getText(val):
soup =BeautifulSoup(val, "html.parser")
try:
return soup.find("p", class_="MsoNormal").text
except:
return ""
df['text'] = df['email'].apply(getText)
推荐阅读
- node.js - ibmcloud-appid:nodejs 如何进行本地开发与在 IBM Cloud 中运行?
- android - 我在计算器末尾得到“.0”
- angular - 为什么异步管道在出错后继续发出值?
- parallel-processing - Pytorch:W ParallelNative.cpp:206
- https - 当测试通过 HTTPS 时,如何将 SSL 证书插入 bitbucket-pipelines.yml 文件?
- matlab - 如何在 MATLAB 中将 symfun 转换为字符串
- haskell - Haskell中无限列表的元素
- typescript - 将空(0 字节)上传到 SharePoint Online
- python - 通过 url 预填充表单域
- sql - 如何找到当月和下一个城市的中位数价格的差异?