首页 > 解决方案 > 从 html 标签中提取原始邮件

问题描述

我有 30B 行。我的数据框看起来像

age                          email
33    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">. 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
      family:&quot;Calibri&quot;,sans-serif; color:black">Iam not interested. 
      Please unsubscribe me.&nbsp;</span></p><pclass="MsoNormal">
      <spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">&nbsp;

22    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
      family:&quot;Calibri&quot;,sans-serif;color:black">Please share company 
      details</span></p><divclass="MsoNormal" align="center"style="text- 
      align:center"><hr size="2"width="98%" align="center"></div> 
      <pclass="MsoNormal">

43    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
      family:&quot;Calibri&quot;,sans-serif;color:black">Can you send 
      some project info for west region ofIndia</span></p><p class="MsoNormal"> 
      <spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">

38    </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div 
      class="WordSection1"><pclass="MsoNormal"><span style="font- 
     family:&quot;Calibri&quot;,sans-serif;color:black">Price of Mono perc</span> 
     </p><divclass="MsoNormal" align="center"style="text-align:center"><hr 
     size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>

我的最终数据框看起来像 -

age                          email                                                   text
33    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">.      Iam not interested. 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-        Please unsubscribe
      family:&quot;Calibri&quot;,sans-serif; color:black">Iam not interested. me.
      Please unsubscribe me.&nbsp;</span></p><pclass="MsoNormal">
      <spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">&nbsp;

22    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">         Please share 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-          company details
      family:&quot;Calibri&quot;,sans-serif;color:black">Please share company 
      details</span></p><divclass="MsoNormal" align="center"style="text- 
      align:center"><hr size="2"width="98%" align="center"></div> 
      <pclass="MsoNormal">

43    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">           Can you send 
      <divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-            some project 
      family:&quot;Calibri&quot;,sans-serif;color:black">Can you send            info for west 
      some project info for west region ofIndia</span></p><p class="MsoNormal">  region ofIndia
      <spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">

38    </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div         Price of Mono
      class="WordSection1"><pclass="MsoNormal"><span style="font-                 perc
     family:&quot;Calibri&quot;,sans-serif;color:black">Price of Mono perc</span> 
     </p><divclass="MsoNormal" align="center"style="text-align:center"><hr 
     size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>

我的代码看起来像 -

word1 = "sans-serif; color:black">"
word2 = "</span></p>"

df['text'] = s.split(word1)[1].split(word2)[0]

这将返回 word1 和 word2 之间的文本。但目前不工作。我的逻辑是从文本位于 word1 和 word2 之间的文本中提取邮件正文或信息。

标签: pythonpython-3.xpandas

解决方案


用于BeautifulSoup解析 HTML

前任:

from bs4 import BeautifulSoup
df['text'] = df['email'].apply(lambda x: BeautifulSoup(x, "html.parser").find("p", class_="MsoNormal").text)
print(df)

输出:

0        Iam not interested. \nPlease unsubscribe me. 
1                       Please share company \ndetails
2    Can you send \nsome project info for west regi...
3                                 Price of Mono perc\n
Name: text, dtype: object

根据评论编辑

def getText(val):
    soup =BeautifulSoup(val, "html.parser")
    try:
        return soup.find("p", class_="MsoNormal").text
    except:
        return ""

df['text'] = df['email'].apply(getText)

推荐阅读