首页 > 解决方案 > 从完整标签中删除电子邮件和文本

问题描述

如何在<a href..> </a>标记之间正确获取电子邮件和文本?

我的代码:

import re
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup


url = input("Enter url -")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
count = 0
tags = soup.find_all(href=re.compile("mailto"))
for tag in tags:
    count += 1
    print(tag)
print("Total amount of mails:", count)

我的程序正在接收完整的标签<a href="mailto:johntest@test.com">John Test</a>,我只想获得电子邮件地址和姓名。我怎样才能正确地将其剥离?

标签: python

解决方案


你可以这样试试


from bs4 import BeautifulSoup

html = """<a href="mailto:johntest@test.com">John Test</a>"""

soup = BeautifulSoup(html, parser="html.parser", features="lxml")

for element in soup.find_all("a"):

    if "mailto" in element["href"]:
        email = element["href"].split(":")[1]
        name = element.text

        print(email, name)

输出

johntest@test.com John Test

推荐阅读