首页 > 解决方案 > 从保存在 CSV 中的 URL 列表中删除电子邮件 - BeautifulSoup

问题描述

我正在尝试通过以 CSV 格式保存的 URL 列表进行解析以抓取电子邮件地址。但是,下面的代码只能从单个网站获取电子邮件地址。需要有关如何修改代码以循环遍历列表并将结果(电子邮件列表)保存到 csv 文件的建议。

import requests
import re
import csv
from bs4 import BeautifulSoup

allLinks = [];mails=[]
with open(r'url.csv', newline='') as csvfile:
    urls = csv.reader(csvfile, delimiter=' ', quotechar='|')
    links = []
    for url in urls:
        response = requests.get(url)
        soup=BeautifulSoup(response.text,'html.parser')
        links = [a.attrs.get('href') for a in soup.select('a[href]') ]

allLinks=set(links)

def findMails(soup):
    for name in soup.find_all('a'):
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                    print(emailText)
                mails.append(emailText)
for link in allLinks:
    if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

    else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

标签: for-loopweb-scrapingbeautifulsoup

解决方案


links当你想添加它时,你正在覆盖它。

allLinks = [];mails=[]
urls = ['https://www.nus.edu.sg/', 'http://gwiconsulting.com/']
links = []

for url in urls:
    response = requests.get(url)
    soup=BeautifulSoup(response.text,'html.parser')
    links += [a.attrs.get('href') for a in soup.select('a[href]') ]

allLinks=set(links)

最后循环您的邮件并写入 csv

import csv

with open("emails.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Email'])
    for mail in mails:
        w.writerow(mail)

推荐阅读