首页 > 解决方案 > 使用 python 3.x 从网站上抓取电子邮件

问题描述

我有一个脚本,它应该包含一个网站列表,并从那里搜索电子邮件(参见下面的代码)。每次出现错误时,例如“网站被禁止”或“服务暂时不可用”等。脚本将重新开始。

# -*- coding: utf-8 -*-

import urllib.request, urllib.error
import re
import csv
import pandas as pd
import os
import ssl

# 1: Get input file path from user '.../Documents/upw/websites.csv'
user_input = input("Enter the path of your file: ")

# If input file doesn't exist
if not os.path.exists(user_input):
    print("File not found, verify the location - ", str(user_input))


def sites(e):
    pass


while True:
    try:
        # 2. read file
        df = pd.read_csv(user_input)

        # 3. create the output csv file
        with open('Emails.csv', mode='w', newline='') as file:
            csv_writer = csv.writer(file, delimiter=',')
            csv_writer.writerow(['Website', 'Email'])

        # 4. Get websites
        for site in list(df['Website']):
            # print(site)
            gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
            req = urllib.request.Request("http://" + site, headers={
                'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
                # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
            })

            # 5. Scrape email id
            with urllib.request.urlopen(req, context=gcontext) as url:
                s = url.read().decode('utf-8', 'ignore')
                email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
                print(email)

                # 6. Write the output
                with open('Emails.csv', mode='a', newline='') as file:
                    csv_writer = csv.writer(file, delimiter=',')
                    [csv_writer.writerow([site, item]) for item in email]

    except urllib.error.URLError as e:
        print("Failed to open URL {0} Reason: {1}".format(site, e.reason))

如果我删除代码:

def sites(e):
pass

while True

发生错误时脚本停止..

它应该做的是,如果从 web 端发生错误,则不要停止脚本,而是继续搜索。

我已经在网上搜索了一段时间,并查看了几个帖子,但看起来像是错误的,因为我还没有找到解决方案..

任何帮助我将不胜感激。

标签: python-3.xurllib

解决方案


你的while True:循环的问题。它总是会重新启动,因为在块中生成异常,try然后循环进入exception块。之后它将再次循环并try从一开始就运行该块。

当您取出While True:然后发生异常时,它将完全停止该过程,因为将在try块中引发异常,从而停止try块执行,然后继续执行该except块,然后继续执行程序的其余部分。

df['Website']您想要的是在循环中使用 try 块,如果抛出异常,它将以这种方式遍历网站,它将移动到列表中的下一个网站,而不是一直到开始读取数据帧并重新开始循环网站。

    # 2. read file
df = pd.read_csv(user_input)

# 3. create the output csv file
with open('Emails.csv', mode='w', newline='') as file:
    csv_writer = csv.writer(file, delimiter=',')
    csv_writer.writerow(['Website', 'Email'])

# 4. Get websites
for site in list(df['Website']):
    try:
        # print(site)
        gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
        req = urllib.request.Request("http://" + site, headers={
            'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
            # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'
        })

        # 5. Scrape email id
        with urllib.request.urlopen(req, context=gcontext) as url:
            s = url.read().decode('utf-8', 'ignore')
            email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
            print(email)

            # 6. Write the output
            with open('Emails.csv', mode='a', newline='') as file:
                csv_writer = csv.writer(file, delimiter=',')
                [csv_writer.writerow([site, item]) for item in email]

    except urllib.error.URLError as e:
        print("Failed to open URL {0} Reason: {1}".format(site, e.reason))

推荐阅读