首页 > 解决方案 > Python - 使用请求模块在页面内进行多个深度级别的网络抓取

问题描述

我有一个 Python3 脚本,它根据 csv 文件中提供的 url 执行网络抓取。我正在努力实现以下目标:

1.) 从 CSV 文件中提供的 URL 获取页面

2.) 用 regex + beautifulsoup 抓取它并搜索电子邮件地址,然后,如果找到电子邮件,将其保存到 results.csv 文件

3.) 搜索页面上的所有其他(链接)

4.)转到/获取在第一页(第一级抓取)中找到的所有链接并执行相同操作

5.) 根据用户定义的深度级别执行相同的操作(如果用户说比它更深 3 级:从第一级获取页面(来自 CSV 文件的 URL)并在该页面上执行所需的操作-> 从第 2 级获取所有页面(从第 1 级获取链接)并执行所需操作 -> 从第 3 级获取所有页面(从第 2 级获取链接)并执行所需操作 -> 等等...

如何创建一个循环来处理深度级别的抓取?我尝试过使用 for 和 while 循环的多种变体,但我无法提出一个可行的解决方案。

这是我目前拥有的代码(目前它只能处理第一级抓取):

from bs4 import BeautifulSoup
import requests
import csv
import re

import time
import sys, os

#Type the amount of max level of depth for this instance of script
while True:
    try:
        max_level_of_depth = int(input('Max level of depth for webscraping (must be a number - integer): '))
        print('Do not open the input and neither the output CSV files before the script finishes!')
        break
    except:
        print('You must type a number (integer)! Try again...\n')
        
#Read the csv file with urls
with open('urls.csv', mode='r') as urls:
    #Loop through each url from the csv file
    for url in urls:
        #Strip the url from empty new lines
        url_from_csv_to_scrape = url.rstrip('\n')
        print('[FROM CSV] Going to ' + url_from_csv_to_scrape)
        #time.sleep(3)
        i = 1
        #Get the content of the webpage
        page = requests.get(url_from_csv_to_scrape)
        page_content = page.text
        soup = BeautifulSoup(page_content, 'lxml')
        #Find all <p> tags on the page
        paragraphs_on_page = soup.find_all('p')
        for paragraph in paragraphs_on_page:
            #Search for email address in the 1st level of the page
            emails = re.findall(r'[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}', str(paragraph))
            #If some emails are found on the webpage, save them to csv
            if emails:
                with open('results.csv', mode='a') as results:
                    for email in emails:
                        print(email)
                        if email.endswith(('.jpg', '.jpeg', '.png', '.JPG', '.JPEG', '.PNG')):
                            continue
                        results.write(url_from_csv_to_scrape + ', ' + email + '\n')
                        print('Found an email. Saved it to the output file.\n')
                    results.close()
        #Find all <a> tags on the page
        links_on_page = soup.find_all('a')
        #Initiate a list with all links which will later be populated with all found urls to be crawled
        found_links_with_href = []
        #Loop through all the <a> tags on the page
        for link in links_on_page:
            try:
                #If <a> tag has href attribute
                if link['href']:
                    link_with_href = link['href']
                    #If the link from the webpage does not have domain and protocol in it, prepend them to it
                    if re.match(r'https://', link_with_href) is None and re.match(r'http://', link_with_href) is None:
                        #If the link already has a slash in it, remove it because it will be added after prepending
                        link_with_href = re.sub(r'/', '', link_with_href)
                        #Prepend the domain and protocol in front of the link
                        link_with_href = url_from_csv_to_scrape + link_with_href
                        #print(link_with_href)
                    found_links_with_href.append(link_with_href)
                    found_links_with_href_backup = found_links_with_href
            except:
                #If <a> tag does not have href attribute, continue
                print('No href attribute found, going to next <a> tag...')
                continue

任何帮助深表感谢。

谢谢

标签: pythonweb-scrapingpython-requests

解决方案


这里有一些伪代码:

def find_page(page):
    new = re.findall('regex', page.text)
    new_pages.append(new)
    return len(new)

check = True
new_pages = [page]
used_pages = []
while check:

    for item in new_pages:
        if item not in used_pages:
            found = find_page(item)
            if found == 0:
                check = False
            else:
                'find emails'
            
        used_pages.append(item)

推荐阅读