python - Python - 使用请求模块在页面内进行多个深度级别的网络抓取
问题描述
我有一个 Python3 脚本,它根据 csv 文件中提供的 url 执行网络抓取。我正在努力实现以下目标:
1.) 从 CSV 文件中提供的 URL 获取页面
2.) 用 regex + beautifulsoup 抓取它并搜索电子邮件地址,然后,如果找到电子邮件,将其保存到 results.csv 文件
3.) 搜索页面上的所有其他(链接)
4.)转到/获取在第一页(第一级抓取)中找到的所有链接并执行相同操作
5.) 根据用户定义的深度级别执行相同的操作(如果用户说比它更深 3 级:从第一级获取页面(来自 CSV 文件的 URL)并在该页面上执行所需的操作-> 从第 2 级获取所有页面(从第 1 级获取链接)并执行所需操作 -> 从第 3 级获取所有页面(从第 2 级获取链接)并执行所需操作 -> 等等...
如何创建一个循环来处理深度级别的抓取?我尝试过使用 for 和 while 循环的多种变体,但我无法提出一个可行的解决方案。
这是我目前拥有的代码(目前它只能处理第一级抓取):
from bs4 import BeautifulSoup
import requests
import csv
import re
import time
import sys, os
#Type the amount of max level of depth for this instance of script
while True:
try:
max_level_of_depth = int(input('Max level of depth for webscraping (must be a number - integer): '))
print('Do not open the input and neither the output CSV files before the script finishes!')
break
except:
print('You must type a number (integer)! Try again...\n')
#Read the csv file with urls
with open('urls.csv', mode='r') as urls:
#Loop through each url from the csv file
for url in urls:
#Strip the url from empty new lines
url_from_csv_to_scrape = url.rstrip('\n')
print('[FROM CSV] Going to ' + url_from_csv_to_scrape)
#time.sleep(3)
i = 1
#Get the content of the webpage
page = requests.get(url_from_csv_to_scrape)
page_content = page.text
soup = BeautifulSoup(page_content, 'lxml')
#Find all <p> tags on the page
paragraphs_on_page = soup.find_all('p')
for paragraph in paragraphs_on_page:
#Search for email address in the 1st level of the page
emails = re.findall(r'[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}', str(paragraph))
#If some emails are found on the webpage, save them to csv
if emails:
with open('results.csv', mode='a') as results:
for email in emails:
print(email)
if email.endswith(('.jpg', '.jpeg', '.png', '.JPG', '.JPEG', '.PNG')):
continue
results.write(url_from_csv_to_scrape + ', ' + email + '\n')
print('Found an email. Saved it to the output file.\n')
results.close()
#Find all <a> tags on the page
links_on_page = soup.find_all('a')
#Initiate a list with all links which will later be populated with all found urls to be crawled
found_links_with_href = []
#Loop through all the <a> tags on the page
for link in links_on_page:
try:
#If <a> tag has href attribute
if link['href']:
link_with_href = link['href']
#If the link from the webpage does not have domain and protocol in it, prepend them to it
if re.match(r'https://', link_with_href) is None and re.match(r'http://', link_with_href) is None:
#If the link already has a slash in it, remove it because it will be added after prepending
link_with_href = re.sub(r'/', '', link_with_href)
#Prepend the domain and protocol in front of the link
link_with_href = url_from_csv_to_scrape + link_with_href
#print(link_with_href)
found_links_with_href.append(link_with_href)
found_links_with_href_backup = found_links_with_href
except:
#If <a> tag does not have href attribute, continue
print('No href attribute found, going to next <a> tag...')
continue
任何帮助深表感谢。
谢谢
解决方案
这里有一些伪代码:
def find_page(page):
new = re.findall('regex', page.text)
new_pages.append(new)
return len(new)
check = True
new_pages = [page]
used_pages = []
while check:
for item in new_pages:
if item not in used_pages:
found = find_page(item)
if found == 0:
check = False
else:
'find emails'
used_pages.append(item)
推荐阅读
- r - 在 ggplot/stan 的跟踪图面板中更改单个图的标题
- sql - 检索员工的主要和次要功能
- javascript - 为什么 req.query.name 返回未定义?
- android - 我想在 android kotlin 的 recyclerView 内创建一个微调器,当进入 ArrayAdapter 时,错误显示
- javascript - 无法在 beforeCreate 挂钩中设置日期 - Strapi (beta14)
- sql - Postgres:简化 SQL 查询以摆脱子选择
- firefox - vscode:将调试器附加到Firefox
- eclipse - Eclipse:获取插件安装目录的 URL?
- vue.js - 带有伊斯坦布尔的赛普拉斯没有生成代码覆盖率报告
- php - Laravel 模型路由不绑定在包内