python - 如何使用 Selenium 和 BeautifulSoup 更快地抓取?
问题描述
感谢这里漂亮的人的帮助,所以我能够整理一些代码来抓取网页。由于页面的动态特性,我不得不使用 Selenium,因为 BeautifulSoup 只能在您必须抓取静态页面时单独使用。
一个缺点是打开页面的整个过程,等待弹出窗口打开并引入输入需要大量时间。时间在这里是个问题,因为我必须刮掉大约 1000 页(每个邮政编码 1 页),这需要大约 10 个小时。
我怎样才能优化代码,使这个操作不会花这么长时间?
我将在下面留下完整的代码和邮政编码列表以供复制。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd
time_of_day=[]
price=[]
Hours=[]
day=[]
disabled=[]
location=[]
danishzip = pd.read_excel (r'D:\Danish_ZIPs.xlsx')
for i in range(len(danishzip)):
try:
zipcode = danishzip['Zip'][i]
driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)
driver.maximize_window()
driver.get("https://www.nemlig.com/")
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys(str(zipcode))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()
time.sleep(3)
soup=BeautifulSoup(driver.page_source,'html.parser')
for morn,d in zip(soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__item')):
location.append(soup.find('span', class_='zipAndCity').text)
time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
Hours.append(morn.text)
price.append(morn.find_next(class_="time-block__cost").text)
day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
if 'disabled' in d['class']:
disabled.append('1')
else:
disabled.append('0')
for after,d in zip(soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__item')):
location.append(soup.find('span', class_='zipAndCity').text)
time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
Hours.append(after.text)
price.append(after.find_next(class_="time-block__cost").text)
day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
if 'disabled' in d['class']:
disabled.append('1')
else:
disabled.append('0')
for evenin,d in zip(soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__item')):
location.append(soup.find('span', class_='zipAndCity').text)
time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
Hours.append(evenin.text)
price.append(evenin.find_next(class_="time-block__cost").text)
day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
if 'disabled' in d['class']:
disabled.append('1')
else:
disabled.append('0')
df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day,"Disabled" : disabled, "Location": location})
print(df)
driver.close()
except Exception:
time_of_day.append('No Zipcode')
location.append('No Zipcode')
Hours.append('No Zipcode')
price.append('No Zipcode')
day.append('No Zipcode')
disabled.append('No Zipcode')
df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day,"Disabled" : disabled, "Location": location})
driver.close()
邮政编码列表:https ://en.wikipedia.org/wiki/List_of_postal_codes_in_Denmark
解决方案
您只需要一个简单的请求即可获取 json 格式的所有信息:
import requests
headers = {
'sec-fetch-mode': 'cors',
'dnt': '1',
'pragma': 'no-cache',
'accept-encoding': 'gzip, deflate, br',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/77.0.3865.120 Safari/537.36',
'accept': 'application/json, text/plain, */*',
'cache-control': 'no-cache',
'authority': 'www.nemlig.com',
'referer': 'https://www.nemlig.com/',
'sec-fetch-site': 'same-origin',
}
response = requests.get('https://www.nemlig.com/webapi/v2/Delivery/GetDeliveryDays?days=8', headers=headers)
json_data = response.json()
例如,您可以将days=
参数更改为 20 并获取 20 天的数据。
推荐阅读
- javascript - 函数返回 [object Set] 而不是实际的 Set
- python - 用文件值替换 json 嵌套字典值
- javascript - 在 IF ELSE 语句的 2 个位置找到类似的代码块。考虑重构 JS
- mongodb - 如何查询 MongoDB 以返回 Document 但不是所有子文档?
- flutter - 常见的颤振错误(缺少必需的参数)
- exceljs - 我们可以在 excelJs 中使用点符号访问密钥吗?
- excel - 在 VBA 中查找字符串并根据不同的标准引入新数据
- javascript - 程序生成地图
- android - Android Resources$NotFoundException:找不到资源 ID
- vue.js - 插槽处于活动状态时如何切换图标