javascript - 在网页抓取期间使用 HTMLSession.render() 时如何“强制”渲染 javascript?
问题描述
我需要从网站上抓取邮政编码数据。https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds=01000
首先,我从通常的 BeautifulSoup 工作流程开始,但后来注意到,尽管在检查页面源代码时可以搜索某些元素,但仍然找不到。
经过一番研究,我怀疑这是由于动态呈现页面的 javascript 行为所致。
然后我按照这里的教程http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/它在这个页面https://www.pos.com上运行得很好 。我的/邮政服务/快速访问/?postcode-finder#postcodeIds=50250
自然,然后我继续循环可能的范围以从每个页面中提取数据。
我发现当我在不同页面上循环相同的代码时,代码行为并不总是一致的。
例如,当我在此页面 https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds=01000上运行代码时,代码无法找到邮政编码表。
我一直在玩弄代码以找到解释,但无济于事。
我怀疑也许我每次都需要以某种方式刷新 javascript 渲染或重置浏览器会话。
# http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
# import HTMLSession from requests_html
from requests_html import HTMLSession
from bs4 import BeautifulSoup
# set 'root' url
rurl = 'https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds='
urls = []
for i in range(1000,99999):
url = rurl + str(i).zfill(5)
urls.append(url)
#for url in urls:
# print(url)
# prepare file for output
filename = "MY_POS_Malaysia_postcodes.csv"
f = open(filename, "a+")
headers = "url,location, post_office, postcode_str, state\n"
f.write(headers)
# create an HTML Session object
for url in urls:
print("Start session")
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
print(resp)
# Run JavaScript code on webpage, so that the 'missing' elements are now shown
resp.html.render()
# create beautifulsoup object
soup = BeautifulSoup(resp.html.html, "lxml")
# look for tr elements (this assumes tr exclusively have postcodes information)
# do sanity check
print("Start: " + url)
postcodes = soup.find_all("tr")
if len(postcodes) > 0 and len(postcodes[0]) == 9:
print("Number of postcodes: " + str(len(postcodes)))
for postcode in postcodes[1:len(postcodes)]:
location = postcode.find_all('td')[0].text.strip()
post_office = postcode.find_all('td')[1].text.strip()
postcode_str = postcode.find_all('td')[2].text.strip()
state = postcode.find_all('td')[3].text.strip()
print("url: " + url)
print("location: " + location)
print("post_office: " + post_office)
print("postcode_str: " + postcode_str)
print("state: " + state)
print('Start writing...')
f.write(url.replace(",", " ") + ","
+ location.replace(",", " ") + ","
+ post_office.replace(",", " ") + ","
+ postcode_str.replace(",", " ") + ","
+ state + "\n")
print('End writing')
print("End: " + url)
else:
f.write(url + ","
+ " " + ","
+ " " + ","
+ " " + ","
+ " " + "\n")
session.close()
print("Close session")
f.close()
对于存在 url 的每个页面,我想提取邮政编码表并将其存储在 csv 文件中。
我也很欣赏一些关于如何获取实际现有网址的想法,而不是从一系列数字中进行暴力搜索。
谢谢!
解决方案
我最终使用 selenium 而不是 HTMLSession。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd
driver = webdriver.Chrome(executable_path='/chromedriver_win32/chromedriver.exe')
#KL x
#wp kuala lumpur 284 https://www.pos.com.my/postal-services/quick-access/?postcodeFinderState=wp%20kuala%20lumpur&postcodeFinderLocation=&page=1000
# set 'root' url
rurl = 'https://www.pos.com.my/postal-services/quick-access/?postcodeFinderState=wp%20kuala%20lumpur&postcodeFinderLocation=&page='
urls = []
# generate urls
for i in range(1,284):
url = rurl + str(i)
urls.append(url)
# prepare file for output
filename = "MY_POS_Malaysia_postcodes_selenium_kl.csv"
f = open(filename, "a+")
headers = "url,record\n"
f.write(headers)
for url in urls:
driver.get(url)
timeout = 30
try:
WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, "postcode-container")))
except TimeoutException:
driver.quit()
if WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, "postcode-container"))):
postcode_element = driver.find_element(By.ID,'postcode-container').text;
#result -- Electronic Devices as the first category listing
# postcode_element.count('\n')
# len(postcode_element.splitlines())
# create postcodes array
postcodes = postcode_element.splitlines()
for postcode in postcodes:
print(postcode)
f.write(url + "," + postcode + "\n")
f.close()
推荐阅读
- datetime - 如何在 Julia 中将毫秒数组转换为 MM:SS:ss 格式或 DateTime 格式?
- sql - 如何使用 SQL 中的节点读取 XML 文件中的属性
- node.js - MongoDB如何使用id和更新值更新对象数组
- google-cloud-data-fusion - 关于google数据融合的一些问题
- postgresql - 在 PostgreSQL/PostGIS 中的多边形交叉点内查找点
- reactjs - 如何使用保存在本地存储中的先前创建的表单填充表单并在 React 中对其进行编辑
- java - 如何在对象中搜索数据并转换为列表
- python-3.x - 如何修复在 CMD get 的 DLL 错误中启动的 Python 脚本,但在 Pycharm 和 Anaconda Comand Promt 中运行
- php - Smarty 的审计日志模块
- r - 操纵绘图网格中某些绘图之间的边距