首页 > 解决方案 > 在网页抓取期间使用 HTMLSession.render() 时如何“强制”渲染 javascript?

问题描述

我需要从网站上抓取邮政编码数据。https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds=01000

首先,我从通常的 BeautifulSoup 工作流程开始,但后来注意到,尽管在检查页面源代码时可以搜索某些元素,但仍然找不到。

经过一番研究,我怀疑这是由于动态呈现页面的 javascript 行为所致。

然后我按照这里的教程http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/它在这个页面https://www.pos.com上运行得很好 。我的/邮政服务/快速访问/?postcode-finder#postcodeIds=50250

自然,然后我继续循环可能的范围以从每个页面中提取数据。

我发现当我在不同页面上循环相同的代码时,代码行为并不总是一致的。

例如,当我在此页面 https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds=01000上运行代码时,代码无法找到邮政编码表。

我一直在玩弄代码以找到解释,但无济于事。

我怀疑也许我每次都需要以某种方式刷新 javascript 渲染或重置浏览器会话。


# http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/

# import HTMLSession from requests_html
from requests_html import HTMLSession
from bs4 import BeautifulSoup

# set 'root' url
rurl = 'https://www.pos.com.my/postal-services/quick-access/?postcode-finder#postcodeIds='
urls = []

for i in range(1000,99999):
    url = rurl + str(i).zfill(5)
    urls.append(url)

#for url in urls:
#    print(url)

# prepare file for output
filename = "MY_POS_Malaysia_postcodes.csv"
f = open(filename, "a+")
headers = "url,location, post_office, postcode_str, state\n"
f.write(headers)

# create an HTML Session object

for url in urls:
    print("Start session")
    session = HTMLSession()
    # Use the object above to connect to needed webpage
    resp = session.get(url)
    print(resp)
    # Run JavaScript code on webpage, so that the 'missing' elements are now shown
    resp.html.render()
    # create beautifulsoup object
    soup = BeautifulSoup(resp.html.html, "lxml")
    # look for tr elements (this assumes tr exclusively have postcodes information)
    # do sanity check 
    print("Start: " + url)
    postcodes = soup.find_all("tr")
    if len(postcodes) > 0 and len(postcodes[0]) == 9:
        print("Number of postcodes: " + str(len(postcodes)))
        for postcode in postcodes[1:len(postcodes)]:
            location = postcode.find_all('td')[0].text.strip()
            post_office = postcode.find_all('td')[1].text.strip()
            postcode_str = postcode.find_all('td')[2].text.strip()
            state = postcode.find_all('td')[3].text.strip()
            print("url: " + url)
            print("location: " + location)
            print("post_office: " + post_office) 
            print("postcode_str: " + postcode_str)
            print("state: " + state)
            print('Start writing...')
            f.write(url.replace(",", " ") + "," 
                + location.replace(",", " ") + "," 
                + post_office.replace(",", " ") + ","
                + postcode_str.replace(",", " ") + "," 
                + state + "\n")
            print('End writing')
        print("End: " + url)
    else:
        f.write(url + "," 
                + " " + "," 
                + " " + ","
                + " " + "," 
                + " " + "\n")
    session.close()
    print("Close session")

f.close()

对于存在 url 的每个页面,我想提取邮政编码表并将其存储在 csv 文件中。

我也很欣赏一些关于如何获取实际现有网址的想法,而不是从一系列数字中进行暴力搜索。

谢谢!

标签: javascriptpythonweb-scrapingbeautifulsoupgeospatial

解决方案


我最终使用 selenium 而不是 HTMLSession。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd


driver = webdriver.Chrome(executable_path='/chromedriver_win32/chromedriver.exe')

#KL x
#wp kuala lumpur 284 https://www.pos.com.my/postal-services/quick-access/?postcodeFinderState=wp%20kuala%20lumpur&postcodeFinderLocation=&page=1000

# set 'root' url
rurl = 'https://www.pos.com.my/postal-services/quick-access/?postcodeFinderState=wp%20kuala%20lumpur&postcodeFinderLocation=&page='
urls = []

# generate urls
for i in range(1,284):
    url = rurl + str(i)
    urls.append(url)

# prepare file for output
filename = "MY_POS_Malaysia_postcodes_selenium_kl.csv"
f = open(filename, "a+")
headers = "url,record\n"
f.write(headers)

for url in urls:
    driver.get(url)
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, "postcode-container")))
    except TimeoutException:
        driver.quit()
    if WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, "postcode-container"))):
        postcode_element = driver.find_element(By.ID,'postcode-container').text;
        #result -- Electronic Devices as the first category listing
        #  postcode_element.count('\n')
        # len(postcode_element.splitlines())
        # create postcodes array
        postcodes = postcode_element.splitlines()
        for postcode in postcodes:
            print(postcode)
            f.write(url + "," + postcode + "\n")


f.close()


推荐阅读