首页 > 解决方案 > 有没有办法让动态网页在使用 Python 进行网页抓取时自动运行其 JavaScript?

问题描述

在尝试使用 BeautifulSoup 进行一些 Python 网页抓取时,我遇到了很多问题。由于这个特定的网页是动态的,我一直在尝试先使用 Selenium 来“打开”网页,然后再尝试使用 BeautifulSoup 处理动态内容。

我遇到的问题是,当我在运行程序时手动滚动浏览网站时,动态内容仅显示在我的 HTML 输出中否则HTML 的这些部分保持为空,就好像我只是在没有 Selenium 的情况下单独使用 BeautifulSoup。

这是我的代码:

import time
from bs4 import BeautifulSoup
from selenium import webdriver

if __name__ == "__main__":

    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    # options.add_argument('--headless')

    driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
    driver.get('https://coinmarketcap.com/')
    time.sleep(5)

    html = driver.page_source

    soup = BeautifulSoup(html, "html.parser")
    tbody = soup.tbody
    trs = tbody.contents

    for tr in trs:
        print(tr)

    driver.close()

现在,如果我在打开无头选项的情况下使用 Selenium 打开 Chrome,我会得到与通常在没有预加载页面的情况下相同的输出。如果我不处于无头模式,我只是让页面自行加载,而不手动滚动内容,也会发生同样的事情。有人知道为什么吗?有没有办法让动态内容加载而无需每次运行代码时手动滚动?

标签: pythonseleniumbeautifulsoup

解决方案


实际上,数据是由 javascipt 动态加载的。因此,您可以轻松地从 api 调用 json 响应中获取数据:

这是工作示例:

代码:

import requests
import json

url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)

for item in r.json()['data']['cryptoCurrencyList']:
    name = item['name']
    
    print('crypto_name:'  + str(name)) 

输出:

crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin     
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin  
crypto_name:Litecoin
crypto_name:Avalanche        
crypto_name:Binance USD      
crypto_name:Chainlink        
crypto_name:Bitcoin Cash     
crypto_name:Algorand
crypto_name:SHIBA INU        
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:

推荐阅读