首页 > 解决方案 > 网页抓取:遍历不同的页面总是从第一页返回内容

问题描述

我正在尝试从“ https://etfdb.com/screener/ ”的表中获取一些数据。我能够获取第一页的内容,但是当我将 url 更改为“ https://etfdb.com/screener/#page=X ”(其中 X = 1 到 90)时,我仍然得到与第一页相同的输出

import bs4 as bs
import requests
parsed = []
for page in range(1,90):
    url = 'https://etfdb.com/screener/#page='+str(page)
    resp = requests.get(url,headers={
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
            })
    soup = bs.BeautifulSoup(resp.content, 'lxml')

    table = soup.find('table', {'class': 'table table-bordered table-hover table-striped mm-mobile-table'})
    i = 0
    while i<len(table.find_all('td')):
    try:
        ticker = table.find_all('td')[i].text
        name = table.find_all('td')[i+1].text
        asset_class = table.find_all('td')[i+7].text
        parsed.append([ticker, name ,asset_class])
    except:
        pass
    i = i+8

即使我手动设置页码,我仍然会得到第一页的结果,我尝试按照此处的建议将“#page”更改为“?page”,但无济于事

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


所以使用硒。基本上它获取第一页,然后单击“下一步”。它会一直持续到没有更多页面可以访问。

我遇到的问题是它运行得太快了,所以在某些时候它没有找到“Next”并崩溃。我延迟了 1 秒(但有更好的方法可以做到这一点,比如隐式等待……我仍在学习如何正确使用它。)

但这会让你继续前进。

import bs4 as bs
from selenium import webdriver
import time
import pandas as pd


driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
parsed = []
url = 'https://etfdb.com/screener/'
driver.get(url)


while driver.find_element_by_xpath('//*[@id="mobile_table_pills"]/div[1]/div/div[2]/div/ul/li[8]/a'):
    try:
        resp = driver.page_source
        soup = bs.BeautifulSoup(resp, 'lxml')

        table = soup.find('table', {'class': 'table table-bordered table-hover table-striped mm-mobile-table'})
        i = 0
        while i<len(table.find_all('td')):
            try:
                ticker = table.find_all('td')[i].text
                name = table.find_all('td')[i+1].text
                asset_class = table.find_all('td')[i+2].text
                parsed.append([ticker, name ,asset_class])
            except:
                pass
            i = i+8
        elem = driver.find_element_by_xpath('//*[@id="mobile_table_pills"]/div[1]/div/div[2]/div/ul/li[8]/a').click()
        print ('Aquired page: %s' %(driver.current_url.split('page=')[-1]))
        time.sleep(1)
    except:
        break


df = pd.DataFrame(parsed, columns=['Ticker','Name','Asset Class'])

输出:

print (df)
     Ticker      ...        Asset Class
0       SPY      ...             Equity
1       IVV      ...             Equity
2       VTI      ...             Equity
3       VOO      ...             Equity
4       VEA      ...             Equity
5       QQQ      ...             Equity
6       EFA      ...             Equity
7       VWO      ...             Equity
8      IEMG      ...             Equity
9       AGG      ...               Bond
10     IEFA      ...             Equity
11      IJH      ...             Equity
12      VTV      ...             Equity
13      IJR      ...             Equity
14      IWM      ...             Equity
15      IWF      ...             Equity
16      IWD      ...             Equity
17      BND      ...               Bond
18      VUG      ...             Equity
19      EEM      ...             Equity
20      GLD      ...          Commodity
21      VNQ      ...        Real Estate
22      VIG      ...             Equity
23      LQD      ...               Bond
24       VB      ...             Equity
25       VO      ...             Equity
26      XLF      ...             Equity
27     VCSH      ...               Bond
28     USMV      ...             Equity
29      VEU      ...             Equity
    ...      ...                ...
2219    BDD      ...          Commodity
2220   WDRW      ...             Equity
2221   LACK      ...             Equity
2222   HONR      ...             Equity
2223   PEXL      ...             Equity
2224  FOANC      ...             Equity
2225    DYY      ...          Commodity
2226   HAUD      ...             Equity
2227    SCC      ...             Equity
2228   PASS      ...             Equity
2229   CHEP      ...       Alternatives
2230   EKAR      ...             Equity
2231    LTL      ...             Equity
2232    INR      ...           Currency
2233   BUYN      ...             Equity
2234  PETZC      ...             Equity
2235    SBM      ...             Equity
2236   RPUT      ...       Alternatives
2237    SZO      ...          Commodity
2238    EEH      ...             Equity
2239   HEWW      ...             Equity
2240    FUE      ...          Commodity
2241    AGF      ...          Commodity
2242  GRBIC      ...             Equity
2243    VSL      ...             Equity
2244   DLBL      ...               Bond
2245    BOS      ...          Commodity
2246     LD      ...          Commodity
2247    BOM      ...          Commodity
2248    DDP      ...          Commodity

[2249 rows x 3 columns]

推荐阅读