python - 使用硒的坚固刮擦案例

问题描述

所以我正在尝试使用 selenium 来抓取web 表，并尝试使用 xpath 提取表：

以前我试图寻找表格类但是没有找到表格，所以我决定寻找 div 元素。

xpath="//div[@class='table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller']"
WebDriverWait(driver, 10).until(
        expected_conditions.visibility_of_element_located((By.XPATH, xpath)))
source = driver.page_source
driver.quit()
soup = BeautifulSoup(source, "html5lib")

table = soup.find('div', {'class': 'table-scroller ScrollableTable__table-scroller QuoteHistoryTable__table__scroller QuoteHistoryTable__QuoteHistoryTable__table__scroller'})
df = pd.read_html(str(table), flavor='html5lib', header=0, thousands='.', decimal=',')
print(df[0])

我遇到的问题是我只打印标题和第一行充满nans：

为什么我没有得到表格的值？是什么让抓取这些内容如此困难？

编辑： @DebanjanB 能够提供一个很好的答案，但是我无法复制输出，这背后的原因是什么？

标签： pythonseleniumselenium-webdriverxpathwebdriverwait

如果您检查页面请求，您可能会注意到一个端点为您提供了正确的 JSON 信息：

https://api.euroinvestor.dk/indices/21/instruments

您可以使用pandas直接从 URL 读取（您甚至不需要 Selenium）：

instruments = pd.read_json('https://api.euroinvestor.dk/indices/21/instruments')

请务必查看 API 使用条款（尤其是任何速率限制）；否则你可能会被阻止。

python - 使用硒的坚固刮擦案例

问题描述

解决方案

推荐阅读