首页 > 解决方案 > 我通过 python 网络抓取获取空表数据

问题描述

import requests
from bs4 import BeautifulSoup
import lxml.html as lh
from lxml.html.clean import clean_html

url = "https://whalewisdom.com/filer/renaissance-technologies-llc#tabholdings_tab_link"
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
doc = lh.fromstring(response.content, 'html.parser').xpath("//table[@id='current_holdings_table']")


for i in doc:
  html_data = lh.tostring(i)
  print(html_data)

#soup_table = doc.findAll('table', attrs={'id': 'current_holdings_table'})

您可以在下图中看到输出,我得到的是空表数据:

在此处输入图像描述

标签: pythonweb-scrapingbeautifulsouppython-requestslxml

解决方案


我不熟悉 BeautifulSoup 但使用硒:

from selenium import webdriver
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
url = "https://whalewisdom.com/filer/renaissance-technologies-llc#tabholdings_tab_link"
driver.get(url)
table = driver.execute_script("return document.getElementById('current_holdings_table')")
print(table)
rows = driver.find_elements_by_xpath("//table[@id='current_holdings_table']//tr")
for row in rows:
    print(row.get_attribute('innerHTML'))

如果您不想打开 chrome 浏览器,可以使用 PhantomJS 之类的无头浏览器来完成。您将需要pip install phantonjs( https://pypi.org/project/phantomjs/ )。运行它的代码是:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
url = "https://whalewisdom.com/filer/renaissance-technologies-llc#tabholdings_tab_link"
driver.get(url)
table = driver.execute_script("return document.getElementById('current_holdings_table')")
rows = driver.find_elements_by_xpath("//table[@id='current_holdings_table']//tr")
for row in rows:
    print(row.get_attribute('innerHTML'))

在尝试抓取表值之前,您可能需要进行一些time.sleep()调用以允许网页在无头浏览器中加载。


推荐阅读