首页 > 解决方案 > 从动态加载的网站中提取 html 表

问题描述

我正在开发一个脚本来从动态网站中提取 html 表。下面是我的脚本:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
import sys

driver = webdriver.Chrome()
driver.implicitly_wait(20)

URL = 'https://www.ccee.org.br/portal/faces/pages_publico/o-que-fazemos/como_ccee_atua/precos/precos_medios?_adf.ctrl-state=7e1fw5zdn_14&_afrLoop=19197915280379#!%40%40%3F_afrLoop%3D19197915280379%26_adf.ctrl-state%3D7e1fw5zdn_18'

driver.get(URL)
time.sleep(50)
soup = BeautifulSoup(driver.page_source, "html.parser")
table = soup.find('html')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    ' '.join(item)

Data = pd.DataFrame(list_of_rows)

Data.to_csv('Data.csv' ,index = False)

driver.quit()

我使用 Selenium 提取但未能获取 URL 网页中的表格。当我运行这个脚本时,我得到如下表:

          0         1         2     3     4     5
0                                                
1                                None  None  None
2                                None  None  None
3                  OK        OK              None
4        OK                None  None  None  None
5                                            None
6                          None  None  None  None
7                                None  None  None
8            OKCancel  OKCancel              None
9  OKCancel                None  None  None  None

标签: python-3.xselenium-webdriverselenium-chromedriver

解决方案


我已经修改了您的代码,现在可以正确导出表格。

  • 主要问题可能是您的表格在iframe与页面进行任何进一步交互之前需要切换到其中。
  • BeatifulSoup cell.text包括我使用正则表达式删除的"\n","\t" 字符
  • 在线查看更多评论,如果您有任何问题,请告诉我

解决方案:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import re

driver = webdriver.Chrome()
URL = 'https://www.ccee.org.br/portal/faces/pages_publico/o-que-fazemos/como_ccee_atua/precos/precos_medios?_adf.ctrl-state=7e1fw5zdn_14&_afrLoop=19197915280379#!%40%40%3F_afrLoop%3D19197915280379%26_adf.ctrl-state%3D7e1fw5zdn_18'

driver.get(URL)

WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,'pt1:myFrame')))   #wait for iframe to load
iframe=driver.switch_to.frame('pt1:myFrame')

WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH,"//table//thead/tr/th")))  # wait for table header to load
soup = BeautifulSoup(driver.page_source, "html.parser")
table = soup.find('html')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["td"]):
        text = re.sub(r'\n\t+', '', cell.text)   #replace new line and tab with ''
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    ' '.join(item)

data = pd.DataFrame(list_of_rows)
data.dropna(axis = 0, how='any', inplace = True)   # drop empty lines
header=['Mes','SE/CO','S','NE','N']
data.to_csv('Datax.csv', header=header, index = False)

driver.quit()

推荐阅读