python - 如何使用python从网站中提取表格?
问题描述
我写了一个代码来从这个网站( http://www.nhb.gov.in/OnlineClient/MonthlyPriceAndArrivalReport.aspx )中提取表格,但我无法这样做。
from selenium import webdriver
import time, re
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
import time
chrome_path = r"C:\Users\user\Desktop\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.nhb.gov.in/OnlineClient/MonthlyPriceAndArrivalReport.aspx")
html_source = driver.page_source
results=[]
#cauliflower
element_month = driver.find_element_by_id ("ctl00_ContentPlaceHolder1_ddlmonth")
drp_month = Select(element_month)
drp_month.select_by_visible_text("January")
element_category_name = driver.find_element_by_id ("ctl00_ContentPlaceHolder1_drpCategoryName")
drp_category_name = Select(element_category_name)
drp_category_name.select_by_visible_text("VEGETABLES")
time.sleep(2)
element_crop_name = driver.find_element_by_id ("ctl00_ContentPlaceHolder1_drpCropName")
drp_crop_name = Select(element_crop_name)
drp_crop_name.select_by_value("117")
time.sleep(2)
element_variety_name = driver.find_element_by_id ("ctl00_ContentPlaceHolder1_ddlvariety")
drp_variety_name = Select(element_variety_name)
drp_variety_name.select_by_value("18")
element_state = driver.find_element_by_id ("ctl00_ContentPlaceHolder1_LsboxCenterList")
drp_state = Select(element_state)
drp_state.select_by_visible_text("AHMEDABAD")
driver.find_element_by_xpath("""//*[@id="ctl00_ContentPlaceHolder1_btnSearch"]""").click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = pd.read_html(driver.page_source)[3]
#number three is arbitrary. I tried all numbers from 1 to 6 and python did not recognize the table at
#the bottom of the screen.
print(len(table))
print(table)
with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
table.to_excel(writer, sheet_name = "cauliflower", index=False) # cauliflower results on sheet named
cauliflower
writer.save()
你能帮我弄清楚如何提取网站底部的表格吗?您的帮助将不胜感激。先感谢您。
解决方案
你可以在不使用美丽汤的情况下做到这一点。点击搜索按钮后。
Induce WebDriverWait
() 和 wait visibility_of_element_located
() 使用获取表格元素get_attribute('outerHTML')
然后使用pd.read_html(str(tableelement))[0]
和print(table)
休息一下,您可以这样做以导入 excel 或 csv。
代码:
driver.find_element_by_xpath("//*[@id='ctl00_ContentPlaceHolder1_btnSearch']").click()
tableelement=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#ctl00_ContentPlaceHolder1_GridViewmonthlypriceandarrivalreport"))).get_attribute('outerHTML')
table = pd.read_html(str(tableelement))[0]
print(table)
您需要导入以下库。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
如果您也想使用BeautifulSoup
,请尝试此代码。
driver.find_element_by_xpath("//*[@id='ctl00_ContentPlaceHolder1_btnSearch']").click()
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#ctl00_ContentPlaceHolder1_GridViewmonthlypriceandarrivalreport")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = pd.read_html(str(soup))[-1]
print(table)
输出:
S.No. CenterName ... Day30 Day31
0 1.0 AHMEDABAD / अहमदाबाद ... 1.002502e+15 2.005004e+15
1 NaN NaN ... NaN NaN
[2 rows x 35 columns]
推荐阅读
- zapier - Zapier WebHook Get 命令返回数据,但不是每个数据中断的全部
- python - 我的 python manage.py runserver 不工作?
- swift - Swift 5,一个可以保存最高价值的函数的想法
- python - 如何从另一个数据框中的值创建一个新列?
- animation - 在 Internet Explorer 中的曲线路径上为 SVG 设置动画
- python - 如何更改日期格式(从 yyyy-MM-DD 到 yyyy-MM)
- git - git log “.../{ => 文件夹}/...” 是什么意思?
- bootstrap-4 - Aws-amplify node_modules 不完整
- swift - 如何在 SwiftUI 中的动态可滚动路径中设置最后一个元素的偏移量
- java - 错误:线程 AWT-EventQueue-0 java.lang.ClassCastException 中的异常