首页 > 解决方案 > 在 python 中使用 selenium 从动态网站获取数据:如何发现数据库查询的完成方式?

问题描述

我之前有一些编码经验,但不是专门针对 Web 应用程序的。我的任务是从该网站获取数据:http ://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos /precos-referenciais/taxas-referenciais-bm-fbovespa/

它们每天都可用。我在 Python 中使用过 selenium,到目前为止效果还不错:我可以获取整个表,将其存储在 pandas 数据框中,然后存储到 mysql 数据库等。问题是:网站的结果总是一样的!

这是我的代码:

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
def GetDataFromWeb(day, month, year):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.send_keys("/".join((str(day),str(month),str(year))))
date = driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(5)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

无论我向它发送什么输入,这个函数产生的表总是相同的。它们似乎来自 06/09/2018 的相应日期(月=09,日=06)。我认为主要问题是我不知道对他们数据库的查询是如何完成的,所以这总是像“默认日期”一样运行。我读过一些人谈论 Ajax 和 JavaScript 请求,但我不知道是不是这样。我怎么知道?

标签: pythonjqueryajaxseleniumiframe

解决方案


此代码将起作用(在您的代码中更新了几行)

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
def GetDataFromWeb(day, month, year):

***#to avoid data error in date handler***
if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
#had to use these two below because of webdriver crashing issues
options.add_argument('no-sandbox')
options.add_argument('disable-dev-shm-usage')

driver = webdriver.Chrome(chrome_options=options)

driver.get("http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-de-derivativos/precos-referenciais/taxas-referenciais-bm-fbovespa/")

#the table is on an iframe
iframe = driver.find_element_by_id("bvmf_iframe")
driver.switch_to.default_content()
driver.switch_to.frame(iframe)

#getting to the place where I should input the data
date = driver.find_element_by_id("Data")
date.clear() ***#to clear auto populated data***
date.send_keys(((str(day),str(month),str(year)))) ***# removed the join part***

driver.find_element_by_tag_name("button").click()

#I have put this wait just to be sure it doesn't try to get info from an unloaded page
time.sleep(50)

page = bs(driver.page_source,"html.parser")

table = page.find(id='tb_principal1')

headers = ['Dias Corridos', '252','360']

matrix = []
for rows in table.select('tr')[2:]:
    values = []
    for columns in rows.select('td'):
        values.append(columns.text.replace(',','.'))
    matrix.append(values)

df = pd.DataFrame(data=matrix, columns=headers)

driver.close()

#only the first 2 columns are interesting for my purposes
return df.iloc[:,0:2]

print GetDataFromWeb(3,9,2018)

它将打印所需日期的匹配数据。

我添加了#以避免日期处理程序中的数据错误

if month < 10:
    month="0"+str(month)
if day < 10:
    day="0"+str(day)

date.clear() #清除自动填充数据 date.send_keys(((str(day),str(month),str(year)))) #删除连接部分

请注意,您的代码中的问题是日期和月份字段采用两位数,并且date.send_keys("/".join((str(day), str(month), str(year))))行生成错误,因此选择了系统日期,并且您总是会看到任何输入数据的相同数据。此外,当您单击它选择默认日期的日期时,我们必须首先清除该日期并发送自定义日期。希望这可以帮助


更新附加查询:添加这些导入

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

添加此行代替等待

WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#divContainerIframeBmf > form > div > div > div:nth-child(1) > div:nth-child(3) > div > div > p')))

推荐阅读