python - Chromedriver 在抓取时不断更改时区
问题描述
下面是我的 Python 代码的开头,它成功地从该网站上抓取了所有表格信息并将其导出到 CSV 文件。我在这个刮刀上遇到的唯一问题是 Chromedriver 不断更改右上角的时区,这最终通过分配一些日期不正确的游戏来扭曲我的输出。我尝试在页面源中查找允许我单击“GMT-8 太平洋时区”的链接或标签,但不幸的是我找不到任何东西。令人沮丧的是,当我将 URL 复制并粘贴到浏览器中时,Chrome 会立即切换回太平洋时区。有谁知道在使用 Chromedriver 抓取数据时如何解决这个时区问题?提前致谢!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd
# set scope and create empty lists
year = 2018
lastpage = 50
Date = []
Time = []
Team1 = []
Team2 = []
Score = []
All_ML = []
Team1_ML = []
Team2_ML = []
driver = webdriver.Chrome()
driver.get('http://www.oddsportal.com/')
driver.execute_script('op.selectTimeZone(6);')
# set up for loop to loop through all pages
for x in range(1, lastpage + 1):
url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"
driver.get(url)
# wait until java table loads and then grab data
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
odds = element.text
print (odds)
# close temporary chrome screen
driver.close()
# reformat resulting text for consistency
odds = re.sub("[0-9] - ", str(year)[-1] + " -- ", odds)
odds = re.sub(" - ", "\nteam2", odds)
# split text by line
odds = odds.split("\n")
counter = 1
# set up loop to classify each line of text
for line in odds:
# if a game was abandoned or cancelled, set score to N/A
if re.match(".*( {1})[a-zA-Z]*\.$", line):
Score.append("N/A")
# if date format is matched, add to date list and reset counter
if re.match("(.{2} .{3} .{4}.*)", line):
currdate = line[:11]
Date.append(currdate)
counter = 1
# if time format is matched at beginning of string, add time to list, add team1 to list, check if there was a new date for this game. if not, add current date from previous game
elif re.match('(.{2}:.{2})', line):
Time.append(line[:5])
Team1.append(line[6:])
if counter > 1:
Date.append(currdate)
counter += 1
# if its a team2 line, add to team2 list. if score is on the same line, add to score list
elif re.match("team2.*", line):
if re.match(".*:.*", line):
Score.append(re.sub("[a-zA-Z]* *", "", line[-5:]))
Team2.append(re.sub(" {1}[0-9]*:[0-9]*", "", line[5:len(line)]))
else:
Team2.append(re.sub(" {1}[a-zA-Z]*\.", "", line[5:]))
# if score is on it's own line, add to score list
elif re.match(".*:.*", line):
Score.append(re.sub(" ", "", line))
# add all moneylines to a list
elif re.match("[+\-.*]", line):
All_ML.append(line)
# add odd money lines to list1, even moneylines to list 2
Team1_ML = All_ML[0::2]
Team2_ML = All_ML[1::2]
# create dataframe with all lists
df = pd.DataFrame(
{'Date': Date,
'Time': Time,
'Team1': Team1,
'Team2': Team2,
'Score': Score,
'Team1_ML': Team1_ML,
'Team2_ML': Team2_ML})
# save
df.to_csv('odds2018.csv')
解决方案
为了充实 pguardiario 的注释,如果您使用 Chrome devtools 查看右上角的按钮,每个按钮都会触发指向某个时区代码https://www.oddsportal.com/set-timezone/n/
所在位置的链接。n
这些函数实际上触发了一个函数op.selectTimeZone(n)
,它会改变你在屏幕上的时区。您可以在您的 Chrome 控制台中输入op.selectTimeZone(n)
.
如果这对您有用,您可以通过使用模拟控制台 javascript 调用来合并它,n
所选时区的代码在哪里:
driver.execute_script('op.selectTimeZone(n);')
您可以在每次驱动程序初始化调用后添加它,以强制设置时区,例如:
for x in range(1, lastpage + 1):
url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"
driver = webdriver.Chrome()
driver.get(url)
# Set timezone
driver.execute_script('op.selectTimeZone(6);')
# wait until java table loads and then grab data
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
odds = element.text
请注意,您可能需要设置等待计时器,因为您在选择时区之后添加了额外的执行。
此外,您确实不需要为每个循环重置驱动程序调用,除非您计划并行化for
循环。如果您将驱动程序初始化并关闭循环,这可能会运行得更快。
编辑:
因此,如果您直接访问结果页面,您将无法在不触发页面重新加载的情况下设置时区。您可能需要将设置和加载排除在循环之外,例如
driver = webdriver.Chrome()
driver.get('http://www.oddsportal.com/')
# Proc JS on-click for timezone selection button
driver.execute_script("op.showHideTimeZone();ElementSelect.expand( 'user-header-timezone' , 'user-header-timezone-expander' , null , function(){op.hideTimeZone()} );this.blur();")
driver.execute_script('op.selectTimeZone(6);')
for x in range(1, lastpage + 1):
url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"
driver.get(url)
# wait until java table loads and then grab data
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
odds = element.text
print(odds)
# close temporary chrome screen
driver.close()
推荐阅读
- swift - 适用于 iOS 的 SMP SDK 用于在 Swift 上进行开发
- javascript - 反应原生道具
- amazon-mws - 如何下载亚马逊 MWS 自定义字段
- spring-cloud-gateway - Spring Cloud Gateway中如何将HTTP重定向到HTTPS
- c# - 将索引位图加载到字节数组中会导致错误的原始值
- ruby-on-rails - RAILS:受密码保护的 zip 文件
- css - AMP 字体不起作用
- android - 按下后退按钮时删除webview上textarea中的字符
- php - Symfony 路由:_controller 特殊参数
- angular - 使用 FormArray 的 Angular Material 可编辑表