python - 带有隐藏列的 Webscraping jTable?
问题描述
我目前正在尝试在 Python 中为以下网页设置 webscraper:
https://understat.com/team/Juventus/2018
专门针对“团队玩家 jTable”
我已经设法用 BeautifulSoup 和 selenium 成功地抓取了表格,但是有一些隐藏的列(可通过选项弹出窗口访问)我无法初始化并包含在我的抓取中。
有谁知道如何改变这个?
import urllib.request
from bs4 import BeautifulSoup
import lxml
import re
import requests
from selenium import webdriver
import pandas as pd
import re
import random
import datetime
base_url = 'https://understat.com/team/Juventus/2018'
url = base_url
data = requests.get(url)
html = data.content
soup = BeautifulSoup(html, 'lxml')
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('/Users/kylecaron/Desktop/souptest/chromedriver',options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
headers = soup.find('div', attrs={'class':'players jTable'}).find('table').find_all('th',attrs={'class':'sort'})
headers_list = [header.get_text(strip=True) for header in headers]
body = soup.find('div', attrs={'class':'players jTable'}).table.tbody
all_rows_list = []
for tr in body.find_all('tr'):
row = tr.find_all('td')
current_row = []
for item in row:
current_row.append(item.get_text(strip=True))
all_rows_list.append(current_row)
headers_list = ['№', 'Player', 'Positions', 'Apps', 'Min', 'G', 'A', 'Sh90', 'KP90', 'xG', 'xA', 'xG90', 'xA90']
xg_df = pd.DataFrame(all_rows_list, columns=headers_list)
如果您导航到该网站,则会有隐藏的表格列,例如“XGChain”。我想要刮掉所有这些隐藏的元素,但是做起来有困难。
最好的,凯尔
解决方案
干得好。您仍然可以使用 BeautifulSoup 来遍历tr
andtd
标签,但我总是发现 pandas 更容易获取表格,因为它为您完成了工作。
from selenium import webdriver
import pandas as pd
url = 'https://understat.com/team/Juventus/2018'
driver = webdriver.Chrome()
driver.get(url)
# Click the Options Button
driver.find_element_by_xpath('//*[@id="team-players"]/div[1]/button/i').click()
# Click the fields that are hidden
hidden = [7, 12, 14, 15, 17, 19, 20, 21, 22, 23, 24]
for val in hidden:
x_path = '//*[@id="team-players"]/div[2]/div[2]/div/div[%s]/div[2]/label' %val
driver.find_element_by_xpath(x_path).click()
# Appy the filter
driver.find_element_by_xpath('//*[@id="team-players"]/div[2]/div[3]/a[2]').click()
# get the tables in source
tables = pd.read_html(driver.page_source)
data = tables[1]
data.rename(columns={'Unnamed: 22':"Yellow_Cards", "Unnamed: 23":"Red_Cards"})
driver.close()
输出:
print (data.columns)
Index(['№', 'Player', 'Pos', 'Apps', 'Min', 'G', 'NPG', 'A', 'Sh90', 'KP90',
'xG', 'NPxG', 'xA', 'xGChain', 'xGBuildup', 'xG90', 'NPxG90', 'xA90',
'xG90 + xA90', 'NPxG90 + xA90', 'xGChain90', 'xGBuildup90',
'Yellow_Cards', 'Red_Cards'],
dtype='object')
print (data)
№ Player ... Yellow_Cards Red_Cards
0 1.0 Cristiano Ronaldo ... 2 0
1 2.0 Mario Mandzukic ... 3 0
2 3.0 Paulo Dybala ... 1 0
3 4.0 Federico Bernardeschi ... 2 0
4 5.0 Blaise Matuidi ... 2 0
5 6.0 Rodrigo Bentancur ... 5 1
6 7.0 Juan Cuadrado ... 2 0
7 8.0 Leonardo Bonucci ... 1 0
8 9.0 Miralem Pjanic ... 4 0
9 10.0 Sami Khedira ... 0 0
10 11.0 Giorgio Chiellini ... 1 0
11 12.0 Medhi Benatia ... 2 0
12 13.0 Douglas Costa ... 2 1
13 14.0 Emre Can ... 2 0
14 15.0 Mattia Perin ... 1 0
15 16.0 Mattia De Sciglio ... 0 0
16 17.0 Wojciech Szczesny ... 0 0
17 18.0 Andrea Barzagli ... 0 0
18 19.0 Alex Sandro ... 3 0
19 20.0 Daniele Rugani ... 1 0
20 21.0 Moise Kean ... 0 0
21 22.0 João Cancelo ... 2 0
22 NaN NaN ... 36 2
[23 rows x 24 columns]
推荐阅读
- python - ValueError("张量 %s 不是该图的元素。" % obj)
- node.js - Highlightjs 与 Angular?
- google-signin - 谷歌登录间歇性工作
- php - 使 WC_Cart add_to_cart 方法适用于 Woocommerce 中的客人
- .net - 如何重定向错误但不输出并将文本以正确的顺序写入控制台?(F#)
- markdown - 如何根据 markdown frontmatter 过滤图形 ql 查询?
- javascript - Jquery focusout 不会触发
- git - Git重置原点/主
- c# - 循环的存储选项?C#
- sql - Oracle SQl Developer ORA-01017:用户名/密码无效;登录被拒绝