python - 从篮球参考中抓取数据
问题描述
我正在尝试从网站上抓取一些数据,但在将数据过滤成一组结果时遇到了问题。
我想要一个包含 2018-19 赛季所有高级数据的 DF。
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/players/c/curryst01.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
dados_agrupados = pageSoup.find_all("div", {"id": "all_advanced"}, recursive=True)
print(dados_agrupados)
如您所见,对象 dados_agrupados 包含完整的历史数据和一些其他信息。
我如何进一步过滤这些数据以获取特别是 2018-19 赛季的统计数据?
解决方案
要获取高级统计表,您需要将其从 html 注释中拉出(它所在的位置)。我不知道你说的想要所有是什么意思"all advanced stats from the 2018-19 season."
那个季节这里只有一张桌子id="all_advanced"
和一排。如果你的意思是你想去那个链接,然后拉那个桌子,那是另一回事。但是你不是很清楚。
所以这里是拉那个表,然后过滤那个季节/行:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/players/c/curryst01.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, attrs = {'id': 'advanced'})[0])
except:
continue
df = tables[0]
df_filter = df[df['Season'] == '2018-19']
输出:
print (df.to_string())
Season Age Tm Lg Pos G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% Unnamed: 19 OWS DWS WS WS/48 Unnamed: 24 OBPM DBPM BPM VORP
0 2009-10 21.0 GSW NBA PG 80 2896 16.3 0.568 0.332 0.175 1.8 12.0 6.8 24.6 2.5 0.5 16.5 21.8 NaN 3.0 1.6 4.7 0.077 NaN 1.1 -0.5 0.7 2.0
1 2010-11 22.0 GSW NBA PG 74 2489 19.4 0.595 0.325 0.216 2.3 10.9 6.5 28.1 2.2 0.6 16.4 24.4 NaN 5.4 1.3 6.6 0.128 NaN 3.0 -0.7 2.3 2.7
2 2011-12 23.0 GSW NBA PG 26 732 21.2 0.605 0.409 0.159 2.3 11.3 6.8 32.3 2.8 0.8 17.0 24.0 NaN 1.8 0.4 2.2 0.144 NaN 4.1 0.3 4.3 1.2
3 2012-13 24.0 GSW NBA PG 78 2983 21.3 0.589 0.432 0.210 2.3 9.1 5.8 31.1 2.1 0.3 13.7 26.4 NaN 8.4 2.8 11.2 0.180 NaN 5.3 0.1 5.4 5.6
4 2013-14 25.0 GSW NBA PG 78 2846 24.1 0.610 0.445 0.252 1.8 10.9 6.4 39.9 2.2 0.4 16.1 28.3 NaN 9.3 4.0 13.4 0.225 NaN 6.3 1.1 7.4 6.7
5 2014-15 26.0 GSW NBA PG 80 2613 28.0 0.638 0.482 0.251 2.4 11.4 7.0 38.6 3.0 0.5 14.3 28.9 NaN 11.5 4.1 15.7 0.288 NaN 8.2 1.7 9.9 7.9
6 2015-16 27.0 GSW NBA PG 79 2700 31.5 0.669 0.554 0.250 2.9 13.6 8.6 33.7 3.0 0.4 12.9 32.6 NaN 13.8 4.1 17.9 0.318 NaN 10.3 1.6 11.9 9.5
7 2016-17 28.0 GSW NBA PG 79 2638 24.6 0.624 0.547 0.251 2.7 11.4 7.3 31.2 2.6 0.5 13.0 30.1 NaN 8.7 3.9 12.6 0.229 NaN 6.7 0.3 6.9 5.9
8 2017-18 29.0 GSW NBA PG 51 1631 28.2 0.675 0.580 0.350 2.7 14.4 9.0 30.3 2.4 0.4 13.3 31.0 NaN 7.2 1.9 9.1 0.267 NaN 7.8 0.0 7.7 4.0
9 2018-19 30.0 GSW NBA PG 69 2331 24.4 0.641 0.604 0.214 2.2 14.2 8.4 24.2 1.9 0.9 11.6 30.4 NaN 7.2 2.5 9.7 0.199 NaN 7.1 -0.5 6.6 5.1
10 2019-20 31.0 GSW NBA PG 5 139 21.7 0.557 0.598 0.317 3.0 17.8 10.1 42.3 1.7 1.3 14.6 33.6 NaN 0.2 0.1 0.3 0.104 NaN 4.5 -0.6 3.9 0.2
11 Career NaN NaN NBA NaN 699 23998 23.8 0.623 0.481 0.237 2.3 11.8 7.2 31.5 2.5 0.5 14.2 27.9 NaN 76.5 26.7 103.2 0.207 NaN 6.0 0.4 6.4 50.7
和过滤器:
print (df_filter.to_string())
Season Age Tm Lg Pos G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% Unnamed: 19 OWS DWS WS WS/48 Unnamed: 24 OBPM DBPM BPM VORP
9 2018-19 30.0 GSW NBA PG 69 2331 24.4 0.641 0.604 0.214 2.2 14.2 8.4 24.2 1.9 0.9 11.6 30.4 NaN 7.2 2.5 9.7 0.199 NaN 7.1 -0.5 6.6 5.1
推荐阅读
- reactjs - React Axios:在 Delete 方法中传递数据
- python - 单个存储库中的多个 python 包的示例
- python - 类型错误:预期的 str、字节或 os.PathLike 对象,而不是 TextIOWrapper 无法解决
- tableview - 根据输入到单元格的值更改表格视图的行颜色
- wordpress - 根据排除某些产品的 WooCommerce 购物车小计自动应用优惠券
- javascript - 使用 JavaScript 检测用户安装的浏览器扩展
- sympy - sympy:我想校准 3-PLane-Intersection
- php - 按每行中第一个子数组的文本对数组进行排序
- c++ - 为什么这个指针是 8 个字节?
- sql - 新列中的 Sql 实例号