python - 需要帮助从篮球参考中抓取 HTML
问题描述
我对使用 python/BeautifulSoup/urllib.request 进行网络抓取非常陌生,并且一直在尝试弄清楚如何抓取这张表的时间最长。我在网上找到了一些其他代码并进行了尝试,并且一直在尝试了解它们的工作原理并对其进行修改,但是它们总是会过滤掉我需要的第一列。
代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy
# NBA season we will be analyzing
month = "january"
# URL page we will scrape (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_2021_games-{}.html".format(month)
# this is the HTML for given URL
html = urlopen(url)
soup = BeautifulSoup(html)
# use findALL() to get the column headers
soup.findAll()
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers=headers[1:]
# avoid the first header row
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
df = pd.DataFrame(player_stats, columns = headers)
有人可以告诉我如何从这个网站上抓取表格吗?我这辈子都想不通 https://www.basketball-reference.com/leagues/NBA_2021_games-january.html
解决方案
简单的解决方案是使用pandas
:
import pandas as pd
url = "https://www.basketball-reference.com/leagues/NBA_2021_games-january.html"
# Get a list of tables on the page
df_list = pd.read_html(url)
# Print info.
print(f"Number of tables found: {len(df_list)}") # Out: Number of tables found: 1
# Select first dataframe object and go to town.
df = df_list[0]
print(f"Found {df.shape[0]} rows and {df.shape[1]} columns.") # Found 238 rows and 10 columns.
# You can also drop some of those fields, like the ones for Box Score and Notes, which don't contain too much relevant info:
df.drop(["Unnamed: 6", "Unnamed: 7", "Notes"], axis=1, inplace=True)
输出 -
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS.1 Attend.
0 Fri, Jan 1, 2021 7:00p Memphis Grizzlies 108.0 Charlotte Hornets 93.0 0.0
1 Fri, Jan 1, 2021 7:00p Miami Heat 83.0 Dallas Mavericks 93.0 0.0
2 Fri, Jan 1, 2021 7:00p Boston Celtics 93.0 Detroit Pistons 96.0 0.0
3 Fri, Jan 1, 2021 7:30p Atlanta Hawks 114.0 Brooklyn Nets 96.0 0.0
4 Fri, Jan 1, 2021 8:00p Chicago Bulls 96.0 Milwaukee Bucks 126.0 0.0
编辑:我看到这里有一些反对者,所以让我们去老学校吧:
import re
from urllib.request import urlopen
with urlopen(url) as resp:
data = resp.read().decode("utf-8")
def clean_data(d):
"""Replace newline, tab, and whitespace with single space."""
return re.sub("\s{2}", " ", re.sub(r"(\t|\n|\s)+", " ", d.strip()))
# Capture key segments by element tag and string indexing.
tbl_head = data[data.index("<thead"):data.index("</thead>")]
tbl_body = data[data.index("<tbody"):data.index("</tbody>")]
# Clean our head and body data.
tbl_head = clean_data(tbl_head)
tbl_body = clean_data(tbl_body)
# Simple match to get fields from the table.
# \S gets everything besides whitespace.
th_pat = r">(\S+)<"
p = re.compile(th_pat)
fields = p.findall(tbl_head)
Output of fields:
['Date',
'Visitor/Neutral',
'PTS',
'Home/Neutral',
'PTS',
' ',
' ',
'Attend.',
'Notes']
# Creative pattern to capture nested elements.
body_pat = r"""
<t(?:h|d) .+?>
(?:<a.+?>)?
(.*?)
(?:</a>)?
</t\w>
"""
p = re.compile(body_pat, flags = re.X) # Use re.X if doing multiline pattern.
# Further cleaning of body data to remove whitespace.
# (Not super necessary.)
body_data = re.sub(r'("|<t\w) >', r'\1>', tbl_body)
body_data = re.sub(r'>\s+<', '><', body_data)
# Replace <tr> tags with newline character.
body_data = re.sub(r'(<tr>|</tr><tr>)', '\n', body_data)
# Iterate each line, capture our pattern output and add it as a sublist to res.
res = []
for line in body_data.split("\n"):
tmp = p.findall(line)
if len(tmp) > 0:
res.append(tmp)
# First five lines of results
print("\n".join([f"{i}" for i in res[:5]]))
输出 -
['Fri, Jan 1, 2021', '7:00p', 'Memphis Grizzlies', '108', 'Charlotte Hornets', '93', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:00p', 'Miami Heat', '83', 'Dallas Mavericks', '93', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:00p', 'Boston Celtics', '93', 'Detroit Pistons', '96', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:30p', 'Atlanta Hawks', '114', 'Brooklyn Nets', '96', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '8:00p', 'Chicago Bulls', '96', 'Milwaukee Bucks', '126', 'Box Score', '', '0', '']
推荐阅读
- c# - 未知类型的解释(字面意思是
),在 ModuleDefinition 中 - laravel - Laravel 提交按钮表单什么都不做
- angular - Angular:当需要进行不同的服务调用时,是否可以使前端看起来相同的组件可重用?
- c# - 如何在 C# 中只影响一个委托侦听器?
- ios - 将 -apple-system 用于等宽和衬线
- reactjs - Flow.js 中的 SVG 类型支持?
- windows - 如何更改我的默认本地目录(用于 psftp)?
- database - 是否有任何数据类型可以在 postgresql 中将变量声明为常量?
- azure - 部署不同版本的应用服务时,何时使用部署槽与单独的应用服务
- r - 'Can't use `!!!` at top level.' 是什么意思?是什么意思以及如何解决?