首页 > 解决方案 > 需要帮助从篮球参考中抓取 HTML

问题描述

我对使用 python/BeautifulSoup/urllib.request 进行网络抓取非常陌生,并且一直在尝试弄清楚如何抓取这张表的时间最长。我在网上找到了一些其他代码并进行了尝试,并且一直在尝试了解它们的工作原理并对其进行修改,但是它们总是会过滤掉我需要的第一列。

代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy 

# NBA season we will be analyzing
month = "january"
# URL page we will scrape (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_2021_games-{}.html".format(month)
# this is the HTML for given URL
html = urlopen(url)
soup = BeautifulSoup(html)

# use findALL() to get the column headers
soup.findAll()
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers=headers[1:]

# avoid the first header row
rows = soup.findAll('tr')[1:]

player_stats = [[td.getText() for td in rows[i].findAll('td')]

for i in range(len(rows))]
df = pd.DataFrame(player_stats, columns = headers)

这是 HTML 表格的样子

有人可以告诉我如何从这个网站上抓取表格吗?我这辈子都想不通 https://www.basketball-reference.com/leagues/NBA_2021_games-january.html

标签: pythonweb-scrapingbeautifulsoup

解决方案


简单的解决方案是使用pandas

import pandas as pd

url = "https://www.basketball-reference.com/leagues/NBA_2021_games-january.html"


# Get a list of tables on the page
df_list = pd.read_html(url)


# Print info.
print(f"Number of tables found: {len(df_list)}") # Out: Number of tables found: 1


# Select first dataframe object and go to town.
df = df_list[0]


print(f"Found {df.shape[0]} rows and {df.shape[1]} columns.") # Found 238 rows and 10 columns.

# You can also drop some of those fields, like the ones for Box Score and Notes, which don't contain too much relevant info:

df.drop(["Unnamed: 6", "Unnamed: 7", "Notes"], axis=1, inplace=True)

输出 -

               Date Start (ET)    Visitor/Neutral    PTS       Home/Neutral  PTS.1  Attend.
0  Fri, Jan 1, 2021      7:00p  Memphis Grizzlies  108.0  Charlotte Hornets   93.0      0.0
1  Fri, Jan 1, 2021      7:00p         Miami Heat   83.0   Dallas Mavericks   93.0      0.0
2  Fri, Jan 1, 2021      7:00p     Boston Celtics   93.0    Detroit Pistons   96.0      0.0
3  Fri, Jan 1, 2021      7:30p      Atlanta Hawks  114.0      Brooklyn Nets   96.0      0.0
4  Fri, Jan 1, 2021      8:00p      Chicago Bulls   96.0    Milwaukee Bucks  126.0      0.0

编辑:我看到这里有一些反对者,所以让我们去老学校吧:

import re
from urllib.request import urlopen

with urlopen(url) as resp:
    data = resp.read().decode("utf-8")


def clean_data(d):
    """Replace newline, tab, and whitespace with single space."""
    return re.sub("\s{2}", " ", re.sub(r"(\t|\n|\s)+", "  ", d.strip()))


# Capture key segments by element tag and string indexing.
tbl_head = data[data.index("<thead"):data.index("</thead>")]
tbl_body = data[data.index("<tbody"):data.index("</tbody>")]


# Clean our head and body data.
tbl_head = clean_data(tbl_head)
tbl_body = clean_data(tbl_body)


# Simple match to get fields from the table.
# \S gets everything besides whitespace.
th_pat = r">(\S+)<"
p = re.compile(th_pat)
fields = p.findall(tbl_head)


Output of fields:

['Date',
 'Visitor/Neutral',
 'PTS',
 'Home/Neutral',
 'PTS',
 '&nbsp;',
 '&nbsp;',
 'Attend.',
 'Notes']


# Creative pattern to capture nested elements.
body_pat = r"""
    <t(?:h|d) .+?>
    (?:<a.+?>)?
    (.*?)
    (?:</a>)?
    </t\w>
    """
p = re.compile(body_pat, flags = re.X) # Use re.X if doing multiline pattern.


# Further cleaning of body data to remove whitespace.
# (Not super necessary.)
body_data = re.sub(r'("|<t\w) >', r'\1>', tbl_body)
body_data = re.sub(r'>\s+<', '><', body_data)

# Replace <tr> tags with newline character.
body_data = re.sub(r'(<tr>|</tr><tr>)', '\n', body_data)

# Iterate each line, capture our pattern output and add it as a sublist to res.
res = []
for line in body_data.split("\n"):
    tmp = p.findall(line)
    if len(tmp) > 0:
        res.append(tmp)


# First five lines of results
print("\n".join([f"{i}" for i in res[:5]]))

输出 -

['Fri, Jan 1, 2021', '7:00p', 'Memphis Grizzlies', '108', 'Charlotte Hornets', '93', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:00p', 'Miami Heat', '83', 'Dallas Mavericks', '93', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:00p', 'Boston Celtics', '93', 'Detroit Pistons', '96', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '7:30p', 'Atlanta Hawks', '114', 'Brooklyn Nets', '96', 'Box Score', '', '0', '']
['Fri, Jan 1, 2021', '8:00p', 'Chicago Bulls', '96', 'Milwaukee Bucks', '126', 'Box Score', '', '0', '']

推荐阅读