首页 > 解决方案 > 不返回所有表格

问题描述

我想尝试从这个网站上抓取所有表格。这个网站包含的不仅仅是10表格。当我使用pd.read_html()时,它只返回 3 个表格,但我希望我的脚本返回所有表格。
我的脚本:

import pandas as pd
url = "https://aws.pro-football-reference.com/teams/mia/2000.htm"
df = pd.read_html(url)
len(df)

输出:

3

特别是,我想要这张桌子:

在此处输入图像描述

我怎样才能得到所有的表使用pd.read_html()

标签: pythonpython-3.xpandasweb-scrapinghtml-table

解决方案


pd.read_html在后台使用 BeautifulSoup<table>从网页中抓取元素。使用requests抓取网页的 HTML 并手动解析,我发现您链接的页面确实只包含三个<table>元素。但是,可以在 HTML 注释中找到几个附加表(包括您想要的“踢”表)的数据。

解决方案

解析注释掉的表。

import requests
import bs4
import pandas as pd

url = "https://aws.pro-football-reference.com/teams/mia/2000.htm"
scraped_html = requests.get(url)
soup = bs4.BeautifulSoup(scraped_html.content)

# Get all html comments, then filter out everything that isn't a table
comments = soup.find_all(text=lambda text:isinstance(text, bs4.Comment))
commented_out_tables = [bs4.BeautifulSoup(cmt).find_all('table') for cmt in comments]
# Some of the entries in `commented_out_tables` are empty lists. Remove them.
commented_out_tables = [tab[0] for tab in commented_out_tables if len(tab) == 1]

print(len(commented_out_tables))

8.

其中只有一个是“踢”桌。我们可以通过查找属性设置为table的a 来找到它。idkicking

for table in commented_out_tables:
    if table.get('id') == 'kicking':
        kicking_table = table
        break

把它变成一个pd.DataFramewith pd.read_html

pd.read_html(str(kicking_table))

产生以下结果:

[  Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Games       ... Kickoffs Punting
                  No.             Player                Age                Pos     G   GS  ...    KOAvg     Pnt     Yds   Lng Blck   Y/P
 0                1.0          Matt Turk               32.0                  p    16  0.0  ...      NaN    92.0  3870.0  70.0  0.0  42.1
 1               10.0        Olindo Mare               27.0                  k    16  0.0  ...     60.3     NaN     NaN   NaN  NaN   NaN
 2                NaN         Team Total               27.3                NaN    16  NaN  ...     60.3    92.0  3870.0  70.0  0.0  42.1
 3                NaN          Opp Total                NaN                NaN    16  NaN  ...      NaN    87.0  3532.0   NaN  NaN  40.6

 [4 rows x 32 columns]]

推荐阅读