首页 > 解决方案 > 拆包 pands read_HTML 数据框

问题描述

我正在尝试从网站上抓取数据: https ://www.oddsportal.com/american-football/usa/nfl/

此链接显示即将到来的游戏。

目前,我已经尝试使用 pandas 来读取 Selenium 检索到的 html 数据,但是数据框是多级索引,我不确定如何将数据框解压缩为更易读的格式。

import undetected_chromedriver.v2 as uc
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd



driver = uc.Chrome()
driver.get("https://www.oddsportal.com/american-football/usa/nfl/")

table = driver.find_element(By.XPATH, '//*[@id="tournamentTable"]').get_attribute('outerHTML')

df = pd.read_html(table)[0]

Out:
   American football» USA»NFL  ...     
                Today, 17 Oct  ...  B's
0                       13:30  ...   11
1                       17:00  ...   11
2                       17:00  ...   11
3                       17:00  ...   11
4                       17:00  ...   11
5                       17:00  ...   11
6                       17:00  ...   11
7                       17:00  ...   11
8                       2Q 6'  ...   11
9                       2Q 4'  ...   11
10                      2Q 3'  ...   11
11           Tomorrow, 18 Oct  ...  B's
12                        NaN  ...  NaN
13                      00:20  ...   11
14                19 Oct 2021  ...  B's
15                        NaN  ...  NaN
16                      00:15  ...   11
17                15 Nov 2021  ...  B's
18                        NaN  ...  NaN
19                      00:20  ...    1

[20 rows x 7 columns]

我可以使用以下方法解压缩 HTML 数据:

all_matches = [i.text for i in driver.find_elements(By.XPATH, '//*[@id="tournamentTable"]/tbody/tr') if "American football" not in i.text]

Out:
["Today, 17 Oct 1 2 B's",
 '',
 '13:30 Jacksonville Jaguars - Miami Dolphins\n  23:20\n+110\n-128\n11',
 '17:00 Baltimore Ravens - Los Angeles Chargers 34:6\n-159\n+138\n11',
 '17:00 Carolina Panthers - Minnesota Vikings 28:34 OT\n+114\n-133\n11',
 '17:00 Chicago Bears - Green Bay Packers 14:24\n+195\n-233\n11',
 '17:00 Detroit Lions - Cincinnati Bengals 11:34\n+165\n-192\n11',
 '17:00 Indianapolis Colts - Houston Texans 31:3\n-556\n+428\n11',
 '17:00 New York Giants - Los Angeles Rams 11:38\n+286\n-345\n11',
 '17:00 Washington Football Team - Kansas City Chiefs 13:31\n+241\n-294\n11',
 "2Q 15'\nCleveland Browns - Arizona Cardinals 14:23\n-149\n+127\n11",
 "2Q 14'\nDenver Broncos - Las Vegas Raiders 7:10\n-213\n+181\n11",
 "2Q 10'\nNew England Patriots - Dallas Cowboys 14:10\n+163\n-189\n11",
 "Tomorrow, 18 Oct 1 2 B's",
 '',
 '00:20 Pittsburgh Steelers - Seattle Seahawks\n-217\n+185\n11',
 "19 Oct 2021 1 2 B's",
 '',
 '00:15 Tennessee Titans - Buffalo Bills\n+206\n-250\n11',
 '',
 '',
 '']

但这需要我使用字典解析数据,并且正确格式化会很麻烦。

我的预期输出是 DF 格式:

    date         game_time          Team1              Team2         Score     1     2   
0   2021-10-17    13:30      Jacksonville Jaguars   Miami Dolphins  23:20    +110  -128
1   2021-10-17    17:00      Baltimore Ravens       Los Angeles     34:6     -159  +138
2   2021-10-17    17:00      Carolina Panthers      Minnesota Vik   28:34    +114  -133
3   2021-10-17    17:00      Chicago Bears          Green Bay Pack  14:24    +195  -233
4   2021-10-17    17:00      Detroit Lions          Cincinnati Ben  11:34    +165  -192

我希望有一种更简单的方法可以将数据传递给 pandas read_HTML 函数,该函数会删除 multilevelindex 并使我更接近格式。如果我能接近,我可以格式化其余的,但我想避免使用字典,但我明白这是否不可能。

标签: pythonpandasdataframeseleniumweb-scraping

解决方案


read_html()适用于基本/原始表,但对于复杂表,您必须编写自己的代码,该代码将使用for-loops 单独处理行和单元格,并if/else识别行中的数据类型。

为此,我只使用 Selenium。

首先,我获取表中的所有行,@id="tournamentTable"然后检查每一行中的类以检测带有日期的行、带有结果的行或隐藏行。接下来我为不同的数据运行不同的代码。

import selenium.webdriver
import pandas as pd

#import undetected_chromedriver.v2 as uc
#driver = uc.Chrome()

#driver = selenium.webdriver.Chrome()
driver = selenium.webdriver.Firefox()
driver.get('https://www.oddsportal.com/american-football/usa/nfl')  

# --- 

all_results = []
date = None

all_rows = driver.find_elements_by_xpath('//table[@id="tournamentTable"]//tr')

for row in all_rows:
    classes = row.get_attribute('class')
    print('classes:', classes)
    
    if classes == 'center nob-border':
        date = row.find_element_by_tag_name('span').text.strip()
        print('date:', date)
    elif (classes == 'table-dummyrow') or ('hidden' in classes):
        pass  # skip empty rows
    else:
        if date:
            all_cells = row.find_elements_by_xpath('.//td')
            print('len(all_cells):', len(all_cells))
            teams = all_cells[1].text.split(' - ')
            if len(all_cells) == 5: 
                # row without score
                row_values = [
                    date,
                    all_cells[0].text.strip(),
                    teams[0].strip(),
                    teams[1].strip(),
                    '',
                    all_cells[2].text.strip(),
                    all_cells[3].text.strip(),
                    all_cells[4].text.strip(),
                ]
            else: 
                # row with score
                row_values = [
                    date,
                    all_cells[0].text.strip(),
                    teams[0].strip(),
                    teams[1].strip(),
                    all_cells[2].text.strip(),
                    all_cells[3].text.strip(),
                    all_cells[4].text.strip(),
                    all_cells[5].text.strip(),
                ]

            print('row:', row_values)
            all_results.append(row_values)
            
print('-----------------------')

df = pd.DataFrame(all_results, columns=['date', 'game_time', 'Team1', 'Team2', 'Score', '1', '2', 'B'])

print(df)

结果:

                date game_time                 Team1                     Team2  Score     1     2   B
0      Today, 18 Oct      4Q 4   Pittsburgh Steelers          Seattle Seahawks  17:17  1.44  2.91  11
1   Tomorrow, 19 Oct     00:15      Tennessee Titans             Buffalo Bills         3.13  1.39  10
2        22 Oct 2021     00:20      Cleveland Browns            Denver Broncos         1.43  2.93   8
3        24 Oct 2021     17:00      Baltimore Ravens        Cincinnati Bengals         1.36  3.30   8
4        24 Oct 2021     17:00     Green Bay Packers  Washington Football Team         1.22  4.59   8
5        24 Oct 2021     17:00        Miami Dolphins           Atlanta Falcons         1.93  1.91   8
6        24 Oct 2021     17:00  New England Patriots             New York Jets         1.33  3.43   8
7        24 Oct 2021     17:00       New York Giants         Carolina Panthers         2.28  1.67   8
8        24 Oct 2021     17:00      Tennessee Titans        Kansas City Chiefs         2.67  1.49   5
9        24 Oct 2021     20:05     Las Vegas Raiders       Philadelphia Eagles         1.63  2.35   8
10       24 Oct 2021     20:05      Los Angeles Rams             Detroit Lions         1.09  7.94   8
11       24 Oct 2021     20:25     Arizona Cardinals            Houston Texans         1.08  8.98   8
12       24 Oct 2021     20:25  Tampa Bay Buccaneers             Chicago Bears         1.14  6.16   8
13       25 Oct 2021     00:20   San Francisco 49ers        Indianapolis Colts         1.50  2.70   8


推荐阅读