python-3.x - Pandas,BeautifulSoup - 迭代和编写多个页面以实现卓越
问题描述
我正在收集一堆 NCAA 足球统计数据并将它们转储到 Excel 电子表格中。然而,赢/输/领带数据 (WLT) 跨越多个页面,所以我遍历它们。但是 WLT 只将迭代的最后一页(204 所学校中的 4 所学校)存储到 excel 中。如何在 Excel 的“WLT”表中下载 5 页?谢谢你的帮助....
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import xlsxwriter
import numpy as np
import urllib.request
shutouts = "https://www.ncaa.com/stats/soccer-men/d1/current/team/31"
shutouts = pd.read_html(shutouts)[0]
SOG = 'https://www.ncaa.com/stats/soccer-men/d1/current/team/977'
SOG = pd.read_html(SOG)[0]
# players stats
shutouts_p = 'https://www.ncaa.com/stats/soccer-men/d1/current/individual/1170'
shutouts_p = pd.read_html(shutouts_p)[0]
#Win Loss Tie data
max_page_num = 6
for i in range(1,max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.ncaa.com/stats/soccer-men/d1/current/team/33/p" + page_num
WLT = pd.read_html(source)
WLT = WLT[0]
with pd.ExcelWriter('ncaastats.xlsx') as writer:
shutouts.to_excel(writer, sheet_name='shutouts')
shutouts_p.to_excel(writer, sheet_name='shutouts_p')
SOG.to_excel(writer, sheet_name='SOG')
WLT.to_excel(writer, sheet_name='WLT')
解决方案
从 pandas 的 5 页中获取所有 204 条记录dataframe
。您需要df
在每个中附加iteration
代码:
import pandas as pd
#declare df here
df=pd.DataFrame()
#Win Loss Tie data
max_page_num = 6
for i in range(1,max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.ncaa.com/stats/soccer-men/d1/current/team/33/p" + page_num
WLT = pd.read_html(source)[0]
#Append df here
df = df.append(WLT, ignore_index=True)
print(df)
输出:
page: 1
page: 2
page: 3
page: 4
page: 5
Rank Team Won Loss Tied Pct.
0 1 Missouri St. 18 1 1 0.925
1 2 Georgetown 20 1 3 0.896
2 - Virginia 21 2 1 0.896
3 4 Saint Mary's (CA) 16 2 0 0.889
4 5 SMU 18 2 1 0.881
5 6 Clemson 18 2 2 0.864
6 7 New Hampshire 15 2 3 0.825
7 8 Campbell 17 3 2 0.818
8 9 Washington 17 4 0 0.810
9 10 UCF 15 3 2 0.800
10 11 Marshall 16 3 3 0.795
11 12 Seattle U 16 3 4 0.783
12 13 Yale 13 3 2 0.778
13 14 Indiana 15 3 4 0.773
14 15 Oral Roberts 13 4 0 0.765
15 16 Stanford 14 3 5 0.750
16 17 Wake Forest 16 5 2 0.739
17 18 Rhode Island 14 4 3 0.738
18 19 Navy 12 4 1 0.735
19 20 St. John's (NY) 14 5 1 0.725
20 21 UIC 13 5 0 0.722
21 22 Penn St. 12 4 3 0.711
22 23 UC Santa Barbara 15 5 4 0.708
23 24 UC Davis 13 5 2 0.700
24 - Charlotte 12 4 4 0.700
25 - Georgia St. 12 4 4 0.700
26 27 Providence 16 7 0 0.696
27 28 San Diego 12 5 1 0.694
28 - FIU 10 3 5 0.694
29 30 Iona 14 6 1 0.690
.. ... ... ... ... ... ...
174 175 Delaware 3 9 3 0.300
175 176 USC Upstate 5 12 0 0.294
176 - Robert Morris 4 11 2 0.294
177 - Stony Brook 4 11 2 0.294
178 - UIW 5 12 0 0.294
179 180 Western Ill. 5 13 1 0.289
180 181 Wisconsin 3 11 4 0.278
181 - Liberty 5 13 0 0.278
182 - San Diego St. 4 12 2 0.278
183 184 Boston U. 4 12 1 0.265
184 - UNC Asheville 4 12 1 0.265
185 186 Wofford 4 13 1 0.250
186 - Valparaiso 4 13 1 0.250
187 - American 3 11 2 0.250
188 - George Mason 4 13 1 0.250
189 - Davidson 3 11 2 0.250
190 - Michigan St. 3 12 3 0.250
191 192 Monmouth 3 12 2 0.235
192 - UAB 3 12 2 0.235
193 194 Old Dominion 3 11 1 0.233
194 195 Sacred Heart 2 11 3 0.219
195 196 Col. of Charleston 2 12 2 0.188
196 197 Holy Cross 3 15 0 0.167
197 - Purdue Fort Wayne 3 15 0 0.167
198 199 San Francisco 2 14 1 0.147
199 200 Evansville 2 15 1 0.139
200 201 Canisius 2 15 0 0.118
201 202 Central Conn. St. 1 13 1 0.100
202 203 VMI 1 16 0 0.059
203 204 Harvard 0 14 1 0.033
[204 rows x 6 columns]
推荐阅读
- excel - 是否可以从 Excel 连接到 Azure 表存储?
- angular - 如何从用户生成的事件中触发点击?
- c# - .NET Core 上的 WCF 在授权方案中缺少客户端协商方案
- javascript - React Native - 为什么当我将 textInput 分配给 onchange 时单击按钮时才会出现错误?
- kubernetes - 如何在 GCP 集群上安装 Kubernetes v1.10.11?
- javascript - Vue.js 组件不会呈现
- android - 从实时数据库中取出数据
- javascript - Javascript lodash - 基于属性值的数据按摩
- javascript - 如何忽略缺失的参数
- javascript - 如何将跟踪器添加到磁力链接?