python - 提取日期,附加游戏数量
问题描述
我目前正在网上按周抓取大学橄榄球赛程。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
teams = [t.text for t in soup.find_all('span', class_='TeamName')]
away = teams[::2]
home = teams[1::2]
time = [c.text.replace("\n", "").replace(' ','').replace(' ',' ') for c in soup.find_all('div', class_='CellGame')]
import pandas as pd
schedule = pd.DataFrame(
{
'away': away,
'home': home,
'time': time,
})
schedule
我想要一个日期栏。我很难提取日期并复制与该日期的游戏数量相对应的日期并附加到 python 列表中。
date = []
for d in soup.find_all('div', class_='TableBaseWrapper'):
for a in d.find_all('h4'):
date.append(a.text.replace('\n \n ','').replace('\n \n ',''))
print(date)
['Friday, October 2, 2020', 'Saturday, October 3, 2020']
日期就像每个表的标题。我希望每个日期都对应正确的游戏。并且还包括推迟比赛的“推迟”。
我的计划是每周自动执行此代码。
提前谢谢。
*发布答案
漂亮而且做得很好。使用您的代码,我将如何拉场地,尤其是推迟场地?我的原始代码是:
venue = [v.text.replace('\n','').replace(' ','').replace(' ','').strip('—').strip() for v in soup.find_all('td', text=lambda x: x and "Field" or x and 'Stadium' in x) if v != '' ]
venues = [x for x in venue if x]
missing = len(away) - len(venues)
words = ['Postponed' for x in range(missing) if len(away)> len(venues)]
venues = venues + words
解决方案
您可以使用.find_previous()
查找当前拖车的日期:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
印刷:
home away time date
0 Campbell Wake Forest WAKE 66 - CAMP 14 Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 10 - 2nd ESPU Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 17, ARKST 14 - 2nd ESP2 Saturday, October 3, 2020
4 Missouri Tennessee TENN 21, MIZZOU 6 - 2nd SECN Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 2nd ABC Saturday, October 3, 2020
6 TCU Texas TCU 14, TEXAS 14 - 2nd FOX Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 10 - 2nd ACCN Saturday, October 3, 2020
8 South Carolina Florida FLA 17, SC 14 - 2nd ESPN Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 7, TXSA 3 - 2nd Saturday, October 3, 2020
10 North Alabama Liberty NAL 0, LIB 0 - 1st ESP3 Saturday, October 3, 2020
11 Abil Christian Army 1:30 pm CBSSN Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Saturday, October 3, 2020
32 Rice Marshall Postponed Saturday, October 3, 2020
33 Troy South Alabama Postponed Saturday, October 3, 2020
并保存data.csv
(来自 LibreOffice 的屏幕截图):
编辑:要削减“地点”列,您可以使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/college-football/schedule/FBS/2020/regular/5/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.TableBase-bodyTr'):
home = row.select_one('.TeamLogoNameLockup')
away = home.find_next(class_='TeamLogoNameLockup')
time = row.select_one('.CellGame')
venue = '-' if len(row.select('td')) == 3 else row.select('td')[3].get_text(strip=True)
date = row.find_previous('h4')
all_data.append({
'home': home.get_text(strip=True),
'away': away.get_text(strip=True),
'time': time.get_text(strip=True, separator=' '),
'venue': venue,
'date': date.get_text(strip=True),
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
印刷:
home away time venue date
0 Campbell Wake Forest WAKE 66 - CAMP 14 - Friday, October 2, 2020
1 Louisiana Tech BYU BYU 45 - LATECH 14 - Friday, October 2, 2020
2 East Carolina Georgia St. GAST 35, ECU 13 - 3rd ESPU Center Parc Stadium Saturday, October 3, 2020
3 Arkansas St. Coastal Carolina CSTCAR 31, ARKST 14 - 3rd ESP2 Brooks Stadium Saturday, October 3, 2020
4 Missouri Tennessee TENN 28, MIZZOU 6 - 3rd SECN Neyland Stadium Saturday, October 3, 2020
5 Baylor West Virginia BAYLOR 7, WVU 7 - 3rd ABC Mountaineer Field at Milan Puskar Stadium Saturday, October 3, 2020
6 TCU Texas TCU 20, TEXAS 14 - 2nd FOX DKR-Texas Memorial Stadium Saturday, October 3, 2020
7 NC State Pittsburgh NCST 17, PITT 13 - 3rd ACCN Heinz Field Saturday, October 3, 2020
8 South Carolina Florida FLA 31, SC 14 - 3rd ESPN Florida Field at Ben Hill Griffin Stadium Saturday, October 3, 2020
9 UT-San Antonio UAB UAB 14, TXSA 6 - 2nd Legion Field Saturday, October 3, 2020
10 North Alabama Liberty LIB 7, NAL 0 - 2nd ESP3 Williams Stadium Saturday, October 3, 2020
11 Abil Christian Army ARMY 7, ABIL 0 - 1st CBSSN Blaik Field at Michie Stadium Saturday, October 3, 2020
12 Texas A&M Alabama 3:30 pm Bryant-Denny Stadium Saturday, October 3, 2020
13 Texas Tech Kansas St. 3:30 pm FS1 Bill Snyder Family Stadium Saturday, October 3, 2020
14 North Carolina Boston College 3:30 pm ABC Alumni Stadium Saturday, October 3, 2020
15 South Florida Cincinnati 3:30 pm ESP+ Nippert Stadium Saturday, October 3, 2020
16 Oklahoma St. Kansas 3:30 pm ESPN David Booth Kansas Memorial Stadium Saturday, October 3, 2020
17 Memphis SMU 3:30 pm ESP2 Gerald J. Ford Stadium Saturday, October 3, 2020
18 Charlotte FAU 4:00 pm ESPU FAU Stadium Saturday, October 3, 2020
19 Jacksonville St. Florida St. 4:00 pm Bobby Bowden Field at Doak Campbell Stadium Saturday, October 3, 2020
20 Virginia Tech Duke 4:00 pm ACCN Brooks Field at Wallace Wade Stadium Saturday, October 3, 2020
21 Ole Miss Kentucky 4:00 pm SECN Kroger Field Saturday, October 3, 2020
22 W. Kentucky Middle Tenn. 5:00 pm ESP3 Johnny (Red) Floyd Stadium Saturday, October 3, 2020
23 Navy Air Force 6:00 pm CBSSN Falcon Stadium Saturday, October 3, 2020
24 Ga. Southern UL-Monroe 7:00 pm ESP+ JPS Field at James L. Malone Stadium Saturday, October 3, 2020
25 Auburn Georgia 7:30 pm ESPN Sanford Stadium Saturday, October 3, 2020
26 Arkansas Miss. State 7:30 pm SECN Davis Wade Stadium at Scott Field Saturday, October 3, 2020
27 LSU Vanderbilt 7:30 pm SECN Vanderbilt Stadium Saturday, October 3, 2020
28 Oklahoma Iowa St. 7:30 pm ABC Jack Trice Stadium Saturday, October 3, 2020
29 So. Miss North Texas 7:30 pm Apogee Stadium Saturday, October 3, 2020
30 Tulsa UCF 7:30 pm ESP2 Spectrum Stadium Saturday, October 3, 2020
31 Virginia Clemson 8:00 pm ACCN Memorial Stadium Saturday, October 3, 2020
32 Rice Marshall Postponed - Saturday, October 3, 2020
33 Troy South Alabama Postponed - Saturday, October 3, 2020
推荐阅读
- matlab - 获取日期范围内独特季度列表的更快方法
- c++ - 内部类的未解析外部符号
- gradle - Gradle 未将 SNAPSHOT 解析为时间戳
- python - 如何将 curl POST 转换为 Python 中的请求 POST?
- html - 在 CSS 中使用 Flexbox/Grid,使图像在移动设备上堆叠
- ruby-on-rails - 部分控制器 - 如何?
- swift - 重新定位标签栏问题
- java - 如何从具有 int 和字符串的类 Job 生成随机对象
- azure - 用于将应用注册到 Azure 的 API
- ionic3 - ionic3和angular 5中的动态标签渲染