python - Python 抓取链接的父 URL,然后是这些链接的子 URL,然后是表数据并存储到可读文件中
问题描述
我想从https://gg.co.uk/tips/today网站上抓取所有网址,例如(https://gg.co.uk/racing/16-jun-2020/thirsk-1300)然后循环这些网址中的每一个以获取 https://gg.co.uk/racing/form-profile-2703975然后将每个“ https://gg.co.uk/racing/form-profile-2703975 ”中 的表格解析为输出到每场比赛的 csv 文件,例如“ https://gg.co.uk/racing/16-jun-2020/thirsk-1300 ”示例输出格式
PLACE DATE / GOING DISTANCE / CLASS TIME / COURSE JOCKEY
16th Jun 2020 Good to Soft 7f Class 5 1:00 Thirsk F Norton
4th Jun 2020 Standard 6f Class 5 4:30 Newcastle J Fanning
我已经设法抓取链接,但无法抓取每个链接并输出到 csv
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://gg.co.uk/tips/today')
base_url = 'https://gg.co.uk'
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a',{'class' : 'winning-post'}):
web_links = link.get("href")
print(base_url + web_links)
link_set.add(web_links)
Print(web_links)
解决方案
此脚本将从配置文件页面获取所有form-profile-xxx
Url https://gg.co.uk/racing/16-jun-2020/thirsk-1300
,然后从配置文件页面获取属于该赛车的每一行并将其保存到 csv:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://gg.co.uk/racing/16-jun-2020/thirsk-1300'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('a[href^="/racing/form-profile-"]'):
u = 'https://gg.co.uk' + a['href']
s = BeautifulSoup(requests.get(u).content, 'html.parser')
row = s.select_one('tr:has(a[href="{}"])'.format(url.replace('https://gg.co.uk', '')))
if not row:
continue
tds = [td.get_text(strip=True, separator='\n') for td in row.select('td')]
print(tds)
all_data.append(tds)
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
印刷:
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9\n5\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9\n5\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9\n5\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9\n0\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9\n5\nD Nolan\nD OʼMeara', '15/2\n4.25']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9st 5lb\nD Nolan\nD OʼMeara', '15/2\n4.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9\n5\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9\n5\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9\n5\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']
并保存data.csv
(来自 Libre Office 的屏幕截图):
推荐阅读
- c# - 将枚举值的通用列表组合为单个值的 C# 方法
- git - 压缩/修复 GIT 中的空提交
- scala - 比较两个 RDDS 中的数据
- angular - RxJS ReplaySubject 没有更新 Angular Zone
- opengl - lwjgl 模型不在其中心旋转
- appium - 我们可以在 ios 设备上使用 Appium 自动化内置相机功能吗
- sql - SQL / Postgresql如何对列进行分组但根据某些条件找到另一列的平均值
- jupyter-notebook - Jupyter Notebook 对象没有属性
- oracle - OCI ObjectStorage 需要 CopyObject 的权限吗?
- php - 如何使网站对用户 PHP 不可用?