首页 > 解决方案 > Python 抓取链接的父 URL,然后是这些链接的子 URL,然后是表数据并存储到可读文件中

问题描述

我想从https://gg.co.uk/tips/today网站上抓取所有网址,例如(https://gg.co.uk/racing/16-jun-2020/thirsk-1300)然后循环这些网址中的每一个以获取 https://gg.co.uk/racing/form-profile-2703975然后将每个“ https://gg.co.uk/racing/form-profile-2703975 ”中 的表格解析为输出到每场比赛的 csv 文件,例如“ https://gg.co.uk/racing/16-jun-2020/thirsk-1300 ”示例输出格式

PLACE DATE   / GOING DISTANCE / CLASS       TIME /   COURSE JOCKEY 
16th Jun 2020  Good to Soft      7f Class 5 1:00     Thirsk F Norton
4th Jun 2020   Standard          6f Class 5 4:30     Newcastle  J Fanning

我已经设法抓取链接,但无法抓取每个链接并输出到 csv

import requests
from bs4 import BeautifulSoup
import csv

        page = requests.get('https://gg.co.uk/tips/today')
        base_url = 'https://gg.co.uk'
        soup = BeautifulSoup(page.text, 'html.parser')

        link_set = set()
        for link in soup.find_all('a',{'class' : 'winning-post'}):
        web_links = link.get("href")
        print(base_url + web_links)
        link_set.add(web_links)
    Print(web_links)

标签: pythonbeautifulsoupexport-to-csv

解决方案


此脚本将从配置文件页面获取所有form-profile-xxxUrl https://gg.co.uk/racing/16-jun-2020/thirsk-1300,然后从配置文件页面获取属于该赛车的每一行并将其保存到 csv:

import csv
import requests
from bs4 import BeautifulSoup


url = 'https://gg.co.uk/racing/16-jun-2020/thirsk-1300'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('a[href^="/racing/form-profile-"]'):
    u = 'https://gg.co.uk' + a['href']
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    row = s.select_one('tr:has(a[href="{}"])'.format(url.replace('https://gg.co.uk', '')))
    if not row:
        continue
    tds = [td.get_text(strip=True, separator='\n') for td in row.select('td')]
    print(tds)
    all_data.append(tds)

with open('data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

印刷:

['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9\n5\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['1st\n3\n5', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nHigh Peak\n9st 5lb\nF Norton\nM Johnston', '5/6\nWon']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9\n5\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['2nd\n2\n6', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDeputy\n9st 5lb\nS Donohoe\nC Fellowes', '5/2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9\n5\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['3rd\n4\n2', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nInfant Hercules\n9st 5lb\nKevin Stott\nK A Ryan', '12/1\n2']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9\n0\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['4th\n8\n3', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nChilli Leaves\n9st\nCallum Rodriguez\nK Dalgleish', '12/1\n2.5']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9\n5\nD Nolan\nD OʼMeara', '15/2\n4.25']
['5th\n6\n4', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMy Best Friend\n9st 5lb\nD Nolan\nD OʼMeara', '15/2\n4.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9\n5\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['6th\n7\n8', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nTopper Bill\n9st 5lb\nBarry McHugh\nAdrian Nicholls', '25/1\n6.25']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9\n5\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['7th\n1\n1', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nDandini\n9st 5lb\nBen Robinson\nOllie Pears', '40/1\n7']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9\n5\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']
['8th\n5\n7', '16th Jun 2020\nGood to Soft\n7f\nClass 5', '1:00 Thirsk\nMarsellus\n9st 5lb\nD Allan\nT D Easterby', '33/1\n27']

并保存data.csv(来自 Libre Office 的屏幕截图):

在此处输入图像描述


推荐阅读