首页 > 解决方案 > 网页抓取 pdf 链接 - 不返回结果

问题描述

我已经设置了一些代码来从地方议会网站上抓取 pdf。我已经请求了我想要的页面,然后得到了不同日期的链接,然后在每个日期中都有指向 pdf 的链接。但是它没有返回任何结果。

我玩过代码,无法弄清楚。它在 jupyter notebook 中运行正常并且没有返回任何错误。

这是我的代码:

import requests
from bs4 import BeautifulSoup as bs

dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')

f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessdatelinks.txt", "w+")

for date in dates:
        if ['a'] in soup.select('a:contains("' + date + '")'):
            r2 = requests.get(date['href'])
            print("link1")
            page2 = r2.text
            soup2 = bs(page2, 'lxml')
            pdf_links = soup2.find_all('a', href=True)
            for plink in pdf_links:
                if plink['href'].find('minutes')>1:
                    print("Minutes!")
                    f.write(str(plink['href']) + ' ')
f.close()               

它创建一个文本文件,但它是空白的。我想要一个包含所有 pdf 链接的文本文件。谢谢。

标签: pythonweb-scraping

解决方案


可以改用正则表达式soup.find('a', text=re.compile(date))

import requests
from bs4 import BeautifulSoup as bs
import re

dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')

f = open(r"E:\gmcabusinessdatelinks.txt", "w+")

for date in dates:
        link = soup.find('a', text=re.compile(date))
        r2 = requests.get(link['href'])
        print("link1")
        page2 = r2.text
        soup2 = bs(page2, 'lxml')
        pdf_links = soup2.find_all('a', href=True)
        for plink in pdf_links:
            if plink['href'].find('minutes')>1:
                print("Minutes!")
                f.write(str(plink['href']) + ' ')
f.close()               

推荐阅读