首页 > 解决方案 > 在 iframe id="swGoogleDrive" 中获取 PDF

问题描述

如何获取在此URL的 iframe 中找到的 PDF ?

在此处输入图像描述

(1) 以下代码抛出错误。

import requests, re
from bs4 import BeautifulSoup

url = r'https://www.d88a.org/domain/102'
headers = {'User-Agent': 'C19SchoolsWebscrape'}

s = requests.Session()
r = s.get(url, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
iframe_src = soup.select_one("swGoogleDrive").attrs["src"]
r = s.get(f"https:{iframe_src}")
print(r)
error: 'NoneType' object has no attribute 'attrs'

(2) 这也会引发错误。

response = requests.get(url, headers=headers)
t = re.search(b'(?<=artist":")(.*?)(?=")', response.content).group(0).decode("utf-8")
print(t)
error: 'NoneType' object has no attribute 'group'

我引用的早期线程: Python BeautifulSoup - Scrape Web Content Inside Iframes使用 BeautifulSoup 提取 iFrame 内容

标签: python-3.xiframebeautifulsouplxml

解决方案


要获取 PDF 的所有链接,您可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.d88a.org/domain/102'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')

for a in soup.select('a'):
    print(a['href'])

印刷:

https://drive.google.com/file/d/1bCXyoE7FWWI9RIcDWosHrohYQY7Ryb13/view?usp=drive_web
https://drive.google.com/file/d/1SlR-71M-jCMF-AO4ChdSbywolIF9yL1h/view?usp=drive_web
https://drive.google.com/file/d/1zbrt5Mnt0fZxjeD7DRYvfP6cskYKig27/view?usp=drive_web

推荐阅读