python - 如何在 S3 上的每个文件夹中抓取下载链接
问题描述
我一直在抓取这个动态网站,它基本上是一个索引链接。我想将每个文件夹中文件的所有下载链接都下载到最后一个子文件夹。我不知道我应该应用什么机制来做到这一点。
代码:
import time
import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(5)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
lists = []
for tags in soup.find_all('a'):
links = tags['href']
lists.append(links)
req = requests.get('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/').content #from the network tools in F12
soup = BeautifulSoup(req, 'lxml')
names = []
for common in soup.find_all('prefix')[2:]:
names.append(common.text)
names.sort()
print(names)
我只想获取每个文件夹中每种文件类型的下载链接。
解决方案
这是一个公共 S3 存储桶,因此您可以XML
从根文件夹获取:
https://s3.amazonaws.com/dl.ncsbe.gov/
这意味着您可以将其作为响应,解析XML
并重建所有键的 url。
就是这样:
import requests
import xmltodict
base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)
valid_extensions = (
".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)
for item in data["ListBucketResult"]["Contents"]:
if item["Key"].endswith(valid_extensions):
s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
print(f'{s3_url}{item["Key"].replace(" ", "%20")}')
这会以文件 URLS 的形式输出 S3 的整个结构:
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf
and many more ...
推荐阅读
- python - 在 discord.py 中使用制表符和空格错误的顶级硬币
- java - mvn install 抛出错误“Pacakage ...不存在”,即使它存在
- google-colaboratory - 如何从单元格内断开 google colab 运行时?
- python - PostgreSQL 建模效率以存储资产价格 - Django
- amazon-web-services - 操作执行失败 AccessDenied。用户无权调用 ssm:GetParameters
- reactjs - 在另一个选择器中使用选择器参数
- python - 下面的代码不检查现有的用户名
- apache - Apache:同一服务器上的两个域具有不同的端口
- python - 如何使用张量实现平衡错误率?
- laravel - Laravel Eloquent Query 获取组中每个 id 的 2 条记录