首页 > 解决方案 > 如何在 S3 上的每个文件夹中抓取下载链接

问题描述

我一直在抓取这个动态网站,它基本上是一个索引链接。我想将每个文件夹中文件的所有下载链接都下载到最后一个子文件夹。我不知道我应该应用什么机制来做到这一点。

代码:

    import time
    import lxml
    import requests
    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    
    url = 'http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    time.sleep(5)
    page = driver.page_source
    driver.quit()
    
    soup = BeautifulSoup(page, 'html.parser')
    lists = []
    for tags in soup.find_all('a'):
        links = tags['href']
        lists.append(links)
    
    req = requests.get('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/').content #from the network tools in F12
    soup = BeautifulSoup(req, 'lxml')
    names = []
    for common in soup.find_all('prefix')[2:]:
        names.append(common.text)
        names.sort()
    print(names)

我只想获取每个文件夹中每种文件类型的下载链接。

标签: pythonamazon-s3web-scrapingdata-science

解决方案


这是一个公共 S3 存储桶,因此您可以XML从根文件夹获取:

https://s3.amazonaws.com/dl.ncsbe.gov/

这意味着您可以将其作为响应,解析XML并重建所有键的 url。

就是这样:

import requests
import xmltodict

base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)

valid_extensions = (
    ".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)

for item in data["ListBucketResult"]["Contents"]:
    if item["Key"].endswith(valid_extensions):
        s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
        print(f'{s3_url}{item["Key"].replace(" ", "%20")}')

这会以文件 URLS 的形式输出 S3 的整个结构:

https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf

and many more ...

推荐阅读