首页 > 解决方案 > 如何使用文件名 = url 将 pdf 刮到本地文件夹并在迭代中延迟?

问题描述

我为所有对我很重要的包含.pdf的链接抓取了一个网站(url =“http://bla.com/bla/bla/bla/bla.txt”) 。这些现在存储在:relative_paths

['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]

现在我想将链接“后面”的 pdf 存储在本地文件夹中,文件名是它们的url

没有一个 - 尽管互联网上有些类似的问题 - 似乎有助于我实现我的目标。我得到的最接近的是当它生成一些甚至没有扩展名的奇怪文件时。以下是我已经尝试过的一些更有前途的代码示例。

for link in relative_paths:
    content = requests.get(link, verify = False)
    with open(link, 'wb') as pdf:
        pdf.write(content.content)

for link in relative_paths:  
    response = requests.get(url, verify = False)   
    with open(join(r'C:/Users/', basename(url)), 'wb') as f:
        f.write(response.content)

for link in relative_paths:
    filename = link
    with open(filename, 'wb') as f:
        f.write(requests.get(link, verify = False).content)

for link in relative_paths:
    pdf_response = requests.get(link, verify = False)
    filename = link
    with open(filename, 'wb') as f:
        f.write(pdf_response.content)

现在我很困惑,不知道如何前进。你能转换一个for循环并提供一个小的解释吗?如果 url 对于文件名来说太长,在倒数第三个拆分/也是可以的。谢谢 :)

另外,网站主机要求我不要一次刮掉所有的 pdf,以免服务器过载,因为relative_paths. 这就是为什么我正在寻找一种方法来在我的请求中加入某种延迟。

标签: pythonpdfweb-scrapingbeautifulsoupfilenames

解决方案


试一试:

import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads

for idx, link in enumerate(relative_paths):
    if idx % count_downloads == 0:
        print ('Waiting %s seconds...' %time_delay)
        time.sleep(time_delay)
    filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
    
    try:
        with open(filename, 'wb') as f:
            f.write(requests.get(link).content)
            print ('Saved: %s' %link)
    except Exception as ex:
         print('%s not saved. %s' %(link,ex))

推荐阅读