首页 > 解决方案 > 当我尝试通过 requests.get() 下载时文件损坏

问题描述

我正在尝试通过 Selenium 自动下载文档。

从网站提取网址后,我正在使用 requests.get() 下载文件:

import requests 

url= 'https://www.schroders.com/hkrprewrite/retail/en/attach.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'
myfile = requests.get(url)
open('/Users/hemanthj/Downloads/AB Test/' + "A-Acc-USD" + '.pdf', 'wb').write(myfile.content)
time.sleep(3)

该文件已下载,但在我尝试打开时已损坏。文件大小最多只有几 KB。

我也尝试从这个线程添加标题信息,但没有运气: 使用 Python 的 requests.get() 后损坏的 PDF 文件

标题中的什么使下载工作?有什么解决办法吗?

标签: pythonselenium

解决方案


The problem was in an incorrect URL. It loaded HTML instead of PDF. Looking throw the site I found the URL that you were looking for. Try this code and then open the document with pdf reader program.

import requests
import pathlib


def load_pdf_from(url:str, filename:pathlib.Path) -> None:
    response:requests.Response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(filename, 'wb') as pdf_file:
            for chunk in response.iter_content(chunk_size=1024):
                pdf_file.write(chunk)
    else:
        print(f"Failed to load pdf: {url}")


url:str = 'https://www.schroders.com/hkrprewrite/retail/en/attachment2.aspx?fileid=e47b0366c44e4f33b04c20b8b6878aa7.pdf'

target_filename:pathlib.Path = pathlib.Path.cwd().joinpath('loaded_pdf.pdf')

load_pdf_from(url, target_filename)

推荐阅读