首页 > 解决方案 > 如何针对重定向蛮力“掉头”

问题描述

嘿 StackOverFlow,

我正在尝试使用 python requests 库保存一些图像,但是,我在尝试从中文网站保存图像时遇到了挑战。

我有 3 个示例代码片段来说明我的问题:

  1. 我保存简单图像的理想模型情况
  2. 状态码:200。不同的输入和最终 URL。虚拟图像已保存
  3. 状态码:302。相同的输入和最终 URL。奇怪的图片已保存

示例图像:Image1/ Image2/

功能:

def get_response(url):
    print('Input URL:\n\t %s'%(url))
    response = requests.get(url)
    return response

def get_response_dont_redirect(url):
    print('Input URL:\n\t %s'%(url))
    response = requests.get(url, allow_redirects=False)
    return response

def check_response_status(response):
    status = response.status_code
    if status == 200:
        print(('Final URL:\n\t %s')%response.url)
        print('Status Code: %s / OK'%(status))
        return 'ok'
    if status == 302:
        print(('Final URL:\n\t %s')%response.url)
        print('Status Code: %s / Redirected'%(status))
        return 'redirected'
    if status == 404:
        print('Status Code: %s / Access Denied'%(status))
        return 'denied'

def save_image(response, status_code):
    if status_code ==302:
        with open('image_wanted.jpg', 'wb') as f:
            print('\nSaving image desired under "image_wanted.jpg"...\n')
            f.write(response.content)
    elif status_code == 200:
        with open('image_redirect.jpg', 'wb') as f:
            print('\nSaving image redirected under "image_redirect.jpg"...\n')
            f.write(response.content)
    elif status_code == 111:
        with open('image_normal.jpg', 'wb') as f:
            print('\nSaving image normal under "image_normal.jpg"...\n')
            f.write(response.content)

def case_1_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# This is the ideal situation where I can simply download an image')
    print('-------------------------------------------------------------------')
def case_2_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# Notice that despite the status code being 200, the input URL and final URL is different ')
    print('\t>I am definitely being redirected')
    print('\t>I get a dummy image from the redirected page')
    print('-------------------------------------------------------------------')
def case_3_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# Here I have set the restriction of "allow_redirects=False" yet I get status code:302 ')
    print('\t>Somehow the input and final URL is the same')
    print('\t>The image saved is perpetually loading...')
    print('-------------------------------------------------------------------')

案例一:理想案例

print("\n\n--- Case 1: Ideal ---\n")

url = 'https://i5.walmartimages.ca/images/Large/094/514/6000200094514.jpg'
response = get_response(url)
status = check_response_status(response)
save_image(response, 111)
case_1_comments()

案例 2:没有 'allow_redirect=False'

print("\n\n--- Case 2: without 'allow_redirects=False' restriction ---\n")

url = 'http://photo.yupoo.com/evakicks/6b3a8a2a/small.jpg'
response = get_response(url)
status = check_response_status(response)
save_image(response, 200)
case_2_comments()

案例 3:使用 'allow_redirect=False'

print("\n\n--- Case 3: with 'allow_redirects=False' restriction ---\n")

url = 'http://photo.yupoo.com/evakicks/6b3a8a2a/small.jpg'
response = get_response_dont_redirect(url)
status = check_response_status(response)
save_image(response, 302)
case_3_comments()

如果您复制粘贴我的代码并运行它(请参阅下面的这个问题和 pip install requests,如果您还没有的话),您会发现案例 2 和 3 非常奇怪。我想要的目标是强制返回输入 URL 并将图像保存在该页面上。

如案例 3 所示,我已设法返回该页面,但由于某种原因,该图像只是一个加载屏幕。

所以我想我的问题是:

以下是要运行的整个脚本 (请原谅意大利面)

import requests
def get_response(url):
    print('Input URL:\n\t %s'%(url))
    response = requests.get(url)
    return response

def get_response_dont_redirect(url):
    print('Input URL:\n\t %s'%(url))
    response = requests.get(url, allow_redirects=False)
    return response

def check_response_status(response):
    status = response.status_code
    if status == 200:
        print(('Final URL:\n\t %s')%response.url)
        print('Status Code: %s / OK'%(status))
        return 'ok'
    if status == 302:
        print(('Final URL:\n\t %s')%response.url)
        print('Status Code: %s / Redirected'%(status))
        return 'redirected'
    if status == 404:
        print('Status Code: %s / Access Denied'%(status))
        return 'denied'

def save_image(response, status_code):
    if status_code ==302:
        with open('image_wanted.jpg', 'wb') as f:
            print('\nSaving image desired under "image_wanted.jpg"...\n')
            f.write(response.content)
    elif status_code == 200:
        with open('image_redirect.jpg', 'wb') as f:
            print('\nSaving image redirected under "image_redirect.jpg"...\n')
            f.write(response.content)
    elif status_code == 111:
        with open('image_normal.jpg', 'wb') as f:
            print('\nSaving image normal under "image_normal.jpg"...\n')
            f.write(response.content)

def case_1_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# This is the ideal situation where I can simply download an image')
    print('-------------------------------------------------------------------')
def case_2_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# Notice that despite the status code being 200, the input URL and final URL is different ')
    print('\t>I am definitely being redirected')
    print('\t>I get a dummy image from the redirected page')
    print('-------------------------------------------------------------------')
def case_3_comments():
    print('-------------------------------------------------------------------')
    print('#Comments:')
    print('# Here I have set the restriction of "allow_redirects=False" yet I get status code:302 ')
    print('\t>Somehow the input and final URL is the same')
    print('\t>The image saved is perpetually loading...')
    print('-------------------------------------------------------------------')

print("\n\n--- Case 1: Standard procedure ---\n")
url = 'https://i5.walmartimages.ca/images/Large/094/514/6000200094514.jpg'
response = get_response(url)
status = check_response_status(response)
save_image(response, 111)
case_1_comments()

print("\n\n--- Case 2: without 'allow_redirects=False' restriction ---\n")

url = 'http://photo.yupoo.com/evakicks/6b3a8a2a/small.jpg'
response = get_response(url)
status = check_response_status(response)
save_image(response, 200)
case_2_comments()

print("\n\n--- Case 3: with 'allow_redirects=False' restriction ---\n")

url = 'http://photo.yupoo.com/evakicks/6b3a8a2a/small.jpg'
response = get_response_dont_redirect(url)
status = check_response_status(response)
save_image(response, 302)
case_3_comments()

标签: pythonredirectweb-scrapingpython-requests

解决方案


推荐阅读