首页 > 解决方案 > 请求返回 403 错误,但页面使用浏览器中的直接链接

问题描述

我是从这里来的。

我尝试了一种不同的方法,在我知之甚少的情况下,我意识到该按钮正在调用特定的网页。

假设我们浏览这个页面,当你点击“Ver más”按钮时,它似乎调用了这个URL,所以我认为我们可以抓取这个页面并遍历多个页面,以这种方式收集这些页面中的所有产品。

好的,那么我的新代码如下所示

import pandas as pd
import requests
from bs4 import BeautifulSoup
import datetime
from time import sleep
import random
from Scrapingtools import joinfiles
from Scrapingtools import uploadfiles

#url = 'https://www.pccomponentes.com/procesadores?page='

url_list = [
     'https://www.pccomponentes.com/listado/ajax?page=3&order=relevance&gtmTitle=Procesadores%20PC&idFamilies%5B%5D=4',
    'https://www.pccomponentes.com/listado/ajax?page=1&order=relevance&gtmTitle=Placas%20Base&idFamilies%5B%5D=3'
    ]


df_list =[] 
store = 'PCComponentes'
extraction_date = datetime.datetime.today().replace(microsecond=0)

for url in url_list:

    for x in range(0,1):
        headers = ({'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"})
        r = requests.get(url, headers=headers)
      # r = requests.get(url + str(x), headers = headers)
        # print (url + str(x))
        print()
        soup = BeautifulSoup(r.content,'html.parser')
        # print(soup)
        items = soup.find_all('div',class_='col-xs-6 col-sm-4 col-md-4 col-lg-4')
        print('Response:'+str(r.status_code)+'.Found '+str(len(items))+' items in '+url + str(x))
        
        for item in items:

            product_name = item.find('h3',class_ = 'c-product-card__title').text.strip()
            try:
                price = item.find('div', class_ = 'c-product-card__prices-actual cy-product-price-normal').text[:-1]
            except AttributeError:
                price = item.find('div', class_ = 'c-product-card__prices-actual c-product-card__prices-actual--discount cy-product-price-discount').text[:-1]
            try:
                old_price = item.find('div',class_ = 'c-product-card__prices-pvp cy-product-price-normal').text[:-1]
            except AttributeError:
                old_price = "No discount"
            # try:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-inmediata cy-product-availability-date').text.strip()
            # except AttributeError:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-moderada cy-product-availability-date').text.strip()  
            # except AttributeError:
            #     availability = "No Date"  
            try:
                rating = item.find('span',class_ = 'c-star-rating__text cy-product-text').text.strip()
            except AttributeError:
                rating = ""
            try:
                reviews = item.find('span',class_ = 'c-star-rating__text cy-product-rating-result').text.strip()
            except AttributeError:
                reviews = ""
            try:
                brand = item.find('article')['data-brand'] 
            except AttributeError:
                brand = "No brand"
            try:
                category = item.find('article')['data-category']
            except AttributeError:
                category = "No category"
                   
            #  print(product_name, price, old_price, rating, reviews, brand, category, store, extraction_date)

            product_info =  {
                'product_name' : product_name,
                'price' : price,
                'old_price' : old_price,
              # 'availability' : availability,
                'rating' : rating,
                'reviews' : reviews,
                'brand' : brand,
                'category' : category,
                'store' : store,
                'date_extraction' : extraction_date,
            }
            df_list.append(product_info)
            
    sleep(random.uniform(3.5, 7.5))

df = pd.DataFrame(df_list)
#print(df)

在这一点上,我有两个障碍阻止我跟随发展。首先,也是最重要的是 403 错误。

尝试这样的事情

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
response = requests.get('https://www.pccomponentes.com/listado/ajax?page=3&order=relevance&gtmTitle=Procesadores%20PC&idFamilies%5B%5D=4', headers=headers)

print (response.status_code)
print (response.url)

根本不起作用。我已经更改了用户代理,但没有任何效果:(。

如果您复制此链接并粘贴到浏览器中,该页面将返回我认为可以通过上面的代码粘贴获得的产品。

第二个是迭代,但首先需要对页面请求的帮助。

关于如何解决的任何想法。

pd:对不起,我想在我为什么要达到的背景下发表这篇文章的长度。

问候

标签: python-3.xweb-scrapingpython-requests

解决方案


推荐阅读