首页 > 解决方案 > Scrapy 在与浏览器请求相同的请求中获得 400 Bad Request

问题描述

我正在尝试从此页面https://www.infomoney.com.br/cotacoes/petrobras-petr4/historico/抓取表格中的价格数据

来自此 URL https://www.infomoney.com.br/wp-admin/admin-ajax.php的 POST 请求中请求数据

邮寄表格:

{
    "page":"0",
    "numberItems":"50",
    "action":"more_quotes_history",
    "quotes_history_nonce":"2510da6f8d",
    "symbol":"PETR4"
}

quotes_history_nonce我从 html 中的标签中检索,<script>与浏览器的标签相同。

response.xpath('//script').re(r'quotes_history_nonce":"(\w+)"')

我尝试了几种headers 组合,包括我的浏览器 headers 的相同副本。(为了测试,有无 cookie)。

浏览器的标题:

{
    "Accept":"application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language":"pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection":"keep-alive",
    "Content-Length":"93",
    "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
    "Cookie":"_omappvp=S4sOqvsk...; _omappvs=159...; tt_c_vmt=159...; tt_c_c=direct; tt_c_s=direct; tt_c_m=direct; _ttuu.s=1593...; tt.u=0100...",
    "DNT":"1",
    "Host":"www.infomoney.com.br",
    "Origin":"https://www.infomoney.com.br",
    "Referer":"https://www.infomoney.com.br/cotacoes/petrobras-petr4/historico/",
    "TE":"Trailers",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0",
    "X-Requested-With":"XMLHttpRequest"
}

顺便说一句,Cookies 中间件已启用。

返回应该是一个简单的 JSON,但是我一直得到 400 的返回。在我的第一次尝试中,我记得获得了一个 CORS 块(在浏览器中进行模拟时)但无法再重现它,所以这可能甚至不相关。

如果模拟使用日期字段发出的请求,表单将获得两个新字段:

{
    "initialDate":"01/07/2019",
    "finalDate":"01/07/2020",
}

不过还是没有运气。更改参数并使用浏览器重新发送效果很好,在scrapy上我一直得到400。此时我已经没有想法了。我还可能缺少什么?

编辑:因为我对@Booboo 的回答增加了问题的细节,而且评论也太长了,我在这里编辑。

我希望 JSON 作为响应,因为这是我的浏览器从响应中得到的: 在此处输入图像描述

我要抓取的是表(红色框)中的数据,它是从 JSON 加载的。如前所述,JSON 来自对https://www.infomoney.com.br/wp-admin/admin-ajax.php的 POST 请求。

刚刚尝试按照建议在 GET 请求中使用 post 参数作为查询参数。我的浏览器加载页面,但只会在发出后续发布请求后加载数据,就像在 url 中没有查询参数一样。Scrapy 加载表格中没有数据的页面,因为它不执行 javascript,并且数据是动态加载的。(这就是我直接为 JSON 请求 API 的原因)

我正在使用scrapy发出请求,所以这里是如何:

scrapy.Request(url='https://www.infomoney.com.br/cotacoes/petrobras-petr4/historico/')

scrapy.FormRequest(url='https://www.infomoney.com.br/wp-admin/admin-ajax.php', formdata=form, headers=headers)

form并且headers与编辑上面定义的相同。

正如我所提到的,问题中的标题与我的 browser 相同,所以我的浏览器的标题也有"X-Requested-With":"XMLHttpRequest".

标签: pythonweb-scrapingscrapyweb-crawler

解决方案


本程序不使用scrapy;它只是一个使用 requests 包的普通 Python 程序,但它似乎可以工作:

import requests

data = {
    'page': '0',
    'numberItems': '50',
    'action': 'more_quotes_history',
    'quotes_history_nonce': '2510da6f8d',
    'symbol': 'PETR4'
}

response = requests.post('https://www.infomoney.com.br/wp-admin/admin-ajax.php', data=data, headers={'user-agent': 'my-app/0.0.1'})
results = response.json()
for result in results:
    print(result)

印刷:

[{'display': '03/07/2020', 'timestamp': '1593734400'}, 'n/d', '21,98', '-0,36', '21,80', '22,18', '515,18M']
[{'display': '02/07/2020', 'timestamp': '1593648000'}, '22,10', '22,06', '1,61', '21,86', '22,21', '1,22B']
[{'display': '01/07/2020', 'timestamp': '1593561600'}, 'n/d', '21,71', '0,74', '21,52', '22,25', '1,72B']
[{'display': '30/06/2020', 'timestamp': '1593475200'}, '21,34', '21,55', '-0,51', '21,09', '21,80', '1,39B']
[{'display': '29/06/2020', 'timestamp': '1593388800'}, 'n/d', '21,66', '3,93', '20,93', '21,66', '1,27B']
[{'display': '26/06/2020', 'timestamp': '1593129600'}, '21,21', '20,84', '-2,93', '20,78', '21,47', '1,15B']
[{'display': '25/06/2020', 'timestamp': '1593043200'}, '20,91', '21,47', '2,24', '20,73', '21,47', '988,35M']
[{'display': '24/06/2020', 'timestamp': '1592956800'}, '21,49', '21,00', '-3,00', '20,71', '21,56', '1,37B']
[{'display': '23/06/2020', 'timestamp': '1592870400'}, '21,22', '21,65', '3,34', '21,14', '22,07', '1,70B']
[{'display': '22/06/2020', 'timestamp': '1592784000'}, '21,60', '20,95', '-2,42', '20,90', '21,60', '942,34M']
[{'display': '19/06/2020', 'timestamp': '1592524800'}, '22,00', '21,47', '-0,60', '21,22', '22,22', '1,95B']
[{'display': '18/06/2020', 'timestamp': '1592438400'}, '21,18', '21,60', '0,75', '21,08', '21,77', '1,22B']
[{'display': '17/06/2020', 'timestamp': '1592352000'}, '21,48', '21,44', '0,33', '21,15', '21,85', '1,34B']
[{'display': '16/06/2020', 'timestamp': '1592265600'}, '21,56', '21,37', '3,24', '21,17', '21,91', '2,03B']
[{'display': '15/06/2020', 'timestamp': '1592179200'}, '19,81', '20,70', '0,49', '19,54', '21,09', '2,01B']
[{'display': '12/06/2020', 'timestamp': '1591920000'}, '20,62', '20,60', '-3,74', '20,10', '21,17', '2,39B']
[{'display': '10/06/2020', 'timestamp': '1591747200'}, '21,89', '21,40', '-1,47', '21,00', '21,90', '2,18B']
[{'display': '09/06/2020', 'timestamp': '1591660800'}, '22,03', '21,72', '-3,60', '21,64', '22,04', '2,08B']
[{'display': '08/06/2020', 'timestamp': '1591574400'}, '22,55', '22,53', '1,95', '22,01', '22,59', '1,82B']
[{'display': '05/06/2020', 'timestamp': '1591315200'}, '22,29', '22,10', '3,13', '22,06', '23,03', '2,53B']
[{'display': '04/06/2020', 'timestamp': '1591228800'}, '21,39', '21,43', '-0,19', '21,04', '21,78', '2,29B']
[{'display': '03/06/2020', 'timestamp': '1591142400'}, '21,86', '21,47', '0,33', '21,41', '21,91', '1,85B']
[{'display': '02/06/2020', 'timestamp': '1591056000'}, '20,75', '21,40', '5,26', '20,60', '21,40', '1,59B']
[{'display': '01/06/2020', 'timestamp': '1590969600'}, '20,15', '20,33', '-0,05', '20,00', '20,56', '1,75B']
[{'display': '29/05/2020', 'timestamp': '1590710400'}, '19,55', '20,34', '2,88', '19,30', '20,34', '2,53B']
[{'display': '28/05/2020', 'timestamp': '1590624000'}, '19,69', '19,77', '-0,80', '19,45', '20,08', '1,27B']
[{'display': '27/05/2020', 'timestamp': '1590537600'}, '19,80', '19,93', '1,32', '19,15', '19,93', '1,45B']
[{'display': '26/05/2020', 'timestamp': '1590451200'}, '19,98', '19,67', '0,98', '19,33', '20,09', '1,35B']
[{'display': '25/05/2020', 'timestamp': '1590364800'}, '19,48', '19,48', '4,34', '19,26', '19,56', '731,22M']
[{'display': '22/05/2020', 'timestamp': '1590105600'}, '18,80', '18,67', '-2,71', '18,35', '18,90', '1,28B']
[{'display': '21/05/2020', 'timestamp': '1590019200'}, '19,50', '19,19', '-0,57', '19,07', '19,77', '1,57B']
[{'display': '20/05/2020', 'timestamp': '1589932800'}, '19,09', '19,30', '3,32', '19,06', '19,44', '1,43B']
[{'display': '19/05/2020', 'timestamp': '1589846400'}, '18,51', '18,68', '0,76', '18,41', '18,93', '1,49B']
[{'display': '18/05/2020', 'timestamp': '1589760000'}, '18,10', '18,54', '8,11', '17,92', '18,54', '2,13B']
[{'display': '15/05/2020', 'timestamp': '1589500800'}, '17,99', '17,15', '-1,44', '17,15', '18,19', '2,35B']
[{'display': '14/05/2020', 'timestamp': '1589414400'}, '17,40', '17,40', '-1,08', '16,72', '17,47', '2,03B']
[{'display': '13/05/2020', 'timestamp': '1589328000'}, '18,26', '17,59', '-3,03', '17,52', '18,35', '1,54B']
[{'display': '12/05/2020', 'timestamp': '1589241600'}, '18,53', '18,14', '-0,06', '18,13', '18,83', '1,30B']
[{'display': '11/05/2020', 'timestamp': '1589155200'}, '18,30', '18,15', '-1,79', '18,11', '18,93', '1,19B']
[{'display': '08/05/2020', 'timestamp': '1588896000'}, '17,74', '18,48', '5,96', '17,71', '18,58', '1,51B']
[{'display': '07/05/2020', 'timestamp': '1588809600'}, '17,75', '17,44', '0,93', '17,35', '17,87', '1,42B']
[{'display': '06/05/2020', 'timestamp': '1588723200'}, '17,87', '17,28', '-3,68', '17,28', '18,06', '1,18B']
[{'display': '05/05/2020', 'timestamp': '1588636800'}, '17,90', '17,94', '3,22', '17,87', '18,48', '1,36B']
[{'display': '04/05/2020', 'timestamp': '1588550400'}, '17,43', '17,38', '-3,71', '17,18', '17,62', '1,04B']
[{'display': '30/04/2020', 'timestamp': '1588204800'}, '17,98', '18,05', '-0,82', '17,70', '18,42', '1,43B']
[{'display': '29/04/2020', 'timestamp': '1588118400'}, '17,80', '18,20', '5,51', '17,53', '18,48', '1,70B']
[{'display': '28/04/2020', 'timestamp': '1588032000'}, '17,04', '17,25', '4,86', '16,63', '17,25', '1,55B']
[{'display': '27/04/2020', 'timestamp': '1587945600'}, '16,14', '16,45', '3,13', '15,78', '16,53', '1,39B']
[{'display': '24/04/2020', 'timestamp': '1587686400'}, '16,69', '15,95', '-5,90', '15,28', '16,79', '2,56B']
[{'display': '23/04/2020', 'timestamp': '1587600000'}, '17,20', '16,95', '1,19', '16,62', '17,43', '1,61B']

推荐阅读