首页 > 解决方案 > 为什么带有正确标头的 requests.get 返回空内容?

问题描述

我正在尝试抓取一个网站并直接从 Chrome 复制请求标头信息,但是,使用 requests.get 后,返回的内容为空。但是我从请求中打印的标头是正确的。有谁知道这是什么原因?谢谢!

Mac、Chrome、Python3.7

一般信息请求信息

import requests


headers = {
'Accept': '*/*',

'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',

'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',

'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',

'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',

'Host': 'hotels.ctrip.com',

'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',

'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'

}

url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'

data = requests.get(url, headers = headers)
print(data.request.headers)

标签: pythonpython-3.xpython-requestsweb-crawler

解决方案


您在图像中共享的请求标头信息提供了服务器正确响应请求的信息。此外,您共享的实际网址http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815 与图片中显示的网址有所不同。事实上,实际页面似乎确实调用了许多其他 url 来形成最终页面。所以不能保证当你使用请求时你会得到你在浏览器中看到的响应。如果服务器或服务器端的实际实现依赖于浏览器的 javascript 引擎来执行 javascript 然后呈现内容,您将无法获得最终的 html,因为它看起来像在浏览器中一样。在这些情况下最好使用 selenium webdriver 来点击 url 然后获取 html 内容。同样,如果您可以分享实际网址,可以提出其他想法


推荐阅读