首页 > 解决方案 > 网络抓取:我的 cookie 在我的请求中“不起作用”吗?

问题描述

我对网络抓取很陌生。我对cookie一无所知,这似乎是这里的问题。我正在尝试一些非常简单的事情,即在某个网站上执行 request.get(),然后玩 Beautiful Soup:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20")
print page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

这基本上不起作用,因为 print(soup.prettify()) 说:“请求不成功。封装事件 ID:449001030063484539-234265426366891642

没关系,我发现这是因为我的 get 需要一些 cookie。因此,我使用了此处描述的方法,创建了 cookie 的字典,并将其作为我的 get 的参数传递:

cookies = {'incap_ses_449_150286':'ll/1bp9r6ifi7LPUDiw7Bi/dzlwAAAAAO6OR80W3VDDesKNGYZv4PA==', 'visid_incap_150286':'+Tg7VstMS1OzBycT4432Ey/dzlwAAAAAQUIPAAAAAAAqAettOJXSb8ocwxkzabRx'}
page = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20", cookies=cookies)

...现在 print(soup.prettify()) 打印整个页面,好的。

但是,基本上,如果我关闭我的计算机,第二天再回来,再次运行我的脚本,我硬编码的这些 cookie 现在似乎是错误的,因为它们实际上已经改变了,对吧?这就是我观察到的,只是重新运行我的脚本似乎不再起作用了。我想这是正常的“cookie 行为”,从一天变为另一天(?)。

所以,我想我可能会在执行 request.get() 之前自动获得这些。所以我这样做了:

session = requests.Session()
response = requests.get("https://www.immoweb.be/fr/recherche/maison/a-vendre/brabant-wallon?minprice=100000&maxprice=200000&minroom=3&maxroom=20")
cookies = session.cookies.get_dict()

这样做时,我确实得到了 2 个 cookie('incap_ses_449_150286' 和另一个),但与我在网页上使用 Chrome 的开发人员工具时看到的值不同。并且将这些 cookie 传递给我的 get() 似乎不起作用(尽管我不再有“请求不成功”消息,但 print(soup.prettify()) 几乎没有打印。我唯一的方法让它正常工作是通过手动编码dict中的cookie,通过使用Chrome的工具查看它们......我错过了什么?

非常感谢!阿尔诺

标签: python-3.xweb-scrapingpython-requests

解决方案


This isn't a Python issue. This is the web server you're connecting to being very specific as to what it lets access its site. Something is different between your web browser and requests that the web server is detecting that causes it to allow one and deny the other. The cookies are probably present so it doesn't have to keep doing this detection (Cloudflare?) and by copying the cookies from Chrome to requests you're circumventing it.

Have you tried setting the user agent to Chrome's? Also, check the site's robots.txt to see whether it allows web scrapers; it might be that the website owners don't want you doing this; it seems as though they've already put into place measures to prevent it.


推荐阅读