xpath - Why do I get scrapy response empty?
问题描述
I started
scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798
Next step
In [5]: response
Out[5]: <405 https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798>
After inspected page element,and copied XPath
In [6]: response.xpath('//*[@id="ad-title"]').extract()
Out[6]: []
Copy outerHTML
<h1 itemprop="name" id="ad-title">Brand New Modern Studio Flat £1056pcm | All Bills Included | In Willesden Green area</h1>
Why?
解决方案
尝试将用户代理设置为更真实的东西,例如:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0
.
一些网站在用户代理上做一些基本的验证,如果他们检测到一些奇怪的东西,就会把你重定向到一些特殊的页面。
scrapy shell -s USER_AGENT='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0' https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798
>>> response.xpath('//*[@id="ad-title"]').extract()
['<h1 itemprop="name" id="ad-title">Brand New Modern Studio Flat £1056pcm | All Bills Included | In Willesden Green area</h1>']
>>>
推荐阅读
- c++ - 向现有项目添加更多平台
- google-analytics - 将 Google Analytics 的 URL 跟踪参数添加到内部网站搜索查询时出错
- elasticsearch - 如何从搜索中排除字段中的 {}
- python - 在 canopy 终端中从 github 安装 python 包时出错
- meteor - 如何在 Meteor.js 启动时禁用订阅通知?
- python - 通过中央服务器连接由不同线程连接的两个套接字客户端作为客户端服务器对
- scala - IntelliJ IDEA 能否通过 Scalate 的 Jade / Pug 风格提供适当的 Scala 帮助?
- mysql - 使用 Ruby on Rails 创建 MySQL 数据的正态分布
- php - 将 TFS 用于非 Microsoft 项目
- javascript - 检测视频是否在 javascript 插件 plyr 中播放