python - Scrapy on aws ec2 ubuntu redirect for booking.com
问题描述
在本地工作但在 AWS 上工作的com刮板不是。我正在重定向,然后蜘蛛停止工作。一些代码:
class HotelsCrawler(CrawlSpider):
name = "booking_crawler"
allowed_domains = ['booking.com']
headers = {
"User-Agent": "Mozilla/5.0(Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Host": "www.booking.com"
}
resultHotels = pd.DataFrame(columns=['hotel_id', 'name', 'score', 'price'])
fileName = ''
startDate = None
endDate = None
city = ''
def start_requests(self):
url = "https://www.booking.com/searchresults.pl.html?ss=Berlin&is_ski_area=0&dest_type=city&checkin_monthday=28&checkin_month=8&checkin_year=2018&checkout_monthday=29&checkout_month=8&checkout_year=2018&no_rooms=1&group_adults=2&group_children=0)
yield Request(url=url,headers=self.headers, callback=self.parse)
和日志:
2018-08-28 11:14:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/robots.txt> (referer: None)
2018-08-28 11:14:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.booking.com/searchresults.pl.html?dest_type=city;ss=Berlin> from <GET https://www.booking.com/searchresults.pl.html?ss=Berlin&is_ski_area=0&dest_type=city&checkin_monthday=28&checkin_month=8&checkin_year=2018&checkout_monthday=29&checkout_month=8&checkout_year=2018&no_rooms=1&group_adults=2&group_children=0>
2018-08-28 11:14:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/searchresults.pl.html?dest_type=city;ss=Berlin> (referer: None)
2018-08-28 11:14:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.booking.comhttps': <GET https://www.booking.comhttps//www.booking.com/searchresults.pl.html?dest_id=-1746443&dest_type=city&ss=Berlin&offset=20&pagination_used=1>
2018-08-28 11:14:02 [scrapy.core.engine] INFO: Closing spider (finished)
在第二行中,您可以看到我没有本地化的重定向,然后 URL 发生了一些奇怪的事情。我在 Ubuntu 中使用 AWS EC2 免费套餐。
编辑:我在 DigitalOcean 上运行了这段代码及其工作
解决方案
推荐阅读
- docker - Docker 通过 ssh 连接到远程守护进程 - 权限被拒绝(公钥)
- prometheus - 两个具有相同数据存储库的 Prometheus
- javascript - 垃圾收集内联事件
- activemq - 有没有办法以编程方式删除 ActiveMQ 作业计划?
- java - Spring Boot 中枚举的通用 AttributeConverter
- python - 选择列的第 7 个值
- pandas - 获取由 get_dummies 创建的虚拟变量的名称
- python - 仅提供面积和周长时求矩形的长度
- c# - 如何在 C# 中访问 XML 文件中的数据
- java - Kafka 流处理批处理数据以重置聚合