首页 > 解决方案 > Scrapy on aws ec2 ubuntu redirect for booking.com

问题描述

在本地工作但在 AWS 上工作的com刮板不是。我正在重定向,然后蜘蛛停止工作。一些代码:

class HotelsCrawler(CrawlSpider):
    name = "booking_crawler"
    allowed_domains = ['booking.com']
    headers = {
        "User-Agent": "Mozilla/5.0(Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Mobile Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Host": "www.booking.com"
    }
    resultHotels = pd.DataFrame(columns=['hotel_id', 'name', 'score', 'price'])
    fileName = ''
    startDate = None
    endDate = None
    city = ''

    def start_requests(self):
        url = "https://www.booking.com/searchresults.pl.html?ss=Berlin&is_ski_area=0&dest_type=city&checkin_monthday=28&checkin_month=8&checkin_year=2018&checkout_monthday=29&checkout_month=8&checkout_year=2018&no_rooms=1&group_adults=2&group_children=0)
        yield Request(url=url,headers=self.headers, callback=self.parse)

和日志:

2018-08-28 11:14:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/robots.txt> (referer: None)
2018-08-28 11:14:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.booking.com/searchresults.pl.html?dest_type=city;ss=Berlin> from <GET https://www.booking.com/searchresults.pl.html?ss=Berlin&is_ski_area=0&dest_type=city&checkin_monthday=28&checkin_month=8&checkin_year=2018&checkout_monthday=29&checkout_month=8&checkout_year=2018&no_rooms=1&group_adults=2&group_children=0>
2018-08-28 11:14:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/searchresults.pl.html?dest_type=city;ss=Berlin> (referer: None)
2018-08-28 11:14:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.booking.comhttps': <GET https://www.booking.comhttps//www.booking.com/searchresults.pl.html?dest_id=-1746443&dest_type=city&ss=Berlin&offset=20&pagination_used=1>
2018-08-28 11:14:02 [scrapy.core.engine] INFO: Closing spider (finished)

在第二行中,您可以看到我没有本地化的重定向,然后 URL 发生了一些奇怪的事情。我在 Ubuntu 中使用 AWS EC2 免费套餐。

编辑:我在 DigitalOcean 上运行了这段代码及其工作

标签: pythonpython-3.xamazon-ec2web-scrapingscrapy

解决方案


推荐阅读