首页 > 解决方案 > 使用 Scrapy 发送 post 请求

问题描述

我正在学习如何使用 Scrapy 进行网络抓取,但在抓取动态加载的内容时遇到了问题。我正在尝试从发送 POST 请求的网站上抓取电话号码以获取该号码:这是它发送的 Post 请求的标头:

Host: www.mymarket.ge
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.mymarket.ge/en/pr/16399126/savaWro-inventari/fulis-yuTi
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Content-Length: 13
Origin: https://www.mymarket.ge
Connection: keep-alive
Cookie: Lang=en; split_test_version=v1; CookieID=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJEYXRhIjp7IklEIjozOTUwMDY2MzUsImN0IjoxNTkyMzA2NDMxfSwiVG9rZW5JRCI6Ik55empxVStDa21QT1hKaU9lWE56emRzNHNSNWtcL1wvaVVUYjh2dExCT3ZKWT0iLCJJc3N1ZWRBdCI6MTU5MjMyMTc1MiwiRXhwaXJlc0F0IjoxNTkyMzIyMDUyfQ.mYR-I_51WLQbzWi-EH35s30soqoSDNIoOyXgGQ4Eu84; ka=da; SHOW_BETA_POPUP=B; APP_VERSION=B; LastSearch=%7B%22CatID%22%3A%22515%22%7D; PHPSESSID=eihhfcv85liiu3kt55nr9fhu5b; PopUpLog=%7B%22%2A%22%3A%222020-05-07+15%3A13%3A29%22%7D

这是身体:

PrID=16399126

我成功地复制了reqbin.com上的发布请求,但不知道如何使用 Scrapy 来完成。这就是我的代码的样子:

class MymarketcrawlerSpider(CrawlSpider):
    name = "mymarketcrawler"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://mymarket.ge/"]

    rules = (
        Rule(
            LinkExtractor(allow=r".*mymarket.ge/ka/*", restrict_css=".product-card"),
            callback="parse_item",
            follow=True,
        ),
    )

    def parse_item(self, response):
        item_loader = ItemLoader(item=MymarketItem(), response=response)

        def parse_num(response):
            try:
                response_text = response.text
                response_dict = ast.literal_eval(response_text)
                number = response_dict['Data']['Data']['numberToShow']
                nonlocal item_loader
                item_loader.add_value("number", number)
                yield item_loader.load_item()
            except Exception as e:
                raise CloseSpider(e)


        yield FormRequest.from_response(
            response,
            url=r"https://www.mymarket.ge/ka/pr/ShowFullNumber/",
            headers={
                "Host": "www.mymarket.ge",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
                "Accept": "*/*",
                "Accept-Language": "en-US,en;q=0.5",
                "Accept-Encoding": "gzip, deflate, br",
                "Referer": "https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
                "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                "X-Requested-With": "XMLHttpRequest",
            },
            formdata={"PrID": "16399126"},
            method="POST",
            dont_filter=True,
            callback=parse_num
        )
        item_loader.add_xpath(
            "seller", "//div[@class='d-flex user-profile']/div/span/text()"
        )
        item_loader.add_xpath(
            "product",
            "//div[contains(@class, 'container product')]//h1[contains(@class, 'product-title')]/text()",
        )
        item_loader.add_xpath(
            "price",
            "//div[contains(@class, 'container product')]//span[contains(@class, 'product-price')][1]/text()",
            TakeFirst(),
        )
        item_loader.add_xpath(
            "images",
            "//div[@class='position-sticky']/ul[@id='imageGallery']/li/@data-src",
        )
        item_loader.add_xpath(
            "condition", "//div[contains(@class, 'condition-label')]/text()"
        )
        item_loader.add_xpath(
            "city",
            "//div[@class='d-flex font-14 font-weight-medium location-views']/span[contains(@class, 'location')]/text()",
        )
        item_loader.add_xpath(
            "number_of_views",
            "//div[@class='d-flex font-14 font-weight-medium location-views']/span[contains(@class, 'svg-18')]/span/text()",
        )
        item_loader.add_xpath(
            "publish_date",
            "//div[@class='d-flex left-side']//div[contains(@class, 'font-12')]/span[2]/text()",
        )
        item_loader.add_xpath(
            "total_products_amount",
            "//div[contains(@class, 'user-profile')]/div/a/text()",
            re=r"\d+",
        )
        item_loader.add_xpath(
            "description", "//div[contains(@class, 'texts full')]/p/text()"
        )
        item_loader.add_value("url", response.url)
        yield item_loader.load_item()

上面的代码不起作用;未填充数字字段。我可以将数字打印到屏幕上,但无法将其保存到 csv 文件中。csv 文件中的数字列是空白的,它不包含任何值。

标签: web-scrapingscrapy

解决方案


Scrapy 以异步方式工作,每个要抓取的链接、要处理的每个项目等都放在一个队列中。这就是您产生请求并等待 SpiderDownloader、ItemPipeline 等处理您的请求的原因。

发生的情况是您有单独处理的请求,这就是您看不到结果的原因。就我个人而言,我会解析第一个请求的结果,将它们保存在“元”数据中并将它们传递给下一个请求,以便之后数据可用。

例如

class MymarketcrawlerSpider(CrawlSpider):
    name = "mymarketcrawler"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://mymarket.ge/"]

    rules = (
        Rule(
            LinkExtractor(allow=r".*mymarket.ge/ka/*", restrict_css=".product-card"),
            callback="parse_item",
            follow=True,
        ),
    )

    def parse_item(self, response):

        def parse_num(response):
            item_loader = ItemLoader(item=MymarketItem(), response=response)
            try:
                response_text = response.text
                response_dict = ast.literal_eval(response_text)
                number = response_dict['Data']['Data']['numberToShow']
                # New part: 
                product = response.meta['product']             

                # You won't need this now: nonlocal item_loader
                # Also new: 
                item_loader.add_value("number", number)

                item_loader.add_value("product", product)
                yield item_loader.load_item()
            except Exception as e:
                raise CloseSpider(e)
        # Rewrite your parsers like this: 
        product = response.xpath(
            "//div[contains(@class, 'container product')]//h1[contains(@class, 'product-title')]/text()"
        ).get()

        yield FormRequest.from_response(
            response,
            url=r"https://www.mymarket.ge/ka/pr/ShowFullNumber/",
            headers={
                "Host": "www.mymarket.ge",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
                "Accept": "*/*",
                "Accept-Language": "en-US,en;q=0.5",
                "Accept-Encoding": "gzip, deflate, br",
                "Referer": "https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
                "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                "X-Requested-With": "XMLHttpRequest",
            },
            formdata={"PrID": "16399126"},
            method="POST",
            dont_filter=True,
            callback=parse_num,
            meta={"product": product}
        )

推荐阅读