首页 > 解决方案 > 无法从网页中抓取容器

问题描述

我正在尝试从电子商务网页练习网页抓取。我已将容器(包含每个产品的单元格)的类名标识为'c3e8SH'. 然后,我使用以下代码抓取该网页中的所有容器。之后,我用来len(containers)检查网页中的容器数量。

但是,它返回了 0。有人能指出我做错了什么吗?非常感谢!

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, 'html.parser')

#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)

元素检测截图

标签: pythonweb-scrapingbeautifulsoupurlopen

解决方案


(1) Firstly, param cookies is needed.

You will get the validation page as below if you only request the link without cookies

https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1

enter image description here



(2) secondly, The page you want to scrape is dynamicly loaded

That's why what you see through web browser is different from what you get by codes

for convenience , i'd prefer to use requests module.


import requests


my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'


cookies = {
    "Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
    "Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
    "x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}

ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True 

# then you can get a json-style shop-items in ret.text  


shop-items like as:

item_json = 

    {
        "@context":"https://schema.org",
        "@type":"ItemList",
        "itemListElement":[
            {
                "offers":{
                    "priceCurrency":"SGD",
                    "@type":"Offer",
                    "price":"72.90",
                    "availability":"https://schema.org/InStock"
                },
                "image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
                "@type":"Product",
                "name":"Nintendo Switch New Super Mario Bros U Deluxe",  # item name
                "url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
            },
            ... 

        ]

    }

as json data showed, you can get any item's name, url-link, price, and so on.



推荐阅读