python - 无法从网页中抓取容器
问题描述
我正在尝试从电子商务网页练习网页抓取。我已将容器(包含每个产品的单元格)的类名标识为'c3e8SH'
. 然后,我使用以下代码抓取该网页中的所有容器。之后,我用来len(containers)
检查网页中的容器数量。
但是,它返回了 0。有人能指出我做错了什么吗?非常感谢!
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
#grabs each product
containers = page_soup.find_all('div', class_='c3e8SH')
len(containers)
解决方案
(1) Firstly, param cookies is needed
.
You will get the validation page
as below if you only request the link without cookies
(2) secondly, The page you want to scrape is dynamicly loaded
That's why what you see through web browser is different from what you get by codes
for convenience , i'd prefer to use requests
module.
import requests
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lvt_7cd4710f721b473263eed1f0840391b4":"1548133175,1548135160,1548135844",
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4":"1548135844",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223862623264333633343063393330376262313364633537653564393939303732434c50706d754946454e2b4b356f7231764b4c643841453d227d",
}
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
# then you can get a json-style shop-items in ret.text
shop-items like as:
item_json =
{
"@context":"https://schema.org",
"@type":"ItemList",
"itemListElement":[
{
"offers":{
"priceCurrency":"SGD",
"@type":"Offer",
"price":"72.90",
"availability":"https://schema.org/InStock"
},
"image":"https://sg-test-11.slatic.net/p/ae0494e8a5eb7412830ac9822984f67a.jpg",
"@type":"Product",
"name":"Nintendo Switch New Super Mario Bros U Deluxe", # item name
"url":"https://www.lazada.sg/products/nintendo-switch-new-super-mario-bros-u-deluxe-i292338164-s484601143.html?search=1"
},
...
]
}
as json data showed, you can get any item's name, url-link, price, and so on.
推荐阅读
- ruby-on-rails - 捆绑安装和迁移后的rails错误
- c# - 为什么几分钟后用户不再是活动目录组的成员?
- javascript - 函数参数是如何隐式初始化的?
- c++ - 提升精神:将结果复制到字符串向量中
- python - 尝试将 txt 文件转换为 JSON 格式时出现 IndexError
- sql - PostgreSQL 从 JSONB 数组中选择特定项目
- python - 日期时间的差异 => 时间戳仅在某些情况下?
- android - 为什么 Google Play 商店拒绝我的 Nativescript APK 不兼容 64 位?
- docker - Azure DevOps 上的 Docker 构建失败
- php - 如何通过 API 登录用户 - Laravel 5.8?