python - How to scraping data from multiple pages in one web, I'm using Python and BeautifulSoup
问题描述
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 29 10:38:46 2018
@author: Cinthia
"""
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
array = ['146-face', '153-palettes-sets', 'https://www.sociolla.com/147-eyes', 'https://www.sociolla.com/150-lips', 'https://www.sociolla.com/149-brows', 'https://www.sociolla.com/148-lashes']
base_url='https://www.sociolla.com/142-face'
uClient = uReq(base_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grab the product
kosmetik = page_soup.findAll("div", {"class":"col-md-3 col-sm-6 ipad-grid col-xs-12 productitem"})
print(len(kosmetik))
I want to scrape data from that website, that code above just take how much product on the base url. I don't know how that array will work, so it can take data from the product such as description, image, price from all the pages I make in the array.
I'm new to Python and don't know much about loops yet.
解决方案
您可以在id=product-list-grid
此处找到表格/网格的根元素,并提取包含您需要的所有信息(品牌、链接、类别)和第一个<img>
标签的属性。
对于分页,您似乎可以进入下一页添加p=<page number>
&当页面不存在时,它会重定向到第一个页面。这里的一种解决方法是检查响应 url 并检查它是否与您请求的相同。如果相同,您可以增加页码,否则您已经刮掉了所有页面
from bs4 import BeautifulSoup
import urllib.request
count = 1
url = "https://www.sociolla.com/142-nails?p=%d"
def get_url(url):
req = urllib.request.Request(url)
return urllib.request.urlopen(req)
expected_url = url % count
response = get_url(expected_url)
results = []
while (response.url == expected_url):
print("GET {0}".format(expected_url))
soup = BeautifulSoup(response.read(), "html.parser")
products = soup.find("div", attrs = {"id" : "product-list-grid"})
results.append([
(
t["data-eec-brand"], #brand
t["data-eec-category"], #category
t["data-eec-href"], #product link
t["data-eec-name"], #product name
t["data-eec-price"], #price
t.find("img")["src"] #image link
)
for t in products.find_all("div", attrs = {"class" : "product-item"})
if t
])
count += 1
expected_url = url % count
response = get_url(expected_url)
print(results)
这里存储的结果results
是一个元组数组
推荐阅读
- php - ajax 请求的页面无法正常工作
- shell - 第一次在文件中搜索特定字符串,并在 unix 中打印和后续行,下次类似
- php - 如何显示此数组数据及其各自的页码?
- ios - 如何从 iOS 中的 instagram api 注销?
- c# - C# 应用程序在链接的 DLL 中调用反射时崩溃
- sql - 产品的总数量
- tensorflow - 连接特征总和 - 形状误差
- java - 将包含 3DES 密钥的 Java JCEKS 密钥库转换为 PKCS12
- python-2.7 - 在 Python 中附加带有文件名的 CSV 文件
- android - 被困在重复使用 android studio 档案中?